Re: Why does nutch need to parse documents --- clarification needed

Sebastian Nagel Wed, 23 Jul 2014 09:02:59 -0700

Hi Harald,

have a look at NUTCH-1785 <https://issues.apache.org/jira/browse/NUTCH-1785>:
it's about the same problem.


> a) where does the binary blob appear in NutchDocument and
Just add a NutchField. The value can be any type, but the indexer must
be able to handle it.

> b) how does it get there?
In Nutch 1.x adding raw/binary content can only done within
IndexerMapReduce.
Indexing filters do not have the binary content at hand. In 2.x this is
different: an indexing
filter can request any field/column to be added. I didn't try but it should
be possible
to request the raw content (column has the same name).

Sebastian


2014-07-23 16:29 GMT+02:00 Harald Kirsch <[email protected]>:

> Hi,
>
> coming back to this question. Now I have basically the following
> parse-plugins.xml:
>
>         <mimeType name="text/html">
>                 <plugin id="parse-html" />
>         </mimeType>
>
> All other mime-types shall not be parsed for links. The documents shall be
> send as-is, i.e. as binary blobs to the index stage. (To preempt cryouts:
> this is a custom index stage that knows how to deal with binary blobs.)
>
> Now where and how will the binary blob be amde available within the
> NutchDocument send to my indexer.
>
> For parsed content I see text coming along in the content field, but
>
> a) where does the binary blob appear in NutchDocument and
> b) how does it get there?
>
> Regards,
> Harald.
>
>
> On 03.07.2014 22:30, Sebastian Nagel wrote:
>
>> Hi Harald,
>>
>>  it is sufficient to only activate the parse-html plugin
>>>
>> Yes. If parse-tika is active also other document types
>> (PDFs, etc.) searched for links.
>>
>>  or is even this not necessary
>>>
>> You need to parse HTMLs. It's impossible to extract links without
>> parsing HTML. Think of relative links (base URL), <!-- comments -->,
>> <![CDATA[...]]>, and other subtleties which will harm other
>> approaches for link extraction (eg, regular expressions).
>>
>>  b) provide HTML and all other documents found as such to some external
>>> tool as is, i.e. unparsed.
>>>
>> Make sure that the raw content is stored (in segments or WebTable), cf.
>> property fetcher.store.content.
>>
>>  (Is there a more detailed description of what the individual stages of
>>> nutch do beyond the tutorial?)
>>>
>> Still a good introduction: Andrzej Białecki's chapter in "Hadoop: The
>> definitive guide"
>> by Tom White.
>>
>> Sebastian
>>
>> On 07/01/2014 03:12 PM, Harald Kirsch wrote:
>>
>>> Suppose I want nutch to fetch URLs and
>>>
>>> a) follow links in HTML documents *only*
>>> b) provide HTML and all other documents found as such to some external
>>> tool as is, i.e. unparsed.
>>>
>>> Is it correct that it is sufficient to only activate the parse-html
>>> plugin from all the parse-*
>>> plugins or is even this not necessary?
>>>
>>> (Is there a more detailed description of what the individual stages of
>>> nutch do beyond the tutorial?)
>>>
>>> Thanks,
>>> Harald.
>>>
>>>
>>
>>

Re: Why does nutch need to parse documents --- clarification needed

Reply via email to