Hi Markus,

Thanks for the explanation. Just realized that the fetched content is not altered by parsers, only new metadata fields are created from the parses. But can a plugin parse existing metadata parsed by another parser?

Also, I tested jsoup-extractor and it doesn't handle HTML well, only XML. Do you think there's a relatively easy way to adapt it for all HTML?

Thanks!

Michael


On 08/02/2017 12:28 PM, Markus Jelsma wrote:
You only need an IndexingFilter if you didn't do the logic in the ParseFilter, 
or, if you want to do something with metadata added by two or more different 
ParseFilters.

You can use multiple Indexing- or ParseFilters, not a problem.

-----Original message-----
From:Michael Chen <yiningchen2...@u.northwestern.edu>
Sent: Wednesday 2nd August 2017 21:23
To: user@nutch.apache.org
Subject: Re: ParseFilter and IndexingFilter


Hi Markus,

Thanks for the quick response! Please let me know at any point if I
should just read some part of the code. But I'm guessing from the stored
data in HBase (with Nutch 2.x), that "parse" changed (in my case,
cleaned up the html tags in "content") the "Document".

Do you mean that parse only adds meta-data somewhere waiting for
indexing filters to index it into HBase? Maybe I'm not understanding
"indexing" correctly.

I'm trying to use the new jsoup-extractor to parse (and index) certain
fields with CSS selectors. I also want to keep the indexing by
index-basic and index-anchor, and preferably the raw html/data as well.
Am I on the right track?

Thank you!

Michael


On 08/02/2017 12:06 PM, Markus Jelsma wrote:
Hi,

ParseFilter can add metadata to parsed records. IndexingFilter can access that 
data and do something with it prior to indexing the metadata fields added 
earlier by the ParseFilter.

If you just want to index the values added by the ParseFilter, you can just use 
index-metadata to index it directly. Only use an IndexingFilter if you need 
additional logic.

Regards,
Markus

-----Original message-----
From:Michael Chen <yiningchen2...@u.northwestern.edu>
Sent: Wednesday 2nd August 2017 20:58
To: user@nutch.apache.org
Subject: ParseFilter and IndexingFilter

Hi,

Does anyone know how multiple ParseFilters and IndexingFilters work
together, e.g. does the first parse affect the second, does the one
index operation affect the next? Given that the factories generate
multiple in the first place... I couldn't find a definitive answer in
the docs and it would be great if someone can help answer this question.
Thanks in advance.

Best regards,

Michael





Reply via email to