Re: Using Nutch for crawling and Lucene for searching (Wildcard/Fuzzy)

Andrzej Bialecki Fri, 15 May 2009 07:13:11 -0700

inghe wrote:

Andrzej Bialecki wrote:
Page content is NOT stored in Lucene indexes that Nutch creates. It'sonly indexed, which is not the same. Luke can show you the text in the"content" field only because it reconstructs it from the index. Thisreconstruction is incomplete because some information is missing (theinformation discarded by NutchDocumentAnalyzer).
As I wrote before, full content is stored in Nutch segments. That's whyNutch can show you the full content, but Luke cannot.
Thanks again, but is there a method to get a "content" informations through
the libraries of Lucene? I would like to work on the content of the web
pages extracted.

As it is now - there is no method. You would have to modify Nutch tocreate indexes where "content" is both indexed and stored - but thenperformance of your index will suffer.



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Using Nutch for crawling and Lucene for searching (Wildcard/Fuzzy)

Reply via email to