inghe wrote:

Andrzej Bialecki wrote:
Page content is NOT stored in Lucene indexes that Nutch creates. It's only indexed, which is not the same. Luke can show you the text in the "content" field only because it reconstructs it from the index. This reconstruction is incomplete because some information is missing (the information discarded by NutchDocumentAnalyzer).

As I wrote before, full content is stored in Nutch segments. That's why Nutch can show you the full content, but Luke cannot.



Thanks again, but is there a method to get a "content" informations through
the libraries of Lucene? I would like to work on the content of the web
pages extracted.


As it is now - there is no method. You would have to modify Nutch to create indexes where "content" is both indexed and stored - but then performance of your index will suffer.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to