inghe wrote:

Hi,
I want to use Nutch for crawling contents and Lucene for extract and analyze
the contents of the index created by Nutch. I'm trying to extract from the
index the contents of web pages, but i don' know how to set the
NutchDocumentAnalyzer in my application, if i use the StandardAnalyzer of
Lucene, i'll get to extract the fields "title", "url" but not the "content".
I'm using Nutch1.0 and Lucene2.4.0

There is no content in Lucene indexes. The original content is stored in Nutch segments. You can use the command bin/nutch readseg to retrieve all (or selected) pages.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to