inghe wrote:
Hi, I want to use Nutch for crawling contents and Lucene for extract and analyze the contents of the index created by Nutch. I'm trying to extract from the index the contents of web pages, but i don' know how to set the NutchDocumentAnalyzer in my application, if i use the StandardAnalyzer of Lucene, i'll get to extract the fields "title", "url" but not the "content". I'm using Nutch1.0 and Lucene2.4.0
There is no content in Lucene indexes. The original content is stored in Nutch segments. You can use the command bin/nutch readseg to retrieve all (or selected) pages.
-- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
