Or as an option you can modify nutch to store content in the index.

Andrzej, is it bad idea, what do you think?

Best Regards
Alexander Aristov


2009/5/14 Andrzej Bialecki <[email protected]>

> inghe wrote:
>
>>
>> Hi,
>> I want to use Nutch for crawling contents and Lucene for extract and
>> analyze
>> the contents of the index created by Nutch. I'm trying to extract from the
>> index the contents of web pages, but i don' know how to set the
>> NutchDocumentAnalyzer in my application, if i use the StandardAnalyzer of
>> Lucene, i'll get to extract the fields "title", "url" but not the
>> "content".
>> I'm using Nutch1.0 and Lucene2.4.0
>>
>
> There is no content in Lucene indexes. The original content is stored in
> Nutch segments. You can use the command bin/nutch readseg to retrieve all
> (or selected) pages.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Reply via email to