Or as an option you can modify nutch to store content in the index. Andrzej, is it bad idea, what do you think?
Best Regards Alexander Aristov 2009/5/14 Andrzej Bialecki <[email protected]> > inghe wrote: > >> >> Hi, >> I want to use Nutch for crawling contents and Lucene for extract and >> analyze >> the contents of the index created by Nutch. I'm trying to extract from the >> index the contents of web pages, but i don' know how to set the >> NutchDocumentAnalyzer in my application, if i use the StandardAnalyzer of >> Lucene, i'll get to extract the fields "title", "url" but not the >> "content". >> I'm using Nutch1.0 and Lucene2.4.0 >> > > There is no content in Lucene indexes. The original content is stored in > Nutch segments. You can use the command bin/nutch readseg to retrieve all > (or selected) pages. > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > >
