On 2010-01-08 19:07, Ulysses Rangel Ribeiro wrote:
I'm crawling with Nutch 1.0 and indexing with Solr 1.4, and came with some
questions regarding data redundancy with this setup.

Considering the following sample segment:

2.0G    content
196K    crawl_fetch
152K    crawl_generate
376K    crawl_parse
392K    parse_data
441M    parse_text

1. From what I have found through searches "content" holds the raw fetched
content, is there any problem if I remove it, ie: does nutch needs it to
apply any sort of logic when re-crawling that content/url?

No, they are no longer needed, unless you want to provide a "cached" view of the content.


2. Previous question applies to parse_data and parse_text after i've called
nutch solrindex on that segment.

Depends how you set up your search. If you search using NutchBean (i.e. the Nutch web application) then you need them. If you search using Solr, then you don't need them.


3. Using samples scritps and tutorials I'm always seeing invertlinks being
called over all segments, but its output mentions merging, when I
fetch/parse new segments can I call invertlinks only over them?

Yes, invertlinks will incrementally merge the existing linkdb with new links from a new segment.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to