Manoharam Reddy wrote:
My segment merger is not functioning properly. I am unable to figure
out the problem.
These are the commands I am using.
bin/nutch inject crawl/crawldb seedurls
In a loop iterating 10 times:-
bin/nutch generate crawl/crawldb crawl/segments -topN 1000 -adddays 5
segment=`ls -d crawl/segments/* | tail -1`
bin/nutch fetch $segment -threads 50
bin/nutch updatedb crawl/crawldb $segment
After loop:-
bin/nutch mergesegs crawl/merged_segments crawl/segments/*
rm -rf crawl/segments/*
mv --verbose crawl/merged_segments/* crawl/segments
rm -rf crawl/merged_segments
Merging 10 segments to crawl/MERGEDsegments/20050529095045
SegmentMerger: adding crawl/segments/20050528144604
SegmentMerger: adding crawl/segments/20050528144619
SegmentMerger: adding crawl/segments/20050528145426
SegmentMerger: adding crawl/segments/20050528151323
SegmentMerger: adding crawl/segments/20050528164032
SegmentMerger: adding crawl/segments/20050528170544
SegmentMerger: adding crawl/segments/20050528192341
SegmentMerger: adding crawl/segments/20050528203512
SegmentMerger: adding crawl/segments/20050528210029
SegmentMerger: adding crawl/segments/20050529055733
SegmentMerger: using segment data from: crawl_generate
`crawl/MERGEDsegments/20050529095045' -> `crawl/segments/20050529095045'
As can be seen here, only crawl_generate was used to merge. Other
folders like parse_data, crawl_fetch were not used. Why?
This behavior is described in the javadoc of SegmentMerger. In general
case, users may wish to merge segments at different stages of processing
- only generated, fetched but not parsed, and parsed. It's easy to do
this if segments are homogenous, i.e. they all contain the same parts.
However, if segments are heterogenous, i.e some of them are processed
further than others, we cannot merge all their parts, because we will
get an incomplete segment as a result (e.g for some urls we will have
parse_data, for other urls it will be missing).
In such cases SegmentMerger processes only the lowest common
denominator, i.e. only those segment parts that are present in all input
segments - and disregards any other existing parts.
That's a long answer to your problem, which is that one or more of your
input segments isn't fetched yet - don't include that segment on the
list of input segments, and all should be fine.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com