Manoharam Reddy wrote: > My segment merger is not functioning properly. I am unable to figure > out the problem. > > These are the commands I am using. > > bin/nutch inject crawl/crawldb seedurls > > In a loop iterating 10 times:- > > bin/nutch generate crawl/crawldb crawl/segments -topN 1000 -adddays 5 > segment=`ls -d crawl/segments/* | tail -1` > bin/nutch fetch $segment -threads 50 > bin/nutch updatedb crawl/crawldb $segment > > After loop:- > > bin/nutch mergesegs crawl/merged_segments crawl/segments/* > rm -rf crawl/segments/* > mv --verbose crawl/merged_segments/* crawl/segments > rm -rf crawl/merged_segments > > Merging 10 segments to crawl/MERGEDsegments/20050529095045 > SegmentMerger: adding crawl/segments/20050528144604 > SegmentMerger: adding crawl/segments/20050528144619 > SegmentMerger: adding crawl/segments/20050528145426 > SegmentMerger: adding crawl/segments/20050528151323 > SegmentMerger: adding crawl/segments/20050528164032 > SegmentMerger: adding crawl/segments/20050528170544 > SegmentMerger: adding crawl/segments/20050528192341 > SegmentMerger: adding crawl/segments/20050528203512 > SegmentMerger: adding crawl/segments/20050528210029 > SegmentMerger: adding crawl/segments/20050529055733 > SegmentMerger: using segment data from: crawl_generate > `crawl/MERGEDsegments/20050529095045' -> `crawl/segments/20050529095045' > > As can be seen here, only crawl_generate was used to merge. Other > folders like parse_data, crawl_fetch were not used. Why?
This behavior is described in the javadoc of SegmentMerger. In general case, users may wish to merge segments at different stages of processing - only generated, fetched but not parsed, and parsed. It's easy to do this if segments are homogenous, i.e. they all contain the same parts. However, if segments are heterogenous, i.e some of them are processed further than others, we cannot merge all their parts, because we will get an incomplete segment as a result (e.g for some urls we will have parse_data, for other urls it will be missing). In such cases SegmentMerger processes only the lowest common denominator, i.e. only those segment parts that are present in all input segments - and disregards any other existing parts. That's a long answer to your problem, which is that one or more of your input segments isn't fetched yet - don't include that segment on the list of input segments, and all should be fine. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
