[ https://issues.apache.org/jira/browse/NUTCH-971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julien Nioche closed NUTCH-971. ------------------------------- > IndexMerger produces indexes itself cannot merge anymore > -------------------------------------------------------- > > Key: NUTCH-971 > URL: https://issues.apache.org/jira/browse/NUTCH-971 > Project: Nutch > Issue Type: Bug > Components: indexer > Affects Versions: 1.2 > Reporter: Gabriele Kahlout > Priority: Minor > Labels: patch > Fix For: 1.3 > > Attachments: IndexMerger-part.diff > > > Here's what I do: > 1. index the fetched segs > $ rm -r $new_indexes $temp_indexes > $ bin/nutch index $new_indexes $it_crawldb crawl/linkdb crawl/segments/* > > I examine the index with luke and it's as expected. > 2. merge the new index with the previous > $ bin/nutch merge $temp_indexes $new_indexes $indexes > IndexMerger: starting at 2011-03-26 10:24:58 > IndexMerger: merging indexes to: crawl/temp_indexes > Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000 > IndexMerger: finished at 2011-03-26 10:24:59, elapsed: 00:00:01 > On the first iteration, when $indexes is empty it works fine by essentially > duplicating $new_indexes into $temp_indexes. > But on the 2nd iteration, after I mv $temp_indexes $indexes[1] the merged > index $temp_indexes contains only #new_indexes and nothing from $indexes, > which indeed still contains the data from the previous iteration. That is, it > doesn't merge. > This unexpected merge behavior is NOT symmetric, i.e. > $ bin/nutch merge $temp_indexes $indexes $new_indexes > IndexMerger: starting at 2011-03-26 10:32:15 > IndexMerger: merging indexes to: crawl/temp_indexes > Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000 > IndexMerger: finished at 2011-03-26 10:32:16, elapsed: 00:00:01 > The morale of the story is that a merged index cannot be merged with another, > i.e. bin/nutch merge is meant to merge only 2 indeces generated with > bin/nutch index (or solrindex, perhaps). > The difference between the 2 indeces I can tell is that the merged index > doesn't contain file index_done (and a hidden companion), but adding those to > the merged index before merging it again doesn't solve either. > The way/workaround to make the merged index equivalent to the bin/nutch index > generated index seems to be putting it in a "part" subdirectory: > bin/nutch merge crawl/temp_indexes/part-1 crawl/indexes crawl/new_indexes > IndexMerger: starting at 2011-03-26 11:18:10 > IndexMerger: merging indexes to: crawl/temp_indexes/part-1 > Adding file:/Users/simpatico/nutch-1.2/crawl/indexes/part-1 > Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000 > IndexMerger: finished at 2011-03-26 11:18:12, elapsed: 00:00:01 > Where was this documented? I'd recommend rather not documenting but have > IndexMerger handle it as in the attached patch. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira