[ https://issues.apache.org/jira/browse/NUTCH-525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514910 ]
Vishal Shah commented on NUTCH-525: ----------------------------------- Hi, I'll add a unit test. For the undelete thing, the need could arise in a situation where we are adding segments incrementally. For e.g., let's say docs A and B are duplicates and A is selected as the winner. In the next incremental update, A is refreshed, but it's status is page_gone (404 or something). Now, rerunning dedup should have undeleted B since it is no longer a duplicate. Or, if there was another duplicate C with a score lower than B, then B should have emerged as the winner after page A became dead. Regards, -vishal. > DeleteDuplicates generates ArrayIndexOutOfBoundsException when trying to > rerun dedup on a segment > ------------------------------------------------------------------------------------------------- > > Key: NUTCH-525 > URL: https://issues.apache.org/jira/browse/NUTCH-525 > Project: Nutch > Issue Type: Bug > Affects Versions: 0.9.0 > Environment: Fedora OS, JDK 1.6, Hadoop FS > Reporter: Vishal Shah > Attachments: deleteDups.patch > > > When trying to rerun dedup on a segment, we get the following Exception: > java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 261883 > at org.apache.lucene.util.BitVector.get(BitVector.java:72) > at > org.apache.lucene.index.SegmentReader.isDeleted(SegmentReader.java:346) > at > org.apache.nutch.indexer.DeleteDuplicates1$InputFormat$DDRecordReader.next(DeleteDuplicates1.java:167) > at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175) > at > org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445) > To reproduce the error, try creating two segments with identical urls - > fetch, parse, index and dedup the 2 segments. Then rerun dedup. > The error comes from the DDRecordReader.next() method: > //skip past deleted documents > while (indexReader.isDeleted(doc) && doc < maxDoc) doc++; > If the last document in the index is deleted, then this loop will skip past > the last document and call indexReader.isDeleted(doc) again. > The conditions should be inverted in order to fix the problem. > I've attached a patch here. > On a related note, why should we skip past deleted documents? The only time > when this will happen is when we are rerunning dedup on a segment. If > documents are not deleted for any reason other than dedup, then they should > be given a chance to compete again, isn't it? We could fix this by putting an > indexReader.undeleteAll() in the constructor for DDRecordReader. Any thoughts > on this? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.