DeleteDuplicates.HashPartitioner depends on the order of IndexDocs
------------------------------------------------------------------
Key: NUTCH-420
URL: http://issues.apache.org/jira/browse/NUTCH-420
Project: Nutch
Issue Type: Bug
Components: indexer
Affects Versions: 0.9.0
Reporter: Dogacan Güney
Priority: Minor
DeleteDuplicates.HashPartitioner.reduce():
// byScore case
if (value.score > highest.score) {
highest.keep = false;
LOG.debug("-discard " + highest + ", keep " + value);
output.collect(highest.url, highest); // delete highest
highest = value;
}
// !byScore is also similar
So assume two docs with same hash are in values.If the first has higher score
than the second than second doc will be deleted. But if the first has lower
score than the second then none will be deleted. AFAICS, there should be an
else condition to delete value and keep highest as it is.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers