[jira] Commented: (NUTCH-835) document deduplication (exact duplicates) failed using MD5Signature
[ https://issues.apache.org/jira/browse/NUTCH-835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884630#action_12884630 ] Andrzej Bialecki commented on NUTCH-835: - Sorry, I should've been more precise - I committed this to branch-1.2 as well (r95963). > document deduplication (exact duplicates) failed using MD5Signature > --- > > Key: NUTCH-835 > URL: https://issues.apache.org/jira/browse/NUTCH-835 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.0.0, 1.1 > Environment: Linux, Ubuntu 10.04, Java 1.6.0_20 >Reporter: Sebastian Nagel >Assignee: Andrzej Bialecki > Fix For: 1.2, 2.0 > > > The MD5Signature class calculates different signatures for identical > documents. > The reason is that > byte[] data = content.getContent(); > ... StringBuilder().append(data) ... > uses java.lang.Object.toString() to get a string representation of the > (binary) content > which results in unique hash codes (e.g., [...@30dc9065) even for two byte > arrays > with identical content. > A solution would be to take the MD5 sum of the binary content as first part > of the > final signature calculation (the parsed content is the second part): > ... > .append(StringUtil.toHexString(MD5Hash.digest(data).getDigest())).append(parse.getText()); > Of course, there are many other solutions... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-835) document deduplication (exact duplicates) failed using MD5Signature
[ https://issues.apache.org/jira/browse/NUTCH-835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884624#action_12884624 ] Julien Nioche commented on NUTCH-835: - This patch has been marked for 1.2 but has been committed to trunk only (2.0). Shall we also apply it to /nutch/branches/branch-1.2 ? > document deduplication (exact duplicates) failed using MD5Signature > --- > > Key: NUTCH-835 > URL: https://issues.apache.org/jira/browse/NUTCH-835 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.0.0, 1.1 > Environment: Linux, Ubuntu 10.04, Java 1.6.0_20 >Reporter: Sebastian Nagel >Assignee: Andrzej Bialecki > Fix For: 1.2, 2.0 > > > The MD5Signature class calculates different signatures for identical > documents. > The reason is that > byte[] data = content.getContent(); > ... StringBuilder().append(data) ... > uses java.lang.Object.toString() to get a string representation of the > (binary) content > which results in unique hash codes (e.g., [...@30dc9065) even for two byte > arrays > with identical content. > A solution would be to take the MD5 sum of the binary content as first part > of the > final signature calculation (the parsed content is the second part): > ... > .append(StringUtil.toHexString(MD5Hash.digest(data).getDigest())).append(parse.getText()); > Of course, there are many other solutions... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-835) document deduplication (exact duplicates) failed using MD5Signature
[ https://issues.apache.org/jira/browse/NUTCH-835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884540#action_12884540 ] Hudson commented on NUTCH-835: -- Integrated in Nutch-trunk #1195 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1195/]) NUTCH-835 Document deduplication failed using MD5Signature (Sebastian Nagel via ab) > document deduplication (exact duplicates) failed using MD5Signature > --- > > Key: NUTCH-835 > URL: https://issues.apache.org/jira/browse/NUTCH-835 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.0.0, 1.1 > Environment: Linux, Ubuntu 10.04, Java 1.6.0_20 >Reporter: Sebastian Nagel >Assignee: Andrzej Bialecki > Fix For: 1.2, 2.0 > > > The MD5Signature class calculates different signatures for identical > documents. > The reason is that > byte[] data = content.getContent(); > ... StringBuilder().append(data) ... > uses java.lang.Object.toString() to get a string representation of the > (binary) content > which results in unique hash codes (e.g., [...@30dc9065) even for two byte > arrays > with identical content. > A solution would be to take the MD5 sum of the binary content as first part > of the > final signature calculation (the parsed content is the second part): > ... > .append(StringUtil.toHexString(MD5Hash.digest(data).getDigest())).append(parse.getText()); > Of course, there are many other solutions... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-835) document deduplication (exact duplicates) failed using MD5Signature
[ https://issues.apache.org/jira/browse/NUTCH-835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884255#action_12884255 ] Andrzej Bialecki commented on NUTCH-835: - Yes, this is a bug. In fact the implementation makes things even worse by appending the parsed text, contrary to its specification that says it should use just the raw content... I'll fix this shortly. > document deduplication (exact duplicates) failed using MD5Signature > --- > > Key: NUTCH-835 > URL: https://issues.apache.org/jira/browse/NUTCH-835 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.0.0, 1.1 > Environment: Linux, Ubuntu 10.04, Java 1.6.0_20 >Reporter: Sebastian Nagel > > The MD5Signature class calculates different signatures for identical > documents. > The reason is that > byte[] data = content.getContent(); > ... StringBuilder().append(data) ... > uses java.lang.Object.toString() to get a string representation of the > (binary) content > which results in unique hash codes (e.g., [...@30dc9065) even for two byte > arrays > with identical content. > A solution would be to take the MD5 sum of the binary content as first part > of the > final signature calculation (the parsed content is the second part): > ... > .append(StringUtil.toHexString(MD5Hash.digest(data).getDigest())).append(parse.getText()); > Of course, there are many other solutions... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.