[ https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13864835#comment-13864835 ]
Sebastian Nagel commented on NUTCH-1693: ---------------------------------------- A useful improvement! Thanks! {quote} > both 2x and 1x should provide identical signature impl. {quote} Definitely. Falling back to MD5 on raw content for empty documents may be useful eg. for PDFs with scanned images and no readable textual content: two binary identical PDFs are then still deduplicated. Regarding calculation of MD5 from text: * [patch for trunk] String.getBytes() depends on default encoding / locale. If it differs eg. for development and production environments this may cause some headaches. We could either pass a fixed Charset (UTF-8) as parameter to getBytes(...) or use MD5Hash.digest(String string) which encodes string as UTF-8 before check-summing * [patch for 2x] instead of converting an UTF-8-encoded byte array to Java String and back: MD5Hash.digest(page.getText().getBytes()) may be more efficient > TextMD5Signatue compute on textual content > ------------------------------------------ > > Key: NUTCH-1693 > URL: https://issues.apache.org/jira/browse/NUTCH-1693 > Project: Nutch > Issue Type: New Feature > Reporter: Tien Nguyen Manh > Assignee: Markus Jelsma > Priority: Minor > Fix For: 2.3, 1.8 > > Attachments: NUTCH-1693-trunk.patch, NUTCH-1693.patch > > > I create a new MD5Signature that based on textual content. In our case we use > boilerpipe to extract main text from content so this signature is more > effective to deduplicate. -- This message was sent by Atlassian JIRA (v6.1.5#6160)