[ https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1693: --------------------------------- Attachment: NUTCH-1693-trunk.patch Patch for trunk. This patch works identical to the original MDSignature except it reads bytes from the parse text. It does not fallback to the MD5Signature. > TextMD5Signatue compute on textual content > ------------------------------------------ > > Key: NUTCH-1693 > URL: https://issues.apache.org/jira/browse/NUTCH-1693 > Project: Nutch > Issue Type: New Feature > Reporter: Tien Nguyen Manh > Assignee: Markus Jelsma > Priority: Minor > Fix For: 2.3, 1.8 > > Attachments: NUTCH-1693-trunk.patch, NUTCH-1693.patch > > > I create a new MD5Signature that based on textual content. In our case we use > boilerpipe to extract main text from content so this signature is more > effective to deduplicate. -- This message was sent by Atlassian JIRA (v6.1.5#6160)