[ https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13861420#comment-13861420 ]
Markus Jelsma commented on NUTCH-1693: -------------------------------------- +1, but this should also be ported to trunk. > TextMD5Signatue compute on textual content > ------------------------------------------ > > Key: NUTCH-1693 > URL: https://issues.apache.org/jira/browse/NUTCH-1693 > Project: Nutch > Issue Type: New Feature > Reporter: Tien Nguyen Manh > Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-1693.patch > > > I create a new MD5Signature that based on textual content. In our case we use > boilerpipe to extract main text from content so this signature is more > effective to deduplicate. -- This message was sent by Atlassian JIRA (v6.1.5#6160)