[ 
https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13864835#comment-13864835
 ] 

Sebastian Nagel commented on NUTCH-1693:
----------------------------------------

A useful improvement! Thanks!
{quote}
> both 2x and 1x should provide identical signature impl.
{quote}
Definitely. Falling back to MD5 on raw content for empty documents may be 
useful eg. for PDFs with scanned images and no readable textual content: two 
binary identical PDFs are then still deduplicated.
Regarding calculation of MD5 from text:
* [patch for trunk] String.getBytes() depends on default encoding / locale. If 
it differs eg. for development and production environments this may cause some 
headaches. We could either pass a fixed Charset (UTF-8) as parameter to 
getBytes(...) or use MD5Hash.digest(String string) which encodes string as 
UTF-8 before check-summing
* [patch for 2x] instead of converting an UTF-8-encoded byte array to Java 
String and back: MD5Hash.digest(page.getText().getBytes()) may be more efficient

> TextMD5Signatue compute on textual content
> ------------------------------------------
>
>                 Key: NUTCH-1693
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1693
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Tien Nguyen Manh
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 2.3, 1.8
>
>         Attachments: NUTCH-1693-trunk.patch, NUTCH-1693.patch
>
>
> I create a new MD5Signature that based on textual content. In our case we use 
> boilerpipe to extract main text from content so this signature is more 
> effective to deduplicate.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to