[ 
https://issues.apache.org/jira/browse/NUTCH-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2391:
-----------------------------------
    Description: 
We're seeing some incidence of a large number of documents being marked as 
duplicate in our crawl.

We traced it back to one of the crawl plugins returning an empty array for the 
content field.

We'd like to propose changing the MD5 signature generation from:
{code}
public byte[] calculate(Content content, Parse parse) {
    byte[] data = content.getContent();
    if (data == null)
      data = content.getUrl().getBytes();
    return MD5Hash.digest(data).getDigest();
  }
{code}
to:
{code}
public byte[] calculate(Content content, Parse parse) {
    byte[] data = content.getContent();
    if ((data == null) || (data.length == 0))
      data = content.getUrl().getBytes();
    return MD5Hash.digest(data).getDigest();
  }
{code}
to address the issue

  was:
We're seeing some incidence of a large number of documents being marked as 
duplicate in our crawl.

We traced it back to one of the crawl plugins returning an empty array for the 
content field.

We'd like to propose changing the MD5 signature generation from:
public byte[] calculate(Content content, Parse parse) {
    byte[] data = content.getContent();
    if (data == null)
      data = content.getUrl().getBytes();
    return MD5Hash.digest(data).getDigest();
  }

to:
public byte[] calculate(Content content, Parse parse) {
    byte[] data = content.getContent();
    if ((data == null) || (data.length == 0))
      data = content.getUrl().getBytes();
    return MD5Hash.digest(data).getDigest();
  }

to address the issue


> Spurious Duplications for MD5
> -----------------------------
>
>                 Key: NUTCH-2391
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2391
>             Project: Nutch
>          Issue Type: Bug
>          Components: commoncrawl
>    Affects Versions: 1.11
>            Reporter: David Johnson
>            Priority: Minor
>             Fix For: 1.14
>
>
> We're seeing some incidence of a large number of documents being marked as 
> duplicate in our crawl.
> We traced it back to one of the crawl plugins returning an empty array for 
> the content field.
> We'd like to propose changing the MD5 signature generation from:
> {code}
> public byte[] calculate(Content content, Parse parse) {
>     byte[] data = content.getContent();
>     if (data == null)
>       data = content.getUrl().getBytes();
>     return MD5Hash.digest(data).getDigest();
>   }
> {code}
> to:
> {code}
> public byte[] calculate(Content content, Parse parse) {
>     byte[] data = content.getContent();
>     if ((data == null) || (data.length == 0))
>       data = content.getUrl().getBytes();
>     return MD5Hash.digest(data).getDigest();
>   }
> {code}
> to address the issue



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to