Hi Markus,

The default one, which, I believe, is MD5. I did not change anything in this 
part.

Regards,

Arkadi

> -----Original Message-----
> From: Markus Jelsma [mailto:[email protected]]
> Sent: Wednesday, 2 November 2011 6:34 PM
> To: [email protected]
> Subject: Re: De-duplication seems to work too aggressively
> 
> Hi
> 
> You didn't mention the signature algorithm you're using.
> 
> Thanks
> 
> > Hi,
> >
> > I stopped using de-duplication in Nutch 0.9-1.2 versions because too
> many
> > URLs were being removed for no apparent reason. I did not report the
> > problem to the list though. I am working with version 1.4 now, tried
> > de-duplication again, and the problem appears to be still there.
> There are
> > significant numbers of URLs being removed when de-duplication is
> applied.
> > I could blame it on duplicated content, but it is hard to believe
> that so
> > much is duplicated. One small site is represented by 1639 URLs in the
> > index, and this number goes down to 1068 after de-duplication is
> done. OK,
> > theoretically, this can happen, but, here is another example. Another
> site
> > has just one (root) page in the index. This entry gets removed by
> > de-duplication. How can this happen? There can be a collision in
> digests,
> > but this is hard to believe, especially given other suspicious
> phenomena.
> >
> > I am not going to use de-duplication anyway, because duplicated
> entries may
> > exist in Arch index for a valid reason (e.g. different owners).
> However,
> > it seems that I have a good case that could help to pinpoint the
> problem,
> > if it indeed exists. If anyone would want to do it, I am happy to
> help.
> >
> > Regards,
> >
> > Arkadi

Reply via email to