Hi

You didn't mention the signature algorithm you're using.

Thanks

> Hi,
> 
> I stopped using de-duplication in Nutch 0.9-1.2 versions because too many
> URLs were being removed for no apparent reason. I did not report the
> problem to the list though. I am working with version 1.4 now, tried
> de-duplication again, and the problem appears to be still there. There are
> significant numbers of URLs being removed when de-duplication is applied.
> I could blame it on duplicated content, but it is hard to believe that so
> much is duplicated. One small site is represented by 1639 URLs in the
> index, and this number goes down to 1068 after de-duplication is done. OK,
> theoretically, this can happen, but, here is another example. Another site
> has just one (root) page in the index. This entry gets removed by
> de-duplication. How can this happen? There can be a collision in digests,
> but this is hard to believe, especially given other suspicious phenomena.
> 
> I am not going to use de-duplication anyway, because duplicated entries may
> exist in Arch index for a valid reason (e.g. different owners).  However,
> it seems that I have a good case that could help to pinpoint the problem,
> if it indeed exists. If anyone would want to do it, I am happy to help.
> 
> Regards,
> 
> Arkadi

Reply via email to