Hi You didn't mention the signature algorithm you're using.
Thanks > Hi, > > I stopped using de-duplication in Nutch 0.9-1.2 versions because too many > URLs were being removed for no apparent reason. I did not report the > problem to the list though. I am working with version 1.4 now, tried > de-duplication again, and the problem appears to be still there. There are > significant numbers of URLs being removed when de-duplication is applied. > I could blame it on duplicated content, but it is hard to believe that so > much is duplicated. One small site is represented by 1639 URLs in the > index, and this number goes down to 1068 after de-duplication is done. OK, > theoretically, this can happen, but, here is another example. Another site > has just one (root) page in the index. This entry gets removed by > de-duplication. How can this happen? There can be a collision in digests, > but this is hard to believe, especially given other suspicious phenomena. > > I am not going to use de-duplication anyway, because duplicated entries may > exist in Arch index for a valid reason (e.g. different owners). However, > it seems that I have a good case that could help to pinpoint the problem, > if it indeed exists. If anyone would want to do it, I am happy to help. > > Regards, > > Arkadi

