Hi Markus, The default one, which, I believe, is MD5. I did not change anything in this part.
Regards, Arkadi > -----Original Message----- > From: Markus Jelsma [mailto:[email protected]] > Sent: Wednesday, 2 November 2011 6:34 PM > To: [email protected] > Subject: Re: De-duplication seems to work too aggressively > > Hi > > You didn't mention the signature algorithm you're using. > > Thanks > > > Hi, > > > > I stopped using de-duplication in Nutch 0.9-1.2 versions because too > many > > URLs were being removed for no apparent reason. I did not report the > > problem to the list though. I am working with version 1.4 now, tried > > de-duplication again, and the problem appears to be still there. > There are > > significant numbers of URLs being removed when de-duplication is > applied. > > I could blame it on duplicated content, but it is hard to believe > that so > > much is duplicated. One small site is represented by 1639 URLs in the > > index, and this number goes down to 1068 after de-duplication is > done. OK, > > theoretically, this can happen, but, here is another example. Another > site > > has just one (root) page in the index. This entry gets removed by > > de-duplication. How can this happen? There can be a collision in > digests, > > but this is hard to believe, especially given other suspicious > phenomena. > > > > I am not going to use de-duplication anyway, because duplicated > entries may > > exist in Arch index for a valid reason (e.g. different owners). > However, > > it seems that I have a good case that could help to pinpoint the > problem, > > if it indeed exists. If anyone would want to do it, I am happy to > help. > > > > Regards, > > > > Arkadi

