Hi,

I stopped using de-duplication in Nutch 0.9-1.2 versions because too many URLs 
were being removed for no apparent reason. I did not report the problem to the 
list though. I am working with version 1.4 now, tried de-duplication again, and 
the problem appears to be still there. There are significant numbers of URLs 
being removed when de-duplication is applied. I could blame it on duplicated 
content, but it is hard to believe that so much is duplicated. One small site 
is represented by 1639 URLs in the index, and this number goes down to 1068 
after de-duplication is done. OK, theoretically, this can happen, but, here is 
another example. Another site has just one (root) page in the index. This entry 
gets removed by de-duplication. How can this happen? There can be a collision 
in digests, but this is hard to believe, especially given other suspicious 
phenomena.

I am not going to use de-duplication anyway, because duplicated entries may 
exist in Arch index for a valid reason (e.g. different owners).  However, it 
seems that I have a good case that could help to pinpoint the problem, if it 
indeed exists. If anyone would want to do it, I am happy to help.

Regards,

Arkadi

Reply via email to