An alternative to Nutch deduplication (which I've found to fail with
multiple document sources that don't provide a 'digest' field in the SOLR
index) is to use SOLR to detect duplicates on update - I've done this with
a SOLR plugin.

On Wed, Nov 2, 2011 at 5:38 PM, <[email protected]> wrote:

> Hi,
>
> I stopped using de-duplication in Nutch 0.9-1.2 versions because too many
> URLs were being removed for no apparent reason. I did not report the
> problem to the list though. I am working with version 1.4 now, tried
> de-duplication again, and the problem appears to be still there. There are
> significant numbers of URLs being removed when de-duplication is applied. I
> could blame it on duplicated content, but it is hard to believe that so
> much is duplicated. One small site is represented by 1639 URLs in the
> index, and this number goes down to 1068 after de-duplication is done. OK,
> theoretically, this can happen, but, here is another example. Another site
> has just one (root) page in the index. This entry gets removed by
> de-duplication. How can this happen? There can be a collision in digests,
> but this is hard to believe, especially given other suspicious phenomena.
>
> I am not going to use de-duplication anyway, because duplicated entries
> may exist in Arch index for a valid reason (e.g. different owners).
>  However, it seems that I have a good case that could help to pinpoint the
> problem, if it indeed exists. If anyone would want to do it, I am happy to
> help.
>
> Regards,
>
> Arkadi
>
>

Reply via email to