Hi Richard, Yes is the simple answer. We are aware for some time that Dedup is broken in nutchgora [1], however Markus also reported an issue with current trunk development [2]. Can you please review and comment if you can reproduce, or alternatively browse though out indexer issues [3] and comment accordingly. A patch would be excellent by any means. Thank you
[1] https://issues.apache.org/jira/browse/NUTCH-992 [2] https://issues.apache.org/jira/browse/NUTCH-1100 [3] https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+NUTCH+AND+resolution+%3D+Unresolved+AND+component+%3D+indexer+ORDER+BY+priority+DESC&mode=hide On Sun, Oct 9, 2011 at 9:35 PM, Rich d'Rich <[email protected]> wrote: > >>Dedup will not work without digest field. Perhaps we can extend solrdedup > >>so > >>it skips all documents > >>with a digest field. Will that work for you? > >You mean skip all documents *without* a digest field? > >Yes, that would work. > >But wouldn't it be better for performance reasons to query only against > >documents with the field already compiled? > > I'm getting this issue as well - we've got a heterogenous SOLR index with > various sources apart from Nutch, and the lack of a digest field crashes > dedup when it hits a non-Nutch doc, as described by Matthias. > > Is there an issue logged for this? I might be making a patch just to keep > us > going. > > -- > Richard > -- *Lewis*

