RE: Solr Deduplication and Field Collpasing
You could create a custom update processor that adds a digest field for newly added documents that do not have the digest field themselves. This way, the documents that are not added by Nutch get a proper non-empty digest field so the deduplication processor won't create the same empty hash and overwrite those. Or you could extend org.apache.solr.update.processor.SignatureUpdateProcessorFactory so it skips documents with an empty digest field. I'd think the latter would be the quickest route but correct me if i'm wrong. Cheers, -Original message- From: Nemani, Raj raj.nem...@turner.com Sent: Tue 28-09-2010 23:28 To: solr-user@lucene.apache.org; Subject: Solr Deduplication and Field Collpasing All, I have setup Nutch to submit the crawl results to Solr index. I have some duplicates in the documents generated by the Nutch crawl. There is filed 'digest' that Nutch generates that is same for those documents that are duplicates. While setting up the the dedupe processor in the Solr config file, I have used this 'Digest' field in the following way(see below for config details). Since my index has documents other than the ones generated by Nutch I cannot use 'overwritedupes=true because for non-Nutch generated documents the digest field will not be populated and I found that Solr deletes every one of those documents that do not have the digest filed populated. Probably because they all will have the same 'sig' filed value generated based on an 'empty' digest field forcing Solr to delete everything? In any case, given the scenario I though I would set 'overwritedupes=false' and use field collapsing based on digest or sig filed but I could not get filed collapsing to work. Based on the wiki documentation I was adding the query string group=truegroup.filed=sig group=truegroup.filed=digest to my over all query in admin console and I still got the duplicate documents in the results. Is there anything special I need to do to get field collapsing working? I am running Solr 1.4. All this is because Nutch thinks that (url *is* the unique id for the nutch document) http://mysite.mydomain.com/index.html and http://mysite/index.html (the difference is only in the alias and for an internal site both are valid) are different documents depending on how the link is setup. This is reason for me to try deduplication. I cannot submit SolrDedup command from Nutch because non-Nutch generated documents do not have digest filed populated and I read on the mailing lists that this will cause the SolrDedup initiated from Nutch to fail. This forced me to do try deduplication on Solr side. Thanks so much in advance for your help. Here is my configuration: SolrConfig.xml updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldsig/str bool name=overwriteDupesfalse/bool str name=signatureClassorg.apache.solr.update.processor.Lookup3Signature /str str name=fieldsdigest/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain requestHandler name=/update class=solr.XmlUpdateRequestHandler lst name=defaults str name=update.processordedupe/str /lst /requestHandler Schema.xml field name=sig type=string stored=true indexed=true multiValued=true / Thanks so much for your help
Re: Solr Deduplication and Field Collpasing
I have the digest field already in the schema because the index is shared between nutch docs and others. I do not know if the second approach is the quickest in my case. I can set the digest value to something unique for non nutch documets easily (I have an I'd field that I can use to populate the digest field during indxing of new non_nutch documets. I have custom tool that does the indexing of these docs). But I have more than3 millon documents in the index already that I don't want start over with new indexing again if I don't have to. Is there a way I can update the digest field with the value from the corresponding I'd field using solr? Thanks Raj - Original Message - From: Markus Jelsma markus.jel...@buyways.nl To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Tue Sep 28 18:19:17 2010 Subject: RE: Solr Deduplication and Field Collpasing You could create a custom update processor that adds a digest field for newly added documents that do not have the digest field themselves. This way, the documents that are not added by Nutch get a proper non-empty digest field so the deduplication processor won't create the same empty hash and overwrite those. Or you could extend org.apache.solr.update.processor.SignatureUpdateProcessorFactory so it skips documents with an empty digest field. I'd think the latter would be the quickest route but correct me if i'm wrong. Cheers, -Original message- From: Nemani, Raj raj.nem...@turner.com Sent: Tue 28-09-2010 23:28 To: solr-user@lucene.apache.org; Subject: Solr Deduplication and Field Collpasing All, I have setup Nutch to submit the crawl results to Solr index. I have some duplicates in the documents generated by the Nutch crawl. There is filed 'digest' that Nutch generates that is same for those documents that are duplicates. While setting up the the dedupe processor in the Solr config file, I have used this 'Digest' field in the following way(see below for config details). Since my index has documents other than the ones generated by Nutch I cannot use 'overwritedupes=true because for non-Nutch generated documents the digest field will not be populated and I found that Solr deletes every one of those documents that do not have the digest filed populated. Probably because they all will have the same 'sig' filed value generated based on an 'empty' digest field forcing Solr to delete everything? In any case, given the scenario I though I would set 'overwritedupes=false' and use field collapsing based on digest or sig filed but I could not get filed collapsing to work. Based on the wiki documentation I was adding the query string group=truegroup.filed=sig group=truegroup.filed=digest to my over all query in admin console and I still got the duplicate documents in the results. Is there anything special I need to do to get field collapsing working? I am running Solr 1.4. All this is because Nutch thinks that (url *is* the unique id for the nutch document) http://mysite.mydomain.com/index.html and http://mysite/index.html (the difference is only in the alias and for an internal site both are valid) are different documents depending on how the link is setup. This is reason for me to try deduplication. I cannot submit SolrDedup command from Nutch because non-Nutch generated documents do not have digest filed populated and I read on the mailing lists that this will cause the SolrDedup initiated from Nutch to fail. This forced me to do try deduplication on Solr side. Thanks so much in advance for your help. Here is my configuration: SolrConfig.xml updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldsig/str bool name=overwriteDupesfalse/bool str name=signatureClassorg.apache.solr.update.processor.Lookup3Signature /str str name=fieldsdigest/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain requestHandler name=/update class=solr.XmlUpdateRequestHandler lst name=defaults str name=update.processordedupe/str /lst /requestHandler Schema.xml