subject:"RE\: Solr Deduplication and Field Collpasing"

RE: Solr Deduplication and Field Collpasing

2010-09-28 Thread Markus Jelsma

You could create a custom update processor that adds a digest field for newly 
added documents that do not have the digest field themselves. This way, the 
documents that are not added by Nutch get a proper non-empty digest field so 
the deduplication processor won't create the same empty hash and overwrite 
those. Or you could extend 
org.apache.solr.update.processor.SignatureUpdateProcessorFactory so it skips 
documents with an empty digest field. I'd think the latter would be the 
quickest route but correct me if i'm wrong.

 

Cheers,
 
-Original message-
From: Nemani, Raj raj.nem...@turner.com
Sent: Tue 28-09-2010 23:28
To: solr-user@lucene.apache.org; 
Subject: Solr Deduplication and Field Collpasing

All,



I have setup Nutch to submit the crawl results to Solr index.  I have
some duplicates in the documents generated by the Nutch crawl.  There is
filed 'digest' that Nutch generates that is same for those documents
that are duplicates.  While setting up the the dedupe processor in the
Solr config file, I have used this 'Digest' field in the following
way(see below for config details).  Since my index has documents other
than the ones generated by Nutch I cannot use 'overwritedupes=true
because for non-Nutch generated documents the digest field will not be
populated and I found that Solr deletes every one of those documents
that do not have the digest filed populated. Probably because they all
will have the same 'sig' filed value generated based on an 'empty'
digest field forcing Solr to delete everything?



In any case, given the scenario I though I would set
'overwritedupes=false' and use field collapsing based on digest or sig
filed but I could not get filed collapsing to work.  Based on the wiki
documentation I was adding the query string
group=truegroup.filed=sig group=truegroup.filed=digest to my
over all query in admin console and I still got the duplicate documents
in the results.  Is there anything special I need to do to get field
collapsing working?  I am running Solr 1.4.



All this is because Nutch thinks that (url *is* the unique id for the
nutch document)

http://mysite.mydomain.com/index.html and http://mysite/index.html (the
difference is only in the alias and for an internal site both are valid)
are different documents depending on how the link is setup.  This is
reason for me to try deduplication.  I cannot submit SolrDedup command
from Nutch because non-Nutch generated documents do not have digest
filed populated and I read on the mailing lists that this will cause the
SolrDedup initiated from Nutch to fail.  This forced me to do try
deduplication on Solr side.



Thanks so much in advance for your help.





Here is my configuration:



SolrConfig.xml

               

               

               

               

               updateRequestProcessorChain name=dedupe

                   processor

               

class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory

               

                     bool name=enabledtrue/bool

                     str name=signatureFieldsig/str

                     bool name=overwriteDupesfalse/bool

                     str

               

name=signatureClassorg.apache.solr.update.processor.Lookup3Signature

               /str 

                 str name=fieldsdigest/str

                 /processor

                   processor class=solr.LogUpdateProcessorFactory /

                   processor class=solr.RunUpdateProcessorFactory /

                 /updateRequestProcessorChain

               

               

               requestHandler name=/update

class=solr.XmlUpdateRequestHandler 

                  lst name=defaults

                    str name=update.processordedupe/str

                  /lst

                /requestHandler

               

               Schema.xml

               

               

               field name=sig type=string stored=true
indexed=true

               multiValued=true /



Thanks so much for your help

Re: Solr Deduplication and Field Collpasing

2010-09-28 Thread Nemani, Raj

I have the digest field already in the schema because the index is shared 
between nutch docs and others.  I do not know if the second approach is the 
quickest in my case.

I can set the digest value to something unique for non nutch documets easily (I 
have an I'd field that I can use to populate the digest field during indxing of 
new non_nutch documets.  I have custom tool that does the indexing of these 
docs).  But I have more than3 millon documents in the index already that I 
don't want start over with new indexing again if I don't have to. Is there a 
way I can update the digest field with the value from the corresponding I'd 
field using solr? 

Thanks
Raj

- Original Message -
From: Markus Jelsma markus.jel...@buyways.nl
To: solr-user@lucene.apache.org solr-user@lucene.apache.org
Sent: Tue Sep 28 18:19:17 2010
Subject: RE: Solr Deduplication and Field Collpasing

You could create a custom update processor that adds a digest field for newly 
added documents that do not have the digest field themselves. This way, the 
documents that are not added by Nutch get a proper non-empty digest field so 
the deduplication processor won't create the same empty hash and overwrite 
those. Or you could extend 
org.apache.solr.update.processor.SignatureUpdateProcessorFactory so it skips 
documents with an empty digest field. I'd think the latter would be the 
quickest route but correct me if i'm wrong.

 

Cheers,
 
-Original message-
From: Nemani, Raj raj.nem...@turner.com
Sent: Tue 28-09-2010 23:28
To: solr-user@lucene.apache.org; 
Subject: Solr Deduplication and Field Collpasing

All,



I have setup Nutch to submit the crawl results to Solr index.  I have
some duplicates in the documents generated by the Nutch crawl.  There is
filed 'digest' that Nutch generates that is same for those documents
that are duplicates.  While setting up the the dedupe processor in the
Solr config file, I have used this 'Digest' field in the following
way(see below for config details).  Since my index has documents other
than the ones generated by Nutch I cannot use 'overwritedupes=true
because for non-Nutch generated documents the digest field will not be
populated and I found that Solr deletes every one of those documents
that do not have the digest filed populated. Probably because they all
will have the same 'sig' filed value generated based on an 'empty'
digest field forcing Solr to delete everything?



In any case, given the scenario I though I would set
'overwritedupes=false' and use field collapsing based on digest or sig
filed but I could not get filed collapsing to work.  Based on the wiki
documentation I was adding the query string
group=truegroup.filed=sig group=truegroup.filed=digest to my
over all query in admin console and I still got the duplicate documents
in the results.  Is there anything special I need to do to get field
collapsing working?  I am running Solr 1.4.



All this is because Nutch thinks that (url *is* the unique id for the
nutch document)

http://mysite.mydomain.com/index.html and http://mysite/index.html (the
difference is only in the alias and for an internal site both are valid)
are different documents depending on how the link is setup.  This is
reason for me to try deduplication.  I cannot submit SolrDedup command
from Nutch because non-Nutch generated documents do not have digest
filed populated and I read on the mailing lists that this will cause the
SolrDedup initiated from Nutch to fail.  This forced me to do try
deduplication on Solr side.



Thanks so much in advance for your help.





Here is my configuration:



SolrConfig.xml

               

               

               

               

               updateRequestProcessorChain name=dedupe

                   processor

               

class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory

               

                     bool name=enabledtrue/bool

                     str name=signatureFieldsig/str

                     bool name=overwriteDupesfalse/bool

                     str

               

name=signatureClassorg.apache.solr.update.processor.Lookup3Signature

               /str 

                 str name=fieldsdigest/str

                 /processor

                   processor class=solr.LogUpdateProcessorFactory /

                   processor class=solr.RunUpdateProcessorFactory /

                 /updateRequestProcessorChain

               

               

               requestHandler name=/update

class=solr.XmlUpdateRequestHandler 

                  lst name=defaults

                    str name=update.processordedupe/str

                  /lst

                /requestHandler

               

               Schema.xml

RE: Solr Deduplication and Field Collpasing

Re: Solr Deduplication and Field Collpasing

2 matches

Site Navigation

Mail list logo

Footer information