Hi,
I am trying to use the dedupe feature to detect and mark near duplicate
content in my collections.
I dont want to prevent duplicate content. I woud like to detect it and
keep it for further processing. Thats why Im using an extra field and
not the documents unique field.
Here is how I added it to the solrConfig.xml :
<requestHandler name="/update" class="solr.UpdateRequestHandler">
<lst name="defaults">
<str name="update.chain">fill_signature</str>
</lst>
</requestHandler>
<updateRequestProcessorChain name="fill_signature"
processor="signature">
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
<updateProcessor
class="solr.processor.SignatureUpdateProcessorFactory" name="signature">
<bool name="enabled">true</bool>
<str name="signatureField">signature</str>
<bool name="overwriteDupes">false</bool>
<str name="fields">content</str>
<str
name="signatureClass">solr.processor.TextProfileSignature</str>
<str name="quantRate">.2</str>
<str name="minTokenLen">3</str>
</updateProcessor>
When I initially add the documents to the cloud everything works as
expected ..... the documents are added and the signature will be created
and added.....perfect:)
The problem occours when I want to update an exisiting document. In that
case the update.chain=fill_signature parameter will of course be set too
and I get a bad request error.
I found this solr issue: https://issues.apache.org/jira/browse/SOLR-3473
Is it that problem I am running into?
Is it somehow possible to add parameters or set a specific update
Handler when Im adding documents to the cloud using solrJ?
In that case I could ether set the update.chain manually and remove it
from the request handler or write a second request Handler which I only
use if I want set the signature field.
I know I can do that manually when Im using eg curl but is it also
possible with SolrJ? :)
Thanks,
Markus