Re: Is deduplication possible during Tika extract?

Markus Jelsma Mon, 17 Jan 2011 16:44:26 -0800

In my opinion it should work for every update handler. If you're really sure 
your configuration if fine and it still doesn't work you might have to file an 
issue.


Your configuration looks alright but don't forget you've configured 
overwriteDupes=false!

> Hello,
> 
> here is an excerpt of my solrconfig.xml:
> 
> <requestHandler name="/update/extract"
> class="org.apache.solr.handler.extraction.ExtractingRequestHandler"
> startup="lazy">
> <lst name="defaults">
> 
> <str name="update.processor">dedupe</str>
> 
> <!-- All the main content goes into "text"... if you need to return
>             the extracted text or do highlighting, use a stored field. -->
> <str name="fmap.content">text</str>
> <str name="lowernames">true</str>
> <str name="uprefix">ignored_</str>
> 
> <!-- capture link hrefs but ignore div attributes -->
> <str name="captureAttr">true</str>
> <str name="fmap.a">links</str>
> <str name="fmap.div">ignored_</str>
> </lst>
> </requestHandler>
> 
> and
> 
> <updateRequestProcessorChain name="dedupe">
> <processor
> class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
> <bool name="enabled">true</bool>
> <str name="signatureField">signature</str>
> <bool name="overwriteDupes">false</bool>
> <str name="fields">text</str>
> <str
> name="signatureClass">org.apache.solr.update.processor.TextProfileSignature
> </str> </processor>
> <processor class="solr.LogUpdateProcessorFactory" />
> <processor class="solr.RunUpdateProcessorFactory" />
> </updateRequestProcessorChain>
> 
> deduplication works when I use only "/update" but not when solr does an
> extract with Tika!
> Is deduplication possible during Tika extract?
> 
> Thanks in advance,
> Arno

Re: Is deduplication possible during Tika extract?

Reply via email to