Re: Bypassing ExtractingRequestHandler

Erick Erickson Sun, 12 Jun 2016 13:34:48 -0700

Two things: Here's a sample bit of SolrJ code, pulling out
the DB stuff should be straightforward:
http://searchhub.org/2012/02/14/indexing-with-solrj/


It's a little out of date, but not very much so. CloudSolrServer
mentioned in one of the comments has been deprecated in
favor of CloudSolrClient, similarly StreamingUpdateSolrServer
is now ConcurrentUpdateSolrClient.


Second, since Solr 5.4 there is the capability to add parser specific
parameters through config, see SOLR-8166. I just added this to the
6.x Ref Guide today, it missed getting into the earlier ref guide
releases.

Best,
Erick

On Fri, Jun 10, 2016 at 1:22 AM, Charlie Hull <char...@flax.co.uk> wrote:
> On 10/06/2016 02:20, Justin Lee wrote:
>>
>> Has anybody had any experience bypassing ExtractingRequestHandler and
>> simply managing Tika manually?  I want to make a small modification to
>> Tika
>> to get and save additional data from my PDFs, but I have been
>> procrastinating in no small part due to the unpleasant prospect of setting
>> up a development environment where I could compile and debug modifications
>> that might run through PDFBox, Tika, and ExtractingRequestHandler.  It
>> occurs to me that it would be much easier if the two were separate, so I
>> could have direct control over Tika and just submit the text to Solr after
>> extraction.  Am I going to regret this approach?  I'm not sure what
>> ExtractingRequestHandler really does for me that Tika doesn't already do.
>
>
> We tend to prefer running Tika externally as it's entirely possible that
> Tika will crash or hang with certain files - and that will bring down Solr
> if you're running Tika within it. Here's a Dropwizard wrapper around Tika
> that might be of use:
> https://github.com/mattflax/dropwizard-tika-server
>
> Cheers
>
> Charlie
>
>>
>> Also, I was reading this
>>
>> <http://stackoverflow.com/questions/33292776/solr-tika-processor-not-crawling-my-pdf-files-prefectly>
>> stackoverflow entry and someone offhandedly mentioned that
>> ExtractingRequestHandler might be separated in the future anyway. Is there
>> a public roadmap for the project, or does one have to keep up with the
>> developer's mailing list and hunt through JIRA entries to keep up with the
>> pulse of the project?
>>
>> Thanks,
>> Justin
>>
>
>
> --
> Charlie Hull
> Flax - Open Source Enterprise Search
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.flax.co.uk

Re: Bypassing ExtractingRequestHandler

Reply via email to