[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server

Tim Allison (JIRA) Mon, 20 Jun 2016 06:49:26 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15339513#comment-15339513
 ]


Tim Allison commented on SOLR-7632:
-----------------------------------

Given the effort that [~thetaphi] and [~lewismc] just went through to upgrade 
to Tika 1.13...I think we might want to pick up work on this issue again.

To carry out [~ehatcher]'s recommendation...I don't know if we'd need CORS for 
this or not, but it might be neat to modify Tika's server to allow users to 
inject their own resources=endpoints via a config file and an extra jar.  
Within the Solr project, we'd just have to implement a resource that takes an 
input stream, runs Tika and then adds a SolrInputDocument.

For simplicity, it will take some effort on the Solr devs' side to figure out 
how to start and stop at least one tika-server seamlessly so that the "getting 
started" user doesn't have to do a thing.

For scaling, one could imagine users configuring multiple tika-servers, and the 
handler randomly selecting which tika-server to hit (I'm sure there are better 
strategies, but random selection could get us started).

I'm more than happy to contribute on the Tika side and on some of the 
integration with Solr side.  Any takers among the Solr devs? 

Overall, is this the right direction?  Is this worth the effort given the 
number of other options for ETL into Solr?

> Change the ExtractingRequestHandler to use Tika-Server
> ------------------------------------------------------
>
>                 Key: SOLR-7632
>                 URL: https://issues.apache.org/jira/browse/SOLR-7632
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Solr Cell (Tika extraction)
>            Reporter: Chris A. Mattmann
>              Labels: memex
>
> It's a pain to upgrade Tika's jars all the times when we release, and if Tika 
> fails it messes up the ExtractingRequestHandler (e.g., the document type 
> caused Tika to fail, etc). A more reliable way and also separated, and easier 
> to deploy version of the ExtractingRequestHandler would make a network call 
> to the Tika JAXRS server, and then call Tika on the Solr server side, get the 
> results and then index the information that way. I have a patch in the works 
> from the DARPA Memex project and I hope to post it soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server

Reply via email to