[
https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15339513#comment-15339513
]
Tim Allison commented on SOLR-7632:
-----------------------------------
Given the effort that [~thetaphi] and [~lewismc] just went through to upgrade
to Tika 1.13...I think we might want to pick up work on this issue again.
To carry out [~ehatcher]'s recommendation...I don't know if we'd need CORS for
this or not, but it might be neat to modify Tika's server to allow users to
inject their own resources=endpoints via a config file and an extra jar.
Within the Solr project, we'd just have to implement a resource that takes an
input stream, runs Tika and then adds a SolrInputDocument.
For simplicity, it will take some effort on the Solr devs' side to figure out
how to start and stop at least one tika-server seamlessly so that the "getting
started" user doesn't have to do a thing.
For scaling, one could imagine users configuring multiple tika-servers, and the
handler randomly selecting which tika-server to hit (I'm sure there are better
strategies, but random selection could get us started).
I'm more than happy to contribute on the Tika side and on some of the
integration with Solr side. Any takers among the Solr devs?
Overall, is this the right direction? Is this worth the effort given the
number of other options for ETL into Solr?
> Change the ExtractingRequestHandler to use Tika-Server
> ------------------------------------------------------
>
> Key: SOLR-7632
> URL: https://issues.apache.org/jira/browse/SOLR-7632
> Project: Solr
> Issue Type: Improvement
> Components: contrib - Solr Cell (Tika extraction)
> Reporter: Chris A. Mattmann
> Labels: memex
>
> It's a pain to upgrade Tika's jars all the times when we release, and if Tika
> fails it messes up the ExtractingRequestHandler (e.g., the document type
> caused Tika to fail, etc). A more reliable way and also separated, and easier
> to deploy version of the ExtractingRequestHandler would make a network call
> to the Tika JAXRS server, and then call Tika on the Solr server side, get the
> results and then index the information that way. I have a patch in the works
> from the DARPA Memex project and I hope to post it soon.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]