I did a series of blog posts about Tika, and while conventional wisdom is that running Tika in Solr is bad, I’ve had GREAT luck with it over the years. https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/ <https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/>
Having said that, my bigger beef with Tika in Solr is about all the dependencies that it drags along. I am constantly looking up a package wondering how we use it in Solr just to find it’s a Tika package…. So…. For that reason I think we need to do something better. I like SolrCell to a package (https://issues.apache.org/jira/browse/SOLR-15951 <https://issues.apache.org/jira/browse/SOLR-15951>). We have this powerful packaging feature, and yet we hardly dog food it ourselves…. I’d love to see us separate out SolrCell and make it easy to do `bin/solr package install solrcell` and have it work! It would both validate the whole Package concept, and minimize the dependencies in Solr’s tarball. Secondly, for folks who really do want to run a separate Tika server, I’d love to make it easier to use. Tika has introduced a new “pipes” concept to reduce the amount of back and forth when working with Tika Server that might tie nicely into the Solr update pipeline. I don’t think any real work has been done on this…. Hoping Tim Allison weighs in on this topic ;-) Eric > On Mar 8, 2023, at 9:50 PM, Shawn Heisey <apa...@elyograg.org> wrote: > > On 3/7/2023 3:48 PM, Jan Høydahl wrote: >> * Move SolrCell to a package, outside of Solr's tarball SOLR-15951 >> <https://issues.apache.org/jira/browse/SOLR-15951> >> * Deprecate SolrCell SOLR-13973 >> <https://issues.apache.org/jira/browse/SOLR-13973> >> * Keep in Solr but use Tika-Server >> <https://cwiki.apache.org/confluence/display/TIKA/TikaServer>, SOLR-7632 >> <https://issues.apache.org/jira/browse/SOLR-7632> >> * Integrate Tika client-side SOLR-1526 >> <https://issues.apache.org/jira/browse/SOLR-1526> > > As you likely know, the big problem is that Tika has a habit of crashing or > misbehaving, particularly with PDFs, and if it's running inside Solr, then > Solr itself is going to suffer whatever bad effects Tika causes. > >> My current thinking / proposal is to: >> * Build a new, thin Solr module that exposes a compatible /update/extract >> handler, delegating to Tika-Server (user-hosted) >> * Deprecate SolrCell in current form >> * From 10.0, Solr will not ship with embedded Tika, only the new handler >> delegating to Tika-Server > > I was thinking something along these lines too. A separate JVM running Tika > Server that can crash without taking Solr down, and communication so ERH can > send commands to it, receive extracted data, and hopefully know when the > other JVM crashes. If we design it well, then the framework could be used to > integrate with other extraction mechanisms besides Tika. I think that would > be quite a bit of work. > > It might be a good idea to make that a separate project as was done for DIH, > but I have no way of guessing whether there is enough interest in the > community to keep it maintained. If it's a separate project, then I think it > would just incorporate SolrJ and Tika, rather than using a special handler. > I have never used ERH in a production setting, and barely have experience > with it in non-production. > > Thanks, > Shawn > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org > For additional commands, e-mail: dev-h...@solr.apache.org > _______________________ Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw> This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.