I did a series of blog posts about Tika, and while conventional wisdom is that 
running Tika in Solr is bad, I’ve had GREAT luck with it over the years.  
https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/
 
<https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/>

Having said that, my bigger beef with Tika in Solr is about all the 
dependencies that it drags along.   I am constantly looking up a package 
wondering how we use it in Solr just to find it’s a Tika package….  So…. For 
that reason I think we need to do something better.

I like SolrCell to a package (https://issues.apache.org/jira/browse/SOLR-15951 
<https://issues.apache.org/jira/browse/SOLR-15951>).   We have this powerful 
packaging feature, and yet we hardly dog food it ourselves….  I’d love to see 
us separate out SolrCell and make it easy to do `bin/solr package install 
solrcell` and have it work!  It would both validate the whole Package concept, 
and minimize the dependencies in Solr’s tarball.

Secondly, for folks who really do want to run a separate Tika server, I’d love 
to make it easier to use.    Tika has introduced a new “pipes” concept to 
reduce the amount of back and forth when working with Tika Server that might 
tie nicely into the Solr update pipeline.  I don’t think any real work has been 
done on this…. Hoping Tim Allison weighs in on this topic ;-)

Eric


> On Mar 8, 2023, at 9:50 PM, Shawn Heisey <apa...@elyograg.org> wrote:
> 
> On 3/7/2023 3:48 PM, Jan Høydahl wrote:
>> * Move SolrCell to a package, outside of Solr's tarball SOLR-15951 
>> <https://issues.apache.org/jira/browse/SOLR-15951>
>> * Deprecate SolrCell SOLR-13973 
>> <https://issues.apache.org/jira/browse/SOLR-13973>
>> * Keep in Solr but use Tika-Server 
>> <https://cwiki.apache.org/confluence/display/TIKA/TikaServer>,  SOLR-7632 
>> <https://issues.apache.org/jira/browse/SOLR-7632>
>> * Integrate Tika client-side SOLR-1526 
>> <https://issues.apache.org/jira/browse/SOLR-1526>
> 
> As you likely know, the big problem is that Tika has a habit of crashing or 
> misbehaving, particularly with PDFs, and if it's running inside Solr, then 
> Solr itself is going to suffer whatever bad effects Tika causes.
> 
>> My current thinking / proposal is to:
>> * Build a new, thin Solr module that exposes a compatible /update/extract 
>> handler, delegating to Tika-Server (user-hosted)
>> * Deprecate SolrCell in current form
>> * From 10.0, Solr will not ship with embedded Tika, only the new handler 
>> delegating to Tika-Server
> 
> I was thinking something along these lines too.  A separate JVM running Tika 
> Server that can crash without taking Solr down, and communication so ERH can 
> send commands to it, receive extracted data, and hopefully know when the 
> other JVM crashes.  If we design it well, then the framework could be used to 
> integrate with other extraction mechanisms besides Tika.  I think that would 
> be quite a bit of work.
> 
> It might be a good idea to make that a separate project as was done for DIH, 
> but I have no way of guessing whether there is enough interest in the 
> community to keep it maintained.  If it's a separate project, then I think it 
> would just incorporate SolrJ and Tika, rather than using a special handler.  
> I have never used ERH in a production setting, and barely have experience 
> with it in non-production.
> 
> Thanks,
> Shawn
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
> For additional commands, e-mail: dev-h...@solr.apache.org
> 

_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
    
This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.

Reply via email to