+1 to Shawn's and Erick's points about isolating Tika in a separate jvm.

Y, please do let us know:  u...@tika.apache.org  We might be able to
help out, and you, in turn, can help the community figure out what's
going on; see e.g.: https://issues.apache.org/jira/browse/TIKA-2703
On Sun, Aug 5, 2018 at 1:22 PM Shawn Heisey <apa...@elyograg.org> wrote:
>
> On 8/2/2018 5:30 AM, Thomas Scheffler wrote:
> > my final verdict is the upgrade to Tika 1.17. If I downgrade the libraries 
> > just for tika back to 1.16 and keep the rest of SOLR 7.4.0 the heap usage 
> > after about 85 % of the index process and manual trigger of the garbage 
> > collector is about 60-70 MB (That low!!!)
> >
> > My problem now is that we have several setups that triggers this reliably 
> > but there is no simple test case that „fails“ if Tika 1.17 or 1.18 is used. 
> > I also do not know if the error is inside Tika or inside the glue code that 
> > makes Tika usable in SOLR.
>
> If downgrading Tika fixes the issue, then it doesn't seem (to me) very
> likely that Solr's glue code for ERH has a problem. If it's not Solr's
> code that has the problem, there will be nothing we can do about it
> other than change the Tika library included with Solr.
>
> Before filing an issue, you should discuss this with the Tika project on
> their mailing list.  They'll want to make sure that they can fix the
> problem in a future version.  It might not be an actual memory leak ...
> it could just be that one of the documents you're trying to index is one
> that Tika requires a huge amount of memory to handle.  But it could be a
> memory leak.
>
> If you know which document is being worked on when it runs out of
> memory, can you try not including that document in your indexing, to see
> if it still has a problem?
>
> Please note that it is strongly recommended that you do not use the
> Extracting Request Handler in production.  Tika is prone to many
> problems, and those problems will generally affect Solr if Tika is being
> run inside Solr.  Because of this, it is recommended that you write a
> separate program using Tika that handles extracting information from
> documents and sending that data to Solr.  If that program crashes, Solr
> remains operational.
>
> There is already an issue to upgrade Tika to the latest version in Solr,
> but you've said that you tried 1.18 already with no change to the
> problem.  So whatever the problem is, it will need to be solved in 1.19
> or later.
>
> Thanks,
> Shawn
>

Reply via email to