+1 to Shawn's and Erick's points about isolating Tika in a separate jvm.
Y, please do let us know: u...@tika.apache.org We might be able to help out, and you, in turn, can help the community figure out what's going on; see e.g.: https://issues.apache.org/jira/browse/TIKA-2703 On Sun, Aug 5, 2018 at 1:22 PM Shawn Heisey <apa...@elyograg.org> wrote: > > On 8/2/2018 5:30 AM, Thomas Scheffler wrote: > > my final verdict is the upgrade to Tika 1.17. If I downgrade the libraries > > just for tika back to 1.16 and keep the rest of SOLR 7.4.0 the heap usage > > after about 85 % of the index process and manual trigger of the garbage > > collector is about 60-70 MB (That low!!!) > > > > My problem now is that we have several setups that triggers this reliably > > but there is no simple test case that „fails“ if Tika 1.17 or 1.18 is used. > > I also do not know if the error is inside Tika or inside the glue code that > > makes Tika usable in SOLR. > > If downgrading Tika fixes the issue, then it doesn't seem (to me) very > likely that Solr's glue code for ERH has a problem. If it's not Solr's > code that has the problem, there will be nothing we can do about it > other than change the Tika library included with Solr. > > Before filing an issue, you should discuss this with the Tika project on > their mailing list. They'll want to make sure that they can fix the > problem in a future version. It might not be an actual memory leak ... > it could just be that one of the documents you're trying to index is one > that Tika requires a huge amount of memory to handle. But it could be a > memory leak. > > If you know which document is being worked on when it runs out of > memory, can you try not including that document in your indexing, to see > if it still has a problem? > > Please note that it is strongly recommended that you do not use the > Extracting Request Handler in production. Tika is prone to many > problems, and those problems will generally affect Solr if Tika is being > run inside Solr. Because of this, it is recommended that you write a > separate program using Tika that handles extracting information from > documents and sending that data to Solr. If that program crashes, Solr > remains operational. > > There is already an issue to upgrade Tika to the latest version in Solr, > but you've said that you tried 1.18 already with no change to the > problem. So whatever the problem is, it will need to be solved in 1.19 > or later. > > Thanks, > Shawn >