Thanks for the suggestion. This provides some benefits, but if it hits a OOME, I'm still stuck with an undefined state. I'd need to kill of the Tika-thread before hitting OOME. The advantage of running it separately is that I can assign the tika-server a quite limited amount of RAM so that it hits OOME way before the main system does, because that has a lot larger maximum heap space. 512M seems to be enough for proper functioning of Tika on most documents, while the main system has a limit of 16G.

The script-on-OOME sounds like a proper solution, I'll look into that, thanks!

Regards,

Egbert


On 04-11-16 10:57, Markus Jelsma wrote:
By the way, if you run Tika embedded in your application and you expect to pass 
it lots of trash - which is usual when crawling the web - it is a good idea to 
launch a single thread for the parse job. Your application can wait for 
completion and if necessary terminate the thread after a timeout period.

Regards,
Markus



-----Original message-----
From:Egbert van der Wal <[email protected]>
Sent: Friday 4th November 2016 9:18
To: [email protected]
Subject: Tika-server: shutdown on exceptions (esp. OOME)?

Hi,

In a web crawling application, we're using Tika to parse binary files
such as PDF that the crawler encounters to extract text from it.

However, due to the wide variety of garbage encountered on the internet,
this isn't always succesful, and sometimes Tika throws exceptions due to
this. For example the OutOfMemory exception I reported (and should be
fixed in the upcoming release):
https://issues.apache.org/jira/browse/TIKA-2045

This used to crash the entire application. I've recently separated this
by running Tika-server and sending the documents over HTTP to this
server. However, when sending such broken documents, the OutOfMemory
process is still thrown in the Tika server. However, it does not
terminate. It keeps running, but will either run *very* slow, doesn't
accept new connections or doesn't respond to them. The usual
'undetermined state' after a OOME, I suppose.

Anyway, I'd like to fix this by having the server check regularly if the
server is still running and restart it if necessary. But for that to
happen, I need it to shutdown when a OOME occurs.

Is there anything I can use to make this happen? Do I need to change the
code or is there a possibility to configure this using a config file of
some sort?

Thanks!

Egbert van der Wal

Reply via email to