Be careful --  you cannot terminate a thread in Java (see TIKA-456).  You can 
ask it nicely to shutdown via interrupt() or destroy(), but if you're in a 
TIKA-1132 situation, that thread will continue to run no matter what you do 
until the process is killed.  The only way you can actually stop something from 
going wrong is to kill the process.  I added MockParser in tika-core/test that 
will allow you to test whether your application correctly handles an oom or a 
permanent hang.

Tika-batch mode filesystem to filesystem [1][2] and the ForkParser are the only 
ways we offer to handle permanent hangs and oom currently.

We need to fix tika-server to be robust against these.  Please open an issue on 
our JIRA.

The JVM option to restart on OOM is a great idea. Thank you, Markus!

Cheers,

           Tim

[1] java -jar tika-app.jar -i <input_dir> -o <output_dir>
[2] 
http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf

-----Original Message-----
From: Markus Jelsma [mailto:[email protected]] 
Sent: Friday, November 4, 2016 5:57 AM
To: [email protected]
Subject: RE: Tika-server: shutdown on exceptions (esp. OOME)?

By the way, if you run Tika embedded in your application and you expect to pass 
it lots of trash - which is usual when crawling the web - it is a good idea to 
launch a single thread for the parse job. Your application can wait for 
completion and if necessary terminate the thread after a timeout period.

Regards,
Markus

 
 
-----Original message-----
> From:Egbert van der Wal <[email protected]>
> Sent: Friday 4th November 2016 9:18
> To: [email protected]
> Subject: Tika-server: shutdown on exceptions (esp. OOME)?
> 
> Hi,
> 
> In a web crawling application, we're using Tika to parse binary files 
> such as PDF that the crawler encounters to extract text from it.
> 
> However, due to the wide variety of garbage encountered on the 
> internet, this isn't always succesful, and sometimes Tika throws 
> exceptions due to this. For example the OutOfMemory exception I 
> reported (and should be fixed in the upcoming release):
> https://issues.apache.org/jira/browse/TIKA-2045
> 
> This used to crash the entire application. I've recently separated 
> this by running Tika-server and sending the documents over HTTP to 
> this server. However, when sending such broken documents, the 
> OutOfMemory process is still thrown in the Tika server. However, it 
> does not terminate. It keeps running, but will either run *very* slow, 
> doesn't accept new connections or doesn't respond to them. The usual 
> 'undetermined state' after a OOME, I suppose.
> 
> Anyway, I'd like to fix this by having the server check regularly if 
> the server is still running and restart it if necessary. But for that 
> to happen, I need it to shutdown when a OOME occurs.
> 
> Is there anything I can use to make this happen? Do I need to change 
> the code or is there a possibility to configure this using a config 
> file of some sort?
> 
> Thanks!
> 
> Egbert van der Wal
> 

Reply via email to