By the way, if you run Tika embedded in your application and you expect to pass 
it lots of trash - which is usual when crawling the web - it is a good idea to 
launch a single thread for the parse job. Your application can wait for 
completion and if necessary terminate the thread after a timeout period.

Regards,
Markus

 
 
-----Original message-----
> From:Egbert van der Wal <[email protected]>
> Sent: Friday 4th November 2016 9:18
> To: [email protected]
> Subject: Tika-server: shutdown on exceptions (esp. OOME)?
> 
> Hi,
> 
> In a web crawling application, we're using Tika to parse binary files 
> such as PDF that the crawler encounters to extract text from it.
> 
> However, due to the wide variety of garbage encountered on the internet, 
> this isn't always succesful, and sometimes Tika throws exceptions due to 
> this. For example the OutOfMemory exception I reported (and should be 
> fixed in the upcoming release): 
> https://issues.apache.org/jira/browse/TIKA-2045
> 
> This used to crash the entire application. I've recently separated this 
> by running Tika-server and sending the documents over HTTP to this 
> server. However, when sending such broken documents, the OutOfMemory 
> process is still thrown in the Tika server. However, it does not 
> terminate. It keeps running, but will either run *very* slow, doesn't 
> accept new connections or doesn't respond to them. The usual 
> 'undetermined state' after a OOME, I suppose.
> 
> Anyway, I'd like to fix this by having the server check regularly if the 
> server is still running and restart it if necessary. But for that to 
> happen, I need it to shutdown when a OOME occurs.
> 
> Is there anything I can use to make this happen? Do I need to change the 
> code or is there a possibility to configure this using a config file of 
> some sort?
> 
> Thanks!
> 
> Egbert van der Wal
> 

Reply via email to