By the way, if you run Tika embedded in your application and you expect to pass it lots of trash - which is usual when crawling the web - it is a good idea to launch a single thread for the parse job. Your application can wait for completion and if necessary terminate the thread after a timeout period.
Regards, Markus -----Original message----- > From:Egbert van der Wal <[email protected]> > Sent: Friday 4th November 2016 9:18 > To: [email protected] > Subject: Tika-server: shutdown on exceptions (esp. OOME)? > > Hi, > > In a web crawling application, we're using Tika to parse binary files > such as PDF that the crawler encounters to extract text from it. > > However, due to the wide variety of garbage encountered on the internet, > this isn't always succesful, and sometimes Tika throws exceptions due to > this. For example the OutOfMemory exception I reported (and should be > fixed in the upcoming release): > https://issues.apache.org/jira/browse/TIKA-2045 > > This used to crash the entire application. I've recently separated this > by running Tika-server and sending the documents over HTTP to this > server. However, when sending such broken documents, the OutOfMemory > process is still thrown in the Tika server. However, it does not > terminate. It keeps running, but will either run *very* slow, doesn't > accept new connections or doesn't respond to them. The usual > 'undetermined state' after a OOME, I suppose. > > Anyway, I'd like to fix this by having the server check regularly if the > server is still running and restart it if necessary. But for that to > happen, I need it to shutdown when a OOME occurs. > > Is there anything I can use to make this happen? Do I need to change the > code or is there a possibility to configure this using a config file of > some sort? > > Thanks! > > Egbert van der Wal >
