The ForkParser does have the ability to kill and restart on permanent hangs. We don't have the RecursiveParserWrapper integrated into the ForkParser currently...patches are welcomed.
At the Tika level, we generally don't check for a Thread.interrupted() because our dependencies don't do it. Unfortunately, you do have to kill a process for a parser that hits a permanent hang. Nothing you can do to a thread will actually be useful, see TIKA-456 for a discussion of this. Some options: 1) The ForkParser will timeout and restart. 2) tika-batch, e.g. java -jar tika-app.jar -i <input_dir> -o <output_dir>, will run multithreaded and it spawns a child process that will be killed and restarted on permanent hang/oom 3) tika-server...we could/should harden that via a child process that could be killed/restarted, but that doesn't currently exist. 4) framework, e.g. Hadoop, etc. see http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/ and Ken Krugler's email (somewhere on our list?!) about spawning a separate thread for each parse and then aborting the process if there's a timeout Finally, no matter what option you use, you can use the MockParser in tika-core/tests to test that your processing pipeline can correctly handle timeouts/oom etc. Add that to your class path and then ask Tika to parse, e.g. <mock><oom/></mock>. See: https://wiki.apache.org/tika/MockParser -----Original Message----- From: Jim Idle [mailto:[email protected]] Sent: Tuesday, November 21, 2017 11:13 PM To: [email protected] Subject: RE: Very slow parsing of a few PDF files I didn't know that there was a ForkParser, but that might possibly be a significant overhead on the application - looks like it has a pool, though I don't know if it gives the ability to say kill a long running parser and restart the pool. I will look in to it: one thing I see already is that it intercepts Interrupted, wraps it in a TikaException but does not set the Thread interrupted flag and cannot rethrow Interrupted because the Parser interface does not throw it. It catches inability to communicate but does it start a new process if I cancel one I may have no choice though as RecursiveParserWrapper, like any implementation of Parser does not check for Thread.interrupted() or throw Interrupted which means that I cannot time out a Future and cancel it. Anyway, thanks for the pointer - I will play with it. Jim > -----Original Message----- > From: Nick Burch [mailto:[email protected]] > Sent: Tuesday, November 21, 2017 17:10 > To: [email protected] > Subject: RE: Very slow parsing of a few PDF files > > On Tue, 21 Nov 2017, Jim Idle wrote: > > Following up on this, I will try cancelling my thread based tasks > > after a pre-set time limit. That is only going to work if Tika and > > the underlying parsers behave correctly with the interrupted exception. > > Anyone had any success with that? I am mainly looking at Office, PDF > > and HTML right now. I will try it myself of course, but perhaps > > someone has already been down this path? > > Have you tried with ForkParser? That would also protect you against > other kinds of failures like OOM too > > Nick
