The ForkParser does have the ability to kill and restart on permanent hangs.  
We don't have the RecursiveParserWrapper integrated into the ForkParser 
currently...patches are welcomed.

At the Tika level, we generally don't check for a Thread.interrupted() because 
our dependencies don't do it.  

Unfortunately, you do have to kill a process for a parser that hits a permanent 
hang.  Nothing you can do to a thread will actually be useful, see TIKA-456 for 
a discussion of this.

Some options:

1) The ForkParser will timeout and restart.

2) tika-batch, e.g. java -jar tika-app.jar -i <input_dir> -o <output_dir>, will 
run multithreaded and it spawns a child process that will be killed and 
restarted on permanent hang/oom

3) tika-server...we could/should harden that via a child process that could be 
killed/restarted, but that doesn't currently exist.

4) framework, e.g. Hadoop, etc. see 
http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
 and Ken Krugler's email (somewhere on our list?!) about spawning a separate 
thread for each parse and then aborting the process if there's a timeout

Finally, no matter what option you use, you can use the MockParser in 
tika-core/tests to test that your processing pipeline can correctly handle 
timeouts/oom etc.  Add that to your class path and then ask Tika to parse, e.g. 
<mock><oom/></mock>.  See: https://wiki.apache.org/tika/MockParser 




-----Original Message-----
From: Jim Idle [mailto:[email protected]] 
Sent: Tuesday, November 21, 2017 11:13 PM
To: [email protected]
Subject: RE: Very slow parsing of a few PDF files

I didn't know that there was a ForkParser, but that might possibly be a 
significant overhead on the application - looks like it has a pool, though I 
don't know if it gives the ability to say kill a long running parser and 
restart the pool. I will look in to it: one thing I see already is that it 
intercepts Interrupted, wraps it in a TikaException but does not set the Thread 
interrupted flag and cannot rethrow Interrupted because the Parser interface 
does not throw it. It catches inability to communicate but does it start a new 
process if I cancel one

I may have no choice though as RecursiveParserWrapper, like any implementation 
of Parser does not check for Thread.interrupted() or throw Interrupted which 
means that I cannot time out a Future and cancel it.

Anyway, thanks for the pointer - I will play with it.

Jim

> -----Original Message-----
> From: Nick Burch [mailto:[email protected]]
> Sent: Tuesday, November 21, 2017 17:10
> To: [email protected]
> Subject: RE: Very slow parsing of a few PDF files
> 
> On Tue, 21 Nov 2017, Jim Idle wrote:
> > Following up on this, I will try cancelling my thread based tasks 
> > after a pre-set time limit. That is only going to work if Tika and 
> > the underlying parsers behave correctly with the interrupted exception.
> > Anyone had any success with that? I am mainly looking at Office, PDF 
> > and HTML right now. I will try it myself of course, but perhaps 
> > someone has already been down this path?
> 
> Have you tried with ForkParser? That would also protect you against 
> other kinds of failures like OOM too
> 
> Nick

Reply via email to