Thanks for the mockparser pointer - I saw that when looking at the tests for 
ForkParser.

I am going to have to write my own application specific solution for this as 
ForkParser tries to serialize every class it things will be needed across the 
connection and a lot of third party classes are not serializable. I think that 
ForkParser is a good enough idea but I am not sure how practical it is in a 
real-life application. For instance, I have to use:

        <dependency>
            <groupId>nu.validator</groupId>
            <artifactId>htmlparser</artifactId>
            <version>1.4.6</version>
        </dependency>

As the HTML parser in Tika does not produce SAX events in the correct order - 
the parser is great but does not support serialization - etc.

Jim

> -----Original Message-----
> From: Allison, Timothy B. [mailto:[email protected]]
> Sent: Monday, November 27, 2017 23:05
> To: [email protected]
> Subject: RE: Very slow parsing of a few PDF files
> 
> The ForkParser does have the ability to kill and restart on permanent hangs.
> We don't have the RecursiveParserWrapper integrated into the ForkParser
> currently...patches are welcomed.
> 
> At the Tika level, we generally don't check for a Thread.interrupted() because
> our dependencies don't do it.
> 
> Unfortunately, you do have to kill a process for a parser that hits a
> permanent hang.  Nothing you can do to a thread will actually be useful, see
> TIKA-456 for a discussion of this.
> 
> Some options:
> 
> 1) The ForkParser will timeout and restart.
> 
> 2) tika-batch, e.g. java -jar tika-app.jar -i <input_dir> -o <output_dir>, 
> will run
> multithreaded and it spawns a child process that will be killed and restarted
> on permanent hang/oom
> 
> 3) tika-server...we could/should harden that via a child process that could be
> killed/restarted, but that doesn't currently exist.
> 
> 4) framework, e.g. Hadoop, etc. see
> https://urldefense.proofpoint.com/v2/url?u=http-
> 3A__openpreservation.org_blog_2014_03_21_tika-2Dride-2Dcharacterising-
> 2Dweb-2Dcontent-
> 2Dnanite_&d=DwIFAg&c=Vxt5e0Osvvt2gflwSlsJ5DmPGcPvTRKLJyp031rXjhg&r
> =LQ_Q8ZxvkO2zK857fAbj5MDtaB4Bvrpw3bihfO3Bhbw&m=5tnilyH_a4t_ytHn
> yU-zAC4ls3cuV-
> Ve2_BkeqS2T1w&s=afDTTbGwLVRv_jkQTdkTwqxmljF4J2XJjOtZ-E6Ohkg&e=
> and Ken Krugler's email (somewhere on our list?!) about spawning a
> separate thread for each parse and then aborting the process if there's a
> timeout
> 
> Finally, no matter what option you use, you can use the MockParser in tika-
> core/tests to test that your processing pipeline can correctly handle
> timeouts/oom etc.  Add that to your class path and then ask Tika to parse,
> e.g. <mock><oom/></mock>.  See:
> https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__wiki.apache.org_tika_MockParser&d=DwIFAg&c=Vxt5e0Osvvt2gflwSlsJ5
> DmPGcPvTRKLJyp031rXjhg&r=LQ_Q8ZxvkO2zK857fAbj5MDtaB4Bvrpw3bihfO3
> Bhbw&m=5tnilyH_a4t_ytHnyU-zAC4ls3cuV-Ve2_BkeqS2T1w&s=ZE4eTZ-
> rESoJleZzUdln9yG3c1phKG85kmzCRYBIcC4&e=
> 
> 
> 
> 
> -----Original Message-----
> From: Jim Idle [mailto:[email protected]]
> Sent: Tuesday, November 21, 2017 11:13 PM
> To: [email protected]
> Subject: RE: Very slow parsing of a few PDF files
> 
> I didn't know that there was a ForkParser, but that might possibly be a
> significant overhead on the application - looks like it has a pool, though I
> don't know if it gives the ability to say kill a long running parser and 
> restart
> the pool. I will look in to it: one thing I see already is that it intercepts
> Interrupted, wraps it in a TikaException but does not set the Thread
> interrupted flag and cannot rethrow Interrupted because the Parser interface
> does not throw it. It catches inability to communicate but does it start a new
> process if I cancel one
> 
> I may have no choice though as RecursiveParserWrapper, like any
> implementation of Parser does not check for Thread.interrupted() or throw
> Interrupted which means that I cannot time out a Future and cancel it.
> 
> Anyway, thanks for the pointer - I will play with it.
> 
> Jim
> 
> > -----Original Message-----
> > From: Nick Burch [mailto:[email protected]]
> > Sent: Tuesday, November 21, 2017 17:10
> > To: [email protected]
> > Subject: RE: Very slow parsing of a few PDF files
> >
> > On Tue, 21 Nov 2017, Jim Idle wrote:
> > > Following up on this, I will try cancelling my thread based tasks
> > > after a pre-set time limit. That is only going to work if Tika and
> > > the underlying parsers behave correctly with the interrupted exception.
> > > Anyone had any success with that? I am mainly looking at Office, PDF
> > > and HTML right now. I will try it myself of course, but perhaps
> > > someone has already been down this path?
> >
> > Have you tried with ForkParser? That would also protect you against
> > other kinds of failures like OOM too
> >
> > Nick

Reply via email to