Thanks for the mockparser pointer - I saw that when looking at the tests for
ForkParser.
I am going to have to write my own application specific solution for this as
ForkParser tries to serialize every class it things will be needed across the
connection and a lot of third party classes are not serializable. I think that
ForkParser is a good enough idea but I am not sure how practical it is in a
real-life application. For instance, I have to use:
<dependency>
<groupId>nu.validator</groupId>
<artifactId>htmlparser</artifactId>
<version>1.4.6</version>
</dependency>
As the HTML parser in Tika does not produce SAX events in the correct order -
the parser is great but does not support serialization - etc.
Jim
> -----Original Message-----
> From: Allison, Timothy B. [mailto:[email protected]]
> Sent: Monday, November 27, 2017 23:05
> To: [email protected]
> Subject: RE: Very slow parsing of a few PDF files
>
> The ForkParser does have the ability to kill and restart on permanent hangs.
> We don't have the RecursiveParserWrapper integrated into the ForkParser
> currently...patches are welcomed.
>
> At the Tika level, we generally don't check for a Thread.interrupted() because
> our dependencies don't do it.
>
> Unfortunately, you do have to kill a process for a parser that hits a
> permanent hang. Nothing you can do to a thread will actually be useful, see
> TIKA-456 for a discussion of this.
>
> Some options:
>
> 1) The ForkParser will timeout and restart.
>
> 2) tika-batch, e.g. java -jar tika-app.jar -i <input_dir> -o <output_dir>,
> will run
> multithreaded and it spawns a child process that will be killed and restarted
> on permanent hang/oom
>
> 3) tika-server...we could/should harden that via a child process that could be
> killed/restarted, but that doesn't currently exist.
>
> 4) framework, e.g. Hadoop, etc. see
> https://urldefense.proofpoint.com/v2/url?u=http-
> 3A__openpreservation.org_blog_2014_03_21_tika-2Dride-2Dcharacterising-
> 2Dweb-2Dcontent-
> 2Dnanite_&d=DwIFAg&c=Vxt5e0Osvvt2gflwSlsJ5DmPGcPvTRKLJyp031rXjhg&r
> =LQ_Q8ZxvkO2zK857fAbj5MDtaB4Bvrpw3bihfO3Bhbw&m=5tnilyH_a4t_ytHn
> yU-zAC4ls3cuV-
> Ve2_BkeqS2T1w&s=afDTTbGwLVRv_jkQTdkTwqxmljF4J2XJjOtZ-E6Ohkg&e=
> and Ken Krugler's email (somewhere on our list?!) about spawning a
> separate thread for each parse and then aborting the process if there's a
> timeout
>
> Finally, no matter what option you use, you can use the MockParser in tika-
> core/tests to test that your processing pipeline can correctly handle
> timeouts/oom etc. Add that to your class path and then ask Tika to parse,
> e.g. <mock><oom/></mock>. See:
> https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__wiki.apache.org_tika_MockParser&d=DwIFAg&c=Vxt5e0Osvvt2gflwSlsJ5
> DmPGcPvTRKLJyp031rXjhg&r=LQ_Q8ZxvkO2zK857fAbj5MDtaB4Bvrpw3bihfO3
> Bhbw&m=5tnilyH_a4t_ytHnyU-zAC4ls3cuV-Ve2_BkeqS2T1w&s=ZE4eTZ-
> rESoJleZzUdln9yG3c1phKG85kmzCRYBIcC4&e=
>
>
>
>
> -----Original Message-----
> From: Jim Idle [mailto:[email protected]]
> Sent: Tuesday, November 21, 2017 11:13 PM
> To: [email protected]
> Subject: RE: Very slow parsing of a few PDF files
>
> I didn't know that there was a ForkParser, but that might possibly be a
> significant overhead on the application - looks like it has a pool, though I
> don't know if it gives the ability to say kill a long running parser and
> restart
> the pool. I will look in to it: one thing I see already is that it intercepts
> Interrupted, wraps it in a TikaException but does not set the Thread
> interrupted flag and cannot rethrow Interrupted because the Parser interface
> does not throw it. It catches inability to communicate but does it start a new
> process if I cancel one
>
> I may have no choice though as RecursiveParserWrapper, like any
> implementation of Parser does not check for Thread.interrupted() or throw
> Interrupted which means that I cannot time out a Future and cancel it.
>
> Anyway, thanks for the pointer - I will play with it.
>
> Jim
>
> > -----Original Message-----
> > From: Nick Burch [mailto:[email protected]]
> > Sent: Tuesday, November 21, 2017 17:10
> > To: [email protected]
> > Subject: RE: Very slow parsing of a few PDF files
> >
> > On Tue, 21 Nov 2017, Jim Idle wrote:
> > > Following up on this, I will try cancelling my thread based tasks
> > > after a pre-set time limit. That is only going to work if Tika and
> > > the underlying parsers behave correctly with the interrupted exception.
> > > Anyone had any success with that? I am mainly looking at Office, PDF
> > > and HTML right now. I will try it myself of course, but perhaps
> > > someone has already been down this path?
> >
> > Have you tried with ForkParser? That would also protect you against
> > other kinds of failures like OOM too
> >
> > Nick