RE: Very slow parsing of a few PDF files

Jim Idle Mon, 20 Nov 2017 19:32:27 -0800

Following up on this, I will try cancelling my thread based tasks after a 
pre-set time limit. That is only going to work if Tika and the underlying 
parsers behave correctly with the interrupted exception. Anyone had any success 
with that? I am mainly looking at Office, PDF and HTML right now. I will try it 
myself of course, but perhaps someone has already been down this path?


Jim

> -----Original Message-----
> From: Jim Idle [mailto:[email protected]]
> Sent: Monday, November 20, 2017 11:54
> To: [email protected]
> Subject: RE: Very slow parsing of a few PDF files
> 
> Tim,
> 
> I am seeing a lot of files that are taking a long time to parse and I am
> currently gathering some samples from our company's servers that I can use
> publicly as most are proprietary to our customers and a good number are
> malware, which may mean they are deliberately broken in format and
> causing the underlying parsers  some issues.
> 
> Do you think that Tika could be made to abort a parse after a certain time, or
> is that too complicated given that there are so many underlying parser
> mechanisms?
> 
> Cheers,
> 
> Jim
> 
> > -----Original Message-----
> > From: Allison, Timothy B. [mailto:[email protected]]
> > Sent: Friday, November 17, 2017 00:04
> > To: [email protected]
> > Subject: RE: Very slow parsing of a few PDF files
> >
> > It boggles my mind that SAX parsing would take 5 minutes, but, um,
> maybe?
> > Now that I think about it there was a beastly pptx file that someone
> > submitted on our JIRA that did take 2 minutes, so, maybe???
> >
> > Open an issue in our JIRA to make extraction of charts/diagrams
> > configurable, and you'll be able to tell. 😊
> >
> > -----Original Message-----
> > From: [email protected] [mailto:[email protected]]
> > Sent: Wednesday, November 15, 2017 10:23 AM
> > To: [email protected]
> > Subject: Re: Very slow parsing of a few PDF files
> >
> >
> >
> > On 2017-11-07 02:52, Jim Idle <[email protected]> wrote:
> > > I have a few PDF files that are taking a very long time to parse.
> > >
> > > For instance I have a file that is 6.89MB that is taking minutes to
> > > parse. If I
> > use jvisualvm and take a long sample, I get:
> > >
> > I'm having a similar problem ( yes, with XLS ) . A 2MB xlsx is taken
> > around 5 mins to run. While looking at the CHANGES files noted some
> > new stuff around xlsx, namely :
> >
> > * Extract text from charts in .docx, .pptx, .xlsx and .xlsb(TIKA-2254).
> > * Extract text from diagrams in .docx, .pptx, .xlsx and .xlsb(TIKA-1945).
> >
> > I then rolled back to version 1.15 and the same file took less than a 
> > second.
> >
> > Is there a way to be sure if these changes were responsible for the
> > extra processing time? if so how can I disable them?
> >
> >  Sorry, but I cant share the file but can say it has some chart data.
> >  unzip  -l  tika-killer.xlsx  | grep -c xl/chart
> > 564
> >
> >
> > JosÃ© Borges Ferreira

RE: Very slow parsing of a few PDF files

Reply via email to