RE: Very slow parsing of a few PDF files

Jim Idle Mon, 20 Nov 2017 22:48:08 -0800

Thanks Dave,

Yes - I already have a thread pool that runs the parsing tasks - just not sure 
if all the parsers will conform to correct handling of being interrupted so I 
wondered if anyone had already tried it. My system sees about 1 million assets 
a day of various types. Not too many have runaway issues. I hope I can provide 
some samples to either the Tika team or the underlying parser team to see if 
anyone can do anything about them. But for now, if I can bale when neded then 
it will be fine.


Cheers,

jim

> -----Original Message-----
> From: Dave Fisher [mailto:[email protected]]
> Sent: Tuesday, November 21, 2017 12:06
> To: [email protected]
> Subject: Re: Very slow parsing of a few PDF files
> 
> IIRC - In a Mac version of PowerPoint some seven years Microsoft went off
> OOXML spec which caused POI Produced files to runaway. A POI user was
> able to get enough attention from MSFT to get it fixed.
> 
> Also any additional objects that POI now parses into objects inefficiently.
> 
> I had a project where some 500,000 assets - PDF, Office, etc went through
> Tika. Some 3-4 dozen files caused parsing trouble. The key is to try to 
> isolate
> your Tika runs in a separate process.
> 
> IIRC from Tim’s Apachecon Miami talk this is something for Tika Eval.
> 
> Regards,
> Dave
> 
> Sent from my iPhone
> 
> > On Nov 20, 2017, at 7:28 PM, Jim Idle <[email protected]> wrote:
> >
> > Following up on this, I will try cancelling my thread based tasks after a 
> > pre-
> set time limit. That is only going to work if Tika and the underlying parsers
> behave correctly with the interrupted exception. Anyone had any success
> with that? I am mainly looking at Office, PDF and HTML right now. I will try 
> it
> myself of course, but perhaps someone has already been down this path?
> >
> > Jim
> >
> >> -----Original Message-----
> >> From: Jim Idle [mailto:[email protected]]
> >> Sent: Monday, November 20, 2017 11:54
> >> To: [email protected]
> >> Subject: RE: Very slow parsing of a few PDF files
> >>
> >> Tim,
> >>
> >> I am seeing a lot of files that are taking a long time to parse and I
> >> am currently gathering some samples from our company's servers that I
> >> can use publicly as most are proprietary to our customers and a good
> >> number are malware, which may mean they are deliberately broken in
> >> format and causing the underlying parsers  some issues.
> >>
> >> Do you think that Tika could be made to abort a parse after a certain
> >> time, or is that too complicated given that there are so many
> >> underlying parser mechanisms?
> >>
> >> Cheers,
> >>
> >> Jim
> >>
> >>> -----Original Message-----
> >>> From: Allison, Timothy B. [mailto:[email protected]]
> >>> Sent: Friday, November 17, 2017 00:04
> >>> To: [email protected]
> >>> Subject: RE: Very slow parsing of a few PDF files
> >>>
> >>> It boggles my mind that SAX parsing would take 5 minutes, but, um,
> >> maybe?
> >>> Now that I think about it there was a beastly pptx file that someone
> >>> submitted on our JIRA that did take 2 minutes, so, maybe???
> >>>
> >>> Open an issue in our JIRA to make extraction of charts/diagrams
> >>> configurable, and you'll be able to tell. 😊
> >>>
> >>> -----Original Message-----
> >>> From: [email protected] [mailto:[email protected]]
> >>> Sent: Wednesday, November 15, 2017 10:23 AM
> >>> To: [email protected]
> >>> Subject: Re: Very slow parsing of a few PDF files
> >>>
> >>>
> >>>
> >>>> On 2017-11-07 02:52, Jim Idle <[email protected]> wrote:
> >>>> I have a few PDF files that are taking a very long time to parse.
> >>>>
> >>>> For instance I have a file that is 6.89MB that is taking minutes to
> >>>> parse. If I
> >>> use jvisualvm and take a long sample, I get:
> >>>>
> >>> I'm having a similar problem ( yes, with XLS ) . A 2MB xlsx is taken
> >>> around 5 mins to run. While looking at the CHANGES files noted some
> >>> new stuff around xlsx, namely :
> >>>
> >>> * Extract text from charts in .docx, .pptx, .xlsx and .xlsb(TIKA-2254).
> >>> * Extract text from diagrams in .docx, .pptx, .xlsx and .xlsb(TIKA-1945).
> >>>
> >>> I then rolled back to version 1.15 and the same file took less than a
> second.
> >>>
> >>> Is there a way to be sure if these changes were responsible for the
> >>> extra processing time? if so how can I disable them?
> >>>
> >>> Sorry, but I cant share the file but can say it has some chart data.
> >>> unzip  -l  tika-killer.xlsx  | grep -c xl/chart
> >>> 564
> >>>
> >>>
> >>> JosÃ© Borges Ferreira
> >

RE: Very slow parsing of a few PDF files

Reply via email to