Re: Very slow parsing of a few PDF files

Dave Fisher Mon, 20 Nov 2017 20:05:55 -0800

IIRC - In a Mac version of PowerPoint some seven years Microsoft went off OOXML 
spec which caused POI Produced files to runaway. A POI user was able to get 
enough attention from MSFT to get it fixed.


Also any additional objects that POI now parses into objects inefficiently.

I had a project where some 500,000 assets - PDF, Office, etc went through Tika. 
Some 3-4 dozen files caused parsing trouble. The key is to try to isolate your 
Tika runs in a separate process.

IIRC from Tim’s Apachecon Miami talk this is something for Tika Eval.

Regards,
Dave

Sent from my iPhone

> On Nov 20, 2017, at 7:28 PM, Jim Idle <[email protected]> wrote:
> 
> Following up on this, I will try cancelling my thread based tasks after a 
> pre-set time limit. That is only going to work if Tika and the underlying 
> parsers behave correctly with the interrupted exception. Anyone had any 
> success with that? I am mainly looking at Office, PDF and HTML right now. I 
> will try it myself of course, but perhaps someone has already been down this 
> path?
> 
> Jim
> 
>> -----Original Message-----
>> From: Jim Idle [mailto:[email protected]]
>> Sent: Monday, November 20, 2017 11:54
>> To: [email protected]
>> Subject: RE: Very slow parsing of a few PDF files
>> 
>> Tim,
>> 
>> I am seeing a lot of files that are taking a long time to parse and I am
>> currently gathering some samples from our company's servers that I can use
>> publicly as most are proprietary to our customers and a good number are
>> malware, which may mean they are deliberately broken in format and
>> causing the underlying parsers  some issues.
>> 
>> Do you think that Tika could be made to abort a parse after a certain time, 
>> or
>> is that too complicated given that there are so many underlying parser
>> mechanisms?
>> 
>> Cheers,
>> 
>> Jim
>> 
>>> -----Original Message-----
>>> From: Allison, Timothy B. [mailto:[email protected]]
>>> Sent: Friday, November 17, 2017 00:04
>>> To: [email protected]
>>> Subject: RE: Very slow parsing of a few PDF files
>>> 
>>> It boggles my mind that SAX parsing would take 5 minutes, but, um,
>> maybe?
>>> Now that I think about it there was a beastly pptx file that someone
>>> submitted on our JIRA that did take 2 minutes, so, maybe???
>>> 
>>> Open an issue in our JIRA to make extraction of charts/diagrams
>>> configurable, and you'll be able to tell. 😊
>>> 
>>> -----Original Message-----
>>> From: [email protected] [mailto:[email protected]]
>>> Sent: Wednesday, November 15, 2017 10:23 AM
>>> To: [email protected]
>>> Subject: Re: Very slow parsing of a few PDF files
>>> 
>>> 
>>> 
>>>> On 2017-11-07 02:52, Jim Idle <[email protected]> wrote:
>>>> I have a few PDF files that are taking a very long time to parse.
>>>> 
>>>> For instance I have a file that is 6.89MB that is taking minutes to
>>>> parse. If I
>>> use jvisualvm and take a long sample, I get:
>>>> 
>>> I'm having a similar problem ( yes, with XLS ) . A 2MB xlsx is taken
>>> around 5 mins to run. While looking at the CHANGES files noted some
>>> new stuff around xlsx, namely :
>>> 
>>> * Extract text from charts in .docx, .pptx, .xlsx and .xlsb(TIKA-2254).
>>> * Extract text from diagrams in .docx, .pptx, .xlsx and .xlsb(TIKA-1945).
>>> 
>>> I then rolled back to version 1.15 and the same file took less than a 
>>> second.
>>> 
>>> Is there a way to be sure if these changes were responsible for the
>>> extra processing time? if so how can I disable them?
>>> 
>>> Sorry, but I cant share the file but can say it has some chart data.
>>> unzip  -l  tika-killer.xlsx  | grep -c xl/chart
>>> 564
>>> 
>>> 
>>> JosÃ© Borges Ferreira
>

Re: Very slow parsing of a few PDF files

Reply via email to