Re: Very slow parsing of a few PDF files

[email protected] Wed, 15 Nov 2017 07:23:03 -0800


On 2017-11-07 02:52, Jim Idle <[email protected]> wrote: 
> I have a few PDF files that are taking a very long time to parse.
> 
> For instance I have a file that is 6.89MB that is taking minutes to parse. If 
> I use jvisualvm and take a long sample, I get:
> 
I'm having a similar problem ( yes, with XLS ) . A 2MB xlsx is taken around 5 
mins to run. While looking at the CHANGES files noted some new stuff around 
xlsx, namely :


* Extract text from charts in .docx, .pptx, .xlsx and .xlsb(TIKA-2254).
* Extract text from diagrams in .docx, .pptx, .xlsx and .xlsb(TIKA-1945).

I then rolled back to version 1.15 and the same file took less than a second.

Is there a way to be sure if these changes were responsible for the extra 
processing time? if so how can I disable them?

 Sorry, but I cant share the file but can say it has some chart data.
 unzip  -l  tika-killer.xlsx  | grep -c xl/chart
564


JosÃ© Borges Ferreira

Re: Very slow parsing of a few PDF files

Reply via email to