On 2017-11-07 02:52, Jim Idle <[email protected]> wrote: > I have a few PDF files that are taking a very long time to parse. > > For instance I have a file that is 6.89MB that is taking minutes to parse. If > I use jvisualvm and take a long sample, I get: > I'm having a similar problem ( yes, with XLS ) . A 2MB xlsx is taken around 5 mins to run. While looking at the CHANGES files noted some new stuff around xlsx, namely :
* Extract text from charts in .docx, .pptx, .xlsx and .xlsb(TIKA-2254). * Extract text from diagrams in .docx, .pptx, .xlsx and .xlsb(TIKA-1945). I then rolled back to version 1.15 and the same file took less than a second. Is there a way to be sure if these changes were responsible for the extra processing time? if so how can I disable them? Sorry, but I cant share the file but can say it has some chart data. unzip -l tika-killer.xlsx | grep -c xl/chart 564 José Borges Ferreira
