IIRC - In a Mac version of PowerPoint some seven years Microsoft went off OOXML spec which caused POI Produced files to runaway. A POI user was able to get enough attention from MSFT to get it fixed.
Also any additional objects that POI now parses into objects inefficiently. I had a project where some 500,000 assets - PDF, Office, etc went through Tika. Some 3-4 dozen files caused parsing trouble. The key is to try to isolate your Tika runs in a separate process. IIRC from Tim’s Apachecon Miami talk this is something for Tika Eval. Regards, Dave Sent from my iPhone > On Nov 20, 2017, at 7:28 PM, Jim Idle <[email protected]> wrote: > > Following up on this, I will try cancelling my thread based tasks after a > pre-set time limit. That is only going to work if Tika and the underlying > parsers behave correctly with the interrupted exception. Anyone had any > success with that? I am mainly looking at Office, PDF and HTML right now. I > will try it myself of course, but perhaps someone has already been down this > path? > > Jim > >> -----Original Message----- >> From: Jim Idle [mailto:[email protected]] >> Sent: Monday, November 20, 2017 11:54 >> To: [email protected] >> Subject: RE: Very slow parsing of a few PDF files >> >> Tim, >> >> I am seeing a lot of files that are taking a long time to parse and I am >> currently gathering some samples from our company's servers that I can use >> publicly as most are proprietary to our customers and a good number are >> malware, which may mean they are deliberately broken in format and >> causing the underlying parsers some issues. >> >> Do you think that Tika could be made to abort a parse after a certain time, >> or >> is that too complicated given that there are so many underlying parser >> mechanisms? >> >> Cheers, >> >> Jim >> >>> -----Original Message----- >>> From: Allison, Timothy B. [mailto:[email protected]] >>> Sent: Friday, November 17, 2017 00:04 >>> To: [email protected] >>> Subject: RE: Very slow parsing of a few PDF files >>> >>> It boggles my mind that SAX parsing would take 5 minutes, but, um, >> maybe? >>> Now that I think about it there was a beastly pptx file that someone >>> submitted on our JIRA that did take 2 minutes, so, maybe??? >>> >>> Open an issue in our JIRA to make extraction of charts/diagrams >>> configurable, and you'll be able to tell. 😊 >>> >>> -----Original Message----- >>> From: [email protected] [mailto:[email protected]] >>> Sent: Wednesday, November 15, 2017 10:23 AM >>> To: [email protected] >>> Subject: Re: Very slow parsing of a few PDF files >>> >>> >>> >>>> On 2017-11-07 02:52, Jim Idle <[email protected]> wrote: >>>> I have a few PDF files that are taking a very long time to parse. >>>> >>>> For instance I have a file that is 6.89MB that is taking minutes to >>>> parse. If I >>> use jvisualvm and take a long sample, I get: >>>> >>> I'm having a similar problem ( yes, with XLS ) . A 2MB xlsx is taken >>> around 5 mins to run. While looking at the CHANGES files noted some >>> new stuff around xlsx, namely : >>> >>> * Extract text from charts in .docx, .pptx, .xlsx and .xlsb(TIKA-2254). >>> * Extract text from diagrams in .docx, .pptx, .xlsx and .xlsb(TIKA-1945). >>> >>> I then rolled back to version 1.15 and the same file took less than a >>> second. >>> >>> Is there a way to be sure if these changes were responsible for the >>> extra processing time? if so how can I disable them? >>> >>> Sorry, but I cant share the file but can say it has some chart data. >>> unzip -l tika-killer.xlsx | grep -c xl/chart >>> 564 >>> >>> >>> José Borges Ferreira >
