Following up on this, I will try cancelling my thread based tasks after a pre-set time limit. That is only going to work if Tika and the underlying parsers behave correctly with the interrupted exception. Anyone had any success with that? I am mainly looking at Office, PDF and HTML right now. I will try it myself of course, but perhaps someone has already been down this path?
Jim > -----Original Message----- > From: Jim Idle [mailto:[email protected]] > Sent: Monday, November 20, 2017 11:54 > To: [email protected] > Subject: RE: Very slow parsing of a few PDF files > > Tim, > > I am seeing a lot of files that are taking a long time to parse and I am > currently gathering some samples from our company's servers that I can use > publicly as most are proprietary to our customers and a good number are > malware, which may mean they are deliberately broken in format and > causing the underlying parsers some issues. > > Do you think that Tika could be made to abort a parse after a certain time, or > is that too complicated given that there are so many underlying parser > mechanisms? > > Cheers, > > Jim > > > -----Original Message----- > > From: Allison, Timothy B. [mailto:[email protected]] > > Sent: Friday, November 17, 2017 00:04 > > To: [email protected] > > Subject: RE: Very slow parsing of a few PDF files > > > > It boggles my mind that SAX parsing would take 5 minutes, but, um, > maybe? > > Now that I think about it there was a beastly pptx file that someone > > submitted on our JIRA that did take 2 minutes, so, maybe??? > > > > Open an issue in our JIRA to make extraction of charts/diagrams > > configurable, and you'll be able to tell. 😊 > > > > -----Original Message----- > > From: [email protected] [mailto:[email protected]] > > Sent: Wednesday, November 15, 2017 10:23 AM > > To: [email protected] > > Subject: Re: Very slow parsing of a few PDF files > > > > > > > > On 2017-11-07 02:52, Jim Idle <[email protected]> wrote: > > > I have a few PDF files that are taking a very long time to parse. > > > > > > For instance I have a file that is 6.89MB that is taking minutes to > > > parse. If I > > use jvisualvm and take a long sample, I get: > > > > > I'm having a similar problem ( yes, with XLS ) . A 2MB xlsx is taken > > around 5 mins to run. While looking at the CHANGES files noted some > > new stuff around xlsx, namely : > > > > * Extract text from charts in .docx, .pptx, .xlsx and .xlsb(TIKA-2254). > > * Extract text from diagrams in .docx, .pptx, .xlsx and .xlsb(TIKA-1945). > > > > I then rolled back to version 1.15 and the same file took less than a > > second. > > > > Is there a way to be sure if these changes were responsible for the > > extra processing time? if so how can I disable them? > > > > Sorry, but I cant share the file but can say it has some chart data. > > unzip -l tika-killer.xlsx | grep -c xl/chart > > 564 > > > > > > José Borges Ferreira
