Thanks Dave, Yes - I already have a thread pool that runs the parsing tasks - just not sure if all the parsers will conform to correct handling of being interrupted so I wondered if anyone had already tried it. My system sees about 1 million assets a day of various types. Not too many have runaway issues. I hope I can provide some samples to either the Tika team or the underlying parser team to see if anyone can do anything about them. But for now, if I can bale when neded then it will be fine.
Cheers, jim > -----Original Message----- > From: Dave Fisher [mailto:dave2w...@comcast.net] > Sent: Tuesday, November 21, 2017 12:06 > To: user@tika.apache.org > Subject: Re: Very slow parsing of a few PDF files > > IIRC - In a Mac version of PowerPoint some seven years Microsoft went off > OOXML spec which caused POI Produced files to runaway. A POI user was > able to get enough attention from MSFT to get it fixed. > > Also any additional objects that POI now parses into objects inefficiently. > > I had a project where some 500,000 assets - PDF, Office, etc went through > Tika. Some 3-4 dozen files caused parsing trouble. The key is to try to > isolate > your Tika runs in a separate process. > > IIRC from Tim’s Apachecon Miami talk this is something for Tika Eval. > > Regards, > Dave > > Sent from my iPhone > > > On Nov 20, 2017, at 7:28 PM, Jim Idle <ji...@proofpoint.com> wrote: > > > > Following up on this, I will try cancelling my thread based tasks after a > > pre- > set time limit. That is only going to work if Tika and the underlying parsers > behave correctly with the interrupted exception. Anyone had any success > with that? I am mainly looking at Office, PDF and HTML right now. I will try > it > myself of course, but perhaps someone has already been down this path? > > > > Jim > > > >> -----Original Message----- > >> From: Jim Idle [mailto:ji...@proofpoint.com] > >> Sent: Monday, November 20, 2017 11:54 > >> To: user@tika.apache.org > >> Subject: RE: Very slow parsing of a few PDF files > >> > >> Tim, > >> > >> I am seeing a lot of files that are taking a long time to parse and I > >> am currently gathering some samples from our company's servers that I > >> can use publicly as most are proprietary to our customers and a good > >> number are malware, which may mean they are deliberately broken in > >> format and causing the underlying parsers some issues. > >> > >> Do you think that Tika could be made to abort a parse after a certain > >> time, or is that too complicated given that there are so many > >> underlying parser mechanisms? > >> > >> Cheers, > >> > >> Jim > >> > >>> -----Original Message----- > >>> From: Allison, Timothy B. [mailto:talli...@mitre.org] > >>> Sent: Friday, November 17, 2017 00:04 > >>> To: user@tika.apache.org > >>> Subject: RE: Very slow parsing of a few PDF files > >>> > >>> It boggles my mind that SAX parsing would take 5 minutes, but, um, > >> maybe? > >>> Now that I think about it there was a beastly pptx file that someone > >>> submitted on our JIRA that did take 2 minutes, so, maybe??? > >>> > >>> Open an issue in our JIRA to make extraction of charts/diagrams > >>> configurable, and you'll be able to tell. 😊 > >>> > >>> -----Original Message----- > >>> From: undersp...@gmail.com [mailto:undersp...@gmail.com] > >>> Sent: Wednesday, November 15, 2017 10:23 AM > >>> To: user@tika.apache.org > >>> Subject: Re: Very slow parsing of a few PDF files > >>> > >>> > >>> > >>>> On 2017-11-07 02:52, Jim Idle <ji...@proofpoint.com> wrote: > >>>> I have a few PDF files that are taking a very long time to parse. > >>>> > >>>> For instance I have a file that is 6.89MB that is taking minutes to > >>>> parse. If I > >>> use jvisualvm and take a long sample, I get: > >>>> > >>> I'm having a similar problem ( yes, with XLS ) . A 2MB xlsx is taken > >>> around 5 mins to run. While looking at the CHANGES files noted some > >>> new stuff around xlsx, namely : > >>> > >>> * Extract text from charts in .docx, .pptx, .xlsx and .xlsb(TIKA-2254). > >>> * Extract text from diagrams in .docx, .pptx, .xlsx and .xlsb(TIKA-1945). > >>> > >>> I then rolled back to version 1.15 and the same file took less than a > second. > >>> > >>> Is there a way to be sure if these changes were responsible for the > >>> extra processing time? if so how can I disable them? > >>> > >>> Sorry, but I cant share the file but can say it has some chart data. > >>> unzip -l tika-killer.xlsx | grep -c xl/chart > >>> 564 > >>> > >>> > >>> José Borges Ferreira > >