At least, parser should not hang on processing corrupted document. IMHO, cases with hanging parser code should be considered blocker issue.
Personally I prefer variant with partial result and some meta which says that document parsing failed somehow. But it can be hard to do. -- Best regards, Konstantin Gribov пн, 30 марта 2015 г. в 16:52, Allison, Timothy B. <talli...@mitre.org>: > I think this is an open question within Tika. Some parsers prefer one > thing over another. And there are different levels of corruption. > > In the two cases where govdocs1 docs might be useful in tests, the > hyperlinks in .doc files do not appear to be "standard", but MSWord opens > them without a problem. In cases where an application can open and > correctly process the content, I think we ought to try to extract content > without throwing exceptions. > > -----Original Message----- > From: Tyler Palsulich [mailto:tpalsul...@gmail.com] > Sent: Monday, March 30, 2015 9:39 AM > To: dev@tika.apache.org > Subject: RE: including refactored docs from govdocs1 in test suite > > Ah. I see. > > In general, what is the goal with handling corrupted files? Extract as much > as possible and fail gracefully? > > Tyler > > On Mar 30, 2015 9:32 AM, "Allison, Timothy B." <talli...@mitre.org> wrote: > > > > Unfortunately, no. MSOffice fixes the document when I do that. > > > > -----Original Message----- > > From: Tyler Palsulich [mailto:tpalsul...@gmail.com] > > Sent: Monday, March 30, 2015 9:24 AM > > To: dev@tika.apache.org > > Subject: Re: including refactored docs from govdocs1 in test suite > > > > Can you copy the hyperlink into a new doc and change the URL? I have no > > idea about including the modified version. > > > > Tyler > > On Mar 30, 2015 9:18 AM, "Allison, Timothy B." <talli...@mitre.org> > wrote: > > > > > All, > > > > > > As part of TIKA-1512, I found that I can delete all of the contents, > > > including the metadata, except for one hyperlink in two documents from > > > govdocs1 and still get the proper behavior -- fail before fix, work > after > > > fix. > > > > > > These documents are in the public domain. > > > > > > Is it ok to include these modified documents in our test suite or > should > > > I avoid inclusion? > > > > > > Happy to avoid inclusion for the sake of a quick release of 1.8 and > then > > > we have time to discuss/determine way ahead... unless the answer is > obvious. > > > > > > Best, > > > > > > Tim > > > > > > -----Original Message----- > > > From: Allison, Timothy B. [mailto:talli...@mitre.org] > > > Sent: Monday, March 30, 2015 7:03 AM > > > To: dev@tika.apache.org > > > Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1 > > > > > > Unless there are objections, I'd like these to be resolved before 1.8: > > > > > > TIKA-1584 -- I'll fix > > > TIKA-1575 -- Resolved by Konstantin Gribov (thank you!) > > > TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs, > but > > > I'll leave this open and do some more digging to see if we need to open > a > > > ticket at the POI level > > > TIKA-1511 -- I'll remove "provided" for xerial > > > > > > TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no? > > > > > > I'll have these fixes completed by noon EDT. Should I run against > > > govdocs1 before or after the RC? > > > > > > My last build of Tika app (a few days ago) ballooned to ~43MB, and > that's > > > before I add ~3MB for xerial. Tika server is now ~48MB. As of my last > > > build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and > > > README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and > tika-server > > > jars. > > > > > > Best, > > > > > > Tim > > > > > > > > > > > > -----Original Message----- > > > From: Tyler Palsulich [mailto:tpalsul...@gmail.com] > > > Sent: Sunday, March 29, 2015 9:13 AM > > > To: dev@tika.apache.org > > > Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1 > > > > > > Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless > > > something else pops up). > > > > > > Thank you everyone. > > > > > > Tyler > > > On Mar 29, 2015 4:43 AM, "Hong-Thai Nguyen" <thaicha...@gmail.com> > wrote: > > > > > > > +1 for 1.8 > > > > > > > > Hong-Thai > > > > > > > > > On 28 Mar 2015, at 16:01, Tyler Palsulich <tpalsul...@apache.org> > > > wrote: > > > > > > > > > > Hi Folks, > > > > > > > > > > Now that TIKA-1581 (JHighlight licensing issues) is resolved, we > need > > > to > > > > > release a new version of Tika. I'll volunteer to be the release > manager > > > > > again. > > > > > > > > > > Should we release this as 1.8 or 1.7.1? > > > > > > > > > > Does anyone have any last minute issues they'd like to finish and > see > > > in > > > > > Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 > and > > > > > TIKA-1586). Any others? > > > > > > > > > > Have a good weekend, > > > > > Tyler > > > > > > > >