I think this is an open question within Tika. Some parsers prefer one thing over another. And there are different levels of corruption.
In the two cases where govdocs1 docs might be useful in tests, the hyperlinks in .doc files do not appear to be "standard", but MSWord opens them without a problem. In cases where an application can open and correctly process the content, I think we ought to try to extract content without throwing exceptions. -----Original Message----- From: Tyler Palsulich [mailto:tpalsul...@gmail.com] Sent: Monday, March 30, 2015 9:39 AM To: dev@tika.apache.org Subject: RE: including refactored docs from govdocs1 in test suite Ah. I see. In general, what is the goal with handling corrupted files? Extract as much as possible and fail gracefully? Tyler On Mar 30, 2015 9:32 AM, "Allison, Timothy B." <talli...@mitre.org> wrote: > > Unfortunately, no. MSOffice fixes the document when I do that. > > -----Original Message----- > From: Tyler Palsulich [mailto:tpalsul...@gmail.com] > Sent: Monday, March 30, 2015 9:24 AM > To: dev@tika.apache.org > Subject: Re: including refactored docs from govdocs1 in test suite > > Can you copy the hyperlink into a new doc and change the URL? I have no > idea about including the modified version. > > Tyler > On Mar 30, 2015 9:18 AM, "Allison, Timothy B." <talli...@mitre.org> wrote: > > > All, > > > > As part of TIKA-1512, I found that I can delete all of the contents, > > including the metadata, except for one hyperlink in two documents from > > govdocs1 and still get the proper behavior -- fail before fix, work after > > fix. > > > > These documents are in the public domain. > > > > Is it ok to include these modified documents in our test suite or should > > I avoid inclusion? > > > > Happy to avoid inclusion for the sake of a quick release of 1.8 and then > > we have time to discuss/determine way ahead... unless the answer is obvious. > > > > Best, > > > > Tim > > > > -----Original Message----- > > From: Allison, Timothy B. [mailto:talli...@mitre.org] > > Sent: Monday, March 30, 2015 7:03 AM > > To: dev@tika.apache.org > > Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1 > > > > Unless there are objections, I'd like these to be resolved before 1.8: > > > > TIKA-1584 -- I'll fix > > TIKA-1575 -- Resolved by Konstantin Gribov (thank you!) > > TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs, but > > I'll leave this open and do some more digging to see if we need to open a > > ticket at the POI level > > TIKA-1511 -- I'll remove "provided" for xerial > > > > TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no? > > > > I'll have these fixes completed by noon EDT. Should I run against > > govdocs1 before or after the RC? > > > > My last build of Tika app (a few days ago) ballooned to ~43MB, and that's > > before I add ~3MB for xerial. Tika server is now ~48MB. As of my last > > build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and > > README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and tika-server > > jars. > > > > Best, > > > > Tim > > > > > > > > -----Original Message----- > > From: Tyler Palsulich [mailto:tpalsul...@gmail.com] > > Sent: Sunday, March 29, 2015 9:13 AM > > To: dev@tika.apache.org > > Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1 > > > > Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless > > something else pops up). > > > > Thank you everyone. > > > > Tyler > > On Mar 29, 2015 4:43 AM, "Hong-Thai Nguyen" <thaicha...@gmail.com> wrote: > > > > > +1 for 1.8 > > > > > > Hong-Thai > > > > > > > On 28 Mar 2015, at 16:01, Tyler Palsulich <tpalsul...@apache.org> > > wrote: > > > > > > > > Hi Folks, > > > > > > > > Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need > > to > > > > release a new version of Tika. I'll volunteer to be the release manager > > > > again. > > > > > > > > Should we release this as 1.8 or 1.7.1? > > > > > > > > Does anyone have any last minute issues they'd like to finish and see > > in > > > > Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and > > > > TIKA-1586). Any others? > > > > > > > > Have a good weekend, > > > > Tyler > > > > >