I think this is an open question within Tika.  Some parsers prefer one thing 
over another.  And there are different levels of corruption.

In the two cases where govdocs1 docs might be useful in tests, the hyperlinks 
in .doc files do not appear to be "standard", but  MSWord opens them without a 
problem.  In cases where an application can open and correctly process the 
content, I think we ought to try to extract content without throwing exceptions.

-----Original Message-----
From: Tyler Palsulich [mailto:tpalsul...@gmail.com] 
Sent: Monday, March 30, 2015 9:39 AM
To: dev@tika.apache.org
Subject: RE: including refactored docs from govdocs1 in test suite

Ah. I see.

In general, what is the goal with handling corrupted files? Extract as much
as possible and fail gracefully?

Tyler

On Mar 30, 2015 9:32 AM, "Allison, Timothy B." <talli...@mitre.org> wrote:
>
> Unfortunately, no.  MSOffice fixes the document when I do that.
>
> -----Original Message-----
> From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
> Sent: Monday, March 30, 2015 9:24 AM
> To: dev@tika.apache.org
> Subject: Re: including refactored docs from govdocs1 in test suite
>
> Can you copy the hyperlink into a new doc and change the URL? I have no
> idea about including the modified version.
>
> Tyler
> On Mar 30, 2015 9:18 AM, "Allison, Timothy B." <talli...@mitre.org> wrote:
>
> > All,
> >
> >   As part of TIKA-1512, I found that I can delete all of the contents,
> > including the metadata, except for one hyperlink in two documents from
> > govdocs1 and still get the proper behavior -- fail before fix, work
after
> > fix.
> >
> >   These documents are in the public domain.
> >
> >   Is it ok to include these modified documents in our test suite or
should
> > I avoid inclusion?
> >
> >   Happy to avoid inclusion for the sake of a quick release of 1.8 and
then
> > we have time to discuss/determine way ahead... unless the answer is
obvious.
> >
> >          Best,
> >
> >                      Tim
> >
> > -----Original Message-----
> > From: Allison, Timothy B. [mailto:talli...@mitre.org]
> > Sent: Monday, March 30, 2015 7:03 AM
> > To: dev@tika.apache.org
> > Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1
> >
> > Unless there are objections, I'd like these to be resolved before 1.8:
> >
> > TIKA-1584 -- I'll fix
> > TIKA-1575 -- Resolved by Konstantin Gribov (thank you!)
> > TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs,
but
> > I'll leave this open and do some more digging to see if we need to open
a
> > ticket at the POI level
> > TIKA-1511 -- I'll remove "provided" for xerial
> >
> > TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no?
> >
> > I'll have these fixes completed by noon EDT.  Should I run against
> > govdocs1 before or after the RC?
> >
> > My last build of Tika app (a few days ago) ballooned to ~43MB, and
that's
> > before I add ~3MB for xerial.  Tika server is now ~48MB.  As of my last
> > build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and
> > README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and
tika-server
> > jars.
> >
> > Best,
> >
> >               Tim
> >
> >
> >
> > -----Original Message-----
> > From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
> > Sent: Sunday, March 29, 2015 9:13 AM
> > To: dev@tika.apache.org
> > Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1
> >
> > Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless
> > something else pops up).
> >
> > Thank you everyone.
> >
> > Tyler
> > On Mar 29, 2015 4:43 AM, "Hong-Thai Nguyen" <thaicha...@gmail.com>
wrote:
> >
> > > +1 for 1.8
> > >
> > > Hong-Thai
> > >
> > > > On 28 Mar 2015, at 16:01, Tyler Palsulich <tpalsul...@apache.org>
> > wrote:
> > > >
> > > > Hi Folks,
> > > >
> > > > Now that TIKA-1581 (JHighlight licensing issues) is resolved, we
need
> > to
> > > > release a new version of Tika. I'll volunteer to be the release
manager
> > > > again.
> > > >
> > > > Should we release this as 1.8 or 1.7.1?
> > > >
> > > > Does anyone have any last minute issues they'd like to finish and
see
> > in
> > > > Tika 1.X? I'd like to get the example working with CORS (TIKA-1585
and
> > > > TIKA-1586). Any others?
> > > >
> > > > Have a good weekend,
> > > > Tyler
> > >
> >

Reply via email to