At least, parser should not hang on processing corrupted document. IMHO,
cases with hanging parser code should be considered blocker issue.

Personally I prefer variant with partial result and some meta which says
that document parsing failed somehow. But it can be hard to do.

-- 
Best regards,
Konstantin Gribov

пн, 30 марта 2015 г. в 16:52, Allison, Timothy B. <talli...@mitre.org>:

> I think this is an open question within Tika.  Some parsers prefer one
> thing over another.  And there are different levels of corruption.
>
> In the two cases where govdocs1 docs might be useful in tests, the
> hyperlinks in .doc files do not appear to be "standard", but  MSWord opens
> them without a problem.  In cases where an application can open and
> correctly process the content, I think we ought to try to extract content
> without throwing exceptions.
>
> -----Original Message-----
> From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
> Sent: Monday, March 30, 2015 9:39 AM
> To: dev@tika.apache.org
> Subject: RE: including refactored docs from govdocs1 in test suite
>
> Ah. I see.
>
> In general, what is the goal with handling corrupted files? Extract as much
> as possible and fail gracefully?
>
> Tyler
>
> On Mar 30, 2015 9:32 AM, "Allison, Timothy B." <talli...@mitre.org> wrote:
> >
> > Unfortunately, no.  MSOffice fixes the document when I do that.
> >
> > -----Original Message-----
> > From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
> > Sent: Monday, March 30, 2015 9:24 AM
> > To: dev@tika.apache.org
> > Subject: Re: including refactored docs from govdocs1 in test suite
> >
> > Can you copy the hyperlink into a new doc and change the URL? I have no
> > idea about including the modified version.
> >
> > Tyler
> > On Mar 30, 2015 9:18 AM, "Allison, Timothy B." <talli...@mitre.org>
> wrote:
> >
> > > All,
> > >
> > >   As part of TIKA-1512, I found that I can delete all of the contents,
> > > including the metadata, except for one hyperlink in two documents from
> > > govdocs1 and still get the proper behavior -- fail before fix, work
> after
> > > fix.
> > >
> > >   These documents are in the public domain.
> > >
> > >   Is it ok to include these modified documents in our test suite or
> should
> > > I avoid inclusion?
> > >
> > >   Happy to avoid inclusion for the sake of a quick release of 1.8 and
> then
> > > we have time to discuss/determine way ahead... unless the answer is
> obvious.
> > >
> > >          Best,
> > >
> > >                      Tim
> > >
> > > -----Original Message-----
> > > From: Allison, Timothy B. [mailto:talli...@mitre.org]
> > > Sent: Monday, March 30, 2015 7:03 AM
> > > To: dev@tika.apache.org
> > > Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1
> > >
> > > Unless there are objections, I'd like these to be resolved before 1.8:
> > >
> > > TIKA-1584 -- I'll fix
> > > TIKA-1575 -- Resolved by Konstantin Gribov (thank you!)
> > > TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs,
> but
> > > I'll leave this open and do some more digging to see if we need to open
> a
> > > ticket at the POI level
> > > TIKA-1511 -- I'll remove "provided" for xerial
> > >
> > > TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no?
> > >
> > > I'll have these fixes completed by noon EDT.  Should I run against
> > > govdocs1 before or after the RC?
> > >
> > > My last build of Tika app (a few days ago) ballooned to ~43MB, and
> that's
> > > before I add ~3MB for xerial.  Tika server is now ~48MB.  As of my last
> > > build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and
> > > README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and
> tika-server
> > > jars.
> > >
> > > Best,
> > >
> > >               Tim
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
> > > Sent: Sunday, March 29, 2015 9:13 AM
> > > To: dev@tika.apache.org
> > > Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1
> > >
> > > Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless
> > > something else pops up).
> > >
> > > Thank you everyone.
> > >
> > > Tyler
> > > On Mar 29, 2015 4:43 AM, "Hong-Thai Nguyen" <thaicha...@gmail.com>
> wrote:
> > >
> > > > +1 for 1.8
> > > >
> > > > Hong-Thai
> > > >
> > > > > On 28 Mar 2015, at 16:01, Tyler Palsulich <tpalsul...@apache.org>
> > > wrote:
> > > > >
> > > > > Hi Folks,
> > > > >
> > > > > Now that TIKA-1581 (JHighlight licensing issues) is resolved, we
> need
> > > to
> > > > > release a new version of Tika. I'll volunteer to be the release
> manager
> > > > > again.
> > > > >
> > > > > Should we release this as 1.8 or 1.7.1?
> > > > >
> > > > > Does anyone have any last minute issues they'd like to finish and
> see
> > > in
> > > > > Tika 1.X? I'd like to get the example working with CORS (TIKA-1585
> and
> > > > > TIKA-1586). Any others?
> > > > >
> > > > > Have a good weekend,
> > > > > Tyler
> > > >
> > >
>

Reply via email to