Do tell...

Interesting.  Any pointers?

-----Original Message-----
From: Ben Chambers [mailto:bchamb...@google.com.INVALID] 
Sent: Friday, September 22, 2017 12:50 PM
To: dev@beam.apache.org
Cc: d...@tika.apache.org
Subject: Re: TikaIO concerns

Regarding specifically elements that are failing -- I believe some other IO has 
used the concept of a "Dead Letter" side-output,, where documents that failed 
to process are side-output so the user can handle them appropriately.

On Fri, Sep 22, 2017 at 9:47 AM Eugene Kirpichov <kirpic...@google.com.invalid> 
wrote:

> Hi Tim,
> From what you're saying it sounds like the Tika library has a big 
> problem with crashes and freezes, and when applying it at scale (eg. 
> in the context of Beam) requires explicitly addressing this problem, 
> eg. accepting the fact that in many realistic applications some 
> documents will just need to be skipped because they are unprocessable? 
> This would be first example of a Beam IO that has this concern, so I'd 
> like to confirm that my understanding is correct.
>
> On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. 
> <talli...@mitre.org>
> wrote:
>
> > Reuven,
> >
> > Thank you!  This suggests to me that it is a good idea to integrate 
> > Tika with Beam so that people don't have to 1) (re)discover the need 
> > to make their wrappers robust and then 2) have to reinvent these 
> > wheels for robustness.
> >
> > For kicks, see William Palmer's post on his toe-stubbing efforts 
> > with Hadoop [1].  He and other Tika users independently have wound 
> > up carrying out exactly your recommendation for 1) below.
> >
> > We have a MockParser that you can get to simulate regular 
> > exceptions,
> OOMs
> > and permanent hangs by asking Tika to parse a <mock> xml [2].
> >
> > > However if processing the document causes the process to crash, 
> > > then it
> > will be retried.
> > Any ideas on how to get around this?
> >
> > Thank you again.
> >
> > Cheers,
> >
> >            Tim
> >
> > [1]
> >
> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
> eb-content-nanite/
> > [2]
> >
> https://github.com/apache/tika/blob/master/tika-parsers/src/test/resou
> rces/test-documents/mock/example.xml
> >
>

Reply via email to