Hi Tim,
>From what you're saying it sounds like the Tika library has a big problem
with crashes and freezes, and when applying it at scale (eg. in the context
of Beam) requires explicitly addressing this problem, eg. accepting the
fact that in many realistic applications some documents will just need to
be skipped because they are unprocessable? This would be first example of a
Beam IO that has this concern, so I'd like to confirm that my understanding
is correct.

On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. <[email protected]>
wrote:

> Reuven,
>
> Thank you!  This suggests to me that it is a good idea to integrate Tika
> with Beam so that people don't have to 1) (re)discover the need to make
> their wrappers robust and then 2) have to reinvent these wheels for
> robustness.
>
> For kicks, see William Palmer's post on his toe-stubbing efforts with
> Hadoop [1].  He and other Tika users independently have wound up carrying
> out exactly your recommendation for 1) below.
>
> We have a MockParser that you can get to simulate regular exceptions, OOMs
> and permanent hangs by asking Tika to parse a <mock> xml [2].
>
> > However if processing the document causes the process to crash, then it
> will be retried.
> Any ideas on how to get around this?
>
> Thank you again.
>
> Cheers,
>
>            Tim
>
> [1]
> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
> [2]
> https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml
>

Reply via email to