Regarding specifically elements that are failing -- I believe some other IO has used the concept of a "Dead Letter" side-output,, where documents that failed to process are side-output so the user can handle them appropriately.
On Fri, Sep 22, 2017 at 9:47 AM Eugene Kirpichov <kirpic...@google.com.invalid> wrote: > Hi Tim, > From what you're saying it sounds like the Tika library has a big problem > with crashes and freezes, and when applying it at scale (eg. in the context > of Beam) requires explicitly addressing this problem, eg. accepting the > fact that in many realistic applications some documents will just need to > be skipped because they are unprocessable? This would be first example of a > Beam IO that has this concern, so I'd like to confirm that my understanding > is correct. > > On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. <talli...@mitre.org> > wrote: > > > Reuven, > > > > Thank you! This suggests to me that it is a good idea to integrate Tika > > with Beam so that people don't have to 1) (re)discover the need to make > > their wrappers robust and then 2) have to reinvent these wheels for > > robustness. > > > > For kicks, see William Palmer's post on his toe-stubbing efforts with > > Hadoop [1]. He and other Tika users independently have wound up carrying > > out exactly your recommendation for 1) below. > > > > We have a MockParser that you can get to simulate regular exceptions, > OOMs > > and permanent hangs by asking Tika to parse a <mock> xml [2]. > > > > > However if processing the document causes the process to crash, then it > > will be retried. > > Any ideas on how to get around this? > > > > Thank you again. > > > > Cheers, > > > > Tim > > > > [1] > > > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/ > > [2] > > > https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml > > >