Re: TikaIO concerns

Sergey Beryozkin Sat, 23 Sep 2017 07:54:15 -0700

Please see comments below, and I'm positive this thread is nearly over :-)
On 22/09/17 22:49, Eugene Kirpichov wrote:

On Fri, Sep 22, 2017 at 2:20 PM Sergey Beryozkin <sberyoz...@gmail.com>
wrote:

Hi,
On 22/09/17 22:02, Eugene Kirpichov wrote:

Sure - with hundreds of different file formats and the abundance of

weird /

malformed / malicious files in the wild, it's quite expected that

sometimes

the library will crash.

Some kinds of issues are easier to address than others. We can catch
exceptions and return a ParseResult representing a failure to parse this
document. Addressing freezes and native JVM process crashes is much

harder

and probably not necessary in the first version.

Sergey - I think, the moment you introduce ParseResult into the code,

other

changes I suggested will follow "by construction":
- There'll be 1 ParseResult per document, containing filename, content

and

metadata, since per discussion above it probably doesn't make sense to
deliver these in separate PCollection elements


I was still harboring the hope that may be using a container bean like
ParseResult (with the other changes you proposed) can somehow let us
stream from Tika into the pipeline.

If it is 1 ParseResult per document then it means that until Tika has
parsed all the document the pipeline will not see it.

This is correct, and this is the API I'm suggesting to start with, because
it's simple and sufficiently useful. I suggest to get into this state
first, and then deal with creating a separate API that allows to not hold
the entire parse result as a single PCollection element in memory. This
should work fine for cases when each document's parse result (not the input
document itself!) is up to a few hundred megabytes in size.

+1. I was contemplating about it yesterday evening and I had to admit Ihad no real clue what I wanted to achieve with the document beingstreamed through the pipeline - partially because my Beam knowledge wasstill pretty limited but also because I had difficulties with comingwith the concrete use cases.

So yes. lets make the 'mainstream' case working well first.


I'm sorry if I may be starting to go in circles. But let me ask this.
How can a Beam user write a Beam function which will ensure the Tika
content pieces are seen ordered by the pipeline, without TikaIO ?

To answer this, I'd need you to clarify what you mean by "seen ordered by
the pipeline" - order is a very vague term when it comes to parallel
processing. What would you like the pipeline to compute that requires order
within a document, but does NOT require having the contents of a document
as a single String?

See above, I don't know :-). The case which I do like, and I'll work ona demo at a later stage at a dedicate branch, is what I describedearlier. I would use sat FileIO to get me a list of 1000s matching PDFs,run that though Tika(IO) and I'd have a function which will output thelist of matching PDFs (or other formats). Ex: if someone needs to findall the Word docs in a given online library, which talk about someevent. I think it won't matter in this case whether the ordering of theindividual lines matters or not, we have a link to the file name andit's enough...


But I'll return to this favourite case of mine later :-)

Or are you asking simply how can users use Tika for arbitrary use cases
without TikaIO?

I thought later, I was really interested, was it important for any ofBeam IO's consumers that the individual data chunks come ordered or not,and if it was, how that was achieved...Knowing that would help me/us toconsider what can possibly be done at a later stage


If you'd like to talk about it later then it is OK...

Thanks for the help
Sergey


May be knowing that will help coming up with the idea how to generalize
somehow with the help of TikaIO ?

- Since you're returning a single value per document, there's no reason

to

use a BoundedReader
- Likewise, there's no reason to use asynchronicity because you're not
delivering the result incrementally

I'd suggest to start the refactoring by removing the asynchronous

codepath,

then converting from BoundedReader to ParDo or MapElements, then

converting

from String to ParseResult.

This is a good plan, thanks, I guess at least for small documents it
should work well (unless I've misunderstood a ParseResult idea)

Thanks, Sergey


On Fri, Sep 22, 2017 at 12:10 PM Sergey Beryozkin <sberyoz...@gmail.com>
wrote:

Hi Tim, All
On 22/09/17 18:17, Allison, Timothy B. wrote:

Y, I think you have it right.

Tika library has a big problem with crashes and freezes


I wouldn't want to overstate it.  Crashes and freezes are exceedingly

rare, but when you are processing millions/billions of files in the wild
[1], they will happen.  We fix the problems or try to get our

dependencies

to fix the problems when we can,

I only would like to add to this that IMHO it would be more correct to
state it's not a Tika library's 'fault' that the crashes might occur.
Tika does its best to get the latest libraries helping it to parse the
files, but indeed there will always be some file there that might use
some incomplete format specific tag etc which may cause the specific
parser to spin - but Tika will include the updated parser library asap.

And with Beam's help the crashes that can kill the Tika jobs completely
will probably become a history...

Cheers, Sergey

but given our past history, I have no reason to believe that these

problems won't happen again.


Thank you, again!

Best,

               Tim

[1] Stuff on the internet or ... some of our users are forensics

examiners dealing with broken/corrupted files


P.S./FTR  😊
1) We've gathered a TB of data from CommonCrawl and we run regression

tests against this TB (thank you, Rackspace for hosting our vm!) to try

to

identify these problems.

2) We've started a fuzzing effort to try to identify problems.
3) We added "tika-batch" for robust single box fileshare/fileshare

processing for our low volume users

4) We're trying to get the message out.  Thank you for working with

us!!!


-----Original Message-----
From: Eugene Kirpichov [mailto:kirpic...@google.com.INVALID]
Sent: Friday, September 22, 2017 12:48 PM
To: dev@beam.apache.org
Cc: d...@tika.apache.org
Subject: Re: TikaIO concerns

Hi Tim,
   From what you're saying it sounds like the Tika library has a big

problem with crashes and freezes, and when applying it at scale (eg. in

the

context of Beam) requires explicitly addressing this problem, eg.

accepting

the fact that in many realistic applications some documents will just

need

to be skipped because they are unprocessable? This would be first

example

of a Beam IO that has this concern, so I'd like to confirm that my
understanding is correct.


On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. <

talli...@mitre.org>

wrote:

Reuven,

Thank you!  This suggests to me that it is a good idea to integrate
Tika with Beam so that people don't have to 1) (re)discover the need
to make their wrappers robust and then 2) have to reinvent these
wheels for robustness.

For kicks, see William Palmer's post on his toe-stubbing efforts with
Hadoop [1].  He and other Tika users independently have wound up
carrying out exactly your recommendation for 1) below.

We have a MockParser that you can get to simulate regular exceptions,
OOMs and permanent hangs by asking Tika to parse a <mock> xml [2].

However if processing the document causes the process to crash, then
it

will be retried.
Any ideas on how to get around this?

Thank you again.

Cheers,

              Tim

[1]

http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w

eb-content-nanite/
[2]

https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml

Re: TikaIO concerns

Reply via email to