Re: TikaIO concerns

Eugene Kirpichov Tue, 19 Sep 2017 16:19:35 -0700

On Tue, Sep 19, 2017 at 5:13 AM Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Hi Sergey,
>
> as discussed together during the review, I fully understand the choices
> you did.
>
> Your plan sounds reasonable. Thanks !
>
> Generally speaking, in order to give visibility and encourage
> contribution, I
> think it would make sense to accept a PR if it's basically right (even if
> it's
> not yet perfect) and doesn't break the build.
>
This is a wider discussion than the current thread, but I don't think I
agree with this approach.

We have followed a much stricter standard in the past, and thanks to that,
Beam currently has (in my opinion) an extremely high-quality library of
IOs, and Beam can take pride in not being one of "those" open-source
projects that advertise everything but guarantee nothing and are
frustrating to work with, because everything is slightly broken in some way
or another.

I can recall at most 1 or 2 cases where a contributor gave up on a PR due
to the amount of issues pointed out during review - and in those cases, the
PR was usually in a state where Beam would not have benefitted from merging
the issue-ridden code anyway. Basically, a thorough review in all cases
I've seen so far has been a good idea in retrospect.

There may be trivial fixups best done by a committer rather than author
(e.g. javadoc typos), but I think nontrivial, high-level issues should be
reviewed rigorously.

People trust Beam (especially Beam IOs) with their data, and at least the
correctness-critical stuff *must* be done right. Beam also generally
promises a stable API, so API mistakes are forever, and can not be fixed
iteratively [this can be addressed by marking in-progress work as
@Experimental] - so APIs must be done right as well. On the other hand,
performance, documentation, lack of support for certain features etc. can
be fixed iteratively - I agree that we shouldn't push too hard on that
during review.

There's also the mentorship aspect: I think it is valuable for new Beam
contributors to get thorough review, especially for their first
contributions, as a kick-start to learning the best practices - they are
going to need them repeatedly in their future contributions. Merging code
without sufficient review gives them the immediate gratification of "having
contributed", but denies the mentorship. Moreover, most contributions are
made by a relatively small number of prolific "serial contributors" (you
being a prime example!) who are responsive to feedback and eager to learn,
so the immediate gratification I think is not very important.

I think the best way to handle code reviews for Beam is to give it our best
as reviewers, especially for first-time contributors; and if it feels like
the amount of required changes is too large for the contributor to handle,
then work with them to prioritize the changes, or start small and decompose
the contribution into more manageable pieces, but each merged piece must be
high-quality.

> I would be happy to help on TikaIO as I did during the first review round
> ;)
>
> Regards
> JB
>
> On 09/19/2017 12:41 PM, Sergey Beryozkin wrote:
> > Hi All
> >
> > This is my first post the the dev list, I work for Talend, I'm a Beam
> novice,
> > Apache Tika fan, and thought it would be really great to try and link
> both
> > projects together, which led me to opening [1] where I typed some early
> > thoughts, followed by PR [2].
> >
> > I noticed yesterday I had the robust :-) (but useful and helpful) newer
> review
> > comments from Eugene pending, so I'd like to summarize a bit why I did
> TikaIO
> > (reader) the way I did, and then decide, based on the feedback from the
> experts,
> > what to do next.
> >
> > Apache Tika Parsers report the text content in chunks, via SaxParser
> events.
> > It's not possible with Tika to take a file and read it bit by bit at the
> > 'initiative' of the Beam Reader, line by line, the only way is to handle
> the
> > SAXParser callbacks which report the data chunks. Some parsers may
> report the
> > complete lines, some individual words, with some being able report the
> data only
> > after the completely parse the document.
> > All depends on the data format.
> >
> > At the moment TikaIO's TikaReader does not use the Beam threads to parse
> the
> > files, Beam threads will only collect the data from the internal queue
> where the
> > internal TikaReader's thread will put the data into
> > (note the data chunks are ordered even though the tests might suggest
> otherwise).
> >
> > The reason I did it was because I thought
> >
> > 1) it would make the individual data chunks available faster to the
> pipeline -
> > the parser will continue working via the binary/video etc file while the
> data
> > will already start flowing - I agree there should be some tests data
> available
> > confirming it - but I'm positive at the moment this approach might yield
> some
> > performance gains with the large sets. If the file is large, if it has
> the
> > embedded attachments/videos to deal with, then it may be more effective
> not to
> > get the Beam thread deal with it...
> >
> > 2) As I commented at the end of [2], having an option to concatenate the
> data
> > chunks first before making them available to the pipeline is useful, and
> I guess
> > doing the same in ParDo would introduce some synchronization issues
> (though not
> > exactly sure yet)
> >
> > One of valid concerns there is that the reader is polling the internal
> queue so,
> > in theory at least, and perhaps in some rare cases too, we may have a
> case where
> > the max polling time has been reached, the parser is still busy, and
> TikaIO
> > fails to report all the file data. I think that it can be solved by
> either 2a)
> > configuring the max polling time to a very large number which will never
> be
> > reached for a practical case, or 2b) simply use a blocking queue without
> the
> > time limits - in the worst case, if TikaParser spins and fails to report
> the end
> > of the document, then, Bean can heal itself if the pipeline blocks.
> > I propose to follow 2b).
> >
> >
> > Please let me know what you think.
> > My plan so far is:
> > 1) start addressing most of Eugene's comments which would require some
> minor
> > TikaIO updates
> > 2) work on removing the TikaSource internal code dealing with File
> patterns
> > which I copied from TextIO at the next stage
> > 3) If needed - mark TikaIO Experimental to give Tika and Beam users some
> time to
> > try it with some real complex files and also decide if TikaIO can
> continue
> > implemented as a BoundedSource/Reader or not
> >
> > Eugene, all, will it work if I start with 1) ?
> >
> > Thanks, Sergey
> >
> > [1] https://issues.apache.org/jira/browse/BEAM-2328
> > [2] https://github.com/apache/beam/pull/3378
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: TikaIO concerns

Reply via email to