I am not deep on the details myself but have reviewed various Avro upgrade
changes such as https://github.com/apache/beam/pull/9779 and also some
internal that I cannot link to. I believe the changes are small and quite
possibly we can create sdks/java/extensions/avro that works with both Avro
1.8 and 1.9 and make Dataflow worker compatible with whatever the user
chooses. (I would expect Spark is trying to get to that point too?)

So then if we have that can we achieve the goals? Spark runner users that
do not use Avro in their own code get Spark's version, Spark runner users
that do use Avro have to choose anyhow, and we make Dataflow worker work
with 1.8 and 1.9.

We can probably achieve the same goals by just making the core compatible
with both 1.8 and 1.9. Users who don't want the dep can exclude it, too. It
doesn't have a bunch of transitive deps so there isn't a lot of value in
trying to exclude it though. So core vs extensions is more of a clean
engineering thing, but having compatibility with 1.9 is a user-driven thing.

Kenn

On Fri, Sep 11, 2020 at 10:49 AM Ismaël Mejía <ieme...@gmail.com> wrote:

> > The concern here is that Avro 1.9 is not backwards compatible with Avro
> 1.8, so would the future world would not be a simple "bring your own avro"
> but might require separate dataflow-with-avro-1.8 and
> dataflow-with-avro-1.9 targets which certainly isn't scalable. (Or am I
> mistaken here? Maybe we could solve this with vending?)
>
> Thinking a bit about it looks similar to what I mentioned with Spark
> runner save that we cannot control those targets so for that reason I
> talked about source code compatibility.
> Avro is really hard to shade correctly because of the way the code
> generation works, otherwise it could have been the best solution.
>
> On Fri, Sep 11, 2020 at 7:28 PM Robert Bradshaw <rober...@google.com>
> wrote:
> >
> > On Fri, Sep 11, 2020 at 10:05 AM Kenneth Knowles <k...@apache.org>
> wrote:
> >>
> >> Top-post: I'm generally in favor of moving Avro out of core
> specifically because it is something where different users (and dep chains)
> want different versions. The pain caused by having it in core has come up a
> lot to me. I don't think backwards-compatibility absolutism helps our users
> in this case. I do think gradual migration to ease pain is important.
> >
> >
> > Agree. Backwards compatibility is not the absolute goal; whatever is
> best for existing and new users is what we should go for. That being said,
> this whole issue is caused by one of our dependencies not being backwards
> compatible itself...
> >
> >>
> >> On Fri, Sep 11, 2020 at 9:30 AM Robert Bradshaw <rober...@google.com>
> wrote:
> >>>
> >>> On Thu, Sep 10, 2020 at 2:48 PM Brian Hulette <bhule...@google.com>
> wrote:
> >>>>
> >>>>
> >>>> On Tue, Sep 8, 2020 at 9:18 AM Robert Bradshaw <rober...@google.com>
> wrote:
> >>>>>
> >>>>> IIRC Dataflow (and perhaps others) implicitly depend on Avro to write
> >>>>> out intermediate files (e.g. for non-shuffle Fusion breaks). Would
> >>>>> this break if we just removed it?
> >>>>
> >>>>
> >>>> I think Dataflow would just need to declare a dependency on the new
> extension.
> >>>
> >>>
> >>> I'm not sure this would solve the underlying problem (it just pushes
> it onto users and makes it more obscure). Maybe my reasoning is incorrect,
> but from what I see
> >>>
> >>> * Many Beam modules (e.g. dataflow, spark, file-based-io, sql, kafka,
> parquet, ...) depend on Avro.
> >>> * Using Avro 1.9 with the above modules doesn't work.
> >>
> >>
> >> I suggest taking these on case-by-case.
> >>
> >>  - Dataflow: implementation detail, probably not a major problem (we
> can just upgrade the pre-portability worker while for portability it is a
> non-issue)
> >>  - Spark: probably need to use whatever version of Avro works for each
> version of Spark (portability mitigates)
> >>  - SQL: happy to upgrade lib version, just needs to be able to read the
> data, Avro version not user-facing
> >>  - IOs: I'm guessing that we have a diamond dep getting resolved by
> clobbering. A quick glance seems like Parquet is on avro 1.10.0, Kafka's
> Avro serde is a separate thing distributed by Confluent with Avro version
> obfuscated by use of parent poms and properties, but their examples use
> Avro 1.9.1.
> >
> >
> > The concern here is that Avro 1.9 is not backwards compatible with Avro
> 1.8, so would the future world would not be a simple "bring your own avro"
> but might require separate dataflow-with-avro-1.8 and
> dataflow-with-avro-1.9 targets which certainly isn't scalable. (Or am I
> mistaken here? Maybe we could solve this with vending?)
> >
> >>> Doesn't this mean that, even if we remove avro from Beam core, a user
> that uses Beam + Avro 1.9 will have issues with any of the above (fairly
> fundamental) modules?
> >>>
> >>>>  We could mitigate this by first adding the new extension module and
> deprecating the core Beam counterpart for a release (or multiple releases).
> >>>
> >>>
> >>> +1 to Reuven's concerns here.
> >>
> >>
> >> Agree we should add the module and release it for at least one release,
> probably a few because users tend to hop a few releases. We have some
> precedent for breaking changes with the Python/Flink version dropping after
> asking users on user@ and polling on Twitter, etc.
> >>
> >> Kenn
>

Reply via email to