Getting Avro out of core is a good idea in principle but it has consequences for users.
The cleanest solution implies moving the packages as done in the PR but this will essentially break every existing user, so we should measure the impact on users and agree if it is worth to break SDK core backwards compatibility (or wait until Beam 3 to do this). Users tend to be really frustrated with such kind of breakage in particular if they don’t see a concrete benefit [1]. I thought for a moment that a solution could be to have the same packages in the extension but that won’t work because we will end up with broken packages/modules and that will get us issues with Java 11. There are two other ‘issues’ about the change: We MUST guarantee that such upgrade does not break users of the Spark runner. Spark leaks Avro 1.8.x by default on the recent versions 2.4.x / 3.x.x so Beam’s code that uses a different version of Avro should at least be source compatible between Avro versions or otherwise it will break users in that runner. In concrete terms this means that we should stay by default in Avro 1.8.2 and have the other modules to only provide the upgraded versions of the Avro dependencies, and it will be up to the users to provide the compatible versions they want to use. And for the concrete case of Confluent Schema Registry in KafkaIO this will imply that we need to find a way to make the Avro dependency aligned between core and KafkaIO otherwise users can have issues because of not available classes/methods in particular if users’s code depend on an unaligned version of Avro. I have to say I have mixed feelings. I am essentially pro removal from core because it should not have ever been there in the first place, but I am afraid of the impact vs the potential gains. [1] https://medium.com/@steve.yegge/dear-google-cloud-your-deprecation-policy-is-killing-you-ee7525dc05dc On Fri, Sep 11, 2020 at 7:05 PM Kenneth Knowles <k...@apache.org> wrote: > > Top-post: I'm generally in favor of moving Avro out of core specifically > because it is something where different users (and dep chains) want different > versions. The pain caused by having it in core has come up a lot to me. I > don't think backwards-compatibility absolutism helps our users in this case. > I do think gradual migration to ease pain is important. > > On Fri, Sep 11, 2020 at 9:30 AM Robert Bradshaw <rober...@google.com> wrote: >> >> On Thu, Sep 10, 2020 at 2:48 PM Brian Hulette <bhule...@google.com> wrote: >>> >>> >>> On Tue, Sep 8, 2020 at 9:18 AM Robert Bradshaw <rober...@google.com> wrote: >>>> >>>> IIRC Dataflow (and perhaps others) implicitly depend on Avro to write >>>> out intermediate files (e.g. for non-shuffle Fusion breaks). Would >>>> this break if we just removed it? >>> >>> >>> I think Dataflow would just need to declare a dependency on the new >>> extension. >> >> >> I'm not sure this would solve the underlying problem (it just pushes it onto >> users and makes it more obscure). Maybe my reasoning is incorrect, but from >> what I see >> >> * Many Beam modules (e.g. dataflow, spark, file-based-io, sql, kafka, >> parquet, ...) depend on Avro. >> * Using Avro 1.9 with the above modules doesn't work. > > > I suggest taking these on case-by-case. > > - Dataflow: implementation detail, probably not a major problem (we can just > upgrade the pre-portability worker while for portability it is a non-issue) > - Spark: probably need to use whatever version of Avro works for each > version of Spark (portability mitigates) > - SQL: happy to upgrade lib version, just needs to be able to read the data, > Avro version not user-facing > - IOs: I'm guessing that we have a diamond dep getting resolved by > clobbering. A quick glance seems like Parquet is on avro 1.10.0, Kafka's Avro > serde is a separate thing distributed by Confluent with Avro version > obfuscated by use of parent poms and properties, but their examples use Avro > 1.9.1. > >> Doesn't this mean that, even if we remove avro from Beam core, a user that >> uses Beam + Avro 1.9 will have issues with any of the above (fairly >> fundamental) modules? >> >>> We could mitigate this by first adding the new extension module and >>> deprecating the core Beam counterpart for a release (or multiple releases). >> >> >> +1 to Reuven's concerns here. > > > Agree we should add the module and release it for at least one release, > probably a few because users tend to hop a few releases. We have some > precedent for breaking changes with the Python/Flink version dropping after > asking users on user@ and polling on Twitter, etc. > > Kenn