Avro differences in our implementation are pretty minimal if you look at the PR, to the point that an Adapter should be really tiny if even needed.
The big backwards incompatible changes in Avro > 1.8 were to remove external libraries from the public APIs e.g. guava, jackson and joda-time. Of course this does not seem to be much but almost every project using Avro was using some of these dependencies, luckily for Beam it was only joda-time and that is already fixed. Keeping backwards compatibility by making Avro part of an extension that is optional for core and using only Avro 1.8 compatible features on Beam's source code is the simplest path, and allow us to avoid all the issues, notice that the dependency that triggered the need for Avro 1.9 (and this thread) is a runtime dependency used by Confluent Schema Registry and this is an issue because sdks-java-core is leaking Avro. Apart from this I am not aware of any feature in any other project that obliges anyone to use Avro 1.9 or 1.10 specific code. Of course a really valid reason to want to use a more recent version of Avro is that Avro 1.8 leaks also unmaintained dependencies with serious security issues (Jackson 1.x). So in the end my main concern is breaking existing users code, this has less impact for us (Talend) but probably more for the rest of the community, but if we agree to break backwards compatibility for the sake of cleanliness well we should probably proceed, and of course give users also some warning. On Mon, Sep 14, 2020 at 7:13 PM Luke Cwik <lc...@google.com> wrote: > > In the Kafka module we reflectively figure out which version of Kafka > exists[1] on the classpath and then reflectively invoke some APIs to work > around differences in Kafka allowing our users to bring whichever version > they want. > > We could do something similar here and reflectively figure out which Avro is > on the classpath and invoke the appropriate methods. If we are worried about > performance of using reflection, we can write and compile two different > versions of an Avro adapter class and choose which one to use (using > reflection only once during class loading). > > e.g. > AvroAdapter { > static final AvroAdapter INSTANCE; > static { > if (avro19?) { > INSTANCE = new Avro19Adapater(); > } else { > INSTANCE = new Avro18Adapter(); > } > > ... methods needed for AvroAdapter implementations ... > } > > Using reflection allows the user to choose which version they use and pushes > down the incompatibility issue from Apache Beam to our deps (e.g. Spark). > > 1: > https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/ConsumerSpEL.java > > On Fri, Sep 11, 2020 at 11:42 AM Kenneth Knowles <k...@apache.org> wrote: >> >> I am not deep on the details myself but have reviewed various Avro upgrade >> changes such as https://github.com/apache/beam/pull/9779 and also some >> internal that I cannot link to. I believe the changes are small and quite >> possibly we can create sdks/java/extensions/avro that works with both Avro >> 1.8 and 1.9 and make Dataflow worker compatible with whatever the user >> chooses. (I would expect Spark is trying to get to that point too?) >> >> So then if we have that can we achieve the goals? Spark runner users that do >> not use Avro in their own code get Spark's version, Spark runner users that >> do use Avro have to choose anyhow, and we make Dataflow worker work with 1.8 >> and 1.9. >> >> We can probably achieve the same goals by just making the core compatible >> with both 1.8 and 1.9. Users who don't want the dep can exclude it, too. It >> doesn't have a bunch of transitive deps so there isn't a lot of value in >> trying to exclude it though. So core vs extensions is more of a clean >> engineering thing, but having compatibility with 1.9 is a user-driven thing. >> >> Kenn >> >> On Fri, Sep 11, 2020 at 10:49 AM Ismaël Mejía <ieme...@gmail.com> wrote: >>> >>> > The concern here is that Avro 1.9 is not backwards compatible with Avro >>> > 1.8, so would the future world would not be a simple "bring your own >>> > avro" but might require separate dataflow-with-avro-1.8 and >>> > dataflow-with-avro-1.9 targets which certainly isn't scalable. (Or am I >>> > mistaken here? Maybe we could solve this with vending?) >>> >>> Thinking a bit about it looks similar to what I mentioned with Spark >>> runner save that we cannot control those targets so for that reason I >>> talked about source code compatibility. >>> Avro is really hard to shade correctly because of the way the code >>> generation works, otherwise it could have been the best solution. >>> >>> On Fri, Sep 11, 2020 at 7:28 PM Robert Bradshaw <rober...@google.com> wrote: >>> > >>> > On Fri, Sep 11, 2020 at 10:05 AM Kenneth Knowles <k...@apache.org> wrote: >>> >> >>> >> Top-post: I'm generally in favor of moving Avro out of core specifically >>> >> because it is something where different users (and dep chains) want >>> >> different versions. The pain caused by having it in core has come up a >>> >> lot to me. I don't think backwards-compatibility absolutism helps our >>> >> users in this case. I do think gradual migration to ease pain is >>> >> important. >>> > >>> > >>> > Agree. Backwards compatibility is not the absolute goal; whatever is best >>> > for existing and new users is what we should go for. That being said, >>> > this whole issue is caused by one of our dependencies not being backwards >>> > compatible itself... >>> > >>> >> >>> >> On Fri, Sep 11, 2020 at 9:30 AM Robert Bradshaw <rober...@google.com> >>> >> wrote: >>> >>> >>> >>> On Thu, Sep 10, 2020 at 2:48 PM Brian Hulette <bhule...@google.com> >>> >>> wrote: >>> >>>> >>> >>>> >>> >>>> On Tue, Sep 8, 2020 at 9:18 AM Robert Bradshaw <rober...@google.com> >>> >>>> wrote: >>> >>>>> >>> >>>>> IIRC Dataflow (and perhaps others) implicitly depend on Avro to write >>> >>>>> out intermediate files (e.g. for non-shuffle Fusion breaks). Would >>> >>>>> this break if we just removed it? >>> >>>> >>> >>>> >>> >>>> I think Dataflow would just need to declare a dependency on the new >>> >>>> extension. >>> >>> >>> >>> >>> >>> I'm not sure this would solve the underlying problem (it just pushes it >>> >>> onto users and makes it more obscure). Maybe my reasoning is incorrect, >>> >>> but from what I see >>> >>> >>> >>> * Many Beam modules (e.g. dataflow, spark, file-based-io, sql, kafka, >>> >>> parquet, ...) depend on Avro. >>> >>> * Using Avro 1.9 with the above modules doesn't work. >>> >> >>> >> >>> >> I suggest taking these on case-by-case. >>> >> >>> >> - Dataflow: implementation detail, probably not a major problem (we can >>> >> just upgrade the pre-portability worker while for portability it is a >>> >> non-issue) >>> >> - Spark: probably need to use whatever version of Avro works for each >>> >> version of Spark (portability mitigates) >>> >> - SQL: happy to upgrade lib version, just needs to be able to read the >>> >> data, Avro version not user-facing >>> >> - IOs: I'm guessing that we have a diamond dep getting resolved by >>> >> clobbering. A quick glance seems like Parquet is on avro 1.10.0, Kafka's >>> >> Avro serde is a separate thing distributed by Confluent with Avro >>> >> version obfuscated by use of parent poms and properties, but their >>> >> examples use Avro 1.9.1. >>> > >>> > >>> > The concern here is that Avro 1.9 is not backwards compatible with Avro >>> > 1.8, so would the future world would not be a simple "bring your own >>> > avro" but might require separate dataflow-with-avro-1.8 and >>> > dataflow-with-avro-1.9 targets which certainly isn't scalable. (Or am I >>> > mistaken here? Maybe we could solve this with vending?) >>> > >>> >>> Doesn't this mean that, even if we remove avro from Beam core, a user >>> >>> that uses Beam + Avro 1.9 will have issues with any of the above >>> >>> (fairly fundamental) modules? >>> >>> >>> >>>> We could mitigate this by first adding the new extension module and >>> >>>> deprecating the core Beam counterpart for a release (or multiple >>> >>>> releases). >>> >>> >>> >>> >>> >>> +1 to Reuven's concerns here. >>> >> >>> >> >>> >> Agree we should add the module and release it for at least one release, >>> >> probably a few because users tend to hop a few releases. We have some >>> >> precedent for breaking changes with the Python/Flink version dropping >>> >> after asking users on user@ and polling on Twitter, etc. >>> >> >>> >> Kenn