What if we introduced a core-lite package without avro? (We could take inventory and see if there are other dependencies we could/should make optional as well.) The existing core module would remain the same, but it would provide a way for users to use other avros with Beam.
On Fri, Sep 11, 2020 at 10:28 AM Ismaël Mejía <[email protected]> wrote: > Getting Avro out of core is a good idea in principle but it has > consequences for users. > > The cleanest solution implies moving the packages as done in the PR > but this will essentially break every existing user, so we should > measure the impact on users and agree if it is worth to break SDK core > backwards compatibility (or wait until Beam 3 to do this). Users tend > to be really frustrated with such kind of breakage in particular if > they don’t see a concrete benefit [1]. I thought for a moment that a > solution could be to have the same packages in the extension but that > won’t work because we will end up with broken packages/modules and > that will get us issues with Java 11. > > There are two other ‘issues’ about the change: > > We MUST guarantee that such upgrade does not break users of the Spark > runner. Spark leaks Avro 1.8.x by default on the recent versions 2.4.x > / 3.x.x so Beam’s code that uses a different version of Avro should at > least be source compatible between Avro versions or otherwise it will > break users in that runner. In concrete terms this means that we > should stay by default in Avro 1.8.2 and have the other modules to > only provide the upgraded versions of the Avro dependencies, and it > will be up to the users to provide the compatible versions they want > to use. > > And for the concrete case of Confluent Schema Registry in KafkaIO this > will imply that we need to find a way to make the Avro dependency > aligned between core and KafkaIO otherwise users can have issues > because of not available classes/methods in particular if users’s code > depend on an unaligned version of Avro. > > I have to say I have mixed feelings. I am essentially pro removal from > core because it should not have ever been there in the first place, > but I am afraid of the impact vs the potential gains. > > [1] > https://medium.com/@steve.yegge/dear-google-cloud-your-deprecation-policy-is-killing-you-ee7525dc05dc > > On Fri, Sep 11, 2020 at 7:05 PM Kenneth Knowles <[email protected]> wrote: > > > > Top-post: I'm generally in favor of moving Avro out of core specifically > because it is something where different users (and dep chains) want > different versions. The pain caused by having it in core has come up a lot > to me. I don't think backwards-compatibility absolutism helps our users in > this case. I do think gradual migration to ease pain is important. > > > > On Fri, Sep 11, 2020 at 9:30 AM Robert Bradshaw <[email protected]> > wrote: > >> > >> On Thu, Sep 10, 2020 at 2:48 PM Brian Hulette <[email protected]> > wrote: > >>> > >>> > >>> On Tue, Sep 8, 2020 at 9:18 AM Robert Bradshaw <[email protected]> > wrote: > >>>> > >>>> IIRC Dataflow (and perhaps others) implicitly depend on Avro to write > >>>> out intermediate files (e.g. for non-shuffle Fusion breaks). Would > >>>> this break if we just removed it? > >>> > >>> > >>> I think Dataflow would just need to declare a dependency on the new > extension. > >> > >> > >> I'm not sure this would solve the underlying problem (it just pushes it > onto users and makes it more obscure). Maybe my reasoning is incorrect, but > from what I see > >> > >> * Many Beam modules (e.g. dataflow, spark, file-based-io, sql, kafka, > parquet, ...) depend on Avro. > >> * Using Avro 1.9 with the above modules doesn't work. > > > > > > I suggest taking these on case-by-case. > > > > - Dataflow: implementation detail, probably not a major problem (we can > just upgrade the pre-portability worker while for portability it is a > non-issue) > > - Spark: probably need to use whatever version of Avro works for each > version of Spark (portability mitigates) > > - SQL: happy to upgrade lib version, just needs to be able to read the > data, Avro version not user-facing > > - IOs: I'm guessing that we have a diamond dep getting resolved by > clobbering. A quick glance seems like Parquet is on avro 1.10.0, Kafka's > Avro serde is a separate thing distributed by Confluent with Avro version > obfuscated by use of parent poms and properties, but their examples use > Avro 1.9.1. > > > >> Doesn't this mean that, even if we remove avro from Beam core, a user > that uses Beam + Avro 1.9 will have issues with any of the above (fairly > fundamental) modules? > >> > >>> We could mitigate this by first adding the new extension module and > deprecating the core Beam counterpart for a release (or multiple releases). > >> > >> > >> +1 to Reuven's concerns here. > > > > > > Agree we should add the module and release it for at least one release, > probably a few because users tend to hop a few releases. We have some > precedent for breaking changes with the Python/Flink version dropping after > asking users on user@ and polling on Twitter, etc. > > > > Kenn >
