Getting Avro out of core is a good idea in principle but it has
consequences for users.

The cleanest solution implies moving the packages as done in the PR
but this will essentially break every existing user, so we should
measure the impact on users and agree if it is worth to break SDK core
backwards compatibility (or wait until Beam 3 to do this). Users tend
to be really frustrated with such kind of breakage in particular if
they don’t see a concrete benefit [1]. I thought for a moment that a
solution could be to have the same packages in the extension but that
won’t work because we will end up with broken packages/modules and
that will get us issues with Java 11.

There are two other ‘issues’ about the change:

We MUST guarantee that such upgrade does not break users of the Spark
runner. Spark leaks Avro 1.8.x by default on the recent versions 2.4.x
/ 3.x.x so Beam’s code that uses a different version of Avro should at
least be source compatible between Avro versions or otherwise it will
break users in that runner. In concrete terms this means that we
should stay by default in Avro 1.8.2 and have the other modules to
only provide the upgraded versions of the Avro dependencies, and it
will be up to the users to provide the compatible versions they want
to use.

And for the concrete case of Confluent Schema Registry in KafkaIO this
will imply that we need to find a way to make the Avro dependency
aligned between core and KafkaIO otherwise users can have issues
because of not available classes/methods in particular if users’s code
depend on an unaligned version of Avro.

I have to say I have mixed feelings. I am essentially pro removal from
core because it should not have ever been there in the first place,
but I am afraid of the impact vs the potential gains.

[1] 
https://medium.com/@steve.yegge/dear-google-cloud-your-deprecation-policy-is-killing-you-ee7525dc05dc

On Fri, Sep 11, 2020 at 7:05 PM Kenneth Knowles <k...@apache.org> wrote:
>
> Top-post: I'm generally in favor of moving Avro out of core specifically 
> because it is something where different users (and dep chains) want different 
> versions. The pain caused by having it in core has come up a lot to me. I 
> don't think backwards-compatibility absolutism helps our users in this case. 
> I do think gradual migration to ease pain is important.
>
> On Fri, Sep 11, 2020 at 9:30 AM Robert Bradshaw <rober...@google.com> wrote:
>>
>> On Thu, Sep 10, 2020 at 2:48 PM Brian Hulette <bhule...@google.com> wrote:
>>>
>>>
>>> On Tue, Sep 8, 2020 at 9:18 AM Robert Bradshaw <rober...@google.com> wrote:
>>>>
>>>> IIRC Dataflow (and perhaps others) implicitly depend on Avro to write
>>>> out intermediate files (e.g. for non-shuffle Fusion breaks). Would
>>>> this break if we just removed it?
>>>
>>>
>>> I think Dataflow would just need to declare a dependency on the new 
>>> extension.
>>
>>
>> I'm not sure this would solve the underlying problem (it just pushes it onto 
>> users and makes it more obscure). Maybe my reasoning is incorrect, but from 
>> what I see
>>
>> * Many Beam modules (e.g. dataflow, spark, file-based-io, sql, kafka, 
>> parquet, ...) depend on Avro.
>> * Using Avro 1.9 with the above modules doesn't work.
>
>
> I suggest taking these on case-by-case.
>
>  - Dataflow: implementation detail, probably not a major problem (we can just 
> upgrade the pre-portability worker while for portability it is a non-issue)
>  - Spark: probably need to use whatever version of Avro works for each 
> version of Spark (portability mitigates)
>  - SQL: happy to upgrade lib version, just needs to be able to read the data, 
> Avro version not user-facing
>  - IOs: I'm guessing that we have a diamond dep getting resolved by 
> clobbering. A quick glance seems like Parquet is on avro 1.10.0, Kafka's Avro 
> serde is a separate thing distributed by Confluent with Avro version 
> obfuscated by use of parent poms and properties, but their examples use Avro 
> 1.9.1.
>
>> Doesn't this mean that, even if we remove avro from Beam core, a user that 
>> uses Beam + Avro 1.9 will have issues with any of the above (fairly 
>> fundamental) modules?
>>
>>>  We could mitigate this by first adding the new extension module and 
>>> deprecating the core Beam counterpart for a release (or multiple releases).
>>
>>
>> +1 to Reuven's concerns here.
>
>
> Agree we should add the module and release it for at least one release, 
> probably a few because users tend to hop a few releases. We have some 
> precedent for breaking changes with the Python/Flink version dropping after 
> asking users on user@ and polling on Twitter, etc.
>
> Kenn

Reply via email to