Re: [DISCUSS] Versioning, Hadoop related dependencies and enterprise users

Andrew Pilloud Tue, 28 Aug 2018 10:46:36 -0700

The Beam SQL module faces similar problems, several of our dependencies are
constrained by maintaining compatibility with versions used by Calcite.
We've written tests to detect some of these incompatibilities. Could we add
integration tests for these major hadoop distros that ensure we maintain
compatibility rather then explicitly calling them out in our upgrade policy?


Andrew

On Tue, Aug 28, 2018 at 10:31 AM Chamikara Jayalath <chamik...@google.com>
wrote:

> Thanks Tim for raising this and Thanks JB and Ismaël for all the great
> points.
>
> I agree that one size fit all solution will not work when it comes to
> dependencies. Based on past examples, clearly there are many cases where we
> should proceed with caution and upgrade dependencies with care.
>
> That said, given that Beam respects semantic versioning and most of our
> dependencies respect semantic versioning I think we should be able to
> upgrade most minor (and patch) versions of dependencies with relative ease.
> Current policy is to automatically create JIRAs if we are more than three
> minor versions behind. I'm not sure if HBase respects semantic versioning.
> If it does not, I think, it should be the exception not the norm.
>
> When it comes major version upgrades though we'll have to proceed with
> caution. In addition to all the case-by-case reasoning Ismaël gave above
> there's also the real possibility of a major version upgrade changing Beam
> API (syntax or semantics) in a non backwards compatible way and breaking
> the backwards compatibility guarantee offered by Beam. Current dependency
> policy [1] try to capture this in a separate section and requires all PRs
> that upgrade dependencies to contain a statement regarding backwards
> compatibility.
>
> I agree that there might be many modifications we have to make to existing
> policies when it comes to upgrading Beam dependencies in according to
> industry standards. Current policies are there as a first version for us to
> try out. We should definitely time to time reevaluate and update the
> policies as needed. I'm also extremely eager to hear what others in the
> community think about this.
>
> Thanks,
> Cham
>
> [1] https://beam.apache.org/contribute/dependencies/
>
> On Tue, Aug 28, 2018 at 8:38 AM Ismaël Mejía <ieme...@gmail.com> wrote:
>
>> I think we should refine the strategy on dependencies discussed
>> recently. Sorry to come late with this (I did not follow closely the
>> previous discussion), but the current approach is clearly not in line
>> with the industry reality (at least not for IO connectors + Hadoop +
>> Spark/Flink use).
>>
>> A really proactive approach to dependency updates is a good practice
>> for the core dependencies we have e.g. Guava, Bytebuddy, Avro,
>> Protobuf, etc, and of course for the case of cloud based IOs e.g. GCS,
>> Bigquery, AWS S3, etc. However when we talk about self hosted data
>> sources or processing systems this gets more complicated and I think
>> we should be more flexible and do this case by case (and remove these
>> from the auto update email reminder).
>>
>> Some open source projects have at least three maintained versions:
>> - LTS – maps to what most of the people have installed (or the big
>> data distributions use) e.g. HBase 1.1.x, Hadoop 2.6.x
>> - Stable – current recommended version. HBase 1.4.x, Hadoop 2.8.x
>> - Next – latest release. HBase 2.1.x Hadoop 3.1.x
>>
>> Following the most recent versions can be good to be close to the
>> current development of other projects and some of the fixes, but these
>> versions are commonly not deployed for most users and adopting a LTS
>> or stable only approach won't satisfy all cases either. To understand
>> why this is complex let’s see some historical issues:
>>
>> IO versioning
>> * Elasticsearch. We delayed the move to version 6 until we heard of
>> more active users needing it (more deployments). We support 2.x and
>> 5.x (but 2.x went recently EOL). Support for 6.x is in progress.
>> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
>> because most big data distributions still use 5.x (however 5.x has
>> been EOL).
>> * KafkaIO uses version 1.x but Kafka recently moved to 2.x, however
>> most of the deployments of Kafka use earlier versions than 1.x. This
>> module uses a single version with the kafka client as a provided
>> dependency and so far it works (but we don’t have multi version
>> tests).
>>
>> Runners versioning
>> * The move to Spark 1 to Spark 2 was decided after evaluating the
>> tradeoffs between maintaining multiple version support and to have
>> breaking changes with the issues of maintaining multiple versions.
>> This is a rare case but also with consequences. This dependency is
>> provided but we don't actively test issues on version migration.
>> * Flink moved to version 1.5, introducing incompatibility in
>> checkpointing (discussed recently and with not yet consensus on how to
>> handle).
>>
>> As you can see, it seems really hard to have a solution that fits all
>> cases. Probably the only rule that I see from this list is that we
>> should upgrade versions for connectors that have been deprecated or
>> arrived to the EOL (e.g. Solr 5.x, Elasticsearch 2.x).
>>
>> For the case of the provided dependencies I wonder if as part of the
>> tests we should provide tests with multiple versions (note that this
>> is currently blocked by BEAM-4087).
>>
>> Any other ideas or opinions to see how we can handle this? What other
>> people in the community think ? (Notice that this can have relation
>> with the ongoing LTS discussion.
>>
>>
>> On Tue, Aug 28, 2018 at 10:44 AM Tim Robertson
>> <timrobertson...@gmail.com> wrote:
>> >
>> > Hi folks,
>> >
>> > I'd like to revisit the discussion around our versioning policy
>> specifically for the Hadoop ecosystem and make sure we are aware of the
>> implications.
>> >
>> > As an example our policy today would have us on HBase 2.1 and I have
>> reminders to address this.
>> >
>> > However, currently the versions of HBase in the major hadoop distros
>> are:
>> >
>> >  - Cloudera 5 on HBase 1.2 (Cloudera 6 is 2.1 but is only in beta)
>> >  - Hortonworks HDP3 on HBase 2.0 (only recently released so we can
>> assume is not widely adopted)
>> >  - AWS EMR HBase on 1.4
>> >
>> > On the versioning I think we might need a more nuanced approach to
>> ensure that we target real communities of existing and potential users.
>> Enterprise users need to stick to the supported versions in the
>> distributions to maintain support contracts from the vendors.
>> >
>> > Should our versioning policy have more room to consider on a case by
>> case basis?
>> >
>> > For Hadoop might we benefit from a strategy on which community of users
>> Beam is targeting?
>> >
>> > (OT: I'm collecting some thoughts on what we might consider to target
>> enterprise hadoop users - kerberos on all relevant IO, performance, leaking
>> beyond encryption zones with temporary files etc)
>> >
>> > Thanks,
>> > Tim
>>
>

Re: [DISCUSS] Versioning, Hadoop related dependencies and enterprise users

Reply via email to