Re: [DISCUSS] Versioning, Hadoop related dependencies and enterprise users

Raghu Angadi Tue, 28 Aug 2018 11:29:55 -0700

Thanks for the IO versioning summary.
KafkaIO's policy of 'let the user decide exact version at runtime' has been
quite useful so far. How feasible is that for other connectors?


Also, KafkaIO does not limit itself to minimum features available across
all the supported versions. Some of the features (e.g. server side
timestamps) are disabled based on runtime Kafka version.  The unit tests
currently run with single recent version. Integration tests could certainly
use multiple versions. With some more effort in writing tests, we could
make multiple versions of the unit tests.

Raghu.

IO versioning
> * Elasticsearch. We delayed the move to version 6 until we heard of
> more active users needing it (more deployments). We support 2.x and
> 5.x (but 2.x went recently EOL). Support for 6.x is in progress.
> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
> because most big data distributions still use 5.x (however 5.x has
> been EOL).
> * KafkaIO uses version 1.x but Kafka recently moved to 2.x, however
> most of the deployments of Kafka use earlier versions than 1.x. This
> module uses a single version with the kafka client as a provided
> dependency and so far it works (but we don’t have multi version
> tests).
>


On Tue, Aug 28, 2018 at 8:38 AM Ismaël Mejía <[email protected]> wrote:

> I think we should refine the strategy on dependencies discussed
> recently. Sorry to come late with this (I did not follow closely the
> previous discussion), but the current approach is clearly not in line
> with the industry reality (at least not for IO connectors + Hadoop +
> Spark/Flink use).
>
> A really proactive approach to dependency updates is a good practice
> for the core dependencies we have e.g. Guava, Bytebuddy, Avro,
> Protobuf, etc, and of course for the case of cloud based IOs e.g. GCS,
> Bigquery, AWS S3, etc. However when we talk about self hosted data
> sources or processing systems this gets more complicated and I think
> we should be more flexible and do this case by case (and remove these
> from the auto update email reminder).
>
> Some open source projects have at least three maintained versions:
> - LTS – maps to what most of the people have installed (or the big
> data distributions use) e.g. HBase 1.1.x, Hadoop 2.6.x
> - Stable – current recommended version. HBase 1.4.x, Hadoop 2.8.x
> - Next – latest release. HBase 2.1.x Hadoop 3.1.x
>
> Following the most recent versions can be good to be close to the
> current development of other projects and some of the fixes, but these
> versions are commonly not deployed for most users and adopting a LTS
> or stable only approach won't satisfy all cases either. To understand
> why this is complex let’s see some historical issues:
>
> IO versioning
> * Elasticsearch. We delayed the move to version 6 until we heard of
> more active users needing it (more deployments). We support 2.x and
> 5.x (but 2.x went recently EOL). Support for 6.x is in progress.
> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
> because most big data distributions still use 5.x (however 5.x has
> been EOL).
> * KafkaIO uses version 1.x but Kafka recently moved to 2.x, however
> most of the deployments of Kafka use earlier versions than 1.x. This
> module uses a single version with the kafka client as a provided
> dependency and so far it works (but we don’t have multi version
> tests).
>
> Runners versioning
> * The move to Spark 1 to Spark 2 was decided after evaluating the
> tradeoffs between maintaining multiple version support and to have
> breaking changes with the issues of maintaining multiple versions.
> This is a rare case but also with consequences. This dependency is
> provided but we don't actively test issues on version migration.
> * Flink moved to version 1.5, introducing incompatibility in
> checkpointing (discussed recently and with not yet consensus on how to
> handle).
>
> As you can see, it seems really hard to have a solution that fits all
> cases. Probably the only rule that I see from this list is that we
> should upgrade versions for connectors that have been deprecated or
> arrived to the EOL (e.g. Solr 5.x, Elasticsearch 2.x).
>
> For the case of the provided dependencies I wonder if as part of the
> tests we should provide tests with multiple versions (note that this
> is currently blocked by BEAM-4087).
>
> Any other ideas or opinions to see how we can handle this? What other
> people in the community think ? (Notice that this can have relation
> with the ongoing LTS discussion.
>
>
> On Tue, Aug 28, 2018 at 10:44 AM Tim Robertson
> <[email protected]> wrote:
> >
> > Hi folks,
> >
> > I'd like to revisit the discussion around our versioning policy
> specifically for the Hadoop ecosystem and make sure we are aware of the
> implications.
> >
> > As an example our policy today would have us on HBase 2.1 and I have
> reminders to address this.
> >
> > However, currently the versions of HBase in the major hadoop distros are:
> >
> >  - Cloudera 5 on HBase 1.2 (Cloudera 6 is 2.1 but is only in beta)
> >  - Hortonworks HDP3 on HBase 2.0 (only recently released so we can
> assume is not widely adopted)
> >  - AWS EMR HBase on 1.4
> >
> > On the versioning I think we might need a more nuanced approach to
> ensure that we target real communities of existing and potential users.
> Enterprise users need to stick to the supported versions in the
> distributions to maintain support contracts from the vendors.
> >
> > Should our versioning policy have more room to consider on a case by
> case basis?
> >
> > For Hadoop might we benefit from a strategy on which community of users
> Beam is targeting?
> >
> > (OT: I'm collecting some thoughts on what we might consider to target
> enterprise hadoop users - kerberos on all relevant IO, performance, leaking
> beyond encryption zones with temporary files etc)
> >
> > Thanks,
> > Tim
>

Re: [DISCUSS] Versioning, Hadoop related dependencies and enterprise users

Reply via email to