Hi Hadoop devs,

I spent a good part of the past 7 months working with a dozen of colleagues
to update the guava version in Cloudera's software (that includes Hadoop,
HBase, Spark, Hive, Cloudera Manager ... more than 20+ projects)

After 7 months, I finally came to a conclusion: Update to Hadoop 3.3 /
3.2.1 / 3.1.3, even if you just go from Hadoop 3.0/ 3.1.0 is going to be
really hard because of guava. Because of Guava, the amount of work to
certify a minor release update is almost equivalent to a major release
update.

That is because:
(1) Going from guava 11 to guava 27 is a big jump. There are several
incompatible API changes in many places. Too bad the Google developers are
not sympathetic about its users.
(2) guava is used in all Hadoop jars. Not just Hadoop servers but also
client jars and Hadoop common libs.
(3) The Hadoop library is used in practically all software at Cloudera.

Here is my proposal:
(1) shade guava into hadoop-thirdparty, relocate the classpath to
org.hadoop.thirdparty.com.google.common.*
(2) make a hadoop-thirdparty 1.1.0 release.
(3) update existing references to guava to the relocated path. There are
more than 2k imports that need an update.
(4) release Hadoop 3.3.1 / 3.2.2 that contains this change.

In this way, we will be able to update guava in Hadoop in the future
without disrupting Hadoop applications.

Note: HBase already did this and this guava update project would have been
much more difficult if HBase didn't do so.

Thoughts? Other options include
(1) force downstream applications to migrate to Hadoop client artifacts as
listed here
https://hadoop.apache.org/docs/r3.1.1/hadoop-project-dist/hadoop-common/DownstreamDev.html
but
that's nearly impossible.
(2) Migrate Guava to Java APIs. I suppose this is a big project and I can't
estimate how much work it's going to be.

Weichiu

Reply via email to