[
https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16032497#comment-16032497
]
Josh Rosen edited comment on HIVE-16391 at 6/1/17 5:59 AM:
---
I tried to see whether Spark can consume existing Hive 1.2.1 artifacts, but it
looks like neither the regular nor {{core}} hive-exec artifacts can work:
* We can't use the regular Hive uber-JAR artifacts because they include many
transitive dependencies but do not relocate those dependencies' classes into a
private namespace, so this will cause multiple versions of the same class to be
included on the classpath. To see this, note the long list of artifacts at
https://github.com/apache/hive/blob/release-1.2.1/ql/pom.xml#L685 but there is
only one relocation pattern (for Kryo).
* We can't use the {{core}}-classified artifact:
** We actually need Kryo to be shaded in {{hive-exec}} because Spark now uses
Kryo 3 (which is needed by Chill 0.8.x, which is needed for Scala 2.12) while
Hive uses Kryo 2.
** In addition, I think that Spark needs to shade Hive's
{{com.google.protobuf:protobuf-java}} dependency.
** The published {{hive-exec}} POM is a "dependency-reduced" POM which doesn't
declare {{hive-exec}}'s transitive dependencies. To see this, compare the
declared dependencies in the published POM in Maven Central
(http://central.maven.org/maven2/org/apache/hive/hive-exec/1.2.1/hive-exec-1.2.1.pom)
to the dependencies the source repo's POM:
https://github.com/apache/hive/blob/release-1.2.1/ql/pom.xml. The lack of
declared dependencies creates an additional layer of pain for us when consuming
the {{core}} JAR because we now have to shoulder the burden of declaring
explicit dependencies on {{hive-exec}}'s transitive dependencies (since they're
no longer bundled in an uber JAR when we use the {{core}} JAR), making it
harder to use tools like Maven's {{dependency:tree}} to help us spot potential
dep. conflicts.
Spark's current custom Hive fork is effectively making three changes compared
to Hive 1.2.1 order to work around the above problems plus some legacy issues
which are no longer relevant:
* Remove the shading/bundling of most non-Hive classes, with the exception of
Kryo and Protobuf. This has the effect of making the published POM declare
proper transitive dependencies, easing the dep. management story in Spark's
POMs, while still ensuring that we relocate classes that conflict with Spark.
* Package the hive-shims into the hive-exec JAR. I don't think that this is
strictly necessary.
* Downgrade Kryo to 2.21. This isn't necessary anymore: there was an earlier
time where we purposely _unshaded_ Kryo and pinned Hive's version to match
Spark's. The only reason that this change is present today was to minimize the
diff between versions 1 and 2 of Spark's Hive fork.
For the full details, see
https://github.com/apache/hive/compare/release-1.2.1...JoshRosen:release-1.2.1-spark2,
which compares the current Version 2 of our Hive fork to stock Hive 1.2.1.
Maven classifiers do not allow the declaration of different dependencies for
artifacts depending on their classifiers, so if we wanted to publish a
{{hive-exec core}}-like artifact which declares its transitive dependencies
then this would need to be done under a new Maven artifact name or new version
(e.g. Hive 1.2.2-spark).
That said, proper declaration of transitive dependencies isn't a hard blocker
for us: a long, long, long time ago, I think that Spark may have actually built
with a stock {{core}} artifact and explicitly declared the transitive deps, so
if we've handled that dependency declaration before then we can do it again at
the cost of some pain in the future if we want to bump to Hive 2.x.
Therefore, I think the minimal change needed in Hive's build is to add a new
classifier, say {{core-spark}}, which behaves like {{core}} except that it
shades and relocates Kryo and Protobuf. If this artifact existed then I think
Spark could use that classified artifact, declare an explicit dependency on the
shim artifacts (assuming Kryo and Protobuf don't need to be shaded there) and
explicitly pull in all of {{hive-exec}}'s transitive dependencies. This avoids
the need to publish separate _versions_ for Spark: instead, Spark would just
consume a differently-packaged/differently-classified version of a stock Hive
release.
If we go with this latter approach, then I guess Hive would need to publish
1.2.3 or 1.2.2.1 in order to introduce the new classified artifact.
Does this sound like a reasonable approach? Or would it make more sense to have
a separate Hive branch and versioning scheme for Spark (e.g.
{{branch-1.2-spark}} and Hive {{1.2.1-spark}})? I lean towards the former
approach (releasing 1.2.3 with an additional Spark-specific classifier),
especially if we want to fix bugs or make functional / non-packaging changes
later down the