[jira] [Comment Edited] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)

2018-06-05 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501561#comment-16501561
 ] 

Saisai Shao edited comment on HIVE-16391 at 6/5/18 10:15 AM:
-

Seems there's no permission for me to upload a file. There's no such button.


was (Author: jerryshao):
Seems there's no permission for me to upload a file.

> Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
> -
>
> Key: HIVE-16391
> URL: https://issues.apache.org/jira/browse/HIVE-16391
> Project: Hive
>  Issue Type: Task
>  Components: Build Infrastructure
>Reporter: Reynold Xin
>Priority: Major
>  Labels: pull-request-available
>
> Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the 
> only change in the fork is to work around the issue that Hive publishes only 
> two sets of jars: one set with no dependency declared, and another with all 
> the dependencies included in the published uber jar. That is to say, Hive 
> doesn't publish a set of jars with the proper dependencies declared.
> There is general consensus on both sides that we should remove the forked 
> Hive.
> The change in the forked version is recorded here 
> https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2
> Note that the fork in the past included other fixes but those have all become 
> unnecessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)

2017-06-01 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16032497#comment-16032497
 ] 

Josh Rosen edited comment on HIVE-16391 at 6/1/17 5:59 AM:
---

I tried to see whether Spark can consume existing Hive 1.2.1 artifacts, but it 
looks like neither the regular nor {{core}} hive-exec artifacts can work:

* We can't use the regular Hive uber-JAR artifacts because they include many 
transitive dependencies but do not relocate those dependencies' classes into a 
private namespace, so this will cause multiple versions of the same class to be 
included on the classpath. To see this, note the long list of artifacts at 
https://github.com/apache/hive/blob/release-1.2.1/ql/pom.xml#L685 but there is 
only one relocation pattern (for Kryo).
* We can't use the {{core}}-classified artifact:
** We actually need Kryo to be shaded in {{hive-exec}} because Spark now uses 
Kryo 3 (which is needed by Chill 0.8.x, which is needed for Scala 2.12) while 
Hive uses Kryo 2.
** In addition, I think that Spark needs to shade Hive's 
{{com.google.protobuf:protobuf-java}} dependency.
** The published {{hive-exec}} POM is a "dependency-reduced" POM which doesn't 
declare {{hive-exec}}'s transitive dependencies. To see this, compare the 
declared dependencies in the published POM in Maven Central 
(http://central.maven.org/maven2/org/apache/hive/hive-exec/1.2.1/hive-exec-1.2.1.pom)
 to the dependencies the source repo's POM:  
https://github.com/apache/hive/blob/release-1.2.1/ql/pom.xml. The lack of 
declared dependencies creates an additional layer of pain for us when consuming 
the {{core}} JAR because we now have to shoulder the burden of declaring 
explicit dependencies on {{hive-exec}}'s transitive dependencies (since they're 
no longer bundled in an uber JAR when we use the {{core}} JAR), making it 
harder to use tools like Maven's {{dependency:tree}} to help us spot potential 
dep. conflicts.

Spark's current custom Hive fork is effectively making three changes compared 
to Hive 1.2.1 order to work around the above problems plus some legacy issues 
which are no longer relevant:

* Remove the shading/bundling of most non-Hive classes, with the exception of 
Kryo and Protobuf. This has the effect of making the published POM declare 
proper transitive dependencies, easing the dep. management story in Spark's 
POMs, while still ensuring that we relocate classes that conflict with Spark.
* Package the hive-shims into the hive-exec JAR. I don't think that this is 
strictly necessary.
* Downgrade Kryo to 2.21. This isn't necessary anymore: there was an earlier 
time where we purposely _unshaded_ Kryo and pinned Hive's version to match 
Spark's. The only reason that this change is present today was to minimize the 
diff between versions 1 and 2 of Spark's Hive fork.

For the full details, see 
https://github.com/apache/hive/compare/release-1.2.1...JoshRosen:release-1.2.1-spark2,
 which compares the current Version 2 of our Hive fork to stock Hive 1.2.1.

Maven classifiers do not allow the declaration of different dependencies for 
artifacts depending on their classifiers, so if we wanted to publish a 
{{hive-exec core}}-like artifact which declares its transitive dependencies 
then this would need to be done under a new Maven artifact name or new version 
(e.g. Hive 1.2.2-spark).

That said, proper declaration of transitive dependencies isn't a hard blocker 
for us: a long, long, long time ago, I think that Spark may have actually built 
with a stock {{core}} artifact and explicitly declared the transitive deps, so 
if we've handled that dependency declaration before then we can do it again at 
the cost of some pain in the future if we want to bump to Hive 2.x.

Therefore, I think the minimal change needed in Hive's build is to add a new 
classifier, say {{core-spark}}, which behaves like {{core}} except that it 
shades and relocates Kryo and Protobuf. If this artifact existed then I think 
Spark could use that classified artifact, declare an explicit dependency on the 
shim artifacts (assuming Kryo and Protobuf don't need to be shaded there) and 
explicitly pull in all of {{hive-exec}}'s transitive dependencies. This avoids 
the need to publish separate _versions_ for Spark: instead, Spark would just 
consume a differently-packaged/differently-classified version of a stock Hive 
release.

If we go with this latter approach, then I guess Hive would need to publish 
1.2.3 or 1.2.2.1 in order to introduce the new classified artifact.

Does this sound like a reasonable approach? Or would it make more sense to have 
a separate Hive branch and versioning scheme for Spark (e.g. 
{{branch-1.2-spark}} and Hive {{1.2.1-spark}})? I lean towards the former 
approach (releasing 1.2.3 with an additional Spark-specific classifier), 
especially if we want to fix bugs or make functional / non-packaging changes 
later down the