Sahil Takiar created HIVE-14240:
-----------------------------------

             Summary: HoS itests shouldn't depend on a Spark distribution
                 Key: HIVE-14240
                 URL: https://issues.apache.org/jira/browse/HIVE-14240
             Project: Hive
          Issue Type: Improvement
          Components: Spark
    Affects Versions: 2.0.1, 2.1.0, 2.0.0
            Reporter: Sahil Takiar
            Assignee: Sahil Takiar


The HoS integration tests download a full Spark Distribution (a tar-ball) from 
CloudFront. It uses this distribution to run Spark locally. It runs a few tests 
with Spark in embedded mode, and some tests against a local Spark on YARN 
cluster. The {{itests/pom.xml}} actually contains scripts to download the 
tar-ball from a pre-defined location.

This is problematic because the Spark Distribution shades all its dependencies, 
including Hadoop dependencies. This can cause problems when upgrading the 
Hadoop version for Hive (ref: HIVE-13930).

Removing it will also avoid having to download the tar-ball during every build, 
and simplify the build process for the itests module.

The Hive itests should instead directly depend on Spark artifacts published in 
Maven Central. It will require some effort to get this working. The current 
Hive Spark Client uses a launch script in the Spark installation to run Spark 
jobs. The script basically does some setup work and invokes 
org.apache.spark.deploy.SparkSubmit. It is possible to invoke this class 
directly, which avoids the need to have a full Spark distribution available 
locally (in fact this option already exists, but isn't tested).

There may be other issues around classpath conflicts between Hive and Spark. 
For example, Hive and Spark require different versions of Kyro. One solution to 
this would be to take Spark artifacts and shade Kyro inside them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to