[GitHub] spark pull request #20923: [SPARK-23807][BUILD][WIP] Add Hadoop 3 profile wi...

steveloughran Wed, 28 Mar 2018 11:09:22 -0700

GitHub user steveloughran opened a pull request:

    https://github.com/apache/spark/pull/20923


    [SPARK-23807][BUILD][WIP] Add Hadoop 3 profile with relevant POM fix ups, 
cloud-storage artifacts and binding

    ## What changes were proposed in this pull request?
    
    1. Adds a `hadoop-3` profile build depending on the hadoop-3.1 artifacts. 
It's tagged as WiP because Hadoop-3.1 isn't out the door yet; it's depending on 
the hadoop 3.1-SNAPSHOT.
    1. In the hadoop-cloud module, adds an explicit hadoop-3 profile which 
switches from explicitly pulling in cloud connectors (hadoop-openstack, 
hadoop-aws, hadoop-azure) to depending on the hadoop-cloudstorage POM artifact, 
which pulls these in, has pre-excluded things like hadoop-common, and stays up 
to date with new connectors (hadoop-azuredatalake, hadoop-allyun). Goal: it 
becomes the Hadoop projects homework of keeping this clean, and the spark 
project doesn't need to handle new hadoop releases adding more dependencies.
     and lines up spark for switching to a shaded hadoop-cloud-storage bundle 
when implemented.
    1. In the hadoop-cloud module, adds new source and tests for connecting to 
the `PathOutputCommitter` factory mechanism of Hadoop 3.1.
    1. Increases the curator and zookeeper versions to match those in hadoop-3, 
fixing spark core to build in sbt with the hadoop-3 dependencies.
    
    Why 3.1-SNAPSHOT over 3.0.1?
    
    * 3.0.0 has to be viewed as an early relase of the code; 3.1 should be the 
stable one.
    * The committer changes are only in the forthcoming 3.1.0 and 3.0.2 
releases.
    * The cloud-storage dependencies are still unstable in the 3.0.x line (too 
many transitive dependencies, omitted hadoop-allyun). The hadoop-3 profile does 
exclude the transitive cruft, for anyone who does want to use branch-3.0 builds.
    
    Hadoop 3.1 should be viewed as the version where Hadoop 3.x is really ready 
to play.
    
    ## How was this patch tested?
    
    * There's some minimal unit tests of the new source in the hadoop-cloud 
module when built with the hadoop-3 connector; 
    * Everything this has been built and tested against both ASF Hadoop 
branch-3.1 and hadoop trunk.
    
    The spark hive JAR has problems here, as it's version check logic fails for 
Hadoop versions > 2.
    
    This can be avoided with either of
    
    * The hadoop JARs built to declare their version as Hadoop 2.11  `mvn 
install -DskipTests -DskipShade -Ddeclared.hadoop.version=2.11` . This is safe 
for local test runs, not for deployment (HDFS is very strict about 
cross-version deployment).
    * A modified version of spark hive whose version check switch statement is 
happy with hadoop 3.
    
    I've done both, with maven and SBT. 
    
    Two issues surfaced
    
    1. A spark-core test failure âfixed in SPARK-23787. 
    1. SBT only: Zookeeper not being found in spark-core. Somehow curator 
2.12.0 triggers some slightly different dependency resolution logic from 
previous versions, and Ivy was missing zookeeper.jar entirely. This patch adds 
the explicit declaration for all spark profiles, setting the ZK version = 3.4.9 
for hadoop-3
    
    The integration tests against real infrastructures live [on 
github](https://github.com/hortonworks-spark/cloud-integration/tree/master/cloud-examples).
 These verify that s3, azure wasb, azure-datalake and openstack swift stores 
can be used as the source and destination of work.
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/steveloughran/spark 
cloud/SPARK-23807-hadoop-31

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20923.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20923
    
----
commit 29e73242cba9797ed24127b24bb0380c69a608d3
Author: Steve Loughran <stevel@...>
Date:   2018-03-28T17:38:57Z

    SPARK-23807 Add Hadoop 3 profile with relevant POM fix ups, cloud-storage 
artifacts and binding
    
    Change-Id: Ia4526f184ced9eef5b67aee9e91eced0dd38d723

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20923: [SPARK-23807][BUILD][WIP] Add Hadoop 3 profile wi...

Reply via email to