GitHub user steveloughran opened a pull request: https://github.com/apache/spark/pull/20923
[SPARK-23807][BUILD][WIP] Add Hadoop 3 profile with relevant POM fix ups, cloud-storage artifacts and binding ## What changes were proposed in this pull request? 1. Adds a `hadoop-3` profile build depending on the hadoop-3.1 artifacts. It's tagged as WiP because Hadoop-3.1 isn't out the door yet; it's depending on the hadoop 3.1-SNAPSHOT. 1. In the hadoop-cloud module, adds an explicit hadoop-3 profile which switches from explicitly pulling in cloud connectors (hadoop-openstack, hadoop-aws, hadoop-azure) to depending on the hadoop-cloudstorage POM artifact, which pulls these in, has pre-excluded things like hadoop-common, and stays up to date with new connectors (hadoop-azuredatalake, hadoop-allyun). Goal: it becomes the Hadoop projects homework of keeping this clean, and the spark project doesn't need to handle new hadoop releases adding more dependencies. and lines up spark for switching to a shaded hadoop-cloud-storage bundle when implemented. 1. In the hadoop-cloud module, adds new source and tests for connecting to the `PathOutputCommitter` factory mechanism of Hadoop 3.1. 1. Increases the curator and zookeeper versions to match those in hadoop-3, fixing spark core to build in sbt with the hadoop-3 dependencies. Why 3.1-SNAPSHOT over 3.0.1? * 3.0.0 has to be viewed as an early relase of the code; 3.1 should be the stable one. * The committer changes are only in the forthcoming 3.1.0 and 3.0.2 releases. * The cloud-storage dependencies are still unstable in the 3.0.x line (too many transitive dependencies, omitted hadoop-allyun). The hadoop-3 profile does exclude the transitive cruft, for anyone who does want to use branch-3.0 builds. Hadoop 3.1 should be viewed as the version where Hadoop 3.x is really ready to play. ## How was this patch tested? * There's some minimal unit tests of the new source in the hadoop-cloud module when built with the hadoop-3 connector; * Everything this has been built and tested against both ASF Hadoop branch-3.1 and hadoop trunk. The spark hive JAR has problems here, as it's version check logic fails for Hadoop versions > 2. This can be avoided with either of * The hadoop JARs built to declare their version as Hadoop 2.11 `mvn install -DskipTests -DskipShade -Ddeclared.hadoop.version=2.11` . This is safe for local test runs, not for deployment (HDFS is very strict about cross-version deployment). * A modified version of spark hive whose version check switch statement is happy with hadoop 3. I've done both, with maven and SBT. Two issues surfaced 1. A spark-core test failure âfixed in SPARK-23787. 1. SBT only: Zookeeper not being found in spark-core. Somehow curator 2.12.0 triggers some slightly different dependency resolution logic from previous versions, and Ivy was missing zookeeper.jar entirely. This patch adds the explicit declaration for all spark profiles, setting the ZK version = 3.4.9 for hadoop-3 The integration tests against real infrastructures live [on github](https://github.com/hortonworks-spark/cloud-integration/tree/master/cloud-examples). These verify that s3, azure wasb, azure-datalake and openstack swift stores can be used as the source and destination of work. You can merge this pull request into a Git repository by running: $ git pull https://github.com/steveloughran/spark cloud/SPARK-23807-hadoop-31 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20923.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20923 ---- commit 29e73242cba9797ed24127b24bb0380c69a608d3 Author: Steve Loughran <stevel@...> Date: 2018-03-28T17:38:57Z SPARK-23807 Add Hadoop 3 profile with relevant POM fix ups, cloud-storage artifacts and binding Change-Id: Ia4526f184ced9eef5b67aee9e91eced0dd38d723 ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org