[GitHub] spark pull request #20923: [SPARK-23807][BUILD][WIP] Add Hadoop 3 profile wi...
Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/20923#discussion_r178823511 --- Diff: pom.xml --- @@ -2671,6 +2671,15 @@ + + hadoop-3 + +3.1.0-SNAPSHOT --- End diff -- RC0 is up for testing right now! @leftnoteasy is managing the release --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20923: [SPARK-23807][BUILD][WIP] Add Hadoop 3 profile wi...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/20923#discussion_r178251635 --- Diff: pom.xml --- @@ -2671,6 +2671,15 @@ + + hadoop-3 + +3.1.0-SNAPSHOT --- End diff -- Hey @steveloughran what is the possible release date for Hadoop 3.1.0? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20923: [SPARK-23807][BUILD][WIP] Add Hadoop 3 profile wi...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20923#discussion_r178195279 --- Diff: hadoop-cloud/pom.xml --- @@ -141,13 +93,98 @@ httpcore ${hadoop.deps.scope} + + + + hadoop-2.6 + +true --- End diff -- I think that's ok as an initial step. It would be better if you could, in profiles, customize independent dependencies (e.g. in the hadoop-3 profile exclude some transitive deps), but I'm not sure whether maven would complain about something like that. `jackson-dataformat-cbor` can become interesting if Spark decides to upgrade jackson, since the github for that project says it's been removed in 2.8. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20923: [SPARK-23807][BUILD][WIP] Add Hadoop 3 profile wi...
Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/20923#discussion_r178072258 --- Diff: hadoop-cloud/pom.xml --- @@ -177,6 +214,188 @@ + + + org.apache.hadoop + hadoop-aws + ${hadoop.version} + ${hadoop.deps.scope} + + + org.apache.hadoop + hadoop-common + + + commons-logging + commons-logging + + + org.codehaus.jackson + jackson-mapper-asl + + + org.codehaus.jackson + jackson-core-asl + + + com.fasterxml.jackson.core + jackson-core + + + com.fasterxml.jackson.core + jackson-databind + + + com.fasterxml.jackson.core + jackson-annotations + + + + + org.apache.hadoop + hadoop-openstack + ${hadoop.version} + ${hadoop.deps.scope} + + + org.apache.hadoop + hadoop-common + + + commons-logging + commons-logging + + + junit + junit + + + org.mockito + mockito-all + + + + + + + joda-time + joda-time + ${hadoop.deps.scope} + + + + com.fasterxml.jackson.core + jackson-databind + ${hadoop.deps.scope} + + + com.fasterxml.jackson.core + jackson-annotations + ${hadoop.deps.scope} + + + com.fasterxml.jackson.dataformat + jackson-dataformat-cbor + ${fasterxml.jackson.version} + + + + org.apache.httpcomponents + httpclient + ${hadoop.deps.scope} + + + + org.apache.httpcomponents + httpcore + ${hadoop.deps.scope} + + + + + + + hadoop-3 + +src/hadoop-3/main/scala + src/hadoop-3/test/scala + + + + + + +org.codehaus.mojo +build-helper-maven-plugin + + +add-scala-sources +generate-sources + + add-source + + + +${extra.source.dir} + + + + +add-scala-test-sources +generate-test-sources + + add-test-source + + + +${extra.testsource.dir} + + + + + + + + + + +
[GitHub] spark pull request #20923: [SPARK-23807][BUILD][WIP] Add Hadoop 3 profile wi...
Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/20923#discussion_r178060744 --- Diff: hadoop-cloud/pom.xml --- @@ -141,13 +93,98 @@ httpcore ${hadoop.deps.scope} + + + + hadoop-2.6 + +true --- End diff -- Hmmm. There's another option which is to leave all those in the standard list, and you get a few extra dependencies which aren't needed for the 3.x line: ``` [INFO] +- com.fasterxml.jackson.core:jackson-databind:jar:2.6.7.1:compile * [INFO] | \- com.fasterxml.jackson.core:jackson-core:jar:2.6.7:compile * [INFO] +- com.fasterxml.jackson.core:jackson-annotations:jar:2.6.7:compile * [INFO] +- com.fasterxml.jackson.dataformat:jackson-dataformat-cbor:jar:2.6.7:compile * [INFO] +- org.apache.httpcomponents:httpclient:jar:4.5.4:compile [INFO] | +- commons-logging:commons-logging:jar:1.2:compile [INFO] | \- commons-codec:commons-codec:jar:1.10:compile [INFO] +- org.apache.httpcomponents:httpcore:jar:4.4.8:compile [INFO] +- org.apache.hadoop:hadoop-aws:jar:3.0.2-SNAPSHOT:compile [INFO] | \- com.amazonaws:aws-java-sdk-bundle:jar:1.11.271:compile [INFO] +- org.apache.hadoop:hadoop-openstack:jar:3.0.2-SNAPSHOT:compile [INFO] +- joda-time:joda-time:jar:2.9.3:compile * [INFO] +- org.apache.hadoop:hadoop-cloud-storage:jar:3.0.2-SNAPSHOT:compile [INFO] | +- org.apache.hadoop:hadoop-aliyun:jar:3.0.2-SNAPSHOT:compile [INFO] | | \- com.aliyun.oss:aliyun-sdk-oss:jar:2.8.3:compile [INFO] | | \- org.jdom:jdom:jar:1.1:compile [INFO] | +- org.apache.hadoop:hadoop-azure:jar:3.0.2-SNAPSHOT:compile [INFO] | | +- com.microsoft.azure:azure-storage:jar:5.4.0:compile [INFO] | | | \- com.microsoft.azure:azure-keyvault-core:jar:0.8.0:compile [INFO] | | \- org.eclipse.jetty:jetty-util-ajax:jar:9.3.19.v20170502:compile [INFO] | \- org.apache.hadoop:hadoop-azure-datalake:jar:3.0.2-SNAPSHOT:compile [INFO] | \- com.microsoft.azure:azure-data-lake-store-sdk:jar:2.2.5:compile ``` the `jackson-dataformat-cbor` is the funny one; This is the sole declaration within spark. With the shaded aws JAR then it's not needed at all. The rest all make their way to the spark assembly through other routes. What do you think? Leave them as the default and not worry about it? It would remove the duplication in the 2.7 profile, and apart from the extraneousness on hadoop-3 builds, harmless. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20923: [SPARK-23807][BUILD][WIP] Add Hadoop 3 profile wi...
Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/20923#discussion_r178057319 --- Diff: hadoop-cloud/pom.xml --- @@ -177,6 +214,188 @@ + + + org.apache.hadoop + hadoop-aws + ${hadoop.version} + ${hadoop.deps.scope} + + + org.apache.hadoop + hadoop-common + + + commons-logging + commons-logging + + + org.codehaus.jackson + jackson-mapper-asl + + + org.codehaus.jackson + jackson-core-asl + + + com.fasterxml.jackson.core + jackson-core + + + com.fasterxml.jackson.core + jackson-databind + + + com.fasterxml.jackson.core + jackson-annotations + + + + + org.apache.hadoop + hadoop-openstack + ${hadoop.version} + ${hadoop.deps.scope} + + + org.apache.hadoop + hadoop-common + + + commons-logging + commons-logging + + + junit + junit + + + org.mockito + mockito-all + + + + + + + joda-time + joda-time + ${hadoop.deps.scope} + + + + com.fasterxml.jackson.core + jackson-databind + ${hadoop.deps.scope} + + + com.fasterxml.jackson.core + jackson-annotations + ${hadoop.deps.scope} + + + com.fasterxml.jackson.dataformat + jackson-dataformat-cbor + ${fasterxml.jackson.version} + + + + org.apache.httpcomponents + httpclient + ${hadoop.deps.scope} + + + + org.apache.httpcomponents + httpcore + ${hadoop.deps.scope} + + + + + + + hadoop-3 + +src/hadoop-3/main/scala + src/hadoop-3/test/scala + + + + + + +org.codehaus.mojo +build-helper-maven-plugin + + +add-scala-sources +generate-sources + + add-source + + + +${extra.source.dir} + + + + +add-scala-test-sources +generate-test-sources + + add-test-source + + + +${extra.testsource.dir} + + + + + + + + + + +
[GitHub] spark pull request #20923: [SPARK-23807][BUILD][WIP] Add Hadoop 3 profile wi...
Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/20923#discussion_r178054506 --- Diff: hadoop-cloud/pom.xml --- @@ -177,6 +214,188 @@ + + + org.apache.hadoop + hadoop-aws + ${hadoop.version} + ${hadoop.deps.scope} + + + org.apache.hadoop + hadoop-common + + + commons-logging + commons-logging + + + org.codehaus.jackson + jackson-mapper-asl + + + org.codehaus.jackson + jackson-core-asl + + + com.fasterxml.jackson.core + jackson-core + + + com.fasterxml.jackson.core + jackson-databind + + + com.fasterxml.jackson.core + jackson-annotations + + + + + org.apache.hadoop + hadoop-openstack + ${hadoop.version} + ${hadoop.deps.scope} + + + org.apache.hadoop + hadoop-common + + + commons-logging + commons-logging + + + junit + junit + + + org.mockito + mockito-all + + + + + + + joda-time + joda-time + ${hadoop.deps.scope} + + + + com.fasterxml.jackson.core + jackson-databind + ${hadoop.deps.scope} + + + com.fasterxml.jackson.core + jackson-annotations + ${hadoop.deps.scope} + + + com.fasterxml.jackson.dataformat + jackson-dataformat-cbor + ${fasterxml.jackson.version} + + + + org.apache.httpcomponents + httpclient + ${hadoop.deps.scope} + + + + org.apache.httpcomponents + httpcore + ${hadoop.deps.scope} + + + + + + + hadoop-3 + +src/hadoop-3/main/scala + src/hadoop-3/test/scala + + + + + + +org.codehaus.mojo +build-helper-maven-plugin + + +add-scala-sources +generate-sources + + add-source + + + +${extra.source.dir} + + + + +add-scala-test-sources +generate-test-sources + + add-test-source + + + +${extra.testsource.dir} + + + + + + + --- End diff -- done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20923: [SPARK-23807][BUILD][WIP] Add Hadoop 3 profile wi...
Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/20923#discussion_r178054451 --- Diff: hadoop-cloud/pom.xml --- @@ -177,6 +214,188 @@ + + + org.apache.hadoop + hadoop-aws + ${hadoop.version} + ${hadoop.deps.scope} + + + org.apache.hadoop + hadoop-common + + + commons-logging + commons-logging + + + org.codehaus.jackson + jackson-mapper-asl + + + org.codehaus.jackson + jackson-core-asl + + + com.fasterxml.jackson.core + jackson-core + + + com.fasterxml.jackson.core + jackson-databind + + + com.fasterxml.jackson.core + jackson-annotations + + + + + org.apache.hadoop + hadoop-openstack + ${hadoop.version} + ${hadoop.deps.scope} + + + org.apache.hadoop + hadoop-common + + + commons-logging + commons-logging + + + junit + junit + + + org.mockito + mockito-all + + + + + + + joda-time + joda-time + ${hadoop.deps.scope} + + + + com.fasterxml.jackson.core + jackson-databind + ${hadoop.deps.scope} + + + com.fasterxml.jackson.core + jackson-annotations + ${hadoop.deps.scope} + + + com.fasterxml.jackson.dataformat + jackson-dataformat-cbor + ${fasterxml.jackson.version} + + + + org.apache.httpcomponents + httpclient + ${hadoop.deps.scope} + + + + org.apache.httpcomponents + httpcore + ${hadoop.deps.scope} + + + + + + + hadoop-3 + +src/hadoop-3/main/scala + src/hadoop-3/test/scala + + + + + --- End diff -- my bad. Cut and paste error. Will make explicit what it's really doing. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20923: [SPARK-23807][BUILD][WIP] Add Hadoop 3 profile wi...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20923#discussion_r177853335 --- Diff: hadoop-cloud/pom.xml --- @@ -177,6 +214,188 @@ + + + org.apache.hadoop + hadoop-aws + ${hadoop.version} + ${hadoop.deps.scope} + + + org.apache.hadoop + hadoop-common + + + commons-logging + commons-logging + + + org.codehaus.jackson + jackson-mapper-asl + + + org.codehaus.jackson + jackson-core-asl + + + com.fasterxml.jackson.core + jackson-core + + + com.fasterxml.jackson.core + jackson-databind + + + com.fasterxml.jackson.core + jackson-annotations + + + + + org.apache.hadoop + hadoop-openstack + ${hadoop.version} + ${hadoop.deps.scope} + + + org.apache.hadoop + hadoop-common + + + commons-logging + commons-logging + + + junit + junit + + + org.mockito + mockito-all + + + + + + + joda-time + joda-time + ${hadoop.deps.scope} + + + + com.fasterxml.jackson.core + jackson-databind + ${hadoop.deps.scope} + + + com.fasterxml.jackson.core + jackson-annotations + ${hadoop.deps.scope} + + + com.fasterxml.jackson.dataformat + jackson-dataformat-cbor + ${fasterxml.jackson.version} + + + + org.apache.httpcomponents + httpclient + ${hadoop.deps.scope} + + + + org.apache.httpcomponents + httpcore + ${hadoop.deps.scope} + + + + + + + hadoop-3 + +src/hadoop-3/main/scala + src/hadoop-3/test/scala + + + + + + +org.codehaus.mojo +build-helper-maven-plugin + + +add-scala-sources +generate-sources + + add-source + + + +${extra.source.dir} + + + + +add-scala-test-sources +generate-test-sources + + add-test-source + + + +${extra.testsource.dir} + + + + + + + + + + +
[GitHub] spark pull request #20923: [SPARK-23807][BUILD][WIP] Add Hadoop 3 profile wi...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20923#discussion_r177852191 --- Diff: hadoop-cloud/pom.xml --- @@ -177,6 +214,188 @@ + + + org.apache.hadoop + hadoop-aws + ${hadoop.version} + ${hadoop.deps.scope} + + + org.apache.hadoop + hadoop-common + + + commons-logging + commons-logging + + + org.codehaus.jackson + jackson-mapper-asl + + + org.codehaus.jackson + jackson-core-asl + + + com.fasterxml.jackson.core + jackson-core + + + com.fasterxml.jackson.core + jackson-databind + + + com.fasterxml.jackson.core + jackson-annotations + + + + + org.apache.hadoop + hadoop-openstack + ${hadoop.version} + ${hadoop.deps.scope} + + + org.apache.hadoop + hadoop-common + + + commons-logging + commons-logging + + + junit + junit + + + org.mockito + mockito-all + + + + + + + joda-time + joda-time + ${hadoop.deps.scope} + + + + com.fasterxml.jackson.core + jackson-databind + ${hadoop.deps.scope} + + + com.fasterxml.jackson.core + jackson-annotations + ${hadoop.deps.scope} + + + com.fasterxml.jackson.dataformat + jackson-dataformat-cbor + ${fasterxml.jackson.version} + + + + org.apache.httpcomponents + httpclient + ${hadoop.deps.scope} + + + + org.apache.httpcomponents + httpcore + ${hadoop.deps.scope} + + + + + + + hadoop-3 + +src/hadoop-3/main/scala + src/hadoop-3/test/scala + + + + + + +org.codehaus.mojo +build-helper-maven-plugin + + +add-scala-sources +generate-sources + + add-source + + + +${extra.source.dir} + + + + +add-scala-test-sources +generate-test-sources + + add-test-source + + + +${extra.testsource.dir} + + + + + + + --- End diff -- nit: remove --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20923: [SPARK-23807][BUILD][WIP] Add Hadoop 3 profile wi...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20923#discussion_r177852057 --- Diff: hadoop-cloud/pom.xml --- @@ -177,6 +214,188 @@ + + + org.apache.hadoop + hadoop-aws + ${hadoop.version} + ${hadoop.deps.scope} + + + org.apache.hadoop + hadoop-common + + + commons-logging + commons-logging + + + org.codehaus.jackson + jackson-mapper-asl + + + org.codehaus.jackson + jackson-core-asl + + + com.fasterxml.jackson.core + jackson-core + + + com.fasterxml.jackson.core + jackson-databind + + + com.fasterxml.jackson.core + jackson-annotations + + + + + org.apache.hadoop + hadoop-openstack + ${hadoop.version} + ${hadoop.deps.scope} + + + org.apache.hadoop + hadoop-common + + + commons-logging + commons-logging + + + junit + junit + + + org.mockito + mockito-all + + + + + + + joda-time + joda-time + ${hadoop.deps.scope} + + + + com.fasterxml.jackson.core + jackson-databind + ${hadoop.deps.scope} + + + com.fasterxml.jackson.core + jackson-annotations + ${hadoop.deps.scope} + + + com.fasterxml.jackson.dataformat + jackson-dataformat-cbor + ${fasterxml.jackson.version} + + + + org.apache.httpcomponents + httpclient + ${hadoop.deps.scope} + + + + org.apache.httpcomponents + httpcore + ${hadoop.deps.scope} + + + + + + + hadoop-3 + +src/hadoop-3/main/scala + src/hadoop-3/test/scala + + + + + --- End diff -- Not really based on the Scala version right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20923: [SPARK-23807][BUILD][WIP] Add Hadoop 3 profile wi...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20923#discussion_r177854961 --- Diff: hadoop-cloud/pom.xml --- @@ -141,13 +93,98 @@ httpcore ${hadoop.deps.scope} + + + + hadoop-2.6 + +true --- End diff -- `activeByDefault` is a little misleading. It only enables the profile if you don't explicitly activate any other profiles. So if you enable any other profile in the build, this won't be enabled automatically. And since the cloud module itself is already under a profile, I don't think you can ever trigger this. Probably will need to be documented in the build docs, or maybe you can think of a different solution like enabling the cloud profile via a property instead. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20923: [SPARK-23807][BUILD][WIP] Add Hadoop 3 profile wi...
GitHub user steveloughran opened a pull request: https://github.com/apache/spark/pull/20923 [SPARK-23807][BUILD][WIP] Add Hadoop 3 profile with relevant POM fix ups, cloud-storage artifacts and binding ## What changes were proposed in this pull request? 1. Adds a `hadoop-3` profile build depending on the hadoop-3.1 artifacts. It's tagged as WiP because Hadoop-3.1 isn't out the door yet; it's depending on the hadoop 3.1-SNAPSHOT. 1. In the hadoop-cloud module, adds an explicit hadoop-3 profile which switches from explicitly pulling in cloud connectors (hadoop-openstack, hadoop-aws, hadoop-azure) to depending on the hadoop-cloudstorage POM artifact, which pulls these in, has pre-excluded things like hadoop-common, and stays up to date with new connectors (hadoop-azuredatalake, hadoop-allyun). Goal: it becomes the Hadoop projects homework of keeping this clean, and the spark project doesn't need to handle new hadoop releases adding more dependencies. and lines up spark for switching to a shaded hadoop-cloud-storage bundle when implemented. 1. In the hadoop-cloud module, adds new source and tests for connecting to the `PathOutputCommitter` factory mechanism of Hadoop 3.1. 1. Increases the curator and zookeeper versions to match those in hadoop-3, fixing spark core to build in sbt with the hadoop-3 dependencies. Why 3.1-SNAPSHOT over 3.0.1? * 3.0.0 has to be viewed as an early relase of the code; 3.1 should be the stable one. * The committer changes are only in the forthcoming 3.1.0 and 3.0.2 releases. * The cloud-storage dependencies are still unstable in the 3.0.x line (too many transitive dependencies, omitted hadoop-allyun). The hadoop-3 profile does exclude the transitive cruft, for anyone who does want to use branch-3.0 builds. Hadoop 3.1 should be viewed as the version where Hadoop 3.x is really ready to play. ## How was this patch tested? * There's some minimal unit tests of the new source in the hadoop-cloud module when built with the hadoop-3 connector; * Everything this has been built and tested against both ASF Hadoop branch-3.1 and hadoop trunk. The spark hive JAR has problems here, as it's version check logic fails for Hadoop versions > 2. This can be avoided with either of * The hadoop JARs built to declare their version as Hadoop 2.11 `mvn install -DskipTests -DskipShade -Ddeclared.hadoop.version=2.11` . This is safe for local test runs, not for deployment (HDFS is very strict about cross-version deployment). * A modified version of spark hive whose version check switch statement is happy with hadoop 3. I've done both, with maven and SBT. Two issues surfaced 1. A spark-core test failure âfixed in SPARK-23787. 1. SBT only: Zookeeper not being found in spark-core. Somehow curator 2.12.0 triggers some slightly different dependency resolution logic from previous versions, and Ivy was missing zookeeper.jar entirely. This patch adds the explicit declaration for all spark profiles, setting the ZK version = 3.4.9 for hadoop-3 The integration tests against real infrastructures live [on github](https://github.com/hortonworks-spark/cloud-integration/tree/master/cloud-examples). These verify that s3, azure wasb, azure-datalake and openstack swift stores can be used as the source and destination of work. You can merge this pull request into a Git repository by running: $ git pull https://github.com/steveloughran/spark cloud/SPARK-23807-hadoop-31 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20923.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20923 commit 29e73242cba9797ed24127b24bb0380c69a608d3 Author: Steve LoughranDate: 2018-03-28T17:38:57Z SPARK-23807 Add Hadoop 3 profile with relevant POM fix ups, cloud-storage artifacts and binding Change-Id: Ia4526f184ced9eef5b67aee9e91eced0dd38d723 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org