Github user steveloughran commented on the issue: https://github.com/apache/spark/pull/12004 Here's why this matters, and why a simple "isn't this just a matter of dropping in the JARs" isn't the solution: *getting getting the right jars together with the right spark version is a non-trivial problem*. That's essentially it. 1. Does everyone know which version of the AWS SDK is needed for 2.7 (1.7.4)? 1. And whether that is compatible with the version in 2.6 (maybe). 1. Will it be the same in Hadoop 2.8 (yes) 1. Will it be the same in Hadoop 2.9+> No, because as well as a 1.11.x version, where AWS broke the JARs into AWS S3 SDK, core sdk, etc. 1. Are the transitive dependencies in hadoop branch-2 always the same? No; for Hadoop 2.9+ you need to declare the version of jackson2-cbor to be compatible with Spark's, otherwise the the aws-sdk will pull one in which is incompatible with the one spark declares at the top-level. 1. Are the dependencies in Hadoop 3 going to stay the same? Absolutely not, we will be adding dynamodb to classpath for s3guard, that consistent world view and the O(1) committer. 1. What about Azure? Which version of the SDK is there? (2.0.0 for hadoop 2.7.x-2.8.x; 4.2.2 for Hadoop 2.8+, where you also need to undeclare the dependency on commons-clang3). You see? It's an unstable transitive graph of things which absolutely need to be kept 100% in sync with the version of Hadoop which spark was built against, and the versions of other stuff (jackson, httpclient) which spark also pulls in, or you end up with stack traces appearing on mailing lists, JIRAs and stack overflow. The way to do that is not to have some text file somewhere saying "This is what you have to do", it's to have a machine readable file which does it, and ensures that things can be pulled in, shaded together as appropriate, and then delivered in people's hands. And the metadata can be published as another artifact into the repo, so if someone downstream wants to get a fully consistent set of artifacts all they need to do is ask for it in their own machine readable bit of code ```xml <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-hadoop-cloud_${scala.binary.version}</artifactId> <version>${spark.version}</version> </dependency> ``` @srowen Regarding S3A consistency, you'll be wanting [S3guard](https://issues.apache.org/jira/browse/HADOOP-13345), whose developers include your colleague @ajfabbri. I am transitively testing Spark+ S3guard, using this module and the [downstream set of examples which I pulled out from this patch.](https://github.com/steveloughran/spark-cloud-examples). That ensures that Spark becomes of the things where we can say "works". S3guard will bring a consistent list to the API: if you create a file, then do a list, it'll be there. This is going to be a prerequisite for the zero-rename committer of [HADOOP-13786](https://issues.apache.org/jira/browse/HADOOP-13786). That's going allow anything under `FileOutputFormat` to writes direct to the destination dir, supporting both speculation and failures. People will want that. And they will only be able to use it if every dependency in their packaging is consistent. Another way to look at it is this: a very large percentage of people using spark are doing it in-cloud deployments. There is no reliable way to get their dependencies right in the ASF releases, which not only cripples it compared to commercial products bundling spark in themselves, but makes it near-impossible for developers to work at the maven level, building applications off the maven dependency graph.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org