Github user steveloughran commented on the issue: https://github.com/apache/spark/pull/12004 # Packaging: 1. this addresses the problem that it's not always immediately obvious to people what they have to do to get, say s3a working. Do you know precisely which version of amazon-aws-SDK you need to have on your CP for a specific version of hadoo-aws.jar to avoid getting a linkage error? That's the problem maven handles for you. 1. with a new module. it lets downstream applications build with that support, knowing that issues related to dependency versions have been handled for them. # Documentation It has an overview of how to use this stuff, lists those dependencies, explains whether they can be used as a direct destination for work, why the Direct committer was taken away, etc. # Testing The tests makes sure everything works. That's the packaging, the versioning of jackson, propagation of configuration options, failure handling, etc. Which offers: 1. Verifying the packaging. The initial role of the tests was to make sure the classpaths were coming in right, filesystems registering, etc. 1. Compliance testing of the object stores client libraries: have they implemented the relevant APIs the way they are meant to, so that Spark can use them to list, read, write data. 1. Regression testing of the hadoop client libs: functionality and performance. This module, along with some Hive stuff, is the basis for benchmarking S3A performance improvements. 1. Regression testing of spark functionality/performance; highlighting places to tune stuff like directory listing operations. 1. Regression testing of cloud infras themselves. More relevant with Openstack than the others, as that's the ones where you can go against nightly builds. 1. Cross object store benchmarking. Compare how long it takes the dataframe example to complete in Azure vs S3a, and crank up the debugging to see where the delays are (it's in s3 copy being way, way slower; looks like Azure is not actually copying bytes). 1. Integration testing. That is, rather than just do a minimal scalatest operation, you can use spark-submit to submit the work to a full cluster, so verify that the right JARs made it out, the cluster isn't running incompatible versions of the JVM and joda time, etc, etc. With this module, then, people get the option of building Spark with the JARs on the CP. But they also gain the ability to have Jenkins set up to make sure that everything works, all the time. It also provides the placeholder to add any code specific to object stores, like, perhaps some kind of committer. I don't have any plans there, but others might.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org