[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

steveloughran Fri, 07 Oct 2016 11:34:13 -0700

Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    # Packaging:
    
    1. this addresses the problem that it's not always immediately obvious to 
people what they have to do to get, say s3a working. Do you know precisely 
which version of amazon-aws-SDK you need to have on your CP for a specific 
version of hadoo-aws.jar to avoid getting a linkage error? That's the problem 
maven handles for you.
    1. with a new module. it lets downstream applications build with that 
support, knowing that issues related to dependency versions have been handled 
for them.
    
    # Documentation
    
    It has an overview of how to use this stuff, lists those dependencies, 
explains whether they can be used as a direct destination for work, why the 
Direct committer was taken away, etc.
    
    # Testing
    
    The tests makes sure everything works. That's the packaging, the versioning 
of jackson, propagation of configuration options, failure handling, etc. Which 
offers: 
    
    1. Verifying the packaging. The initial role of the tests was to make sure 
the classpaths were coming in right, filesystems registering, etc.
    1. Compliance testing of the object stores client libraries: have they 
implemented the relevant APIs the way they are meant to, so that Spark can use 
them to list, read, write data.
    1. Regression testing of the hadoop client libs: functionality and 
performance. This module, along with some Hive stuff, is the basis for 
benchmarking S3A performance improvements.
    1. Regression testing of spark functionality/performance; highlighting 
places to tune stuff like directory listing operations.
    1. Regression testing of cloud infras themselves. More relevant with 
Openstack than the others, as that's the ones where you can go against nightly 
builds.
    1. Cross object store benchmarking. Compare how long it takes the dataframe 
example to complete in Azure vs S3a, and crank up the debugging to see where 
the delays are (it's in s3 copy being way, way slower; looks like Azure is not 
actually copying bytes).
    1. Integration testing. That is, rather than just do a minimal scalatest 
operation, you can use spark-submit to submit the work to a full cluster, so 
verify that the right JARs made it out, the cluster isn't running incompatible 
versions of  the JVM and joda time, etc, etc.
    
    With this module, then, people get the option of building Spark with the 
JARs on the CP. But they also gain the ability to have Jenkins set up to make 
sure that everything works, all the time.
    It also provides the placeholder to add any code specific to object stores, 
like, perhaps some kind of committer. I don't have any plans there, but others 
might.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Reply via email to