[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

steveloughran Fri, 20 Jan 2017 08:24:22 -0800

Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Here's why this matters, and why a simple "isn't this just a matter of 
dropping in the JARs" isn't the solution:
    
    *getting getting the right jars together with the right spark version is a 
non-trivial problem*.
    
    That's essentially it. 
    1. Does everyone know which version of the AWS SDK is needed for 2.7 
(1.7.4)? 
    1. And whether that is compatible with the version in 2.6 (maybe).
    1. Will it be the same in Hadoop 2.8 (yes)
    1. Will it be the same in Hadoop 2.9+> No, because as well as a 1.11.x 
version, where AWS broke the JARs into AWS S3 SDK, core sdk, etc.
    1. Are the transitive dependencies in hadoop branch-2 always the same? No; 
for Hadoop 2.9+ you need to declare the version of jackson2-cbor to be 
compatible with Spark's, otherwise the the aws-sdk will pull one in which is 
incompatible with the one spark declares at the top-level.
    1. Are the dependencies in Hadoop 3 going to stay the same? Absolutely not, 
we will be adding dynamodb to classpath for s3guard, that consistent world view 
and the O(1) committer.
    1. What about Azure? Which version of the SDK is there? (2.0.0 for hadoop 
2.7.x-2.8.x;  4.2.2 for Hadoop 2.8+, where you also need to undeclare the 
dependency on commons-clang3).
    
    You see? It's an unstable transitive graph of things which absolutely need 
to be kept 100% in sync with the version of Hadoop which spark was built 
against, and the versions of other stuff (jackson, httpclient) which spark also 
pulls in, or you end up with stack traces appearing on mailing lists, JIRAs and 
stack overflow.
    
    The way to do that is not to have some text file somewhere saying "This is 
what you have to do", it's to have a machine readable file which does it, and 
ensures that things can be pulled in, shaded together as appropriate, and then 
delivered in people's hands. And the metadata can be published as another 
artifact into the repo, so if someone downstream wants to get a fully 
consistent set of artifacts all they need to do is ask for it in their own 
machine readable bit of code
    
    ```xml
        <dependency>
          <groupId>org.apache.spark</groupId>
          <artifactId>spark-hadoop-cloud_${scala.binary.version}</artifactId>
          <version>${spark.version}</version>
        </dependency>
    ```
    
    @srowen 
    Regarding S3A consistency, you'll be wanting 
[S3guard](https://issues.apache.org/jira/browse/HADOOP-13345), whose developers 
include your colleague @ajfabbri.  I am transitively testing Spark+ S3guard, 
using this module and the [downstream set of examples which I pulled out from 
this patch.](https://github.com/steveloughran/spark-cloud-examples). That 
ensures that Spark becomes of the things where we can say "works".
    
    S3guard will bring a consistent list to the API: if you create a file, then 
do a list, it'll be there. This is going to be a prerequisite for the 
zero-rename committer of 
[HADOOP-13786](https://issues.apache.org/jira/browse/HADOOP-13786). That's 
going allow anything under `FileOutputFormat` to writes direct to the 
destination dir, supporting both speculation and failures. People will want 
that. And they will only be able to use it if every dependency in their 
packaging is consistent. 
    
    Another way to look at it is this: a very large percentage of people using 
spark are doing it in-cloud deployments. There is no reliable way to get their 
dependencies right in the ASF releases, which not only cripples it compared to 
commercial products bundling spark in themselves, but makes it near-impossible 
for developers to work at the maven level, building applications off the maven 
dependency graph.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Reply via email to