[jira] [Commented] (SPARK-7481) Add spark-hadoop-cloud module to pull in object store support
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16000502#comment-16000502 ] Steve Loughran commented on SPARK-7481: --- thank you! > Add spark-hadoop-cloud module to pull in object store support > - > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.1.0 >Reporter: Steve Loughran >Assignee: Steve Loughran > Fix For: 2.3.0 > > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add spark-hadoop-cloud module to pull in object store support
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15981602#comment-15981602 ] Steve Loughran commented on SPARK-7481: --- One thing I want to emphasise here is: I have no loyalty to my code. I just want packings of and applications pulling in via maven/SBT to be able to have a consistent set of artifacts needed to successfully interact with object stores as a source of data. Most of the stuff related to spark/object store integration I can do elsewhere, such as in Apache Bahir (integrating) and on github. It's just that classpath setup which you can't really do downstream as it depends on getting the combination of things like spark, hadoop, aws-sdk, jackson, all 100% consistent. That's all I care about. And I don't care if someone else does it, as long as the patch works for current and future versions of hadoop/aws-SDK. If someone else does it, I'll gladly test that stuff downstream. But Spark does need that integration. It had some in the past, when s3n was implemented in hadoop-common, but that's been gone since things were moved in Hadoop 2.6. I personally think it should go back in, as, implicitly, so does everyone whose downstream spark-based product includes a set of the cloud storage clients and JARs 100% in sync with the rest of their product's artifacts. So: does anyone have any alternative designs? The easiest would be to add it to spark-core itself, but that's got consequences if people ship anything built on the shaded-AWS JAR (the one which fixes its jackson inconsistencies internally), as it adds tens of MB to everything pulling in spark-core. A separate module is the way to manage this. Which is pretty much all the final version of the patch is. > Add spark-hadoop-cloud module to pull in object store support > - > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.1.0 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add spark-hadoop-cloud module to pull in object store support
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15982587#comment-15982587 ] Steven Rand commented on SPARK-7481: What happened to https://github.com/apache/spark/pull/12004? It doesn't look like there were any concrete objections to the changes made there -- was it just closed for lack of a reviewer? As someone who has spent several tens of hours (and counting!) debugging classpath issues for Spark applications that read from and write to an object store, I think this change is hugely valuable. I suspect that the large number of votes and watchers indicates that others think this as well, so it'd be pretty depressing if it didn't happen just because no one will review the patch. Unfortunately I'm not qualified to review it myself, but I'd be quite grateful if someone more competent were to do so. > Add spark-hadoop-cloud module to pull in object store support > - > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.1.0 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add spark-hadoop-cloud module to pull in object store support
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15982591#comment-15982591 ] Sean Owen commented on SPARK-7481: -- I don't believe my last round of comments were addressed, and it was one of quite a lot of rounds. This is a real problem. > Add spark-hadoop-cloud module to pull in object store support > - > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.1.0 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add spark-hadoop-cloud module to pull in object store support
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984771#comment-15984771 ] Steve Loughran commented on SPARK-7481: --- I think we ended up going in circles on that PR. Sean has actually been very tolerant of me, however it's been hampered by my full time focus on other thingsr. I've only been had time to work on the spark PR intermittently and that's been hard for all: me in the rebase/retest, the one reviewer in having to catch up again. Now, anyone who does manage to get that CP right will discover that S3A absolutely flies with Spark, in partitioning (list file improvements), data input (set fadvise=true for ORC and Parquet), and for output (set fast.output=true, play with the pool options). It delivers that performance because this patch set things up for the integration tests, downstream of this patch so I and others can be confident that the things actually work, at sped, at scale. Indeed, many of S3A performance work was actually based on Hive and Spark workloads:, the data formats & their seek patterns, directory layouts, file generation. All that's left is the little problem of getting the classpath right. Oh, and the committer. For now, for people's enjoyment, here's some videos from Spark Summit East on the topic * [Spark and object stores|https://youtu.be/8F2Jqw5_OnI]. * [Robust and Scalable etl over Cloud Storage With Spark|https://spark-summit.org/east-2017/events/robust-and-scalable-etl-over-cloud-storage-with-spark/] > Add spark-hadoop-cloud module to pull in object store support > - > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.1.0 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add spark-hadoop-cloud module to pull in object store support
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15985040#comment-15985040 ] Steve Loughran commented on SPARK-7481: --- (This is a fairly long comment, but it tries to summarise the entire state of interaction with object stores, esp. S3A on Hadoop 2.8+. Azure is simpler, GCS: google's problem. Swift. not used very much). If you look at object store & Spark (or indeed, any code which uses a filesystem as the source and dest of work), there are problems which can generally be grouped into various categories. h3. Foundational: talking to the object stores classpath & execution: can you wire the JARs up? Longstanding issue in ASF Spark releases (SPARK-5348, SPARK-12557). This was exacerbated by the movement of S3n:// to the hadoop-aws-package (FWIW, I hadn't noticed that move, I'd have blocked it if I'd been paying attention). This includes transitive problems (SPARK-11413) Credential propagation. Spark's env var propagation is pretty cute here; SPARK-19739 picks up {{AWS_SESSION_TOKEN}} too. Diagnostics on failure is a real pain. h3. Observable Inconsistencies leading to Data loss Generally where the metaphor "it's just a filesystem" fail. These are bad because they often "just work", especially in dev & Test with small datasets, and when they go wrong, they can fail by generating bad results *and nobody notices*. * Expectations of consistent listing of "directories" S3Guard deals with this, HADOOP-13345, as can Netflix's S3mper and AWS's premium Dynamo backed S3 storage. * Expectations on the transacted nature of Directory renames, the core atomic commit operations against full filesystems. * Expectations that when things are deleted they go away. This does become visible sometimes, usually in checks for a destination not existing (SPARK-19013) * Expectations that write-in-progress data is visible/flushed, that {{close()}} is low cost. SPARK-19111. Committing pretty much combines all of these, see below for more details. h3. Aggressively bad performance That's the mismatch between what the object store offers, what the apps expect, and the metaphor work in the Hadooop FileSystem implementations, which, in trying to hide the conceptual mismatch can actually amplify the problem. Example: Directory tree scanning at the start of a query. The mock directory structure allows callers to do treewalks, when really a full list of all children can be done as a direct O(1) call. SPARK-17159 covers some of this for scanning directories in Spark Streaming, but there's a hidden tree walk in every call to {{FileSystem.globStatus()}} (HADOOP-13371). Given how S3Guard transforms this treewalk, and you need it for consistency, that's probably the best solution for now. Although I have a PoC which does a full List **/* followed by a filter, that's not viable when you have a wide deep tree and do need to prune aggressively. Checkpointing to object stores is similar: it's generally not dangerous to do the write+rename, just adds the copy overhead, consistency issues notwithstanding. h3. Suboptimal code. There's opportunities for speedup, but if it's not on the critical path, not worth the hassle. That said, as every call to {{getFileStatus()}} can take hundreds of millis, they get onto the critical path quite fast. Example checks for a file existing before calling {{fs.delete(path)}} (this is always a no-op if the dest path isn't there), and the equivalent on mkdirs: {{if (!fs.exists(dir) fs.mkdirs(path)}}. Hadoop 3.0 will help steer people on the path of righteousness there by deprecating a couple of methods which encourage inefficiencies (isFile/isDir). h3. The commit problem The full commit problem combines all of these: you need a consistent list of source data, your deleted destination path musn't appear in listings, the commit of each task must promote a task's work to the pending output of the job; an abort must leave no trace of it. The final job commit must place data into the final destination, again, job abort not make any output visible. There's some ambiguity about what happens if task and job commits fails; generally the safest is "abort everything". Futhermore nobody has any idea what to do if an {{abort()}} raises exceptions. Oh, and all of this must be fast. Spark is no better or worse than the core MapReduce committers here, or that of Hive. Spark generally uses the Hadoop {{FileOutputFormat}} via the {{HadoopMapReduceCommitProtocol}}, directly or indirectly (e.g {{ParquetOutputFormat}}), extracting its committer and casting it to {{FileOutputCommitter}}, primarily to get a working directory. This committer assumes the destination is a consistent FS, uses renames when promoting task and job output, assuming that is so fast it doesn't even bother to log a message "about to rename". Hence the recurrent Sta
[jira] [Commented] (SPARK-7481) Add spark-hadoop-cloud module to pull in object store support
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15993465#comment-15993465 ] Apache Spark commented on SPARK-7481: - User 'steveloughran' has created a pull request for this issue: https://github.com/apache/spark/pull/17834 > Add spark-hadoop-cloud module to pull in object store support > - > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.1.0 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org