[jira] [Commented] (SPARK-7481) Add spark-hadoop-cloud module to pull in object store support

2017-05-08 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16000502#comment-16000502
 ] 

Steve Loughran commented on SPARK-7481:
---

thank you!

> Add spark-hadoop-cloud module to pull in object store support
> -
>
> Key: SPARK-7481
> URL: https://issues.apache.org/jira/browse/SPARK-7481
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Fix For: 2.3.0
>
>
> To keep the s3n classpath right, to add s3a, swift & azure, the dependencies 
> of spark in a 2.6+ profile need to add the relevant object store packages 
> (hadoop-aws, hadoop-openstack, hadoop-azure)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7481) Add spark-hadoop-cloud module to pull in object store support

2017-04-24 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15981602#comment-15981602
 ] 

Steve Loughran commented on SPARK-7481:
---

One thing I want to emphasise here is: I have no loyalty to my code. I just 
want packings of and applications pulling in via maven/SBT to be able to have a 
consistent set of artifacts needed to successfully interact with object stores 
as a source of data. Most of the stuff related to spark/object store 
integration I can do  elsewhere, such as in Apache Bahir (integrating) and on 
github. It's just that classpath setup which you can't really do downstream as 
it depends on getting the combination of things like spark, hadoop, aws-sdk, 
jackson, all 100% consistent.

That's all I care about. And I don't care if someone else does it, as long as 
the patch works for current and future versions of hadoop/aws-SDK. If someone 
else does it, I'll gladly test that stuff downstream.

But Spark does need that integration. It had some in the past, when s3n was 
implemented in hadoop-common, but that's been gone since things were moved in 
Hadoop 2.6. I personally think it should go back in, as, implicitly, so does 
everyone whose downstream spark-based product includes a set of the cloud 
storage clients and JARs 100% in sync with the rest of their product's 
artifacts. 

So: does anyone have any alternative designs? The easiest would be to add it to 
spark-core itself, but that's got consequences if people ship anything built on 
the shaded-AWS JAR (the one which fixes its jackson inconsistencies 
internally), as it adds tens of MB to everything pulling in spark-core. A 
separate module is the way to manage this. Which is pretty much all the final 
version of the patch is.

> Add spark-hadoop-cloud module to pull in object store support
> -
>
> Key: SPARK-7481
> URL: https://issues.apache.org/jira/browse/SPARK-7481
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Steve Loughran
>
> To keep the s3n classpath right, to add s3a, swift & azure, the dependencies 
> of spark in a 2.6+ profile need to add the relevant object store packages 
> (hadoop-aws, hadoop-openstack, hadoop-azure)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7481) Add spark-hadoop-cloud module to pull in object store support

2017-04-25 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15982587#comment-15982587
 ] 

Steven Rand commented on SPARK-7481:


What happened to https://github.com/apache/spark/pull/12004? It doesn't look 
like there were any concrete objections to the changes made there -- was it 
just closed for lack of a reviewer?

As someone who has spent several tens of hours (and counting!) debugging 
classpath issues for Spark applications that read from and write to an object 
store, I think this change is hugely valuable. I suspect that the large number 
of votes and watchers indicates that others think this as well, so it'd be 
pretty depressing if it didn't happen just because no one will review the 
patch. Unfortunately I'm not qualified to review it myself, but I'd be quite 
grateful if someone more competent were to do so.

> Add spark-hadoop-cloud module to pull in object store support
> -
>
> Key: SPARK-7481
> URL: https://issues.apache.org/jira/browse/SPARK-7481
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Steve Loughran
>
> To keep the s3n classpath right, to add s3a, swift & azure, the dependencies 
> of spark in a 2.6+ profile need to add the relevant object store packages 
> (hadoop-aws, hadoop-openstack, hadoop-azure)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7481) Add spark-hadoop-cloud module to pull in object store support

2017-04-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15982591#comment-15982591
 ] 

Sean Owen commented on SPARK-7481:
--

I don't believe my last round of comments were addressed, and it was one of 
quite a lot of rounds. This is a real problem.

> Add spark-hadoop-cloud module to pull in object store support
> -
>
> Key: SPARK-7481
> URL: https://issues.apache.org/jira/browse/SPARK-7481
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Steve Loughran
>
> To keep the s3n classpath right, to add s3a, swift & azure, the dependencies 
> of spark in a 2.6+ profile need to add the relevant object store packages 
> (hadoop-aws, hadoop-openstack, hadoop-azure)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7481) Add spark-hadoop-cloud module to pull in object store support

2017-04-26 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984771#comment-15984771
 ] 

Steve Loughran commented on SPARK-7481:
---

I think we ended up going in circles on that PR. Sean has actually been very 
tolerant of me, however it's been hampered by my full time focus on other 
thingsr. I've only been had time to work on the spark PR intermittently and 
that's been hard for all: me in the rebase/retest, the one reviewer in having 
to catch up again.

Now, anyone who does manage to get that CP right will discover that S3A 
absolutely flies with Spark, in partitioning (list file improvements), data 
input (set fadvise=true for ORC and Parquet), and for output (set 
fast.output=true, play with the pool options). It delivers that performance 
because this patch set things up for the integration tests, downstream of this 
patch so I and others can be confident that the things actually work, at sped, 
at scale. Indeed, many of S3A performance work was actually based on Hive and 
Spark workloads:, the data formats & their seek patterns, directory layouts, 
file generation. All that's left is the little problem of getting the classpath 
right. Oh, and the committer.


For now, for people's enjoyment, here's some videos from Spark Summit East on 
the topic

* [Spark and object stores|https://youtu.be/8F2Jqw5_OnI]. 
* [Robust and Scalable etl over Cloud Storage With 
Spark|https://spark-summit.org/east-2017/events/robust-and-scalable-etl-over-cloud-storage-with-spark/]



> Add spark-hadoop-cloud module to pull in object store support
> -
>
> Key: SPARK-7481
> URL: https://issues.apache.org/jira/browse/SPARK-7481
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Steve Loughran
>
> To keep the s3n classpath right, to add s3a, swift & azure, the dependencies 
> of spark in a 2.6+ profile need to add the relevant object store packages 
> (hadoop-aws, hadoop-openstack, hadoop-azure)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7481) Add spark-hadoop-cloud module to pull in object store support

2017-04-26 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15985040#comment-15985040
 ] 

Steve Loughran commented on SPARK-7481:
---

(This is a fairly long comment, but it tries to summarise the entire state of 
interaction with object stores, esp. S3A on Hadoop 2.8+. Azure is simpler, GCS: 
google's problem. Swift. not used very much).

If you look at object store & Spark (or indeed, any code which uses a 
filesystem as the source and dest of work), there are problems which can 
generally be grouped into various categories.

h3. Foundational: talking to the object stores

classpath & execution: can you wire the JARs up? Longstanding issue in ASF 
Spark releases (SPARK-5348, SPARK-12557). This was exacerbated by the movement 
of S3n:// to the hadoop-aws-package (FWIW, I hadn't noticed that move, I'd have 
blocked it if I'd been paying attention). This includes transitive problems 
(SPARK-11413)

Credential propagation. Spark's env var propagation is pretty cute here; 
SPARK-19739 picks up {{AWS_SESSION_TOKEN}} too. Diagnostics on failure is a 
real pain.


h3. Observable Inconsistencies leading to Data loss

Generally where the metaphor "it's just a filesystem" fail. These are bad 
because they often "just work", especially in dev & Test with small datasets, 
and when they go wrong, they can fail by generating bad results *and nobody 
notices*.

* Expectations of consistent listing of "directories" S3Guard deals with this, 
HADOOP-13345, as can Netflix's S3mper and AWS's premium Dynamo backed S3 
storage.
* Expectations on the transacted nature of Directory renames, the core atomic 
commit operations against full filesystems.
* Expectations that when things are deleted they go away. This does become 
visible sometimes, usually in checks for a destination not existing 
(SPARK-19013)
* Expectations that write-in-progress data is visible/flushed, that {{close()}} 
is low cost. SPARK-19111.

Committing pretty much combines all of these, see below for more details.

h3. Aggressively bad performance

That's the mismatch between what the object store offers, what the apps expect, 
and the metaphor work in the Hadooop FileSystem implementations, which, in 
trying to hide the conceptual mismatch can actually amplify the problem. 
Example: Directory tree scanning at the start of a query. The mock directory 
structure allows callers to do treewalks, when really a full list of all 
children can be done as a direct O(1) call. SPARK-17159 covers some of this for 
scanning directories in Spark Streaming, but there's a hidden tree walk in 
every call to {{FileSystem.globStatus()}} (HADOOP-13371). Given how S3Guard 
transforms this treewalk, and you need it for consistency, that's probably the 
best solution for now. Although I have a PoC which does a full List **/* 
followed by a filter, that's not viable when you have a wide deep tree and do 
need to prune aggressively.

Checkpointing to object stores is similar: it's generally not dangerous to do 
the write+rename, just adds the copy overhead, consistency issues 
notwithstanding.


h3. Suboptimal code. 

There's opportunities for speedup, but if it's not on the critical path, not 
worth the hassle. That said, as every call to {{getFileStatus()}} can take 
hundreds of millis, they get onto the critical path quite fast. Example checks 
for a file existing before calling {{fs.delete(path)}} (this is always a no-op 
if the dest path isn't there), and the equivalent on mkdirs: {{if 
(!fs.exists(dir) fs.mkdirs(path)}}. Hadoop 3.0 will help steer people on the 
path of righteousness there by deprecating a couple of methods which encourage 
inefficiencies (isFile/isDir).


h3. The commit problem 

The full commit problem combines all of these: you need a consistent list of 
source data, your deleted destination path musn't appear in listings, the 
commit of each task must promote a task's work to the pending output of the 
job; an abort must leave no trace of it. The final job commit must place data 
into the final destination, again, job abort not make any output visible. 
There's some ambiguity about what happens if task and job commits fails; 
generally the safest is "abort everything". Futhermore nobody has any idea what 
to do if an {{abort()}} raises exceptions. Oh, and all of this must be fast. 
Spark is no better or worse than the core MapReduce committers here, or that of 
Hive.

Spark generally uses the Hadoop {{FileOutputFormat}} via the 
{{HadoopMapReduceCommitProtocol}}, directly or indirectly (e.g 
{{ParquetOutputFormat}}), extracting its committer and casting it to 
{{FileOutputCommitter}}, primarily to get a working directory. This committer 
assumes the destination is a consistent FS, uses renames when promoting task 
and job output, assuming that is so fast it doesn't even bother to log a 
message "about to rename". Hence the recurrent Sta

[jira] [Commented] (SPARK-7481) Add spark-hadoop-cloud module to pull in object store support

2017-05-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15993465#comment-15993465
 ] 

Apache Spark commented on SPARK-7481:
-

User 'steveloughran' has created a pull request for this issue:
https://github.com/apache/spark/pull/17834

> Add spark-hadoop-cloud module to pull in object store support
> -
>
> Key: SPARK-7481
> URL: https://issues.apache.org/jira/browse/SPARK-7481
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Steve Loughran
>
> To keep the s3n classpath right, to add s3a, swift & azure, the dependencies 
> of spark in a 2.6+ profile need to add the relevant object store packages 
> (hadoop-aws, hadoop-openstack, hadoop-azure)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org