Re: Writing files to s3 with out temporary directory

2017-12-01 Thread Steve Loughran
Hadoop trunk (i.e 3.1 when it comes out), has the code to do 0-rename commits http://steveloughran.blogspot.co.uk/2017/11/subatomic.html if you want to play today, you can build Hadoop trunk & spark master, + a little glue JAR of mine to get Parquet to play properly

Re: Writing files to s3 with out temporary directory

2017-11-22 Thread Haoyuan Li
This blog / tutorial maybe helpful to run Spark in the Cloud with Alluxio. Best regards, Haoyuan On Mon, Nov 20, 2017 at 2:12 PM, lucas.g...@gmail.com wrote: > That sounds like allot of work and if I

Re: Writing files to s3 with out temporary directory

2017-11-21 Thread Jim Carroll
I got it working. It's much faster. If someone else wants to try it I: 1) Was already using the code from the Presto S3 Hadoop FileSystem implementation modified to sever it from the rest of the Presto codebase. 2) I extended it and overrode the method "keyFromPath" so that anytime the Path

Re: Writing files to s3 with out temporary directory

2017-11-21 Thread Jim Carroll
It's not actually that tough. We already use a custom Hadoop FileSystem for S3 because when we started using Spark with S3 the native FileSystem was very unreliable. Our's is based on the code from Presto. (see

Re: Writing files to s3 with out temporary directory

2017-11-20 Thread lucas.g...@gmail.com
That sounds like allot of work and if I understand you correctly it sounds like a piece of middleware that already exists (I could be wrong). Alluxio? Good luck and let us know how it goes! Gary On 20 November 2017 at 14:10, Jim Carroll wrote: > Thanks. In the meantime

Re: Writing files to s3 with out temporary directory

2017-11-20 Thread Jim Carroll
Thanks. In the meantime I might just write a custom file system that maps writes to parquet file parts to their final locations and then skips the move. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To

Re: Writing files to s3 with out temporary directory

2017-11-20 Thread lucas.g...@gmail.com
You can expect to see some fixes for this sort of issue in the medium term future (multiple months, probably not years). As Taylor notes, it's a Hadoop problem, not a spark problem. So whichever version of hadoop includes the fix will then wait for a spark release to get built against it. Last

Re: Writing files to s3 with out temporary directory

2017-11-20 Thread Tayler Lawrence Jones
It is an open issue with Hadoop file committer, not spark. The simple workaround is to write to hdfs then copy to s3. Netflix did a talk about their custom output committer at the last spark summit which is a clever efficient way of doing that - I’d check it out on YouTube. They have open sourced

Re: Writing files to s3 with out temporary directory

2017-11-20 Thread Jim Carroll
I have this exact issue. I was going to intercept the call in the filesystem if I had to (since we're using the S3 filesystem from Presto anyway) but if there's simply a way to do this correctly I'd much prefer it. This basically doubles the time to write parquet files to s3. -- Sent from: