Re: Spark <--> S3 flakiness

2017-05-18 Thread Steve Loughran
On 18 May 2017, at 05:29, lucas.g...@gmail.com wrote: Steve, just to clarify: "FWIW, if you can move up to the Hadoop 2.8 version of the S3A client it is way better on high-performance reads, especially if you are working with column data and can set the

Re: Spark <--> S3 flakiness

2017-05-17 Thread lucas.g...@gmail.com
Steve, just to clarify: "FWIW, if you can move up to the Hadoop 2.8 version of the S3A client it is way better on high-performance reads, especially if you are working with column data and can set the fs.s3a.experimental.fadvise=random option. " Are you talking about the hadoop-aws lib or hadoop

Re: Spark <--> S3 flakiness

2017-05-17 Thread Steve Loughran
On 17 May 2017, at 06:00, lucas.g...@gmail.com wrote: Steve, thanks for the reply. Digging through all the documentation now. Much appreciated! FWIW, if you can move up to the Hadoop 2.8 version of the S3A client it is way better on high-performance reads,

Re: Spark <--> S3 flakiness

2017-05-16 Thread lucas.g...@gmail.com
Steve, thanks for the reply. Digging through all the documentation now. Much appreciated! On 16 May 2017 at 10:10, Steve Loughran wrote: > > On 11 May 2017, at 06:07, lucas.g...@gmail.com wrote: > > Hi users, we have a bunch of pyspark jobs that are using S3 for

Re: Spark <--> S3 flakiness

2017-05-16 Thread Steve Loughran
On 11 May 2017, at 06:07, lucas.g...@gmail.com wrote: Hi users, we have a bunch of pyspark jobs that are using S3 for loading / intermediate steps and final output of parquet files. Please don't, not without a committer specially written to work against S3 in the

Re: Spark <--> S3 flakiness

2017-05-14 Thread Gourav Sengupta
Are you running EMR? On Sun, May 14, 2017 at 4:59 AM, Miguel Morales wrote: > Some things just didn't work as i had first expected it. For example, > when writing from a spark collection to an alluxio destination didn't > persist them to s3 automatically. > > I

Re: Spark <--> S3 flakiness

2017-05-13 Thread Miguel Morales
Some things just didn't work as i had first expected it. For example, when writing from a spark collection to an alluxio destination didn't persist them to s3 automatically. I remember having to use the alluxio library directly to force the files to persist to s3 after spark finished writing to

Re: Spark <--> S3 flakiness

2017-05-12 Thread Gene Pang
Hi, Yes, you can use Alluxio with Spark to read/write to S3. Here is a blog post on Spark + Alluxio + S3 , and here is some documentation for configuring Alluxio + S3

Re: Spark <--> S3 flakiness

2017-05-11 Thread Vadim Semenov
Use the official mailing list archive http://mail-archives.apache.org/mod_mbox/spark-user/201705.mbox/%3ccajyeq0gh1fbhbajb9gghognhqouogydba28lnn262hfzzgf...@mail.gmail.com%3e On Thu, May 11, 2017 at 2:50 PM, lucas.g...@gmail.com wrote: > Also, and this is unrelated to the

Re: Spark <--> S3 flakiness

2017-05-11 Thread Miguel Morales
Might want to try to use gzip as opposed to parquet. The only way i ever reliably got parquet to work on S3 is by using Alluxio as a buffer, but it's a decent amount of work. On Thu, May 11, 2017 at 11:50 AM, lucas.g...@gmail.com wrote: > Also, and this is unrelated to the

Re: Spark <--> S3 flakiness

2017-05-11 Thread lucas.g...@gmail.com
Also, and this is unrelated to the actual question... Why don't these messages show up in the archive? http://apache-spark-user-list.1001560.n3.nabble.com/ Ideally I'd want to post a link to our internal wiki for these questions, but can't find them in the archive. On 11 May 2017 at 07:16,

Re: Spark <--> S3 flakiness

2017-05-11 Thread lucas.g...@gmail.com
Looks like this isn't viable in spark 2.0.0 (and greater I presume). I'm pretty sure I came across this blog and ignored it due to that. Any other thoughts? The linked tickets in: https://issues.apache.org/jira/browse/SPARK-10063 https://issues.apache.org/jira/browse/HADOOP-13786

Re: Spark <--> S3 flakiness

2017-05-10 Thread Miguel Morales
Try using the DirectParquetOutputCommiter: http://dev.sortable.com/spark-directparquetoutputcommitter/ On Wed, May 10, 2017 at 10:07 PM, lucas.g...@gmail.com wrote: > Hi users, we have a bunch of pyspark jobs that are using S3 for loading / > intermediate steps and final

Spark <--> S3 flakiness

2017-05-10 Thread lucas.g...@gmail.com
Hi users, we have a bunch of pyspark jobs that are using S3 for loading / intermediate steps and final output of parquet files. We're running into the following issues on a semi regular basis: * These are intermittent errors, IE we have about 300 jobs that run nightly... And a fairly random but