Re: CSV write to S3 failing silently with partial completion

2017-09-27 Thread Mcclintic, Abbi
b...@amazon.com> Cc: user <user@spark.apache.org> Subject: Re: CSV write to S3 failing silently with partial completion Hi, Can you please let me know the following: 1. Why are you using JAVA? 2. The way you are creating the SPARK cluster 3. The way you are initiating SPARK session or cont

Re: CSV write to S3 failing silently with partial completion

2017-09-11 Thread Gourav Sengupta
Hi, Can you please let me know the following: 1. Why are you using JAVA? 2. The way you are creating the SPARK cluster 3. The way you are initiating SPARK session or context 4. Are you able to query the data that is written to S3 using a SPARK dataframe and validate that the number of rows in the

Re: CSV write to S3 failing silently with partial completion

2017-09-08 Thread Steve Loughran
On 7 Sep 2017, at 18:36, Mcclintic, Abbi > wrote: Thanks all – couple notes below. Generally all our partitions are of equal size (ie on a normal day in this particular case I see 10 equally sized partitions of 2.8 GB). We see the problem with

Re: CSV write to S3 failing silently with partial completion

2017-09-07 Thread Mcclintic, Abbi
Thanks all – couple notes below. Generally all our partitions are of equal size (ie on a normal day in this particular case I see 10 equally sized partitions of 2.8 GB). We see the problem with repartitioning and without – in this example we are repartitioning to 10 but we also see the

Re: CSV write to S3 failing silently with partial completion

2017-09-07 Thread Patrick Alwell
Sounds like an S3 bug. Can you replicate locally with HDFS? Try using S3a protocol too; there is a jar you can leverage like so: spark-submit --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 my_spark_program.py EMR can sometimes be buggy. :/ You could also try

RE: CSV write to S3 failing silently with partial completion

2017-09-07 Thread JG Perrin
Are you assuming that all partitions are of equal size? Did you try with more partitions (like repartitioning)? Does the error always happen with the last (or smaller) file? If you are sending to redshift, why not use the JDBC driver? -Original Message- From: abbim