Re: CSV write to S3 failing silently with partial completion

Patrick Alwell Thu, 07 Sep 2017 07:50:16 -0700

Sounds like an S3 bug. Can you replicate locally with HDFS?

Try using S3a protocol too; there is a jar you can leverage like so: 
spark-submit --packages 
com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 
my_spark_program.py


EMR can sometimes be buggy. :/

You could also try leveraging EC2 nodes and manually creating a cluster with 
password less SSH.

But I feel your pain man, I’ve had weird issues with Redshift and EMR as well.

Let me know if you can or can’t replicate locally; and I can bring it up with 
our S3 team for the next release of HDP and we can file a bug with AWS.

-Pat

On 9/7/17, 2:59 AM, "JG Perrin" <jper...@lumeris.com> wrote:

    Are you assuming that all partitions are of equal size? Did you try with 
more partitions (like repartitioning)? Does the error always happen with the 
last (or smaller) file? If you are sending to redshift, why not use the JDBC 
driver?
    
    -----Original Message-----
    From: abbim [mailto:ab...@amazon.com] 
    Sent: Thursday, September 07, 2017 1:02 AM
    To: user@spark.apache.org
    Subject: CSV write to S3 failing silently with partial completion
    
    Hi all,
    My team has been experiencing a recurring unpredictable bug where only a 
partial write to CSV in S3 on one partition of our Dataset is performed. For 
example, in a Dataset of 10 partitions written to CSV in S3, we might see 9 of 
the partitions as 2.8 GB in size, but one of them as 1.6 GB. However, the job 
does not exit with an error code.
    
    This becomes problematic in the following ways:
    1. When we copy the data to Redshift, we get a bad decrypt error on the 
partial file, suggesting that the failure occurred at a weird byte in the file. 
    2. We lose data - sometimes as much as 10%.
    
    We don't see this problem with parquet format, which we also use, but 
moving all of our data to parquet is not currently feasible. We're using the 
Java API with Spark 2.2 and Amazon EMR 5.8, code is a simple as this:
    df.write().csv("s3://some-bucket/some_location"). We're experiencing the 
issue 1-3x/week on a daily job and are unable to reliably reproduce the 
problem. 
    
    Any thoughts on why we might be seeing this and how to resolve?
    Thanks in advance.
    
    
    
    --
    Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
    
    ---------------------------------------------------------------------
    To unsubscribe e-mail: user-unsubscr...@spark.apache.org
    
    ______________________________________________________________________
    This electronic transmission and any documents accompanying this electronic 
transmission contain confidential information belonging to the sender.  This 
information may contain confidential health information that is legally 
privileged.  The information is intended only for the use of the individual or 
entity named above.  The authorized recipient of this transmission is 
prohibited from disclosing this information to any other party unless required 
to do so by law or regulation and is required to delete or destroy the 
information after its stated need has been fulfilled.  If you are not the 
intended recipient, you are hereby notified that any disclosure, copying, 
distribution or the taking of any action in reliance on or regarding the 
contents of this electronically transmitted information is strictly prohibited. 
 If you have received this E-mail in error, please notify the sender and delete 
this message immediately.
    
    ---------------------------------------------------------------------
    To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: CSV write to S3 failing silently with partial completion

Reply via email to