Re: CSV write to S3 failing silently with partial completion

2017-09-27 Thread Mcclintic, Abbi
Hi folks,
We appear to have mitigated the issue by including the following configurations 
to our jobs, with significant improvement in S3 consistency with CSV and JSON 
(which turned out to be worse than CSV initially):

spark.speculation=false
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1

Still not really sure of the root cause, but this has at least stopped the 
bleeding for my team and so far hasn’t caused any large degradation in runtime 
for our jobs.

I’ve looked into the spark-redshift connector but I don’t think it supports 
client side encryption which is a requirement for our data and wouldn’t solve 
the problem for our data used outside of Redshift.

Hope that helps someone else out if you hit the same issue.

-Abbi


From: Gourav Sengupta <gourav.sengu...@gmail.com>
Date: Monday, September 11, 2017 at 6:32 AM
To: "Mcclintic, Abbi" <ab...@amazon.com>
Cc: user <user@spark.apache.org>
Subject: Re: CSV write to S3 failing silently with partial completion

Hi,

Can you please let me know the following:
1. Why are you using JAVA?
2. The way you are creating the SPARK cluster
3. The way you are initiating SPARK session or context
4. Are you able to query the data that is written to S3 using a SPARK dataframe 
and validate that the number of rows in the source are same as the ones written 
to target?
5. how are you loading the data to Redshift (cluster size, version, command, 
compression, command, manifest file)
6. using Redshift JDBC (https://github.com/databricks/spark-redshift) you will 
have to play around with it a bit to understand how it works (be careful that 
it does not drop the table at target Redshift database)

Regards,
Gourav

On Thu, Sep 7, 2017 at 7:02 AM, abbim 
<ab...@amazon.com<mailto:ab...@amazon.com>> wrote:
Hi all,
My team has been experiencing a recurring unpredictable bug where only a
partial write to CSV in S3 on one partition of our Dataset is performed. For
example, in a Dataset of 10 partitions written to CSV in S3, we might see 9
of the partitions as 2.8 GB in size, but one of them as 1.6 GB. However, the
job does not exit with an error code.

This becomes problematic in the following ways:
1. When we copy the data to Redshift, we get a bad decrypt error on the
partial file, suggesting that the failure occurred at a weird byte in the
file.
2. We lose data - sometimes as much as 10%.

We don't see this problem with parquet format, which we also use, but moving
all of our data to parquet is not currently feasible. We're using the Java
API with Spark 2.2 and Amazon EMR 5.8, code is a simple as this:
df.write().csv("s3://some-bucket/some_location"). We're experiencing the
issue 1-3x/week on a daily job and are unable to reliably reproduce the
problem.

Any thoughts on why we might be seeing this and how to resolve?
Thanks in advance.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>



Re: CSV write to S3 failing silently with partial completion

2017-09-11 Thread Gourav Sengupta
Hi,

Can you please let me know the following:
1. Why are you using JAVA?
2. The way you are creating the SPARK cluster
3. The way you are initiating SPARK session or context
4. Are you able to query the data that is written to S3 using a SPARK
dataframe and validate that the number of rows in the source are same as
the ones written to target?
5. how are you loading the data to Redshift (cluster size, version,
command, compression, command, manifest file)
6. using Redshift JDBC (https://github.com/databricks/spark-redshift) you
will have to play around with it a bit to understand how it works (be
careful that it does not drop the table at target Redshift database)

Regards,
Gourav

On Thu, Sep 7, 2017 at 7:02 AM, abbim  wrote:

> Hi all,
> My team has been experiencing a recurring unpredictable bug where only a
> partial write to CSV in S3 on one partition of our Dataset is performed.
> For
> example, in a Dataset of 10 partitions written to CSV in S3, we might see 9
> of the partitions as 2.8 GB in size, but one of them as 1.6 GB. However,
> the
> job does not exit with an error code.
>
> This becomes problematic in the following ways:
> 1. When we copy the data to Redshift, we get a bad decrypt error on the
> partial file, suggesting that the failure occurred at a weird byte in the
> file.
> 2. We lose data - sometimes as much as 10%.
>
> We don't see this problem with parquet format, which we also use, but
> moving
> all of our data to parquet is not currently feasible. We're using the Java
> API with Spark 2.2 and Amazon EMR 5.8, code is a simple as this:
> df.write().csv("s3://some-bucket/some_location"). We're experiencing the
> issue 1-3x/week on a daily job and are unable to reliably reproduce the
> problem.
>
> Any thoughts on why we might be seeing this and how to resolve?
> Thanks in advance.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: CSV write to S3 failing silently with partial completion

2017-09-08 Thread Steve Loughran

On 7 Sep 2017, at 18:36, Mcclintic, Abbi 
> wrote:

Thanks all – couple notes below.

Generally all our partitions are of equal size (ie on a normal day in this 
particular case I see 10 equally sized partitions of 2.8 GB). We see the 
problem with repartitioning and without – in this example we are repartitioning 
to 10 but we also see the problem without any repartitioning when the default 
partition count is 200. We know that data loss is occurring because we have a 
final quality gate that counts the number of rows and halts the process if we 
see too large of a drop.

We have one use case where the data needs to be read on a local machine after 
processing and one use case of copying to redshift. Regarding the redshift 
copy, it gets a bit complicated owing to VPC and encryption requirements so we 
haven’t looked into using the JDBC driver yet.

My understanding was that Amazon EMR does not support 
s3a,
 but it may be worth looking into.

1. No, it doesn't
2. You can't currently use s3a as a direct destination of work due to s3 not 
being consistent, not without a consistency layer on top (S3Guard, etc)

We may also try a combination of writing to HDFS combined with s3distcp.


+1




Re: CSV write to S3 failing silently with partial completion

2017-09-07 Thread Mcclintic, Abbi
Thanks all – couple notes below.



Generally all our partitions are of equal size (ie on a normal day in this 
particular case I see 10 equally sized partitions of 2.8 GB). We see the 
problem with repartitioning and without – in this example we are repartitioning 
to 10 but we also see the problem without any repartitioning when the default 
partition count is 200. We know that data loss is occurring because we have a 
final quality gate that counts the number of rows and halts the process if we 
see too large of a drop.



We have one use case where the data needs to be read on a local machine after 
processing and one use case of copying to redshift. Regarding the redshift 
copy, it gets a bit complicated owing to VPC and encryption requirements so we 
haven’t looked into using the JDBC driver yet.



My understanding was that Amazon EMR does not support 
s3a<https://aws.amazon.com/premiumsupport/knowledge-center/emr-file-system-s3/>,
 but it may be worth looking into. We may also try a combination of writing to 
HDFS combined with s3distcp.



Thanks,



Abbi





On 9/7/17, 7:50 AM, "Patrick Alwell" <palw...@hortonworks.com> wrote:



Sounds like an S3 bug. Can you replicate locally with HDFS?



Try using S3a protocol too; there is a jar you can leverage like so: 
spark-submit --packages 
com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 
my_spark_program.py



EMR can sometimes be buggy. :/



You could also try leveraging EC2 nodes and manually creating a cluster 
with password less SSH.



But I feel your pain man, I’ve had weird issues with Redshift and EMR as 
well.



Let me know if you can or can’t replicate locally; and I can bring it up 
with our S3 team for the next release of HDP and we can file a bug with AWS.



-Pat



On 9/7/17, 2:59 AM, "JG Perrin" <jper...@lumeris.com> wrote:



Are you assuming that all partitions are of equal size? Did you try 
with more partitions (like repartitioning)? Does the error always happen with 
the last (or smaller) file? If you are sending to redshift, why not use the 
JDBC driver?



-Original Message-

From: abbim [mailto:ab...@amazon.com]

Sent: Thursday, September 07, 2017 1:02 AM

To: user@spark.apache.org

    Subject: CSV write to S3 failing silently with partial completion



Hi all,

My team has been experiencing a recurring unpredictable bug where only 
a partial write to CSV in S3 on one partition of our Dataset is performed. For 
example, in a Dataset of 10 partitions written to CSV in S3, we might see 9 of 
the partitions as 2.8 GB in size, but one of them as 1.6 GB. However, the job 
does not exit with an error code.



This becomes problematic in the following ways:

1. When we copy the data to Redshift, we get a bad decrypt error on the 
partial file, suggesting that the failure occurred at a weird byte in the file.

2. We lose data - sometimes as much as 10%.



We don't see this problem with parquet format, which we also use, but 
moving all of our data to parquet is not currently feasible. We're using the 
Java API with Spark 2.2 and Amazon EMR 5.8, code is a simple as this:

df.write().csv("s3://some-bucket/some_location"). We're experiencing 
the issue 1-3x/week on a daily job and are unable to reliably reproduce the 
problem.



Any thoughts on why we might be seeing this and how to resolve?

Thanks in advance.







--

Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/



-

To unsubscribe e-mail: user-unsubscr...@spark.apache.org



__

This electronic transmission and any documents accompanying this 
electronic transmission contain confidential information belonging to the 
sender.  This information may contain confidential health information that is 
legally privileged.  The information is intended only for the use of the 
individual or entity named above.  The authorized recipient of this 
transmission is prohibited from disclosing this information to any other party 
unless required to do so by law or regulation and is required to delete or 
destroy the information after its stated need has been fulfilled.  If you are 
not the intended recipient, you are hereby notified that any disclosure, 
copying, distribution or the taking of any action in reliance on or regarding 
the contents of this electronically transmitted information is strictly 
prohibited.  If you have received this E-mail in error, please notify the 
sender and delete this message immediately.



-

To unsubscribe e-mail: user-unsubscr...@spark.apache.org










Re: CSV write to S3 failing silently with partial completion

2017-09-07 Thread Patrick Alwell
Sounds like an S3 bug. Can you replicate locally with HDFS?

Try using S3a protocol too; there is a jar you can leverage like so: 
spark-submit --packages 
com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 
my_spark_program.py

EMR can sometimes be buggy. :/

You could also try leveraging EC2 nodes and manually creating a cluster with 
password less SSH.

But I feel your pain man, I’ve had weird issues with Redshift and EMR as well.

Let me know if you can or can’t replicate locally; and I can bring it up with 
our S3 team for the next release of HDP and we can file a bug with AWS.

-Pat

On 9/7/17, 2:59 AM, "JG Perrin" <jper...@lumeris.com> wrote:

Are you assuming that all partitions are of equal size? Did you try with 
more partitions (like repartitioning)? Does the error always happen with the 
last (or smaller) file? If you are sending to redshift, why not use the JDBC 
driver?

-Original Message-
From: abbim [mailto:ab...@amazon.com] 
Sent: Thursday, September 07, 2017 1:02 AM
To: user@spark.apache.org
    Subject: CSV write to S3 failing silently with partial completion

Hi all,
My team has been experiencing a recurring unpredictable bug where only a 
partial write to CSV in S3 on one partition of our Dataset is performed. For 
example, in a Dataset of 10 partitions written to CSV in S3, we might see 9 of 
the partitions as 2.8 GB in size, but one of them as 1.6 GB. However, the job 
does not exit with an error code.

This becomes problematic in the following ways:
1. When we copy the data to Redshift, we get a bad decrypt error on the 
partial file, suggesting that the failure occurred at a weird byte in the file. 
2. We lose data - sometimes as much as 10%.

We don't see this problem with parquet format, which we also use, but 
moving all of our data to parquet is not currently feasible. We're using the 
Java API with Spark 2.2 and Amazon EMR 5.8, code is a simple as this:
df.write().csv("s3://some-bucket/some_location"). We're experiencing the 
issue 1-3x/week on a daily job and are unable to reliably reproduce the 
problem. 

Any thoughts on why we might be seeing this and how to resolve?
Thanks in advance.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

__
This electronic transmission and any documents accompanying this electronic 
transmission contain confidential information belonging to the sender.  This 
information may contain confidential health information that is legally 
privileged.  The information is intended only for the use of the individual or 
entity named above.  The authorized recipient of this transmission is 
prohibited from disclosing this information to any other party unless required 
to do so by law or regulation and is required to delete or destroy the 
information after its stated need has been fulfilled.  If you are not the 
intended recipient, you are hereby notified that any disclosure, copying, 
distribution or the taking of any action in reliance on or regarding the 
contents of this electronically transmitted information is strictly prohibited. 
 If you have received this E-mail in error, please notify the sender and delete 
this message immediately.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org





RE: CSV write to S3 failing silently with partial completion

2017-09-07 Thread JG Perrin
Are you assuming that all partitions are of equal size? Did you try with more 
partitions (like repartitioning)? Does the error always happen with the last 
(or smaller) file? If you are sending to redshift, why not use the JDBC driver?

-Original Message-
From: abbim [mailto:ab...@amazon.com] 
Sent: Thursday, September 07, 2017 1:02 AM
To: user@spark.apache.org
Subject: CSV write to S3 failing silently with partial completion

Hi all,
My team has been experiencing a recurring unpredictable bug where only a 
partial write to CSV in S3 on one partition of our Dataset is performed. For 
example, in a Dataset of 10 partitions written to CSV in S3, we might see 9 of 
the partitions as 2.8 GB in size, but one of them as 1.6 GB. However, the job 
does not exit with an error code.

This becomes problematic in the following ways:
1. When we copy the data to Redshift, we get a bad decrypt error on the partial 
file, suggesting that the failure occurred at a weird byte in the file. 
2. We lose data - sometimes as much as 10%.

We don't see this problem with parquet format, which we also use, but moving 
all of our data to parquet is not currently feasible. We're using the Java API 
with Spark 2.2 and Amazon EMR 5.8, code is a simple as this:
df.write().csv("s3://some-bucket/some_location"). We're experiencing the issue 
1-3x/week on a daily job and are unable to reliably reproduce the problem. 

Any thoughts on why we might be seeing this and how to resolve?
Thanks in advance.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

__
This electronic transmission and any documents accompanying this electronic 
transmission contain confidential information belonging to the sender.  This 
information may contain confidential health information that is legally 
privileged.  The information is intended only for the use of the individual or 
entity named above.  The authorized recipient of this transmission is 
prohibited from disclosing this information to any other party unless required 
to do so by law or regulation and is required to delete or destroy the 
information after its stated need has been fulfilled.  If you are not the 
intended recipient, you are hereby notified that any disclosure, copying, 
distribution or the taking of any action in reliance on or regarding the 
contents of this electronically transmitted information is strictly prohibited. 
 If you have received this E-mail in error, please notify the sender and delete 
this message immediately.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



CSV write to S3 failing silently with partial completion

2017-09-07 Thread abbim
Hi all,
My team has been experiencing a recurring unpredictable bug where only a
partial write to CSV in S3 on one partition of our Dataset is performed. For
example, in a Dataset of 10 partitions written to CSV in S3, we might see 9
of the partitions as 2.8 GB in size, but one of them as 1.6 GB. However, the
job does not exit with an error code.

This becomes problematic in the following ways:
1. When we copy the data to Redshift, we get a bad decrypt error on the
partial file, suggesting that the failure occurred at a weird byte in the
file. 
2. We lose data - sometimes as much as 10%.

We don't see this problem with parquet format, which we also use, but moving
all of our data to parquet is not currently feasible. We're using the Java
API with Spark 2.2 and Amazon EMR 5.8, code is a simple as this:
df.write().csv("s3://some-bucket/some_location"). We're experiencing the
issue 1-3x/week on a daily job and are unable to reliably reproduce the
problem. 

Any thoughts on why we might be seeing this and how to resolve?
Thanks in advance.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org