On 7 Sep 2017, at 18:36, Mcclintic, Abbi 
<ab...@amazon.com<mailto:ab...@amazon.com>> wrote:

Thanks all – couple notes below.

Generally all our partitions are of equal size (ie on a normal day in this 
particular case I see 10 equally sized partitions of 2.8 GB). We see the 
problem with repartitioning and without – in this example we are repartitioning 
to 10 but we also see the problem without any repartitioning when the default 
partition count is 200. We know that data loss is occurring because we have a 
final quality gate that counts the number of rows and halts the process if we 
see too large of a drop.

We have one use case where the data needs to be read on a local machine after 
processing and one use case of copying to redshift. Regarding the redshift 
copy, it gets a bit complicated owing to VPC and encryption requirements so we 
haven’t looked into using the JDBC driver yet.

My understanding was that Amazon EMR does not support 
s3a<https://aws.amazon.com/premiumsupport/knowledge-center/emr-file-system-s3/>,
 but it may be worth looking into.

1. No, it doesn't
2. You can't currently use s3a as a direct destination of work due to s3 not 
being consistent, not without a consistency layer on top (S3Guard, etc)

We may also try a combination of writing to HDFS combined with s3distcp.


+1


Reply via email to