On 7 Sep 2017, at 18:36, Mcclintic, Abbi <ab...@amazon.com<mailto:ab...@amazon.com>> wrote:
Thanks all – couple notes below. Generally all our partitions are of equal size (ie on a normal day in this particular case I see 10 equally sized partitions of 2.8 GB). We see the problem with repartitioning and without – in this example we are repartitioning to 10 but we also see the problem without any repartitioning when the default partition count is 200. We know that data loss is occurring because we have a final quality gate that counts the number of rows and halts the process if we see too large of a drop. We have one use case where the data needs to be read on a local machine after processing and one use case of copying to redshift. Regarding the redshift copy, it gets a bit complicated owing to VPC and encryption requirements so we haven’t looked into using the JDBC driver yet. My understanding was that Amazon EMR does not support s3a<https://aws.amazon.com/premiumsupport/knowledge-center/emr-file-system-s3/>, but it may be worth looking into. 1. No, it doesn't 2. You can't currently use s3a as a direct destination of work due to s3 not being consistent, not without a consistency layer on top (S3Guard, etc) We may also try a combination of writing to HDFS combined with s3distcp. +1