Are you sure that you use S3A? Because EMR says that they do not support S3A
https://aws.amazon.com/premiumsupport/knowledge-center/emr-file-system-s3/ > Amazon EMR does not currently support use of the Apache Hadoop S3A file system. I think that the HEAD requests come from the `createBucketIfNotExists` in the AWS S3 library that checks if the bucket exists every time you do a PUT request, i.e. creates a HEAD request. You can disable that by setting `fs.s3.buckets.create.enabled` to `false` http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-upload-s3.html On Thu, Jun 29, 2017 at 4:56 PM, Everett Anderson <ever...@nuna.com.invalid> wrote: > Hi, > > We're using Spark 2.0.2 + Hadoop 2.7.3 on AWS EMR with S3A for direct I/O > from/to S3 from our Spark jobs. We set mapreduce. > fileoutputcommitter.algorithm.version=2 and are using encrypted S3 > buckets. > > This has been working fine for us, but perhaps as we've been running more > jobs in parallel, we've started getting errors like > > Status Code: 503, AWS Service: Amazon S3, AWS Request ID: ..., AWS Error > Code: SlowDown, AWS Error Message: Please reduce your request rate., S3 > Extended Request ID: ... > > We enabled CloudWatch S3 request metrics for one of our buckets and I was > a little alarmed to see spikes of over 800k S3 requests over a minute or > so, with the bulk of them HEAD requests. > > We read and write Parquet files, and most tables have around 50 > shards/parts, though some have up to 200. I imagine there's additional > parallelism when reading a shard in Parquet, though. > > Has anyone else encountered this? How did you solve it? > > I'd sure prefer to avoid copying all our data in and out of HDFS for each > job, if possible. > > Thanks! > >