I think I found where the problem comes from. I am writing lzo compressed thrift records using elephant-bird, my guess is that perhaps one side is computing the checksum based on the uncompressed data and the other on the compressed data, thus getting a mismatch.
When writing the data as strings using a plain TextOutputFormat, the multi part upload works, this confirms that the lzo compression is probably the problem... but it is not a solution :( 2015-04-13 18:46 GMT+02:00 Eugen Cepoi <cepoi.eu...@gmail.com>: > Hi, > > I am not sure my problem is relevant to spark, but perhaps someone else > had the same error. When I try to write files that need multipart upload to > S3 from a job on EMR I always get this error: > > com.amazonaws.services.s3.model.AmazonS3Exception: The Content-MD5 you > specified did not match what we received. > > If I disable multipart upload via fs.s3n.multipart.uploads.enabled (or > output smaller files that don't require multi part upload), then everything > works fine. > > I've seen an old thread on the ML where someone has the same error, but in > my case I don't have any other errors on the worker nodes. > > I am using spark 1.2.1 and hadoop 2.4.0. > > Thanks, > Eugen >