Thanks, Vadim! That helps and makes sense. I don't think we have a number of
keys so large that we have to worry about it. If we do, I think I would go with
an approach similar to what you suggested.
Thanks again,
Subhash
Sent from my iPhone
> On Mar 8, 2018, at 11:56 AM, Vadim Semenov
You need to put randomness into the beginning of the key, if you put it
other than into the beginning, it's not guaranteed that you're going to
have good performance.
The way we achieved this is by writing to HDFS first, and then having a
custom DistCp implemented using Spark that copies parquet
Hey Spark user community,
I am writing Parquet files from Spark to S3 using S3a. I was reading this
article about improving S3 bucket performance, specifically about how it
can help to introduce randomness to your key names so that data is written
to different partitions.