Hey Spark user community,

I am writing Parquet files from Spark to S3 using S3a. I was reading this
article about improving S3 bucket performance, specifically about how it
can help to introduce randomness to your key names so that data is written
to different partitions.

https://aws.amazon.com/premiumsupport/knowledge-center/s3-bucket-performance-improve/

Is there a straight forward way to accomplish this randomness in Spark via
the DataSet API? The only thing that I could think of would be to actually
split the large set into multiple sets (based on row boundaries), and then
write each one with the random key name.

Is there an easier way that I am missing?

Thanks in advance!
Subhash

Reply via email to