subject:"Spark \& S3 \- Introducing random values into key names"

Re: Spark & S3 - Introducing random values into key names

2018-03-08 Thread Subhash Sriram

Thanks, Vadim! That helps and makes sense. I don't think we have a number of 
keys so large that we have to worry about it. If we do, I think I would go with 
an approach similar to what you suggested.

Thanks again,
Subhash 

Sent from my iPhone

> On Mar 8, 2018, at 11:56 AM, Vadim Semenov  wrote:
> 
> You need to put randomness into the beginning of the key, if you put it other 
> than into the beginning, it's not guaranteed that you're going to have good 
> performance.
> 
> The way we achieved this is by writing to HDFS first, and then having a 
> custom DistCp implemented using Spark that copies parquet files using random 
> keys,
> and then saves the list of resulting keys to S3, and when we want to use 
> those parquet files, we just need to load the listing file, and then take 
> keys from it and pass them into the loader.
> 
> You only need to do this when you have way too many files, if the number of 
> keys you operate is reasonably small (let's say, in thousands), you won't get 
> any benefits.
> 
> Also the S3 buckets have internal optimizations, and overtime it adjusts to 
> the workload, i.e. some additional underlying partitions are getting added, 
> some splits happen, etc.
> If you want to have good performance from start, you would need to use 
> randomization, yes.
> Or alternatively, you can contact AWS and tell them about the naming schema 
> that you're going to have (but it must be set in stone), and then they can 
> try to pre-optimize the bucket for you.
> 
>> On Thu, Mar 8, 2018 at 11:42 AM, Subhash Sriram  
>> wrote:
>> Hey Spark user community,
>> 
>> I am writing Parquet files from Spark to S3 using S3a. I was reading this 
>> article about improving S3 bucket performance, specifically about how it can 
>> help to introduce randomness to your key names so that data is written to 
>> different partitions.
>> 
>> https://aws.amazon.com/premiumsupport/knowledge-center/s3-bucket-performance-improve/
>> 
>> Is there a straight forward way to accomplish this randomness in Spark via 
>> the DataSet API? The only thing that I could think of would be to actually 
>> split the large set into multiple sets (based on row boundaries), and then 
>> write each one with the random key name.
>> 
>> Is there an easier way that I am missing?
>> 
>> Thanks in advance!
>> Subhash
>> 
>> 
>

Re: Spark & S3 - Introducing random values into key names

2018-03-08 Thread Vadim Semenov

You need to put randomness into the beginning of the key, if you put it
other than into the beginning, it's not guaranteed that you're going to
have good performance.

The way we achieved this is by writing to HDFS first, and then having a
custom DistCp implemented using Spark that copies parquet files using
random keys,
and then saves the list of resulting keys to S3, and when we want to use
those parquet files, we just need to load the listing file, and then take
keys from it and pass them into the loader.

You only need to do this when you have way too many files, if the number of
keys you operate is reasonably small (let's say, in thousands), you won't
get any benefits.

Also the S3 buckets have internal optimizations, and overtime it adjusts to
the workload, i.e. some additional underlying partitions are getting added,
some splits happen, etc.
If you want to have good performance from start, you would need to use
randomization, yes.
Or alternatively, you can contact AWS and tell them about the naming schema
that you're going to have (but it must be set in stone), and then they can
try to pre-optimize the bucket for you.

On Thu, Mar 8, 2018 at 11:42 AM, Subhash Sriram 
wrote:

> Hey Spark user community,
>
> I am writing Parquet files from Spark to S3 using S3a. I was reading this
> article about improving S3 bucket performance, specifically about how it
> can help to introduce randomness to your key names so that data is written
> to different partitions.
>
> https://aws.amazon.com/premiumsupport/knowledge-
> center/s3-bucket-performance-improve/
>
> Is there a straight forward way to accomplish this randomness in Spark via
> the DataSet API? The only thing that I could think of would be to actually
> split the large set into multiple sets (based on row boundaries), and then
> write each one with the random key name.
>
> Is there an easier way that I am missing?
>
> Thanks in advance!
> Subhash
>
>
>

Spark & S3 - Introducing random values into key names

2018-03-08 Thread Subhash Sriram

Hey Spark user community,

I am writing Parquet files from Spark to S3 using S3a. I was reading this
article about improving S3 bucket performance, specifically about how it
can help to introduce randomness to your key names so that data is written
to different partitions.

https://aws.amazon.com/premiumsupport/knowledge-center/s3-bucket-performance-improve/

Is there a straight forward way to accomplish this randomness in Spark via
the DataSet API? The only thing that I could think of would be to actually
split the large set into multiple sets (based on row boundaries), and then
write each one with the random key name.

Is there an easier way that I am missing?

Thanks in advance!
Subhash

Re: Spark & S3 - Introducing random values into key names

Re: Spark & S3 - Introducing random values into key names

Spark & S3 - Introducing random values into key names

3 matches

Site Navigation

Mail list logo

Footer information