The primary goal for balancing partitions would be for the write to S3. We 
would like to prevent unbalanced partitions (can do with repartition), but also 
avoid partitions that are too small or too large.

So for that case, getting the cache size would work Maropu if its roughly 
accurate, but for data ingest we aren’t caching, just writing straight through 
to S3.

The idea for writing to disk and checking for the size is interesting Hatim. 
For certain jobs, it seems very doable to write a small percentage of the data 
to S3, check the file size through the AWS API, and use that to estimate the 
total size. Thanks for the idea.

—
Pedro Rodriguez
PhD Student in Large-Scale Machine Learning | CU Boulder
Systems Oriented Data Scientist
UC Berkeley AMPLab Alumni

pedrorodriguez.io | 909-353-4423
github.com/EntilZha | LinkedIn

On July 12, 2016 at 7:26:17 PM, Hatim Diab (timd...@gmail.com) wrote:

Hi,

Since the final size depends on data types and compression. I've had to first 
get a rough estimate of data, written to disk, then compute the number of 
partitions.

partitions = int(ceil(size_data * conversion_ratio / block_size))

In my case block size 256mb, source txt & dest is snappy parquet, 
compression_ratio .6

df.repartition(partitions).write.parquet(output)

Which yields files in the range of 230mb.

Another way was to count and come up with an imperial formula.

Cheers,
Hatim


On Jul 12, 2016, at 9:07 PM, Takeshi Yamamuro <linguin....@gmail.com> wrote:

Hi,

There is no simple way to access the size in a driver side.
Since the partitions of primitive typed data (e.g., int) are compressed by 
`DataFrame#cache`,
the actual size is possibly a little bit different from processing partitions 
size.

// maropu

On Wed, Jul 13, 2016 at 4:53 AM, Pedro Rodriguez <ski.rodrig...@gmail.com> 
wrote:
Hi,

Are there any tools for partitioning RDD/DataFrames by size at runtime? The 
idea would be to specify that I would like for each partition to be roughly X 
number of megabytes then write that through to S3. I haven't found anything off 
the shelf, and looking through stack overflow posts doesn't seem to yield 
anything concrete.

Is there a way to programmatically get the size or a size estimate for an 
RDD/DataFrame at runtime (eg size of one record would be sufficient)? I gave 
SizeEstimator a try, but it seems like the results varied quite a bit (tried on 
whole RDD and a sample). It would also be useful to get programmatic access to 
the size of the RDD in memory if it is cached.

Thanks,
--
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn: 
https://www.linkedin.com/in/pedrorodriguezscience




--
---
Takeshi Yamamuro

Reply via email to