Re: Tools for Balancing Partitions by Size

2016-07-13 Thread Pedro Rodriguez
Hi Gourav, In our case, we process raw logs into parquet tables that downstream applications can use for other jobs. The desired outcome is that we only need to worry about unbalanced input data at the preprocess step so that downstream jobs can assume balanced input data. In our specific case,

Re: Tools for Balancing Partitions by Size

2016-07-13 Thread Gourav Sengupta
Hi, Using file size is a very bad way of managing data provided you think that volume, variety and veracity does not holds true. Actually its a very bad way of thinking and designing data solutions, you are bound to hit bottle necks, optimization issues, and manual interventions. I have found

Re: Tools for Balancing Partitions by Size

2016-07-12 Thread Pedro Rodriguez
The primary goal for balancing partitions would be for the write to S3. We would like to prevent unbalanced partitions (can do with repartition), but also avoid partitions that are too small or too large. So for that case, getting the cache size would work Maropu if its roughly accurate, but

Re: Tools for Balancing Partitions by Size

2016-07-12 Thread Hatim Diab
Hi, Since the final size depends on data types and compression. I've had to first get a rough estimate of data, written to disk, then compute the number of partitions. partitions = int(ceil(size_data * conversion_ratio / block_size)) In my case block size 256mb, source txt & dest is snappy

Re: Tools for Balancing Partitions by Size

2016-07-12 Thread Takeshi Yamamuro
Hi, There is no simple way to access the size in a driver side. Since the partitions of primitive typed data (e.g., int) are compressed by `DataFrame#cache`, the actual size is possibly a little bit different from processing partitions size. // maropu On Wed, Jul 13, 2016 at 4:53 AM, Pedro

Tools for Balancing Partitions by Size

2016-07-12 Thread Pedro Rodriguez
Hi, Are there any tools for partitioning RDD/DataFrames by size at runtime? The idea would be to specify that I would like for each partition to be roughly X number of megabytes then write that through to S3. I haven't found anything off the shelf, and looking through stack overflow posts doesn't