The primary goal for balancing partitions would be for the write to S3. We would like to prevent unbalanced partitions (can do with repartition), but also avoid partitions that are too small or too large.
So for that case, getting the cache size would work Maropu if its roughly accurate, but for data ingest we aren’t caching, just writing straight through to S3. The idea for writing to disk and checking for the size is interesting Hatim. For certain jobs, it seems very doable to write a small percentage of the data to S3, check the file size through the AWS API, and use that to estimate the total size. Thanks for the idea. — Pedro Rodriguez PhD Student in Large-Scale Machine Learning | CU Boulder Systems Oriented Data Scientist UC Berkeley AMPLab Alumni pedrorodriguez.io | 909-353-4423 github.com/EntilZha | LinkedIn On July 12, 2016 at 7:26:17 PM, Hatim Diab (timd...@gmail.com) wrote: Hi, Since the final size depends on data types and compression. I've had to first get a rough estimate of data, written to disk, then compute the number of partitions. partitions = int(ceil(size_data * conversion_ratio / block_size)) In my case block size 256mb, source txt & dest is snappy parquet, compression_ratio .6 df.repartition(partitions).write.parquet(output) Which yields files in the range of 230mb. Another way was to count and come up with an imperial formula. Cheers, Hatim On Jul 12, 2016, at 9:07 PM, Takeshi Yamamuro <linguin....@gmail.com> wrote: Hi, There is no simple way to access the size in a driver side. Since the partitions of primitive typed data (e.g., int) are compressed by `DataFrame#cache`, the actual size is possibly a little bit different from processing partitions size. // maropu On Wed, Jul 13, 2016 at 4:53 AM, Pedro Rodriguez <ski.rodrig...@gmail.com> wrote: Hi, Are there any tools for partitioning RDD/DataFrames by size at runtime? The idea would be to specify that I would like for each partition to be roughly X number of megabytes then write that through to S3. I haven't found anything off the shelf, and looking through stack overflow posts doesn't seem to yield anything concrete. Is there a way to programmatically get the size or a size estimate for an RDD/DataFrame at runtime (eg size of one record would be sufficient)? I gave SizeEstimator a try, but it seems like the results varied quite a bit (tried on whole RDD and a sample). It would also be useful to get programmatic access to the size of the RDD in memory if it is cached. Thanks, -- Pedro Rodriguez PhD Student in Distributed Machine Learning | CU Boulder UC Berkeley AMPLab Alumni ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 Github: github.com/EntilZha | LinkedIn: https://www.linkedin.com/in/pedrorodriguezscience -- --- Takeshi Yamamuro