Tools for Balancing Partitions by Size

Pedro Rodriguez Tue, 12 Jul 2016 12:54:06 -0700

Hi,

Are there any tools for partitioning RDD/DataFrames by size at runtime? The
idea would be to specify that I would like for each partition to be roughly
X number of megabytes then write that through to S3. I haven't found
anything off the shelf, and looking through stack overflow posts doesn't
seem to yield anything concrete.


Is there a way to programmatically get the size or a size estimate for an
RDD/DataFrame at runtime (eg size of one record would be sufficient)? I
gave SizeEstimator a try, but it seems like the results varied quite a bit
(tried on whole RDD and a sample). It would also be useful to get
programmatic access to the size of the RDD in memory if it is cached.

Thanks,
-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience

Tools for Balancing Partitions by Size

Reply via email to