Hi, There is no simple way to access the size in a driver side. Since the partitions of primitive typed data (e.g., int) are compressed by `DataFrame#cache`, the actual size is possibly a little bit different from processing partitions size.
// maropu On Wed, Jul 13, 2016 at 4:53 AM, Pedro Rodriguez <ski.rodrig...@gmail.com> wrote: > Hi, > > Are there any tools for partitioning RDD/DataFrames by size at runtime? > The idea would be to specify that I would like for each partition to be > roughly X number of megabytes then write that through to S3. I haven't > found anything off the shelf, and looking through stack overflow posts > doesn't seem to yield anything concrete. > > Is there a way to programmatically get the size or a size estimate for an > RDD/DataFrame at runtime (eg size of one record would be sufficient)? I > gave SizeEstimator a try, but it seems like the results varied quite a bit > (tried on whole RDD and a sample). It would also be useful to get > programmatic access to the size of the RDD in memory if it is cached. > > Thanks, > -- > Pedro Rodriguez > PhD Student in Distributed Machine Learning | CU Boulder > UC Berkeley AMPLab Alumni > > ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 > Github: github.com/EntilZha | LinkedIn: > https://www.linkedin.com/in/pedrorodriguezscience > > -- --- Takeshi Yamamuro