Hi,

There is no simple way to access the size in a driver side.
Since the partitions of primitive typed data (e.g., int) are compressed by
`DataFrame#cache`,
the actual size is possibly a little bit different from processing
partitions size.

// maropu

On Wed, Jul 13, 2016 at 4:53 AM, Pedro Rodriguez <ski.rodrig...@gmail.com>
wrote:

> Hi,
>
> Are there any tools for partitioning RDD/DataFrames by size at runtime?
> The idea would be to specify that I would like for each partition to be
> roughly X number of megabytes then write that through to S3. I haven't
> found anything off the shelf, and looking through stack overflow posts
> doesn't seem to yield anything concrete.
>
> Is there a way to programmatically get the size or a size estimate for an
> RDD/DataFrame at runtime (eg size of one record would be sufficient)? I
> gave SizeEstimator a try, but it seems like the results varied quite a bit
> (tried on whole RDD and a sample). It would also be useful to get
> programmatic access to the size of the RDD in memory if it is cached.
>
> Thanks,
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>


-- 
---
Takeshi Yamamuro

Reply via email to