Hi, Are there any tools for partitioning RDD/DataFrames by size at runtime? The idea would be to specify that I would like for each partition to be roughly X number of megabytes then write that through to S3. I haven't found anything off the shelf, and looking through stack overflow posts doesn't seem to yield anything concrete.
Is there a way to programmatically get the size or a size estimate for an RDD/DataFrame at runtime (eg size of one record would be sufficient)? I gave SizeEstimator a try, but it seems like the results varied quite a bit (tried on whole RDD and a sample). It would also be useful to get programmatic access to the size of the RDD in memory if it is cached. Thanks, -- Pedro Rodriguez PhD Student in Distributed Machine Learning | CU Boulder UC Berkeley AMPLab Alumni ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 Github: github.com/EntilZha | LinkedIn: https://www.linkedin.com/in/pedrorodriguezscience