Hi, Since the final size depends on data types and compression. I've had to first get a rough estimate of data, written to disk, then compute the number of partitions.
partitions = int(ceil(size_data * conversion_ratio / block_size)) In my case block size 256mb, source txt & dest is snappy parquet, compression_ratio .6 df.repartition(partitions).write.parquet(output) Which yields files in the range of 230mb. Another way was to count and come up with an imperial formula. Cheers, Hatim > On Jul 12, 2016, at 9:07 PM, Takeshi Yamamuro <linguin....@gmail.com> wrote: > > Hi, > > There is no simple way to access the size in a driver side. > Since the partitions of primitive typed data (e.g., int) are compressed by > `DataFrame#cache`, > the actual size is possibly a little bit different from processing partitions > size. > > // maropu > >> On Wed, Jul 13, 2016 at 4:53 AM, Pedro Rodriguez <ski.rodrig...@gmail.com> >> wrote: >> Hi, >> >> Are there any tools for partitioning RDD/DataFrames by size at runtime? The >> idea would be to specify that I would like for each partition to be roughly >> X number of megabytes then write that through to S3. I haven't found >> anything off the shelf, and looking through stack overflow posts doesn't >> seem to yield anything concrete. >> >> Is there a way to programmatically get the size or a size estimate for an >> RDD/DataFrame at runtime (eg size of one record would be sufficient)? I gave >> SizeEstimator a try, but it seems like the results varied quite a bit (tried >> on whole RDD and a sample). It would also be useful to get programmatic >> access to the size of the RDD in memory if it is cached. >> >> Thanks, >> -- >> Pedro Rodriguez >> PhD Student in Distributed Machine Learning | CU Boulder >> UC Berkeley AMPLab Alumni >> >> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 >> Github: github.com/EntilZha | LinkedIn: >> https://www.linkedin.com/in/pedrorodriguezscience > > > > -- > --- > Takeshi Yamamuro