Hi,

Since the final size depends on data types and compression. I've had to first 
get a rough estimate of data, written to disk, then compute the number of 
partitions.

partitions = int(ceil(size_data * conversion_ratio / block_size))

In my case block size 256mb, source txt & dest is snappy parquet, 
compression_ratio .6

df.repartition(partitions).write.parquet(output)

Which yields files in the range of 230mb.

Another way was to count and come up with an imperial formula.

Cheers,
Hatim


> On Jul 12, 2016, at 9:07 PM, Takeshi Yamamuro <linguin....@gmail.com> wrote:
> 
> Hi,
> 
> There is no simple way to access the size in a driver side.
> Since the partitions of primitive typed data (e.g., int) are compressed by 
> `DataFrame#cache`,
> the actual size is possibly a little bit different from processing partitions 
> size.
> 
> // maropu
> 
>> On Wed, Jul 13, 2016 at 4:53 AM, Pedro Rodriguez <ski.rodrig...@gmail.com> 
>> wrote:
>> Hi,
>> 
>> Are there any tools for partitioning RDD/DataFrames by size at runtime? The 
>> idea would be to specify that I would like for each partition to be roughly 
>> X number of megabytes then write that through to S3. I haven't found 
>> anything off the shelf, and looking through stack overflow posts doesn't 
>> seem to yield anything concrete.
>> 
>> Is there a way to programmatically get the size or a size estimate for an 
>> RDD/DataFrame at runtime (eg size of one record would be sufficient)? I gave 
>> SizeEstimator a try, but it seems like the results varied quite a bit (tried 
>> on whole RDD and a sample). It would also be useful to get programmatic 
>> access to the size of the RDD in memory if it is cached.
>> 
>> Thanks,
>> -- 
>> Pedro Rodriguez
>> PhD Student in Distributed Machine Learning | CU Boulder
>> UC Berkeley AMPLab Alumni
>> 
>> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
>> Github: github.com/EntilZha | LinkedIn: 
>> https://www.linkedin.com/in/pedrorodriguezscience
> 
> 
> 
> -- 
> ---
> Takeshi Yamamuro

Reply via email to