Hi Gourav,
In our case, we process raw logs into parquet tables that downstream
applications can use for other jobs. The desired outcome is that we only
need to worry about unbalanced input data at the preprocess step so that
downstream jobs can assume balanced input data.
In our specific case,
Hi,
Using file size is a very bad way of managing data provided you think that
volume, variety and veracity does not holds true. Actually its a very bad
way of thinking and designing data solutions, you are bound to hit bottle
necks, optimization issues, and manual interventions.
I have found
The primary goal for balancing partitions would be for the write to S3. We
would like to prevent unbalanced partitions (can do with repartition), but also
avoid partitions that are too small or too large.
So for that case, getting the cache size would work Maropu if its roughly
accurate, but
Hi,
Since the final size depends on data types and compression. I've had to first
get a rough estimate of data, written to disk, then compute the number of
partitions.
partitions = int(ceil(size_data * conversion_ratio / block_size))
In my case block size 256mb, source txt & dest is snappy
Hi,
There is no simple way to access the size in a driver side.
Since the partitions of primitive typed data (e.g., int) are compressed by
`DataFrame#cache`,
the actual size is possibly a little bit different from processing
partitions size.
// maropu
On Wed, Jul 13, 2016 at 4:53 AM, Pedro
Hi,
Are there any tools for partitioning RDD/DataFrames by size at runtime? The
idea would be to specify that I would like for each partition to be roughly
X number of megabytes then write that through to S3. I haven't found
anything off the shelf, and looking through stack overflow posts doesn't