Hi,

Using file size is a very bad way of managing data provided you think that
volume, variety and veracity does not holds true. Actually its a very bad
way of thinking and designing data solutions, you are bound to hit bottle
necks, optimization issues, and manual interventions.

I have found thinking about data in logical partitions helps overcome most
of the design problems that is mentioned above.

You can either use reparition with shuffling or colasce with shuffle turned
off to manage loads.

If you are using HIVE just let me know.


Regards,
Gourav Sengupta

On Wed, Jul 13, 2016 at 5:39 AM, Pedro Rodriguez <ski.rodrig...@gmail.com>
wrote:

> The primary goal for balancing partitions would be for the write to S3. We
> would like to prevent unbalanced partitions (can do with repartition), but
> also avoid partitions that are too small or too large.
>
> So for that case, getting the cache size would work Maropu if its roughly
> accurate, but for data ingest we aren’t caching, just writing straight
> through to S3.
>
> The idea for writing to disk and checking for the size is interesting
> Hatim. For certain jobs, it seems very doable to write a small percentage
> of the data to S3, check the file size through the AWS API, and use that to
> estimate the total size. Thanks for the idea.
>
> —
> Pedro Rodriguez
> PhD Student in Large-Scale Machine Learning | CU Boulder
> Systems Oriented Data Scientist
> UC Berkeley AMPLab Alumni
>
> pedrorodriguez.io | 909-353-4423
> github.com/EntilZha | LinkedIn
> <https://www.linkedin.com/in/pedrorodriguezscience>
>
> On July 12, 2016 at 7:26:17 PM, Hatim Diab (timd...@gmail.com) wrote:
>
> Hi,
>
> Since the final size depends on data types and compression. I've had to
> first get a rough estimate of data, written to disk, then compute the
> number of partitions.
>
> partitions = int(ceil(size_data * conversion_ratio / block_size))
>
> In my case block size 256mb, source txt & dest is snappy parquet,
> compression_ratio .6
>
> df.repartition(partitions).write.parquet(output)
>
> Which yields files in the range of 230mb.
>
> Another way was to count and come up with an imperial formula.
>
> Cheers,
> Hatim
>
>
> On Jul 12, 2016, at 9:07 PM, Takeshi Yamamuro <linguin....@gmail.com>
> wrote:
>
> Hi,
>
> There is no simple way to access the size in a driver side.
> Since the partitions of primitive typed data (e.g., int) are compressed by
> `DataFrame#cache`,
> the actual size is possibly a little bit different from processing
> partitions size.
>
> // maropu
>
> On Wed, Jul 13, 2016 at 4:53 AM, Pedro Rodriguez <ski.rodrig...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Are there any tools for partitioning RDD/DataFrames by size at runtime?
>> The idea would be to specify that I would like for each partition to be
>> roughly X number of megabytes then write that through to S3. I haven't
>> found anything off the shelf, and looking through stack overflow posts
>> doesn't seem to yield anything concrete.
>>
>> Is there a way to programmatically get the size or a size estimate for an
>> RDD/DataFrame at runtime (eg size of one record would be sufficient)? I
>> gave SizeEstimator a try, but it seems like the results varied quite a bit
>> (tried on whole RDD and a sample). It would also be useful to get
>> programmatic access to the size of the RDD in memory if it is cached.
>>
>> Thanks,
>> --
>> Pedro Rodriguez
>> PhD Student in Distributed Machine Learning | CU Boulder
>> UC Berkeley AMPLab Alumni
>>
>> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
>> Github: github.com/EntilZha | LinkedIn:
>> https://www.linkedin.com/in/pedrorodriguezscience
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>
>

Reply via email to