Hi all,
I have a job that, for every row, creates about 20 new objects (i.e. RDD of
100 rows in = RDD 2000 rows out). The reason for this is each row is tagged
with a list of the 'buckets' or 'windows' it belongs to.
The actual data is about 10 billion rows. Each executor has 60GB of memory.
Tuning/Patterns for Data Generation
Heavy/Throughput Jobs
Hi all,
I have a job that, for every row, creates about 20 new objects (i.e. RDD of 100
rows in = RDD 2000 rows out). The reason for this is each row is tagged with a
list of the 'buckets' or 'windows' it belongs