Hi, I've been struggling with the reliability, which on the face of it, should be a fairly simple job: Given a number of events, group them by user, event type and week in which they occurred and aggregate their counts.
The input data is fairly skewed so I do a repartition and add a salt to the key, so the aggregation is performed in two phases to (hopefully) mitigate this. What I see is that the size of data dealt with by each task is uniform (170mb) with a parallelism of 5000 (also tried with 2500), however tasks take increasingly longer, initially they start at a few seconds, then up to 18min by the end of the job. As I am not storing any RDDs I have tried using the legacy memory management to 70% of the heap memory to execution, with 10% for storage and 20% for unroll. Could anyone give me any pointers on what might be causing this, it feels like a memory leak, but I'm struggling to see where it is coming from. Many thanks. Here are some details on my set up, plus I've attached the stats from the aggregation phase: Spark version: 2.0 (tried with 1.6.2 also) Driver: 2 cores 10Gb RAM Workers: 20 nodes each with 16 cores and 100Gb RAM Data from map phase: Input data size: 668.2 GB Shuffle write size: 831.9 GB [image: Screen Shot 2016-09-20 at 17.15.12.png]