Hi, Sometimes running the very same spark application binary, behaves differently with every execution.
- The Ganglia profile is different with every execution: sometimes it takes 0.5 TB of memory, the next time it takes 1 TB of memory, the next time it is 0.75 TB... - Spark UI shows number of succeeded tasks is more than total number of tasks, eg: 3500/3000. There are no failed tasks. At this stage the computation keeps carrying on for a long time without returning an answer. - The only way to get an answer from an application is to hopelessly keep running that application multiple times, until by some luck it gets converged. I was not able to regenerate this by a minimal code, as it seems some random factors affect this behavior. I have a suspicion, but I'm not sure, that use of one or more groupByKey() calls intensifies this problem. Another source of suspicion is the unpredicted performance of ec2 clusters with latency and io. Is this a known issue with spark?