> > - Spark UI shows number of succeeded tasks is more than total number > of tasks, eg: 3500/3000. There are no failed tasks. At this stage the > computation keeps carrying on for a long time without returning an answer. > > No sign of resubmitted tasks in the command line logs either? You might want to get more information on what is going on in the JVM? I don't know what others use but jvmtop is easy to install on ec2 and you can monitor some processes.
> > - The only way to get an answer from an application is to hopelessly > keep running that application multiple times, until by some luck it gets > converged. > > I was not able to regenerate this by a minimal code, as it seems some > random factors affect this behavior. I have a suspicion, but I'm not sure, > that use of one or more groupByKey() calls intensifies this problem. > Is this related to the amount of data you are processing? Is it more likely to happen on large data? My experience on ec2 is whenever the the memory/partitioning/timout settings are reasonable the output is quite consistent. Even if I stop and restart the cluster the other day.