Hi,

Sometimes running the very same spark application binary, behaves
differently with every execution.

   - The Ganglia profile is different with every execution: sometimes it
   takes 0.5 TB of memory, the next time it takes 1 TB of memory, the next
   time it is 0.75 TB...
   - Spark UI shows number of succeeded tasks is more than total number of
   tasks, eg: 3500/3000. There are no failed tasks. At this stage the
   computation keeps carrying on for a long time without returning an answer.
   - The only way to get an answer from an application is to hopelessly
   keep running that application multiple times, until by some luck it gets
   converged.

I was not able to regenerate this by a minimal code, as it seems some
random factors affect this behavior. I have a suspicion, but I'm not sure,
that use of one or more groupByKey() calls intensifies this problem.

Another source of suspicion is the unpredicted performance of ec2 clusters
with latency and io.

Is this a known issue with spark?

Reply via email to