Yes, things get more unstable with larger data. But, that's the whole point of my question:
Why should spark get unstable when data gets larger? When data gets larger, spark should get *slower*, not more unstable. lack of stability makes parameter tuning very difficult, time consuming and a painful experience. Also, it is a mystery to me why spark gets unstable in a non-deterministic fashion. Why should it use twice, or half, the memory it used in the previous run of exactly the same code? On Wed, Apr 23, 2014 at 10:43 AM, Andras Barjak < andras.bar...@lynxanalytics.com> wrote: > > >> - Spark UI shows number of succeeded tasks is more than total number >> of tasks, eg: 3500/3000. There are no failed tasks. At this stage the >> computation keeps carrying on for a long time without returning an answer. >> >> No sign of resubmitted tasks in the command line logs either? > You might want to get more information on what is going on in the JVM? > I don't know what others use but jvmtop is easy to install on ec2 and you > can monitor some processes. > >> >> - The only way to get an answer from an application is to hopelessly >> keep running that application multiple times, until by some luck it gets >> converged. >> >> I was not able to regenerate this by a minimal code, as it seems some >> random factors affect this behavior. I have a suspicion, but I'm not sure, >> that use of one or more groupByKey() calls intensifies this problem. >> > Is this related to the amount of data you are processing? Is it more > likely to happen on large data? > My experience on ec2 is whenever the the memory/partitioning/timout > settings are reasonable > the output is quite consistent. Even if I stop and restart the cluster the > other day. >