Yes, things get more unstable with larger data. But, that's the whole point
of my question:

Why should spark get unstable when data gets larger?

When data gets larger, spark should get *slower*, not more unstable. lack
of stability makes parameter tuning very difficult, time consuming and a
painful experience.

Also, it is a mystery to me why spark gets unstable in a non-deterministic
fashion. Why should it use twice, or half, the memory it used in the
previous run of exactly the same code?



On Wed, Apr 23, 2014 at 10:43 AM, Andras Barjak <
andras.bar...@lynxanalytics.com> wrote:

>
>
>>    - Spark UI shows number of succeeded tasks is more than total number
>>    of tasks, eg: 3500/3000. There are no failed tasks. At this stage the
>>    computation keeps carrying on for a long time without returning an answer.
>>
>> No sign of resubmitted tasks in the command line logs either?
> You might want to get more information on what is going on in the JVM?
> I don't know what others use but jvmtop is easy to install on ec2 and you
> can monitor some processes.
>
>>
>>    - The only way to get an answer from an application is to hopelessly
>>    keep running that application multiple times, until by some luck it gets
>>    converged.
>>
>> I was not able to regenerate this by a minimal code, as it seems some
>> random factors affect this behavior. I have a suspicion, but I'm not sure,
>> that use of one or more groupByKey() calls intensifies this problem.
>>
> Is this related to the amount of data you are processing? Is it more
> likely to happen on large data?
> My experience on ec2 is whenever the the memory/partitioning/timout
> settings are reasonable
> the output is quite consistent. Even if I stop and restart the cluster the
> other day.
>

Reply via email to