Re: Help Explain Tasks in WebUI:4040

2015-08-31 Thread Igor Berman
are there other processes on sk3? or more generally are you sharing
resources with somebody else, virtualization etc

does your transformation consumes other services?(e.g. reading from s3, so
it can happen that s3 latency plays the role...)
can it be that task per some key will take longer than same task on other
key(I mean your business logic...) I see that some tasks take ~1min and
other ~1h which is strange




On 28 August 2015 at 21:47, Muler  wrote:

> I have a 7 node cluster running in standalone mode (1 executor per node,
> 100g/executor, 18 cores/executor)
>
> Attached is the Task status for two of my nodes. I'm not clear why some of
> my tasks are taking too long:
>
>1. [node sk5, green] task 197 took 35 mins while task 218 took less
>than 2 mins. But if you look into the size of output size/records they have
>almost same size. Even more strange, the size of shuffle spill for memory
>and disk is 0 for task 197 and yet it is taking a long time
>
> Same issue for my other node (sk3, red)
>
> Can you please explain what is going on?
>
> Thanks,
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>


Re: Help Explain Tasks in WebUI:4040

2015-08-30 Thread Akhil Das
Are you doing a join/groupBy such operation? In that case i would suspect
that the keys are not evenly distributed and that's why few of the tasks
are spending way too much time doing the actual processing. You might want
to look into custom partitioners

to
avoid these scenarios.

Thanks
Best Regards

On Sat, Aug 29, 2015 at 12:17 AM, Muler  wrote:

> I have a 7 node cluster running in standalone mode (1 executor per node,
> 100g/executor, 18 cores/executor)
>
> Attached is the Task status for two of my nodes. I'm not clear why some of
> my tasks are taking too long:
>
>1. [node sk5, green] task 197 took 35 mins while task 218 took less
>than 2 mins. But if you look into the size of output size/records they have
>almost same size. Even more strange, the size of shuffle spill for memory
>and disk is 0 for task 197 and yet it is taking a long time
>
> Same issue for my other node (sk3, red)
>
> Can you please explain what is going on?
>
> Thanks,
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>


Re: Help Explain Tasks in WebUI:4040

2015-08-28 Thread Alexey Grishchenko
It really depends on the code. I would say that the easiest way is to
restart the problematic action, find the straggler task and analyze whats
happening with it with jstack / make a heap dump and analyze locally. For
example, there might be the case that your tasks are connecting to some
external resource and this resource is timing out under the pressure. Also
call toDebugString on the problematic RDD before calling an action that
triggers the calculations, this would give you an understanding what your
execution tasks are really doing

On Fri, Aug 28, 2015 at 7:47 PM, Muler  wrote:

> I have a 7 node cluster running in standalone mode (1 executor per node,
> 100g/executor, 18 cores/executor)
>
> Attached is the Task status for two of my nodes. I'm not clear why some of
> my tasks are taking too long:
>
>1. [node sk5, green] task 197 took 35 mins while task 218 took less
>than 2 mins. But if you look into the size of output size/records they have
>almost same size. Even more strange, the size of shuffle spill for memory
>and disk is 0 for task 197 and yet it is taking a long time
>
> Same issue for my other node (sk3, red)
>
> Can you please explain what is going on?
>
> Thanks,
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>



-- 
Alexey Grishchenko, http://0x0fff.com