Re: Scheduler hang?

2015-02-28 Thread Victor Tso-Guillen
Moving user to bcc.

What I found was that the TaskSetManager for my task set that had 5 tasks
had preferred locations set for 4 of the 5. Three had localhost/driver
and had completed. The one that had nothing had also completed. The last
one was set by our code to be my IP address. Local mode can hang on this
because of https://issues.apache.org/jira/browse/SPARK-4939 addressed by
https://github.com/apache/spark/pull/4147, which is obviously not an
optimal solution but since it's only local mode, it's very good enough. I'm
not going to wait for those seconds to tick by to complete the task, so
I'll fix the IP address reporting side for local mode in my code.

On Thu, Feb 26, 2015 at 8:32 PM, Victor Tso-Guillen v...@paxata.com wrote:

 Of course, breakpointing on every status update and revive offers
 invocation kept the problem from happening. Where could the race be?

 On Thu, Feb 26, 2015 at 7:55 PM, Victor Tso-Guillen v...@paxata.com
 wrote:

 Love to hear some input on this. I did get a standalone cluster up on my
 local machine and the problem didn't present itself. I'm pretty confident
 that means the problem is in the LocalBackend or something near it.

 On Thu, Feb 26, 2015 at 1:37 PM, Victor Tso-Guillen v...@paxata.com
 wrote:

 Okay I confirmed my suspicions of a hang. I made a request that stopped
 progressing, though the already-scheduled tasks had finished. I made a
 separate request that was small enough not to hang, and it kicked the hung
 job enough to finish. I think what's happening is that the scheduler or the
 local backend is not kicking the revive offers messaging at the right time,
 but I have to dig into the code some more to nail the culprit. Anyone on
 these list have experience in those code areas that could help?

 On Thu, Feb 26, 2015 at 2:27 AM, Victor Tso-Guillen v...@paxata.com
 wrote:

 Thanks for the link. Unfortunately, I turned on rdd compression and
 nothing changed. I tried moving netty - nio and no change :(

 On Thu, Feb 26, 2015 at 2:01 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Not many that i know of, but i bumped into this one
 https://issues.apache.org/jira/browse/SPARK-4516

 Thanks
 Best Regards

 On Thu, Feb 26, 2015 at 3:26 PM, Victor Tso-Guillen v...@paxata.com
 wrote:

 Is there any potential problem from 1.1.1 to 1.2.1 with shuffle
 dependencies that produce no data?

 On Thu, Feb 26, 2015 at 1:56 AM, Victor Tso-Guillen v...@paxata.com
 wrote:

 The data is small. The job is composed of many small stages.

 * I found that with fewer than 222 the problem exhibits. What will
 be gained by going higher?
 * Pushing up the parallelism only pushes up the boundary at which
 the system appears to hang. I'm worried about some sort of message loss 
 or
 inconsistency.
 * Yes, we are using Kryo.
 * I'll try that, but I'm again a little confused why you're
 recommending this. I'm stumped so might as well?

 On Wed, Feb 25, 2015 at 11:13 PM, Akhil Das 
 ak...@sigmoidanalytics.com wrote:

 What operation are you trying to do and how big is the data that
 you are operating on?

 Here's a few things which you can try:

 - Repartition the RDD to a higher number than 222
 - Specify the master as local[*] or local[10]
 - Use Kryo Serializer (.set(spark.serializer,
 org.apache.spark.serializer.KryoSerializer))
 - Enable RDD Compression (.set(spark.rdd.compress,true) )


 Thanks
 Best Regards

 On Thu, Feb 26, 2015 at 10:15 AM, Victor Tso-Guillen 
 v...@paxata.com wrote:

 I'm getting this really reliably on Spark 1.2.1. Basically I'm in
 local mode with parallelism at 8. I have 222 tasks and I never seem 
 to get
 far past 40. Usually in the 20s to 30s it will just hang. The last 
 logging
 is below, and a screenshot of the UI.

 2015-02-25 20:39:55.779 GMT-0800 INFO  [task-result-getter-3]
 TaskSetManager - Finished task 3.0 in stage 16.0 (TID 22) in 612 ms on
 localhost (1/5)
 2015-02-25 20:39:55.825 GMT-0800 INFO  [Executor task launch
 worker-10] Executor - Finished task 1.0 in stage 16.0 (TID 20). 2492 
 bytes
 result sent to driver
 2015-02-25 20:39:55.825 GMT-0800 INFO  [Executor task launch
 worker-8] Executor - Finished task 2.0 in stage 16.0 (TID 21). 2492 
 bytes
 result sent to driver
 2015-02-25 20:39:55.831 GMT-0800 INFO  [task-result-getter-0]
 TaskSetManager - Finished task 1.0 in stage 16.0 (TID 20) in 670 ms on
 localhost (2/5)
 2015-02-25 20:39:55.836 GMT-0800 INFO  [task-result-getter-1]
 TaskSetManager - Finished task 2.0 in stage 16.0 (TID 21) in 674 ms on
 localhost (3/5)
 2015-02-25 20:39:55.891 GMT-0800 INFO  [Executor task launch
 worker-9] Executor - Finished task 0.0 in stage 16.0 (TID 19). 2492 
 bytes
 result sent to driver
 2015-02-25 20:39:55.896 GMT-0800 INFO  [task-result-getter-2]
 TaskSetManager - Finished task 0.0 in stage 16.0 (TID 19) in 740 ms on
 localhost (4/5)

 [image: Inline image 1]
 What should I make of this? Where do I start?

 Thanks,
 Victor












Re: Scheduler hang?

2015-02-26 Thread Victor Tso-Guillen
Okay I confirmed my suspicions of a hang. I made a request that stopped
progressing, though the already-scheduled tasks had finished. I made a
separate request that was small enough not to hang, and it kicked the hung
job enough to finish. I think what's happening is that the scheduler or the
local backend is not kicking the revive offers messaging at the right time,
but I have to dig into the code some more to nail the culprit. Anyone on
these list have experience in those code areas that could help?

On Thu, Feb 26, 2015 at 2:27 AM, Victor Tso-Guillen v...@paxata.com wrote:

 Thanks for the link. Unfortunately, I turned on rdd compression and
 nothing changed. I tried moving netty - nio and no change :(

 On Thu, Feb 26, 2015 at 2:01 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Not many that i know of, but i bumped into this one
 https://issues.apache.org/jira/browse/SPARK-4516

 Thanks
 Best Regards

 On Thu, Feb 26, 2015 at 3:26 PM, Victor Tso-Guillen v...@paxata.com
 wrote:

 Is there any potential problem from 1.1.1 to 1.2.1 with shuffle
 dependencies that produce no data?

 On Thu, Feb 26, 2015 at 1:56 AM, Victor Tso-Guillen v...@paxata.com
 wrote:

 The data is small. The job is composed of many small stages.

 * I found that with fewer than 222 the problem exhibits. What will be
 gained by going higher?
 * Pushing up the parallelism only pushes up the boundary at which the
 system appears to hang. I'm worried about some sort of message loss or
 inconsistency.
 * Yes, we are using Kryo.
 * I'll try that, but I'm again a little confused why you're
 recommending this. I'm stumped so might as well?

 On Wed, Feb 25, 2015 at 11:13 PM, Akhil Das ak...@sigmoidanalytics.com
  wrote:

 What operation are you trying to do and how big is the data that you
 are operating on?

 Here's a few things which you can try:

 - Repartition the RDD to a higher number than 222
 - Specify the master as local[*] or local[10]
 - Use Kryo Serializer (.set(spark.serializer,
 org.apache.spark.serializer.KryoSerializer))
 - Enable RDD Compression (.set(spark.rdd.compress,true) )


 Thanks
 Best Regards

 On Thu, Feb 26, 2015 at 10:15 AM, Victor Tso-Guillen v...@paxata.com
 wrote:

 I'm getting this really reliably on Spark 1.2.1. Basically I'm in
 local mode with parallelism at 8. I have 222 tasks and I never seem to 
 get
 far past 40. Usually in the 20s to 30s it will just hang. The last 
 logging
 is below, and a screenshot of the UI.

 2015-02-25 20:39:55.779 GMT-0800 INFO  [task-result-getter-3]
 TaskSetManager - Finished task 3.0 in stage 16.0 (TID 22) in 612 ms on
 localhost (1/5)
 2015-02-25 20:39:55.825 GMT-0800 INFO  [Executor task launch
 worker-10] Executor - Finished task 1.0 in stage 16.0 (TID 20). 2492 
 bytes
 result sent to driver
 2015-02-25 20:39:55.825 GMT-0800 INFO  [Executor task launch
 worker-8] Executor - Finished task 2.0 in stage 16.0 (TID 21). 2492 bytes
 result sent to driver
 2015-02-25 20:39:55.831 GMT-0800 INFO  [task-result-getter-0]
 TaskSetManager - Finished task 1.0 in stage 16.0 (TID 20) in 670 ms on
 localhost (2/5)
 2015-02-25 20:39:55.836 GMT-0800 INFO  [task-result-getter-1]
 TaskSetManager - Finished task 2.0 in stage 16.0 (TID 21) in 674 ms on
 localhost (3/5)
 2015-02-25 20:39:55.891 GMT-0800 INFO  [Executor task launch
 worker-9] Executor - Finished task 0.0 in stage 16.0 (TID 19). 2492 bytes
 result sent to driver
 2015-02-25 20:39:55.896 GMT-0800 INFO  [task-result-getter-2]
 TaskSetManager - Finished task 0.0 in stage 16.0 (TID 19) in 740 ms on
 localhost (4/5)

 [image: Inline image 1]
 What should I make of this? Where do I start?

 Thanks,
 Victor









Re: Scheduler hang?

2015-02-26 Thread Victor Tso-Guillen
The data is small. The job is composed of many small stages.

* I found that with fewer than 222 the problem exhibits. What will be
gained by going higher?
* Pushing up the parallelism only pushes up the boundary at which the
system appears to hang. I'm worried about some sort of message loss or
inconsistency.
* Yes, we are using Kryo.
* I'll try that, but I'm again a little confused why you're recommending
this. I'm stumped so might as well?

On Wed, Feb 25, 2015 at 11:13 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 What operation are you trying to do and how big is the data that you are
 operating on?

 Here's a few things which you can try:

 - Repartition the RDD to a higher number than 222
 - Specify the master as local[*] or local[10]
 - Use Kryo Serializer (.set(spark.serializer,
 org.apache.spark.serializer.KryoSerializer))
 - Enable RDD Compression (.set(spark.rdd.compress,true) )


 Thanks
 Best Regards

 On Thu, Feb 26, 2015 at 10:15 AM, Victor Tso-Guillen v...@paxata.com
 wrote:

 I'm getting this really reliably on Spark 1.2.1. Basically I'm in local
 mode with parallelism at 8. I have 222 tasks and I never seem to get far
 past 40. Usually in the 20s to 30s it will just hang. The last logging is
 below, and a screenshot of the UI.

 2015-02-25 20:39:55.779 GMT-0800 INFO  [task-result-getter-3]
 TaskSetManager - Finished task 3.0 in stage 16.0 (TID 22) in 612 ms on
 localhost (1/5)
 2015-02-25 20:39:55.825 GMT-0800 INFO  [Executor task launch worker-10]
 Executor - Finished task 1.0 in stage 16.0 (TID 20). 2492 bytes result sent
 to driver
 2015-02-25 20:39:55.825 GMT-0800 INFO  [Executor task launch worker-8]
 Executor - Finished task 2.0 in stage 16.0 (TID 21). 2492 bytes result sent
 to driver
 2015-02-25 20:39:55.831 GMT-0800 INFO  [task-result-getter-0]
 TaskSetManager - Finished task 1.0 in stage 16.0 (TID 20) in 670 ms on
 localhost (2/5)
 2015-02-25 20:39:55.836 GMT-0800 INFO  [task-result-getter-1]
 TaskSetManager - Finished task 2.0 in stage 16.0 (TID 21) in 674 ms on
 localhost (3/5)
 2015-02-25 20:39:55.891 GMT-0800 INFO  [Executor task launch worker-9]
 Executor - Finished task 0.0 in stage 16.0 (TID 19). 2492 bytes result sent
 to driver
 2015-02-25 20:39:55.896 GMT-0800 INFO  [task-result-getter-2]
 TaskSetManager - Finished task 0.0 in stage 16.0 (TID 19) in 740 ms on
 localhost (4/5)

 [image: Inline image 1]
 What should I make of this? Where do I start?

 Thanks,
 Victor





Re: Scheduler hang?

2015-02-26 Thread Victor Tso-Guillen
Thanks for the link. Unfortunately, I turned on rdd compression and nothing
changed. I tried moving netty - nio and no change :(

On Thu, Feb 26, 2015 at 2:01 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Not many that i know of, but i bumped into this one
 https://issues.apache.org/jira/browse/SPARK-4516

 Thanks
 Best Regards

 On Thu, Feb 26, 2015 at 3:26 PM, Victor Tso-Guillen v...@paxata.com
 wrote:

 Is there any potential problem from 1.1.1 to 1.2.1 with shuffle
 dependencies that produce no data?

 On Thu, Feb 26, 2015 at 1:56 AM, Victor Tso-Guillen v...@paxata.com
 wrote:

 The data is small. The job is composed of many small stages.

 * I found that with fewer than 222 the problem exhibits. What will be
 gained by going higher?
 * Pushing up the parallelism only pushes up the boundary at which the
 system appears to hang. I'm worried about some sort of message loss or
 inconsistency.
 * Yes, we are using Kryo.
 * I'll try that, but I'm again a little confused why you're recommending
 this. I'm stumped so might as well?

 On Wed, Feb 25, 2015 at 11:13 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 What operation are you trying to do and how big is the data that you
 are operating on?

 Here's a few things which you can try:

 - Repartition the RDD to a higher number than 222
 - Specify the master as local[*] or local[10]
 - Use Kryo Serializer (.set(spark.serializer,
 org.apache.spark.serializer.KryoSerializer))
 - Enable RDD Compression (.set(spark.rdd.compress,true) )


 Thanks
 Best Regards

 On Thu, Feb 26, 2015 at 10:15 AM, Victor Tso-Guillen v...@paxata.com
 wrote:

 I'm getting this really reliably on Spark 1.2.1. Basically I'm in
 local mode with parallelism at 8. I have 222 tasks and I never seem to get
 far past 40. Usually in the 20s to 30s it will just hang. The last logging
 is below, and a screenshot of the UI.

 2015-02-25 20:39:55.779 GMT-0800 INFO  [task-result-getter-3]
 TaskSetManager - Finished task 3.0 in stage 16.0 (TID 22) in 612 ms on
 localhost (1/5)
 2015-02-25 20:39:55.825 GMT-0800 INFO  [Executor task launch
 worker-10] Executor - Finished task 1.0 in stage 16.0 (TID 20). 2492 bytes
 result sent to driver
 2015-02-25 20:39:55.825 GMT-0800 INFO  [Executor task launch worker-8]
 Executor - Finished task 2.0 in stage 16.0 (TID 21). 2492 bytes result 
 sent
 to driver
 2015-02-25 20:39:55.831 GMT-0800 INFO  [task-result-getter-0]
 TaskSetManager - Finished task 1.0 in stage 16.0 (TID 20) in 670 ms on
 localhost (2/5)
 2015-02-25 20:39:55.836 GMT-0800 INFO  [task-result-getter-1]
 TaskSetManager - Finished task 2.0 in stage 16.0 (TID 21) in 674 ms on
 localhost (3/5)
 2015-02-25 20:39:55.891 GMT-0800 INFO  [Executor task launch worker-9]
 Executor - Finished task 0.0 in stage 16.0 (TID 19). 2492 bytes result 
 sent
 to driver
 2015-02-25 20:39:55.896 GMT-0800 INFO  [task-result-getter-2]
 TaskSetManager - Finished task 0.0 in stage 16.0 (TID 19) in 740 ms on
 localhost (4/5)

 [image: Inline image 1]
 What should I make of this? Where do I start?

 Thanks,
 Victor








Re: Scheduler hang?

2015-02-26 Thread Victor Tso-Guillen
Is there any potential problem from 1.1.1 to 1.2.1 with shuffle
dependencies that produce no data?

On Thu, Feb 26, 2015 at 1:56 AM, Victor Tso-Guillen v...@paxata.com wrote:

 The data is small. The job is composed of many small stages.

 * I found that with fewer than 222 the problem exhibits. What will be
 gained by going higher?
 * Pushing up the parallelism only pushes up the boundary at which the
 system appears to hang. I'm worried about some sort of message loss or
 inconsistency.
 * Yes, we are using Kryo.
 * I'll try that, but I'm again a little confused why you're recommending
 this. I'm stumped so might as well?

 On Wed, Feb 25, 2015 at 11:13 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 What operation are you trying to do and how big is the data that you are
 operating on?

 Here's a few things which you can try:

 - Repartition the RDD to a higher number than 222
 - Specify the master as local[*] or local[10]
 - Use Kryo Serializer (.set(spark.serializer,
 org.apache.spark.serializer.KryoSerializer))
 - Enable RDD Compression (.set(spark.rdd.compress,true) )


 Thanks
 Best Regards

 On Thu, Feb 26, 2015 at 10:15 AM, Victor Tso-Guillen v...@paxata.com
 wrote:

 I'm getting this really reliably on Spark 1.2.1. Basically I'm in local
 mode with parallelism at 8. I have 222 tasks and I never seem to get far
 past 40. Usually in the 20s to 30s it will just hang. The last logging is
 below, and a screenshot of the UI.

 2015-02-25 20:39:55.779 GMT-0800 INFO  [task-result-getter-3]
 TaskSetManager - Finished task 3.0 in stage 16.0 (TID 22) in 612 ms on
 localhost (1/5)
 2015-02-25 20:39:55.825 GMT-0800 INFO  [Executor task launch worker-10]
 Executor - Finished task 1.0 in stage 16.0 (TID 20). 2492 bytes result sent
 to driver
 2015-02-25 20:39:55.825 GMT-0800 INFO  [Executor task launch worker-8]
 Executor - Finished task 2.0 in stage 16.0 (TID 21). 2492 bytes result sent
 to driver
 2015-02-25 20:39:55.831 GMT-0800 INFO  [task-result-getter-0]
 TaskSetManager - Finished task 1.0 in stage 16.0 (TID 20) in 670 ms on
 localhost (2/5)
 2015-02-25 20:39:55.836 GMT-0800 INFO  [task-result-getter-1]
 TaskSetManager - Finished task 2.0 in stage 16.0 (TID 21) in 674 ms on
 localhost (3/5)
 2015-02-25 20:39:55.891 GMT-0800 INFO  [Executor task launch worker-9]
 Executor - Finished task 0.0 in stage 16.0 (TID 19). 2492 bytes result sent
 to driver
 2015-02-25 20:39:55.896 GMT-0800 INFO  [task-result-getter-2]
 TaskSetManager - Finished task 0.0 in stage 16.0 (TID 19) in 740 ms on
 localhost (4/5)

 [image: Inline image 1]
 What should I make of this? Where do I start?

 Thanks,
 Victor






Re: Scheduler hang?

2015-02-26 Thread Akhil Das
Not many that i know of, but i bumped into this one
https://issues.apache.org/jira/browse/SPARK-4516

Thanks
Best Regards

On Thu, Feb 26, 2015 at 3:26 PM, Victor Tso-Guillen v...@paxata.com wrote:

 Is there any potential problem from 1.1.1 to 1.2.1 with shuffle
 dependencies that produce no data?

 On Thu, Feb 26, 2015 at 1:56 AM, Victor Tso-Guillen v...@paxata.com
 wrote:

 The data is small. The job is composed of many small stages.

 * I found that with fewer than 222 the problem exhibits. What will be
 gained by going higher?
 * Pushing up the parallelism only pushes up the boundary at which the
 system appears to hang. I'm worried about some sort of message loss or
 inconsistency.
 * Yes, we are using Kryo.
 * I'll try that, but I'm again a little confused why you're recommending
 this. I'm stumped so might as well?

 On Wed, Feb 25, 2015 at 11:13 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 What operation are you trying to do and how big is the data that you are
 operating on?

 Here's a few things which you can try:

 - Repartition the RDD to a higher number than 222
 - Specify the master as local[*] or local[10]
 - Use Kryo Serializer (.set(spark.serializer,
 org.apache.spark.serializer.KryoSerializer))
 - Enable RDD Compression (.set(spark.rdd.compress,true) )


 Thanks
 Best Regards

 On Thu, Feb 26, 2015 at 10:15 AM, Victor Tso-Guillen v...@paxata.com
 wrote:

 I'm getting this really reliably on Spark 1.2.1. Basically I'm in local
 mode with parallelism at 8. I have 222 tasks and I never seem to get far
 past 40. Usually in the 20s to 30s it will just hang. The last logging is
 below, and a screenshot of the UI.

 2015-02-25 20:39:55.779 GMT-0800 INFO  [task-result-getter-3]
 TaskSetManager - Finished task 3.0 in stage 16.0 (TID 22) in 612 ms on
 localhost (1/5)
 2015-02-25 20:39:55.825 GMT-0800 INFO  [Executor task launch worker-10]
 Executor - Finished task 1.0 in stage 16.0 (TID 20). 2492 bytes result sent
 to driver
 2015-02-25 20:39:55.825 GMT-0800 INFO  [Executor task launch worker-8]
 Executor - Finished task 2.0 in stage 16.0 (TID 21). 2492 bytes result sent
 to driver
 2015-02-25 20:39:55.831 GMT-0800 INFO  [task-result-getter-0]
 TaskSetManager - Finished task 1.0 in stage 16.0 (TID 20) in 670 ms on
 localhost (2/5)
 2015-02-25 20:39:55.836 GMT-0800 INFO  [task-result-getter-1]
 TaskSetManager - Finished task 2.0 in stage 16.0 (TID 21) in 674 ms on
 localhost (3/5)
 2015-02-25 20:39:55.891 GMT-0800 INFO  [Executor task launch worker-9]
 Executor - Finished task 0.0 in stage 16.0 (TID 19). 2492 bytes result sent
 to driver
 2015-02-25 20:39:55.896 GMT-0800 INFO  [task-result-getter-2]
 TaskSetManager - Finished task 0.0 in stage 16.0 (TID 19) in 740 ms on
 localhost (4/5)

 [image: Inline image 1]
 What should I make of this? Where do I start?

 Thanks,
 Victor







Re: Scheduler hang?

2015-02-26 Thread Victor Tso-Guillen
Love to hear some input on this. I did get a standalone cluster up on my
local machine and the problem didn't present itself. I'm pretty confident
that means the problem is in the LocalBackend or something near it.

On Thu, Feb 26, 2015 at 1:37 PM, Victor Tso-Guillen v...@paxata.com wrote:

 Okay I confirmed my suspicions of a hang. I made a request that stopped
 progressing, though the already-scheduled tasks had finished. I made a
 separate request that was small enough not to hang, and it kicked the hung
 job enough to finish. I think what's happening is that the scheduler or the
 local backend is not kicking the revive offers messaging at the right time,
 but I have to dig into the code some more to nail the culprit. Anyone on
 these list have experience in those code areas that could help?

 On Thu, Feb 26, 2015 at 2:27 AM, Victor Tso-Guillen v...@paxata.com
 wrote:

 Thanks for the link. Unfortunately, I turned on rdd compression and
 nothing changed. I tried moving netty - nio and no change :(

 On Thu, Feb 26, 2015 at 2:01 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Not many that i know of, but i bumped into this one
 https://issues.apache.org/jira/browse/SPARK-4516

 Thanks
 Best Regards

 On Thu, Feb 26, 2015 at 3:26 PM, Victor Tso-Guillen v...@paxata.com
 wrote:

 Is there any potential problem from 1.1.1 to 1.2.1 with shuffle
 dependencies that produce no data?

 On Thu, Feb 26, 2015 at 1:56 AM, Victor Tso-Guillen v...@paxata.com
 wrote:

 The data is small. The job is composed of many small stages.

 * I found that with fewer than 222 the problem exhibits. What will be
 gained by going higher?
 * Pushing up the parallelism only pushes up the boundary at which the
 system appears to hang. I'm worried about some sort of message loss or
 inconsistency.
 * Yes, we are using Kryo.
 * I'll try that, but I'm again a little confused why you're
 recommending this. I'm stumped so might as well?

 On Wed, Feb 25, 2015 at 11:13 PM, Akhil Das 
 ak...@sigmoidanalytics.com wrote:

 What operation are you trying to do and how big is the data that you
 are operating on?

 Here's a few things which you can try:

 - Repartition the RDD to a higher number than 222
 - Specify the master as local[*] or local[10]
 - Use Kryo Serializer (.set(spark.serializer,
 org.apache.spark.serializer.KryoSerializer))
 - Enable RDD Compression (.set(spark.rdd.compress,true) )


 Thanks
 Best Regards

 On Thu, Feb 26, 2015 at 10:15 AM, Victor Tso-Guillen v...@paxata.com
  wrote:

 I'm getting this really reliably on Spark 1.2.1. Basically I'm in
 local mode with parallelism at 8. I have 222 tasks and I never seem to 
 get
 far past 40. Usually in the 20s to 30s it will just hang. The last 
 logging
 is below, and a screenshot of the UI.

 2015-02-25 20:39:55.779 GMT-0800 INFO  [task-result-getter-3]
 TaskSetManager - Finished task 3.0 in stage 16.0 (TID 22) in 612 ms on
 localhost (1/5)
 2015-02-25 20:39:55.825 GMT-0800 INFO  [Executor task launch
 worker-10] Executor - Finished task 1.0 in stage 16.0 (TID 20). 2492 
 bytes
 result sent to driver
 2015-02-25 20:39:55.825 GMT-0800 INFO  [Executor task launch
 worker-8] Executor - Finished task 2.0 in stage 16.0 (TID 21). 2492 
 bytes
 result sent to driver
 2015-02-25 20:39:55.831 GMT-0800 INFO  [task-result-getter-0]
 TaskSetManager - Finished task 1.0 in stage 16.0 (TID 20) in 670 ms on
 localhost (2/5)
 2015-02-25 20:39:55.836 GMT-0800 INFO  [task-result-getter-1]
 TaskSetManager - Finished task 2.0 in stage 16.0 (TID 21) in 674 ms on
 localhost (3/5)
 2015-02-25 20:39:55.891 GMT-0800 INFO  [Executor task launch
 worker-9] Executor - Finished task 0.0 in stage 16.0 (TID 19). 2492 
 bytes
 result sent to driver
 2015-02-25 20:39:55.896 GMT-0800 INFO  [task-result-getter-2]
 TaskSetManager - Finished task 0.0 in stage 16.0 (TID 19) in 740 ms on
 localhost (4/5)

 [image: Inline image 1]
 What should I make of this? Where do I start?

 Thanks,
 Victor










Re: Scheduler hang?

2015-02-26 Thread Victor Tso-Guillen
Of course, breakpointing on every status update and revive offers
invocation kept the problem from happening. Where could the race be?

On Thu, Feb 26, 2015 at 7:55 PM, Victor Tso-Guillen v...@paxata.com wrote:

 Love to hear some input on this. I did get a standalone cluster up on my
 local machine and the problem didn't present itself. I'm pretty confident
 that means the problem is in the LocalBackend or something near it.

 On Thu, Feb 26, 2015 at 1:37 PM, Victor Tso-Guillen v...@paxata.com
 wrote:

 Okay I confirmed my suspicions of a hang. I made a request that stopped
 progressing, though the already-scheduled tasks had finished. I made a
 separate request that was small enough not to hang, and it kicked the hung
 job enough to finish. I think what's happening is that the scheduler or the
 local backend is not kicking the revive offers messaging at the right time,
 but I have to dig into the code some more to nail the culprit. Anyone on
 these list have experience in those code areas that could help?

 On Thu, Feb 26, 2015 at 2:27 AM, Victor Tso-Guillen v...@paxata.com
 wrote:

 Thanks for the link. Unfortunately, I turned on rdd compression and
 nothing changed. I tried moving netty - nio and no change :(

 On Thu, Feb 26, 2015 at 2:01 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Not many that i know of, but i bumped into this one
 https://issues.apache.org/jira/browse/SPARK-4516

 Thanks
 Best Regards

 On Thu, Feb 26, 2015 at 3:26 PM, Victor Tso-Guillen v...@paxata.com
 wrote:

 Is there any potential problem from 1.1.1 to 1.2.1 with shuffle
 dependencies that produce no data?

 On Thu, Feb 26, 2015 at 1:56 AM, Victor Tso-Guillen v...@paxata.com
 wrote:

 The data is small. The job is composed of many small stages.

 * I found that with fewer than 222 the problem exhibits. What will be
 gained by going higher?
 * Pushing up the parallelism only pushes up the boundary at which the
 system appears to hang. I'm worried about some sort of message loss or
 inconsistency.
 * Yes, we are using Kryo.
 * I'll try that, but I'm again a little confused why you're
 recommending this. I'm stumped so might as well?

 On Wed, Feb 25, 2015 at 11:13 PM, Akhil Das 
 ak...@sigmoidanalytics.com wrote:

 What operation are you trying to do and how big is the data that you
 are operating on?

 Here's a few things which you can try:

 - Repartition the RDD to a higher number than 222
 - Specify the master as local[*] or local[10]
 - Use Kryo Serializer (.set(spark.serializer,
 org.apache.spark.serializer.KryoSerializer))
 - Enable RDD Compression (.set(spark.rdd.compress,true) )


 Thanks
 Best Regards

 On Thu, Feb 26, 2015 at 10:15 AM, Victor Tso-Guillen 
 v...@paxata.com wrote:

 I'm getting this really reliably on Spark 1.2.1. Basically I'm in
 local mode with parallelism at 8. I have 222 tasks and I never seem to 
 get
 far past 40. Usually in the 20s to 30s it will just hang. The last 
 logging
 is below, and a screenshot of the UI.

 2015-02-25 20:39:55.779 GMT-0800 INFO  [task-result-getter-3]
 TaskSetManager - Finished task 3.0 in stage 16.0 (TID 22) in 612 ms on
 localhost (1/5)
 2015-02-25 20:39:55.825 GMT-0800 INFO  [Executor task launch
 worker-10] Executor - Finished task 1.0 in stage 16.0 (TID 20). 2492 
 bytes
 result sent to driver
 2015-02-25 20:39:55.825 GMT-0800 INFO  [Executor task launch
 worker-8] Executor - Finished task 2.0 in stage 16.0 (TID 21). 2492 
 bytes
 result sent to driver
 2015-02-25 20:39:55.831 GMT-0800 INFO  [task-result-getter-0]
 TaskSetManager - Finished task 1.0 in stage 16.0 (TID 20) in 670 ms on
 localhost (2/5)
 2015-02-25 20:39:55.836 GMT-0800 INFO  [task-result-getter-1]
 TaskSetManager - Finished task 2.0 in stage 16.0 (TID 21) in 674 ms on
 localhost (3/5)
 2015-02-25 20:39:55.891 GMT-0800 INFO  [Executor task launch
 worker-9] Executor - Finished task 0.0 in stage 16.0 (TID 19). 2492 
 bytes
 result sent to driver
 2015-02-25 20:39:55.896 GMT-0800 INFO  [task-result-getter-2]
 TaskSetManager - Finished task 0.0 in stage 16.0 (TID 19) in 740 ms on
 localhost (4/5)

 [image: Inline image 1]
 What should I make of this? Where do I start?

 Thanks,
 Victor











Scheduler hang?

2015-02-25 Thread Victor Tso-Guillen
I'm getting this really reliably on Spark 1.2.1. Basically I'm in local
mode with parallelism at 8. I have 222 tasks and I never seem to get far
past 40. Usually in the 20s to 30s it will just hang. The last logging is
below, and a screenshot of the UI.

2015-02-25 20:39:55.779 GMT-0800 INFO  [task-result-getter-3]
TaskSetManager - Finished task 3.0 in stage 16.0 (TID 22) in 612 ms on
localhost (1/5)
2015-02-25 20:39:55.825 GMT-0800 INFO  [Executor task launch worker-10]
Executor - Finished task 1.0 in stage 16.0 (TID 20). 2492 bytes result sent
to driver
2015-02-25 20:39:55.825 GMT-0800 INFO  [Executor task launch worker-8]
Executor - Finished task 2.0 in stage 16.0 (TID 21). 2492 bytes result sent
to driver
2015-02-25 20:39:55.831 GMT-0800 INFO  [task-result-getter-0]
TaskSetManager - Finished task 1.0 in stage 16.0 (TID 20) in 670 ms on
localhost (2/5)
2015-02-25 20:39:55.836 GMT-0800 INFO  [task-result-getter-1]
TaskSetManager - Finished task 2.0 in stage 16.0 (TID 21) in 674 ms on
localhost (3/5)
2015-02-25 20:39:55.891 GMT-0800 INFO  [Executor task launch worker-9]
Executor - Finished task 0.0 in stage 16.0 (TID 19). 2492 bytes result sent
to driver
2015-02-25 20:39:55.896 GMT-0800 INFO  [task-result-getter-2]
TaskSetManager - Finished task 0.0 in stage 16.0 (TID 19) in 740 ms on
localhost (4/5)

[image: Inline image 1]
What should I make of this? Where do I start?

Thanks,
Victor


Re: Scheduler hang?

2015-02-25 Thread Akhil Das
What operation are you trying to do and how big is the data that you are
operating on?

Here's a few things which you can try:

- Repartition the RDD to a higher number than 222
- Specify the master as local[*] or local[10]
- Use Kryo Serializer (.set(spark.serializer,
org.apache.spark.serializer.KryoSerializer))
- Enable RDD Compression (.set(spark.rdd.compress,true) )


Thanks
Best Regards

On Thu, Feb 26, 2015 at 10:15 AM, Victor Tso-Guillen v...@paxata.com
wrote:

 I'm getting this really reliably on Spark 1.2.1. Basically I'm in local
 mode with parallelism at 8. I have 222 tasks and I never seem to get far
 past 40. Usually in the 20s to 30s it will just hang. The last logging is
 below, and a screenshot of the UI.

 2015-02-25 20:39:55.779 GMT-0800 INFO  [task-result-getter-3]
 TaskSetManager - Finished task 3.0 in stage 16.0 (TID 22) in 612 ms on
 localhost (1/5)
 2015-02-25 20:39:55.825 GMT-0800 INFO  [Executor task launch worker-10]
 Executor - Finished task 1.0 in stage 16.0 (TID 20). 2492 bytes result sent
 to driver
 2015-02-25 20:39:55.825 GMT-0800 INFO  [Executor task launch worker-8]
 Executor - Finished task 2.0 in stage 16.0 (TID 21). 2492 bytes result sent
 to driver
 2015-02-25 20:39:55.831 GMT-0800 INFO  [task-result-getter-0]
 TaskSetManager - Finished task 1.0 in stage 16.0 (TID 20) in 670 ms on
 localhost (2/5)
 2015-02-25 20:39:55.836 GMT-0800 INFO  [task-result-getter-1]
 TaskSetManager - Finished task 2.0 in stage 16.0 (TID 21) in 674 ms on
 localhost (3/5)
 2015-02-25 20:39:55.891 GMT-0800 INFO  [Executor task launch worker-9]
 Executor - Finished task 0.0 in stage 16.0 (TID 19). 2492 bytes result sent
 to driver
 2015-02-25 20:39:55.896 GMT-0800 INFO  [task-result-getter-2]
 TaskSetManager - Finished task 0.0 in stage 16.0 (TID 19) in 740 ms on
 localhost (4/5)

 [image: Inline image 1]
 What should I make of this? Where do I start?

 Thanks,
 Victor