Re: Some spark apps fail with "All masters are unresponsive", while others pass normally

2015-11-09 Thread Akhil Das
Is that all you have in the executor logs? I suspect some of those jobs are
having a hard time managing  the memory.

Thanks
Best Regards

On Sun, Nov 1, 2015 at 9:38 PM, Romi Kuntsman  wrote:

> [adding dev list since it's probably a bug, but i'm not sure how to
> reproduce so I can open a bug about it]
>
> Hi,
>
> I have a standalone Spark 1.4.0 cluster with 100s of applications running
> every day.
>
> From time to time, the applications crash with the following error (see
> below)
> But at the same time (and also after that), other applications are
> running, so I can safely assume the master and workers are working.
>
> 1. why is there a NullPointerException? (i can't track the scala stack
> trace to the code, but anyway NPE is usually a obvious bug even if there's
> actually a network error...)
> 2. why can't it connect to the master? (if it's a network timeout, how to
> increase it? i see the values are hardcoded inside AppClient)
> 3. how to recover from this error?
>
>
>   ERROR 01-11 15:32:54,991SparkDeploySchedulerBackend - Application
> has been killed. Reason: All masters are unresponsive! Giving up. ERROR
>   ERROR 01-11 15:32:55,087  OneForOneStrategy - ERROR
> logs/error.log
>   java.lang.NullPointerException NullPointerException
>   at
> org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160)
>   at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>   at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>   at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>   at
> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
>   at
> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>   at
> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>   at
> org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>   at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>   at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>   ERROR 01-11 15:32:55,603   SparkContext - Error
> initializing SparkContext. ERROR
>   java.lang.IllegalStateException: Cannot call methods on a stopped
> SparkContext
>   at org.apache.spark.SparkContext.org
> $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
>   at
> org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501)
>   at
> org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005)
>   at org.apache.spark.SparkContext.(SparkContext.scala:543)
>   at
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
>
>
> Thanks!
>
> *Romi Kuntsman*, *Big Data Engineer*
> http://www.totango.com
>


Re: Some spark apps fail with "All masters are unresponsive", while others pass normally

2015-11-09 Thread Romi Kuntsman
If they have a problem managing memory, wouldn't there should be a OOM?
Why does AppClient throw a NPE?

*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com

On Mon, Nov 9, 2015 at 4:59 PM, Akhil Das 
wrote:

> Is that all you have in the executor logs? I suspect some of those jobs
> are having a hard time managing  the memory.
>
> Thanks
> Best Regards
>
> On Sun, Nov 1, 2015 at 9:38 PM, Romi Kuntsman  wrote:
>
>> [adding dev list since it's probably a bug, but i'm not sure how to
>> reproduce so I can open a bug about it]
>>
>> Hi,
>>
>> I have a standalone Spark 1.4.0 cluster with 100s of applications running
>> every day.
>>
>> From time to time, the applications crash with the following error (see
>> below)
>> But at the same time (and also after that), other applications are
>> running, so I can safely assume the master and workers are working.
>>
>> 1. why is there a NullPointerException? (i can't track the scala stack
>> trace to the code, but anyway NPE is usually a obvious bug even if there's
>> actually a network error...)
>> 2. why can't it connect to the master? (if it's a network timeout, how to
>> increase it? i see the values are hardcoded inside AppClient)
>> 3. how to recover from this error?
>>
>>
>>   ERROR 01-11 15:32:54,991SparkDeploySchedulerBackend - Application
>> has been killed. Reason: All masters are unresponsive! Giving up. ERROR
>>   ERROR 01-11 15:32:55,087  OneForOneStrategy - ERROR
>> logs/error.log
>>   java.lang.NullPointerException NullPointerException
>>   at
>> org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160)
>>   at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>   at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>   at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>   at
>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
>>   at
>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
>>   at
>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>   at
>> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
>>   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>   at
>> org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61)
>>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>>   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>>   at
>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>>   at
>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>   at
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>   at
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>   at
>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>   ERROR 01-11 15:32:55,603   SparkContext - Error
>> initializing SparkContext. ERROR
>>   java.lang.IllegalStateException: Cannot call methods on a stopped
>> SparkContext
>>   at org.apache.spark.SparkContext.org
>> $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
>>   at
>> org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501)
>>   at
>> org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005)
>>   at org.apache.spark.SparkContext.(SparkContext.scala:543)
>>   at
>> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
>>
>>
>> Thanks!
>>
>> *Romi Kuntsman*, *Big Data Engineer*
>> http://www.totango.com
>>
>
>


Re: Some spark apps fail with "All masters are unresponsive", while others pass normally

2015-11-09 Thread Akhil Das
Did you find anything regarding the OOM in the executor logs?

Thanks
Best Regards

On Mon, Nov 9, 2015 at 8:44 PM, Romi Kuntsman  wrote:

> If they have a problem managing memory, wouldn't there should be a OOM?
> Why does AppClient throw a NPE?
>
> *Romi Kuntsman*, *Big Data Engineer*
> http://www.totango.com
>
> On Mon, Nov 9, 2015 at 4:59 PM, Akhil Das 
> wrote:
>
>> Is that all you have in the executor logs? I suspect some of those jobs
>> are having a hard time managing  the memory.
>>
>> Thanks
>> Best Regards
>>
>> On Sun, Nov 1, 2015 at 9:38 PM, Romi Kuntsman  wrote:
>>
>>> [adding dev list since it's probably a bug, but i'm not sure how to
>>> reproduce so I can open a bug about it]
>>>
>>> Hi,
>>>
>>> I have a standalone Spark 1.4.0 cluster with 100s of applications
>>> running every day.
>>>
>>> From time to time, the applications crash with the following error (see
>>> below)
>>> But at the same time (and also after that), other applications are
>>> running, so I can safely assume the master and workers are working.
>>>
>>> 1. why is there a NullPointerException? (i can't track the scala stack
>>> trace to the code, but anyway NPE is usually a obvious bug even if there's
>>> actually a network error...)
>>> 2. why can't it connect to the master? (if it's a network timeout, how
>>> to increase it? i see the values are hardcoded inside AppClient)
>>> 3. how to recover from this error?
>>>
>>>
>>>   ERROR 01-11 15:32:54,991SparkDeploySchedulerBackend - Application
>>> has been killed. Reason: All masters are unresponsive! Giving up. ERROR
>>>   ERROR 01-11 15:32:55,087  OneForOneStrategy - ERROR
>>> logs/error.log
>>>   java.lang.NullPointerException NullPointerException
>>>   at
>>> org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160)
>>>   at
>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>>   at
>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>>   at
>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>>   at
>>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
>>>   at
>>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
>>>   at
>>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>>   at
>>> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
>>>   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>>   at
>>> org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61)
>>>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>>   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>>>   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>>>   at
>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>>>   at
>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>   at
>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>   at
>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>   at
>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>   ERROR 01-11 15:32:55,603   SparkContext - Error
>>> initializing SparkContext. ERROR
>>>   java.lang.IllegalStateException: Cannot call methods on a stopped
>>> SparkContext
>>>   at org.apache.spark.SparkContext.org
>>> $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
>>>   at
>>> org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501)
>>>   at
>>> org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005)
>>>   at org.apache.spark.SparkContext.(SparkContext.scala:543)
>>>   at
>>> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
>>>
>>>
>>> Thanks!
>>>
>>> *Romi Kuntsman*, *Big Data Engineer*
>>> http://www.totango.com
>>>
>>
>>
>


Re: Some spark apps fail with "All masters are unresponsive", while others pass normally

2015-11-09 Thread Romi Kuntsman
I didn't see anything about a OOM.
This happens sometimes before anything in the application happened, and
happens to a few applications at the same time - so I guess it's a
communication failure, but the problem is that the error shown doesn't
represent the actual problem (which may be a network timeout etc)

*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com

On Mon, Nov 9, 2015 at 6:00 PM, Akhil Das 
wrote:

> Did you find anything regarding the OOM in the executor logs?
>
> Thanks
> Best Regards
>
> On Mon, Nov 9, 2015 at 8:44 PM, Romi Kuntsman  wrote:
>
>> If they have a problem managing memory, wouldn't there should be a OOM?
>> Why does AppClient throw a NPE?
>>
>> *Romi Kuntsman*, *Big Data Engineer*
>> http://www.totango.com
>>
>> On Mon, Nov 9, 2015 at 4:59 PM, Akhil Das 
>> wrote:
>>
>>> Is that all you have in the executor logs? I suspect some of those jobs
>>> are having a hard time managing  the memory.
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Sun, Nov 1, 2015 at 9:38 PM, Romi Kuntsman  wrote:
>>>
 [adding dev list since it's probably a bug, but i'm not sure how to
 reproduce so I can open a bug about it]

 Hi,

 I have a standalone Spark 1.4.0 cluster with 100s of applications
 running every day.

 From time to time, the applications crash with the following error (see
 below)
 But at the same time (and also after that), other applications are
 running, so I can safely assume the master and workers are working.

 1. why is there a NullPointerException? (i can't track the scala stack
 trace to the code, but anyway NPE is usually a obvious bug even if there's
 actually a network error...)
 2. why can't it connect to the master? (if it's a network timeout, how
 to increase it? i see the values are hardcoded inside AppClient)
 3. how to recover from this error?


   ERROR 01-11 15:32:54,991SparkDeploySchedulerBackend - Application
 has been killed. Reason: All masters are unresponsive! Giving up. ERROR
   ERROR 01-11 15:32:55,087  OneForOneStrategy - ERROR
 logs/error.log
   java.lang.NullPointerException NullPointerException
   at
 org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160)
   at
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
   at
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
   at
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
   at
 org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
   at
 org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
   at
 scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
   at
 org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
   at
 org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
   at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
   at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
   ERROR 01-11 15:32:55,603   SparkContext - Error
 initializing SparkContext. ERROR
   java.lang.IllegalStateException: Cannot call methods on a stopped
 SparkContext
   at org.apache.spark.SparkContext.org
 $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
   at
 org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501)
   at
 org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005)
   at org.apache.spark.SparkContext.(SparkContext.scala:543)
   at
 org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)


 Thanks!

 *Romi Kuntsman*, *Big Data Engineer*
 http://www.totango.com

>>>
>>>
>>
>


Re: Some spark apps fail with "All masters are unresponsive", while others pass normally

2015-11-09 Thread Tim Preece
Searching shows several people hit this same NPE in AppClient.scala line 160
( perhaps because appID was null - could  application had be stopped before
registered ?) 



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Some-spark-apps-fail-with-All-masters-are-unresponsive-while-others-pass-normally-tp14858p15096.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Some spark apps fail with "All masters are unresponsive", while others pass normally

2015-11-01 Thread Romi Kuntsman
[adding dev list since it's probably a bug, but i'm not sure how to
reproduce so I can open a bug about it]

Hi,

I have a standalone Spark 1.4.0 cluster with 100s of applications running
every day.

>From time to time, the applications crash with the following error (see
below)
But at the same time (and also after that), other applications are running,
so I can safely assume the master and workers are working.

1. why is there a NullPointerException? (i can't track the scala stack
trace to the code, but anyway NPE is usually a obvious bug even if there's
actually a network error...)
2. why can't it connect to the master? (if it's a network timeout, how to
increase it? i see the values are hardcoded inside AppClient)
3. how to recover from this error?


  ERROR 01-11 15:32:54,991SparkDeploySchedulerBackend - Application has
been killed. Reason: All masters are unresponsive! Giving up. ERROR
  ERROR 01-11 15:32:55,087  OneForOneStrategy - ERROR
logs/error.log
  java.lang.NullPointerException NullPointerException
  at
org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160)
  at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
  at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
  at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
  at
org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
  at
org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
  at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
  at
org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
  at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
  at
org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61)
  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
  at akka.actor.ActorCell.invoke(ActorCell.scala:487)
  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
  at akka.dispatch.Mailbox.run(Mailbox.scala:220)
  at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
  at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
  at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
  at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
  at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
  ERROR 01-11 15:32:55,603   SparkContext - Error
initializing SparkContext. ERROR
  java.lang.IllegalStateException: Cannot call methods on a stopped
SparkContext
  at org.apache.spark.SparkContext.org
$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
  at
org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501)
  at
org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005)
  at org.apache.spark.SparkContext.(SparkContext.scala:543)
  at
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)


Thanks!

*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com