Re: Some spark apps fail with "All masters are unresponsive", while others pass normally
Is that all you have in the executor logs? I suspect some of those jobs are having a hard time managing the memory. Thanks Best Regards On Sun, Nov 1, 2015 at 9:38 PM, Romi Kuntsmanwrote: > [adding dev list since it's probably a bug, but i'm not sure how to > reproduce so I can open a bug about it] > > Hi, > > I have a standalone Spark 1.4.0 cluster with 100s of applications running > every day. > > From time to time, the applications crash with the following error (see > below) > But at the same time (and also after that), other applications are > running, so I can safely assume the master and workers are working. > > 1. why is there a NullPointerException? (i can't track the scala stack > trace to the code, but anyway NPE is usually a obvious bug even if there's > actually a network error...) > 2. why can't it connect to the master? (if it's a network timeout, how to > increase it? i see the values are hardcoded inside AppClient) > 3. how to recover from this error? > > > ERROR 01-11 15:32:54,991SparkDeploySchedulerBackend - Application > has been killed. Reason: All masters are unresponsive! Giving up. ERROR > ERROR 01-11 15:32:55,087 OneForOneStrategy - ERROR > logs/error.log > java.lang.NullPointerException NullPointerException > at > org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > at > org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59) > at > org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) > at > org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at > org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.ActorCell.invoke(ActorCell.scala:487) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) > at akka.dispatch.Mailbox.run(Mailbox.scala:220) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > ERROR 01-11 15:32:55,603 SparkContext - Error > initializing SparkContext. ERROR > java.lang.IllegalStateException: Cannot call methods on a stopped > SparkContext > at org.apache.spark.SparkContext.org > $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103) > at > org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501) > at > org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005) > at org.apache.spark.SparkContext.(SparkContext.scala:543) > at > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61) > > > Thanks! > > *Romi Kuntsman*, *Big Data Engineer* > http://www.totango.com >
Re: Some spark apps fail with "All masters are unresponsive", while others pass normally
Did you find anything regarding the OOM in the executor logs? Thanks Best Regards On Mon, Nov 9, 2015 at 8:44 PM, Romi Kuntsmanwrote: > If they have a problem managing memory, wouldn't there should be a OOM? > Why does AppClient throw a NPE? > > *Romi Kuntsman*, *Big Data Engineer* > http://www.totango.com > > On Mon, Nov 9, 2015 at 4:59 PM, Akhil Das > wrote: > >> Is that all you have in the executor logs? I suspect some of those jobs >> are having a hard time managing the memory. >> >> Thanks >> Best Regards >> >> On Sun, Nov 1, 2015 at 9:38 PM, Romi Kuntsman wrote: >> >>> [adding dev list since it's probably a bug, but i'm not sure how to >>> reproduce so I can open a bug about it] >>> >>> Hi, >>> >>> I have a standalone Spark 1.4.0 cluster with 100s of applications >>> running every day. >>> >>> From time to time, the applications crash with the following error (see >>> below) >>> But at the same time (and also after that), other applications are >>> running, so I can safely assume the master and workers are working. >>> >>> 1. why is there a NullPointerException? (i can't track the scala stack >>> trace to the code, but anyway NPE is usually a obvious bug even if there's >>> actually a network error...) >>> 2. why can't it connect to the master? (if it's a network timeout, how >>> to increase it? i see the values are hardcoded inside AppClient) >>> 3. how to recover from this error? >>> >>> >>> ERROR 01-11 15:32:54,991SparkDeploySchedulerBackend - Application >>> has been killed. Reason: All masters are unresponsive! Giving up. ERROR >>> ERROR 01-11 15:32:55,087 OneForOneStrategy - ERROR >>> logs/error.log >>> java.lang.NullPointerException NullPointerException >>> at >>> org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160) >>> at >>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) >>> at >>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) >>> at >>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) >>> at >>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59) >>> at >>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42) >>> at >>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) >>> at >>> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42) >>> at akka.actor.Actor$class.aroundReceive(Actor.scala:465) >>> at >>> org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61) >>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) >>> at akka.actor.ActorCell.invoke(ActorCell.scala:487) >>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) >>> at akka.dispatch.Mailbox.run(Mailbox.scala:220) >>> at >>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) >>> at >>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>> at >>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >>> at >>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>> at >>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >>> ERROR 01-11 15:32:55,603 SparkContext - Error >>> initializing SparkContext. ERROR >>> java.lang.IllegalStateException: Cannot call methods on a stopped >>> SparkContext >>> at org.apache.spark.SparkContext.org >>> $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103) >>> at >>> org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501) >>> at >>> org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005) >>> at org.apache.spark.SparkContext.(SparkContext.scala:543) >>> at >>> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61) >>> >>> >>> Thanks! >>> >>> *Romi Kuntsman*, *Big Data Engineer* >>> http://www.totango.com >>> >> >> >
Re: Some spark apps fail with "All masters are unresponsive", while others pass normally
If they have a problem managing memory, wouldn't there should be a OOM? Why does AppClient throw a NPE? *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com On Mon, Nov 9, 2015 at 4:59 PM, Akhil Daswrote: > Is that all you have in the executor logs? I suspect some of those jobs > are having a hard time managing the memory. > > Thanks > Best Regards > > On Sun, Nov 1, 2015 at 9:38 PM, Romi Kuntsman wrote: > >> [adding dev list since it's probably a bug, but i'm not sure how to >> reproduce so I can open a bug about it] >> >> Hi, >> >> I have a standalone Spark 1.4.0 cluster with 100s of applications running >> every day. >> >> From time to time, the applications crash with the following error (see >> below) >> But at the same time (and also after that), other applications are >> running, so I can safely assume the master and workers are working. >> >> 1. why is there a NullPointerException? (i can't track the scala stack >> trace to the code, but anyway NPE is usually a obvious bug even if there's >> actually a network error...) >> 2. why can't it connect to the master? (if it's a network timeout, how to >> increase it? i see the values are hardcoded inside AppClient) >> 3. how to recover from this error? >> >> >> ERROR 01-11 15:32:54,991SparkDeploySchedulerBackend - Application >> has been killed. Reason: All masters are unresponsive! Giving up. ERROR >> ERROR 01-11 15:32:55,087 OneForOneStrategy - ERROR >> logs/error.log >> java.lang.NullPointerException NullPointerException >> at >> org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160) >> at >> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) >> at >> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) >> at >> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) >> at >> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59) >> at >> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42) >> at >> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) >> at >> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42) >> at akka.actor.Actor$class.aroundReceive(Actor.scala:465) >> at >> org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61) >> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) >> at akka.actor.ActorCell.invoke(ActorCell.scala:487) >> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) >> at akka.dispatch.Mailbox.run(Mailbox.scala:220) >> at >> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) >> at >> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >> at >> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >> at >> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >> at >> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >> ERROR 01-11 15:32:55,603 SparkContext - Error >> initializing SparkContext. ERROR >> java.lang.IllegalStateException: Cannot call methods on a stopped >> SparkContext >> at org.apache.spark.SparkContext.org >> $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103) >> at >> org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501) >> at >> org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005) >> at org.apache.spark.SparkContext.(SparkContext.scala:543) >> at >> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61) >> >> >> Thanks! >> >> *Romi Kuntsman*, *Big Data Engineer* >> http://www.totango.com >> > >
Some spark apps fail with "All masters are unresponsive", while others pass normally
[adding dev list since it's probably a bug, but i'm not sure how to reproduce so I can open a bug about it] Hi, I have a standalone Spark 1.4.0 cluster with 100s of applications running every day. >From time to time, the applications crash with the following error (see below) But at the same time (and also after that), other applications are running, so I can safely assume the master and workers are working. 1. why is there a NullPointerException? (i can't track the scala stack trace to the code, but anyway NPE is usually a obvious bug even if there's actually a network error...) 2. why can't it connect to the master? (if it's a network timeout, how to increase it? i see the values are hardcoded inside AppClient) 3. how to recover from this error? ERROR 01-11 15:32:54,991SparkDeploySchedulerBackend - Application has been killed. Reason: All masters are unresponsive! Giving up. ERROR ERROR 01-11 15:32:55,087 OneForOneStrategy - ERROR logs/error.log java.lang.NullPointerException NullPointerException at org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59) at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) at org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) ERROR 01-11 15:32:55,603 SparkContext - Error initializing SparkContext. ERROR java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext at org.apache.spark.SparkContext.org $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103) at org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501) at org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005) at org.apache.spark.SparkContext.(SparkContext.scala:543) at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61) Thanks! *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com