I didn't see anything about a OOM. This happens sometimes before anything in the application happened, and happens to a few applications at the same time - so I guess it's a communication failure, but the problem is that the error shown doesn't represent the actual problem (which may be a network timeout etc)
*Romi Kuntsman*, *Big Data Engineer* http://www.totango.com On Mon, Nov 9, 2015 at 6:00 PM, Akhil Das <ak...@sigmoidanalytics.com> wrote: > Did you find anything regarding the OOM in the executor logs? > > Thanks > Best Regards > > On Mon, Nov 9, 2015 at 8:44 PM, Romi Kuntsman <r...@totango.com> wrote: > >> If they have a problem managing memory, wouldn't there should be a OOM? >> Why does AppClient throw a NPE? >> >> *Romi Kuntsman*, *Big Data Engineer* >> http://www.totango.com >> >> On Mon, Nov 9, 2015 at 4:59 PM, Akhil Das <ak...@sigmoidanalytics.com> >> wrote: >> >>> Is that all you have in the executor logs? I suspect some of those jobs >>> are having a hard time managing the memory. >>> >>> Thanks >>> Best Regards >>> >>> On Sun, Nov 1, 2015 at 9:38 PM, Romi Kuntsman <r...@totango.com> wrote: >>> >>>> [adding dev list since it's probably a bug, but i'm not sure how to >>>> reproduce so I can open a bug about it] >>>> >>>> Hi, >>>> >>>> I have a standalone Spark 1.4.0 cluster with 100s of applications >>>> running every day. >>>> >>>> From time to time, the applications crash with the following error (see >>>> below) >>>> But at the same time (and also after that), other applications are >>>> running, so I can safely assume the master and workers are working. >>>> >>>> 1. why is there a NullPointerException? (i can't track the scala stack >>>> trace to the code, but anyway NPE is usually a obvious bug even if there's >>>> actually a network error...) >>>> 2. why can't it connect to the master? (if it's a network timeout, how >>>> to increase it? i see the values are hardcoded inside AppClient) >>>> 3. how to recover from this error? >>>> >>>> >>>> ERROR 01-11 15:32:54,991 SparkDeploySchedulerBackend - Application >>>> has been killed. Reason: All masters are unresponsive! Giving up. ERROR >>>> ERROR 01-11 15:32:55,087 OneForOneStrategy - ERROR >>>> logs/error.log >>>> java.lang.NullPointerException NullPointerException >>>> at >>>> org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160) >>>> at >>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) >>>> at >>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) >>>> at >>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) >>>> at >>>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59) >>>> at >>>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42) >>>> at >>>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) >>>> at >>>> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42) >>>> at akka.actor.Actor$class.aroundReceive(Actor.scala:465) >>>> at >>>> org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61) >>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) >>>> at akka.actor.ActorCell.invoke(ActorCell.scala:487) >>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) >>>> at akka.dispatch.Mailbox.run(Mailbox.scala:220) >>>> at >>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) >>>> at >>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>>> at >>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >>>> at >>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>>> at >>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >>>> ERROR 01-11 15:32:55,603 SparkContext - Error >>>> initializing SparkContext. ERROR >>>> java.lang.IllegalStateException: Cannot call methods on a stopped >>>> SparkContext >>>> at org.apache.spark.SparkContext.org >>>> $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103) >>>> at >>>> org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501) >>>> at >>>> org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005) >>>> at org.apache.spark.SparkContext.<init>(SparkContext.scala:543) >>>> at >>>> org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61) >>>> >>>> >>>> Thanks! >>>> >>>> *Romi Kuntsman*, *Big Data Engineer* >>>> http://www.totango.com >>>> >>> >>> >> >