I am running on a spark 1.5.1 cluster managed by Mesos - I have an application that handled a chemistry problem which can be increased by increasing the number of atoms - increasing the number of Spark stages. I do a repartition at each stage - Stage 9 is the last stage. At each stage the size and complexity increases by a factor of 8 or so. Problems with 8 stages run with no difficulty - ones with 9 stages never work - the always crash in a manner similar to the stack dump below ( sorry for the length but NONE of steps are mine. I do not see any slaves throwing an exception (which has different errors anyway) I am completely baffled and believe the error is in something Spark is doing - I use 7000 or so tasks to try to divide the work - I see the same issue when I cut the parallelism to 256 but tasks run longer - my mean task takes about 5 minutes (oh yes I expect the job to take about 8 hours on my 15 node cluster. Any bright ideas
[Stage 9:======================================> (5827 + 60) / 7776]Exception in thread "main" org.apache.spark.SparkException: Job 0 cancelled because Stage 9 was cancelled at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283) at org.apache.spark.scheduler.DAGScheduler.handleJobCancellation(DAGScheduler.scala:1229) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleStageCancellation$1.apply$mcVI$sp(DAGScheduler.scala:1217) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleStageCancellation$1.apply(DAGScheduler.scala:1216) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleStageCancellation$1.apply(DAGScheduler.scala:1216) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofInt.foreach(ArrayOps.scala:156) at org.apache.spark.scheduler.DAGScheduler.handleStageCancellation(DAGScheduler.scala:1216) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1469) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1835) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1848) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1919) at org.apache.spark.rdd.RDD.count(RDD.scala:1121) at org.apache.spark.api.java.JavaRDDLike$class.count(JavaRDDLike.scala:445) at org.apache.spark.api.java.AbstractJavaRDDLike.count(JavaRDDLike.scala:47) at com.lordjoe.molgen.SparkAtomGenerator.run(SparkAtomGenerator.java:150) at com.lordjoe.molgen.SparkAtomGenerator.run(SparkAtomGenerator.java:110) at com.lordjoe.molgen.VariantCounter.main(VariantCounter.java:80) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 15/12/14 09:53:20 WARN ServletHandler: /stages/stage/kill/ java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) at org.apache.spark.ui.jobs.StagesTab.handleKillRequest(StagesTab.scala:49) at org.apache.spark.ui.SparkUI$$anonfun$3.apply(SparkUI.scala:71) at org.apache.spark.ui.SparkUI$$anonfun$3.apply(SparkUI.scala:71) at org.apache.spark.ui.JettyUtils$$anon$2.doRequest(JettyUtils.scala:141) at org.apache.spark.ui.JettyUtils$$anon$2.doGet(JettyUtils.scala:128) at javax.servlet.http.HttpServlet.service(HttpServlet.java:735) at javax.servlet.http.HttpServlet.service(HttpServlet.java:848) at org.spark-project.jetty.servlet.ServletHolder.handle(ServletHolder.java:684) at org.spark-project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501) at org.spark-project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086) at org.spark-project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428) at org.spark-project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020) at org.spark-project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.spark-project.jetty.server.handler.GzipHandler.handle(GzipHandler.java:264) at org.spark-project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.spark-project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.spark-project.jetty.server.Server.handle(Server.java:370) at org.spark-project.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) at org.spark-project.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) at org.spark-project.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033) at org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:644) at org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.spark-project.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) at org.spark-project.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667) at org.spark-project.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52) at org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:745) I1214 09:53:20.040680 31127 sched.cpp:1589] Asked to stop the driver I1214 09:53:20.040848 22738 sched.cpp:831] Stopping framework '20151020-114053-711206558-5050-2549-0220'