Hi, I solved by increasing the akka timeout time. All the bests, 2016-06-28 15:04 GMT+02:00 ANDREA SPINA <74...@studenti.unimore.it>:
> Hello everyone, > > I am running some experiments with Spark 1.4.0 on a ~80GiB dataset located > on hdfs-2.7.1. The environment is a 25 nodes cluster, 16 cores per node. I > set the following params: > > spark.master = "spark://"${runtime.hostname}":7077" > > # 28 GiB of memory > spark.executor.memory = "28672m" > spark.worker.memory = "28672m" > spark.driver.memory = "2048m" > > spark.driver.maxResultSize = "0" > > I run some scaling experiments varying the machine set number. > I can successfully experiments with the whole number of nodes (25) and > also with (20) nodes. Experiments with environments of 5 nodes and 10 nodes > relentlessy fails. During the running spark executor begin to collect > failing jobs from different stages and end with the following trace: > > 16/06/28 03:11:09 INFO DAGScheduler: Job 14 failed: reduce at > sGradientDescent.scala:229, took 1778.508309 s > Exception in thread "main" org.apache.spark.SparkException: Job aborted > due to stage failure: Task 212 in stage 14.0 failed 4 times, most recent > failure: Lost task 212.3 in stage 14.0 (TID 12278, 130.149.21.19): > java.io.IOException: Connection from /130.149.21.16:35997 closed > at > org.apache.spark.network.client.TransportResponseHandler.channelUnregistered(TransportResponseHandler.java:104) > at > org.apache.spark.network.server.TransportChannelHandler.channelUnregistered(TransportChannelHandler.java:91) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:183) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:169) > at > io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:183) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:169) > at > io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:183) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:169) > at > io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:183) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:169) > at > io.netty.channel.DefaultChannelPipeline.fireChannelUnregistered(DefaultChannelPipeline.java:738) > at > io.netty.channel.AbstractChannel$AbstractUnsafe$6.run(AbstractChannel.java:606) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:380) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > at java.lang.Thread.run(Thread.java:745) > > Driver stacktrace: > at org.apache.spark.scheduler.DAGScheduler.org > $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > > Here > <https://dl.dropboxusercontent.com/u/78598929/spark-hadoop-org.apache.spark.deploy.master.Master-1-cloud-11.log> > the Master full Log. > As well, each Worker receive signal SIGTERM: 15 > > I can't figure out a solution as well. > Thank you, Regards, > > Andrea > > > -- > *Andrea Spina* > N.Tessera: *74598* > MAT: *89369* > *Ingegneria Informatica* *[LM] *(D.M. 270) > -- *Andrea Spina* N.Tessera: *74598* MAT: *89369* *Ingegneria Informatica* *[LM] *(D.M. 270)