Hi All, We have set up 2 node cluster (NODE-DSRV05 and NODE-DSRV02) each is having 32gb RAM and 1 TB hard disk capacity and 8 cores of cpu. We have set up hdfs which has 2 TB capacity and the block size is 256 mb When we try to process 1 gb file on spark, we see the following exception
14/11/14 17:01:42 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, NODE-DSRV05.impetus.co.in, NODE_LOCAL, 1667 bytes) 14/11/14 17:01:42 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, NODE-DSRV05.impetus.co.in, NODE_LOCAL, 1667 bytes) 14/11/14 17:01:42 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, NODE-DSRV05.impetus.co.in, NODE_LOCAL, 1667 bytes) 14/11/14 17:01:43 INFO cluster.SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@IMPETUS-DSRV02:41124/user/Executor#539551156] with ID 0 14/11/14 17:01:43 INFO storage.BlockManagerMasterActor: Registering block manager NODE-DSRV05.impetus.co.in:60432 with 2.1 GB RAM 14/11/14 17:01:43 INFO storage.BlockManagerMasterActor: Registering block manager NODE-DSRV02:47844 with 2.1 GB RAM 14/11/14 17:01:43 INFO network.ConnectionManager: Accepted connection from [ NODE-DSRV05.impetus.co.in/192.168.145.195:51447] 14/11/14 17:01:43 INFO network.SendingConnection: Initiating connection to [ NODE-DSRV05.impetus.co.in/192.168.145.195:60432] 14/11/14 17:01:43 INFO network.SendingConnection: Connected to [ NODE-DSRV05.impetus.co.in/192.168.145.195:60432], 1 messages pending 14/11/14 17:01:43 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on NODE-DSRV05.impetus.co.in:60432 (size: 17.1 KB, free: 2.1 GB) 14/11/14 17:01:43 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on NODE-DSRV05.impetus.co.in:60432 (size: 14.1 KB, free: 2.1 GB) 14/11/14 17:01:44 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, NODE-DSRV05.impetus.co.in): java.lang.NullPointerException: org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:609) org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:609) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) java.lang.Thread.run(Thread.java:722) 14/11/14 17:01:44 INFO scheduler.TaskSetManager: Starting task 0.1 in stage 0.0 (TID 3, NODE-DSRV05.impetus.co.in, NODE_LOCAL, 1667 bytes) 14/11/14 17:01:44 INFO scheduler.TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1) on executor NODE-DSRV05.impetus.co.in: java.lang.NullPointerException (null) [duplicate 1] 14/11/14 17:01:44 INFO scheduler.TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2) on executor NODE-DSRV05.impetus.co.in: java.lang.NullPointerException (null) [duplicate 2] 14/11/14 17:01:44 INFO scheduler.TaskSetManager: Starting task 2.1 in stage 0.0 (TID 4, NODE-DSRV05.impetus.co.in, NODE_LOCAL, 1667 bytes) 14/11/14 17:01:44 INFO scheduler.TaskSetManager: Starting task 1.1 in stage 0.0 (TID 5, NODE-DSRV02, NODE_LOCAL, 1667 bytes) 14/11/14 17:01:44 INFO scheduler.TaskSetManager: Lost task 0.1 in stage 0.0 (TID 3) on executor NODE-DSRV05.impetus.co.in: java.lang.NullPointerException (null) [duplicate 3] 14/11/14 17:01:44 INFO scheduler.TaskSetManager: Starting task 0.2 in stage 0.0 (TID 6, NODE-DSRV02, NODE_LOCAL, 1667 bytes) 14/11/14 17:01:44 INFO scheduler.TaskSetManager: Lost task 2.1 in stage 0.0 (TID 4) on executor NODE-DSRV05.impetus.co.in: java.lang.NullPointerException (null) [duplicate 4] 14/11/14 17:01:44 INFO scheduler.TaskSetManager: Starting task 2.2 in stage 0.0 (TID 7, NODE-DSRV02, NODE_LOCAL, 1667 bytes) What I see is, it couldnt launch tasks on NODE-DSRV05 and processing it on single node i.e NODE-DSRV02. When we tried with 360 MB of data, I dont see any exception but the entire processing is done by only one node. I couldnt figure out where the issue lies. Any suggestions on what kind of situations might cause such issue ? Thanks, Padma Ch