Nishkam Ravi created SPARK-2398: ----------------------------------- Summary: Trouble running Spark 1.0 on Yarn Key: SPARK-2398 URL: https://issues.apache.org/jira/browse/SPARK-2398 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Nishkam Ravi
Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. For example: SparkPageRank when run in standalone mode goes through without any errors (tested for up to 30GB input dataset on a 6-node cluster). Also runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn cluster mode) as the input data size is increased. Confirmed for 16GB input dataset. The same workload runs fine with Spark 0.9 in both standalone and yarn cluster mode (for up to 30 GB input dataset on a 6-node cluster). Commandline used: (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn --deploy-mode cluster --properties-file pagerank.conf --driver-memory 30g --driver-cores 16 --num-executors 5 --class org.apache.spark.examples.SparkPageRank /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar pagerank_in $NUM_ITER) pagerank.conf: spark.master spark://c1704.halxg.cloudera.com:7077 spark.home /opt/cloudera/parcels/CDH/lib/spark spark.executor.memory 32g spark.default.parallelism 118 spark.cores.max 96 spark.storage.memoryFraction 0.6 spark.shuffle.memoryFraction 0.3 spark.shuffle.compress true spark.shuffle.spill.compress true spark.broadcast.compress true spark.rdd.compress false spark.io.compression.codec org.apache.spark.io.LZFCompressionCodec spark.io.compression.snappy.block.size 32768 spark.reducer.maxMbInFlight 48 spark.local.dir /var/lib/jenkins/workspace/tmp spark.driver.memory 30g spark.executor.cores 16 spark.locality.wait 6000 spark.executor.instances 5 UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions: 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection to ConnectionManagerId(a1016.halxg.cloudera.com,54105) java.nio.channels.AsynchronousCloseException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496) at org.apache.spark.network.SendingConnection.write(Connection.scala:361) at org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) -------- java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703) at org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619) at java.io.FilterInputStream.close(FilterInputStream.java:181) at org.apache.hadoop.util.LineReader.close(LineReader.java:150) at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244) at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226) at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) at org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) ------- 14/07/07 17:59:52 WARN network.SendingConnection: Error finishing connection to a1016.halxg.cloudera.com/10.20.184.116:54105 java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at org.apache.spark.network.SendingConnection.finishConnect(Connection.scala:313) at org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.2#6252)