[ https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14055931#comment-14055931 ]
Guoqiang Li commented on SPARK-2398: ------------------------------------ The root cause is java process beyond the memory of {{-Xmx}} is set. This problem also exists in hadoop. Increase the value of {{spark.yarn.executor.memoryOverhead}} can solve the problem. > Trouble running Spark 1.0 on Yarn > ---------------------------------- > > Key: SPARK-2398 > URL: https://issues.apache.org/jira/browse/SPARK-2398 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.0.0 > Reporter: Nishkam Ravi > > Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. > For example: SparkPageRank when run in standalone mode goes through without > any errors (tested for up to 30GB input dataset on a 6-node cluster). Also > runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn > cluster mode) as the input data size is increased. Confirmed for 16GB input > dataset. > The same workload runs fine with Spark 0.9 in both standalone and yarn > cluster mode (for up to 30 GB input dataset on a 6-node cluster). > Commandline used: > (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn > --deploy-mode cluster --properties-file pagerank.conf --driver-memory 30g > --driver-cores 16 --num-executors 5 --class > org.apache.spark.examples.SparkPageRank > /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar > pagerank_in $NUM_ITER) > pagerank.conf: > spark.master spark://c1704.halxg.cloudera.com:7077 > spark.home /opt/cloudera/parcels/CDH/lib/spark > spark.executor.memory 32g > spark.default.parallelism 118 > spark.cores.max 96 > spark.storage.memoryFraction 0.6 > spark.shuffle.memoryFraction 0.3 > spark.shuffle.compress true > spark.shuffle.spill.compress true > spark.broadcast.compress true > spark.rdd.compress false > spark.io.compression.codec org.apache.spark.io.LZFCompressionCodec > spark.io.compression.snappy.block.size 32768 > spark.reducer.maxMbInFlight 48 > spark.local.dir /var/lib/jenkins/workspace/tmp > spark.driver.memory 30g > spark.executor.cores 16 > spark.locality.wait 6000 > spark.executor.instances 5 > UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions: > 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection > to ConnectionManagerId(a1016.halxg.cloudera.com,54105) > java.nio.channels.AsynchronousCloseException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496) > at > org.apache.spark.network.SendingConnection.write(Connection.scala:361) > at > org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > -------- > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703) > at > org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619) > at java.io.FilterInputStream.close(FilterInputStream.java:181) > at org.apache.hadoop.util.LineReader.close(LineReader.java:150) > at > org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244) > at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226) > at > org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) > at > org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197) > at > org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) > at > org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97) > at org.apache.spark.scheduler.Task.run(Task.scala:51) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > ------- > 14/07/07 17:59:52 WARN network.SendingConnection: Error finishing connection > to a1016.halxg.cloudera.com/10.20.184.116:54105 > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) > at > org.apache.spark.network.SendingConnection.finishConnect(Connection.scala:313) > at > org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:203) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.2#6252)