Nishkam Ravi created SPARK-2398:
-----------------------------------

             Summary: Trouble running Spark 1.0 on Yarn 
                 Key: SPARK-2398
                 URL: https://issues.apache.org/jira/browse/SPARK-2398
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.0.0
            Reporter: Nishkam Ravi


Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. 

For example: SparkPageRank when run in standalone mode goes through without any 
errors (tested for up to 30GB input dataset on a 6-node cluster).  Also runs 
fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn cluster 
mode) as the input data size is increased. Confirmed for 16GB input dataset.

The same workload runs fine with Spark 0.9 in both standalone and yarn cluster 
mode (for up to 30 GB input dataset on a 6-node cluster).

Commandline used:

(/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn 
--deploy-mode cluster --properties-file pagerank.conf  --driver-memory 30g 
--driver-cores 16 --num-executors 5 --class 
org.apache.spark.examples.SparkPageRank 
/opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar
 pagerank_in $NUM_ITER)

pagerank.conf:

spark.master            spark://c1704.halxg.cloudera.com:7077
spark.home      /opt/cloudera/parcels/CDH/lib/spark
spark.executor.memory   32g
spark.default.parallelism       118
spark.cores.max 96
spark.storage.memoryFraction    0.6
spark.shuffle.memoryFraction    0.3
spark.shuffle.compress  true
spark.shuffle.spill.compress    true
spark.broadcast.compress        true
spark.rdd.compress      false
spark.io.compression.codec      org.apache.spark.io.LZFCompressionCodec
spark.io.compression.snappy.block.size  32768
spark.reducer.maxMbInFlight     48
spark.local.dir  /var/lib/jenkins/workspace/tmp
spark.driver.memory     30g
spark.executor.cores    16
spark.locality.wait     6000
spark.executor.instances        5

UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions:

14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection 
to ConnectionManagerId(a1016.halxg.cloudera.com,54105)
java.nio.channels.AsynchronousCloseException
        at 
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496)
        at 
org.apache.spark.network.SendingConnection.write(Connection.scala:361)
        at 
org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

--------

java.io.IOException: Filesystem closed
        at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
        at org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619)
        at java.io.FilterInputStream.close(FilterInputStream.java:181)
        at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
        at 
org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244)
        at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226)
        at 
org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
        at 
org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197)
        at 
org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
        at 
org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
        at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at 
org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
        at org.apache.spark.scheduler.Task.run(Task.scala:51)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
-------

14/07/07 17:59:52 WARN network.SendingConnection: Error finishing connection to 
a1016.halxg.cloudera.com/10.20.184.116:54105
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
        at 
org.apache.spark.network.SendingConnection.finishConnect(Connection.scala:313)
        at 
org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:203)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to