[jira] [Commented] (SPARK-2398) Trouble running Spark 1.0 on Yarn
[ https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14211896#comment-14211896 ] Nishkam Ravi commented on SPARK-2398: - [~srowen] yes, this has been resolved by modifying the YARN overhead from a constant additive to a multiplier, as we had discussed. > Trouble running Spark 1.0 on Yarn > -- > > Key: SPARK-2398 > URL: https://issues.apache.org/jira/browse/SPARK-2398 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Nishkam Ravi > > Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. > For example: SparkPageRank when run in standalone mode goes through without > any errors (tested for up to 30GB input dataset on a 6-node cluster). Also > runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn > cluster mode) as the input data size is increased. Confirmed for 16GB input > dataset. > The same workload runs fine with Spark 0.9 in both standalone and yarn > cluster mode (for up to 30 GB input dataset on a 6-node cluster). > Commandline used: > (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn > --deploy-mode cluster --properties-file pagerank.conf --driver-memory 30g > --driver-cores 16 --num-executors 5 --class > org.apache.spark.examples.SparkPageRank > /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar > pagerank_in $NUM_ITER) > pagerank.conf: > spark.masterspark://c1704.halxg.cloudera.com:7077 > spark.home /opt/cloudera/parcels/CDH/lib/spark > spark.executor.memory 32g > spark.default.parallelism 118 > spark.cores.max 96 > spark.storage.memoryFraction0.6 > spark.shuffle.memoryFraction0.3 > spark.shuffle.compress true > spark.shuffle.spill.compresstrue > spark.broadcast.compresstrue > spark.rdd.compress false > spark.io.compression.codec org.apache.spark.io.LZFCompressionCodec > spark.io.compression.snappy.block.size 32768 > spark.reducer.maxMbInFlight 48 > spark.local.dir /var/lib/jenkins/workspace/tmp > spark.driver.memory 30g > spark.executor.cores16 > spark.locality.wait 6000 > spark.executor.instances5 > UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions: > 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection > to ConnectionManagerId(a1016.halxg.cloudera.com,54105) > java.nio.channels.AsynchronousCloseException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496) > at > org.apache.spark.network.SendingConnection.write(Connection.scala:361) > at > org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703) > at > org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619) > at java.io.FilterInputStream.close(FilterInputStream.java:181) > at org.apache.hadoop.util.LineReader.close(LineReader.java:150) > at > org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244) > at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226) > at > org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) > at > org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197) > at > org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) > at > org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97) > at org.apache.spark.scheduler.Task.run(Task.scala:51) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker
[jira] [Commented] (SPARK-2398) Trouble running Spark 1.0 on Yarn
[ https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206395#comment-14206395 ] Sean Owen commented on SPARK-2398: -- Was this finally resolved by [~nravi]'s changes to make the YARN container padding scale differently? > Trouble running Spark 1.0 on Yarn > -- > > Key: SPARK-2398 > URL: https://issues.apache.org/jira/browse/SPARK-2398 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Nishkam Ravi > > Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. > For example: SparkPageRank when run in standalone mode goes through without > any errors (tested for up to 30GB input dataset on a 6-node cluster). Also > runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn > cluster mode) as the input data size is increased. Confirmed for 16GB input > dataset. > The same workload runs fine with Spark 0.9 in both standalone and yarn > cluster mode (for up to 30 GB input dataset on a 6-node cluster). > Commandline used: > (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn > --deploy-mode cluster --properties-file pagerank.conf --driver-memory 30g > --driver-cores 16 --num-executors 5 --class > org.apache.spark.examples.SparkPageRank > /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar > pagerank_in $NUM_ITER) > pagerank.conf: > spark.masterspark://c1704.halxg.cloudera.com:7077 > spark.home /opt/cloudera/parcels/CDH/lib/spark > spark.executor.memory 32g > spark.default.parallelism 118 > spark.cores.max 96 > spark.storage.memoryFraction0.6 > spark.shuffle.memoryFraction0.3 > spark.shuffle.compress true > spark.shuffle.spill.compresstrue > spark.broadcast.compresstrue > spark.rdd.compress false > spark.io.compression.codec org.apache.spark.io.LZFCompressionCodec > spark.io.compression.snappy.block.size 32768 > spark.reducer.maxMbInFlight 48 > spark.local.dir /var/lib/jenkins/workspace/tmp > spark.driver.memory 30g > spark.executor.cores16 > spark.locality.wait 6000 > spark.executor.instances5 > UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions: > 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection > to ConnectionManagerId(a1016.halxg.cloudera.com,54105) > java.nio.channels.AsynchronousCloseException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496) > at > org.apache.spark.network.SendingConnection.write(Connection.scala:361) > at > org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703) > at > org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619) > at java.io.FilterInputStream.close(FilterInputStream.java:181) > at org.apache.hadoop.util.LineReader.close(LineReader.java:150) > at > org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244) > at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226) > at > org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) > at > org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197) > at > org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) > at > org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97) > at org.apache.spark.scheduler.Task.run(Task.scala:51) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >
[jira] [Commented] (SPARK-2398) Trouble running Spark 1.0 on Yarn
[ https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060113#comment-14060113 ] Mridul Muralidharan commented on SPARK-2398: As discussed in the PR, I am attempting to list the various factors which contribute to overhead. Note, this is not exhaustive (yet) - please add more to this JIRA - so that when we are reasonably sure, we can model the expected overhead based on these factors. These factors are typically off-heap - since anything within heap is budgetted for by Xmx - and enforced by VM : and so should ideally (not practically always, see gc overheads) not exceed the Xmx value 1) 256 KB per socket accepted via ConnectionManager for inter-worker comm (setReceiveBufferSize) Typically, there will be (numExecutor - 1) number of sockets open. 2) 128 KB per socket for writing output to dfs. For reads, this does not seem to be configured - and should be 8k per socket iirc. Typically 1 per executor at a given point in time ? 3) 256k for each akka socket for send/receive buffer. One per worker ? (to talk to master) - so 512kb ? Any other use of akka ? 4) If I am not wrong, netty might allocate multiple "spark.akka.frameSize" sized direct buffer. There might be a few of these allocated and pooled/reused. I did not go in detail into netty code though. If someone else with more knowhow can clarify, that would be great ! Default size of 10mb for spark.akka.frameSize 5) The default size of the assembled spark jar is about 12x mb (and changing) - though not all classes get loaded, the overhead would be some function of this. The actual footprint would be higher than the on-disk size. IIRC this is outside of the heap - [~sowen], any comments on this ? I have not looked into these in like 10 years now ! 6) Per thread (Xss) overhead of 1mb (for 64bit vm). Last I recall, we have about 220 odd threads - not sure if this was at the master or on the workers. Ofcourse, this is dependent on the various threadpools we use (io, computation, etc), akka and netty config, etc. 7) Disk read overhead. Thanks for [~pwendell]'s fix, atleast for small files, the overhead is not too high - since we do not mmap files but directly read them. But for anything larger than 8kb (default), we use memory mapped buffers. The actual overhead depends on the number of files opened for read via DiskStore - and the entire file contents get mmap'ed into virt mem. Note that there is some non-virt-mem overhead also at native level for these buffers. The actual number of files opened should be carefully tracked to understand the effect of this on spark overhead : since this aspect is changing a lot off late. Impact is on shuffle, disk persisted rdd, among others. The actual value would be application dependent (how large the data is !) 8) The overhead introduced by VM not being able to reclaim memory completely (the cost of moving data vs amount of space reclaimed). Ideally, this should be low - but would be dependent on the heap space, collector used, among other things. I am not very knowledgable of the recent advances in gc collectors, so I hesitate to put a number to this. I am sure this is not an exhaustive list, please do add to this. In our case specifically, and [~tgraves] could add more, the number of containers can be high (300+ is easily possible), memory per container is modest (8gig usually). To add details of observed overhead patterns (from the PR discussion) - a) I have had inhouse GBDT impl run without customizing overhead (so default of 384 mb) with 12gb container and 22 nodes on reasonably large dataset. b) I have had to customize overhead to 1.7gb for collaborative filtering with 8gb container and 300 nodes (on a fairly large dataset). c) I have had to minimally customize overhead to do inhouse QR factorization of a 50k x 50k distributed dense matrix on 45 nodes at 12 gb each (this was incorrectly specified in the PR discussion). > Trouble running Spark 1.0 on Yarn > -- > > Key: SPARK-2398 > URL: https://issues.apache.org/jira/browse/SPARK-2398 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Nishkam Ravi > > Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. > For example: SparkPageRank when run in standalone mode goes through without > any errors (tested for up to 30GB input dataset on a 6-node cluster). Also > runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn > cluster mode) as the input data size is increased. Confirmed for 16GB input > dataset. > The same workload runs fine with Spark 0.9 in both standalone and yarn > cluster mode (for up to 30 GB input dataset on a 6-node cluster). > Commandline used: > (/opt/cloudera/parcels/CDH/lib/spark
[jira] [Commented] (SPARK-2398) Trouble running Spark 1.0 on Yarn
[ https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059494#comment-14059494 ] Nishkam Ravi commented on SPARK-2398: - In this case, it should be Spark taking care of it not Yarn, since Spark creates XmX based on the executor memory param. > Trouble running Spark 1.0 on Yarn > -- > > Key: SPARK-2398 > URL: https://issues.apache.org/jira/browse/SPARK-2398 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Nishkam Ravi > > Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. > For example: SparkPageRank when run in standalone mode goes through without > any errors (tested for up to 30GB input dataset on a 6-node cluster). Also > runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn > cluster mode) as the input data size is increased. Confirmed for 16GB input > dataset. > The same workload runs fine with Spark 0.9 in both standalone and yarn > cluster mode (for up to 30 GB input dataset on a 6-node cluster). > Commandline used: > (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn > --deploy-mode cluster --properties-file pagerank.conf --driver-memory 30g > --driver-cores 16 --num-executors 5 --class > org.apache.spark.examples.SparkPageRank > /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar > pagerank_in $NUM_ITER) > pagerank.conf: > spark.masterspark://c1704.halxg.cloudera.com:7077 > spark.home /opt/cloudera/parcels/CDH/lib/spark > spark.executor.memory 32g > spark.default.parallelism 118 > spark.cores.max 96 > spark.storage.memoryFraction0.6 > spark.shuffle.memoryFraction0.3 > spark.shuffle.compress true > spark.shuffle.spill.compresstrue > spark.broadcast.compresstrue > spark.rdd.compress false > spark.io.compression.codec org.apache.spark.io.LZFCompressionCodec > spark.io.compression.snappy.block.size 32768 > spark.reducer.maxMbInFlight 48 > spark.local.dir /var/lib/jenkins/workspace/tmp > spark.driver.memory 30g > spark.executor.cores16 > spark.locality.wait 6000 > spark.executor.instances5 > UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions: > 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection > to ConnectionManagerId(a1016.halxg.cloudera.com,54105) > java.nio.channels.AsynchronousCloseException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496) > at > org.apache.spark.network.SendingConnection.write(Connection.scala:361) > at > org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703) > at > org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619) > at java.io.FilterInputStream.close(FilterInputStream.java:181) > at org.apache.hadoop.util.LineReader.close(LineReader.java:150) > at > org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244) > at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226) > at > org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) > at > org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197) > at > org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) > at > org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97) > at org.apache.spark.scheduler.Task.run(Task.scala:51) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(Thread
[jira] [Commented] (SPARK-2398) Trouble running Spark 1.0 on Yarn
[ https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059489#comment-14059489 ] Nishkam Ravi commented on SPARK-2398: - [~gq] [~sowen] I don't specify XmX, I'm only requesting Yarn container of a certain size. If I specified both, I'm talking to JVM and Yarn at the same time and potentially sending inconsistent messages. If I only specify container size, Yarn should take care of this without bothering the developer (i.e., allocate specified_container_size + memory_overhead, where memory_overhead = f(specified_container_size)). Ideally. Double checked and made sure that all config parameters are identical between 0.9 and 1.0 deployment. I suspect something has changed for the worse. I can do some further diagnosis by redeploying 0.9 and looking at nodemanager logs. Increasing spark.yarn.executor.memoryOverhead fixes this problem. > Trouble running Spark 1.0 on Yarn > -- > > Key: SPARK-2398 > URL: https://issues.apache.org/jira/browse/SPARK-2398 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Nishkam Ravi > > Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. > For example: SparkPageRank when run in standalone mode goes through without > any errors (tested for up to 30GB input dataset on a 6-node cluster). Also > runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn > cluster mode) as the input data size is increased. Confirmed for 16GB input > dataset. > The same workload runs fine with Spark 0.9 in both standalone and yarn > cluster mode (for up to 30 GB input dataset on a 6-node cluster). > Commandline used: > (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn > --deploy-mode cluster --properties-file pagerank.conf --driver-memory 30g > --driver-cores 16 --num-executors 5 --class > org.apache.spark.examples.SparkPageRank > /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar > pagerank_in $NUM_ITER) > pagerank.conf: > spark.masterspark://c1704.halxg.cloudera.com:7077 > spark.home /opt/cloudera/parcels/CDH/lib/spark > spark.executor.memory 32g > spark.default.parallelism 118 > spark.cores.max 96 > spark.storage.memoryFraction0.6 > spark.shuffle.memoryFraction0.3 > spark.shuffle.compress true > spark.shuffle.spill.compresstrue > spark.broadcast.compresstrue > spark.rdd.compress false > spark.io.compression.codec org.apache.spark.io.LZFCompressionCodec > spark.io.compression.snappy.block.size 32768 > spark.reducer.maxMbInFlight 48 > spark.local.dir /var/lib/jenkins/workspace/tmp > spark.driver.memory 30g > spark.executor.cores16 > spark.locality.wait 6000 > spark.executor.instances5 > UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions: > 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection > to ConnectionManagerId(a1016.halxg.cloudera.com,54105) > java.nio.channels.AsynchronousCloseException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496) > at > org.apache.spark.network.SendingConnection.write(Connection.scala:361) > at > org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703) > at > org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619) > at java.io.FilterInputStream.close(FilterInputStream.java:181) > at org.apache.hadoop.util.LineReader.close(LineReader.java:150) > at > org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244) > at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226) > at > org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) > at > org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197) > at > org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) > at > org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuff
[jira] [Commented] (SPARK-2398) Trouble running Spark 1.0 on Yarn
[ https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14058552#comment-14058552 ] Sean Owen commented on SPARK-2398: -- [~gq] This does not really have to do with allocating memory off heap per se. Your second reply is closer. [~nravi] If you tell a Java process it can use 16GB of memory, and tell YARN the container can use 16GB of memory, then it will get killed at some point since the JVM's physical memory footprint will certainly go beyond 16GB. This is just how Java and YARN work. I suspect your cluster config is actually different. There are several YARN configurations that matter here, mostly, the max memory that a container can request. Yes, spark.yarn.executor.memoryOverhead could be increased to give more room, but I don't even know that this is the issue. How big is the YARN container vs your heap size? > Trouble running Spark 1.0 on Yarn > -- > > Key: SPARK-2398 > URL: https://issues.apache.org/jira/browse/SPARK-2398 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Nishkam Ravi > > Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. > For example: SparkPageRank when run in standalone mode goes through without > any errors (tested for up to 30GB input dataset on a 6-node cluster). Also > runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn > cluster mode) as the input data size is increased. Confirmed for 16GB input > dataset. > The same workload runs fine with Spark 0.9 in both standalone and yarn > cluster mode (for up to 30 GB input dataset on a 6-node cluster). > Commandline used: > (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn > --deploy-mode cluster --properties-file pagerank.conf --driver-memory 30g > --driver-cores 16 --num-executors 5 --class > org.apache.spark.examples.SparkPageRank > /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar > pagerank_in $NUM_ITER) > pagerank.conf: > spark.masterspark://c1704.halxg.cloudera.com:7077 > spark.home /opt/cloudera/parcels/CDH/lib/spark > spark.executor.memory 32g > spark.default.parallelism 118 > spark.cores.max 96 > spark.storage.memoryFraction0.6 > spark.shuffle.memoryFraction0.3 > spark.shuffle.compress true > spark.shuffle.spill.compresstrue > spark.broadcast.compresstrue > spark.rdd.compress false > spark.io.compression.codec org.apache.spark.io.LZFCompressionCodec > spark.io.compression.snappy.block.size 32768 > spark.reducer.maxMbInFlight 48 > spark.local.dir /var/lib/jenkins/workspace/tmp > spark.driver.memory 30g > spark.executor.cores16 > spark.locality.wait 6000 > spark.executor.instances5 > UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions: > 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection > to ConnectionManagerId(a1016.halxg.cloudera.com,54105) > java.nio.channels.AsynchronousCloseException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496) > at > org.apache.spark.network.SendingConnection.write(Connection.scala:361) > at > org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703) > at > org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619) > at java.io.FilterInputStream.close(FilterInputStream.java:181) > at org.apache.hadoop.util.LineReader.close(LineReader.java:150) > at > org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244) > at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226) > at > org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) > at > org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197) > at > org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) > at > org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) >
[jira] [Commented] (SPARK-2398) Trouble running Spark 1.0 on Yarn
[ https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14058298#comment-14058298 ] Guoqiang Li commented on SPARK-2398: Q1. {{-Xmx}} is only the heap space, {{native library}} and {{sun.misc.Unsafe}} can easily allocate memory outside Java heap. reference http://stackoverflow.com/questions/6527131/java-using-more-memory-than-the-allocated-memory. Q2. This is not a bug. We can disable this check by setting {{yarn.nodemanager.pmem-check-enabled}} to false. Its default value is {{true}} in [yarn-default.xml|http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml] > Trouble running Spark 1.0 on Yarn > -- > > Key: SPARK-2398 > URL: https://issues.apache.org/jira/browse/SPARK-2398 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Nishkam Ravi > > Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. > For example: SparkPageRank when run in standalone mode goes through without > any errors (tested for up to 30GB input dataset on a 6-node cluster). Also > runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn > cluster mode) as the input data size is increased. Confirmed for 16GB input > dataset. > The same workload runs fine with Spark 0.9 in both standalone and yarn > cluster mode (for up to 30 GB input dataset on a 6-node cluster). > Commandline used: > (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn > --deploy-mode cluster --properties-file pagerank.conf --driver-memory 30g > --driver-cores 16 --num-executors 5 --class > org.apache.spark.examples.SparkPageRank > /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar > pagerank_in $NUM_ITER) > pagerank.conf: > spark.masterspark://c1704.halxg.cloudera.com:7077 > spark.home /opt/cloudera/parcels/CDH/lib/spark > spark.executor.memory 32g > spark.default.parallelism 118 > spark.cores.max 96 > spark.storage.memoryFraction0.6 > spark.shuffle.memoryFraction0.3 > spark.shuffle.compress true > spark.shuffle.spill.compresstrue > spark.broadcast.compresstrue > spark.rdd.compress false > spark.io.compression.codec org.apache.spark.io.LZFCompressionCodec > spark.io.compression.snappy.block.size 32768 > spark.reducer.maxMbInFlight 48 > spark.local.dir /var/lib/jenkins/workspace/tmp > spark.driver.memory 30g > spark.executor.cores16 > spark.locality.wait 6000 > spark.executor.instances5 > UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions: > 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection > to ConnectionManagerId(a1016.halxg.cloudera.com,54105) > java.nio.channels.AsynchronousCloseException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496) > at > org.apache.spark.network.SendingConnection.write(Connection.scala:361) > at > org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703) > at > org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619) > at java.io.FilterInputStream.close(FilterInputStream.java:181) > at org.apache.hadoop.util.LineReader.close(LineReader.java:150) > at > org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244) > at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226) > at > org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) > at > org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197) > at > org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) > at > org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156) > at > org.apache.spark.scheduler.Shuff
[jira] [Commented] (SPARK-2398) Trouble running Spark 1.0 on Yarn
[ https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14056858#comment-14056858 ] Nishkam Ravi commented on SPARK-2398: - A bit unclear on the root cause. Q1. Are the semantics of -XmX are being violated? Which component is responsible for this issue? Can you explain the problem in a bit more detail? Q2. This problem is not encountered with CDH 5.0/Spark 0.9 with identical configuration parameters, so it has to be a regression somewhere. Also, with CDH5.1/Spark 1.0, there are no issues in the standalone mode . Could it be the case that a newly introduced bug in Spark/Yarn has exposed a problem that has existed for a while? > Trouble running Spark 1.0 on Yarn > -- > > Key: SPARK-2398 > URL: https://issues.apache.org/jira/browse/SPARK-2398 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Nishkam Ravi > > Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. > For example: SparkPageRank when run in standalone mode goes through without > any errors (tested for up to 30GB input dataset on a 6-node cluster). Also > runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn > cluster mode) as the input data size is increased. Confirmed for 16GB input > dataset. > The same workload runs fine with Spark 0.9 in both standalone and yarn > cluster mode (for up to 30 GB input dataset on a 6-node cluster). > Commandline used: > (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn > --deploy-mode cluster --properties-file pagerank.conf --driver-memory 30g > --driver-cores 16 --num-executors 5 --class > org.apache.spark.examples.SparkPageRank > /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar > pagerank_in $NUM_ITER) > pagerank.conf: > spark.masterspark://c1704.halxg.cloudera.com:7077 > spark.home /opt/cloudera/parcels/CDH/lib/spark > spark.executor.memory 32g > spark.default.parallelism 118 > spark.cores.max 96 > spark.storage.memoryFraction0.6 > spark.shuffle.memoryFraction0.3 > spark.shuffle.compress true > spark.shuffle.spill.compresstrue > spark.broadcast.compresstrue > spark.rdd.compress false > spark.io.compression.codec org.apache.spark.io.LZFCompressionCodec > spark.io.compression.snappy.block.size 32768 > spark.reducer.maxMbInFlight 48 > spark.local.dir /var/lib/jenkins/workspace/tmp > spark.driver.memory 30g > spark.executor.cores16 > spark.locality.wait 6000 > spark.executor.instances5 > UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions: > 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection > to ConnectionManagerId(a1016.halxg.cloudera.com,54105) > java.nio.channels.AsynchronousCloseException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496) > at > org.apache.spark.network.SendingConnection.write(Connection.scala:361) > at > org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703) > at > org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619) > at java.io.FilterInputStream.close(FilterInputStream.java:181) > at org.apache.hadoop.util.LineReader.close(LineReader.java:150) > at > org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244) > at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226) > at > org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) > at > org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197) > at > org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) > at > org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156) > at > o
[jira] [Commented] (SPARK-2398) Trouble running Spark 1.0 on Yarn
[ https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14055931#comment-14055931 ] Guoqiang Li commented on SPARK-2398: The root cause is java process beyond the memory of {{-Xmx}} is set. This problem also exists in hadoop. Increase the value of {{spark.yarn.executor.memoryOverhead}} can solve the problem. > Trouble running Spark 1.0 on Yarn > -- > > Key: SPARK-2398 > URL: https://issues.apache.org/jira/browse/SPARK-2398 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Nishkam Ravi > > Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. > For example: SparkPageRank when run in standalone mode goes through without > any errors (tested for up to 30GB input dataset on a 6-node cluster). Also > runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn > cluster mode) as the input data size is increased. Confirmed for 16GB input > dataset. > The same workload runs fine with Spark 0.9 in both standalone and yarn > cluster mode (for up to 30 GB input dataset on a 6-node cluster). > Commandline used: > (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn > --deploy-mode cluster --properties-file pagerank.conf --driver-memory 30g > --driver-cores 16 --num-executors 5 --class > org.apache.spark.examples.SparkPageRank > /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar > pagerank_in $NUM_ITER) > pagerank.conf: > spark.masterspark://c1704.halxg.cloudera.com:7077 > spark.home /opt/cloudera/parcels/CDH/lib/spark > spark.executor.memory 32g > spark.default.parallelism 118 > spark.cores.max 96 > spark.storage.memoryFraction0.6 > spark.shuffle.memoryFraction0.3 > spark.shuffle.compress true > spark.shuffle.spill.compresstrue > spark.broadcast.compresstrue > spark.rdd.compress false > spark.io.compression.codec org.apache.spark.io.LZFCompressionCodec > spark.io.compression.snappy.block.size 32768 > spark.reducer.maxMbInFlight 48 > spark.local.dir /var/lib/jenkins/workspace/tmp > spark.driver.memory 30g > spark.executor.cores16 > spark.locality.wait 6000 > spark.executor.instances5 > UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions: > 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection > to ConnectionManagerId(a1016.halxg.cloudera.com,54105) > java.nio.channels.AsynchronousCloseException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496) > at > org.apache.spark.network.SendingConnection.write(Connection.scala:361) > at > org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703) > at > org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619) > at java.io.FilterInputStream.close(FilterInputStream.java:181) > at org.apache.hadoop.util.LineReader.close(LineReader.java:150) > at > org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244) > at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226) > at > org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) > at > org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197) > at > org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) > at > org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97) > at org.apache.spark.scheduler.Task.run(Task.scala:51) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >
[jira] [Commented] (SPARK-2398) Trouble running Spark 1.0 on Yarn
[ https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14055637#comment-14055637 ] Nishkam Ravi commented on SPARK-2398: - Thanks [~gq], looks related. I can see the following error in nodemanager logs: Container [pid=19511,containerID=container_1404772822174_0009_01_02] is running beyond physical memory limits. Current usage: 32.6 GB of 32.5 GB physical memory used; 34.0 GB of 68.3 GB virtual memory used. Killing container. Dump of the process-tree for container_1404772822174_0009_01_02 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 19511 16428 19511 19511 (bash) 0 0 108863488 315 /bin/bash -c /usr/java/jdk1.7.0_55-cloudera/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms32768m -Xmx32768m -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Djava.io.tmpdir=/data/3/yarn/nm/usercache/jenkins/appcache/application_1404772822174_0009/container_1404772822174_0009_01_02/tmp org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sp...@a1017.halxg.cloudera.com:34062/user/CoarseGrainedScheduler 1 a1014.halxg.cloudera.com 16 1> /var/log/hadoop-yarn/container/application_1404772822174_0009/container_1404772822174_0009_01_02/stdout 2> /var/log/hadoop-yarn/container/application_1404772822174_0009/container_1404772822174_0009_01_02/stderr |- 19516 19511 19511 19511 (java) 413192 18976 36427812864 8556873 /usr/java/jdk1.7.0_55-cloudera/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms32768m -Xmx32768m -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Djava.io.tmpdir=/data/3/yarn/nm/usercache/jenkins/appcache/application_1404772822174_0009/container_1404772822174_0009_01_02/tmp org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sp...@a1017.halxg.cloudera.com:34062/user/CoarseGrainedScheduler 1 a1014.halxg.cloudera.com 16 What is the root cause of this problem? Is this a regression in YARN? Your PR seems to suggest that the workaround is to specify a higher value for spark.yarn.executor.memoryOverhead? > Trouble running Spark 1.0 on Yarn > -- > > Key: SPARK-2398 > URL: https://issues.apache.org/jira/browse/SPARK-2398 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Nishkam Ravi > > Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. > For example: SparkPageRank when run in standalone mode goes through without > any errors (tested for up to 30GB input dataset on a 6-node cluster). Also > runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn > cluster mode) as the input data size is increased. Confirmed for 16GB input > dataset. > The same workload runs fine with Spark 0.9 in both standalone and yarn > cluster mode (for up to 30 GB input dataset on a 6-node cluster). > Commandline used: > (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn > --deploy-mode cluster --properties-file pagerank.conf --driver-memory 30g > --driver-cores 16 --num-executors 5 --class > org.apache.spark.examples.SparkPageRank > /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar > pagerank_in $NUM_ITER) > pagerank.conf: > spark.masterspark://c1704.halxg.cloudera.com:7077 > spark.home /opt/cloudera/parcels/CDH/lib/spark > spark.executor.memory 32g > spark.default.parallelism 118 > spark.cores.max 96 > spark.storage.memoryFraction0.6 > spark.shuffle.memoryFraction0.3 > spark.shuffle.compress true > spark.shuffle.spill.compresstrue > spark.broadcast.compresstrue > spark.rdd.compress false > spark.io.compression.codec org.apache.spark.io.LZFCompressionCodec > spark.io.compression.snappy.block.size 32768 > spark.reducer.maxMbInFlight 48 > spark.local.dir /var/lib/jenkins/workspace/tmp > spark.driver.memory 30g > spark.executor.cores16 > spark.locality.wait 6000 > spark.executor.instances5 > UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions: > 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection > to ConnectionManagerId(a1016.halxg.cloudera.com,54105) > java.nio.channels.AsynchronousCloseException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496) > at > org.apache.spark.network.SendingConnection.write(Connection.scala:361) > at > org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(Thread
[jira] [Commented] (SPARK-2398) Trouble running Spark 1.0 on Yarn
[ https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054434#comment-14054434 ] Guoqiang Li commented on SPARK-2398: Seems to be related to [SPARK-1930|https://issues.apache.org/jira/browse/SPARK-1930]. Can you post the yarn node manager log? > Trouble running Spark 1.0 on Yarn > -- > > Key: SPARK-2398 > URL: https://issues.apache.org/jira/browse/SPARK-2398 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Nishkam Ravi > > Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. > For example: SparkPageRank when run in standalone mode goes through without > any errors (tested for up to 30GB input dataset on a 6-node cluster). Also > runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn > cluster mode) as the input data size is increased. Confirmed for 16GB input > dataset. > The same workload runs fine with Spark 0.9 in both standalone and yarn > cluster mode (for up to 30 GB input dataset on a 6-node cluster). > Commandline used: > (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn > --deploy-mode cluster --properties-file pagerank.conf --driver-memory 30g > --driver-cores 16 --num-executors 5 --class > org.apache.spark.examples.SparkPageRank > /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar > pagerank_in $NUM_ITER) > pagerank.conf: > spark.masterspark://c1704.halxg.cloudera.com:7077 > spark.home /opt/cloudera/parcels/CDH/lib/spark > spark.executor.memory 32g > spark.default.parallelism 118 > spark.cores.max 96 > spark.storage.memoryFraction0.6 > spark.shuffle.memoryFraction0.3 > spark.shuffle.compress true > spark.shuffle.spill.compresstrue > spark.broadcast.compresstrue > spark.rdd.compress false > spark.io.compression.codec org.apache.spark.io.LZFCompressionCodec > spark.io.compression.snappy.block.size 32768 > spark.reducer.maxMbInFlight 48 > spark.local.dir /var/lib/jenkins/workspace/tmp > spark.driver.memory 30g > spark.executor.cores16 > spark.locality.wait 6000 > spark.executor.instances5 > UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions: > 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection > to ConnectionManagerId(a1016.halxg.cloudera.com,54105) > java.nio.channels.AsynchronousCloseException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496) > at > org.apache.spark.network.SendingConnection.write(Connection.scala:361) > at > org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703) > at > org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619) > at java.io.FilterInputStream.close(FilterInputStream.java:181) > at org.apache.hadoop.util.LineReader.close(LineReader.java:150) > at > org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244) > at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226) > at > org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) > at > org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197) > at > org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) > at > org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97) > at org.apache.spark.scheduler.Task.run(Task.scala:51) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(Thr