[jira] [Commented] (SPARK-2398) Trouble running Spark 1.0 on Yarn

2014-11-13 Thread Nishkam Ravi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14211896#comment-14211896
 ] 

Nishkam Ravi commented on SPARK-2398:
-

[~srowen] yes, this has been resolved by modifying the YARN overhead from a 
constant additive to a multiplier, as we had discussed.

> Trouble running Spark 1.0 on Yarn 
> --
>
> Key: SPARK-2398
> URL: https://issues.apache.org/jira/browse/SPARK-2398
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Nishkam Ravi
>
> Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. 
> For example: SparkPageRank when run in standalone mode goes through without 
> any errors (tested for up to 30GB input dataset on a 6-node cluster).  Also 
> runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn 
> cluster mode) as the input data size is increased. Confirmed for 16GB input 
> dataset.
> The same workload runs fine with Spark 0.9 in both standalone and yarn 
> cluster mode (for up to 30 GB input dataset on a 6-node cluster).
> Commandline used:
> (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn 
> --deploy-mode cluster --properties-file pagerank.conf  --driver-memory 30g 
> --driver-cores 16 --num-executors 5 --class 
> org.apache.spark.examples.SparkPageRank 
> /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar
>  pagerank_in $NUM_ITER)
> pagerank.conf:
> spark.masterspark://c1704.halxg.cloudera.com:7077
> spark.home  /opt/cloudera/parcels/CDH/lib/spark
> spark.executor.memory   32g
> spark.default.parallelism   118
> spark.cores.max 96
> spark.storage.memoryFraction0.6
> spark.shuffle.memoryFraction0.3
> spark.shuffle.compress  true
> spark.shuffle.spill.compresstrue
> spark.broadcast.compresstrue
> spark.rdd.compress  false
> spark.io.compression.codec  org.apache.spark.io.LZFCompressionCodec
> spark.io.compression.snappy.block.size  32768
> spark.reducer.maxMbInFlight 48
> spark.local.dir  /var/lib/jenkins/workspace/tmp
> spark.driver.memory 30g
> spark.executor.cores16
> spark.locality.wait 6000
> spark.executor.instances5
> UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions:
> 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection 
> to ConnectionManagerId(a1016.halxg.cloudera.com,54105)
> java.nio.channels.AsynchronousCloseException
> at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496)
> at 
> org.apache.spark.network.SendingConnection.write(Connection.scala:361)
> at 
> org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> 
> java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619)
> at java.io.FilterInputStream.close(FilterInputStream.java:181)
> at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
> at 
> org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226)
> at 
> org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
> at 
> org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197)
> at 
> org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
> at 
> org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker

[jira] [Commented] (SPARK-2398) Trouble running Spark 1.0 on Yarn

2014-11-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206395#comment-14206395
 ] 

Sean Owen commented on SPARK-2398:
--

Was this finally resolved by [~nravi]'s changes to make the YARN container 
padding scale differently?

> Trouble running Spark 1.0 on Yarn 
> --
>
> Key: SPARK-2398
> URL: https://issues.apache.org/jira/browse/SPARK-2398
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Nishkam Ravi
>
> Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. 
> For example: SparkPageRank when run in standalone mode goes through without 
> any errors (tested for up to 30GB input dataset on a 6-node cluster).  Also 
> runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn 
> cluster mode) as the input data size is increased. Confirmed for 16GB input 
> dataset.
> The same workload runs fine with Spark 0.9 in both standalone and yarn 
> cluster mode (for up to 30 GB input dataset on a 6-node cluster).
> Commandline used:
> (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn 
> --deploy-mode cluster --properties-file pagerank.conf  --driver-memory 30g 
> --driver-cores 16 --num-executors 5 --class 
> org.apache.spark.examples.SparkPageRank 
> /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar
>  pagerank_in $NUM_ITER)
> pagerank.conf:
> spark.masterspark://c1704.halxg.cloudera.com:7077
> spark.home  /opt/cloudera/parcels/CDH/lib/spark
> spark.executor.memory   32g
> spark.default.parallelism   118
> spark.cores.max 96
> spark.storage.memoryFraction0.6
> spark.shuffle.memoryFraction0.3
> spark.shuffle.compress  true
> spark.shuffle.spill.compresstrue
> spark.broadcast.compresstrue
> spark.rdd.compress  false
> spark.io.compression.codec  org.apache.spark.io.LZFCompressionCodec
> spark.io.compression.snappy.block.size  32768
> spark.reducer.maxMbInFlight 48
> spark.local.dir  /var/lib/jenkins/workspace/tmp
> spark.driver.memory 30g
> spark.executor.cores16
> spark.locality.wait 6000
> spark.executor.instances5
> UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions:
> 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection 
> to ConnectionManagerId(a1016.halxg.cloudera.com,54105)
> java.nio.channels.AsynchronousCloseException
> at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496)
> at 
> org.apache.spark.network.SendingConnection.write(Connection.scala:361)
> at 
> org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> 
> java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619)
> at java.io.FilterInputStream.close(FilterInputStream.java:181)
> at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
> at 
> org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226)
> at 
> org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
> at 
> org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197)
> at 
> org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
> at 
> org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 

[jira] [Commented] (SPARK-2398) Trouble running Spark 1.0 on Yarn

2014-07-13 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060113#comment-14060113
 ] 

Mridul Muralidharan commented on SPARK-2398:



As discussed in the PR, I am attempting to list the various factors which 
contribute to overhead.
Note, this is not exhaustive (yet) - please add more to this JIRA - so that 
when we are reasonably sure, we can model the expected overhead based on these 
factors.

These factors are typically off-heap - since anything within heap is budgetted 
for by Xmx - and enforced by VM : and so should ideally (not practically 
always, see gc overheads) not exceed the Xmx value

1) 256 KB per socket accepted via ConnectionManager for inter-worker comm 
(setReceiveBufferSize)
Typically, there will be (numExecutor - 1) number of sockets open.

2) 128 KB per socket for writing output to dfs. For reads, this does not seem 
to be configured - and should be 8k per socket iirc.
Typically 1 per executor at a given point in time ?

3) 256k for each akka socket for send/receive buffer.
One per worker ? (to talk to master) - so 512kb ? Any other use of akka ?

4) If I am not wrong, netty might allocate multiple "spark.akka.frameSize" 
sized direct buffer. There might be a few of these allocated and pooled/reused.
I did not go in detail into netty code though. If someone else with more 
knowhow can clarify, that would be great !
Default size of 10mb for spark.akka.frameSize

5) The default size of the assembled spark jar is about 12x mb (and changing) - 
though not all classes get loaded, the overhead would be some function of this.
The actual footprint would be higher than the on-disk size.
IIRC this is outside of the heap - [~sowen], any comments on this ? I have not 
looked into these in like 10 years now !

6) Per thread (Xss) overhead of 1mb (for 64bit vm).
Last I recall, we have about 220 odd threads - not sure if this was at the 
master or on the workers.
Ofcourse, this is dependent on the various threadpools we use (io, computation, 
etc), akka and netty config, etc.

7) Disk read overhead.
Thanks for [~pwendell]'s fix, atleast for small files, the overhead is not too 
high - since we do not mmap files but directly read them.
But for anything larger than 8kb (default), we use memory mapped buffers.
The actual overhead depends on the number of files opened for read via 
DiskStore - and the entire file contents get mmap'ed into virt mem.
Note that there is some non-virt-mem overhead also at native level for these 
buffers.

The actual number of files opened should be carefully tracked to understand the 
effect of this on spark overhead : since this aspect is changing a lot off late.
Impact is on shuffle,  disk persisted rdd, among others.
The actual value would be application dependent (how large the data is !)


8) The overhead introduced by VM not being able to reclaim memory completely 
(the cost of moving data vs amount of space reclaimed).
Ideally, this should be low - but would be dependent on the heap space, 
collector used, among other things.
I am not very knowledgable of the recent advances in gc collectors, so I 
hesitate to put a number to this.



I am sure this is not an exhaustive list, please do add to this.
In our case specifically, and [~tgraves] could add more, the number of 
containers can be high (300+ is easily possible), memory per container is 
modest (8gig usually).
To add details of observed overhead patterns (from the PR discussion) - 
a) I have had inhouse GBDT impl run without customizing overhead (so default of 
384 mb) with 12gb container and 22 nodes on reasonably large dataset.
b) I have had to customize overhead to 1.7gb for collaborative filtering with 
8gb container and 300 nodes (on a fairly large dataset).
c) I have had to minimally customize overhead to do inhouse QR factorization of 
a 50k x 50k distributed dense matrix on 45 nodes at 12 gb each (this was 
incorrectly specified in the PR discussion).

> Trouble running Spark 1.0 on Yarn 
> --
>
> Key: SPARK-2398
> URL: https://issues.apache.org/jira/browse/SPARK-2398
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Nishkam Ravi
>
> Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. 
> For example: SparkPageRank when run in standalone mode goes through without 
> any errors (tested for up to 30GB input dataset on a 6-node cluster).  Also 
> runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn 
> cluster mode) as the input data size is increased. Confirmed for 16GB input 
> dataset.
> The same workload runs fine with Spark 0.9 in both standalone and yarn 
> cluster mode (for up to 30 GB input dataset on a 6-node cluster).
> Commandline used:
> (/opt/cloudera/parcels/CDH/lib/spark

[jira] [Commented] (SPARK-2398) Trouble running Spark 1.0 on Yarn

2014-07-11 Thread Nishkam Ravi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059494#comment-14059494
 ] 

Nishkam Ravi commented on SPARK-2398:
-

In this case, it should be Spark taking care of it not Yarn, since Spark 
creates XmX based on the executor memory param.

> Trouble running Spark 1.0 on Yarn 
> --
>
> Key: SPARK-2398
> URL: https://issues.apache.org/jira/browse/SPARK-2398
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Nishkam Ravi
>
> Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. 
> For example: SparkPageRank when run in standalone mode goes through without 
> any errors (tested for up to 30GB input dataset on a 6-node cluster).  Also 
> runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn 
> cluster mode) as the input data size is increased. Confirmed for 16GB input 
> dataset.
> The same workload runs fine with Spark 0.9 in both standalone and yarn 
> cluster mode (for up to 30 GB input dataset on a 6-node cluster).
> Commandline used:
> (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn 
> --deploy-mode cluster --properties-file pagerank.conf  --driver-memory 30g 
> --driver-cores 16 --num-executors 5 --class 
> org.apache.spark.examples.SparkPageRank 
> /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar
>  pagerank_in $NUM_ITER)
> pagerank.conf:
> spark.masterspark://c1704.halxg.cloudera.com:7077
> spark.home  /opt/cloudera/parcels/CDH/lib/spark
> spark.executor.memory   32g
> spark.default.parallelism   118
> spark.cores.max 96
> spark.storage.memoryFraction0.6
> spark.shuffle.memoryFraction0.3
> spark.shuffle.compress  true
> spark.shuffle.spill.compresstrue
> spark.broadcast.compresstrue
> spark.rdd.compress  false
> spark.io.compression.codec  org.apache.spark.io.LZFCompressionCodec
> spark.io.compression.snappy.block.size  32768
> spark.reducer.maxMbInFlight 48
> spark.local.dir  /var/lib/jenkins/workspace/tmp
> spark.driver.memory 30g
> spark.executor.cores16
> spark.locality.wait 6000
> spark.executor.instances5
> UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions:
> 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection 
> to ConnectionManagerId(a1016.halxg.cloudera.com,54105)
> java.nio.channels.AsynchronousCloseException
> at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496)
> at 
> org.apache.spark.network.SendingConnection.write(Connection.scala:361)
> at 
> org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> 
> java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619)
> at java.io.FilterInputStream.close(FilterInputStream.java:181)
> at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
> at 
> org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226)
> at 
> org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
> at 
> org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197)
> at 
> org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
> at 
> org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(Thread

[jira] [Commented] (SPARK-2398) Trouble running Spark 1.0 on Yarn

2014-07-11 Thread Nishkam Ravi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059489#comment-14059489
 ] 

Nishkam Ravi commented on SPARK-2398:
-

[~gq] [~sowen]  I don't specify XmX, I'm only requesting Yarn container of a 
certain size. If I specified both, I'm talking to JVM and Yarn at the same time 
and potentially sending inconsistent messages. If I only specify container 
size, Yarn should take care of this without bothering the developer (i.e., 
allocate specified_container_size + memory_overhead, where memory_overhead = 
f(specified_container_size)). Ideally.

Double checked and made sure that all config parameters are identical between 
0.9 and 1.0 deployment. I suspect something has changed for the worse. I can do 
some further diagnosis by redeploying 0.9 and looking at nodemanager logs. 
Increasing spark.yarn.executor.memoryOverhead fixes this problem.

> Trouble running Spark 1.0 on Yarn 
> --
>
> Key: SPARK-2398
> URL: https://issues.apache.org/jira/browse/SPARK-2398
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Nishkam Ravi
>
> Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. 
> For example: SparkPageRank when run in standalone mode goes through without 
> any errors (tested for up to 30GB input dataset on a 6-node cluster).  Also 
> runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn 
> cluster mode) as the input data size is increased. Confirmed for 16GB input 
> dataset.
> The same workload runs fine with Spark 0.9 in both standalone and yarn 
> cluster mode (for up to 30 GB input dataset on a 6-node cluster).
> Commandline used:
> (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn 
> --deploy-mode cluster --properties-file pagerank.conf  --driver-memory 30g 
> --driver-cores 16 --num-executors 5 --class 
> org.apache.spark.examples.SparkPageRank 
> /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar
>  pagerank_in $NUM_ITER)
> pagerank.conf:
> spark.masterspark://c1704.halxg.cloudera.com:7077
> spark.home  /opt/cloudera/parcels/CDH/lib/spark
> spark.executor.memory   32g
> spark.default.parallelism   118
> spark.cores.max 96
> spark.storage.memoryFraction0.6
> spark.shuffle.memoryFraction0.3
> spark.shuffle.compress  true
> spark.shuffle.spill.compresstrue
> spark.broadcast.compresstrue
> spark.rdd.compress  false
> spark.io.compression.codec  org.apache.spark.io.LZFCompressionCodec
> spark.io.compression.snappy.block.size  32768
> spark.reducer.maxMbInFlight 48
> spark.local.dir  /var/lib/jenkins/workspace/tmp
> spark.driver.memory 30g
> spark.executor.cores16
> spark.locality.wait 6000
> spark.executor.instances5
> UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions:
> 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection 
> to ConnectionManagerId(a1016.halxg.cloudera.com,54105)
> java.nio.channels.AsynchronousCloseException
> at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496)
> at 
> org.apache.spark.network.SendingConnection.write(Connection.scala:361)
> at 
> org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> 
> java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619)
> at java.io.FilterInputStream.close(FilterInputStream.java:181)
> at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
> at 
> org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226)
> at 
> org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
> at 
> org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197)
> at 
> org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
> at 
> org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuff

[jira] [Commented] (SPARK-2398) Trouble running Spark 1.0 on Yarn

2014-07-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14058552#comment-14058552
 ] 

Sean Owen commented on SPARK-2398:
--

[~gq] This does not really have to do with allocating memory off heap per se. 
Your second reply is closer.
[~nravi] If you tell a Java process it can use 16GB of memory, and tell YARN 
the container can use 16GB of memory, then it will get killed at some point 
since the JVM's physical memory footprint will certainly go beyond 16GB. This 
is just how Java and YARN work. 

I suspect your cluster config is actually different. There are several YARN 
configurations that matter here, mostly, the max memory that a container can 
request. Yes, spark.yarn.executor.memoryOverhead could be increased to give 
more room, but I don't even know that this is the issue.

How big is the YARN container vs your heap size?

> Trouble running Spark 1.0 on Yarn 
> --
>
> Key: SPARK-2398
> URL: https://issues.apache.org/jira/browse/SPARK-2398
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Nishkam Ravi
>
> Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. 
> For example: SparkPageRank when run in standalone mode goes through without 
> any errors (tested for up to 30GB input dataset on a 6-node cluster).  Also 
> runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn 
> cluster mode) as the input data size is increased. Confirmed for 16GB input 
> dataset.
> The same workload runs fine with Spark 0.9 in both standalone and yarn 
> cluster mode (for up to 30 GB input dataset on a 6-node cluster).
> Commandline used:
> (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn 
> --deploy-mode cluster --properties-file pagerank.conf  --driver-memory 30g 
> --driver-cores 16 --num-executors 5 --class 
> org.apache.spark.examples.SparkPageRank 
> /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar
>  pagerank_in $NUM_ITER)
> pagerank.conf:
> spark.masterspark://c1704.halxg.cloudera.com:7077
> spark.home  /opt/cloudera/parcels/CDH/lib/spark
> spark.executor.memory   32g
> spark.default.parallelism   118
> spark.cores.max 96
> spark.storage.memoryFraction0.6
> spark.shuffle.memoryFraction0.3
> spark.shuffle.compress  true
> spark.shuffle.spill.compresstrue
> spark.broadcast.compresstrue
> spark.rdd.compress  false
> spark.io.compression.codec  org.apache.spark.io.LZFCompressionCodec
> spark.io.compression.snappy.block.size  32768
> spark.reducer.maxMbInFlight 48
> spark.local.dir  /var/lib/jenkins/workspace/tmp
> spark.driver.memory 30g
> spark.executor.cores16
> spark.locality.wait 6000
> spark.executor.instances5
> UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions:
> 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection 
> to ConnectionManagerId(a1016.halxg.cloudera.com,54105)
> java.nio.channels.AsynchronousCloseException
> at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496)
> at 
> org.apache.spark.network.SendingConnection.write(Connection.scala:361)
> at 
> org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> 
> java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619)
> at java.io.FilterInputStream.close(FilterInputStream.java:181)
> at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
> at 
> org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226)
> at 
> org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
> at 
> org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197)
> at 
> org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
> at 
> org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>

[jira] [Commented] (SPARK-2398) Trouble running Spark 1.0 on Yarn

2014-07-10 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14058298#comment-14058298
 ] 

Guoqiang Li commented on SPARK-2398:


Q1. {{-Xmx}} is only the heap space,  {{native library}} and 
{{sun.misc.Unsafe}} can easily allocate memory outside Java heap.
reference  
http://stackoverflow.com/questions/6527131/java-using-more-memory-than-the-allocated-memory.
Q2. This is not a bug. We can disable this check by setting 
{{yarn.nodemanager.pmem-check-enabled}} to false.  Its default value is  
{{true}}
 in 
[yarn-default.xml|http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml]

> Trouble running Spark 1.0 on Yarn 
> --
>
> Key: SPARK-2398
> URL: https://issues.apache.org/jira/browse/SPARK-2398
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Nishkam Ravi
>
> Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. 
> For example: SparkPageRank when run in standalone mode goes through without 
> any errors (tested for up to 30GB input dataset on a 6-node cluster).  Also 
> runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn 
> cluster mode) as the input data size is increased. Confirmed for 16GB input 
> dataset.
> The same workload runs fine with Spark 0.9 in both standalone and yarn 
> cluster mode (for up to 30 GB input dataset on a 6-node cluster).
> Commandline used:
> (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn 
> --deploy-mode cluster --properties-file pagerank.conf  --driver-memory 30g 
> --driver-cores 16 --num-executors 5 --class 
> org.apache.spark.examples.SparkPageRank 
> /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar
>  pagerank_in $NUM_ITER)
> pagerank.conf:
> spark.masterspark://c1704.halxg.cloudera.com:7077
> spark.home  /opt/cloudera/parcels/CDH/lib/spark
> spark.executor.memory   32g
> spark.default.parallelism   118
> spark.cores.max 96
> spark.storage.memoryFraction0.6
> spark.shuffle.memoryFraction0.3
> spark.shuffle.compress  true
> spark.shuffle.spill.compresstrue
> spark.broadcast.compresstrue
> spark.rdd.compress  false
> spark.io.compression.codec  org.apache.spark.io.LZFCompressionCodec
> spark.io.compression.snappy.block.size  32768
> spark.reducer.maxMbInFlight 48
> spark.local.dir  /var/lib/jenkins/workspace/tmp
> spark.driver.memory 30g
> spark.executor.cores16
> spark.locality.wait 6000
> spark.executor.instances5
> UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions:
> 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection 
> to ConnectionManagerId(a1016.halxg.cloudera.com,54105)
> java.nio.channels.AsynchronousCloseException
> at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496)
> at 
> org.apache.spark.network.SendingConnection.write(Connection.scala:361)
> at 
> org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> 
> java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619)
> at java.io.FilterInputStream.close(FilterInputStream.java:181)
> at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
> at 
> org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226)
> at 
> org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
> at 
> org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197)
> at 
> org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
> at 
> org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156)
> at 
> org.apache.spark.scheduler.Shuff

[jira] [Commented] (SPARK-2398) Trouble running Spark 1.0 on Yarn

2014-07-09 Thread Nishkam Ravi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14056858#comment-14056858
 ] 

Nishkam Ravi commented on SPARK-2398:
-

A bit unclear on the root cause. 
Q1. Are the semantics of -XmX are being violated? Which component is 
responsible for this issue? Can you explain the problem in a bit more detail?
Q2. This problem is not encountered with CDH 5.0/Spark 0.9 with identical 
configuration parameters, so it has to be a regression somewhere. Also, with 
CDH5.1/Spark 1.0, there are no issues in the standalone mode . Could it be the 
case that a newly introduced bug in Spark/Yarn has exposed a problem that has 
existed for a while?

> Trouble running Spark 1.0 on Yarn 
> --
>
> Key: SPARK-2398
> URL: https://issues.apache.org/jira/browse/SPARK-2398
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Nishkam Ravi
>
> Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. 
> For example: SparkPageRank when run in standalone mode goes through without 
> any errors (tested for up to 30GB input dataset on a 6-node cluster).  Also 
> runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn 
> cluster mode) as the input data size is increased. Confirmed for 16GB input 
> dataset.
> The same workload runs fine with Spark 0.9 in both standalone and yarn 
> cluster mode (for up to 30 GB input dataset on a 6-node cluster).
> Commandline used:
> (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn 
> --deploy-mode cluster --properties-file pagerank.conf  --driver-memory 30g 
> --driver-cores 16 --num-executors 5 --class 
> org.apache.spark.examples.SparkPageRank 
> /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar
>  pagerank_in $NUM_ITER)
> pagerank.conf:
> spark.masterspark://c1704.halxg.cloudera.com:7077
> spark.home  /opt/cloudera/parcels/CDH/lib/spark
> spark.executor.memory   32g
> spark.default.parallelism   118
> spark.cores.max 96
> spark.storage.memoryFraction0.6
> spark.shuffle.memoryFraction0.3
> spark.shuffle.compress  true
> spark.shuffle.spill.compresstrue
> spark.broadcast.compresstrue
> spark.rdd.compress  false
> spark.io.compression.codec  org.apache.spark.io.LZFCompressionCodec
> spark.io.compression.snappy.block.size  32768
> spark.reducer.maxMbInFlight 48
> spark.local.dir  /var/lib/jenkins/workspace/tmp
> spark.driver.memory 30g
> spark.executor.cores16
> spark.locality.wait 6000
> spark.executor.instances5
> UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions:
> 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection 
> to ConnectionManagerId(a1016.halxg.cloudera.com,54105)
> java.nio.channels.AsynchronousCloseException
> at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496)
> at 
> org.apache.spark.network.SendingConnection.write(Connection.scala:361)
> at 
> org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> 
> java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619)
> at java.io.FilterInputStream.close(FilterInputStream.java:181)
> at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
> at 
> org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226)
> at 
> org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
> at 
> org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197)
> at 
> org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
> at 
> org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156)
> at 
> o

[jira] [Commented] (SPARK-2398) Trouble running Spark 1.0 on Yarn

2014-07-09 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14055931#comment-14055931
 ] 

Guoqiang Li commented on SPARK-2398:


The root cause is java process beyond the memory of {{-Xmx}} is set. This 
problem also exists in hadoop. Increase the value of 
{{spark.yarn.executor.memoryOverhead}} can solve the problem.

> Trouble running Spark 1.0 on Yarn 
> --
>
> Key: SPARK-2398
> URL: https://issues.apache.org/jira/browse/SPARK-2398
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Nishkam Ravi
>
> Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. 
> For example: SparkPageRank when run in standalone mode goes through without 
> any errors (tested for up to 30GB input dataset on a 6-node cluster).  Also 
> runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn 
> cluster mode) as the input data size is increased. Confirmed for 16GB input 
> dataset.
> The same workload runs fine with Spark 0.9 in both standalone and yarn 
> cluster mode (for up to 30 GB input dataset on a 6-node cluster).
> Commandline used:
> (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn 
> --deploy-mode cluster --properties-file pagerank.conf  --driver-memory 30g 
> --driver-cores 16 --num-executors 5 --class 
> org.apache.spark.examples.SparkPageRank 
> /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar
>  pagerank_in $NUM_ITER)
> pagerank.conf:
> spark.masterspark://c1704.halxg.cloudera.com:7077
> spark.home  /opt/cloudera/parcels/CDH/lib/spark
> spark.executor.memory   32g
> spark.default.parallelism   118
> spark.cores.max 96
> spark.storage.memoryFraction0.6
> spark.shuffle.memoryFraction0.3
> spark.shuffle.compress  true
> spark.shuffle.spill.compresstrue
> spark.broadcast.compresstrue
> spark.rdd.compress  false
> spark.io.compression.codec  org.apache.spark.io.LZFCompressionCodec
> spark.io.compression.snappy.block.size  32768
> spark.reducer.maxMbInFlight 48
> spark.local.dir  /var/lib/jenkins/workspace/tmp
> spark.driver.memory 30g
> spark.executor.cores16
> spark.locality.wait 6000
> spark.executor.instances5
> UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions:
> 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection 
> to ConnectionManagerId(a1016.halxg.cloudera.com,54105)
> java.nio.channels.AsynchronousCloseException
> at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496)
> at 
> org.apache.spark.network.SendingConnection.write(Connection.scala:361)
> at 
> org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> 
> java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619)
> at java.io.FilterInputStream.close(FilterInputStream.java:181)
> at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
> at 
> org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226)
> at 
> org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
> at 
> org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197)
> at 
> org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
> at 
> org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 

[jira] [Commented] (SPARK-2398) Trouble running Spark 1.0 on Yarn

2014-07-08 Thread Nishkam Ravi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14055637#comment-14055637
 ] 

Nishkam Ravi commented on SPARK-2398:
-

Thanks [~gq], looks related. I can see the following error in nodemanager logs:

Container [pid=19511,containerID=container_1404772822174_0009_01_02] is 
running beyond physical memory limits. Current usage: 32.6 GB of 32.5 GB 
physical memory used; 34.0 GB of 68.3 GB virtual memory used. Killing container.
Dump of the process-tree for container_1404772822174_0009_01_02 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) 
SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 19511 16428 19511 19511 (bash) 0 0 108863488 315 /bin/bash -c 
/usr/java/jdk1.7.0_55-cloudera/bin/java -server -XX:OnOutOfMemoryError='kill 
%p' -Xms32768m -Xmx32768m  -verbose:gc -XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps 
-Djava.io.tmpdir=/data/3/yarn/nm/usercache/jenkins/appcache/application_1404772822174_0009/container_1404772822174_0009_01_02/tmp
 org.apache.spark.executor.CoarseGrainedExecutorBackend 
akka.tcp://sp...@a1017.halxg.cloudera.com:34062/user/CoarseGrainedScheduler 1 
a1014.halxg.cloudera.com 16 1> 
/var/log/hadoop-yarn/container/application_1404772822174_0009/container_1404772822174_0009_01_02/stdout
 2> 
/var/log/hadoop-yarn/container/application_1404772822174_0009/container_1404772822174_0009_01_02/stderr
 
|- 19516 19511 19511 19511 (java) 413192 18976 36427812864 8556873 
/usr/java/jdk1.7.0_55-cloudera/bin/java -server -XX:OnOutOfMemoryError=kill %p 
-Xms32768m -Xmx32768m -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
-Djava.io.tmpdir=/data/3/yarn/nm/usercache/jenkins/appcache/application_1404772822174_0009/container_1404772822174_0009_01_02/tmp
 org.apache.spark.executor.CoarseGrainedExecutorBackend 
akka.tcp://sp...@a1017.halxg.cloudera.com:34062/user/CoarseGrainedScheduler 1 
a1014.halxg.cloudera.com 16 

What is the root cause of this problem? Is this a regression in YARN? Your PR 
seems to suggest that the workaround is to specify a higher value for 
spark.yarn.executor.memoryOverhead?


> Trouble running Spark 1.0 on Yarn 
> --
>
> Key: SPARK-2398
> URL: https://issues.apache.org/jira/browse/SPARK-2398
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Nishkam Ravi
>
> Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. 
> For example: SparkPageRank when run in standalone mode goes through without 
> any errors (tested for up to 30GB input dataset on a 6-node cluster).  Also 
> runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn 
> cluster mode) as the input data size is increased. Confirmed for 16GB input 
> dataset.
> The same workload runs fine with Spark 0.9 in both standalone and yarn 
> cluster mode (for up to 30 GB input dataset on a 6-node cluster).
> Commandline used:
> (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn 
> --deploy-mode cluster --properties-file pagerank.conf  --driver-memory 30g 
> --driver-cores 16 --num-executors 5 --class 
> org.apache.spark.examples.SparkPageRank 
> /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar
>  pagerank_in $NUM_ITER)
> pagerank.conf:
> spark.masterspark://c1704.halxg.cloudera.com:7077
> spark.home  /opt/cloudera/parcels/CDH/lib/spark
> spark.executor.memory   32g
> spark.default.parallelism   118
> spark.cores.max 96
> spark.storage.memoryFraction0.6
> spark.shuffle.memoryFraction0.3
> spark.shuffle.compress  true
> spark.shuffle.spill.compresstrue
> spark.broadcast.compresstrue
> spark.rdd.compress  false
> spark.io.compression.codec  org.apache.spark.io.LZFCompressionCodec
> spark.io.compression.snappy.block.size  32768
> spark.reducer.maxMbInFlight 48
> spark.local.dir  /var/lib/jenkins/workspace/tmp
> spark.driver.memory 30g
> spark.executor.cores16
> spark.locality.wait 6000
> spark.executor.instances5
> UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions:
> 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection 
> to ConnectionManagerId(a1016.halxg.cloudera.com,54105)
> java.nio.channels.AsynchronousCloseException
> at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496)
> at 
> org.apache.spark.network.SendingConnection.write(Connection.scala:361)
> at 
> org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(Thread

[jira] [Commented] (SPARK-2398) Trouble running Spark 1.0 on Yarn

2014-07-07 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054434#comment-14054434
 ] 

Guoqiang Li commented on SPARK-2398:


Seems to be related to 
[SPARK-1930|https://issues.apache.org/jira/browse/SPARK-1930].
Can you post the yarn node manager log?

> Trouble running Spark 1.0 on Yarn 
> --
>
> Key: SPARK-2398
> URL: https://issues.apache.org/jira/browse/SPARK-2398
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Nishkam Ravi
>
> Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. 
> For example: SparkPageRank when run in standalone mode goes through without 
> any errors (tested for up to 30GB input dataset on a 6-node cluster).  Also 
> runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn 
> cluster mode) as the input data size is increased. Confirmed for 16GB input 
> dataset.
> The same workload runs fine with Spark 0.9 in both standalone and yarn 
> cluster mode (for up to 30 GB input dataset on a 6-node cluster).
> Commandline used:
> (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn 
> --deploy-mode cluster --properties-file pagerank.conf  --driver-memory 30g 
> --driver-cores 16 --num-executors 5 --class 
> org.apache.spark.examples.SparkPageRank 
> /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar
>  pagerank_in $NUM_ITER)
> pagerank.conf:
> spark.masterspark://c1704.halxg.cloudera.com:7077
> spark.home  /opt/cloudera/parcels/CDH/lib/spark
> spark.executor.memory   32g
> spark.default.parallelism   118
> spark.cores.max 96
> spark.storage.memoryFraction0.6
> spark.shuffle.memoryFraction0.3
> spark.shuffle.compress  true
> spark.shuffle.spill.compresstrue
> spark.broadcast.compresstrue
> spark.rdd.compress  false
> spark.io.compression.codec  org.apache.spark.io.LZFCompressionCodec
> spark.io.compression.snappy.block.size  32768
> spark.reducer.maxMbInFlight 48
> spark.local.dir  /var/lib/jenkins/workspace/tmp
> spark.driver.memory 30g
> spark.executor.cores16
> spark.locality.wait 6000
> spark.executor.instances5
> UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions:
> 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection 
> to ConnectionManagerId(a1016.halxg.cloudera.com,54105)
> java.nio.channels.AsynchronousCloseException
> at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496)
> at 
> org.apache.spark.network.SendingConnection.write(Connection.scala:361)
> at 
> org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> 
> java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619)
> at java.io.FilterInputStream.close(FilterInputStream.java:181)
> at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
> at 
> org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226)
> at 
> org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
> at 
> org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197)
> at 
> org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
> at 
> org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(Thr