[jira] [Commented] (SPARK-4516) Netty off-heap memory use causes executors to be killed by OS

2014-11-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225627#comment-14225627
 ] 

Apache Spark commented on SPARK-4516:
-

User 'aarondav' has created a pull request for this issue:
https://github.com/apache/spark/pull/3465

 Netty off-heap memory use causes executors to be killed by OS
 -

 Key: SPARK-4516
 URL: https://issues.apache.org/jira/browse/SPARK-4516
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
 Environment: Linux, Mesos
Reporter: Hector Yee
Priority: Critical
  Labels: netty, shuffle

 The netty block transfer manager has a race condition where it closes an 
 active connection resulting in the error below. Switching to nio seems to 
 alleviate the problem.
 {code}
 14/11/20 18:53:43 INFO TransportClientFactory: Found inactive connection to 
 i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773, closing it.
 14/11/20 18:53:43 ERROR RetryingBlockFetcher: Exception while beginning fetch 
 of 1 outstanding blocks 
 java.io.IOException: Failed to connect to 
 i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
 at 
 org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141)
 at 
 org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
 at 
 org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
 at 
 org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
 at 
 org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87)
 at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:148)
 at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:288)
 at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
 at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at 
 com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:246)
 at 
 com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:235)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.net.ConnectException: Connection refused: 
 i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
 at 
 io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
 at 
 io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
 at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
 at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
 at 
 io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4516) Netty off-heap memory use causes executors to be killed by OS

2014-11-25 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225700#comment-14225700
 ] 

Aaron Davidson commented on SPARK-4516:
---

It turns out there was a real bug which caused us to allocate memory 
proportional to both number of cores and number of _executors_ in the cluster. 
PR [#3465|https://github.com/apache/spark/pull/3465] should remove the latter 
factor, which should greatly decrease the amount of off-heap memory allocated.

Do note that even with this patch, one key feature of the Netty transport 
service is that we do allocate and reuse significant off-heap buffer space 
rather than on-heap, which helps reduce GC pauses. So it's possible that 
certain environments which previously heavily constrained off-heap memory (by 
giving almost all of the container/cgroup's memory to the Spark heap) may have 
to be modified to ensure that at least 32 * (number of cores) MB is available 
to be allocated off-JVM heap.

If this is not possible, you can either disable direct byte buffer usage via 
spark.shuffle.io.preferDirectBufs or set spark.shuffle.io.serverThreads and 
spark.shuffle.io.clientThreads to something smaller than the number of 
executor cores. Typically we find that 10GB/s network cannot saturate more 
than, say, 8 cores on a machine (in practice I've never seen even that many 
required), so we would expect no performance degradation if you set these 
parameters such on beefier machines, and it should cap off-heap allocation to 
order of 256 MB.

 Netty off-heap memory use causes executors to be killed by OS
 -

 Key: SPARK-4516
 URL: https://issues.apache.org/jira/browse/SPARK-4516
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
 Environment: Linux, Mesos
Reporter: Hector Yee
Priority: Critical
  Labels: netty, shuffle

 The netty block transfer manager has a race condition where it closes an 
 active connection resulting in the error below. Switching to nio seems to 
 alleviate the problem.
 {code}
 14/11/20 18:53:43 INFO TransportClientFactory: Found inactive connection to 
 i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773, closing it.
 14/11/20 18:53:43 ERROR RetryingBlockFetcher: Exception while beginning fetch 
 of 1 outstanding blocks 
 java.io.IOException: Failed to connect to 
 i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
 at 
 org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141)
 at 
 org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
 at 
 org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
 at 
 org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
 at 
 org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87)
 at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:148)
 at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:288)
 at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
 at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at 
 com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:246)
 at 
 com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:235)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 

[jira] [Commented] (SPARK-4516) Netty off-heap memory use causes executors to be killed by OS

2014-11-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225731#comment-14225731
 ] 

Apache Spark commented on SPARK-4516:
-

User 'aarondav' has created a pull request for this issue:
https://github.com/apache/spark/pull/3469

 Netty off-heap memory use causes executors to be killed by OS
 -

 Key: SPARK-4516
 URL: https://issues.apache.org/jira/browse/SPARK-4516
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
 Environment: Linux, Mesos
Reporter: Hector Yee
Priority: Critical
  Labels: netty, shuffle

 The netty block transfer manager has a race condition where it closes an 
 active connection resulting in the error below. Switching to nio seems to 
 alleviate the problem.
 {code}
 14/11/20 18:53:43 INFO TransportClientFactory: Found inactive connection to 
 i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773, closing it.
 14/11/20 18:53:43 ERROR RetryingBlockFetcher: Exception while beginning fetch 
 of 1 outstanding blocks 
 java.io.IOException: Failed to connect to 
 i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
 at 
 org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141)
 at 
 org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
 at 
 org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
 at 
 org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
 at 
 org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87)
 at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:148)
 at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:288)
 at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
 at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at 
 com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:246)
 at 
 com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:235)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.net.ConnectException: Connection refused: 
 i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
 at 
 io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
 at 
 io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
 at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
 at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
 at 
 io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4516) Netty off-heap memory use causes executors to be killed by OS

2014-11-25 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225743#comment-14225743
 ] 

Aaron Davidson commented on SPARK-4516:
---

About my last point, [~rxin], [~pwendell], and I decided it may be better if we 
just cap the number of threads we use by default to 8, to try to avoid issues 
for people who use executors with very large number of cores and were on the 
edge of their off-heap limits already. #3469 implements this, which may cause a 
performance regression if we're wrong about the magic number 8 being an upper 
bound on the useful number of cores. It can be overridden via the 
serverThreads/clientThreads properties, but if anyone sees this as an issue, 
please let me know.

 Netty off-heap memory use causes executors to be killed by OS
 -

 Key: SPARK-4516
 URL: https://issues.apache.org/jira/browse/SPARK-4516
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
 Environment: Linux, Mesos
Reporter: Hector Yee
Assignee: Aaron Davidson
Priority: Critical
  Labels: netty, shuffle
 Fix For: 1.2.0


 The netty block transfer manager has a race condition where it closes an 
 active connection resulting in the error below. Switching to nio seems to 
 alleviate the problem.
 {code}
 14/11/20 18:53:43 INFO TransportClientFactory: Found inactive connection to 
 i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773, closing it.
 14/11/20 18:53:43 ERROR RetryingBlockFetcher: Exception while beginning fetch 
 of 1 outstanding blocks 
 java.io.IOException: Failed to connect to 
 i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
 at 
 org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141)
 at 
 org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
 at 
 org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
 at 
 org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
 at 
 org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87)
 at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:148)
 at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:288)
 at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
 at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at 
 com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:246)
 at 
 com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:235)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.net.ConnectException: Connection refused: 
 i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
 at 
 io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
 at 
 io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
 at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
 at 
 

[jira] [Commented] (SPARK-4516) Netty off-heap memory use causes executors to be killed by OS

2014-11-22 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222172#comment-14222172
 ] 

Patrick Wendell commented on SPARK-4516:


Okay then I think this is just a documentation issue. We should add the 
documentation about direct buffers to the main configuration page and also 
mention it in the doc about network options.

 Netty off-heap memory use causes executors to be killed by OS
 -

 Key: SPARK-4516
 URL: https://issues.apache.org/jira/browse/SPARK-4516
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
 Environment: Linux, Mesos
Reporter: Hector Yee
Priority: Critical
  Labels: netty, shuffle

 The netty block transfer manager has a race condition where it closes an 
 active connection resulting in the error below. Switching to nio seems to 
 alleviate the problem.
 {code}
 14/11/20 18:53:43 INFO TransportClientFactory: Found inactive connection to 
 i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773, closing it.
 14/11/20 18:53:43 ERROR RetryingBlockFetcher: Exception while beginning fetch 
 of 1 outstanding blocks 
 java.io.IOException: Failed to connect to 
 i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
 at 
 org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141)
 at 
 org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
 at 
 org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
 at 
 org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
 at 
 org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87)
 at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:148)
 at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:288)
 at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
 at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at 
 com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:246)
 at 
 com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:235)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.net.ConnectException: Connection refused: 
 i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
 at 
 io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
 at 
 io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
 at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
 at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
 at 
 io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For 

[jira] [Commented] (SPARK-4516) Netty off-heap memory use causes executors to be killed by OS

2014-11-21 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221519#comment-14221519
 ] 

Patrick Wendell commented on SPARK-4516:


Okay sounds good. Does changing the netty config help?


 Netty off-heap memory use causes executors to be killed by OS
 -

 Key: SPARK-4516
 URL: https://issues.apache.org/jira/browse/SPARK-4516
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
 Environment: Linux, Mesos
Reporter: Hector Yee
Priority: Critical
  Labels: netty, shuffle

 The netty block transfer manager has a race condition where it closes an 
 active connection resulting in the error below. Switching to nio seems to 
 alleviate the problem.
 {code}
 14/11/20 18:53:43 INFO TransportClientFactory: Found inactive connection to 
 i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773, closing it.
 14/11/20 18:53:43 ERROR RetryingBlockFetcher: Exception while beginning fetch 
 of 1 outstanding blocks 
 java.io.IOException: Failed to connect to 
 i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
 at 
 org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141)
 at 
 org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
 at 
 org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
 at 
 org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
 at 
 org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87)
 at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:148)
 at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:288)
 at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
 at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at 
 com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:246)
 at 
 com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:235)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.net.ConnectException: Connection refused: 
 i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
 at 
 io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
 at 
 io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
 at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
 at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
 at 
 io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4516) Netty off-heap memory use causes executors to be killed by OS

2014-11-21 Thread Hector Yee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221758#comment-14221758
 ] 

Hector Yee commented on SPARK-4516:
---

Yes turning off direct buffers worked with Netty

 Netty off-heap memory use causes executors to be killed by OS
 -

 Key: SPARK-4516
 URL: https://issues.apache.org/jira/browse/SPARK-4516
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
 Environment: Linux, Mesos
Reporter: Hector Yee
Priority: Critical
  Labels: netty, shuffle

 The netty block transfer manager has a race condition where it closes an 
 active connection resulting in the error below. Switching to nio seems to 
 alleviate the problem.
 {code}
 14/11/20 18:53:43 INFO TransportClientFactory: Found inactive connection to 
 i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773, closing it.
 14/11/20 18:53:43 ERROR RetryingBlockFetcher: Exception while beginning fetch 
 of 1 outstanding blocks 
 java.io.IOException: Failed to connect to 
 i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
 at 
 org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141)
 at 
 org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
 at 
 org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
 at 
 org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
 at 
 org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87)
 at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:148)
 at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:288)
 at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
 at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at 
 com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:246)
 at 
 com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:235)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.net.ConnectException: Connection refused: 
 i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
 at 
 io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
 at 
 io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
 at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
 at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
 at 
 io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4516) Netty off-heap memory use causes executors to be killed by OS

2014-11-20 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14220352#comment-14220352
 ] 

Patrick Wendell commented on SPARK-4516:


[~hector.yee] I updated the title, let me know if you decide this is not 
related to being killed by the OS but it seems like that is the case.

 Netty off-heap memory use causes executors to be killed by OS
 -

 Key: SPARK-4516
 URL: https://issues.apache.org/jira/browse/SPARK-4516
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
 Environment: Linux, Mesos
Reporter: Hector Yee
Priority: Critical
  Labels: netty, shuffle

 The netty block transfer manager has a race condition where it closes an 
 active connection resulting in the error below. Switching to nio seems to 
 alleviate the problem.
 {code}
 14/11/20 18:53:43 INFO TransportClientFactory: Found inactive connection to 
 i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773, closing it.
 14/11/20 18:53:43 ERROR RetryingBlockFetcher: Exception while beginning fetch 
 of 1 outstanding blocks 
 java.io.IOException: Failed to connect to 
 i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
 at 
 org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141)
 at 
 org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
 at 
 org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
 at 
 org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
 at 
 org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87)
 at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:148)
 at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:288)
 at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
 at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at 
 com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:246)
 at 
 com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:235)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.net.ConnectException: Connection refused: 
 i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
 at 
 io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
 at 
 io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
 at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
 at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
 at 
 io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Commented] (SPARK-4516) Netty off-heap memory use causes executors to be killed by OS

2014-11-20 Thread Hector Yee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14220362#comment-14220362
 ] 

Hector Yee commented on SPARK-4516:
---

I checked the mesos log on the slave and it was an OOM kill

I1120 22:36:20.193279 95373 slave.cpp:3321] Current usage 33.69%. Max allowed 
age: 1.126220739407303days
I1120 22:36:23.329488 95371 mem.cpp:532] OOM notifier is triggered for 
container def5caa2-c0f3-4175-9f5c-210735e6e009
I1120 22:36:23.329684 95371 mem.cpp:551] OOM detected for container 
def5caa2-c0f3-4175-9f5c-210735e6e009
I1120 22:36:23.330762 95371 mem.cpp:605] Memory limit exceeded: Requested: 
26328MB Maximum Used: 26328MB

MEMORY STATISTICS: 
cache 126976
rss 27606781952
rss_huge 0
mapped_file 16384
writeback 0
swap 0
pgpgin 14435895
pgpgout 7695927
pgfault 63682623
pgmajfault 824
inactive_anon 0
active_anon 27606781952
inactive_file 126976
active_file 0
unevictable 0
hierarchical_memory_limit 27606908928
hierarchical_memsw_limit 18446744073709551615
total_cache 126976
total_rss 27606781952
total_rss_huge 0
total_mapped_file 16384
total_writeback 0
total_swap 0
total_pgpgin 14435895
total_pgpgout 7695927
total_pgfault 63682623
total_pgmajfault 824
total_inactive_anon 0
total_active_anon 27606781952
total_inactive_file 126976
total_active_file 0
total_unevictable 0
I1120 22:36:23.330862 95371 containerizer.cpp:1133] Container 
def5caa2-c0f3-4175-9f5c-210735e6e009 has reached its limit for resource 
mem(*):26328 and will be terminated
I1120 22:36:23.330899 95371 containerizer.cpp:946] Destroying container 
'def5caa2-c0f3-4175-9f5c-210735e6e009'
I1120 22:36:23.332049 95367 cgroups.cpp:2207] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/def5caa2-c0f3-4175-9f5c-210735e6e009
I1120 22:36:23.434741 95371 cgroups.cpp:1374] Successfully froze cgroup 
/sys/fs/cgroup/freezer/mesos/def5caa2-c0f3-4175-9f5c-210735e6e009 after 
102.648064ms
I1120 22:36:23.436122 95391 cgroups.cpp:2224] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/def5caa2-c0f3-4175-9f5c-210735e6e009
I1120 22:36:23.437611 95391 cgroups.cpp:1403] Successfullly thawed cgroup 
/sys/fs/cgroup/freezer/mesos/def5caa2-c0f3-4175-9f5c-210735e6e009 after 
1.39904ms
I1120 22:36:23.439303 95394 containerizer.cpp:1117] Executor for container 
'def5caa2-c0f3-4175-9f5c-210735e6e009' has exited
I1120 22:36:25.953094 95368 slave.cpp:2898] Executor '33' of framework 
20141119-235105-1873488138-31272-108823-0041 terminated with signal Killed
I1120 22:36:25.953872 95368 slave.cpp:2215] Handling status update TASK_FAILED 
(UUID: 986cf483-d400-4edc-8423-dd0e51dfeeb8) for task 33 of framework 
20141119-235105-1873488138-31272-108823-0041 from @0.0.0.0:0
I1120 22:36:25.953943 95368 slave.cpp:4305] Terminating task 33


 Netty off-heap memory use causes executors to be killed by OS
 -

 Key: SPARK-4516
 URL: https://issues.apache.org/jira/browse/SPARK-4516
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
 Environment: Linux, Mesos
Reporter: Hector Yee
Priority: Critical
  Labels: netty, shuffle

 The netty block transfer manager has a race condition where it closes an 
 active connection resulting in the error below. Switching to nio seems to 
 alleviate the problem.
 {code}
 14/11/20 18:53:43 INFO TransportClientFactory: Found inactive connection to 
 i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773, closing it.
 14/11/20 18:53:43 ERROR RetryingBlockFetcher: Exception while beginning fetch 
 of 1 outstanding blocks 
 java.io.IOException: Failed to connect to 
 i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
 at 
 org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141)
 at 
 org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
 at 
 org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
 at 
 org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
 at 
 org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87)
 at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:148)
 at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:288)
 at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
 at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at 

[jira] [Commented] (SPARK-4516) Netty off-heap memory use causes executors to be killed by OS

2014-11-20 Thread Hector Yee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14220382#comment-14220382
 ] 

Hector Yee commented on SPARK-4516:
---

Also the log was from  tmp  mesos  slaves  
20141023-174642-3852091146-5050-41161-1224  frameworks  
20141119-235105-1873488138-31272-108823-0041  executors  33  runs  
def5caa2-c0f3-4175-9f5c-210735e6e009 so just to confirm the executor and 
container IDs match up

 Netty off-heap memory use causes executors to be killed by OS
 -

 Key: SPARK-4516
 URL: https://issues.apache.org/jira/browse/SPARK-4516
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
 Environment: Linux, Mesos
Reporter: Hector Yee
Priority: Critical
  Labels: netty, shuffle

 The netty block transfer manager has a race condition where it closes an 
 active connection resulting in the error below. Switching to nio seems to 
 alleviate the problem.
 {code}
 14/11/20 18:53:43 INFO TransportClientFactory: Found inactive connection to 
 i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773, closing it.
 14/11/20 18:53:43 ERROR RetryingBlockFetcher: Exception while beginning fetch 
 of 1 outstanding blocks 
 java.io.IOException: Failed to connect to 
 i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
 at 
 org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141)
 at 
 org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
 at 
 org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
 at 
 org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
 at 
 org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87)
 at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:148)
 at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:288)
 at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
 at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at 
 com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:246)
 at 
 com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:235)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.net.ConnectException: Connection refused: 
 i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
 at 
 io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
 at 
 io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
 at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
 at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
 at 
 io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To