[jira] [Commented] (SPARK-4516) Netty off-heap memory use causes executors to be killed by OS
[ https://issues.apache.org/jira/browse/SPARK-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14221758#comment-14221758 ] Hector Yee commented on SPARK-4516: --- Yes turning off direct buffers worked with Netty > Netty off-heap memory use causes executors to be killed by OS > - > > Key: SPARK-4516 > URL: https://issues.apache.org/jira/browse/SPARK-4516 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.2.0 > Environment: Linux, Mesos >Reporter: Hector Yee >Priority: Critical > Labels: netty, shuffle > > The netty block transfer manager has a race condition where it closes an > active connection resulting in the error below. Switching to nio seems to > alleviate the problem. > {code} > 14/11/20 18:53:43 INFO TransportClientFactory: Found inactive connection to > i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773, closing it. > 14/11/20 18:53:43 ERROR RetryingBlockFetcher: Exception while beginning fetch > of 1 outstanding blocks > java.io.IOException: Failed to connect to > i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141) > at > org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120) > at > org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:148) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:288) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:246) > at > com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:235) > at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) > at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.net.ConnectException: Connection refused: > i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773 > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) > at > io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208) > at > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287) > at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4516) Netty off-heap memory use causes executors to be killed by OS
[ https://issues.apache.org/jira/browse/SPARK-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14220382#comment-14220382 ] Hector Yee commented on SPARK-4516: --- Also the log was from tmp mesos slaves 20141023-174642-3852091146-5050-41161-1224 frameworks 20141119-235105-1873488138-31272-108823-0041 executors 33 runs def5caa2-c0f3-4175-9f5c-210735e6e009 so just to confirm the executor and container IDs match up > Netty off-heap memory use causes executors to be killed by OS > - > > Key: SPARK-4516 > URL: https://issues.apache.org/jira/browse/SPARK-4516 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.2.0 > Environment: Linux, Mesos >Reporter: Hector Yee >Priority: Critical > Labels: netty, shuffle > > The netty block transfer manager has a race condition where it closes an > active connection resulting in the error below. Switching to nio seems to > alleviate the problem. > {code} > 14/11/20 18:53:43 INFO TransportClientFactory: Found inactive connection to > i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773, closing it. > 14/11/20 18:53:43 ERROR RetryingBlockFetcher: Exception while beginning fetch > of 1 outstanding blocks > java.io.IOException: Failed to connect to > i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141) > at > org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120) > at > org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:148) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:288) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:246) > at > com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:235) > at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) > at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.net.ConnectException: Connection refused: > i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773 > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) > at > io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208) > at > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287) > at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > {code} -- This message was sent by Atlassian JIRA (v6.3.4
[jira] [Commented] (SPARK-4516) Netty off-heap memory use causes executors to be killed by OS
[ https://issues.apache.org/jira/browse/SPARK-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14220362#comment-14220362 ] Hector Yee commented on SPARK-4516: --- I checked the mesos log on the slave and it was an OOM kill I1120 22:36:20.193279 95373 slave.cpp:3321] Current usage 33.69%. Max allowed age: 1.126220739407303days I1120 22:36:23.329488 95371 mem.cpp:532] OOM notifier is triggered for container def5caa2-c0f3-4175-9f5c-210735e6e009 I1120 22:36:23.329684 95371 mem.cpp:551] OOM detected for container def5caa2-c0f3-4175-9f5c-210735e6e009 I1120 22:36:23.330762 95371 mem.cpp:605] Memory limit exceeded: Requested: 26328MB Maximum Used: 26328MB MEMORY STATISTICS: cache 126976 rss 27606781952 rss_huge 0 mapped_file 16384 writeback 0 swap 0 pgpgin 14435895 pgpgout 7695927 pgfault 63682623 pgmajfault 824 inactive_anon 0 active_anon 27606781952 inactive_file 126976 active_file 0 unevictable 0 hierarchical_memory_limit 27606908928 hierarchical_memsw_limit 18446744073709551615 total_cache 126976 total_rss 27606781952 total_rss_huge 0 total_mapped_file 16384 total_writeback 0 total_swap 0 total_pgpgin 14435895 total_pgpgout 7695927 total_pgfault 63682623 total_pgmajfault 824 total_inactive_anon 0 total_active_anon 27606781952 total_inactive_file 126976 total_active_file 0 total_unevictable 0 I1120 22:36:23.330862 95371 containerizer.cpp:1133] Container def5caa2-c0f3-4175-9f5c-210735e6e009 has reached its limit for resource mem(*):26328 and will be terminated I1120 22:36:23.330899 95371 containerizer.cpp:946] Destroying container 'def5caa2-c0f3-4175-9f5c-210735e6e009' I1120 22:36:23.332049 95367 cgroups.cpp:2207] Freezing cgroup /sys/fs/cgroup/freezer/mesos/def5caa2-c0f3-4175-9f5c-210735e6e009 I1120 22:36:23.434741 95371 cgroups.cpp:1374] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/def5caa2-c0f3-4175-9f5c-210735e6e009 after 102.648064ms I1120 22:36:23.436122 95391 cgroups.cpp:2224] Thawing cgroup /sys/fs/cgroup/freezer/mesos/def5caa2-c0f3-4175-9f5c-210735e6e009 I1120 22:36:23.437611 95391 cgroups.cpp:1403] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/def5caa2-c0f3-4175-9f5c-210735e6e009 after 1.39904ms I1120 22:36:23.439303 95394 containerizer.cpp:1117] Executor for container 'def5caa2-c0f3-4175-9f5c-210735e6e009' has exited I1120 22:36:25.953094 95368 slave.cpp:2898] Executor '33' of framework 20141119-235105-1873488138-31272-108823-0041 terminated with signal Killed I1120 22:36:25.953872 95368 slave.cpp:2215] Handling status update TASK_FAILED (UUID: 986cf483-d400-4edc-8423-dd0e51dfeeb8) for task 33 of framework 20141119-235105-1873488138-31272-108823-0041 from @0.0.0.0:0 I1120 22:36:25.953943 95368 slave.cpp:4305] Terminating task 33 > Netty off-heap memory use causes executors to be killed by OS > - > > Key: SPARK-4516 > URL: https://issues.apache.org/jira/browse/SPARK-4516 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.2.0 > Environment: Linux, Mesos >Reporter: Hector Yee >Priority: Critical > Labels: netty, shuffle > > The netty block transfer manager has a race condition where it closes an > active connection resulting in the error below. Switching to nio seems to > alleviate the problem. > {code} > 14/11/20 18:53:43 INFO TransportClientFactory: Found inactive connection to > i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773, closing it. > 14/11/20 18:53:43 ERROR RetryingBlockFetcher: Exception while beginning fetch > of 1 outstanding blocks > java.io.IOException: Failed to connect to > i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141) > at > org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120) > at > org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:148) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:288) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$
[jira] [Updated] (SPARK-4516) Lost task with netty
[ https://issues.apache.org/jira/browse/SPARK-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hector Yee updated SPARK-4516: -- Summary: Lost task with netty (was: Race condition in netty) > Lost task with netty > > > Key: SPARK-4516 > URL: https://issues.apache.org/jira/browse/SPARK-4516 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.2.0 > Environment: Linux, Mesos >Reporter: Hector Yee >Priority: Critical > Labels: netty, shuffle > > The netty block transfer manager has a race condition where it closes an > active connection resulting in the error below. Switching to nio seems to > alleviate the problem. > {code} > 14/11/20 18:53:43 INFO TransportClientFactory: Found inactive connection to > i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773, closing it. > 14/11/20 18:53:43 ERROR RetryingBlockFetcher: Exception while beginning fetch > of 1 outstanding blocks > java.io.IOException: Failed to connect to > i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141) > at > org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120) > at > org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:148) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:288) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:246) > at > com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:235) > at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) > at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.net.ConnectException: Connection refused: > i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773 > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) > at > io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208) > at > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287) > at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4516) Race condition in netty
[ https://issues.apache.org/jira/browse/SPARK-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14220157#comment-14220157 ] Hector Yee commented on SPARK-4516: --- Digging deeper it looks like you are right, the first machine fails silently with no reason in the log. My guess is that it ran out of memory, when this kind of thing happens. The last time this happened was when the native snappy library used too much ram. I upped the mesos overhead to 1G to fix those snappy errors: --conf spark.mesos.executor.memoryOverhead=1024 Is it possible that netty uses something off java heap and is allocating too much? Or maybe a silent failure somewhere that is not logged? Diagnostics follow: 1st machine fails (f20aaa19) with nothing in the log. The last thing it says is starting 3 remote fetchers 14/11/20 22:35:18 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 1, fetching them 14/11/20 22:35:18 INFO MapOutputTrackerWorker: Doing the fetch; tracker actor = Actor[akka.tcp://sparkdri...@i-62305989.inst.aws.airbnb.com:46605/user/MapOutputTracker#-1862215473] 14/11/20 22:35:18 INFO MapOutputTrackerWorker: Got the output locations 14/11/20 22:35:18 INFO ShuffleBlockFetcherIterator: Getting 498 non-empty blocks out of 724 blocks 14/11/20 22:35:18 INFO ShuffleBlockFetcherIterator: Started 3 remote fetches in 67 ms On the master it says 14/11/20 22:36:25 ERROR TaskSchedulerImpl: Lost executor 20141023-174642-3852091146-5050-41161-1224 on i-f20aaa19.inst.aws.airbnb.com: remote Akka client disassociated 14/11/20 22:36:25 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkexecu...@i-f20aaa19.inst.aws.airbnb.com:54417] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 14/11/20 22:36:25 INFO TaskSetManager: Re-queueing tasks for 20141023-174642-3852091146-5050-41161-1224 from TaskSet 0.0 14/11/20 22:36:25 WARN TaskSetManager: Lost task 7.0 in stage 1.0 (TID 898, i-f20aaa19.inst.aws.airbnb.com): ExecutorLostFailure (executor 20141023-174642-3852091146-5050-41161-1224 lost) 14/11/20 22:36:25 ERROR CoarseMesosSchedulerBackend: Asked to remove non-existent executor 20141023-174642-3852091146-5050-41161-1224 2nd machine fails saying it could not connect 14/11/20 22:36:36 INFO TransportClientFactory: Found inactive connection to i-f20aaa19.inst.aws.airbnb.com/10.225.139.181:51003, closing it. 14/11/20 22:36:36 ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks java.io.IOException: Failed to connect to i-f20aaa19.inst.aws.airbnb.com/10.225.139.181:51003 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141) at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) at org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120) at org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87) > Race condition in netty > --- > > Key: SPARK-4516 > URL: https://issues.apache.org/jira/browse/SPARK-4516 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.2.0 > Environment: Linux, Mesos >Reporter: Hector Yee >Priority: Critical > Labels: netty, shuffle > > The netty block transfer manager has a race condition where it closes an > active connection resulting in the error below. Switching to nio seems to > alleviate the problem. > {code} > 14/11/20 18:53:43 INFO TransportClientFactory: Found inactive connection to > i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773, closing it. > 14/11/20 18:53:43 ERROR RetryingBlockFetcher: Exception while beginning fetch > of 1 outstanding blocks > java.io.IOException: Failed to connect to > i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141) > at > org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120) > at > org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:148) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:288) > at
[jira] [Commented] (SPARK-4516) Race condition in netty
[ https://issues.apache.org/jira/browse/SPARK-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14220025#comment-14220025 ] Hector Yee commented on SPARK-4516: --- The channel is marked as inactive while it is being used I believe. I didn't dig into the code so I have no idea but according to the logs that is what seems to be happening. > Race condition in netty > --- > > Key: SPARK-4516 > URL: https://issues.apache.org/jira/browse/SPARK-4516 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.2.0 > Environment: Linux, Mesos >Reporter: Hector Yee > Labels: netty, shuffle > > The netty block transfer manager has a race condition where it closes an > active connection resulting in the error below. Switching to nio seems to > alleviate the problem. > 14/11/20 18:53:43 INFO TransportClientFactory: Found inactive connection to > i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773, closing it. > 14/11/20 18:53:43 ERROR RetryingBlockFetcher: Exception while beginning fetch > of 1 outstanding blocks > java.io.IOException: Failed to connect to > i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141) > at > org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120) > at > org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:148) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:288) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:246) > at > com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:235) > at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) > at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.net.ConnectException: Connection refused: > i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773 > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) > at > io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208) > at > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287) > at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4516) Race condition in netty
[ https://issues.apache.org/jira/browse/SPARK-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hector Yee updated SPARK-4516: -- Affects Version/s: (was: 1.1.1) 1.2.0 > Race condition in netty > --- > > Key: SPARK-4516 > URL: https://issues.apache.org/jira/browse/SPARK-4516 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.2.0 > Environment: Linux, Mesos >Reporter: Hector Yee > Labels: netty, shuffle > > The netty block transfer manager has a race condition where it closes an > active connection resulting in the error below. Switching to nio seems to > alleviate the problem. > 14/11/20 18:53:43 INFO TransportClientFactory: Found inactive connection to > i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773, closing it. > 14/11/20 18:53:43 ERROR RetryingBlockFetcher: Exception while beginning fetch > of 1 outstanding blocks > java.io.IOException: Failed to connect to > i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141) > at > org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120) > at > org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:148) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:288) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:246) > at > com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:235) > at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) > at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.net.ConnectException: Connection refused: > i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773 > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) > at > io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208) > at > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287) > at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4516) Race condition in netty
[ https://issues.apache.org/jira/browse/SPARK-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hector Yee updated SPARK-4516: -- Affects Version/s: (was: 1.1.0) 1.1.1 > Race condition in netty > --- > > Key: SPARK-4516 > URL: https://issues.apache.org/jira/browse/SPARK-4516 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.1.1 > Environment: Linux, Mesos >Reporter: Hector Yee > Labels: netty, shuffle > > The netty block transfer manager has a race condition where it closes an > active connection resulting in the error below. Switching to nio seems to > alleviate the problem. > 14/11/20 18:53:43 INFO TransportClientFactory: Found inactive connection to > i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773, closing it. > 14/11/20 18:53:43 ERROR RetryingBlockFetcher: Exception while beginning fetch > of 1 outstanding blocks > java.io.IOException: Failed to connect to > i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141) > at > org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120) > at > org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:148) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:288) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:246) > at > com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:235) > at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) > at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.net.ConnectException: Connection refused: > i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773 > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) > at > io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208) > at > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287) > at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4516) Race condition in netty
Hector Yee created SPARK-4516: - Summary: Race condition in netty Key: SPARK-4516 URL: https://issues.apache.org/jira/browse/SPARK-4516 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.1.0 Environment: Linux, Mesos Reporter: Hector Yee The netty block transfer manager has a race condition where it closes an active connection resulting in the error below. Switching to nio seems to alleviate the problem. 14/11/20 18:53:43 INFO TransportClientFactory: Found inactive connection to i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773, closing it. 14/11/20 18:53:43 ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks java.io.IOException: Failed to connect to i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141) at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) at org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120) at org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87) at org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:148) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:288) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:246) at com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:235) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.ConnectException: Connection refused: i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3633) Fetches failure observed after SPARK-2711
[ https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14219807#comment-14219807 ] Hector Yee commented on SPARK-3633: --- I think it may be a different bug.. looked at the failed executor and it looks like something is closing the connection causing fetches to fail 14/11/20 18:53:43 INFO TransportClientFactory: Found inactive connection to i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773, closing it. 14/11/20 18:53:43 ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks java.io.IOException: Failed to connect to i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141) at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) at org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120) at org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87) at org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:148) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:288) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:246) at com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:235) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.ConnectException: Connection refused: i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ... 1 more > Fetches failure observed after SPARK-2711 > - > > Key: SPARK-3633 > URL: https://issues.apache.org/jira/browse/SPARK-3633 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 1.1.0 >Reporter: Nishkam Ravi >Priority: Blocker > > Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. > Recently upgraded to Spark 1.1. The workload fails with the following error > message(s): > {code} > 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, > c1705.ha
[jira] [Commented] (SPARK-3633) Fetches failure observed after SPARK-2711
[ https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14219765#comment-14219765 ] Hector Yee commented on SPARK-3633: --- I'm still seeing a similar error in spark 1.2 rc2 14/11/20 18:41:12 WARN TaskSetManager: Lost task 4.1 in stage 1.0 (TID 907, i-8cb72661.inst.aws.airbnb.com): FetchFailed(null, shuffleId=1, mapId=-1, reduceId=4, message= org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1 at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:386) at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:383) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:382) at org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:178) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > Fetches failure observed after SPARK-2711 > - > > Key: SPARK-3633 > URL: https://issues.apache.org/jira/browse/SPARK-3633 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 1.1.0 >Reporter: Nishkam Ravi >Priority: Blocker > > Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. > Recently upgraded to Spark 1.1. The workload fails with the following error > message(s): > {code} > 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, > c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, > c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120) > 14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages > {code} > In order to identify the problem, I carried out change set analysis. As I go > back in time, the error message changes to: > {code} > 14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, > c1706.halxg.cloudera.com): java.io.FileNotFoundException: > /var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034 > (Too many open files) > java.io.FileOutputStream.open(Native Method) > java.io.FileOutputStream.(FileOutputStream.java:221) > > org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117) > > org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185) > > org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197) > > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145) > org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) > > org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:51) > > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > org.apache.spark.scheduler.Task.run(Task.scala:54) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:745) > {code} > All the way until Aug 4th. Turns out the problem changeset is 4fde28c. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3871) compute-classpath.sh does not escape :
Hector Yee created SPARK-3871: - Summary: compute-classpath.sh does not escape : Key: SPARK-3871 URL: https://issues.apache.org/jira/browse/SPARK-3871 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.1.0 Reporter: Hector Yee Priority: Minor Chronos jobs on Mesos schedule jobs in temp directories such as /tmp/mesos/slaves/20140926-142803-3852091146-5050-3487-375/frameworks/20140719-203536-160311562-5050-10655-0007/executors/ct:1412815902180:2:search_ranking_scoring/runs/f1e0d058-3ef0-4838-816e-e3fa5e179dd8 The compute-classpath.sh does not properly escape the : in the temp dirs generated by mesos and so the spark-submit gets a broken classpath -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3753) Spark hive join results in empty with shared hive context
Hector Yee created SPARK-3753: - Summary: Spark hive join results in empty with shared hive context Key: SPARK-3753 URL: https://issues.apache.org/jira/browse/SPARK-3753 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: Hector Yee Priority: Minor When I have two hive tables and do a join with the same hive context I get the empty set e.g. val hc = new HiveContext(sc) val table1 = hc.sql("SELECT * from t1") val table2 = hc.sql("SELECT * from t2") val intersect = table1.join(table2).take(10) // empty set but this works if I do val hc1 = new HiveContext(sc) val table1 = hc1.sql("SELECT * from t1") val hc2 = new HiveContext(sc) val table2 = hc2.sql("SELECT * from t2") val intersect = table1.join(table2).take(10) I am not sure if take is propagating up the take to table1 and table2 and then doing the intersect (in the case of large tables that means no results) or if it is some other problem with hive context. Doing the join in one SQL query also seems to result in the empty set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1547) Add gradient boosting algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049659#comment-14049659 ] Hector Yee commented on SPARK-1547: --- Just generic log loss with L1 regularization should suffice. Most of the work is in feature engineering anyway. It is no hurry at all, I already have several implementations not in MLLib that I am using. It would just be convenient to have another implementation to compare against. > Add gradient boosting algorithm to MLlib > > > Key: SPARK-1547 > URL: https://issues.apache.org/jira/browse/SPARK-1547 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Manish Amde >Assignee: Manish Amde > > This task requires adding the gradient boosting algorithm to Spark MLlib. The > implementation needs to adapt the gradient boosting algorithm to the scalable > tree implementation. > The tasks involves: > - Comparing the various tradeoffs and finalizing the algorithm before > implementation > - Code implementation > - Unit tests > - Functional tests > - Performance tests > - Documentation -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1547) Add gradient boosting algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049394#comment-14049394 ] Hector Yee commented on SPARK-1547: --- Honestly trees are most useful when the feature vectors are dense. Any possibility that the solver can be decoupled from the tree part for dealing with sparse data? > Add gradient boosting algorithm to MLlib > > > Key: SPARK-1547 > URL: https://issues.apache.org/jira/browse/SPARK-1547 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Manish Amde >Assignee: Manish Amde > > This task requires adding the gradient boosting algorithm to Spark MLlib. The > implementation needs to adapt the gradient boosting algorithm to the scalable > tree implementation. > The tasks involves: > - Comparing the various tradeoffs and finalizing the algorithm before > implementation > - Code implementation > - Unit tests > - Functional tests > - Performance tests > - Documentation -- This message was sent by Atlassian JIRA (v6.2#6252)