[jira] [Commented] (SPARK-4516) Netty off-heap memory use causes executors to be killed by OS

2014-11-21 Thread Hector Yee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14221758#comment-14221758
 ] 

Hector Yee commented on SPARK-4516:
---

Yes turning off direct buffers worked with Netty

> Netty off-heap memory use causes executors to be killed by OS
> -
>
> Key: SPARK-4516
> URL: https://issues.apache.org/jira/browse/SPARK-4516
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.2.0
> Environment: Linux, Mesos
>Reporter: Hector Yee
>Priority: Critical
>  Labels: netty, shuffle
>
> The netty block transfer manager has a race condition where it closes an 
> active connection resulting in the error below. Switching to nio seems to 
> alleviate the problem.
> {code}
> 14/11/20 18:53:43 INFO TransportClientFactory: Found inactive connection to 
> i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773, closing it.
> 14/11/20 18:53:43 ERROR RetryingBlockFetcher: Exception while beginning fetch 
> of 1 outstanding blocks 
> java.io.IOException: Failed to connect to 
> i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141)
> at 
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
> at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
> at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
> at 
> org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:148)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:288)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at 
> com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:246)
> at 
> com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:235)
> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.ConnectException: Connection refused: 
> i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
> at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
> at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4516) Netty off-heap memory use causes executors to be killed by OS

2014-11-20 Thread Hector Yee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14220382#comment-14220382
 ] 

Hector Yee commented on SPARK-4516:
---

Also the log was from  tmp  mesos  slaves  
20141023-174642-3852091146-5050-41161-1224  frameworks  
20141119-235105-1873488138-31272-108823-0041  executors  33  runs  
def5caa2-c0f3-4175-9f5c-210735e6e009 so just to confirm the executor and 
container IDs match up

> Netty off-heap memory use causes executors to be killed by OS
> -
>
> Key: SPARK-4516
> URL: https://issues.apache.org/jira/browse/SPARK-4516
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.2.0
> Environment: Linux, Mesos
>Reporter: Hector Yee
>Priority: Critical
>  Labels: netty, shuffle
>
> The netty block transfer manager has a race condition where it closes an 
> active connection resulting in the error below. Switching to nio seems to 
> alleviate the problem.
> {code}
> 14/11/20 18:53:43 INFO TransportClientFactory: Found inactive connection to 
> i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773, closing it.
> 14/11/20 18:53:43 ERROR RetryingBlockFetcher: Exception while beginning fetch 
> of 1 outstanding blocks 
> java.io.IOException: Failed to connect to 
> i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141)
> at 
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
> at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
> at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
> at 
> org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:148)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:288)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at 
> com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:246)
> at 
> com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:235)
> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.ConnectException: Connection refused: 
> i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
> at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
> at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4

[jira] [Commented] (SPARK-4516) Netty off-heap memory use causes executors to be killed by OS

2014-11-20 Thread Hector Yee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14220362#comment-14220362
 ] 

Hector Yee commented on SPARK-4516:
---

I checked the mesos log on the slave and it was an OOM kill

I1120 22:36:20.193279 95373 slave.cpp:3321] Current usage 33.69%. Max allowed 
age: 1.126220739407303days
I1120 22:36:23.329488 95371 mem.cpp:532] OOM notifier is triggered for 
container def5caa2-c0f3-4175-9f5c-210735e6e009
I1120 22:36:23.329684 95371 mem.cpp:551] OOM detected for container 
def5caa2-c0f3-4175-9f5c-210735e6e009
I1120 22:36:23.330762 95371 mem.cpp:605] Memory limit exceeded: Requested: 
26328MB Maximum Used: 26328MB

MEMORY STATISTICS: 
cache 126976
rss 27606781952
rss_huge 0
mapped_file 16384
writeback 0
swap 0
pgpgin 14435895
pgpgout 7695927
pgfault 63682623
pgmajfault 824
inactive_anon 0
active_anon 27606781952
inactive_file 126976
active_file 0
unevictable 0
hierarchical_memory_limit 27606908928
hierarchical_memsw_limit 18446744073709551615
total_cache 126976
total_rss 27606781952
total_rss_huge 0
total_mapped_file 16384
total_writeback 0
total_swap 0
total_pgpgin 14435895
total_pgpgout 7695927
total_pgfault 63682623
total_pgmajfault 824
total_inactive_anon 0
total_active_anon 27606781952
total_inactive_file 126976
total_active_file 0
total_unevictable 0
I1120 22:36:23.330862 95371 containerizer.cpp:1133] Container 
def5caa2-c0f3-4175-9f5c-210735e6e009 has reached its limit for resource 
mem(*):26328 and will be terminated
I1120 22:36:23.330899 95371 containerizer.cpp:946] Destroying container 
'def5caa2-c0f3-4175-9f5c-210735e6e009'
I1120 22:36:23.332049 95367 cgroups.cpp:2207] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/def5caa2-c0f3-4175-9f5c-210735e6e009
I1120 22:36:23.434741 95371 cgroups.cpp:1374] Successfully froze cgroup 
/sys/fs/cgroup/freezer/mesos/def5caa2-c0f3-4175-9f5c-210735e6e009 after 
102.648064ms
I1120 22:36:23.436122 95391 cgroups.cpp:2224] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/def5caa2-c0f3-4175-9f5c-210735e6e009
I1120 22:36:23.437611 95391 cgroups.cpp:1403] Successfullly thawed cgroup 
/sys/fs/cgroup/freezer/mesos/def5caa2-c0f3-4175-9f5c-210735e6e009 after 
1.39904ms
I1120 22:36:23.439303 95394 containerizer.cpp:1117] Executor for container 
'def5caa2-c0f3-4175-9f5c-210735e6e009' has exited
I1120 22:36:25.953094 95368 slave.cpp:2898] Executor '33' of framework 
20141119-235105-1873488138-31272-108823-0041 terminated with signal Killed
I1120 22:36:25.953872 95368 slave.cpp:2215] Handling status update TASK_FAILED 
(UUID: 986cf483-d400-4edc-8423-dd0e51dfeeb8) for task 33 of framework 
20141119-235105-1873488138-31272-108823-0041 from @0.0.0.0:0
I1120 22:36:25.953943 95368 slave.cpp:4305] Terminating task 33


> Netty off-heap memory use causes executors to be killed by OS
> -
>
> Key: SPARK-4516
> URL: https://issues.apache.org/jira/browse/SPARK-4516
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.2.0
> Environment: Linux, Mesos
>Reporter: Hector Yee
>Priority: Critical
>  Labels: netty, shuffle
>
> The netty block transfer manager has a race condition where it closes an 
> active connection resulting in the error below. Switching to nio seems to 
> alleviate the problem.
> {code}
> 14/11/20 18:53:43 INFO TransportClientFactory: Found inactive connection to 
> i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773, closing it.
> 14/11/20 18:53:43 ERROR RetryingBlockFetcher: Exception while beginning fetch 
> of 1 outstanding blocks 
> java.io.IOException: Failed to connect to 
> i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141)
> at 
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
> at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
> at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
> at 
> org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:148)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:288)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$

[jira] [Updated] (SPARK-4516) Lost task with netty

2014-11-20 Thread Hector Yee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hector Yee updated SPARK-4516:
--
Summary: Lost task with netty  (was: Race condition in netty)

> Lost task with netty
> 
>
> Key: SPARK-4516
> URL: https://issues.apache.org/jira/browse/SPARK-4516
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.2.0
> Environment: Linux, Mesos
>Reporter: Hector Yee
>Priority: Critical
>  Labels: netty, shuffle
>
> The netty block transfer manager has a race condition where it closes an 
> active connection resulting in the error below. Switching to nio seems to 
> alleviate the problem.
> {code}
> 14/11/20 18:53:43 INFO TransportClientFactory: Found inactive connection to 
> i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773, closing it.
> 14/11/20 18:53:43 ERROR RetryingBlockFetcher: Exception while beginning fetch 
> of 1 outstanding blocks 
> java.io.IOException: Failed to connect to 
> i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141)
> at 
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
> at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
> at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
> at 
> org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:148)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:288)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at 
> com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:246)
> at 
> com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:235)
> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.ConnectException: Connection refused: 
> i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
> at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
> at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4516) Race condition in netty

2014-11-20 Thread Hector Yee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14220157#comment-14220157
 ] 

Hector Yee commented on SPARK-4516:
---

Digging deeper it looks like you are right, the first machine fails silently 
with no reason in the log.
My guess is that it ran out of memory, when this kind of thing happens. The 
last time this happened was when the native snappy
library used too much ram. I upped the mesos overhead to 1G to fix those snappy 
errors: --conf spark.mesos.executor.memoryOverhead=1024 
Is it possible that netty uses something off java heap and is allocating too 
much?
Or maybe a silent failure somewhere that is not logged?

Diagnostics follow:

1st machine fails (f20aaa19) with nothing in the log. The last thing it says is 
starting 3 remote fetchers

14/11/20 22:35:18 INFO MapOutputTrackerWorker: Don't have map outputs for 
shuffle 1, fetching them
14/11/20 22:35:18 INFO MapOutputTrackerWorker: Doing the fetch; tracker actor = 
Actor[akka.tcp://sparkdri...@i-62305989.inst.aws.airbnb.com:46605/user/MapOutputTracker#-1862215473]
14/11/20 22:35:18 INFO MapOutputTrackerWorker: Got the output locations
14/11/20 22:35:18 INFO ShuffleBlockFetcherIterator: Getting 498 non-empty 
blocks out of 724 blocks
14/11/20 22:35:18 INFO ShuffleBlockFetcherIterator: Started 3 remote fetches in 
67 ms

On the master it says
14/11/20 22:36:25 ERROR TaskSchedulerImpl: Lost executor 
20141023-174642-3852091146-5050-41161-1224 on i-f20aaa19.inst.aws.airbnb.com: 
remote Akka client disassociated
14/11/20 22:36:25 WARN ReliableDeliverySupervisor: Association with remote 
system [akka.tcp://sparkexecu...@i-f20aaa19.inst.aws.airbnb.com:54417] has 
failed, address is now gated for [5000] ms. Reason is: [Disassociated].
14/11/20 22:36:25 INFO TaskSetManager: Re-queueing tasks for 
20141023-174642-3852091146-5050-41161-1224 from TaskSet 0.0

14/11/20 22:36:25 WARN TaskSetManager: Lost task 7.0 in stage 1.0 (TID 898, 
i-f20aaa19.inst.aws.airbnb.com): ExecutorLostFailure (executor 
20141023-174642-3852091146-5050-41161-1224 lost)
14/11/20 22:36:25 ERROR CoarseMesosSchedulerBackend: Asked to remove 
non-existent executor 20141023-174642-3852091146-5050-41161-1224

2nd machine fails saying it could not connect

14/11/20 22:36:36 INFO TransportClientFactory: Found inactive connection to 
i-f20aaa19.inst.aws.airbnb.com/10.225.139.181:51003, closing it.
14/11/20 22:36:36 ERROR RetryingBlockFetcher: Exception while beginning fetch 
of 1 outstanding blocks 
java.io.IOException: Failed to connect to 
i-f20aaa19.inst.aws.airbnb.com/10.225.139.181:51003
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141)
at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
at 
org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87)

> Race condition in netty
> ---
>
> Key: SPARK-4516
> URL: https://issues.apache.org/jira/browse/SPARK-4516
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.2.0
> Environment: Linux, Mesos
>Reporter: Hector Yee
>Priority: Critical
>  Labels: netty, shuffle
>
> The netty block transfer manager has a race condition where it closes an 
> active connection resulting in the error below. Switching to nio seems to 
> alleviate the problem.
> {code}
> 14/11/20 18:53:43 INFO TransportClientFactory: Found inactive connection to 
> i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773, closing it.
> 14/11/20 18:53:43 ERROR RetryingBlockFetcher: Exception while beginning fetch 
> of 1 outstanding blocks 
> java.io.IOException: Failed to connect to 
> i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141)
> at 
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
> at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
> at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
> at 
> org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:148)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:288)
> at

[jira] [Commented] (SPARK-4516) Race condition in netty

2014-11-20 Thread Hector Yee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14220025#comment-14220025
 ] 

Hector Yee commented on SPARK-4516:
---

The channel is marked as inactive while it is being used I believe. I didn't 
dig into the code so I have no idea but according
to the logs that is what seems to be happening.

> Race condition in netty
> ---
>
> Key: SPARK-4516
> URL: https://issues.apache.org/jira/browse/SPARK-4516
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.2.0
> Environment: Linux, Mesos
>Reporter: Hector Yee
>  Labels: netty, shuffle
>
> The netty block transfer manager has a race condition where it closes an 
> active connection resulting in the error below. Switching to nio seems to 
> alleviate the problem.
> 14/11/20 18:53:43 INFO TransportClientFactory: Found inactive connection to 
> i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773, closing it.
> 14/11/20 18:53:43 ERROR RetryingBlockFetcher: Exception while beginning fetch 
> of 1 outstanding blocks 
> java.io.IOException: Failed to connect to 
> i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141)
> at 
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
> at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
> at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
> at 
> org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:148)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:288)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at 
> com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:246)
> at 
> com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:235)
> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.ConnectException: Connection refused: 
> i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
> at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
> at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4516) Race condition in netty

2014-11-20 Thread Hector Yee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hector Yee updated SPARK-4516:
--
Affects Version/s: (was: 1.1.1)
   1.2.0

> Race condition in netty
> ---
>
> Key: SPARK-4516
> URL: https://issues.apache.org/jira/browse/SPARK-4516
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.2.0
> Environment: Linux, Mesos
>Reporter: Hector Yee
>  Labels: netty, shuffle
>
> The netty block transfer manager has a race condition where it closes an 
> active connection resulting in the error below. Switching to nio seems to 
> alleviate the problem.
> 14/11/20 18:53:43 INFO TransportClientFactory: Found inactive connection to 
> i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773, closing it.
> 14/11/20 18:53:43 ERROR RetryingBlockFetcher: Exception while beginning fetch 
> of 1 outstanding blocks 
> java.io.IOException: Failed to connect to 
> i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141)
> at 
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
> at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
> at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
> at 
> org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:148)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:288)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at 
> com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:246)
> at 
> com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:235)
> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.ConnectException: Connection refused: 
> i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
> at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
> at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4516) Race condition in netty

2014-11-20 Thread Hector Yee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hector Yee updated SPARK-4516:
--
Affects Version/s: (was: 1.1.0)
   1.1.1

> Race condition in netty
> ---
>
> Key: SPARK-4516
> URL: https://issues.apache.org/jira/browse/SPARK-4516
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.1.1
> Environment: Linux, Mesos
>Reporter: Hector Yee
>  Labels: netty, shuffle
>
> The netty block transfer manager has a race condition where it closes an 
> active connection resulting in the error below. Switching to nio seems to 
> alleviate the problem.
> 14/11/20 18:53:43 INFO TransportClientFactory: Found inactive connection to 
> i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773, closing it.
> 14/11/20 18:53:43 ERROR RetryingBlockFetcher: Exception while beginning fetch 
> of 1 outstanding blocks 
> java.io.IOException: Failed to connect to 
> i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141)
> at 
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
> at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
> at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
> at 
> org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:148)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:288)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at 
> com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:246)
> at 
> com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:235)
> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.ConnectException: Connection refused: 
> i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
> at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
> at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4516) Race condition in netty

2014-11-20 Thread Hector Yee (JIRA)
Hector Yee created SPARK-4516:
-

 Summary: Race condition in netty
 Key: SPARK-4516
 URL: https://issues.apache.org/jira/browse/SPARK-4516
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.1.0
 Environment: Linux, Mesos
Reporter: Hector Yee


The netty block transfer manager has a race condition where it closes an active 
connection resulting in the error below. Switching to nio seems to alleviate 
the problem.

14/11/20 18:53:43 INFO TransportClientFactory: Found inactive connection to 
i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773, closing it.
14/11/20 18:53:43 ERROR RetryingBlockFetcher: Exception while beginning fetch 
of 1 outstanding blocks 
java.io.IOException: Failed to connect to 
i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141)
at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
at 
org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:148)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:288)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:246)
at 
com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:235)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: 
i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3633) Fetches failure observed after SPARK-2711

2014-11-20 Thread Hector Yee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14219807#comment-14219807
 ] 

Hector Yee commented on SPARK-3633:
---

I think it may be a different bug.. looked at the failed executor and it looks 
like something is closing the connection causing fetches to fail

14/11/20 18:53:43 INFO TransportClientFactory: Found inactive connection to 
i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773, closing it.
14/11/20 18:53:43 ERROR RetryingBlockFetcher: Exception while beginning fetch 
of 1 outstanding blocks 
java.io.IOException: Failed to connect to 
i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141)
at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
at 
org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:148)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:288)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:246)
at 
com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:235)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: 
i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
... 1 more

> Fetches failure observed after SPARK-2711
> -
>
> Key: SPARK-3633
> URL: https://issues.apache.org/jira/browse/SPARK-3633
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.1.0
>Reporter: Nishkam Ravi
>Priority: Blocker
>
> Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. 
> Recently upgraded to Spark 1.1. The workload fails with the following error 
> message(s):
> {code}
> 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, 
> c1705.ha

[jira] [Commented] (SPARK-3633) Fetches failure observed after SPARK-2711

2014-11-20 Thread Hector Yee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14219765#comment-14219765
 ] 

Hector Yee commented on SPARK-3633:
---

I'm still seeing a similar error in spark 1.2 rc2

14/11/20 18:41:12 WARN TaskSetManager: Lost task 4.1 in stage 1.0 (TID 907, 
i-8cb72661.inst.aws.airbnb.com): FetchFailed(null, shuffleId=1, mapId=-1, 
reduceId=4, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
location for shuffle 1
at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:386)
at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:383)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:382)
at 
org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:178)
at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
at 
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)


> Fetches failure observed after SPARK-2711
> -
>
> Key: SPARK-3633
> URL: https://issues.apache.org/jira/browse/SPARK-3633
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.1.0
>Reporter: Nishkam Ravi
>Priority: Blocker
>
> Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. 
> Recently upgraded to Spark 1.1. The workload fails with the following error 
> message(s):
> {code}
> 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, 
> c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, 
> c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120)
> 14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages
> {code}
> In order to identify the problem, I carried out change set analysis. As I go 
> back in time, the error message changes to:
> {code}
> 14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, 
> c1706.halxg.cloudera.com): java.io.FileNotFoundException: 
> /var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034
>  (Too many open files)
> java.io.FileOutputStream.open(Native Method)
> java.io.FileOutputStream.(FileOutputStream.java:221)
> 
> org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117)
> 
> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185)
> 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197)
> 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145)
> org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
> 
> org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:51)
> 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> org.apache.spark.scheduler.Task.run(Task.scala:54)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> {code}
> All the way until Aug 4th. Turns out the problem changeset is 4fde28c. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3871) compute-classpath.sh does not escape :

2014-10-08 Thread Hector Yee (JIRA)
Hector Yee created SPARK-3871:
-

 Summary: compute-classpath.sh does not escape :
 Key: SPARK-3871
 URL: https://issues.apache.org/jira/browse/SPARK-3871
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.1.0
Reporter: Hector Yee
Priority: Minor


Chronos jobs on Mesos schedule jobs in temp directories such as
/tmp/mesos/slaves/20140926-142803-3852091146-5050-3487-375/frameworks/20140719-203536-160311562-5050-10655-0007/executors/ct:1412815902180:2:search_ranking_scoring/runs/f1e0d058-3ef0-4838-816e-e3fa5e179dd8

The compute-classpath.sh does not properly escape the : in the temp dirs 
generated by mesos and so the spark-submit gets a broken classpath



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3753) Spark hive join results in empty with shared hive context

2014-09-30 Thread Hector Yee (JIRA)
Hector Yee created SPARK-3753:
-

 Summary: Spark hive join results in empty with shared hive context
 Key: SPARK-3753
 URL: https://issues.apache.org/jira/browse/SPARK-3753
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Hector Yee
Priority: Minor


When I have two hive tables and do a join with the same hive context I get the 
empty set

e.g.

val hc = new HiveContext(sc)
val table1 = hc.sql("SELECT * from t1")
val table2 = hc.sql("SELECT * from t2")
val intersect = table1.join(table2).take(10)
// empty set

but this works if I do 
val hc1 = new HiveContext(sc)
val table1 = hc1.sql("SELECT * from t1")
val hc2 = new HiveContext(sc)
val table2 = hc2.sql("SELECT * from t2")
val intersect = table1.join(table2).take(10)

I am not sure if take is propagating up the take to table1 and table2 and then 
doing the intersect (in the case of large tables that means no results) or if 
it is some other problem with hive context.

Doing the join in one SQL query also seems to result in the empty set.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1547) Add gradient boosting algorithm to MLlib

2014-07-01 Thread Hector Yee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049659#comment-14049659
 ] 

Hector Yee commented on SPARK-1547:
---

Just generic log loss with L1 regularization should suffice. Most of the work 
is in feature engineering anyway. It is no hurry at all, I already have several 
implementations not in MLLib that I am using. It would just be convenient to 
have another implementation to compare against.

> Add gradient boosting algorithm to MLlib
> 
>
> Key: SPARK-1547
> URL: https://issues.apache.org/jira/browse/SPARK-1547
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Manish Amde
>Assignee: Manish Amde
>
> This task requires adding the gradient boosting algorithm to Spark MLlib. The 
> implementation needs to adapt the gradient boosting algorithm to the scalable 
> tree implementation.
> The tasks involves:
> - Comparing the various tradeoffs and finalizing the algorithm before 
> implementation
> - Code implementation
> - Unit tests
> - Functional tests
> - Performance tests
> - Documentation



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1547) Add gradient boosting algorithm to MLlib

2014-07-01 Thread Hector Yee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049394#comment-14049394
 ] 

Hector Yee commented on SPARK-1547:
---

Honestly trees are most useful when the feature vectors are dense. Any 
possibility that the solver can be decoupled from the tree part for dealing 
with sparse data?

> Add gradient boosting algorithm to MLlib
> 
>
> Key: SPARK-1547
> URL: https://issues.apache.org/jira/browse/SPARK-1547
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Manish Amde
>Assignee: Manish Amde
>
> This task requires adding the gradient boosting algorithm to Spark MLlib. The 
> implementation needs to adapt the gradient boosting algorithm to the scalable 
> tree implementation.
> The tasks involves:
> - Comparing the various tradeoffs and finalizing the algorithm before 
> implementation
> - Code implementation
> - Unit tests
> - Functional tests
> - Performance tests
> - Documentation



--
This message was sent by Atlassian JIRA
(v6.2#6252)