Re: executor failures w/ scala 2.10

2013-11-13 Thread Prashant Sharma
We may no longer need to track disassociation and IMHO use the *improved* feature in akka 2.2.x called remote death watch. Which lets us acknowledge a remote death both in case of a natural demise and accidental deaths. This was not the case with remote death watch in previous akka releases.

Re: executor failures w/ scala 2.10

2013-11-13 Thread Matei Zaharia
Hey Prashant, do messages still get lost while we’re dissociated? Or can you set the timeouts high enough to proven that? Matei On Nov 13, 2013, at 12:39 AM, Prashant Sharma scrapco...@gmail.com wrote: We may no longer need to track disassociation and IMHO use the *improved* feature in akka

Re: executor failures w/ scala 2.10

2013-11-13 Thread Prashant Sharma
We can set timeouts high enough ! same as connection timeout that we already set. On Wed, Nov 13, 2013 at 11:37 PM, Matei Zaharia matei.zaha...@gmail.comwrote: Hey Prashant, do messages still get lost while we’re dissociated? Or can you set the timeouts high enough to proven that? Matei

Re: executor failures w/ scala 2.10

2013-11-01 Thread Matei Zaharia
Yes, so far they’ve been built on that assumption — not that Akka would *guarantee* delivery in that as soon as the send() call returns you know it’s delivered, but that Akka would act the same way as a TCP socket, allowing you to send a stream of messages in order and hear when the connection

Re: executor failures w/ scala 2.10

2013-10-31 Thread Imran Rashid
unfortunately that change wasn't the silver bullet I was hoping for. Even with 1) ignoring DisassociatedEvent 2) executor uses ReliableProxy to send messages back to driver 3) turn up akka.remote.watch-failure-detector.threshold=12 there is a lot of weird behavior. First, there are a few

Re: executor failures w/ scala 2.10

2013-10-31 Thread Matei Zaharia
It’s true that Akka’s delivery guarantees are in general at-most-once, but if you look at the text there it says that they differ by transport. In the previous version, I’m quite sure that except maybe in very rare circumstances or cases where we had a bug, Akka’s remote layer always kept

Re: executor failures w/ scala 2.10

2013-10-31 Thread Sriram Ramachandrasekaran
Sorry if I my understanding is wrong. May be, for this particular case it might be something to do with the load/network, but, in general, are you saying that, we build these communication channels(block manager communication, task events communication, etc) assuming akka would take care of it? I

Re: executor failures w/ scala 2.10

2013-10-30 Thread Prashant Sharma
I have things running (from scala 2.10 branch) for over 3-4 hours now without a problem and my jobs write data about the same as you suggested. My cluster size is 7 nodes and not *congested* for memory. I going to leave jobs running all night long. Meanwhile I had encourage you to try to spot the

Re: executor failures w/ scala 2.10

2013-10-30 Thread Imran Rashid
I'm gonna try turning on more akka debugging msgs as described at http://akka.io/faq/ and http://doc.akka.io/docs/akka/current/scala/testing.html#Tracing_Actor_Invocations unfortunately that will require a patch to spark, but hopefully that will give us more info to go on ... On Wed, Oct 30,

Re: executor failures w/ scala 2.10

2013-10-30 Thread Prashant Sharma
Can you apply this patch too and check the logs of Driver and worker. diff --git a/core/src/main/scala/org/apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala b/core/src/main/scala/org/apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala index b6f0ec9..ad0ebf7 100644 ---

Re: executor failures w/ scala 2.10

2013-10-30 Thread Prashant Sharma
I am guessing something wrong with using Dissociation event then. Try applying something on the lines of this patch. This might cause the executors to hang so be prepared for that. diff --git a/core/src/main/scala/org/apache/spark/executor/StandaloneExecutorBackend.scala

Re: executor failures w/ scala 2.10

2013-10-30 Thread Imran Rashid
yeah, just causes them to hang. the first deadLetters message shows up about the same time. Oddly, after it first happens, I keep getting some results trickling in from those executors. (maybe they were just queued up on the driver already, I dunno.) but then it just hangs. the stage has a

executor failures w/ scala 2.10

2013-10-29 Thread Imran Rashid
We've been testing out the 2.10 branch of spark, and we're running into some issues were akka disconnects from the executors after a while. We ran some simple tests first, and all was well, so we started upgrading our whole codebase to 2.10. Everything seemed to be working, but then we noticed