Hi Matt,

I’m not sure whether those tests will actually find this specific issue.  The 
tests that I linked to test Spark’s Zookeeper-based multi-master mode, whereas 
it sounds like you’re seeing this issue in regular standalone cluster.  In 
those tests, the workers disconnect from the master because we've actually 
killed the master, whereas it sounds like your workers are disconnecting due to 
intermittent network issues and you’d like them to reconnect to the same master.

That FaultToleranceSuite is not run automatically and I don’t think that it was 
designed for automatic use: for example, it doesn’t clean up Docker containers 
after failed tests, doesn’t use a framework like ScalaTest for structured test 
result reporting, etc.  The new framework that I’m working on will address 
these issues.

- Josh

On October 15, 2014 at 3:04:42 PM, Matthew Cheah (matthew.c.ch...@gmail.com) 
wrote:

Thanks Josh! These tests seem to cover the cases I'm looking for already =).

What's interesting though is that we still ran into SPARK-3736 despite such 
integration tests being in place to catch it - specifically, the case when the 
master disconnects and reconnects, the workers should reconnect to the master 
after the master restarts. Are the tests here run regularly, i.e. Jenkins build 
or nightly build, and if so how did that test case pass while SPARK-3736 
apparently still exists?

At any rate, I think I'll submit my fix PR but with no particular extra 
automated test written for it, since it seems like FaultToleranceTest 
sufficiently covers what I need.

On Wed, Oct 15, 2014 at 2:42 PM, Josh Rosen <rosenvi...@gmail.com> wrote:
There are some end-to-end integration tests of Master <-> Worker 
fault-tolerance in 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/FaultToleranceTest.scala

I’ve actually been working to develop a more generalized Docker-based 
integration-testing framework for Spark in order to test Master <-> Worker 
interactions.  I’d like to eventually clean up my code and release it publicly.

On October 15, 2014 at 2:39:22 PM, Matthew Cheah (matthew.c.ch...@gmail.com) 
wrote:

I think on a higher level I also want to ask why such unit testing has not
actually been done in this codebase. If it's not a common practice to test
message passing then I'm fine with leaving out the unit test, however I'm
more curious as to why such testing was not done before.

On Wed, Oct 15, 2014 at 2:18 PM, Chester Chen <ches...@alpinenow.com> wrote:

> You can call resolve method on ActorSelection.resolveOne() to see if the
> actor is still there or the path is correct. The method returns a future
> and you can wait for it with timeout. This way, you know the actor is live
> or already dead or incorrect.
>
> Another way, is to send Identify method to ActorSystem, if it returns with
> correct identified message; then you can act on it, otherwise, ...
>
> hope this helps
>
> Chester
>
> On Wed, Oct 15, 2014 at 1:38 PM, Matthew Cheah <matthew.c.ch...@gmail.com>
> wrote:
>
>> What's happening when I do this is that the Worker tries to get the Master
>> actor by calling context.actorSelection(), and the RegisterWorker message
>> gets sent to the dead letters mailbox instead of being picked up by
>> expectMsg. I'm new to Akka and I've tried various ways to registering a
>> "mock" master to no avail.
>>
>> I would think there would be at least some kind of test for master -
>> worker
>> message passing, no?
>>
>> On Wed, Oct 15, 2014 at 11:28 AM, Nan Zhu <zhunanmcg...@gmail.com> wrote:
>>
>> > I don’t think there are test cases for Worker itself
>> >
>> >
>> > You can
>> >
>> >
>> > val actorRef = TestActorRef[Master](Props(classOf[Master], ...))(
>> > actorSystem) actorRef.underlyingActor.receive(Heartbeat)
>>
>> >
>> > and use expectMsg to test if Master can reply correct message by
>> assuming
>> > Worker is absolutely correct
>> >
>> > Then in another test case to test if Worker can send register message to
>> > Master after receiving Master’s “re-register” instruction, (in this test
>> > case assuming that the Master is absolutely right)
>> >
>> > Best,
>> >
>> > --
>> > Nan Zhu
>> >
>> > On Wednesday, October 15, 2014 at 2:04 PM, Matthew Cheah wrote:
>> >
>> > Thanks, the example was helpful.
>> >
>> > However, testing the Worker itself is a lot more complicated than
>> > WorkerWatcher, since the Worker class is quite a bit more complex. Are
>> > there any tests that inspect the Worker itself?
>> >
>> > Thanks,
>> >
>> > -Matt Cheah
>> >
>> > On Tue, Oct 14, 2014 at 6:40 PM, Nan Zhu <zhunanmcg...@gmail.com>
>> wrote:
>> >
>> > You can use akka testkit
>> >
>> > Example:
>> >
>> >
>> >
>> https://github.com/apache/spark/blob/ef4ff00f87a4e8d38866f163f01741c2673e41da/core/src/test/scala/org/apache/spark/deploy/worker/WorkerWatcherSuite.scala
>> >
>> > --
>> > Nan Zhu
>> >
>> > On Tuesday, October 14, 2014 at 9:17 PM, Matthew Cheah wrote:
>> >
>> > Hi everyone,
>> >
>> > I’m adding some new message passing between the Master and Worker
>> actors in
>> > order to address https://issues.apache.org/jira/browse/SPARK-3736 .
>> >
>> > I was wondering if these kinds of interactions are tested in the
>> automated
>> > Jenkins test suite, and if so, where I could find some examples to help
>> me
>> > do the same.
>> >
>> > Thanks!
>> >
>> > -Matt Cheah
>> >
>> >
>> >
>> >
>> >
>>
>
>

Reply via email to