> On March 30, 2017, 8:13 a.m., Stephan Erb wrote: > > src/main/java/org/apache/aurora/scheduler/mesos/VersionedMesosSchedulerImpl.java > > Lines 120 (patched) > > <https://reviews.apache.org/r/58053/diff/1/?file=1680496#file1680496line120> > > > > Do we have to give up eventually? (I suppose not...)
I don't think so. If we give up, I assume the scheduler is going to shut down. Suppose if Mesos is down, on scheduler shutdown means we will elect a new leader. A new leader (by default) has a one minute timeout to register to Mesos. If we give up, we will just be flapping between leaders until the system heals. I think that's pretty undesirable. > On March 30, 2017, 8:13 a.m., Stephan Erb wrote: > > src/main/java/org/apache/aurora/scheduler/mesos/VersionedMesosSchedulerImpl.java > > Lines 125-129 (patched) > > <https://reviews.apache.org/r/58053/diff/1/?file=1680496#file1680496line125> > > > > Does the Mesos docs say anything about simultanous `SUBSCRIBE` calls? > > > > If the backoff time is still pretty low we might end up sending another > > subscribe before we have received an answer for the previous one. >From what I understand, multiple subscription per framework is not allowed and >subsequent subscribe attempts will fail if a connection was already >established. The underlying driver ignores those failures so we should be fine. > On March 30, 2017, 8:13 a.m., Stephan Erb wrote: > > src/main/java/org/apache/aurora/scheduler/mesos/VersionedMesosSchedulerImpl.java > > Lines 128-130 (original), 165-167 (patched) > > <https://reviews.apache.org/r/58053/diff/1/?file=1680496#file1680496line165> > > > > You are unsetting `isSubscribed` in the `disconnected` handler. Doesn't > > this imply we will never run the reregistration code here? Good catch, fixed. > On March 30, 2017, 8:13 a.m., Stephan Erb wrote: > > src/main/java/org/apache/aurora/scheduler/mesos/VersionedMesosSchedulerImpl.java > > Lines 137-138 (original), 174-175 (patched) > > <https://reviews.apache.org/r/58053/diff/1/?file=1680496#file1680496line175> > > > > I am wondering why we need this here for `OFFERS` but not for > > `RESCIND`, `INVERSE_OFFERS`, etc. I put it in here for the same kind of errors are the unversioned driver. Technically we could put it everywhere. I'm not opposed if you think we should do it. - Zameer ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/58053/#review170580 ----------------------------------------------------------- On March 29, 2017, 4:52 p.m., Zameer Manji wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/58053/ > ----------------------------------------------------------- > > (Updated March 29, 2017, 4:52 p.m.) > > > Review request for Aurora and Stephan Erb. > > > Bugs: AURORA-1911 > https://issues.apache.org/jira/browse/AURORA-1911 > > > Repository: aurora > > > Description > ------- > > As noted in AURORA-1911 the `V1Mesos` driver doesn't re try `SUBSCRIBE` calls > if they fail. This means that after a leader subscribes and disconnects, it > is possible for it to never re subscribe again if the Mesos Master is > unhealthy. > > To fix this, I have moved the subscription into the dedicated > `SchedulerExecutor` and it coninutes to attempt to subscribe using truncated > binary backoff. It only stops if we are disconnected or if we sucessfully > connect. > > > Diffs > ----- > > src/jmh/java/org/apache/aurora/benchmark/StatusUpdateBenchmark.java > 206b11458da2b0f938f0fcab5e5d3259a88ac9ee > src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java > 5bf1e4e8c46044cb69b266cd203b5ec2f8b9ab61 > src/main/java/org/apache/aurora/scheduler/mesos/SchedulerDriverModule.java > 10d4f1b515b91d85b283cb7c655275c22fb133f9 > > src/main/java/org/apache/aurora/scheduler/mesos/VersionedMesosSchedulerImpl.java > 67d356ab66c926a3b56860b906a453d57d6b694d > > src/test/java/org/apache/aurora/scheduler/mesos/VersionedMesosSchedulerImplTest.java > 756d0d9e30a447f9fba75c1c60f2f2f3c610399b > > > Diff: https://reviews.apache.org/r/58053/diff/1/ > > > Testing > ------- > > > Thanks, > > Zameer Manji > >