thanks! On Thu, Oct 26, 2017 at 9:51 PM, Bill Farner <wfar...@apache.org> wrote:
> FYI - i believe we stumbled on this issue, which is greatly exacerbated by > f2755e1 > <https://github.com/apache/aurora/commit/f2755e1cdd67f3c1516726c21d6e8f13059a5a01>. > The good news is that we now have a good handle on the culprit! More > details at https://issues.apache.org/jira/browse/AURORA-1953 > > On Thu, Sep 28, 2017 at 2:14 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote: > >> Hmm...it is a very busy cluster and 10 mins of logs will be voluminous. >> They contain some internal details which I cannot share publicly. If you >> suspect specific areas, I can try to get those logs and remove internal >> info. >> >> Re: code, we have a fork which is very close to master. >> >> On Wed, Sep 27, 2017 at 10:03 PM, Bill Farner <wfar...@apache.org> wrote: >> >>> What commit/release was this with? From the looks of the log contents, >>> it's not master. I'd like to make sure i'm looking at the correct code. >>> >>> Are there more logs not being included? If so, can you share more >>> complete logs? In particular, logs during the 10 minute delay would be >>> particularly helpful. >>> >>> On Tue, Sep 26, 2017 at 11:51 PM, Mohit Jaggi <mohit.ja...@uber.com> >>> wrote: >>> >>>> Updating subject...as it looks like leader election was fine but >>>> registration ack did not make it to the SchedulerLifecycle code. Weird that >>>> an event will get lost like that. >>>> >>>> On Tue, Sep 26, 2017 at 4:21 PM, John Sirois <john.sir...@gmail.com> >>>> wrote: >>>> >>>>> >>>>> >>>>> On Tue, Sep 26, 2017 at 4:33 PM, Mohit Jaggi <mohit.ja...@uber.com> >>>>> wrote: >>>>> >>>>>> John, >>>>>> I was referring to the following log message...isn't that the right >>>>>> one? >>>>>> >>>>> >>>>> Aha - it is, apologies. >>>>> >>>>> >>>>>> Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926 18:11:58.795 >>>>>> [Thread-814, MesosCallbackHandler$MesosCallbackHandlerImpl:180] >>>>>> Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e2f84d17-0000" >>>>>> >>>>>> On Tue, Sep 26, 2017 at 3:30 PM, John Sirois <john.sir...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> > ... but it succeeded(see logs at the end) ... >>>>>>> >>>>>>> The underlying c++ code in the scheduler driver successfully >>>>>>> connected to the leading master socket: >>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>>> 18:11:37.231549 24868 scheduler.cpp:361] Connected with the master >>>>>>> at http://10.163.25.45:5050/master/api/v1/scheduler >>>>>>> >>>>>>> <http://10.163.25.45:5050/master/api/v1/scheduler>This is not the >>>>>>> same as a framework registration call being successfully executed >>>>>>> against >>>>>>> the newly connected master. >>>>>>> You need to be careful about what you derive from the logs just >>>>>>> based on a reading of the words. Generally you'll need to look >>>>>>> carefully / >>>>>>> grep sourcecode to be sure you are mentally modelling the code flows >>>>>>> correctly. It certainly gets tricky. >>>>>>> >>>>>>> On Tue, Sep 26, 2017 at 4:14 PM, Mohit Jaggi <mohit.ja...@uber.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hmm....so it is indeed the mesos registration, but it succeeded(see >>>>>>>> logs at the end). Looking at the code, I see that >>>>>>>> com.google.common.eventbus is used as pub-sub mechanism to link >>>>>>>> the registration (first log message) to the registrationAcked >>>>>>>> flag...and >>>>>>>> this flag is not being set for 10mins (?) otherwise the registration >>>>>>>> timeout handler will not print the second log message. >>>>>>>> >>>>>>>> delayedActions.onRegistrationTimeout( >>>>>>>> () -> { >>>>>>>> if (!registrationAcked.get()) { >>>>>>>> LOG.error( >>>>>>>> "Framework has not been registered within the tolerated >>>>>>>> delay."); >>>>>>>> stateMachine.transition(State.DEAD); >>>>>>>> } >>>>>>>> }); >>>>>>>> >>>>>>>> Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926 >>>>>>>> 18:11:58.795 [Thread-814, >>>>>>>> MesosCallbackHandler$MesosCallbackHandlerImpl:180] >>>>>>>> Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e >>>>>>>> 2f84d17-0000" >>>>>>>> >>>>>>>> ... >>>>>>>> >>>>>>>> ... >>>>>>>> >>>>>>>> Sep 26 18:21:37 machine62 aurora-scheduler[24743]: E0926 >>>>>>>> 18:21:37.205 [Lifecycle-0, SchedulerLifecycle$4:235] Framework has not >>>>>>>> been >>>>>>>> registered within the tolerated delay. >>>>>>>> >>>>>>>> On Tue, Sep 26, 2017 at 2:34 PM, John Sirois <john.sir...@gmail.com >>>>>>>> > wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Sep 26, 2017 at 3:31 PM, Mohit Jaggi <mohit.ja...@uber.com >>>>>>>>> > wrote: >>>>>>>>> >>>>>>>>>> Actually I wonder if the "register" here is for Aurora to >>>>>>>>>> register with mesos as a framework...I was assuming it refered to >>>>>>>>>> registering with ZK as a leader. >>>>>>>>>> >>>>>>>>> >>>>>>>>> Bingo. >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Sep 26, 2017 at 2:27 PM, Renan DelValle Rueda < >>>>>>>>>> renanidelva...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> I've had this, or something very similar, happen before. It's an >>>>>>>>>>> issue with Aurora and ZK. Election is based upon ZK, so if writing >>>>>>>>>>> down who >>>>>>>>>>> the leader is to the ZK server path fails, or if ZK is unable to >>>>>>>>>>> reach >>>>>>>>>>> quorum on the write, the election will fail. Sometimes this might >>>>>>>>>>> manifest >>>>>>>>>>> itself in weird ways, such as two aurora schedulers believing they >>>>>>>>>>> are >>>>>>>>>>> leaders. If you could tell us a little bit about your ZK set up we >>>>>>>>>>> might be >>>>>>>>>>> able to narrow down the issue. Also, Aurora version and whether you >>>>>>>>>>> are >>>>>>>>>>> using Curator or the commons library will help as well. >>>>>>>>>>> >>>>>>>>>>> On Tue, Sep 26, 2017 at 2:02 PM, Mohit Jaggi < >>>>>>>>>>> mohit.ja...@uber.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hmm..it seems machine62 became a leader but could not >>>>>>>>>>>> "register" as leader. Not sure what that means. My naive >>>>>>>>>>>> assumption is that >>>>>>>>>>>> "becoming leader" and "registering as leader" are "atomic". >>>>>>>>>>>> >>>>>>>>>>>> ------- grep on SchedulerLifecycle ----- >>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62 >>>>>>>>>>>> aurora-scheduler[24743]: I0926 18:11:33.158 [LeaderSelector-0, >>>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine >>>>>>>>>>>> transition >>>>>>>>>>>> STORAGE_PREPARED -> LEADER_AWAITING_REGISTRATION >>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62 >>>>>>>>>>>> aurora-scheduler[24743]: I0926 18:11:33.159 [LeaderSelector-0, >>>>>>>>>>>> SchedulerLifecycle$4:224] Elected as leading scheduler! >>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:11:37 machine62 >>>>>>>>>>>> aurora-scheduler[24743]: I0926 18:11:37.204 [LeaderSelector-0, >>>>>>>>>>>> SchedulerLifecycle$DefaultDelayedActions:163] Giving up on >>>>>>>>>>>> registration in (10, mins) >>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62 >>>>>>>>>>>> aurora-scheduler[24743]: E0926 18:21:37.205 [Lifecycle-0, >>>>>>>>>>>> SchedulerLifecycle$4:235] Framework has not been registered within >>>>>>>>>>>> the >>>>>>>>>>>> tolerated delay. >>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62 >>>>>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.205 [Lifecycle-0, >>>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine >>>>>>>>>>>> transition >>>>>>>>>>>> LEADER_AWAITING_REGISTRATION -> DEAD >>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62 >>>>>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0, >>>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine >>>>>>>>>>>> transition DEAD >>>>>>>>>>>> -> DEAD >>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62 >>>>>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0, >>>>>>>>>>>> SchedulerLifecycle$6:275] Shutdown already invoked, ignoring extra >>>>>>>>>>>> call. >>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:22:05 machine62 >>>>>>>>>>>> aurora-scheduler[54502]: I0926 18:22:05.681 [main, >>>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine >>>>>>>>>>>> transition IDLE >>>>>>>>>>>> -> PREPARING_STORAGE >>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:22:06 machine62 >>>>>>>>>>>> aurora-scheduler[54502]: I0926 18:22:06.396 [main, >>>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine >>>>>>>>>>>> transition >>>>>>>>>>>> PREPARING_STORAGE -> STORAGE_PREPARED >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> ------ connecting to mesos ----- >>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>>>>>>>> 18:11:37.211750 24871 group.cpp:757] Found non-sequence node >>>>>>>>>>>> 'log_replicas' >>>>>>>>>>>> at '/mesos' in ZooKeeper >>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>>>>>>>> 18:11:37.211817 24871 detector.cpp:152] Detected a new leader: >>>>>>>>>>>> (id='1506') >>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>>>>>>>> 18:11:37.211917 24871 group.cpp:699] Trying to get >>>>>>>>>>>> '/mesos/json.info_0000001506' in ZooKeeper >>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>>>>>>>> 18:11:37.216063 24871 zookeeper.cpp:262] A new leading master >>>>>>>>>>>> (UPID= >>>>>>>>>>>> master@10.163.25.45:5050) is detected >>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>>>>>>>> 18:11:37.216162 24871 scheduler.cpp:470] New master detected at >>>>>>>>>>>> master@10.163.25.45:5050 >>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>>>>>>>> 18:11:37.217772 24871 scheduler.cpp:479] Waiting for 12.81503ms >>>>>>>>>>>> before >>>>>>>>>>>> initiating a re-(connection) attempt with the master >>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 >>>>>>>>>>>> 18:11:37.231549 24868 scheduler.cpp:361] Connected with the master >>>>>>>>>>>> at >>>>>>>>>>>> http://10.163.25.45:5050/master/api/v1/scheduler >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Sep 26, 2017 at 1:24 PM, Bill Farner < >>>>>>>>>>>> wfar...@apache.org> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> No, there is not a legitimate reason. Did this occur for an >>>>>>>>>>>>> extended period of time? Do you have logs from the scheduler >>>>>>>>>>>>> indicating >>>>>>>>>>>>> that it lost ZK leadership and subsequently interacted with mesos? >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Sep 26, 2017 at 1:02 PM, Mohit Jaggi < >>>>>>>>>>>>> mohit.ja...@uber.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Fellows, >>>>>>>>>>>>>> While examining Aurora log files, I noticed a condition where >>>>>>>>>>>>>> a scheduler was talking to Mesos but it was not showing up as a >>>>>>>>>>>>>> leader in >>>>>>>>>>>>>> Zookeeper. It ultimately restarted itself and another scheduler >>>>>>>>>>>>>> became the >>>>>>>>>>>>>> leader. >>>>>>>>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Mohit. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >