Re: Lost framework registered event [Was Re: leader election issues]

Mohit Jaggi Fri, 27 Oct 2017 09:56:29 -0700

thanks!

On Thu, Oct 26, 2017 at 9:51 PM, Bill Farner <wfar...@apache.org> wrote:


> FYI - i believe we stumbled on this issue, which is greatly exacerbated by
> f2755e1
> <https://github.com/apache/aurora/commit/f2755e1cdd67f3c1516726c21d6e8f13059a5a01>.
> The good news is that we now have a good handle on the culprit!  More
> details at https://issues.apache.org/jira/browse/AURORA-1953
>
> On Thu, Sep 28, 2017 at 2:14 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote:
>
>> Hmm...it is a very busy cluster and 10 mins of logs will be voluminous.
>> They contain some internal details which I cannot share publicly. If you
>> suspect specific areas, I can try to get those logs and remove internal
>> info.
>>
>> Re: code, we have a fork which is very close to master.
>>
>> On Wed, Sep 27, 2017 at 10:03 PM, Bill Farner <wfar...@apache.org> wrote:
>>
>>> What commit/release was this with?  From the looks of the log contents,
>>> it's not master.  I'd like to make sure i'm looking at the correct code.
>>>
>>> Are there more logs not being included?  If so, can you share more
>>> complete logs?  In particular, logs during the 10 minute delay would be
>>> particularly helpful.
>>>
>>> On Tue, Sep 26, 2017 at 11:51 PM, Mohit Jaggi <mohit.ja...@uber.com>
>>> wrote:
>>>
>>>> Updating subject...as it looks like leader election was fine but
>>>> registration ack did not make it to the SchedulerLifecycle code. Weird that
>>>> an event will get lost like that.
>>>>
>>>> On Tue, Sep 26, 2017 at 4:21 PM, John Sirois <john.sir...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Tue, Sep 26, 2017 at 4:33 PM, Mohit Jaggi <mohit.ja...@uber.com>
>>>>> wrote:
>>>>>
>>>>>> John,
>>>>>> I was referring to the following log message...isn't that the right
>>>>>> one?
>>>>>>
>>>>>
>>>>> Aha - it is, apologies.
>>>>>
>>>>>
>>>>>> Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926 18:11:58.795
>>>>>>  [Thread-814, MesosCallbackHandler$MesosCallbackHandlerImpl:180]
>>>>>> Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e2f84d17-0000"
>>>>>>
>>>>>> On Tue, Sep 26, 2017 at 3:30 PM, John Sirois <john.sir...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> > ... but it succeeded(see logs at the end) ...
>>>>>>>
>>>>>>> The underlying c++ code in the scheduler driver successfully
>>>>>>> connected to the leading master socket:
>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>> 18:11:37.231549 24868 scheduler.cpp:361] Connected with the master
>>>>>>> at http://10.163.25.45:5050/master/api/v1/scheduler
>>>>>>>
>>>>>>> <http://10.163.25.45:5050/master/api/v1/scheduler>This is not the
>>>>>>> same as a framework registration call being successfully executed 
>>>>>>> against
>>>>>>> the newly connected master.
>>>>>>> You need to be careful about what you derive from the logs just
>>>>>>> based on a reading of the words. Generally you'll need to look 
>>>>>>> carefully /
>>>>>>> grep sourcecode to be sure you are mentally modelling the code flows
>>>>>>> correctly. It certainly gets tricky.
>>>>>>>
>>>>>>> On Tue, Sep 26, 2017 at 4:14 PM, Mohit Jaggi <mohit.ja...@uber.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hmm....so it is indeed the mesos registration, but it succeeded(see
>>>>>>>> logs at the end). Looking at the code, I see that
>>>>>>>> com.google.common.eventbus  is used as pub-sub mechanism to link
>>>>>>>> the registration (first log message)  to the  registrationAcked 
>>>>>>>> flag...and
>>>>>>>> this flag is not being set for 10mins (?) otherwise the registration
>>>>>>>> timeout handler will not print the second log message.
>>>>>>>>
>>>>>>>> delayedActions.onRegistrationTimeout(
>>>>>>>>     () -> {
>>>>>>>>       if (!registrationAcked.get()) {
>>>>>>>>         LOG.error(
>>>>>>>>             "Framework has not been registered within the tolerated 
>>>>>>>> delay.");
>>>>>>>>         stateMachine.transition(State.DEAD);
>>>>>>>>       }
>>>>>>>>     });
>>>>>>>>
>>>>>>>> Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926
>>>>>>>> 18:11:58.795 [Thread-814, 
>>>>>>>> MesosCallbackHandler$MesosCallbackHandlerImpl:180]
>>>>>>>> Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e
>>>>>>>> 2f84d17-0000"
>>>>>>>>
>>>>>>>> ...
>>>>>>>>
>>>>>>>> ...
>>>>>>>>
>>>>>>>> Sep 26 18:21:37 machine62 aurora-scheduler[24743]: E0926
>>>>>>>> 18:21:37.205 [Lifecycle-0, SchedulerLifecycle$4:235] Framework has not 
>>>>>>>> been
>>>>>>>> registered within the tolerated delay.
>>>>>>>>
>>>>>>>> On Tue, Sep 26, 2017 at 2:34 PM, John Sirois <john.sir...@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Sep 26, 2017 at 3:31 PM, Mohit Jaggi <mohit.ja...@uber.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Actually I wonder if the "register" here is for Aurora to
>>>>>>>>>> register with mesos as a framework...I was assuming it refered to
>>>>>>>>>> registering with ZK as a leader.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Bingo.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 26, 2017 at 2:27 PM, Renan DelValle Rueda <
>>>>>>>>>> renanidelva...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I've had this, or something very similar, happen before. It's an
>>>>>>>>>>> issue with Aurora and ZK. Election is based upon ZK, so if writing 
>>>>>>>>>>> down who
>>>>>>>>>>> the leader is to the ZK server path fails, or if ZK is unable to 
>>>>>>>>>>> reach
>>>>>>>>>>> quorum on the write, the election will fail. Sometimes this might 
>>>>>>>>>>> manifest
>>>>>>>>>>> itself in weird ways, such as two aurora schedulers believing they 
>>>>>>>>>>> are
>>>>>>>>>>> leaders. If you could tell us a little bit about your ZK set up we 
>>>>>>>>>>> might be
>>>>>>>>>>> able to narrow down the issue. Also, Aurora version and whether you 
>>>>>>>>>>> are
>>>>>>>>>>> using Curator or the commons library will help as well.
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Sep 26, 2017 at 2:02 PM, Mohit Jaggi <
>>>>>>>>>>> mohit.ja...@uber.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hmm..it seems machine62 became a leader but could not
>>>>>>>>>>>> "register" as leader. Not sure what that means. My naive 
>>>>>>>>>>>> assumption is that
>>>>>>>>>>>> "becoming leader" and "registering as leader" are "atomic".
>>>>>>>>>>>>
>>>>>>>>>>>> ------- grep on SchedulerLifecycle -----
>>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62
>>>>>>>>>>>> aurora-scheduler[24743]: I0926 18:11:33.158 [LeaderSelector-0,
>>>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine 
>>>>>>>>>>>> transition
>>>>>>>>>>>> STORAGE_PREPARED -> LEADER_AWAITING_REGISTRATION
>>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62
>>>>>>>>>>>> aurora-scheduler[24743]: I0926 18:11:33.159 [LeaderSelector-0,
>>>>>>>>>>>> SchedulerLifecycle$4:224] Elected as leading scheduler!
>>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:11:37 machine62
>>>>>>>>>>>> aurora-scheduler[24743]: I0926 18:11:37.204 [LeaderSelector-0,
>>>>>>>>>>>> SchedulerLifecycle$DefaultDelayedActions:163] Giving up on
>>>>>>>>>>>> registration in (10, mins)
>>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>>>>>> aurora-scheduler[24743]: E0926 18:21:37.205 [Lifecycle-0,
>>>>>>>>>>>> SchedulerLifecycle$4:235] Framework has not been registered within 
>>>>>>>>>>>> the
>>>>>>>>>>>> tolerated delay.
>>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.205 [Lifecycle-0,
>>>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine 
>>>>>>>>>>>> transition
>>>>>>>>>>>> LEADER_AWAITING_REGISTRATION -> DEAD
>>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0,
>>>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine 
>>>>>>>>>>>> transition DEAD
>>>>>>>>>>>> -> DEAD
>>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0,
>>>>>>>>>>>> SchedulerLifecycle$6:275] Shutdown already invoked, ignoring extra 
>>>>>>>>>>>> call.
>>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:22:05 machine62
>>>>>>>>>>>> aurora-scheduler[54502]: I0926 18:22:05.681 [main,
>>>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine 
>>>>>>>>>>>> transition IDLE
>>>>>>>>>>>> -> PREPARING_STORAGE
>>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:22:06 machine62
>>>>>>>>>>>> aurora-scheduler[54502]: I0926 18:22:06.396 [main,
>>>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine 
>>>>>>>>>>>> transition
>>>>>>>>>>>> PREPARING_STORAGE -> STORAGE_PREPARED
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ------ connecting to mesos -----
>>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>>>> 18:11:37.211750 24871 group.cpp:757] Found non-sequence node 
>>>>>>>>>>>> 'log_replicas'
>>>>>>>>>>>> at '/mesos' in ZooKeeper
>>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>>>> 18:11:37.211817 24871 detector.cpp:152] Detected a new leader: 
>>>>>>>>>>>> (id='1506')
>>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>>>> 18:11:37.211917 24871 group.cpp:699] Trying to get
>>>>>>>>>>>> '/mesos/json.info_0000001506' in ZooKeeper
>>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>>>> 18:11:37.216063 24871 zookeeper.cpp:262] A new leading master 
>>>>>>>>>>>> (UPID=
>>>>>>>>>>>> master@10.163.25.45:5050) is detected
>>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>>>> 18:11:37.216162 24871 scheduler.cpp:470] New master detected at
>>>>>>>>>>>> master@10.163.25.45:5050
>>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>>>> 18:11:37.217772 24871 scheduler.cpp:479] Waiting for 12.81503ms 
>>>>>>>>>>>> before
>>>>>>>>>>>> initiating a re-(connection) attempt with the master
>>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>>>> 18:11:37.231549 24868 scheduler.cpp:361] Connected with the master 
>>>>>>>>>>>> at
>>>>>>>>>>>> http://10.163.25.45:5050/master/api/v1/scheduler
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Sep 26, 2017 at 1:24 PM, Bill Farner <
>>>>>>>>>>>> wfar...@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> No, there is not a legitimate reason.  Did this occur for an
>>>>>>>>>>>>> extended period of time?  Do you have logs from the scheduler 
>>>>>>>>>>>>> indicating
>>>>>>>>>>>>> that it lost ZK leadership and subsequently interacted with mesos?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Sep 26, 2017 at 1:02 PM, Mohit Jaggi <
>>>>>>>>>>>>> mohit.ja...@uber.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Fellows,
>>>>>>>>>>>>>> While examining Aurora log files, I noticed a condition where
>>>>>>>>>>>>>> a scheduler was talking to Mesos but it was not showing up as a 
>>>>>>>>>>>>>> leader in
>>>>>>>>>>>>>> Zookeeper. It ultimately restarted itself and another scheduler 
>>>>>>>>>>>>>> became the
>>>>>>>>>>>>>> leader.
>>>>>>>>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Mohit.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Lost framework registered event [Was Re: leader election issues]

Reply via email to