Vinod Kone created MESOS-299:
--------------------------------

             Summary: Master detector doesn't notify about leading master after 
network disconnection
                 Key: MESOS-299
                 URL: https://issues.apache.org/jira/browse/MESOS-299
             Project: Mesos
          Issue Type: Bug
            Reporter: Vinod Kone
            Assignee: Benjamin Hindman


This occurred during a rack switch upgrade event at Twitter.

Slave lost connectivity with the leading master. But, when the network switch 
came back up, the slave was never notified of the leading master and it never 
registered.

{code}
I1025 17:05:23.435269 33693 detector.cpp:389] Master detector lost connection 
to ZooKeeper, attempting to reconnect ...
2012-10-25 
17:05:30,105:33681(0x49807940):ZOO_ERROR@handle_socket_error_msg@1528: Socket 
[10.35.96.123:2181] zk retcode=-7, errno=110(Connection timed out): connection 
timed out (exceeded timeout by 5ms)
I1025 17:05:33.011539 33698 http.cpp:156] HTTP request for 
'/slave(1)/stats.json'
W1025 17:05:33.436805 33686 detector.cpp:450] Timed out waiting to reconnect to 
ZooKeeper (sessionId=13969feb5654992)
I1025 17:05:33.436957 33686 slave.cpp:362] Lost master(s) ... waiting
.......
.......
I1025 17:07:23.214442 33684 http.cpp:156] HTTP request for 
'/slave(1)/stats.json'
2012-10-25 17:07:23,249:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got 
ping response in 0 ms
2012-10-25 17:07:26,586:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got 
ping response in 0 ms
2012-10-25 17:07:29,920:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got 
ping response in 0 ms
I1025 17:07:33.233223 33698 http.cpp:156] HTTP request for 
'/slave(1)/stats.json'
2012-10-25 17:07:33,255:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got 
ping response in 0 ms
2012-10-25 17:07:36,592:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got 
ping response in 0 ms
2012-10-25 17:07:39,929:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got 
ping response in 0 ms
I1025 17:07:43.251569 33698 http.cpp:156] HTTP request for 
'/slave(1)/stats.json'
2012-10-25 17:07:43,265:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got 
ping response in 0 ms
2012-10-25 17:07:46,602:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got 
ping response in 0 ms
2012-10-25 17:07:49,939:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got 
ping response in 0 ms
I1025 17:07:53.270818 33691 http.cpp:156] HTTP request for 
'/slave(1)/stats.json'
2012-10-25 17:07:53,275:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got 
ping response in 0 ms
2012-10-25 17:07:56,612:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got 
ping response in 0 ms
2012-10-25 17:07:59,949:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got 
ping response in 0 ms
2012-10-25 17:08:03,286:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got 
ping response in 0 ms
I1025 17:08:03.293431 33687 http.cpp:156] HTTP request for 
'/slave(1)/stats.json'
2012-10-25 17:08:06,620:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got 
ping response in 0 ms
2012-10-25 17:08:09,956:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got 
ping response in 0 ms
2012-10-25 17:08:13,291:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got 
ping response in 0 ms
I1025 17:08:13.316841 33684 http.cpp:156] HTTP request for 
'/slave(1)/stats.json'
2012-10-25 17:08:16,628:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got 
ping response in 0 ms
2012-10-25 17:08:19,964:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got 
ping response in 0 ms
2012-10-25 17:08:23,300:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got 
ping response in 0 ms
I1025 17:08:23.337929 33695 http.cpp:156] HTTP request for 
'/slave(1)/stats.json'
2012-10-25 17:08:26,637:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got 
ping response in 0 ms
2012-10-25 17:08:29,974:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got 
ping response in 0 ms
2012-10-25 17:08:33,310:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got 
ping response in 0 ms
I1025 17:08:33.358109 33696 http.cpp:156] HTTP request for 
'/slave(1)/stats.json'
2012-10-25 17:08:36,646:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got 
ping response in 0 ms
2012-10-25 17:08:39,981:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got 
ping response in 0 ms
2012-10-25 17:08:43,317:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got 
ping response in 0 ms
I1025 17:08:43.377002 33685 http.cpp:156] HTTP request for 
'/slave(1)/stats.json'
W1025 17:08:45.004808 33696 slave.cpp:336] Ignoring shutdown message from 
[email protected]:5050because it is not from the registered master 
(@0.0.0.0:0)
....
....
2012-10-25 18:35:04,005:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1916: 
Processing WATCHER_EVENT
2012-10-25 18:35:04,005:33681(0x49807940):ZOO_DEBUG@process_completions@1765: 
Calling a watcher for node [/home/mesos/prod/master], type = -1 
event=ZOO_CHILD_EVENT
2012-10-25 18:35:04,005:33681(0x47002940):ZOO_DEBUG@zoo_awget_children_@2626: 
Sending request xid=0x50858793 for path [/home/mesos/prod/master] to 
10.35.98.111:2181
2012-10-25 18:35:04,011:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1989: 
Queueing asynchronous response
2012-10-25 18:35:04,011:33681(0x49807940):ZOO_DEBUG@process_completions@1795: 
Calling COMPLETION_STRINGLIST for xid=0x50858793 rc=0
I1025 18:35:04.012164 33696 detector.cpp:469] Master detector found 2 
registered masters
2012-10-25 18:35:04,012:33681(0x47002940):ZOO_DEBUG@zoo_awget@2414: Sending 
request xid=0x50858794 for path [/home/mesos/prod/master/0000001188] to 
10.35.98.111:2181
2012-10-25 18:35:04,017:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1989: 
Queueing asynchronous response
2012-10-25 18:35:04,017:33681(0x49807940):ZOO_DEBUG@process_completions@1772: 
Calling COMPLETION_DATA for xid=0x50858794 rc=0
I1025 18:35:04.017673 33696 detector.cpp:504] Master detector got new master 
pid: [email protected]:5050
I1025 18:35:04.017858 33696 slave.cpp:350] New master detected at 
[email protected]:5050
I1025 18:35:04.330201 33697 slave.cpp:407] Re-registered with master
I1025 18:35:04.330456 33697 slave.cpp:694] Updating framework 
201104070004-0000002563-0000 pid to scheduler(1)@10.34.231.115:41277
I1025 18:35:04.954231 33684 http.cpp:156] HTTP request for 
'/slave(1)/stats.json'
2012-10-25 18:35:05,715:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1916: 
Processing WATCHER_EVENT
2012-10-25 18:35:05,715:33681(0x49807940):ZOO_DEBUG@process_completions@1765: 
Calling a watcher for node [/home/mesos/prod/master], type = -1 
event=ZOO_CHILD_EVENT
2012-10-25 18:35:05,716:33681(0x41ff8940):ZOO_DEBUG@zoo_awget_children_@2626: 
Sending request xid=0x50858795 for path [/home/mesos/prod/master] to 
10.35.98.111:2181
2012-10-25 18:35:05,719:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1989: 
Queueing asynchronous response
2012-10-25 18:35:05,719:33681(0x49807940):ZOO_DEBUG@process_completions@1795: 
Calling COMPLETION_STRINGLIST for xid=0x50858795 rc=0
I1025 18:35:05.720186 33685 detector.cpp:469] Master detector found 3 
registered masters
{code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to