Vinod Kone created MESOS-299:
--------------------------------
Summary: Master detector doesn't notify about leading master after
network disconnection
Key: MESOS-299
URL: https://issues.apache.org/jira/browse/MESOS-299
Project: Mesos
Issue Type: Bug
Reporter: Vinod Kone
Assignee: Benjamin Hindman
This occurred during a rack switch upgrade event at Twitter.
Slave lost connectivity with the leading master. But, when the network switch
came back up, the slave was never notified of the leading master and it never
registered.
{code}
I1025 17:05:23.435269 33693 detector.cpp:389] Master detector lost connection
to ZooKeeper, attempting to reconnect ...
2012-10-25
17:05:30,105:33681(0x49807940):ZOO_ERROR@handle_socket_error_msg@1528: Socket
[10.35.96.123:2181] zk retcode=-7, errno=110(Connection timed out): connection
timed out (exceeded timeout by 5ms)
I1025 17:05:33.011539 33698 http.cpp:156] HTTP request for
'/slave(1)/stats.json'
W1025 17:05:33.436805 33686 detector.cpp:450] Timed out waiting to reconnect to
ZooKeeper (sessionId=13969feb5654992)
I1025 17:05:33.436957 33686 slave.cpp:362] Lost master(s) ... waiting
.......
.......
I1025 17:07:23.214442 33684 http.cpp:156] HTTP request for
'/slave(1)/stats.json'
2012-10-25 17:07:23,249:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got
ping response in 0 ms
2012-10-25 17:07:26,586:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got
ping response in 0 ms
2012-10-25 17:07:29,920:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got
ping response in 0 ms
I1025 17:07:33.233223 33698 http.cpp:156] HTTP request for
'/slave(1)/stats.json'
2012-10-25 17:07:33,255:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got
ping response in 0 ms
2012-10-25 17:07:36,592:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got
ping response in 0 ms
2012-10-25 17:07:39,929:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got
ping response in 0 ms
I1025 17:07:43.251569 33698 http.cpp:156] HTTP request for
'/slave(1)/stats.json'
2012-10-25 17:07:43,265:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got
ping response in 0 ms
2012-10-25 17:07:46,602:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got
ping response in 0 ms
2012-10-25 17:07:49,939:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got
ping response in 0 ms
I1025 17:07:53.270818 33691 http.cpp:156] HTTP request for
'/slave(1)/stats.json'
2012-10-25 17:07:53,275:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got
ping response in 0 ms
2012-10-25 17:07:56,612:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got
ping response in 0 ms
2012-10-25 17:07:59,949:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got
ping response in 0 ms
2012-10-25 17:08:03,286:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got
ping response in 0 ms
I1025 17:08:03.293431 33687 http.cpp:156] HTTP request for
'/slave(1)/stats.json'
2012-10-25 17:08:06,620:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got
ping response in 0 ms
2012-10-25 17:08:09,956:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got
ping response in 0 ms
2012-10-25 17:08:13,291:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got
ping response in 0 ms
I1025 17:08:13.316841 33684 http.cpp:156] HTTP request for
'/slave(1)/stats.json'
2012-10-25 17:08:16,628:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got
ping response in 0 ms
2012-10-25 17:08:19,964:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got
ping response in 0 ms
2012-10-25 17:08:23,300:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got
ping response in 0 ms
I1025 17:08:23.337929 33695 http.cpp:156] HTTP request for
'/slave(1)/stats.json'
2012-10-25 17:08:26,637:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got
ping response in 0 ms
2012-10-25 17:08:29,974:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got
ping response in 0 ms
2012-10-25 17:08:33,310:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got
ping response in 0 ms
I1025 17:08:33.358109 33696 http.cpp:156] HTTP request for
'/slave(1)/stats.json'
2012-10-25 17:08:36,646:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got
ping response in 0 ms
2012-10-25 17:08:39,981:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got
ping response in 0 ms
2012-10-25 17:08:43,317:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got
ping response in 0 ms
I1025 17:08:43.377002 33685 http.cpp:156] HTTP request for
'/slave(1)/stats.json'
W1025 17:08:45.004808 33696 slave.cpp:336] Ignoring shutdown message from
[email protected]:5050because it is not from the registered master
(@0.0.0.0:0)
....
....
2012-10-25 18:35:04,005:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1916:
Processing WATCHER_EVENT
2012-10-25 18:35:04,005:33681(0x49807940):ZOO_DEBUG@process_completions@1765:
Calling a watcher for node [/home/mesos/prod/master], type = -1
event=ZOO_CHILD_EVENT
2012-10-25 18:35:04,005:33681(0x47002940):ZOO_DEBUG@zoo_awget_children_@2626:
Sending request xid=0x50858793 for path [/home/mesos/prod/master] to
10.35.98.111:2181
2012-10-25 18:35:04,011:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1989:
Queueing asynchronous response
2012-10-25 18:35:04,011:33681(0x49807940):ZOO_DEBUG@process_completions@1795:
Calling COMPLETION_STRINGLIST for xid=0x50858793 rc=0
I1025 18:35:04.012164 33696 detector.cpp:469] Master detector found 2
registered masters
2012-10-25 18:35:04,012:33681(0x47002940):ZOO_DEBUG@zoo_awget@2414: Sending
request xid=0x50858794 for path [/home/mesos/prod/master/0000001188] to
10.35.98.111:2181
2012-10-25 18:35:04,017:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1989:
Queueing asynchronous response
2012-10-25 18:35:04,017:33681(0x49807940):ZOO_DEBUG@process_completions@1772:
Calling COMPLETION_DATA for xid=0x50858794 rc=0
I1025 18:35:04.017673 33696 detector.cpp:504] Master detector got new master
pid: [email protected]:5050
I1025 18:35:04.017858 33696 slave.cpp:350] New master detected at
[email protected]:5050
I1025 18:35:04.330201 33697 slave.cpp:407] Re-registered with master
I1025 18:35:04.330456 33697 slave.cpp:694] Updating framework
201104070004-0000002563-0000 pid to scheduler(1)@10.34.231.115:41277
I1025 18:35:04.954231 33684 http.cpp:156] HTTP request for
'/slave(1)/stats.json'
2012-10-25 18:35:05,715:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1916:
Processing WATCHER_EVENT
2012-10-25 18:35:05,715:33681(0x49807940):ZOO_DEBUG@process_completions@1765:
Calling a watcher for node [/home/mesos/prod/master], type = -1
event=ZOO_CHILD_EVENT
2012-10-25 18:35:05,716:33681(0x41ff8940):ZOO_DEBUG@zoo_awget_children_@2626:
Sending request xid=0x50858795 for path [/home/mesos/prod/master] to
10.35.98.111:2181
2012-10-25 18:35:05,719:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1989:
Queueing asynchronous response
2012-10-25 18:35:05,719:33681(0x49807940):ZOO_DEBUG@process_completions@1795:
Calling COMPLETION_STRINGLIST for xid=0x50858795 rc=0
I1025 18:35:05.720186 33685 detector.cpp:469] Master detector found 3
registered masters
{code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira