Hi, we have Mesos cluster installation - 3 masters (0.21.0), ZK (3.4.5) running Mesos, Spark, Chronos, Marathon and Storm 0.9.3. All nodes running Ubuntu 14.04.
My problem is that i have to start MesosNimbus on currently elected leader, otherwise MesosNimbus get stuck. From log i see it detects currently leading master correctly but then get stuck. When leader changes to node running nimbus it works again. nimbus upstrart.log I0119 12:20:03.289799 10728 detector.cpp:433] A new leading master (UPID= master@192.168.56.11:5050) is detected I0119 12:20:03.290081 10733 sched.cpp:234] New master detected at master@192.168.56.11:5050 I0119 12:20:03.290592 10733 sched.cpp:242] No credentials provided. Attempting to register without authentication nimbus.log 2015-01-19T12:15:40.478+0100 o.m.log [DEBUG] started Server@20e1ceb3 2015-01-19T12:15:40.478+0100 s.m.MesosNimbus [INFO] Started serving config dir under http://192.168.56.10:49202/conf 2015-01-19T12:15:40.535+0100 s.m.MesosNimbus [INFO] Waiting for scheduler to initialize... On leading mesos i see following log (repeated every second) mesos.log I0119 12:40:53.208027 4957 master.cpp:1520] Received re-registration request from framework 20150119-114412-171485376-5050-6660-0002 (Storm 0.9.3) at scheduler-37d9a510-1136-4adb-be09-c9c2e388611f@127.0.1.1:52310 I0119 12:40:53.208860 4957 master.cpp:1573] Re-registering framework 20150119-114412-171485376-5050-6660-0002 (Storm 0.9.3) at scheduler-37d9a510-1136-4adb-be09-c9c2e388611f@127.0.1.1:52310 I0119 12:40:53.209205 4957 master.cpp:1602] Framework 20150119-114412-171485376-5050-6660-0002 (Storm 0.9.3) at scheduler-37d9a510-1136-4adb-be09-c9c2e388611f@127.0.1.1:52310 failed over I0119 12:40:53.211552 4957 hierarchical_allocator_process.hpp:375] Activated framework 20150119-114412-171485376-5050-6660-0002 I0119 12:40:53.211932 4959 master.cpp:789] Framework 20150119-114412-171485376-5050-6660-0002 (Storm 0.9.3) at scheduler-37d9a510-1136-4adb-be09-c9c2e388611f@127.0.1.1:52310 disconnected I0119 12:40:53.212004 4959 master.cpp:1752] Disconnecting framework 20150119-114412-171485376-5050-6660-0002 (Storm 0.9.3) at scheduler-37d9a510-1136-4adb-be09-c9c2e388611f@127.0.1.1:52310 I0119 12:40:53.212198 4959 master.cpp:1768] Deactivating framework 20150119-114412-171485376-5050-6660-0002 (Storm 0.9.3) at scheduler-37d9a510-1136-4adb-be09-c9c2e388611f@127.0.1.1:52310 I0119 12:40:53.212446 4959 master.cpp:811] Giving framework 20150119-114412-171485376-5050-6660-0002 (Storm 0.9.3) at scheduler-37d9a510-1136-4adb-be09-c9c2e388611f@127.0.1.1:52310 1hrs to failover I0119 12:40:53.212550 4959 hierarchical_allocator_process.hpp:405] Deactivated framework 20150119-114412-171485376-5050-6660-0002 I0119 12:40:54.209858 4959 master.cpp:1520] Received re-registration request from framework 20150119-114412-171485376-5050-6660-0002 (Storm 0.9.3) at scheduler-37d9a510-1136-4adb-be09-c9c2e388611f@127.0.1.1:52310 Other frameworks works okay and handles leading masters on another node correctly. >From breef look at source code it hangs https://github.com/mesos/storm/blob/master/src/storm/mesos/MesosNimbus.java at line 153 when trying to acquire semaphore. Thank you for your great job Ondrej Smola