Hi,

we have Mesos cluster installation - 3 masters (0.21.0), ZK (3.4.5) running
Mesos, Spark, Chronos, Marathon and Storm 0.9.3. All nodes running Ubuntu
14.04.

My problem is that i have to start MesosNimbus on currently elected leader,
otherwise MesosNimbus get stuck. From log i see it detects currently
leading master correctly but then get stuck. When leader changes to node
running nimbus it works again.

nimbus upstrart.log

I0119 12:20:03.289799 10728 detector.cpp:433] A new leading master (UPID=
master@192.168.56.11:5050) is detected
I0119 12:20:03.290081 10733 sched.cpp:234] New master detected at
master@192.168.56.11:5050
I0119 12:20:03.290592 10733 sched.cpp:242] No credentials provided.
Attempting to register without authentication

nimbus.log

2015-01-19T12:15:40.478+0100 o.m.log [DEBUG] started Server@20e1ceb3
2015-01-19T12:15:40.478+0100 s.m.MesosNimbus [INFO] Started serving config
dir under http://192.168.56.10:49202/conf
2015-01-19T12:15:40.535+0100 s.m.MesosNimbus [INFO] Waiting for scheduler
to initialize...

On leading mesos i see following log (repeated every second)

mesos.log

I0119 12:40:53.208027  4957 master.cpp:1520] Received re-registration
request from framework 20150119-114412-171485376-5050-6660-0002 (Storm
0.9.3) at scheduler-37d9a510-1136-4adb-be09-c9c2e388611f@127.0.1.1:52310
I0119 12:40:53.208860  4957 master.cpp:1573] Re-registering framework
20150119-114412-171485376-5050-6660-0002 (Storm 0.9.3)  at
scheduler-37d9a510-1136-4adb-be09-c9c2e388611f@127.0.1.1:52310
I0119 12:40:53.209205  4957 master.cpp:1602] Framework
20150119-114412-171485376-5050-6660-0002 (Storm 0.9.3) at
scheduler-37d9a510-1136-4adb-be09-c9c2e388611f@127.0.1.1:52310 failed over
I0119 12:40:53.211552  4957 hierarchical_allocator_process.hpp:375]
Activated framework 20150119-114412-171485376-5050-6660-0002
I0119 12:40:53.211932  4959 master.cpp:789] Framework
20150119-114412-171485376-5050-6660-0002 (Storm 0.9.3) at
scheduler-37d9a510-1136-4adb-be09-c9c2e388611f@127.0.1.1:52310 disconnected
I0119 12:40:53.212004  4959 master.cpp:1752] Disconnecting framework
20150119-114412-171485376-5050-6660-0002 (Storm 0.9.3) at
scheduler-37d9a510-1136-4adb-be09-c9c2e388611f@127.0.1.1:52310
I0119 12:40:53.212198  4959 master.cpp:1768] Deactivating framework
20150119-114412-171485376-5050-6660-0002 (Storm 0.9.3) at
scheduler-37d9a510-1136-4adb-be09-c9c2e388611f@127.0.1.1:52310
I0119 12:40:53.212446  4959 master.cpp:811] Giving framework
20150119-114412-171485376-5050-6660-0002 (Storm 0.9.3) at
scheduler-37d9a510-1136-4adb-be09-c9c2e388611f@127.0.1.1:52310 1hrs to
failover
I0119 12:40:53.212550  4959 hierarchical_allocator_process.hpp:405]
Deactivated framework 20150119-114412-171485376-5050-6660-0002
I0119 12:40:54.209858  4959 master.cpp:1520] Received re-registration
request from framework 20150119-114412-171485376-5050-6660-0002 (Storm
0.9.3) at scheduler-37d9a510-1136-4adb-be09-c9c2e388611f@127.0.1.1:52310


Other frameworks works okay and handles leading masters on another node
correctly.
>From breef look at source code it hangs

https://github.com/mesos/storm/blob/master/src/storm/mesos/MesosNimbus.java
at line 153

when trying to acquire semaphore.


Thank you for your great job

Ondrej Smola

Reply via email to