[ https://issues.apache.org/jira/browse/MESOS-9690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16804905#comment-16804905 ]
Benno Evers commented on MESOS-9690: ------------------------------------ The authentication issues mentioned in the original ticket turned out to be a red herring, so I updated the ticket description and labels. > Framework registration can silently fail w/o visible error > ---------------------------------------------------------- > > Key: MESOS-9690 > URL: https://issues.apache.org/jira/browse/MESOS-9690 > Project: Mesos > Issue Type: Bug > Reporter: Benno Evers > Priority: Major > Labels: foundations > > When running a v1 framework the master can sometimes respond with "503 > Service Unavailable" to a SUBSCRIBE request, without any log message hinting > at what might be wrong even at log level `GLOG_v=4`. For example, this is > from an attempt to run the `OperationFeedbackFramework` against `mesos-local`: > {noformat} > I0328 18:17:53.273442 7793 scheduler.cpp:600] Sending SUBSCRIBE call to > http://127.0.1.1:36423/master/api/v1/scheduler > I0328 18:17:53.273653 7797 leveldb.cpp:347] Persisting action (14 bytes) to > leveldb took 3.185352ms > I0328 18:17:53.273695 7797 replica.cpp:712] Persisted action NOP at position > 0 > I0328 18:17:53.274099 7798 containerizer.cpp:1123] Recovering isolators > I0328 18:17:53.274602 7794 replica.cpp:695] Replica received learned notice > for position 0 from log-network(1)@127.0.1.1:36423 > I0328 18:17:53.274829 7798 containerizer.cpp:1162] Recovering provisioner > I0328 18:17:53.275249 7795 process.cpp:3588] Handling HTTP event for process > 'master' with path: '/master/api/v1/scheduler' > I0328 18:17:53.276659 7792 provisioner.cpp:494] Provisioner recovery complete > I0328 18:17:53.277318 7796 slave.cpp:7602] Recovering executors > I0328 18:17:53.277470 7796 slave.cpp:7755] Finished recovery > I0328 18:17:53.277743 7794 leveldb.cpp:347] Persisting action (16 bytes) to > leveldb took 3.110989ms > I0328 18:17:53.277777 7794 replica.cpp:712] Persisted action NOP at position > 0 > I0328 18:17:53.278400 7795 http.cpp:1105] HTTP POST for > /master/api/v1/scheduler from 127.0.0.1:45952 > I0328 18:17:53.278426 7793 task_status_update_manager.cpp:181] Pausing > sending task status updates > I0328 18:17:53.278453 7794 log.cpp:570] Writer started with ending position 0 > I0328 18:17:53.278425 7798 status_update_manager_process.hpp:379] Pausing > operation status update manager > I0328 18:17:53.278431 7796 slave.cpp:1258] New master detected at > master@127.0.1.1:36423 > I0328 18:17:53.278502 7796 slave.cpp:1312] No credentials provided. > Attempting to register without authentication > I0328 18:17:53.278560 7796 slave.cpp:1323] Detecting new master > W0328 18:17:53.279768 7791 scheduler.cpp:697] Received '503 Service > Unavailable' () for SUBSCRIBE > {noformat} > Regardless of the actual issue that caused the error response, I think at the > very least, > - the `mesos::scheduler::Mesos` class should either have a way to provide > some feedback to the user or retry itself, not silently swallow the error > - out documentation should mention the possibility of this call returning > errors -- This message was sent by Atlassian JIRA (v7.6.3#76005)