[jira] [Commented] (MESOS-9690) Framework registration can silently fail w/o visible error

Benno Evers (JIRA) Fri, 29 Mar 2019 05:30:32 -0700


    [ 
https://issues.apache.org/jira/browse/MESOS-9690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16804905#comment-16804905
 ]


Benno Evers commented on MESOS-9690:
------------------------------------

The authentication issues mentioned in the original ticket turned out to be a 
red herring, so I updated the ticket description and labels.

> Framework registration can silently fail w/o visible error
> ----------------------------------------------------------
>
>                 Key: MESOS-9690
>                 URL: https://issues.apache.org/jira/browse/MESOS-9690
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benno Evers
>            Priority: Major
>              Labels: foundations
>
> When running a v1 framework the master can sometimes respond with "503 
> Service Unavailable" to a SUBSCRIBE request, without any log message hinting 
> at what might be wrong even at log level `GLOG_v=4`. For example, this is 
> from an attempt to run the `OperationFeedbackFramework` against `mesos-local`:
> {noformat}
> I0328 18:17:53.273442  7793 scheduler.cpp:600] Sending SUBSCRIBE call to 
> http://127.0.1.1:36423/master/api/v1/scheduler
> I0328 18:17:53.273653  7797 leveldb.cpp:347] Persisting action (14 bytes) to 
> leveldb took 3.185352ms
> I0328 18:17:53.273695  7797 replica.cpp:712] Persisted action NOP at position > 0
> I0328 18:17:53.274099  7798 containerizer.cpp:1123] Recovering isolators
> I0328 18:17:53.274602  7794 replica.cpp:695] Replica received learned notice 
> for position 0 from log-network(1)@127.0.1.1:36423
> I0328 18:17:53.274829  7798 containerizer.cpp:1162] Recovering provisioner
> I0328 18:17:53.275249  7795 process.cpp:3588] Handling HTTP event for process 
> 'master' with path: '/master/api/v1/scheduler'
> I0328 18:17:53.276659  7792 provisioner.cpp:494] Provisioner recovery complete
> I0328 18:17:53.277318  7796 slave.cpp:7602] Recovering executors
> I0328 18:17:53.277470  7796 slave.cpp:7755] Finished recovery
> I0328 18:17:53.277743  7794 leveldb.cpp:347] Persisting action (16 bytes) to 
> leveldb took 3.110989ms
> I0328 18:17:53.277777  7794 replica.cpp:712] Persisted action NOP at position > 0
> I0328 18:17:53.278400  7795 http.cpp:1105] HTTP POST for 
> /master/api/v1/scheduler from 127.0.0.1:45952
> I0328 18:17:53.278426  7793 task_status_update_manager.cpp:181] Pausing 
> sending task status updates
> I0328 18:17:53.278453  7794 log.cpp:570] Writer started with ending position 0
> I0328 18:17:53.278425  7798 status_update_manager_process.hpp:379] Pausing 
> operation status update manager
> I0328 18:17:53.278431  7796 slave.cpp:1258] New master detected at 
> [email protected]:36423
> I0328 18:17:53.278502  7796 slave.cpp:1312] No credentials provided. 
> Attempting to register without authentication
> I0328 18:17:53.278560  7796 slave.cpp:1323] Detecting new master
> W0328 18:17:53.279768  7791 scheduler.cpp:697] Received '503 Service 
> Unavailable' () for SUBSCRIBE
> {noformat}
> Regardless of the actual issue that caused the error response, I think at the 
> very least,
>  - the `mesos::scheduler::Mesos` class should either have a way to provide 
> some feedback to the user or retry itself, not silently swallow the error
>  - out documentation should mention the possibility of this call returning 
> errors



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9690) Framework registration can silently fail w/o visible error

Reply via email to