[ https://issues.apache.org/jira/browse/MESOS-8171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Harper updated MESOS-8171: ------------------------------ Affects Version/s: 1.1.3 1.2.2 1.3.1 > Using a failoverTimeout of 0 with Mesos native scheduler client can result in > infinite subscribe loop > ----------------------------------------------------------------------------------------------------- > > Key: MESOS-8171 > URL: https://issues.apache.org/jira/browse/MESOS-8171 > Project: Mesos > Issue Type: Bug > Components: c++ api, java api, scheduler driver > Affects Versions: 1.1.3, 1.2.2, 1.3.1, 1.4.0 > Reporter: Tim Harper > Priority: Minor > > Over the past year, the Marathon team has been plagued with an issue that > hits our CI builds periodically in which the scheduler driver enters a tight > loop, sending 10,000s of SUBSCRIBE calls to the master per second. I turned > on debug logging for the client and the server, and it pointed to an issue > with the {{doReliableRegistration}} method in sched.cpp. Here's the logs: > {code} > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.099815 13397 process.cpp:1383] libprocess is initialized on > 127.0.1.1:60957 with 8 worker threads > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.118237 13397 logging.cpp:199] Logging to STDERR > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.128921 13416 sched.cpp:232] Version: 1.4.0 > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.151785 13791 group.cpp:341] Group process > (zookeeper-group(1)@127.0.1.1:60957) connected to ZooKeeper > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.151823 13791 group.cpp:831] Syncing group operations: queue size > (joins, cancels, datas) = (0, 0, 0) > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.151837 13791 group.cpp:419] Trying to create path '/mesos' in > ZooKeeper > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.152586 13791 group.cpp:758] Found non-sequence node 'log_replicas' > at '/mesos' in ZooKeeper > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.152662 13791 detector.cpp:152] Detected a new leader: (id='0') > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.152762 13791 group.cpp:700] Trying to get > '/mesos/json.info_0000000000' in ZooKeeper > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.157148 13791 zookeeper.cpp:262] A new leading master > (UPID=master@172.16.10.95:32856) is detected > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.157347 13787 sched.cpp:336] New master detected at > master@172.16.10.95:32856 > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.157557 13787 sched.cpp:352] No credentials provided. Attempting to > register without authentication > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.157565 13787 sched.cpp:836] Sending SUBSCRIBE call to > master@172.16.10.95:32856 > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.157635 13787 sched.cpp:869] Will retry registration in 0ns if > necessary > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.158979 13785 sched.cpp:836] Sending SUBSCRIBE call to > master@172.16.10.95:32856 > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.159029 13785 sched.cpp:869] Will retry registration in 0ns if > necessary > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.159265 13790 sched.cpp:836] Sending SUBSCRIBE call to > master@172.16.10.95:32856 > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.159303 13790 sched.cpp:869] Will retry registration in 0ns if > necessary > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.159479 13786 sched.cpp:836] Sending SUBSCRIBE call to > master@172.16.10.95:32856 > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.159521 13786 sched.cpp:869] Will retry registration in 0ns if > necessary > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.159622 13788 sched.cpp:836] Sending SUBSCRIBE call to > master@172.16.10.95:32856 > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.159658 13788 sched.cpp:869] Will retry registration in 0ns if > necessary > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.159749 13789 sched.cpp:836] Sending SUBSCRIBE call to > master@172.16.10.95:32856 > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.159785 13789 sched.cpp:869] Will retry registration in 0ns if > necessary > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.159878 13792 sched.cpp:836] Sending SUBSCRIBE call to > master@172.16.10.95:32856 > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.159916 13792 sched.cpp:869] Will retry registration in 0ns if > necessary > {code} > In Marathon, when we are running our tests, we set the failoverTimeout to 0 > in order to cause the Mesos master to immediately forget about a framework > when it disconnects. > On line 860 of sched.cpp, the retry-delay is set to 1/10th the > failoverTimeout, which provides the best explanation for why the value is 0: > {code} > /Users/tim/src/m8e/mesos/src/sched/sched.cpp > 818 | void doReliableRegistration(Duration maxBackoff) > 819 | { > ... > 851 | // Bound the maximum backoff by 'REGISTRATION_RETRY_INTERVAL_MAX'. > 852 | maxBackoff = > 853 | std::min(maxBackoff, scheduler::REGISTRATION_RETRY_INTERVAL_MAX); > 854 | > 855 | // If failover timeout is present, bound the maximum backoff > 856 | // by 1/10th of the failover timeout. > 857 | if (framework.has_failover_timeout()) { > 858 | Try<Duration> duration = > Duration::create(framework.failover_timeout()); > 859 | if (duration.isSome()) { > 860 | maxBackoff = std::min(maxBackoff, duration.get() / 10); > 861 | } > 862 | } > 863 | > 864 | // Determine the delay for next attempt by picking a random > 865 | // duration between 0 and 'maxBackoff'. > 866 | // TODO(vinod): Use random numbers from <random> header. > 867 | Duration delay = maxBackoff * ((double) os::random() / RAND_MAX); > 868 | > 869 | VLOG(1) << "Will retry registration in " << delay << " if > necessary"; > 870 | > 871 | // Backoff. > 872 | frameworkRegistrationTimer = process::delay( > 873 | delay, self(), &Self::doReliableRegistration, maxBackoff * 2); > 874 | } > 875 | > {code} > Reading through the code, it seems that once this value is 0, it will always > be zero, since backoff is multiplicative (0 * 2 == 0), and the > failover_timeout / 10 limit is applied each time. -- This message was sent by Atlassian JIRA (v6.4.14#64029)