[ https://issues.apache.org/jira/browse/MESOS-7376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benjamin Mahler updated MESOS-7376: ----------------------------------- Summary: Reduce copying of the Registry to improve Registrar performance. (was: Long registry updates when the number of agents is high) > Reduce copying of the Registry to improve Registrar performance. > ---------------------------------------------------------------- > > Key: MESOS-7376 > URL: https://issues.apache.org/jira/browse/MESOS-7376 > Project: Mesos > Issue Type: Improvement > Components: master > Affects Versions: 1.3.0 > Reporter: Ilya Pronin > Assignee: Ilya Pronin > Priority: Critical > > During scale testing we discovered that as the number of registered agents > grows the time it takes to update the registry grows to unacceptable values > very fast. At some point it starts exceeding {{registry_store_timeout}} which > doesn't fire. > With 55k agents we saw this ({{registry_store_timeout=20secs}}): > {noformat} > I0331 17:11:21.227442 36472 registrar.cpp:473] Applied 69 operations in > 3.138843387secs; attempting to update the registry > I0331 17:11:24.441409 36464 log.cpp:529] LogStorage.set: acquired the lock in > 74461ns > I0331 17:11:24.441541 36464 log.cpp:543] LogStorage.set: started in 51770ns > I0331 17:11:26.869323 36462 log.cpp:628] LogStorage.set: wrote append at > position=6420881 in 2.41043644secs > I0331 17:11:26.869454 36462 state.hpp:179] State.store: storage.set has > finished in 2.428189561secs (b=1) > I0331 17:11:56.199453 36469 registrar.cpp:518] Successfully updated the > registry in 34.971944192secs > {noformat} > This is caused by repeated {{Registry}} copying which involves copying a big > object graph that takes roughly 0.4 sec (with 55k agents). -- This message was sent by Atlassian JIRA (v6.3.15#6346)