[ https://issues.apache.org/jira/browse/MESOS-6803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adam B updated MESOS-6803: -------------------------- Component/s: scheduler driver > Agent authentication does not have an initial `delay` > ----------------------------------------------------- > > Key: MESOS-6803 > URL: https://issues.apache.org/jira/browse/MESOS-6803 > Project: Mesos > Issue Type: Bug > Components: agent, scheduler driver > Reporter: Alex Clemmer > Assignee: Alex Clemmer > Priority: Critical > Labels: microsoft, security, windows-mvp > > When an agent registers, there is currently a somewhat subtle difference in > behavior between the cases when it does and does not authenticate: > * In the case that it DOES NOT authenticate, we will choose a random time > between 0 and the agent `registration_backoff_factor` to initiate > registration. The reason for this is to avoid every agent hitting the master > at once during master failover. (We also employ backoff to help this.) See: > [1] > * In the case that it DOES authenticate, we always attempt to authenticate > and register the Agent immediately. So currently in authenticated clusters, > after failover, all agents will immediately try to register with a master > upon failover; though, this is helped somewhat by the fact that the > authenticated codepath still uses backoff. See: [2] > It is important to resolve this disparity, not only to make the system more > resilient, but also because it directly blocks us from passing many tests on > platforms where authentication is not supported at all (Windows in > particular). > For some time, we have meant to make both the authenticated and > unauthenticated codepaths use a random `delay` to begin. See Adam's TODO in > [3]. Historically, people seem to have had a few problems with this: > 1. Deep in the bowels of git history, Vinod notes[4] that the Agent might end > up trying to authenticate twice, if a new master is detected before the auth > is processed. It seems to me that this should not be an issue (or at least, > not any more). > 2. Many of our tests depend on authenticated registration happening even if > `Clock::pause()` has been called; that is, because our first attempt at > authentication and Agent registration are dispatched for immediate execution, > even when we pause the clock, these events should still happen. If we use a > `delay`, then they are scheduled to happen in the future, and any tests > employing `Clock::pause` during this time will fail. > The resolution of this bug, at minimum, involves fixing the semantics of the > above tests to pass when `HAS_AUTHENTICATION` is set to false. Following > this, it is realistic to expect that we add `delay` to the authentication > codepath as well. > In terms of resolution, it is useful to know the specific tests that will > fail if `HAS_AUTHENTICATION` is set to false: > ``` > [ FAILED ] ExamplesTest.V1JavaFramework > [ FAILED ] ExamplesTest.PythonFramework > [ FAILED ] FaultToleranceTest.FrameworkReregister > [ FAILED ] MasterAllocatorTest/0.RebalancedForUpdatedWeights, where > TypeParam = > mesos::internal::master::allocator::MesosAllocator<mesos::internal::master::allocator::HierarchicalAllocatorProcess<mesos::internal::master::allocator::DRFSorter, > mesos::internal::master::allocator::DRFSorter, > mesos::internal::master::allocator::DRFSorter> > > [ FAILED ] MasterAllocatorTest/1.RebalancedForUpdatedWeights, where > TypeParam = mesos::internal::tests::Module<mesos::allocator::Allocator, > (mesos::internal::tests::ModuleID)6> > [ FAILED ] MasterTest.EndpointsForHalfRemovedSlave > [ FAILED ] MasterTest.UnreachableTaskAfterFailover > [ FAILED ] MasterTest.CancelRecoveredSlaveRemoval > [ FAILED ] MasterTest.RecoveredFramework > [ FAILED ] OversubscriptionTest.RescindRevocableOfferWithIncreasedRevocable > [ FAILED ] OversubscriptionTest.RescindRevocableOfferWithDecreasedRevocable > [ FAILED ] OversubscriptionTest.Reregistration > [ FAILED ] PartitionTest.ReregisterSlavePartitionAware > [ FAILED ] PartitionTest.ReregisterSlaveNotPartitionAware > [ FAILED ] PartitionTest.PartitionedSlaveReregistrationMasterFailover > [ FAILED ] PartitionTest.PartitionedSlaveOrphanedTask > [ FAILED ] PartitionTest.SpuriousSlaveReregistration > [ FAILED ] PartitionTest.PartitionedSlaveStatusUpdates > [ FAILED ] PartitionTest.RegistryGcByCount > [ FAILED ] PartitionTest.RegistryGcByAge > [ FAILED ] PartitionTest.RegistryGcRace > [ FAILED ] OneWayPartitionTest.MasterToSlave > [ FAILED ] ReconciliationTest.ReconcileStatusUpdateTaskState > [ FAILED ] ReservationTest.ACLMultipleOperations > [ FAILED ] ReservationTest.WithoutAuthenticationWithoutPrincipal > [ FAILED ] ReservationTest.WithoutAuthenticationWithPrincipal > [ FAILED ] SlaveTest.DuplicateTerminalUpdateBeforeAck > [ FAILED ] SlaveTest.StateEndpoint > [ FAILED ] SlaveTest.PingTimeoutNoPings > [ FAILED ] SlaveTest.PingTimeoutSomePings > [ FAILED ] SlaveTest.ReregisterWithStatusUpdateTaskState > [ FAILED ] SlaveTest.MaxCompletedExecutorsPerFrameworkFlag > [ FAILED ] ContentType/AgentAPITest.NestedContainerLaunchFalse/0, where > GetParam() = application/x-protobuf > [ FAILED ] ContentType/AgentAPITest.NestedContainerLaunchFalse/1, where > GetParam() = application/json > [ FAILED ] ContentType/AgentAPITest.NestedContainerLaunch/0, where > GetParam() = application/x-protobuf > [ FAILED ] ContentType/AgentAPITest.NestedContainerLaunch/1, where > GetParam() = application/json > [ FAILED ] > ContentType/AgentAPITest.LaunchNestedContainerSessionAttachFailure/0, where > GetParam() = application/x-protobuf > [ FAILED ] > ContentType/AgentAPITest.LaunchNestedContainerSessionAttachFailure/1, where > GetParam() = application/json > [ FAILED ] DiskResource/PersistentVolumeTest.MasterFailover/0, where > GetParam() = 0 > [ FAILED ] DiskResource/PersistentVolumeTest.AccessPersistentVolume/0, > where GetParam() = 0 > [ FAILED ] DiskResource/PersistentVolumeTest.AccessPersistentVolume/1, > where GetParam() = 1 > [ FAILED ] > DiskResource/PersistentVolumeTest.SharedPersistentVolumeRescindOnDestroy/0, > where GetParam() = 0 > [ FAILED ] > DiskResource/PersistentVolumeTest.SharedPersistentVolumeRescindOnDestroy/1, > where GetParam() = 1 > [ FAILED ] MountDiskResource/PersistentVolumeTest.AccessPersistentVolume/0, > where GetParam() = 2 > [ FAILED ] > MountDiskResource/PersistentVolumeTest.SharedPersistentVolumeRescindOnDestroy/0, > where GetParam() = 2 > ``` > [1] > https://github.com/apache/mesos/blob/c5c5c13deab834e6db7e1f9d687b8cc0f6a0641f/src/slave/slave.cpp#L948 > [2] > https://github.com/apache/mesos/blob/c5c5c13deab834e6db7e1f9d687b8cc0f6a0641f/src/slave/slave.cpp#L942 > [3] > https://github.com/apache/mesos/blob/c5c5c13deab834e6db7e1f9d687b8cc0f6a0641f/src/slave/slave.cpp#L938 > [4] > https://github.com/apache/mesos/commit/09b1dc3e95955aa187458fcb61e1d66b04ec3af2#diff-01648193f4029dc9fc1e024949f6ea28R562 -- This message was sent by Atlassian JIRA (v6.3.4#6332)