[ 
https://issues.apache.org/jira/browse/MESOS-6803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-6803:
--------------------------
    Component/s: scheduler driver

> Agent authentication does not have an initial `delay`
> -----------------------------------------------------
>
>                 Key: MESOS-6803
>                 URL: https://issues.apache.org/jira/browse/MESOS-6803
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, scheduler driver
>            Reporter: Alex Clemmer
>            Assignee: Alex Clemmer
>            Priority: Critical
>              Labels: microsoft, security, windows-mvp
>
> When an agent registers, there is currently a somewhat subtle difference in 
> behavior between the cases when it does and does not authenticate:
> * In the case that it DOES NOT authenticate, we will choose a random time 
> between 0 and the agent `registration_backoff_factor` to initiate 
> registration. The reason for this is to avoid every agent hitting the master 
> at once during master failover. (We also employ backoff to help this.) See: 
> [1]
> * In the case that it DOES authenticate, we always attempt to authenticate 
> and register the Agent immediately. So currently in authenticated clusters, 
> after failover, all agents will immediately try to register with a master 
> upon failover; though, this is helped somewhat by the fact that the 
> authenticated codepath still uses backoff. See: [2]
> It is important to resolve this disparity, not only to make the system more 
> resilient, but also because it directly blocks us from passing many tests on 
> platforms where authentication is not supported at all (Windows in 
> particular).
> For some time, we have meant to make both the authenticated and 
> unauthenticated codepaths use a random `delay` to begin. See Adam's TODO in 
> [3]. Historically, people seem to have had a few problems with this:
> 1. Deep in the bowels of git history, Vinod notes[4] that the Agent might end 
> up trying to authenticate twice, if a new master is detected before the auth 
> is processed. It seems to me that this should not be an issue (or at least, 
> not any more).
> 2. Many of our tests depend on authenticated registration happening even if 
> `Clock::pause()` has been called; that is, because our first attempt at 
> authentication and Agent registration are dispatched for immediate execution, 
> even when we pause the clock, these events should still happen. If we use a 
> `delay`, then they are scheduled to happen in the future, and any tests 
> employing `Clock::pause` during this time will fail.
> The resolution of this bug, at minimum, involves fixing the semantics of the 
> above tests to pass when `HAS_AUTHENTICATION` is set to false. Following 
> this, it is realistic to expect that we add `delay` to the authentication 
> codepath as well.
> In terms of resolution, it is useful to know the specific tests that will 
> fail if `HAS_AUTHENTICATION` is set to false:
> ```
> [  FAILED  ] ExamplesTest.V1JavaFramework
> [  FAILED  ] ExamplesTest.PythonFramework
> [  FAILED  ] FaultToleranceTest.FrameworkReregister
> [  FAILED  ] MasterAllocatorTest/0.RebalancedForUpdatedWeights, where 
> TypeParam = 
> mesos::internal::master::allocator::MesosAllocator<mesos::internal::master::allocator::HierarchicalAllocatorProcess<mesos::internal::master::allocator::DRFSorter,
>  mesos::internal::master::allocator::DRFSorter, 
> mesos::internal::master::allocator::DRFSorter> >
> [  FAILED  ] MasterAllocatorTest/1.RebalancedForUpdatedWeights, where 
> TypeParam = mesos::internal::tests::Module<mesos::allocator::Allocator, 
> (mesos::internal::tests::ModuleID)6>
> [  FAILED  ] MasterTest.EndpointsForHalfRemovedSlave
> [  FAILED  ] MasterTest.UnreachableTaskAfterFailover
> [  FAILED  ] MasterTest.CancelRecoveredSlaveRemoval
> [  FAILED  ] MasterTest.RecoveredFramework
> [  FAILED  ] OversubscriptionTest.RescindRevocableOfferWithIncreasedRevocable
> [  FAILED  ] OversubscriptionTest.RescindRevocableOfferWithDecreasedRevocable
> [  FAILED  ] OversubscriptionTest.Reregistration
> [  FAILED  ] PartitionTest.ReregisterSlavePartitionAware
> [  FAILED  ] PartitionTest.ReregisterSlaveNotPartitionAware
> [  FAILED  ] PartitionTest.PartitionedSlaveReregistrationMasterFailover
> [  FAILED  ] PartitionTest.PartitionedSlaveOrphanedTask
> [  FAILED  ] PartitionTest.SpuriousSlaveReregistration
> [  FAILED  ] PartitionTest.PartitionedSlaveStatusUpdates
> [  FAILED  ] PartitionTest.RegistryGcByCount
> [  FAILED  ] PartitionTest.RegistryGcByAge
> [  FAILED  ] PartitionTest.RegistryGcRace
> [  FAILED  ] OneWayPartitionTest.MasterToSlave
> [  FAILED  ] ReconciliationTest.ReconcileStatusUpdateTaskState
> [  FAILED  ] ReservationTest.ACLMultipleOperations
> [  FAILED  ] ReservationTest.WithoutAuthenticationWithoutPrincipal
> [  FAILED  ] ReservationTest.WithoutAuthenticationWithPrincipal
> [  FAILED  ] SlaveTest.DuplicateTerminalUpdateBeforeAck
> [  FAILED  ] SlaveTest.StateEndpoint
> [  FAILED  ] SlaveTest.PingTimeoutNoPings
> [  FAILED  ] SlaveTest.PingTimeoutSomePings
> [  FAILED  ] SlaveTest.ReregisterWithStatusUpdateTaskState
> [  FAILED  ] SlaveTest.MaxCompletedExecutorsPerFrameworkFlag
> [  FAILED  ] ContentType/AgentAPITest.NestedContainerLaunchFalse/0, where 
> GetParam() = application/x-protobuf
> [  FAILED  ] ContentType/AgentAPITest.NestedContainerLaunchFalse/1, where 
> GetParam() = application/json
> [  FAILED  ] ContentType/AgentAPITest.NestedContainerLaunch/0, where 
> GetParam() = application/x-protobuf
> [  FAILED  ] ContentType/AgentAPITest.NestedContainerLaunch/1, where 
> GetParam() = application/json
> [  FAILED  ] 
> ContentType/AgentAPITest.LaunchNestedContainerSessionAttachFailure/0, where 
> GetParam() = application/x-protobuf
> [  FAILED  ] 
> ContentType/AgentAPITest.LaunchNestedContainerSessionAttachFailure/1, where 
> GetParam() = application/json
> [  FAILED  ] DiskResource/PersistentVolumeTest.MasterFailover/0, where 
> GetParam() = 0
> [  FAILED  ] DiskResource/PersistentVolumeTest.AccessPersistentVolume/0, 
> where GetParam() = 0
> [  FAILED  ] DiskResource/PersistentVolumeTest.AccessPersistentVolume/1, 
> where GetParam() = 1
> [  FAILED  ] 
> DiskResource/PersistentVolumeTest.SharedPersistentVolumeRescindOnDestroy/0, 
> where GetParam() = 0
> [  FAILED  ] 
> DiskResource/PersistentVolumeTest.SharedPersistentVolumeRescindOnDestroy/1, 
> where GetParam() = 1
> [  FAILED  ] MountDiskResource/PersistentVolumeTest.AccessPersistentVolume/0, 
> where GetParam() = 2
> [  FAILED  ] 
> MountDiskResource/PersistentVolumeTest.SharedPersistentVolumeRescindOnDestroy/0,
>  where GetParam() = 2
> ```
> [1] 
> https://github.com/apache/mesos/blob/c5c5c13deab834e6db7e1f9d687b8cc0f6a0641f/src/slave/slave.cpp#L948
> [2] 
> https://github.com/apache/mesos/blob/c5c5c13deab834e6db7e1f9d687b8cc0f6a0641f/src/slave/slave.cpp#L942
> [3] 
> https://github.com/apache/mesos/blob/c5c5c13deab834e6db7e1f9d687b8cc0f6a0641f/src/slave/slave.cpp#L938
> [4] 
> https://github.com/apache/mesos/commit/09b1dc3e95955aa187458fcb61e1d66b04ec3af2#diff-01648193f4029dc9fc1e024949f6ea28R562



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to