...
- Remove the HostAwareContainerAllocator & ContainerAllocator, simplify Container Allocator as a simple lightweight entity allocating requests to available resources (PR1, PR2)
- Introduce ContainerManager which acts as a brain for validating and issuing any actions on containers in the Job Coordinator for both active & Standby containers. (PR)
- Transfer state & validation of container launch & expired request handling from ContainerAllocator to ContainerManager
- Transfer state & validation for callback handler lifecycle management of Container allocator & resource request on boot from ClusterResourceManager.CallBack(ContainerProcessManager) to ContainerManager
- Encapsulates logic and state related to container placement actions like move, restarts for active & standby container in ContainerManager (PR-1, TDB)
- It is ContainerManager’s duty to validate any ContainerPlacementRequestMessages & also invalidate messages from the previous deployment incarnation
- It is ContainerManager’s duty to write ContainerPlacementResponseMessages to Metastore for external control to query the status of the request
- ContainerPlacementMetadata is a metadata holder for container actions (ControlActionMetadata) for ex request_id, current status, requested resources etc
Note: ClusterResourceManager.Callback (ContainerProcessManager) is tightly coupled with ClusterbasedJobCoordinator today, all the proposed changes will be done except for moving state & lifecycle management of Container allocator & resource request on boot from ClusterResourceManager.CallBack(ContainerProcessManager) to ContainerManager in phase 1 of the implementation so that this feature can be developed faster. Hence ContainerProcessManager will still be tied with ClusterBasedJobCoordinator and will intercept any container placement requests. 2.1 Container Move 2.1.1 Stateless Container Move & Stateful Container Move (without Standby) ...
- If the preferred resources are not able to be accrued the active container is never stopped and a failure notification is sent for the ContainerPacementRequest
- If the ContainerPlacementManager is not able to stop the active container (3.1 #1 above fails) in that case the request is marked failed & a failure notification is sent for the ContainerPacementRequest
- If ClusterResourceManager fails to start the stopped active container on the accrued destination host, then we attempt to start the container back on the source host and a failure notification is sent for the ContainerPacementRequest. If container fails to start on source host then an attempt is made to start on ANY_HOST
Note: ClusterResourceManager.Callback (ContainerProcessManager) is tightly coupled with ClusterbasedJobCoordinator today, all the proposed changes will be done except for moving state & lifecycle management of Container allocator & resource request on boot from ClusterResourceManager.CallBack(ContainerProcessManager) to ContainerManager in phase 1 of the implementation so that this feature can be developed faster. Hence ContainerProcessManager will still be tied with ClusterBasedJobCoordinator and will intercept any container placement requests. Option 3: Stateful without Standby (Spin Up StandBy container & then move) (Phase 2) [Strech] ... |