Hey Stephen,

Thanks for this great write-up.

Regarding Problem 3 for pipeline to be long lived, do we consider other ways 
rather than destroy the pipeline. AFAIK, destroying pipelines is expensive. 
Another option is to basically seal/pause the open containers and transfer them 
to a new pipeline. During the transition, containers are read-only on old 
pipeline and after containers are transfer to new pipeline (meaning new 
pipeline is created and fully registers itself on SCM DB), containers are 
write-able on new pipelines. We probably need to have ref-count for containers 
to know how many reads are still in flight for race condition purpose. 
This could save some cost that destroy pipeline may bring.

-Li

On 2020/3/12, 9:19 PM, "Stephen O'Donnell" <[email protected]> 
wrote:

    We had a discussion yesterday with some of the team related to network
    topology and we came up with the following list of proposals which probably
    need to be implemented to cover some edge cases and make the feature more
    supportable. I am sharing them here to gather any further ideas, problems
    and feedback before we attempt to fix these issues.
    
    
    Problem 1:
    
    As of now, there is no tool to tell us if any containers are not replicated
    on 2 racks.
    
    Solution:
    
    A feature should be added to Recon to check the replication and highlight
    containers which are not on two racks.
    
    
    Problem 2:
    
    If closed containers somehow end up on only 1 rack, there is no facility to
    correct that.
    
    Solution:
    
    Replication Manager should be extended to check for both under replicated
    and mis-replicated containers and it should work to correct them. It was
    also suggested that if a container has only 2 replicas on 1 rack, the
    cluster is rack aware, and no node is available from another rack,
    replication manager should not schedule a 3rd copy on the same rack. It
    should instead wait for a node on another rack to become available.
    
    Problem 3:
    
    If pipelines get created which are not rack tolerant, then they will be
    long lived and will create containers which are not rack tolerant for a
    long time. This can happen if nodes from another rack are not available
    when pipelines are being created, or 1 rack of a 2 rack cluster is stopped.
    
    Solution:
    
    The existing pipeline scrubber should be extended to check for pipelines
    which are not rack tolerant and also check if there are nodes available
    from at least two racks. If so, it will destroy non-rack tolerant pipelines
    in a controlled fashion.
    
    For a badly configured cluster, eg rack_1 has 10 nodes, rack_2 has 1 node,
    we should never create non-rack tolerant pipelines even though it will
    reduce the cluster throughput. That is, the fall back option when creating
    pipelines should only be used when there is only 1 rack available.
    
    
    Problem 4:
    
    With the existing design, pipelines start to be created as soon as 3 nodes
    have registered with SCM. If 3 nodes from the same rack register first, the
    system does not know the cluster is rack aware as yet (the current logic
    checks the number of racks which have checked in) and so it will create a
    non-rack tolerant pipeline. The solution to problem 3 can take care of
    this, but it seems it would be better to try to prevent these bad pipelines
    getting created to begin with. Additionally, with multi-raft, it would be
    better to have most nodes registered before creating pipelines to spread
    them out across the cluster more evenly.
    
    Solution:
    
    SCM already has a Safemode check. It is the ideal place to add a check like
    this and we decided it would make sense to have some safe mode rules which
    must pass before pipelines can start to be created. Several ideas were
    discussed:
    
    1. Wait for a static number of nodes to register. This is simple, but a
    static configuration that must be changed as the cluster grows is not
    ideal. This check already exists for exiting safemode, but it would need to
    be changed slightly to block pipeline creation too.
    
    2. Wait for the node count to stabilize. In this way, the safemode rule
    would check the node count has not changed during some interval of time,
    implying all nodes have registered. A negative is slowing down the startup
    time, but due to (3) below this would not be a problem on an established
    cluster.
    
    3. Wait for some percentage of the total expected containers to be
    reported, which would imply most of the expected nodes have registered.
    This check is already present to exit safe mode, so we would need it to
    block pipeline creation too. The one negative is that it may not work well
    for clusters with a small number of nodes or few containers (ie new
    clusters). It would also be possible for all containers to be reported with
    only one third of the nodes registered in an extreme case.
    
    4. Wait for at least 2 racks to be registered if the cluster is configured
    as rack tolerant. This does help with ensuring the pipelines are spread
    across all the nodes.
    
    This area needs some more exploration to figure out which of these ideas is
    best.
    
    
    Problem 5:
    
    The closed container replication policy is different from the pipeline
    policy and it is possible to configure Replication Manager to use an
    incompatible policy.
    
    Solution:
    
    It may not be possible or desirable to merge the closed container placement
    policy with the pipeline policy, but we need to think about unifying the
    configuration so it is not possible to set incompatible options.
    
    Thanks,
    
    Stephen.
    


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to