Hi Stephen,

Thanks for initiating this.  Here are some of my thoughts based on in-field
experience and discussion during today's community call.

The goals we want to achieve are,
a)  make sure container can tolerate rack level failure, so we need cross
rack pipeline
b)  make sure containers are evenly distributed, and container overlap
between every three datanodes are minimized, then the unavailable data size
once three data nodes go down at the same time is minimized.

Given the current state is
1.   pipeline creation and healthy pipeline destruction are expensive
operations, it's better to avoid unnecessary operations as much as
possible.
2.   massive container replication is an expensive operation too.

Let's have a case by case analysis,
1.  For a fresh new cluster, SCM will exit safe mode very soon after start
up.  Most pipelines are created with the same group members and same
datanode as the ratis leader. As long as there are 3 healthy nodes
available, background thread starts to create pipelines(In this case, even
HDDS-4062 cannot alleviate it).  A CLI command, just like current
ReplicationManage, to turn on/off the background thread creation, seems a
good choice.  For a fresh new cluster,  admin turns on the pipeline auto
creation after enough data nodes are registered. Though this will add some
burn to cluster admin, it's kind of acceptable.

2.  For a stably running cluster without datanode add/remove, periodically
close pipeline is basically not preferred, because after the pipeline
closed, you will get a new pipeline with the same datanodes as group
members. But to achieve goal b), we have to periodically close the
pipelines and recreate new pipelines.  The only thing I worry about is we
will have more partial full containers. We need "HDDS-3952. Container
Compact" .  The pipeline close frequency is very important.

3.  When expanding a cluster by adding a batch of new datanodes, with the
close pipeline mechanism, the key point is the frequency.  If we want to
distribute the write request quickly to the new datanodes, the short
interval is preferred. But again, short interval means more partial full
containers.

4.  When downgrading a cluster, because of network partition, rack power
failure or old servers recycle,  pipelines on these involved datanodes will
be closed after timeout.  But whether new pipelines can be created out is a
question.  So it's a trade off,  fallback vs not fallback,  write
throughput & availability vs data safety.  Basically,I think all these
cases which cause cluster downgrading usallly will be fixed in a short
time. So fallback is more user friendly and data safety risk is relatively
low.

Bests,
Sammi


On Mon, Aug 3, 2020 at 7:35 PM Stephen O'Donnell
<[email protected]> wrote:

> Hi All,
>
> I would like to revisit network topology around pipeline creation and
> destruction.
>
> Pipelines are created by the RatisPipelineProvider which delegates
> responsibility for picking the pipeline nodes to the
> PipelinePlacementPolicy.
>
> When picking the nodes for a pipeline, the PipelinePlacementPolicy will
> check for the topology and presence of more than 1 rack, and if so, try to
> create pipelines spanning multiple racks. Otherwise it will select random
> nodes - this is the fall back mechanism, intended to be used by clusters
> with a single rack, or no topology configured.
>
> As I have raised before, we have a couple of problems:
>
> 1) On cluster startup, pipeline creation is triggered immediately when
> nodes register. If at least 3 nodes from 1 rack register before any others,
> they can be part of a pipeline which is not rack aware. We have somewhat
> fixed this with safemode rules.
>
> 2) If the nodes per rack are not perfectly balanced, it would be possible
> for 3 DNs in 1 rack to have capacity for more pipelines, with all other
> nodes having no capacity. If that happens, the fallback mechanism would be
> used, and a non-rack aware pipeline would be created.
>
> 3) If something happens such that only 1 rack is available for some time
> (restart or rack outage) the cluster will create new pipelines on 1 rack,
> and these will never get destroyed, even when the missing rack returns to
> service.
>
> Proposal 1:
>
> If the cluster has multiple racks AND there are healthy nodes covering at
> least 2 racks, where healthy is defined as a node which is registered and
> not stale or dead, then we should not allow "fallback" (pipelines which
> span only 1 rack) pipelines to be created.
>
> This means if you have a badly configured cluster - eg Rack 1 = 10 nodes;
> Rack 2 = 1 node, the pipeline limit will be constrained by the capacity of
> that 1 node on rack 2. Even a setup like Rack 1 = 10 nodes, Rack 2 = 5
> would be constrained by this.
>
> IMO, this constraint is better than creating non rack aware pipelines, as
> the cluster setup should be fixed.
>
> This should also handle the case when the cluster degrades to 1 rack, as
> the healthy node definition will notice only 1 rack is alive.
>
> It would be quite easy to implement this in
> PipelinePlacementPolicy#filterViableNodes, as we already get the list of
> healthy nodes, and then exclude overloaded nodes.
>
> Questions:
>
> 1. Pipeline creation does not consider capacity - do we need to consider
> capacity in the "healthy node" definition? Eg, extend it to nodes which are
> not stale or dead, and have X bytes of available space? What if no nodes
> have enough space?
>
> 2. What happens to a pipeline if one node in the pipeline runs out of
> space? Will this be detected and the pipeline destroyed?
>
>
> Proposal 2:
>
> In the PipelineManager, there is already a thread called the
> BackgroundPipelineCreator. As well as creating pipelines, I think it should
> check existing pipelines using a similar rule as proposal 1. Ie, if the
> cluster has multiple racks, and there are healthy nodes spanning more than
> 1 rack, it should destroy non-rack aware pipelines.
>
> This would handle the case where the cluster degrades to a single rack,
> single rack pipelines get created, and then it returns to multi-rack. It
> would also allow for any non rack aware pipelines created at startup to be
> cleaned up.
>
> Questions:
>
> 1. Should the pipeline destruction be throttled? Consider the case where
> the cluster goes from 2 racks to 1 rack. All nodes on the remaining rack
> will be involved in non-rack-aware pipelines up to their pipeline limit.
> When the second rack comes back online, it will not be able to create any
> pipelines, until we free capacity on the existing nodes.
>
> 2. Assuming the destruction is throttled, I would welcome some ideas about
> metrics that can be used to throttle it, that will handle small to large
> clusters. Perhaps something as simple as "destroy at least 1 and up to at
> most X% of bad pipelines, then run createPipelines, sleep, repeat".
>
> Note, that in a very small cluster - 6 nodes, 3 nodes per rack. If 1 rack
> is down and its 3 nodes are in a pipeline - we cannot create a new pipeline
> without briefly going to zero pipelines on the cluster.
>
> I would like to get some agreement on the proposals before making any code
> changes. Please let me know if there are any things I have missed or other
> potential problems this could introduce.
>
> Thanks,
>
> Stephen.
>

Reply via email to