Hi Sammi, Thanks for the detailed reply.
After HDDS-3270 we can prevent pipelines from getting created in the cluster until hdds.scm.safemode.min.datanode nodes have registered, so that should largely avoid "bad" pipelines getting created on startup. >From a topology perspective, I am not too concerned about non-rack-aware pipelines being created at startup time, so long as we can close those pipelines later. If non-rack-aware pipelines get created and remain forever, then it gives Replication Manager more work to do, as it must replicate one copy of every closed container which is not rack aware. It seems to be an open question about whether regularly closing pipelines is a good idea. I have captured the details gathered from several people in a short document and attached it to https://issues.apache.org/jira/browse/HDDS-4065 so the information is all in one place. A less controversial question is whether we should close non-rack-aware pipelines on a rack aware cluster. One suggestion I heard was to highlight the problem in Recon and allow an admin to close the non-rack-aware pipelines manually at a convenient time. This could work, but I feel it would be better to handle this automatically. The longer it goes unhandled, the more containers will need to be replicated to ensure data availability. I agree that the fallback is important, but I believe it is a bug for the fallback to be used when there are multiple racks alive. A small two rack cluster is one corner case where this can happen, however for a large cluster with 100's of pipelines, it is still possible for a non-rack-aware pipeline to get created in some scenarios on a healthy cluster, and we should avoid that - https://issues.apache.org/jira/browse/HDDS-4062 is intended to fix that. Thanks, Stephen. On Fri, Aug 7, 2020 at 10:37 AM Sammi Chen <[email protected]> wrote: > Hi Stephen, > > Thanks for initiating this. Here are some of my thoughts based on in-field > experience and discussion during today's community call. > > The goals we want to achieve are, > a) make sure container can tolerate rack level failure, so we need cross > rack pipeline > b) make sure containers are evenly distributed, and container overlap > between every three datanodes are minimized, then the unavailable data size > once three data nodes go down at the same time is minimized. > > Given the current state is > 1. pipeline creation and healthy pipeline destruction are expensive > operations, it's better to avoid unnecessary operations as much as > possible. > 2. massive container replication is an expensive operation too. > > Let's have a case by case analysis, > 1. For a fresh new cluster, SCM will exit safe mode very soon after start > up. Most pipelines are created with the same group members and same > datanode as the ratis leader. As long as there are 3 healthy nodes > available, background thread starts to create pipelines(In this case, even > HDDS-4062 cannot alleviate it). A CLI command, just like current > ReplicationManage, to turn on/off the background thread creation, seems a > good choice. For a fresh new cluster, admin turns on the pipeline auto > creation after enough data nodes are registered. Though this will add some > burn to cluster admin, it's kind of acceptable. > > 2. For a stably running cluster without datanode add/remove, periodically > close pipeline is basically not preferred, because after the pipeline > closed, you will get a new pipeline with the same datanodes as group > members. But to achieve goal b), we have to periodically close the > pipelines and recreate new pipelines. The only thing I worry about is we > will have more partial full containers. We need "HDDS-3952. Container > Compact" . The pipeline close frequency is very important. > > 3. When expanding a cluster by adding a batch of new datanodes, with the > close pipeline mechanism, the key point is the frequency. If we want to > distribute the write request quickly to the new datanodes, the short > interval is preferred. But again, short interval means more partial full > containers. > > 4. When downgrading a cluster, because of network partition, rack power > failure or old servers recycle, pipelines on these involved datanodes will > be closed after timeout. But whether new pipelines can be created out is a > question. So it's a trade off, fallback vs not fallback, write > throughput & availability vs data safety. Basically,I think all these > cases which cause cluster downgrading usallly will be fixed in a short > time. So fallback is more user friendly and data safety risk is relatively > low. > > Bests, > Sammi > > > On Mon, Aug 3, 2020 at 7:35 PM Stephen O'Donnell > <[email protected]> wrote: > > > Hi All, > > > > I would like to revisit network topology around pipeline creation and > > destruction. > > > > Pipelines are created by the RatisPipelineProvider which delegates > > responsibility for picking the pipeline nodes to the > > PipelinePlacementPolicy. > > > > When picking the nodes for a pipeline, the PipelinePlacementPolicy will > > check for the topology and presence of more than 1 rack, and if so, try > to > > create pipelines spanning multiple racks. Otherwise it will select random > > nodes - this is the fall back mechanism, intended to be used by clusters > > with a single rack, or no topology configured. > > > > As I have raised before, we have a couple of problems: > > > > 1) On cluster startup, pipeline creation is triggered immediately when > > nodes register. If at least 3 nodes from 1 rack register before any > others, > > they can be part of a pipeline which is not rack aware. We have somewhat > > fixed this with safemode rules. > > > > 2) If the nodes per rack are not perfectly balanced, it would be possible > > for 3 DNs in 1 rack to have capacity for more pipelines, with all other > > nodes having no capacity. If that happens, the fallback mechanism would > be > > used, and a non-rack aware pipeline would be created. > > > > 3) If something happens such that only 1 rack is available for some time > > (restart or rack outage) the cluster will create new pipelines on 1 rack, > > and these will never get destroyed, even when the missing rack returns to > > service. > > > > Proposal 1: > > > > If the cluster has multiple racks AND there are healthy nodes covering at > > least 2 racks, where healthy is defined as a node which is registered and > > not stale or dead, then we should not allow "fallback" (pipelines which > > span only 1 rack) pipelines to be created. > > > > This means if you have a badly configured cluster - eg Rack 1 = 10 nodes; > > Rack 2 = 1 node, the pipeline limit will be constrained by the capacity > of > > that 1 node on rack 2. Even a setup like Rack 1 = 10 nodes, Rack 2 = 5 > > would be constrained by this. > > > > IMO, this constraint is better than creating non rack aware pipelines, as > > the cluster setup should be fixed. > > > > This should also handle the case when the cluster degrades to 1 rack, as > > the healthy node definition will notice only 1 rack is alive. > > > > It would be quite easy to implement this in > > PipelinePlacementPolicy#filterViableNodes, as we already get the list of > > healthy nodes, and then exclude overloaded nodes. > > > > Questions: > > > > 1. Pipeline creation does not consider capacity - do we need to consider > > capacity in the "healthy node" definition? Eg, extend it to nodes which > are > > not stale or dead, and have X bytes of available space? What if no nodes > > have enough space? > > > > 2. What happens to a pipeline if one node in the pipeline runs out of > > space? Will this be detected and the pipeline destroyed? > > > > > > Proposal 2: > > > > In the PipelineManager, there is already a thread called the > > BackgroundPipelineCreator. As well as creating pipelines, I think it > should > > check existing pipelines using a similar rule as proposal 1. Ie, if the > > cluster has multiple racks, and there are healthy nodes spanning more > than > > 1 rack, it should destroy non-rack aware pipelines. > > > > This would handle the case where the cluster degrades to a single rack, > > single rack pipelines get created, and then it returns to multi-rack. It > > would also allow for any non rack aware pipelines created at startup to > be > > cleaned up. > > > > Questions: > > > > 1. Should the pipeline destruction be throttled? Consider the case where > > the cluster goes from 2 racks to 1 rack. All nodes on the remaining rack > > will be involved in non-rack-aware pipelines up to their pipeline limit. > > When the second rack comes back online, it will not be able to create any > > pipelines, until we free capacity on the existing nodes. > > > > 2. Assuming the destruction is throttled, I would welcome some ideas > about > > metrics that can be used to throttle it, that will handle small to large > > clusters. Perhaps something as simple as "destroy at least 1 and up to at > > most X% of bad pipelines, then run createPipelines, sleep, repeat". > > > > Note, that in a very small cluster - 6 nodes, 3 nodes per rack. If 1 rack > > is down and its 3 nodes are in a pipeline - we cannot create a new > pipeline > > without briefly going to zero pipelines on the cluster. > > > > I would like to get some agreement on the proposals before making any > code > > changes. Please let me know if there are any things I have missed or > other > > potential problems this could introduce. > > > > Thanks, > > > > Stephen. > > >
