Re: Discuss: Network Topology edge cases(Internet mail) + storage class

Elek, Marton Tue, 17 Mar 2020 03:01:23 -0700

> So this would be more like a performance/economy vs redundancy kindathing. Shall we actually allow both and make it configurable in rackawareness config?

Totally agree, we can have multiple strategies (bot for open and closedcontainers).

But finding a good / smart default is still an important question (butit's easier if we support multiple options).


Slightly related:

In the comments of HDDS-3167 [1] a discussion has been started about howdifferent replication strategies can be configured:

The idea is to introduce a new abstraction "storage-class" (similar tothe S3 classes [2]) to make it easy to configure different replicationstrategies (open / closed) for containers and keys.


For example:

storageClass=STANDARD :=
     openContainerReplication = RATIS/THREE
     closedContainerReplication = 3
     openContainerRackAwarness = Mandatory

storageClass=REDUCED :+
     openContainerReplication = RATIS/ONE
     closedContainerReplication = 2
     closedContainerRackAwarness = Prefered

storageClass=COLD :+
     openContainerReplication = RATIS/THREE
     closedContainerType = EC
     closedContainerReplication = EC-5-3

For each container which should save the storageClass and choose it forany key with the same storage class. Storage class can define therequired steps for any replication decision.


Marton

[1] https://issues.apache.org/jira/browse/HDDS-3167

[2] https://aws.amazon.com/s3/storage-classes/

On 3/17/20 7:28 AM, timmycheng(程力) wrote:

So this would be more like a performance/economy vs redundancy kinda thing. 
Shall we actually allow both and make it configurable in rack awareness config?
At the end of the day, rack awareness could bring better redundancy to spread 
ndoes across differernt racks as well as great econmoy to choose adjacent 
nodes. It's just depending on what users would prefer.
In reality, larger cluster would prefer durability to economy, but in smaller 
cluster, it may be on the opposite direction.
Shall we actually start to think about multiple redundency modes for ozone? 
Object storage like S3 may have different redundancy mode.

-Li

On 2020/3/16, 7:14 PM, "Stephen O'Donnell" <[email protected]> 
wrote:

     > Problem 3. It's inevitable to allocate new pipelines which doesn't meet
     rack tolerace policy.

I would argue that in a working well configured cluster (ie the same number

     of nodes per rack, at least 2 racks available) we should ensure we do not
     fall back to non-rack-aware pipelines ever. The only time we should fall
     back is if there are no other racks available. Even if we have a badly
     configured cluster (10 nodes on 1 rack and 1 node on the only other rack)
     we should only allocate rack aware pipelines and the cluster configuration
     should be fixed. Yes, this would limit the write capacity of the cluster,
     but its a side effect of other issues which need to be fixed.

> Another way is migrate one member of the pipeline from one datanode to

     another datanode.

If we ensure that non-rack-tolerant pipelines are rare (ie only when there

     are no other racks available) and we re-create these pipelines in a
     controlled way, then I feel we should avoid trying to do something overly
     complex here. Just closing down a bad pipeline and creating a new one would
     be the easiest approach.

On Mon, Mar 16, 2020 at 5:51 AM Sammi Chen <[email protected]> wrote:> Thanks Stephen for the summary and lead the discussion.

     >
     > Problem 1&2,  we definitely need such indicator, and move around the 
closed
     > container replicas to make sure not only replica number meets the
     > requirements, but also it meets the rack aware requirements.  I'm not
     > familar with Recon server function. It it cann't be done in Replication
     > Manager, Recon server is also a good choice.
     > Problem 5,  the closed container placement policy is a little different
     > than pipeline placement policy. The former considers rack awareness and
     > datanode free disk space. The latter consider rack awareness and datanode
     > pipeline load balance. Agree to unify the configuration and keep the
     > configuration consistent.
     >
     > Problem 4.  thanks for bring storming four possible solutions for this
     > problem.  For the same mode, there are actually 2 sub-cases, one is the
     > safe mode of a new cluster, another is the safe node of an existing 
stable
     > cluster. Solution 1~3 seems work for safe node of existing cluster. While
     > Solution 4 seems work for both sub-cases.
     >
     > Problem 3. It's inevitable to allocate new pipelines which doesn't meet
     > rack tolerace policy. One way is we can close the pipeline once we find a
     > better pipeline candidate sets.
     > Another way is migrate one member of the pipeline from one datanode to
     > another datanode.  For example, we can let the 4th datanode join the
     > pipeline, can learn the data, once it catches up, we can turn on this
     > datanode, and turn off another beening replaced datanode. The second
     > solution requires RATIS support. I'm not clear if it's feasible or how 
much
     > workload will be. The first solution is quite straight-forward. Though
     > close pipeline is a cost operation. But with multi-raft enabled, maybe we
     > can offord this cost, for a higher data security guarantee.
     >
     > Bests,
     > Sammi
     >
     >
     >
     >
     >
     > On Thu, Mar 12, 2020 at 9:19 PM Stephen O'Donnell
     > <[email protected]> wrote:
     >
     > > We had a discussion yesterday with some of the team related to network
     > > topology and we came up with the following list of proposals which
     > probably
     > > need to be implemented to cover some edge cases and make the feature 
more
     > > supportable. I am sharing them here to gather any further ideas, 
problems
     > > and feedback before we attempt to fix these issues.
     > >
     > >
     > > Problem 1:
     > >
     > > As of now, there is no tool to tell us if any containers are not
     > replicated
     > > on 2 racks.
     > >
     > > Solution:
     > >
     > > A feature should be added to Recon to check the replication and 
highlight
     > > containers which are not on two racks.
     > >
     > >
     > > Problem 2:
     > >
     > > If closed containers somehow end up on only 1 rack, there is no 
facility
     > to
     > > correct that.
     > >
     > > Solution:
     > >
     > > Replication Manager should be extended to check for both under 
replicated
     > > and mis-replicated containers and it should work to correct them. It 
was
     > > also suggested that if a container has only 2 replicas on 1 rack, the
     > > cluster is rack aware, and no node is available from another rack,
     > > replication manager should not schedule a 3rd copy on the same rack. It
     > > should instead wait for a node on another rack to become available.
     > >
     > > Problem 3:
     > >
     > > If pipelines get created which are not rack tolerant, then they will be
     > > long lived and will create containers which are not rack tolerant for a
     > > long time. This can happen if nodes from another rack are not available
     > > when pipelines are being created, or 1 rack of a 2 rack cluster is
     > stopped.
     > >
     > > Solution:
     > >
     > > The existing pipeline scrubber should be extended to check for 
pipelines
     > > which are not rack tolerant and also check if there are nodes available
     > > from at least two racks. If so, it will destroy non-rack tolerant
     > pipelines
     > > in a controlled fashion.
     > >
     > > For a badly configured cluster, eg rack_1 has 10 nodes, rack_2 has 1
     > node,
     > > we should never create non-rack tolerant pipelines even though it will
     > > reduce the cluster throughput. That is, the fall back option when
     > creating
     > > pipelines should only be used when there is only 1 rack available.
     > >
     > >
     > > Problem 4:
     > >
     > > With the existing design, pipelines start to be created as soon as 3
     > nodes
     > > have registered with SCM. If 3 nodes from the same rack register first,
     > the
     > > system does not know the cluster is rack aware as yet (the current 
logic
     > > checks the number of racks which have checked in) and so it will 
create a
     > > non-rack tolerant pipeline. The solution to problem 3 can take care of
     > > this, but it seems it would be better to try to prevent these bad
     > pipelines
     > > getting created to begin with. Additionally, with multi-raft, it would 
be
     > > better to have most nodes registered before creating pipelines to 
spread
     > > them out across the cluster more evenly.
     > >
     > > Solution:
     > >
     > > SCM already has a Safemode check. It is the ideal place to add a check
     > like
     > > this and we decided it would make sense to have some safe mode rules
     > which
     > > must pass before pipelines can start to be created. Several ideas were
     > > discussed:
     > >
     > > 1. Wait for a static number of nodes to register. This is simple, but a
     > > static configuration that must be changed as the cluster grows is not
     > > ideal. This check already exists for exiting safemode, but it would 
need
     > to
     > > be changed slightly to block pipeline creation too.
     > >
     > > 2. Wait for the node count to stabilize. In this way, the safemode rule
     > > would check the node count has not changed during some interval of 
time,
     > > implying all nodes have registered. A negative is slowing down the
     > startup
     > > time, but due to (3) below this would not be a problem on an 
established
     > > cluster.
     > >
     > > 3. Wait for some percentage of the total expected containers to be
     > > reported, which would imply most of the expected nodes have registered.
     > > This check is already present to exit safe mode, so we would need it to
     > > block pipeline creation too. The one negative is that it may not work
     > well
     > > for clusters with a small number of nodes or few containers (ie new
     > > clusters). It would also be possible for all containers to be reported
     > with
     > > only one third of the nodes registered in an extreme case.
     > >
     > > 4. Wait for at least 2 racks to be registered if the cluster is
     > configured
     > > as rack tolerant. This does help with ensuring the pipelines are spread
     > > across all the nodes.
     > >
     > > This area needs some more exploration to figure out which of these 
ideas
     > is
     > > best.
     > >
     > >
     > > Problem 5:
     > >
     > > The closed container replication policy is different from the pipeline
     > > policy and it is possible to configure Replication Manager to use an
     > > incompatible policy.
     > >
     > > Solution:
     > >
     > > It may not be possible or desirable to merge the closed container
     > placement
     > > policy with the pipeline policy, but we need to think about unifying 
the
     > > configuration so it is not possible to set incompatible options.
     > >
     > > Thanks,
     > >
     > > Stephen.
     > >
     >


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Discuss: Network Topology edge cases(Internet mail) + storage class

Reply via email to