Re: [Gluster-devel] Split-brain present and future in afr
- Original Message - From: Jeff Darcy jda...@redhat.com To: Pranith Kumar Karampuri pkara...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org Sent: Tuesday, May 20, 2014 10:08:12 PM Subject: Re: [Gluster-devel] Split-brain present and future in afr 1. Better protection for split-brain over time. 2. Policy based split-brain resolution. 3. Provide better availability with client quorum and replica 2. I would add the following: (4) Quorum enforcement - any kind - on by default. For replica - 3 we can do that. For replica 2, quorum implementation at the moment is not good enough. Until we fix it correctly may be we should let it be. We can revisit that decision once we come up with better solution for replica 2. (5) Fix the problem of volumes losing quorum because unrelated nodes went down (i.e. implement volume-level quorum). (6) Better tools for users to resolve split brain themselves. Agreed. Already in plan for 3.6. For 3, we are planning to introduce arbiter bricks that can be used to determine quorum. The arbiter bricks will be dummy bricks that host only files that will be updated from multiple clients. This will be achieved by bringing about variable replication count for configurable class of files within a volume. In the case of a replicated volume with one arbiter brick per replica group, certain files that are prone to split-brain will be in 3 bricks (2 data bricks + 1 arbiter brick). All other files will be present in the regular data bricks. For example, when oVirt VM disks are hosted on a replica 2 volume, sanlock is used by oVirt for arbitration. sanloclk lease files will be written by all clients and VM disks are written by only a single client at any given point of time. In this scenario, we can place sanlock lease files on 2 data + 1 arbiter bricks. The VM disk files will only be present on the 2 data bricks. Client quorum is now determined by looking at 3 bricks instead of 2 and we have better protection when network split-brains happen. Constantly filtering requests to use either N or N+1 bricks is going to be complicated and hard to debug. Every data-structure allocation or loop based on replica count will have to be examined, and many will have to be modified. That's a *lot* of places. This also overlaps significantly with functionality that can be achieved with data classification (i.e. supporting multiple replica levels within the same volume). What use case requires that it be implemented within AFR instead of more generally and flexibly? 1) It wouldn't still bring in arbiter for replica 2. 2) That would need more bricks, more processes, more ports. Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Split-brain present and future in afr
Constantly filtering requests to use either N or N+1 bricks is going to be complicated and hard to debug. Every data-structure allocation or loop based on replica count will have to be examined, and many will have to be modified. That's a *lot* of places. This also overlaps significantly with functionality that can be achieved with data classification (i.e. supporting multiple replica levels within the same volume). What use case requires that it be implemented within AFR instead of more generally and flexibly? 1) It wouldn't still bring in arbiter for replica 2. It's functionally the same, just implemented in a more modular fashion. Either way, for the same set of data that was previously replicated twice, most data would still be replicated twice but some subset would be replicated three times. The policy filter is just implemented in a translator dedicated to the purpose, instead of within AFR. In addition to being simpler, this keeps the user experience consistent for setting this vs. other kinds of policies. 2) That would need more bricks, more processes, more ports. Fewer, actually. Either approach requires that we split bricks (as the user sees them). One way we turn N user bricks into N regular bricks plus N/2 arbiter bricks. The other way we turn N user bricks into N bricks for the replica-2 part and another N for the replica-3 part. That seems like slightly more, but (a) it's the same user view, and (b) for processes and ports it will actually be less. Since data classification is likely to involve splitting bricks many times, and multi-tenancy likewise, the data classification project is already scoped to include multiplexing multiple bricks into one process on one port (like HekaFS used to do). Thus the total number of ports and processes for an N-brick volume will go back down to N even with the equivalent of arbiter functionality. Doing replica 2.5 as part of data classification instead of within AFR also has other advantages. For example, it naturally gives us support for overlapping replica sets - an often requested feature to spread load more evenly after a failure. Perhaps most importantly, it doesn't require separate implementations or debugging for AFRv1, AFRv2, and NSR. Let's for once put our effort where it will do us most good, instead of succumbing to streetlight effect[1] yet again and hacking on the components that are most familiar. [1] http://en.wikipedia.org/wiki/Streetlight_effect ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Split-brain present and future in afr
On 23/05/2014, at 10:17 AM, Pranith Kumar Karampuri wrote: snip 2) That would need more bricks, more processes, more ports. Meh to more ports. We should be moving to a model (maybe in 4.x?) where we use less ports. Preferably just one or two in total if its feasible from a network layer. Backup applications can manage it, and they're transferring a tonne of data too. ;) + Justin -- Open Source and Standards @ Red Hat twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Split-brain present and future in afr
1. Better protection for split-brain over time. 2. Policy based split-brain resolution. 3. Provide better availability with client quorum and replica 2. I would add the following: (4) Quorum enforcement - any kind - on by default. (5) Fix the problem of volumes losing quorum because unrelated nodes went down (i.e. implement volume-level quorum). (6) Better tools for users to resolve split brain themselves. For 3, we are planning to introduce arbiter bricks that can be used to determine quorum. The arbiter bricks will be dummy bricks that host only files that will be updated from multiple clients. This will be achieved by bringing about variable replication count for configurable class of files within a volume. In the case of a replicated volume with one arbiter brick per replica group, certain files that are prone to split-brain will be in 3 bricks (2 data bricks + 1 arbiter brick). All other files will be present in the regular data bricks. For example, when oVirt VM disks are hosted on a replica 2 volume, sanlock is used by oVirt for arbitration. sanloclk lease files will be written by all clients and VM disks are written by only a single client at any given point of time. In this scenario, we can place sanlock lease files on 2 data + 1 arbiter bricks. The VM disk files will only be present on the 2 data bricks. Client quorum is now determined by looking at 3 bricks instead of 2 and we have better protection when network split-brains happen. Constantly filtering requests to use either N or N+1 bricks is going to be complicated and hard to debug. Every data-structure allocation or loop based on replica count will have to be examined, and many will have to be modified. That's a *lot* of places. This also overlaps significantly with functionality that can be achieved with data classification (i.e. supporting multiple replica levels within the same volume). What use case requires that it be implemented within AFR instead of more generally and flexibly? ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel