Re: [Gluster-users] split-brain on glusterfs running with quorum on server and client

Pranith Kumar Karampuri Sat, 06 Sep 2014 07:01:50 -0700


On 09/06/2014 04:53 AM, Jeff Darcy wrote:

I have a replicate glusterfs setup on 3 Bricks ( replicate = 3 ). I have
client and server quorum turned on. I rebooted one of the 3 bricks. When it
came back up, the client started throwing error messages that one of the
files went into split brain.

This is a good example of how split brain can happen even with all kinds of
quorum enabled.  Let's look at those xattrs.  BTW, thank you for a very
nicely detailed bug report which includes those.

BRICK1
========
[root@ip-172-31-38-189 ~]# getfattr -d -m . -e hex
/data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
getfattr: Removing leading '/' from absolute path names
# file:
data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
trusted.afr.PL2-client-0=0x000000000000000000000000
trusted.afr.PL2-client-1=0x000000010000000000000000
trusted.afr.PL2-client-2=0x000000010000000000000000
trusted.gfid=0xea950263977e46bf89a0ef631ca139c2

BRICK 2
=======
[root@ip-172-31-16-220 ~]# getfattr -d -m . -e hex
/data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
getfattr: Removing leading '/' from absolute path names
# file:
data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
trusted.afr.PL2-client-0=0x00000d460000000000000000
trusted.afr.PL2-client-1=0x000000000000000000000000
trusted.afr.PL2-client-2=0x000000000000000000000000
trusted.gfid=0xea950263977e46bf89a0ef631ca139c2
BRICK 3
=========
[root@ip-172-31-12-218 ~]# getfattr -d -m . -e hex
/data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
getfattr: Removing leading '/' from absolute path names
# file:
data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
trusted.afr.PL2-client-0=0x00000d460000000000000000
trusted.afr.PL2-client-1=0x000000000000000000000000
trusted.afr.PL2-client-2=0x000000000000000000000000
trusted.gfid=0xea950263977e46bf89a0ef631ca139c2

Here, we see that brick 1 shows a single pending operation for the other
two, while they show 0xd46 (3398) pending operations for brick 1.
Here's how this can happen.

(1) There is exactly one pending operation.

(2) Brick1 completes the write first, and says so.

(3) Client sends messages to all three, saying to decrement brick1's
count.

(4) All three bricks receive and process that message.

(5) Brick1 fails.

(6) Brick2 and brick3 complete the write, and say so.

(7) Client tells all bricks to decrement remaining counts.

(8) Brick2 and brick3 receive and process that message.

(9) Brick1 is dead, so its counts for brick2/3 stay at one.

(10) Brick2 and brick3 have quorum, with all-zero pending counters.

(11) Client sends 0xd46 more writes to brick2 and brick3.

Note that at no point did we lose quorum. Note also the tight timing
required.  If brick1 had failed an instant earlier, it would not have
decremented its own counter.  If it had failed an instant later, it
would have decremented brick2's and brick3's as well.  If brick1 had not
finished first, we'd be in yet another scenario.  If delayed changelog
had been operative, the messages at (3) and (7) would have been combined
to leave us in yet another scenario.  As far as I can tell, we would
have been able to resolve the conflict in all those cases.
*** Key point: quorum enforcement does not totally eliminate split
brain.  It only makes the frequency a few orders of magnitude lower. ***

Not quite right. After we fixed the bughttps://bugzilla.redhat.com/show_bug.cgi?id=1066996, the only twopossible ways to introduce split-brain are1) if we have an implementation bug in changelog xattr marking, Ibelieve that to be the case here.

2) Keep writing to the file from the mount then
a) take brick 1 down, wait until at least one write is successful

b) bring brick1 back up and take brick 2 down (self-heal should nothappen) wait until at least one write is successfulc) bring brick2 back up and take brick 3 down (self-heal should nothappen) wait until at least one write is successful

With outcast implementation case-2 will also be immune to split-brainerrors.

Then the only way we have split-brains in afr is implementation errorsof changelog marking. If we test it thoroughly and fix such problems wecan get it to be immune to split-brain :-).


Pranith

So, is there any way to prevent this completely?  Some AFR enhancements,
such as the oft-promised "outcast" feature[1], might have helped.
NSR[2] is immune to this particular problem.  "Policy based split brain
resolution"[3] might have resolved it automatically instead of merely
flagging it.  Unfortunately, those are all in the future.  For now, I'd
say the best approach is to resolve the conflict manually and try to
move on.  Unless there's more going on than meets the eye, recurrence
should be very unlikely.

[1] http://www.gluster.org/community/documentation/index.php/Features/outcast

[2] 
http://www.gluster.org/community/documentation/index.php/Features/new-style-replication

[3] http://www.gluster.org/community/documentation/index.php/Features/pbspbr
_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] split-brain on glusterfs running with quorum on server and client

Reply via email to