Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-03-29 Thread Erik Jacobson
Thank you for replying!! Responses below...

I have attached the volume def (meant to before).
I have attached a couple logs from one of the leaders.

> That's  odd.
> As  far as  I know, the client's are accessing  one of the gluster nodes  
> that serves as NFS server and then syncs data across the peers ,right?

Correct, although in this case, with a 1x3, all of them should have
local copies. Our first reports came in from 3x3 (9 server) systems but
we have been able to duplicate on 1x3 thankfully in house. This is a
huge step forward as I had no reproducer previously.

> What happens when the virtual IP(s) are  failed  over to the other gluster 
> node? Is the issue resolved?

While we do use CTDB for managing the IPs aliases, I don't start the test until
the IP is stabilized. I put all 76 nodes on one IP alias to make a more
similar load to what we have in the field.

I think it is important to point out that if I reduce the load, all is
well. For examples, if the test were just booting -- where the initial
reports were seen -- just 1 or 2 nodes out of 1,000 would have an issue
each cycle. They all boot the same way and are all using the same IP
alias for NFS in my test case. So I think the split-brain messages are maybe
a symptom of some sort of timeout ??? (making stuff up here).

> Also, what kind of  load balancing are you using ?
[I moved this question up because the below answer has too much
output]

We are doing very simple balancing - manual balancing. As we add compute
nodes to the cluster, a couple racks are assigned to IP alias #1, the
next couple to IP alias #2, and so on. I'm happy to not have the
complexity of a real load balancer right now.


> Do you get any split brain entries via 'gluster volume geal  info' ?

I ran two trials for the 'gluster volume heal ...'

Trial 1 - with all 3 servers up and while running the load:
[root@leader2 ~]# gluster volume heal cm_shared info
Brick 172.23.0.4:/data/brick_cm_shared
Status: Connected
Number of entries: 0

Brick 172.23.0.5:/data/brick_cm_shared
Status: Connected
Number of entries: 0

Brick 172.23.0.6:/data/brick_cm_shared
Status: Connected
Number of entries: 0


Trial 2 - with 1 server down (stopped glusterd on 1 server) - and
without doing any testing yet -- I see this.  Let me explain though -
not in the error path, I am using RW NFS filesystem image blobs on this
same volume for the writable areas of the node. In the field, we
duplicate the problem with using TMPFS for that writable area. I am
happy to re-do the test with RO NFS and TMPFS for writable, which my
GUESS says the healing messages would go away. Would that help?
If you look at the heal count -- 76 -- that equals the node count - the
number of writable XFS image files using for writing for each node.

[root@leader2 ~]# gluster volume heal cm_shared info
Brick 172.23.0.4:/data/brick_cm_shared
Status: Transport endpoint is not connected
Number of entries: -

Brick 172.23.0.5:/data/brick_cm_shared








Status: Connected
Number of entries: 8

Brick 172.23.0.6:/data/brick_cm_shared








Status: Connected
Number of entries: 8



Trial 3 - ran the heal command around the time the split-brain errors
were being reported


[root@leader2 glusterfs]# gluster volume heal cm_shared info
Brick 172.23.0.4:/data/brick_cm_shared
Status: Transport endpoint is not connected
Number of entries: -

Brick 172.23.0.5:/data/brick_cm_shared












































































Status: Connected
Number of entries: 76

Brick 172.23.0.6:/data/brick_cm_shared












































































Status: Connected
Number of entries: 76

 
Volume Name: cm_shared
Type: Replicate
Volume ID: f6175f56-8422-4056-9891-f9ba84756b87
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 172.23.0.4:/data/brick_cm_shared
Brick2: 172.23.0.5:/data/brick_cm_shared
Brick3: 172.23.0.6:/data/brick_cm_shared
Options Reconfigured:
nfs.event-threads: 3
config.brick-threads: 16
config.client-threads: 16
performance.iot-pass-through: false
config.global-threading: off
performance.client-io-threads: on
nfs.disable: off
storage.fips-mode-rchecksum: on
transport.address-family: inet
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
cluster.lookup-optimize: on
client.event-threads: 32
server.event-threads: 32
performance.stat-prefetch: on
performance.cache-invalidation: on
performance.md-cache-timeout: 600
network.inode-lru-limit: 100
performance.io-thread-count: 32
performance.cache-size: 8GB
performance.parallel-readdir: on
cluster.lookup-unhashed: auto
performance.flush-behind: on
performance.aggregate-size: 2048KB
performance.write-behind-trickling-writes: off
transport.listen-backlog: 16384
performance.write-behind-window-size: 1024MB
server.outstanding-rpc-limit: 1024
nfs.outstanding-rpc-limit: 1024
nfs.acl: on
storage.max-hardlinks: 0
performance.cache-refresh-timeout: 60
per

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-03-29 Thread Ravishankar N



On 29/03/20 9:40 am, Erik Jacobson wrote:

Hello all,

I am getting split-brain errors in the gnfs nfs.log when 1 gluster
server is down in a 3-brick/3-node gluster volume. It only happens under
intense load.

In the lab, I have a test case that can repeat the problem on a single
subvolume cluster.

  If all leaders are up, we see no errors.


Here are example nfs.log errors:


[2020-03-29 03:42:52.295532] E [MSGID: 108008] 
[afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: Failing 
ACCESS on gfid 8eed77d3-b4fa-4beb-a0e7-e46c2b71ffe1: split-brain observed. 
[Input/output error]

Since you say that the errors go away when all 3 bricks (which I guess 
is what you refer to as 'leaders') of the replica are up, it could be 
possible that the brick you brought down had the only good copy. In such 
cases, even though you have the other 2 bricks of the replica up, they 
both are bad copies waiting to be healed and hence all operations on 
those files will fail with EIO. Since you say this occurs under high 
load only. I suspect this is the case since heal hasn't had the time to 
catch up with the nodes going up and down.


If you see the split-brain errors despite all 3 replica bricks being 
online and the gnfs server being able to connect to all of them, then it 
could be a genuine split-brain problem. But I don't think that is the 
case here.


Regards,
Ravi





Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users