[Gluster-users] Question About Stable Versions

2019-09-16 Thread Timothy Orme
Hi All,

New to gluster, and was getting a little bit lost in the docs trying to figure 
out what the de-facto stable version is.  I see there is a relatively new 
release cycle, and there are versions 4,5,6 and 7 that are listed as being 
maintained.   I had tried using a couple versions of 6 though, and ran into 
some issues, and ended up reverting back to 4.1.  Is there a recommended 
version that is suited for production?

Thanks!
Tim


Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/118564314

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/118564314

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


[Gluster-users] split-brain errors under heavy load when one brick down

2019-09-16 Thread Erik Jacobson
Hello all. I'm new to the list but not to gluster.

We are using gluster to service NFS boot on a top500 cluster. It is a
Distributed-Replicate volume 3x9.

We are having a problem when one server in a subvolume goes down, we get
random missing files and split-brain errors in the nfs.log file.

We are using Gluster NFS (We are interested in switching to Ganesha but
this workload presents problems there that we need to work through yet).

Unfortunately, like many such large systems, I am unable to take much
out of the system for debugging and unable to take the system down to
test this very often. However, my hope is to be well prepared when the
next large system comes through the factory so I can try to reproduce
this issue or have some things to try.

In the lab, I have a test system that is also a 3x9 setup like at the
customer site, but with only 3 compute nodes instead of 2,592 compute
nodes. We use CTDB for IP alias management - the compute nodes connect
to NFS with the alias.

Here is the issue we are having:
- 2592 nodes all PXE-booting at once and using the Gluster servers as
  their NFS root is working great. This includes when one subvolume is
  degraded due to the loss of a server. No issues at boot, no split-brain
  messages in the log.
- The problem comes in when we do an intensive job launch. This launch
  uses SLURM and then loads hundreds of shared libraries over NFS across
  all 2592 nodes.
- When all servers in the 3x9 pool are up, we're in good shape - no
  issues on the compute nodes, no split-brain messages in the log.
- When one subvolume has one missing server (its ethernet adapters
  died), while we boot fine, the SLURM launch has random missing files.
  Gluster nfs.log shows split-brain messages and ACCESS I/O errors.
- Taking an example failed file and accessing it across all compute nodes
  always works afterwords, the issue is transient.
- The missing file is always found in the other bricks in the subvolume by
  searching there is well
- No FS/disk IO errors in the logs or dmesg and the files are accessible
  before and after the transient error (and from the bricks themselves as I
  said).
- The customer jobs fail to launch, then, if we are degraded. They fail
  with library read errors, missing config files, etc.


What is perplexing is the huge load of 2592 nodes with NFS roots
PXE-booting does not trigger the issue when one subvolume is degraded.

Thank you for reading this far and thanks to the community for
making Gluster!!

Example errors:

ex1

[2019-09-06 18:26:42.665050] E [MSGID: 108008]
[afr-read-txn.c:123:afr_read_txn_refresh_done] 0-cm_shared-replicate-1: Failing
ACCESS on gfid ee3f5646-9368-4151-92a3-5b8e7db1fbf9: split-brain observed.
[Input/output error]

ex2

[2019-09-06 18:26:55.359272] E [MSGID: 108008]
[afr-read-txn.c:123:afr_read_txn_refresh_done] 0-cm_shared-replicate-1: Failing
READLINK on gfid f2be38c2-1cd1-486b-acad-17f2321a18b3: split-brain observed.
[Input/output error]
[2019-09-06 18:26:55.359367] W [MSGID: 112199]
[nfs3-helpers.c:3435:nfs3_log_readlink_res] 0-nfs-nfsv3:
/image/images_ro_nfs/toss-20190730/usr/lib64/libslurm.so.32 => (XID: 88651c80,
READLINK: NFS: 5(I/O error), POSIX: 5(Input/output error)) target: (null)



The errors seem to happen only on the 'replicate' volume where one
server is down in the subvolume (of course, any NFS server will
trigger that when it accesses the files on the degraded volume).



Now, I am no longer able to access this customer system and it is moving
to more secret work so I can't easily run tests on such a big system
until we have something come through the factory. However, I'm desperate
for help and would like a bag of tricks to attack this with next time I
can hit it. Having the HA stuff fail when needed has given me a bit of a
black eye on the solution. I had a lesson learned in being sure to test
the HA solution. I had tested many times at full system boot but didn't
think to do job launch tests while degraded in my testing. That pain
will haunt me but also make me better.



Info on the volumes:
 - RHEL 7.6 x86_64 Gluster/GNFS servers
 - gluster version 4.1.6, I set up the build
 - Clients are AARCH64 NFS 3 clients (technically configured with RO NFS
   (Using a version of Linux somewhat like CentOS 7.6).
 - The base filesystems for bricks are XFS and NO LVM layer.


What follows is the volume info from my test system in the lab, which
has the same versions and setup. I cannot get this info from the
customer without an approval process but the same scripts and tools set
up my test system so I'm confident the settings are the same.


[root@leader1 ~]# gluster volume info

Volume Name: cm_shared
Type: Distributed-Replicate
Volume ID: e7f2796b-7a94-41ab-a07d-bdce4900c731
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x 3 = 9
Transport-type: tcp
Bricks:
Brick1: 172.23.0.3:/data/brick_cm_shared
Brick2: 172.23.0.4:/data/brick_cm_shared
Brick3: 172.23.0.5:/data/brick_cm_shared
Brick4: 172.23.0.6:/data/