Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
On 08/04/20 9:55 pm, Erik Jacobson wrote: 9439138:[2020-04-08 15:48:44.737590] E [afr-common.c:754:afr_inode_event_gen_reset] (-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) [0x7fa4fb1cb735] -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) [0x7fa4fb1cb0b9] -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) [0x7fa4fb1b8701] ) 0-cm_shared-replicate-0: Resetting event gen for f2d7abf0-5444-48d6-863d-4b128502daf9 Could you print the function/line no. of each of these 3 functions in the backtrace and see who calls afr_inode_event_gen_reset? `addr2line` should give you that info: addr2line -f -e /your/path/to/lib/glusterfs/7.2/xlator/cluster/afr.so 0x6f735 addr2line -f -e /your/path/to/lib/glusterfs/7.2/xlator/cluster/afr.so 0x6f0b9 addr2line -f -e /your/path/to/lib/glusterfs/7.2/xlator/cluster/afr.so 0x5c701 I think it is likely called from afr_lookup_done, which I don't think is necessary. I will send a patch for review. Once reviews are over, I will share it with you and if it fixes the issue in your testing, we can merge it with confidence. Thanks, Ravi Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] Impressive boot times for big clusters: NFS, Image Objects, and Sharding
I wanted to share some positive news with the group here. Summary: Using sharding and squashfs image files instead of expanded directory trees for RO NFS OS images have led to impressive boot times of 2k diskless node clusters using 12 servers for gluster+tftp+etc+etc. Details: As you may have seen in some of my other posts, we have been using gluster to boot giant clusters, some of which are in the top500 list of HPC resources. The compute nodes are diskless. Up until now, we have done this by pushing an operating system from our head node to the storage cluster, which is made up of one or more 3-server/(3-brick) subvolumes in a distributed/replicate configuration. The servers are also PXE-boot and tftboot servers and also serve the "miniroot" (basically a fat initrd with a cluster manager toolchain). We also locate other management functions there unrelated to boot and root. This copy of the operating system is a simple a directory tree representing the whole operating system image. You could 'chroot' in to it, for example. So this operating system is a read-only NFS mount point used as a base by all compute nodes to use as their root filesystem. This has been working well, getting us boot times (not including BIOS startup) of between 10 and 15 minutes for a 2,000 node cluster. Typically a cluster like this would have 12 gluster/nfs servers in 3 subvolumes. On simple RHEL8 images without much customization, I tend to get 10 minutes. We have observed some slow-downs with certain job launch work loads for customers who have very metadata intensive job launch. The metadata load of such an operation is very intensive, with giant loads being observed on the gluster servers. We recently started supporting RW NFS as opposed to TMPFS for this solution for the writable components of root. Our customers tend to prefer to keep every byte of memory for jobs. We came up with a solution of hosting the RW NFS sparse files with XFS filesystems on top from a writable area in gluster for NFS. This makes the RW NFS solution very fast because it reduces RW NFS metadata per-node. Boot times didn't go up significantly (but our first attempt with just using a directory tree was a slow disaster, hitting the worse-case lots of small file write + lots of metadata work load). So we solved that problem with XFS FS images on RW NFS. Building on that idea, we have in our development branch, a version of the solution that changes the RO NFS image to a squashfs file on a sharding volume. That is, instead of each operating system being many thousands of files and being (slowly) synced to the gluser servers, the head node makes a squashfs file out of the image and pushes that. Then all the compute nodes mount the squashfs image from the NFS mount. (mount RO NFS mount, loop-mount squashfs image). On a 2,000 node cluster I had access to for a time, our prototype got us boot times of 5 minutes -- including RO NFS with squashfs and the RW NFS for writable areas like /etc, /var, etc (on an XFS image file). * We also tried RW NFS with OVERLAY and no problem there I expect, for people who prefer the squashfs non-expanded format, we can reduce the leader per compute density. Now, not all customers will want squashfs. Some want to be able to edit a file and see it instantly on all nodes. However, customers looking for fast boot times or who are suffering slowness on metadata intensive job launch work loads, will have a new fast option. Therefore, it's very important we still solve the bug we're working on in another thread. But I wanted to share something positive. So now I've said something positive instead of only asking for help :) :) Erik Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
Thank you again for the help so far. Here is what Scott Titus came up with. Let us know if you have suggestions for next steps. We never hit the "Event gen is zero" message, so it appears that afr_access() never has a zero event_gen to begin with. However, the "Resetting event gen" message was just a bit chatty, growing our nfs.log to >2.4GB. Many were against a gfid of populated with zeros. Around each split brain log, we did find the "Resetting event gen" messages containing a matching gfid: 9439138:[2020-04-08 15:48:44.737590] E [afr-common.c:754:afr_inode_event_gen_reset] (-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) [0x7fa4fb1cb735] -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) [0x7fa4fb1cb0b9] -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) [0x7fa4fb1b8701] ) 0-cm_shared-replicate-0: Resetting event gen for f2d7abf0-5444-48d6-863d-4b128502daf9 9439139:[2020-04-08 15:48:44.737636] E [afr-common.c:754:afr_inode_event_gen_reset] (-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) [0x7fa4fb1cb735] -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) [0x7fa4fb1cb0b9] -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) [0x7fa4fb1b8701] ) 0-cm_shared-replicate-0: Resetting event gen for f2d7abf0-5444-48d6-863d-4b128502daf9 9439140:[2020-04-08 15:48:44.737663] E [MSGID: 108008] [afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: Failing ACCESS on gfid f2d7abf0-5444-48d6-863d-4b128502daf9: split-brain observed. [Input/output error] 9439143:[2020-04-08 15:48:44.737801] E [afr-common.c:754:afr_inode_event_gen_reset] (-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) [0x7fa4fb1cb735] -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) [0x7fa4fb1cb0b9] -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) [0x7fa4fb1b8701] ) 0-cm_shared-replicate-0: Resetting event gen for f2d7abf0-5444-48d6-863d-4b128502daf9 9439145:[2020-04-08 15:48:44.737861] E [afr-common.c:754:afr_inode_event_gen_reset] (-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) [0x7fa4fb1cb735] -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) [0x7fa4fb1cb0b9] -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) [0x7fa4fb1b8701] ) 0-cm_shared-replicate-0: Resetting event gen for f2d7abf0-5444-48d6-863d-4b128502daf9 9439148:[2020-04-08 15:48:44.738125] E [afr-common.c:754:afr_inode_event_gen_reset] (-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) [0x7fa4fb1cb735] -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) [0x7fa4fb1cb0b9] -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) [0x7fa4fb1b8701] ) 0-cm_shared-replicate-0: Resetting event gen for f2d7abf0-5444-48d6-863d-4b128502daf9 9439225:[2020-04-08 15:48:44.749920] E [afr-common.c:754:afr_inode_event_gen_reset] (-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) [0x7fa4fb1cb735] -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) [0x7fa4fb1cb0b9] -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) [0x7fa4fb1b8701] ) 0-cm_shared-replicate-0: Resetting event gen for f2d7abf0-5444-48d6-863d-4b128502daf9 Thanks, -Scott On 4/8/20 8:31 AM, Erik Jacobson wrote: > Hi team - > > We got an update to try more stuff from the community. > > I feel like I've been "given an inch but am taking a mile" but if we > do happen to have time on orbit41 again, we'll do the next round of > debugging. > > Erik On Wed, Apr 08, 2020 at 01:53:00PM +0530, Ravishankar N wrote: > On 08/04/20 4:59 am, Erik Jacobson wrote: > > Apologies for misinterpreting the backtrace. > > > > #0 afr_read_txn_refresh_done (frame=0x7ffcf4146478, > > this=0x7fff64013720, err=5) at afr-read-txn.c:312 > > #1 0x7fff68938d2b in afr_txn_refresh_done > > (frame=frame@entry=0x7ffcf4146478, this=this@entry=0x7fff64013720, > > err=5, err@entry=0) > > at afr-common.c:1222 > Sorry, I missed this too. > > (gdb) print event_generation > > $3 = 0 > > > > (gdb) print priv->fav_child_policy > > $4 = AFR_FAV_CHILD_NONE > > > > I am not sure what this signifies though. It appears to be a read > > transaction with no event generation and no favorite child policy. > > > > Feel free to ask for clarification in case my thought process went awry > > somewhere. > > Favorite child policy is only for automatically resolving split-brains and > is 0 unless that volume option is set. The problem is indeed that > event_generation is zero. Could you try to apply this logging patch and see > if afr_inode_event_gen_reset() for that gfid is hit or if afr_access() has a > zero event_gen to begin with? > > Thanks, > > Ravi > > > > > > diff --git a/xlators/cluster/afr/src/afr-common.c > b/xlators/cluster/afr/src/afr-common.c > index 4bfaef9e8..61f21795e 100644 > --- a/xlators/cluster/afr/src/afr-co
Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
On 08/04/20 4:59 am, Erik Jacobson wrote: Apologies for misinterpreting the backtrace. #0 afr_read_txn_refresh_done (frame=0x7ffcf4146478, this=0x7fff64013720, err=5) at afr-read-txn.c:312 #1 0x7fff68938d2b in afr_txn_refresh_done (frame=frame@entry=0x7ffcf4146478, this=this@entry=0x7fff64013720, err=5, err@entry=0) at afr-common.c:1222 Sorry, I missed this too. (gdb) print event_generation $3 = 0 (gdb) print priv->fav_child_policy $4 = AFR_FAV_CHILD_NONE I am not sure what this signifies though. It appears to be a read transaction with no event generation and no favorite child policy. Feel free to ask for clarification in case my thought process went awry somewhere. Favorite child policy is only for automatically resolving split-brains and is 0 unless that volume option is set. The problem is indeed that event_generation is zero. Could you try to apply this logging patch and see if afr_inode_event_gen_reset() for that gfid is hit or if afr_access() has a zero event_gen to begin with? Thanks, Ravi diff --git a/xlators/cluster/afr/src/afr-common.c b/xlators/cluster/afr/src/afr-common.c index 4bfaef9e8..61f21795e 100644 --- a/xlators/cluster/afr/src/afr-common.c +++ b/xlators/cluster/afr/src/afr-common.c @@ -750,6 +750,8 @@ afr_inode_event_gen_reset(inode_t *inode, xlator_t *this) GF_VALIDATE_OR_GOTO(this->name, inode, out); +gf_msg_callingfn(this->name, GF_LOG_ERROR, 0, AFR_MSG_SPLIT_BRAIN, + "Resetting event gen for %s", uuid_utoa(inode->gfid)); LOCK(&inode->lock); { ret = __afr_inode_event_gen_reset(inode, this); diff --git a/xlators/cluster/afr/src/afr-inode-read.c b/xlators/cluster/afr/src/afr-inode-read.c index 9204add5b..5ac83d6c8 100644 --- a/xlators/cluster/afr/src/afr-inode-read.c +++ b/xlators/cluster/afr/src/afr-inode-read.c @@ -172,6 +172,12 @@ afr_access(call_frame_t *frame, xlator_t *this, loc_t *loc, int mask, if (xdata) local->xdata_req = dict_ref(xdata); +if (local->event_generation == 0) +gf_msg(this->name, GF_LOG_ERROR, 0, AFR_MSG_SPLIT_BRAIN, + "Event gen is zero for %s(%s)", local->loc.name, + local->loc.inode->gfid ? uuid_utoa(local->loc.inode->gfid) + : "NULL"); + afr_read_txn(frame, this, loc->inode, afr_access_wind, AFR_METADATA_TRANSACTION); Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users