Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-15 Thread Ravishankar N


On 10/04/20 2:06 am, Erik Jacobson wrote:

Once again thanks for sticking with us. Here is a reply from Scott
Titus. If you have something for us to try, we'd love it. The code had
your patch applied when gdb was run:


Here is the addr2line output for those addresses.  Very interesting command, of
which I was not aware.

[root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/cluster/
afr.so 0x6f735
afr_lookup_metadata_heal_check
afr-common.c:2803
[root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/cluster/
afr.so 0x6f0b9
afr_lookup_done
afr-common.c:2455
[root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/cluster/
afr.so 0x5c701
afr_inode_event_gen_reset
afr-common.c:755

Right, so afr_lookup_done() is resetting the event gen to zero. This 
looks like a race between lookup and inode refresh code paths. We made 
some changes to the event generation logic in AFR. Can you apply the 
attached patch and see if it fixes the split-brain issue? It should 
apply cleanly on glusterfs-7.4.


Thanks,
Ravi
>From 4389908252c886c22897d8c52c0ce027a511453f Mon Sep 17 00:00:00 2001
From: Ravishankar N 
Date: Mon, 24 Dec 2018 13:00:19 +0530
Subject: [PATCH] afr: mark pending xattrs as a part of metadata heal

...if pending xattrs are zero for all children.

Problem:
If there are no pending xattrs and a metadata heal needs to be
performed, it can be possible that we end up with xattrs inadvertendly
deleted from all bricks, as explained in the  BZ.

Fix:
After picking one among the sources as the good copy, mark pending xattrs on
all sources to blame the sinks. Now even if this metadata heal fails midway,
a subsequent heal will still choose one of the valid sources that it
picked previously.

Fixes: #1067
Change-Id: If1b050b70b0ad911e162c04db4d89b263e2b8d7b
Signed-off-by: Ravishankar N 
---
 .../cluster/afr/src/afr-self-heal-metadata.c  | 62 ++-
 1 file changed, 61 insertions(+), 1 deletion(-)

diff --git a/xlators/cluster/afr/src/afr-self-heal-metadata.c b/xlators/cluster/afr/src/afr-self-heal-metadata.c
index f4e31b65b..03f43bad1 100644
--- a/xlators/cluster/afr/src/afr-self-heal-metadata.c
+++ b/xlators/cluster/afr/src/afr-self-heal-metadata.c
@@ -190,6 +190,59 @@ out:
 return ret;
 }
 
+static int
+__afr_selfheal_metadata_mark_pending_xattrs(call_frame_t *frame, xlator_t *this,
+inode_t *inode,
+struct afr_reply *replies,
+unsigned char *sources)
+{
+int ret = 0;
+int i = 0;
+int m_idx = 0;
+afr_private_t *priv = NULL;
+int raw[AFR_NUM_CHANGE_LOGS] = {0};
+dict_t *xattr = NULL;
+
+priv = this->private;
+m_idx = afr_index_for_transaction_type(AFR_METADATA_TRANSACTION);
+raw[m_idx] = 1;
+
+xattr = dict_new();
+if (!xattr)
+return -ENOMEM;
+
+for (i = 0; i < priv->child_count; i++) {
+if (sources[i])
+continue;
+ret = dict_set_static_bin(xattr, priv->pending_key[i], raw,
+  sizeof(int) * AFR_NUM_CHANGE_LOGS);
+if (ret) {
+ret = -1;
+goto out;
+}
+}
+
+for (i = 0; i < priv->child_count; i++) {
+if (!sources[i])
+continue;
+ret = afr_selfheal_post_op(frame, this, inode, i, xattr, NULL);
+if (ret < 0) {
+gf_msg(this->name, GF_LOG_INFO, -ret, AFR_MSG_SELF_HEAL_INFO,
+   "Failed to set pending metadata xattr on child %d for %s", i,
+   uuid_utoa(inode->gfid));
+goto out;
+}
+}
+
+afr_replies_wipe(replies, priv->child_count);
+ret = afr_selfheal_unlocked_discover(frame, inode, inode->gfid, replies);
+
+out:
+if (xattr)
+dict_unref(xattr);
+return ret;
+}
+
 /*
  * Look for mismatching uid/gid or mode or user xattrs even if
  * AFR xattrs don't say so, and pick one arbitrarily as winner. */
@@ -210,6 +263,7 @@ __afr_selfheal_metadata_finalize_source(call_frame_t *frame, xlator_t *this,
 };
 int source = -1;
 int sources_count = 0;
+int ret = 0;
 
 priv = this->private;
 
@@ -300,7 +354,13 @@ __afr_selfheal_metadata_finalize_source(call_frame_t *frame, xlator_t *this,
 healed_sinks[i] = 1;
 }
 }
-
+if ((sources_count == priv->child_count) && (source > -1) &&
+(AFR_COUNT(healed_sinks, priv->child_count) != 0)) {
+ret = __afr_selfheal_metadata_mark_pending_xattrs(frame, this, inode,
+  replies, sources);
+if (ret < 0)
+return ret;
+}
 out:
 afr_mark_active_sinks(this, sources, locked_on, healed_sinks);
 return source;
-- 
2.25.1





Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@g

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-15 Thread Ravishankar N
Attached the wrong patch by mistake in my previous mail. Sending the 
correct one now.


-Ravi


On 15/04/20 2:05 pm, Ravishankar N wrote:


On 10/04/20 2:06 am, Erik Jacobson wrote:

Once again thanks for sticking with us. Here is a reply from Scott
Titus. If you have something for us to try, we'd love it. The code had
your patch applied when gdb was run:


Here is the addr2line output for those addresses.  Very interesting 
command, of

which I was not aware.

[root@leader3 ~]# addr2line -f 
-e/usr/lib64/glusterfs/7.2/xlator/cluster/

afr.so 0x6f735
afr_lookup_metadata_heal_check
afr-common.c:2803
[root@leader3 ~]# addr2line -f 
-e/usr/lib64/glusterfs/7.2/xlator/cluster/

afr.so 0x6f0b9
afr_lookup_done
afr-common.c:2455
[root@leader3 ~]# addr2line -f 
-e/usr/lib64/glusterfs/7.2/xlator/cluster/

afr.so 0x5c701
afr_inode_event_gen_reset
afr-common.c:755

Right, so afr_lookup_done() is resetting the event gen to zero. This 
looks like a race between lookup and inode refresh code paths. We made 
some changes to the event generation logic in AFR. Can you apply the 
attached patch and see if it fixes the split-brain issue? It should 
apply cleanly on glusterfs-7.4.


Thanks,
Ravi





Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users
>From 11601e709a97ce7c40078866bf5d24b486f39454 Mon Sep 17 00:00:00 2001
From: Ravishankar N 
Date: Wed, 15 Apr 2020 13:53:26 +0530
Subject: [PATCH] afr: event gen changes

The general idea of the changes is to prevent resetting event generation
to zero in the inode ctx, since event gen is something that should
follow 'causal order'.

Change #1:
For a read txn, in inode refresh cbk, if event_generation is
found zero, we are failing the read fop. This is not needed
because change in event gen is only a marker for the next inode refresh to
happen and should not be taken into account by the current read txn.

Change #2:
The event gen being zero above can happen if there is a racing lookup,
which resets even get (in afr_lookup_done) if there are non zero afr
xattrs. The resetting is done only to trigger an inode refresh and a
possible client side heal on the next lookup. That can be acheived by
setting the need_refresh flag in the inode ctx. So replaced all
occurences of resetting even gen to zero with a call to
afr_inode_need_refresh_set().

Change #3:
In both lookup and discover path, we are doing an inode refresh which is
not required since all 3 essentially do the same thing- update the inode
ctx with the good/bad copies from the brick replies. Inode refresh also
triggers background heals, but I think it is okay to do it when we call
refresh during the read and write txns and not in the lookup path.

Change-Id: Id0600dd34b144b4ae7a3bf3c397551adf7e402f1
Signed-off-by: Ravishankar N 
---
 ...ismatch-resolution-with-fav-child-policy.t |  8 +-
 xlators/cluster/afr/src/afr-common.c  | 92 ---
 xlators/cluster/afr/src/afr-dir-write.c   |  6 +-
 xlators/cluster/afr/src/afr.h |  5 +-
 4 files changed, 29 insertions(+), 82 deletions(-)

diff --git a/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t b/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
index f4aa351e4..12af0c854 100644
--- a/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
+++ b/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
@@ -168,8 +168,8 @@ TEST [ "$gfid_1" != "$gfid_2" ]
 #We know that second brick has the bigger size file
 BIGGER_FILE_MD5=$(md5sum $B0/${V0}1/f3 | cut -d\  -f1)
 
-TEST ls $M0/f3
-TEST cat $M0/f3
+TEST ls $M0 #Trigger entry heal via readdir inode refresh
+TEST cat $M0/f3 #Trigger data heal via readv inode refresh
 EXPECT_WITHIN $HEAL_TIMEOUT "^0$" get_pending_heal_count $V0
 
 #gfid split-brain should be resolved
@@ -215,8 +215,8 @@ TEST $CLI volume start $V0 force
 EXPECT_WITHIN $PROCESS_UP_TIMEOUT "1" brick_up_status $V0 $H0 $B0/${V0}2
 EXPECT_WITHIN $CHILD_UP_TIMEOUT "1" afr_child_up_status $V0 2
 
-TEST ls $M0/f4
-TEST cat $M0/f4
+TEST ls $M0 #Trigger entry heal via readdir inode refresh
+TEST cat $M0/f4  #Trigger data heal via readv inode refresh
 EXPECT_WITHIN $HEAL_TIMEOUT "^0$" get_pending_heal_count $V0
 
 #gfid split-brain should be resolved
diff --git a/xlators/cluster/afr/src/afr-common.c b/xlators/cluster/afr/src/afr-common.c
index 61f21795e..319665a14 100644
--- a/xlators/cluster/afr/src/afr-common.c
+++ b/xlators/cluster/afr/src/afr-common.c
@@ -282,7 +282,7 @@ __afr_set_in_flight_sb_status(xlator_t *this, afr_local_t *local,
 metadatamap |= (1 << index);
 }
 if (metadatamap_old != metadatamap) {
-event = 0;
+__afr_inode_need_refresh_set(inode, this);
 }
 break;
 
@@ -295,7 +295,7 @@ __afr_set_in

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-15 Thread Erik Jacobson
> Attached the wrong patch by mistake in my previous mail. Sending the correct
> one now.

Early results loook GREAT !!

We'll keep beating on it. We applied it to glsuter72 as that is what we
have to ship with. It applied fine with some line moves.

If you would like us to also run a test with gluster74 so that you can
say that's tested, we can run that test. I can do a special build.

THANK YOU!!

> 
> 
> -Ravi
> 
> 
> On 15/04/20 2:05 pm, Ravishankar N wrote:
> 
> 
> On 10/04/20 2:06 am, Erik Jacobson wrote:
> 
> Once again thanks for sticking with us. Here is a reply from Scott
> Titus. If you have something for us to try, we'd love it. The code had
> your patch applied when gdb was run:
> 
> 
> Here is the addr2line output for those addresses.  Very interesting
> command, of
> which I was not aware.
> 
> [root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/
> cluster/
> afr.so 0x6f735
> afr_lookup_metadata_heal_check
> afr-common.c:2803
> [root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/
> cluster/
> afr.so 0x6f0b9
> afr_lookup_done
> afr-common.c:2455
> [root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/
> cluster/
> afr.so 0x5c701
> afr_inode_event_gen_reset
> afr-common.c:755
> 
> 
> Right, so afr_lookup_done() is resetting the event gen to zero. This looks
> like a race between lookup and inode refresh code paths. We made some
> changes to the event generation logic in AFR. Can you apply the attached
> patch and see if it fixes the split-brain issue? It should apply cleanly 
> on
> glusterfs-7.4.
> 
> Thanks,
> Ravi
> 
>
> 
> 
> 
> 
> Community Meeting Calendar:
> 
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968
> 
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
> 

> >From 11601e709a97ce7c40078866bf5d24b486f39454 Mon Sep 17 00:00:00 2001
> From: Ravishankar N 
> Date: Wed, 15 Apr 2020 13:53:26 +0530
> Subject: [PATCH] afr: event gen changes
> 
> The general idea of the changes is to prevent resetting event generation
> to zero in the inode ctx, since event gen is something that should
> follow 'causal order'.
> 
> Change #1:
> For a read txn, in inode refresh cbk, if event_generation is
> found zero, we are failing the read fop. This is not needed
> because change in event gen is only a marker for the next inode refresh to
> happen and should not be taken into account by the current read txn.
> 
> Change #2:
> The event gen being zero above can happen if there is a racing lookup,
> which resets even get (in afr_lookup_done) if there are non zero afr
> xattrs. The resetting is done only to trigger an inode refresh and a
> possible client side heal on the next lookup. That can be acheived by
> setting the need_refresh flag in the inode ctx. So replaced all
> occurences of resetting even gen to zero with a call to
> afr_inode_need_refresh_set().
> 
> Change #3:
> In both lookup and discover path, we are doing an inode refresh which is
> not required since all 3 essentially do the same thing- update the inode
> ctx with the good/bad copies from the brick replies. Inode refresh also
> triggers background heals, but I think it is okay to do it when we call
> refresh during the read and write txns and not in the lookup path.
> 
> Change-Id: Id0600dd34b144b4ae7a3bf3c397551adf7e402f1
> Signed-off-by: Ravishankar N 
> ---
>  ...ismatch-resolution-with-fav-child-policy.t |  8 +-
>  xlators/cluster/afr/src/afr-common.c  | 92 ---
>  xlators/cluster/afr/src/afr-dir-write.c   |  6 +-
>  xlators/cluster/afr/src/afr.h |  5 +-
>  4 files changed, 29 insertions(+), 82 deletions(-)
> 
> diff --git a/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t 
> b/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
> index f4aa351e4..12af0c854 100644
> --- a/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
> +++ b/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
> @@ -168,8 +168,8 @@ TEST [ "$gfid_1" != "$gfid_2" ]
>  #We know that second brick has the bigger size file
>  BIGGER_FILE_MD5=$(md5sum $B0/${V0}1/f3 | cut -d\  -f1)
>  
> -TEST ls $M0/f3
> -TEST cat $M0/f3
> +TEST ls $M0 #Trigger entry heal via readdir inode refresh
> +TEST cat $M0/f3 #Trigger data heal via readv inode refresh
>  EXPECT_WITHIN $HEAL_TIMEOUT "^0$" get_pending_heal_count $V0
>  
>  #gfid split-brain should be resolved
> @@ -215,8 +215,8 @@ TEST $CLI volume start $V0 force
>  EXPECT_WITHIN $PROCESS_UP_TIMEOUT "1" brick_up_status $V0 $H0 $B0/${V0}2
>  EXPECT_WITHIN $CHILD_UP_TIMEOUT "1" afr_child_up_status $V0 2
>  
> -TEST ls $M0/f4
> -TEST cat 

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-15 Thread Erik Jacobson
After several successful runs of the test case, we thought we were
solved. Indeed, split-brain is gone.

But we're triggering a seg fault now, even in a less loaded case.

We're going to switch to gluster74, which was your intention, and report
back.

On Wed, Apr 15, 2020 at 10:33:01AM -0500, Erik Jacobson wrote:
> > Attached the wrong patch by mistake in my previous mail. Sending the correct
> > one now.
> 
> Early results loook GREAT !!
> 
> We'll keep beating on it. We applied it to glsuter72 as that is what we
> have to ship with. It applied fine with some line moves.
> 
> If you would like us to also run a test with gluster74 so that you can
> say that's tested, we can run that test. I can do a special build.
> 
> THANK YOU!!
> 
> > 
> > 
> > -Ravi
> > 
> > 
> > On 15/04/20 2:05 pm, Ravishankar N wrote:
> > 
> > 
> > On 10/04/20 2:06 am, Erik Jacobson wrote:
> > 
> > Once again thanks for sticking with us. Here is a reply from Scott
> > Titus. If you have something for us to try, we'd love it. The code 
> > had
> > your patch applied when gdb was run:
> > 
> > 
> > Here is the addr2line output for those addresses.  Very interesting
> > command, of
> > which I was not aware.
> > 
> > [root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/
> > cluster/
> > afr.so 0x6f735
> > afr_lookup_metadata_heal_check
> > afr-common.c:2803
> > [root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/
> > cluster/
> > afr.so 0x6f0b9
> > afr_lookup_done
> > afr-common.c:2455
> > [root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/
> > cluster/
> > afr.so 0x5c701
> > afr_inode_event_gen_reset
> > afr-common.c:755
> > 
> > 
> > Right, so afr_lookup_done() is resetting the event gen to zero. This 
> > looks
> > like a race between lookup and inode refresh code paths. We made some
> > changes to the event generation logic in AFR. Can you apply the attached
> > patch and see if it fixes the split-brain issue? It should apply 
> > cleanly on
> > glusterfs-7.4.
> > 
> > Thanks,
> > Ravi
> > 
> >
> > 
> > 
> > 
> > 
> > Community Meeting Calendar:
> > 
> > Schedule -
> > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> > Bridge: https://bluejeans.com/441850968
> > 
> > Gluster-users mailing list
> > Gluster-users@gluster.org
> > https://lists.gluster.org/mailman/listinfo/gluster-users
> > 
> 
> > >From 11601e709a97ce7c40078866bf5d24b486f39454 Mon Sep 17 00:00:00 2001
> > From: Ravishankar N 
> > Date: Wed, 15 Apr 2020 13:53:26 +0530
> > Subject: [PATCH] afr: event gen changes
> > 
> > The general idea of the changes is to prevent resetting event generation
> > to zero in the inode ctx, since event gen is something that should
> > follow 'causal order'.
> > 
> > Change #1:
> > For a read txn, in inode refresh cbk, if event_generation is
> > found zero, we are failing the read fop. This is not needed
> > because change in event gen is only a marker for the next inode refresh to
> > happen and should not be taken into account by the current read txn.
> > 
> > Change #2:
> > The event gen being zero above can happen if there is a racing lookup,
> > which resets even get (in afr_lookup_done) if there are non zero afr
> > xattrs. The resetting is done only to trigger an inode refresh and a
> > possible client side heal on the next lookup. That can be acheived by
> > setting the need_refresh flag in the inode ctx. So replaced all
> > occurences of resetting even gen to zero with a call to
> > afr_inode_need_refresh_set().
> > 
> > Change #3:
> > In both lookup and discover path, we are doing an inode refresh which is
> > not required since all 3 essentially do the same thing- update the inode
> > ctx with the good/bad copies from the brick replies. Inode refresh also
> > triggers background heals, but I think it is okay to do it when we call
> > refresh during the read and write txns and not in the lookup path.
> > 
> > Change-Id: Id0600dd34b144b4ae7a3bf3c397551adf7e402f1
> > Signed-off-by: Ravishankar N 
> > ---
> >  ...ismatch-resolution-with-fav-child-policy.t |  8 +-
> >  xlators/cluster/afr/src/afr-common.c  | 92 ---
> >  xlators/cluster/afr/src/afr-dir-write.c   |  6 +-
> >  xlators/cluster/afr/src/afr.h |  5 +-
> >  4 files changed, 29 insertions(+), 82 deletions(-)
> > 
> > diff --git 
> > a/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t 
> > b/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
> > index f4aa351e4..12af0c854 100644
> > --- a/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
> > +++ b/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
> > @@ -168,8 +168,8 @@ TEST [ "$gfid_1" != "$gfid_2" ]
> >  #We know that second brick has the big

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-15 Thread Erik Jacobson
The new split-brain issue is much harder to reproduce, but after several
intense runs, it usually hits once.

We switched to pure gluster74 plus your patch so we're apples to apples
now.

I'm going to see if Scott can help debug it.

Here is the back trace info from the core dump:

-rw-r-  1 root root 1.9G Apr 15 12:40 
core.glusterfs.0.52467a7e67964553aa9971eb2bb0148c.61084.158697232400
-rw-r-  1 root root 221M Apr 15 12:40 
core.glusterfs.0.52467a7e67964553aa9971eb2bb0148c.61084.158697232400.lz4
drwxrwxrwt  9 root root  20K Apr 15 12:40 .
[root@leader3 tmp]#
[root@leader3 tmp]#
[root@leader3 tmp]# gdb 
core.glusterfs.0.52467a7e67964553aa9971eb2bb0148c.61084.158697232400
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-5.el8
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
.
Find the GDB manual and other documentation resources online at:
.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
[New LWP 61102]
[New LWP 61085]
[New LWP 61087]
[New LWP 61117]
[New LWP 61086]
[New LWP 61108]
[New LWP 61089]
[New LWP 61090]
[New LWP 61121]
[New LWP 61088]
[New LWP 61091]
[New LWP 61093]
[New LWP 61095]
[New LWP 61092]
[New LWP 61094]
[New LWP 61098]
[New LWP 61096]
[New LWP 61097]
[New LWP 61084]
[New LWP 61100]
[New LWP 61103]
[New LWP 61104]
[New LWP 61099]
[New LWP 61105]
[New LWP 61101]
[New LWP 61106]
[New LWP 61109]
[New LWP 61107]
[New LWP 61112]
[New LWP 61119]
[New LWP 61110]
[New LWP 6]
[New LWP 61118]
[New LWP 61123]
[New LWP 61122]
[New LWP 61113]
[New LWP 61114]
[New LWP 61120]
[New LWP 61116]
[New LWP 61115]

warning: core file may not match specified executable file.
Reading symbols from /usr/sbin/glusterfsd...Reading symbols from 
/usr/lib/debug/usr/sbin/glusterfsd-7.4-1.el8722.0800.200415T1052.a.rhel8hpeerikj.x86_64.debug...done.
done.

warning: Ignoring non-absolute filename: 
Missing separate debuginfo for linux-vdso.so.1
Try: dnf --enablerepo='*debug*' install 
/usr/lib/debug/.build-id/06/44254f9cbaa826db070a796046026adba58266

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments
Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id 
gluster/nfs -p /var/run/gluster/n'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x7fe63bb5d7bb in FRAME_DESTROY (frame=0x7fe5ac096288)
at ../../../../libglusterfs/src/glusterfs/stack.h:193
193 FRAME_DESTROY(frame);
[Current thread is 1 (Thread 0x7fe617fff700 (LWP 61102))]
Missing separate debuginfos, use: dnf debuginfo-install 
glibc-2.28-42.el8.x86_64 keyutils-libs-1.5.10-6.el8.x86_64 
krb5-libs-1.16.1-22.el8.x86_64 libacl-2.2.53-1.el8.x86_64 
libattr-2.4.48-3.el8.x86_64 libcom_err-1.44.3-2.el8.x86_64 
libgcc-8.2.1-3.5.el8.x86_64 libselinux-2.8-6.el8.x86_64 
libtirpc-1.1.4-3.el8.x86_64 libuuid-2.32.1-8.el8.x86_64 
openssl-libs-1.1.1-8.el8.x86_64 pcre2-10.32-1.el8.x86_64 
zlib-1.2.11-10.el8.x86_64
(gdb) bt
#0  0x7fe63bb5d7bb in FRAME_DESTROY (frame=0x7fe5ac096288)
at ../../../../libglusterfs/src/glusterfs/stack.h:193
#1  STACK_DESTROY (stack=0x7fe5ac6d65f8)
at ../../../../libglusterfs/src/glusterfs/stack.h:193
#2  rda_fill_fd_cbk (frame=0x7fe5acf18eb8, cookie=,
this=0x7fe63c0162b0, op_ret=3, op_errno=0, entries=,
xdata=0x0) at readdir-ahead.c:623
#3  0x7fe63bd6c3aa in afr_readdir_cbk (frame=,
cookie=, this=, op_ret=,
op_errno=, s

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-15 Thread Erik Jacobson
> The new split-brain issue is much harder to reproduce, but after several

(correcting to say new seg fault issue, the split brain is gone!!)

> intense runs, it usually hits once.
> 
> We switched to pure gluster74 plus your patch so we're apples to apples
> now.
> 
> I'm going to see if Scott can help debug it.
> 
> Here is the back trace info from the core dump:
> 
> -rw-r-  1 root root 1.9G Apr 15 12:40 
> core.glusterfs.0.52467a7e67964553aa9971eb2bb0148c.61084.158697232400
> -rw-r-  1 root root 221M Apr 15 12:40 
> core.glusterfs.0.52467a7e67964553aa9971eb2bb0148c.61084.158697232400.lz4
> drwxrwxrwt  9 root root  20K Apr 15 12:40 .
> [root@leader3 tmp]#
> [root@leader3 tmp]#
> [root@leader3 tmp]# gdb 
> core.glusterfs.0.52467a7e67964553aa9971eb2bb0148c.61084.158697232400
> GNU gdb (GDB) Red Hat Enterprise Linux 8.2-5.el8
> Copyright (C) 2018 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later 
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.
> Type "show copying" and "show warranty" for details.
> This GDB was configured as "x86_64-redhat-linux-gnu".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> .
> Find the GDB manual and other documentation resources online at:
> .
> 
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> [New LWP 61102]
> [New LWP 61085]
> [New LWP 61087]
> [New LWP 61117]
> [New LWP 61086]
> [New LWP 61108]
> [New LWP 61089]
> [New LWP 61090]
> [New LWP 61121]
> [New LWP 61088]
> [New LWP 61091]
> [New LWP 61093]
> [New LWP 61095]
> [New LWP 61092]
> [New LWP 61094]
> [New LWP 61098]
> [New LWP 61096]
> [New LWP 61097]
> [New LWP 61084]
> [New LWP 61100]
> [New LWP 61103]
> [New LWP 61104]
> [New LWP 61099]
> [New LWP 61105]
> [New LWP 61101]
> [New LWP 61106]
> [New LWP 61109]
> [New LWP 61107]
> [New LWP 61112]
> [New LWP 61119]
> [New LWP 61110]
> [New LWP 6]
> [New LWP 61118]
> [New LWP 61123]
> [New LWP 61122]
> [New LWP 61113]
> [New LWP 61114]
> [New LWP 61120]
> [New LWP 61116]
> [New LWP 61115]
> 
> warning: core file may not match specified executable file.
> Reading symbols from /usr/sbin/glusterfsd...Reading symbols from 
> /usr/lib/debug/usr/sbin/glusterfsd-7.4-1.el8722.0800.200415T1052.a.rhel8hpeerikj.x86_64.debug...done.
> done.
> 
> warning: Ignoring non-absolute filename: 
> Missing separate debuginfo for linux-vdso.so.1
> Try: dnf --enablerepo='*debug*' install 
> /usr/lib/debug/.build-id/06/44254f9cbaa826db070a796046026adba58266
> 
> warning: Loadable section ".note.gnu.property" outside of ELF segments
> 
> warning: Loadable section ".note.gnu.property" outside of ELF segments
> 
> warning: Loadable section ".note.gnu.property" outside of ELF segments
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> 
> warning: Loadable section ".note.gnu.property" outside of ELF segments
> 
> warning: Loadable section ".note.gnu.property" outside of ELF segments
> 
> warning: Loadable section ".note.gnu.property" outside of ELF segments
> 
> warning: Loadable section ".note.gnu.property" outside of ELF segments
> 
> warning: Loadable section ".note.gnu.property" outside of ELF segments
> 
> warning: Loadable section ".note.gnu.property" outside of ELF segments
> 
> warning: Loadable section ".note.gnu.property" outside of ELF segments
> 
> warning: Loadable section ".note.gnu.property" outside of ELF segments
> 
> warning: Loadable section ".note.gnu.property" outside of ELF segments
> 
> warning: Loadable section ".note.gnu.property" outside of ELF segments
> 
> warning: Loadable section ".note.gnu.property" outside of ELF segments
> 
> warning: Loadable section ".note.gnu.property" outside of ELF segments
> Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id 
> gluster/nfs -p /var/run/gluster/n'.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x7fe63bb5d7bb in FRAME_DESTROY (frame=0x7fe5ac096288)
> at ../../../../libglusterfs/src/glusterfs/stack.h:193
> 193   FRAME_DESTROY(frame);
> [Current thread is 1 (Thread 0x7fe617fff700 (LWP 61102))]
> Missing separate debuginfos, use: dnf debuginfo-install 
> glibc-2.28-42.el8.x86_64 keyutils-libs-1.5.10-6.el8.x86_64 
> krb5-libs-1.16.1-22.el8.x86_64 libacl-2.2.53-1.el8.x86_64 
> libattr-2.4.48-3.el8.x86_64 libcom_err-1.44.3-2.el8.x86_64 
> libgcc-8.2.1-3.5.el8.x86_64 libselinux-2.8-6.el8.x86_64 
> libtirpc-1.1.4-3.el8.x86_64 libuuid-2.32.1-8.el8.x86_64 
> openssl-libs-1.1.1-8.el8.x86_64 pcre2-10.32-1.el8.x86_64 
> zlib-1.2.11-10.el8.x86_64
> (gdb) bt
> #0  0x7fe63bb5d7bb in FRAME_DESTROY (frame=0x7fe5ac096288)
> at ../../../../libglusterfs/src/glusterfs/stack.h:193
> #1  S

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-15 Thread Erik Jacobson
It is important to note that our testing has shown zero split-brain
errors since the patch... And that it is significantly harder to
hit the seg fault than it was to hit split-brain before. It's still
sufficiently frequent that we can't let it out the door.  In my intensive
test case (found elsewhere in the thread), it would 100% hit the problem
with 57 nodes every time at least once. With the patch, zero split
brain, but maybe 1 in 4 runs would seg fault. We didn't have a seg
fault problem previously. This is all within the context of 1 of the 3
servers in the subvolume being down. I hit the seg fault once with just
57 nodes booting (using NFS for their root FS) and no other load.


Scott was able to take an analysis pass. Any suggestions? his words
follow:


The segfault appears to occur in read-ahead functionality.  We will keep 
the core in case it needs to be looked at again, being sure to copy off 
all necessary metadata to maintain adequate symbol lookup within gdb.  
It may also be possible to breakpoint immediately prior to the segfault, 
but setting the right conditions may prove to be difficult.

A bit of analysis:

Prior to the segfault, the op_errno field in a struct rda_fd_ctx packet 
shows an ENOENT error.  The packet is from the call_frame_t parameter of 
rda_fill_fd_cbk() (Backtrace #2)  The following shows the progression 
from the call_frame_t parameter to the op_errno field of the rda_fd_ctx 
structure.

(gdb) print {call_frame_t}0x7fe5acf18eb8
$26 = {root = 0x7fe5ac6d65f8, parent = 0x0, frames = {next = 
0x7fe5ac6d6cf0, prev = 0x7fe5ac096298}, local = 0x7fe5ac1dbc78,
   this = 0x7fe63c0162b0, ret = 0x0, ref_count = 0, lock = {spinlock = 
0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0,
     __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = 
{__prev = 0x0, __next = 0x0}}, __size = '\000' ,
   __align = 0}}, cookie = 0x0, complete = false, op = GF_FOP_NULL, 
begin = {tv_sec = 4234, tv_nsec = 637078332}, end = {tv_sec = 4234,
     tv_nsec = 803882781}, wind_from = 0x0, wind_to = 0x0, unwind_from = 
0x0, unwind_to = 0x0}

(gdb) print {struct rda_local}0x7fe5ac1dbc78
$27 = {ctx = 0x7fe5ace46590, fd = 0x7fe60433d8b8, xattrs = 0x0, inode = 
0x0, offset = 0, generation = 0, skip_dir = 0}

(gdb) print {struct rda_fd_ctx}0x7fe5ace46590
$28 = {cur_offset = 0, cur_size = 638, next_offset = 1538, state = 36, 
lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0,
     __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 
0, __list = {__prev = 0x0, __next = 0x0}},
   __size = '\000' , __align = 0}}, entries = 
{{list = {next = 0x7fe60cda5f90, prev = 0x7fe60ca08190}, {
     next = 0x7fe60cda5f90, prev = 0x7fe60ca08190}}, d_ino = 0, 
d_off = 0, d_len = 0, d_type = 0, d_stat = {ia_flags = 0, ia_ino = 0,
   ia_dev = 0, ia_rdev = 0, ia_size = 0, ia_nlink = 0, ia_uid = 0, 
ia_gid = 0, ia_blksize = 0, ia_blocks = 0, ia_atime = 0,
   ia_mtime = 0, ia_ctime = 0, ia_btime = 0, ia_atime_nsec = 0, 
ia_mtime_nsec = 0, ia_ctime_nsec = 0, ia_btime_nsec = 0,
   ia_attributes = 0, ia_attributes_mask = 0, ia_gfid = '\000' 
, ia_type = IA_INVAL, ia_prot = {suid = 0 '\000',
     sgid = 0 '\000', sticky = 0 '\000', owner = {read = 0 '\000', 
write = 0 '\000', exec = 0 '\000'}, group = {read = 0 '\000',
   write = 0 '\000', exec = 0 '\000'}, other = {read = 0 '\000', 
write = 0 '\000', exec = 0 '\000'}}}, dict = 0x0, inode = 0x0,
     d_name = 0x7fe5ace466a8 ""}, fill_frame = 0x0, stub = 0x0, op_errno 
= 2, xattrs = 0x0, writes_during_prefetch = 0x0, prefetching = {
     lk = 0x7fe5ace466d0 "", value = 0}}

The segfault occurs at the bottom of rda_fill_fd_cbk() where the rpc 
call stack frames are being destroyed.  The following are what I believe 
to be the three frames that are intended to be destroyed, but it is 
unclear which packet is causing the problem.  If I were to dig more into 
this, I will use ddd (graphical debugger).  It's been a while since I've 
done low level debugging like this, so I'm a bit rusty.

(gdb) print {call_frame_t}0x7fe5acf18eb8
$34 = {root = 0x7fe5ac6d65f8, parent = 0x0, frames = {next = 
0x7fe5ac6d6cf0, prev = 0x7fe5ac096298}, local = 0x7fe5ac1dbc78,
   this = 0x7fe63c0162b0, ret = 0x0, ref_count = 0, lock = {spinlock = 
0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0,
     __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = 
{__prev = 0x0, __next = 0x0}}, __size = '\000' ,
   __align = 0}}, cookie = 0x0, complete = false, op = GF_FOP_NULL, 
begin = {tv_sec = 4234, tv_nsec = 637078332}, end = {tv_sec = 4234,
     tv_nsec = 803882781}, wind_from = 0x0, wind_to = 0x0, unwind_from = 
0x0, unwind_to = 0x0}
(gdb) print {call_frame_t}0x7fe5ac6d6ce0
$35 = {root = 0x0, parent = 0x563f5a955920, frames = {next = 
0x7fe5ac096298, prev = 0x7fe5acf18ec8}, local = 0x0, this = 0x108a,
   ret = 0x25f90b3c, ref_count = 0, lock = {spinlock = 0, mutex = 
{__data = {__lock = 0, __count = 0, _