Re: [Gluster-users] NFS crashes - bug 1010241
On 11/20/2014 5:51 AM, Niels de Vos wrote: > Do you have a bug for this against the 3.4 version? If not, please file > one and I'll post the NFS change for inclusion. > > Note that 3.4.2 does not get any updates, you would need to use the 3.4 > stable release series, currently at 3.4.6. I've filed a bug. https://bugzilla.redhat.com/show_bug.cgi?id=1166278 Hopefully I did all that right, but I'm sure it can be fixed if not. Thanks, Shawn ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] NFS crashes - bug 1010241
... and now with attached patch :-/ On Thu, Nov 20, 2014 at 01:51:05PM +0100, Niels de Vos wrote: > On Wed, Nov 19, 2014 at 09:21:23PM -0700, Shawn Heisey wrote: > > On 11/19/2014 6:53 PM, Ravishankar N wrote: > > > Heterogeneous op-version cluster is not supported. You would need to > > > upgrade all servers. > > > > > > http://www.gluster.org/community/documentation/index.php/Upgrade_to_3.5 > > > > I would be running 3.4.2 bricks with a later 3.4.x release on the NFS > > peers, not different minor versions. I was hoping that at least would > > be a setup that is likely to work. I would not expect things to work > > right on a long-term basis if I mixed 3.4.2 bricks with 3.5 or 3.6 NFS > > servers. > > > > I could really use a fixed 3.4.x, but having just read Joe Julian's > > message saying that he no longer recommends 3.4 because of the large > > number of bugfixes that have not been backported, I am not holding my > > breath. My monitor/restart script manages the problem fairly > > effectively, and we won't be using Gluster for longer than a few more > > months. > > > > I would be willing to try patching the 3.4.2 source and installing new > > binaries, if someone can tell exactly me how to obtain the proper source > > and how to build new RPM packages (CentOS 6). I installed 3.4.2 using > > the glusterfs-epel.repo file from download.gluster.org when 3.4.2 was new. > > I am pretty sure we can backport the changes. At least the NFS patch is > pretty straight forward (attached). The DHT fix needs a little more > attention while backporting (http://review.gluster.org/6219). > > Do you have a bug for this against the 3.4 version? If not, please file > one and I'll post the NFS change for inclusion. > > Note that 3.4.2 does not get any updates, you would need to use the 3.4 > stable release series, currently at 3.4.6. > > Thanks, > Niels From 3f94f5e0c31e18f4957aeb7fa43a074a290fbf9f Mon Sep 17 00:00:00 2001 From: Niels de Vos Date: Thu, 20 Nov 2014 13:40:06 +0100 Subject: [PATCH] gNFS: NFS segfaults with nfstest_posix tool Problem: nfs3_stat_to_fattr3() missed a NULL check. FIX: (1) Added a NULL check. (2) In all fop cbk path, if the op_ret is -1 and op_errno is 0, then handle it as a special case. Set the NFS3 status as NFS3ERR_SERVERFAULT instead of NFS3_OK. (3) The other component of FIX would be in DHT module and is on the way. Cherry picked from commit 0b2487d3bc8bc526d9b08698ea1434e94a6420d5: > Change-Id: I6f03c9a02d794f8b807574f2755094dab1b90c92 > BUG: 1010241 > Signed-off-by: Santosh Kumar Pradhan > Reviewed-on: http://review.gluster.org/6026 > Reviewed-by: Rajesh Joseph > Reviewed-by: Niels de Vos > Reviewed-by: Vijay Bellur > Tested-by: Gluster Build System Change-Id: I6f03c9a02d794f8b807574f2755094dab1b90c92 Signed-off-by: Niels de Vos --- xlators/nfs/server/src/nfs3-helpers.c | 4 ++ xlators/nfs/server/src/nfs3.c | 75 --- 2 files changed, 48 insertions(+), 31 deletions(-) diff --git a/xlators/nfs/server/src/nfs3-helpers.c b/xlators/nfs/server/src/nfs3-helpers.c index fc910bd..95edc8b 100644 --- a/xlators/nfs/server/src/nfs3-helpers.c +++ b/xlators/nfs/server/src/nfs3-helpers.c @@ -275,6 +275,9 @@ nfs3_stat_to_fattr3 (struct iatt *buf) { fattr3 fa = {0, }; +if (buf == NULL) +goto out; + if (IA_ISDIR (buf->ia_type)) fa.type = NF3DIR; else if (IA_ISREG (buf->ia_type)) @@ -344,6 +347,7 @@ nfs3_stat_to_fattr3 (struct iatt *buf) fa.mtime.seconds = buf->ia_mtime; fa.mtime.nseconds = buf->ia_mtime_nsec; +out: return fa; } diff --git a/xlators/nfs/server/src/nfs3.c b/xlators/nfs/server/src/nfs3.c index a72..98fb154 100644 --- a/xlators/nfs/server/src/nfs3.c +++ b/xlators/nfs/server/src/nfs3.c @@ -64,6 +64,19 @@ } while (0);\ +/* + * Special case: If op_ret is -1, it's very unusual op_errno being + * 0 which means something came wrong from upper layer(s). If it + * happens by any means, then set NFS3 status to NFS3ERR_SERVERFAULT. + */ +static inline nfsstat3 nfs3_cbk_errno_status (int32_t op_ret, int32_t op_errno) +{ +if ((op_ret == -1) && (op_errno == 0)){ +return NFS3ERR_SERVERFAULT; +} +return nfs3_errno_to_nfsstat3 (op_errno); +} + struct nfs3_export * __nfs3_get_export_by_index (struct nfs3_state *nfs3, uuid_t exportid) { @@ -694,7 +707,7 @@ nfs3svc_getattr_lookup_cbk (call_frame_t *frame, void *cookie, xlator_t *this, gf_log (GF_NFS, GF_LOG_WARNING, "%x: %s => -1 (%s)", rpcsvc_request_xid (cs->req), cs->resolvedloc.path, strerror (op_errno)); -status = nfs3_errno_to_nfsstat3 (op_errno); +status = nfs3_cbk_errno_status (op_ret, op_errno); } else { nfs_fix_generatio
Re: [Gluster-users] NFS crashes - bug 1010241
On Wed, Nov 19, 2014 at 09:21:23PM -0700, Shawn Heisey wrote: > On 11/19/2014 6:53 PM, Ravishankar N wrote: > > Heterogeneous op-version cluster is not supported. You would need to > > upgrade all servers. > > > > http://www.gluster.org/community/documentation/index.php/Upgrade_to_3.5 > > I would be running 3.4.2 bricks with a later 3.4.x release on the NFS > peers, not different minor versions. I was hoping that at least would > be a setup that is likely to work. I would not expect things to work > right on a long-term basis if I mixed 3.4.2 bricks with 3.5 or 3.6 NFS > servers. > > I could really use a fixed 3.4.x, but having just read Joe Julian's > message saying that he no longer recommends 3.4 because of the large > number of bugfixes that have not been backported, I am not holding my > breath. My monitor/restart script manages the problem fairly > effectively, and we won't be using Gluster for longer than a few more > months. > > I would be willing to try patching the 3.4.2 source and installing new > binaries, if someone can tell exactly me how to obtain the proper source > and how to build new RPM packages (CentOS 6). I installed 3.4.2 using > the glusterfs-epel.repo file from download.gluster.org when 3.4.2 was new. I am pretty sure we can backport the changes. At least the NFS patch is pretty straight forward (attached). The DHT fix needs a little more attention while backporting (http://review.gluster.org/6219). Do you have a bug for this against the 3.4 version? If not, please file one and I'll post the NFS change for inclusion. Note that 3.4.2 does not get any updates, you would need to use the 3.4 stable release series, currently at 3.4.6. Thanks, Niels pgpfqPlY8WKGO.pgp Description: PGP signature ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] NFS crashes - bug 1010241
On 11/19/2014 6:53 PM, Ravishankar N wrote: > Heterogeneous op-version cluster is not supported. You would need to upgrade > all servers. > > http://www.gluster.org/community/documentation/index.php/Upgrade_to_3.5 I would be running 3.4.2 bricks with a later 3.4.x release on the NFS peers, not different minor versions. I was hoping that at least would be a setup that is likely to work. I would not expect things to work right on a long-term basis if I mixed 3.4.2 bricks with 3.5 or 3.6 NFS servers. I could really use a fixed 3.4.x, but having just read Joe Julian's message saying that he no longer recommends 3.4 because of the large number of bugfixes that have not been backported, I am not holding my breath. My monitor/restart script manages the problem fairly effectively, and we won't be using Gluster for longer than a few more months. I would be willing to try patching the 3.4.2 source and installing new binaries, if someone can tell exactly me how to obtain the proper source and how to build new RPM packages (CentOS 6). I installed 3.4.2 using the glusterfs-epel.repo file from download.gluster.org when 3.4.2 was new. Thanks, Shawn ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] NFS crashes - bug 1010241
In my experience this unusually happens because of NFS lockd trying too traverse a firewall. Turn off NFS locking on the source host and you will be fine. The root cause is not a problem with cluster its actually a deficiency in the NFS RFCs about RPC which has never been properly addressed.-- Sent from my HP Pre3On Nov 19, 2014 10:45 PM, Alex Crow wrote: Also if OP is on non-supported gluster 3.4.x rather than RHSS or at least 3.5.x, and given sufficient space, how about taking enough hosts out of the cluster to bring fully up to date and store the data, syncing the data across, updating the originals, syncing back and then adding back the hosts you took out to to the first backup? On 20/11/14 01:53, Ravishankar N wrote: > On 11/19/2014 10:11 PM, Shawn Heisey wrote: >> We are running into this crash stacktrace on 3.4.2. >> >> https://bugzilla.redhat.com/show_bug.cgi?id=1010241 >> >> The NFS process dies with no predictability. I've written a shell >> script that detects the crash and runs a process to completely kill all >> gluster processes and restart glusterd, which has eliminated >> customer-facing fallout from these problems. > > No kill required. `gluster volume start force` should re-spawn the dead processes. > > >> Because of continual stability problems from day one, the gluster >> storage is being phased out, but there are many terabytes of data still >> used there. It would be nice to have it remain stable while we still >> use it. As soon as we can fully migrate all data to another storage >> solution, the gluster machines will be decommissioned. >> >> That BZ id is specific to version 3.6, and it's always difficult for >> mere mortals to determine which fixes have been backported to earlier >> releases. >> > A (not so?) easy way is to clone the source, checkout into the desired branch and grep the git-log for the commit message you're interested in. > >> Has the fix for bug 1010241 been backported to any 3.4 release? > I just did the grep and no it's not. I don't know if a backport is possible.(CC'ed the respective devs). The (two) fixes are present in 3.5 though. > > >If so, >> is it possible for me to upgrade my servers without being concerned >> about the distributed+replicated volume going offline? When we upgraded >> from 3.3 to 3.4, the volume was not fully functional as soon as we >> upgraded one server, and did not become fully functional until all >> servers were upgraded and rebooted. >> >> Assuming again that there is a 3.4 version with the fix ... the gluster >> peers that I use for NFS do not have any bricks. Would I need to >> upgrade ALL the servers, or could I get away with just upgrading the >> servers that are being used for NFS? > > Heterogeneous op-version cluster is not supported. You would need to upgrade all servers. > > http://www.gluster.org/community/documentation/index.php/Upgrade_to_3.5 > > Thanks, > Ravi > >> Thanks, >> Shawn >> ___ >> Gluster-users mailing list >> Gluster-users@gluster.org >> http://supercolony.gluster.org/mailman/listinfo/gluster-users >> > ___ > Gluster-users mailing list > Gluster-users@gluster.org > http://supercolony.gluster.org/mailman/listinfo/gluster-users > -- This message is intended only for the addressee and may contain confidential information. Unless you are that person, you may not disclose its contents or use it in any way and are requested to delete the message along with any attachments and notify us immediately. "Transact" is operated by Integrated Financial Arrangements plc. 29 Clement's Lane, London EC4N 7AE. Tel: (020) 7608 4900 Fax: (020) 7608 5300. (Registered office: as above; Registered in England and Wales under number: 3727592). Authorised and regulated by the Financial Conduct Authority (entered on the Financial Services Register; no. 190856). ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] NFS crashes - bug 1010241
Also if OP is on non-supported gluster 3.4.x rather than RHSS or at least 3.5.x, and given sufficient space, how about taking enough hosts out of the cluster to bring fully up to date and store the data, syncing the data across, updating the originals, syncing back and then adding back the hosts you took out to to the first backup? On 20/11/14 01:53, Ravishankar N wrote: On 11/19/2014 10:11 PM, Shawn Heisey wrote: We are running into this crash stacktrace on 3.4.2. https://bugzilla.redhat.com/show_bug.cgi?id=1010241 The NFS process dies with no predictability. I've written a shell script that detects the crash and runs a process to completely kill all gluster processes and restart glusterd, which has eliminated customer-facing fallout from these problems. No kill required. `gluster volume start force` should re-spawn the dead processes. Because of continual stability problems from day one, the gluster storage is being phased out, but there are many terabytes of data still used there. It would be nice to have it remain stable while we still use it. As soon as we can fully migrate all data to another storage solution, the gluster machines will be decommissioned. That BZ id is specific to version 3.6, and it's always difficult for mere mortals to determine which fixes have been backported to earlier releases. A (not so?) easy way is to clone the source, checkout into the desired branch and grep the git-log for the commit message you're interested in. Has the fix for bug 1010241 been backported to any 3.4 release? I just did the grep and no it's not. I don't know if a backport is possible.(CC'ed the respective devs). The (two) fixes are present in 3.5 though. If so, is it possible for me to upgrade my servers without being concerned about the distributed+replicated volume going offline? When we upgraded from 3.3 to 3.4, the volume was not fully functional as soon as we upgraded one server, and did not become fully functional until all servers were upgraded and rebooted. Assuming again that there is a 3.4 version with the fix ... the gluster peers that I use for NFS do not have any bricks. Would I need to upgrade ALL the servers, or could I get away with just upgrading the servers that are being used for NFS? Heterogeneous op-version cluster is not supported. You would need to upgrade all servers. http://www.gluster.org/community/documentation/index.php/Upgrade_to_3.5 Thanks, Ravi Thanks, Shawn ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users -- This message is intended only for the addressee and may contain confidential information. Unless you are that person, you may not disclose its contents or use it in any way and are requested to delete the message along with any attachments and notify us immediately. "Transact" is operated by Integrated Financial Arrangements plc. 29 Clement's Lane, London EC4N 7AE. Tel: (020) 7608 4900 Fax: (020) 7608 5300. (Registered office: as above; Registered in England and Wales under number: 3727592). Authorised and regulated by the Financial Conduct Authority (entered on the Financial Services Register; no. 190856). ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] NFS crashes - bug 1010241
On 11/19/2014 10:11 PM, Shawn Heisey wrote: > We are running into this crash stacktrace on 3.4.2. > > https://bugzilla.redhat.com/show_bug.cgi?id=1010241 > > The NFS process dies with no predictability. I've written a shell > script that detects the crash and runs a process to completely kill all > gluster processes and restart glusterd, which has eliminated > customer-facing fallout from these problems. No kill required. `gluster volume start force` should re-spawn the dead processes. > > Because of continual stability problems from day one, the gluster > storage is being phased out, but there are many terabytes of data still > used there. It would be nice to have it remain stable while we still > use it. As soon as we can fully migrate all data to another storage > solution, the gluster machines will be decommissioned. > > That BZ id is specific to version 3.6, and it's always difficult for > mere mortals to determine which fixes have been backported to earlier > releases. > A (not so?) easy way is to clone the source, checkout into the desired branch and grep the git-log for the commit message you're interested in. > Has the fix for bug 1010241 been backported to any 3.4 release? I just did the grep and no it's not. I don't know if a backport is possible.(CC'ed the respective devs). The (two) fixes are present in 3.5 though. If so, > is it possible for me to upgrade my servers without being concerned > about the distributed+replicated volume going offline? When we upgraded > from 3.3 to 3.4, the volume was not fully functional as soon as we > upgraded one server, and did not become fully functional until all > servers were upgraded and rebooted. > > Assuming again that there is a 3.4 version with the fix ... the gluster > peers that I use for NFS do not have any bricks. Would I need to > upgrade ALL the servers, or could I get away with just upgrading the > servers that are being used for NFS? Heterogeneous op-version cluster is not supported. You would need to upgrade all servers. http://www.gluster.org/community/documentation/index.php/Upgrade_to_3.5 Thanks, Ravi > > Thanks, > Shawn > ___ > Gluster-users mailing list > Gluster-users@gluster.org > http://supercolony.gluster.org/mailman/listinfo/gluster-users > ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] NFS crashes - bug 1010241
We are running into this crash stacktrace on 3.4.2. https://bugzilla.redhat.com/show_bug.cgi?id=1010241 The NFS process dies with no predictability. I've written a shell script that detects the crash and runs a process to completely kill all gluster processes and restart glusterd, which has eliminated customer-facing fallout from these problems. Because of continual stability problems from day one, the gluster storage is being phased out, but there are many terabytes of data still used there. It would be nice to have it remain stable while we still use it. As soon as we can fully migrate all data to another storage solution, the gluster machines will be decommissioned. That BZ id is specific to version 3.6, and it's always difficult for mere mortals to determine which fixes have been backported to earlier releases. Has the fix for bug 1010241 been backported to any 3.4 release? If so, is it possible for me to upgrade my servers without being concerned about the distributed+replicated volume going offline? When we upgraded from 3.3 to 3.4, the volume was not fully functional as soon as we upgraded one server, and did not become fully functional until all servers were upgraded and rebooted. Assuming again that there is a 3.4 version with the fix ... the gluster peers that I use for NFS do not have any bricks. Would I need to upgrade ALL the servers, or could I get away with just upgrading the servers that are being used for NFS? Thanks, Shawn ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users