Re: [Gluster-users] NFS crashes - bug 1010241

2014-11-20 Thread Shawn Heisey
On 11/20/2014 5:51 AM, Niels de Vos wrote:
> Do you have a bug for this against the 3.4 version? If not, please file
> one and I'll post the NFS change for inclusion.
> 
> Note that 3.4.2 does not get any updates, you would need to use the 3.4
> stable release series, currently at 3.4.6.

I've filed a bug.

https://bugzilla.redhat.com/show_bug.cgi?id=1166278

Hopefully I did all that right, but I'm sure it can be fixed if not.

Thanks,
Shawn

___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] NFS crashes - bug 1010241

2014-11-20 Thread Niels de Vos
... and now with attached patch :-/

On Thu, Nov 20, 2014 at 01:51:05PM +0100, Niels de Vos wrote:
> On Wed, Nov 19, 2014 at 09:21:23PM -0700, Shawn Heisey wrote:
> > On 11/19/2014 6:53 PM, Ravishankar N wrote:
> > > Heterogeneous op-version cluster is not supported. You would need to 
> > > upgrade all servers.
> > > 
> > > http://www.gluster.org/community/documentation/index.php/Upgrade_to_3.5
> > 
> > I would be running 3.4.2 bricks with a later 3.4.x release on the NFS
> > peers, not different minor versions.  I was hoping that at least would
> > be a setup that is likely to work.  I would not expect things to work
> > right on a long-term basis if I mixed 3.4.2 bricks with 3.5 or 3.6 NFS
> > servers.
> > 
> > I could really use a fixed 3.4.x, but having just read Joe Julian's
> > message saying that he no longer recommends 3.4 because of the large
> > number of bugfixes that have not been backported, I am not holding my
> > breath.  My monitor/restart script manages the problem fairly
> > effectively, and we won't be using Gluster for longer than a few more
> > months.
> > 
> > I would be willing to try patching the 3.4.2 source and installing new
> > binaries, if someone can tell exactly me how to obtain the proper source
> > and how to build new RPM packages (CentOS 6).  I installed 3.4.2 using
> > the glusterfs-epel.repo file from download.gluster.org when 3.4.2 was new.
> 
> I am pretty sure we can backport the changes. At least the NFS patch is
> pretty straight forward (attached). The DHT fix needs a little more
> attention while backporting (http://review.gluster.org/6219).
> 
> Do you have a bug for this against the 3.4 version? If not, please file
> one and I'll post the NFS change for inclusion.
> 
> Note that 3.4.2 does not get any updates, you would need to use the 3.4
> stable release series, currently at 3.4.6.
> 
> Thanks,
> Niels


From 3f94f5e0c31e18f4957aeb7fa43a074a290fbf9f Mon Sep 17 00:00:00 2001
From: Niels de Vos 
Date: Thu, 20 Nov 2014 13:40:06 +0100
Subject: [PATCH] gNFS: NFS segfaults with nfstest_posix tool

Problem:
nfs3_stat_to_fattr3() missed a NULL check.

FIX:
(1) Added a NULL check.
(2) In all fop cbk path, if the op_ret is -1 and op_errno is 0,
then handle it as a special case. Set the NFS3 status as
NFS3ERR_SERVERFAULT instead of NFS3_OK.
(3) The other component of FIX would be in DHT module and
is on the way.

Cherry picked from commit 0b2487d3bc8bc526d9b08698ea1434e94a6420d5:
> Change-Id: I6f03c9a02d794f8b807574f2755094dab1b90c92
> BUG: 1010241
> Signed-off-by: Santosh Kumar Pradhan 
> Reviewed-on: http://review.gluster.org/6026
> Reviewed-by: Rajesh Joseph 
> Reviewed-by: Niels de Vos 
> Reviewed-by: Vijay Bellur 
> Tested-by: Gluster Build System 

Change-Id: I6f03c9a02d794f8b807574f2755094dab1b90c92
Signed-off-by: Niels de Vos 
---
 xlators/nfs/server/src/nfs3-helpers.c |  4 ++
 xlators/nfs/server/src/nfs3.c | 75 ---
 2 files changed, 48 insertions(+), 31 deletions(-)

diff --git a/xlators/nfs/server/src/nfs3-helpers.c 
b/xlators/nfs/server/src/nfs3-helpers.c
index fc910bd..95edc8b 100644
--- a/xlators/nfs/server/src/nfs3-helpers.c
+++ b/xlators/nfs/server/src/nfs3-helpers.c
@@ -275,6 +275,9 @@ nfs3_stat_to_fattr3 (struct iatt *buf)
 {
 fattr3  fa = {0, };
 
+if (buf == NULL)
+goto out;
+
 if (IA_ISDIR (buf->ia_type))
 fa.type = NF3DIR;
 else if (IA_ISREG (buf->ia_type))
@@ -344,6 +347,7 @@ nfs3_stat_to_fattr3 (struct iatt *buf)
 fa.mtime.seconds = buf->ia_mtime;
 fa.mtime.nseconds = buf->ia_mtime_nsec;
 
+out:
 return fa;
 }
 
diff --git a/xlators/nfs/server/src/nfs3.c b/xlators/nfs/server/src/nfs3.c
index a72..98fb154 100644
--- a/xlators/nfs/server/src/nfs3.c
+++ b/xlators/nfs/server/src/nfs3.c
@@ -64,6 +64,19 @@
 } while (0);\
 
 
+/*
+ * Special case: If op_ret is -1, it's very unusual op_errno being
+ * 0 which means something came wrong from upper layer(s). If it
+ * happens by any means, then set NFS3 status to NFS3ERR_SERVERFAULT.
+ */
+static inline nfsstat3 nfs3_cbk_errno_status (int32_t op_ret, int32_t op_errno)
+{
+if ((op_ret == -1) && (op_errno == 0)){
+return NFS3ERR_SERVERFAULT;
+}
+return nfs3_errno_to_nfsstat3 (op_errno);
+}
+
 struct nfs3_export *
 __nfs3_get_export_by_index (struct nfs3_state *nfs3, uuid_t exportid)
 {
@@ -694,7 +707,7 @@ nfs3svc_getattr_lookup_cbk (call_frame_t *frame, void 
*cookie, xlator_t *this,
 gf_log (GF_NFS, GF_LOG_WARNING,
 "%x: %s => -1 (%s)", rpcsvc_request_xid (cs->req),
 cs->resolvedloc.path, strerror (op_errno));
-status = nfs3_errno_to_nfsstat3 (op_errno);
+status = nfs3_cbk_errno_status (op_ret, op_errno);
 }
 else {
 nfs_fix_generatio

Re: [Gluster-users] NFS crashes - bug 1010241

2014-11-20 Thread Niels de Vos
On Wed, Nov 19, 2014 at 09:21:23PM -0700, Shawn Heisey wrote:
> On 11/19/2014 6:53 PM, Ravishankar N wrote:
> > Heterogeneous op-version cluster is not supported. You would need to 
> > upgrade all servers.
> > 
> > http://www.gluster.org/community/documentation/index.php/Upgrade_to_3.5
> 
> I would be running 3.4.2 bricks with a later 3.4.x release on the NFS
> peers, not different minor versions.  I was hoping that at least would
> be a setup that is likely to work.  I would not expect things to work
> right on a long-term basis if I mixed 3.4.2 bricks with 3.5 or 3.6 NFS
> servers.
> 
> I could really use a fixed 3.4.x, but having just read Joe Julian's
> message saying that he no longer recommends 3.4 because of the large
> number of bugfixes that have not been backported, I am not holding my
> breath.  My monitor/restart script manages the problem fairly
> effectively, and we won't be using Gluster for longer than a few more
> months.
> 
> I would be willing to try patching the 3.4.2 source and installing new
> binaries, if someone can tell exactly me how to obtain the proper source
> and how to build new RPM packages (CentOS 6).  I installed 3.4.2 using
> the glusterfs-epel.repo file from download.gluster.org when 3.4.2 was new.

I am pretty sure we can backport the changes. At least the NFS patch is
pretty straight forward (attached). The DHT fix needs a little more
attention while backporting (http://review.gluster.org/6219).

Do you have a bug for this against the 3.4 version? If not, please file
one and I'll post the NFS change for inclusion.

Note that 3.4.2 does not get any updates, you would need to use the 3.4
stable release series, currently at 3.4.6.

Thanks,
Niels


pgpfqPlY8WKGO.pgp
Description: PGP signature
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] NFS crashes - bug 1010241

2014-11-19 Thread Shawn Heisey
On 11/19/2014 6:53 PM, Ravishankar N wrote:
> Heterogeneous op-version cluster is not supported. You would need to upgrade 
> all servers.
> 
> http://www.gluster.org/community/documentation/index.php/Upgrade_to_3.5

I would be running 3.4.2 bricks with a later 3.4.x release on the NFS
peers, not different minor versions.  I was hoping that at least would
be a setup that is likely to work.  I would not expect things to work
right on a long-term basis if I mixed 3.4.2 bricks with 3.5 or 3.6 NFS
servers.

I could really use a fixed 3.4.x, but having just read Joe Julian's
message saying that he no longer recommends 3.4 because of the large
number of bugfixes that have not been backported, I am not holding my
breath.  My monitor/restart script manages the problem fairly
effectively, and we won't be using Gluster for longer than a few more
months.

I would be willing to try patching the 3.4.2 source and installing new
binaries, if someone can tell exactly me how to obtain the proper source
and how to build new RPM packages (CentOS 6).  I installed 3.4.2 using
the glusterfs-epel.repo file from download.gluster.org when 3.4.2 was new.

Thanks,
Shawn

___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] NFS crashes - bug 1010241

2014-11-19 Thread Paul Robert Marino
In my experience this unusually happens because of NFS lockd trying too traverse a firewall. Turn off NFS locking  on the source host and you will be fine. The root cause is not a problem with cluster its actually a deficiency in the NFS RFCs about RPC which has never been properly addressed.-- Sent from my HP Pre3On Nov 19, 2014 10:45 PM, Alex Crow  wrote: Also if OP is on non-supported gluster 3.4.x rather than RHSS or at 
least 3.5.x, and given sufficient space, how about taking enough hosts 
out of the cluster to bring fully up to date and store the data, syncing 
the data across, updating the originals, syncing back and then adding 
back the hosts you took out to to the first backup?
On 20/11/14 01:53, Ravishankar N wrote:
> On 11/19/2014 10:11 PM, Shawn Heisey wrote:
>> We are running into this crash stacktrace on 3.4.2.
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=1010241
>>
>> The NFS process dies with no predictability.  I've written a shell
>> script that detects the crash and runs a process to completely kill all
>> gluster processes and restart glusterd, which has eliminated
>> customer-facing fallout from these problems.
>
> No kill required. `gluster volume start  force` should re-spawn the dead processes.
>
>
>> Because of continual stability problems from day one, the gluster
>> storage is being phased out, but there are many terabytes of data still
>> used there.  It would be nice to have it remain stable while we still
>> use it.  As soon as we can fully migrate all data to another storage
>> solution, the gluster machines will be decommissioned.
>>
>> That BZ id is specific to version 3.6, and it's always difficult for
>> mere mortals to determine which fixes have been backported to earlier
>> releases.
>>
> A (not so?)  easy way is to clone the source, checkout into the desired branch and grep the git-log for the commit message you're interested in.
>
>> Has the fix for bug 1010241 been backported to any 3.4 release?
> I just did the grep and no it's not. I don't know if a backport is possible.(CC'ed the respective devs). The (two) fixes are present in 3.5 though.
>
>
>If so,
>> is it possible for me to upgrade my servers without being concerned
>> about the distributed+replicated volume going offline?  When we upgraded
>> from 3.3 to 3.4, the volume was not fully functional as soon as we
>> upgraded one server, and did not become fully functional until all
>> servers were upgraded and rebooted.
>>
>> Assuming again that there is a 3.4 version with the fix ... the gluster
>> peers that I use for NFS do not have any bricks.  Would I need to
>> upgrade ALL the servers, or could I get away with just upgrading the
>> servers that are being used for NFS?
>
> Heterogeneous op-version cluster is not supported. You would need to upgrade all servers.
>
> http://www.gluster.org/community/documentation/index.php/Upgrade_to_3.5
>
> Thanks,
> Ravi
>
>> Thanks,
>> Shawn
>> ___
>> Gluster-users mailing list
>> Gluster-users@gluster.org
>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>>
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>

-- 
This message is intended only for the addressee and may contain
confidential information. Unless you are that person, you may not
disclose its contents or use it in any way and are requested to delete
the message along with any attachments and notify us immediately.
"Transact" is operated by Integrated Financial Arrangements plc. 29
Clement's Lane, London EC4N 7AE. Tel: (020) 7608 4900 Fax: (020) 7608
5300. (Registered office: as above; Registered in England and Wales
under number: 3727592). Authorised and regulated by the Financial
Conduct Authority (entered on the Financial Services Register; no. 190856).

___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] NFS crashes - bug 1010241

2014-11-19 Thread Alex Crow
Also if OP is on non-supported gluster 3.4.x rather than RHSS or at 
least 3.5.x, and given sufficient space, how about taking enough hosts 
out of the cluster to bring fully up to date and store the data, syncing 
the data across, updating the originals, syncing back and then adding 
back the hosts you took out to to the first backup?





On 20/11/14 01:53, Ravishankar N wrote:

On 11/19/2014 10:11 PM, Shawn Heisey wrote:

We are running into this crash stacktrace on 3.4.2.

https://bugzilla.redhat.com/show_bug.cgi?id=1010241

The NFS process dies with no predictability.  I've written a shell
script that detects the crash and runs a process to completely kill all
gluster processes and restart glusterd, which has eliminated
customer-facing fallout from these problems.


No kill required. `gluster volume start  force` should re-spawn the 
dead processes.



Because of continual stability problems from day one, the gluster
storage is being phased out, but there are many terabytes of data still
used there.  It would be nice to have it remain stable while we still
use it.  As soon as we can fully migrate all data to another storage
solution, the gluster machines will be decommissioned.

That BZ id is specific to version 3.6, and it's always difficult for
mere mortals to determine which fixes have been backported to earlier
releases.


A (not so?)  easy way is to clone the source, checkout into the desired branch 
and grep the git-log for the commit message you're interested in.


Has the fix for bug 1010241 been backported to any 3.4 release?

I just did the grep and no it's not. I don't know if a backport is 
possible.(CC'ed the respective devs). The (two) fixes are present in 3.5 though.


   If so,

is it possible for me to upgrade my servers without being concerned
about the distributed+replicated volume going offline?  When we upgraded
from 3.3 to 3.4, the volume was not fully functional as soon as we
upgraded one server, and did not become fully functional until all
servers were upgraded and rebooted.

Assuming again that there is a 3.4 version with the fix ... the gluster
peers that I use for NFS do not have any bricks.  Would I need to
upgrade ALL the servers, or could I get away with just upgrading the
servers that are being used for NFS?


Heterogeneous op-version cluster is not supported. You would need to upgrade 
all servers.

http://www.gluster.org/community/documentation/index.php/Upgrade_to_3.5

Thanks,
Ravi


Thanks,
Shawn
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users



--
This message is intended only for the addressee and may contain
confidential information. Unless you are that person, you may not
disclose its contents or use it in any way and are requested to delete
the message along with any attachments and notify us immediately.
"Transact" is operated by Integrated Financial Arrangements plc. 29
Clement's Lane, London EC4N 7AE. Tel: (020) 7608 4900 Fax: (020) 7608
5300. (Registered office: as above; Registered in England and Wales
under number: 3727592). Authorised and regulated by the Financial
Conduct Authority (entered on the Financial Services Register; no. 190856).

___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] NFS crashes - bug 1010241

2014-11-19 Thread Ravishankar N
On 11/19/2014 10:11 PM, Shawn Heisey wrote:
> We are running into this crash stacktrace on 3.4.2.
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1010241
> 
> The NFS process dies with no predictability.  I've written a shell
> script that detects the crash and runs a process to completely kill all
> gluster processes and restart glusterd, which has eliminated
> customer-facing fallout from these problems.


No kill required. `gluster volume start  force` should re-spawn the 
dead processes.


> 
> Because of continual stability problems from day one, the gluster
> storage is being phased out, but there are many terabytes of data still
> used there.  It would be nice to have it remain stable while we still
> use it.  As soon as we can fully migrate all data to another storage
> solution, the gluster machines will be decommissioned.
> 
> That BZ id is specific to version 3.6, and it's always difficult for
> mere mortals to determine which fixes have been backported to earlier
> releases.
> 

A (not so?)  easy way is to clone the source, checkout into the desired branch 
and grep the git-log for the commit message you're interested in.

> Has the fix for bug 1010241 been backported to any 3.4 release?

I just did the grep and no it's not. I don't know if a backport is 
possible.(CC'ed the respective devs). The (two) fixes are present in 3.5 though.


  If so,
> is it possible for me to upgrade my servers without being concerned
> about the distributed+replicated volume going offline?  When we upgraded
> from 3.3 to 3.4, the volume was not fully functional as soon as we
> upgraded one server, and did not become fully functional until all
> servers were upgraded and rebooted.
> 
> Assuming again that there is a 3.4 version with the fix ... the gluster
> peers that I use for NFS do not have any bricks.  Would I need to
> upgrade ALL the servers, or could I get away with just upgrading the
> servers that are being used for NFS?


Heterogeneous op-version cluster is not supported. You would need to upgrade 
all servers.

http://www.gluster.org/community/documentation/index.php/Upgrade_to_3.5

Thanks,
Ravi

> 
> Thanks,
> Shawn
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-users
> 

___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


[Gluster-users] NFS crashes - bug 1010241

2014-11-19 Thread Shawn Heisey
We are running into this crash stacktrace on 3.4.2.

https://bugzilla.redhat.com/show_bug.cgi?id=1010241

The NFS process dies with no predictability.  I've written a shell
script that detects the crash and runs a process to completely kill all
gluster processes and restart glusterd, which has eliminated
customer-facing fallout from these problems.

Because of continual stability problems from day one, the gluster
storage is being phased out, but there are many terabytes of data still
used there.  It would be nice to have it remain stable while we still
use it.  As soon as we can fully migrate all data to another storage
solution, the gluster machines will be decommissioned.

That BZ id is specific to version 3.6, and it's always difficult for
mere mortals to determine which fixes have been backported to earlier
releases.

Has the fix for bug 1010241 been backported to any 3.4 release?  If so,
is it possible for me to upgrade my servers without being concerned
about the distributed+replicated volume going offline?  When we upgraded
from 3.3 to 3.4, the volume was not fully functional as soon as we
upgraded one server, and did not become fully functional until all
servers were upgraded and rebooted.

Assuming again that there is a 3.4 version with the fix ... the gluster
peers that I use for NFS do not have any bricks.  Would I need to
upgrade ALL the servers, or could I get away with just upgrading the
servers that are being used for NFS?

Thanks,
Shawn
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users