Re: [Gluster-users] NFS crashes under load

2010-11-07 Thread Shehjar Tikoo

Thanks. I'll be looking into it. I've filed a bug at:

http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=2061

You may add yourself to the CC list for notifications.

It seems the crash is easily reproduced on your setup. Can you please 
post the log from Gluster NFS process in TRACE log level to the bug?


Dan Bretherton wrote:

I upgraded to GlusterFS 3.1 a couple of weeks ago and overall I am very
impressed; I think it is a big step forward.  Unfortunately there is one
"feature" that is causing me a big problem - the NFS process crashes
every few hours  when under load.  I have pasted the relevant error
messages from nfs.log at the end of this message.  The rest of the log
file is swamped with these messages incidentally.

[2010-11-06 23:07:04.977055] E [rpcsvc.c:1249:nfs_rpcsvc_program_actor]
nfsrpc: RPC program not available

There are no apparent problems while these errors are being produced so
this issue probably isn't relevant to the crashes.


Correct. That error is misleading and will be removed in 3.1.1

Thanks
-Shehjar


To give an indication of what I mean by "under load", we have a small
HPC cluster that is used for running ocean models.  A typical model run
involves 20 processors, all needing to read simultaneously from the same
input data files at regular intervals during the run.  There are roughly
20 files, each ~1GB in size.  At the same time this is going on several
people, typically, are processing output from previous runs from this
and other (much bigger) clusters, chugging through hundreds of GB and
tens of thousands of files every few hours.  I don't think the
Gluster-NFS crashes are purely load dependant because they seem to occur
at different load levels, which is what leads me to suspect something
subtle related to the cluster's 20-processor model runs.  I would prefer
to use the GlusterFS client on the cluster's compute nodes, but
unfortunately the pre-FUSE Linux kernel has been customised in a way
that has thwarted all my attempts to build a FUSE module that the kernel
will accept (see
http://gluster.org/pipermail/gluster-users/2010-April/004538.html)

The servers that are exporting NFS are all running CentOS 5.5 with
GlusterFS installed from RPMs, and the GlusterFS volumes are distributed
(not repicated).  Two of the servers with GlusterFS bricks are actually
running SuSE Enterprise 10; I don't know if this is relevant.  I used
previous GlusterFS versions with SLES10 without any problems, but as
RPMs are not provided for SuSE I presume it is not an officially
supported distro.  For that reason I am only using the CentOS machines
as NFS servers for the GlusterFS volumes.

I would be very grateful for any suggested solutions or workarounds that
might help to prevent these NFS crashes.

-Dan.
nfs.log extract
--
[2010-11-06 23:07:10.380744] E [fd.c:506:fd_unref_unbind]
(-->/usr/lib64/glusterfs/3.1.0/xlator/debug/io-stats.so(io_stats_fstat_cbk+0x8e)
[0x2b30813e]
(-->/usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs_fop_fstat_cbk+0x41)
[0x2b9a6da1]
(-->/usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs3svc_readdir_fstat_cbk+0x22d)
[0x2b9b0bdd]))) : Assertion failed: fd->refcount
pending frames:

patchset: v3.1.0
signal received: 11
time of crash: 2010-11-06 23:07:10
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.1.0
/lib64/libc.so.6[0x35746302d0]
/lib64/libpthread.so.0(pthread_spin_lock+0x2)[0x357520b722]
/usr/lib64/libglusterfs.so.0(fd_unref_unbind+0x3d)[0x38f223511d]
/usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs3svc_readdir_fstat_cbk+0x22d)[0x2b9b0bdd]
/usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs_fop_fstat_cbk+0x41)[0x2b9a6da1]
/usr/lib64/glusterfs/3.1.0/xlator/debug/io-stats.so(io_stats_fstat_cbk+0x8e)[0x2b30813e]
/usr/lib64/libglusterfs.so.0(default_fstat_cbk+0x79)[0x38fa69]
/usr/lib64/glusterfs/3.1.0/xlator/performance/read-ahead.so(ra_attr_cbk+0x79)[0x2aeec459]
/usr/lib64/glusterfs/3.1.0/xlator/performance/write-behind.so(wb_fstat_cbk+0x9f)[0x2ace402f]
/usr/lib64/glusterfs/3.1.0/xlator/cluster/distribute.so(dht_attr_cbk+0xf4)[0x2b521d24]
/usr/lib64/glusterfs/3.1.0/xlator/protocol/client.so(client3_1_fstat_cbk+0x287)[0x2aacd2b7]
/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x38f1a0f2e2]
/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x8d)[0x38f1a0f4dd]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x2c)[0x38f1a0a77c]
/usr/lib64/glusterfs/3.1.0/rpc-transport/socket.so(socket_event_poll_in+0x3f)[0x2aaac3eb435f]
/usr/lib64/glusterfs/3.1.0/rpc-transport/socket.so(socket_event_handler+0x168)[0x2aaac3eb44e8]
/usr/lib64/libglusterfs.so.0[0x38f2236ee7]
/usr/sbin/glusterfs(main+0x37d)[0x4046ad]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x357461d994]
/usr/sbin/glusterfs[0x402dc9]
-



___
Gluster-users mailing list
Gluster-u

Re: [Gluster-users] question on NFS mounting

2010-11-07 Thread Shehjar Tikoo

Joe Landman wrote:

On 11/07/2010 02:00 AM, Bernard Li wrote:


I'm not sure about distribute, but with replicate, each brick should
be able to act as the NFS server.  What does `showmount -e` say for
each brick?  And what error message did you get when you tried to
mount it?




With any kind of volume config, NFS starts up by default on all bricks. 
You'll have ensure that no other nfs servers are running on the bricks 
when Gluster volumes are started.




Actually, showmount didn't work.

We get permission denied.  Even after playing with the auth.allowed flag.



Please paste the output of rpcinfo -p . It'll help point out 
whats going on.


Thanks
-Shehjar



Cheers,

Bernard





___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] Gluster crash

2010-11-07 Thread Shehjar Tikoo
Please file a bug. It'd help to have the steps to reproduce and if it is 
easily reproduced, the client log in TRACE log level. Thanks.


Samuel Hassine wrote:

Hi all,

Our service using GlusterFS is in production since one week and we are
managing a huge trafic. The last night, one of the Gluster client (on a
physical node with a lot of virtual engines) crashed. Can you give me
more information about the log of the crash?

Here is the log: 


pending frames:
frame : type(1) op(READ)
frame : type(1) op(READ)
frame : type(1) op(READ)
frame : type(1) op(CREATE)
frame : type(1) op(CREATE)

patchset: v3.0.6
signal received: 6
time of crash: 2010-11-06 05:38:11
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.0.6
/lib/libc.so.6[0x7f7644e76f60]
/lib/libc.so.6(gsignal+0x35)[0x7f7644e76ed5]
/lib/libc.so.6(abort+0x183)[0x7f7644e783f3]
/lib/libc.so.6(__assert_fail+0xe9)[0x7f7644e6fdc9]
/lib/libpthread.so.0(pthread_mutex_lock+0x686)[0x7f76451a0b16]
/lib/glusterfs/3.0.6/xlator/performance/io-cache.so(ioc_create_cbk
+0x87)[0x7f7643dcd3f7]
/lib/glusterfs/3.0.6/xlator/performance/read-ahead.so(ra_create_cbk
+0x1a2)[0x7f7643fd9322]
/lib/glusterfs/3.0.6/xlator/cluster/replicate.so(afr_create_unwind
+0x126)[0x7f76441f1866]
/lib/glusterfs/3.0.6/xlator/cluster/replicate.so(afr_create_wind_cbk
+0x10f)[0x7f76441f25ef]
/lib/glusterfs/3.0.6/xlator/protocol/client.so(client_create_cbk
+0x5aa)[0x7f764443a00a]
/lib/glusterfs/3.0.6/xlator/protocol/client.so(protocol_client_pollin
+0xca)[0x7f76444284ba]
/lib/glusterfs/3.0.6/xlator/protocol/client.so(notify
+0xe0)[0x7f7644437d70]
/lib/libglusterfs.so.0(xlator_notify+0x43)[0x7f76455cd483]
/lib/glusterfs/3.0.6/transport/socket.so(socket_event_handler
+0xe0)[0x7f76433819e0]
/lib/libglusterfs.so.0[0x7f76455e7e0f]
/sbin/glusterfs(main+0x82c)[0x40446c]
/lib/libc.so.6(__libc_start_main+0xe6)[0x7f7644e631a6]
/sbin/glusterfs[0x402a29]

I just want to know "why" Gluster crashed.

Regards.
Sam





___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


[Gluster-users] NFS crashes under load

2010-11-07 Thread Dan Bretherton
I upgraded to GlusterFS 3.1 a couple of weeks ago and overall I am very
impressed; I think it is a big step forward.  Unfortunately there is one
"feature" that is causing me a big problem - the NFS process crashes
every few hours  when under load.  I have pasted the relevant error
messages from nfs.log at the end of this message.  The rest of the log
file is swamped with these messages incidentally.

[2010-11-06 23:07:04.977055] E [rpcsvc.c:1249:nfs_rpcsvc_program_actor]
nfsrpc: RPC program not available

There are no apparent problems while these errors are being produced so
this issue probably isn't relevant to the crashes.

To give an indication of what I mean by "under load", we have a small
HPC cluster that is used for running ocean models.  A typical model run
involves 20 processors, all needing to read simultaneously from the same
input data files at regular intervals during the run.  There are roughly
20 files, each ~1GB in size.  At the same time this is going on several
people, typically, are processing output from previous runs from this
and other (much bigger) clusters, chugging through hundreds of GB and
tens of thousands of files every few hours.  I don't think the
Gluster-NFS crashes are purely load dependant because they seem to occur
at different load levels, which is what leads me to suspect something
subtle related to the cluster's 20-processor model runs.  I would prefer
to use the GlusterFS client on the cluster's compute nodes, but
unfortunately the pre-FUSE Linux kernel has been customised in a way
that has thwarted all my attempts to build a FUSE module that the kernel
will accept (see
http://gluster.org/pipermail/gluster-users/2010-April/004538.html)

The servers that are exporting NFS are all running CentOS 5.5 with
GlusterFS installed from RPMs, and the GlusterFS volumes are distributed
(not repicated).  Two of the servers with GlusterFS bricks are actually
running SuSE Enterprise 10; I don't know if this is relevant.  I used
previous GlusterFS versions with SLES10 without any problems, but as
RPMs are not provided for SuSE I presume it is not an officially
supported distro.  For that reason I am only using the CentOS machines
as NFS servers for the GlusterFS volumes.

I would be very grateful for any suggested solutions or workarounds that
might help to prevent these NFS crashes.

-Dan.
nfs.log extract
--
[2010-11-06 23:07:10.380744] E [fd.c:506:fd_unref_unbind]
(-->/usr/lib64/glusterfs/3.1.0/xlator/debug/io-stats.so(io_stats_fstat_cbk+0x8e)
[0x2b30813e]
(-->/usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs_fop_fstat_cbk+0x41)
[0x2b9a6da1]
(-->/usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs3svc_readdir_fstat_cbk+0x22d)
[0x2b9b0bdd]))) : Assertion failed: fd->refcount
pending frames:

patchset: v3.1.0
signal received: 11
time of crash: 2010-11-06 23:07:10
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.1.0
/lib64/libc.so.6[0x35746302d0]
/lib64/libpthread.so.0(pthread_spin_lock+0x2)[0x357520b722]
/usr/lib64/libglusterfs.so.0(fd_unref_unbind+0x3d)[0x38f223511d]
/usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs3svc_readdir_fstat_cbk+0x22d)[0x2b9b0bdd]
/usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs_fop_fstat_cbk+0x41)[0x2b9a6da1]
/usr/lib64/glusterfs/3.1.0/xlator/debug/io-stats.so(io_stats_fstat_cbk+0x8e)[0x2b30813e]
/usr/lib64/libglusterfs.so.0(default_fstat_cbk+0x79)[0x38fa69]
/usr/lib64/glusterfs/3.1.0/xlator/performance/read-ahead.so(ra_attr_cbk+0x79)[0x2aeec459]
/usr/lib64/glusterfs/3.1.0/xlator/performance/write-behind.so(wb_fstat_cbk+0x9f)[0x2ace402f]
/usr/lib64/glusterfs/3.1.0/xlator/cluster/distribute.so(dht_attr_cbk+0xf4)[0x2b521d24]
/usr/lib64/glusterfs/3.1.0/xlator/protocol/client.so(client3_1_fstat_cbk+0x287)[0x2aacd2b7]
/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x38f1a0f2e2]
/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x8d)[0x38f1a0f4dd]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x2c)[0x38f1a0a77c]
/usr/lib64/glusterfs/3.1.0/rpc-transport/socket.so(socket_event_poll_in+0x3f)[0x2aaac3eb435f]
/usr/lib64/glusterfs/3.1.0/rpc-transport/socket.so(socket_event_handler+0x168)[0x2aaac3eb44e8]
/usr/lib64/libglusterfs.so.0[0x38f2236ee7]
/usr/sbin/glusterfs(main+0x37d)[0x4046ad]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x357461d994]
/usr/sbin/glusterfs[0x402dc9]
-

-- 
Mr. D.A. Bretherton
Computer System Manager
Environmental Systems Science Centre
Harry Pitt Building
3 Earley Gate
University of Reading
Reading, RG6 6AL
UK

Tel. +44 118 378 5205
Fax: +44 118 378 6413

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] question on NFS mounting

2010-11-07 Thread Bernard Li
Hi Joe:

On Sun, Nov 7, 2010 at 12:03 AM, Joe Landman
 wrote:

> Actually, showmount didn't work.
>
> We get permission denied.  Even after playing with the auth.allowed flag.

That's an indication that the gNFS server is not running.

I would recommend you review the FAQ and some of the recent posts on
the list, as there have been a couple threads discussing numerous
NFS-related issues and their solutions.  I've collected them for you
here for your convenience:

http://www.gluster.org/faq/index.php?sid=679&lang=en&action=show&cat=5
http://gluster.org/pipermail/gluster-users/2010-November/005692.html

Cheers,

Bernard
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] question on NFS mounting

2010-11-07 Thread Joe Landman

On 11/07/2010 02:00 AM, Bernard Li wrote:


I'm not sure about distribute, but with replicate, each brick should
be able to act as the NFS server.  What does `showmount -e` say for
each brick?  And what error message did you get when you tried to
mount it?


Actually, showmount didn't work.

We get permission denied.  Even after playing with the auth.allowed flag.



Cheers,

Bernard



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] question on NFS mounting

2010-11-07 Thread Bernard Li
Hi Joe:

On Sat, Nov 6, 2010 at 9:53 PM, Joe Landman
 wrote:

> We have a 3.1 cluster set up, and NFS mounting is operational.  We are
> trying to get our heads around the mounting of this cluster.  What we found
> works (for a 6 brick distributed cluster) is using the same server:/export
>  in all the mounts.
>
> My questions are
>
> 1) can we use any of the bricks for server?  We tried using another brick in
> the volume, but it doesn't seem to work.

I'm not sure about distribute, but with replicate, each brick should
be able to act as the NFS server.  What does `showmount -e` say for
each brick?  And what error message did you get when you tried to
mount it?

Cheers,

Bernard
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users