Re: [Gluster-users] NFS crashes under load
Thanks. I'll be looking into it. I've filed a bug at: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=2061 You may add yourself to the CC list for notifications. It seems the crash is easily reproduced on your setup. Can you please post the log from Gluster NFS process in TRACE log level to the bug? Dan Bretherton wrote: I upgraded to GlusterFS 3.1 a couple of weeks ago and overall I am very impressed; I think it is a big step forward. Unfortunately there is one "feature" that is causing me a big problem - the NFS process crashes every few hours when under load. I have pasted the relevant error messages from nfs.log at the end of this message. The rest of the log file is swamped with these messages incidentally. [2010-11-06 23:07:04.977055] E [rpcsvc.c:1249:nfs_rpcsvc_program_actor] nfsrpc: RPC program not available There are no apparent problems while these errors are being produced so this issue probably isn't relevant to the crashes. Correct. That error is misleading and will be removed in 3.1.1 Thanks -Shehjar To give an indication of what I mean by "under load", we have a small HPC cluster that is used for running ocean models. A typical model run involves 20 processors, all needing to read simultaneously from the same input data files at regular intervals during the run. There are roughly 20 files, each ~1GB in size. At the same time this is going on several people, typically, are processing output from previous runs from this and other (much bigger) clusters, chugging through hundreds of GB and tens of thousands of files every few hours. I don't think the Gluster-NFS crashes are purely load dependant because they seem to occur at different load levels, which is what leads me to suspect something subtle related to the cluster's 20-processor model runs. I would prefer to use the GlusterFS client on the cluster's compute nodes, but unfortunately the pre-FUSE Linux kernel has been customised in a way that has thwarted all my attempts to build a FUSE module that the kernel will accept (see http://gluster.org/pipermail/gluster-users/2010-April/004538.html) The servers that are exporting NFS are all running CentOS 5.5 with GlusterFS installed from RPMs, and the GlusterFS volumes are distributed (not repicated). Two of the servers with GlusterFS bricks are actually running SuSE Enterprise 10; I don't know if this is relevant. I used previous GlusterFS versions with SLES10 without any problems, but as RPMs are not provided for SuSE I presume it is not an officially supported distro. For that reason I am only using the CentOS machines as NFS servers for the GlusterFS volumes. I would be very grateful for any suggested solutions or workarounds that might help to prevent these NFS crashes. -Dan. nfs.log extract -- [2010-11-06 23:07:10.380744] E [fd.c:506:fd_unref_unbind] (-->/usr/lib64/glusterfs/3.1.0/xlator/debug/io-stats.so(io_stats_fstat_cbk+0x8e) [0x2b30813e] (-->/usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs_fop_fstat_cbk+0x41) [0x2b9a6da1] (-->/usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs3svc_readdir_fstat_cbk+0x22d) [0x2b9b0bdd]))) : Assertion failed: fd->refcount pending frames: patchset: v3.1.0 signal received: 11 time of crash: 2010-11-06 23:07:10 configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.1.0 /lib64/libc.so.6[0x35746302d0] /lib64/libpthread.so.0(pthread_spin_lock+0x2)[0x357520b722] /usr/lib64/libglusterfs.so.0(fd_unref_unbind+0x3d)[0x38f223511d] /usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs3svc_readdir_fstat_cbk+0x22d)[0x2b9b0bdd] /usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs_fop_fstat_cbk+0x41)[0x2b9a6da1] /usr/lib64/glusterfs/3.1.0/xlator/debug/io-stats.so(io_stats_fstat_cbk+0x8e)[0x2b30813e] /usr/lib64/libglusterfs.so.0(default_fstat_cbk+0x79)[0x38fa69] /usr/lib64/glusterfs/3.1.0/xlator/performance/read-ahead.so(ra_attr_cbk+0x79)[0x2aeec459] /usr/lib64/glusterfs/3.1.0/xlator/performance/write-behind.so(wb_fstat_cbk+0x9f)[0x2ace402f] /usr/lib64/glusterfs/3.1.0/xlator/cluster/distribute.so(dht_attr_cbk+0xf4)[0x2b521d24] /usr/lib64/glusterfs/3.1.0/xlator/protocol/client.so(client3_1_fstat_cbk+0x287)[0x2aacd2b7] /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x38f1a0f2e2] /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x8d)[0x38f1a0f4dd] /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x2c)[0x38f1a0a77c] /usr/lib64/glusterfs/3.1.0/rpc-transport/socket.so(socket_event_poll_in+0x3f)[0x2aaac3eb435f] /usr/lib64/glusterfs/3.1.0/rpc-transport/socket.so(socket_event_handler+0x168)[0x2aaac3eb44e8] /usr/lib64/libglusterfs.so.0[0x38f2236ee7] /usr/sbin/glusterfs(main+0x37d)[0x4046ad] /lib64/libc.so.6(__libc_start_main+0xf4)[0x357461d994] /usr/sbin/glusterfs[0x402dc9] - ___ Gluster-users mailing list Gluster-u
Re: [Gluster-users] question on NFS mounting
Joe Landman wrote: On 11/07/2010 02:00 AM, Bernard Li wrote: I'm not sure about distribute, but with replicate, each brick should be able to act as the NFS server. What does `showmount -e` say for each brick? And what error message did you get when you tried to mount it? With any kind of volume config, NFS starts up by default on all bricks. You'll have ensure that no other nfs servers are running on the bricks when Gluster volumes are started. Actually, showmount didn't work. We get permission denied. Even after playing with the auth.allowed flag. Please paste the output of rpcinfo -p . It'll help point out whats going on. Thanks -Shehjar Cheers, Bernard ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Gluster crash
Please file a bug. It'd help to have the steps to reproduce and if it is easily reproduced, the client log in TRACE log level. Thanks. Samuel Hassine wrote: Hi all, Our service using GlusterFS is in production since one week and we are managing a huge trafic. The last night, one of the Gluster client (on a physical node with a lot of virtual engines) crashed. Can you give me more information about the log of the crash? Here is the log: pending frames: frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(CREATE) frame : type(1) op(CREATE) patchset: v3.0.6 signal received: 6 time of crash: 2010-11-06 05:38:11 configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.0.6 /lib/libc.so.6[0x7f7644e76f60] /lib/libc.so.6(gsignal+0x35)[0x7f7644e76ed5] /lib/libc.so.6(abort+0x183)[0x7f7644e783f3] /lib/libc.so.6(__assert_fail+0xe9)[0x7f7644e6fdc9] /lib/libpthread.so.0(pthread_mutex_lock+0x686)[0x7f76451a0b16] /lib/glusterfs/3.0.6/xlator/performance/io-cache.so(ioc_create_cbk +0x87)[0x7f7643dcd3f7] /lib/glusterfs/3.0.6/xlator/performance/read-ahead.so(ra_create_cbk +0x1a2)[0x7f7643fd9322] /lib/glusterfs/3.0.6/xlator/cluster/replicate.so(afr_create_unwind +0x126)[0x7f76441f1866] /lib/glusterfs/3.0.6/xlator/cluster/replicate.so(afr_create_wind_cbk +0x10f)[0x7f76441f25ef] /lib/glusterfs/3.0.6/xlator/protocol/client.so(client_create_cbk +0x5aa)[0x7f764443a00a] /lib/glusterfs/3.0.6/xlator/protocol/client.so(protocol_client_pollin +0xca)[0x7f76444284ba] /lib/glusterfs/3.0.6/xlator/protocol/client.so(notify +0xe0)[0x7f7644437d70] /lib/libglusterfs.so.0(xlator_notify+0x43)[0x7f76455cd483] /lib/glusterfs/3.0.6/transport/socket.so(socket_event_handler +0xe0)[0x7f76433819e0] /lib/libglusterfs.so.0[0x7f76455e7e0f] /sbin/glusterfs(main+0x82c)[0x40446c] /lib/libc.so.6(__libc_start_main+0xe6)[0x7f7644e631a6] /sbin/glusterfs[0x402a29] I just want to know "why" Gluster crashed. Regards. Sam ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
[Gluster-users] NFS crashes under load
I upgraded to GlusterFS 3.1 a couple of weeks ago and overall I am very impressed; I think it is a big step forward. Unfortunately there is one "feature" that is causing me a big problem - the NFS process crashes every few hours when under load. I have pasted the relevant error messages from nfs.log at the end of this message. The rest of the log file is swamped with these messages incidentally. [2010-11-06 23:07:04.977055] E [rpcsvc.c:1249:nfs_rpcsvc_program_actor] nfsrpc: RPC program not available There are no apparent problems while these errors are being produced so this issue probably isn't relevant to the crashes. To give an indication of what I mean by "under load", we have a small HPC cluster that is used for running ocean models. A typical model run involves 20 processors, all needing to read simultaneously from the same input data files at regular intervals during the run. There are roughly 20 files, each ~1GB in size. At the same time this is going on several people, typically, are processing output from previous runs from this and other (much bigger) clusters, chugging through hundreds of GB and tens of thousands of files every few hours. I don't think the Gluster-NFS crashes are purely load dependant because they seem to occur at different load levels, which is what leads me to suspect something subtle related to the cluster's 20-processor model runs. I would prefer to use the GlusterFS client on the cluster's compute nodes, but unfortunately the pre-FUSE Linux kernel has been customised in a way that has thwarted all my attempts to build a FUSE module that the kernel will accept (see http://gluster.org/pipermail/gluster-users/2010-April/004538.html) The servers that are exporting NFS are all running CentOS 5.5 with GlusterFS installed from RPMs, and the GlusterFS volumes are distributed (not repicated). Two of the servers with GlusterFS bricks are actually running SuSE Enterprise 10; I don't know if this is relevant. I used previous GlusterFS versions with SLES10 without any problems, but as RPMs are not provided for SuSE I presume it is not an officially supported distro. For that reason I am only using the CentOS machines as NFS servers for the GlusterFS volumes. I would be very grateful for any suggested solutions or workarounds that might help to prevent these NFS crashes. -Dan. nfs.log extract -- [2010-11-06 23:07:10.380744] E [fd.c:506:fd_unref_unbind] (-->/usr/lib64/glusterfs/3.1.0/xlator/debug/io-stats.so(io_stats_fstat_cbk+0x8e) [0x2b30813e] (-->/usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs_fop_fstat_cbk+0x41) [0x2b9a6da1] (-->/usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs3svc_readdir_fstat_cbk+0x22d) [0x2b9b0bdd]))) : Assertion failed: fd->refcount pending frames: patchset: v3.1.0 signal received: 11 time of crash: 2010-11-06 23:07:10 configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.1.0 /lib64/libc.so.6[0x35746302d0] /lib64/libpthread.so.0(pthread_spin_lock+0x2)[0x357520b722] /usr/lib64/libglusterfs.so.0(fd_unref_unbind+0x3d)[0x38f223511d] /usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs3svc_readdir_fstat_cbk+0x22d)[0x2b9b0bdd] /usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs_fop_fstat_cbk+0x41)[0x2b9a6da1] /usr/lib64/glusterfs/3.1.0/xlator/debug/io-stats.so(io_stats_fstat_cbk+0x8e)[0x2b30813e] /usr/lib64/libglusterfs.so.0(default_fstat_cbk+0x79)[0x38fa69] /usr/lib64/glusterfs/3.1.0/xlator/performance/read-ahead.so(ra_attr_cbk+0x79)[0x2aeec459] /usr/lib64/glusterfs/3.1.0/xlator/performance/write-behind.so(wb_fstat_cbk+0x9f)[0x2ace402f] /usr/lib64/glusterfs/3.1.0/xlator/cluster/distribute.so(dht_attr_cbk+0xf4)[0x2b521d24] /usr/lib64/glusterfs/3.1.0/xlator/protocol/client.so(client3_1_fstat_cbk+0x287)[0x2aacd2b7] /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x38f1a0f2e2] /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x8d)[0x38f1a0f4dd] /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x2c)[0x38f1a0a77c] /usr/lib64/glusterfs/3.1.0/rpc-transport/socket.so(socket_event_poll_in+0x3f)[0x2aaac3eb435f] /usr/lib64/glusterfs/3.1.0/rpc-transport/socket.so(socket_event_handler+0x168)[0x2aaac3eb44e8] /usr/lib64/libglusterfs.so.0[0x38f2236ee7] /usr/sbin/glusterfs(main+0x37d)[0x4046ad] /lib64/libc.so.6(__libc_start_main+0xf4)[0x357461d994] /usr/sbin/glusterfs[0x402dc9] - -- Mr. D.A. Bretherton Computer System Manager Environmental Systems Science Centre Harry Pitt Building 3 Earley Gate University of Reading Reading, RG6 6AL UK Tel. +44 118 378 5205 Fax: +44 118 378 6413 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] question on NFS mounting
Hi Joe: On Sun, Nov 7, 2010 at 12:03 AM, Joe Landman wrote: > Actually, showmount didn't work. > > We get permission denied. Even after playing with the auth.allowed flag. That's an indication that the gNFS server is not running. I would recommend you review the FAQ and some of the recent posts on the list, as there have been a couple threads discussing numerous NFS-related issues and their solutions. I've collected them for you here for your convenience: http://www.gluster.org/faq/index.php?sid=679&lang=en&action=show&cat=5 http://gluster.org/pipermail/gluster-users/2010-November/005692.html Cheers, Bernard ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] question on NFS mounting
On 11/07/2010 02:00 AM, Bernard Li wrote: I'm not sure about distribute, but with replicate, each brick should be able to act as the NFS server. What does `showmount -e` say for each brick? And what error message did you get when you tried to mount it? Actually, showmount didn't work. We get permission denied. Even after playing with the auth.allowed flag. Cheers, Bernard -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: land...@scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] question on NFS mounting
Hi Joe: On Sat, Nov 6, 2010 at 9:53 PM, Joe Landman wrote: > We have a 3.1 cluster set up, and NFS mounting is operational. We are > trying to get our heads around the mounting of this cluster. What we found > works (for a 6 brick distributed cluster) is using the same server:/export > in all the mounts. > > My questions are > > 1) can we use any of the bricks for server? We tried using another brick in > the volume, but it doesn't seem to work. I'm not sure about distribute, but with replicate, each brick should be able to act as the NFS server. What does `showmount -e` say for each brick? And what error message did you get when you tried to mount it? Cheers, Bernard ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users