Re: [Gluster-users] Replication not working on server hang
Hi c) does not glsuterfs ping the servers periodically to see if they are available or not ? if so, why does not it detect that situation ? It does, but in this case the server is up and running and replying with pongs. The current ping-pong only checks for network reachability to the server process. not sure that the server is replying to pings in this situation ... Anyway I was trying to check how glusterfs behaves when no server is available so i have setup a replicated volume identical to the one i'm using but having all the remote-host option point to ip addresses not used in our network. I mounted it and tried to do a ls on the mount point. The client hanged the same way (forever), i have killed the glusterfs process 25 minutes after (past all configurable timeouts). Altough this can be useful in some situations (i.e when both server and clients are rebooting so clients will wait until some server is available) it also can be bad as applications will never notice that something is going wrong Given volfile: +--+ 1: volume data1 2: type protocol/client 3: option transport-type tcp 4: option remote-host 192.168.1.99 5: option remote-subvolume export 6: option ping-timeout 5 7: end-volume 8: 9: volume data2 10: type protocol/client 11: option transport-type tcp 12: option remote-host 192.168.2.99 13: option remote-subvolume export 14: option ping-timeout 5 15: end-volume 16: 17: volume data 18: type cluster/replicate 19: subvolumes data1 data2 20: end-volume +--+[2009-09-01 11:05:44] N [glusterfsd.c:1152:main] glusterfs: Successfully started [2009-09-01 11:05:47] E [socket.c:744:socket_connect_finish] data1: connection to failed (No route to host) [2009-09-01 11:05:47] E [socket.c:744:socket_connect_finish] data1: connection to failed (No route to host) [2009-09-01 11:05:47] E [socket.c:744:socket_connect_finish] data2: connection to failed (No route to host) [2009-09-01 11:05:47] E [socket.c:744:socket_connect_finish] data2: connection to failed (No route to host) [2009-09-01 11:31:30] W [glusterfsd.c:827:cleanup_and_exit] glusterfs: shutting down Please also note the "connection to failed" which is a) duplicated and b) does not say where it has tried to connect -- Best regards ... David Saez Padroshttp://www.ols.es On-Line Services 2000 S.L. telf+34 902 50 29 75 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
> we have no problem on using the volume for reading data and if > the server is not under heavy load it works well for writing too, > only hangs when server is using 100% cpu for math intensise > calculations and we try to write a lot of data to glusterfs > volume (it usually > takes some hours for it to hang) Interesting, as I suspect our hangs have occurred during very low CPU utilization. Can't tell for sure, as we can't get a shell during the hang. >> Talk of XFS being stable is encouraging me to give it a shot. > > It will be very difficult for us to migrate everything to xfs > now, but i will like to see someone having problems without xfs > not having that problems with xfs rather than people not having > problems saying that xfs is stable to start such large migration > process Agreed, that is exactly why I would like to try a different FS. If I still get hangs under XFS, then at least I'll know the hangs are almost certainly not due to EXT3 bugs. If it does work, then I have sneaked around the problem, rather than solved it (cheat!) Fortunately, I don't have too much data right now, and I can devote a partition to XFS for testing. My systems are in 24x6 production though, so I will need to wait for a quite (Sunday) period for the real testing. Thanks, Jeff. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
Hi In our case the hanged server is: Dell PE2900 2 x x5...@3.33 8Gb RAM SAS 6/iR Integrated RAID Controller 7 x SEAGATE ST31000640SS 1 x SEAGATE ST3300656SS Debian testing Kernel 2.6.26-2 server hanged when writing to a unified volume (7 x 1Tb + namespace and system on the ST3300656SS) We have: IBM x3650 2 x x5...@3.16 32Gb RAM SATA Integrated RAID Controller 4 X 1TB SATA Hitachi HUA72101 RHEL 5.3 Kernel 2.6.18-128.4.1.el5xen Glusterfs 2.0.3 w/ patch 943 Couldn't be more different really! yes, and we are using glusterfs 2.0.1 Server hangs when building software on a 100GB replicated volume, mounted with direct-io-mode=disabled. we have no problem on using the volume for reading data and if the server is not under heavy load it works well for writing too, only hangs when server is using 100% cpu for math intensise calculations and we try to write a lot of data to glusterfs volume (it usually takes some hours for it to hang) Talk of XFS being stable is encouraging me to give it a shot. It will be very difficult for us to migrate everything to xfs now, but i will like to see someone having problems without xfs not having that problems with xfs rather than people not having problems saying that xfs is stable to start such large migration process -- Best regards ... David Saez Padroshttp://www.ols.es On-Line Services 2000 S.L. telf+34 902 50 29 75 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
Original-Nachricht > Datum: Mon, 31 Aug 2009 10:58:41 +1000 (EST) > Von: "Jeff Evans" > An: da...@ols.es > CC: gluster-users@gluster.org > Betreff: Re: [Gluster-users] Replication not working on server hang > Hola David, > > > Maybe we could try to > > see if all the ones experiencing this problem have something in > > common. > > Agreed. > > > In our case the hanged server is: > > > > Dell PE2900 > > 2 x x5...@3.33 8Gb RAM > > SAS 6/iR Integrated RAID Controller > > 7 x SEAGATE ST31000640SS > > 1 x SEAGATE ST3300656SS > > Debian testing > > Kernel 2.6.26-2 > > > > server hanged when writing to a unified volume (7 x 1Tb + > > namespace and system on the ST3300656SS) > > We have: > > IBM x3650 > 2 x x5...@3.16 32Gb RAM > SATA Integrated RAID Controller > 4 X 1TB SATA Hitachi HUA72101 > RHEL 5.3 > Kernel 2.6.18-128.4.1.el5xen > Glusterfs 2.0.3 w/ patch 943 > > Couldn't be more different really! > > Server hangs when building software on a 100GB replicated volume, > mounted with direct-io-mode=disabled. > > I have found that building: > > http://mirror.cs.wisc.edu/pub/mirrors/ghost/AFPL/GhostPCL/ghostpcl_1.40.tar.bz2 > Reliably produces the hang in my case. > Even just a grep -R of the source gives me the dreaded hang. > I tested it now on my GlusterFS with XFS below and it works without issues: --- uranos test # wget http://mirror.cs.wisc.edu/pub/mirrors/ghost/AFPL/GhostPCL/ghostpcl_1.40.tar.bz2 --2009-08-31 03:49:06-- http://mirror.cs.wisc.edu/pub/mirrors/ghost/AFPL/GhostPCL/ghostpcl_1.40.tar.bz2 Resolving mirror.cs.wisc.edu... 128.105.103.12 Connecting to mirror.cs.wisc.edu|128.105.103.12|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 10093425 (9.6M) [application/x-tar] Saving to: `ghostpcl_1.40.tar.bz2' 100%[=>] 10,093,425 289K/s in 35s 2009-08-31 03:49:42 (278 KB/s) - `ghostpcl_1.40.tar.bz2' saved [10093425/10093425] uranos test # time tar xjf ghostpcl_1.40.tar.bz2 real1m32.895s user0m4.520s sys 0m0.720s uranos test # time echo $(grep -iR "test" ghostpcl_1.40/ | wc -l) 3180 real0m9.776s user0m0.070s sys 0m0.310s uranos test # --- > Talk of XFS being stable is encouraging me to give it a shot. > > XFS isn't shipped with RHEL 5.3, but then neither is FUSE! (both > should be in 5.4 though, finally). > > Thanks, Jeff. > Steve > > ___ > Gluster-users mailing list > Gluster-users@gluster.org > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users -- GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
Hola David, > Maybe we could try to > see if all the ones experiencing this problem have something in > common. Agreed. > In our case the hanged server is: > > Dell PE2900 > 2 x x5...@3.33 8Gb RAM > SAS 6/iR Integrated RAID Controller > 7 x SEAGATE ST31000640SS > 1 x SEAGATE ST3300656SS > Debian testing > Kernel 2.6.26-2 > > server hanged when writing to a unified volume (7 x 1Tb + > namespace and system on the ST3300656SS) We have: IBM x3650 2 x x5...@3.16 32Gb RAM SATA Integrated RAID Controller 4 X 1TB SATA Hitachi HUA72101 RHEL 5.3 Kernel 2.6.18-128.4.1.el5xen Glusterfs 2.0.3 w/ patch 943 Couldn't be more different really! Server hangs when building software on a 100GB replicated volume, mounted with direct-io-mode=disabled. I have found that building: http://mirror.cs.wisc.edu/pub/mirrors/ghost/AFPL/GhostPCL/ghostpcl_1.40.tar.bz2 Reliably produces the hang in my case. Even just a grep -R of the source gives me the dreaded hang. Talk of XFS being stable is encouraging me to give it a shot. XFS isn't shipped with RHEL 5.3, but then neither is FUSE! (both should be in 5.4 though, finally). Thanks, Jeff. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
Original-Nachricht > Datum: Sun, 30 Aug 2009 21:52:17 +0200 > Von: Jasper van Wanrooy - Chatventure > An: gluster-users > Betreff: Re: [Gluster-users] Replication not working on server hang > Hi, > Hello > Sideways I'm reading the discussions about server hangs the last few > weeks. However, I did quite a few stress tests on our test systems, > but I'm unable to reproduce the hangs. The only real difference I see > is that we are using the XFS filesystem. Does anyone have experience > with that? > I do use as well XFS and can't reproduce any issues with it when using anything >= GlusterFS 2.0.4. Older 2.0.x releases of GlusterFS where ultra unstable for me but starting from 2.0.4 things seem to get better. Currently I am using 2.1.0git in production for serving web pages and things work flawless. If it continues like that then I am going to try again to move my mailstorage to be on GlusterFS. But not in the next 2 to 3 weeks. Anyway... GlusterFS and XFS = no hangs at all for me. Crashing GlusterFS? Yes! Hangs? No! If you use XFS then be sure to not use a Kernel from the 2.6.29 and 2.6.30 series as it has an bug with XFS. There are patchs for 2.6.29 and 2.6.30 but none of them is included in the main line of the Kernel. Maybe released 2.6.31 Kernel will fix the issue? RC8 however has still the same issue as 2.6.29/2.6.30. > Kind Regards, > > Jasper > Steve > ___ > Gluster-users mailing list > Gluster-users@gluster.org > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users -- Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3 - sicherer, schneller und einfacher! http://portal.gmx.net/de/go/atbrowser ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
Hi, Sideways I'm reading the discussions about server hangs the last few weeks. However, I did quite a few stress tests on our test systems, but I'm unable to reproduce the hangs. The only real difference I see is that we are using the XFS filesystem. Does anyone have experience with that? Kind Regards, Jasper ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
On 08/30/2009 04:00 AM, Anand Avati wrote: I'm wondering if there's some way for glusterfs to detect the flaws of the underlying operating system. I believe there's no bug-free file systems in the universe, so I believe it is the job of the glusterfs developer to specify which underlying filesystem is tested and supported. It's not good to simply say that glusterfs works on all real-world approximations to an imaginary bug-free posix filesystem. I would be genuinely interested to know about another project which is geared up to be resilient against kernel hangs so that we can borrow some ideas on how to reliably detect kernel soft lockups or syscall hangs. As far as I know, even mature projects like Apache have not bothered fixing such hangs (or even detecting this kind of underlying OS flaw). There are projects that require kernel patches to work properly (for example, the OpenVZ project), and most Linux distributions (i.e. RedHat) maintain a set of kernel patches. Vendors may provide work arounds for known kernel problems - for example, the dovecot people go through various means to flush the NFS or FUSE cache (including for GlusterFS) before doing certain operations, and these are done using non-portable operations. Summary of it is that relying on the Linux kernel to be correct in all situations (or any kernel for that matter) will have limits. Sometimes, it is necessary to track down the problem, correct it, and provide a patch. This can involve discussions on linux-dev leading to it finally being corrected upstream, and no longer needing to provide a patch. Not saying it has to go this far - but unless the problem is understood, it shouldn't be written off either. If GlusterFS can issue a set of operations that reproducibly causes ext3 to freeze, this is of a concern for both the ext3 developers/maintainers and the GlusterFS developers/maintainers, and it is a joint problem to solve, since ext3 is so common. As for detecting lockups or hangs - I'm not aware of this being done in the userspace area, but it could be argued that this is a bit artificial of a comparison, because GlusterFS is at its base, a network file system, and it *is* common for network file systems (such as NFS) to deal with problems with the underlying volumes. GlusterFS uses FUSE as a novel approach to avoiding the problem entirely - but if GlusterFS from user space can cause the backend storage volume to freeze up, even from outside GlusterFS, then it seems like the user space barrier is insufficient. For all of the above - I am assuming that GlusterFS is being used to do something which ends up locking up the entire volume, even from outside GlusterFS. If anybody is experiencing GlusterFS *only* problems, where the underlying volume is still accessible from another process, than this would be a different problem, probably GlusterFS specific. Cheers, mark -- Mark Mielke ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
Hi The calls (as you have seen in the logs as well) which are hanging are lookup calls, which have to be sent to all subvolumes to ensure all the copies are in sync. one thing that i could not understand is why if such this calls are sent to all servers to keep files in sync why replicate will only self-heal if the files exist on the first subvolume but not if the files do not exist on the first subvolume -- Best regards ... David Saez Padroshttp://www.ols.es On-Line Services 2000 S.L. telf+34 902 50 29 75 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
On Sun, 30 Aug 2009 01:00:13 -0700 Anand Avati wrote: > > I'm wondering if there's some way for glusterfs to detect the flaws of the > > underlying operating system. I believe there's no bug-free file systems in > > the universe, so I believe it is the job of the glusterfs developer to > > specify which underlying filesystem is tested and supported. It's not good > > to simply say that glusterfs works on all real-world approximations to an > > imaginary bug-free posix filesystem. > > I would be genuinely interested to know about another project which is > geared up to be resilient against kernel hangs so that we can borrow > some ideas on how to reliably detect kernel soft lockups or syscall > hangs. As far as I know, even mature projects like Apache have not > bothered fixing such hangs (or even detecting this kind of underlying > OS flaw). Apache is no software thats' primary use is to overcome hardware (and software) issues leading to offline filesystems. You cannot compare two applications with totally different usage patterns. And, just to say that clearly, nobody expects you to _solve_ or fix a hang. The users only expect to _recognise_ a problem and just shut down. It is far better to shut down without a real problem than to continue while having one and hang. First one leads to more work at max, but second one leads to offline service. And thats exactly why we are all here, to prevent an offline file service. > Avati -- Regards, Stephan ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
Hi This backend fs hang can happen only because of a kernel bug. That is indeed false. The fs hang can well have simple hardware reasons, too. maybe it also could be due to some sort of wrong access to the filesystem. What is clear is that the soft lockup itself is a kernel bug, the problems here are what is exactly causing this bug (file system, controller driver, hardware, kernel itself, ...) and why glusterfs is triggering this bug and direct operations to the ext3 file system or through nfs are not. -- Best regards ... David Saez Padroshttp://www.ols.es On-Line Services 2000 S.L. telf+34 902 50 29 75 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
HI The discussion in this thread is about those situations where the server (machine hosting the storage/posix volume) hangs the backend filesystem (verified by kernel console logs) and that in turn results in the mountpoint hang. That seems to be the case in Stephan's situation, yes, as we have evidence from reiserFS. What evidence have we in the ext3 cases? just searching on the net i found similar cases that where due to the sata driver (altough in our case all disks ara sas), so the problem could also be due to the disk driver or to some other piece of the system. Having both reiserfs and ext3 have a bug that produces this hangs is very unlikely. Maybe we could try to see if all the ones experiencing this problem have something in common. In our case the hanged server is: Dell PE2900 2 x x5...@3.33 8Gb RAM SAS 6/iR Integrated RAID Controller 7 x SEAGATE ST31000640SS 1 x SEAGATE ST3300656SS Debian testing Kernel 2.6.26-2 server hanged when writing to a unified volume (7 x 1Tb + namespace and system on the ST3300656SS) -- Best regards ... David Saez Padroshttp://www.ols.es On-Line Services 2000 S.L. telf+34 902 50 29 75 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
On Sat, 29 Aug 2009 18:23:23 -0700 Anand Avati wrote: > This backend fs > hang can happen only because of a kernel bug. That is indeed false. The fs hang can well have simple hardware reasons, too. In fact it is a good idea and defensive programming style to not count on everybody being perfect - just like you should act on the street where you should not count on perfect drivers in other cars, too. Lets say your favourite hd controller just died half way, you cannot blame the kernel for keeping networking up, but all fs related just block. -- Regards, Stephan ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
Hi Avati, I'm experiencing complete system-wide hangs exactly as David has mentioned. > The discussion in this > thread is about those situations where the server (machine > hosting the > storage/posix volume) hangs the backend filesystem (verified by > kernel console logs) and that in turn results in the mountpoint > hang. That seems to be the case in Stephan's situation, yes, as we have evidence from reiserFS. What evidence have we in the ext3 cases? > While your symptoms are similar on the client side hanging, In the case of 144, my systems didn't hang. Maybe I was just lucky. Now that I have disabled read-ahead to workaround 144, I am seeing total system hangs. I also saw these hangs back before I used read-ahead (with 1.3). As I have said, it is like new FD's cannot be allocated, while those already open continue normally. I'm talking about regular ext3 mounts here, not glusterfs ones. > The discussion thread is about the situation where the server side > kernel misbehaves and results in glusterfs hanging. The two > actual problems are quite different. Perhaps, as I said, it may be coincidence, but when I ran with read-ahead, I didn't get any system hangs, just the core-dumps. Now, I don't get core dumps any more. I get system-wide hangs. Thanks, Jeff. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
Hi As far as I know, even mature projects like Apache have not bothered fixing such hangs (or even detecting this kind of underlying OS flaw). that's right, but Apache is not a fault tolerant file system, in the other hand some applications that face bugs in other apps have options to workaround bugs in such applications (like dovecot has for some outlook bugs). For a fault tolerant file sistem i would expect that it can at least detect and handle any problem in any of the subsystems involved. -- Best regards ... David Saez Padroshttp://www.ols.es On-Line Services 2000 S.L. telf+34 902 50 29 75 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
On Sun, Aug 30, 2009 at 01:00:13AM -0700, Anand Avati wrote: > > I'm wondering if there's some way for glusterfs to detect the flaws of the > > underlying operating system. I believe there's no bug-free file systems in > > the universe, so I believe it is the job of the glusterfs developer to > > specify which underlying filesystem is tested and supported. It's not good > > to simply say that glusterfs works on all real-world approximations to an > > imaginary bug-free posix filesystem. > > I would be genuinely interested to know about another project which is > geared up to be resilient against kernel hangs so that we can borrow > some ideas on how to reliably detect kernel soft lockups or syscall > hangs. As far as I know, even mature projects like Apache have not > bothered fixing such hangs (or even detecting this kind of underlying > OS flaw). Check out heartbeat, and the rest (perhaps you knew of this): http://www.linux-ha.org/ cheers zenaan -- Homepage: www.SoulSound.net -- Free Australia: www.UPMART.org Please respect the confidentiality of this email as sensibly warranted. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
Hi What we have here (kernel lockups and glusterfs on the same machine) might not be a co-incidence. but it could be There might well be a correlation -- but by nature of the problem it is not right to treat this as a cause-effect relation with glusterfs being the cause. i think it's also not right to simply discard glusterfs being the cause It is just not right to blame _any_ userspace application for any kind of kernel lockups or hangs. well, i'm to saying that this is glusterfs fault, what i'm saying is that is very likely that glusterfs is at least triggering this fault So any hang or lockup in the kernel can only be caused by a bug in itself, which could possibly be triggered by a specific user application. maybe, but don't you feel that this needs to be investigated in order to know what is really happening ? What we will be fixing is failing over to other machines when the backend FS hangs. The reason why this was not a priority (so far atleast) is because a kernel is a trusted piece of software in the system, and when you are having a kernel which has a bug in the fs, you should just upgrade to a newer kernel. yes, but right now there is no evidence that this is a kernel bug. From a user's point of view, if this did not happen when using nfs and happens when using glusterfs the most evident solution is to switch back to nfs (like you, we usually prefer to trust kernel stability against application stability) and not do any kernel upgrade unless there is an evidence that this is a kernel bug (as a kernel upgrade could mean having to upgrade many other pieces of software that were working ok and that will need to be tested again). What we promise to fix is a way to (as best as possible) somehow translate a backend FS hang into a "subvolume down" status and consider that subvolume to be down. After that, you will _still_ continue to face kernel hangs and lockups and just glusterfs will stop hanging. Your machines would still remain locked up. that's great ! -- Thanx & best regards ... David Saez Padroshttp://www.ols.es On-Line Services 2000 S.L. telf+34 902 50 29 75 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
Hi a) documentation says "All operations that do not modify the file or directory are sent to all the subvolumes and the first successful reply is returned to the application", why is blocking then ? it's suposed that the reply from the non blocked server will come first and nothing will block, but clients are blocking on a simple ls operation The calls (as you have seen in the logs as well) which are hanging are lookup calls, which have to be sent to all subvolumes to ensure all the copies are in sync. ok, then the most simple fix will be to add a timeout for lookup calls, altough i will prefer to optionally also have the first reply to the lookup being sent to the application and then wait in the background for the other ones so gluster can keep files in sync, this will eliminate this hang and also make the system more responsive. BTW, will switching off some of the self heal options in the client make glusterfs use only the first reply received to the lookup call ? -- Thanx & best regards ... David Saez Padroshttp://www.ols.es On-Line Services 2000 S.L. telf+34 902 50 29 75 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
> I'm wondering if there's some way for glusterfs to detect the flaws of the > underlying operating system. I believe there's no bug-free file systems in > the universe, so I believe it is the job of the glusterfs developer to > specify which underlying filesystem is tested and supported. It's not good > to simply say that glusterfs works on all real-world approximations to an > imaginary bug-free posix filesystem. I would be genuinely interested to know about another project which is geared up to be resilient against kernel hangs so that we can borrow some ideas on how to reliably detect kernel soft lockups or syscall hangs. As far as I know, even mature projects like Apache have not bothered fixing such hangs (or even detecting this kind of underlying OS flaw). Avati ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
On 08/29/2009 09:23 PM, Anand Avati wrote: sharing my point of view). What we promise to fix is a way to (as best as possible) somehow translate a backend FS hang into a "subvolume down" status and consider that subvolume to be down. After that, you will _still_ continue to face kernel hangs and lockups and just glusterfs will stop hanging. Your machines would still remain locked up. Thanks, Anand. This is exactly what I would expect. I'm looking forward to seeing this fix. :-) Cheers, mark -- Mark Mielke ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
> i don't think that this is a hang on the file system itself, another > user reported the same problem on another file system type and it's > very unlikely that both file systems had the same problem. Also i have > the problem with a ext3 file system which is very mature. I will say > that is more likely that the failure is in the driver, the kernel itself > or even in glusterfs than in the file system itself. The fact that this > bug does not appear when using the file system directly or through nfs > make glusterfs the most obvious candidate. What we have here (kernel lockups and glusterfs on the same machine) might not be a co-incidence. There might well be a correlation -- but by nature of the problem it is not right to treat this as a cause-effect relation with glusterfs being the cause. At best, glusterfs could be performing some IO pattern (extended attribute usage?) which might be triggering some bugs in the kernel. It is just not right to blame _any_ userspace application for any kind of kernel lockups or hangs. This is not just my personal opinion, but goes with the definition of OS kernels - they are running in a protected memory and communicating only via a secure system call interface and few other mechanisms. So any hang or lockup in the kernel can only be caused by a bug in itself, which could possibly be triggered by a specific user application. >> It's not a simple problem to solve. But, it should be solved. > > right, i'm also a developer and can understand that some developers > may see some bug reports as a personal attack, but the ones in this > list (at least me) see glusterfs as a good thing (that why we are > using it) and we are not attacking the software or the authors, we > are just reporting annoying situations that we feel should be > corrected. Bug reports that this ones should be investigated and > workarounds should be found so glusterfs volumes will be able to > tolerate this 'other software' faults gracefully Well nothing has been taken personal so far. We acknowledge that it is a limitation in today's version of glusterfs that backend FS hangs resulting via kernel lockups are not handled by a graceful failover -- reason being internally glusterfs still does not translate "backend fs hang" into a "subvolume down" status. In fact glusterfs does not even recognize a "backend fs hang" situation as of today. This backend fs hang can happen only because of a kernel bug. It is true that this situation is not handled by glusterfs today. We will fix this soon. What we will be fixing is failing over to other machines when the backend FS hangs. The reason why this was not a priority (so far atleast) is because a kernel is a trusted piece of software in the system, and when you are having a kernel which has a bug in the fs, you should just upgrade to a newer kernel. Kernel hangs and application hangs are very different (on which you are probably not sharing my point of view). What we promise to fix is a way to (as best as possible) somehow translate a backend FS hang into a "subvolume down" status and consider that subvolume to be down. After that, you will _still_ continue to face kernel hangs and lockups and just glusterfs will stop hanging. Your machines would still remain locked up. Avati ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
Jeff, We are working on bug 144. We think one of the changes we plan to bring in 2.0.7 will fix this problem. The discussion in this thread is about those situations where the server (machine hosting the storage/posix volume) hangs the backend filesystem (verified by kernel console logs) and that in turn results in the mountpoint hang. While your symptoms are similar on the client side hanging, and we acknowledge that yours definitely is a glusterfs bug. The discussion thread is about the situation where the server side kernel misbehaves and results in glusterfs hanging. The two actual problems are quite different. Avati On Sat, Aug 29, 2009 at 10:40 AM, Jeff Evans wrote: > Hi All, > > I'm afraid that I have some more fuel to add to the glusterfs hanging > "fire". > > Way back when experimenting with 1.3, I began experiencing hangs. > > Then we added the read-ahead Xlator to the server and the hangs > miraculously stopped. > That may well be a coincidence, I don't know, but we never hung while > read-ahead was loaded. > > Then came version 2.0 and we hit a bug: > > http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=144 > > So, we had to take out read-ahead and now we have the hangs again! Doh > & Double Doh! > > This has forced me to take glusterfs out of production, and management > is now questioning my decision to utilize it at all (a subscription > won't be purchased anytime soon). > > Some points to note: > > I'm using ext3, the rest of my set-up is detailed in the above > bugzilla report. > > My hangs have often been triggered with a grep -R on the glusterFS > mount (yes, just reading!). > > None of my hangs have ever given me a single log entry. > > When hung, the server affected cannot be logged into. Just get the > first line 'Last login:...' > > This, and other services I run, seem to indicate that existing > processes that have already open FD's NOT on glusterfs can continue to > execute, but no new FD's can be opened at all, system wide. > > To date there has been a lot of talk about the underlying FS being an > issue in these cases. > > I seriously doubt it, & certainly not in the case of ext3. > > I agree that the server process shouldn't be able to hang a stable > system, but what about the client? > > Could this be the work of GlusterFS/Fuse/Kernel interaction? > > Whatever the cause, it is one very large show stopper that we MUST > rectify. > > We may well be dealing with several parallel issues. > Finding common factors in our glusterfs instances should help us > narrow down the search. > > Regards, Jeff. > > > ___ > Gluster-users mailing list > Gluster-users@gluster.org > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users > ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
> well, that never hapen before when using nfs with the same > computers, same disk, etc ... for almost 2 years, so it's more > than possible that is glusterfs the one which is triggering this > suposed ext3 bug, but appart from this: > > a) documentation says "All operations that do not modify the file > or directory are sent to all the subvolumes and the first successful > reply is returned to the application", why is blocking then ? > it's suposed that the reply from the non blocked server will > come first and nothing will block, but clients are blocking on > a simple ls operation The calls (as you have seen in the logs as well) which are hanging are lookup calls, which have to be sent to all subvolumes to ensure all the copies are in sync. > b) server1 (the non blocked one) also has the volumes mounted like > any other client, but having option read-subvolume set to the local > volume, but it also hangs when it was suposed to read from the local > volume, not from the hanged one The read calls are indeed served from read-subvolume, but that is only for read() system calls so that you can avoid bulk data transfer on the network. Calls like lookup() have to be sent to all subvolumes as long as they report to be "up". The problem is that in the current version there is no way to translate a "hanging backend fs" into a "down subvolume". > c) does not glsuterfs ping the servers periodically to see if they > are available or not ? if so, why does not it detect that situation ? It does, but in this case the server is up and running and replying with pongs. The current ping-pong only checks for network reachability to the server process. Avati ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
Hi All, I'm afraid that I have some more fuel to add to the glusterfs hanging "fire". Way back when experimenting with 1.3, I began experiencing hangs. Then we added the read-ahead Xlator to the server and the hangs miraculously stopped. That may well be a coincidence, I don't know, but we never hung while read-ahead was loaded. Then came version 2.0 and we hit a bug: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=144 So, we had to take out read-ahead and now we have the hangs again! Doh & Double Doh! This has forced me to take glusterfs out of production, and management is now questioning my decision to utilize it at all (a subscription won't be purchased anytime soon). Some points to note: I'm using ext3, the rest of my set-up is detailed in the above bugzilla report. My hangs have often been triggered with a grep -R on the glusterFS mount (yes, just reading!). None of my hangs have ever given me a single log entry. When hung, the server affected cannot be logged into. Just get the first line 'Last login:...' This, and other services I run, seem to indicate that existing processes that have already open FD's NOT on glusterfs can continue to execute, but no new FD's can be opened at all, system wide. To date there has been a lot of talk about the underlying FS being an issue in these cases. I seriously doubt it, & certainly not in the case of ext3. I agree that the server process shouldn't be able to hang a stable system, but what about the client? Could this be the work of GlusterFS/Fuse/Kernel interaction? Whatever the cause, it is one very large show stopper that we MUST rectify. We may well be dealing with several parallel issues. Finding common factors in our glusterfs instances should help us narrow down the search. Regards, Jeff. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
Just wanted to chime in that the EXACT same issue has occurred for me. I was going to work through the support chain but given that others are seeing it and hopefully have logs, perhaps I don't need to do so. Basically, I hope it can be fixed! Justice London E-mail: jlon...@lawinfo.com -Original Message- From: gluster-users-boun...@gluster.org [mailto:gluster-users-boun...@gluster.org] On Behalf Of Stephan von Krawczynski Sent: Friday, August 28, 2009 4:33 AM To: David Saez Padros Cc: Anand Avati; gluster-users Subject: Re: [Gluster-users] Replication not working on server hang > [...] > Glusterfs log only shows lines like this ones: > > [2009-08-28 09:19:28] E [client-protocol.c:292:call_bail] data2: bailing > out frame LOOKUP(32) frame sent = 2009-08-28 08:49:18. frame-timeout = 1800 > [2009-08-28 09:23:38] E [client-protocol.c:292:call_bail] data2: bailing > out frame LOOKUP(32) frame sent = 2009-08-28 08:53:28. frame-timeout = 1800 > > Once server2 has been rebooted all gluster fs become available > again on all clients and the hanged df and ls processes terminate, > but difficult to understand why a replicated share that must survive > to failure on one server does not. You are suffering from the problem we talked about few days ago on the list. If your local fs produces a deadlock somehow on one server glusterfs is currently unable to cope with the situation and just _waits_ for things to come. This deadlocks your clients, too, without any need. Your experience backs my critics on the handling of these situations. -- Regards, Stephan ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users No virus found in this incoming message. Checked by AVG - www.avg.com Version: 8.5.409 / Virus Database: 270.13.70/2329 - Release Date: 08/28/09 06:26:00 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
Hi a) documentation says "All operations that do not modify the file or directory are sent to all the subvolumes and the first successful reply is returned to the application", why is blocking then ? it's suposed that the reply from the non blocked server will come first and nothing will block, but clients are blocking on a simple ls operation My impression is that you have to imagine the setup as serialized queue on the server. If there was one operation with a hang, all future ones will be hanging, too. if it says that it will take the first answer to come then the only way is that the client does not block in any request, it will have very little sense to do that. The only thing i can imagine is that it blocked when trying to write the last access time of some folder My idea of a solution would be to implement something like a bail-out timeout configurable on the client vol file for every brick. This would allow to intermix slow and fast servers and it would cope with a situation where some clients are far away with a slow connnections and others are nearby with very fast connection to the same servers. completly agree The biggest problem about it probably is not to bail out servers, but to re-integrate them. Currently there seems to be no userspace tool to tell a client to re-integrate a formerly dead server. Obviously this should not happen auto-magically to prevent flapping. in our case, rebooting the hanged server make all clients runs again without any notable bad effect, the ones that were reading from the blocked volume continued reading as if nothing has happened, in this aspect everything run ok. -- Best regards ... David Saez Padroshttp://www.ols.es On-Line Services 2000 S.L. telf+34 902 50 29 75 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
Hi NFS doesn't offer the options that GlusterFS does, so comparing 1:1 doesn't really make sense. i agree, but i was not comparing features, what i wanted to point is that this failure did never happen when using nfs and that it almost always happen when using glusterfs. I you just consider nfs not being very reliable, then it's even better for comparing against glusterfs in that situation. There is no legitimate reason at all that the underlying file system should lock up as a result of this. For some file systems, there is a hang, not a lock up, until some critical section of the disk or in memory representation is finished being worked on. I understand that ext3 removing a large file is an example of something that might trigger this for a period. i don't think that this is a hang on the file system itself, another user reported the same problem on another file system type and it's very unlikely that both file systems had the same problem. Also i have the problem with a ext3 file system which is very mature. I will say that is more likely that the failure is in the driver, the kernel itself or even in glusterfs than in the file system itself. The fact that this bug does not appear when using the file system directly or through nfs make glusterfs the most obvious candidate. It's not a simple problem to solve. But, it should be solved. right, i'm also a developer and can understand that some developers may see some bug reports as a personal attack, but the ones in this list (at least me) see glusterfs as a good thing (that why we are using it) and we are not attacking the software or the authors, we are just reporting annoying situations that we feel should be corrected. Bug reports that this ones should be investigated and workarounds should be found so glusterfs volumes will be able to tolerate this 'other software' faults gracefully -- Best regards ... David Saez Padroshttp://www.ols.es On-Line Services 2000 S.L. telf+34 902 50 29 75 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
I'm wondering if there's some way for glusterfs to detect the flaws of the underlying operating system. I believe there's no bug-free file systems in the universe, so I believe it is the job of the glusterfs developer to specify which underlying filesystem is tested and supported. It's not good to simply say that glusterfs works on all real-world approximations to an imaginary bug-free posix filesystem. - Wei Stephan von Krawczynski wrote: On Fri, 28 Aug 2009 14:28:51 +0200 David Saez Padros wrote: Hi well, that never hapen before when using nfs with the same computers, same disk, etc ... for almost 2 years, so it's more than possible that is glusterfs the one which is triggering this suposed ext3 bug, but appart from this: I can assure you that you will never have an agreement on this point on this list, this happens to be the only bugfree software in universe according to authors ;-) a) documentation says "All operations that do not modify the file or directory are sent to all the subvolumes and the first successful reply is returned to the application", why is blocking then ? it's suposed that the reply from the non blocked server will come first and nothing will block, but clients are blocking on a simple ls operation My impression is that you have to imagine the setup as serialized queue on the server. If there was one operation with a hang, all future ones will be hanging, too. b) server1 (the non blocked one) also has the volumes mounted like any other client, but having option read-subvolume set to the local volume, but it also hangs when it was suposed to read from the local volume, not from the hanged one This is exactly my experience. You cannot make it work either way. There seems to be some locking across all used servers. c) does not glsuterfs ping the servers periodically to see if they are available or not ? if so, why does not it detect that situation ? Well, this ping-pong procedure seems to be only detecting offline servers (i.e. network down), but is obviously not able to give hints about being operational or not. My idea of a solution would be to implement something like a bail-out timeout configurable on the client vol file for every brick. This would allow to intermix slow and fast servers and it would cope with a situation where some clients are far away with a slow connnections and others are nearby with very fast connection to the same servers. The biggest problem about it probably is not to bail out servers, but to re-integrate them. Currently there seems to be no userspace tool to tell a client to re-integrate a formerly dead server. Obviously this should not happen auto-magically to prevent flapping. [...] Glusterfs log only shows lines like this ones: [2009-08-28 09:19:28] E [client-protocol.c:292:call_bail] data2: bailing out frame LOOKUP(32) frame sent = 2009-08-28 08:49:18. frame-timeout = 1800 [2009-08-28 09:23:38] E [client-protocol.c:292:call_bail] data2: bailing out frame LOOKUP(32) frame sent = 2009-08-28 08:53:28. frame-timeout = 1800 Once server2 has been rebooted all gluster fs become available again on all clients and the hanged df and ls processes terminate, but difficult to understand why a replicated share that must survive to failure on one server does not. You are suffering from the problem we talked about few days ago on the list. If your local fs produces a deadlock somehow on one server glusterfs is currently unable to cope with the situation and just _waits_ for things to come. This deadlocks your clients, too, without any need. Your experience backs my critics on the handling of these situations. -- Best regards ... David Saez Padroshttp://www.ols.es On-Line Services 2000 S.L. telf+34 902 50 29 75 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
On 08/28/2009 09:28 AM, Stephan von Krawczynski wrote: On Fri, 28 Aug 2009 14:28:51 +0200 David Saez Padros wrote: Hi well, that never hapen before when using nfs with the same computers, same disk, etc ... for almost 2 years, so it's more than possible that is glusterfs the one which is triggering this suposed ext3 bug, but appart from this: FYI: NFS will have problems in such a situation as well. If NFS tries to write to a local file system which hangs, NFS will hang too. With soft mounts, these can time out, but that's not clearly the better option either. Anybody who has worked with NFS for some time has seen this before. Once upon a time at our company (1990?) all mounts were hard mounts, and when a problem occurred, the processes completely locked out the user. Control-C wouldn't even work! NFS doesn't offer the options that GlusterFS does, so comparing 1:1 doesn't really make sense. For "not experiencing it in 2 years" - I think people really need to understand that GlusterFS is a *user space application*. Most specifically, this means that it *only* runs standard system calls, that any other program such as /bin/cat, /bin/tar, or /bin/du would run. There is no legitimate reason at all that the underlying file system should lock up as a result of this. For some file systems, there is a hang, not a lock up, until some critical section of the disk or in memory representation is finished being worked on. I understand that ext3 removing a large file is an example of something that might trigger this for a period. I think GlusterFS should try to be more resilient to this sort of thing as well - but comparisons to NFS are invalid, and treating this as a GlusterFS problem only (i.e. not tracking down the FS vendor and having them fix their FS) is also invalid. It's not a simple problem to solve. But, it should be solved. For RAID disks, for example, they are often tuned to significantly reduce the retry attempts and timeouts, so that the system remains responsive even when the disk is failing. GlusterFS should do the same. It's not a perfect solution - having a long running operation time out to soon incorrectly is a risk - but it's a necessary solution for any large scale cluster. Cheers, mark -- Mark Mielke ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
On Fri, 28 Aug 2009 14:28:51 +0200 David Saez Padros wrote: > Hi > > well, that never hapen before when using nfs with the same > computers, same disk, etc ... for almost 2 years, so it's more > than possible that is glusterfs the one which is triggering this > suposed ext3 bug, but appart from this: I can assure you that you will never have an agreement on this point on this list, this happens to be the only bugfree software in universe according to authors ;-) > a) documentation says "All operations that do not modify the file > or directory are sent to all the subvolumes and the first successful > reply is returned to the application", why is blocking then ? > it's suposed that the reply from the non blocked server will > come first and nothing will block, but clients are blocking on > a simple ls operation My impression is that you have to imagine the setup as serialized queue on the server. If there was one operation with a hang, all future ones will be hanging, too. > b) server1 (the non blocked one) also has the volumes mounted like > any other client, but having option read-subvolume set to the local > volume, but it also hangs when it was suposed to read from the local > volume, not from the hanged one This is exactly my experience. You cannot make it work either way. There seems to be some locking across all used servers. > c) does not glsuterfs ping the servers periodically to see if they > are available or not ? if so, why does not it detect that situation ? Well, this ping-pong procedure seems to be only detecting offline servers (i.e. network down), but is obviously not able to give hints about being operational or not. My idea of a solution would be to implement something like a bail-out timeout configurable on the client vol file for every brick. This would allow to intermix slow and fast servers and it would cope with a situation where some clients are far away with a slow connnections and others are nearby with very fast connection to the same servers. The biggest problem about it probably is not to bail out servers, but to re-integrate them. Currently there seems to be no userspace tool to tell a client to re-integrate a formerly dead server. Obviously this should not happen auto-magically to prevent flapping. > >> [...] > >> Glusterfs log only shows lines like this ones: > >> > >> [2009-08-28 09:19:28] E [client-protocol.c:292:call_bail] data2: bailing > >> out frame LOOKUP(32) frame sent = 2009-08-28 08:49:18. frame-timeout = 1800 > >> [2009-08-28 09:23:38] E [client-protocol.c:292:call_bail] data2: bailing > >> out frame LOOKUP(32) frame sent = 2009-08-28 08:53:28. frame-timeout = 1800 > >> > >> Once server2 has been rebooted all gluster fs become available > >> again on all clients and the hanged df and ls processes terminate, > >> but difficult to understand why a replicated share that must survive > >> to failure on one server does not. > > > > You are suffering from the problem we talked about few days ago on the list. > > If your local fs produces a deadlock somehow on one server glusterfs is > > currently unable to cope with the situation and just _waits_ for things to > > come. This deadlocks your clients, too, without any need. > > Your experience backs my critics on the handling of these situations. > > -- > Best regards ... > > > David Saez Padroshttp://www.ols.es > On-Line Services 2000 S.L. telf+34 902 50 29 75 > > > > -- Regards, Stephan ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
Hi well, that never hapen before when using nfs with the same computers, same disk, etc ... for almost 2 years, so it's more than possible that is glusterfs the one which is triggering this suposed ext3 bug, but appart from this: a) documentation says "All operations that do not modify the file or directory are sent to all the subvolumes and the first successful reply is returned to the application", why is blocking then ? it's suposed that the reply from the non blocked server will come first and nothing will block, but clients are blocking on a simple ls operation b) server1 (the non blocked one) also has the volumes mounted like any other client, but having option read-subvolume set to the local volume, but it also hangs when it was suposed to read from the local volume, not from the hanged one c) does not glsuterfs ping the servers periodically to see if they are available or not ? if so, why does not it detect that situation ? [...] Glusterfs log only shows lines like this ones: [2009-08-28 09:19:28] E [client-protocol.c:292:call_bail] data2: bailing out frame LOOKUP(32) frame sent = 2009-08-28 08:49:18. frame-timeout = 1800 [2009-08-28 09:23:38] E [client-protocol.c:292:call_bail] data2: bailing out frame LOOKUP(32) frame sent = 2009-08-28 08:53:28. frame-timeout = 1800 Once server2 has been rebooted all gluster fs become available again on all clients and the hanged df and ls processes terminate, but difficult to understand why a replicated share that must survive to failure on one server does not. You are suffering from the problem we talked about few days ago on the list. If your local fs produces a deadlock somehow on one server glusterfs is currently unable to cope with the situation and just _waits_ for things to come. This deadlocks your clients, too, without any need. Your experience backs my critics on the handling of these situations. -- Best regards ... David Saez Padroshttp://www.ols.es On-Line Services 2000 S.L. telf+34 902 50 29 75 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
> [...] > Glusterfs log only shows lines like this ones: > > [2009-08-28 09:19:28] E [client-protocol.c:292:call_bail] data2: bailing > out frame LOOKUP(32) frame sent = 2009-08-28 08:49:18. frame-timeout = 1800 > [2009-08-28 09:23:38] E [client-protocol.c:292:call_bail] data2: bailing > out frame LOOKUP(32) frame sent = 2009-08-28 08:53:28. frame-timeout = 1800 > > Once server2 has been rebooted all gluster fs become available > again on all clients and the hanged df and ls processes terminate, > but difficult to understand why a replicated share that must survive > to failure on one server does not. You are suffering from the problem we talked about few days ago on the list. If your local fs produces a deadlock somehow on one server glusterfs is currently unable to cope with the situation and just _waits_ for things to come. This deadlocks your clients, too, without any need. Your experience backs my critics on the handling of these situations. -- Regards, Stephan ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
[Gluster-users] Replication not working on server hang
Hi - "David Saez Padros" wrote: in the problem we have the server also hang to the point that there where no way to access it and we end rebooting the server to gain acces to it Do you mean you were unable to login to the machine over the network? unable to have a responsive console shell? machine would not respond to ICMP on the network? Do you still have the logfiles and volfiles and can you describe the steps to reproduce in a bug report? As a thumb rule, if your server hangs to the degree of not even having a usable shell, it just means that heavy IO via glusterfs triggered some bug in the operating system. try to get kernel output via dmesg or console logs if you have any. glusterfsd only issues system calls and does not do anything funky with the server. Think of some application local to the server causing such a hung. glusterfsd is no different in that respect. wa had again the same problem with the following setup: server1 and server2 are Dell PE 2900 computers with Debian (kernel 2.6.26-2 x64) running gluster 2.0.1-1. Each server has 6 sas disks unifed and exported with gluster, which are mounted as replicated gluster fs in all clients. the problem happen when sever2 was under heavy load (8 wrf processes running) and one client was copying a large amount of files to the replicated gluster fs. We have been running the same setup using the same computers but using nfs instead of glusterfs for almost 2 years without having this problem. when the problem happen server2 was totally locked, responds to pings, can ssh to server but once the username and password is correctly entered nothing happens and in some seconds ssh is disconnected. Direct terminal access (keyboard) is also impossible. Kernel log shows a "BUG: soft lockup - CPU#1 stuck for ..." for each core (all for wrf process) with a trace at different points of wrf.exe. Everytime we had this problem it was triggered by copying a lot of files to the gluster fs. It may or not be related to glusterfs but the worse thing is that when this happen access to the replicated gluster fs from any client also hangs. In that case df hangs on one of the glusterfs shares (df cannot be terminated in any way including ctrl-c or kill -KILL). Altough df shows the other share any ls operation on that share hangs, and ls also cannot be terminated in any way including ctrl-c or kill -KILL. Glusterfs log only shows lines like this ones: [2009-08-28 09:19:28] E [client-protocol.c:292:call_bail] data2: bailing out frame LOOKUP(32) frame sent = 2009-08-28 08:49:18. frame-timeout = 1800 [2009-08-28 09:23:38] E [client-protocol.c:292:call_bail] data2: bailing out frame LOOKUP(32) frame sent = 2009-08-28 08:53:28. frame-timeout = 1800 Once server2 has been rebooted all gluster fs become available again on all clients and the hanged df and ls processes terminate, but difficult to understand why a replicated share that must survive to failure on one server does not. -- Best regards ... David Saez Padroshttp://www.ols.es On-Line Services 2000 S.L. telf+34 902 50 29 75 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users