Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?
What version? File a bugzilla (oss.oracle.com/bugzilla) with all the version details, etc. Easier to track issues that-a-way. mike wrote: > Here's another issue: > > I have a client with 9049 files/dirs in a specific dir. The first time > I did an ls /that/dir/ it froze up - and hitting control C actually > made it behave interestingly. > > Every two times I hit control C, I got one of these permission denied > lines (the files exist, I can open them, I can edit them, I am root, > too) > > [EMAIL PROTECTED] ~]# ls /home/mark/web/domain.com/ > ls: cannot access /home/mark/web/domain.com/file32.htm: Permission denied > ls: cannot access /home/mark/web/domain.com/file.htm: Permission denied > > etc. > > Also I am seeing in my webserver log once in a while: > > 2008/04/21 18:10:09 [crit] 6917#0: *1256684 stat() > "/home/mike/web/michaelshadle.com/" failed (13: Permission denied), > client: 1.2.3.4, server: michaelshadle.com, request: "GET / HTTP/1.0", > host: "michaelshadle.com" > > a stat() call fails on a directory, and that directory not only > exists, it's readable and behaves properly 99.999% of the time. these > random stat() failures are a bit odd. However, this hasn't thrown an > error on my proxy server, which is nice. But it is something I am > worried about. I can stat it in shell: > > [EMAIL PROTECTED] web]# stat /home/mike/web/michaelshadle.com/ > File: `/home/mike/web/michaelshadle.com/' > Size: 4096Blocks: 64 IO Block: 32768 directory > Device: 811h/2065d Inode: 135710860 Links: 15 > Access: (0711/drwx--x--x) Uid: ( 1000/mike) Gid: ( 1000/mike) > Access: 2008-03-25 04:27:52.0 -0700 > Modify: 2008-03-04 02:17:22.0 -0800 > Change: 2008-03-27 04:45:57.0 -0700 > [EMAIL PROTECTED] web]# > > Does anything there look out of place? Nothing shows up in dmesg or > any of my /var/log/* logs... > > > On 4/21/08, Herbert van den Bergh <[EMAIL PROTECTED]> wrote: > >> Does the web server that the proxy server talks to have any extended >> debugging you can turn on? In particular, would it be able to log >> timestamps of things it does, so you can narrow down where the hic-up >> occurs? A brute force method to do this would be to run strace -T on all >> server processes, and look for things that take much longer than they >> should, like disk reads exceeding 100ms, or other syscalls taking much >> longer than usual. Ideally you'd have some timing around code you suspect, >> and log a message if the time exceeds some configurable limit. >> >> Thanks, >> Herbert. >> ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?
Here's another issue: I have a client with 9049 files/dirs in a specific dir. The first time I did an ls /that/dir/ it froze up - and hitting control C actually made it behave interestingly. Every two times I hit control C, I got one of these permission denied lines (the files exist, I can open them, I can edit them, I am root, too) [EMAIL PROTECTED] ~]# ls /home/mark/web/domain.com/ ls: cannot access /home/mark/web/domain.com/file32.htm: Permission denied ls: cannot access /home/mark/web/domain.com/file.htm: Permission denied etc. Also I am seeing in my webserver log once in a while: 2008/04/21 18:10:09 [crit] 6917#0: *1256684 stat() "/home/mike/web/michaelshadle.com/" failed (13: Permission denied), client: 1.2.3.4, server: michaelshadle.com, request: "GET / HTTP/1.0", host: "michaelshadle.com" a stat() call fails on a directory, and that directory not only exists, it's readable and behaves properly 99.999% of the time. these random stat() failures are a bit odd. However, this hasn't thrown an error on my proxy server, which is nice. But it is something I am worried about. I can stat it in shell: [EMAIL PROTECTED] web]# stat /home/mike/web/michaelshadle.com/ File: `/home/mike/web/michaelshadle.com/' Size: 4096Blocks: 64 IO Block: 32768 directory Device: 811h/2065d Inode: 135710860 Links: 15 Access: (0711/drwx--x--x) Uid: ( 1000/mike) Gid: ( 1000/mike) Access: 2008-03-25 04:27:52.0 -0700 Modify: 2008-03-04 02:17:22.0 -0800 Change: 2008-03-27 04:45:57.0 -0700 [EMAIL PROTECTED] web]# Does anything there look out of place? Nothing shows up in dmesg or any of my /var/log/* logs... On 4/21/08, Herbert van den Bergh <[EMAIL PROTECTED]> wrote: > > Does the web server that the proxy server talks to have any extended > debugging you can turn on? In particular, would it be able to log > timestamps of things it does, so you can narrow down where the hic-up > occurs? A brute force method to do this would be to run strace -T on all > server processes, and look for things that take much longer than they > should, like disk reads exceeding 100ms, or other syscalls taking much > longer than usual. Ideally you'd have some timing around code you suspect, > and log a message if the time exceeds some configurable limit. > > Thanks, > Herbert. > > > > mike wrote: > > You're right, it -is- possible, but if you look at it (and I can log > > it for hours) it only seems to do that right before I get a timeout > > message from the proxy. The two appear to be related. > > > > I will continue to monitor this and make sure that my hypothesis is > > correct. Something is flaking out every so often. > > > > I get this on my nginx proxy server: > > > > 2008/04/21 17:37:01 [error] 1256#0: *7406286 upstream timed out (110: > > Connection timed out) while reading response header from upstream, > > client: 1.2.3.4, server: lvs01.domain.com, request: "GET /someURL.php > > HTTP/1.1", upstream: "http://10.13.5.12:80/someURL.php";, > host: > > "somedomain.com", referrer: > "http://somedomain.com/someURL.php"; > > > > That only happens after it's sitting for 3 real-time seconds waiting > > for a reply from the server. Note: this happens no matter what proxy > > and webserver I use. It does not seem to be anything related to that. > > > > > > On 4/21/08, Herbert van den Bergh > <[EMAIL PROTECTED]> wrote: > > > > > > > Mike, > > > > > > Are you sure it's not possible for sdb to be idle for just 1 second? If > you > > > look at the interval right after the one you pointed out, you'll see r/s > is > > > 2.97 and w/s is .99, so it did 3 reads and 1 write in that one second > > > interval. The device appears to be used very little. I think it's > quite > > > possible that some 1 second intervals have no reads or writes at all, > don't > > > you think? > > > > > > Thanks, > > > Herbert. > > > > > > > > > > > > mike wrote: > > > > > > > > > > Thanks. > > > > > > > > If I have the opportunity to run the (buggy) new kernel again I will > > > > try this. That is a definately problem and I think I need to set the > > > > oracle behavior to crash and not auto reboot for this to be effective, > > > > right? > > > > > > > > That is just one issue. > > > > 1) 2.6.24-16 with load completely crashes node producing largest i/o > > > > 2) 2.6.22-19 utilization drops to 0% and causes a hiccup randomly (I > > > > don't see a pattern and no batch jobs, or other things running at the > > > > time it happens) - this is more important as it still is happening > > > > even though I'm runnign the more "stable" kernel. > > > > > > > > > > > > On 4/21/08, Sunil Mushran <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > > > > > > > > > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/networking/netconsole.txt;h=3c2f2b3286385337ce5ec24afebd4699dd1e6e0a;hb=HEAD > > > > > > > > > > > > > > > netconsole is a facility to capture oops traces. It is not a
Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?
Is there any OCFS2 debugging I could turn on as well? To log perhaps every file request and how long it took, or if something hits a threshhold, etc? I think I can turn on the webserver's debugging (if not, the author is pretty responsive) and hopefully find the bottleneck through there. On 4/21/08, Herbert van den Bergh <[EMAIL PROTECTED]> wrote: > > Does the web server that the proxy server talks to have any extended > debugging you can turn on? In particular, would it be able to log > timestamps of things it does, so you can narrow down where the hic-up > occurs? A brute force method to do this would be to run strace -T on all > server processes, and look for things that take much longer than they > should, like disk reads exceeding 100ms, or other syscalls taking much > longer than usual. Ideally you'd have some timing around code you suspect, > and log a message if the time exceeds some configurable limit. > > Thanks, > Herbert. > > > > mike wrote: > > You're right, it -is- possible, but if you look at it (and I can log > > it for hours) it only seems to do that right before I get a timeout > > message from the proxy. The two appear to be related. > > > > I will continue to monitor this and make sure that my hypothesis is > > correct. Something is flaking out every so often. > > > > I get this on my nginx proxy server: > > > > 2008/04/21 17:37:01 [error] 1256#0: *7406286 upstream timed out (110: > > Connection timed out) while reading response header from upstream, > > client: 1.2.3.4, server: lvs01.domain.com, request: "GET /someURL.php > > HTTP/1.1", upstream: "http://10.13.5.12:80/someURL.php";, > host: > > "somedomain.com", referrer: > "http://somedomain.com/someURL.php"; > > > > That only happens after it's sitting for 3 real-time seconds waiting > > for a reply from the server. Note: this happens no matter what proxy > > and webserver I use. It does not seem to be anything related to that. > > > > > > On 4/21/08, Herbert van den Bergh > <[EMAIL PROTECTED]> wrote: > > > > > > > Mike, > > > > > > Are you sure it's not possible for sdb to be idle for just 1 second? If > you > > > look at the interval right after the one you pointed out, you'll see r/s > is > > > 2.97 and w/s is .99, so it did 3 reads and 1 write in that one second > > > interval. The device appears to be used very little. I think it's > quite > > > possible that some 1 second intervals have no reads or writes at all, > don't > > > you think? > > > > > > Thanks, > > > Herbert. > > > > > > > > > > > > mike wrote: > > > > > > > > > > Thanks. > > > > > > > > If I have the opportunity to run the (buggy) new kernel again I will > > > > try this. That is a definately problem and I think I need to set the > > > > oracle behavior to crash and not auto reboot for this to be effective, > > > > right? > > > > > > > > That is just one issue. > > > > 1) 2.6.24-16 with load completely crashes node producing largest i/o > > > > 2) 2.6.22-19 utilization drops to 0% and causes a hiccup randomly (I > > > > don't see a pattern and no batch jobs, or other things running at the > > > > time it happens) - this is more important as it still is happening > > > > even though I'm runnign the more "stable" kernel. > > > > > > > > > > > > On 4/21/08, Sunil Mushran <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > > > > > > > > > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/networking/netconsole.txt;h=3c2f2b3286385337ce5ec24afebd4699dd1e6e0a;hb=HEAD > > > > > > > > > > > > > > > netconsole is a facility to capture oops traces. It is not a console > > > > > per se and does not require a head/gtk/x11 etc to work. The link > above > > > > > explains the usage, etc. > > > > > > > > > > > > > > > mike wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > Well these are headless production servers, CLI only. no GTK, no > X11. > > > > > > also I am not running the newer kernels (and I can't...) it looks > like > > > > > > I cannot run a hybrid of 2.6.24-16 and 2.6.22-19, whichever one > has > > > > > > mounted the drive first is the winner. > > > > > > > > > > > > If I mix them, I can get the 2.6.24's to mount, then the older > ones > > > > > > give the "number too large" error or whatever. So I can't > currently > > > > > > use one server on my cluster to test because it would require > > > > > > upgrading all of them just for this test. > > > > > > > > > > > > On 4/21/08, Sunil Mushran <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Setting up netconsole does not require a reboot. The idea is to > > > > > > > catch the oops trace when the oops happens. Without that trace, > > > > > > > we are flying blind. > > > > > > > > > > > > > > > > > > > > > mike wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Since these are production I can't do much. > > > > > > > > > > > > > > > > But I did get
Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?
Does the web server that the proxy server talks to have any extended debugging you can turn on? In particular, would it be able to log timestamps of things it does, so you can narrow down where the hic-up occurs? A brute force method to do this would be to run strace -T on all server processes, and look for things that take much longer than they should, like disk reads exceeding 100ms, or other syscalls taking much longer than usual. Ideally you'd have some timing around code you suspect, and log a message if the time exceeds some configurable limit. Thanks, Herbert. mike wrote: > You're right, it -is- possible, but if you look at it (and I can log > it for hours) it only seems to do that right before I get a timeout > message from the proxy. The two appear to be related. > > I will continue to monitor this and make sure that my hypothesis is > correct. Something is flaking out every so often. > > I get this on my nginx proxy server: > > 2008/04/21 17:37:01 [error] 1256#0: *7406286 upstream timed out (110: > Connection timed out) while reading response header from upstream, > client: 1.2.3.4, server: lvs01.domain.com, request: "GET /someURL.php > HTTP/1.1", upstream: "http://10.13.5.12:80/someURL.php";, host: > "somedomain.com", referrer: "http://somedomain.com/someURL.php"; > > That only happens after it's sitting for 3 real-time seconds waiting > for a reply from the server. Note: this happens no matter what proxy > and webserver I use. It does not seem to be anything related to that. > > > On 4/21/08, Herbert van den Bergh <[EMAIL PROTECTED]> wrote: > >> Mike, >> >> Are you sure it's not possible for sdb to be idle for just 1 second? If you >> look at the interval right after the one you pointed out, you'll see r/s is >> 2.97 and w/s is .99, so it did 3 reads and 1 write in that one second >> interval. The device appears to be used very little. I think it's quite >> possible that some 1 second intervals have no reads or writes at all, don't >> you think? >> >> Thanks, >> Herbert. >> >> >> >> mike wrote: >> >>> Thanks. >>> >>> If I have the opportunity to run the (buggy) new kernel again I will >>> try this. That is a definately problem and I think I need to set the >>> oracle behavior to crash and not auto reboot for this to be effective, >>> right? >>> >>> That is just one issue. >>> 1) 2.6.24-16 with load completely crashes node producing largest i/o >>> 2) 2.6.22-19 utilization drops to 0% and causes a hiccup randomly (I >>> don't see a pattern and no batch jobs, or other things running at the >>> time it happens) - this is more important as it still is happening >>> even though I'm runnign the more "stable" kernel. >>> >>> >>> On 4/21/08, Sunil Mushran <[EMAIL PROTECTED]> wrote: >>> >>> >>> >> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/networking/netconsole.txt;h=3c2f2b3286385337ce5ec24afebd4699dd1e6e0a;hb=HEAD >> netconsole is a facility to capture oops traces. It is not a console per se and does not require a head/gtk/x11 etc to work. The link above explains the usage, etc. mike wrote: > Well these are headless production servers, CLI only. no GTK, no X11. > also I am not running the newer kernels (and I can't...) it looks like > I cannot run a hybrid of 2.6.24-16 and 2.6.22-19, whichever one has > mounted the drive first is the winner. > > If I mix them, I can get the 2.6.24's to mount, then the older ones > give the "number too large" error or whatever. So I can't currently > use one server on my cluster to test because it would require > upgrading all of them just for this test. > > On 4/21/08, Sunil Mushran <[EMAIL PROTECTED]> wrote: > > > > > >> Setting up netconsole does not require a reboot. The idea is to >> catch the oops trace when the oops happens. Without that trace, >> we are flying blind. >> >> >> mike wrote: >> >> >> >> >> >>> Since these are production I can't do much. >>> >>> But I did get an error (it's not happening as much but it still >>> >> blips >> >>> here and there) >>> >>> Notice that /dev/sdb (my iscsi target using ocfs2) hits 0.00% >>> utilization, 3 seconds before my proxy says "hey, timeout" - every >>> other second there is -always- some utilization going on. >>> >>> What could be steps to figure out this issue? Using debugfs.ocfs2 >>> >> or >> >>> >>> >>> >> something? >> >> >> >> >> >>> It's mounted as: >>> /dev/sdb1 on /home type ocfs2 >>> >>> >> (rw,_netdev,noatime,data=writeback,heartbeat=local) >> >>> I know I'm not being much help, but I'm willing to try almost >>> >> anything >> >
Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?
You're right, it -is- possible, but if you look at it (and I can log it for hours) it only seems to do that right before I get a timeout message from the proxy. The two appear to be related. I will continue to monitor this and make sure that my hypothesis is correct. Something is flaking out every so often. I get this on my nginx proxy server: 2008/04/21 17:37:01 [error] 1256#0: *7406286 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 1.2.3.4, server: lvs01.domain.com, request: "GET /someURL.php HTTP/1.1", upstream: "http://10.13.5.12:80/someURL.php";, host: "somedomain.com", referrer: "http://somedomain.com/someURL.php"; That only happens after it's sitting for 3 real-time seconds waiting for a reply from the server. Note: this happens no matter what proxy and webserver I use. It does not seem to be anything related to that. On 4/21/08, Herbert van den Bergh <[EMAIL PROTECTED]> wrote: > > Mike, > > Are you sure it's not possible for sdb to be idle for just 1 second? If you > look at the interval right after the one you pointed out, you'll see r/s is > 2.97 and w/s is .99, so it did 3 reads and 1 write in that one second > interval. The device appears to be used very little. I think it's quite > possible that some 1 second intervals have no reads or writes at all, don't > you think? > > Thanks, > Herbert. > > > > mike wrote: > > Thanks. > > > > If I have the opportunity to run the (buggy) new kernel again I will > > try this. That is a definately problem and I think I need to set the > > oracle behavior to crash and not auto reboot for this to be effective, > > right? > > > > That is just one issue. > > 1) 2.6.24-16 with load completely crashes node producing largest i/o > > 2) 2.6.22-19 utilization drops to 0% and causes a hiccup randomly (I > > don't see a pattern and no batch jobs, or other things running at the > > time it happens) - this is more important as it still is happening > > even though I'm runnign the more "stable" kernel. > > > > > > On 4/21/08, Sunil Mushran <[EMAIL PROTECTED]> wrote: > > > > > > > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/networking/netconsole.txt;h=3c2f2b3286385337ce5ec24afebd4699dd1e6e0a;hb=HEAD > > > > > > netconsole is a facility to capture oops traces. It is not a console > > > per se and does not require a head/gtk/x11 etc to work. The link above > > > explains the usage, etc. > > > > > > > > > mike wrote: > > > > > > > > > > Well these are headless production servers, CLI only. no GTK, no X11. > > > > also I am not running the newer kernels (and I can't...) it looks like > > > > I cannot run a hybrid of 2.6.24-16 and 2.6.22-19, whichever one has > > > > mounted the drive first is the winner. > > > > > > > > If I mix them, I can get the 2.6.24's to mount, then the older ones > > > > give the "number too large" error or whatever. So I can't currently > > > > use one server on my cluster to test because it would require > > > > upgrading all of them just for this test. > > > > > > > > On 4/21/08, Sunil Mushran <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > > > > > > > > > > > Setting up netconsole does not require a reboot. The idea is to > > > > > catch the oops trace when the oops happens. Without that trace, > > > > > we are flying blind. > > > > > > > > > > > > > > > mike wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > Since these are production I can't do much. > > > > > > > > > > > > But I did get an error (it's not happening as much but it still > blips > > > > > > here and there) > > > > > > > > > > > > Notice that /dev/sdb (my iscsi target using ocfs2) hits 0.00% > > > > > > utilization, 3 seconds before my proxy says "hey, timeout" - every > > > > > > other second there is -always- some utilization going on. > > > > > > > > > > > > What could be steps to figure out this issue? Using debugfs.ocfs2 > or > > > > > > > > > > > > > > > > > > > > > > > > > > > > > something? > > > > > > > > > > > > > > > > > > > > > > > > > > It's mounted as: > > > > > > /dev/sdb1 on /home type ocfs2 > > > > > > > (rw,_netdev,noatime,data=writeback,heartbeat=local) > > > > > > > > > > > > I know I'm not being much help, but I'm willing to try almost > anything > > > > > > as long as it doesn't cause downtime or require cluster-wide > changes > > > > > > (since those require downtime...) - I want to try to go back to > > > > > > 2.6.24-16 with data=writeback and see if that fixes the crashing > > > > > > issue, but if I'm having issues already like this perhaps I should > > > > > > resolve this before moving up. > > > > > > > > > > > > > > > > > > > > > > > > [EMAIL PROTECTED] ~]# cat /root/web03-iostat.txt > > > > > > > > > > > > Time: 02:11:46 PM > > > > > > avg-cpu: %user %nice %system %iowait %steal %idle > > > > > > 3.710.00 27.238.910.00 60.15 > > > > > > > > > > > > Device: rrqm/s wrqm/s r/s w/s rsec/s w
Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?
Mike, Are you sure it's not possible for sdb to be idle for just 1 second? If you look at the interval right after the one you pointed out, you'll see r/s is 2.97 and w/s is .99, so it did 3 reads and 1 write in that one second interval. The device appears to be used very little. I think it's quite possible that some 1 second intervals have no reads or writes at all, don't you think? Thanks, Herbert. mike wrote: > Thanks. > > If I have the opportunity to run the (buggy) new kernel again I will > try this. That is a definately problem and I think I need to set the > oracle behavior to crash and not auto reboot for this to be effective, > right? > > That is just one issue. > 1) 2.6.24-16 with load completely crashes node producing largest i/o > 2) 2.6.22-19 utilization drops to 0% and causes a hiccup randomly (I > don't see a pattern and no batch jobs, or other things running at the > time it happens) - this is more important as it still is happening > even though I'm runnign the more "stable" kernel. > > > On 4/21/08, Sunil Mushran <[EMAIL PROTECTED]> wrote: > >> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/networking/netconsole.txt;h=3c2f2b3286385337ce5ec24afebd4699dd1e6e0a;hb=HEAD >> >> netconsole is a facility to capture oops traces. It is not a console >> per se and does not require a head/gtk/x11 etc to work. The link above >> explains the usage, etc. >> >> >> mike wrote: >> >>> Well these are headless production servers, CLI only. no GTK, no X11. >>> also I am not running the newer kernels (and I can't...) it looks like >>> I cannot run a hybrid of 2.6.24-16 and 2.6.22-19, whichever one has >>> mounted the drive first is the winner. >>> >>> If I mix them, I can get the 2.6.24's to mount, then the older ones >>> give the "number too large" error or whatever. So I can't currently >>> use one server on my cluster to test because it would require >>> upgrading all of them just for this test. >>> >>> On 4/21/08, Sunil Mushran <[EMAIL PROTECTED]> wrote: >>> >>> >>> Setting up netconsole does not require a reboot. The idea is to catch the oops trace when the oops happens. Without that trace, we are flying blind. mike wrote: > Since these are production I can't do much. > > But I did get an error (it's not happening as much but it still blips > here and there) > > Notice that /dev/sdb (my iscsi target using ocfs2) hits 0.00% > utilization, 3 seconds before my proxy says "hey, timeout" - every > other second there is -always- some utilization going on. > > What could be steps to figure out this issue? Using debugfs.ocfs2 or > > > something? > It's mounted as: > /dev/sdb1 on /home type ocfs2 > (rw,_netdev,noatime,data=writeback,heartbeat=local) > > I know I'm not being much help, but I'm willing to try almost anything > as long as it doesn't cause downtime or require cluster-wide changes > (since those require downtime...) - I want to try to go back to > 2.6.24-16 with data=writeback and see if that fixes the crashing > issue, but if I'm having issues already like this perhaps I should > resolve this before moving up. > > > > [EMAIL PROTECTED] ~]# cat /root/web03-iostat.txt > > Time: 02:11:46 PM > avg-cpu: %user %nice %system %iowait %steal %idle > 3.710.00 27.238.910.00 60.15 > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sda 0.0054.460.00 309.90 0.00 2914.85 > 9.4123.08 74.47 0.93 28.71 > sdb 12.87 0.00 17.820.00 245.54 0.00 > 13.78 0.33 17.78 18.33 32.67 > > Time: 02:11:47 PM > avg-cpu: %user %nice %system %iowait %steal %idle > 0.250.00 26.242.230.00 71.29 > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sda 0.00 0.000.000.00 0.00 0.00 > 0.00 0.000.00 0.00 0.00 > sdb 5.94 0.00 22.770.99 228.71 0.99 > 9.67 0.42 17.92 17.08 40.59 > > Time: 02:11:48 PM <- THIS HAS THE ISSUE > avg-cpu: %user %nice %system %iowait %steal %idle > 0.000.00 25.990.000.00 74.01 > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sda 0.0010.890.002.97 0.00 110.89 > 37.33 0.000.00 0.00 0.00 > sdb 0.00 0.000.000.00 0.00 0.00 > 0.00 0.000.00 0.00 0.00 > > > Time: 02:11:49 PM > avg-cpu: %user %n
Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?
Thanks. If I have the opportunity to run the (buggy) new kernel again I will try this. That is a definately problem and I think I need to set the oracle behavior to crash and not auto reboot for this to be effective, right? That is just one issue. 1) 2.6.24-16 with load completely crashes node producing largest i/o 2) 2.6.22-19 utilization drops to 0% and causes a hiccup randomly (I don't see a pattern and no batch jobs, or other things running at the time it happens) - this is more important as it still is happening even though I'm runnign the more "stable" kernel. On 4/21/08, Sunil Mushran <[EMAIL PROTECTED]> wrote: > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/networking/netconsole.txt;h=3c2f2b3286385337ce5ec24afebd4699dd1e6e0a;hb=HEAD > > netconsole is a facility to capture oops traces. It is not a console > per se and does not require a head/gtk/x11 etc to work. The link above > explains the usage, etc. > > > mike wrote: > > Well these are headless production servers, CLI only. no GTK, no X11. > > also I am not running the newer kernels (and I can't...) it looks like > > I cannot run a hybrid of 2.6.24-16 and 2.6.22-19, whichever one has > > mounted the drive first is the winner. > > > > If I mix them, I can get the 2.6.24's to mount, then the older ones > > give the "number too large" error or whatever. So I can't currently > > use one server on my cluster to test because it would require > > upgrading all of them just for this test. > > > > On 4/21/08, Sunil Mushran <[EMAIL PROTECTED]> wrote: > > > > > > > Setting up netconsole does not require a reboot. The idea is to > > > catch the oops trace when the oops happens. Without that trace, > > > we are flying blind. > > > > > > > > > mike wrote: > > > > > > > > > > Since these are production I can't do much. > > > > > > > > But I did get an error (it's not happening as much but it still blips > > > > here and there) > > > > > > > > Notice that /dev/sdb (my iscsi target using ocfs2) hits 0.00% > > > > utilization, 3 seconds before my proxy says "hey, timeout" - every > > > > other second there is -always- some utilization going on. > > > > > > > > What could be steps to figure out this issue? Using debugfs.ocfs2 or > > > > > > > > > > > something? > > > > > > > > > > It's mounted as: > > > > /dev/sdb1 on /home type ocfs2 > > > > (rw,_netdev,noatime,data=writeback,heartbeat=local) > > > > > > > > I know I'm not being much help, but I'm willing to try almost anything > > > > as long as it doesn't cause downtime or require cluster-wide changes > > > > (since those require downtime...) - I want to try to go back to > > > > 2.6.24-16 with data=writeback and see if that fixes the crashing > > > > issue, but if I'm having issues already like this perhaps I should > > > > resolve this before moving up. > > > > > > > > > > > > > > > > [EMAIL PROTECTED] ~]# cat /root/web03-iostat.txt > > > > > > > > Time: 02:11:46 PM > > > > avg-cpu: %user %nice %system %iowait %steal %idle > > > > 3.710.00 27.238.910.00 60.15 > > > > > > > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > > > > avgrq-sz avgqu-sz await svctm %util > > > > sda 0.0054.460.00 309.90 0.00 2914.85 > > > > 9.4123.08 74.47 0.93 28.71 > > > > sdb 12.87 0.00 17.820.00 245.54 0.00 > > > > 13.78 0.33 17.78 18.33 32.67 > > > > > > > > Time: 02:11:47 PM > > > > avg-cpu: %user %nice %system %iowait %steal %idle > > > > 0.250.00 26.242.230.00 71.29 > > > > > > > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > > > > avgrq-sz avgqu-sz await svctm %util > > > > sda 0.00 0.000.000.00 0.00 0.00 > > > > 0.00 0.000.00 0.00 0.00 > > > > sdb 5.94 0.00 22.770.99 228.71 0.99 > > > > 9.67 0.42 17.92 17.08 40.59 > > > > > > > > Time: 02:11:48 PM <- THIS HAS THE ISSUE > > > > avg-cpu: %user %nice %system %iowait %steal %idle > > > > 0.000.00 25.990.000.00 74.01 > > > > > > > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > > > > avgrq-sz avgqu-sz await svctm %util > > > > sda 0.0010.890.002.97 0.00 110.89 > > > > 37.33 0.000.00 0.00 0.00 > > > > sdb 0.00 0.000.000.00 0.00 0.00 > > > > 0.00 0.000.00 0.00 0.00 > > > > > > > > > > > > Time: 02:11:49 PM > > > > avg-cpu: %user %nice %system %iowait %steal %idle > > > > 0.250.00 14.850.990.00 83.91 > > > > > > > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > > > > avgrq-sz avgqu-sz await svctm %util > > > > sda 0.00 0.000.000.00 0.00 0.00 > > > > 0.00 0.000.00 0.00 0.00 > > > > sdb 0.99 0.002.97
Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/networking/netconsole.txt;h=3c2f2b3286385337ce5ec24afebd4699dd1e6e0a;hb=HEAD netconsole is a facility to capture oops traces. It is not a console per se and does not require a head/gtk/x11 etc to work. The link above explains the usage, etc. mike wrote: > Well these are headless production servers, CLI only. no GTK, no X11. > also I am not running the newer kernels (and I can't...) it looks like > I cannot run a hybrid of 2.6.24-16 and 2.6.22-19, whichever one has > mounted the drive first is the winner. > > If I mix them, I can get the 2.6.24's to mount, then the older ones > give the "number too large" error or whatever. So I can't currently > use one server on my cluster to test because it would require > upgrading all of them just for this test. > > On 4/21/08, Sunil Mushran <[EMAIL PROTECTED]> wrote: > >> Setting up netconsole does not require a reboot. The idea is to >> catch the oops trace when the oops happens. Without that trace, >> we are flying blind. >> >> >> mike wrote: >> >>> Since these are production I can't do much. >>> >>> But I did get an error (it's not happening as much but it still blips >>> here and there) >>> >>> Notice that /dev/sdb (my iscsi target using ocfs2) hits 0.00% >>> utilization, 3 seconds before my proxy says "hey, timeout" - every >>> other second there is -always- some utilization going on. >>> >>> What could be steps to figure out this issue? Using debugfs.ocfs2 or >>> >> something? >> >>> It's mounted as: >>> /dev/sdb1 on /home type ocfs2 >>> (rw,_netdev,noatime,data=writeback,heartbeat=local) >>> >>> I know I'm not being much help, but I'm willing to try almost anything >>> as long as it doesn't cause downtime or require cluster-wide changes >>> (since those require downtime...) - I want to try to go back to >>> 2.6.24-16 with data=writeback and see if that fixes the crashing >>> issue, but if I'm having issues already like this perhaps I should >>> resolve this before moving up. >>> >>> >>> >>> [EMAIL PROTECTED] ~]# cat /root/web03-iostat.txt >>> >>> Time: 02:11:46 PM >>> avg-cpu: %user %nice %system %iowait %steal %idle >>> 3.710.00 27.238.910.00 60.15 >>> >>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >>> avgrq-sz avgqu-sz await svctm %util >>> sda 0.0054.460.00 309.90 0.00 2914.85 >>> 9.4123.08 74.47 0.93 28.71 >>> sdb 12.87 0.00 17.820.00 245.54 0.00 >>> 13.78 0.33 17.78 18.33 32.67 >>> >>> Time: 02:11:47 PM >>> avg-cpu: %user %nice %system %iowait %steal %idle >>> 0.250.00 26.242.230.00 71.29 >>> >>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >>> avgrq-sz avgqu-sz await svctm %util >>> sda 0.00 0.000.000.00 0.00 0.00 >>> 0.00 0.000.00 0.00 0.00 >>> sdb 5.94 0.00 22.770.99 228.71 0.99 >>> 9.67 0.42 17.92 17.08 40.59 >>> >>> Time: 02:11:48 PM <- THIS HAS THE ISSUE >>> avg-cpu: %user %nice %system %iowait %steal %idle >>> 0.000.00 25.990.000.00 74.01 >>> >>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >>> avgrq-sz avgqu-sz await svctm %util >>> sda 0.0010.890.002.97 0.00 110.89 >>> 37.33 0.000.00 0.00 0.00 >>> sdb 0.00 0.000.000.00 0.00 0.00 >>> 0.00 0.000.00 0.00 0.00 >>> >>> >>> Time: 02:11:49 PM >>> avg-cpu: %user %nice %system %iowait %steal %idle >>> 0.250.00 14.850.990.00 83.91 >>> >>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >>> avgrq-sz avgqu-sz await svctm %util >>> sda 0.00 0.000.000.00 0.00 0.00 >>> 0.00 0.000.00 0.00 0.00 >>> sdb 0.99 0.002.970.9930.69 0.99 >>> 8.00 0.07 17.50 17.50 6.93 >>> >>> Time: 02:11:50 PM >>> avg-cpu: %user %nice %system %iowait %steal %idle >>> 0.740.001.241.730.00 96.29 >>> >>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >>> avgrq-sz avgqu-sz await svctm %util >>> sda 0.00 0.000.000.00 0.00 0.00 >>> 0.00 0.000.00 0.00 0.00 >>> sdb 0.99 0.005.940.0055.45 0.00 >>> 9.33 0.07 11.67 11.67 6.93 >>> >>> Time: 02:11:51 PM >>> avg-cpu: %user %nice %system %iowait %steal %idle >>> 0.000.001.24 16.340.00 82.43 >>> >>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >>> avgrq-sz avgqu-sz await svctm %util >>> sda 0.00 153.470.00 494.06 0.00 5156.44 >>> 10.4455.62 107.23 1.16 57.43 >>> sdb 2.97 0.00 11.880.99
Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?
Well these are headless production servers, CLI only. no GTK, no X11. also I am not running the newer kernels (and I can't...) it looks like I cannot run a hybrid of 2.6.24-16 and 2.6.22-19, whichever one has mounted the drive first is the winner. If I mix them, I can get the 2.6.24's to mount, then the older ones give the "number too large" error or whatever. So I can't currently use one server on my cluster to test because it would require upgrading all of them just for this test. On 4/21/08, Sunil Mushran <[EMAIL PROTECTED]> wrote: > Setting up netconsole does not require a reboot. The idea is to > catch the oops trace when the oops happens. Without that trace, > we are flying blind. > > > mike wrote: > > Since these are production I can't do much. > > > > But I did get an error (it's not happening as much but it still blips > > here and there) > > > > Notice that /dev/sdb (my iscsi target using ocfs2) hits 0.00% > > utilization, 3 seconds before my proxy says "hey, timeout" - every > > other second there is -always- some utilization going on. > > > > What could be steps to figure out this issue? Using debugfs.ocfs2 or > something? > > > > It's mounted as: > > /dev/sdb1 on /home type ocfs2 > > (rw,_netdev,noatime,data=writeback,heartbeat=local) > > > > I know I'm not being much help, but I'm willing to try almost anything > > as long as it doesn't cause downtime or require cluster-wide changes > > (since those require downtime...) - I want to try to go back to > > 2.6.24-16 with data=writeback and see if that fixes the crashing > > issue, but if I'm having issues already like this perhaps I should > > resolve this before moving up. > > > > > > > > [EMAIL PROTECTED] ~]# cat /root/web03-iostat.txt > > > > Time: 02:11:46 PM > > avg-cpu: %user %nice %system %iowait %steal %idle > > 3.710.00 27.238.910.00 60.15 > > > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > > avgrq-sz avgqu-sz await svctm %util > > sda 0.0054.460.00 309.90 0.00 2914.85 > > 9.4123.08 74.47 0.93 28.71 > > sdb 12.87 0.00 17.820.00 245.54 0.00 > > 13.78 0.33 17.78 18.33 32.67 > > > > Time: 02:11:47 PM > > avg-cpu: %user %nice %system %iowait %steal %idle > > 0.250.00 26.242.230.00 71.29 > > > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > > avgrq-sz avgqu-sz await svctm %util > > sda 0.00 0.000.000.00 0.00 0.00 > > 0.00 0.000.00 0.00 0.00 > > sdb 5.94 0.00 22.770.99 228.71 0.99 > > 9.67 0.42 17.92 17.08 40.59 > > > > Time: 02:11:48 PM <- THIS HAS THE ISSUE > > avg-cpu: %user %nice %system %iowait %steal %idle > > 0.000.00 25.990.000.00 74.01 > > > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > > avgrq-sz avgqu-sz await svctm %util > > sda 0.0010.890.002.97 0.00 110.89 > > 37.33 0.000.00 0.00 0.00 > > sdb 0.00 0.000.000.00 0.00 0.00 > > 0.00 0.000.00 0.00 0.00 > > > > > > Time: 02:11:49 PM > > avg-cpu: %user %nice %system %iowait %steal %idle > > 0.250.00 14.850.990.00 83.91 > > > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > > avgrq-sz avgqu-sz await svctm %util > > sda 0.00 0.000.000.00 0.00 0.00 > > 0.00 0.000.00 0.00 0.00 > > sdb 0.99 0.002.970.9930.69 0.99 > > 8.00 0.07 17.50 17.50 6.93 > > > > Time: 02:11:50 PM > > avg-cpu: %user %nice %system %iowait %steal %idle > > 0.740.001.241.730.00 96.29 > > > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > > avgrq-sz avgqu-sz await svctm %util > > sda 0.00 0.000.000.00 0.00 0.00 > > 0.00 0.000.00 0.00 0.00 > > sdb 0.99 0.005.940.0055.45 0.00 > > 9.33 0.07 11.67 11.67 6.93 > > > > Time: 02:11:51 PM > > avg-cpu: %user %nice %system %iowait %steal %idle > > 0.000.001.24 16.340.00 82.43 > > > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > > avgrq-sz avgqu-sz await svctm %util > > sda 0.00 153.470.00 494.06 0.00 5156.44 > > 10.4455.62 107.23 1.16 57.43 > > sdb 2.97 0.00 11.880.99 117.82 0.99 > > 9.23 0.26 13.08 20.00 25.74 > > > > Time: 02:11:52 PM > > avg-cpu: %user %nice %system %iowait %steal %idle > > 0.000.000.253.220.00 96.53 > > > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > > avgrq-sz avgqu-sz await svctm %util > > sda 0.00 0.000.00 16.83 0.00 158.42 > > 9.4
Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?
Setting up netconsole does not require a reboot. The idea is to catch the oops trace when the oops happens. Without that trace, we are flying blind. mike wrote: > Since these are production I can't do much. > > But I did get an error (it's not happening as much but it still blips > here and there) > > Notice that /dev/sdb (my iscsi target using ocfs2) hits 0.00% > utilization, 3 seconds before my proxy says "hey, timeout" - every > other second there is -always- some utilization going on. > > What could be steps to figure out this issue? Using debugfs.ocfs2 or > something? > > It's mounted as: > /dev/sdb1 on /home type ocfs2 > (rw,_netdev,noatime,data=writeback,heartbeat=local) > > I know I'm not being much help, but I'm willing to try almost anything > as long as it doesn't cause downtime or require cluster-wide changes > (since those require downtime...) - I want to try to go back to > 2.6.24-16 with data=writeback and see if that fixes the crashing > issue, but if I'm having issues already like this perhaps I should > resolve this before moving up. > > > > [EMAIL PROTECTED] ~]# cat /root/web03-iostat.txt > > Time: 02:11:46 PM > avg-cpu: %user %nice %system %iowait %steal %idle >3.710.00 27.238.910.00 60.15 > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sda 0.0054.460.00 309.90 0.00 2914.85 > 9.4123.08 74.47 0.93 28.71 > sdb 12.87 0.00 17.820.00 245.54 0.00 > 13.78 0.33 17.78 18.33 32.67 > > Time: 02:11:47 PM > avg-cpu: %user %nice %system %iowait %steal %idle >0.250.00 26.242.230.00 71.29 > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sda 0.00 0.000.000.00 0.00 0.00 > 0.00 0.000.00 0.00 0.00 > sdb 5.94 0.00 22.770.99 228.71 0.99 > 9.67 0.42 17.92 17.08 40.59 > > Time: 02:11:48 PM <- THIS HAS THE ISSUE > avg-cpu: %user %nice %system %iowait %steal %idle >0.000.00 25.990.000.00 74.01 > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sda 0.0010.890.002.97 0.00 110.89 > 37.33 0.000.00 0.00 0.00 > sdb 0.00 0.000.000.00 0.00 0.00 > 0.00 0.000.00 0.00 0.00 > > > Time: 02:11:49 PM > avg-cpu: %user %nice %system %iowait %steal %idle >0.250.00 14.850.990.00 83.91 > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sda 0.00 0.000.000.00 0.00 0.00 > 0.00 0.000.00 0.00 0.00 > sdb 0.99 0.002.970.9930.69 0.99 > 8.00 0.07 17.50 17.50 6.93 > > Time: 02:11:50 PM > avg-cpu: %user %nice %system %iowait %steal %idle >0.740.001.241.730.00 96.29 > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sda 0.00 0.000.000.00 0.00 0.00 > 0.00 0.000.00 0.00 0.00 > sdb 0.99 0.005.940.0055.45 0.00 > 9.33 0.07 11.67 11.67 6.93 > > Time: 02:11:51 PM > avg-cpu: %user %nice %system %iowait %steal %idle >0.000.001.24 16.340.00 82.43 > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sda 0.00 153.470.00 494.06 0.00 5156.44 > 10.4455.62 107.23 1.16 57.43 > sdb 2.97 0.00 11.880.99 117.82 0.99 > 9.23 0.26 13.08 20.00 25.74 > > Time: 02:11:52 PM > avg-cpu: %user %nice %system %iowait %steal %idle >0.000.000.253.220.00 96.53 > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sda 0.00 0.000.00 16.83 0.00 158.42 > 9.41 0.13 164.71 1.18 1.98 > sdb 1.98 0.002.970.0039.60 0.00 > 13.33 0.13 73.33 43.33 12.87 > > Time: 02:11:53 PM > avg-cpu: %user %nice %system %iowait %steal %idle >0.500.000.254.700.00 94.55 > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sda 0.00 0.000.000.00 0.00 0.00 > 0.00 0.000.00 0.00 0.00 > sdb 5.94 0.00 11.880.99 141.58 0.99 > 11.08 0.20 15.38 15.38 19.80 > > Time: 02:11:54 PM > avg-cpu: %user %nice %system %iowait %steal %idle >3.960.00 10.150.740.00 85.15 > > Dev
Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?
Since these are production I can't do much. But I did get an error (it's not happening as much but it still blips here and there) Notice that /dev/sdb (my iscsi target using ocfs2) hits 0.00% utilization, 3 seconds before my proxy says "hey, timeout" - every other second there is -always- some utilization going on. What could be steps to figure out this issue? Using debugfs.ocfs2 or something? It's mounted as: /dev/sdb1 on /home type ocfs2 (rw,_netdev,noatime,data=writeback,heartbeat=local) I know I'm not being much help, but I'm willing to try almost anything as long as it doesn't cause downtime or require cluster-wide changes (since those require downtime...) - I want to try to go back to 2.6.24-16 with data=writeback and see if that fixes the crashing issue, but if I'm having issues already like this perhaps I should resolve this before moving up. [EMAIL PROTECTED] ~]# cat /root/web03-iostat.txt Time: 02:11:46 PM avg-cpu: %user %nice %system %iowait %steal %idle 3.710.00 27.238.910.00 60.15 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.0054.460.00 309.90 0.00 2914.85 9.4123.08 74.47 0.93 28.71 sdb 12.87 0.00 17.820.00 245.54 0.00 13.78 0.33 17.78 18.33 32.67 Time: 02:11:47 PM avg-cpu: %user %nice %system %iowait %steal %idle 0.250.00 26.242.230.00 71.29 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sdb 5.94 0.00 22.770.99 228.71 0.99 9.67 0.42 17.92 17.08 40.59 Time: 02:11:48 PM <- THIS HAS THE ISSUE avg-cpu: %user %nice %system %iowait %steal %idle 0.000.00 25.990.000.00 74.01 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.0010.890.002.97 0.00 110.89 37.33 0.000.00 0.00 0.00 sdb 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 Time: 02:11:49 PM avg-cpu: %user %nice %system %iowait %steal %idle 0.250.00 14.850.990.00 83.91 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sdb 0.99 0.002.970.9930.69 0.99 8.00 0.07 17.50 17.50 6.93 Time: 02:11:50 PM avg-cpu: %user %nice %system %iowait %steal %idle 0.740.001.241.730.00 96.29 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sdb 0.99 0.005.940.0055.45 0.00 9.33 0.07 11.67 11.67 6.93 Time: 02:11:51 PM avg-cpu: %user %nice %system %iowait %steal %idle 0.000.001.24 16.340.00 82.43 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 153.470.00 494.06 0.00 5156.44 10.4455.62 107.23 1.16 57.43 sdb 2.97 0.00 11.880.99 117.82 0.99 9.23 0.26 13.08 20.00 25.74 Time: 02:11:52 PM avg-cpu: %user %nice %system %iowait %steal %idle 0.000.000.253.220.00 96.53 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.000.00 16.83 0.00 158.42 9.41 0.13 164.71 1.18 1.98 sdb 1.98 0.002.970.0039.60 0.00 13.33 0.13 73.33 43.33 12.87 Time: 02:11:53 PM avg-cpu: %user %nice %system %iowait %steal %idle 0.500.000.254.700.00 94.55 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sdb 5.94 0.00 11.880.99 141.58 0.99 11.08 0.20 15.38 15.38 19.80 Time: 02:11:54 PM avg-cpu: %user %nice %system %iowait %steal %idle 3.960.00 10.150.740.00 85.15 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.0020.790.004.95 0.00 205.94 41.60 0.000.00 0.00 0.00 sdb 4.95 0.005.940.0087.13 0.00 14.67 0.07 11.67 11.67 6.93 On 4/21/08, Sunil Mushran <[EMAIL PROTECTED]> wrote: > Do you have
Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?
Do you have the panic output... kernel stack trace. We'll need that to figure this out. Without that, we can only speculate. mike wrote: > On 4/21/08, Tao Ma <[EMAIL PROTECTED]> wrote: > >> mike wrote: >> >>> I have changed my kernel back to 2.6.22-14-server, and now I don't get >>> the kernel panics. It seems like an issue with 2.6.24-16 and some i/o >>> made it crash... >>> >>> >>> >> OK, so it seems that it is a bug for ocfs2 kernel, not the ocfs2-tools. :) >> Then could you please describe it in more detail about how the kernel panic >> happens? >> > > Yeah, this specific issue seems like a kernel issue. > > I don't know, these are production systems and I am already getting > angry customers. I can't really test anymore. Both are standard Ubuntu > kernels. > > Okay: 2.6.22-14-server (I think still minor file access issues) > Breaks under load: 2.6.24-16-server > > > >>> However I am still getting file access timeouts once in a while. I am >>> nervous about putting more load on the setup. >>> >>> >>> >> Also please provide more details about it. >> > > I am using nginx for a frontend load balancer, and nginx for a > webserver as well. This doesn't seem to be related to the webserver at > all though, it was happening before this. > > lvs01 proxies traffic in to web01, web02, and web03 (currently using > nginx, before I was using LVS/ipvsadm) > > Every so often, one of the webservers sends me back > > >>> [EMAIL PROTECTED] .batch]# cat /etc/default/o2cb >>> >>> # O2CB_ENABLED: 'true' means to load the driver on boot. >>> O2CB_ENABLED=true >>> >>> # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start. >>> O2CB_BOOTCLUSTER=mycluster >>> >>> # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead. >>> O2CB_HEARTBEAT_THRESHOLD=7 >>> >>> >>> >> This value is a little smaller, so how did you build up your shared >> disk(iSCSI or ...)? The most common value I heard of is 61. It is about 120 >> secs. I don't know the reason and maybe Sunil can tell you. ;) >> You can also refer to >> http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT. >> >> >>> # O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is >>> considered dead. >>> O2CB_IDLE_TIMEOUT_MS=1 >>> >>> # O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet is >>> >> sent >> >>> O2CB_KEEPALIVE_DELAY_MS=5000 >>> >>> # O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts >>> O2CB_RECONNECT_DELAY_MS=2000 >>> >>> >>> On 4/21/08, Tao Ma <[EMAIL PROTECTED]> wrote: >>> >>> >>> Hi Mike, Are you sure it is caused by the update of ocfs2-tools? AFAIK, the ocfs2-tools only include tools like mkfs, fsck and tunefs >> etc. So >> if you don't make any change to the disk(by using this new tools), it shouldn't cause the problem of kernel panic since they are all user >> space >> tools. Then there is only one thing maybe. Have you modify >> /etc/sysconfig/o2cb(This >> is the place for RHEL, not sure the place in ubuntu)? I have checked the >> rpm >> package for RHEL, it will update /etc/sysconfig/o2cb and this file has >> some >> timeouts defined in it. So do you have some backups for this file? If yes, please restore it to >> see >> whether it helps(I can't say it for sure). If not, do you remember the old value of some timeouts you set for >> ocfs2? If >> yes, you can use o2cb configure to set them by yourself. >> > > ___ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users > ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?
> On 4/21/08, Tao Ma <[EMAIL PROTECTED]> wrote: > Also please provide more details about it. I am using nginx for a frontend load balancer, and nginx for a webserver as well. This doesn't seem to be related to the webserver at all though, it was happening before this. lvs01 proxies traffic in to web01, web02, and web03 (currently using nginx, before I was using LVS/ipvsadm) *** Sorry, gmail triggered "send" before I was done *** Every so often, one of the webservers sends me back a connection timeout. Even under the more "stable" kernel. In the newer kernel, when I put some load on it, the other machines would stop I/O to all disks it seems like completely for a few seconds, like something was blocking it (I think even the local non-ISCSI disk had 0% util) The machine putting the load on it would panic and reboot (I have kernel.panic = 60 in my sysctl.conf) > This value is a little smaller, so how did you build up your shared > disk(iSCSI or ...)? The most common value I heard of is 61. It is about 120 > secs. I don't know the reason and maybe Sunil can tell you. ;) > You can also refer to I think originally I did not understand why the values would be so high. These machines are pretty much dedicated, there's not many of them and they shouldn't be stressing anything that much to begin with. However, for the sake of things, I can reset these to the defaults. Problem is I have to restart the entire cluster for that. Will it work if I change it on all of them, and then reboot each machine one at a time? It seems like it won't link up if it has conflicting values, so how can I ensure the machines won't be fighting each other all at the same time? FYI, This is how I made it: mkfs.ocfs2 -b4k -C32k -N8 -L iscsi /dev/sdb1 It's an iSCSI export that my provider has. All this traffic is on the same private gigabit interconnect. Could there be any additional tuning parameters - there's probably over 3 million files being stored on this 600GB volume. I'd say average filesize is under 1MB too. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?
On 4/21/08, Tao Ma <[EMAIL PROTECTED]> wrote: > mike wrote: > > I have changed my kernel back to 2.6.22-14-server, and now I don't get > > the kernel panics. It seems like an issue with 2.6.24-16 and some i/o > > made it crash... > > > > > OK, so it seems that it is a bug for ocfs2 kernel, not the ocfs2-tools. :) > Then could you please describe it in more detail about how the kernel panic > happens? Yeah, this specific issue seems like a kernel issue. I don't know, these are production systems and I am already getting angry customers. I can't really test anymore. Both are standard Ubuntu kernels. Okay: 2.6.22-14-server (I think still minor file access issues) Breaks under load: 2.6.24-16-server > > However I am still getting file access timeouts once in a while. I am > > nervous about putting more load on the setup. > > > > > Also please provide more details about it. I am using nginx for a frontend load balancer, and nginx for a webserver as well. This doesn't seem to be related to the webserver at all though, it was happening before this. lvs01 proxies traffic in to web01, web02, and web03 (currently using nginx, before I was using LVS/ipvsadm) Every so often, one of the webservers sends me back > > [EMAIL PROTECTED] .batch]# cat /etc/default/o2cb > > > > # O2CB_ENABLED: 'true' means to load the driver on boot. > > O2CB_ENABLED=true > > > > # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start. > > O2CB_BOOTCLUSTER=mycluster > > > > # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead. > > O2CB_HEARTBEAT_THRESHOLD=7 > > > > > This value is a little smaller, so how did you build up your shared > disk(iSCSI or ...)? The most common value I heard of is 61. It is about 120 > secs. I don't know the reason and maybe Sunil can tell you. ;) > You can also refer to > http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT. > > > # O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is > > considered dead. > > O2CB_IDLE_TIMEOUT_MS=1 > > > > # O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet is > sent > > O2CB_KEEPALIVE_DELAY_MS=5000 > > > > # O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts > > O2CB_RECONNECT_DELAY_MS=2000 > > > > > > On 4/21/08, Tao Ma <[EMAIL PROTECTED]> wrote: > > > > > > > Hi Mike, > > > Are you sure it is caused by the update of ocfs2-tools? > > > AFAIK, the ocfs2-tools only include tools like mkfs, fsck and tunefs > etc. So > > > if you don't make any change to the disk(by using this new tools), it > > > shouldn't cause the problem of kernel panic since they are all user > space > > > tools. > > > Then there is only one thing maybe. Have you modify > /etc/sysconfig/o2cb(This > > > is the place for RHEL, not sure the place in ubuntu)? I have checked the > rpm > > > package for RHEL, it will update /etc/sysconfig/o2cb and this file has > some > > > timeouts defined in it. > > > So do you have some backups for this file? If yes, please restore it to > see > > > whether it helps(I can't say it for sure). > > > If not, do you remember the old value of some timeouts you set for > ocfs2? If > > > yes, you can use o2cb configure to set them by yourself. > > > > > > > > > > ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?
mike wrote: > I have changed my kernel back to 2.6.22-14-server, and now I don't get > the kernel panics. It seems like an issue with 2.6.24-16 and some i/o > made it crash... > OK, so it seems that it is a bug for ocfs2 kernel, not the ocfs2-tools. :) Then could you please describe it in more detail about how the kernel panic happens? > However I am still getting file access timeouts once in a while. I am > nervous about putting more load on the setup. > Also please provide more details about it. > > > [EMAIL PROTECTED] .batch]# cat /etc/default/o2cb > > # O2CB_ENABLED: 'true' means to load the driver on boot. > O2CB_ENABLED=true > > # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start. > O2CB_BOOTCLUSTER=mycluster > > # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead. > O2CB_HEARTBEAT_THRESHOLD=7 > This value is a little smaller, so how did you build up your shared disk(iSCSI or ...)? The most common value I heard of is 61. It is about 120 secs. I don't know the reason and maybe Sunil can tell you. ;) You can also refer to http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT. > # O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is > considered dead. > O2CB_IDLE_TIMEOUT_MS=1 > > # O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet is sent > O2CB_KEEPALIVE_DELAY_MS=5000 > > # O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts > O2CB_RECONNECT_DELAY_MS=2000 > > > On 4/21/08, Tao Ma <[EMAIL PROTECTED]> wrote: > >> Hi Mike, >>Are you sure it is caused by the update of ocfs2-tools? >> AFAIK, the ocfs2-tools only include tools like mkfs, fsck and tunefs etc. So >> if you don't make any change to the disk(by using this new tools), it >> shouldn't cause the problem of kernel panic since they are all user space >> tools. >> Then there is only one thing maybe. Have you modify /etc/sysconfig/o2cb(This >> is the place for RHEL, not sure the place in ubuntu)? I have checked the rpm >> package for RHEL, it will update /etc/sysconfig/o2cb and this file has some >> timeouts defined in it. >> So do you have some backups for this file? If yes, please restore it to see >> whether it helps(I can't say it for sure). >> If not, do you remember the old value of some timeouts you set for ocfs2? If >> yes, you can use o2cb configure to set them by yourself. >> ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?
On Mon, Apr 21, 2008 at 05:02:33PM +0800, Tao Ma wrote: > Then there is only one thing maybe. Have you modify > /etc/sysconfig/o2cb(This is the place for RHEL, not sure the place in > ubuntu)? I have checked the rpm package for RHEL, it will update > /etc/sysconfig/o2cb and this file has some timeouts defined in it. It is probably /etc/default/o2cb for Ubuntu. Joel -- "I am working for the time when unqualified blacks, browns, and women join the unqualified men in running our overnment." - Sissy Farenthold Joel Becker Principal Software Developer Oracle E-mail: [EMAIL PROTECTED] Phone: (650) 506-8127 ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?
I have changed my kernel back to 2.6.22-14-server, and now I don't get the kernel panics. It seems like an issue with 2.6.24-16 and some i/o made it crash... However I am still getting file access timeouts once in a while. I am nervous about putting more load on the setup. [EMAIL PROTECTED] .batch]# cat /etc/default/o2cb # O2CB_ENABLED: 'true' means to load the driver on boot. O2CB_ENABLED=true # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start. O2CB_BOOTCLUSTER=mycluster # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead. O2CB_HEARTBEAT_THRESHOLD=7 # O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is considered dead. O2CB_IDLE_TIMEOUT_MS=1 # O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet is sent O2CB_KEEPALIVE_DELAY_MS=5000 # O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts O2CB_RECONNECT_DELAY_MS=2000 On 4/21/08, Tao Ma <[EMAIL PROTECTED]> wrote: > Hi Mike, >Are you sure it is caused by the update of ocfs2-tools? > AFAIK, the ocfs2-tools only include tools like mkfs, fsck and tunefs etc. So > if you don't make any change to the disk(by using this new tools), it > shouldn't cause the problem of kernel panic since they are all user space > tools. > Then there is only one thing maybe. Have you modify /etc/sysconfig/o2cb(This > is the place for RHEL, not sure the place in ubuntu)? I have checked the rpm > package for RHEL, it will update /etc/sysconfig/o2cb and this file has some > timeouts defined in it. > So do you have some backups for this file? If yes, please restore it to see > whether it helps(I can't say it for sure). > If not, do you remember the old value of some timeouts you set for ocfs2? If > yes, you can use o2cb configure to set them by yourself. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?
Hi Mike, Are you sure it is caused by the update of ocfs2-tools? AFAIK, the ocfs2-tools only include tools like mkfs, fsck and tunefs etc. So if you don't make any change to the disk(by using this new tools), it shouldn't cause the problem of kernel panic since they are all user space tools. Then there is only one thing maybe. Have you modify /etc/sysconfig/o2cb(This is the place for RHEL, not sure the place in ubuntu)? I have checked the rpm package for RHEL, it will update /etc/sysconfig/o2cb and this file has some timeouts defined in it. So do you have some backups for this file? If yes, please restore it to see whether it helps(I can't say it for sure). If not, do you remember the old value of some timeouts you set for ocfs2? If yes, you can use o2cb configure to set them by yourself. Good Luck. Regards, Tao mike wrote: > Hi, I'm running into a big issue. I believe it is OCFS2, I can get my > machines to kernel panic consistently. > > Before I was running Ubuntu Gutsy (7.10) ocfs2-tools 1.2.4. > > Now I am running Ubuntu Hardy (8.04) ocfs2-tools 1.3.9. > > I am even running the same kernel (2.6.22-14), but the behavior has > changed with my OCFS2 mounts it seems. At first I thought it was due > to the newer kernel (2.6.24-16) but it isn't the case. Now it is > happening no matter which kernel I use. I even compiled my own vanilla > 2.6.25, and it still has this issue. > > I have 6 total clients mounting the ocfs2 partition: > - 2 batch servers which only access it every 5 or 10 minutes to load > up a PHP script to process > - 1 server I am trying to rsync from local RAID disk -> ocfs2 - I am > limiting this to 250kb/sec > - 3 webservers loading normal stuff - PHP scripts, graphics, media > files - maybe 2MB/sec combined total > > That's not even 3MB/sec - yet when I start the rsync, pretty quickly > the server doing the rsync kernel panics and reboots. The 3 webservers > all have issues with reading from the OCFS2 mounted partition. The > %util all drops to 0, it's like it bottlenecks and suspends all disk > I/O on the webservers for a few seconds. Then things go back to normal > for a while. > > Is there any additional info that could be useful? I am desperately in > need of help. I have hosting customers and somehow this upgrade has > pretty much crippled me... > > ___ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
[Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?
Hi, I'm running into a big issue. I believe it is OCFS2, I can get my machines to kernel panic consistently. Before I was running Ubuntu Gutsy (7.10) ocfs2-tools 1.2.4. Now I am running Ubuntu Hardy (8.04) ocfs2-tools 1.3.9. I am even running the same kernel (2.6.22-14), but the behavior has changed with my OCFS2 mounts it seems. At first I thought it was due to the newer kernel (2.6.24-16) but it isn't the case. Now it is happening no matter which kernel I use. I even compiled my own vanilla 2.6.25, and it still has this issue. I have 6 total clients mounting the ocfs2 partition: - 2 batch servers which only access it every 5 or 10 minutes to load up a PHP script to process - 1 server I am trying to rsync from local RAID disk -> ocfs2 - I am limiting this to 250kb/sec - 3 webservers loading normal stuff - PHP scripts, graphics, media files - maybe 2MB/sec combined total That's not even 3MB/sec - yet when I start the rsync, pretty quickly the server doing the rsync kernel panics and reboots. The 3 webservers all have issues with reading from the OCFS2 mounted partition. The %util all drops to 0, it's like it bottlenecks and suspends all disk I/O on the webservers for a few seconds. Then things go back to normal for a while. Is there any additional info that could be useful? I am desperately in need of help. I have hosting customers and somehow this upgrade has pretty much crippled me... ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users