subject:"\[Ocfs2\-users\] Did anything substantial change between 1.2.4 and 1.3.9\?"

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

2008-04-21 Thread Sunil Mushran

What version? File a bugzilla (oss.oracle.com/bugzilla) with
all the version details, etc. Easier to track issues that-a-way.

mike wrote:
> Here's another issue:
>
> I have a client with 9049 files/dirs in a specific dir. The first time
> I did an ls /that/dir/ it froze up - and hitting control C actually
> made it behave interestingly.
>
> Every two times I hit control C, I got one of these permission denied
> lines (the files exist, I can open them, I can edit them, I am root,
> too)
>
> [EMAIL PROTECTED] ~]# ls /home/mark/web/domain.com/
> ls: cannot access /home/mark/web/domain.com/file32.htm: Permission denied
> ls: cannot access /home/mark/web/domain.com/file.htm: Permission denied
>
> etc.
>
> Also I am seeing in my webserver log once in a while:
>
> 2008/04/21 18:10:09 [crit] 6917#0: *1256684 stat()
> "/home/mike/web/michaelshadle.com/" failed (13: Permission denied),
> client: 1.2.3.4, server: michaelshadle.com, request: "GET / HTTP/1.0",
> host: "michaelshadle.com"
>
> a stat() call fails on a directory, and that directory not only
> exists, it's readable and behaves properly 99.999% of the time. these
> random stat() failures are a bit odd. However, this hasn't thrown an
> error on my proxy server, which is nice. But it is something I am
> worried about. I can stat it in shell:
>
> [EMAIL PROTECTED] web]# stat /home/mike/web/michaelshadle.com/
>   File: `/home/mike/web/michaelshadle.com/'
>   Size: 4096Blocks: 64 IO Block: 32768  directory
> Device: 811h/2065d  Inode: 135710860   Links: 15
> Access: (0711/drwx--x--x)  Uid: ( 1000/mike)   Gid: ( 1000/mike)
> Access: 2008-03-25 04:27:52.0 -0700
> Modify: 2008-03-04 02:17:22.0 -0800
> Change: 2008-03-27 04:45:57.0 -0700
> [EMAIL PROTECTED] web]#
>
> Does anything there look out of place? Nothing shows up in dmesg or
> any of my /var/log/* logs...
>
>
> On 4/21/08, Herbert van den Bergh <[EMAIL PROTECTED]> wrote:
>   
>> Does the web server that the proxy server talks to have any extended
>> debugging you can turn on?  In particular, would it be able to log
>> timestamps of things it does, so you can narrow down where the hic-up
>> occurs?  A brute force method to do this would be to run strace -T on all
>> server processes, and look for things that take much longer than they
>> should, like disk reads exceeding 100ms, or other syscalls taking much
>> longer than usual.  Ideally you'd have some timing around code you suspect,
>> and log a message if the time exceeds some configurable limit.
>>
>> Thanks,
>> Herbert.
>> 

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

2008-04-21 Thread mike

Here's another issue:

I have a client with 9049 files/dirs in a specific dir. The first time
I did an ls /that/dir/ it froze up - and hitting control C actually
made it behave interestingly.

Every two times I hit control C, I got one of these permission denied
lines (the files exist, I can open them, I can edit them, I am root,
too)

[EMAIL PROTECTED] ~]# ls /home/mark/web/domain.com/
ls: cannot access /home/mark/web/domain.com/file32.htm: Permission denied
ls: cannot access /home/mark/web/domain.com/file.htm: Permission denied

etc.

Also I am seeing in my webserver log once in a while:

2008/04/21 18:10:09 [crit] 6917#0: *1256684 stat()
"/home/mike/web/michaelshadle.com/" failed (13: Permission denied),
client: 1.2.3.4, server: michaelshadle.com, request: "GET / HTTP/1.0",
host: "michaelshadle.com"

a stat() call fails on a directory, and that directory not only
exists, it's readable and behaves properly 99.999% of the time. these
random stat() failures are a bit odd. However, this hasn't thrown an
error on my proxy server, which is nice. But it is something I am
worried about. I can stat it in shell:

[EMAIL PROTECTED] web]# stat /home/mike/web/michaelshadle.com/
  File: `/home/mike/web/michaelshadle.com/'
  Size: 4096Blocks: 64 IO Block: 32768  directory
Device: 811h/2065d  Inode: 135710860   Links: 15
Access: (0711/drwx--x--x)  Uid: ( 1000/mike)   Gid: ( 1000/mike)
Access: 2008-03-25 04:27:52.0 -0700
Modify: 2008-03-04 02:17:22.0 -0800
Change: 2008-03-27 04:45:57.0 -0700
[EMAIL PROTECTED] web]#

Does anything there look out of place? Nothing shows up in dmesg or
any of my /var/log/* logs...





On 4/21/08, Herbert van den Bergh <[EMAIL PROTECTED]> wrote:
>
> Does the web server that the proxy server talks to have any extended
> debugging you can turn on?  In particular, would it be able to log
> timestamps of things it does, so you can narrow down where the hic-up
> occurs?  A brute force method to do this would be to run strace -T on all
> server processes, and look for things that take much longer than they
> should, like disk reads exceeding 100ms, or other syscalls taking much
> longer than usual.  Ideally you'd have some timing around code you suspect,
> and log a message if the time exceeds some configurable limit.
>
> Thanks,
> Herbert.
>
>
>
> mike wrote:
> > You're right, it -is- possible, but if you look at it (and I can log
> > it for hours) it only seems to do that right before I get a timeout
> > message from the proxy. The two appear to be related.
> >
> > I will continue to monitor this and make sure that my hypothesis is
> > correct. Something is flaking out every so often.
> >
> > I get this on my nginx proxy server:
> >
> > 2008/04/21 17:37:01 [error] 1256#0: *7406286 upstream timed out (110:
> > Connection timed out) while reading response header from upstream,
> > client: 1.2.3.4, server: lvs01.domain.com, request: "GET /someURL.php
> > HTTP/1.1", upstream: "http://10.13.5.12:80/someURL.php";,
> host:
> > "somedomain.com", referrer:
> "http://somedomain.com/someURL.php";
> >
> > That only happens after it's sitting for 3 real-time seconds waiting
> > for a reply from the server. Note: this happens no matter what proxy
> > and webserver I use. It does not seem to be anything related to that.
> >
> >
> > On 4/21/08, Herbert van den Bergh
> <[EMAIL PROTECTED]> wrote:
> >
> >
> > > Mike,
> > >
> > > Are you sure it's not possible for sdb to be idle for just 1 second?  If
> you
> > > look at the interval right after the one you pointed out, you'll see r/s
> is
> > > 2.97 and w/s is .99, so it did 3 reads and 1 write in that one second
> > > interval.  The device appears to be used very little.  I think it's
> quite
> > > possible that some 1 second intervals have no reads or writes at all,
> don't
> > > you think?
> > >
> > > Thanks,
> > > Herbert.
> > >
> > >
> > >
> > > mike wrote:
> > >
> > >
> > > > Thanks.
> > > >
> > > > If I have the opportunity to run the (buggy) new kernel again I will
> > > > try this. That is a definately problem and I think I need to set the
> > > > oracle behavior to crash and not auto reboot for this to be effective,
> > > > right?
> > > >
> > > > That is just one issue.
> > > > 1) 2.6.24-16 with load completely crashes node producing largest i/o
> > > > 2) 2.6.22-19 utilization drops to 0% and causes a hiccup randomly (I
> > > > don't see a pattern and no batch jobs, or other things running at the
> > > > time it happens) - this is more important as it still is happening
> > > > even though I'm runnign the more "stable" kernel.
> > > >
> > > >
> > > > On 4/21/08, Sunil Mushran <[EMAIL PROTECTED]> wrote:
> > > >
> > > >
> > > >
> > > >
> > >
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/networking/netconsole.txt;h=3c2f2b3286385337ce5ec24afebd4699dd1e6e0a;hb=HEAD
> > >
> > >
> > > >
> > > > > netconsole is a facility to capture oops traces. It is not a

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

2008-04-21 Thread mike

Is there any OCFS2 debugging I could turn on as well? To log perhaps
every file request and how long it took, or if something hits a
threshhold, etc?

I think I can turn on the webserver's debugging (if not, the author is
pretty responsive) and hopefully find the bottleneck through there.

On 4/21/08, Herbert van den Bergh <[EMAIL PROTECTED]> wrote:
>
> Does the web server that the proxy server talks to have any extended
> debugging you can turn on?  In particular, would it be able to log
> timestamps of things it does, so you can narrow down where the hic-up
> occurs?  A brute force method to do this would be to run strace -T on all
> server processes, and look for things that take much longer than they
> should, like disk reads exceeding 100ms, or other syscalls taking much
> longer than usual.  Ideally you'd have some timing around code you suspect,
> and log a message if the time exceeds some configurable limit.
>
> Thanks,
> Herbert.
>
>
>
> mike wrote:
> > You're right, it -is- possible, but if you look at it (and I can log
> > it for hours) it only seems to do that right before I get a timeout
> > message from the proxy. The two appear to be related.
> >
> > I will continue to monitor this and make sure that my hypothesis is
> > correct. Something is flaking out every so often.
> >
> > I get this on my nginx proxy server:
> >
> > 2008/04/21 17:37:01 [error] 1256#0: *7406286 upstream timed out (110:
> > Connection timed out) while reading response header from upstream,
> > client: 1.2.3.4, server: lvs01.domain.com, request: "GET /someURL.php
> > HTTP/1.1", upstream: "http://10.13.5.12:80/someURL.php";,
> host:
> > "somedomain.com", referrer:
> "http://somedomain.com/someURL.php";
> >
> > That only happens after it's sitting for 3 real-time seconds waiting
> > for a reply from the server. Note: this happens no matter what proxy
> > and webserver I use. It does not seem to be anything related to that.
> >
> >
> > On 4/21/08, Herbert van den Bergh
> <[EMAIL PROTECTED]> wrote:
> >
> >
> > > Mike,
> > >
> > > Are you sure it's not possible for sdb to be idle for just 1 second?  If
> you
> > > look at the interval right after the one you pointed out, you'll see r/s
> is
> > > 2.97 and w/s is .99, so it did 3 reads and 1 write in that one second
> > > interval.  The device appears to be used very little.  I think it's
> quite
> > > possible that some 1 second intervals have no reads or writes at all,
> don't
> > > you think?
> > >
> > > Thanks,
> > > Herbert.
> > >
> > >
> > >
> > > mike wrote:
> > >
> > >
> > > > Thanks.
> > > >
> > > > If I have the opportunity to run the (buggy) new kernel again I will
> > > > try this. That is a definately problem and I think I need to set the
> > > > oracle behavior to crash and not auto reboot for this to be effective,
> > > > right?
> > > >
> > > > That is just one issue.
> > > > 1) 2.6.24-16 with load completely crashes node producing largest i/o
> > > > 2) 2.6.22-19 utilization drops to 0% and causes a hiccup randomly (I
> > > > don't see a pattern and no batch jobs, or other things running at the
> > > > time it happens) - this is more important as it still is happening
> > > > even though I'm runnign the more "stable" kernel.
> > > >
> > > >
> > > > On 4/21/08, Sunil Mushran <[EMAIL PROTECTED]> wrote:
> > > >
> > > >
> > > >
> > > >
> > >
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/networking/netconsole.txt;h=3c2f2b3286385337ce5ec24afebd4699dd1e6e0a;hb=HEAD
> > >
> > >
> > > >
> > > > > netconsole is a facility to capture oops traces. It is not a console
> > > > > per se and does not require a head/gtk/x11 etc to work. The link
> above
> > > > > explains the usage, etc.
> > > > >
> > > > >
> > > > > mike wrote:
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > > Well these are headless production servers, CLI only. no GTK, no
> X11.
> > > > > > also I am not running the newer kernels (and I can't...) it looks
> like
> > > > > > I cannot run a hybrid of 2.6.24-16 and 2.6.22-19, whichever one
> has
> > > > > > mounted the drive first is the winner.
> > > > > >
> > > > > > If I mix them, I can get the 2.6.24's to mount, then the older
> ones
> > > > > > give the "number too large" error or whatever. So I can't
> currently
> > > > > > use one server on my cluster to test because it would require
> > > > > > upgrading all of them just for this test.
> > > > > >
> > > > > > On 4/21/08, Sunil Mushran <[EMAIL PROTECTED]> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > > Setting up netconsole does not require a reboot. The idea is to
> > > > > > > catch the oops trace when the oops happens. Without that trace,
> > > > > > > we are flying blind.
> > > > > > >
> > > > > > >
> > > > > > > mike wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > Since these are production I can't do much.
> > > > > > > >
> > > > > > > > But I did get

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

2008-04-21 Thread Herbert van den Bergh


Does the web server that the proxy server talks to have any extended 
debugging you can turn on?  In particular, would it be able to log 
timestamps of things it does, so you can narrow down where the hic-up 
occurs?  A brute force method to do this would be to run strace -T on 
all server processes, and look for things that take much longer than 
they should, like disk reads exceeding 100ms, or other syscalls taking 
much longer than usual.  Ideally you'd have some timing around code you 
suspect, and log a message if the time exceeds some configurable limit.

Thanks,
Herbert.


mike wrote:
> You're right, it -is- possible, but if you look at it (and I can log
> it for hours) it only seems to do that right before I get a timeout
> message from the proxy. The two appear to be related.
>
> I will continue to monitor this and make sure that my hypothesis is
> correct. Something is flaking out every so often.
>
> I get this on my nginx proxy server:
>
> 2008/04/21 17:37:01 [error] 1256#0: *7406286 upstream timed out (110:
> Connection timed out) while reading response header from upstream,
> client: 1.2.3.4, server: lvs01.domain.com, request: "GET /someURL.php
> HTTP/1.1", upstream: "http://10.13.5.12:80/someURL.php";, host:
> "somedomain.com", referrer: "http://somedomain.com/someURL.php";
>
> That only happens after it's sitting for 3 real-time seconds waiting
> for a reply from the server. Note: this happens no matter what proxy
> and webserver I use. It does not seem to be anything related to that.
>
>
> On 4/21/08, Herbert van den Bergh <[EMAIL PROTECTED]> wrote:
>   
>> Mike,
>>
>> Are you sure it's not possible for sdb to be idle for just 1 second?  If you
>> look at the interval right after the one you pointed out, you'll see r/s is
>> 2.97 and w/s is .99, so it did 3 reads and 1 write in that one second
>> interval.  The device appears to be used very little.  I think it's quite
>> possible that some 1 second intervals have no reads or writes at all, don't
>> you think?
>>
>> Thanks,
>> Herbert.
>>
>>
>>
>> mike wrote:
>> 
>>> Thanks.
>>>
>>> If I have the opportunity to run the (buggy) new kernel again I will
>>> try this. That is a definately problem and I think I need to set the
>>> oracle behavior to crash and not auto reboot for this to be effective,
>>> right?
>>>
>>> That is just one issue.
>>> 1) 2.6.24-16 with load completely crashes node producing largest i/o
>>> 2) 2.6.22-19 utilization drops to 0% and causes a hiccup randomly (I
>>> don't see a pattern and no batch jobs, or other things running at the
>>> time it happens) - this is more important as it still is happening
>>> even though I'm runnign the more "stable" kernel.
>>>
>>>
>>> On 4/21/08, Sunil Mushran <[EMAIL PROTECTED]> wrote:
>>>
>>>
>>>   
>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/networking/netconsole.txt;h=3c2f2b3286385337ce5ec24afebd4699dd1e6e0a;hb=HEAD
>> 
 netconsole is a facility to capture oops traces. It is not a console
 per se and does not require a head/gtk/x11 etc to work. The link above
 explains the usage, etc.


 mike wrote:


 
> Well these are headless production servers, CLI only. no GTK, no X11.
> also I am not running the newer kernels (and I can't...) it looks like
> I cannot run a hybrid of 2.6.24-16 and 2.6.22-19, whichever one has
> mounted the drive first is the winner.
>
> If I mix them, I can get the 2.6.24's to mount, then the older ones
> give the "number too large" error or whatever. So I can't currently
> use one server on my cluster to test because it would require
> upgrading all of them just for this test.
>
> On 4/21/08, Sunil Mushran <[EMAIL PROTECTED]> wrote:
>
>
>
>
>   
>> Setting up netconsole does not require a reboot. The idea is to
>> catch the oops trace when the oops happens. Without that trace,
>> we are flying blind.
>>
>>
>> mike wrote:
>>
>>
>>
>>
>> 
>>> Since these are production I can't do much.
>>>
>>> But I did get an error (it's not happening as much but it still
>>>   
>> blips
>> 
>>> here and there)
>>>
>>> Notice that /dev/sdb (my iscsi target using ocfs2) hits 0.00%
>>> utilization, 3 seconds before my proxy says "hey, timeout" - every
>>> other second there is -always- some utilization going on.
>>>
>>> What could be steps to figure out this issue? Using debugfs.ocfs2
>>>   
>> or
>> 
>>>
>>>
>>>   
>> something?
>>
>>
>>
>>
>> 
>>> It's mounted as:
>>> /dev/sdb1 on /home type ocfs2
>>>
>>>   
>> (rw,_netdev,noatime,data=writeback,heartbeat=local)
>> 
>>> I know I'm not being much help, but I'm willing to try almost
>>>   
>> anything
>> 
>

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

2008-04-21 Thread mike

You're right, it -is- possible, but if you look at it (and I can log
it for hours) it only seems to do that right before I get a timeout
message from the proxy. The two appear to be related.

I will continue to monitor this and make sure that my hypothesis is
correct. Something is flaking out every so often.

I get this on my nginx proxy server:

2008/04/21 17:37:01 [error] 1256#0: *7406286 upstream timed out (110:
Connection timed out) while reading response header from upstream,
client: 1.2.3.4, server: lvs01.domain.com, request: "GET /someURL.php
HTTP/1.1", upstream: "http://10.13.5.12:80/someURL.php";, host:
"somedomain.com", referrer: "http://somedomain.com/someURL.php";

That only happens after it's sitting for 3 real-time seconds waiting
for a reply from the server. Note: this happens no matter what proxy
and webserver I use. It does not seem to be anything related to that.


On 4/21/08, Herbert van den Bergh <[EMAIL PROTECTED]> wrote:
>
> Mike,
>
> Are you sure it's not possible for sdb to be idle for just 1 second?  If you
> look at the interval right after the one you pointed out, you'll see r/s is
> 2.97 and w/s is .99, so it did 3 reads and 1 write in that one second
> interval.  The device appears to be used very little.  I think it's quite
> possible that some 1 second intervals have no reads or writes at all, don't
> you think?
>
> Thanks,
> Herbert.
>
>
>
> mike wrote:
> > Thanks.
> >
> > If I have the opportunity to run the (buggy) new kernel again I will
> > try this. That is a definately problem and I think I need to set the
> > oracle behavior to crash and not auto reboot for this to be effective,
> > right?
> >
> > That is just one issue.
> > 1) 2.6.24-16 with load completely crashes node producing largest i/o
> > 2) 2.6.22-19 utilization drops to 0% and causes a hiccup randomly (I
> > don't see a pattern and no batch jobs, or other things running at the
> > time it happens) - this is more important as it still is happening
> > even though I'm runnign the more "stable" kernel.
> >
> >
> > On 4/21/08, Sunil Mushran <[EMAIL PROTECTED]> wrote:
> >
> >
> > >
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/networking/netconsole.txt;h=3c2f2b3286385337ce5ec24afebd4699dd1e6e0a;hb=HEAD
> > >
> > > netconsole is a facility to capture oops traces. It is not a console
> > > per se and does not require a head/gtk/x11 etc to work. The link above
> > > explains the usage, etc.
> > >
> > >
> > > mike wrote:
> > >
> > >
> > > > Well these are headless production servers, CLI only. no GTK, no X11.
> > > > also I am not running the newer kernels (and I can't...) it looks like
> > > > I cannot run a hybrid of 2.6.24-16 and 2.6.22-19, whichever one has
> > > > mounted the drive first is the winner.
> > > >
> > > > If I mix them, I can get the 2.6.24's to mount, then the older ones
> > > > give the "number too large" error or whatever. So I can't currently
> > > > use one server on my cluster to test because it would require
> > > > upgrading all of them just for this test.
> > > >
> > > > On 4/21/08, Sunil Mushran <[EMAIL PROTECTED]> wrote:
> > > >
> > > >
> > > >
> > > >
> > > > > Setting up netconsole does not require a reboot. The idea is to
> > > > > catch the oops trace when the oops happens. Without that trace,
> > > > > we are flying blind.
> > > > >
> > > > >
> > > > > mike wrote:
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > > Since these are production I can't do much.
> > > > > >
> > > > > > But I did get an error (it's not happening as much but it still
> blips
> > > > > > here and there)
> > > > > >
> > > > > > Notice that /dev/sdb (my iscsi target using ocfs2) hits 0.00%
> > > > > > utilization, 3 seconds before my proxy says "hey, timeout" - every
> > > > > > other second there is -always- some utilization going on.
> > > > > >
> > > > > > What could be steps to figure out this issue? Using debugfs.ocfs2
> or
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > something?
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > > It's mounted as:
> > > > > > /dev/sdb1 on /home type ocfs2
> > > > > >
> (rw,_netdev,noatime,data=writeback,heartbeat=local)
> > > > > >
> > > > > > I know I'm not being much help, but I'm willing to try almost
> anything
> > > > > > as long as it doesn't cause downtime or require cluster-wide
> changes
> > > > > > (since those require downtime...) - I want to try to go back to
> > > > > > 2.6.24-16 with data=writeback and see if that fixes the crashing
> > > > > > issue, but if I'm having issues already like this perhaps I should
> > > > > > resolve this before moving up.
> > > > > >
> > > > > >
> > > > > >
> > > > > > [EMAIL PROTECTED] ~]# cat /root/web03-iostat.txt
> > > > > >
> > > > > > Time: 02:11:46 PM
> > > > > > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> > > > > > 3.710.00   27.238.910.00   60.15
> > > > > >
> > > > > > Device: rrqm/s   wrqm/s r/s w/s   rsec/s   w

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

2008-04-21 Thread Herbert van den Bergh


Mike,

Are you sure it's not possible for sdb to be idle for just 1 second?  If 
you look at the interval right after the one you pointed out, you'll see 
r/s is 2.97 and w/s is .99, so it did 3 reads and 1 write in that one 
second interval.  The device appears to be used very little.  I think 
it's quite possible that some 1 second intervals have no reads or writes 
at all, don't you think?

Thanks,
Herbert.


mike wrote:
> Thanks.
>
> If I have the opportunity to run the (buggy) new kernel again I will
> try this. That is a definately problem and I think I need to set the
> oracle behavior to crash and not auto reboot for this to be effective,
> right?
>
> That is just one issue.
> 1) 2.6.24-16 with load completely crashes node producing largest i/o
> 2) 2.6.22-19 utilization drops to 0% and causes a hiccup randomly (I
> don't see a pattern and no batch jobs, or other things running at the
> time it happens) - this is more important as it still is happening
> even though I'm runnign the more "stable" kernel.
>
>
> On 4/21/08, Sunil Mushran <[EMAIL PROTECTED]> wrote:
>   
>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/networking/netconsole.txt;h=3c2f2b3286385337ce5ec24afebd4699dd1e6e0a;hb=HEAD
>>
>> netconsole is a facility to capture oops traces. It is not a console
>> per se and does not require a head/gtk/x11 etc to work. The link above
>> explains the usage, etc.
>>
>>
>> mike wrote:
>> 
>>> Well these are headless production servers, CLI only. no GTK, no X11.
>>> also I am not running the newer kernels (and I can't...) it looks like
>>> I cannot run a hybrid of 2.6.24-16 and 2.6.22-19, whichever one has
>>> mounted the drive first is the winner.
>>>
>>> If I mix them, I can get the 2.6.24's to mount, then the older ones
>>> give the "number too large" error or whatever. So I can't currently
>>> use one server on my cluster to test because it would require
>>> upgrading all of them just for this test.
>>>
>>> On 4/21/08, Sunil Mushran <[EMAIL PROTECTED]> wrote:
>>>
>>>
>>>   
 Setting up netconsole does not require a reboot. The idea is to
 catch the oops trace when the oops happens. Without that trace,
 we are flying blind.


 mike wrote:


 
> Since these are production I can't do much.
>
> But I did get an error (it's not happening as much but it still blips
> here and there)
>
> Notice that /dev/sdb (my iscsi target using ocfs2) hits 0.00%
> utilization, 3 seconds before my proxy says "hey, timeout" - every
> other second there is -always- some utilization going on.
>
> What could be steps to figure out this issue? Using debugfs.ocfs2 or
>
>
>   
 something?


 
> It's mounted as:
> /dev/sdb1 on /home type ocfs2
> (rw,_netdev,noatime,data=writeback,heartbeat=local)
>
> I know I'm not being much help, but I'm willing to try almost anything
> as long as it doesn't cause downtime or require cluster-wide changes
> (since those require downtime...) - I want to try to go back to
> 2.6.24-16 with data=writeback and see if that fixes the crashing
> issue, but if I'm having issues already like this perhaps I should
> resolve this before moving up.
>
>
>
> [EMAIL PROTECTED] ~]# cat /root/web03-iostat.txt
>
> Time: 02:11:46 PM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>  3.710.00   27.238.910.00   60.15
>
> Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sda   0.0054.460.00  309.90 0.00  2914.85
> 9.4123.08   74.47   0.93  28.71
> sdb  12.87 0.00   17.820.00   245.54 0.00
> 13.78 0.33   17.78  18.33  32.67
>
> Time: 02:11:47 PM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>  0.250.00   26.242.230.00   71.29
>
> Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sda   0.00 0.000.000.00 0.00 0.00
> 0.00 0.000.00   0.00   0.00
> sdb   5.94 0.00   22.770.99   228.71 0.99
> 9.67 0.42   17.92  17.08  40.59
>
> Time: 02:11:48 PM   <- THIS HAS THE ISSUE
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>  0.000.00   25.990.000.00   74.01
>
> Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sda   0.0010.890.002.97 0.00   110.89
> 37.33 0.000.00   0.00   0.00
> sdb   0.00 0.000.000.00 0.00 0.00
> 0.00 0.000.00   0.00   0.00
>
>
> Time: 02:11:49 PM
> avg-cpu:  %user   %n

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

2008-04-21 Thread mike

Thanks.

If I have the opportunity to run the (buggy) new kernel again I will
try this. That is a definately problem and I think I need to set the
oracle behavior to crash and not auto reboot for this to be effective,
right?

That is just one issue.
1) 2.6.24-16 with load completely crashes node producing largest i/o
2) 2.6.22-19 utilization drops to 0% and causes a hiccup randomly (I
don't see a pattern and no batch jobs, or other things running at the
time it happens) - this is more important as it still is happening
even though I'm runnign the more "stable" kernel.


On 4/21/08, Sunil Mushran <[EMAIL PROTECTED]> wrote:
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/networking/netconsole.txt;h=3c2f2b3286385337ce5ec24afebd4699dd1e6e0a;hb=HEAD
>
> netconsole is a facility to capture oops traces. It is not a console
> per se and does not require a head/gtk/x11 etc to work. The link above
> explains the usage, etc.
>
>
> mike wrote:
> > Well these are headless production servers, CLI only. no GTK, no X11.
> > also I am not running the newer kernels (and I can't...) it looks like
> > I cannot run a hybrid of 2.6.24-16 and 2.6.22-19, whichever one has
> > mounted the drive first is the winner.
> >
> > If I mix them, I can get the 2.6.24's to mount, then the older ones
> > give the "number too large" error or whatever. So I can't currently
> > use one server on my cluster to test because it would require
> > upgrading all of them just for this test.
> >
> > On 4/21/08, Sunil Mushran <[EMAIL PROTECTED]> wrote:
> >
> >
> > > Setting up netconsole does not require a reboot. The idea is to
> > > catch the oops trace when the oops happens. Without that trace,
> > > we are flying blind.
> > >
> > >
> > > mike wrote:
> > >
> > >
> > > > Since these are production I can't do much.
> > > >
> > > > But I did get an error (it's not happening as much but it still blips
> > > > here and there)
> > > >
> > > > Notice that /dev/sdb (my iscsi target using ocfs2) hits 0.00%
> > > > utilization, 3 seconds before my proxy says "hey, timeout" - every
> > > > other second there is -always- some utilization going on.
> > > >
> > > > What could be steps to figure out this issue? Using debugfs.ocfs2 or
> > > >
> > > >
> > > something?
> > >
> > >
> > > > It's mounted as:
> > > > /dev/sdb1 on /home type ocfs2
> > > > (rw,_netdev,noatime,data=writeback,heartbeat=local)
> > > >
> > > > I know I'm not being much help, but I'm willing to try almost anything
> > > > as long as it doesn't cause downtime or require cluster-wide changes
> > > > (since those require downtime...) - I want to try to go back to
> > > > 2.6.24-16 with data=writeback and see if that fixes the crashing
> > > > issue, but if I'm having issues already like this perhaps I should
> > > > resolve this before moving up.
> > > >
> > > >
> > > >
> > > > [EMAIL PROTECTED] ~]# cat /root/web03-iostat.txt
> > > >
> > > > Time: 02:11:46 PM
> > > > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> > > >  3.710.00   27.238.910.00   60.15
> > > >
> > > > Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
> > > > avgrq-sz avgqu-sz   await  svctm  %util
> > > > sda   0.0054.460.00  309.90 0.00  2914.85
> > > > 9.4123.08   74.47   0.93  28.71
> > > > sdb  12.87 0.00   17.820.00   245.54 0.00
> > > > 13.78 0.33   17.78  18.33  32.67
> > > >
> > > > Time: 02:11:47 PM
> > > > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> > > >  0.250.00   26.242.230.00   71.29
> > > >
> > > > Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
> > > > avgrq-sz avgqu-sz   await  svctm  %util
> > > > sda   0.00 0.000.000.00 0.00 0.00
> > > > 0.00 0.000.00   0.00   0.00
> > > > sdb   5.94 0.00   22.770.99   228.71 0.99
> > > > 9.67 0.42   17.92  17.08  40.59
> > > >
> > > > Time: 02:11:48 PM   <- THIS HAS THE ISSUE
> > > > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> > > >  0.000.00   25.990.000.00   74.01
> > > >
> > > > Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
> > > > avgrq-sz avgqu-sz   await  svctm  %util
> > > > sda   0.0010.890.002.97 0.00   110.89
> > > > 37.33 0.000.00   0.00   0.00
> > > > sdb   0.00 0.000.000.00 0.00 0.00
> > > > 0.00 0.000.00   0.00   0.00
> > > >
> > > >
> > > > Time: 02:11:49 PM
> > > > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> > > >  0.250.00   14.850.990.00   83.91
> > > >
> > > > Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
> > > > avgrq-sz avgqu-sz   await  svctm  %util
> > > > sda   0.00 0.000.000.00 0.00 0.00
> > > > 0.00 0.000.00   0.00   0.00
> > > > sdb   0.99 0.002.97

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

2008-04-21 Thread Sunil Mushran

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/networking/netconsole.txt;h=3c2f2b3286385337ce5ec24afebd4699dd1e6e0a;hb=HEAD

netconsole is a facility to capture oops traces. It is not a console
per se and does not require a head/gtk/x11 etc to work. The link above
explains the usage, etc.

mike wrote:
> Well these are headless production servers, CLI only. no GTK, no X11.
> also I am not running the newer kernels (and I can't...) it looks like
> I cannot run a hybrid of 2.6.24-16 and 2.6.22-19, whichever one has
> mounted the drive first is the winner.
>
> If I mix them, I can get the 2.6.24's to mount, then the older ones
> give the "number too large" error or whatever. So I can't currently
> use one server on my cluster to test because it would require
> upgrading all of them just for this test.
>
> On 4/21/08, Sunil Mushran <[EMAIL PROTECTED]> wrote:
>   
>> Setting up netconsole does not require a reboot. The idea is to
>> catch the oops trace when the oops happens. Without that trace,
>> we are flying blind.
>>
>>
>> mike wrote:
>> 
>>> Since these are production I can't do much.
>>>
>>> But I did get an error (it's not happening as much but it still blips
>>> here and there)
>>>
>>> Notice that /dev/sdb (my iscsi target using ocfs2) hits 0.00%
>>> utilization, 3 seconds before my proxy says "hey, timeout" - every
>>> other second there is -always- some utilization going on.
>>>
>>> What could be steps to figure out this issue? Using debugfs.ocfs2 or
>>>   
>> something?
>> 
>>> It's mounted as:
>>> /dev/sdb1 on /home type ocfs2
>>> (rw,_netdev,noatime,data=writeback,heartbeat=local)
>>>
>>> I know I'm not being much help, but I'm willing to try almost anything
>>> as long as it doesn't cause downtime or require cluster-wide changes
>>> (since those require downtime...) - I want to try to go back to
>>> 2.6.24-16 with data=writeback and see if that fixes the crashing
>>> issue, but if I'm having issues already like this perhaps I should
>>> resolve this before moving up.
>>>
>>>
>>>
>>> [EMAIL PROTECTED] ~]# cat /root/web03-iostat.txt
>>>
>>> Time: 02:11:46 PM
>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>   3.710.00   27.238.910.00   60.15
>>>
>>> Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
>>> avgrq-sz avgqu-sz   await  svctm  %util
>>> sda   0.0054.460.00  309.90 0.00  2914.85
>>> 9.4123.08   74.47   0.93  28.71
>>> sdb  12.87 0.00   17.820.00   245.54 0.00
>>> 13.78 0.33   17.78  18.33  32.67
>>>
>>> Time: 02:11:47 PM
>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>   0.250.00   26.242.230.00   71.29
>>>
>>> Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
>>> avgrq-sz avgqu-sz   await  svctm  %util
>>> sda   0.00 0.000.000.00 0.00 0.00
>>> 0.00 0.000.00   0.00   0.00
>>> sdb   5.94 0.00   22.770.99   228.71 0.99
>>> 9.67 0.42   17.92  17.08  40.59
>>>
>>> Time: 02:11:48 PM   <- THIS HAS THE ISSUE
>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>   0.000.00   25.990.000.00   74.01
>>>
>>> Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
>>> avgrq-sz avgqu-sz   await  svctm  %util
>>> sda   0.0010.890.002.97 0.00   110.89
>>> 37.33 0.000.00   0.00   0.00
>>> sdb   0.00 0.000.000.00 0.00 0.00
>>> 0.00 0.000.00   0.00   0.00
>>>
>>>
>>> Time: 02:11:49 PM
>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>   0.250.00   14.850.990.00   83.91
>>>
>>> Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
>>> avgrq-sz avgqu-sz   await  svctm  %util
>>> sda   0.00 0.000.000.00 0.00 0.00
>>> 0.00 0.000.00   0.00   0.00
>>> sdb   0.99 0.002.970.9930.69 0.99
>>> 8.00 0.07   17.50  17.50   6.93
>>>
>>> Time: 02:11:50 PM
>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>   0.740.001.241.730.00   96.29
>>>
>>> Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
>>> avgrq-sz avgqu-sz   await  svctm  %util
>>> sda   0.00 0.000.000.00 0.00 0.00
>>> 0.00 0.000.00   0.00   0.00
>>> sdb   0.99 0.005.940.0055.45 0.00
>>> 9.33 0.07   11.67  11.67   6.93
>>>
>>> Time: 02:11:51 PM
>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>   0.000.001.24   16.340.00   82.43
>>>
>>> Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
>>> avgrq-sz avgqu-sz   await  svctm  %util
>>> sda   0.00   153.470.00  494.06 0.00  5156.44
>>> 10.4455.62  107.23   1.16  57.43
>>> sdb   2.97 0.00   11.880.99

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

2008-04-21 Thread mike

Well these are headless production servers, CLI only. no GTK, no X11.
also I am not running the newer kernels (and I can't...) it looks like
I cannot run a hybrid of 2.6.24-16 and 2.6.22-19, whichever one has
mounted the drive first is the winner.

If I mix them, I can get the 2.6.24's to mount, then the older ones
give the "number too large" error or whatever. So I can't currently
use one server on my cluster to test because it would require
upgrading all of them just for this test.

On 4/21/08, Sunil Mushran <[EMAIL PROTECTED]> wrote:
> Setting up netconsole does not require a reboot. The idea is to
> catch the oops trace when the oops happens. Without that trace,
> we are flying blind.
>
>
> mike wrote:
> > Since these are production I can't do much.
> >
> > But I did get an error (it's not happening as much but it still blips
> > here and there)
> >
> > Notice that /dev/sdb (my iscsi target using ocfs2) hits 0.00%
> > utilization, 3 seconds before my proxy says "hey, timeout" - every
> > other second there is -always- some utilization going on.
> >
> > What could be steps to figure out this issue? Using debugfs.ocfs2 or
> something?
> >
> > It's mounted as:
> > /dev/sdb1 on /home type ocfs2
> > (rw,_netdev,noatime,data=writeback,heartbeat=local)
> >
> > I know I'm not being much help, but I'm willing to try almost anything
> > as long as it doesn't cause downtime or require cluster-wide changes
> > (since those require downtime...) - I want to try to go back to
> > 2.6.24-16 with data=writeback and see if that fixes the crashing
> > issue, but if I'm having issues already like this perhaps I should
> > resolve this before moving up.
> >
> >
> >
> > [EMAIL PROTECTED] ~]# cat /root/web03-iostat.txt
> >
> > Time: 02:11:46 PM
> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >   3.710.00   27.238.910.00   60.15
> >
> > Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
> > avgrq-sz avgqu-sz   await  svctm  %util
> > sda   0.0054.460.00  309.90 0.00  2914.85
> > 9.4123.08   74.47   0.93  28.71
> > sdb  12.87 0.00   17.820.00   245.54 0.00
> > 13.78 0.33   17.78  18.33  32.67
> >
> > Time: 02:11:47 PM
> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >   0.250.00   26.242.230.00   71.29
> >
> > Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
> > avgrq-sz avgqu-sz   await  svctm  %util
> > sda   0.00 0.000.000.00 0.00 0.00
> > 0.00 0.000.00   0.00   0.00
> > sdb   5.94 0.00   22.770.99   228.71 0.99
> > 9.67 0.42   17.92  17.08  40.59
> >
> > Time: 02:11:48 PM   <- THIS HAS THE ISSUE
> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >   0.000.00   25.990.000.00   74.01
> >
> > Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
> > avgrq-sz avgqu-sz   await  svctm  %util
> > sda   0.0010.890.002.97 0.00   110.89
> > 37.33 0.000.00   0.00   0.00
> > sdb   0.00 0.000.000.00 0.00 0.00
> > 0.00 0.000.00   0.00   0.00
> >
> >
> > Time: 02:11:49 PM
> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >   0.250.00   14.850.990.00   83.91
> >
> > Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
> > avgrq-sz avgqu-sz   await  svctm  %util
> > sda   0.00 0.000.000.00 0.00 0.00
> > 0.00 0.000.00   0.00   0.00
> > sdb   0.99 0.002.970.9930.69 0.99
> > 8.00 0.07   17.50  17.50   6.93
> >
> > Time: 02:11:50 PM
> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >   0.740.001.241.730.00   96.29
> >
> > Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
> > avgrq-sz avgqu-sz   await  svctm  %util
> > sda   0.00 0.000.000.00 0.00 0.00
> > 0.00 0.000.00   0.00   0.00
> > sdb   0.99 0.005.940.0055.45 0.00
> > 9.33 0.07   11.67  11.67   6.93
> >
> > Time: 02:11:51 PM
> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >   0.000.001.24   16.340.00   82.43
> >
> > Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
> > avgrq-sz avgqu-sz   await  svctm  %util
> > sda   0.00   153.470.00  494.06 0.00  5156.44
> > 10.4455.62  107.23   1.16  57.43
> > sdb   2.97 0.00   11.880.99   117.82 0.99
> > 9.23 0.26   13.08  20.00  25.74
> >
> > Time: 02:11:52 PM
> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >   0.000.000.253.220.00   96.53
> >
> > Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
> > avgrq-sz avgqu-sz   await  svctm  %util
> > sda   0.00 0.000.00   16.83 0.00   158.42
> > 9.4

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

2008-04-21 Thread Sunil Mushran

Setting up netconsole does not require a reboot. The idea is to
catch the oops trace when the oops happens. Without that trace,
we are flying blind.

mike wrote:
> Since these are production I can't do much.
>
> But I did get an error (it's not happening as much but it still blips
> here and there)
>
> Notice that /dev/sdb (my iscsi target using ocfs2) hits 0.00%
> utilization, 3 seconds before my proxy says "hey, timeout" - every
> other second there is -always- some utilization going on.
>
> What could be steps to figure out this issue? Using debugfs.ocfs2 or 
> something?
>
> It's mounted as:
> /dev/sdb1 on /home type ocfs2
> (rw,_netdev,noatime,data=writeback,heartbeat=local)
>
> I know I'm not being much help, but I'm willing to try almost anything
> as long as it doesn't cause downtime or require cluster-wide changes
> (since those require downtime...) - I want to try to go back to
> 2.6.24-16 with data=writeback and see if that fixes the crashing
> issue, but if I'm having issues already like this perhaps I should
> resolve this before moving up.
>
>
>
> [EMAIL PROTECTED] ~]# cat /root/web03-iostat.txt
>
> Time: 02:11:46 PM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>3.710.00   27.238.910.00   60.15
>
> Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sda   0.0054.460.00  309.90 0.00  2914.85
> 9.4123.08   74.47   0.93  28.71
> sdb  12.87 0.00   17.820.00   245.54 0.00
> 13.78 0.33   17.78  18.33  32.67
>
> Time: 02:11:47 PM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>0.250.00   26.242.230.00   71.29
>
> Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sda   0.00 0.000.000.00 0.00 0.00
> 0.00 0.000.00   0.00   0.00
> sdb   5.94 0.00   22.770.99   228.71 0.99
> 9.67 0.42   17.92  17.08  40.59
>
> Time: 02:11:48 PM   <- THIS HAS THE ISSUE
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>0.000.00   25.990.000.00   74.01
>
> Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sda   0.0010.890.002.97 0.00   110.89
> 37.33 0.000.00   0.00   0.00
> sdb   0.00 0.000.000.00 0.00 0.00
> 0.00 0.000.00   0.00   0.00
>
>
> Time: 02:11:49 PM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>0.250.00   14.850.990.00   83.91
>
> Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sda   0.00 0.000.000.00 0.00 0.00
> 0.00 0.000.00   0.00   0.00
> sdb   0.99 0.002.970.9930.69 0.99
> 8.00 0.07   17.50  17.50   6.93
>
> Time: 02:11:50 PM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>0.740.001.241.730.00   96.29
>
> Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sda   0.00 0.000.000.00 0.00 0.00
> 0.00 0.000.00   0.00   0.00
> sdb   0.99 0.005.940.0055.45 0.00
> 9.33 0.07   11.67  11.67   6.93
>
> Time: 02:11:51 PM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>0.000.001.24   16.340.00   82.43
>
> Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sda   0.00   153.470.00  494.06 0.00  5156.44
> 10.4455.62  107.23   1.16  57.43
> sdb   2.97 0.00   11.880.99   117.82 0.99
> 9.23 0.26   13.08  20.00  25.74
>
> Time: 02:11:52 PM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>0.000.000.253.220.00   96.53
>
> Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sda   0.00 0.000.00   16.83 0.00   158.42
> 9.41 0.13  164.71   1.18   1.98
> sdb   1.98 0.002.970.0039.60 0.00
> 13.33 0.13   73.33  43.33  12.87
>
> Time: 02:11:53 PM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>0.500.000.254.700.00   94.55
>
> Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sda   0.00 0.000.000.00 0.00 0.00
> 0.00 0.000.00   0.00   0.00
> sdb   5.94 0.00   11.880.99   141.58 0.99
> 11.08 0.20   15.38  15.38  19.80
>
> Time: 02:11:54 PM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>3.960.00   10.150.740.00   85.15
>
> Dev

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

2008-04-21 Thread mike

Since these are production I can't do much.

But I did get an error (it's not happening as much but it still blips
here and there)

Notice that /dev/sdb (my iscsi target using ocfs2) hits 0.00%
utilization, 3 seconds before my proxy says "hey, timeout" - every
other second there is -always- some utilization going on.

What could be steps to figure out this issue? Using debugfs.ocfs2 or something?

It's mounted as:
/dev/sdb1 on /home type ocfs2
(rw,_netdev,noatime,data=writeback,heartbeat=local)

I know I'm not being much help, but I'm willing to try almost anything
as long as it doesn't cause downtime or require cluster-wide changes
(since those require downtime...) - I want to try to go back to
2.6.24-16 with data=writeback and see if that fixes the crashing
issue, but if I'm having issues already like this perhaps I should
resolve this before moving up.



[EMAIL PROTECTED] ~]# cat /root/web03-iostat.txt

Time: 02:11:46 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   3.710.00   27.238.910.00   60.15

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sda   0.0054.460.00  309.90 0.00  2914.85
9.4123.08   74.47   0.93  28.71
sdb  12.87 0.00   17.820.00   245.54 0.00
13.78 0.33   17.78  18.33  32.67

Time: 02:11:47 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.250.00   26.242.230.00   71.29

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sda   0.00 0.000.000.00 0.00 0.00
0.00 0.000.00   0.00   0.00
sdb   5.94 0.00   22.770.99   228.71 0.99
9.67 0.42   17.92  17.08  40.59

Time: 02:11:48 PM   <- THIS HAS THE ISSUE
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.000.00   25.990.000.00   74.01

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sda   0.0010.890.002.97 0.00   110.89
37.33 0.000.00   0.00   0.00
sdb   0.00 0.000.000.00 0.00 0.00
0.00 0.000.00   0.00   0.00


Time: 02:11:49 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.250.00   14.850.990.00   83.91

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sda   0.00 0.000.000.00 0.00 0.00
0.00 0.000.00   0.00   0.00
sdb   0.99 0.002.970.9930.69 0.99
8.00 0.07   17.50  17.50   6.93

Time: 02:11:50 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.740.001.241.730.00   96.29

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sda   0.00 0.000.000.00 0.00 0.00
0.00 0.000.00   0.00   0.00
sdb   0.99 0.005.940.0055.45 0.00
9.33 0.07   11.67  11.67   6.93

Time: 02:11:51 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.000.001.24   16.340.00   82.43

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sda   0.00   153.470.00  494.06 0.00  5156.44
10.4455.62  107.23   1.16  57.43
sdb   2.97 0.00   11.880.99   117.82 0.99
9.23 0.26   13.08  20.00  25.74

Time: 02:11:52 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.000.000.253.220.00   96.53

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sda   0.00 0.000.00   16.83 0.00   158.42
9.41 0.13  164.71   1.18   1.98
sdb   1.98 0.002.970.0039.60 0.00
13.33 0.13   73.33  43.33  12.87

Time: 02:11:53 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.500.000.254.700.00   94.55

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sda   0.00 0.000.000.00 0.00 0.00
0.00 0.000.00   0.00   0.00
sdb   5.94 0.00   11.880.99   141.58 0.99
11.08 0.20   15.38  15.38  19.80

Time: 02:11:54 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   3.960.00   10.150.740.00   85.15

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sda   0.0020.790.004.95 0.00   205.94
41.60 0.000.00   0.00   0.00
sdb   4.95 0.005.940.0087.13 0.00
14.67 0.07   11.67  11.67   6.93



On 4/21/08, Sunil Mushran <[EMAIL PROTECTED]> wrote:
> Do you have

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

2008-04-21 Thread Sunil Mushran

Do you have the panic output... kernel stack trace. We'll need
that to figure this out. Without that, we can only speculate.

mike wrote:
> On 4/21/08, Tao Ma <[EMAIL PROTECTED]> wrote:
>   
>> mike wrote:
>> 
>>> I have changed my kernel back to 2.6.22-14-server, and now I don't get
>>> the kernel panics. It seems like an issue with 2.6.24-16 and some i/o
>>> made it crash...
>>>
>>>
>>>   
>> OK, so it seems that it is a bug for ocfs2 kernel, not the ocfs2-tools. :)
>> Then could you please describe it in more detail about how the kernel panic
>> happens?
>> 
>
> Yeah, this specific issue seems like a kernel issue.
>
> I don't know, these are production systems and I am already getting
> angry customers. I can't really test anymore. Both are standard Ubuntu
> kernels.
>
> Okay: 2.6.22-14-server (I think still minor file access issues)
> Breaks under load: 2.6.24-16-server
>
>
>   
>>> However I am still getting file access timeouts once in a while. I am
>>> nervous about putting more load on the setup.
>>>
>>>
>>>   
>> Also please provide more details about it.
>> 
>
> I am using nginx for a frontend load balancer, and nginx for a
> webserver as well. This doesn't seem to be related to the webserver at
> all though, it was happening before this.
>
> lvs01 proxies traffic in to web01, web02, and web03 (currently using
> nginx, before I was using LVS/ipvsadm)
>
> Every so often, one of the webservers sends me back
>
>   
>>> [EMAIL PROTECTED] .batch]# cat /etc/default/o2cb
>>>
>>> # O2CB_ENABLED: 'true' means to load the driver on boot.
>>> O2CB_ENABLED=true
>>>
>>> # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
>>> O2CB_BOOTCLUSTER=mycluster
>>>
>>> # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
>>> O2CB_HEARTBEAT_THRESHOLD=7
>>>
>>>
>>>   
>> This value is a little smaller, so how did you build up your shared
>> disk(iSCSI or ...)? The most common value I heard of is 61. It is about 120
>> secs. I don't know the reason and maybe Sunil can tell you. ;)
>> You can also refer to
>> http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT.
>>
>> 
>>> # O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is
>>> considered dead.
>>> O2CB_IDLE_TIMEOUT_MS=1
>>>
>>> # O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet is
>>>   
>> sent
>> 
>>> O2CB_KEEPALIVE_DELAY_MS=5000
>>>
>>> # O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts
>>> O2CB_RECONNECT_DELAY_MS=2000
>>>
>>>
>>> On 4/21/08, Tao Ma <[EMAIL PROTECTED]> wrote:
>>>
>>>
>>>   
 Hi Mike,
   Are you sure it is caused by the update of ocfs2-tools?
 AFAIK, the ocfs2-tools only include tools like mkfs, fsck and tunefs
 
>> etc. So
>> 
 if you don't make any change to the disk(by using this new tools), it
 shouldn't cause the problem of kernel panic since they are all user
 
>> space
>> 
 tools.
 Then there is only one thing maybe. Have you modify
 
>> /etc/sysconfig/o2cb(This
>> 
 is the place for RHEL, not sure the place in ubuntu)? I have checked the
 
>> rpm
>> 
 package for RHEL, it will update /etc/sysconfig/o2cb and this file has
 
>> some
>> 
 timeouts defined in it.
 So do you have some backups for this file? If yes, please restore it to
 
>> see
>> 
 whether it helps(I can't say it for sure).
 If not, do you remember the old value of some timeouts you set for
 
>> ocfs2? If
>> 
 yes, you can use o2cb configure to set them by yourself.


 
>> 
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>   


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

2008-04-21 Thread mike

> On 4/21/08, Tao Ma <[EMAIL PROTECTED]> wrote:

> Also please provide more details about it.

I am using nginx for a frontend load balancer, and nginx for a
webserver as well. This doesn't seem to be related to the webserver at
all though, it was happening before this.

lvs01 proxies traffic in to web01, web02, and web03 (currently using
nginx, before I was using LVS/ipvsadm)

*** Sorry, gmail triggered "send" before I was done ***

Every so often, one of the webservers sends me back a connection
timeout. Even under the more "stable" kernel. In the newer kernel,
when I put some load on it, the other machines would stop I/O to all
disks it seems like completely for a few seconds, like something was
blocking it (I think even the local non-ISCSI disk had 0% util)

The machine putting the load on it would panic and reboot (I have
kernel.panic = 60 in my sysctl.conf)

> This value is a little smaller, so how did you build up your shared
> disk(iSCSI or ...)? The most common value I heard of is 61. It is about 120
> secs. I don't know the reason and maybe Sunil can tell you. ;)
> You can also refer to

I think originally I did not understand why the values would be so
high. These machines are pretty much dedicated, there's not many of
them and they shouldn't be stressing anything that much to begin with.
However, for the sake of things, I can reset these to the defaults.
Problem is I have to restart the entire cluster for that. Will it work
if I change it on all of them, and then reboot each machine one at a
time? It seems like it won't link up if it has conflicting values, so
how can I ensure the machines won't be fighting each other all at the
same time?

FYI, This is how I made it:
mkfs.ocfs2 -b4k -C32k -N8 -L iscsi /dev/sdb1

It's an iSCSI export that my provider has. All this traffic is on the
same private gigabit interconnect.

Could there be any additional tuning parameters - there's probably
over 3 million files being stored on this 600GB volume. I'd say
average filesize is under 1MB too.

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

2008-04-21 Thread mike

On 4/21/08, Tao Ma <[EMAIL PROTECTED]> wrote:
> mike wrote:
> > I have changed my kernel back to 2.6.22-14-server, and now I don't get
> > the kernel panics. It seems like an issue with 2.6.24-16 and some i/o
> > made it crash...
> >
> >
> OK, so it seems that it is a bug for ocfs2 kernel, not the ocfs2-tools. :)
> Then could you please describe it in more detail about how the kernel panic
> happens?

Yeah, this specific issue seems like a kernel issue.

I don't know, these are production systems and I am already getting
angry customers. I can't really test anymore. Both are standard Ubuntu
kernels.

Okay: 2.6.22-14-server (I think still minor file access issues)
Breaks under load: 2.6.24-16-server


> > However I am still getting file access timeouts once in a while. I am
> > nervous about putting more load on the setup.
> >
> >
> Also please provide more details about it.

I am using nginx for a frontend load balancer, and nginx for a
webserver as well. This doesn't seem to be related to the webserver at
all though, it was happening before this.

lvs01 proxies traffic in to web01, web02, and web03 (currently using
nginx, before I was using LVS/ipvsadm)

Every so often, one of the webservers sends me back

> > [EMAIL PROTECTED] .batch]# cat /etc/default/o2cb
> >
> > # O2CB_ENABLED: 'true' means to load the driver on boot.
> > O2CB_ENABLED=true
> >
> > # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
> > O2CB_BOOTCLUSTER=mycluster
> >
> > # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
> > O2CB_HEARTBEAT_THRESHOLD=7
> >
> >
> This value is a little smaller, so how did you build up your shared
> disk(iSCSI or ...)? The most common value I heard of is 61. It is about 120
> secs. I don't know the reason and maybe Sunil can tell you. ;)
> You can also refer to
> http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT.
>
> > # O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is
> > considered dead.
> > O2CB_IDLE_TIMEOUT_MS=1
> >
> > # O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet is
> sent
> > O2CB_KEEPALIVE_DELAY_MS=5000
> >
> > # O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts
> > O2CB_RECONNECT_DELAY_MS=2000
> >
> >
> > On 4/21/08, Tao Ma <[EMAIL PROTECTED]> wrote:
> >
> >
> > > Hi Mike,
> > >   Are you sure it is caused by the update of ocfs2-tools?
> > > AFAIK, the ocfs2-tools only include tools like mkfs, fsck and tunefs
> etc. So
> > > if you don't make any change to the disk(by using this new tools), it
> > > shouldn't cause the problem of kernel panic since they are all user
> space
> > > tools.
> > > Then there is only one thing maybe. Have you modify
> /etc/sysconfig/o2cb(This
> > > is the place for RHEL, not sure the place in ubuntu)? I have checked the
> rpm
> > > package for RHEL, it will update /etc/sysconfig/o2cb and this file has
> some
> > > timeouts defined in it.
> > > So do you have some backups for this file? If yes, please restore it to
> see
> > > whether it helps(I can't say it for sure).
> > > If not, do you remember the old value of some timeouts you set for
> ocfs2? If
> > > yes, you can use o2cb configure to set them by yourself.
> > >
> > >
> >
>
>

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

2008-04-21 Thread Tao Ma

mike wrote:
> I have changed my kernel back to 2.6.22-14-server, and now I don't get
> the kernel panics. It seems like an issue with 2.6.24-16 and some i/o
> made it crash...
>   
OK, so it seems that it is a bug for ocfs2 kernel, not the ocfs2-tools. :)
Then could you please describe it in more detail about how the kernel 
panic happens?
> However I am still getting file access timeouts once in a while. I am
> nervous about putting more load on the setup.
>   
Also please provide more details about it.
>
>
> [EMAIL PROTECTED] .batch]# cat /etc/default/o2cb
>
> # O2CB_ENABLED: 'true' means to load the driver on boot.
> O2CB_ENABLED=true
>
> # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
> O2CB_BOOTCLUSTER=mycluster
>
> # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
> O2CB_HEARTBEAT_THRESHOLD=7
>   
This value is a little smaller, so how did you build up your shared 
disk(iSCSI or ...)? The most common value I heard of is 61. It is about 
120 secs. I don't know the reason and maybe Sunil can tell you. ;)
You can also refer to 
http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT.
> # O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is
> considered dead.
> O2CB_IDLE_TIMEOUT_MS=1
>
> # O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet is sent
> O2CB_KEEPALIVE_DELAY_MS=5000
>
> # O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts
> O2CB_RECONNECT_DELAY_MS=2000
>
>
> On 4/21/08, Tao Ma <[EMAIL PROTECTED]> wrote:
>   
>> Hi Mike,
>>Are you sure it is caused by the update of ocfs2-tools?
>> AFAIK, the ocfs2-tools only include tools like mkfs, fsck and tunefs etc. So
>> if you don't make any change to the disk(by using this new tools), it
>> shouldn't cause the problem of kernel panic since they are all user space
>> tools.
>> Then there is only one thing maybe. Have you modify /etc/sysconfig/o2cb(This
>> is the place for RHEL, not sure the place in ubuntu)? I have checked the rpm
>> package for RHEL, it will update /etc/sysconfig/o2cb and this file has some
>> timeouts defined in it.
>> So do you have some backups for this file? If yes, please restore it to see
>> whether it helps(I can't say it for sure).
>> If not, do you remember the old value of some timeouts you set for ocfs2? If
>> yes, you can use o2cb configure to set them by yourself.
>> 


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

2008-04-21 Thread Joel Becker

On Mon, Apr 21, 2008 at 05:02:33PM +0800, Tao Ma wrote:
> Then there is only one thing maybe. Have you modify 
> /etc/sysconfig/o2cb(This is the place for RHEL, not sure the place in 
> ubuntu)? I have checked the rpm package for RHEL, it will update 
> /etc/sysconfig/o2cb and this file has some timeouts defined in it.

It is probably /etc/default/o2cb for Ubuntu.

Joel

-- 

"I am working for the time when unqualified blacks, browns, and
 women join the unqualified men in running our overnment."
- Sissy Farenthold

Joel Becker
Principal Software Developer
Oracle
E-mail: [EMAIL PROTECTED]
Phone: (650) 506-8127

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

2008-04-21 Thread mike

I have changed my kernel back to 2.6.22-14-server, and now I don't get
the kernel panics. It seems like an issue with 2.6.24-16 and some i/o
made it crash...

However I am still getting file access timeouts once in a while. I am
nervous about putting more load on the setup.

[EMAIL PROTECTED] .batch]# cat /etc/default/o2cb

# O2CB_ENABLED: 'true' means to load the driver on boot.
O2CB_ENABLED=true

# O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
O2CB_BOOTCLUSTER=mycluster

# O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
O2CB_HEARTBEAT_THRESHOLD=7

# O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is
considered dead.
O2CB_IDLE_TIMEOUT_MS=1

# O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet is sent
O2CB_KEEPALIVE_DELAY_MS=5000

# O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts
O2CB_RECONNECT_DELAY_MS=2000

On 4/21/08, Tao Ma <[EMAIL PROTECTED]> wrote:
> Hi Mike,
>Are you sure it is caused by the update of ocfs2-tools?
> AFAIK, the ocfs2-tools only include tools like mkfs, fsck and tunefs etc. So
> if you don't make any change to the disk(by using this new tools), it
> shouldn't cause the problem of kernel panic since they are all user space
> tools.
> Then there is only one thing maybe. Have you modify /etc/sysconfig/o2cb(This
> is the place for RHEL, not sure the place in ubuntu)? I have checked the rpm
> package for RHEL, it will update /etc/sysconfig/o2cb and this file has some
> timeouts defined in it.
> So do you have some backups for this file? If yes, please restore it to see
> whether it helps(I can't say it for sure).
> If not, do you remember the old value of some timeouts you set for ocfs2? If
> yes, you can use o2cb configure to set them by yourself.

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

2008-04-21 Thread Tao Ma

Hi Mike,
Are you sure it is caused by the update of ocfs2-tools?
AFAIK, the ocfs2-tools only include tools like mkfs, fsck and tunefs 
etc. So if you don't make any change to the disk(by using this new 
tools), it shouldn't cause the problem of kernel panic since they are 
all user space tools.
Then there is only one thing maybe. Have you modify 
/etc/sysconfig/o2cb(This is the place for RHEL, not sure the place in 
ubuntu)? I have checked the rpm package for RHEL, it will update 
/etc/sysconfig/o2cb and this file has some timeouts defined in it.
So do you have some backups for this file? If yes, please restore it to 
see whether it helps(I can't say it for sure).
If not, do you remember the old value of some timeouts you set for 
ocfs2? If yes, you can use o2cb configure to set them by yourself.

Good Luck.

Regards,
Tao

mike wrote:
> Hi, I'm running into a big issue. I believe it is OCFS2, I can get my
> machines to kernel panic consistently.
> 
> Before I was running Ubuntu Gutsy (7.10) ocfs2-tools 1.2.4.
> 
> Now I am running Ubuntu Hardy (8.04) ocfs2-tools 1.3.9.
> 
> I am even running the same kernel (2.6.22-14), but the behavior has
> changed with my OCFS2 mounts it seems. At first I thought it was due
> to the newer kernel (2.6.24-16) but it isn't the case. Now it is
> happening no matter which kernel I use. I even compiled my own vanilla
> 2.6.25, and it still has this issue.
> 
> I have 6 total clients mounting the ocfs2 partition:
> - 2 batch servers which only access it every 5 or 10 minutes to load
> up a PHP script to process
> - 1 server I am trying to rsync from local RAID disk -> ocfs2  - I am
> limiting this to 250kb/sec
> - 3 webservers loading normal stuff - PHP scripts, graphics, media
> files - maybe 2MB/sec combined total
> 
> That's not even 3MB/sec - yet when I start the rsync, pretty quickly
> the server doing the rsync kernel panics and reboots. The 3 webservers
> all have issues with reading from the OCFS2 mounted partition. The
> %util all drops to 0, it's like it bottlenecks and suspends all disk
> I/O on the webservers for a few seconds. Then things go back to normal
> for a while.
> 
> Is there any additional info that could be useful? I am desperately in
> need of help. I have hosting customers and somehow this upgrade has
> pretty much crippled me...
> 
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

[Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

2008-04-21 Thread mike

Hi, I'm running into a big issue. I believe it is OCFS2, I can get my
machines to kernel panic consistently.

Before I was running Ubuntu Gutsy (7.10) ocfs2-tools 1.2.4.

Now I am running Ubuntu Hardy (8.04) ocfs2-tools 1.3.9.

I am even running the same kernel (2.6.22-14), but the behavior has
changed with my OCFS2 mounts it seems. At first I thought it was due
to the newer kernel (2.6.24-16) but it isn't the case. Now it is
happening no matter which kernel I use. I even compiled my own vanilla
2.6.25, and it still has this issue.

I have 6 total clients mounting the ocfs2 partition:
- 2 batch servers which only access it every 5 or 10 minutes to load
up a PHP script to process
- 1 server I am trying to rsync from local RAID disk -> ocfs2  - I am
limiting this to 250kb/sec
- 3 webservers loading normal stuff - PHP scripts, graphics, media
files - maybe 2MB/sec combined total

That's not even 3MB/sec - yet when I start the rsync, pretty quickly
the server doing the rsync kernel panics and reboots. The 3 webservers
all have issues with reading from the OCFS2 mounted partition. The
%util all drops to 0, it's like it bottlenecks and suspends all disk
I/O on the webservers for a few seconds. Then things go back to normal
for a while.

Is there any additional info that could be useful? I am desperately in
need of help. I have hosting customers and somehow this upgrade has
pretty much crippled me...

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

[Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

19 matches

Site Navigation

Mail list logo

Footer information