kernel go-slow

2003-02-02 Thread Russell Coker
I'm running a number of machines with 2.4.20 and the ReiserFS journal patches.

One problem that has started occuring is that periodically some of the 
machines will go really slow for a while.  It's as if the CPU speed has just 
dropped to 1% of it's regular speed.  Then after 10 minutes or so it will 
continue as normal.

Has anyone heard of such things before?

I am asking here first because the ReiserFS patch is the most significant 
kernel patch I've applied on what is otherwise a stock 2.4.20 kernel.

Interestingly the machines that have the problems are not the most active in 
the file system (mail store), but the mail spool machines.  The mail spool 
machines do a good amount of file access (but well below the limits of the 
hardware) and also use more memory and have large load spikes on occasion 
(virus and spam scanning).

-- 
http://www.coker.com.au/selinux/   My NSA Security Enhanced Linux packages
http://www.coker.com.au/bonnie++/  Bonnie++ hard drive benchmark
http://www.coker.com.au/postal/Postal SMTP/POP benchmark
http://www.coker.com.au/~russell/  My home page




Re: kernel go-slow

2003-02-02 Thread Rudy L. Zijlstra
Russell Coker wrote:


I'm running a number of machines with 2.4.20 and the ReiserFS journal patches.

One problem that has started occuring is that periodically some of the 
machines will go really slow for a while.  It's as if the CPU speed has just 
dropped to 1% of it's regular speed.  Then after 10 minutes or so it will 
continue as normal.

Has anyone heard of such things before?

 

Russell,

I am (was) running a vanilla 2.4.20 kernel and experienced a slow-down 
each night during virus scan. System would not respond to http during 
undefined moments. But rather repeatable each night, though each time at 
a different moment during the night. I've just rebooted into 2.4.19 to 
check whether its 2.4.20 or the results of hardware modification I did 2 
weeks ago. System is lightly loaded. file systems in use mostly Reiserfs 
and a spattering of left-over ext2.

Cheers,

Rudy



Re: kernel go-slow

2003-02-02 Thread Ookhoi
Russell Coker wrote (ao):
> I'm running a number of machines with 2.4.20 and the ReiserFS journal
> patches.
>
> One problem that has started occuring is that periodically some of the
> machines will go really slow for a while. It's as if the CPU speed has
> just dropped to 1% of it's regular speed. Then after 10 minutes or so
> it will continue as normal.
>
> Has anyone heard of such things before?

It seems there is a 'bug' in 2.4.20 which causes the stall. (don't know
the details, but you're not the only one).

Maybe a -pre fixes it, though in your case I would wait for .21 I think.



Re: kernel go-slow

2003-02-06 Thread Alexander Lyamin
Mon, Feb 03, 2003 at 12:27:40AM +0100, Russell Coker wrote:
> I'm running a number of machines with 2.4.20 and the ReiserFS journal patches.
> 
> One problem that has started occuring is that periodically some of the 
> machines will go really slow for a while.  It's as if the CPU speed has just 
> dropped to 1% of it's regular speed.  Then after 10 minutes or so it will 
> continue as normal.

when its slows down, please check with vmstat for IO or with your
led for disk activity. thats a simply and stupid.

but theres no really good way to understand whats goining on in kernel
if you are userland yourself. so go in kernel with profiling and see
where does it spend it precisious time. slightly more complicated then
method above, but much more effective.

> 
> Has anyone heard of such things before?
> 
> I am asking here first because the ReiserFS patch is the most significant 
> kernel patch I've applied on what is otherwise a stock 2.4.20 kernel.
> 
> Interestingly the machines that have the problems are not the most active in 
> the file system (mail store), but the mail spool machines.  The mail spool 
> machines do a good amount of file access (but well below the limits of the 
> hardware) and also use more memory and have large load spikes on occasion 
> (virus and spam scanning).

-- 
"Cache remedies via multi-variable logic shorts will leave you crying."(cl)
Lex Lyamin



Re: kernel go-slow

2003-02-06 Thread Alexander Lyamin
Thu, Feb 06, 2003 at 02:26:49PM +0300, Alexander Lyamin wrote:
> Mon, Feb 03, 2003 at 12:27:40AM +0100, Russell Coker wrote:
> > I'm running a number of machines with 2.4.20 and the ReiserFS journal patches.
> > 
> > One problem that has started occuring is that periodically some of the 
> > machines will go really slow for a while.  It's as if the CPU speed has just 
> > dropped to 1% of it's regular speed.  Then after 10 minutes or so it will 
> > continue as normal.
> 
> when its slows down, please check with vmstat for IO or with your
i think i wasnt clear enough.
so - first , if you "go-slow" on a disk activity, chances are good
that it caused by FS or VM or their misunderstandings.

but there is possible situations that will not generate disk activity,
but may cause your system to "go-slow", if there you have some 
unussual IO numbers while disk activity is moderate to low -
most likely same sweet pair.

but Oleg Drokin pointed at situations when even IO will not indicate
whats going on :)

so advice is still the same - if you having slowdowns profiling might help
you much better then  withchy methods described above.

> led for disk activity. thats a simply and stupid.
> 
> but theres no really good way to understand whats goining on in kernel
> if you are userland yourself. so go in kernel with profiling and see
> where does it spend it precisious time. slightly more complicated then
> method above, but much more effective.
> 
> > 
> > Has anyone heard of such things before?
> > 
> > I am asking here first because the ReiserFS patch is the most significant 
> > kernel patch I've applied on what is otherwise a stock 2.4.20 kernel.
> > 
> > Interestingly the machines that have the problems are not the most active in 
> > the file system (mail store), but the mail spool machines.  The mail spool 
> > machines do a good amount of file access (but well below the limits of the 
> > hardware) and also use more memory and have large load spikes on occasion 
> > (virus and spam scanning).
talking about  virus/spam scanning - what do you use and how its integrated in
your SMTP MTA ?

-- 
"Cache remedies via multi-variable logic shorts will leave you crying."(cl)
Lex Lyamin



Re: kernel go-slow

2003-02-06 Thread Russell Coker
On Thu, 6 Feb 2003 17:32, Alexander Lyamin wrote:
> > > One problem that has started occuring is that periodically some of the
> > > machines will go really slow for a while.  It's as if the CPU speed has
> > > just dropped to 1% of it's regular speed.  Then after 10 minutes or so
> > > it will continue as normal.
> >
> > when its slows down, please check with vmstat for IO or with your
>
> i think i wasnt clear enough.
> so - first , if you "go-slow" on a disk activity, chances are good
> that it caused by FS or VM or their misunderstandings.

vmstat doesn't work properly.  CPU time is 99% system which suggests that one 
CPU is spending all it's time in kernel space (for both threads of a 
hyper-threaded CPU) or that both CPUs have each got one thread locked in 
kernel space.

It's not disk related, those machines don't have a huge disk access.  The 
machines with the serious disk activity don't have any problems.

> but there is possible situations that will not generate disk activity,
> but may cause your system to "go-slow", if there you have some
> unussual IO numbers while disk activity is moderate to low -
> most likely same sweet pair.

The problem is that sar etc product jumbled results.  Profiling the kernel may 
help, but may also hide the error, and it's not something I can easily do.

The servers are locked in a managed server room on the other side of the city 
so seeing the blinken lights is not an option.

I've put the aa1 kernel on half the machines and now I'll wait to see what 
happens.  If the aa1 machines don't have the problem but the others do then 
I'll go all aa1.

> > > Interestingly the machines that have the problems are not the most
> > > active in the file system (mail store), but the mail spool machines. 
> > > The mail spool machines do a good amount of file access (but well below
> > > the limits of the hardware) and also use more memory and have large
> > > load spikes on occasion (virus and spam scanning).
>
> talking about  virus/spam scanning - what do you use and how its integrated
> in your SMTP MTA ?

RAV.  I'm not sure of the details, I think it runs as a daemon that qmail 
talks to.  I try to avoid the anti-virus stuff.

-- 
http://www.coker.com.au/selinux/   My NSA Security Enhanced Linux packages
http://www.coker.com.au/bonnie++/  Bonnie++ hard drive benchmark
http://www.coker.com.au/postal/Postal SMTP/POP benchmark
http://www.coker.com.au/~russell/  My home page




Re: kernel go-slow

2003-02-06 Thread Oleg Drokin
Hello!

On Thu, Feb 06, 2003 at 05:41:46PM +0100, Russell Coker wrote:

> > but there is possible situations that will not generate disk activity,
> > but may cause your system to "go-slow", if there you have some
> > unussual IO numbers while disk activity is moderate to low -
> > most likely same sweet pair.
> The problem is that sar etc product jumbled results.  Profiling the kernel may 
> help, but may also hide the error, and it's not something I can easily do.

Well, you can do it very easily.
reboot with "profile=2" kernel option.
when 100% sys cpu situation started - execute readprofile -r
when it is finished, execute readprofile -m /path/to/System.map >somefile
then sort somefile and you are done, you are now seeing where is most of the time
is spent.

> The servers are locked in a managed server room on the other side of the city 
> so seeing the blinken lights is not an option.

;)
webcam

> I've put the aa1 kernel on half the machines and now I'll wait to see what 
> happens.  If the aa1 machines don't have the problem but the others do then 
> I'll go all aa1.

Ah, if your problem was with highmem I/O not present, then that might actually help.

Bye,
Oleg



Re: kernel go-slow

2003-02-06 Thread Hans Reiser
Russell Coker wrote:


On Thu, 6 Feb 2003 17:32, Alexander Lyamin wrote:
 

One problem that has started occuring is that periodically some of the
machines will go really slow for a while.  It's as if the CPU speed has
just dropped to 1% of it's regular speed.  Then after 10 minutes or so
it will continue as normal.
   

when its slows down, please check with vmstat for IO or with your
 

i think i wasnt clear enough.
so - first , if you "go-slow" on a disk activity, chances are good
that it caused by FS or VM or their misunderstandings.
   


vmstat doesn't work properly.  CPU time is 99% system which suggests that one 
CPU is spending all it's time in kernel space (for both threads of a 
hyper-threaded CPU) or that both CPUs have each got one thread locked in 
kernel space.

 

I propose that you try reversing the datalogging patch for long enough 
to know whether it is our new code that is buggy.

If it is not our code, and it matters enough to justify the cost, we can 
remote login kernel analyze for you for an hourly fee.  Probably the fee 
you charge them is good enough for us too.;-)

--
Hans