kernel go-slow
I'm running a number of machines with 2.4.20 and the ReiserFS journal patches. One problem that has started occuring is that periodically some of the machines will go really slow for a while. It's as if the CPU speed has just dropped to 1% of it's regular speed. Then after 10 minutes or so it will continue as normal. Has anyone heard of such things before? I am asking here first because the ReiserFS patch is the most significant kernel patch I've applied on what is otherwise a stock 2.4.20 kernel. Interestingly the machines that have the problems are not the most active in the file system (mail store), but the mail spool machines. The mail spool machines do a good amount of file access (but well below the limits of the hardware) and also use more memory and have large load spikes on occasion (virus and spam scanning). -- http://www.coker.com.au/selinux/ My NSA Security Enhanced Linux packages http://www.coker.com.au/bonnie++/ Bonnie++ hard drive benchmark http://www.coker.com.au/postal/Postal SMTP/POP benchmark http://www.coker.com.au/~russell/ My home page
Re: kernel go-slow
Russell Coker wrote: I'm running a number of machines with 2.4.20 and the ReiserFS journal patches. One problem that has started occuring is that periodically some of the machines will go really slow for a while. It's as if the CPU speed has just dropped to 1% of it's regular speed. Then after 10 minutes or so it will continue as normal. Has anyone heard of such things before? Russell, I am (was) running a vanilla 2.4.20 kernel and experienced a slow-down each night during virus scan. System would not respond to http during undefined moments. But rather repeatable each night, though each time at a different moment during the night. I've just rebooted into 2.4.19 to check whether its 2.4.20 or the results of hardware modification I did 2 weeks ago. System is lightly loaded. file systems in use mostly Reiserfs and a spattering of left-over ext2. Cheers, Rudy
Re: kernel go-slow
Russell Coker wrote (ao): > I'm running a number of machines with 2.4.20 and the ReiserFS journal > patches. > > One problem that has started occuring is that periodically some of the > machines will go really slow for a while. It's as if the CPU speed has > just dropped to 1% of it's regular speed. Then after 10 minutes or so > it will continue as normal. > > Has anyone heard of such things before? It seems there is a 'bug' in 2.4.20 which causes the stall. (don't know the details, but you're not the only one). Maybe a -pre fixes it, though in your case I would wait for .21 I think.
Re: kernel go-slow
Mon, Feb 03, 2003 at 12:27:40AM +0100, Russell Coker wrote: > I'm running a number of machines with 2.4.20 and the ReiserFS journal patches. > > One problem that has started occuring is that periodically some of the > machines will go really slow for a while. It's as if the CPU speed has just > dropped to 1% of it's regular speed. Then after 10 minutes or so it will > continue as normal. when its slows down, please check with vmstat for IO or with your led for disk activity. thats a simply and stupid. but theres no really good way to understand whats goining on in kernel if you are userland yourself. so go in kernel with profiling and see where does it spend it precisious time. slightly more complicated then method above, but much more effective. > > Has anyone heard of such things before? > > I am asking here first because the ReiserFS patch is the most significant > kernel patch I've applied on what is otherwise a stock 2.4.20 kernel. > > Interestingly the machines that have the problems are not the most active in > the file system (mail store), but the mail spool machines. The mail spool > machines do a good amount of file access (but well below the limits of the > hardware) and also use more memory and have large load spikes on occasion > (virus and spam scanning). -- "Cache remedies via multi-variable logic shorts will leave you crying."(cl) Lex Lyamin
Re: kernel go-slow
Thu, Feb 06, 2003 at 02:26:49PM +0300, Alexander Lyamin wrote: > Mon, Feb 03, 2003 at 12:27:40AM +0100, Russell Coker wrote: > > I'm running a number of machines with 2.4.20 and the ReiserFS journal patches. > > > > One problem that has started occuring is that periodically some of the > > machines will go really slow for a while. It's as if the CPU speed has just > > dropped to 1% of it's regular speed. Then after 10 minutes or so it will > > continue as normal. > > when its slows down, please check with vmstat for IO or with your i think i wasnt clear enough. so - first , if you "go-slow" on a disk activity, chances are good that it caused by FS or VM or their misunderstandings. but there is possible situations that will not generate disk activity, but may cause your system to "go-slow", if there you have some unussual IO numbers while disk activity is moderate to low - most likely same sweet pair. but Oleg Drokin pointed at situations when even IO will not indicate whats going on :) so advice is still the same - if you having slowdowns profiling might help you much better then withchy methods described above. > led for disk activity. thats a simply and stupid. > > but theres no really good way to understand whats goining on in kernel > if you are userland yourself. so go in kernel with profiling and see > where does it spend it precisious time. slightly more complicated then > method above, but much more effective. > > > > > Has anyone heard of such things before? > > > > I am asking here first because the ReiserFS patch is the most significant > > kernel patch I've applied on what is otherwise a stock 2.4.20 kernel. > > > > Interestingly the machines that have the problems are not the most active in > > the file system (mail store), but the mail spool machines. The mail spool > > machines do a good amount of file access (but well below the limits of the > > hardware) and also use more memory and have large load spikes on occasion > > (virus and spam scanning). talking about virus/spam scanning - what do you use and how its integrated in your SMTP MTA ? -- "Cache remedies via multi-variable logic shorts will leave you crying."(cl) Lex Lyamin
Re: kernel go-slow
On Thu, 6 Feb 2003 17:32, Alexander Lyamin wrote: > > > One problem that has started occuring is that periodically some of the > > > machines will go really slow for a while. It's as if the CPU speed has > > > just dropped to 1% of it's regular speed. Then after 10 minutes or so > > > it will continue as normal. > > > > when its slows down, please check with vmstat for IO or with your > > i think i wasnt clear enough. > so - first , if you "go-slow" on a disk activity, chances are good > that it caused by FS or VM or their misunderstandings. vmstat doesn't work properly. CPU time is 99% system which suggests that one CPU is spending all it's time in kernel space (for both threads of a hyper-threaded CPU) or that both CPUs have each got one thread locked in kernel space. It's not disk related, those machines don't have a huge disk access. The machines with the serious disk activity don't have any problems. > but there is possible situations that will not generate disk activity, > but may cause your system to "go-slow", if there you have some > unussual IO numbers while disk activity is moderate to low - > most likely same sweet pair. The problem is that sar etc product jumbled results. Profiling the kernel may help, but may also hide the error, and it's not something I can easily do. The servers are locked in a managed server room on the other side of the city so seeing the blinken lights is not an option. I've put the aa1 kernel on half the machines and now I'll wait to see what happens. If the aa1 machines don't have the problem but the others do then I'll go all aa1. > > > Interestingly the machines that have the problems are not the most > > > active in the file system (mail store), but the mail spool machines. > > > The mail spool machines do a good amount of file access (but well below > > > the limits of the hardware) and also use more memory and have large > > > load spikes on occasion (virus and spam scanning). > > talking about virus/spam scanning - what do you use and how its integrated > in your SMTP MTA ? RAV. I'm not sure of the details, I think it runs as a daemon that qmail talks to. I try to avoid the anti-virus stuff. -- http://www.coker.com.au/selinux/ My NSA Security Enhanced Linux packages http://www.coker.com.au/bonnie++/ Bonnie++ hard drive benchmark http://www.coker.com.au/postal/Postal SMTP/POP benchmark http://www.coker.com.au/~russell/ My home page
Re: kernel go-slow
Hello! On Thu, Feb 06, 2003 at 05:41:46PM +0100, Russell Coker wrote: > > but there is possible situations that will not generate disk activity, > > but may cause your system to "go-slow", if there you have some > > unussual IO numbers while disk activity is moderate to low - > > most likely same sweet pair. > The problem is that sar etc product jumbled results. Profiling the kernel may > help, but may also hide the error, and it's not something I can easily do. Well, you can do it very easily. reboot with "profile=2" kernel option. when 100% sys cpu situation started - execute readprofile -r when it is finished, execute readprofile -m /path/to/System.map >somefile then sort somefile and you are done, you are now seeing where is most of the time is spent. > The servers are locked in a managed server room on the other side of the city > so seeing the blinken lights is not an option. ;) webcam > I've put the aa1 kernel on half the machines and now I'll wait to see what > happens. If the aa1 machines don't have the problem but the others do then > I'll go all aa1. Ah, if your problem was with highmem I/O not present, then that might actually help. Bye, Oleg
Re: kernel go-slow
Russell Coker wrote: On Thu, 6 Feb 2003 17:32, Alexander Lyamin wrote: One problem that has started occuring is that periodically some of the machines will go really slow for a while. It's as if the CPU speed has just dropped to 1% of it's regular speed. Then after 10 minutes or so it will continue as normal. when its slows down, please check with vmstat for IO or with your i think i wasnt clear enough. so - first , if you "go-slow" on a disk activity, chances are good that it caused by FS or VM or their misunderstandings. vmstat doesn't work properly. CPU time is 99% system which suggests that one CPU is spending all it's time in kernel space (for both threads of a hyper-threaded CPU) or that both CPUs have each got one thread locked in kernel space. I propose that you try reversing the datalogging patch for long enough to know whether it is our new code that is buggy. If it is not our code, and it matters enough to justify the cost, we can remote login kernel analyze for you for an hourly fee. Probably the fee you charge them is good enough for us too.;-) -- Hans