Re: I don't get where the load comes from
On 06/02/11 02:31, Corey wrote: On 06/01/2011 10:16 AM, Christiano F. Haesbaert wrote: I had tinkered with a solution for this: Cron wakes up a minute before the batch run is scheduled to run. Cron will then copy a random 4kb sector from the hard disk to RAM, then run either an MD5 or SHA hash against it. The whole process would be timed and if it completed within a a reasonable amount of time for the system then it would kick off a batch job This was the easiest way I thought of measuring the actual performance of the system at any given time since it measures the entire system and closely emulates actual work. While this isn't really the right thing to do, I found it to be the most effective on my systems. You really think cron should be doing it's own calculation ? I don't like that *at all*. Can't we just have a higher default threshold for cron ? Can't we default to 0 ? I think this is something that should be looked up, if we admit load average is a shitty measure, we shouldn't rely on it for running cron jobs. I hereby vote for default to 0. (Thank god this isn't a democracy :-) ) Just have cron look at the system load average... ducking :) a few posts it was mentioned that that is actually the case
Re: I don't get where the load comes from
On 06/03/11 16:10, Alexander Hall wrote: On 06/02/11 02:31, Corey wrote: On 06/01/2011 10:16 AM, Christiano F. Haesbaert wrote: I had tinkered with a solution for this: Cron wakes up a minute before the batch run is scheduled to run. Cron will then copy a random 4kb sector from the hard disk to RAM, then run either an MD5 or SHA hash against it. The whole process would be timed and if it completed within a a reasonable amount of time for the system then it would kick off a batch job This was the easiest way I thought of measuring the actual performance of the system at any given time since it measures the entire system and closely emulates actual work. While this isn't really the right thing to do, I found it to be the most effective on my systems. You really think cron should be doing it's own calculation ? I don't like that *at all*. Can't we just have a higher default threshold for cron ? Can't we default to 0 ? I think this is something that should be looked up, if we admit load average is a shitty measure, we shouldn't rely on it for running cron jobs. I hereby vote for default to 0. (Thank god this isn't a democracy :-) ) Just have cron look at the system load average... ducking :) a few posts it was mentioned that that is actually the case ^ *ago*
Re: I don't get where the load comes from
On Thu, Jun 02, 2011 at 01:12:54AM -0400, Nico Kadel-Garcia wrote: | Then perhaps lean to write. If you're measuring a different | phenomenon, one that has different units, then it's a distinctly | different *calculation* becuase you're measuring a distinct collection | of objects. One may as well add up a restaurant bill, leave out the | tax and tip, and say it's unchanged because I used the same plus | signs. No different measurement, nothing has changed. Your tax+tip example is off; day one you just have soup, the next day you have soup plus a main course. The *price* changed, not the tax rate or the tip rate. With a changed price, the final sum is different but the calculation is exactly the same. You're not arguing that the calculation is different because the outcome changes, are you ? If that's your point then I'm not really sure what you're doing here; that's just inane. These days there's more processes on a machine, including those kernel threads, you know...the ones with non-random (ie sequential) PIDs that also do work. Also, the speed of the system has changed. Units do not change; variables change (just like the amount of work a machine does over the course of 1, 5 or 15 minutes) but the calculation does not: it gathers some variables and outputs a neigh meaningless number. Paul 'WEiRD' de Weerd -- [++-]+++.+++[---].+++[+ +++-].++[-]+.--.[-] http://www.weirdnet.nl/
Re: I don't get where the load comes from
On Thu, Jun 2, 2011 at 1:12 AM, Nico Kadel-Garcia nka...@gmail.com wrote: On Thu, Jun 2, 2011 at 12:48 AM, Theo de Raadt dera...@cvs.openbsd.org wrote: 100% right. The load average calculation has not changed in 25 years. Anyone who says otherwise hasn't got a single fact on their side. What has changed, however, is that the kernel has more kernel threads running (for instance, ps aguxk, and look at the first few which have the 'K' flag set in the 'STAT' field. Some kernels have decided to not count those threads, others do count them. Since these kernel threads make various decisions for when to do their next tasks and how to context switch, the statistical monitoring of the system which ends up creating load values can get perturbed. That's what this comes down to. Which.. sounds exactly like a change in the load average calculation, due to kernel changes, that has occurred in the last 25 years. You clearly cannot read. The calculation has NOT CHANGED. The way that work is done in the kernel has changed. You better get back to class; your potty break is over. Then perhaps lean to write. If you're measuring a different phenomenon, one that has different units, then it's a distinctly different *calculation* becuase you're measuring a distinct collection of objects. One may as well add up a restaurant bill, leave out the tax and tip, and say it's unchanged because I used the same plus signs. It's particularly confusing, as the original poster was confused, when trying to comparae prices, in this case system loads. Thinking about this. I'm not saying that this implies *OpenBSD* changed its calculaton. As Theo pointed out, other kernels have changed what they report to the load tool. So that shifts the measure on other kernels. Perhaps he took this personally.
Re: I don't get where the load comes from
On Thu, Jun 2, 2011 at 4:58 AM, Paul de Weerd we...@weirdnet.nl wrote: On Thu, Jun 02, 2011 at 01:12:54AM -0400, Nico Kadel-Garcia wrote: | Then perhaps lean to write. If you're measuring a different | phenomenon, one that has different units, then it's a distinctly | different *calculation* becuase you're measuring a distinct collection | of objects. One may as well add up a restaurant bill, leave out the | tax and tip, and say it's unchanged because I used the same plus | signs. No different measurement, nothing has changed. Your tax+tip example is off; day one you just have soup, the next day you have soup plus a main course. The *price* changed, not the tax rate or the tip rate. With a changed price, the final sum is different but the calculation is exactly the same. You're not arguing that the calculation is different because the outcome changes, are you ? If that's your point then I'm not really sure what you're doing here; that's just inane. No, no. I did read Theo's note, especially where he said: Some kernels have decided to not count those threads, others do count them. The kernel is still running the processes, in both cases. They're consuming system resources. Running too many such processes will still interfere with other production. I'm still having the soup, either way. These days there's more processes on a machine, including those kernel threads, you know...the ones with non-random (ie sequential) PIDs that also do work. Also, the speed of the system has changed. Units do not change; variables change (just like the amount of work a machine does over the course of 1, 5 or 15 minutes) but the calculation does not: it gathers some variables and outputs a neigh meaningless number. So you're implying that because we make more money, and don't notice the tax and tip so much on the bill. It still matters. It's still part of the bill, and very comfusing when comparison shopping. It's a metaphor I use because I'm in Massachusetts, not that far from New Hampshire: different things are tax free in each state. The money, or in this case the resources, is coming ouf of *somebody's* pocket.
Re: I don't get where the load comes from
On 2011-06-02 13.36, Nico Kadel-Garcia wrote: Thinking about this. I'm not saying that this implies *OpenBSD* changed its calculaton. As Theo pointed out, other kernels have changed what they report to the load tool. So that shifts the measure on other kernels. Perhaps he took this personally. Theo is hardly the one who's taken things personally in this discussion. I for one have seen the darned load average being misunderstood so incredibly often over the years, that seeing it being almost inexplicably abused by so many otherwise well-educated unix- literate people for sure makes it personal to *ME*. And while it's undeniably hard to correct people who may have had this misconception their entire careers, it has to be done. This is so deeply rooted, and has been the cause for so much bad advice that the implications for those misinformed are of course hard to grasp. Denial is a word that comes to mind, but it is so easy to set the record straight - just read the damn code. Now, can we move on please? If anyone feels they've given bad advice in the past due to this debacle, learn from it and don't do it again. :-) Regards, /Benny -- internetlabbet.se / work: +46 8 551 124 80 / Words must Benny Lvfgren/ mobile: +46 70 718 11 90 / be weighed, / fax:+46 8 551 124 89/not counted. /email: benny -at- internetlabbet.se
Re: I don't get where the load comes from
On 2011-05-31 14.45, Artur Grabowski wrote: The load average is a decaying average of the number of processes in the runnable state or currently running on a cpu or in the process of being forked or that have spent less than a second in a sleep state with sleep priority lower than PZERO, which includes waiting for memory resources, disk I/O, filesystem locks and a bunch of other things. You could say it's a very vague estimate of how much work the cpu might need to be doing soon, maybe. Or it could be completely wrong because of sampling bias. It's not very important so it's not really critical for the system to do a good job guessing this number, so the system doesn't really try too hard. This number may tell you something useful, or it might be totally misleading. Or both. One thing that often bites me in the butt is that cron relies on the load average to decide if it should let batch(1) jobs run or not. The default is if cron sees a loadavg 1.5 it keeps the batch job enqueued until it drops below that value. As I often see much, much higher loads on my systems, invariably I find myself wondering why my batch jobs never finish, just to discover that they have yet to run. *duh* So whenever I remember to, on every new system I set up I configure a different load threshold value for cron. But I tend to forget, so... :-) I have no really good suggestion for how else cron should handle this, otherwise I would have submitted a patch ages ago... Regards, /Benny -- internetlabbet.se / work: +46 8 551 124 80 / Words must Benny Lvfgren/ mobile: +46 70 718 11 90 / be weighed, / fax:+46 8 551 124 89/not counted. /email: benny -at- internetlabbet.se
Re: I don't get where the load comes from
Load is generally a measure of a single processor core utilization over an kernel dependent time range. Generally as others have pointed out being a very broad (not as in meadow, as in continent). Different OS's report load very differently from each other today. Traditionally you would see a load average of 1-2 on a multicore system (I am talking HP-UX X client servers etc of the early 90's vintage). a Load average of 1 means a single core of the system is being utilized close to 100% of the time. On dual core systems a load average of 1 should be absolutely no cause for concern. Linux has moved away from reporting load average as a percentage of a single core time in recent days for precisely this reason, people see a load of 1 and think there systems are esploding. In the traditional mold todays processors should in theory get loads of 4-7 and still be responsive... On 31 May 2011 19:10, Joel Carnat j...@carnat.net wrote: Le 31 mai 2011 ` 08:10, Tony Abernethy a icrit : Joel Carnat wrote well, compared to my previous box, running NetBSD/xen, the same services and showing about 0.3-0.6 of load ; I thought a load of 1.21 was quite much. Different systems will agree on the spelling of the word load. That is about as much agreement as you can expect. Does the 0.3-0.6 really mean 30-60 percent loaded? As far as I understood the counters on my previous nbsd box, 0.3 meant that the cpu was used at 30% of it's total capacity. Then, looking at the sys/user counters, I'd see what kind of things the system was doing. 1.21 tasks seems kinda low for a multi-tasking system. ok :)
Re: I don't get where the load comes from
On 2011-06-01 15.12, Joel Wiramu Pauling wrote: Load is generally a measure of a single processor core utilization over an kernel dependent time range. No it isn't. You have totally misunderstood what the load average is. Generally as others have pointed out being a very broad (not as in meadow, as in continent). Different OS's report load very differently from each other today. That one's sort of correct, although I've yet to see an OS where the load doesn't in some way refer to an *average* *count* *of* *processes*. Traditionally you would see a load average of 1-2 on a multicore system (I am talking HP-UX X client servers etc of the early 90's vintage). a Load average of 1 means a single core of the system is being utilized close to 100% of the time. No, no, no. Absolutely *NOT*. It doesn't reflect CPU usage at all. And it never have. The load average must be the single most misunderstood kernel metric there have ever been in the history of unix systems. Very simplified it reflects the *number* *of* *processes* in a runnable state, averaged over some time. Not necessarily processes actually on core, mind you, but the number of processes *wanting* to run. Now, a process can be in a runnable state for a variety of reasons, and there is for example nothing that says it even needs to use up its alloted time slice when actually running, but it still counts as runnable. It can be runnable when waiting for a system resource; then it consumes *no* CPU cycles at all, but it still counts towards the load average. On dual core systems a load average of 1 should be absolutely no cause for concern. I routinely see load averages of 30-40-50, upwards of 100 on some of my systems. They run absolutely smooth and beautiful, with no noticable lag or delays. The processors may be near idling, they may be doing some work, it varies, but it is nothing I can tell from the load average alone. Linux has moved away from reporting load average as a percentage of a single core time in recent days for precisely this reason, people see a load of 1 and think there systems are esploding. In the traditional mold todays processors should in theory get loads of 4-7 and still be responsive... I'm sorry to say, but your entire text is based on a misunderstanding of what the load average really is, so the above sentences are totally irrelevant. Regards, /Benny On 31 May 2011 19:10, Joel Carnat j...@carnat.net wrote: Le 31 mai 2011 ` 08:10, Tony Abernethy a icrit : Joel Carnat wrote well, compared to my previous box, running NetBSD/xen, the same services and showing about 0.3-0.6 of load ; I thought a load of 1.21 was quite much. Different systems will agree on the spelling of the word load. That is about as much agreement as you can expect. Does the 0.3-0.6 really mean 30-60 percent loaded? As far as I understood the counters on my previous nbsd box, 0.3 meant that the cpu was used at 30% of it's total capacity. Then, looking at the sys/user counters, I'd see what kind of things the system was doing. 1.21 tasks seems kinda low for a multi-tasking system. ok :) -- internetlabbet.se / work: +46 8 551 124 80 / Words must Benny LC6fgren/ mobile: +46 70 718 11 90 / be weighed, / fax:+46 8 551 124 89/not counted. /email: benny -at- internetlabbet.se
Re: I don't get where the load comes from
On 01-Jun-11 05:46, Benny Lofgren wrote: On 2011-05-31 14.45, Artur Grabowski wrote: The load average is a decaying average of the number of processes in the runnable state or currently running on a cpu or in the process of being forked or that have spent less than a second in a sleep state with sleep priority lower than PZERO, which includes waiting for memory resources, disk I/O, filesystem locks and a bunch of other things. You could say it's a very vague estimate of how much work the cpu might need to be doing soon, maybe. Or it could be completely wrong because of sampling bias. It's not very important so it's not really critical for the system to do a good job guessing this number, so the system doesn't really try too hard. This number may tell you something useful, or it might be totally misleading. Or both. One thing that often bites me in the butt is that cron relies on the load average to decide if it should let batch(1) jobs run or not. The default is if cron sees a loadavg 1.5 it keeps the batch job enqueued until it drops below that value. As I often see much, much higher loads on my systems, invariably I find myself wondering why my batch jobs never finish, just to discover that they have yet to run. *duh* So whenever I remember to, on every new system I set up I configure a different load threshold value for cron. But I tend to forget, so... :-) I have no really good suggestion for how else cron should handle this, otherwise I would have submitted a patch ages ago... I had tinkered with a solution for this: Cron wakes up a minute before the batch run is scheduled to run. Cron will then copy a random 4kb sector from the hard disk to RAM, then run either an MD5 or SHA hash against it. The whole process would be timed and if it completed within a a reasonable amount of time for the system then it would kick off a batch job This was the easiest way I thought of measuring the actual performance of the system at any given time since it measures the entire system and closely emulates actual work. While this isn't really the right thing to do, I found it to be the most effective on my systems.
Re: I don't get where the load comes from
On 1 June 2011 11:01, LeviaComm Networks n...@leviacomm.net wrote: On 01-Jun-11 05:46, Benny Lofgren wrote: On 2011-05-31 14.45, Artur Grabowski wrote: The load average is a decaying average of the number of processes in the runnable state or currently running on a cpu or in the process of being forked or that have spent less than a second in a sleep state with sleep priority lower than PZERO, which includes waiting for memory resources, disk I/O, filesystem locks and a bunch of other things. You could say it's a very vague estimate of how much work the cpu might need to be doing soon, maybe. Or it could be completely wrong because of sampling bias. It's not very important so it's not really critical for the system to do a good job guessing this number, so the system doesn't really try too hard. This number may tell you something useful, or it might be totally misleading. Or both. One thing that often bites me in the butt is that cron relies on the load average to decide if it should let batch(1) jobs run or not. The default is if cron sees a loadavg 1.5 it keeps the batch job enqueued until it drops below that value. As I often see much, much higher loads on my systems, invariably I find myself wondering why my batch jobs never finish, just to discover that they have yet to run. *duh* So whenever I remember to, on every new system I set up I configure a different load threshold value for cron. But I tend to forget, so... :-) I have no really good suggestion for how else cron should handle this, otherwise I would have submitted a patch ages ago... I had tinkered with a solution for this: Cron wakes up a minute before the batch run is scheduled to run. Cron will then copy a random 4kb sector from the hard disk to RAM, then run either an MD5 or SHA hash against it. The whole process would be timed and if it completed within a a reasonable amount of time for the system then it would kick off a batch job This was the easiest way I thought of measuring the actual performance of the system at any given time since it measures the entire system and closely emulates actual work. While this isn't really the right thing to do, I found it to be the most effective on my systems. You really think cron should be doing it's own calculation ? I don't like that *at all*. Can't we just have a higher default threshold for cron ? Can't we default to 0 ? I think this is something that should be looked up, if we admit load average is a shitty measure, we shouldn't rely on it for running cron jobs. I hereby vote for default to 0. (Thank god this isn't a democracy :-) )
Re: I don't get where the load comes from
On 2011-06-01 15.53, Joel Wiramu Pauling wrote: On 2 June 2011 01:41, Benny Lofgren bl-li...@lofgren.biz mailto:bl-li...@lofgren.biz wrote: I agree with what you are saying, and I worded this quite badly, the frame I was trying to setup was back in the day when multi-user meant something (VAX/PDP) - the load average WAS tied to core utilization - as you would queue a job, and it would go into the queue and there would be lots of stuff in the queue and the load average would bumo, because there wasn't much core to go around. Not wanting to turn this into a pissing contest, I still have to say that you are fundamentally wrong about this. I'm sorry, but what you are saying simply is not correct. I've worked in-depth on just about every unixlike architecture there is since I started out in this business back in 1983, and on every single one (that employed it at all) the load average concept has worked similarly to how I described it in my previous mail. (Not always EXACTLY alike, but the general principle have always been the same.) The reason I'm so adamant about this is that the interpretation of the load average metric truly is one of the longest-standing misconceptions about the finer points of unix system administration there is, and if this discussion thread can set just one individual straight about it then it is worth the extra mail bandwidth. :-) One only needs to look at all of the very confident, yet dead-wrong, answers to the OP:s question in this thread to realize that it is indeed a confusing subject. And the importance of getting it straightened out cannot be overstated. I've long ago lost count of the number of times I've been called in to fix a problem with high system loads only to find that the only metric used to determine that is... yes, the load average. I wonder how much money have been wasted over the years trying to throw hardware on what might not even have been a problem in the first place... Regards, /Benny That hasn't been the case for a very very long time and once we entered the age of multi-tasking load become unintuitive. Point being it's an indication of something today that isn't at all intuitive. Sorry for muddying the waters even more, my fuck up. On 31 May 2011 19:10, Joel Carnat j...@carnat.net mailto:j...@carnat.net wrote: Le 31 mai 2011 ` 08:10, Tony Abernethy a icrit : Joel Carnat wrote well, compared to my previous box, running NetBSD/xen, the same services and showing about 0.3-0.6 of load ; I thought a load of 1.21 was quite much. Different systems will agree on the spelling of the word load. That is about as much agreement as you can expect. Does the 0.3-0.6 really mean 30-60 percent loaded? As far as I understood the counters on my previous nbsd box, 0.3 meant that the cpu was used at 30% of it's total capacity. Then, looking at the sys/user counters, I'd see what kind of things the system was doing. 1.21 tasks seems kinda low for a multi-tasking system. ok :) -- internetlabbet.se http://internetlabbet.se / work: +46 8 551 124 80 / Words must Benny Lvfgren/ mobile: +46 70 718 11 90 / be weighed, / fax:+46 8 551 124 89/not counted. /email: benny -at- internetlabbet.se http://internetlabbet.se -- internetlabbet.se / work: +46 8 551 124 80 / Words must Benny Lvfgren/ mobile: +46 70 718 11 90 / be weighed, / fax:+46 8 551 124 89/not counted. /email: benny -at- internetlabbet.se
Re: I don't get where the load comes from
On 2011-06-01 17.16, Christiano F. Haesbaert wrote: On 1 June 2011 11:01, LeviaComm Networks n...@leviacomm.net wrote: On 01-Jun-11 05:46, Benny Lofgren wrote: On 2011-05-31 14.45, Artur Grabowski wrote: The load average is a decaying average of the number of processes in the runnable state or currently running on a cpu or in the process of being forked or that have spent less than a second in a sleep state with sleep priority lower than PZERO, which includes waiting for memory resources, disk I/O, filesystem locks and a bunch of other things. You could say it's a very vague estimate of how much work the cpu might need to be doing soon, maybe. Or it could be completely wrong because of sampling bias. It's not very important so it's not really critical for the system to do a good job guessing this number, so the system doesn't really try too hard. This number may tell you something useful, or it might be totally misleading. Or both. One thing that often bites me in the butt is that cron relies on the load average to decide if it should let batch(1) jobs run or not. The default is if cron sees a loadavg 1.5 it keeps the batch job enqueued until it drops below that value. As I often see much, much higher loads on my systems, invariably I find myself wondering why my batch jobs never finish, just to discover that they have yet to run. *duh* So whenever I remember to, on every new system I set up I configure a different load threshold value for cron. But I tend to forget, so... :-) I have no really good suggestion for how else cron should handle this, otherwise I would have submitted a patch ages ago... I had tinkered with a solution for this: Cron wakes up a minute before the batch run is scheduled to run. Cron will then copy a random 4kb sector from the hard disk to RAM, then run either an MD5 or SHA hash against it. The whole process would be timed and if it completed within a a reasonable amount of time for the system then it would kick off a batch job This was the easiest way I thought of measuring the actual performance of the system at any given time since it measures the entire system and closely emulates actual work. While this isn't really the right thing to do, I found it to be the most effective on my systems. You really think cron should be doing it's own calculation ? I don't like that *at all*. Can't we just have a higher default threshold for cron ? Can't we default to 0 ? I think this is something that should be looked up, if we admit load average is a shitty measure, we shouldn't rely on it for running cron jobs. I hereby vote for default to 0. (Thank god this isn't a democracy :-) ) I didn't really like Christophers suggestion either. For one thing, *any* kind of attempt at userland performance measurement will over time (as hardware gets faster) become less accurate to the point of not being usable unless tuned, and we really DON'T want to have to tune cron (or anything else userland for that matter) for different architectures and/or generations of systems. Also, what kind of metric should cron measure? What if the batch job is CPU-bound only, but will take two weeks to run and it's simply most convenient to start it using batch(1)? Or if the second batch job is i/o bound and doesn't get to run because I just started up the two-week CPU bound job and cron only measures that? In fact I really don't feel the load average is such a bad metric for cron to use, it's just that the default was probably set a millenia ago and hasn't changed since then. Easiest is to set the default to 0.0 as you suggest, disabling the feature altogether, more complicated but perhaps better in this world of multi-core systems might be to set it = number of cores. (Which also reminds me, sendmail have a similar feature using load average, which have also bugged me from time to time. Might be others as well, but none come to mind right now.) Regards, /Benny -- internetlabbet.se / work: +46 8 551 124 80 / Words must Benny Lvfgren/ mobile: +46 70 718 11 90 / be weighed, / fax:+46 8 551 124 89/not counted. /email: benny -at- internetlabbet.se
Re: I don't get where the load comes from
On 2011-06-01 15.53, Joel Wiramu Pauling wrote: On 2 June 2011 01:41, Benny Lofgren bl-li...@lofgren.biz mailto:bl-li...@lofgren.biz wrote: I agree with what you are saying, and I worded this quite badly, the frame I was trying to setup was back in the day when multi-user meant something (VAX/PDP) - the load average WAS tied to core utilization - as you would queue a job, and it would go into the queue and there would be lots of stuff in the queue and the load average would bumo, because there wasn't much core to go around. Not wanting to turn this into a pissing contest, I still have to say that you are fundamentally wrong about this. I'm sorry, but what you are saying simply is not correct. I've worked in-depth on just about every unixlike architecture there is since I started out in this business back in 1983, and on every single one (that employed it at all) the load average concept has worked similarly to how I described it in my previous mail. (Not always EXACTLY alike, but the general principle have always been the same.) The reason I'm so adamant about this is that the interpretation of the load average metric truly is one of the longest-standing misconceptions about the finer points of unix system administration there is, and if this discussion thread can set just one individual straight about it then it is worth the extra mail bandwidth. :-) 100% right. The load average calculation has not changed in 25 years. Anyone who says otherwise hasn't got a single fact on their side. What has changed, however, is that the kernel has more kernel threads running (for instance, ps aguxk, and look at the first few which have the 'K' flag set in the 'STAT' field. Some kernels have decided to not count those threads, others do count them. Since these kernel threads make various decisions for when to do their next tasks and how to context switch, the statistical monitoring of the system which ends up creating load values can get perturbed. That's what this comes down to.
Re: I don't get where the load comes from
On 01-Jun-11 08:39, Benny Lofgren wrote: On 2011-06-01 17.16, Christiano F. Haesbaert wrote: On 1 June 2011 11:01, LeviaComm Networksn...@leviacomm.net wrote: On 01-Jun-11 05:46, Benny Lofgren wrote: On 2011-05-31 14.45, Artur Grabowski wrote: The load average is a decaying average of the number of processes in the runnable state or currently running on a cpu or in the process of being forked or that have spent less than a second in a sleep state with sleep priority lower than PZERO, which includes waiting for memory resources, disk I/O, filesystem locks and a bunch of other things. You could say it's a very vague estimate of how much work the cpu might need to be doing soon, maybe. Or it could be completely wrong because of sampling bias. It's not very important so it's not really critical for the system to do a good job guessing this number, so the system doesn't really try too hard. This number may tell you something useful, or it might be totally misleading. Or both. One thing that often bites me in the butt is that cron relies on the load average to decide if it should let batch(1) jobs run or not. The default is if cron sees a loadavg 1.5 it keeps the batch job enqueued until it drops below that value. As I often see much, much higher loads on my systems, invariably I find myself wondering why my batch jobs never finish, just to discover that they have yet to run. *duh* So whenever I remember to, on every new system I set up I configure a different load threshold value for cron. But I tend to forget, so... :-) I have no really good suggestion for how else cron should handle this, otherwise I would have submitted a patch ages ago... I had tinkered with a solution for this: Cron wakes up a minute before the batch run is scheduled to run. Cron will then copy a random 4kb sector from the hard disk to RAM, then run either an MD5 or SHA hash against it. The whole process would be timed and if it completed within a a reasonable amount of time for the system then it would kick off a batch job This was the easiest way I thought of measuring the actual performance of the system at any given time since it measures the entire system and closely emulates actual work. While this isn't really the right thing to do, I found it to be the most effective on my systems. You really think cron should be doing it's own calculation ? I don't like that *at all*. Can't we just have a higher default threshold for cron ? Can't we default to 0 ? I think this is something that should be looked up, if we admit load average is a shitty measure, we shouldn't rely on it for running cron jobs. I hereby vote for default to 0. (Thank god this isn't a democracy :-) ) I didn't really like Christophers suggestion either. I don't like it either, but its only way to get my file server to run batch jobs without noticable performance loss For one thing, *any* kind of attempt at userland performance measurement will over time (as hardware gets faster) become less accurate to the point of not being usable unless tuned, and we really DON'T want to have to tune cron (or anything else userland for that matter) for different architectures and/or generations of systems. I never intended my suggestion to be used as-is in open-bsd proper, just as a mentioning of what I am doing to bypass bogus load values preventing my system from doing what it needs Also, what kind of metric should cron measure? What if the batch job is CPU-bound only, but will take two weeks to run and it's simply most convenient to start it using batch(1)? Or if the second batch job is i/o bound and doesn't get to run because I just started up the two-week CPU bound job and cron only measures that? The jobs I run on my file servers require a bit of everything In fact I really don't feel the load average is such a bad metric for cron to use, it's just that the default was probably set a millenia ago and hasn't changed since then. Easiest is to set the default to 0.0 as you suggest, disabling the feature altogether, more complicated but perhaps better in this world of multi-core systems might be to set it = number of cores. I agree with that, but I've had times where the load was genuinely high and the batch jobs cause the system to slow to the point of being unusable, or at least to the point where the users start complaining about it. (Which also reminds me, sendmail have a similar feature using load average, which have also bugged me from time to time. Might be others as well, but none come to mind right now.) Regards, /Benny -- -Christopher Ahrens
Re: I don't get where the load comes from
On 1 June 2011 17:49, Theo de Raadt dera...@cvs.openbsd.org wrote: What has changed, however, is that the kernel has more kernel threads running (for instance, ps aguxk, and look at the first few which have the 'K' flag set in the 'STAT' field. In ps aguxk, what does the g do? I didn't find enlightenment on the man page or the Googles. -- PS: Apologies btw. for not responding to and thanking list members who wrote helpful replies after my graphics tablet question. That's one's kinda slipped through and I meant to get back to it, but...
Re: I don't get where the load comes from
On Wed, Jun 01, 2011 at 09:49:17AM -0600, Theo de Raadt wrote: On 2011-06-01 15.53, Joel Wiramu Pauling wrote: On 2 June 2011 01:41, Benny Lofgren bl-li...@lofgren.biz mailto:bl-li...@lofgren.biz wrote: I agree with what you are saying, and I worded this quite badly, the frame I was trying to setup was back in the day when multi-user meant something (VAX/PDP) - the load average WAS tied to core utilization - as you would queue a job, and it would go into the queue and there would be lots of stuff in the queue and the load average would bumo, because there wasn't much core to go around. Not wanting to turn this into a pissing contest, I still have to say that you are fundamentally wrong about this. I'm sorry, but what you are saying simply is not correct. I've worked in-depth on just about every unixlike architecture there is since I started out in this business back in 1983, and on every single one (that employed it at all) the load average concept has worked similarly to how I described it in my previous mail. (Not always EXACTLY alike, but the general principle have always been the same.) The reason I'm so adamant about this is that the interpretation of the load average metric truly is one of the longest-standing misconceptions about the finer points of unix system administration there is, and if this discussion thread can set just one individual straight about it then it is worth the extra mail bandwidth. :-) 100% right. The load average calculation has not changed in 25 years. Anyone who says otherwise hasn't got a single fact on their side. The metric was invented in a time when processes were swapped in and out routinely. This is why load, nice and the scheduler all behave the way they do (which to our modern minds seem strange and arcane). What has changed, however, is that the kernel has more kernel threads running (for instance, ps aguxk, and look at the first few which have the 'K' flag set in the 'STAT' field. Some kernels have decided to not count those threads, others do count them. Since these kernel threads make various decisions for when to do their next tasks and how to context switch, the statistical monitoring of the system which ends up creating load values can get perturbed. That's what this comes down to. -- Ariane
Re: I don't get where the load comes from
On Wed, Jun 01, 2011 at 11:09:03PM +0200, ropers wrote: On 1 June 2011 17:49, Theo de Raadt dera...@cvs.openbsd.org wrote: What has changed, however, is that the kernel has more kernel threads running (for instance, ps aguxk, and look at the first few which have the 'K' flag set in the 'STAT' field. In ps aguxk, what does the g do? I didn't find enlightenment on the man page or the Googles. These days it does nohing. It used to a flag up to 4.3BSD. Which showed more processes than the normally, like proceses not attached to your terminal and zombies iirc. -Otto -- PS: Apologies btw. for not responding to and thanking list members who wrote helpful replies after my graphics tablet question. That's one's kinda slipped through and I meant to get back to it, but...
Re: I don't get where the load comes from
On 06/01/2011 10:16 AM, Christiano F. Haesbaert wrote: I had tinkered with a solution for this: Cron wakes up a minute before the batch run is scheduled to run. Cron will then copy a random 4kb sector from the hard disk to RAM, then run either an MD5 or SHA hash against it. The whole process would be timed and if it completed within a a reasonable amount of time for the system then it would kick off a batch job This was the easiest way I thought of measuring the actual performance of the system at any given time since it measures the entire system and closely emulates actual work. While this isn't really the right thing to do, I found it to be the most effective on my systems. You really think cron should be doing it's own calculation ? I don't like that *at all*. Can't we just have a higher default threshold for cron ? Can't we default to 0 ? I think this is something that should be looked up, if we admit load average is a shitty measure, we shouldn't rely on it for running cron jobs. I hereby vote for default to 0. (Thank god this isn't a democracy :-) ) Just have cron look at the system load average... ducking :)
Re: I don't get where the load comes from
On 1 June 2011 17:49, Theo de Raadt dera...@cvs.openbsd.org wrote: What has changed, however, is that the kernel has more kernel threads running (for instance, ps aguxk, and look at the first few which have the 'K' flag set in the 'STAT' field. In ps aguxk, what does the g do? I didn't find enlightenment on the man page or the Googles. These days it does nohing. It used to a flag up to 4.3BSD. Which showed more processes than the normally, like proceses not attached to your terminal and zombies iirc. It is such a bummer that we don't provide source code so that _educated_ people could read it. Oh wait, we provide source code. So access to source code isn't the problem? No it has never been the problem.
Re: I don't get where the load comes from
On Wed, Jun 1, 2011 at 11:49 AM, Theo de Raadt dera...@cvs.openbsd.org wrote: On 2011-06-01 15.53, Joel Wiramu Pauling wrote: On 2 June 2011 01:41, Benny Lofgren bl-li...@lofgren.biz mailto:bl-li...@lofgren.biz wrote: I agree with what you are saying, and I worded this quite badly, the frame I was trying to setup was back in the day when multi-user meant something (VAX/PDP) - the load average WAS tied to core utilization - as you would queue a job, and it would go into the queue and there would be lots of stuff in the queue and the load average would bumo, because there wasn't much core to go around. Not wanting to turn this into a pissing contest, I still have to say that you are fundamentally wrong about this. I'm sorry, but what you are saying simply is not correct. I've worked in-depth on just about every unixlike architecture there is since I started out in this business back in 1983, and on every single one (that employed it at all) the load average concept has worked similarly to how I described it in my previous mail. (Not always EXACTLY alike, but the general principle have always been the same.) The reason I'm so adamant about this is that the interpretation of the load average metric truly is one of the longest-standing misconceptions about the finer points of unix system administration there is, and if this discussion thread can set just one individual straight about it then it is worth the extra mail bandwidth. :-) 100% right. The load average calculation has not changed in 25 years. Anyone who says otherwise hasn't got a single fact on their side. What has changed, however, is that the kernel has more kernel threads running (for instance, ps aguxk, and look at the first few which have the 'K' flag set in the 'STAT' field. Some kernels have decided to not count those threads, others do count them. Since these kernel threads make various decisions for when to do their next tasks and how to context switch, the statistical monitoring of the system which ends up creating load values can get perturbed. That's what this comes down to. Which.. sounds exactly like a change in the load average calculation, due to kernel changes, that has occurred in the last 25 years.
Re: I don't get where the load comes from
100% right. The load average calculation has not changed in 25 years. Anyone who says otherwise hasn't got a single fact on their side. What has changed, however, is that the kernel has more kernel threads running (for instance, ps aguxk, and look at the first few which have the 'K' flag set in the 'STAT' field. Some kernels have decided to not count those threads, others do count them. Since these kernel threads make various decisions for when to do their next tasks and how to context switch, the statistical monitoring of the system which ends up creating load values can get perturbed. That's what this comes down to. Which.. sounds exactly like a change in the load average calculation, due to kernel changes, that has occurred in the last 25 years. You clearly cannot read. The calculation has NOT CHANGED. The way that work is done in the kernel has changed. You better get back to class; your potty break is over.
Re: I don't get where the load comes from
On Thu, Jun 2, 2011 at 12:48 AM, Theo de Raadt dera...@cvs.openbsd.org wrote: 100% right. The load average calculation has not changed in 25 years. Anyone who says otherwise hasn't got a single fact on their side. What has changed, however, is that the kernel has more kernel threads running (for instance, ps aguxk, and look at the first few which have the 'K' flag set in the 'STAT' field. Some kernels have decided to not count those threads, others do count them. Since these kernel threads make various decisions for when to do their next tasks and how to context switch, the statistical monitoring of the system which ends up creating load values can get perturbed. That's what this comes down to. Which.. sounds exactly like a change in the load average calculation, due to kernel changes, that has occurred in the last 25 years. You clearly cannot read. The calculation has NOT CHANGED. The way that work is done in the kernel has changed. You better get back to class; your potty break is over. Then perhaps lean to write. If you're measuring a different phenomenon, one that has different units, then it's a distinctly different *calculation* becuase you're measuring a distinct collection of objects. One may as well add up a restaurant bill, leave out the tax and tip, and say it's unchanged because I used the same plus signs. It's particularly confusing, as the original poster was confused, when trying to comparae prices, in this case system loads.
Re: I don't get where the load comes from
Le 31 mai 2011 ` 00:15, Paul de Weerd a icrit : On Mon, May 30, 2011 at 11:44:29PM +0200, Joel Carnat wrote: | Hi, | | I am running a personal Mail+Web system on a Core2Duo 2GHz using Speedstep. | It is mostly doing nothing but still has a high load average. Wait, what ? ~1 is 'a high load average' now ? What are that database and webserver doing on your machine 'doing nothing' ? What other processes do you have running ? Note that you don't have to use lots of CPU to get a (really) high load... well, compared to my previous box, running NetBSD/xen, the same services and showing about 0.3-0.6 of load ; I thought a load of 1.21 was quite much. Do you see a lot of interrupts perhaps ? Try `systat -s1 vm` or `vmstat -i`. # vmstat -i interrupt total rate irq0/clock9709553 199 irq0/ipi 1291416 26 irq144/acpi010 irq145/inteldrm090 irq96/uhci0 1170 irq98/ehci0 20 irq97/azalia0 10 irq101/wpi0 10 irq101/bge03666157 irq96/ehci1200 irq101/ahci0 3323496 irq147/pckbc0 60 irq148/pckbc0 380 Total11700128 240 Paul 'WEiRD' de Weerd | I've check various stat tools but didn't find the reason for the load. | | Anyone has ideas? | | TIA, | Jo | | PS: here are some of the results I checked. | | # uname -a | OpenBSD bagheera.tumfatig.net 4.9 GENERIC.MP#819 amd64 | | # sysctl hw | hw.machine=amd64 | hw.model=Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz | hw.ncpu=2 | hw.byteorder=1234 | hw.pagesize=4096 | hw.disknames=cd0:,sd0:01d3664288919ae7 | hw.diskcount=2 | hw.sensors.cpu0.temp0=45.00 degC | hw.sensors.cpu1.temp0=45.00 degC | hw.sensors.acpitz0.temp0=45.50 degC (zone temperature) | hw.sensors.acpiac0.indicator0=On (power supply) | hw.sensors.acpibat0.volt0=11.10 VDC (voltage) | hw.sensors.acpibat0.volt1=12.71 VDC (current voltage) | hw.sensors.acpibat0.amphour0=4.61 Ah (last full capacity) | hw.sensors.acpibat0.amphour1=0.52 Ah (warning capacity) | hw.sensors.acpibat0.amphour2=0.16 Ah (low capacity) | hw.sensors.acpibat0.amphour3=5.20 Ah (remaining capacity), OK | hw.sensors.acpibat0.raw0=0 (battery full), OK | hw.sensors.acpibat0.raw1=1 (rate) | hw.cpuspeed=800 | hw.setperf=0 | hw.vendor=Dell Inc. | hw.product=XPS M1330 | hw.serialno=CK0W33J | hw.uuid=44454c4c-4b00-1030-8057-c3c04f4a | hw.physmem=3747008512 | hw.usermem=3734933504 | hw.ncpufound=2 | | # top -n -o cpu -T | load averages: 1.19, 1.14, 0.99bagheera.tumfatig.net 23:39:09 | 78 processes: 77 idle, 1 on processor | CPU0 states: 1.8% user, 0.0% nice, 0.7% system, 0.1% interrupt, 97.4% | idle | CPU1 states: 2.4% user, 0.0% nice, 0.8% system, 0.0% interrupt, 96.8% | idle | Memory: Real: 238M/656M act/tot Free: 2809M Swap: 0K/8197M used/tot | | PID USERNAME PRI NICE SIZE RES STATE WAIT TIMECPU COMMAND | 3230 root 20 2156K 3152K sleep/1 netio 0:00 0.20% sshd | 1867 sshd 20 2148K 2368K sleep/0 select0:00 0.05% sshd | 19650 www 140 5640K 30M sleep/0 semwait 0:59 0.00% httpd | 4225 www 140 5984K 42M sleep/1 semwait 0:58 0.00% httpd | 3624 www 140 5644K 30M sleep/1 semwait 0:53 0.00% httpd | 24875 www 140 5740K 32M sleep/1 semwait 0:52 0.00% httpd | 22848 www 140 5724K 30M sleep/1 semwait 0:50 0.00% httpd | 13508 www 140 5832K 31M sleep/1 semwait 0:48 0.00% httpd | 24210 www 140 5652K 30M sleep/1 semwait 0:48 0.00% httpd | 510 www 140 5660K 30M sleep/1 semwait 0:46 0.00% httpd | 20258 www20 5536K 32M sleep/0 select0:46 0.00% httpd | 6543 www 140 5772K 32M sleep/0 semwait 0:43 0.00% httpd | 9783 _mysql 20 55M 30M sleep/1 poll 0:20 0.00% mysqld | 19071 root 20 640K 1416K sleep/1 select0:09 0.00% sshd | 10389 root 20 3376K 2824K sleep/0 poll 0:07 0.00% monit | 21695 _sogo 20 7288K 18M sleep/1 poll 0:05 0.00% sogod | 1888 named 20 20M 21M sleep/1 select0:05 0.00% named | 18781 _sogo 20 15M 29M sleep/1 poll 0:04 0.00% sogod | | # iostat -c 10 -w 1 | ttycd0 sd0 cpu | tin tout KB/t t/s MB/s KB/t t/s MB/s us ni sy in id |07 0.00 0 0.00 20.64 7 0.14 2 0 1 0 97 |0 174 0.00 0 0.00 0.00 0 0.00 0 0 0 0100 |0 57 0.00 0 0.00 0.00 0 0.00 1 0 2 0 97 |0 57 0.00 0 0.00 32.00 17 0.53
Re: I don't get where the load comes from
Joel Carnat wrote well, compared to my previous box, running NetBSD/xen, the same services and showing about 0.3-0.6 of load ; I thought a load of 1.21 was quite much. Different systems will agree on the spelling of the word load. That is about as much agreement as you can expect. Does the 0.3-0.6 really mean 30-60 percent loaded? 1.21 tasks seems kinda low for a multi-tasking system.
Re: I don't get where the load comes from
Le 31 mai 2011 ` 02:19, Gonzalo L. R. a icrit : Take a look of this http://undeadly.org/cgi?action=articlesid=20090715034920 I found this article before posting. But one thing that didn't convinced me is that, if I shutdown apmd and configure hw.setperf=100, the load drops down to 0.30-0.20. I don't get how A high load is just that: high. It means you have a lot of processes that sometimes run. can show load variation depending on CPU speed only. El 05/30/11 18:44, Joel Carnat escribis: Hi, I am running a personal Mail+Web system on a Core2Duo 2GHz using Speedstep. It is mostly doing nothing but still has a high load average. I've check various stat tools but didn't find the reason for the load. Anyone has ideas? TIA, Jo PS: here are some of the results I checked. # uname -a OpenBSD bagheera.tumfatig.net 4.9 GENERIC.MP#819 amd64 # sysctl hw hw.machine=amd64 hw.model=Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz hw.ncpu=2 hw.byteorder=1234 hw.pagesize=4096 hw.disknames=cd0:,sd0:01d3664288919ae7 hw.diskcount=2 hw.sensors.cpu0.temp0=45.00 degC hw.sensors.cpu1.temp0=45.00 degC hw.sensors.acpitz0.temp0=45.50 degC (zone temperature) hw.sensors.acpiac0.indicator0=On (power supply) hw.sensors.acpibat0.volt0=11.10 VDC (voltage) hw.sensors.acpibat0.volt1=12.71 VDC (current voltage) hw.sensors.acpibat0.amphour0=4.61 Ah (last full capacity) hw.sensors.acpibat0.amphour1=0.52 Ah (warning capacity) hw.sensors.acpibat0.amphour2=0.16 Ah (low capacity) hw.sensors.acpibat0.amphour3=5.20 Ah (remaining capacity), OK hw.sensors.acpibat0.raw0=0 (battery full), OK hw.sensors.acpibat0.raw1=1 (rate) hw.cpuspeed=800 hw.setperf=0 hw.vendor=Dell Inc. hw.product=XPS M1330 hw.serialno=CK0W33J hw.uuid=44454c4c-4b00-1030-8057-c3c04f4a hw.physmem=3747008512 hw.usermem=3734933504 hw.ncpufound=2 # top -n -o cpu -T load averages: 1.19, 1.14, 0.99bagheera.tumfatig.net 23:39:09 78 processes: 77 idle, 1 on processor CPU0 states: 1.8% user, 0.0% nice, 0.7% system, 0.1% interrupt, 97.4% idle CPU1 states: 2.4% user, 0.0% nice, 0.8% system, 0.0% interrupt, 96.8% idle Memory: Real: 238M/656M act/tot Free: 2809M Swap: 0K/8197M used/tot PID USERNAME PRI NICE SIZE RES STATE WAIT TIMECPU COMMAND 3230 root 20 2156K 3152K sleep/1 netio 0:00 0.20% sshd 1867 sshd 20 2148K 2368K sleep/0 select0:00 0.05% sshd 19650 www 140 5640K 30M sleep/0 semwait 0:59 0.00% httpd 4225 www 140 5984K 42M sleep/1 semwait 0:58 0.00% httpd 3624 www 140 5644K 30M sleep/1 semwait 0:53 0.00% httpd 24875 www 140 5740K 32M sleep/1 semwait 0:52 0.00% httpd 22848 www 140 5724K 30M sleep/1 semwait 0:50 0.00% httpd 13508 www 140 5832K 31M sleep/1 semwait 0:48 0.00% httpd 24210 www 140 5652K 30M sleep/1 semwait 0:48 0.00% httpd 510 www 140 5660K 30M sleep/1 semwait 0:46 0.00% httpd 20258 www20 5536K 32M sleep/0 select0:46 0.00% httpd 6543 www 140 5772K 32M sleep/0 semwait 0:43 0.00% httpd 9783 _mysql 20 55M 30M sleep/1 poll 0:20 0.00% mysqld 19071 root 20 640K 1416K sleep/1 select0:09 0.00% sshd 10389 root 20 3376K 2824K sleep/0 poll 0:07 0.00% monit 21695 _sogo 20 7288K 18M sleep/1 poll 0:05 0.00% sogod 1888 named 20 20M 21M sleep/1 select0:05 0.00% named 18781 _sogo 20 15M 29M sleep/1 poll 0:04 0.00% sogod # iostat -c 10 -w 1 ttycd0 sd0 cpu tin tout KB/t t/s MB/s KB/t t/s MB/s us ni sy in id 07 0.00 0 0.00 20.64 7 0.14 2 0 1 0 97 0 174 0.00 0 0.00 0.00 0 0.00 0 0 0 0100 0 57 0.00 0 0.00 0.00 0 0.00 1 0 2 0 97 0 57 0.00 0 0.00 32.00 17 0.53 1 0 1 0 98 0 58 0.00 0 0.00 0.00 0 0.00 7 0 7 0 86 0 57 0.00 0 0.00 0.00 0 0.00 1 0 1 0 98 0 57 0.00 0 0.00 0.00 0 0.00 1 0 1 0 98 0 57 0.00 0 0.00 0.00 0 0.00 2 0 0 0 98 0 57 0.00 0 0.00 4.00 1 0.00 0 0 1 0 99 0 58 0.00 0 0.00 0.00 0 0.00 1 0 0 1 98 # vmstat -c 10 -w 1 procsmemory pagediskstraps cpu r b wavm fre flt re pi po fr sr cd0 sd0 int sys cs us sy id 1 1 0 243420 2866736 655 0 0 0 0 0 0 1 15 1828 77 2 1 97 0 1 0 243636 2866336 234 0 0 0 0 0 0 0 10 540 47 0 1 99 0 1 0 243668 2866304 95 0 0 0 0 0 0 0 17 329 44 1 0 99 0 1 0 242848 2867552 644 0 0 0 0 0 0 08 1445 115 1 1 98 0 1 0 243612 2866352 1076 0 0 0 0 0 0 09 2436 44 0 2 98 0 1 0 243668 2866288 117 0 0 0 0 0 0
Re: I don't get where the load comes from
Joel Carnat wrote: But one thing that didn't convinced me is that, if I shutdown apmd and configure hw.setperf=100, the load drops down to 0.30-0.20. I don't get how A high load is just that: high. It means you have a lot of processes that sometimes run. can show load variation depending on CPU speed only. Actually that should convince you that the numbers do not mean much. You are measuring the difference between just barely being counted and just barely not being counted.
Re: I don't get where the load comes from
Le 31 mai 2011 ` 08:10, Tony Abernethy a icrit : Joel Carnat wrote well, compared to my previous box, running NetBSD/xen, the same services and showing about 0.3-0.6 of load ; I thought a load of 1.21 was quite much. Different systems will agree on the spelling of the word load. That is about as much agreement as you can expect. Does the 0.3-0.6 really mean 30-60 percent loaded? As far as I understood the counters on my previous nbsd box, 0.3 meant that the cpu was used at 30% of it's total capacity. Then, looking at the sys/user counters, I'd see what kind of things the system was doing. 1.21 tasks seems kinda low for a multi-tasking system. ok :)
Re: I don't get where the load comes from
Hi all, load is not realy a cpu usage %. In facts it is sum of many % (cpu real load, memory, buffers, etc...) that explain why load can up over 5.0 for each cpu without any crash or freeze of the host. we should consider load as a host ressources %... this is not real of course but this is more real, than considering it as only cpu use. For example, in facts, all my machines run permanently about 1.1 or 1.2 and sometimes for a short time (few minutes) goes up to 2.5 to 3.0 of load. so I don't worry, before 5.0, we should not worry about that. regards From: Joel Carnat j...@carnat.net Sent: Tue May 31 09:10:59 CEST 2011 To: Tony Abernethy t...@servasoftware.com Subject: Re: I don't get where the load comes from Le 31 mai 2011 ` 08:10, Tony Abernethy a icrit : Joel Carnat wrote well, compared to my previous box, running NetBSD/xen, the same services and showing about 0.3-0.6 of load ; I thought a load of 1.21 was quite much. Different systems will agree on the spelling of the word load. That is about as much agreement as you can expect. Does the 0.3-0.6 really mean 30-60 percent loaded? As far as I understood the counters on my previous nbsd box, 0.3 meant that the cpu was used at 30% of it's total capacity. Then, looking at the sys/user counters, I'd see what kind of things the system was doing. 1.21 tasks seems kinda low for a multi-tasking system. ok :) Cordialement Francois Pussault 3701 - 8 rue Marcel Pagnol 31100 ToulouseB FranceB +33 6 17 230 820 B +33 5 34 365 269 fpussa...@contactoffice.fr
Re: I don't get where the load comes from
On Tue, May 31, 2011 at 2:24 AM, Francois Pussault fpussa...@contactoffice.fr wrote: load is not realy a cpu usage %. In facts it is sum of many % (cpu real load, memory, buffers, etc...) that explain why load can up over 5.0 for each cpu without any crash or freeze of the host. we should consider load as a host ressources %... this is not real of course but this is more real, than considering it as only cpu use. The load average numbers give the number of jobs in the run queue averaged over 1, 5, and 15 minutes from top(1).
Re: I don't get where the load comes from
On May 31, 2011, at 12:33 AM, Abel Abraham Camarillo Ojeda wrote: On Tue, May 31, 2011 at 2:24 AM, Francois Pussault fpussa...@contactoffice.fr wrote: load is not realy a cpu usage %. In facts it is sum of many % (cpu real load, memory, buffers, etc...) that explain why load can up over 5.0 for each cpu without any crash or freeze of the host. we should consider load as a host ressources %... this is not real of course but this is more real, than considering it as only cpu use. The load average numbers give the number of jobs in the run queue averaged over 1, 5, and 15 minutes from top(1). As was mentioned earlier, no two systems agree on what load average is. Making statements about it for a particular system should be based on the code for that system. Some systems count processes runnable if only the NFS back-end-storage were available to page in the file. Other systems say it's in a wait state. The former can easily lead to load averages in the 100s (or more) with a a CPU idling at 99% (because everything's waiting on NFS). Some systems don't even agree on what it means to average. Load Averages generally suck as a metric for system business. Look at interrupts and CPU time -- they're what matter. If you want to break out CPU beyond system, user and idle, you can do that, too. Sean
Re: I don't get where the load comes from
So it is why I mentioned it is not real but a user-land approach of it can be understood. From: Sean Kamath kam...@geekoids.com Sent: Tue May 31 11:07:46 CEST 2011 To: Misc OpenBSD misc@openbsd.org Subject: Re: I don't get where the load comes from On May 31, 2011, at 12:33 AM, Abel Abraham Camarillo Ojeda wrote: On Tue, May 31, 2011 at 2:24 AM, Francois Pussault fpussa...@contactoffice.fr wrote: load is not realy a cpu usage %. In facts it is sum of many % (cpu real load, memory, buffers, etc...) that explain why load can up over 5.0 for each cpu without any crash or freeze of the host. we should consider load as a host ressources %... this is not real of course but this is more real, than considering it as only cpu use. The load average numbers give the number of jobs in the run queue averaged over 1, 5, and 15 minutes from top(1). As was mentioned earlier, no two systems agree on what load average is. Making statements about it for a particular system should be based on the code for that system. Some systems count processes runnable if only the NFS back-end-storage were available to page in the file. Other systems say it's in a wait state. The former can easily lead to load averages in the 100s (or more) with a a CPU idling at 99% (because everything's waiting on NFS). Some systems don't even agree on what it means to average. Load Averages generally suck as a metric for system business. Look at interrupts and CPU time -- they're what matter. If you want to break out CPU beyond system, user and idle, you can do that, too. Sean Cordialement Francois Pussault 3701 - 8 rue Marcel Pagnol 31100 ToulouseB FranceB +33 6 17 230 820 B +33 5 34 365 269 fpussa...@contactoffice.fr
Re: I don't get where the load comes from
On Tue, May 31, 2011 at 9:24 AM, Francois Pussault fpussa...@contactoffice.fr wrote: Hi all, load is not realy a cpu usage %. In facts it is sum of many % (cpu real load, memory, buffers, etc...) No, it isn't. we should consider load as a host ressources %... No, we shouldn't. The load average is a decaying average of the number of processes in the runnable state or currently running on a cpu or in the process of being forked or that have spent less than a second in a sleep state with sleep priority lower than PZERO, which includes waiting for memory resources, disk I/O, filesystem locks and a bunch of other things. You could say it's a very vague estimate of how much work the cpu might need to be doing soon, maybe. Or it could be completely wrong because of sampling bias. It's not very important so it's not really critical for the system to do a good job guessing this number, so the system doesn't really try too hard. This number may tell you something useful, or it might be totally misleading. Or both. //art //art
Re: I don't get where the load comes from
On Mon, May 30, 2011 at 11:44:29PM +0200, Joel Carnat wrote: | Hi, | | I am running a personal Mail+Web system on a Core2Duo 2GHz using Speedstep. | It is mostly doing nothing but still has a high load average. Wait, what ? ~1 is 'a high load average' now ? What are that database and webserver doing on your machine 'doing nothing' ? What other processes do you have running ? Note that you don't have to use lots of CPU to get a (really) high load... Do you see a lot of interrupts perhaps ? Try `systat -s1 vm` or `vmstat -i`. Paul 'WEiRD' de Weerd | I've check various stat tools but didn't find the reason for the load. | | Anyone has ideas? | | TIA, | Jo | | PS: here are some of the results I checked. | | # uname -a | OpenBSD bagheera.tumfatig.net 4.9 GENERIC.MP#819 amd64 | | # sysctl hw | hw.machine=amd64 | hw.model=Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz | hw.ncpu=2 | hw.byteorder=1234 | hw.pagesize=4096 | hw.disknames=cd0:,sd0:01d3664288919ae7 | hw.diskcount=2 | hw.sensors.cpu0.temp0=45.00 degC | hw.sensors.cpu1.temp0=45.00 degC | hw.sensors.acpitz0.temp0=45.50 degC (zone temperature) | hw.sensors.acpiac0.indicator0=On (power supply) | hw.sensors.acpibat0.volt0=11.10 VDC (voltage) | hw.sensors.acpibat0.volt1=12.71 VDC (current voltage) | hw.sensors.acpibat0.amphour0=4.61 Ah (last full capacity) | hw.sensors.acpibat0.amphour1=0.52 Ah (warning capacity) | hw.sensors.acpibat0.amphour2=0.16 Ah (low capacity) | hw.sensors.acpibat0.amphour3=5.20 Ah (remaining capacity), OK | hw.sensors.acpibat0.raw0=0 (battery full), OK | hw.sensors.acpibat0.raw1=1 (rate) | hw.cpuspeed=800 | hw.setperf=0 | hw.vendor=Dell Inc. | hw.product=XPS M1330 | hw.serialno=CK0W33J | hw.uuid=44454c4c-4b00-1030-8057-c3c04f4a | hw.physmem=3747008512 | hw.usermem=3734933504 | hw.ncpufound=2 | | # top -n -o cpu -T | load averages: 1.19, 1.14, 0.99bagheera.tumfatig.net 23:39:09 | 78 processes: 77 idle, 1 on processor | CPU0 states: 1.8% user, 0.0% nice, 0.7% system, 0.1% interrupt, 97.4% | idle | CPU1 states: 2.4% user, 0.0% nice, 0.8% system, 0.0% interrupt, 96.8% | idle | Memory: Real: 238M/656M act/tot Free: 2809M Swap: 0K/8197M used/tot | | PID USERNAME PRI NICE SIZE RES STATE WAIT TIMECPU COMMAND | 3230 root 20 2156K 3152K sleep/1 netio 0:00 0.20% sshd | 1867 sshd 20 2148K 2368K sleep/0 select0:00 0.05% sshd | 19650 www 140 5640K 30M sleep/0 semwait 0:59 0.00% httpd | 4225 www 140 5984K 42M sleep/1 semwait 0:58 0.00% httpd | 3624 www 140 5644K 30M sleep/1 semwait 0:53 0.00% httpd | 24875 www 140 5740K 32M sleep/1 semwait 0:52 0.00% httpd | 22848 www 140 5724K 30M sleep/1 semwait 0:50 0.00% httpd | 13508 www 140 5832K 31M sleep/1 semwait 0:48 0.00% httpd | 24210 www 140 5652K 30M sleep/1 semwait 0:48 0.00% httpd | 510 www 140 5660K 30M sleep/1 semwait 0:46 0.00% httpd | 20258 www20 5536K 32M sleep/0 select0:46 0.00% httpd | 6543 www 140 5772K 32M sleep/0 semwait 0:43 0.00% httpd | 9783 _mysql 20 55M 30M sleep/1 poll 0:20 0.00% mysqld | 19071 root 20 640K 1416K sleep/1 select0:09 0.00% sshd | 10389 root 20 3376K 2824K sleep/0 poll 0:07 0.00% monit | 21695 _sogo 20 7288K 18M sleep/1 poll 0:05 0.00% sogod | 1888 named 20 20M 21M sleep/1 select0:05 0.00% named | 18781 _sogo 20 15M 29M sleep/1 poll 0:04 0.00% sogod | | # iostat -c 10 -w 1 | ttycd0 sd0 cpu | tin tout KB/t t/s MB/s KB/t t/s MB/s us ni sy in id |07 0.00 0 0.00 20.64 7 0.14 2 0 1 0 97 |0 174 0.00 0 0.00 0.00 0 0.00 0 0 0 0100 |0 57 0.00 0 0.00 0.00 0 0.00 1 0 2 0 97 |0 57 0.00 0 0.00 32.00 17 0.53 1 0 1 0 98 |0 58 0.00 0 0.00 0.00 0 0.00 7 0 7 0 86 |0 57 0.00 0 0.00 0.00 0 0.00 1 0 1 0 98 |0 57 0.00 0 0.00 0.00 0 0.00 1 0 1 0 98 |0 57 0.00 0 0.00 0.00 0 0.00 2 0 0 0 98 |0 57 0.00 0 0.00 4.00 1 0.00 0 0 1 0 99 |0 58 0.00 0 0.00 0.00 0 0.00 1 0 0 1 98 | | # vmstat -c 10 -w 1 | procsmemory pagediskstraps cpu | r b wavm fre flt re pi po fr sr cd0 sd0 int sys cs us sy | id | 1 1 0 243420 2866736 655 0 0 0 0 0 0 1 15 1828 77 2 1 | 97 | 0 1 0 243636 2866336 234 0 0 0 0 0 0 0 10 540 47 0 1 | 99 | 0 1 0 243668 2866304 95 0 0 0 0 0 0 0 17 329 44 1 0 | 99 | 0 1 0 242848 2867552 644 0 0 0 0 0 0 08 1445 115 1 1 | 98 | 0 1 0 243612 2866352 1076 0 0 0 0 0 0 09 2436 44 0 2 | 98 | 0 1 0 243668 2866288
Re: I don't get where the load comes from
Take a look of this http://undeadly.org/cgi?action=articlesid=20090715034920 El 05/30/11 18:44, Joel Carnat escribis: Hi, I am running a personal Mail+Web system on a Core2Duo 2GHz using Speedstep. It is mostly doing nothing but still has a high load average. I've check various stat tools but didn't find the reason for the load. Anyone has ideas? TIA, Jo PS: here are some of the results I checked. # uname -a OpenBSD bagheera.tumfatig.net 4.9 GENERIC.MP#819 amd64 # sysctl hw hw.machine=amd64 hw.model=Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz hw.ncpu=2 hw.byteorder=1234 hw.pagesize=4096 hw.disknames=cd0:,sd0:01d3664288919ae7 hw.diskcount=2 hw.sensors.cpu0.temp0=45.00 degC hw.sensors.cpu1.temp0=45.00 degC hw.sensors.acpitz0.temp0=45.50 degC (zone temperature) hw.sensors.acpiac0.indicator0=On (power supply) hw.sensors.acpibat0.volt0=11.10 VDC (voltage) hw.sensors.acpibat0.volt1=12.71 VDC (current voltage) hw.sensors.acpibat0.amphour0=4.61 Ah (last full capacity) hw.sensors.acpibat0.amphour1=0.52 Ah (warning capacity) hw.sensors.acpibat0.amphour2=0.16 Ah (low capacity) hw.sensors.acpibat0.amphour3=5.20 Ah (remaining capacity), OK hw.sensors.acpibat0.raw0=0 (battery full), OK hw.sensors.acpibat0.raw1=1 (rate) hw.cpuspeed=800 hw.setperf=0 hw.vendor=Dell Inc. hw.product=XPS M1330 hw.serialno=CK0W33J hw.uuid=44454c4c-4b00-1030-8057-c3c04f4a hw.physmem=3747008512 hw.usermem=3734933504 hw.ncpufound=2 # top -n -o cpu -T load averages: 1.19, 1.14, 0.99bagheera.tumfatig.net 23:39:09 78 processes: 77 idle, 1 on processor CPU0 states: 1.8% user, 0.0% nice, 0.7% system, 0.1% interrupt, 97.4% idle CPU1 states: 2.4% user, 0.0% nice, 0.8% system, 0.0% interrupt, 96.8% idle Memory: Real: 238M/656M act/tot Free: 2809M Swap: 0K/8197M used/tot PID USERNAME PRI NICE SIZE RES STATE WAIT TIMECPU COMMAND 3230 root 20 2156K 3152K sleep/1 netio 0:00 0.20% sshd 1867 sshd 20 2148K 2368K sleep/0 select0:00 0.05% sshd 19650 www 140 5640K 30M sleep/0 semwait 0:59 0.00% httpd 4225 www 140 5984K 42M sleep/1 semwait 0:58 0.00% httpd 3624 www 140 5644K 30M sleep/1 semwait 0:53 0.00% httpd 24875 www 140 5740K 32M sleep/1 semwait 0:52 0.00% httpd 22848 www 140 5724K 30M sleep/1 semwait 0:50 0.00% httpd 13508 www 140 5832K 31M sleep/1 semwait 0:48 0.00% httpd 24210 www 140 5652K 30M sleep/1 semwait 0:48 0.00% httpd 510 www 140 5660K 30M sleep/1 semwait 0:46 0.00% httpd 20258 www20 5536K 32M sleep/0 select0:46 0.00% httpd 6543 www 140 5772K 32M sleep/0 semwait 0:43 0.00% httpd 9783 _mysql 20 55M 30M sleep/1 poll 0:20 0.00% mysqld 19071 root 20 640K 1416K sleep/1 select0:09 0.00% sshd 10389 root 20 3376K 2824K sleep/0 poll 0:07 0.00% monit 21695 _sogo 20 7288K 18M sleep/1 poll 0:05 0.00% sogod 1888 named 20 20M 21M sleep/1 select0:05 0.00% named 18781 _sogo 20 15M 29M sleep/1 poll 0:04 0.00% sogod # iostat -c 10 -w 1 ttycd0 sd0 cpu tin tout KB/t t/s MB/s KB/t t/s MB/s us ni sy in id 07 0.00 0 0.00 20.64 7 0.14 2 0 1 0 97 0 174 0.00 0 0.00 0.00 0 0.00 0 0 0 0100 0 57 0.00 0 0.00 0.00 0 0.00 1 0 2 0 97 0 57 0.00 0 0.00 32.00 17 0.53 1 0 1 0 98 0 58 0.00 0 0.00 0.00 0 0.00 7 0 7 0 86 0 57 0.00 0 0.00 0.00 0 0.00 1 0 1 0 98 0 57 0.00 0 0.00 0.00 0 0.00 1 0 1 0 98 0 57 0.00 0 0.00 0.00 0 0.00 2 0 0 0 98 0 57 0.00 0 0.00 4.00 1 0.00 0 0 1 0 99 0 58 0.00 0 0.00 0.00 0 0.00 1 0 0 1 98 # vmstat -c 10 -w 1 procsmemory pagediskstraps cpu r b wavm fre flt re pi po fr sr cd0 sd0 int sys cs us sy id 1 1 0 243420 2866736 655 0 0 0 0 0 0 1 15 1828 77 2 1 97 0 1 0 243636 2866336 234 0 0 0 0 0 0 0 10 540 47 0 1 99 0 1 0 243668 2866304 95 0 0 0 0 0 0 0 17 329 44 1 0 99 0 1 0 242848 2867552 644 0 0 0 0 0 0 08 1445 115 1 1 98 0 1 0 243612 2866352 1076 0 0 0 0 0 0 09 2436 44 0 2 98 0 1 0 243668 2866288 117 0 0 0 0 0 0 07 369 46 1 1 98 0 1 0 243836 2866112 337 0 0 0 0 0 0 07 818 86 0 1 99 0 1 0 243428 2866728 1216 0 0 0 0 0 0 0 11 2920 69 1 2 97 0 1 0 243640 2866332 212 0 0 0 0 0 0 06 313 38 1 0 99 0 1 0 243684 2866284 96 0 0 0 0 0 0 08 334 48 1 0 99 -- Sending from my Computer.