Linux CPU timer (was Re: question on top)
Martin and myself took our argument off-list over a few virtual beers. I still owe you the outcome of that. * CPU time accounting in Linux is *different* with later kernels (from SLES10 / RHEL5 on) The older kernels used TOD clock (wall clock based) where the newer kernels use the CPU timer (virtual CPU time based). These measure entirely different things and it is not very relevant whether one of them is more accurate or more precise than the other. Whether one metric is *better* than the other depends on what you want to measure. Either approach hides some aspects of virtualization and reveals others. Martin and myself seem to want different metrics. Without knowing what you want to measure, you can't tell which one is better. * Understanding of performance problems requires both Linux and z/VM data Full understanding of any performance problems with Linux on z/VM will normally require that you combine z/VM metrics with Linux statistics. For z/VM metrics you need a performance monitor. For Linux the kernel provides the metrics for various measurement tools. * Using *top* on SLES10 / RHEL5 Since "top" uses the metrics provided by the kernel, it will also show different numbers based on the new CPU timer accounting when used on later kernels. Depending on the configuration, it may point you in the right direction when a Linux server is using excessive resources. But that still does not make "top" a monitoring tool, for example because its heavy resource consumption actually disturbs the system you try to measure. * Two kernel problems caused incorrect CPU usage accounting in Linux Martin says the bad numbers we noticed in our test case would be corrected by fixes that are in the pipeline. The impact of an ill-behaving application (like in our case) would also be mitigated by these fixes. I have to take his word for it, since I have not been able to verify it. The fixes for these problems will over time show up in your favorite Linux distribution. Rob -- Rob van der Heij Velocity Software http://www.velocitysoftware.com/ -- For LINUX-390 subscribe / signoff / archive access instructions, send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit http://www.marist.edu/htbin/wlvindex?LINUX-390
Re: question on top
So, Martin, I learned a long time ago, that if the doc says 2+2 is 5, that don't make it right. Here is real data, we do understand it, and we do understand how to account for the "error", which is why we don't push for a "fix". So using native Linux tools, this data would be off by factor of 7 if trying to account for CPU, and for what Linux should account for, it is off by factor of 4. Don't bet Rob any beverages (or include me on the bet please), he only made it this bad to demonstrate he understood the problem, after a real production issue showed up at an installation that cares about accurate data and accounting. "Linux claims to be idle 86% of time. From VM data I know that we run 100% TTIME and 50% VTIME.". Linux is using a complete IFL, 50% of it virtual, but only thinks it's using 14% of it... This should be enough of a clue for you Martin Schwidefsky wrote: On Mon, 2008-12-08 at 17:41 +0100, Rob van der Heij wrote: a production server). Not sure we've bothered to report the details since this problem would not impact our users. So the data still can not be used for serious performance The last time we talked, your tool used the Linux data as one input value of your calculations. So if the Linux data is really wrong, any fix would improve the accuracy of your tool, no? I don't think the measurements based on CPU timer are more accurate than those based on TOD. Sorry Rob but this is nonsense. For one thing because the CPU timer is less accurate than the TOD clock. Principles of Operation chapter 4 about the CPU timer: "The CPU timer is a binary counter with a format which is the same as that of bits 0-63 of the TOD clock, except that bit 0 is considered a sign. The CPU timer nominally is decremented by subtracting a one in bit position 51 every microsecond." I would call this as accurate as the TOD clock. The stepping rates are not 100% the same if the TOD-clock-steering facility is installed but the difference is very very small. By the way z/VM is using the same mechanism to do its own cputime accounting. It's accurate enough when you measure a single virtual machine. But when the kernel is reloading the CPU timer again and again for each process or thread using a small amount of CPU, the error adds up very quick. This statement is wrong. The CPU timer is reprogrammed when a CPU goes idle, after it wakes up from idle, when a new earliest CPU timer event is added and when a CPU timer event expires. Usually there are no CPU timer events so we only reprogram the CPU timer going in and out of idle. In particular the kernel does not reprogram the CPU timer for each process. The overall error is minuscule, the following function programs the CPU timer: static inline void set_vtimer(__u64 expires) { __u64 timer; asm volatile (" STPT %0\n" /* Store current cpu timer value */ " SPT %1" /* Set new value immediatly afterwards */ : "=m" (timer) : "m" (expires) ); S390_lowcore.system_timer += S390_lowcore.last_update_timer - timer; S390_lowcore.last_update_timer = expires; /* store expire time for this CPU timer */ __get_cpu_var(virt_cpu_timer).to_expire = expires; } The instruction to store the current value and the instruction to set the new value are next to each other. You cannot do better. There is one problem we recently identified and that is the cputime spent by the idle process doing actual system work is accounted as idle time instead of system time. I have a patch for this problem, it will go upstream with the next merge window. The maximum difference I was able to create with my testcases has been 0,35%. And because the CPU timer measures only in-SIE time, you miss the resources that CP and SIE spent on behalf of the virtual machine. Even when you don't measure it, someone still has to pay for it ;-) This is called CP overhead and there are two cases. If CP wants to account CPU time to the guest because it has done work on behalf of the guest, it can simply add the time to the guest CPU timer in the SIE control block before the guest cpu is restarted. The cputime spent by CP for things not directly related to a guest should NOT be accounted to the guest. This part of the CP overhead has to be accounted by z/VM. When I was diagnosing the customer problem, I did notice one bug in the kernel that probably could be fixed. But I did not have time yet to try that and see how big the difference would be. And in general, the additional code for dealing with the CPU timer makes the unmeasured part of time longer, so in general reduces the capture ratio. How is the unmeasured part of the time longer? There is some overhead for doing the improved Linux cputime accounting but the additional instructions are fully accounted as cputime in Linux. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. -
Re: question on top
On Mon, Dec 8, 2008 at 5:49 PM, Martin Schwidefsky <[EMAIL PROTECTED]> wrote: > 2) The factor of 4-5 is based on what numbers exactly? I doubt that you > get that discrepancy if you are running a cpu bound linux process that > uses more than a few percent of cpu. As already pointer out the > situation that started this thread is very likely a multi threaded > program and top aggregates the cputime. I think I should wait until you start to put Adult Bavarian Beverages on that assumption. :-) -Rob -- For LINUX-390 subscribe / signoff / archive access instructions, send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit http://www.marist.edu/htbin/wlvindex?LINUX-390
Re: question on top
On Mon, 2008-12-08 at 17:41 +0100, Rob van der Heij wrote: > >> a production server). Not sure we've bothered to report the details since > >> this problem > >> would not impact our users. So the data still can not be used for serious > >> performance > > > > The last time we talked, your tool used the Linux data as one input value of > > your calculations. So if the Linux data is really wrong, any fix would > > improve > > the accuracy of your tool, no? > > I don't think the measurements based on CPU timer are more accurate > than those based on TOD. Sorry Rob but this is nonsense. > For one thing because the CPU timer is less accurate than the TOD clock. Principles of Operation chapter 4 about the CPU timer: "The CPU timer is a binary counter with a format which is the same as that of bits 0-63 of the TOD clock, except that bit 0 is considered a sign. The CPU timer nominally is decremented by subtracting a one in bit position 51 every microsecond." I would call this as accurate as the TOD clock. The stepping rates are not 100% the same if the TOD-clock-steering facility is installed but the difference is very very small. By the way z/VM is using the same mechanism to do its own cputime accounting. > It's accurate enough when you measure a single virtual machine. > But when the kernel is reloading the CPU timer again and again for > each process or thread using a small amount of CPU, the error adds up > very quick. This statement is wrong. The CPU timer is reprogrammed when a CPU goes idle, after it wakes up from idle, when a new earliest CPU timer event is added and when a CPU timer event expires. Usually there are no CPU timer events so we only reprogram the CPU timer going in and out of idle. In particular the kernel does not reprogram the CPU timer for each process. The overall error is minuscule, the following function programs the CPU timer: static inline void set_vtimer(__u64 expires) { __u64 timer; asm volatile (" STPT %0\n" /* Store current cpu timer value */ " SPT %1" /* Set new value immediatly afterwards */ : "=m" (timer) : "m" (expires) ); S390_lowcore.system_timer += S390_lowcore.last_update_timer - timer; S390_lowcore.last_update_timer = expires; /* store expire time for this CPU timer */ __get_cpu_var(virt_cpu_timer).to_expire = expires; } The instruction to store the current value and the instruction to set the new value are next to each other. You cannot do better. There is one problem we recently identified and that is the cputime spent by the idle process doing actual system work is accounted as idle time instead of system time. I have a patch for this problem, it will go upstream with the next merge window. The maximum difference I was able to create with my testcases has been 0,35%. > And because the CPU timer measures only in-SIE time, you miss the > resources that CP and SIE spent on behalf of the virtual machine. Even > when you don't measure it, someone still has to pay for it ;-) This is called CP overhead and there are two cases. If CP wants to account CPU time to the guest because it has done work on behalf of the guest, it can simply add the time to the guest CPU timer in the SIE control block before the guest cpu is restarted. The cputime spent by CP for things not directly related to a guest should NOT be accounted to the guest. This part of the CP overhead has to be accounted by z/VM. > When I was diagnosing the customer problem, I did notice one bug in > the kernel that probably could be fixed. But I did not have time yet > to try that and see how big the difference would be. And in general, > the additional code for dealing with the CPU timer makes the > unmeasured part of time longer, so in general reduces the capture > ratio. How is the unmeasured part of the time longer? There is some overhead for doing the improved Linux cputime accounting but the additional instructions are fully accounted as cputime in Linux. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. -- For LINUX-390 subscribe / signoff / archive access instructions, send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit http://www.marist.edu/htbin/wlvindex?LINUX-390
Re: question on top
Sorry Alan, if you are an open source company, or a monopoly, then you don't mind helping your competitors. With the growth in installations doing accounting for Linux applications, this stuff is actually important Alan Altmark wrote: On Monday, 12/08/2008 at 10:28 EST, Barton Robinson <[EMAIL PROTECTED]> wrote: Not sure we've bothered to report the details since this problem would not impact our users. It would nonetheless be a good service to the community to report the problem you found, whether it affects your customers or not. -- For LINUX-390 subscribe / signoff / archive access instructions, send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit http://www.marist.edu/htbin/wlvindex?LINUX-390 begin:vcard fn:Barton Robinson n:Robinson;Barton adr;dom:;;PO 390640;Mountain View;CA;94039-0640 email;internet:[EMAIL PROTECTED] title:Sr. Architect tel;work:650-964-8867 note:If you can't measure it, I'm just not interested x-mozilla-html:FALSE url:http://velocitysoftware.com version:2.1 end:vcard
Re: question on top
On Monday, 12/08/2008 at 10:28 EST, Barton Robinson <[EMAIL PROTECTED]> wrote: > Not sure we've bothered to report the details since this problem > would not impact our users. It would nonetheless be a good service to the community to report the problem you found, whether it affects your customers or not. While we all want tools like 'top' to show accurate information, they only provide a slice of the information you need to really manage the throughput of your system, something arguably more important than absolute performance (vague term) of a single guest. To manage your *system* you need something like Performance Toolkit, OMEGAMON, or ESAMON. They make it easier (possible?) to tune your system to meet the needs of *your* workload, build an accurate charge-back system if you need one, and perform capacity planning based on historical data. I think getting a guest's view of performance is ok if you don't sit there with it in a loop and you use it primarily to compare to the output from yesterday, looking for unusual differences, not absolute numbers. As the number of servers grow, you can get more variability, so you might start seeing 'unusual' differences which, if you have a view of the entire system, are not unusual. Alan Altmark z/VM Development IBM Endicott -- For LINUX-390 subscribe / signoff / archive access instructions, send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit http://www.marist.edu/htbin/wlvindex?LINUX-390
Re: question on top
On Mon, 2008-12-08 at 07:25 -0800, Barton Robinson wrote: > Sorry Christian, but with the latest and greatest, there are many cases where > Linux and > TOP now seriously under report utilization (I think by factor of 5 in the > lab, and by 4 in > a production server). Not sure we've bothered to report the details since > this problem > would not impact our users. So the data still can not be used for serious > performance > work, capacity planning or accounting/chargeback. It's like putting gas in a > car, and the > price per "unit" varies with the number of other people wanting gas. Doesn't > lead one to > trust the instrumentation. 1) Rob, please report these discrepancies. The numbers linux reports should be correct. 2) The factor of 4-5 is based on what numbers exactly? I doubt that you get that discrepancy if you are running a cpu bound linux process that uses more than a few percent of cpu. As already pointer out the situation that started this thread is very likely a multi threaded program and top aggregates the cputime. 3) Top is by no means a monitoring tool. You can use it to get a rough snapshot of the current situation but please don't use it instead of a real monitor because top itself uses a lot of cpu. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. -- For LINUX-390 subscribe / signoff / archive access instructions, send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit http://www.marist.edu/htbin/wlvindex?LINUX-390
Re: question on top
On Mon, Dec 8, 2008 at 4:54 PM, Christian Borntraeger <[EMAIL PROTECTED]> wrote: > Barton, > >> Sorry Christian, but with the latest and greatest, there are many cases >> where Linux and >> TOP now seriously under report utilization (I think by factor of 5 in the >> lab, and by 4 in > > Do you have a short description of one test case? If there is a real problem, > we should fix it. Yes, we do understand the fault in the application now and could reproduce the behavior. The "real problem" for the customer was that the Linux data did not reveal that his (DB2) application is misbehaving. Most people recognize that you can't do much with the Linux data without knowing the context (i.e. that was happening on z/VM during that time). As long as you want to capture things in a single metric, you probably can't fix the problem. I believe the customer is working with the DB2 people to get them fix the problem. Let's hope he has more luck than we had when we tried a year ago. >> a production server). Not sure we've bothered to report the details since >> this problem >> would not impact our users. So the data still can not be used for serious >> performance > > The last time we talked, your tool used the Linux data as one input value of > your calculations. So if the Linux data is really wrong, any fix would improve > the accuracy of your tool, no? I don't think the measurements based on CPU timer are more accurate than those based on TOD. For one thing because the CPU timer is less accurate than the TOD clock. It's accurate enough when you measure a single virtual machine. But when the kernel is reloading the CPU timer again and again for each process or thread using a small amount of CPU, the error adds up very quick. And because the CPU timer measures only in-SIE time, you miss the resources that CP and SIE spent on behalf of the virtual machine. Even when you don't measure it, someone still has to pay for it ;-) When I was diagnosing the customer problem, I did notice one bug in the kernel that probably could be fixed. But I did not have time yet to try that and see how big the difference would be. And in general, the additional code for dealing with the CPU timer makes the unmeasured part of time longer, so in general reduces the capture ratio. Rob -- Rob van der Heij Velocity Software http://www.velocitysoftware.com/ -- For LINUX-390 subscribe / signoff / archive access instructions, send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit http://www.marist.edu/htbin/wlvindex?LINUX-390
Re: question on top
Barton, > Sorry Christian, but with the latest and greatest, there are many cases where > Linux and > TOP now seriously under report utilization (I think by factor of 5 in the > lab, and by 4 in Do you have a short description of one test case? If there is a real problem, we should fix it. > a production server). Not sure we've bothered to report the details since > this problem > would not impact our users. So the data still can not be used for serious > performance The last time we talked, your tool used the Linux data as one input value of your calculations. So if the Linux data is really wrong, any fix would improve the accuracy of your tool, no? Christian -- For LINUX-390 subscribe / signoff / archive access instructions, send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit http://www.marist.edu/htbin/wlvindex?LINUX-390
Re: question on top
Sorry Christian, but with the latest and greatest, there are many cases where Linux and TOP now seriously under report utilization (I think by factor of 5 in the lab, and by 4 in a production server). Not sure we've bothered to report the details since this problem would not impact our users. So the data still can not be used for serious performance work, capacity planning or accounting/chargeback. It's like putting gas in a car, and the price per "unit" varies with the number of other people wanting gas. Doesn't lead one to trust the instrumentation. Christian Borntraeger wrote: Am Montag, 8. Dezember 2008 schrieb Barton Robinson: Yes, top lies. This is no longer true with SLES10+ and RHEL5+. They use the stpt instruction for accurate accounting. If you still see wrong numbers with a recent distro, this would be a bug and should be reported. -- For LINUX-390 subscribe / signoff / archive access instructions, send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit http://www.marist.edu/htbin/wlvindex?LINUX-390 begin:vcard fn:Barton Robinson n:Robinson;Barton adr;dom:;;PO 390640;Mountain View;CA;94039-0640 email;internet:[EMAIL PROTECTED] title:Sr. Architect tel;work:650-964-8867 note:If you can't measure it, I'm just not interested x-mozilla-html:FALSE url:http://velocitysoftware.com version:2.1 end:vcard
Re: question on top
Am Montag, 8. Dezember 2008 schrieb Barton Robinson: > Yes, top lies. This is no longer true with SLES10+ and RHEL5+. They use the stpt instruction for accurate accounting. If you still see wrong numbers with a recent distro, this would be a bug and should be reported. > So are you really saying a single threaded process is using more than one > cpu? Don't need much more proof than that. Per default top accumulates all threads of a process. It is completely normal to see values > 100% if there are threads. Use shift-H to see single thread values. Christan -- For LINUX-390 subscribe / signoff / archive access instructions, send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit http://www.marist.edu/htbin/wlvindex?LINUX-390
Re: question on top
Barton Robinson wrote: > Yes, top lies. So are you really saying a single threaded process is > using more than one > cpu? Don't need much more proof than that. > Barton I don't think TOP shows threads by default, so a multithreaded process would so up as a single line. In TOP type "H" to show threads. Also you can type "1" to list all CPUs rather than a summary. mark -- For LINUX-390 subscribe / signoff / archive access instructions, send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit http://www.marist.edu/htbin/wlvindex?LINUX-390
Re: question on top
On 12/8/08 8:48 AM, "Ayer, Paul W" <[EMAIL PROTECTED]> wrote: > I have seen notes before that "top lies" but; Like a rug. > Looking at a top display sorted by %CPU I see some processes using over > 200 or 300 % CPU. > This LPAR has 4 IFL's installed and the Linux has access to all four. > Can we use the %cpu below 400 % as an indicator that we are not using > all four at that moment? Probably not. Work is distributed on all available CPUs by the Linux scheduler. You're using less that the total capacity of the four CPUs, but I don't think there are any guarantees that you're only using 3 out of the 4 (ie you could take #4 away without problems). > Also if the %cpu says lets say ... 382% are we using 100% of two CPU's > and 82% of the third maybe? See above. You can conclude that you are using 382% of the total 400% at this instant. Beyond that, you'd need a performance monitor of some sort. -- For LINUX-390 subscribe / signoff / archive access instructions, send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit http://www.marist.edu/htbin/wlvindex?LINUX-390
Re: question on top
Yes, top lies. So are you really saying a single threaded process is using more than one cpu? Don't need much more proof than that. What does your performance monitor say (can you measure Linux in an LPAR with your performance monitor)??? Normally the processes would be somewhat balanced over the physical CPUs. so 382% would mean 95% utilization overall of the 4 CPUs. Ayer, Paul W wrote: Good Morning, I have seen notes before that "top lies" but; Looking at a top display sorted by %CPU I see some processes using over 200 or 300 % CPU. This LPAR has 4 IFL's installed and the Linux has access to all four. Can we use the %cpu below 400 % as an indicator that we are not using all four at that moment? Also if the %cpu says lets say ... 382% are we using 100% of two CPU's and 82% of the third maybe? Any input or ideas here would be great. Thanks, Paul -- For LINUX-390 subscribe / signoff / archive access instructions, send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit http://www.marist.edu/htbin/wlvindex?LINUX-390 -- For LINUX-390 subscribe / signoff / archive access instructions, send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit http://www.marist.edu/htbin/wlvindex?LINUX-390 begin:vcard fn:Barton Robinson n:Robinson;Barton adr;dom:;;PO 390640;Mountain View;CA;94039-0640 email;internet:[EMAIL PROTECTED] title:Sr. Architect tel;work:650-964-8867 note:If you can't measure it, I'm just not interested x-mozilla-html:FALSE url:http://velocitysoftware.com version:2.1 end:vcard
Re: question on top
I think top only has a reasonable view of the world when running Linux native i.e. no z/VM and the IFLs are dedicated to the lpar i.e. not shared with another lpar. Otherwise various virtualization effects come into play which make the top numbers rather meaningless without a means to prorate them according to outside info. Best regards, Pieter Harder [EMAIL PROTECTED] tel +31-73-6837133 / +31-6-47272537 -Oorspronkelijk bericht- Van: Linux on 390 Port [mailto:[EMAIL PROTECTED] Namens Ayer, Paul W Verzonden: maandag 8 december 2008 14:49 Aan: LINUX-390@VM.MARIST.EDU Onderwerp: question on top Good Morning, I have seen notes before that "top lies" but; Looking at a top display sorted by %CPU I see some processes using over 200 or 300 % CPU. This LPAR has 4 IFL's installed and the Linux has access to all four. Can we use the %cpu below 400 % as an indicator that we are not using all four at that moment? Also if the %cpu says lets say ... 382% are we using 100% of two CPU's and 82% of the third maybe? Any input or ideas here would be great. Thanks, Paul -- For LINUX-390 subscribe / signoff / archive access instructions, send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit http://www.marist.edu/htbin/wlvindex?LINUX-390 Brabant Water N.V. Postbus 1068 5200 BC 's-Hertogenbosch http://www.brabantwater.nl Handelsregister: 16005077 -- For LINUX-390 subscribe / signoff / archive access instructions, send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit http://www.marist.edu/htbin/wlvindex?LINUX-390
question on top
Good Morning, I have seen notes before that "top lies" but; Looking at a top display sorted by %CPU I see some processes using over 200 or 300 % CPU. This LPAR has 4 IFL's installed and the Linux has access to all four. Can we use the %cpu below 400 % as an indicator that we are not using all four at that moment? Also if the %cpu says lets say ... 382% are we using 100% of two CPU's and 82% of the third maybe? Any input or ideas here would be great. Thanks, Paul -- For LINUX-390 subscribe / signoff / archive access instructions, send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit http://www.marist.edu/htbin/wlvindex?LINUX-390