On Mon, Dec 8, 2008 at 4:54 PM, Christian Borntraeger <[EMAIL PROTECTED]> wrote: > Barton, > >> Sorry Christian, but with the latest and greatest, there are many cases >> where Linux and >> TOP now seriously under report utilization (I think by factor of 5 in the >> lab, and by 4 in > > Do you have a short description of one test case? If there is a real problem, > we should fix it.
Yes, we do understand the fault in the application now and could reproduce the behavior. The "real problem" for the customer was that the Linux data did not reveal that his (DB2) application is misbehaving. Most people recognize that you can't do much with the Linux data without knowing the context (i.e. that was happening on z/VM during that time). As long as you want to capture things in a single metric, you probably can't fix the problem. I believe the customer is working with the DB2 people to get them fix the problem. Let's hope he has more luck than we had when we tried a year ago. >> a production server). Not sure we've bothered to report the details since >> this problem >> would not impact our users. So the data still can not be used for serious >> performance > > The last time we talked, your tool used the Linux data as one input value of > your calculations. So if the Linux data is really wrong, any fix would improve > the accuracy of your tool, no? I don't think the measurements based on CPU timer are more accurate than those based on TOD. For one thing because the CPU timer is less accurate than the TOD clock. It's accurate enough when you measure a single virtual machine. But when the kernel is reloading the CPU timer again and again for each process or thread using a small amount of CPU, the error adds up very quick. And because the CPU timer measures only in-SIE time, you miss the resources that CP and SIE spent on behalf of the virtual machine. Even when you don't measure it, someone still has to pay for it ;-) When I was diagnosing the customer problem, I did notice one bug in the kernel that probably could be fixed. But I did not have time yet to try that and see how big the difference would be. And in general, the additional code for dealing with the CPU timer makes the unmeasured part of time longer, so in general reduces the capture ratio. Rob -- Rob van der Heij Velocity Software http://www.velocitysoftware.com/ ---------------------------------------------------------------------- For LINUX-390 subscribe / signoff / archive access instructions, send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit http://www.marist.edu/htbin/wlvindex?LINUX-390