Re: question on top

Rob van der Heij Mon, 08 Dec 2008 08:42:43 -0800

On Mon, Dec 8, 2008 at 4:54 PM, Christian Borntraeger
<[EMAIL PROTECTED]> wrote:
> Barton,
>
>> Sorry Christian, but with the latest and greatest, there are many cases 
>> where Linux and
>> TOP now seriously under report utilization (I think by factor of 5 in the 
>> lab, and by 4 in
>
> Do you have a short description of one test case? If there is a real problem,
> we should fix it.


Yes, we do understand the fault in the application now and could
reproduce the behavior. The "real problem" for the customer was that
the Linux data did not reveal that his (DB2) application is
misbehaving. Most people recognize that you can't do much with the
Linux data without knowing the context (i.e. that was happening on
z/VM during that time).
As long as you want to capture things in a single metric, you probably
can't fix the problem. I believe the customer is working with the DB2
people to get them fix the problem. Let's hope he has more luck than
we had when we tried a year ago.

>> a production server). Not sure we've bothered to report the details since 
>> this problem
>> would not impact our users.  So the data still can not be used for serious 
>> performance
>
> The last time we talked, your tool used the Linux data as one input value of
> your calculations. So if the Linux data is really wrong, any fix would improve
> the accuracy of your tool, no?

I don't think the measurements based on CPU timer are more accurate
than those based on TOD. For one thing because the CPU timer is less
accurate than the TOD clock. It's accurate enough when you measure a
single virtual machine. But when the kernel is reloading the CPU timer
again and again for each process or thread using a small amount of
CPU, the error adds up very quick.
And because the CPU timer measures only in-SIE time, you miss the
resources that CP and SIE spent on behalf of the virtual machine. Even
when you don't measure it, someone still has to pay for it ;-)

When I was diagnosing the customer problem, I did notice one bug in
the kernel that probably could be fixed. But I did not have time yet
to try that and see how big the difference would be. And in general,
the additional code for dealing with the CPU timer makes the
unmeasured part of time longer, so in general reduces the capture
ratio.

Rob
--
Rob van der Heij
Velocity Software
http://www.velocitysoftware.com/

----------------------------------------------------------------------
For LINUX-390 subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit
http://www.marist.edu/htbin/wlvindex?LINUX-390

Re: question on top

Reply via email to