Stephane Eranian wrote:
Will,

On Tue, Oct 30, 2007 at 03:37:17PM -0400, William Cohen wrote:
Looking over the OLS2007 slides. You are talking about slide 19 with full virtualization vs para-virtualization? I haven't read the VT-x/AMD-V documentation.

Yes, my point was that any solution would have to work with both para and full.

Exposing the real CPUID is going to make things difficult when migrating between machines. On Xen a host can be moved from one physical machine to another. What happens if the physical machines are different? Is there some sane way that perfmon can indicate that this migration has occurred and made data collection no longer feasible?


Well, the PMU cannot be virtualized completly in software. You need to access
the actual registers. That means that you need to know which PMU you are dealing
with. On X86, it means you need to rely on CPUID to figure this out. That does
not mean just using family/model. Clearly on Intel processors starting with
Intel Core duo, we can use the leaf function 0xa to query the architectural
perfmon version.

I agree with you that this causes some problems with migration. But it is not
necessarily different from the SSE problem. You may migrate to a system
with an older version of SSE. It is a trade-off between full use of the
capabilities of the underlying hardware and flexibility for migration.

This is the perfect example as to why having an architected PMU is important.
It ensures some minimum level of functionalities from the PMU. So if your tools
were coded only to the architected PMU, it would be easier to migrated from
one machine to another. Now if you server farm is composed on AMD and Intel
machines, there is unfortunately not much we can do.

I think it would not be that crazy to impose the following trade-off:
  * if you use the PMU, then you can only migrate to machines using the
    same processor.

Yes, restricting the migrations to machines with same processor would be too limiting. I was thinking more of the lines that the pmu hardware might disappear or change from under a program. Having the ability to know that that situation occurred might be useful. In this case the program may want to discard the previous perfmon setup and set up a new on to equivalent event on the current arch.

Does the VT-x/AMD-V hardware allow true save and restore of the values? On some earlier x86 processors there is no way to correctly restore the upper bits of the performance counters. The stores just sign extended the 32-bit value written in. Or are the registers just going to be treated as the lower 32-bits of a 64-bit counter. It would have been nice if the performance counters were implemented as true 64-bit writeable counters.

I remember running into an issue about this particular point when hacking
on KVM. I think there was a workaround. I think you had to force the upper
bits before letting VT-x restore.

As for 64-bit counters, this is an old-story. There are trade-offs at
the hardware level. You don't really need 64-bit counters. You need counters
that do not overflow too frequently, let's say once per hour. As for sampling,
you always want to trigger a counter overflow, and you will never need full
64-bit to express a sampling period. So I would rather use hardware to add
more counters than wider counters.

It is nice to have more counters. However, the pain with the narrower registers is that reading a counter value is no longer a really quick atomic read of a register. The value needs to be synthesized by reading the register and some value in memory, and checking that roll over didn't occur.

I would think there are three usage models the last two are subdivision within a guest OS:

system-wide (everything on the system including vmm like the current xenoprof)

Agreed.

guest-wide (just within the one host os context)

I think your guest wide corresponds to what we would call per-thread on
a native system. From  a certain control point, e.g. on Xen that would
be domain 0, you attach to a guest OS and measure it as it mapped onto
possibly different physical CPUs. The PMU start is saved/restored on guest
domain switches.


thread-wide (just within one thread/process context)

Not sure I understand this one.

A thread within a guest os. The guest os is responsible for setting the pmu when context switch happens within the guest os.

But all 3 are different from what I call PMU virtualization for guest. this mode
ensures continuity of service to guest applications. If I run inside a 
virtualized
Linux guest, I want to be able to run pfmon of Oprofile just like on native.

What creates additional difficulty here is that all three modes could
potentially co-exist, thereby increasing the need for sharing safely the PMU
resource.

The vmm would be invoked each time the pmu interrupt handler run due to NMI handling and pmu writes? How expensive will that be.

Probably not mcuh different than what happens today. Not that depending on the
mode, the interrupt would need to be forwarded (reposted) to a guest OS. In the
case of KVM, it could be consume by the host Linux kernel directly as well.

How would the system-wide work if there existing guest-wide or thread-wide profiling working. Would the in kernel pmu allocation be incorporated into the the VMM? Steal the existing PMU resource from the host os?


That's the sharing problem I mention in my slides. Suppose you are running
a system-wide session like you describe above, and then in a guest OS you want
to run pfmon. Totally legitimate, yet hard to control who gets what. One thing
is for sure, they all need to use the same interrupt vector which the current
hardware.


How many different operation modes are there for the pmu hardware?
        1) just count events, interrupts indicate a counter overflow
        2) sampling, do some action when a counter overflow
        3) hardware stored samples, e.g. PEBS/IBS, need to handle hw buffer

OProfile in the sampling mechanism needs to determine which counter caused the interrupt to add the sample. Would it make sense to have a table of function pointers for an action to do for each counter? When the interrupt for the pmu occurs the overflowing counters are identified and the associated functions called in the table are called. Or is this going to unworkable because the counter share pmu state and handling one pmu event may affect the rest of the pmu?

-Will
_______________________________________________
perfmon mailing list
[email protected]
http://www.hpl.hp.com/hosted/linux/mail-archives/perfmon/

Reply via email to