I haven't seen any updates on the "lost events" thread in a while.
Losing a thread about lost events would be especially ironic.

I did have a conversation with John Mellor-Crummey around 2008-03-10
about this.  I reminded him that several years ago (when the 2.6.9
kernel was brand new) we first saw this behavior at Rice on an Opteron
box (2 sockets, single core)
running 2.6.7, the corresponding versions of perfctr, PAPI, and hpcrun.

It turned out that there was window of vulnerability in context
switches when switching from a monitored process to an unmonitored one.
The driver got an event in the new process context, found no virtualized
counters, dropped the event on the floor, and returned without
reinitializing the counter for the next event.   Diagnosing this
was easy once I looked in the kernel message log, because perfctr was
civilized enough to leave descriptive messages, but it was otherwise
silent about quitting, so it required me to go snooping as root on this
box to find the messages.

Fixing this was easy (on me) -- I just sent an email to Mikael Pettersson,
gave him an account (with root) on the machine, and sat back.  Within 2 days
he'd upgraded the box to 2.6.9, fixed perfctr, and propagated the patch to
the world.

I'm speculating that there's a similar condition in Perfmon 2 in which the
handler returns without enabling the next event.  Note that the original
Perfmon on IA64 from that era was absolutely rock solid, even when we
abused it by having it deliver events at very high rates.




[EMAIL PROTECTED] wrote:
> I have a customer who has an application that when run under pfmon reports
> 154 billion CPU_CYCLES used (appears to be a reasonable value).  When this
> same application is run under Hpcrun (from HPCToolkit using PAPI) it only
> reports about 2 billion CPU_CYCLES used.  These tests are run on an Intel
> IA64 platform.
> 
<stuff elided here>
> 
> I have also browsed this mailing list and found a thread called
> "papi on compute node linux" which was last updated 2008-03-10.  The
> discussion in this thread sounds to me like it could easily explain what
> I am seeing.
> 
> Is there a way I can determine if this discussion (ie: loosing interrupts)
> is what I am seeing ?
> 
> Thanks for any help you can provide.
> 
> Gary
> 
> 
> -------------------------------------------------------------------------
> This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
> Don't miss this year's exciting event. There's still time to save $100. 
> Use priority code J8TL2D2. 
> http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
> _______________________________________________
> perfmon2-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/perfmon2-devel

-- 
Robert J. Fowler
Chief Domain Scientist, HPC
Renaissance Computing Institute
The University of North Carolina at Chapel Hill
100 Europa Dr, Suite 540
Chapel Hill, NC 27517
V: 919.445.9670
F: 919 445.9669
[EMAIL PROTECTED]

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
perfmon2-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/perfmon2-devel

Reply via email to