I've been spending way too much time being a manager lately
to actually dig around in the Perfmon code,
but my gut feeling on this issue continues to be that the
problem is one of two things:

My first guess is that it is something similar to the
perfctr "got confused and quit because of a race condition
in a context swap between a monitored process/thread and one
that is not monitored" problem.  In a sidebar to Gary I suggested
looking for control paths that allow the event handler to return
without enabling the next event.

Second, the PAPI workaround of using SIGRTMIN rather than SIGIO leads me
to suspect that a second or third trap occurs while the first is being
handled and that going to queued signals ensures that multiple events are
delivered in separate signals.  This is a consequence of usgin a
"1-signal = 1-event" assumption.

Use a "1-signal = zero to n events" model is a more common, and efficient,
way of working this situation.  Thus, make sure that the signal handler
has a way of seeing and acting on all of the events that have occurred
before it returns, e.g., wrap the event-handling core of the handler
in a WHILE loop.  (This requires that the driver  notes
that multiple events have occured.) Also, make sure that events are not cleared
unless they are actually acted on, e.g., clear just the relevant bit
rather than zero an entire event bit mask. This is a pretty common strategy 
used in
high performance networking situations where there are a gazillion open
sockets;  you can do a huge amount of work for each signal delivered to
user space, thus saving lots of signal handling overhead.

Typically, two SIGIOs are delivered:  one occurs with the
first event; all visible events are handled with SIGIO blocked; when
the handler returns, a second SIGIO is delivered to indicate the
occurrence of the 2nd through Nth events.  Since all of the events
may have already been handled before the 2nd SIGIO is delivered, the
signal handler also needs to be able to deal with the "no events
pending" case.
An aside:

Mark K sent a note to me that
said "On CNL 2.1 and perfmon 2.3, I was able to complete 1-hour runs
with multiple threads, at 2,000 interrupts/second, 10,000/second and
even 100,000/second without dropping interrupts."

It would begood if Cray pushed their fixes back to the world.


stephane eranian wrote:
> Phil,
> 
> On Fri, Apr 18, 2008 at 9:49 AM, Philip Mucci <[EMAIL PROTECTED]> wrote:
>> Gary,
>>
>>  I'm CC'ing this to the HPCTOOLKIT folks to see if they have an idea.
>>  My guess is that there is some interaction with monitoring around the
>>  forks/execs. Mark Krentel from Rice raised this issue a few times and
>>  as far as I know, it remains unfixed. This is the old perfmon
>>  interface as far as I can tell.
>>
>>  I'm no longer confident that this is not a monitoring/masking issue
>>  nor is it a threading/signaling issue, it could be related to PAPI
>>  while profiling, not functioning properly across a fork/exec. Either
>>  way, you wouldn't see anything in the kernel logs. Is there any way we
>>  can make code.exe run without calling doing the fork/exec's just to
>>  see if it proceeds normally?
>>
> Keep in mind that by default pfmon DOES NOT follow across fork.
> 
>>  Another question, is there a way to trick pfmon into dumping out the
>>  sample counts for individual processes? That would make it a heck of a
>>  lot easier to compare.
>>
> It depends on which version you are using. You may want to ugrade to
> CVS. The new version does print total samples+ number of buffer overflow.
> 
> -------------------------------------------------------------------------
> This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
> Don't miss this year's exciting event. There's still time to save $100. 
> Use priority code J8TL2D2. 
> http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
> _______________________________________________
> perfmon2-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/perfmon2-devel

-- 
Robert J. Fowler
Chief Domain Scientist, HPC
Renaissance Computing Institute
The University of North Carolina at Chapel Hill
100 Europa Dr, Suite 540
Chapel Hill, NC 27517
V: 919.445.9670
F: 919 445.9669
[EMAIL PROTECTED]

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
perfmon2-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/perfmon2-devel

Reply via email to