I've been spending way too much time being a manager lately to actually dig around in the Perfmon code, but my gut feeling on this issue continues to be that the problem is one of two things:
My first guess is that it is something similar to the perfctr "got confused and quit because of a race condition in a context swap between a monitored process/thread and one that is not monitored" problem. In a sidebar to Gary I suggested looking for control paths that allow the event handler to return without enabling the next event. Second, the PAPI workaround of using SIGRTMIN rather than SIGIO leads me to suspect that a second or third trap occurs while the first is being handled and that going to queued signals ensures that multiple events are delivered in separate signals. This is a consequence of usgin a "1-signal = 1-event" assumption. Use a "1-signal = zero to n events" model is a more common, and efficient, way of working this situation. Thus, make sure that the signal handler has a way of seeing and acting on all of the events that have occurred before it returns, e.g., wrap the event-handling core of the handler in a WHILE loop. (This requires that the driver notes that multiple events have occured.) Also, make sure that events are not cleared unless they are actually acted on, e.g., clear just the relevant bit rather than zero an entire event bit mask. This is a pretty common strategy used in high performance networking situations where there are a gazillion open sockets; you can do a huge amount of work for each signal delivered to user space, thus saving lots of signal handling overhead. Typically, two SIGIOs are delivered: one occurs with the first event; all visible events are handled with SIGIO blocked; when the handler returns, a second SIGIO is delivered to indicate the occurrence of the 2nd through Nth events. Since all of the events may have already been handled before the 2nd SIGIO is delivered, the signal handler also needs to be able to deal with the "no events pending" case. An aside: Mark K sent a note to me that said "On CNL 2.1 and perfmon 2.3, I was able to complete 1-hour runs with multiple threads, at 2,000 interrupts/second, 10,000/second and even 100,000/second without dropping interrupts." It would begood if Cray pushed their fixes back to the world. stephane eranian wrote: > Phil, > > On Fri, Apr 18, 2008 at 9:49 AM, Philip Mucci <[EMAIL PROTECTED]> wrote: >> Gary, >> >> I'm CC'ing this to the HPCTOOLKIT folks to see if they have an idea. >> My guess is that there is some interaction with monitoring around the >> forks/execs. Mark Krentel from Rice raised this issue a few times and >> as far as I know, it remains unfixed. This is the old perfmon >> interface as far as I can tell. >> >> I'm no longer confident that this is not a monitoring/masking issue >> nor is it a threading/signaling issue, it could be related to PAPI >> while profiling, not functioning properly across a fork/exec. Either >> way, you wouldn't see anything in the kernel logs. Is there any way we >> can make code.exe run without calling doing the fork/exec's just to >> see if it proceeds normally? >> > Keep in mind that by default pfmon DOES NOT follow across fork. > >> Another question, is there a way to trick pfmon into dumping out the >> sample counts for individual processes? That would make it a heck of a >> lot easier to compare. >> > It depends on which version you are using. You may want to ugrade to > CVS. The new version does print total samples+ number of buffer overflow. > > ------------------------------------------------------------------------- > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference > Don't miss this year's exciting event. There's still time to save $100. > Use priority code J8TL2D2. > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone > _______________________________________________ > perfmon2-devel mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/perfmon2-devel -- Robert J. Fowler Chief Domain Scientist, HPC Renaissance Computing Institute The University of North Carolina at Chapel Hill 100 Europa Dr, Suite 540 Chapel Hill, NC 27517 V: 919.445.9670 F: 919 445.9669 [EMAIL PROTECTED] ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ perfmon2-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/perfmon2-devel
