Phillip and friends

I have gone from praying that someone out there would help to being
overwhelmed
with suggestions and opinions.  Thank you, this is so much better than some
other open source
environments I have worked in where you put a message on a mailing list and
then sit around
and get old waiting for a response.

I had downloaded the perfmon2 package but have not installed them yet.  My
understanding
of that package is that it changes the way signals are used for
multi-threaded environments.  Since
we do not have that environment, I did not think they would help.  If
someone thinks they may help, I
can put them in and give it a try.

Unfortunately our customer is in a secure environment and therefore all I
have is a copy of the
binary executable for the application.  It took us 3 months and a non
disclosure agreement with them
to even get that much.  Since I do not have source I can not modify the
application to skip doing the
fork/exec's.

As Stephane pointed out in another email on this chain, pfmon (by default)
does not count
events for the forked processes.   The values reported by pfmon when used
in sampling
mode seem to account for the CPU time used by the applicaiton,   When pfmon
is run in sampling
mode on this application it reports 4788780 samples.  When hpcrun is used
it only reports 41602
samples (in the code.exe data file).  Both of these runs were done with a
sample period of 32767.

I have PAPI built with debug and have run it with PAPI_DEBUG set to:
PAPI_DEBUG=SUBSTRATE,API,INTERNAL,MULTIPLEX,OVERFLOW,PROFILE

I have looked through the debug output but this untrained eye did not see
anything "unusual".
When I try to put attachments that size on the mailing list the message
gets rejected for being
too big.  If you would be willing to look at it, I can send it directly to
your mailbox.

I also spent some time yesterday looking through the kernel messages
generated by
perfmon during the hpcrun test.  I did find two things that looked unusual
in that log.  I found
several places where  monitoring was being masked when it was already
masked.  I also
found lots of places where monitoring was being restarted without having
been masked.  I
found these by doing a grep on the file for all places where pmc[4] was
being displayed.

Excerpts from this grep output follows:

Apr  8 15:33:22 molson kernel: pfm_mask_monitoring.928: CPU0 [18168]
pmc[4]=0x801220
Apr  8 15:33:22 molson kernel: pfm_restore_monitoring.1013: CPU0 [18168]
[18168] pmc[4]=0x801228
Apr  8 15:33:22 molson kernel: pfm_mask_monitoring.928: CPU0 [18168]
pmc[4]=0x801220
Apr  8 15:33:22 molson kernel: pfm_mask_monitoring.928: CPU0 [18168]
pmc[4]=0x801220
Apr  8 15:33:22 molson kernel: pfm_restore_monitoring.1013: CPU0 [18168]
[18168] pmc[4]=0x801228
Apr  8 15:33:23 molson kernel: pfm_mask_monitoring.928: CPU0 [18168]
pmc[4]=0x801220


Apr  8 15:33:20 molson kernel: pfm_mask_monitoring.928: CPU0 [18168]
pmc[4]=0x801220
Apr  8 15:33:20 molson kernel: pfm_restore_monitoring.1013: CPU0 [18168]
[18168] pmc[4]=0x801228
Apr  8 15:33:20 molson kernel: pfm_mask_monitoring.928: CPU0 [18168]
pmc[4]=0x801220
Apr  8 15:33:20 molson kernel: pfm_restore_monitoring.1013: CPU0 [18168]
[18168] pmc[4]=0x801228
Apr  8 15:33:20 molson kernel: pfm_restore_monitoring.1013: CPU0 [18168]
[18168] pmc[4]=0x801228
Apr  8 15:33:20 molson kernel: pfm_mask_monitoring.928: CPU0 [18168]
pmc[4]=0x801220
Apr  8 15:33:20 molson kernel: pfm_restore_monitoring.1013: CPU0 [18168]
[18168] pmc[4]=0x801228
Apr  8 15:33:20 molson kernel: pfm_mask_monitoring.928: CPU0 [18168]
pmc[4]=0x801220
Apr  8 15:33:20 molson kernel: pfm_restore_monitoring.1013: CPU0 [18168]
[18168] pmc[4]=0x801228
Apr  8 15:33:20 molson kernel: pfm_mask_monitoring.928: CPU0 [18168]
pmc[4]=0x801220
Apr  8 15:33:20 molson kernel: pfm_restore_monitoring.1013: CPU0 [18168]
[18168] pmc[4]=0x801228
Apr  8 15:33:20 molson kernel: pfm_restore_monitoring.1013: CPU0 [18168]
[18168] pmc[4]=0x801228
Apr  8 15:33:20 molson kernel: pfm_mask_monitoring.928: CPU0 [18168]
pmc[4]=0x801220
Apr  8 15:33:20 molson kernel: pfm_restore_monitoring.1013: CPU0 [18168]
[18168] pmc[4]=0x801228
Apr  8 15:33:20 molson kernel: pfm_restore_monitoring.1013: CPU0 [18168]
[18168] pmc[4]=0x801228
Apr  8 15:33:20 molson kernel: pfm_mask_monitoring.928: CPU0 [18168]
pmc[4]=0x801220
Apr  8 15:33:20 molson kernel: pfm_restore_monitoring.1013: CPU0 [18168]
[18168] pmc[4]=0x801228
Apr  8 15:33:20 molson kernel: pfm_restore_monitoring.1013: CPU0 [18168]
[18168] pmc[4]=0x801228
Apr  8 15:33:20 molson kernel: pfm_mask_monitoring.928: CPU0 [18168]
pmc[4]=0x801220
Apr  8 15:33:20 molson kernel: pfm_restore_monitoring.1013: CPU0 [18168]
[18168] pmc[4]=0x801228
Apr  8 15:33:20 molson kernel: pfm_restore_monitoring.1013: CPU0 [18168]
[18168] pmc[4]=0x801228
Apr  8 15:33:20 molson kernel: pfm_mask_monitoring.928: CPU0 [18168]
pmc[4]=0x801220
Apr  8 15:33:21 molson kernel: pfm_restore_monitoring.1013: CPU0 [18168]
[18168] pmc[4]=0x801228

I can also provide this file to anyone who would like to see the full
contents.

Are these conditions normal or could it be a clue to the cause of this
problem ??

Gary


Philip Mucci <[EMAIL PROTECTED]> wrote on 04/18/2008 12:49:46 AM:

> Gary,
>
> I'm CC'ing this to the HPCTOOLKIT folks to see if they have an idea.
> My guess is that there is some interaction with monitoring around the
> forks/execs. Mark Krentel from Rice raised this issue a few times and
> as far as I know, it remains unfixed. This is the old perfmon
> interface as far as I can tell.
>
> I'm no longer confident that this is not a monitoring/masking issue
> nor is it a threading/signaling issue, it could be related to PAPI
> while profiling, not functioning properly across a fork/exec. Either
> way, you wouldn't see anything in the kernel logs. Is there any way we
> can make code.exe run without calling doing the fork/exec's just to
> see if it proceeds normally?
>
> Another question, is there a way to trick pfmon into dumping out the
> sample counts for individual processes? That would make it a heck of a
> lot easier to compare.
>
> Depending on how you built PAPI, you should be able to set
> PAPI_DEBUG=PROFILE,OVERFLOW to catch/see the samples. It's possible
> that may give you some indication when profiling stops being
> registered. Also, try passing -d 5 -v to hpcrun, that may also give
> some results.
>
> Let us know.
>
> Thanks.
>
> Phil
>
> P.S. You may want to forward the initial message or two to the HPC
> folks above...
>
> On Apr 17, 2008, at 10:05 PM, [EMAIL PROTECTED] wrote:
>
> > Phil
> >
> > I should have provided you with a little more information.  The data
> > files
> > created by the hpcrun have the following names:
> >
> > bash.CPU_CYCLES.molson.5636.0x1604
> > bash.CPU_CYCLES.molson.5637.0x1605
> > bash.CPU_CYCLES.molson.5638.0x1606
> > bash.CPU_CYCLES.molson.5813.0x16b5
> > code.exe.CPU_CYCLES.molson.5634.0x1602
> > gzip.CPU_CYCLES.molson.5637.0x1605
> > rm.CPU_CYCLES.molson.5638.0x1606
> > rm.CPU_CYCLES.molson.5813.0x16b5
> >
> > As the data files show this executable runs 4 bash shell scripts, does
> > two remove commands, and does one gzip command in addition to
> > the output for the program itself (code.exe).  There are no thread
> > files
> > created and I have run the program with strace and can not find any
> > kernel calls to clone (which I am told would show up if a
> > pthread_create
> > had been done).
> >
> > My understanding is that each of the things run by this application
> > runs as a seperate process.  The data files show  different PID's
> > (except where a bash shell is executing one of the commands).
> > I would think that the problems handling signals in a multi-threaded
> > process should not apply here.  Am I wrong about this ?
> >
> > The /proc/perfmon file shows:
> >
> > perfmon version           : 2.0
> > model                     : Montecito
> > fastctxsw                 : No
> > -- and lots more --
> >
> > The Papi version installed is:
> >
> > Name        : papi                         Relocations: /usr
> > Version     : 3.5.0                             Vendor: Bull
> > Release     : 1.b.3.Bull                    Build Date: Thu 05 Apr
> > 2007
> > 05:05:28 AM MST
> > Install Date: Thu 13 Mar 2008 08:32:31 AM MST      Build Host: apollon
> > Group       : Application                   Source RPM:
> > papi-3.5.0-1.b.3.Bull.src.rpm
> > -- and some more --
> >
> > This was created from Papi-3.5.0 but packaged by Bull into an RPM.
> >
> > I reran the test using a sample period of 16,000,000 and got the
> > following
> > results:
> >
> > real    1m38.636s
> > user    1m38.240s
> > sys     0m0.440s
> >
> > Output from hpcprof:
> >
> > Columns correspond to the following events [event:period (events/
> > sample)]
> >  CPU_OP_CYCLES_ALL:16000000 - CPU Operating Cycles -- All CPU cycles
> > counted (0 samples) [not shown]
> >  CPU_OP_CYCLES_ALL:16000000 - CPU Operating Cycles -- All CPU cycles
> > counted (0 samples) [not shown]
> >  CPU_OP_CYCLES_ALL:16000000 - CPU Operating Cycles -- All CPU cycles
> > counted (0 samples) [not shown]
> >  CPU_OP_CYCLES_ALL:16000000 - CPU Operating Cycles -- All CPU cycles
> > counted (0 samples) [not shown]
> >  CPU_OP_CYCLES_ALL:16000000 - CPU Operating Cycles -- All CPU cycles
> > counted (82 samples) [not shown]
> >  CPU_OP_CYCLES_ALL:16000000 - CPU Operating Cycles -- All CPU cycles
> > counted (47 samples) [not shown]
> >  CPU_OP_CYCLES_ALL:16000000 - CPU Operating Cycles -- All CPU cycles
> > counted (0 samples) [not shown]
> >  CPU_OP_CYCLES_ALL:16000000 - CPU Operating Cycles -- All CPU cycles
> > counted (0 samples) [not shown]
> >  CPU_OP_CYCLES_ALL (min):16000000 - CPU Operating Cycles -- All CPU
> > cycles
> > counted (The minimum for events of this type.) (0 samples)
> >  CPU_OP_CYCLES_ALL (max):16000000 - CPU Operating Cycles -- All CPU
> > cycles
> > counted (The maximum for events of this type.) (129 samples)
> >  CPU_OP_CYCLES_ALL (sum):16000000 - CPU Operating Cycles -- All CPU
> > cycles
> > counted (Summed over all events of this type.) (129 samples)
> >
> > Calculations with these numbers (129 times 16000000 divided by 1600)
> > show :
> >
> > 1290000  (or about 1.3 total seconds)
> >
> > This is still a very small percent of the CPU time reported by "time".
> > Actually we have
> > experimented with lots of different sample rates (very large and very
> > small) and all the
> > ones we have tried seem to produce about the same results.
> >
> > When I run this test, the number of perfmon interrupts that occur
> > (as shown
> > in /proc/interrupts) is very close to the total number of
> > samples shown in the hpcprof output.   That suggests to me that we
> > are not
> > ignoring or loosing interrupts.  Instead it looks to me
> > like perfmon has masked monitoring (as part of handling an overflow)
> > and
> > then somehow the process being monitored got back
> > into execution without restarting it.  This likely happened during
> > interrupt processing since that is where the code responsible for
> > managing all this seems to be.  What I do not know is wheather this
> > has
> > happened many times or just once.  If it just happened
> > once then I would expect the perfmon kernel debug data would have
> > shown the
> > last interrupt processed  masked monitoring
> > but never restarted it.  In fact the kernel debug data does not show
> > that.
> > This suggests to me that somehow we are allowing the
> > monitored process to execute with monitoring masked so the hardware
> > is not
> > counting the events.
> >
> > First I would like to know if you guys think the above analysis is
> > correct
> > (after all I am just a rookie and do not understand all
> > this stuff yet).
> >
> > And second if anyone has any suggestions that could help me detect
> > that the
> > monitored process is being executed while monitoring
> > is masked.  For instance is there a place in the code where we
> > always go
> > when we are done processing an overflow that I may be
> > able to add a check to see if monitoring is masked and spit out some
> > additional debug info if that happens.
> >
> > I am happy to do the work if you can point me in the right direction.
> >
> > Gary
> >
> >
> >
> >
> >
> >             Philip Mucci
> >             <[EMAIL PROTECTED]
> >>                                                         To
> >                                       [EMAIL PROTECTED]
> >             04/16/2008
> > 04:20                                           cc
> >             PM
[EMAIL PROTECTED]
> >                                       t, [EMAIL PROTECTED]
> >
> > Subject
> >                                       Re: [perfmon2] Pfmon and Hpcrun
> >                                       give inconsistent results.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >>>
> >
> > Hi Gary,
> >
> >>> Phil,
> >>>
> >>> On Wed, Apr 16, 2008 at 11:22 AM, Philip Mucci <[EMAIL PROTECTED]>
> >>> wrote:
> >>>> Folks,
> >>>>
> >>>> hpcrun does it's sampling inside the target process using first
> >>>> person
> >>>> access, not a 3rd person ptrace() like pfmon, so the process is
> >>>> implicitly blocked when processing samples, i.e. there are no
> >>>> dropped
> >>>> samples unless something else has gone wrong.
> >>>>
> >>> Thanks for clarifing this. It makes more sense given how PAPI works.
> >>
> >> I think this means that I just lost the best explanation I have had
> >> so far
> >> as to why I see these inconsistencies.
> >>
> >
> > Yes, you are correct. The blocking is not an issue in hpcrun.
> >
> >>>
> >>>> Another thing, you cannot rely on the sample count of hpcrun to
> >>>> compute cycles. Why? Because those are samples that only have not
> >>>> been
> >>>> dropped. If samples occur ourside of the sample space (as can
> >>>> happen
> >>>> when one has floating point exceptions), the address will be in
> >>>> kernel
> >>>> space and it will be dropped. pfmon has no concept of filtering out
> >>>> addresses, so even if you ask for user-space samples, you'll still
> >>>> get
> >>>> samples in the output with kernel addresses. I'm not sure what the
> >>>> default is for your version of pfmon.
> >>>>
> >>> PFmon does not do filtering of samples. It relies on the hardware
> >>> via
> >>> the priv levels. By default, pfmon only measures at the user level.
> >>> That does not mean you won't get kernel-level samples because there
> >>> are boundary effects when sampling.
> >>>
> >>>> Which value is correct, according to /bin/time? 2Billion or 154
> >> Billion?
> >>>>
> >>> This is a valid point. Which value makes most sense related to time?
> >>
> >>
> >> Let me provide all the steps I use when running these tests.  Maybe I
> >> am just doing something wrong and you can correct my
> >> misunderstanding.
> >>
> >> When I use "time" with my hpcrun test, it provides this information:
> >>
> >> time hpcrun -e CPU_CYCLES:32767 -o hpcrun.data -- ./code.exe >hpcrun
> >> 2>hpcrun.debug
> >> real    1m44.921s
> >> user    1m39.490s
> >> sys     0m2.420s
> >>
> >> The output from the hpcprof run on all of the data files produced by
> >> this
> >> test shows the following summary information:
> >>
> >> Columns correspond to the following events [event:period (events/
> >> sample)]
> >> CPU_CYCLES:32767 - CPU Cycles (29 samples) [not shown]
> >> CPU_CYCLES:32767 - CPU Cycles (9 samples) [not shown]
> >> CPU_CYCLES:32767 - CPU Cycles (29 samples) [not shown]
> >> CPU_CYCLES:32767 - CPU Cycles (29 samples) [not shown]
> >> CPU_CYCLES:32767 - CPU Cycles (41602 samples) [not shown]
> >> CPU_CYCLES:32767 - CPU Cycles (24490 samples) [not shown]
> >> CPU_CYCLES:32767 - CPU Cycles (7 samples) [not shown]
> >> CPU_CYCLES:32767 - CPU Cycles (5 samples) [not shown]
> >> CPU_CYCLES (min):32767 - CPU Cycles (The minimum for events of this
> >> type.) (1 samples)
> >> CPU_CYCLES (max):32767 - CPU Cycles (The maximum for events of this
> >> type.) (66127 samples)
> >> CPU_CYCLES (sum):32767 - CPU Cycles (Summed over all events of this
> >> type.) (66200 samples)
> >>
> >
> > Gary, this executable must be multi-threaded? Do each of the threads
> > do the same amount of work? If so, the above is your clue. Most of the
> > threads are experiencing the perfmon2 race where the signal comes in
> > but gets dropped and thus monitoring is not restarted. PAPI from CVS
> > has fixes in there for this on perfmon2 platforms. Is this OS using
> > perfmon2 or the old 'monolithic' perfmon interface? If this is
> > perfmon1, then we may have an issue here. But PAPI-CVS handles this
> > properly for perfmon2 by using a real-time signal.
> >
> > Judging from the other numbers you have below, I'd guess that if you
> > set the sample rate to something much lower (which is certainly
> > reasonable, 32768 is awfully small for a 1600Mhz processor), then
> > you'd get more reasonable results. Experience (from Mark and the Rice
> > folks) have shown that this signal dropping is much less likely to
> > happen when the interrupt load is low.
> >
> > So, I reckon if you set the sample period to 16,000,000 (approximate
> > 100/second), you'll get answers that match up.
> >
> > Phil
> >
> > P.S. Please get back to me on which version of PAPI and perfmon kernel
> > support you have.
> >
> >
> >
>


-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
perfmon2-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/perfmon2-devel

Reply via email to