Phillip and friends I have gone from praying that someone out there would help to being overwhelmed with suggestions and opinions. Thank you, this is so much better than some other open source environments I have worked in where you put a message on a mailing list and then sit around and get old waiting for a response.
I had downloaded the perfmon2 package but have not installed them yet. My understanding of that package is that it changes the way signals are used for multi-threaded environments. Since we do not have that environment, I did not think they would help. If someone thinks they may help, I can put them in and give it a try. Unfortunately our customer is in a secure environment and therefore all I have is a copy of the binary executable for the application. It took us 3 months and a non disclosure agreement with them to even get that much. Since I do not have source I can not modify the application to skip doing the fork/exec's. As Stephane pointed out in another email on this chain, pfmon (by default) does not count events for the forked processes. The values reported by pfmon when used in sampling mode seem to account for the CPU time used by the applicaiton, When pfmon is run in sampling mode on this application it reports 4788780 samples. When hpcrun is used it only reports 41602 samples (in the code.exe data file). Both of these runs were done with a sample period of 32767. I have PAPI built with debug and have run it with PAPI_DEBUG set to: PAPI_DEBUG=SUBSTRATE,API,INTERNAL,MULTIPLEX,OVERFLOW,PROFILE I have looked through the debug output but this untrained eye did not see anything "unusual". When I try to put attachments that size on the mailing list the message gets rejected for being too big. If you would be willing to look at it, I can send it directly to your mailbox. I also spent some time yesterday looking through the kernel messages generated by perfmon during the hpcrun test. I did find two things that looked unusual in that log. I found several places where monitoring was being masked when it was already masked. I also found lots of places where monitoring was being restarted without having been masked. I found these by doing a grep on the file for all places where pmc[4] was being displayed. Excerpts from this grep output follows: Apr 8 15:33:22 molson kernel: pfm_mask_monitoring.928: CPU0 [18168] pmc[4]=0x801220 Apr 8 15:33:22 molson kernel: pfm_restore_monitoring.1013: CPU0 [18168] [18168] pmc[4]=0x801228 Apr 8 15:33:22 molson kernel: pfm_mask_monitoring.928: CPU0 [18168] pmc[4]=0x801220 Apr 8 15:33:22 molson kernel: pfm_mask_monitoring.928: CPU0 [18168] pmc[4]=0x801220 Apr 8 15:33:22 molson kernel: pfm_restore_monitoring.1013: CPU0 [18168] [18168] pmc[4]=0x801228 Apr 8 15:33:23 molson kernel: pfm_mask_monitoring.928: CPU0 [18168] pmc[4]=0x801220 Apr 8 15:33:20 molson kernel: pfm_mask_monitoring.928: CPU0 [18168] pmc[4]=0x801220 Apr 8 15:33:20 molson kernel: pfm_restore_monitoring.1013: CPU0 [18168] [18168] pmc[4]=0x801228 Apr 8 15:33:20 molson kernel: pfm_mask_monitoring.928: CPU0 [18168] pmc[4]=0x801220 Apr 8 15:33:20 molson kernel: pfm_restore_monitoring.1013: CPU0 [18168] [18168] pmc[4]=0x801228 Apr 8 15:33:20 molson kernel: pfm_restore_monitoring.1013: CPU0 [18168] [18168] pmc[4]=0x801228 Apr 8 15:33:20 molson kernel: pfm_mask_monitoring.928: CPU0 [18168] pmc[4]=0x801220 Apr 8 15:33:20 molson kernel: pfm_restore_monitoring.1013: CPU0 [18168] [18168] pmc[4]=0x801228 Apr 8 15:33:20 molson kernel: pfm_mask_monitoring.928: CPU0 [18168] pmc[4]=0x801220 Apr 8 15:33:20 molson kernel: pfm_restore_monitoring.1013: CPU0 [18168] [18168] pmc[4]=0x801228 Apr 8 15:33:20 molson kernel: pfm_mask_monitoring.928: CPU0 [18168] pmc[4]=0x801220 Apr 8 15:33:20 molson kernel: pfm_restore_monitoring.1013: CPU0 [18168] [18168] pmc[4]=0x801228 Apr 8 15:33:20 molson kernel: pfm_restore_monitoring.1013: CPU0 [18168] [18168] pmc[4]=0x801228 Apr 8 15:33:20 molson kernel: pfm_mask_monitoring.928: CPU0 [18168] pmc[4]=0x801220 Apr 8 15:33:20 molson kernel: pfm_restore_monitoring.1013: CPU0 [18168] [18168] pmc[4]=0x801228 Apr 8 15:33:20 molson kernel: pfm_restore_monitoring.1013: CPU0 [18168] [18168] pmc[4]=0x801228 Apr 8 15:33:20 molson kernel: pfm_mask_monitoring.928: CPU0 [18168] pmc[4]=0x801220 Apr 8 15:33:20 molson kernel: pfm_restore_monitoring.1013: CPU0 [18168] [18168] pmc[4]=0x801228 Apr 8 15:33:20 molson kernel: pfm_restore_monitoring.1013: CPU0 [18168] [18168] pmc[4]=0x801228 Apr 8 15:33:20 molson kernel: pfm_mask_monitoring.928: CPU0 [18168] pmc[4]=0x801220 Apr 8 15:33:20 molson kernel: pfm_restore_monitoring.1013: CPU0 [18168] [18168] pmc[4]=0x801228 Apr 8 15:33:20 molson kernel: pfm_restore_monitoring.1013: CPU0 [18168] [18168] pmc[4]=0x801228 Apr 8 15:33:20 molson kernel: pfm_mask_monitoring.928: CPU0 [18168] pmc[4]=0x801220 Apr 8 15:33:21 molson kernel: pfm_restore_monitoring.1013: CPU0 [18168] [18168] pmc[4]=0x801228 I can also provide this file to anyone who would like to see the full contents. Are these conditions normal or could it be a clue to the cause of this problem ?? Gary Philip Mucci <[EMAIL PROTECTED]> wrote on 04/18/2008 12:49:46 AM: > Gary, > > I'm CC'ing this to the HPCTOOLKIT folks to see if they have an idea. > My guess is that there is some interaction with monitoring around the > forks/execs. Mark Krentel from Rice raised this issue a few times and > as far as I know, it remains unfixed. This is the old perfmon > interface as far as I can tell. > > I'm no longer confident that this is not a monitoring/masking issue > nor is it a threading/signaling issue, it could be related to PAPI > while profiling, not functioning properly across a fork/exec. Either > way, you wouldn't see anything in the kernel logs. Is there any way we > can make code.exe run without calling doing the fork/exec's just to > see if it proceeds normally? > > Another question, is there a way to trick pfmon into dumping out the > sample counts for individual processes? That would make it a heck of a > lot easier to compare. > > Depending on how you built PAPI, you should be able to set > PAPI_DEBUG=PROFILE,OVERFLOW to catch/see the samples. It's possible > that may give you some indication when profiling stops being > registered. Also, try passing -d 5 -v to hpcrun, that may also give > some results. > > Let us know. > > Thanks. > > Phil > > P.S. You may want to forward the initial message or two to the HPC > folks above... > > On Apr 17, 2008, at 10:05 PM, [EMAIL PROTECTED] wrote: > > > Phil > > > > I should have provided you with a little more information. The data > > files > > created by the hpcrun have the following names: > > > > bash.CPU_CYCLES.molson.5636.0x1604 > > bash.CPU_CYCLES.molson.5637.0x1605 > > bash.CPU_CYCLES.molson.5638.0x1606 > > bash.CPU_CYCLES.molson.5813.0x16b5 > > code.exe.CPU_CYCLES.molson.5634.0x1602 > > gzip.CPU_CYCLES.molson.5637.0x1605 > > rm.CPU_CYCLES.molson.5638.0x1606 > > rm.CPU_CYCLES.molson.5813.0x16b5 > > > > As the data files show this executable runs 4 bash shell scripts, does > > two remove commands, and does one gzip command in addition to > > the output for the program itself (code.exe). There are no thread > > files > > created and I have run the program with strace and can not find any > > kernel calls to clone (which I am told would show up if a > > pthread_create > > had been done). > > > > My understanding is that each of the things run by this application > > runs as a seperate process. The data files show different PID's > > (except where a bash shell is executing one of the commands). > > I would think that the problems handling signals in a multi-threaded > > process should not apply here. Am I wrong about this ? > > > > The /proc/perfmon file shows: > > > > perfmon version : 2.0 > > model : Montecito > > fastctxsw : No > > -- and lots more -- > > > > The Papi version installed is: > > > > Name : papi Relocations: /usr > > Version : 3.5.0 Vendor: Bull > > Release : 1.b.3.Bull Build Date: Thu 05 Apr > > 2007 > > 05:05:28 AM MST > > Install Date: Thu 13 Mar 2008 08:32:31 AM MST Build Host: apollon > > Group : Application Source RPM: > > papi-3.5.0-1.b.3.Bull.src.rpm > > -- and some more -- > > > > This was created from Papi-3.5.0 but packaged by Bull into an RPM. > > > > I reran the test using a sample period of 16,000,000 and got the > > following > > results: > > > > real 1m38.636s > > user 1m38.240s > > sys 0m0.440s > > > > Output from hpcprof: > > > > Columns correspond to the following events [event:period (events/ > > sample)] > > CPU_OP_CYCLES_ALL:16000000 - CPU Operating Cycles -- All CPU cycles > > counted (0 samples) [not shown] > > CPU_OP_CYCLES_ALL:16000000 - CPU Operating Cycles -- All CPU cycles > > counted (0 samples) [not shown] > > CPU_OP_CYCLES_ALL:16000000 - CPU Operating Cycles -- All CPU cycles > > counted (0 samples) [not shown] > > CPU_OP_CYCLES_ALL:16000000 - CPU Operating Cycles -- All CPU cycles > > counted (0 samples) [not shown] > > CPU_OP_CYCLES_ALL:16000000 - CPU Operating Cycles -- All CPU cycles > > counted (82 samples) [not shown] > > CPU_OP_CYCLES_ALL:16000000 - CPU Operating Cycles -- All CPU cycles > > counted (47 samples) [not shown] > > CPU_OP_CYCLES_ALL:16000000 - CPU Operating Cycles -- All CPU cycles > > counted (0 samples) [not shown] > > CPU_OP_CYCLES_ALL:16000000 - CPU Operating Cycles -- All CPU cycles > > counted (0 samples) [not shown] > > CPU_OP_CYCLES_ALL (min):16000000 - CPU Operating Cycles -- All CPU > > cycles > > counted (The minimum for events of this type.) (0 samples) > > CPU_OP_CYCLES_ALL (max):16000000 - CPU Operating Cycles -- All CPU > > cycles > > counted (The maximum for events of this type.) (129 samples) > > CPU_OP_CYCLES_ALL (sum):16000000 - CPU Operating Cycles -- All CPU > > cycles > > counted (Summed over all events of this type.) (129 samples) > > > > Calculations with these numbers (129 times 16000000 divided by 1600) > > show : > > > > 1290000 (or about 1.3 total seconds) > > > > This is still a very small percent of the CPU time reported by "time". > > Actually we have > > experimented with lots of different sample rates (very large and very > > small) and all the > > ones we have tried seem to produce about the same results. > > > > When I run this test, the number of perfmon interrupts that occur > > (as shown > > in /proc/interrupts) is very close to the total number of > > samples shown in the hpcprof output. That suggests to me that we > > are not > > ignoring or loosing interrupts. Instead it looks to me > > like perfmon has masked monitoring (as part of handling an overflow) > > and > > then somehow the process being monitored got back > > into execution without restarting it. This likely happened during > > interrupt processing since that is where the code responsible for > > managing all this seems to be. What I do not know is wheather this > > has > > happened many times or just once. If it just happened > > once then I would expect the perfmon kernel debug data would have > > shown the > > last interrupt processed masked monitoring > > but never restarted it. In fact the kernel debug data does not show > > that. > > This suggests to me that somehow we are allowing the > > monitored process to execute with monitoring masked so the hardware > > is not > > counting the events. > > > > First I would like to know if you guys think the above analysis is > > correct > > (after all I am just a rookie and do not understand all > > this stuff yet). > > > > And second if anyone has any suggestions that could help me detect > > that the > > monitored process is being executed while monitoring > > is masked. For instance is there a place in the code where we > > always go > > when we are done processing an overflow that I may be > > able to add a check to see if monitoring is masked and spit out some > > additional debug info if that happens. > > > > I am happy to do the work if you can point me in the right direction. > > > > Gary > > > > > > > > > > > > Philip Mucci > > <[EMAIL PROTECTED] > >> To > > [EMAIL PROTECTED] > > 04/16/2008 > > 04:20 cc > > PM [EMAIL PROTECTED] > > t, [EMAIL PROTECTED] > > > > Subject > > Re: [perfmon2] Pfmon and Hpcrun > > give inconsistent results. > > > > > > > > > > > > > > > > > > > > > >>> > > > > Hi Gary, > > > >>> Phil, > >>> > >>> On Wed, Apr 16, 2008 at 11:22 AM, Philip Mucci <[EMAIL PROTECTED]> > >>> wrote: > >>>> Folks, > >>>> > >>>> hpcrun does it's sampling inside the target process using first > >>>> person > >>>> access, not a 3rd person ptrace() like pfmon, so the process is > >>>> implicitly blocked when processing samples, i.e. there are no > >>>> dropped > >>>> samples unless something else has gone wrong. > >>>> > >>> Thanks for clarifing this. It makes more sense given how PAPI works. > >> > >> I think this means that I just lost the best explanation I have had > >> so far > >> as to why I see these inconsistencies. > >> > > > > Yes, you are correct. The blocking is not an issue in hpcrun. > > > >>> > >>>> Another thing, you cannot rely on the sample count of hpcrun to > >>>> compute cycles. Why? Because those are samples that only have not > >>>> been > >>>> dropped. If samples occur ourside of the sample space (as can > >>>> happen > >>>> when one has floating point exceptions), the address will be in > >>>> kernel > >>>> space and it will be dropped. pfmon has no concept of filtering out > >>>> addresses, so even if you ask for user-space samples, you'll still > >>>> get > >>>> samples in the output with kernel addresses. I'm not sure what the > >>>> default is for your version of pfmon. > >>>> > >>> PFmon does not do filtering of samples. It relies on the hardware > >>> via > >>> the priv levels. By default, pfmon only measures at the user level. > >>> That does not mean you won't get kernel-level samples because there > >>> are boundary effects when sampling. > >>> > >>>> Which value is correct, according to /bin/time? 2Billion or 154 > >> Billion? > >>>> > >>> This is a valid point. Which value makes most sense related to time? > >> > >> > >> Let me provide all the steps I use when running these tests. Maybe I > >> am just doing something wrong and you can correct my > >> misunderstanding. > >> > >> When I use "time" with my hpcrun test, it provides this information: > >> > >> time hpcrun -e CPU_CYCLES:32767 -o hpcrun.data -- ./code.exe >hpcrun > >> 2>hpcrun.debug > >> real 1m44.921s > >> user 1m39.490s > >> sys 0m2.420s > >> > >> The output from the hpcprof run on all of the data files produced by > >> this > >> test shows the following summary information: > >> > >> Columns correspond to the following events [event:period (events/ > >> sample)] > >> CPU_CYCLES:32767 - CPU Cycles (29 samples) [not shown] > >> CPU_CYCLES:32767 - CPU Cycles (9 samples) [not shown] > >> CPU_CYCLES:32767 - CPU Cycles (29 samples) [not shown] > >> CPU_CYCLES:32767 - CPU Cycles (29 samples) [not shown] > >> CPU_CYCLES:32767 - CPU Cycles (41602 samples) [not shown] > >> CPU_CYCLES:32767 - CPU Cycles (24490 samples) [not shown] > >> CPU_CYCLES:32767 - CPU Cycles (7 samples) [not shown] > >> CPU_CYCLES:32767 - CPU Cycles (5 samples) [not shown] > >> CPU_CYCLES (min):32767 - CPU Cycles (The minimum for events of this > >> type.) (1 samples) > >> CPU_CYCLES (max):32767 - CPU Cycles (The maximum for events of this > >> type.) (66127 samples) > >> CPU_CYCLES (sum):32767 - CPU Cycles (Summed over all events of this > >> type.) (66200 samples) > >> > > > > Gary, this executable must be multi-threaded? Do each of the threads > > do the same amount of work? If so, the above is your clue. Most of the > > threads are experiencing the perfmon2 race where the signal comes in > > but gets dropped and thus monitoring is not restarted. PAPI from CVS > > has fixes in there for this on perfmon2 platforms. Is this OS using > > perfmon2 or the old 'monolithic' perfmon interface? If this is > > perfmon1, then we may have an issue here. But PAPI-CVS handles this > > properly for perfmon2 by using a real-time signal. > > > > Judging from the other numbers you have below, I'd guess that if you > > set the sample rate to something much lower (which is certainly > > reasonable, 32768 is awfully small for a 1600Mhz processor), then > > you'd get more reasonable results. Experience (from Mark and the Rice > > folks) have shown that this signal dropping is much less likely to > > happen when the interrupt load is low. > > > > So, I reckon if you set the sample period to 16,000,000 (approximate > > 100/second), you'll get answers that match up. > > > > Phil > > > > P.S. Please get back to me on which version of PAPI and perfmon kernel > > support you have. > > > > > > > ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ perfmon2-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/perfmon2-devel
