Stephane

I sent this response directly to your email address when I meant to post it
back to the mailing list.  I
have added to the end some information related to running pfmon in sampling
mode and am now
posting it to the mailing list.  Sorry for the incorrect routing that
caused you to get multiple copies.

Stephane, Phil

Responses imbeded below.

"stephane eranian" <[EMAIL PROTECTED]> wrote on 04/16/2008 02:31:49
AM:

> Phil,
>
> On Wed, Apr 16, 2008 at 11:22 AM, Philip Mucci <[EMAIL PROTECTED]> wrote:
> > Folks,
> >
> >  hpcrun does it's sampling inside the target process using first person
> >  access, not a 3rd person ptrace() like pfmon, so the process is
> >  implicitly blocked when processing samples, i.e. there are no dropped
> >  samples unless something else has gone wrong.
> >
> Thanks for clarifing this. It makes more sense given how PAPI works.

I think this means that I just lost the best explanation I have had so far
as to why I see these inconsistencies.

>
> >  Another thing, you cannot rely on the sample count of hpcrun to
> >  compute cycles. Why? Because those are samples that only have not been
> >  dropped. If samples occur ourside of the sample space (as can happen
> >  when one has floating point exceptions), the address will be in kernel
> >  space and it will be dropped. pfmon has no concept of filtering out
> >  addresses, so even if you ask for user-space samples, you'll still get
> >  samples in the output with kernel addresses. I'm not sure what the
> >  default is for your version of pfmon.
> >
> PFmon does not do filtering of samples. It relies on the hardware via
> the priv levels. By default, pfmon only measures at the user level.
> That does not mean you won't get kernel-level samples because there
> are boundary effects when sampling.
>
> >  Which value is correct, according to /bin/time? 2Billion or 154
Billion?
> >
> This is a valid point. Which value makes most sense related to time?


Let me provide all the steps I use when running these tests.  Maybe I
am just doing something wrong and you can correct my misunderstanding.

When I use "time" with my hpcrun test, it provides this information:

time hpcrun -e CPU_CYCLES:32767 -o hpcrun.data -- ./code.exe >hpcrun
2>hpcrun.debug
real    1m44.921s
user    1m39.490s
sys     0m2.420s

The output from the hpcprof run on all of the data files produced by this
test shows the following summary information:

Columns correspond to the following events [event:period (events/sample)]
  CPU_CYCLES:32767 - CPU Cycles (29 samples) [not shown]
  CPU_CYCLES:32767 - CPU Cycles (9 samples) [not shown]
  CPU_CYCLES:32767 - CPU Cycles (29 samples) [not shown]
  CPU_CYCLES:32767 - CPU Cycles (29 samples) [not shown]
  CPU_CYCLES:32767 - CPU Cycles (41602 samples) [not shown]
  CPU_CYCLES:32767 - CPU Cycles (24490 samples) [not shown]
  CPU_CYCLES:32767 - CPU Cycles (7 samples) [not shown]
  CPU_CYCLES:32767 - CPU Cycles (5 samples) [not shown]
  CPU_CYCLES (min):32767 - CPU Cycles (The minimum for events of this
type.) (1 samples)
  CPU_CYCLES (max):32767 - CPU Cycles (The maximum for events of this
type.) (66127 samples)
  CPU_CYCLES (sum):32767 - CPU Cycles (Summed over all events of this
type.) (66200 samples)

When I multiply the number of samples (66200) times the sample
period I get the total number of CPU_CYCLES of:

2169175400

The file /proc/cpuinfo shows (this is repeated for 7 more processors):

processor  : 0
vendor     : GenuineIntel
arch       : IA-64
family     : Itanium 2
model      : 2
revision   : 1
archrev    : 0
features   : branchlong
cpu number : 0
cpu regs   : 4
cpu MHz    : 1600.000442
itc MHz    : 1600.442800
BogoMIPS   : 2398.61
siblings   : 1
physical id: 24576
LID        : 0x00000060

When I divide this by the cpu frequency (1600) I get:

1355734.625 (or 1.3 seconds)

When I use "time" with my pfmon test, it provides this information:

pfmon --debug -v -e CPU_CYCLES ./code.exe >pfmon 2>pfmon.debug
real    1m38.834s
user    1m38.360s
sys     0m0.490s

The output from the pfmon test run shows:

[10489] monitoring ./code.exe ...
results are on terminal
[10489] activating monitoring
measurements started at Wed Apr 16 09:12:59 2008

[10489] task exited
156255456222 CPU_CYCLES
[10489] detached
created tasks        : 1
maximum tasks        : 1
maximum active tasks : 1
measurements completed at Wed Apr 16 09:14:38 2008

When I divide this by the cpu frequency (1600) I get:

97659660.13875 (or 97.6 seconds)

When I use "time" with pfmon (with sampling), it provides this information:

pfmon --debug -v -e CPU_CYCLES --long-smpl-periods=32767
--smpl-outfile=pfmons.data ./code.exe >pfmons 2>pfmons.debug
real    1m55.932s
user    1m53.370s
sys     0m2.350s

The output from the pfmon (with sampling) test run shows:

loaded 16582 text symbols from /proc/kallsyms
loaded 4566 data symbols from /proc/kallsyms
table_size=188319 effective=88389 min_addr=0x40000000004a82f0
loaded 88389 text symbols from ELF file ./code.exe
table_size=188319 effective=9030 min_addr=0x4000000003138f00
loaded 9030 data symbols from ELF file ./code.exe
measuring at user privilege level ONLY
1 event set(s) defined
long  sampling periods(val/mask/seed): 32767/0x0/0
short sampling periods(val/mask/seed): 32767/0x0/0
using detailed-ia64 sampling module
pfmlib_itanium2.c (pfm_ita2_dispatch_counters.484): ev[0]=CPU_CYCLES
counters=0xf0
pfmlib_itanium2.c (pfm_ita2_dispatch_counters.500): max_l0=8 max_l1=4
max_l2=4 max_l3=4
4  (-1,-1): CPU_CYCLES
[PMC4(pmc4)=0x001228 thres=0 es=0x12 plm=8 umask=0x0 pm=0 ism=0x0 oi=1]
CPU_CYCLES
[PMD4(pmd4)]
[PMC15(pmc15)=0xffffffff ibrp0_pmc8=1 ibrp1_pmc9=1 ibrp2_pmc8=1
ibrp3_pmc9=1]
pfmlib_itanium2.c (pfm_dispatch_btb.949): found_btb=0 found_bar_dear=0
pmd setup for event set0:
[pmd4 set=0 ival=0xffffffffffff8001 long_rate=0xffffffffffff8001
short_rate=0xffffffffffff8001 mask=0x0 seed=0 randomize=n]
[pmd4 set=0 smpl_pmds=0x0 reset_pmds=0x0]
sampling buffer #entries=15437341 size=617493912, max_entry_size=40
vCPU1 -> pCPU1
vCPU0 -> pCPU0
vCPU6 -> pCPU6
vCPU2 -> pCPU2
vCPU3 -> pCPU3
vCPU4 -> pCPU4
exec-pattern=*
vCPU5 -> pCPU5
vCPU7 -> pCPU7
[14370] started task: ./code.exe
follow_exec=n follow_vfork=n follow_fork=n follow_pthread=n
[14370] monitoring ./code.exe ...
[14370] sampling results will be in file "pfmon.data"
[14370] activating monitoring
measurements started at Wed Apr 16 14:03:21 2008

[14370] task exited
[14370] 4804705 samples collected (0 buffer overflows)
[14370] detached
created tasks        : 1
maximum tasks        : 1
maximum active tasks : 1
measurements completed at Wed Apr 16 14:05:17 2008

Samples (4804705) times sample period (32767) divided by CPU frequency
(1600) gives:

98397355.459375 (or 98.3 seconds)


The pfmon test (both with and without sampling) reports CPU cycle
counts that seem to account for the cpu time used during the test.
In the sampling case I can see a slight loss (113 seconds used and
98 seconds accounted for) which is probably explained by the
sampling blind spot Stehpane described in previous email.

The hpcrun test reports a CPU cycle count that only seems to represent
about 1 to 2 percent of the cpu time used during the test.  Phil pointed
out that samples with kernel addresses would be droped by hpcrun but in
is it really reasonable to assume that 98 percent of the samples happened
in kernel space when only about 3 percent of the processing time happens
in the kernel ?

I guess I still do not understand why hpcrun does not report sample counts
that account for the CPU time used in user space ?

Thanks again for any information that may help solve this mystery.

Gary


-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
perfmon2-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/perfmon2-devel

Reply via email to