On Mon, Sep 24, 2018 at 04:09:09PM +0300, Alexey Budankov wrote:
> Hi,
> 
> On 24.09.2018 10:02, Alexey Budankov wrote:
> > Hi,
> > 
> > On 23.09.2018 22:30, Jiri Olsa wrote:
> >> On Fri, Sep 21, 2018 at 09:13:08AM +0300, Alexey Budankov wrote:
> >>
> >> SNIP
> >>
> >>> Events:
> >>> cpu/period=P,event=0x3c/Duk;CPU_CLK_UNHALTED.THREAD
> >>> cpu/period=P,umask=0x3/Duk;CPU_CLK_UNHALTED.REF_TSC
> >>> cpu/period=P,event=0xc0/Duk;INST_RETIRED.ANY
> >>> cpu/period=0xaae61,event=0xc2,umask=0x10/uk;UOPS_RETIRED.ALL
> >>> cpu/period=0x11171,event=0xc2,umask=0x20/uk;UOPS_RETIRED.SCALAR_SIMD
> >>> cpu/period=0x11171,event=0xc2,umask=0x40/uk;UOPS_RETIRED.PACKED_SIMD
> >>>
> >>> =================================================
> >>>
> >>> Command:
> >>> /usr/bin/time /tmp/vtune_amplifier_2019.574715/bin64/perf.thr record 
> >>> --threads=T \
> >>>   -a -N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \
> >>>         -e cpu/period=P,event=0x3c/Duk,\
> >>>            cpu/period=P,umask=0x3/Duk,\
> >>>            cpu/period=P,event=0xc0/Duk,\
> >>>            cpu/period=0x30d40,event=0xc2,umask=0x10/uk,\
> >>>            cpu/period=0x4e20,event=0xc2,umask=0x20/uk,\
> >>>            cpu/period=0x4e20,event=0xc2,umask=0x40/uk \
> >>>          --clockid=monotonic_raw -- ./matrix.(icc|gcc)
> >>
> >> hum, so I guess the results suck because of the -a option,
> >> getting extra samples for all the perf record threads
> >>
> >> could you try without the -a? you monitor only user events,
> >> so you're interested only in ./matrix.* samples, right?
> > 
> > Ok, trying without -a, in per-process mode. 
> 
> Command:
> 
> /usr/bin/time ./perf.thr record --threads=T \
>       -N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \
>       -e cpu/period=P,event=0x3c/Duk,\
>          cpu/period=P,umask=0x3/Duk,\
>          cpu/period=P,event=0xc0/Duk,\
>          cpu/period=0xaae61,event=0xc2,umask=0x10/uk,\
>          cpu/period=0x11171,event=0xc2,umask=0x20/uk,\
>          cpu/period=0x11171,event=0xc2,umask=0x40/uk \
>       --clockid=monotonic_raw -- ./matrix.gcc
> 
> Workload: matrix multiplication in 128 threads
> 
> T : 272
>       P (period, ms)       : 0.35 
>       runtime overhead (%) : 13x ~ 87.73 / 6.81

how do you meassure this?

>       data loss (%)        : 0
>       LOST events          : 36
>       SAMPLE events        : 8048542
>         perf.data size (GiB) : 10

any idea why does it have some much more samples?

thanks,
jirka

Reply via email to