Hi,

On 24.09.2018 10:02, Alexey Budankov wrote:
> Hi,
> 
> On 23.09.2018 22:30, Jiri Olsa wrote:
>> On Fri, Sep 21, 2018 at 09:13:08AM +0300, Alexey Budankov wrote:
>>
>> SNIP
>>
>>> Events:
>>> cpu/period=P,event=0x3c/Duk;CPU_CLK_UNHALTED.THREAD
>>> cpu/period=P,umask=0x3/Duk;CPU_CLK_UNHALTED.REF_TSC
>>> cpu/period=P,event=0xc0/Duk;INST_RETIRED.ANY
>>> cpu/period=0xaae61,event=0xc2,umask=0x10/uk;UOPS_RETIRED.ALL
>>> cpu/period=0x11171,event=0xc2,umask=0x20/uk;UOPS_RETIRED.SCALAR_SIMD
>>> cpu/period=0x11171,event=0xc2,umask=0x40/uk;UOPS_RETIRED.PACKED_SIMD
>>>
>>> =================================================
>>>
>>> Command:
>>> /usr/bin/time /tmp/vtune_amplifier_2019.574715/bin64/perf.thr record 
>>> --threads=T \
>>>     -a -N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \
>>>         -e cpu/period=P,event=0x3c/Duk,\
>>>            cpu/period=P,umask=0x3/Duk,\
>>>            cpu/period=P,event=0xc0/Duk,\
>>>            cpu/period=0x30d40,event=0xc2,umask=0x10/uk,\
>>>            cpu/period=0x4e20,event=0xc2,umask=0x20/uk,\
>>>            cpu/period=0x4e20,event=0xc2,umask=0x40/uk \
>>>          --clockid=monotonic_raw -- ./matrix.(icc|gcc)
>>
>> hum, so I guess the results suck because of the -a option,
>> getting extra samples for all the perf record threads
>>
>> could you try without the -a? you monitor only user events,
>> so you're interested only in ./matrix.* samples, right?
> 
> Ok, trying without -a, in per-process mode. 

Command:

/usr/bin/time ./perf.thr record --threads=T \
        -N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \
        -e cpu/period=P,event=0x3c/Duk,\
           cpu/period=P,umask=0x3/Duk,\
           cpu/period=P,event=0xc0/Duk,\
           cpu/period=0xaae61,event=0xc2,umask=0x10/uk,\
           cpu/period=0x11171,event=0xc2,umask=0x20/uk,\
           cpu/period=0x11171,event=0xc2,umask=0x40/uk \
        --clockid=monotonic_raw -- ./matrix.gcc

Workload: matrix multiplication in 128 threads

T : 272
        P (period, ms)       : 0.35 
        runtime overhead (%) : 13x ~ 87.73 / 6.81
        data loss (%)        : 0
        LOST events          : 36
        SAMPLE events        : 8048542
        perf.data size (GiB) : 10

T : 128
        P (period, ms)       : 0.35 
        runtime overhead (%) : 10x ~ 71.12 / 6.81
        data loss (%)        : 0
        LOST events          : 2
        SAMPLE events        : 6524363
        perf.data size (GiB) : 8

T : 64
        P (period, ms)       : 0.35 
        runtime overhead (%) : 10x ~ 71.89 / 6.81
        data loss (%)        : 0
        LOST events          : 2
        SAMPLE events        : 7160623
        perf.data size (GiB) : 9

=================================================

Command:

/usr/bin/time ./perf.aio record --aio=N \
        -N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \
        -e cpu/period=P,event=0x3c/Duk,\
           cpu/period=P,umask=0x3/Duk,\
           cpu/period=P,event=0xc0/Duk,\
           cpu/period=0xaae61,event=0xc2,umask=0x10/uk,\
           cpu/period=0x11171,event=0xc2,umask=0x20/uk,\
           cpu/period=0x11171,event=0xc2,umask=0x40/uk \
        --clockid=monotonic_raw ./matrix.gcc

Workload: matrix multiplication in 128 threads

N : 512
        P (period, ms)       : 1.5
        runtime overhead (%) : 2.8x ~ 19.20 / 6.81
        data loss (%)        : 0
        LOST events          : 0
        SAMPLE events        : 1094976
        perf.data size (GiB) : 1.3

N : 272
        P (period, ms)       : 1.5
        runtime overhead (%) : 3.3x ~ 22.34 / 6.81
        data loss (%)        : 0
        LOST events          : 0
        SAMPLE events        : 1089252
        perf.data size (GiB) : 1.3
  
N : 128
        P (period, ms)       : 1.5
        runtime overhead (%) : 2.6x ~ 15.15 / 6.81
        data loss (%)        : 1
        LOST events          : 1
        SAMPLE events        : 1094102
        perf.data size (GiB) : 1.3
 
N : 64
        P (period, ms)       : 1.5
        runtime overhead (%) : 2.4x ~ 16.23 / 6.81
        data loss (%)        : 2
        LOST events          : 18
        SAMPLE events        : 1105986
        perf.data size (GiB) : 1.3

Thanks,
Alexey

> VTune collects as user as kernel mode samples, using /uk modifiers set.
> The set can be extended to collect in VM host and guests as well.
> 
> Thanks,
> Alexey
> 
>>
>> thanks,
>> jirka
>>
> 

Reply via email to