On 16.04.21 18:48, Philippe Gerum wrote:
> 
> Jan Kiszka <jan.kis...@siemens.com> writes:
> 
>> On 15.04.21 09:54, Philippe Gerum wrote:
>>>
>>> Jan Kiszka <jan.kis...@siemens.com> writes:
>>>
>>>> On 15.04.21 09:21, Philippe Gerum wrote:
>>>>>
>>>>> Jan Kiszka <jan.kis...@siemens.com> writes:
>>>>>
>>>>>> On 27.03.21 11:19, Philippe Gerum wrote:
>>>>>>> From: Philippe Gerum <r...@xenomai.org>
>>>>>>>
>>>>>>> Since v5.9-rc1, csum_partial_copy_nocheck() forces a zero seed as its
>>>>>>> last argument to csum_partial(). According to #cc44c17baf7f3, passing
>>>>>>> a non-zero value would not even yield the proper result on some
>>>>>>> architectures.
>>>>>>>
>>>>>>> Nevertheless, the current ICMP code does expect a non-zero csum seed
>>>>>>> to be used in the next computation, so let's wrap net_csum_copy() to
>>>>>>> csum_partial_copy_nocheck() for pre-5.9 kernels, and open code it for
>>>>>>> later kernels so that we still feed csum_partial() with the user-given
>>>>>>> csum. We still expect the x86, ARM and arm64 implementations of
>>>>>>> csum_partial() to do the right thing.
>>>>>>>
>>>>>>
>>>>>> If that issue only affects the ICMP code path, why not only changing
>>>>>> that, leaving rtskb_copy_and_csum_bits with the benefit of doing copy
>>>>>> and csum in one step?
>>>>>>
>>>>>
>>>>> As a result of #cc44c17baf7f3, I see no common helper available from the
>>>>> kernel folding the copy and checksum operations anymore, so I see no way
>>>>> to keep rtskb_copy_and_csum_bits() as is. Did you find an all-in-one
>>>>> replacement for csum_partial_copy_nocheck() which would allow a csum
>>>>> value to be fed in?
>>>>>
>>>>
>>>> rtskb_copy_and_csum_dev does not need that.
>>>>
>>>
>>> You made rtskb_copy_and_csum_bits() part of the exported API. So how do
>>> you want to deal with this?
>>>
>>
>> That is an internal API, so we don't care.
>>
>> But even if we converted rtskb_copy_and_csum_bits to work as it used to
>> (with a csum != 0), there would be not reason to make
>> rtskb_copy_and_csum_dev pay that price.
>>
> 
> Well, there may be a reason to challenge the idea that a folded
> copy_and_csum is better for a real-time system than a split memcpy+csum
> in the first place. Out of curiosity, I ran a few benchmarks lately on a
> few SoCs regarding this, and it turned out that optimizing the data copy
> to get the buffer quickly in place before checksumming the result may
> well yield much better results with respect to jitter than what
> csum_and_copy currently achieves on these SoCs.
> 
> Inline csum_and_copy did perform slightly better on average (a couple of
> hundreds of nanosecs at best) but with pathological jittery in the worst
> case at times. On the contrary, the split memcpy+csum method did not
> exhibit such jittery during these tests, not once.
> 
> - four SoCs tested (2 x x86, armv7, armv8a)
> - test code ran in kernel space (real-time task context,
>   out-of-band/primary context)
> - csum_partial_copy_nocheck() vs memcpy()+csum_partial()
> - 3 buffer sizes tested (32, 1024, 1500 bytes), 3 runs each
> - all buffers (src & dst) aligned on L1_CACHE_BYTES
> - each run performed 1000,000 iterations of a given checksum loop, no
>   pause in between.
> - no concurrent load on the board
> - all results in nanoseconds
> 
> The worst results obtained are presented here for each SoC.
> 
> x86[1]
> ------
> 
> vendor_id     : GenuineIntel
> cpu family    : 6
> model         : 92
> model name    : Intel(R) Atom(TM) Processor E3940 @ 1.60GHz
> stepping      : 9
> cpu MHz               : 1593.600
> cache size    : 1024 KB
> cpuid level   : 21
> wp            : yes
> flags         : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
> pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb 
> rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology 
> tsc_reliable nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 
> ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe 
> popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault 
> cat_l2 pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust 
> smep erms mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec 
> xgetbv1 xsaves dtherm ida arat pln pts
> vmx flags     : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad 
> ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid 
> unrestricted_guest vapic_reg vid ple shadow_vmcs
> 
> ==
> 
> CSUM_COPY 32b: min=68, max=653, avg=70
> CSUM_COPY 1024b: min=248, max=373, avg=251
> CSUM_COPY 1500b: min=344, max=3123, avg=350   <=================
> COPY+CSUM 32b: min=101, max=790, avg=103
> COPY+CSUM 1024b: min=297, max=397, avg=300
> COPY+CSUM 1500b: min=402, max=490, avg=405
> 
> ==
> 
> CSUM_COPY 32b: min=68, max=1420, avg=70
> CSUM_COPY 1024b: min=248, max=29706, avg=251   <=================
> CSUM_COPY 1500b: min=344, max=792, avg=350
> COPY+CSUM 32b: min=101, max=872, avg=103
> COPY+CSUM 1024b: min=297, max=776, avg=300
> COPY+CSUM 1500b: min=402, max=853, avg=405
> 
> ==
> 
> CSUM_COPY 32b: min=68, max=661, avg=70
> CSUM_COPY 1024b: min=248, max=1714, avg=251
> CSUM_COPY 1500b: min=344, max=33937, avg=350   <=================
> COPY+CSUM 32b: min=101, max=610, avg=103
> COPY+CSUM 1024b: min=297, max=605, avg=300
> COPY+CSUM 1500b: min=402, max=712, avg=405
> 
> x86[2]
> ------
> 
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 23
> model name      : Intel(R) Core(TM)2 Duo CPU     E7200  @ 2.53GHz
> stepping        : 6
> microcode       : 0x60c
> cpu MHz         : 1627.113
> cache size      : 3072 KB
> cpuid level     : 10
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx lm 
> constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64 
> monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm pti dtherm
> 
> CSUM_COPY 32b: min=558, max=31010, avg=674     <=================
> CSUM_COPY 1024b: min=558, max=2794, avg=745
> CSUM_COPY 1500b: min=558, max=2794, avg=841
> COPY+CSUM 32b: min=558, max=2794, avg=671
> COPY+CSUM 1024b: min=558, max=2794, avg=778
> COPY+CSUM 1500b: min=838, max=2794, avg=865
> 
> ==
> 
> CSUM_COPY 32b: min=59, max=532, avg=62
> CSUM_COPY 1024b: min=198, max=270, avg=201
> CSUM_COPY 1500b: min=288, max=34921, avg=289   <=================
> COPY+CSUM 32b: min=53, max=326, avg=56
> COPY+CSUM 1024b: min=228, max=461, avg=232
> COPY+CSUM 1500b: min=311, max=341, avg=317
> 
> ==
> 
> CSUM_COPY 32b: min=59, max=382, avg=62
> CSUM_COPY 1024b: min=198, max=383, avg=201
> CSUM_COPY 1500b: min=285, max=21235, avg=289   <=================
> COPY+CSUM 32b: min=52, max=300, avg=56
> COPY+CSUM 1024b: min=228, max=348, avg=232
> COPY+CSUM 1500b: min=311, max=409, avg=317
> 
> Cortex A9 quad-core 1.2Ghz (iMX6qp)
> -----------------------------------
> 
> model name    : ARMv7 Processor rev 10 (v7l)
> Features      : half thumb fastmult vfp edsp neon vfpv3 tls vfpd32 
> CPU implementer       : 0x41
> CPU architecture: 7
> CPU variant   : 0x2
> CPU part      : 0xc09
> CPU revision  : 10
> 
> CSUM_COPY 32b: min=333, max=1334, avg=440
> CSUM_COPY 1024b: min=1000, max=2666, avg=1060
> CSUM_COPY 1500b: min=1333, max=45333, avg=1357   <=================
> COPY+CSUM 32b: min=333, max=1334, avg=476
> COPY+CSUM 1024b: min=1000, max=2333, avg=1324
> COPY+CSUM 1500b: min=1666, max=2334, avg=1713
> 
> ==
> 
> CSUM_COPY 32b: min=333, max=1334, avg=439
> CSUM_COPY 1024b: min=1000, max=46000, avg=1060   <=================
> CSUM_COPY 1500b: min=1333, max=5000, avg=1351
> COPY+CSUM 32b: min=333, max=1334, avg=476
> COPY+CSUM 1024b: min=1000, max=2334, avg=1324
> COPY+CSUM 1500b: min=1666, max=2667, avg=1713
> 
> ==
> 
> CSUM_COPY 32b: min=333, max=1666, avg=454
> CSUM_COPY 1024b: min=1000, max=2000, avg=1060
> CSUM_COPY 1500b: min=1333, max=45000, avg=1348   <=================
> COPY+CSUM 32b: min=333, max=1334, avg=454
> COPY+CSUM 1024b: min=1000, max=2334, avg=1317
> COPY+CSUM 1500b: min=1666, max=6000, avg=1712
> 
> Cortex A55 quad-core 2Ghz (Odroid C4)
> -------------------------------------
> 
> Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp 
> asimdhp cpuid asimdrdm lrcpc dcpop asimddp
> CPU implementer : 0x41
> CPU architecture: 8
> CPU variant     : 0x1
> CPU part        : 0xd05
> CPU revision    : 0
> 
> 
> CSUM_COPY 32b: min=125, max=833, avg=140
> CSUM_COPY 1024b: min=625, max=41916, avg=673   <=================
> CSUM_COPY 1500b: min=875, max=3875, avg=923
> COPY+CSUM 32b: min=125, max=458, avg=140
> COPY+CSUM 1024b: min=625, max=1166, avg=666
> COPY+CSUM 1500b: min=875, max=1167, avg=913
> 
> ==
> 
> CSUM_COPY 32b: min=125, max=1292, avg=139
> CSUM_COPY 1024b: min=541, max=48333, avg=555
> CSUM_COPY 1500b: min=708, max=3458, avg=740
> COPY+CSUM 32b: min=125, max=292, avg=136
> COPY+CSUM 1024b: min=541, max=750, avg=556
> COPY+CSUM 1500b: min=708, max=834, avg=740
> 
> ==
> 
> CSUM_COPY 32b: min=125, max=833, avg=140
> CSUM_COPY 1024b: min=666, max=55667, avg=673   <=================
> CSUM_COPY 1500b: min=875, max=4208, avg=913
> COPY+CSUM 32b: min=125, max=375, avg=140
> COPY+CSUM 1024b: min=666, max=916, avg=673
> COPY+CSUM 1500b: min=875, max=1042, avg=913
> 
> ============
> 
> A few additional observations from looking at the implementation:
> 
> For memcpy, legacy x86[2] uses movsq, finishing with movsb to complete
> buffers of unaligned length. Current x86[1] uses ERMS-optimized movsb
> which is faster.
> 
> arm32/armv7 optimizes memcpy by loading up to 8 words in a single
> instruction. csum_and_copy loads/stores at best 4 words at a time,
> only when src and dst are 32bit aligned (which matches the test case).
> 
> arm64/armv8a uses load/store pair instructions to copy memory
> blocks. It does not have asm-optimized csum_and_copy support, so it
> uses the generic C version.
> 
> What could be inferred in terms of prefetching and speculation might
> explain some differences between the approaches too.
> 
> I would be interested in any converging / diverging results testing the
> same combo with a different test code, because from my standpoint,
> things do not seem as obvious as they are supposed to be at the moment.
> 

If copy+csum is not using any recent memcopy optimizations, that is an
argument for at least equivalent performance.

But I don't get yet where the huge jittery should be coming from. Were
the measurement loop preemptible? In that case I would expect a split
copy followed by another loop to csum should give much worse results as
it needs the cache to stay warm - while copy-csum only touches the data
once.

Jan

-- 
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux

Reply via email to