Hi Eric, Thanks your quick reply.
On 2019/7/24 16:56, Eric Dumazet wrote: > > > On 7/24/19 10:38 AM, Zhangshaokun wrote: >> Hi, >> >> I've observed an significant performance regression with the following >> commit-id <adb03115f459> >> ("net: get rid of an signed integer overflow in ip_idents_reserve()"). > > Yes this UBSAN false positive has been painful > > > >> >> Here are my test scenes: >> ----Server---- >> Cmd: iperf3 -s xxx.xxx.xxxx.xxx -p 10000 -i 0 -A 0 >> Kenel: 4.19.34 >> Server number: 32 >> Port: 10000 – 10032 >> CPU affinity: 0 – 32 >> CPU architecture: aarch64 >> NUMA node0 CPU(s): 0-23 >> NUMA node1 CPU(s): 24-47 >> >> ----Client---- >> Cmd: iperf3 -u -c xxx.xxx.xxxx.xxx -p 10000 -l 16 -b 0 -t 0 -i 0 -A 8 >> Kenel: 4.19.34 >> Client number: 32 >> Port: 10000 – 10032 >> CPU affinity: 0 – 32 >> CPU architecture: aarch64 >> NUMA node0 CPU(s): 0-23 >> NUMA node1 CPU(s): 24-47 >> >> Firstly, With patch <adb03115f459> ("net: get rid of an signed integer >> overflow in ip_idents_reserve()") , >> client’s cpu is 100%, and function ip_idents_reserve() cpu usage is very >> high, but the result is not good. >> 03:08:32 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s >> txcmp/s rxmcst/s %ifutil >> 03:08:33 AM eth0 0.00 0.00 0.00 0.00 0.00 >> 0.00 0.00 0.00 >> 03:08:33 AM eth1 0.00 3461296.00 0.00 196049.97 0.00 >> 0.00 0.00 0.00 >> 03:08:33 AM lo 0.00 0.00 0.00 0.00 0.00 >> 0.00 0.00 0.00 >> >> Secondly, revert that patch, use atomic_add_return() instead, the result is >> better, as below: >> 03:23:24 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s >> txcmp/s rxmcst/s %ifutil >> 03:23:25 AM lo 0.00 0.00 0.00 0.00 0.00 >> 0.00 0.00 0.00 >> 03:23:25 AM eth1 0.00 12834590.00 0.00 726959.20 0.00 >> 0.00 0.00 0.00 >> 03:23:25 AM eth0 7.00 11.00 0.40 2.95 0.00 >> 0.00 0.00 0.00 >> >> Thirdly, atomic is not used in ip_idents_reserve() completely ,while each >> cpu core allocates its own ID segment, >> Such as: cpu core0 allocate ID 0 – 1023, cpu core1 allocate 1024 – 2047, >> …,etc >> the result is the best: > > Not sure what you mean. > > Less entropy in IPv4 ID is not going to help when fragments _are_ needed. > > Send 40,000 datagrams of 2000 bytes each, add delays, reorders, and boom, > most of the packets will be lost. > > This is not because your use case does not need proper IP ID that we can mess > with them. > Got it, thanks your more explanation. > If you need to send packets very fast, maybe use AF_PACKET ? > Ok, I will try it later. >> 03:27:06 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s >> txcmp/s rxmcst/s %ifutil >> 03:27:07 AM lo 0.00 0.00 0.00 0.00 0.00 >> 0.00 0.00 0.00 >> 03:27:07 AM eth1 0.00 14275505.00 0.00 808573.53 0.00 >> 0.00 0.00 0.00 >> 03:27:07 AM eth0 0.00 2.00 0.00 0.18 0.00 >> 0.00 0.00 0.00 >> >> Because atomic operation performance is bottleneck when cpu core number >> increase, Can we revert the patch or >> use ID segment for each cpu core instead? > > > This has been discussed in the past. > > https://lore.kernel.org/lkml/b0160f4b-b996-b0ee-405a-3d5f18662...@gmail.com/ > > We can revert now UBSAN has been fixed. > > Or even use Peter patch : > https://lore.kernel.org/lkml/20181101172739.ga3...@hirez.programming.kicks-ass.net/ > I have tried this patch under the condition that I remove try_cmpxchg because there is no this API in arm64 : 09:21:16 PM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil 09:21:17 PM lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 09:21:17 PM eth1 0.00 10434613.00 0.00 591023.00 0.00 0.00 0.00 0.00 09:21:17 PM eth0 1.00 0.00 0.12 0.00 0.00 0.00 0.00 0.00 The result is 10434613.00 pps and it is less than the atomic_add_return(12834590.00 pps). Any thoughts? Thanks, Shaokun > However, you will still hit badly a shared cache line, not matter what. > > Some arches are known to have terrible LL/SC implementation :/ > > > . >