On 2017/01/15 09:00AM, Benjamin Herrenschmidt wrote: > On Fri, 2017-01-13 at 23:22 +0530, 'Naveen N. Rao' wrote: > > > That rather depends on whether the processor has a store to load forwarder > > > that will satisfy the read from the store buffer. > > > I don't know about ppc, but at least some x86 will do that. > > > > Interesting - good to know that. > > > > However, I don't think powerpc does that and in-register swap is likely > > faster regardless. Note also that gcc prefers this form at higher > > optimization levels. > > Of course powerpc has a load-store forwarder these days, however, I > wouldn't be surprised if the in-register form was still faster on some > implementations, but this needs to be tested.
Thanks for clarifying! To test this, I wrote a simple (perhaps naive) test that just issues a whole lot of endian swaps and in _that_ test, it does look like the load-store forwarder is doing pretty well. The tests: bpf-bswap.S: ----------- .file "bpf-bswap.S" .abiversion 2 .section ".text" .align 2 .globl main .type main, @function main: mflr 0 std 0,16(1) stdu 1,-32760(1) addi 3,1,32 li 4,0 li 5,32720 li 11,32720 mulli 11,11,8 li 10,0 li 7,16 1: ldx 6,3,4 stdx 6,1,7 ldbrx 6,1,7 stdx 6,3,4 addi 4,4,8 cmpd 4,5 beq 2f b 1b 2: addi 10,10,1 li 4,0 cmpd 10,11 beq 3f b 1b 3: li 3,0 addi 1,1,32760 ld 0,16(1) mtlr 0 blr bpf-bswap-reg.S: --------------- .file "bpf-bswap-reg.S" .abiversion 2 .section ".text" .align 2 .globl main .type main, @function main: mflr 0 std 0,16(1) stdu 1,-32760(1) addi 3,1,32 li 4,0 li 5,32720 li 11,32720 mulli 11,11,8 li 10,0 1: ldx 6,3,4 rldicl 7,6,32,32 rlwinm 8,6,24,0,31 rlwimi 8,6,8,8,15 rlwinm 9,7,24,0,31 rlwimi 8,6,8,24,31 rlwimi 9,7,8,8,15 rlwimi 9,7,8,24,31 rldicr 8,8,32,31 or 6,8,9 stdx 6,3,4 addi 4,4,8 cmpd 4,5 beq 2f b 1b 2: addi 10,10,1 li 4,0 cmpd 10,11 beq 3f b 1b 3: li 3,0 addi 1,1,32760 ld 0,16(1) mtlr 0 blr Profiling the two variants: # perf stat ./bpf-bswap Performance counter stats for './bpf-bswap': 1395.979224 task-clock (msec) # 0.999 CPUs utilized 0 context-switches # 0.000 K/sec 0 cpu-migrations # 0.000 K/sec 45 page-faults # 0.032 K/sec 4,651,874,673 cycles # 3.332 GHz (66.87%) 3,141,186 stalled-cycles-frontend # 0.07% frontend cycles idle (50.57%) 1,117,289,485 stalled-cycles-backend # 24.02% backend cycles idle (50.57%) 8,565,963,861 instructions # 1.84 insn per cycle # 0.13 stalled cycles per insn (67.05%) 2,174,029,771 branches # 1557.351 M/sec (49.69%) 262,656 branch-misses # 0.01% of all branches (50.05%) 1.396893189 seconds time elapsed # perf stat ./bpf-bswap-reg Performance counter stats for './bpf-bswap-reg': 1819.758102 task-clock (msec) # 0.999 CPUs utilized 3 context-switches # 0.002 K/sec 0 cpu-migrations # 0.000 K/sec 44 page-faults # 0.024 K/sec 6,034,777,602 cycles # 3.316 GHz (66.83%) 2,010,983 stalled-cycles-frontend # 0.03% frontend cycles idle (50.47%) 1,024,975,759 stalled-cycles-backend # 16.98% backend cycles idle (50.52%) 16,043,732,849 instructions # 2.66 insn per cycle # 0.06 stalled cycles per insn (67.01%) 2,148,710,750 branches # 1180.767 M/sec (49.57%) 268,046 branch-misses # 0.01% of all branches (49.52%) 1.821501345 seconds time elapsed This is all in a POWER8 vm. On POWER7, the in-register variant is around 4 times faster than the ldbrx variant. So, yes, unless I've missed something, the ldbrx variant seems to perform better, if not on par with the in-register swap variant on POWER8. > > Ideally, you'd want to try to "optimize" load+swap or swap+store > though. Agreed. This is already the case with BPF for packet access - those use skb helpers which issue the appropriate lhbrx/lwbrx/ldbrx. The newer BPF_FROM_LE/BPF_FROM_BE are for endian operations with other BPF programs. We can probably implement an extra pass to detect use of endian swap and try to match it up with a previous load or a subsequent store though... Thanks! - Naveen