Re: [PATCH 3/3] powerpc: bpf: implement in-register swap for 64-bit endian operations

2017-01-24 Thread 'Naveen N. Rao'
On 2017/01/24 04:13PM, David Laight wrote:
> From: 'Naveen N. Rao'
> > Sent: 23 January 2017 19:22
> > On 2017/01/15 09:00AM, Benjamin Herrenschmidt wrote:
> > > On Fri, 2017-01-13 at 23:22 +0530, 'Naveen N. Rao' wrote:
> > > > > That rather depends on whether the processor has a store to load 
> > > > > forwarder
> > > > > that will satisfy the read from the store buffer.
> > > > > I don't know about ppc, but at least some x86 will do that.
> > > >
> > > > Interesting - good to know that.
> > > >
> > > > However, I don't think powerpc does that and in-register swap is likely
> > > > faster regardless. Note also that gcc prefers this form at higher
> > > > optimization levels.
> > >
> > > Of course powerpc has a load-store forwarder these days, however, I
> > > wouldn't be surprised if the in-register form was still faster on some
> > > implementations, but this needs to be tested.
> > 
> > Thanks for clarifying! To test this, I wrote a simple (perhaps naive)
> > test that just issues a whole lot of endian swaps and in _that_ test, it
> > does look like the load-store forwarder is doing pretty well.
> ...
> > This is all in a POWER8 vm. On POWER7, the in-register variant is around
> > 4 times faster than the ldbrx variant.
> ...
> 
> I wonder which is faster on the little 1GHz embedded ppc we use here.

Worth a test, for sure.
FWIW, this patch won't matter since eBPF JIT is for ppc64.

Thanks,
Naveen



RE: [PATCH 3/3] powerpc: bpf: implement in-register swap for 64-bit endian operations

2017-01-24 Thread David Laight
From: 'Naveen N. Rao'
> Sent: 23 January 2017 19:22
> On 2017/01/15 09:00AM, Benjamin Herrenschmidt wrote:
> > On Fri, 2017-01-13 at 23:22 +0530, 'Naveen N. Rao' wrote:
> > > > That rather depends on whether the processor has a store to load 
> > > > forwarder
> > > > that will satisfy the read from the store buffer.
> > > > I don't know about ppc, but at least some x86 will do that.
> > >
> > > Interesting - good to know that.
> > >
> > > However, I don't think powerpc does that and in-register swap is likely
> > > faster regardless. Note also that gcc prefers this form at higher
> > > optimization levels.
> >
> > Of course powerpc has a load-store forwarder these days, however, I
> > wouldn't be surprised if the in-register form was still faster on some
> > implementations, but this needs to be tested.
> 
> Thanks for clarifying! To test this, I wrote a simple (perhaps naive)
> test that just issues a whole lot of endian swaps and in _that_ test, it
> does look like the load-store forwarder is doing pretty well.
...
> This is all in a POWER8 vm. On POWER7, the in-register variant is around
> 4 times faster than the ldbrx variant.
...

I wonder which is faster on the little 1GHz embedded ppc we use here.

David




Re: [PATCH 3/3] powerpc: bpf: implement in-register swap for 64-bit endian operations

2017-01-23 Thread 'Naveen N. Rao'
On 2017/01/15 09:00AM, Benjamin Herrenschmidt wrote:
> On Fri, 2017-01-13 at 23:22 +0530, 'Naveen N. Rao' wrote:
> > > That rather depends on whether the processor has a store to load forwarder
> > > that will satisfy the read from the store buffer.
> > > I don't know about ppc, but at least some x86 will do that.
> > 
> > Interesting - good to know that.
> > 
> > However, I don't think powerpc does that and in-register swap is likely 
> > faster regardless. Note also that gcc prefers this form at higher 
> > optimization levels.
> 
> Of course powerpc has a load-store forwarder these days, however, I
> wouldn't be surprised if the in-register form was still faster on some
> implementations, but this needs to be tested.

Thanks for clarifying! To test this, I wrote a simple (perhaps naive) 
test that just issues a whole lot of endian swaps and in _that_ test, it 
does look like the load-store forwarder is doing pretty well.

The tests:

bpf-bswap.S:
---
.file   "bpf-bswap.S"
.abiversion 2
.section".text"
.align 2
.globl main
.type   main, @function
main:
mflr0
std 0,16(1)
stdu1,-32760(1)
addi3,1,32
li  4,0
li  5,32720
li  11,32720
mulli   11,11,8
li  10,0
li  7,16
1:  ldx 6,3,4
stdx6,1,7
ldbrx   6,1,7
stdx6,3,4
addi4,4,8
cmpd4,5
beq 2f
b   1b
2:  addi10,10,1
li  4,0
cmpd10,11
beq 3f
b   1b
3:  li  3,0
addi1,1,32760
ld  0,16(1)
mtlr0
blr

bpf-bswap-reg.S:
---
.file   "bpf-bswap-reg.S"
.abiversion 2
.section".text"
.align 2
.globl main
.type   main, @function
main:
mflr0
std 0,16(1)
stdu1,-32760(1)
addi3,1,32
li  4,0
li  5,32720
li  11,32720
mulli   11,11,8
li  10,0
1:  ldx 6,3,4
rldicl  7,6,32,32
rlwinm  8,6,24,0,31
rlwimi  8,6,8,8,15
rlwinm  9,7,24,0,31
rlwimi  8,6,8,24,31
rlwimi  9,7,8,8,15
rlwimi  9,7,8,24,31
rldicr  8,8,32,31
or  6,8,9
stdx6,3,4
addi4,4,8
cmpd4,5
beq 2f
b   1b
2:  addi10,10,1
li  4,0
cmpd10,11
beq 3f
b   1b
3:  li  3,0
addi1,1,32760
ld  0,16(1)
mtlr0
blr

Profiling the two variants:

# perf stat ./bpf-bswap

 Performance counter stats for './bpf-bswap':

   1395.979224  task-clock (msec) #0.999 CPUs utilized  

 0  context-switches  #0.000 K/sec  

 0  cpu-migrations#0.000 K/sec  

45  page-faults   #0.032 K/sec  

 4,651,874,673  cycles#3.332 GHz
  (66.87%)
 3,141,186  stalled-cycles-frontend   #0.07% frontend cycles 
idle (50.57%)
 1,117,289,485  stalled-cycles-backend#   24.02% backend cycles 
idle  (50.57%)
 8,565,963,861  instructions  #1.84  insn per cycle 

  #0.13  stalled cycles per 
insn  (67.05%)
 2,174,029,771  branches  # 1557.351 M/sec  
  (49.69%)
   262,656  branch-misses #0.01% of all branches
  (50.05%)

   1.396893189 seconds time elapsed

# perf stat ./bpf-bswap-reg

 Performance counter stats for './bpf-bswap-reg':

   1819.758102  task-clock (msec) #0.999 CPUs utilized  

 3  context-switches  #0.002 K/sec  

 0  cpu-migrations#0.000 K/sec  

44  page-faults   #0.024 K/sec  

 6,034,777,602  cycles#3.316 GHz
  (66.83%)
 2,010,983  stalled-cycles-frontend   #0.03% frontend cycles 
idle (50.47%)
 1,024,975,759  stalled-cycles-backend#   16.98% backend cycles 
idle  (50.52%)
16,043,732,849  instructions  #2.66  insn per cycle 

  #0.06  stalled cycles per 
insn  (67.01%)
 2,148,710,750  branches  # 1180.767 M/sec  
  (49.57%)
   268,046  branch-misses #0.01% of all branches
  (49.52%)

   1.821501345 seconds time elapsed


This is all 

Re: [PATCH 3/3] powerpc: bpf: implement in-register swap for 64-bit endian operations

2017-01-15 Thread Benjamin Herrenschmidt
On Fri, 2017-01-13 at 23:22 +0530, 'Naveen N. Rao' wrote:
> > That rather depends on whether the processor has a store to load forwarder
> > that will satisfy the read from the store buffer.
> > I don't know about ppc, but at least some x86 will do that.
> 
> Interesting - good to know that.
> 
> However, I don't think powerpc does that and in-register swap is likely 
> faster regardless. Note also that gcc prefers this form at higher 
> optimization levels.

Of course powerpc has a load-store forwarder these days, however, I
wouldn't be surprised if the in-register form was still faster on some
implementations, but this needs to be tested.

Ideally, you'd want to try to "optimize" load+swap or swap+store
though.

Cheers,
Ben.



Re: [PATCH 3/3] powerpc: bpf: implement in-register swap for 64-bit endian operations

2017-01-13 Thread 'Naveen N. Rao'
On 2017/01/13 05:17PM, David Laight wrote:
> From: Naveen N. Rao
> > Sent: 13 January 2017 17:10
> > Generate instructions to perform the endian conversion using registers,
> > rather than generating two memory accesses.
> > 
> > The "way easier and faster" comment was obviously for the author, not
> > the processor.
> 
> That rather depends on whether the processor has a store to load forwarder
> that will satisfy the read from the store buffer.
> I don't know about ppc, but at least some x86 will do that.

Interesting - good to know that.

However, I don't think powerpc does that and in-register swap is likely 
faster regardless. Note also that gcc prefers this form at higher 
optimization levels.

Thanks,
Naveen



RE: [PATCH 3/3] powerpc: bpf: implement in-register swap for 64-bit endian operations

2017-01-13 Thread David Laight
From: Naveen N. Rao
> Sent: 13 January 2017 17:10
> Generate instructions to perform the endian conversion using registers,
> rather than generating two memory accesses.
> 
> The "way easier and faster" comment was obviously for the author, not
> the processor.

That rather depends on whether the processor has a store to load forwarder
that will satisfy the read from the store buffer.
I don't know about ppc, but at least some x86 will do that.

David



[PATCH 3/3] powerpc: bpf: implement in-register swap for 64-bit endian operations

2017-01-13 Thread Naveen N. Rao
Generate instructions to perform the endian conversion using registers,
rather than generating two memory accesses.

The "way easier and faster" comment was obviously for the author, not
the processor.

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/net/bpf_jit_comp64.c | 22 ++
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/net/bpf_jit_comp64.c 
b/arch/powerpc/net/bpf_jit_comp64.c
index 1e313db..0413a89 100644
--- a/arch/powerpc/net/bpf_jit_comp64.c
+++ b/arch/powerpc/net/bpf_jit_comp64.c
@@ -599,16 +599,22 @@ static int bpf_jit_build_body(struct bpf_prog *fp, u32 
*image,
break;
case 64:
/*
-* Way easier and faster(?) to store the value
-* into stack and then use ldbrx
+* We'll split it up into two words, swap those
+* independently and then merge them back.
 *
-* ctx->seen will be reliable in pass2, but
-* the instructions generated will remain the
-* same across all passes
+* First up, let's swap the most-significant 
word.
 */
-   PPC_STD(dst_reg, 1, bpf_jit_stack_local(ctx));
-   PPC_ADDI(b2p[TMP_REG_1], 1, 
bpf_jit_stack_local(ctx));
-   PPC_LDBRX(dst_reg, 0, b2p[TMP_REG_1]);
+   PPC_RLDICL(b2p[TMP_REG_1], dst_reg, 32, 32);
+   PPC_RLWINM(b2p[TMP_REG_2], b2p[TMP_REG_1], 8, 
0, 31);
+   PPC_RLWIMI(b2p[TMP_REG_2], b2p[TMP_REG_1], 24, 
0, 7);
+   PPC_RLWIMI(b2p[TMP_REG_2], b2p[TMP_REG_1], 24, 
16, 23);
+   /* Then, the second half */
+   PPC_RLWINM(b2p[TMP_REG_1], dst_reg, 8, 0, 31);
+   PPC_RLWIMI(b2p[TMP_REG_1], dst_reg, 24, 0, 7);
+   PPC_RLWIMI(b2p[TMP_REG_1], dst_reg, 24, 16, 23);
+   /* Merge back */
+   PPC_RLDICR(dst_reg, b2p[TMP_REG_1], 32, 31);
+   PPC_OR(dst_reg, dst_reg, b2p[TMP_REG_2]);
break;
}
break;
-- 
2.10.2