> On 27 Dec 2019, at 07:09, Lijian Zhang <lijian.zh...@arm.com> wrote:
> 
> Hi All,
> 
> We have tried to improve L2 performance on Arm servers by investigating 
> bihash data structure. Could you please review and suggest on the findings?
> 
> 1. Apply prefetches in Bihash MAC-table
> 
> 
> Per software profiling with L2 throughput bench-marking on Arm CPU, cache 
> misses are the major factor affecting performance. This is reasonable and 
> expected
>  for hash operations. We tried to apply prefetches over Bihash MAC table for 
> the ACL and L2 related nodes, however, improvement is gained with l2_fwd node 
> only.
> 
> 
> 
> With 10K L2 flows, the prefetches applied on l2_fwd node, save 12% clocks on 
> CortexA72, 23% clocks on ThunderX2, and 19.9% clocks on x86 Haswell. And it 
> also
>  improves L2 throughput by 2% on CortexA72, 2.7% on ThunderX2 and 3% on x86 
> Haswell.
> 
> 
> 
> But with single L2 flow, the additional prefetches will bring some 
> degradation on node cycles and throughput.
> 
> 
> This also seems reasonable. With single L2 flow, the hit bihash data is in 
> cache, so we don't have to prefetch any data, and any additional prefetch 
> instructions
>  will cost useless cycles. With 10K flows, prefetching data over bihash data 
> structure could reduce the cost of cache-misses.
> 
> 
> 
> So the question is how would you decide whether to apply prefetches on the 
> node process functions or not? Is there a trade-off btw applying prefetches 
> for
>  multiple flows, and not doing prefetches due to less flows?

That node is poorly written performance-wise, so best will be to rewrite it 
completely.
I would expected that 50% improvement in clocks/packet is possible there even 
in single mac case...

> 
> 
> 
> 2. Force loop unrolling in hash calculation
> 
> 
> The hash calculation clib_crc32c (u8 * s, int len) and key comparison 
> function clib_bihash_key_compare_x_x () are key to Bihash performance.
> 
> 
> The key length in most Bihash instances are fixed, i.e., parameter len to 
> function clib_crc32c () is known at compiling time. VPP is compiled with -O2 
> option
>  which will unroll the loop automatically only the iteration number is equal 
> or less than 3.
> 
> 
> 
> Force unroll loops in CRC32 calculation function when the loop number is 
> fixed, key-length larger than 24Bytes and known at compiling stage can save 
> some
>  CPU cycles.
> 
> 
> /* Unroll loops when iteration number is known at compiling stage */
> #define __CLIB_UNROLL(n) _Pragma(#n)
> #define _CLIB_UNROLL(n) __CLIB_UNROLL(n)
> #ifdef __clang__
> #define CLIB_UNROLL(n) _CLIB_UNROLL(unroll n)
> #elif __GNUC__ >= 8
> #define CLIB_UNROLL(n) _CLIB_UNROLL(GCC unroll n)
> #else
> #define CLIB_UNROLL(n)
> #endif
> ​
> static_always_inline u32
> clib_crc32c (u8 * s, int len)
> {
> u32 v = 0;
> ​
> #if defined(__x86_64__)
> CLIB_UNROLL(8) for (; len >= 8; len -= 8, s += 8)
>   v = _mm_crc32_u64 (v, *((u64 *) s));
> #else
> 
> Below is micro bench-marking results on function clib_crc32c () when the key 
> size is fixed and known at compiling stage and with '_Pragma (GCC unroll n)'
>  applied.
> 
> 
> 
> Length of Key (Bytes):
>               
> 
> 8
>          
> 
> 16
>        
> 
> 24
>        
> 
> 32
>        
> 
> 40
> 
> Cycle saved (Taishan):              0%       0%       0%       69%     53%
> 
> Cycle saved (ThunderX1):         0%       0%       0%       26%     5%
> 
> Cycle saved (ThunderX2):         0%       0%       0%       19%     71%
> 
> Cycle saved (Haswell):              0%       0%       0%       50%     38%
> 

I can see that gcc unrolls this even without unroll pragma, even for 64 byte 
keys, looks like you are doing something wrong, at least on x86.

static_always_inline u32
clib_crc32c (u8 * s, int len)
{
  u32 v = 0;
  for (; len >= 8; len-=8, s+= 8)
  v = _mm_crc32_u64 (v, *((u64 *) s));
  return v;
}

u32
foo (u8 * s, int len)
{
  return clib_crc32c (s, 64);
}

gives:

foo:
  xor  eax, eax
  crc32  rax, QWORD PTR [rdi]
  mov  eax, eax
  crc32  rax, QWORD PTR [rdi+8]
  mov  eax, eax
  crc32  rax, QWORD PTR [rdi+16]
  mov  eax, eax
  crc32  rax, QWORD PTR [rdi+24]
  mov  eax, eax
  crc32  rax, QWORD PTR [rdi+32]
  mov  eax, eax
  crc32  rax, QWORD PTR [rdi+40]
  mov  eax, eax
  crc32  rax, QWORD PTR [rdi+48]
  mov  eax, eax
  crc32  rax, QWORD PTR [rdi+56]
  ret





-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#14979): https://lists.fd.io/g/vpp-dev/message/14979
Mute This Topic: https://lists.fd.io/mt/69282766/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to