> Here are some more thoughts about loop unroll...
> In another mail [1], you are discussing manual loop unroll for 
> rte_ipv4/ipv6_phdr_cksum().
> Perhaps the compiler already loop unrolls those.
> Check the assembler output for the existing code calling __rte_raw_cksum().
> If the compiler doesn't loop unroll __rte_raw_cksum() for those two 
> functions, maybe you can help it by modifying __rte_raw_cksum(); try 
> replacing the end pointer with an int counter, which will be compile time 
> constant when called by rte_ipv4/ipv6_phdr_cksum().
>
> [1]: 
> https://inbox.dpdk.org/dev/CAFn2buA5NzmzA0+t1_5auigvQTyT7Ne6RMVaPVU=sdc03nd...@mail.gmail.com/
>
> PS: I do the following when optimizing inline functions: Add non-inline 
> functions calling the inline functions, and then use "objdump -S" to look at 
> the generated code. E.g.:
>
> uint32_t review__rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
> { return __rte_raw_cksum(buf, len, sum); }
>
> uint32_t review__rte_raw_cksum_len20(const void *buf, uint32_t sum)
> { return __rte_raw_cksum(buf, 20, sum); }
>
> uint32_t review__rte_raw_cksum_len8(const void *buf, uint32_t sum)
> { return __rte_raw_cksum(buf, 8, sum); }
>

https://godbolt.org/z/qr39hf76s
rte_ipv4_phdr_cksum and rte_ipv6_phdr_cksum are both fully unrolled
(-O2 or higher). Vectorization also happens (clang chooses
not to vectorize ipv4). yay compilers :)

Reply via email to