Hi, I must say: greate work.
I have some small comments: > +/** > + * Macro for copying unaligned block from one location to another, > + * 47 bytes leftover maximum, > + * locations should not overlap. > + * Requirements: > + * - Store is aligned > + * - Load offset is <offset>, which must be immediate value within [1, 15] > + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards > are available for loading > + * - <dst>, <src>, <len> must be variables > + * - __m128i <xmm0> ~ <xmm8> must be pre-defined > + */ > +#define MOVEUNALIGNED_LEFT47(dst, src, len, offset) > \ > +{ > \ ... > +} Why not do { ... } while(0) or ({ ... }) ? This could have unpredictable side effects. Second: Why you completely substitute #define rte_memcpy(dst, src, n) \ ({ (__builtin_constant_p(n)) ? \ memcpy((dst), (src), (n)) : \ rte_memcpy_func((dst), (src), (n)); }) with inline rte_memcpy()? This construction can help compiler to deduce which version to use (static?) inline implementation or call external function. Did you try 'extern inline' type? It could help reducing compilation time.