pitrou commented on PR #47573:
URL: https://github.com/apache/arrow/pull/47573#issuecomment-3406771727

   > One hypothesis I'm wondering about is whether mixing scalar code with SIMD 
introduce additional latency.
   
   Indeed, when looking at the disassembly above, I noticed that many/most 
instructions actually just do scalar reads + scalar-to-vector register moves to 
load the data in the SIMD registers.
   
   > For comparison, the Lemire implementation I'm curious about is loading 
data once with a `load_unaligned`, then using a swizzle (byte reorder), then a 
rshift, then a mask.
   
   That's probably the ideal implementation indeed. However, one must be 
careful to not read data out of bounds, which introduces some complication. 
Perhaps the various `unpack` specializations would need to take an additional 
argument indicating available padding.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to