pitrou commented on PR #47573: URL: https://github.com/apache/arrow/pull/47573#issuecomment-3406771727
> One hypothesis I'm wondering about is whether mixing scalar code with SIMD introduce additional latency. Indeed, when looking at the disassembly above, I noticed that many/most instructions actually just do scalar reads + scalar-to-vector register moves to load the data in the SIMD registers. > For comparison, the Lemire implementation I'm curious about is loading data once with a `load_unaligned`, then using a swizzle (byte reorder), then a rshift, then a mask. That's probably the ideal implementation indeed. However, one must be careful to not read data out of bounds, which introduces some complication. Perhaps the various `unpack` specializations would need to take an additional argument indicating available padding. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
