gsmiller commented on PR #12417: URL: https://github.com/apache/lucene/pull/12417#issuecomment-1629974586
> @gsmiller It is not surprising that ByteVector is faster for narrow bit width scenarios. The issue is how to represent these different bit widths in a cohesive way, without coercing the bits into a narrower format - since that defeats the point. @ChrisHegarty I'm not entirely sure what you mean here, but I'm certain that's my lack of understanding in this space surfacing. Just taking a bit width of 2 as an illustrative example, my super-naive thought is that we'd be encoding the values to disk using essentially 8 total lanes right? So we'd store 8 ints to represent these 8 lanes (each int packing 16 2-bit values)? The first int would store docs 0, 8, 16, 32, [...], 120 (in the block of 128 docs). The second int would store 1, 9, 17, 33, [...], 121. And so on. For each of these ints, we need to perform 16 "operations" to decode the 16 values (shift + mask). If we're using a 128 bit vector register, we can operate on 4 lanes at a time, so we have to load up 4 ints, do these 16 operations, then do the next 4, etc. But then each lane directly contains the int we want to "copy out" into the final int[] we're populating, so that part is easy. Alternatively, what if we used 32 lanes that were only 8 bits wide instead of 8 lanes that are 32 bits wide? If we stored 32 bytes to represent these lanes, each packing 4 values, we'd have something like lane 1 containing 0, 32, 64, 96; lane 2 containing 1, 33, 65, 97; and so on up to our last lane of 31, 63, 95, 127. We could read these bytes from disk and directly load 16 of them into a 128 bit register, performing 16 concurrent shift + mask operations then copying out 16 values at a time instead of only 4 as before. What I'm not sure about though is whether-or-not it's possible to copy out those 16 values at a time into ints. We really just care about copying out the 8 bit lane values and left-padding the remaining 24 bits with 0s, but maybe that can't easily be done? I messed about with `reinterpretAsInts`, but that doesn't really seem well suited? Maybe there's a better way to approach this? Or maybe it just fundamentally can't be done? I clearly have a large knowledge gap in this space, so again—I'm sure this reasoning is fundamentally flawed in some way—but just approaching it with a bit of an outsider's perspective and wondering why we can't increase our concurrency here in cases of small bit widths. Just seems a bit wasteful to shift + mask 128 bits at a time and only care about 8 of them when we could shift + mask 128 bits at a time and care about 32 of them. Also, happy to take this conversation elsewhere if it makes sense. If it goes beyond this comment, I'll move it somewhere else to make sure we don't clutter up this PR. Thanks for your help understanding! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
