[GitHub] [lucene] gsmiller commented on pull request #12417: forutil add vectorized and scalar code

via GitHub Mon, 10 Jul 2023 18:42:48 -0700


gsmiller commented on PR #12417:
URL: https://github.com/apache/lucene/pull/12417#issuecomment-1629974586


   > @gsmiller It is not surprising that ByteVector is faster for narrow bit 
width scenarios. The issue is how to represent these different bit widths in a 
cohesive way, without coercing the bits into a narrower format - since that 
defeats the point.
   
   @ChrisHegarty I'm not entirely sure what you mean here, but I'm certain 
that's my lack of understanding in this space surfacing.
   
   Just taking a bit width of 2 as an illustrative example, my super-naive 
thought is that we'd be encoding the values to disk using essentially 8 total 
lanes right? So we'd store 8 ints to represent these 8 lanes (each int packing 
16 2-bit values)? The first int would store docs 0, 8, 16, 32, [...], 120 (in 
the block of 128 docs). The second int would store 1, 9, 17, 33, [...], 121. 
And so on. For each of these ints, we need to perform 16 "operations" to decode 
the 16 values (shift + mask). If we're using a 128 bit vector register, we can 
operate on 4 lanes at a time, so we have to load up 4 ints, do these 16 
operations, then do the next 4, etc. But then each lane directly contains the 
int we want to "copy out" into the final int[] we're populating, so that part 
is easy.
   
   Alternatively, what if we used 32 lanes that were only 8 bits wide instead 
of 8 lanes that are 32 bits wide? If we stored 32 bytes to represent these 
lanes, each packing 4 values, we'd have something like lane 1 containing 0, 32, 
64, 96; lane 2 containing 1, 33, 65, 97; and so on up to our last lane of 31, 
63, 95, 127. We could read these bytes from disk and directly load 16 of them 
into a 128 bit register, performing 16 concurrent shift + mask operations then 
copying out 16 values at a time instead of only 4 as before. What I'm not sure 
about though is whether-or-not it's possible to copy out those 16 values at a 
time into ints. We really just care about copying out the 8 bit lane values and 
left-padding the remaining 24 bits with 0s, but maybe that can't easily be 
done? I messed about with `reinterpretAsInts`, but that doesn't really seem 
well suited? Maybe there's a better way to approach this? Or maybe it just 
fundamentally can't be done?
   
   I clearly have a large knowledge gap in this space, so again—I'm sure this 
reasoning is fundamentally flawed in some way—but just approaching it with a 
bit of an outsider's perspective and wondering why we can't increase our 
concurrency here in cases of small bit widths. Just seems a bit wasteful to 
shift + mask 128 bits at a time and only care about 8 of them when we could 
shift + mask 128 bits at a time and care about 32 of them.
   
   Also, happy to take this conversation elsewhere if it makes sense. If it 
goes beyond this comment, I'll move it somewhere else to make sure we don't 
clutter up this PR. Thanks for your help understanding!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [lucene] gsmiller commented on pull request #12417: forutil add vectorized and scalar code

Reply via email to