benwtrent commented on PR #12582:
URL: https://github.com/apache/lucene/pull/12582#issuecomment-1731463225
> Do we know why search is faster? Is it mostly because working on the
quantized vectors requires a lower memory bandwi[d]th?
Search is faster in two regards:
- PanamaVector allows for more `byte` actions to occur at once than
`float32` (should be major)
- Reading `byte[]` off of a buffer doesn't require decoding floats (very
minor change)
IMO, we should be seeing WAY better search numbers. I need to do more
testing to triple check.
> Do you know how recall degrades compared to without quantization? I saw
the numbers you shared but I don't have a good sense of what recall we usually
had until now.
++ I want to graph the two together to compare so its clearer.
> I don't feel great about the logic that merges quantiles at merge time and
only requantizes if the merged quantiles don't differ too much from the input
quantiles. It feels like quantiles could slowly change over multiple merging
rounds and we'd end up in a state where the quantized vectors would be
different from requantizing the raw vectors with the quantization state that is
stored in the segment, which feels wrong. Am I missing something?
The quantization buckets could change slightly overtime, but since we are
bucketing `float32` into `int8`, the error bounds are comparatively large.
The cost of requantization is almost never worth it. In my testing,
quantiles over random data from the same data set shows that segments differ by
only around `1e-4`, which is tiny and shouldn't require requantization.
@tveasey helped me do some empirical analysis here and can provide some
numbers.
> Related to the above, it looks like we ignore deletions when merging
quantiles. It would probably be ok in practice most of the time but I worry
that there might be corner cases?
A corner case in what way? That we potentially include deletions when
computing quantiles or if re-quantization is required?
We can easily exclude them as conceptually, the "new" doc (if it were an
update) would exist in another segment. It could be we are double counting a
vector and we probably shouldn't do that.
> > Do we want to have a new "flat" vector codec that HNSW (or other
complicated vector indexing methods), can use? Detractor here is that now HNSW
codec relies on another pluggable thing that is a "flat" vector index (just
provides mechanisms for reading, writing, merging vectors in a flat index).
> I don't have a strong opinion on this. Making it a codec though has the
downside that it would require more files since two codecs can't write to the
same file. Maybe having utility methods around reading/writing flat vectors is
good enough?
Utility methods are honestly what I am leaning towards. Its then a
discussion around how a codec (like HNSW) is configured to use it.
> > Should "quantization" just be a thing that is provided to vector codecs?
> I might be misunderstanding the question, but to me this is what the
byte[] encoding is about. And this quantization that's getting added here is
more powerful because it's adaptative and will change over time depending on
what vectors get indexed or deleted? If it needs to adapt to the data then it
belongs to the codec. We could have utility code to make it easier to write
codecs that quantize their data though (maybe this is what your question
suggested?).
Yeah, it needs to adapt over time. There are adverse cases (indexing vectors
sorted by relative clusters is one) that need to be handled. But, they can be
handled easily at merge time by recomputing quantiles and potentially
re-quantizing.
> > Should the "quantizer" keep the raw vectors around itself?
> My understanding is that we have to, as the accuracy of the quantization
could otherwise degrade over time in an unbounded fashion.
After a period of time, if vectors are part of the same corpus and created
via the same model, the quantiles actually level out and re-quantizing will
rarely or never occur since the calculated quantiles are statistically
equivalent. Especially given the binning into `int8`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]