benwtrent commented on issue #14342:
URL: https://github.com/apache/lucene/issues/14342#issuecomment-2734567137
OK, a colleague and I spent some time digging into this and Option 0 (a bug)
turned out to be the case. Its a 5 character change (like all good bugs), but
here are the new recall numbers for fashion-minst:
Still not mid 90s at 5x oversampling, but WAY better than the abysmal
results from before.
```
FLAT Results:
recall latency(ms) nDoc topK fanout quantized index(s) index_docs/s
num_segments index_size(MB) overSample vec_disk(MB) vec_RAM(MB) indexType
0.444 4.110 60000 10 50 1 bits 0.00 Infinity
1 186.43 1.000 181.646 2.203 FLAT
0.629 4.383 60000 10 50 1 bits 0.00 Infinity
1 186.43 2.000 181.646 2.203 FLAT
0.730 4.437 60000 10 50 1 bits 0.00 Infinity
1 186.43 3.000 181.646 2.203 FLAT
0.792 4.455 60000 10 50 1 bits 0.00 Infinity
1 186.43 4.000 181.646 2.203 FLAT
0.833 4.445 60000 10 50 1 bits 0.00 Infinity
1 186.43 5.000 181.646 2.203 FLAT
0.926 4.607 60000 10 50 1 bits 0.00 Infinity
1 186.43 10.000 181.646 2.203 FLAT
```
```
HNSW
recall latency(ms) nDoc topK fanout maxConn beamWidth quantized
index(s) index_docs/s num_segments index_size(MB) overSample vec_disk(MB)
vec_RAM(MB) indexType
0.443 0.188 60000 10 50 64 250 1 bits
0.00 Infinity 1 189.55 1.000 181.646
2.203 HNSW
0.629 0.274 60000 10 50 64 250 1 bits
0.00 Infinity 1 189.55 2.000 181.646
2.203 HNSW
0.730 0.349 60000 10 50 64 250 1 bits
0.00 Infinity 1 189.55 3.000 181.646
2.203 HNSW
0.792 0.471 60000 10 50 64 250 1 bits
0.00 Infinity 1 189.55 4.000 181.646
2.203 HNSW
0.833 0.479 60000 10 50 64 250 1 bits
0.00 Infinity 1 189.55 5.000 181.646
2.203 HNSW
0.926 0.786 60000 10 50 64 250 1 bits
0.00 Infinity 1 189.55 10.000 181.646
2.203 HNSW
```
For the curious, it had to do with shifting the normal distribution
initialization parameters correctly given the standard deviation of the actual
vector distribution. We had the mean & std flipped. When these are well
behaved, this sort of bug has a tiny effect (which is why we never caught it),
but minst isn't well behaved and brought this nasty little bug to light.
I am gonna run some more benchmarks and will open a PR soon with the fix.
As an aside, there is likely even more gains for non-normal distribution
vectors like minst, but they will take more time and effort.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]