msokolov commented on PR #926:
URL: https://github.com/apache/lucene/pull/926#issuecomment-1164418508
Hi Alessandro, thank you for running the tests. I'm suspicious of the
results though -- they just look too good to be true! I know from profiling
that we spend most of the time in similarity computations, yet this change
doesn't impact how many of those we do nor how costly they are.
One thing I see is that you are using an `hdf5` file as input, but this
tester was not designed to accept that format. This is a script I have used to
extract raw floating-point data (what KnnGraphTester expects) from hdf5. This
also takes care of normalizing to unit vectors, which you should do for angular
data, but nor euclidean
```
import h5py
import numpy as np
import sys
with h5py.File(sys.argv[1], 'r') as f:
for key in f.keys():
print(f"{key}: {f[key].shape}")
ds = f[key]
print(f"copying {ds.shape} from {key}")
arr = np.zeros(ds.shape, dtype='float32')
ds.read_direct(arr)
# normalize all vectors (along dim 1) to unit length
norm = np.linalg.norm(arr, 2, 1)
norm[norm==0] = 1
arr = arr / np.expand_dims(norm, 1)
arr.tofile(sys.argv[1] + "-" + key)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]