GitHub user chrevanthreddy added a comment to the discussion: RFC-104 - Add 
native Vector Search Index type for Hudi Vector Type

Benchmarking tests on 1Million dataset: 
The ivf-exact path does:

batch_queries crossJoin broadcast(ivf_centroids)
rank centroids per query by squared L2
keep nprobe
join to ivf_vectors by cluster_id
compute exact cosine distance
rank per query by exact distance
keep top-20
Observed 10k-query no-write result:

IVF-only probe rows: 40000
IVF-only probe stage: 29.480s
IVF-only materialize stage: 2042.473s
IVF-only result rows: 200000
IVF-only total runtime (no GCS write): 2071.955s

The ivf-rabitq path does:

batch_queries crossJoin broadcast(ivf_centroids)
rank centroids per query by squared L2
keep nprobe
join to ivf_postings by cluster_id
compute Hamming distance
keep top 1000 candidates per query
join surviving candidates to ivf_vectors
compute exact cosine distance
rank per query by exact distance
keep top-20
Observed 10k-query no-write result:

IVF probes rows: 40000
IVF candidate rows (top-1000 per query): 10000000
Probe stage: 133.598s
Hamming stage: 175.685s
IVF + RaBitQ materialize stage: 178.305s
IVF + RaBitQ result rows: 200000
IVF + RaBitQ total runtime (no GCS write): 487.591s

GitHub link: 
https://github.com/apache/hudi/discussions/18500#discussioncomment-16909264

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to