benwtrent commented on PR #14078:
URL: https://github.com/apache/lucene/pull/14078#issuecomment-2713454213
Hey @lpld
> May I also ask about the selection of datasets being used for the
benchmarks? How do you choose them?
I haven't tested with SIFT, though be sure to use euclidean distance when
testing it. I would imagine that so few dimensions might not perform super
well. There is just too much information loss.
But the datasets I have been utilizing are ones that are built with modern
day transformer based models. Lucene Util has tooling for downloading and using
Cohere multi-lingual (max-inner product, 768 dims).
Specifically, for this data format, we did testing with the following
datasets and models:
- https://huggingface.co/Snowflake/snowflake-arctic-embed-l with dbpedia
- https://huggingface.co/intfloat/e5-small with dbpedia, hot-pot qa, quora,
fiqa
- https://huggingface.co/thenlper/gte-base with hotpotqa, fiqa, dbpedia
- And of course, coherev3 multi-lingual
https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3
- https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings
- GIST-1M (sibling dataset to sift), with euclidean and max-inner product.
If you are testing for a product, I would use the model that you are
planning to use.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]