It is basically the code which Michael Sokolov posted at
https://markmail.org/message/kf4nzoqyhwacb7ri
except
- that I have replaced KnnVectorField by KnnFloatVectorField, because
KnnVectorField is deprecated.
- that I don't hard code the dimension as 2048 and the metric as
EUCLIDEAN, but take the dimension and metric (VectorSimilarityFunction)
used by the model. which are in the case of for example
text-embedding-ada-002: 1536 and COSINE
(https://platform.openai.com/docs/guides/embeddings/which-distance-function-should-i-use)
HTH
Michael
Am 18.05.23 um 11:10 schrieb Ishan Chattopadhyaya:
That sounds promising, Michael. Can you share scripts/steps/code to
reproduce this?
On Thu, 18 May, 2023, 1:16 pm Michael Wechner,
<michael.wech...@wyona.com> wrote:
I just implemented it and tested it with OpenAI's
text-embedding-ada-002, which is using 1536 dimensions and it
works very fine :-)
Thanks
Michael
Am 18.05.23 um 00:29 schrieb Michael Wechner:
IIUC KnnVectorField is deprecated and one is supposed to use
KnnFloatVectorField when using float as vector values, right?
Am 17.05.23 um 16:41 schrieb Michael Sokolov:
see https://markmail.org/message/kf4nzoqyhwacb7ri
On Wed, May 17, 2023 at 10:09 AM David Smiley
<dsmi...@apache.org> wrote:
> easily be circumvented by a user
This is a revelation to me and others, if true. Michael,
please then point to a test or code snippet that shows the
Lucene user community what they want to see so they are
unblocked from their explorations of vector search.
~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley
On Wed, May 17, 2023 at 7:51 AM Michael Sokolov
<msoko...@gmail.com> wrote:
I think I've said before on this list we don't actually
enforce the limit in any way that can't easily be
circumvented by a user. The codec already supports any
size vector - it doesn't impose any limit. The way the
API is written you can *already today* create an index
with max-int sized vectors and we are committed to
supporting that going forward by our backwards
compatibility policy as Robert points out. This wasn't
intentional, I think, but it is the facts.
Given that, I think this whole discussion is not really
necessary.
On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti
<a.benede...@sease.io> wrote:
Hi all,
we have finalized all the options proposed by the
community and we are ready to vote for the preferred
one and then proceed with the implementation.
*Option 1*
Keep it as it is (dimension limit hardcoded to 1024)
*Motivation*:
We are close to improving on many fronts. Given the
criticality of Lucene in computing infrastructure
and the concerns raised by one of the most active
stewards of the project, I think we should keep
working toward improving the feature as is and move
to up the limit after we can demonstrate improvement
unambiguously.
*Option 2*
make the limit configurable, for example through a
system property
*Motivation*:
The system administrator can enforce a limit its
users need to respect that it's in line with
whatever the admin decided to be acceptable for them.
The default can stay the current one.
This should open the doors for Apache Solr,
Elasticsearch, OpenSearch, and any sort of plugin
development
*Option 3*
Move the max dimension limit lower level to a HNSW
specific implementation. Once there, this limit
would not bind any other potential vector engine
alternative/evolution.*
*
*Motivation:*There seem to be contradictory
performance interpretations about the current HNSW
implementation. Some consider its performance ok,
some not, and it depends on the target data set and
use case. Increasing the max dimension limit where
it is currently (in top level FloatVectorValues)
would not allow potential alternatives (e.g. for
other use-cases) to be based on a lower limit.
*Option 4*
Make it configurable and move it to an appropriate
place.
In particular, a
simple Integer.getInteger("lucene.hnsw.maxDimensions",
1024) should be enough.
*Motivation*:
Both are good and not mutually exclusive and could
happen in any order.
Someone suggested to perfect what the _default_
limit should be, but I've not seen an argument
_against_ configurability. Especially in this way
-- a toggle that doesn't bind Lucene's APIs in any way.
I'll keep this [VOTE] open for a week and then
proceed to the implementation.
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
/Apache Lucene/Solr Committer/
/Apache Solr PMC Member/
e-mail: a.benede...@sease.io/
/
*Sease* - Information Retrieval Applied
Consulting | Training | Open Source
Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> |
Twitter <https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
Github <https://github.com/seaseltd>