It is basically the code which Michael Sokolov posted at

https://markmail.org/message/kf4nzoqyhwacb7ri

except
 - that I have replaced KnnVectorField by KnnFloatVectorField, because KnnVectorField is deprecated.  - that I don't hard code the  dimension as 2048 and the metric as EUCLIDEAN, but take the dimension and metric (VectorSimilarityFunction) used by the model. which are in the case of for example text-embedding-ada-002: 1536 and COSINE (https://platform.openai.com/docs/guides/embeddings/which-distance-function-should-i-use)

HTH

Michael



Am 18.05.23 um 11:10 schrieb Ishan Chattopadhyaya:
That sounds promising, Michael. Can you share scripts/steps/code to reproduce this?

On Thu, 18 May, 2023, 1:16 pm Michael Wechner, <michael.wech...@wyona.com> wrote:

    I just implemented it and tested it with OpenAI's
    text-embedding-ada-002, which is using 1536 dimensions and it
    works very fine :-)

    Thanks

    Michael



    Am 18.05.23 um 00:29 schrieb Michael Wechner:
    IIUC KnnVectorField is deprecated and one is supposed to use
    KnnFloatVectorField when using float as vector values, right?

    Am 17.05.23 um 16:41 schrieb Michael Sokolov:
    see https://markmail.org/message/kf4nzoqyhwacb7ri

    On Wed, May 17, 2023 at 10:09 AM David Smiley
    <dsmi...@apache.org> wrote:

        > easily be circumvented by a user

        This is a revelation to me and others, if true.  Michael,
        please then point to a test or code snippet that shows the
        Lucene user community what they want to see so they are
        unblocked from their explorations of vector search.

        ~ David Smiley
        Apache Lucene/Solr Search Developer
        http://www.linkedin.com/in/davidwsmiley


        On Wed, May 17, 2023 at 7:51 AM Michael Sokolov
        <msoko...@gmail.com> wrote:

            I think I've said before on this list we don't actually
            enforce the limit in any way that can't easily be
            circumvented by a user. The codec already supports any
            size vector - it doesn't impose any limit. The way the
            API is written you can *already today* create an index
            with max-int sized vectors and we are committed to
            supporting that going forward by our backwards
            compatibility policy as Robert points out. This wasn't
            intentional, I think, but it is the facts.

            Given that, I think this whole discussion is not really
            necessary.

            On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti
            <a.benede...@sease.io> wrote:

                Hi all,
                we have finalized all the options proposed by the
                community and we are ready to vote for the preferred
                one and then proceed with the implementation.

                *Option 1*
                Keep it as it is (dimension limit hardcoded to 1024)
                *Motivation*:
                We are close to improving on many fronts. Given the
                criticality of Lucene in computing infrastructure
                and the concerns raised by one of the most active
                stewards of the project, I think we should keep
                working toward improving the feature as is and move
                to up the limit after we can demonstrate improvement
                unambiguously.

                *Option 2*
                make the limit configurable, for example through a
                system property
                *Motivation*:
                The system administrator can enforce a limit its
                users need to respect that it's in line with
                whatever the admin decided to be acceptable for them.
                The default can stay the current one.
                This should open the doors for Apache Solr,
                Elasticsearch, OpenSearch, and any sort of plugin
                development

                *Option 3*
                Move the max dimension limit lower level to a HNSW
                specific implementation. Once there, this limit
                would not bind any other potential vector engine
                alternative/evolution.*
                *
                *Motivation:*There seem to be contradictory
                performance interpretations about the current HNSW
                implementation. Some consider its performance ok,
                some not, and it depends on the target data set and
                use case. Increasing the max dimension limit where
                it is currently (in top level FloatVectorValues)
                would not allow potential alternatives (e.g. for
                other use-cases) to be based on a lower limit.

                *Option 4*
                Make it configurable and move it to an appropriate
                place.
                In particular, a
                simple Integer.getInteger("lucene.hnsw.maxDimensions",
                1024) should be enough.
                *Motivation*:
                Both are good and not mutually exclusive and could
                happen in any order.
                Someone suggested to perfect what the _default_
                limit should be, but I've not seen an argument
                _against_ configurability.  Especially in this way
                -- a toggle that doesn't bind Lucene's APIs in any way.

                I'll keep this [VOTE] open for a week and then
                proceed to the implementation.
                --------------------------
                *Alessandro Benedetti*
                Director @ Sease Ltd.
                /Apache Lucene/Solr Committer/
                /Apache Solr PMC Member/

                e-mail: a.benede...@sease.io/
                /

                *Sease* - Information Retrieval Applied
                Consulting | Training | Open Source

                Website: Sease.io <http://sease.io/>
                LinkedIn <https://linkedin.com/company/sease-ltd> |
                Twitter <https://twitter.com/seaseltd> | Youtube
                <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
                Github <https://github.com/seaseltd>



Reply via email to