Re: [VOTE] Dimension Limit for KNN Vectors

Michael Wechner Thu, 18 May 2023 05:07:53 -0700


Am 18.05.23 um 12:22 schrieb Michael McCandless:

I love all the energy and passion going into debating all the ways topoke at this limit, but please let's also spend some of this passionon actually improving the scalability of our aKNN implementation! E.g. Robert opened an exciting "Plan B" (https://github.com/apache/lucene/issues/12302 ) to workaroundOpenJDK's crazy slowness on enabling access to vectorized SIMD CPUinstructions (the Java Vector API, JEP 426:https://openjdk.org/jeps/426 ). This could help postings and docvalues performance too!

agreed, but I do not think the MAX_DIMENSIONS decision should depend onthis, because I think whatever improvements can be accomplishedeventually, very likely there will always be some limit.


Thanks

Michael


Mike McCandless

http://blog.mikemccandless.com

On Thu, May 18, 2023 at 5:24 AM Alessandro Benedetti<a.benede...@sease.io> wrote:


    That's great and a good plan B, but let's try to focus this thread
    of collecting votes for a week (let's keep discussions on the nice
    PR opened by David or the discussion thread we have in the mailing
    list already :)

    On Thu, 18 May 2023, 10:10 Ishan Chattopadhyaya,
    <ichattopadhy...@gmail.com> wrote:

        That sounds promising, Michael. Can you share
        scripts/steps/code to reproduce this?

        On Thu, 18 May, 2023, 1:16 pm Michael Wechner,
        <michael.wech...@wyona.com> wrote:

            I just implemented it and tested it with OpenAI's
            text-embedding-ada-002, which is using 1536 dimensions and
            it works very fine :-)

            Thanks

            Michael



            Am 18.05.23 um 00:29 schrieb Michael Wechner:

            IIUC KnnVectorField is deprecated and one is supposed to
            use KnnFloatVectorField when using float as vector
            values, right?

            Am 17.05.23 um 16:41 schrieb Michael Sokolov:

            see https://markmail.org/message/kf4nzoqyhwacb7ri

            On Wed, May 17, 2023 at 10:09 AM David Smiley
            <dsmi...@apache.org> wrote:

                > easily be circumvented by a user

                This is a revelation to me and others, if true. 
                Michael, please then point to a test or code snippet
                that shows the Lucene user community what they want
                to see so they are unblocked from their explorations
                of vector search.

                ~ David Smiley
                Apache Lucene/Solr Search Developer
                http://www.linkedin.com/in/davidwsmiley


                On Wed, May 17, 2023 at 7:51 AM Michael Sokolov
                <msoko...@gmail.com> wrote:

                    I think I've said before on this list we don't
                    actually enforce the limit in any way that can't
                    easily be circumvented by a user. The codec
                    already supports any size vector - it doesn't
                    impose any limit. The way the API is written you
                    can *already today* create an index with max-int
                    sized vectors and we are committed to supporting
                    that going forward by our backwards
                    compatibility policy as Robert points out. This
                    wasn't intentional, I think, but it is the facts.

                    Given that, I think this whole discussion is not
                    really necessary.

                    On Tue, May 16, 2023 at 4:50 AM Alessandro
                    Benedetti <a.benede...@sease.io> wrote:

                        Hi all,
                        we have finalized all the options proposed
                        by the community and we are ready to vote
                        for the preferred one and then proceed with
                        the implementation.

                        *Option 1*
                        Keep it as it is (dimension limit hardcoded
                        to 1024)
                        *Motivation*:
                        We are close to improving on many fronts.
                        Given the criticality of Lucene in computing
                        infrastructure and the concerns raised by
                        one of the most active stewards of the
                        project, I think we should keep working
                        toward improving the feature as is and move
                        to up the limit after we can demonstrate
                        improvement unambiguously.

                        *Option 2*
                        make the limit configurable, for example
                        through a system property
                        *Motivation*:
                        The system administrator can enforce a limit
                        its users need to respect that it's in line
                        with whatever the admin decided to be
                        acceptable for them.
                        The default can stay the current one.
                        This should open the doors for Apache Solr,
                        Elasticsearch, OpenSearch, and any sort of
                        plugin development

                        *Option 3*
                        Move the max dimension limit lower level to
                        a HNSW specific implementation. Once there,
                        this limit would not bind any other
                        potential vector engine alternative/evolution.*
                        *
                        *Motivation:*There seem to be contradictory
                        performance interpretations about the
                        current HNSW implementation. Some consider
                        its performance ok, some not, and it depends
                        on the target data set and use case.
                        Increasing the max dimension limit where it
                        is currently (in top level
                        FloatVectorValues) would not allow
                        potential alternatives (e.g. for other
                        use-cases) to be based on a lower limit.

                        *Option 4*
                        Make it configurable and move it to an
                        appropriate place.
                        In particular, a
                        simple Integer.getInteger("lucene.hnsw.maxDimensions",
                        1024) should be enough.
                        *Motivation*:
                        Both are good and not mutually exclusive and
                        could happen in any order.
                        Someone suggested to perfect what the
                        _default_ limit should be, but I've not seen
                        an argument _against_ configurability. 
                        Especially in this way -- a toggle that
                        doesn't bind Lucene's APIs in any way.

                        I'll keep this [VOTE] open for a week and
                        then proceed to the implementation.
                        --------------------------
                        *Alessandro Benedetti*
                        Director @ Sease Ltd.
                        /Apache Lucene/Solr Committer/
                        /Apache Solr PMC Member/

                        e-mail: a.benede...@sease.io/
                        /

                        *Sease* - Information Retrieval Applied
                        Consulting | Training | Open Source

                        Website: Sease.io <http://sease.io/>
                        LinkedIn
                        <https://linkedin.com/company/sease-ltd> |
                        Twitter <https://twitter.com/seaseltd> |
                        Youtube
                        
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
                        Github <https://github.com/seaseltd>

Re: [VOTE] Dimension Limit for KNN Vectors

Reply via email to