I try to better understand the code, so IIUC vector MAX_DIMENSIONS is currently used inside

lucene/core/src/java/org/apache/lucene/document/FieldType.java
lucene/core/src/java/org/apache/lucene/document/KnnFloatVectorField.java
lucene/core/src/java/org/apache/lucene/document/KnnByteVectorField.java
lucene/core/src/java/org/apache/lucene/index/FloatVectorValues.java
public static final int MAX_DIMENSIONS = 1024;
lucene/core/src/java/org/apache/lucene/index/ByteVectorValues.java
public static final int MAX_DIMENSIONS = 1024;

and when you are writing that it should be moved to the hnsw-specific code, then you mean somewhere to

lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsReader.java
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java
lucene/core/src/java/org/apache/lucene/codecs/lucene95/OffHeapByteVectorValues.java
lucene/core/src/java/org/apache/lucene/codecs/lucene95/OffHeapFloatVectorValues.java
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java
lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborQueue.java
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
lucene/core/src/java/org/apache/lucene/util/hnsw/RandomAccessVectorValues.java

?

Thanks

Michael




Am 17.05.23 um 03:50 schrieb Robert Muir:
by the way, i agree with the idea to MOVE THE LIMIT UNCHANGED to the hsnw-specific code.

This way, someone can write alternative codec with vectors using some other completely different approach that incorporates a different more appropriate limit (maybe lower, maybe higher) depending upon their tradeoffs. We should encourage this as I think it is the "only true fix" to the scalability issues: use a scalable algorithm! Also, alternative codecs don't force the project into many years of index backwards compatibility, which is really my penultimate concern. We can lock ourselves into a truly bad place and become irrelevant (especially with scalar code implementing all this vector stuff, it is really senseless). In the meantime I suggest we try to reduce pain for the default codec with the current implementation if possible. If it is not possible, we need a new codec that performs.

On Tue, May 16, 2023 at 8:53 PM Robert Muir <rcm...@gmail.com> wrote:

    Gus, I think i explained myself multiple times on issues and in
    this thread. the performance is unacceptable, everyone knows it,
    but nobody is talking about.
    I don't need to explain myself time and time again here.
    You don't seem to understand the technical issues (at least you
    sure as fuck don't know how service loading works or you wouldnt
    have opened https://github.com/apache/lucene/issues/12300 😂)

    I'm just the only one here completely unconstrained by any of
    silicon valley's influences to speak my true mind, without any
    repercussions, so I do it. Don't give any fucks about ChatGPT.

    I'm standing by my technical veto. If you bypass it, I'll revert
    the offending commit.

    As far as fixing the technical performance, I just opened an issue
    with some ideas to at least improve cpu usage by a factor of N. It
    does not help with the crazy heap memory usage or other issues of
    KNN implementation causing shit like OOM on merge. But it is one
    step: https://github.com/apache/lucene/issues/12302



    On Tue, May 16, 2023 at 7:45 AM Gus Heck <gus.h...@gmail.com> wrote:

        Robert,

        Can you explain in clear technical terms the standard that
        must be met for performance? A benchmark that must run in X
        time on Y hardware for example (and why that test is
        suitable)? Or some other reproducible criteria? So far I've
        heard you give an *opinion* that it's unusable, but that's not
        a technical criteria, others may have a different concept of
        what is usable to them.

        Forgive me if I misunderstand, but the essence of your
        argument has seemed to be

        "Performance isn't good enough, therefore we should force
        anyone who wants to experiment with something bigger to fork
        the code base to do it"

        Thus, it is necessary to have a clear unambiguous standard
        that anyone can verify for "good enough". A clear standard
        would also focus efforts at improvement.

        Where are the goal posts?

        FWIW I'm +1 on any of 2-4 since I believe the existence of a
        hard limit is fundamentally counterproductive in an open
        source setting, as it will lead to *fewer people* pushing
        the limits. Extremely few people are going to get into the
        nitty-gritty of optimizing things unless they are staring at
        code that they can prove does something interesting,
        but doesn't run fast enough for their purposes. If people hit
        a hard limit, more of them give up and never develop the code
        that will motivate them to look for optimizations.

        -Gus

        On Tue, May 16, 2023 at 6:04 AM Robert Muir <rcm...@gmail.com>
        wrote:

            i still feel -1 (veto) on increasing this limit. sending
            more emails does not change the technical facts or make
            the veto go away.

            On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti
            <a.benede...@sease.io> wrote:

                Hi all,
                we have finalized all the options proposed by the
                community and we are ready to vote for the preferred
                one and then proceed with the implementation.

                *Option 1*
                Keep it as it is (dimension limit hardcoded to 1024)
                *Motivation*:
                We are close to improving on many fronts. Given the
                criticality of Lucene in computing infrastructure and
                the concerns raised by one of the most active stewards
                of the project, I think we should keep working toward
                improving the feature as is and move to up the limit
                after we can demonstrate improvement unambiguously.

                *Option 2*
                make the limit configurable, for example through a
                system property
                *Motivation*:
                The system administrator can enforce a limit its users
                need to respect that it's in line with whatever the
                admin decided to be acceptable for them.
                The default can stay the current one.
                This should open the doors for Apache Solr,
                Elasticsearch, OpenSearch, and any sort of plugin
                development

                *Option 3*
                Move the max dimension limit lower level to a HNSW
                specific implementation. Once there, this limit would
                not bind any other potential vector engine
                alternative/evolution.*
                *
                *Motivation:*There seem to be contradictory
                performance interpretations about the current HNSW
                implementation. Some consider its performance ok, some
                not, and it depends on the target data set and use
                case. Increasing the max dimension limit where it is
                currently (in top level FloatVectorValues) would not
                allow potential alternatives (e.g. for other
                use-cases) to be based on a lower limit.

                *Option 4*
                Make it configurable and move it to an appropriate place.
                In particular, a
                simple Integer.getInteger("lucene.hnsw.maxDimensions",
                1024) should be enough.
                *Motivation*:
                Both are good and not mutually exclusive and could
                happen in any order.
                Someone suggested to perfect what the _default_ limit
                should be, but I've not seen an argument _against_
                configurability. Especially in this way -- a toggle
                that doesn't bind Lucene's APIs in any way.

                I'll keep this [VOTE] open for a week and then proceed
                to the implementation.
                --------------------------
                *Alessandro Benedetti*
                Director @ Sease Ltd.
                /Apache Lucene/Solr Committer/
                /Apache Solr PMC Member/

                e-mail: a.benede...@sease.io/
                /

                *Sease* - Information Retrieval Applied
                Consulting | Training | Open Source

                Website: Sease.io <http://sease.io/>
                LinkedIn <https://linkedin.com/company/sease-ltd> |
                Twitter <https://twitter.com/seaseltd> | Youtube
                <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
                Github <https://github.com/seaseltd>



-- http://www.needhamsoftware.com (work)
        http://www.the111shift.com (play)

Reply via email to