Hi,

I don't understand the whole discussion here and I fully agree with Robert. As of now it IS possible to change the maximum vector dimensions by defining your own codec with a few lines of Java code. Solr is doing that today. This approach is IMHO perfectly ok for backwards compatibility, easy to do and allows people to kill their CPU and hardware as they like:

You just need a wrapper for the vectors format and glue that into the codec:

 * 
https://github.com/apache/solr/blob/3aa6aa2085ac3ec5b90d181a7db7577c57318d4a/solr/core/src/java/org/apache/solr/core/SchemaCodecFactory.java#L149-L178
 * 
https://github.com/apache/solr/blob/3aa6aa2085ac3ec5b90d181a7db7577c57318d4a/solr/core/src/java/org/apache/solr/core/SchemaCodecFactory.java#L125-L139

This has several nice features:

 * The default codec isn't changed, no backwards compatibilizty issue
 * The user must make sure the codec is constructed correctly. In case
   of Lucene codec updates they must apply the correct verison numbers.
   This ensure, people know what they do!

I agree this is a bit of overhead fo the implementor, but still allows advanced users to change it. Basically they define their own codec, that's what Robert wants. As a simplification we may optionally add some easier ways to define a codec with less boilerplate code, but that's unrelated to the current dicussion. I'd like to see some builder pattern to create a codec with a custom name.

Please stop arguing about all this limitis!

Uwe

Am 17.05.2023 um 04:58 schrieb Robert Muir:
My problem is that it impacts the default codec which is supported by our backwards compatibility policy for many years. We can't just let the user determine backwards compatibility with a sysprop. how will checkindex work? We have to have bounds and also allow for more performant implementations that might have different limitations. And I'm pretty sure we want a faster implementation than what we have in the future, and it will probably have different limits.

For other codecs, it is fine to have a different limit as I already said, as it is implementation dependent. And honestly the stuff in lucene/codecs can be more "Fast and loose" because it doesn't require the extensive index back compat guarantee.

Again, penultimate concern is that index back compat guarantee. When it comes to limits, the proper way is not to just keep bumping them without technical reasons, instead the correct approach is to fix the technical problems and make them irrelevant. Great example here (merged this morning): https://github.com/apache/lucene/commit/f53eb28af053d7612f7e4d1b2de05d33dc410645


On Tue, May 16, 2023 at 10:49 PM David Smiley <[email protected]> wrote:

    Robert, I have not heard from you (or anyone) an argument against
    System property based configurability (as I described in Option 4
    via a System property).  Uwe notes wisely some care must be taken
    to ensure it actually works.  Sure, of course.  What concerns do
    you have with this?

    ~ David Smiley
    Apache Lucene/Solr Search Developer
    http://www.linkedin.com/in/davidwsmiley


    On Tue, May 16, 2023 at 9:50 PM Robert Muir <[email protected]> wrote:

        by the way, i agree with the idea to MOVE THE LIMIT UNCHANGED
        to the hsnw-specific code.

        This way, someone can write alternative codec with vectors
        using some other completely different approach that
        incorporates a different more appropriate limit (maybe lower,
        maybe higher) depending upon their tradeoffs. We should
        encourage this as I think it is the "only true fix" to the
        scalability issues: use a scalable algorithm! Also,
        alternative codecs don't force the project into many years of
        index backwards compatibility, which is really my penultimate
        concern. We can lock ourselves into a truly bad place and
        become irrelevant (especially with scalar code implementing
        all this vector stuff, it is really senseless). In the
        meantime I suggest we try to reduce pain for the default codec
        with the current implementation if possible. If it is not
        possible, we need a new codec that performs.

        On Tue, May 16, 2023 at 8:53 PM Robert Muir <[email protected]>
        wrote:

            Gus, I think i explained myself multiple times on issues
            and in this thread. the performance is unacceptable,
            everyone knows it, but nobody is talking about.
            I don't need to explain myself time and time again here.
            You don't seem to understand the technical issues (at
            least you sure as fuck don't know how service loading
            works or you wouldnt have opened
            https://github.com/apache/lucene/issues/12300 😂)

            I'm just the only one here completely unconstrained by any
            of silicon valley's influences to speak my true mind,
            without any repercussions, so I do it. Don't give any
            fucks about ChatGPT.

            I'm standing by my technical veto. If you bypass it, I'll
            revert the offending commit.

            As far as fixing the technical performance, I just opened
            an issue with some ideas to at least improve cpu usage by
            a factor of N. It does not help with the crazy heap memory
            usage or other issues of KNN implementation causing shit
            like OOM on merge. But it is one step:
            https://github.com/apache/lucene/issues/12302



            On Tue, May 16, 2023 at 7:45 AM Gus Heck
            <[email protected]> wrote:

                Robert,

                Can you explain in clear technical terms the standard
                that must be met for performance? A benchmark that
                must run in X time on Y hardware for example (and why
                that test is suitable)? Or some other reproducible
                criteria? So far I've heard you give an *opinion* that
                it's unusable, but that's not a technical criteria,
                others may have a different concept of what is usable
                to them.

                Forgive me if I misunderstand, but the essence of your
                argument has seemed to be

                "Performance isn't good enough, therefore we should
                force anyone who wants to experiment with something
                bigger to fork the code base to do it"

                Thus, it is necessary to have a clear
                unambiguous standard that anyone can verify for "good
                enough". A clear standard would also focus efforts at
                improvement.

                Where are the goal posts?

                FWIW I'm +1 on any of 2-4 since I believe the
                existence of a hard limit is fundamentally
                counterproductive in an open source setting, as it
                will lead to *fewer people* pushing the limits.
                Extremely few people are going to get into the
                nitty-gritty of optimizing things unless they are
                staring at code that they can prove does something
                interesting, but doesn't run fast enough for their
                purposes. If people hit a hard limit, more of them
                give up and never develop the code that will motivate
                them to look for optimizations.

                -Gus

                On Tue, May 16, 2023 at 6:04 AM Robert Muir
                <[email protected]> wrote:

                    i still feel -1 (veto) on increasing this limit.
                    sending more emails does not change the technical
                    facts or make the veto go away.

                    On Tue, May 16, 2023 at 4:50 AM Alessandro
                    Benedetti <[email protected]> wrote:

                        Hi all,
                        we have finalized all the options proposed by
                        the community and we are ready to vote for the
                        preferred one and then proceed with the
                        implementation.

                        *Option 1*
                        Keep it as it is (dimension limit hardcoded to
                        1024)
                        *Motivation*:
                        We are close to improving on many fronts.
                        Given the criticality of Lucene in computing
                        infrastructure and the concerns raised by one
                        of the most active stewards of the project, I
                        think we should keep working toward improving
                        the feature as is and move to up the limit
                        after we can demonstrate improvement
                        unambiguously.

                        *Option 2*
                        make the limit configurable, for example
                        through a system property
                        *Motivation*:
                        The system administrator can enforce a limit
                        its users need to respect that it's in line
                        with whatever the admin decided to be
                        acceptable for them.
                        The default can stay the current one.
                        This should open the doors for Apache Solr,
                        Elasticsearch, OpenSearch, and any sort of
                        plugin development

                        *Option 3*
                        Move the max dimension limit lower level to a
                        HNSW specific implementation. Once there, this
                        limit would not bind any other potential
                        vector engine alternative/evolution.*
                        *
                        *Motivation:*There seem to be contradictory
                        performance interpretations about the current
                        HNSW implementation. Some consider its
                        performance ok, some not, and it depends on
                        the target data set and use case. Increasing
                        the max dimension limit where it is currently
                        (in top level FloatVectorValues) would not
                        allow potential alternatives (e.g. for other
                        use-cases) to be based on a lower limit.

                        *Option 4*
                        Make it configurable and move it to an
                        appropriate place.
                        In particular, a
                        simple Integer.getInteger("lucene.hnsw.maxDimensions",
                        1024) should be enough.
                        *Motivation*:
                        Both are good and not mutually exclusive and
                        could happen in any order.
                        Someone suggested to perfect what the
                        _default_ limit should be, but I've not seen
                        an argument _against_ configurability. 
                        Especially in this way -- a toggle that
                        doesn't bind Lucene's APIs in any way.

                        I'll keep this [VOTE] open for a week and then
                        proceed to the implementation.
                        --------------------------
                        *Alessandro Benedetti*
                        Director @ Sease Ltd.
                        /Apache Lucene/Solr Committer/
                        /Apache Solr PMC Member/

                        e-mail: [email protected]/
                        /

                        *Sease* - Information Retrieval Applied
                        Consulting | Training | Open Source

                        Website: Sease.io <http://sease.io/>
                        LinkedIn
                        <https://linkedin.com/company/sease-ltd> |
                        Twitter <https://twitter.com/seaseltd> |
                        Youtube
                        
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
                        Github <https://github.com/seaseltd>



-- http://www.needhamsoftware.com (work)
                http://www.the111shift.com (play)

--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:[email protected]

Reply via email to