IIUC we all agree that the limit could be raised, but we need some solid reasoning what limit makes sense, resp. why do we set this particular limit (e.g. 2048), right?

Thanks

Michael


Am 04.04.23 um 15:32 schrieb Michael McCandless:
> I am not in favor of just doubling it as suggested by some people, I would ideally prefer a solution that remains there to a decent extent, rather than having to modifying it anytime someone requires a higher limit.

The problem with this approach is it is a one-way door, once released.  We would not be able to lower the limit again in the future without possibly breaking some applications.

> For example, we don't limit the number of docs per index to an arbitrary maximum of N, you push how many docs you like and if they are too much for your system, you get terrible performance/crashes/whatever.

Correction: we do check this limit and throw a specific exception now: https://github.com/apache/lucene/issues/6905

+1 to raise the limit, but not remove it.

Mike McCandless

http://blog.mikemccandless.com


On Mon, Apr 3, 2023 at 9:51 AM Alessandro Benedetti <a.benede...@sease.io> wrote:

    ... and what would be the next limit?
    I guess we'll need to motivate it better than the 1024 one.
    I appreciate the fact that a limit is pretty much wanted by
    everyone but I suspect we'll need some solid foundation for
    deciding the amount (and it should be high enough to avoid
    continuous changes)

    Cheers

    On Sun, 2 Apr 2023, 07:29 Michael Wechner,
    <michael.wech...@wyona.com> wrote:

        btw, what was the reasoning to set the current limit to 1024?

        Thanks

        Michael

        Am 01.04.23 um 14:47 schrieb Michael Sokolov:
        I'm also in favor of raising this limit. We do see some
        datasets with higher than 1024 dims. I also think we need to
        keep a limit. For example we currently need to keep all the
        vectors in RAM while indexing and we want to be able to
        support reasonable numbers of vectors in an index segment.
        Also we don't know what innovations might come down the road.
        Maybe someday we want to do product quantization and enforce
        that (k, m) both fit in a byte -- we wouldn't be able to do
        that if a vector's dimension were to exceed 32K.

        On Fri, Mar 31, 2023 at 11:57 AM Alessandro Benedetti
        <a.benede...@sease.io> wrote:

            I am also curious what would be the worst-case scenario
            if we remove the constant at all (so automatically the
            limit becomes the Java Integer.MAX_VALUE).
            i.e.
            right now if you exceed the limit you get:

                if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
                throw new IllegalArgumentException(
                "cannot index vectors with dimension greater than " +
                ByteVectorValues.MAX_DIMENSIONS);
                }


            in relation to:

                These limits allow us to
                better tune our data structures, prevent overflows,
                help ensure we
                have good test coverage, etc.

            I agree 100% especially for typing stuff properly and
            avoiding resource waste here and there, but I am not
            entirely sure this is the case for the current
            implementation i.e. do we have optimizations in place
            that assume the max dimension to be 1024?
            If I missed that (and I likely have), I of course suggest
            the contribution should not just blindly remove the
            limit, but do it appropriately.
            I am not in favor of just doubling it as suggested by
            some people, I would ideally prefer a solution that
            remains there to a decent extent, rather than having to
            modifying it anytime someone requires a higher limit.

            Cheers
            --------------------------
            *Alessandro Benedetti*
            Director @ Sease Ltd.
            /Apache Lucene/Solr Committer/
            /Apache Solr PMC Member/

            e-mail: a.benede...@sease.io/
            /

            *Sease* - Information Retrieval Applied
            Consulting | Training | Open Source

            Website: Sease.io <http://sease.io/>
            LinkedIn <https://linkedin.com/company/sease-ltd> |
            Twitter <https://twitter.com/seaseltd> | Youtube
            <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
            Github <https://github.com/seaseltd>


            On Fri, 31 Mar 2023 at 16:12, Michael Wechner
            <michael.wech...@wyona.com> wrote:

                OpenAI reduced their size to 1536 dimensions

                https://openai.com/blog/new-and-improved-embedding-model

                so 2048 would work :-)

                but other services do provide also higher dimensions
                with sometimes
                slightly better accuracy

                Thanks

                Michael


                Am 31.03.23 um 14:45 schrieb Adrien Grand:
                > I'm supportive of bumping the limit on the maximum
                dimension for
                > vectors to something that is above what the
                majority of users need,
                > but I'd like to keep a limit. We have limits for
                other things like the
                > max number of docs per index, the max term length,
                the max number of
                > dimensions of points, etc. and there are a few
                things that we don't
                > have limits on that I wish we had limits on. These
                limits allow us to
                > better tune our data structures, prevent overflows,
                help ensure we
                > have good test coverage, etc.
                >
                > That said, these other limits we have in place are
                quite high. E.g.
                > the 32kB term limit, nobody would ever type a 32kB
                term in a text box.
                > Likewise for the max of 8 dimensions for points: a
                segment cannot
                > possibly have 2 splits per dimension on average if
                it doesn't have
                > 512*2^(8*2)=34M docs, a sizable dataset already, so
                more dimensions
                > than 8 would likely defeat the point of indexing.
                In contrast, our
                > limit on the number of dimensions of vectors seems
                to be under what
                > some users would like, and while I understand the
                performance argument
                > against bumping the limit, it doesn't feel to me
                like something that
                > would be so bad that we need to prevent users from
                using numbers of
                > dimensions in the low thousands, e.g. top-k KNN
                searches would still
                > look at a very small subset of the full dataset.
                >
                > So overall, my vote would be to bump the limit to
                2048 as suggested by
                > Mayya on the issue that you linked.
                >
                > On Fri, Mar 31, 2023 at 2:38 PM Michael Wechner
                > <michael.wech...@wyona.com> wrote:
                >> Thanks Alessandro for summarizing the discussion
                below!
                >>
                >> I understand that there is no clear reasoning re
                what is the best embedding size, whereas I think
                heuristic approaches like described by the following
                link can be helpful
                >>
                >>
                
https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter
                >>
                >> Having said this, we see various embedding
                services providing higher dimensions than 1024, like
                for example OpenAI, Cohere and Aleph Alpha.
                >>
                >> And it would be great if we could run benchmarks
                without having to recompile Lucene ourselves.
                >>
                >> Therefore I would to suggest to either increase
                the limit or even better to remove the limit and add
                a disclaimer, that people should be aware of possible
                crashes etc.
                >>
                >> Thanks
                >>
                >> Michael
                >>
                >>
                >>
                >>
                >> Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:
                >>
                >>
                >> I've been monitoring various discussions on Pull
                Requests about changing the max number of dimensions
                allowed for Lucene HNSW vectors:
                >>
                >> https://github.com/apache/lucene/pull/12191
                >>
                >> https://github.com/apache/lucene/issues/11507
                >>
                >>
                >> I would like to set up a discussion and
                potentially a vote about this.
                >>
                >> I have seen some strong opposition from a few
                people but a majority of favor in this direction.
                >>
                >>
                >> Motivation
                >>
                >> We were discussing in the Solr slack channel with
                Ishan Chattopadhyaya, Marcus Eagan, and David Smiley
                about some neural search integrations in Solr:
                https://github.com/openai/chatgpt-retrieval-plugin
                >>
                >>
                >> Proposal
                >>
                >> No hard limit at all.
                >>
                >> As for many other Lucene areas, users will be
                allowed to push the system to the limit of their
                resources and get terrible performances or crashes if
                they want.
                >>
                >>
                >> What we are NOT discussing
                >>
                >> - Quality and scalability of the HNSW algorithm
                >>
                >> - dimensionality reduction
                >>
                >> - strategies to fit in an arbitrary self-imposed limit
                >>
                >>
                >> Benefits
                >>
                >> - users can use the models they want to generate
                vectors
                >>
                >> - removal of an arbitrary limit that blocks some
                integrations
                >>
                >>
                >> Cons
                >>
                >>   - if you go for vectors with high dimensions,
                there's no guarantee you get acceptable performance
                for your use case
                >>
                >>
                >>
                >> I want to keep it simple, right now in many Lucene
                areas, you can push the system to not acceptable
                performance/ crashes.
                >>
                >> For example, we don't limit the number of docs per
                index to an arbitrary maximum of N, you push how many
                docs you like and if they are too much for your
                system, you get terrible performance/crashes/whatever.
                >>
                >>
                >> Limits caused by primitive java types will stay
                there behind the scene, and that's acceptable, but I
                would prefer to not have arbitrary hard-coded ones
                that may limit the software usability and integration
                which is extremely important for a library.
                >>
                >>
                >> I strongly encourage people to add benefits and
                cons, that I missed (I am sure I missed some of them,
                but wanted to keep it simple)
                >>
                >>
                >> Cheers
                >>
                >> --------------------------
                >> Alessandro Benedetti
                >> Director @ Sease Ltd.
                >> Apache Lucene/Solr Committer
                >> Apache Solr PMC Member
                >>
                >> e-mail: a.benede...@sease.io
                >>
                >>
                >> Sease - Information Retrieval Applied
                >> Consulting | Training | Open Source
                >>
                >> Website: Sease.io
                >> LinkedIn | Twitter | Youtube | Github
                >>
                >>
                >


                
---------------------------------------------------------------------
                To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
                For additional commands, e-mail:
                dev-h...@lucene.apache.org


Reply via email to