Re: [Proposal] Remove max number of dimensions for KNN vectors

Michael Wechner Fri, 07 Apr 2023 19:57:46 -0700

sorry to interrupt, but I think we get side-tracked from the originaldiscussion to increase the vector dimension limit.

I think improving the vector indexing performance is one thing andmaking sure Lucene does not crash when increasing the vector dimensionlimit is another.

I think it is great to find better ways to index vectors, but I thinkthis should not prevent people from being able to use models with highervector dimensions than 1024.

The following comparison might not be perfect, but imagine we haveinvented a combustion engine, which is strong enough to move a car inthe flat area, but when applying it to a truck to move things overmountains it will fail, because it is not strong enough. Would youprevent people from using the combustion engine for a car in the flat area?


Thanks

Michael



Am 08.04.23 um 00:15 schrieb jim ferenczi:

> Keep in mind, there may be other ways to do it. In general if merging
something is going to be "heavyweight", we should think about it to
prevent things from going really bad overall.

Yep I agree. Personally I don t see how we can solve this withoutprior knowledge of the vectors. Faiss has a nice implementation thatfits naturally with Lucene called IVF (

https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexIVF.html)

but if we want to avoid running kmeans on every merge we d require toprovide the clusters for the entire index before indexing the firstvector.

It s a complex issue…

On Fri, 7 Apr 2023 at 22:58, Robert Muir <[email protected]> wrote:

    Personally i'd have to re-read the paper, but in general the merging
    issue has to be addressed somehow to fix the overall indexing time
    problem. It seems it gets "dodged" with huge rambuffers in the emails
    here.
    Keep in mind, there may be other ways to do it. In general if merging
    something is going to be "heavyweight", we should think about it to
    prevent things from going really bad overall.

    As an example, I'm most familiar with adding DEFLATE compression to
    stored fields. Previously, we'd basically decompress and recompress
    the stored fields on merge, and LZ4 is so fast that it wasn't
    obviously a problem. But with DEFLATE it got slower/heavier (more
    intense compression algorithm), something had to be done or indexing
    would be unacceptably slow. Hence if you look at storedfields writer,
    there is "dirtiness" logic etc so that recompression is amortized over
    time and doesn't happen on every merge.

    On Fri, Apr 7, 2023 at 5:38 PM jim ferenczi
    <[email protected]> wrote:
    >
    > I am also not sure that diskann would solve the merging issue.
    The idea describe in the paper is to run kmeans first to create
    multiple graphs, one per cluster. In our case the vectors in each
    segment could belong to different cluster so I don’t see how we
    could merge them efficiently.
    >
    > On Fri, 7 Apr 2023 at 22:28, jim ferenczi
    <[email protected]> wrote:
    >>
    >> The inference time (and cost) to generate these big vectors
    must be quite large too ;).
    >> Regarding the ram buffer, we could drastically reduce the size
    by writing the vectors on disk instead of keeping them in the
    heap. With 1k dimensions the ram buffer is filled with these
    vectors quite rapidly.
    >>
    >> On Fri, 7 Apr 2023 at 21:59, Robert Muir <[email protected]> wrote:
    >>>
    >>> On Fri, Apr 7, 2023 at 7:47 AM Michael Sokolov
    <[email protected]> wrote:
    >>> >
    >>> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer
    size=1994)
    >>> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW
    buffer size=1994)
    >>> >
    >>> > Robert, since you're the only on-the-record veto here, does this
    >>> > change your thinking at all, or if not could you share some test
    >>> > results that didn't go the way you expected? Maybe we can
    find some
    >>> > mitigation if we focus on a specific issue.
    >>> >
    >>>
    >>> My scale concerns are both space and time. What does the execution
    >>> time look like if you don't set insanely large IW rambuffer? The
    >>> default is 16MB. Just concerned we're shoving some problems
    under the
    >>> rug :)
    >>>
    >>> Even with the yuge RAMbuffer, we're still talking about almost
    2 hours
    >>> to index 4M documents with these 2k vectors. Whereas you'd measure
    >>> this in seconds with typical lucene indexing, its nothing.
    >>>
    >>>
    ---------------------------------------------------------------------
    >>> To unsubscribe, e-mail: [email protected]
    >>> For additional commands, e-mail: [email protected]
    >>>

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

Re: [Proposal] Remove max number of dimensions for KNN vectors

Reply via email to