Yes, that was explicitly mentioned in the original mail, improving the
vector based search of Lucene is an interesting area, but off topic here.

Let's summarise:
- We want to at least increase the limit (or remove it)
- We proved that performance are ok to do it (and we can improve them more
in the future), no harm is given to users that intend to stick to low
dimensional vectors

What are the next steps?
What apache community tool can we use to agree on a new limit/no explicit
limit (max integer)?
I think we need some sort of place where each of us propose a limit with a
motivation and we vote the best option?
Any idea on how to do it?

Cheers

On Sat, 8 Apr 2023, 03:57 Michael Wechner, <michael.wech...@wyona.com>
wrote:

> sorry to interrupt, but I think we get side-tracked from the original
> discussion to increase the vector dimension limit.
>
> I think improving the vector indexing performance is one thing and making
> sure Lucene does not crash when increasing the vector dimension limit is
> another.
>
> I think it is great to find better ways to index vectors, but I think this
> should not prevent people from being able to use models with higher vector
> dimensions than 1024.
>
> The following comparison might not be perfect, but imagine we have
> invented a combustion engine, which is strong enough to move a car in the
> flat area, but when applying it to a truck to move things over mountains it
> will fail, because it is not strong enough. Would you prevent people from
> using the combustion engine for a car in the flat area?
>
> Thanks
>
> Michael
>
>
>
> Am 08.04.23 um 00:15 schrieb jim ferenczi:
>
> > Keep in mind, there may be other ways to do it. In general if merging
> something is going to be "heavyweight", we should think about it to
> prevent things from going really bad overall.
>
> Yep I agree. Personally I don t see how we can solve this without prior
> knowledge of the vectors. Faiss has a nice implementation that fits
> naturally with Lucene called IVF (
> https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexIVF.html)
> but if we want to avoid running kmeans on every merge we d require to
> provide the clusters for the entire index before indexing the first vector.
> It s a complex issue…
>
> On Fri, 7 Apr 2023 at 22:58, Robert Muir <rcm...@gmail.com> wrote:
>
>> Personally i'd have to re-read the paper, but in general the merging
>> issue has to be addressed somehow to fix the overall indexing time
>> problem. It seems it gets "dodged" with huge rambuffers in the emails
>> here.
>> Keep in mind, there may be other ways to do it. In general if merging
>> something is going to be "heavyweight", we should think about it to
>> prevent things from going really bad overall.
>>
>> As an example, I'm most familiar with adding DEFLATE compression to
>> stored fields. Previously, we'd basically decompress and recompress
>> the stored fields on merge, and LZ4 is so fast that it wasn't
>> obviously a problem. But with DEFLATE it got slower/heavier (more
>> intense compression algorithm), something had to be done or indexing
>> would be unacceptably slow. Hence if you look at storedfields writer,
>> there is "dirtiness" logic etc so that recompression is amortized over
>> time and doesn't happen on every merge.
>>
>> On Fri, Apr 7, 2023 at 5:38 PM jim ferenczi <jim.feren...@gmail.com>
>> wrote:
>> >
>> > I am also not sure that diskann would solve the merging issue. The idea
>> describe in the paper is to run kmeans first to create multiple graphs, one
>> per cluster. In our case the vectors in each segment could belong to
>> different cluster so I don’t see how we could merge them efficiently.
>> >
>> > On Fri, 7 Apr 2023 at 22:28, jim ferenczi <jim.feren...@gmail.com>
>> wrote:
>> >>
>> >> The inference time (and cost) to generate these big vectors must be
>> quite large too ;).
>> >> Regarding the ram buffer, we could drastically reduce the size by
>> writing the vectors on disk instead of keeping them in the heap. With 1k
>> dimensions the ram buffer is filled with these vectors quite rapidly.
>> >>
>> >> On Fri, 7 Apr 2023 at 21:59, Robert Muir <rcm...@gmail.com> wrote:
>> >>>
>> >>> On Fri, Apr 7, 2023 at 7:47 AM Michael Sokolov <msoko...@gmail.com>
>> wrote:
>> >>> >
>> >>> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer
>> size=1994)
>> >>> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer
>> size=1994)
>> >>> >
>> >>> > Robert, since you're the only on-the-record veto here, does this
>> >>> > change your thinking at all, or if not could you share some test
>> >>> > results that didn't go the way you expected? Maybe we can find some
>> >>> > mitigation if we focus on a specific issue.
>> >>> >
>> >>>
>> >>> My scale concerns are both space and time. What does the execution
>> >>> time look like if you don't set insanely large IW rambuffer? The
>> >>> default is 16MB. Just concerned we're shoving some problems under the
>> >>> rug :)
>> >>>
>> >>> Even with the yuge RAMbuffer, we're still talking about almost 2 hours
>> >>> to index 4M documents with these 2k vectors. Whereas you'd measure
>> >>> this in seconds with typical lucene indexing, its nothing.
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >>> For additional commands, e-mail: dev-h...@lucene.apache.org
>> >>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>
>

Reply via email to