Re: [Proposal] Remove max number of dimensions for KNN vectors

Michael Wechner Thu, 06 Apr 2023 12:52:04 -0700



Am 06.04.23 um 17:47 schrieb Robert Muir:

Well, I'm asking ppl actually try to test using such high dimensions.
Based on my own experience, I consider it unusable. It seems other
folks may have run into trouble too. If the project committers can't
even really use vectors with such high dimension counts, then its not
in an OK state for users, and we shouldn't bump the limit.

I'm happy to discuss/compromise etc, but simply bumping the limit
without addressing the underlying usability/scalability is a real
no-go,


I agree that this needs to be adressed

  it is not really solving anything, nor is it giving users any
freedom or allowing them to do something they couldnt do before.
Because if it still doesnt work it still doesnt work.


I disagree, because it *does work* with "smaller" document sets.

Currently we have to compile Lucene ourselves to not get the exceptionwhen using a model with vector dimension greater than 1024,

which is of course possible, but not really convenient.

As I wrote before, to resolve this discussion, I think we should testand address possible issues.

I will try to stop discussing now :-) and instead try to understandbetter the actual issues. Would be great if others could join on this!


Thanks

Michael


We all need to be on the same page, grounded in reality, not fantasy,
where if we set a limit of 1024 or 2048, that you can actually index
vectors with that many dimensions and it actually works and scales.

On Thu, Apr 6, 2023 at 11:38 AM Alessandro Benedetti
<a.benede...@sease.io> wrote:

As I said earlier, a max limit limits usability.
It's not forcing users with small vectors to pay the performance penalty of big 
vectors, it's literally preventing some users to use Lucene/Solr/Elasticsearch 
at all.
As far as I know, the max limit is used to raise an exception, it's not used to 
initialise or optimise data structures (please correct me if I'm wrong).

Improving the algorithm performance is a separate discussion.
I don't see a correlation with the fact that indexing billions of whatever 
dimensioned vector is slow with a usability parameter.

What about potential users that need few high dimensional vectors?

As I said before, I am a big +1 for NOT just raise it blindly, but I believe we 
need to remove the limit or size it in a way it's not a problem for both users 
and internal data structure optimizations, if any.


On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcm...@gmail.com> wrote:

I'd ask anyone voting +1 to raise this limit to at least try to index
a few million vectors with 756 or 1024, which is allowed today.

IMO based on how painful it is, it seems the limit is already too
high, I realize that will sound controversial but please at least try
it out!

voting +1 without at least doing this is really the
"weak/unscientifically minded" approach.

On Wed, Apr 5, 2023 at 12:52 PM Michael Wechner
<michael.wech...@wyona.com> wrote:

Thanks for your feedback!

I agree, that it should not crash.

So far we did not experience crashes ourselves, but we did not index
millions of vectors.

I will try to reproduce the crash, maybe this will help us to move forward.

Thanks

Michael

Am 05.04.23 um 18:30 schrieb Dawid Weiss:

Can you describe your crash in more detail?

I can't. That experiment was a while ago and a quick test to see if I
could index rather large-ish USPTO (patent office) data as vectors.
Couldn't do it then.

How much RAM?

My indexing jobs run with rather smallish heaps to give space for I/O
buffers. Think 4-8GB at most. So yes, it could have been the problem.
I recall segment merging grew slower and slower and then simply
crashed. Lucene should work with low heap requirements, even if it
slows down. Throwing ram at the indexing/ segment merging problem
is... I don't know - not elegant?

Anyway. My main point was to remind folks about how Apache works -
code is merged in when there are no vetoes. If Rob (or anybody else)
remains unconvinced, he or she can block the change. (I didn't invent
those rules).

D.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors

Reply via email to