Re: [VOTE] Dimension Limit for KNN Vectors

Michael McCandless Thu, 18 May 2023 03:22:50 -0700

This isn't really a VOTE (no specific code change is being proposed), but
rather a poll?


Anyway, I would prefer Option 3: put the limit check into the HNSW
algorithm itself.  This is the right place for the limit check, since HNSW
has its own scaling behaviour.  It might have other limits, like max
fanout, etc.  And we really should fix the loophole Mike S posted -- that's
just a dangerous long-term trap for users, thinking they have the back
compat promise of Lucene, when in fact they do not.

I love all the energy and passion going into debating all the ways to poke
at this limit, but please let's also spend some of this passion on actually
improving the scalability of our aKNN implementation!  E.g. Robert opened
an exciting "Plan B" ( https://github.com/apache/lucene/issues/12302 ) to
workaround OpenJDK's crazy slowness on enabling access to vectorized SIMD
CPU instructions (the Java Vector API, JEP 426: https://openjdk.org/jeps/426
).  This could help postings and doc values performance too!

Mike McCandless

http://blog.mikemccandless.com


On Thu, May 18, 2023 at 5:24 AM Alessandro Benedetti <a.benede...@sease.io>
wrote:

> That's great and a good plan B, but let's try to focus this thread of
> collecting votes for a week (let's keep discussions on the nice PR opened
> by David or the discussion thread we have in the mailing list already :)
>
> On Thu, 18 May 2023, 10:10 Ishan Chattopadhyaya, <
> ichattopadhy...@gmail.com> wrote:
>
>> That sounds promising, Michael. Can you share scripts/steps/code to
>> reproduce this?
>>
>> On Thu, 18 May, 2023, 1:16 pm Michael Wechner, <michael.wech...@wyona.com>
>> wrote:
>>
>>> I just implemented it and tested it with OpenAI's
>>> text-embedding-ada-002, which is using 1536 dimensions and it works very
>>> fine :-)
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>>
>>>
>>> Am 18.05.23 um 00:29 schrieb Michael Wechner:
>>>
>>> IIUC KnnVectorField is deprecated and one is supposed to use
>>> KnnFloatVectorField when using float as vector values, right?
>>>
>>> Am 17.05.23 um 16:41 schrieb Michael Sokolov:
>>>
>>> see https://markmail.org/message/kf4nzoqyhwacb7ri
>>>
>>> On Wed, May 17, 2023 at 10:09 AM David Smiley <dsmi...@apache.org>
>>> wrote:
>>>
>>>> > easily be circumvented by a user
>>>>
>>>> This is a revelation to me and others, if true.  Michael, please then
>>>> point to a test or code snippet that shows the Lucene user community what
>>>> they want to see so they are unblocked from their explorations of vector
>>>> search.
>>>>
>>>> ~ David Smiley
>>>> Apache Lucene/Solr Search Developer
>>>> http://www.linkedin.com/in/davidwsmiley
>>>>
>>>>
>>>> On Wed, May 17, 2023 at 7:51 AM Michael Sokolov <msoko...@gmail.com>
>>>> wrote:
>>>>
>>>>> I think I've said before on this list we don't actually enforce the
>>>>> limit in any way that can't easily be circumvented by a user. The codec
>>>>> already supports any size vector - it doesn't impose any limit. The way 
>>>>> the
>>>>> API is written you can *already today* create an index with max-int sized
>>>>> vectors and we are committed to supporting that going forward by our
>>>>> backwards compatibility policy as Robert points out. This wasn't
>>>>> intentional, I think, but it is the facts.
>>>>>
>>>>> Given that, I think this whole discussion is not really necessary.
>>>>>
>>>>> On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti <
>>>>> a.benede...@sease.io> wrote:
>>>>>
>>>>>> Hi all,
>>>>>> we have finalized all the options proposed by the community and we
>>>>>> are ready to vote for the preferred one and then proceed with the
>>>>>> implementation.
>>>>>>
>>>>>> *Option 1*
>>>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>>>> *Motivation*:
>>>>>> We are close to improving on many fronts. Given the criticality of
>>>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>>>> most active stewards of the project, I think we should keep working 
>>>>>> toward
>>>>>> improving the feature as is and move to up the limit after we can
>>>>>> demonstrate improvement unambiguously.
>>>>>>
>>>>>> *Option 2*
>>>>>> make the limit configurable, for example through a system property
>>>>>> *Motivation*:
>>>>>> The system administrator can enforce a limit its users need to
>>>>>> respect that it's in line with whatever the admin decided to be 
>>>>>> acceptable
>>>>>> for them.
>>>>>> The default can stay the current one.
>>>>>> This should open the doors for Apache Solr, Elasticsearch,
>>>>>> OpenSearch, and any sort of plugin development
>>>>>>
>>>>>> *Option 3*
>>>>>> Move the max dimension limit lower level to a HNSW specific
>>>>>> implementation. Once there, this limit would not bind any other potential
>>>>>> vector engine alternative/evolution.
>>>>>> *Motivation:* There seem to be contradictory performance
>>>>>> interpretations about the current HNSW implementation. Some consider its
>>>>>> performance ok, some not, and it depends on the target data set and use
>>>>>> case. Increasing the max dimension limit where it is currently (in top
>>>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>>>> other use-cases) to be based on a lower limit.
>>>>>>
>>>>>> *Option 4*
>>>>>> Make it configurable and move it to an appropriate place.
>>>>>> In particular, a
>>>>>> simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024) should be
>>>>>> enough.
>>>>>> *Motivation*:
>>>>>> Both are good and not mutually exclusive and could happen in any
>>>>>> order.
>>>>>> Someone suggested to perfect what the _default_ limit should be, but
>>>>>> I've not seen an argument _against_ configurability.  Especially in this
>>>>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>>>
>>>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>>>> implementation.
>>>>>> --------------------------
>>>>>> *Alessandro Benedetti*
>>>>>> Director @ Sease Ltd.
>>>>>> *Apache Lucene/Solr Committer*
>>>>>> *Apache Solr PMC Member*
>>>>>>
>>>>>> e-mail: a.benede...@sease.io
>>>>>>
>>>>>>
>>>>>> *Sease* - Information Retrieval Applied
>>>>>> Consulting | Training | Open Source
>>>>>>
>>>>>> Website: Sease.io <http://sease.io/>
>>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>>>> <https://twitter.com/seaseltd> | Youtube
>>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>>>> <https://github.com/seaseltd>
>>>>>>
>>>>>
>>>
>>>

Re: [VOTE] Dimension Limit for KNN Vectors

Reply via email to