Re: [VOTE] Dimension Limit for KNN Vectors

Alessandro Benedetti Thu, 18 May 2023 02:24:44 -0700

That's great and a good plan B, but let's try to focus this thread of
collecting votes for a week (let's keep discussions on the nice PR opened
by David or the discussion thread we have in the mailing list already :)


On Thu, 18 May 2023, 10:10 Ishan Chattopadhyaya, <ichattopadhy...@gmail.com>
wrote:

> That sounds promising, Michael. Can you share scripts/steps/code to
> reproduce this?
>
> On Thu, 18 May, 2023, 1:16 pm Michael Wechner, <michael.wech...@wyona.com>
> wrote:
>
>> I just implemented it and tested it with OpenAI's text-embedding-ada-002,
>> which is using 1536 dimensions and it works very fine :-)
>>
>> Thanks
>>
>> Michael
>>
>>
>>
>> Am 18.05.23 um 00:29 schrieb Michael Wechner:
>>
>> IIUC KnnVectorField is deprecated and one is supposed to use
>> KnnFloatVectorField when using float as vector values, right?
>>
>> Am 17.05.23 um 16:41 schrieb Michael Sokolov:
>>
>> see https://markmail.org/message/kf4nzoqyhwacb7ri
>>
>> On Wed, May 17, 2023 at 10:09 AM David Smiley <dsmi...@apache.org> wrote:
>>
>>> > easily be circumvented by a user
>>>
>>> This is a revelation to me and others, if true.  Michael, please then
>>> point to a test or code snippet that shows the Lucene user community what
>>> they want to see so they are unblocked from their explorations of vector
>>> search.
>>>
>>> ~ David Smiley
>>> Apache Lucene/Solr Search Developer
>>> http://www.linkedin.com/in/davidwsmiley
>>>
>>>
>>> On Wed, May 17, 2023 at 7:51 AM Michael Sokolov <msoko...@gmail.com>
>>> wrote:
>>>
>>>> I think I've said before on this list we don't actually enforce the
>>>> limit in any way that can't easily be circumvented by a user. The codec
>>>> already supports any size vector - it doesn't impose any limit. The way the
>>>> API is written you can *already today* create an index with max-int sized
>>>> vectors and we are committed to supporting that going forward by our
>>>> backwards compatibility policy as Robert points out. This wasn't
>>>> intentional, I think, but it is the facts.
>>>>
>>>> Given that, I think this whole discussion is not really necessary.
>>>>
>>>> On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti <
>>>> a.benede...@sease.io> wrote:
>>>>
>>>>> Hi all,
>>>>> we have finalized all the options proposed by the community and we are
>>>>> ready to vote for the preferred one and then proceed with the
>>>>> implementation.
>>>>>
>>>>> *Option 1*
>>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>>> *Motivation*:
>>>>> We are close to improving on many fronts. Given the criticality of
>>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>>> most active stewards of the project, I think we should keep working toward
>>>>> improving the feature as is and move to up the limit after we can
>>>>> demonstrate improvement unambiguously.
>>>>>
>>>>> *Option 2*
>>>>> make the limit configurable, for example through a system property
>>>>> *Motivation*:
>>>>> The system administrator can enforce a limit its users need to respect
>>>>> that it's in line with whatever the admin decided to be acceptable for
>>>>> them.
>>>>> The default can stay the current one.
>>>>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>>>>> and any sort of plugin development
>>>>>
>>>>> *Option 3*
>>>>> Move the max dimension limit lower level to a HNSW specific
>>>>> implementation. Once there, this limit would not bind any other potential
>>>>> vector engine alternative/evolution.
>>>>> *Motivation:* There seem to be contradictory performance
>>>>> interpretations about the current HNSW implementation. Some consider its
>>>>> performance ok, some not, and it depends on the target data set and use
>>>>> case. Increasing the max dimension limit where it is currently (in top
>>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>>> other use-cases) to be based on a lower limit.
>>>>>
>>>>> *Option 4*
>>>>> Make it configurable and move it to an appropriate place.
>>>>> In particular, a
>>>>> simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024) should be
>>>>> enough.
>>>>> *Motivation*:
>>>>> Both are good and not mutually exclusive and could happen in any order.
>>>>> Someone suggested to perfect what the _default_ limit should be, but
>>>>> I've not seen an argument _against_ configurability.  Especially in this
>>>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>>
>>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>>> implementation.
>>>>> --------------------------
>>>>> *Alessandro Benedetti*
>>>>> Director @ Sease Ltd.
>>>>> *Apache Lucene/Solr Committer*
>>>>> *Apache Solr PMC Member*
>>>>>
>>>>> e-mail: a.benede...@sease.io
>>>>>
>>>>>
>>>>> *Sease* - Information Retrieval Applied
>>>>> Consulting | Training | Open Source
>>>>>
>>>>> Website: Sease.io <http://sease.io/>
>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>>> <https://twitter.com/seaseltd> | Youtube
>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>>> <https://github.com/seaseltd>
>>>>>
>>>>
>>
>>

Re: [VOTE] Dimension Limit for KNN Vectors

Reply via email to