Re: [VOTE] Dimension Limit for KNN Vectors

David Smiley Wed, 17 May 2023 15:05:43 -0700

Thanks Michael for sharing your code snippet on how to circumvent the
limit.  My reaction to this is the same as Alessandro.


I just created a PR to make the limit configurable:
https://github.com/apache/lucene/pull/12306
If there is to be a veto presented to the PR, it should include technical
reasons specific to the PR and be raised on the PR itself.

Afterwards, I leave it to others to move the limit with its configurability
to be enforced in a codec specific way.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Wed, May 17, 2023 at 12:58 PM Mayya Sharipova
<[email protected]> wrote:

> Alessandro,
> Thanks for raising the code of conduct; it is very discouraging and
> intimidating to participate in discussions where such language is used
> especially by senior members.
>
> Michael S.,
> thanks for your suggestion and that's what we used in Elasticsearch to
> raise dims limit, and Alessandro, perhaps, you can use it as well in Solr
> for the time being.
>
> On Wed, May 17, 2023 at 11:03 AM Alessandro Benedetti <
> [email protected]> wrote:
>
>> Thanks, Michael,
>> that example backs even more strongly the need of cleaning it up and
>> making the limit configurable without the need for custom field types I
>> guess (I was taking a look at the code again, and it seems the limit is
>> also checked twice:
>> in org.apache.lucene.document.KnnByteVectorField#createType and then
>> in org.apache.lucene.document.FieldType#setVectorAttributes (for both byte
>> and float variants).
>> This should help people vote, great!
>>
>> Cheers
>> --------------------------
>> *Alessandro Benedetti*
>> Director @ Sease Ltd.
>> *Apache Lucene/Solr Committer*
>> *Apache Solr PMC Member*
>>
>> e-mail: [email protected]
>>
>>
>> *Sease* - Information Retrieval Applied
>> Consulting | Training | Open Source
>>
>> Website: Sease.io <http://sease.io/>
>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>> <https://twitter.com/seaseltd> | Youtube
>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>> <https://github.com/seaseltd>
>>
>>
>> On Wed, 17 May 2023 at 15:42, Michael Sokolov <[email protected]> wrote:
>>
>>> see https://markmail.org/message/kf4nzoqyhwacb7ri
>>>
>>> On Wed, May 17, 2023 at 10:09 AM David Smiley <[email protected]>
>>> wrote:
>>>
>>>> > easily be circumvented by a user
>>>>
>>>> This is a revelation to me and others, if true.  Michael, please then
>>>> point to a test or code snippet that shows the Lucene user community what
>>>> they want to see so they are unblocked from their explorations of vector
>>>> search.
>>>>
>>>> ~ David Smiley
>>>> Apache Lucene/Solr Search Developer
>>>> http://www.linkedin.com/in/davidwsmiley
>>>>
>>>>
>>>> On Wed, May 17, 2023 at 7:51 AM Michael Sokolov <[email protected]>
>>>> wrote:
>>>>
>>>>> I think I've said before on this list we don't actually enforce the
>>>>> limit in any way that can't easily be circumvented by a user. The codec
>>>>> already supports any size vector - it doesn't impose any limit. The way 
>>>>> the
>>>>> API is written you can *already today* create an index with max-int sized
>>>>> vectors and we are committed to supporting that going forward by our
>>>>> backwards compatibility policy as Robert points out. This wasn't
>>>>> intentional, I think, but it is the facts.
>>>>>
>>>>> Given that, I think this whole discussion is not really necessary.
>>>>>
>>>>> On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi all,
>>>>>> we have finalized all the options proposed by the community and we
>>>>>> are ready to vote for the preferred one and then proceed with the
>>>>>> implementation.
>>>>>>
>>>>>> *Option 1*
>>>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>>>> *Motivation*:
>>>>>> We are close to improving on many fronts. Given the criticality of
>>>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>>>> most active stewards of the project, I think we should keep working 
>>>>>> toward
>>>>>> improving the feature as is and move to up the limit after we can
>>>>>> demonstrate improvement unambiguously.
>>>>>>
>>>>>> *Option 2*
>>>>>> make the limit configurable, for example through a system property
>>>>>> *Motivation*:
>>>>>> The system administrator can enforce a limit its users need to
>>>>>> respect that it's in line with whatever the admin decided to be 
>>>>>> acceptable
>>>>>> for them.
>>>>>> The default can stay the current one.
>>>>>> This should open the doors for Apache Solr, Elasticsearch,
>>>>>> OpenSearch, and any sort of plugin development
>>>>>>
>>>>>> *Option 3*
>>>>>> Move the max dimension limit lower level to a HNSW specific
>>>>>> implementation. Once there, this limit would not bind any other potential
>>>>>> vector engine alternative/evolution.
>>>>>> *Motivation:* There seem to be contradictory performance
>>>>>> interpretations about the current HNSW implementation. Some consider its
>>>>>> performance ok, some not, and it depends on the target data set and use
>>>>>> case. Increasing the max dimension limit where it is currently (in top
>>>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>>>> other use-cases) to be based on a lower limit.
>>>>>>
>>>>>> *Option 4*
>>>>>> Make it configurable and move it to an appropriate place.
>>>>>> In particular, a
>>>>>> simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024) should be
>>>>>> enough.
>>>>>> *Motivation*:
>>>>>> Both are good and not mutually exclusive and could happen in any
>>>>>> order.
>>>>>> Someone suggested to perfect what the _default_ limit should be, but
>>>>>> I've not seen an argument _against_ configurability.  Especially in this
>>>>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>>>
>>>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>>>> implementation.
>>>>>> --------------------------
>>>>>> *Alessandro Benedetti*
>>>>>> Director @ Sease Ltd.
>>>>>> *Apache Lucene/Solr Committer*
>>>>>> *Apache Solr PMC Member*
>>>>>>
>>>>>> e-mail: [email protected]
>>>>>>
>>>>>>
>>>>>> *Sease* - Information Retrieval Applied
>>>>>> Consulting | Training | Open Source
>>>>>>
>>>>>> Website: Sease.io <http://sease.io/>
>>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>>>> <https://twitter.com/seaseltd> | Youtube
>>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>>>> <https://github.com/seaseltd>
>>>>>>
>>>>>

Re: [VOTE] Dimension Limit for KNN Vectors

Reply via email to