Re: [VOTE] Dimension Limit for KNN Vectors

Alessandro Benedetti Wed, 17 May 2023 08:03:23 -0700

Thanks, Michael,
that example backs even more strongly the need of cleaning it up and making
the limit configurable without the need for custom field types I guess (I
was taking a look at the code again, and it seems the limit is also checked
twice: in org.apache.lucene.document.KnnByteVectorField#createType and then
in org.apache.lucene.document.FieldType#setVectorAttributes (for both byte
and float variants).
This should help people vote, great!


Cheers
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benede...@sease.io


*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>


On Wed, 17 May 2023 at 15:42, Michael Sokolov <msoko...@gmail.com> wrote:

> see https://markmail.org/message/kf4nzoqyhwacb7ri
>
> On Wed, May 17, 2023 at 10:09 AM David Smiley <dsmi...@apache.org> wrote:
>
>> > easily be circumvented by a user
>>
>> This is a revelation to me and others, if true.  Michael, please then
>> point to a test or code snippet that shows the Lucene user community what
>> they want to see so they are unblocked from their explorations of vector
>> search.
>>
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>>
>>
>> On Wed, May 17, 2023 at 7:51 AM Michael Sokolov <msoko...@gmail.com>
>> wrote:
>>
>>> I think I've said before on this list we don't actually enforce the
>>> limit in any way that can't easily be circumvented by a user. The codec
>>> already supports any size vector - it doesn't impose any limit. The way the
>>> API is written you can *already today* create an index with max-int sized
>>> vectors and we are committed to supporting that going forward by our
>>> backwards compatibility policy as Robert points out. This wasn't
>>> intentional, I think, but it is the facts.
>>>
>>> Given that, I think this whole discussion is not really necessary.
>>>
>>> On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti <
>>> a.benede...@sease.io> wrote:
>>>
>>>> Hi all,
>>>> we have finalized all the options proposed by the community and we are
>>>> ready to vote for the preferred one and then proceed with the
>>>> implementation.
>>>>
>>>> *Option 1*
>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>> *Motivation*:
>>>> We are close to improving on many fronts. Given the criticality of
>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>> most active stewards of the project, I think we should keep working toward
>>>> improving the feature as is and move to up the limit after we can
>>>> demonstrate improvement unambiguously.
>>>>
>>>> *Option 2*
>>>> make the limit configurable, for example through a system property
>>>> *Motivation*:
>>>> The system administrator can enforce a limit its users need to respect
>>>> that it's in line with whatever the admin decided to be acceptable for
>>>> them.
>>>> The default can stay the current one.
>>>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>>>> and any sort of plugin development
>>>>
>>>> *Option 3*
>>>> Move the max dimension limit lower level to a HNSW specific
>>>> implementation. Once there, this limit would not bind any other potential
>>>> vector engine alternative/evolution.
>>>> *Motivation:* There seem to be contradictory performance
>>>> interpretations about the current HNSW implementation. Some consider its
>>>> performance ok, some not, and it depends on the target data set and use
>>>> case. Increasing the max dimension limit where it is currently (in top
>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>> other use-cases) to be based on a lower limit.
>>>>
>>>> *Option 4*
>>>> Make it configurable and move it to an appropriate place.
>>>> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
>>>> 1024) should be enough.
>>>> *Motivation*:
>>>> Both are good and not mutually exclusive and could happen in any order.
>>>> Someone suggested to perfect what the _default_ limit should be, but
>>>> I've not seen an argument _against_ configurability.  Especially in this
>>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>
>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>> implementation.
>>>> --------------------------
>>>> *Alessandro Benedetti*
>>>> Director @ Sease Ltd.
>>>> *Apache Lucene/Solr Committer*
>>>> *Apache Solr PMC Member*
>>>>
>>>> e-mail: a.benede...@sease.io
>>>>
>>>>
>>>> *Sease* - Information Retrieval Applied
>>>> Consulting | Training | Open Source
>>>>
>>>> Website: Sease.io <http://sease.io/>
>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>> <https://twitter.com/seaseltd> | Youtube
>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>> <https://github.com/seaseltd>
>>>>
>>>

Re: [VOTE] Dimension Limit for KNN Vectors

Reply via email to