Re: [VOTE] Dimension Limit for KNN Vectors

Benjamin Trent Tue, 16 May 2023 05:28:02 -0700

My vote is for option 3. Prevents Lucene from having the limit increased.
Allows others who implement a different codec to set a limit of their
choosing.


Though I don't know the historical reasons for putting specific
configuration items at the codec level. This limit is performance related
and various codec implementations would have different performance concerns.


On Tue, May 16, 2023, 8:02 AM Michael Wechner <michael.wech...@wyona.com>
wrote:

> +1 to Gus' reply.
>
> I think that Robert's veto or anyone else's veto is fair enough, but I
> also think that anyone who is vetoing should be very clear about the
> objectives / goals to be achieved, in order to get a +1.
>
> If no clear objectives / goals can be defined and agreed on, then the
> whole thing becomes arbitrary.
>
> Therefore I would also be interested to know the objectives / goals to be
> met that there will be a +1 re this vote?
>
> Thanks
>
> Michael
>
>
>
> Am 16.05.23 um 13:45 schrieb Gus Heck:
>
> Robert,
>
> Can you explain in clear technical terms the standard that must be met for
> performance? A benchmark that must run in X time on Y hardware for example
> (and why that test is suitable)? Or some other reproducible criteria? So
> far I've heard you give an *opinion* that it's unusable, but that's not a
> technical criteria, others may have a different concept of what is usable
> to them.
>
> Forgive me if I misunderstand, but the essence of your argument has seemed
> to be
>
> "Performance isn't good enough, therefore we should force anyone who wants
> to experiment with something bigger to fork the code base to do it"
>
> Thus, it is necessary to have a clear unambiguous standard that anyone can
> verify for "good enough". A clear standard would also focus efforts at
> improvement.
>
> Where are the goal posts?
>
> FWIW I'm +1 on any of 2-4 since I believe the existence of a hard limit is
> fundamentally counterproductive in an open source setting, as it will lead
> to *fewer people* pushing the limits. Extremely few people are going to
> get into the nitty-gritty of optimizing things unless they are staring at
> code that they can prove does something interesting, but doesn't run fast
> enough for their purposes. If people hit a hard limit, more of them give up
> and never develop the code that will motivate them to look for
> optimizations.
>
> -Gus
>
> On Tue, May 16, 2023 at 6:04 AM Robert Muir <rcm...@gmail.com> wrote:
>
>> i still feel -1 (veto) on increasing this limit. sending more emails does
>> not change the technical facts or make the veto go away.
>>
>> On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti <
>> a.benede...@sease.io> wrote:
>>
>>> Hi all,
>>> we have finalized all the options proposed by the community and we are
>>> ready to vote for the preferred one and then proceed with the
>>> implementation.
>>>
>>> *Option 1*
>>> Keep it as it is (dimension limit hardcoded to 1024)
>>> *Motivation*:
>>> We are close to improving on many fronts. Given the criticality of
>>> Lucene in computing infrastructure and the concerns raised by one of the
>>> most active stewards of the project, I think we should keep working toward
>>> improving the feature as is and move to up the limit after we can
>>> demonstrate improvement unambiguously.
>>>
>>> *Option 2*
>>> make the limit configurable, for example through a system property
>>> *Motivation*:
>>> The system administrator can enforce a limit its users need to respect
>>> that it's in line with whatever the admin decided to be acceptable for
>>> them.
>>> The default can stay the current one.
>>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>>> and any sort of plugin development
>>>
>>> *Option 3*
>>> Move the max dimension limit lower level to a HNSW specific
>>> implementation. Once there, this limit would not bind any other potential
>>> vector engine alternative/evolution.
>>> *Motivation:* There seem to be contradictory performance
>>> interpretations about the current HNSW implementation. Some consider its
>>> performance ok, some not, and it depends on the target data set and use
>>> case. Increasing the max dimension limit where it is currently (in top
>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>> other use-cases) to be based on a lower limit.
>>>
>>> *Option 4*
>>> Make it configurable and move it to an appropriate place.
>>> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
>>> 1024) should be enough.
>>> *Motivation*:
>>> Both are good and not mutually exclusive and could happen in any order.
>>> Someone suggested to perfect what the _default_ limit should be, but
>>> I've not seen an argument _against_ configurability.  Especially in this
>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>
>>> I'll keep this [VOTE] open for a week and then proceed to the
>>> implementation.
>>> --------------------------
>>> *Alessandro Benedetti*
>>> Director @ Sease Ltd.
>>> *Apache Lucene/Solr Committer*
>>> *Apache Solr PMC Member*
>>>
>>> e-mail: a.benede...@sease.io
>>>
>>>
>>> *Sease* - Information Retrieval Applied
>>> Consulting | Training | Open Source
>>>
>>> Website: Sease.io <http://sease.io/>
>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>> <https://twitter.com/seaseltd> | Youtube
>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>> <https://github.com/seaseltd>
>>>
>>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>
>
>

Re: [VOTE] Dimension Limit for KNN Vectors

Reply via email to