Re: [VOTE] Dimension Limit for KNN Vectors

Alessandro Benedetti Fri, 19 May 2023 02:58:24 -0700

Thanks to everyone involved so far!
I confirm that a proper subject should have been [POLL] rather than [VOTE],
apologies for the confusion.


We are in the middle of the poll and this is the summary so far (ordered by
preference):

Option 2-4: 9 votes
make the limit configurable, potentially moving the limit to the
appropriate place

Option 3: 4 votes
keep it as it is (1024) but move it lower level in HNSW-specific
implementation

Option 1: 0 votes
keep it as it is (1024)

I've also seen many people responding in the mail thread, but not
indicating their preference.
I believe it would be very useful if everyone interested, expresses their
preference.

Have a good day!
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benede...@sease.io


*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>


On Thu, 18 May 2023 at 14:34, Nicholas Knize <nkn...@gmail.com> wrote:

> Difficult to keep up with this topic when it's spread across issues, PRs,
> and email lists. My poll response is option 3. -1 to option 2, I think the
> configuration should be moved to the HNSW specific implementation. At this
> point of technical maturity, it doesn't make sense (to me) to have the
> config be a global system property.
>
> Given the conversation fragmentation I'll ask here what I asked in my
> comment on the github issue
> <https://github.com/apache/lucene/issues/11507#issuecomment-1548612414>.
>
> "Can anyone smart here post their benchmarks to substantiate their
> claims?"
>
> For as enthusiastic a topic as vector dimensionality is, it sure is
> discouraging there isn't empirical data to help make an informed decision
> around what the recommended limit should be. I've only seen broad benchmark
> claims like "We benchmarked a patched Lucene/Solr. We fully understand (we
> measured it :-P)" It sure would be useful to see these benchmarks! Not
> having them to help improve these arbitrary limits seems like a serious
> disservice to the Lucene/Solr user community. I think until trustworthy
> numbers are made available all we'll have is conjecture and opinions.
>
> IMHO, given Java's lag in SIMD Vector support I'd rather see equal energy
> put into Robert's Vector API Integration, Plan B
> <https://github.com/apache/lucene/issues/12302> proposal. I'm not trying
> to minimize the importance of adding a configuration to the HNSW
> dimensionality, I just think we have the requisite expertise on this
> project to fix the bigger performance issues that are a direct result of
> Java's bigger vector performance deficiencies.
>
> Nicholas Knize, Ph.D., GISP
> Principal Engineer - Search  |  Amazon
> Apache Lucene PMC Member and Committer
> nkn...@apache.org
>
>
> On Thu, May 18, 2023 at 7:07 AM Michael Wechner <michael.wech...@wyona.com>
> wrote:
>
>>
>>
>> Am 18.05.23 um 12:22 schrieb Michael McCandless:
>>
>>
>> I love all the energy and passion going into debating all the ways to
>> poke at this limit, but please let's also spend some of this passion on
>> actually improving the scalability of our aKNN implementation!  E.g. Robert
>> opened an exciting "Plan B" (
>> https://github.com/apache/lucene/issues/12302 ) to workaround
>> OpenJDK's crazy slowness on enabling access to vectorized SIMD CPU
>> instructions (the Java Vector API, JEP 426: https://openjdk.org/jeps/426
>> ).  This could help postings and doc values performance too!
>>
>>
>>
>> agreed, but I do not think the MAX_DIMENSIONS decision should depend on
>> this, because I think whatever improvements can be accomplished eventually,
>> very likely there will always be some limit.
>>
>> Thanks
>>
>> Michael
>>
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Thu, May 18, 2023 at 5:24 AM Alessandro Benedetti <
>> a.benede...@sease.io> wrote:
>>
>>> That's great and a good plan B, but let's try to focus this thread of
>>> collecting votes for a week (let's keep discussions on the nice PR opened
>>> by David or the discussion thread we have in the mailing list already :)
>>>
>>> On Thu, 18 May 2023, 10:10 Ishan Chattopadhyaya, <
>>> ichattopadhy...@gmail.com> wrote:
>>>
>>>> That sounds promising, Michael. Can you share scripts/steps/code to
>>>> reproduce this?
>>>>
>>>> On Thu, 18 May, 2023, 1:16 pm Michael Wechner, <
>>>> michael.wech...@wyona.com> wrote:
>>>>
>>>>> I just implemented it and tested it with OpenAI's
>>>>> text-embedding-ada-002, which is using 1536 dimensions and it works very
>>>>> fine :-)
>>>>>
>>>>> Thanks
>>>>>
>>>>> Michael
>>>>>
>>>>>
>>>>>
>>>>> Am 18.05.23 um 00:29 schrieb Michael Wechner:
>>>>>
>>>>> IIUC KnnVectorField is deprecated and one is supposed to use
>>>>> KnnFloatVectorField when using float as vector values, right?
>>>>>
>>>>> Am 17.05.23 um 16:41 schrieb Michael Sokolov:
>>>>>
>>>>> see https://markmail.org/message/kf4nzoqyhwacb7ri
>>>>>
>>>>> On Wed, May 17, 2023 at 10:09 AM David Smiley <dsmi...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> > easily be circumvented by a user
>>>>>>
>>>>>> This is a revelation to me and others, if true.  Michael, please then
>>>>>> point to a test or code snippet that shows the Lucene user community what
>>>>>> they want to see so they are unblocked from their explorations of vector
>>>>>> search.
>>>>>>
>>>>>> ~ David Smiley
>>>>>> Apache Lucene/Solr Search Developer
>>>>>> http://www.linkedin.com/in/davidwsmiley
>>>>>>
>>>>>>
>>>>>> On Wed, May 17, 2023 at 7:51 AM Michael Sokolov <msoko...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I think I've said before on this list we don't actually enforce the
>>>>>>> limit in any way that can't easily be circumvented by a user. The codec
>>>>>>> already supports any size vector - it doesn't impose any limit. The way 
>>>>>>> the
>>>>>>> API is written you can *already today* create an index with max-int 
>>>>>>> sized
>>>>>>> vectors and we are committed to supporting that going forward by our
>>>>>>> backwards compatibility policy as Robert points out. This wasn't
>>>>>>> intentional, I think, but it is the facts.
>>>>>>>
>>>>>>> Given that, I think this whole discussion is not really necessary.
>>>>>>>
>>>>>>> On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti <
>>>>>>> a.benede...@sease.io> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>> we have finalized all the options proposed by the community and we
>>>>>>>> are ready to vote for the preferred one and then proceed with the
>>>>>>>> implementation.
>>>>>>>>
>>>>>>>> *Option 1*
>>>>>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>>>>>> *Motivation*:
>>>>>>>> We are close to improving on many fronts. Given the criticality of
>>>>>>>> Lucene in computing infrastructure and the concerns raised by one of 
>>>>>>>> the
>>>>>>>> most active stewards of the project, I think we should keep working 
>>>>>>>> toward
>>>>>>>> improving the feature as is and move to up the limit after we can
>>>>>>>> demonstrate improvement unambiguously.
>>>>>>>>
>>>>>>>> *Option 2*
>>>>>>>> make the limit configurable, for example through a system property
>>>>>>>> *Motivation*:
>>>>>>>> The system administrator can enforce a limit its users need to
>>>>>>>> respect that it's in line with whatever the admin decided to be 
>>>>>>>> acceptable
>>>>>>>> for them.
>>>>>>>> The default can stay the current one.
>>>>>>>> This should open the doors for Apache Solr, Elasticsearch,
>>>>>>>> OpenSearch, and any sort of plugin development
>>>>>>>>
>>>>>>>> *Option 3*
>>>>>>>> Move the max dimension limit lower level to a HNSW specific
>>>>>>>> implementation. Once there, this limit would not bind any other 
>>>>>>>> potential
>>>>>>>> vector engine alternative/evolution.
>>>>>>>> *Motivation:* There seem to be contradictory performance
>>>>>>>> interpretations about the current HNSW implementation. Some consider 
>>>>>>>> its
>>>>>>>> performance ok, some not, and it depends on the target data set and use
>>>>>>>> case. Increasing the max dimension limit where it is currently (in top
>>>>>>>> level FloatVectorValues) would not allow potential alternatives (e.g. 
>>>>>>>> for
>>>>>>>> other use-cases) to be based on a lower limit.
>>>>>>>>
>>>>>>>> *Option 4*
>>>>>>>> Make it configurable and move it to an appropriate place.
>>>>>>>> In particular, a
>>>>>>>> simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024) should be
>>>>>>>> enough.
>>>>>>>> *Motivation*:
>>>>>>>> Both are good and not mutually exclusive and could happen in any
>>>>>>>> order.
>>>>>>>> Someone suggested to perfect what the _default_ limit should be,
>>>>>>>> but I've not seen an argument _against_ configurability.  Especially in
>>>>>>>> this way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>>>>>
>>>>>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>>>>>> implementation.
>>>>>>>> --------------------------
>>>>>>>> *Alessandro Benedetti*
>>>>>>>> Director @ Sease Ltd.
>>>>>>>> *Apache Lucene/Solr Committer*
>>>>>>>> *Apache Solr PMC Member*
>>>>>>>>
>>>>>>>> e-mail: a.benede...@sease.io
>>>>>>>>
>>>>>>>>
>>>>>>>> *Sease* - Information Retrieval Applied
>>>>>>>> Consulting | Training | Open Source
>>>>>>>>
>>>>>>>> Website: Sease.io <http://sease.io/>
>>>>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>>>>>> <https://twitter.com/seaseltd> | Youtube
>>>>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>>>>>> <https://github.com/seaseltd>
>>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>

Re: [VOTE] Dimension Limit for KNN Vectors

Reply via email to