Thanks to everyone involved so far! I confirm that a proper subject should have been [POLL] rather than [VOTE], apologies for the confusion.
We are in the middle of the poll and this is the summary so far (ordered by preference): Option 2-4: 9 votes make the limit configurable, potentially moving the limit to the appropriate place Option 3: 4 votes keep it as it is (1024) but move it lower level in HNSW-specific implementation Option 1: 0 votes keep it as it is (1024) I've also seen many people responding in the mail thread, but not indicating their preference. I believe it would be very useful if everyone interested, expresses their preference. Have a good day! -------------------------- *Alessandro Benedetti* Director @ Sease Ltd. *Apache Lucene/Solr Committer* *Apache Solr PMC Member* e-mail: a.benede...@sease.io *Sease* - Information Retrieval Applied Consulting | Training | Open Source Website: Sease.io <http://sease.io/> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter <https://twitter.com/seaseltd> | Youtube <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github <https://github.com/seaseltd> On Thu, 18 May 2023 at 14:34, Nicholas Knize <nkn...@gmail.com> wrote: > Difficult to keep up with this topic when it's spread across issues, PRs, > and email lists. My poll response is option 3. -1 to option 2, I think the > configuration should be moved to the HNSW specific implementation. At this > point of technical maturity, it doesn't make sense (to me) to have the > config be a global system property. > > Given the conversation fragmentation I'll ask here what I asked in my > comment on the github issue > <https://github.com/apache/lucene/issues/11507#issuecomment-1548612414>. > > "Can anyone smart here post their benchmarks to substantiate their > claims?" > > For as enthusiastic a topic as vector dimensionality is, it sure is > discouraging there isn't empirical data to help make an informed decision > around what the recommended limit should be. I've only seen broad benchmark > claims like "We benchmarked a patched Lucene/Solr. We fully understand (we > measured it :-P)" It sure would be useful to see these benchmarks! Not > having them to help improve these arbitrary limits seems like a serious > disservice to the Lucene/Solr user community. I think until trustworthy > numbers are made available all we'll have is conjecture and opinions. > > IMHO, given Java's lag in SIMD Vector support I'd rather see equal energy > put into Robert's Vector API Integration, Plan B > <https://github.com/apache/lucene/issues/12302> proposal. I'm not trying > to minimize the importance of adding a configuration to the HNSW > dimensionality, I just think we have the requisite expertise on this > project to fix the bigger performance issues that are a direct result of > Java's bigger vector performance deficiencies. > > Nicholas Knize, Ph.D., GISP > Principal Engineer - Search | Amazon > Apache Lucene PMC Member and Committer > nkn...@apache.org > > > On Thu, May 18, 2023 at 7:07 AM Michael Wechner <michael.wech...@wyona.com> > wrote: > >> >> >> Am 18.05.23 um 12:22 schrieb Michael McCandless: >> >> >> I love all the energy and passion going into debating all the ways to >> poke at this limit, but please let's also spend some of this passion on >> actually improving the scalability of our aKNN implementation! E.g. Robert >> opened an exciting "Plan B" ( >> https://github.com/apache/lucene/issues/12302 ) to workaround >> OpenJDK's crazy slowness on enabling access to vectorized SIMD CPU >> instructions (the Java Vector API, JEP 426: https://openjdk.org/jeps/426 >> ). This could help postings and doc values performance too! >> >> >> >> agreed, but I do not think the MAX_DIMENSIONS decision should depend on >> this, because I think whatever improvements can be accomplished eventually, >> very likely there will always be some limit. >> >> Thanks >> >> Michael >> >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> >> On Thu, May 18, 2023 at 5:24 AM Alessandro Benedetti < >> a.benede...@sease.io> wrote: >> >>> That's great and a good plan B, but let's try to focus this thread of >>> collecting votes for a week (let's keep discussions on the nice PR opened >>> by David or the discussion thread we have in the mailing list already :) >>> >>> On Thu, 18 May 2023, 10:10 Ishan Chattopadhyaya, < >>> ichattopadhy...@gmail.com> wrote: >>> >>>> That sounds promising, Michael. Can you share scripts/steps/code to >>>> reproduce this? >>>> >>>> On Thu, 18 May, 2023, 1:16 pm Michael Wechner, < >>>> michael.wech...@wyona.com> wrote: >>>> >>>>> I just implemented it and tested it with OpenAI's >>>>> text-embedding-ada-002, which is using 1536 dimensions and it works very >>>>> fine :-) >>>>> >>>>> Thanks >>>>> >>>>> Michael >>>>> >>>>> >>>>> >>>>> Am 18.05.23 um 00:29 schrieb Michael Wechner: >>>>> >>>>> IIUC KnnVectorField is deprecated and one is supposed to use >>>>> KnnFloatVectorField when using float as vector values, right? >>>>> >>>>> Am 17.05.23 um 16:41 schrieb Michael Sokolov: >>>>> >>>>> see https://markmail.org/message/kf4nzoqyhwacb7ri >>>>> >>>>> On Wed, May 17, 2023 at 10:09 AM David Smiley <dsmi...@apache.org> >>>>> wrote: >>>>> >>>>>> > easily be circumvented by a user >>>>>> >>>>>> This is a revelation to me and others, if true. Michael, please then >>>>>> point to a test or code snippet that shows the Lucene user community what >>>>>> they want to see so they are unblocked from their explorations of vector >>>>>> search. >>>>>> >>>>>> ~ David Smiley >>>>>> Apache Lucene/Solr Search Developer >>>>>> http://www.linkedin.com/in/davidwsmiley >>>>>> >>>>>> >>>>>> On Wed, May 17, 2023 at 7:51 AM Michael Sokolov <msoko...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> I think I've said before on this list we don't actually enforce the >>>>>>> limit in any way that can't easily be circumvented by a user. The codec >>>>>>> already supports any size vector - it doesn't impose any limit. The way >>>>>>> the >>>>>>> API is written you can *already today* create an index with max-int >>>>>>> sized >>>>>>> vectors and we are committed to supporting that going forward by our >>>>>>> backwards compatibility policy as Robert points out. This wasn't >>>>>>> intentional, I think, but it is the facts. >>>>>>> >>>>>>> Given that, I think this whole discussion is not really necessary. >>>>>>> >>>>>>> On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti < >>>>>>> a.benede...@sease.io> wrote: >>>>>>> >>>>>>>> Hi all, >>>>>>>> we have finalized all the options proposed by the community and we >>>>>>>> are ready to vote for the preferred one and then proceed with the >>>>>>>> implementation. >>>>>>>> >>>>>>>> *Option 1* >>>>>>>> Keep it as it is (dimension limit hardcoded to 1024) >>>>>>>> *Motivation*: >>>>>>>> We are close to improving on many fronts. Given the criticality of >>>>>>>> Lucene in computing infrastructure and the concerns raised by one of >>>>>>>> the >>>>>>>> most active stewards of the project, I think we should keep working >>>>>>>> toward >>>>>>>> improving the feature as is and move to up the limit after we can >>>>>>>> demonstrate improvement unambiguously. >>>>>>>> >>>>>>>> *Option 2* >>>>>>>> make the limit configurable, for example through a system property >>>>>>>> *Motivation*: >>>>>>>> The system administrator can enforce a limit its users need to >>>>>>>> respect that it's in line with whatever the admin decided to be >>>>>>>> acceptable >>>>>>>> for them. >>>>>>>> The default can stay the current one. >>>>>>>> This should open the doors for Apache Solr, Elasticsearch, >>>>>>>> OpenSearch, and any sort of plugin development >>>>>>>> >>>>>>>> *Option 3* >>>>>>>> Move the max dimension limit lower level to a HNSW specific >>>>>>>> implementation. Once there, this limit would not bind any other >>>>>>>> potential >>>>>>>> vector engine alternative/evolution. >>>>>>>> *Motivation:* There seem to be contradictory performance >>>>>>>> interpretations about the current HNSW implementation. Some consider >>>>>>>> its >>>>>>>> performance ok, some not, and it depends on the target data set and use >>>>>>>> case. Increasing the max dimension limit where it is currently (in top >>>>>>>> level FloatVectorValues) would not allow potential alternatives (e.g. >>>>>>>> for >>>>>>>> other use-cases) to be based on a lower limit. >>>>>>>> >>>>>>>> *Option 4* >>>>>>>> Make it configurable and move it to an appropriate place. >>>>>>>> In particular, a >>>>>>>> simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024) should be >>>>>>>> enough. >>>>>>>> *Motivation*: >>>>>>>> Both are good and not mutually exclusive and could happen in any >>>>>>>> order. >>>>>>>> Someone suggested to perfect what the _default_ limit should be, >>>>>>>> but I've not seen an argument _against_ configurability. Especially in >>>>>>>> this way -- a toggle that doesn't bind Lucene's APIs in any way. >>>>>>>> >>>>>>>> I'll keep this [VOTE] open for a week and then proceed to the >>>>>>>> implementation. >>>>>>>> -------------------------- >>>>>>>> *Alessandro Benedetti* >>>>>>>> Director @ Sease Ltd. >>>>>>>> *Apache Lucene/Solr Committer* >>>>>>>> *Apache Solr PMC Member* >>>>>>>> >>>>>>>> e-mail: a.benede...@sease.io >>>>>>>> >>>>>>>> >>>>>>>> *Sease* - Information Retrieval Applied >>>>>>>> Consulting | Training | Open Source >>>>>>>> >>>>>>>> Website: Sease.io <http://sease.io/> >>>>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter >>>>>>>> <https://twitter.com/seaseltd> | Youtube >>>>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github >>>>>>>> <https://github.com/seaseltd> >>>>>>>> >>>>>>> >>>>> >>>>> >>