That's great and a good plan B, but let's try to focus this thread of collecting votes for a week (let's keep discussions on the nice PR opened by David or the discussion thread we have in the mailing list already :)
On Thu, 18 May 2023, 10:10 Ishan Chattopadhyaya, <ichattopadhy...@gmail.com> wrote: > That sounds promising, Michael. Can you share scripts/steps/code to > reproduce this? > > On Thu, 18 May, 2023, 1:16 pm Michael Wechner, <michael.wech...@wyona.com> > wrote: > >> I just implemented it and tested it with OpenAI's text-embedding-ada-002, >> which is using 1536 dimensions and it works very fine :-) >> >> Thanks >> >> Michael >> >> >> >> Am 18.05.23 um 00:29 schrieb Michael Wechner: >> >> IIUC KnnVectorField is deprecated and one is supposed to use >> KnnFloatVectorField when using float as vector values, right? >> >> Am 17.05.23 um 16:41 schrieb Michael Sokolov: >> >> see https://markmail.org/message/kf4nzoqyhwacb7ri >> >> On Wed, May 17, 2023 at 10:09 AM David Smiley <dsmi...@apache.org> wrote: >> >>> > easily be circumvented by a user >>> >>> This is a revelation to me and others, if true. Michael, please then >>> point to a test or code snippet that shows the Lucene user community what >>> they want to see so they are unblocked from their explorations of vector >>> search. >>> >>> ~ David Smiley >>> Apache Lucene/Solr Search Developer >>> http://www.linkedin.com/in/davidwsmiley >>> >>> >>> On Wed, May 17, 2023 at 7:51 AM Michael Sokolov <msoko...@gmail.com> >>> wrote: >>> >>>> I think I've said before on this list we don't actually enforce the >>>> limit in any way that can't easily be circumvented by a user. The codec >>>> already supports any size vector - it doesn't impose any limit. The way the >>>> API is written you can *already today* create an index with max-int sized >>>> vectors and we are committed to supporting that going forward by our >>>> backwards compatibility policy as Robert points out. This wasn't >>>> intentional, I think, but it is the facts. >>>> >>>> Given that, I think this whole discussion is not really necessary. >>>> >>>> On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti < >>>> a.benede...@sease.io> wrote: >>>> >>>>> Hi all, >>>>> we have finalized all the options proposed by the community and we are >>>>> ready to vote for the preferred one and then proceed with the >>>>> implementation. >>>>> >>>>> *Option 1* >>>>> Keep it as it is (dimension limit hardcoded to 1024) >>>>> *Motivation*: >>>>> We are close to improving on many fronts. Given the criticality of >>>>> Lucene in computing infrastructure and the concerns raised by one of the >>>>> most active stewards of the project, I think we should keep working toward >>>>> improving the feature as is and move to up the limit after we can >>>>> demonstrate improvement unambiguously. >>>>> >>>>> *Option 2* >>>>> make the limit configurable, for example through a system property >>>>> *Motivation*: >>>>> The system administrator can enforce a limit its users need to respect >>>>> that it's in line with whatever the admin decided to be acceptable for >>>>> them. >>>>> The default can stay the current one. >>>>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch, >>>>> and any sort of plugin development >>>>> >>>>> *Option 3* >>>>> Move the max dimension limit lower level to a HNSW specific >>>>> implementation. Once there, this limit would not bind any other potential >>>>> vector engine alternative/evolution. >>>>> *Motivation:* There seem to be contradictory performance >>>>> interpretations about the current HNSW implementation. Some consider its >>>>> performance ok, some not, and it depends on the target data set and use >>>>> case. Increasing the max dimension limit where it is currently (in top >>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for >>>>> other use-cases) to be based on a lower limit. >>>>> >>>>> *Option 4* >>>>> Make it configurable and move it to an appropriate place. >>>>> In particular, a >>>>> simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024) should be >>>>> enough. >>>>> *Motivation*: >>>>> Both are good and not mutually exclusive and could happen in any order. >>>>> Someone suggested to perfect what the _default_ limit should be, but >>>>> I've not seen an argument _against_ configurability. Especially in this >>>>> way -- a toggle that doesn't bind Lucene's APIs in any way. >>>>> >>>>> I'll keep this [VOTE] open for a week and then proceed to the >>>>> implementation. >>>>> -------------------------- >>>>> *Alessandro Benedetti* >>>>> Director @ Sease Ltd. >>>>> *Apache Lucene/Solr Committer* >>>>> *Apache Solr PMC Member* >>>>> >>>>> e-mail: a.benede...@sease.io >>>>> >>>>> >>>>> *Sease* - Information Retrieval Applied >>>>> Consulting | Training | Open Source >>>>> >>>>> Website: Sease.io <http://sease.io/> >>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter >>>>> <https://twitter.com/seaseltd> | Youtube >>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github >>>>> <https://github.com/seaseltd> >>>>> >>>> >> >>