Thanks Michael for sharing your code snippet on how to circumvent the limit. My reaction to this is the same as Alessandro.
I just created a PR to make the limit configurable: https://github.com/apache/lucene/pull/12306 If there is to be a veto presented to the PR, it should include technical reasons specific to the PR and be raised on the PR itself. Afterwards, I leave it to others to move the limit with its configurability to be enforced in a codec specific way. ~ David Smiley Apache Lucene/Solr Search Developer http://www.linkedin.com/in/davidwsmiley On Wed, May 17, 2023 at 12:58 PM Mayya Sharipova <mayya.sharip...@elastic.co.invalid> wrote: > Alessandro, > Thanks for raising the code of conduct; it is very discouraging and > intimidating to participate in discussions where such language is used > especially by senior members. > > Michael S., > thanks for your suggestion and that's what we used in Elasticsearch to > raise dims limit, and Alessandro, perhaps, you can use it as well in Solr > for the time being. > > On Wed, May 17, 2023 at 11:03 AM Alessandro Benedetti < > a.benede...@sease.io> wrote: > >> Thanks, Michael, >> that example backs even more strongly the need of cleaning it up and >> making the limit configurable without the need for custom field types I >> guess (I was taking a look at the code again, and it seems the limit is >> also checked twice: >> in org.apache.lucene.document.KnnByteVectorField#createType and then >> in org.apache.lucene.document.FieldType#setVectorAttributes (for both byte >> and float variants). >> This should help people vote, great! >> >> Cheers >> -------------------------- >> *Alessandro Benedetti* >> Director @ Sease Ltd. >> *Apache Lucene/Solr Committer* >> *Apache Solr PMC Member* >> >> e-mail: a.benede...@sease.io >> >> >> *Sease* - Information Retrieval Applied >> Consulting | Training | Open Source >> >> Website: Sease.io <http://sease.io/> >> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter >> <https://twitter.com/seaseltd> | Youtube >> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github >> <https://github.com/seaseltd> >> >> >> On Wed, 17 May 2023 at 15:42, Michael Sokolov <msoko...@gmail.com> wrote: >> >>> see https://markmail.org/message/kf4nzoqyhwacb7ri >>> >>> On Wed, May 17, 2023 at 10:09 AM David Smiley <dsmi...@apache.org> >>> wrote: >>> >>>> > easily be circumvented by a user >>>> >>>> This is a revelation to me and others, if true. Michael, please then >>>> point to a test or code snippet that shows the Lucene user community what >>>> they want to see so they are unblocked from their explorations of vector >>>> search. >>>> >>>> ~ David Smiley >>>> Apache Lucene/Solr Search Developer >>>> http://www.linkedin.com/in/davidwsmiley >>>> >>>> >>>> On Wed, May 17, 2023 at 7:51 AM Michael Sokolov <msoko...@gmail.com> >>>> wrote: >>>> >>>>> I think I've said before on this list we don't actually enforce the >>>>> limit in any way that can't easily be circumvented by a user. The codec >>>>> already supports any size vector - it doesn't impose any limit. The way >>>>> the >>>>> API is written you can *already today* create an index with max-int sized >>>>> vectors and we are committed to supporting that going forward by our >>>>> backwards compatibility policy as Robert points out. This wasn't >>>>> intentional, I think, but it is the facts. >>>>> >>>>> Given that, I think this whole discussion is not really necessary. >>>>> >>>>> On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti < >>>>> a.benede...@sease.io> wrote: >>>>> >>>>>> Hi all, >>>>>> we have finalized all the options proposed by the community and we >>>>>> are ready to vote for the preferred one and then proceed with the >>>>>> implementation. >>>>>> >>>>>> *Option 1* >>>>>> Keep it as it is (dimension limit hardcoded to 1024) >>>>>> *Motivation*: >>>>>> We are close to improving on many fronts. Given the criticality of >>>>>> Lucene in computing infrastructure and the concerns raised by one of the >>>>>> most active stewards of the project, I think we should keep working >>>>>> toward >>>>>> improving the feature as is and move to up the limit after we can >>>>>> demonstrate improvement unambiguously. >>>>>> >>>>>> *Option 2* >>>>>> make the limit configurable, for example through a system property >>>>>> *Motivation*: >>>>>> The system administrator can enforce a limit its users need to >>>>>> respect that it's in line with whatever the admin decided to be >>>>>> acceptable >>>>>> for them. >>>>>> The default can stay the current one. >>>>>> This should open the doors for Apache Solr, Elasticsearch, >>>>>> OpenSearch, and any sort of plugin development >>>>>> >>>>>> *Option 3* >>>>>> Move the max dimension limit lower level to a HNSW specific >>>>>> implementation. Once there, this limit would not bind any other potential >>>>>> vector engine alternative/evolution. >>>>>> *Motivation:* There seem to be contradictory performance >>>>>> interpretations about the current HNSW implementation. Some consider its >>>>>> performance ok, some not, and it depends on the target data set and use >>>>>> case. Increasing the max dimension limit where it is currently (in top >>>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for >>>>>> other use-cases) to be based on a lower limit. >>>>>> >>>>>> *Option 4* >>>>>> Make it configurable and move it to an appropriate place. >>>>>> In particular, a >>>>>> simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024) should be >>>>>> enough. >>>>>> *Motivation*: >>>>>> Both are good and not mutually exclusive and could happen in any >>>>>> order. >>>>>> Someone suggested to perfect what the _default_ limit should be, but >>>>>> I've not seen an argument _against_ configurability. Especially in this >>>>>> way -- a toggle that doesn't bind Lucene's APIs in any way. >>>>>> >>>>>> I'll keep this [VOTE] open for a week and then proceed to the >>>>>> implementation. >>>>>> -------------------------- >>>>>> *Alessandro Benedetti* >>>>>> Director @ Sease Ltd. >>>>>> *Apache Lucene/Solr Committer* >>>>>> *Apache Solr PMC Member* >>>>>> >>>>>> e-mail: a.benede...@sease.io >>>>>> >>>>>> >>>>>> *Sease* - Information Retrieval Applied >>>>>> Consulting | Training | Open Source >>>>>> >>>>>> Website: Sease.io <http://sease.io/> >>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter >>>>>> <https://twitter.com/seaseltd> | Youtube >>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github >>>>>> <https://github.com/seaseltd> >>>>>> >>>>>