Thanks, Michael, that example backs even more strongly the need of cleaning it up and making the limit configurable without the need for custom field types I guess (I was taking a look at the code again, and it seems the limit is also checked twice: in org.apache.lucene.document.KnnByteVectorField#createType and then in org.apache.lucene.document.FieldType#setVectorAttributes (for both byte and float variants). This should help people vote, great!
Cheers -------------------------- *Alessandro Benedetti* Director @ Sease Ltd. *Apache Lucene/Solr Committer* *Apache Solr PMC Member* e-mail: a.benede...@sease.io *Sease* - Information Retrieval Applied Consulting | Training | Open Source Website: Sease.io <http://sease.io/> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter <https://twitter.com/seaseltd> | Youtube <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github <https://github.com/seaseltd> On Wed, 17 May 2023 at 15:42, Michael Sokolov <msoko...@gmail.com> wrote: > see https://markmail.org/message/kf4nzoqyhwacb7ri > > On Wed, May 17, 2023 at 10:09 AM David Smiley <dsmi...@apache.org> wrote: > >> > easily be circumvented by a user >> >> This is a revelation to me and others, if true. Michael, please then >> point to a test or code snippet that shows the Lucene user community what >> they want to see so they are unblocked from their explorations of vector >> search. >> >> ~ David Smiley >> Apache Lucene/Solr Search Developer >> http://www.linkedin.com/in/davidwsmiley >> >> >> On Wed, May 17, 2023 at 7:51 AM Michael Sokolov <msoko...@gmail.com> >> wrote: >> >>> I think I've said before on this list we don't actually enforce the >>> limit in any way that can't easily be circumvented by a user. The codec >>> already supports any size vector - it doesn't impose any limit. The way the >>> API is written you can *already today* create an index with max-int sized >>> vectors and we are committed to supporting that going forward by our >>> backwards compatibility policy as Robert points out. This wasn't >>> intentional, I think, but it is the facts. >>> >>> Given that, I think this whole discussion is not really necessary. >>> >>> On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti < >>> a.benede...@sease.io> wrote: >>> >>>> Hi all, >>>> we have finalized all the options proposed by the community and we are >>>> ready to vote for the preferred one and then proceed with the >>>> implementation. >>>> >>>> *Option 1* >>>> Keep it as it is (dimension limit hardcoded to 1024) >>>> *Motivation*: >>>> We are close to improving on many fronts. Given the criticality of >>>> Lucene in computing infrastructure and the concerns raised by one of the >>>> most active stewards of the project, I think we should keep working toward >>>> improving the feature as is and move to up the limit after we can >>>> demonstrate improvement unambiguously. >>>> >>>> *Option 2* >>>> make the limit configurable, for example through a system property >>>> *Motivation*: >>>> The system administrator can enforce a limit its users need to respect >>>> that it's in line with whatever the admin decided to be acceptable for >>>> them. >>>> The default can stay the current one. >>>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch, >>>> and any sort of plugin development >>>> >>>> *Option 3* >>>> Move the max dimension limit lower level to a HNSW specific >>>> implementation. Once there, this limit would not bind any other potential >>>> vector engine alternative/evolution. >>>> *Motivation:* There seem to be contradictory performance >>>> interpretations about the current HNSW implementation. Some consider its >>>> performance ok, some not, and it depends on the target data set and use >>>> case. Increasing the max dimension limit where it is currently (in top >>>> level FloatVectorValues) would not allow potential alternatives (e.g. for >>>> other use-cases) to be based on a lower limit. >>>> >>>> *Option 4* >>>> Make it configurable and move it to an appropriate place. >>>> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions", >>>> 1024) should be enough. >>>> *Motivation*: >>>> Both are good and not mutually exclusive and could happen in any order. >>>> Someone suggested to perfect what the _default_ limit should be, but >>>> I've not seen an argument _against_ configurability. Especially in this >>>> way -- a toggle that doesn't bind Lucene's APIs in any way. >>>> >>>> I'll keep this [VOTE] open for a week and then proceed to the >>>> implementation. >>>> -------------------------- >>>> *Alessandro Benedetti* >>>> Director @ Sease Ltd. >>>> *Apache Lucene/Solr Committer* >>>> *Apache Solr PMC Member* >>>> >>>> e-mail: a.benede...@sease.io >>>> >>>> >>>> *Sease* - Information Retrieval Applied >>>> Consulting | Training | Open Source >>>> >>>> Website: Sease.io <http://sease.io/> >>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter >>>> <https://twitter.com/seaseltd> | Youtube >>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github >>>> <https://github.com/seaseltd> >>>> >>>