I replied with a tentative comment on SOLR-17974 - let's continue discussion there!
Jason On Wed, Oct 22, 2025 at 2:56 PM Chris Hostetter <[email protected]> wrote: > > > TL;DR: How much do we care about our client APIs supporting multi-valued > fields where each value is a "Vector" (aka: "java.util.List" of numbers), > when the alternative (for now) is to use a String representation of each > value ? > > --- > > Once upon a time, I was working on some custom plugin work where I had a > field type using "Tuples" (of strings) as values -- but the field type > also needed to be multi-valued (ie: a variable number of tuples in a > single document) > > Writing my field type (as a composite around multiple abstracted away > indexed & docValue fields) was pretty easy, but I ran into weirdness > writing tests with indexing (with SolrInputDocument), and getting my > "stored" values back (via SolrDocument) because I was trying to represent > my "Tuples" in code as java.util.List instances (to play nice with Solr's > existing serialization logic, response writers, etc). > > So *each* field value was a List<String> > > The problem came in my tests of multi-valued fields -- due to long > standing "convinience" logic in SolrInputDocument and SolrDocument those > classes that assumes that if you set/add a java.util.Collection as a field > "value", the *real* field values are the contents of that Collection -- it > either re-uses it as is, or will add *each* of those elements to it's > existing java.util.Collection for that field. > > (Since this email was getting really long & convoluted the first 3 times I > tried to draft it, I created SOLR-17974 to go into all the details and > attached a test case showing how weird it all is.) > > > Even though it was/is possible -- since I knew how the internals work -- > to carefully set a field value in SolrInputDocument top be a > List<List<String>>, I instead gave up and represented each "Tuple" as a > String using a special delimiter -- making my "multi-valued" fields a > List<String>. > > (Which is/was similar to how Solr's spatial field types work, and plays > nice with all ContentStream loaders and response writers) > > > > I was reminded of all this about a year ago when I realized: > > 1) Solr's DenseVectorField can only be multiValues=false > > 2) For external representation purposes, DenseVectorField *acts* like a > multi-valued numeric field (either List<Float> or List<Byte>) > > 3) There was/is work being considered in Lucene to add "multiple vectors > per document" to the underlying HSNW graph logic (lucene/issues/12313) > > > All of which raised my eyebrow: "I wonder how Solr's going to deal with > the List<List<Float>> problem if/when that happens?" ... but I had other > things on my mind at the time. > > > > Skip ahead to today... > > Lucene still doesn't support multiple vectors per document in the HNSW > graphs, but 10.3 *did* introduce a new LateInteractionField which is a > different type of "vector" field where each document has a "float[][]" (a > variable number of fixed sized "vectors", each represented as a fixed size > float[]). > > So the question becomes: How to we represent these document values in Solr > if we want to add support for this new field type? (SOLR-17975) > > > The most expedient approach would be to follow in the footsteps of the > spatial fields (and my old "Tuple" type) and use a String encoding -- > either mapping a String<->float[] and being multiValued=true in schema, or > mapping String<->float[][] and requiring multiValued=false (the "query" > side of this field type will already need some way for users to express a > "float[][]" in a query string) > > > The alternative is to pay off our very old tech dept: SOLR-17974. > > *IF* we redesign SolrInputDocument and SolrDocument to have more explicit > APIs allowing for the possibility that a single "value" in a multi-valued > field might be represented as a complex (possibly nested) > java.util.Collection of primatives (recognizing that along the way, we > will probably find lots of other places in the code base that make > assumptions about multi-valued fields, just because it's alwasy been that > way.) ... *THEN* ... we could model each vector value as a "List<Float>" > and still have "multi-valued" vector fields. > > > But how much to people actaully care about this? > > Are there other usecases where his would be useful? > > Is having a "cleaner" API worth the headaches of changing this now? > > ? > > > > > > > -Hoss > http://www.lucidworks.com/ > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
