Read your example again and yes, that makes sense. I was only thinking in terms of single dimensions, my bad!
On Wed, May 25, 2022 at 11:08 AM Greg Miller <gsmil...@gmail.com> wrote: > I appreciate all the feedback, but disagree that we can accomplish what > we’re trying to do here with the existing fields. > > It’s not sufficient to AND together multiple fields for this use-case > because of the fact that the different dimensions can be multi-valued and > not all combinations are valid. To go back to my example, imagine wiper > blades that fit 2010 Ford vehicles and 2011 Chevy vehicles but not 2010 > Chevy or 2011 Ford. You have to index the combinations, not the separate > component values. I can’t see a way to retain this information with > separate fields. Am I missing something? I guess with an “unsorted” numeric > DV type we could get there with aligned indices, as you describe, but that > seems less appealing than supporting multi-dim points directly. > > I’m in agreement though that there isn’t a compelling need to add a new > field type for this. I have no problem building on BDV and putting this in > the sandbox module to start. Makes sense to me. It sounds like we’d have > consensus to take that approach and re-evaluate if there are future needs? > Any objections? > > Cheers, > -g > > > On Wed, May 25, 2022 at 10:05 Marc D'Mello <marcd2...@gmail.com> wrote: > >> But adding a new type should be the last resort. >> >> >> I did not realize that was the case, that's good to know. It seems like I >> should just use BDV (which does make the code change easier/faster so I >> have no issues with it). >> >> As for Patrick's suggestion of using separate numeric fields instead of >> packing them together, that actually does sound like an interesting idea, I >> think the biggest issue with it though would be implementing a multivalued >> version of this. As Robert pointed out, we would need an UnsortedNumericDV. >> >> Thanks for all the feedback! >> >> >> On Wed, May 25, 2022 at 8:17 AM Robert Muir <rcm...@gmail.com> wrote: >> >>> On Wed, May 25, 2022 at 12:17 AM Greg Miller <gsmil...@gmail.com> wrote: >>> > >>> > A "two separate field approach" would >>> > consist of indexing year and make separately, and you'd lose the >>> > information that only certain combinations are valid. Am I overlooking >>> > something with your suggestion? Maybe there's something we can do with >>> > Lucene already that solves for this case and I'm just not aware of it? >>> > That's entirely possible and I'd love to learn more if there is! >>> >>> This makes no sense to me. If there are two dimensions, there's no >>> difference in faceting code calling fieldA.value and fieldB.value, >>> than calling field.valueA and field.valueB. >>> >>> In other words, doesn't make any sense to needlessly "pack dimensions >>> together" at docvalues level, especially for what should be a >>> column-stride field. There's really no difference from the app >>> perspective. Any issues you have here seem to be issues around facet >>> module and not docvalues... >>> >>> > >>> > As for MultiRangeQuery and the mention of sandbox modules, I think >>> > that's a bit of a different use-case. MultiRangeQuery lets you filter >>> > by a disjunction of ranges. The "multi" part doesn't relate to >>> > "multiple values in a doc" (but it does support that, as do the >>> > "standard" range queries). >>> > >>> > Where I see a gap right now, beyond just faceting, is that we can >>> > represent N-dim points in the points index and filter on them (using >>> > the points index), but we have no doc values equivalent. This means, >>> > 1) we can't facet, and 2) we can't create a "slow" query that does >>> > post-filtering instead of using the points index (which could be a >>> > very real advantage in cases with a sparse match set but a dense >>> > points index). So I like the idea of creating that concept and being >>> > able to facet and filter on it. Whether-or-not this is a "formal" doc >>> > values type or sits on top of BDV, I have less of a strong opinion. >>> >>> We shouldn't add new docvalues types because of "slow queries", I'm >>> really against that. The root problem is that points impl can't filter >>> well (like the inverted index can), and as a hack, docvalues "picks up >>> the slack". If its becoming a major issue, address this with points >>> directly? >>> >>> > >>> > And finally... it really should be multi-valued. The points index >>> > supports multiple points-per-field within a single document. Seems >>> > like a big gap that we wouldn't support that with a doc value field. >>> > Because BDV is inherently single-valued, I propose we come up with an >>> > encoding scheme that encodes multiple points on top of that "single" >>> > BDV entry. This is where building on BDV started to feel a little icky >>> > to me and it seemed like it might be a good use-case for actually >>> > formalizing a format/encoding, but again, no strong preference. We >>> > could certainly do something more quickly on top of BDV and formalize >>> > an encoding later if/as necessary. >>> >>> Doesn't matter that points index supports it. Do the use-cases make >>> sense? It's especially stupid that e.g. LatLonDocValueField supports >>> multi-values. Really? What kind of quantum documents are in multiple >>> locations at the same time? >>> >>> The sortedset/sortednumeric exist to support use-cases on String and >>> int, where user wants to "sort on a multivalued field", which is >>> really crazy if you think about it. So they both sort the numbers at >>> index-time, so that you can pick a "representative" value >>> (min/max/median) in constant time. I think a lot of this existing >>> stuff is just brain-damage from the no-sql fads, alternatively we >>> could remove this multivalued nonsense and the crazy servers that want >>> to follow no-sql fads could index just the "representative value" >>> (min/max/median) in a single-valued field. >>> >>> Sorry, I'm just not seeing a lot of strong use-cases here to justify >>> creating a new DV field, which we should really avoid, as its a hugely >>> expensive cost. I would recommend prototyping stuff with >>> BinaryDocValues, using the sandbox, etc. See if the features get >>> popular and people use them. >>> >>> If they really "catch on", and we think its more efficient, then we >>> can think about how the stuff could be best encoded/compressed/etc. >>> But adding a new type should be the last resort. Adding some >>> specialized multi-dimensional type is IMO out of the question. It >>> would be a lot less horrible to just use separate DV fields, one for >>> each dimension. If there is *strong* compelling use-cases for >>> multi-valued stuff, then in the worst case we could think about >>> something like a UnsortedNumericDV, which would allow fieldA[0] to >>> align with fieldB[0] and fieldA[1] to align with fieldB[1], which >>> would solve the issue for faceting. Just don't allow sorting. And >>> probably not any "slow" query stuff too. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: dev-h...@lucene.apache.org >>> >>>