Re: Adding a new PointDocValuesField

Greg Miller Wed, 25 May 2022 11:08:01 -0700

I appreciate all the feedback, but disagree that we can accomplish what
we’re trying to do here with the existing fields.


It’s not sufficient to AND together multiple fields for this use-case
because of the fact that the different dimensions can be multi-valued and
not all combinations are valid. To go back to my example, imagine wiper
blades that fit 2010 Ford vehicles and 2011 Chevy vehicles but not 2010
Chevy or 2011 Ford. You have to index the combinations, not the separate
component values. I can’t see a way to retain this information with
separate fields. Am I missing something? I guess with an “unsorted” numeric
DV type we could get there with aligned indices, as you describe, but that
seems less appealing than supporting multi-dim points directly.

I’m in agreement though that there isn’t a compelling need to add a new
field type for this. I have no problem building on BDV and putting this in
the sandbox module to start. Makes sense to me. It sounds like we’d have
consensus to take that approach and re-evaluate if there are future needs?
Any objections?

Cheers,
-g


On Wed, May 25, 2022 at 10:05 Marc D'Mello <marcd2...@gmail.com> wrote:

> But adding a new type should be the last resort.
>
>
> I did not realize that was the case, that's good to know. It seems like I
> should just use BDV (which does make the code change easier/faster so I
> have no issues with it).
>
> As for Patrick's suggestion of using separate numeric fields instead of
> packing them together, that actually does sound like an interesting idea, I
> think the biggest issue with it though would be implementing a multivalued
> version of this. As Robert pointed out, we would need an UnsortedNumericDV.
>
> Thanks for all the feedback!
>
>
> On Wed, May 25, 2022 at 8:17 AM Robert Muir <rcm...@gmail.com> wrote:
>
>> On Wed, May 25, 2022 at 12:17 AM Greg Miller <gsmil...@gmail.com> wrote:
>> >
>> >  A "two separate field approach" would
>> > consist of indexing year and make separately, and you'd lose the
>> > information that only certain combinations are valid. Am I overlooking
>> > something with your suggestion? Maybe there's something we can do with
>> > Lucene already that solves for this case and I'm just not aware of it?
>> > That's entirely possible and I'd love to learn more if there is!
>>
>> This makes no sense to me. If there are two dimensions, there's no
>> difference in faceting code calling fieldA.value and fieldB.value,
>> than calling field.valueA and field.valueB.
>>
>> In other words, doesn't make any sense to needlessly "pack dimensions
>> together" at docvalues level, especially for what should be a
>> column-stride field. There's really no difference from the app
>> perspective. Any issues you have here seem to be issues around facet
>> module and not docvalues...
>>
>> >
>> > As for MultiRangeQuery and the mention of sandbox modules, I think
>> > that's a bit of a different use-case. MultiRangeQuery lets you filter
>> > by a disjunction of ranges. The "multi" part doesn't relate to
>> > "multiple values in a doc" (but it does support that, as do the
>> > "standard" range queries).
>> >
>> > Where I see a gap right now, beyond just faceting, is that we can
>> > represent N-dim points in the points index and filter on them (using
>> > the points index), but we have no doc values equivalent. This means,
>> > 1) we can't facet, and 2) we can't create a "slow" query that does
>> > post-filtering instead of using the points index (which could be a
>> > very real advantage in cases with a sparse match set but a dense
>> > points index). So I like the idea of creating that concept and being
>> > able to facet and filter on it. Whether-or-not this is a "formal" doc
>> > values type or sits on top of BDV, I have less of a strong opinion.
>>
>> We shouldn't add new docvalues types because of "slow queries", I'm
>> really against that. The root problem is that points impl can't filter
>> well (like the inverted index can), and as a hack, docvalues "picks up
>> the slack". If its becoming a major issue, address this with points
>> directly?
>>
>> >
>> > And finally... it really should be multi-valued. The points index
>> > supports multiple points-per-field within a single document. Seems
>> > like a big gap that we wouldn't support that with a doc value field.
>> > Because BDV is inherently single-valued, I propose we come up with an
>> > encoding scheme that encodes multiple points on top of that "single"
>> > BDV entry. This is where building on BDV started to feel a little icky
>> > to me and it seemed like it might be a good use-case for actually
>> > formalizing a format/encoding, but again, no strong preference. We
>> > could certainly do something more quickly on top of BDV and formalize
>> > an encoding later if/as necessary.
>>
>> Doesn't matter that points index supports it. Do the use-cases make
>> sense? It's especially stupid that e.g. LatLonDocValueField supports
>> multi-values. Really? What kind of quantum documents are in multiple
>> locations at the same time?
>>
>> The sortedset/sortednumeric exist to support use-cases on String and
>> int, where user wants to "sort on a multivalued field", which is
>> really crazy if you think about it. So they both sort the numbers at
>> index-time, so that you can pick a "representative" value
>> (min/max/median) in constant time. I think a lot of this existing
>> stuff is just brain-damage from the no-sql fads, alternatively we
>> could remove this multivalued nonsense and the crazy servers that want
>> to follow no-sql fads could index just the "representative value"
>> (min/max/median) in a single-valued field.
>>
>> Sorry, I'm just not seeing a lot of strong use-cases here to justify
>> creating a new DV field, which we should really avoid, as its a hugely
>> expensive cost. I would recommend prototyping stuff with
>> BinaryDocValues, using the sandbox, etc. See if the features get
>> popular and people use them.
>>
>> If they really "catch on", and we think its more efficient, then we
>> can think about how the stuff could be best encoded/compressed/etc.
>> But adding a new type should be the last resort. Adding some
>> specialized multi-dimensional type is IMO out of the question. It
>> would be a lot less horrible to just use separate DV fields, one for
>> each dimension. If there is *strong* compelling use-cases for
>> multi-valued stuff, then in the worst case we could think about
>> something like a UnsortedNumericDV, which would allow fieldA[0] to
>> align with fieldB[0] and fieldA[1] to align with fieldB[1], which
>> would solve the issue for faceting. Just don't allow sorting. And
>> probably not any "slow" query stuff too.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>

Re: Adding a new PointDocValuesField

Reply via email to