Re: Adding a new PointDocValuesField

Marc D'Mello Wed, 25 May 2022 11:30:34 -0700

Read your example again and yes, that makes sense. I was only thinking in
terms of single dimensions, my bad!


On Wed, May 25, 2022 at 11:08 AM Greg Miller <gsmil...@gmail.com> wrote:

> I appreciate all the feedback, but disagree that we can accomplish what
> we’re trying to do here with the existing fields.
>
> It’s not sufficient to AND together multiple fields for this use-case
> because of the fact that the different dimensions can be multi-valued and
> not all combinations are valid. To go back to my example, imagine wiper
> blades that fit 2010 Ford vehicles and 2011 Chevy vehicles but not 2010
> Chevy or 2011 Ford. You have to index the combinations, not the separate
> component values. I can’t see a way to retain this information with
> separate fields. Am I missing something? I guess with an “unsorted” numeric
> DV type we could get there with aligned indices, as you describe, but that
> seems less appealing than supporting multi-dim points directly.
>
> I’m in agreement though that there isn’t a compelling need to add a new
> field type for this. I have no problem building on BDV and putting this in
> the sandbox module to start. Makes sense to me. It sounds like we’d have
> consensus to take that approach and re-evaluate if there are future needs?
> Any objections?
>
> Cheers,
> -g
>
>
> On Wed, May 25, 2022 at 10:05 Marc D'Mello <marcd2...@gmail.com> wrote:
>
>> But adding a new type should be the last resort.
>>
>>
>> I did not realize that was the case, that's good to know. It seems like I
>> should just use BDV (which does make the code change easier/faster so I
>> have no issues with it).
>>
>> As for Patrick's suggestion of using separate numeric fields instead of
>> packing them together, that actually does sound like an interesting idea, I
>> think the biggest issue with it though would be implementing a multivalued
>> version of this. As Robert pointed out, we would need an UnsortedNumericDV.
>>
>> Thanks for all the feedback!
>>
>>
>> On Wed, May 25, 2022 at 8:17 AM Robert Muir <rcm...@gmail.com> wrote:
>>
>>> On Wed, May 25, 2022 at 12:17 AM Greg Miller <gsmil...@gmail.com> wrote:
>>> >
>>> >  A "two separate field approach" would
>>> > consist of indexing year and make separately, and you'd lose the
>>> > information that only certain combinations are valid. Am I overlooking
>>> > something with your suggestion? Maybe there's something we can do with
>>> > Lucene already that solves for this case and I'm just not aware of it?
>>> > That's entirely possible and I'd love to learn more if there is!
>>>
>>> This makes no sense to me. If there are two dimensions, there's no
>>> difference in faceting code calling fieldA.value and fieldB.value,
>>> than calling field.valueA and field.valueB.
>>>
>>> In other words, doesn't make any sense to needlessly "pack dimensions
>>> together" at docvalues level, especially for what should be a
>>> column-stride field. There's really no difference from the app
>>> perspective. Any issues you have here seem to be issues around facet
>>> module and not docvalues...
>>>
>>> >
>>> > As for MultiRangeQuery and the mention of sandbox modules, I think
>>> > that's a bit of a different use-case. MultiRangeQuery lets you filter
>>> > by a disjunction of ranges. The "multi" part doesn't relate to
>>> > "multiple values in a doc" (but it does support that, as do the
>>> > "standard" range queries).
>>> >
>>> > Where I see a gap right now, beyond just faceting, is that we can
>>> > represent N-dim points in the points index and filter on them (using
>>> > the points index), but we have no doc values equivalent. This means,
>>> > 1) we can't facet, and 2) we can't create a "slow" query that does
>>> > post-filtering instead of using the points index (which could be a
>>> > very real advantage in cases with a sparse match set but a dense
>>> > points index). So I like the idea of creating that concept and being
>>> > able to facet and filter on it. Whether-or-not this is a "formal" doc
>>> > values type or sits on top of BDV, I have less of a strong opinion.
>>>
>>> We shouldn't add new docvalues types because of "slow queries", I'm
>>> really against that. The root problem is that points impl can't filter
>>> well (like the inverted index can), and as a hack, docvalues "picks up
>>> the slack". If its becoming a major issue, address this with points
>>> directly?
>>>
>>> >
>>> > And finally... it really should be multi-valued. The points index
>>> > supports multiple points-per-field within a single document. Seems
>>> > like a big gap that we wouldn't support that with a doc value field.
>>> > Because BDV is inherently single-valued, I propose we come up with an
>>> > encoding scheme that encodes multiple points on top of that "single"
>>> > BDV entry. This is where building on BDV started to feel a little icky
>>> > to me and it seemed like it might be a good use-case for actually
>>> > formalizing a format/encoding, but again, no strong preference. We
>>> > could certainly do something more quickly on top of BDV and formalize
>>> > an encoding later if/as necessary.
>>>
>>> Doesn't matter that points index supports it. Do the use-cases make
>>> sense? It's especially stupid that e.g. LatLonDocValueField supports
>>> multi-values. Really? What kind of quantum documents are in multiple
>>> locations at the same time?
>>>
>>> The sortedset/sortednumeric exist to support use-cases on String and
>>> int, where user wants to "sort on a multivalued field", which is
>>> really crazy if you think about it. So they both sort the numbers at
>>> index-time, so that you can pick a "representative" value
>>> (min/max/median) in constant time. I think a lot of this existing
>>> stuff is just brain-damage from the no-sql fads, alternatively we
>>> could remove this multivalued nonsense and the crazy servers that want
>>> to follow no-sql fads could index just the "representative value"
>>> (min/max/median) in a single-valued field.
>>>
>>> Sorry, I'm just not seeing a lot of strong use-cases here to justify
>>> creating a new DV field, which we should really avoid, as its a hugely
>>> expensive cost. I would recommend prototyping stuff with
>>> BinaryDocValues, using the sandbox, etc. See if the features get
>>> popular and people use them.
>>>
>>> If they really "catch on", and we think its more efficient, then we
>>> can think about how the stuff could be best encoded/compressed/etc.
>>> But adding a new type should be the last resort. Adding some
>>> specialized multi-dimensional type is IMO out of the question. It
>>> would be a lot less horrible to just use separate DV fields, one for
>>> each dimension. If there is *strong* compelling use-cases for
>>> multi-valued stuff, then in the worst case we could think about
>>> something like a UnsortedNumericDV, which would allow fieldA[0] to
>>> align with fieldB[0] and fieldA[1] to align with fieldB[1], which
>>> would solve the issue for faceting. Just don't allow sorting. And
>>> probably not any "slow" query stuff too.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>>

Re: Adding a new PointDocValuesField

Reply via email to