Re: Getting all values for a specific dimension for SortedSetDocValues per document

Greg Miller Fri, 01 Jul 2022 09:58:48 -0700

To address the last topic (building up ordinal ranges per-segment),
what I'm thinking is that you'd iterate all unique ordinals in the
SSDV field and "memorize" the ordinal range for each dimension
up-front, but on a per-segment basis. This would be very similar to
what DefaultSortedSetDocValuesReaderState#createOneFlatFacetDimState
is doing, but you'd do it per-segment. That per-segment state
information would only need to be done once, since segments are
immutable. Then, when you're iterating all your hits within each
segment, you can determine whether-or-not each stored doc value for
the doc is within the dim range using your segment-level ordinal range
information, avoiding the need to do the global mapping. You can
short-circuit the document-level ordinal iteration once an ordinal
goes beyond the range since the ordinals are stored in sorted order,
but in general, you're still doing a linear operation per document. No
way to avoid that. But you can also resolve the BytesRef within the
context of each segment, completely avoiding the need for a global
mapping.


Not sure this will result in any major performance improvements for
you, but it does let you avoid building a global ordinal map and doing
map lookups within the tight loop.

Cheers,
-Greg

On Fri, Jul 1, 2022 at 2:35 AM Harald Braumann <[email protected]> wrote:
>
> Hi!
>
> On 01.07.22 00:46, Greg Miller wrote:
> > Have you considered taxonomy faceting for your use-case? Because the
> > taxonomy structure is maintained in a separate index, it's
> > (relatively) trivial to iterate all direct child ordinals of a given
> > dimension. The cost of mapping to a global ordinal space is done when
> > the index is merged.
>
> Thanks for the tip. I will certainly look into it.
>
> > Separately, I'd be curious about where you're running into performance
> > issues within the context of your system. Is the cost you're concerned
> > with building up the ordinal map? That's certainly expensive, but it's
> > a one-time cost (until you refresh your index).
>
> That's not the problem. The index changes rarely during operation, anyways.
>
> > Or are you concerned
> > with the actual map lookup within your tight loop?
>
> Yes. The index I tested has about 3M documents and overall about 180M
> doc value ords. Just iterating, without retrieving the actual values or
> building the result map takes close to 3s. All the time seems to be
> spent in SortedSetDocValues.nextOrd and LongValues.get.
>
> > If the latter, you
> > could consider doing more work at the slice-level by separately
> > determining the child ords for each dim ord within the context of each
> > segment (there's no off-the-shelf code for this that I'm aware of, so
> > you'd have to roll your own).
>
> I was thinking about this as well. So lets say, I have the ord ranges
> for dimensions per segment. I guess, I could easily test this by making
> sure, there is only one segment, so I can use the global ord ranges I
> already got. What would be the best way to jump to those ords for each
> document? If I use SortedSetDocValues, I'd still had to iterator through
> all ords per document, right? Maybe that's not really the problem. I've
> tried to time this, but both the profiler and hand crafted timing code
> massively skew the results, so I'm not sure, I trust those measurements.
>
> Cheers
> harry
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Getting all values for a specific dimension for SortedSetDocValues per document

Reply via email to