To address the last topic (building up ordinal ranges per-segment), what I'm thinking is that you'd iterate all unique ordinals in the SSDV field and "memorize" the ordinal range for each dimension up-front, but on a per-segment basis. This would be very similar to what DefaultSortedSetDocValuesReaderState#createOneFlatFacetDimState is doing, but you'd do it per-segment. That per-segment state information would only need to be done once, since segments are immutable. Then, when you're iterating all your hits within each segment, you can determine whether-or-not each stored doc value for the doc is within the dim range using your segment-level ordinal range information, avoiding the need to do the global mapping. You can short-circuit the document-level ordinal iteration once an ordinal goes beyond the range since the ordinals are stored in sorted order, but in general, you're still doing a linear operation per document. No way to avoid that. But you can also resolve the BytesRef within the context of each segment, completely avoiding the need for a global mapping.
Not sure this will result in any major performance improvements for you, but it does let you avoid building a global ordinal map and doing map lookups within the tight loop. Cheers, -Greg On Fri, Jul 1, 2022 at 2:35 AM Harald Braumann <braum...@m2n.at> wrote: > > Hi! > > On 01.07.22 00:46, Greg Miller wrote: > > Have you considered taxonomy faceting for your use-case? Because the > > taxonomy structure is maintained in a separate index, it's > > (relatively) trivial to iterate all direct child ordinals of a given > > dimension. The cost of mapping to a global ordinal space is done when > > the index is merged. > > Thanks for the tip. I will certainly look into it. > > > Separately, I'd be curious about where you're running into performance > > issues within the context of your system. Is the cost you're concerned > > with building up the ordinal map? That's certainly expensive, but it's > > a one-time cost (until you refresh your index). > > That's not the problem. The index changes rarely during operation, anyways. > > > Or are you concerned > > with the actual map lookup within your tight loop? > > Yes. The index I tested has about 3M documents and overall about 180M > doc value ords. Just iterating, without retrieving the actual values or > building the result map takes close to 3s. All the time seems to be > spent in SortedSetDocValues.nextOrd and LongValues.get. > > > If the latter, you > > could consider doing more work at the slice-level by separately > > determining the child ords for each dim ord within the context of each > > segment (there's no off-the-shelf code for this that I'm aware of, so > > you'd have to roll your own). > > I was thinking about this as well. So lets say, I have the ord ranges > for dimensions per segment. I guess, I could easily test this by making > sure, there is only one segment, so I can use the global ord ranges I > already got. What would be the best way to jump to those ords for each > document? If I use SortedSetDocValues, I'd still had to iterator through > all ords per document, right? Maybe that's not really the problem. I've > tried to time this, but both the profiler and hand crafted timing code > massively skew the results, so I'm not sure, I trust those measurements. > > Cheers > harry > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org