[ https://issues.apache.org/jira/browse/LUCENE-10080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408429#comment-17408429 ]
Robert Muir commented on LUCENE-10080: -------------------------------------- I kinda imagined something like this: instead of: {code} // either add or increment a count in the hppc int/int map hppcmap.insertOrIncrement(ordinal); {code} we do this: {code} if (longtail.getAndSet(ordinal)) { // bit was already set, so bump the hppc map counter (or add a new entry) hppcmap.insertOrIncrement(ordinal); } else { // otherwise bit was not set before, we've recorded our count of 1. } {code} The idea is that today, the long tail of ordinals with just count=1 are wasting a lot of space (32-bits at least plus whatever overhead that specialized map has). Instead we could use 1 bit-per-ordinal for these "exceptional cases" with something like FixedBitSet (maybe even SparseFixedBitSet later). It is true the heap cost would be O(maxOrdinal). But I think in case of a *large* number of ordinals with a zipfian distribution, it could save space overall, esp. If the query is a worst-case one (such as MatchAllDocsQuery). This approach doesn't make sense to wrap the int[], only the specialized map, for the int[] it would just add more memory space and cpu cycles. But if the number of unique ordinals is *large* it makes sense to try to not be so wasteful and allocate tons of int[]'s that are mostly zeros. > Use a bit set to count long-tail of singleton FacetLabels? > ---------------------------------------------------------- > > Key: LUCENE-10080 > URL: https://issues.apache.org/jira/browse/LUCENE-10080 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet > Reporter: Michael McCandless > Priority: Major > > I was talking about this with [~rcmuir ] about LUCENE-9969, and he had a neat > idea for more efficient facet counting. > Today we accumulate counts directly in an HPPC native int/int map, or a > non-sparse {{int[]}} (if enough hits match the query). > But it is likely that many of these facet counts are singletons (occur only > once in each query). To be more space efficient, we could wrap a bit set > around the map or {{int[]}}. The first time we see an ordinal, we set its > bit. The second and subsequent times, we increment the count as we do today. > If we use a non-sparse bitset (e.g. {{FixedBitSet}}) that will add some > non-sparse heap cost O(maxDoc) for each segment, but if there are enough > ordinals to count, that can be a win over just the HPPC native int map for > some cases? > Maybe this could be an intermediate implementation, since we already cover > the "very low hit count" (use HPPC int/int map) and "very high hit count" > (using {{int[]}}) today? > Also, this bit set would be able to quickly iterate over the sorted ordinals, > which might be helpful if we move the three big {{int[]}} into numeric doc > values? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org