Hey,
CountDistinctMetric and PercentileMetric in the stream metrics module have
no-op implementations. They only work with FacetStream/StatsStream via the JSON
Facet API. This means they can't be used with Rollup stream decorator. We
cannot compute percentiles and countDistinct alongside other metrics on the
output of a streaming expression.
I'd like to add support for these metrics for rollup. The approach I'm
considering:
CountDistinctMetric:
*Maintain a HashSet of seen values per group ("over").
*Exact counting with memory proportional to cardinality.
*A separate hll() function for approximate countDistinct can be added later if
needed.
PercentileMetric:
*Using t-digest. This would require adding the t-digest dependency to
solrj-streaming.
*This gives us bounded memory.
*This is the same approach JSON Facet API already uses.
Let me know if there are any concerns or if you'd recommend a different
approach for either of these.
Thanks,
Khush