Hey,
CountDistinctMetric and PercentileMetric in the stream metrics module have 
no-op implementations. They only work with FacetStream/StatsStream via the JSON 
Facet API. This means they can't be used with Rollup stream decorator. We 
cannot compute percentiles and countDistinct alongside other metrics on the 
output of a streaming expression.

I'd like to add support for these metrics for rollup. The approach I'm 
considering:

CountDistinctMetric: 

*Maintain a HashSet of seen values per group ("over"). 
*Exact counting with memory proportional to cardinality.
*A separate hll() function for approximate countDistinct can be added later if 
needed.

PercentileMetric:

*Using t-digest. This would require adding the t-digest dependency to 
solrj-streaming. 
*This gives us bounded memory.
*This is the same approach JSON Facet API already uses.

Let me know if there are any concerns or if you'd recommend a different 
approach for either of these.

Thanks,
Khush

Reply via email to