Copilot commented on code in PR #4394:
URL: https://github.com/apache/solr/pull/4394#discussion_r3181449818
##########
solr/solrj-streaming/src/java/org/apache/solr/client/solrj/io/stream/metrics/CountDistinctMetric.java:
##########
@@ -66,7 +72,10 @@ private void init(String functionName, String columnName) {
@Override
public void update(Tuple tuple) {
- // Nop for now
+ Object value = tuple.get(columnName);
+ if (value != null) {
+ distinctValues.add(value);
+ }
Review Comment:
`CountDistinctMetric(String, true)` is already used for SQL
`APPROX_COUNT_DISTINCT`, but this implementation ignores that mode and still
retains every distinct value in a `HashSet`. That means the "approximate"
variant now has exact-count memory growth, which can turn grouped queries over
high-cardinality fields into large heap usage instead of the bounded-memory HLL
behavior callers expect.
##########
solr/solrj-streaming/src/java/org/apache/solr/client/solrj/io/stream/metrics/CountDistinctMetric.java:
##########
@@ -81,14 +90,11 @@ public String[] getColumns() {
@Override
public Number getValue() {
- // No op for now
- return null;
+ return distinctValues.size();
}
@Override
public StreamExpressionParameter toExpression(StreamFactory factory) throws
IOException {
- return new StreamExpression(getFunctionName())
- .withParameter(columnName)
- .withParameter(Boolean.toString(outputLong));
+ return new StreamExpression(getFunctionName()).withParameter(columnName);
Review Comment:
This new expression form is still incomplete in the distributed
SQL/map-reduce path: `SolrTable.handleGroupByMapReduce()` builds the worker
`StreamFactory` with `sum/min/max/avg/count` only, so a `ParallelStream` that
pushes `rollup(..., countDist(...))` to workers will still fail when
`numWorkers > 1`. The feature won't actually be usable end-to-end for parallel
group-by queries until that factory also registers `countDist`.
##########
solr/solrj-streaming/src/java/org/apache/solr/client/solrj/io/stream/metrics/CountDistinctMetric.java:
##########
@@ -53,6 +55,10 @@ public CountDistinctMetric(StreamExpression expression,
StreamFactory factory)
expression,
functionName));
}
+ if (1 != expression.getParameters().size()) {
+ throw new IOException(
+ String.format(Locale.ROOT, "Invalid expression %s - unknown operands
found", expression));
Review Comment:
Previous releases serialized this metric as `countDist(field,true)` from
`toExpression()`. Rejecting any 2-argument form here makes those
already-generated expressions fail to parse after upgrade instead of remaining
backward compatible while you fix the serializer.
##########
solr/solrj-streaming/src/java/org/apache/solr/client/solrj/io/stream/metrics/CountDistinctMetric.java:
##########
@@ -81,14 +90,11 @@ public String[] getColumns() {
@Override
public Number getValue() {
- // No op for now
- return null;
+ return distinctValues.size();
}
@Override
public StreamExpressionParameter toExpression(StreamFactory factory) throws
IOException {
- return new StreamExpression(getFunctionName())
- .withParameter(columnName)
- .withParameter(Boolean.toString(outputLong));
+ return new StreamExpression(getFunctionName()).withParameter(columnName);
Review Comment:
The PR fixes `CountDistinctMetric.toExpression()` and tightens its parser,
but there is still no dedicated round-trip regression test for `countDist`
alongside the existing metric `toExpression()` tests in
`StreamExpressionToExpressionTest`. Without that, the exact serialization bug
fixed here is easy to reintroduce unnoticed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]