Copilot commented on code in PR #4394:
URL: https://github.com/apache/solr/pull/4394#discussion_r3181449818


##########
solr/solrj-streaming/src/java/org/apache/solr/client/solrj/io/stream/metrics/CountDistinctMetric.java:
##########
@@ -66,7 +72,10 @@ private void init(String functionName, String columnName) {
 
   @Override
   public void update(Tuple tuple) {
-    // Nop for now
+    Object value = tuple.get(columnName);
+    if (value != null) {
+      distinctValues.add(value);
+    }

Review Comment:
   `CountDistinctMetric(String, true)` is already used for SQL 
`APPROX_COUNT_DISTINCT`, but this implementation ignores that mode and still 
retains every distinct value in a `HashSet`. That means the "approximate" 
variant now has exact-count memory growth, which can turn grouped queries over 
high-cardinality fields into large heap usage instead of the bounded-memory HLL 
behavior callers expect.



##########
solr/solrj-streaming/src/java/org/apache/solr/client/solrj/io/stream/metrics/CountDistinctMetric.java:
##########
@@ -81,14 +90,11 @@ public String[] getColumns() {
 
   @Override
   public Number getValue() {
-    // No op for now
-    return null;
+    return distinctValues.size();
   }
 
   @Override
   public StreamExpressionParameter toExpression(StreamFactory factory) throws 
IOException {
-    return new StreamExpression(getFunctionName())
-        .withParameter(columnName)
-        .withParameter(Boolean.toString(outputLong));
+    return new StreamExpression(getFunctionName()).withParameter(columnName);

Review Comment:
   This new expression form is still incomplete in the distributed 
SQL/map-reduce path: `SolrTable.handleGroupByMapReduce()` builds the worker 
`StreamFactory` with `sum/min/max/avg/count` only, so a `ParallelStream` that 
pushes `rollup(..., countDist(...))` to workers will still fail when 
`numWorkers > 1`. The feature won't actually be usable end-to-end for parallel 
group-by queries until that factory also registers `countDist`.



##########
solr/solrj-streaming/src/java/org/apache/solr/client/solrj/io/stream/metrics/CountDistinctMetric.java:
##########
@@ -53,6 +55,10 @@ public CountDistinctMetric(StreamExpression expression, 
StreamFactory factory)
               expression,
               functionName));
     }
+    if (1 != expression.getParameters().size()) {
+      throw new IOException(
+          String.format(Locale.ROOT, "Invalid expression %s - unknown operands 
found", expression));

Review Comment:
   Previous releases serialized this metric as `countDist(field,true)` from 
`toExpression()`. Rejecting any 2-argument form here makes those 
already-generated expressions fail to parse after upgrade instead of remaining 
backward compatible while you fix the serializer.



##########
solr/solrj-streaming/src/java/org/apache/solr/client/solrj/io/stream/metrics/CountDistinctMetric.java:
##########
@@ -81,14 +90,11 @@ public String[] getColumns() {
 
   @Override
   public Number getValue() {
-    // No op for now
-    return null;
+    return distinctValues.size();
   }
 
   @Override
   public StreamExpressionParameter toExpression(StreamFactory factory) throws 
IOException {
-    return new StreamExpression(getFunctionName())
-        .withParameter(columnName)
-        .withParameter(Boolean.toString(outputLong));
+    return new StreamExpression(getFunctionName()).withParameter(columnName);

Review Comment:
   The PR fixes `CountDistinctMetric.toExpression()` and tightens its parser, 
but there is still no dedicated round-trip regression test for `countDist` 
alongside the existing metric `toExpression()` tests in 
`StreamExpressionToExpressionTest`. Without that, the exact serialization bug 
fixed here is easy to reintroduce unnoticed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to