oglego opened a new pull request, #4376: URL: https://github.com/apache/datafusion-comet/pull/4376
## Which issue does this PR close? Closes #. ## Rationale for this change `cardinality` is a commonly used Spark SQL function (since 2.4.0) that was not yet supported natively by Comet, causing fallback to Spark execution. It returns the number of elements in an array or the number of key-value pairs in a map. Per the Spark docs, `cardinality` returns `-1` for null input only when both `spark.sql.legacy.sizeOfNull=true` AND `spark.sql.ansi.enabled=false`. With default settings it always returns `NULL` for null input. In Spark's implementation, `cardinality` is a direct alias for the `Size` expression with `legacySizeOfNull = false` hardcoded at parse time — the two-config gate never applies to `cardinality`. ## What changes are included in this PR? Adds native support for `cardinality(expr)` for both array and map inputs by extending the existing `CometSize` serde. There was no separate `Cardinality` class to wire — both `cardinality` and `size` produce a `Size` node in the logical plan. Two gaps in the Scala layer and one in the Rust layer prevented `cardinality` from offloading: 1. `CometSize.getSupportLevel` was returning `Unsupported` for `MapType` inputs 2. `CometSize.convert` was reading `SQLConf.get.legacySizeOfNull` (the global config) rather than `expr.legacySizeOfNull` (the instance field set at parse time), so `cardinality` could incorrectly return `-1` for `NULL` when the legacy config was enabled 3. `SparkSizeFunc` in the Rust layer used `Signature::uniform` with exact type stubs, causing DataFusion's type coercion to reject real map columns with concrete inner field types Changes: - Extended `CometSize` in `arrays.scala` to accept `MapType` as `Compatible()` - Fixed `CometSize.convert` to use `expr.legacySizeOfNull` instead of the global config - Changed `SparkSizeFunc` signature to `Signature::any(1)` so real map and array types are accepted regardless of inner field types - Fixed null map entries in `SparkSizeFunc` to append `null` instead of hardcoded `-1` - Removed `spark_answer_only` from the `size(arr), size(m)` column query in `size.sql` since map inputs now run natively - Annotated the `Size` row in `expressions.md` to note it also covers the `cardinality` SQL alias The `implement-comet-expression` and `wire-datafusion-function` skills were used to scaffold this implementation. ## How are these changes tested? New Comet SQL Test at `spark/src/test/resources/sql-tests/expressions/array/cardinality.sql`. Covers: - Array column input - Map column input - Array and map columns in the same query - `NULL` array and map inputs via column reference (asserts `NULL` is returned, not `-1`) - Nested array column input (`array<array<int>>`) - Array-of-structs column input (`array<struct<a: int>>`) - Literal array and map arguments (`spark_answer_only` — `CreateArray`/`CreateMap` not yet natively supported by Comet, but correctness is verified against Spark) The existing `size.sql` test now also covers the map-input path natively (removed `spark_answer_only`). ```bash ./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite cardinality" -Dtest=none ./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite size" -Dtest=none ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
