CuteChuanChuan commented on PR #21988:
URL: https://github.com/apache/datafusion/pull/21988#issuecomment-4366624297
Hi @alamb, thanks for raising the concern. I researched how PostgreSQL and
DuckDB handle this, and added Spark / BigQuery / ClickHouse for completeness
since the question is really "what's the cross-engine norm." Summary up front,
then docs + reproducible tests.
## TL;DR
Across all five engines I checked, `median` (or its closest equivalent) over
an integer column **never returns the input integer type** — it always returns
a floating-point / decimal type that preserves the `.5` from averaging.
| Engine | Has `median()`? | Return type for integer input | Truncates
`.5`? |
|------------|-----------------|-------------------------------|------------------|
| DuckDB | Yes | `DOUBLE` | No
|
| PostgreSQL | No (uses `percentile_cont(0.5)`) | `double precision` | No
|
| Spark | Yes | floating point (docs example: `(0),(10)` →
`5.0`) | No |
| BigQuery | No (uses `PERCENTILE_CONT`) | INT64 not even in supported
types — must CAST to
`NUMERIC`/`BIGNUMERIC`/`FLOAT64` | N/A by design |
| ClickHouse | Yes (alias for `quantile`) | `Float64` | No |
## Per-engine evidence
### DuckDB —
[docs](https://duckdb.org/docs/current/sql/functions/aggregates#medianx)
> `median(x)` — The middle value of the set. **For even value counts,
quantitative values are averaged** and ordinal values return the lower value.
> Formula: `quantile_cont(x, 0.5)`
Reproducible (DuckDB 1.5.2):
```bash
duckdb -c "SELECT median(v) AS m, typeof(median(v)) AS rtype \
FROM (VALUES (CAST(-128 AS TINYINT)), (CAST(127 AS TINYINT))) AS t(v);"
# m: -0.5 rtype: DOUBLE
```
### PostgreSQL —
[docs](https://www.postgresql.org/docs/current/functions-aggregate.html#FUNCTIONS-ORDEREDSET-TABLE)
PostgreSQL has no built-in median; the wiki
([Aggregate_Median](https://wiki.postgresql.org/wiki/Aggregate_Median))
recommends percentile_cont(0.5) WITHIN
GROUP (ORDER BY ...), which is documented as returning double precision.
Reproducible (PostgreSQL 17.9):
```bash
docker run -d --name pg17 -e POSTGRES_PASSWORD=test postgres:17 && sleep 3
docker exec pg17 psql -U postgres -c "\
SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY v) AS m, \
pg_typeof(percentile_cont(0.5) WITHIN GROUP (ORDER BY v)) AS rtype \
FROM (VALUES (CAST(-128 AS smallint)), (CAST(127 AS smallint))) AS t(v);"
# m: -0.5 rtype: double precision
```
### Spark —
[docs](https://spark.apache.org/docs/latest/api/sql/index.html#median)
Docs example for median(col): SELECT median(col) FROM VALUES (0), (10) AS
tab(col);
-- 5.0
Integer input (0), (10) returns 5.0 (floating point) — even when the result
happens to be a whole number.
### BigQuery —
[docs](https://docs.cloud.google.com/bigquery/docs/reference/standard-sql/navigation_functions#percentile_cont)
No built-in MEDIAN. PERCENTILE_CONT's Supported Argument Types are
explicitly limited to NUMERIC, BIGNUMERIC, FLOAT64 — INT64 is not in the list.
The return-type table never produces an integer type, so truncation is
structurally impossible.
### ClickHouse — [median
docs](https://clickhouse.com/docs/en/sql-reference/aggregate-functions/reference/median)
/ [quantile
docs](https://clickhouse.com/docs/sql-reference/aggregate-functions/reference/quantile)
▎ The median* functions are the aliases for the corresponding quantile*
functions.
▎ Returned value: Approximate quantile of the specified level. Float64 or
Date or DateTime.
Docs example with UInt32 input (1, 1, 2, 3) returns 1.5.
## On the downstream-disruption concern
You're right that this is still a return-type breaking change, and the
cross-engine evidence above tell us:
1. Users coming from any of the five engines above already expect a
floating-point return for them.
2. Anyone relying on median(int_col) → int_col is depending on behavior that
is non-portable across the rest of the ecosystem.
3. Pure-DataFusion users who happen to have median(int_col) feeding a
downstream integer-typed schema would still need to update their code.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]