code-mc opened a new issue, #55235:
URL: https://github.com/apache/spark/issues/55235
The newly added KLL sketch functions behave in an unexpected way when you
encounter a sketch that is empty (which is often the case when dealing with
nullable input fields or using FILTER WHERE expressions in your sketch
aggregate).
Minimal example to reproduce on the latest 4.2.0-preview3:
```sql
WITH sketch_data AS (
SELECT kll_sketch_agg_double(value) AS sketch
FROM (
select * from VALUES (1.23D), (2.34D), (3.45D) AS T(value)
limit 0
)
)
SELECT kll_sketch_get_quantile_double(sketch, array(0.25, 0.75)) FROM
sketch_data
;
```
Due to the limit 0 an empty double KLL sketch will be created, which when
queried using the `kll_sketch_get_quantile_double` will raise the specified
`KLL_INVALID_INPUT_SKETCH_BUFFER` error.
While the pyspark docs do not mention anything about behavior when reading
an empty sketch, the databricks docs (I know not pyspark but still) do say the
following:
> Notes
> Returns NULL if the sketch is empty.
See:
https://docs.databricks.com/gcp/en/sql/language-manual/functions/kll_sketch_get_quantile_bigint#notes
Which makes a lot more sense to me, as it does not crash any query that
operates over a row with an empty sketch, which is currently difficult to
ensure you will not encounter. The only way to work around this error is to
first filter out any empty sketch:
```sql
WITH sketch_data AS (
SELECT kll_sketch_agg_double(value) AS sketch
FROM (
select * from VALUES (1.23D), (2.34D), (3.45D) AS T(value)
limit 0
)
)
SELECT
CASE WHEN kll_sketch_get_n_double(sketch) != 0 THEN
kll_sketch_get_quantile_double(sketch, array(0.25, 0.75)) ELSE null END
FROM sketch_data
```
This adds a lot of boilerplate that ruins the readability of the query.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]