code-mc opened a new issue, #55235:
URL: https://github.com/apache/spark/issues/55235

   The newly added KLL sketch functions behave in an unexpected way when you 
encounter a sketch that is empty (which is often the case when dealing with 
nullable input fields or using FILTER WHERE expressions in your sketch 
aggregate).
   
   Minimal example to reproduce on the latest 4.2.0-preview3:
   
   ```sql
   WITH sketch_data AS (
       SELECT kll_sketch_agg_double(value) AS sketch
       FROM (
       select * from VALUES (1.23D), (2.34D), (3.45D) AS T(value)
       limit 0
       )
   )
   SELECT kll_sketch_get_quantile_double(sketch, array(0.25, 0.75)) FROM 
sketch_data
   ;
   ```
   
   Due to the limit 0 an empty double KLL sketch will be created, which when 
queried using the `kll_sketch_get_quantile_double` will raise the specified 
`KLL_INVALID_INPUT_SKETCH_BUFFER` error.
   
   While the pyspark docs do not mention anything about behavior when reading 
an empty sketch, the databricks docs (I know not pyspark but still) do say the 
following:
   
   > Notes
   > Returns NULL if the sketch is empty.
   
   See: 
https://docs.databricks.com/gcp/en/sql/language-manual/functions/kll_sketch_get_quantile_bigint#notes
   
   Which makes a lot more sense to me, as it does not crash any query that 
operates over a row with an empty sketch, which is currently difficult to 
ensure you will not encounter. The only way to work around this error is to 
first filter out any empty sketch:
   
   ```sql
   WITH sketch_data AS (
         SELECT kll_sketch_agg_double(value) AS sketch
         FROM (
         select * from VALUES (1.23D), (2.34D), (3.45D) AS T(value)
         limit 0
         )
     )
     SELECT 
       CASE WHEN kll_sketch_get_n_double(sketch) != 0 THEN 
kll_sketch_get_quantile_double(sketch, array(0.25, 0.75)) ELSE null END
     FROM sketch_data
   ```
   
   This adds a lot of boilerplate that ruins the readability of the query.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to