[ https://issues.apache.org/jira/browse/SPARK-42346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684332#comment-17684332 ]
L. C. Hsieh commented on SPARK-42346: ------------------------------------- I will wait for this patch before cutting RC1. > distinct(count colname) with UNION ALL causes query analyzer bug > ---------------------------------------------------------------- > > Key: SPARK-42346 > URL: https://issues.apache.org/jira/browse/SPARK-42346 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.3.0, 3.4.0, 3.5.0 > Reporter: Robin > Priority: Major > > If you combine a UNION ALL with a count(distinct colname) you get a query > analyzer bug. > > This behaviour is introduced in 3.3.0. The bug was not present in 3.2.1. > > Here is a reprex in PySpark: > {{df_pd = pd.DataFrame([}} > {{ \{'surname': 'a', 'first_name': 'b'}}} > {{])}} > {{df_spark = spark.createDataFrame(df_pd)}} > {{df_spark.createOrReplaceTempView("input_table")}} > {{sql = """}} > {{SELECT }} > {{ (SELECT Count(DISTINCT first_name) FROM input_table) }} > {{ AS distinct_value_count}} > {{FROM input_table}} > {{UNION ALL}} > {{SELECT }} > {{ (SELECT Count(DISTINCT surname) FROM input_table) }} > {{ AS distinct_value_count}} > {{FROM input_table """}} > {{spark.sql(sql).toPandas()}} > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org