Punit Shah created SPARK-33327:
----------------------------------

             Summary: grouped by first and last against date column returns 
incorrect results
                 Key: SPARK-33327
                 URL: https://issues.apache.org/jira/browse/SPARK-33327
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.4.7, 2.4.6
            Reporter: Punit Shah


The attached csv file has two columns, namely "User" and "FromDate".  The 
import defaults the "FromDate" column as a timestamp. 
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
 * outDF.createOrReplaceTempView("table02")

In this default case the following sql generates correct results:

{color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
`FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}

{color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
is read in as a Date, then the above sql query generates incorrect 
results:{color}
 * {color:#172b4d}outDF = spark_session.read.csv("users.csv", inferSchema=True, 
header=True).selectExpr("`User`", "cast(`FromDate` as date)"){color}

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to