[ https://issues.apache.org/jira/browse/SPARK-33327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226712#comment-17226712 ]
Punit Shah commented on SPARK-33327: ------------------------------------ The correct behaviour of running the query should be: cnt, FromDate_First, FromDate_Last, cntdist 15, 2013-02-21, 2013-12-13, 4 or: cnt, FromDate_First, FromDate_Last, cntdist 15, 2013-02-21 00:00:00, 2013-12-13 00:00:00, 4 Thanks for asking [~hyukjin.kwon] Now I notice that both imports fail as shown below: The spark_session.read.csv("users.csv", inferSchema=True, header=True) behaves incorrectly like: cnt, FromDate_First, FromDate_Last, cntdist 15, 2013-12-13 00:00:00, 2013-03-18 00:00:00, 4 The spark_session.read.csv("users.csv", inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as date)") also behaves incorrectly like so: cnt, FromDate_First, FromDate_Last, cntdist 15, 2013-12-13 , 2013-02-21 , 4 > grouped by first and last against date column returns incorrect results > ----------------------------------------------------------------------- > > Key: SPARK-33327 > URL: https://issues.apache.org/jira/browse/SPARK-33327 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.6, 2.4.7 > Reporter: Punit Shah > Priority: Major > Attachments: users.csv > > > The attached csv file has two columns, namely "User" and "FromDate". The > import defaults the "FromDate" column as a timestamp. > * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) > * outDF.createOrReplaceTempView("table02") > In this default case the following sql generates correct results: > {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as > `FromDate_First`, last(`FromDate`) as `FromDate_Last`, > count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} > {color:#172b4d}However if we read the dataframe like so (where the "FromDate" > is read in as a Date, then the above sql query generates incorrect > results:{color} > * {color:#172b4d}outDF = spark_session.read.csv("users.csv", > inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as > date)"){color} > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org