[jira] [Commented] (SPARK-33327) grouped by first and last against date column returns incorrect results

Punit Shah (Jira) Thu, 05 Nov 2020 05:14:38 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-33327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226712#comment-17226712
 ]


Punit Shah commented on SPARK-33327:
------------------------------------

The correct behaviour of running the query should be:

cnt, FromDate_First, FromDate_Last, cntdist

15, 2013-02-21, 2013-12-13, 4

or:

cnt, FromDate_First, FromDate_Last, cntdist

15, 2013-02-21 00:00:00, 2013-12-13 00:00:00, 4

Thanks for asking [~hyukjin.kwon]  Now I notice that both imports fail as shown 
below:

The spark_session.read.csv("users.csv", inferSchema=True, header=True) behaves 
incorrectly like:

cnt, FromDate_First, FromDate_Last, cntdist

15, 2013-12-13 00:00:00, 2013-03-18 00:00:00, 4

The spark_session.read.csv("users.csv", inferSchema=True, 
header=True).selectExpr("`User`", "cast(`FromDate` as date)") also behaves 
incorrectly like so:

cnt, FromDate_First, FromDate_Last, cntdist

15, 2013-12-13 , 2013-02-21 , 4

> grouped by first and last against date column returns incorrect results
> -----------------------------------------------------------------------
>
>                 Key: SPARK-33327
>                 URL: https://issues.apache.org/jira/browse/SPARK-33327
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.6, 2.4.7
>            Reporter: Punit Shah
>            Priority: Major
>         Attachments: users.csv
>
>
> The attached csv file has two columns, namely "User" and "FromDate".  The 
> import defaults the "FromDate" column as a timestamp. 
>  * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
>  * outDF.createOrReplaceTempView("table02")
> In this default case the following sql generates correct results:
> {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
> `FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
> count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}
> {color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
> is read in as a Date, then the above sql query generates incorrect 
> results:{color}
>  * {color:#172b4d}outDF = spark_session.read.csv("users.csv", 
> inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as 
> date)"){color}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33327) grouped by first and last against date column returns incorrect results

Reply via email to