[GitHub] spark pull request #20163: [SPARK-22966][PySpark] Spark SQL should handle Py...

rednaxelafx Thu, 04 Jan 2018 17:18:30 -0800

GitHub user rednaxelafx opened a pull request:

    https://github.com/apache/spark/pull/20163


    [SPARK-22966][PySpark] Spark SQL should handle Python UDFs that return a 
datetime.date or datetime.datetime

    ## What changes were proposed in this pull request?
    
    Perform appropriate conversions for results coming from Python UDFs that 
return `datetime.date` or `datetime.datetime`.
    
    Before this PR, Pyrolite would unpickle both `datetime.date` and 
`datetime.datetime` into a `java.util.Calendar`, which Spark SQL doesn't 
understand, which then leads to incorrect results. An example of such incorrect 
result is:
    
    ```
    >>> py_date = udf(datetime.date)
    >>> spark.range(1).select(py_date(lit(2017), lit(10), 
lit(30))).show(truncate=False)
    
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |date(2017, 10, 30)                                                         
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                       |
    
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    
|java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet=false,lenient=true,zone=sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-28800000,dstSavings=3600000,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-28800000,dstSavings=3600000,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=7200000,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=7200000,endTimeMode=0]],firstDayOfWeek=1,minimalDaysInFirstWeek=1,ERA=?,YEAR=2017,MONTH=9,WEEK_OF_YEAR=?,WEEK_OF_MONTH=?,DAY_OF_MONTH=30,DAY_OF_YEAR=?,DAY_OF_WEEK=?,DAY_OF_WEEK_IN_MONTH=?,AM_PM=0,HOUR=0,HOUR_OF_DAY=0,MINUTE=0,SECOND=0,MILLISECOND=?,ZONE_OFFSET=?,DST_OFFSET=?]|
    
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    ```
    After this PR, the same query above would give correct results:
    ```
    >>> spark.range(1).select(py_date(lit(2017), lit(10), 
lit(30))).show(truncate=False)
    +------------------+
    |date(2017, 10, 30)|
    +------------------+
    |2017-10-30        |
    +------------------+
    ```
    
    An explicit non-goal of this PR is to change the behavior of timezone 
awareness or timezone settings of `datetime.datetime` objects collected from a 
`DataFrame`.
    Currently PySpark always returns such `datetime.datetime` objects as 
timezone unaware (naive) ones that respect Python's current local timezone 
(#19607 changed the default behavior for Pandas support but not for plain 
`collect()`). This PR does not change that behavior.
    
    ## How was this patch tested?
    
    Added some unit tests to `pyspark.sql.tests` for such UDFs, so that
     * `datetime.date` -> `StringType`
     * `datetime.date` -> `DateType`
     * `datetime.datetime` -> `StringType`
     * `datetime.datetime` -> `TimestampType`
    * `datetime.datetime` with non-default timezone
    * `datetime.datetime` with null timezone (naive datetime)
    cases are covered.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rednaxelafx/apache-spark pyspark-udf-datetime

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20163.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20163
    
----
commit ca026d31a489f1e0eb451fe85df97083659d0f67
Author: Kris Mok <kris.mok@...>
Date:   2018-01-05T00:29:20Z

    Spark SQL should handle Python UDFs that return a datetime.date or 
datetime.datetime
    
    Perform appropriate conversions for results coming from such UDFs.
    
    Also added some unit tests to pyspark.sql.tests for such UDFs.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20163: [SPARK-22966][PySpark] Spark SQL should handle Py...

Reply via email to