GitHub user rednaxelafx opened a pull request: https://github.com/apache/spark/pull/20163
[SPARK-22966][PySpark] Spark SQL should handle Python UDFs that return a datetime.date or datetime.datetime ## What changes were proposed in this pull request? Perform appropriate conversions for results coming from Python UDFs that return `datetime.date` or `datetime.datetime`. Before this PR, Pyrolite would unpickle both `datetime.date` and `datetime.datetime` into a `java.util.Calendar`, which Spark SQL doesn't understand, which then leads to incorrect results. An example of such incorrect result is: ``` >>> py_date = udf(datetime.date) >>> spark.range(1).select(py_date(lit(2017), lit(10), lit(30))).show(truncate=False) +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |date(2017, 10, 30) | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet=false,lenient=true,zone=sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-28800000,dstSavings=3600000,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-28800000,dstSavings=3600000,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=7200000,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=7200000,endTimeMode=0]],firstDayOfWeek=1,minimalDaysInFirstWeek=1,ERA=?,YEAR=2017,MONTH=9,WEEK_OF_YEAR=?,WEEK_OF_MONTH=?,DAY_OF_MONTH=30,DAY_OF_YEAR=?,DAY_OF_WEEK=?,DAY_OF_WEEK_IN_MONTH=?,AM_PM=0,HOUR=0,HOUR_OF_DAY=0,MINUTE=0,SECOND=0,MILLISECOND=?,ZONE_OFFSET=?,DST_OFFSET=?]| +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` After this PR, the same query above would give correct results: ``` >>> spark.range(1).select(py_date(lit(2017), lit(10), lit(30))).show(truncate=False) +------------------+ |date(2017, 10, 30)| +------------------+ |2017-10-30 | +------------------+ ``` An explicit non-goal of this PR is to change the behavior of timezone awareness or timezone settings of `datetime.datetime` objects collected from a `DataFrame`. Currently PySpark always returns such `datetime.datetime` objects as timezone unaware (naive) ones that respect Python's current local timezone (#19607 changed the default behavior for Pandas support but not for plain `collect()`). This PR does not change that behavior. ## How was this patch tested? Added some unit tests to `pyspark.sql.tests` for such UDFs, so that * `datetime.date` -> `StringType` * `datetime.date` -> `DateType` * `datetime.datetime` -> `StringType` * `datetime.datetime` -> `TimestampType` * `datetime.datetime` with non-default timezone * `datetime.datetime` with null timezone (naive datetime) cases are covered. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rednaxelafx/apache-spark pyspark-udf-datetime Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20163.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20163 ---- commit ca026d31a489f1e0eb451fe85df97083659d0f67 Author: Kris Mok <kris.mok@...> Date: 2018-01-05T00:29:20Z Spark SQL should handle Python UDFs that return a datetime.date or datetime.datetime Perform appropriate conversions for results coming from such UDFs. Also added some unit tests to pyspark.sql.tests for such UDFs. ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org