Noah Lidell created SPARK-24752:
-----------------------------------

             Summary: date_format provides incorrect year after a timezone 
conversation changes the year on a timestamp
                 Key: SPARK-24752
                 URL: https://issues.apache.org/jira/browse/SPARK-24752
             Project: Spark
          Issue Type: Bug
          Components: PySpark, Spark Core, SQL
    Affects Versions: 2.1.2
            Reporter: Noah Lidell


Initially I have a dataframe with a column for date and a column for time in 
UTC time. I combine these two columns into a single string that represents a 
UTC timestamp. I then convert that UTC timestamp into a Central Time timestamp 
using pyspark.sql.functions.from_utc_timestamp(). In the example below I have 
timestamps that are early in the day on Janurary 1st UTC, so that when they are 
converted to Central Time the resulting timestamp is on December 31st of the 
previous year. These timestamps are produced correctly. However, if I then use 
pypspark.sql.functions.date_format() to format a date from the one of these 
December 31st Central Time timestamps then the year portion of the output will 
be wrong. 

 

 
{code:java}
import pyspark.sql.functions as F

columns = ["observation_date","observation_time"]
vals = [
('2015-01-01', '02:22'),
('2016-01-01', '02:22'),
('2016-01-01', '08:22'),
('2016-01-02', '02:22'),
('2017-01-01', '02:22'),
('2018-01-01', '02:22'),
('2019-01-01', '02:22'),
]

test_df = spark.createDataFrame(vals, columns)

test_df.show()
# OUTPUT
# +----------------+----------------+
# |observation_date|observation_time|
# +----------------+----------------+
# |      2015-01-01|           02:22|
# |      2016-01-01|           02:22|
# |      2016-01-01|           08:22|
# |      2016-01-02|           02:22|
# |      2017-01-01|           02:22|
# |      2018-01-01|           02:22|
# |      2019-01-01|           02:22|
# +----------------+----------------+

renamed = 
test_df.withColumnRenamed("observation_date","UTC_observation_date").withColumnRenamed("observation_time","UTC_observation_time")

utc_ts = renamed.withColumn("UTC", F.concat(F.col("UTC_observation_date"), 
F.lit(" "), F.col("UTC_observation_time")))

utc_ts.show()
# OUTPUT
# +--------------------+--------------------+----------------+
# |UTC_observation_date|UTC_observation_time|             UTC|
# +--------------------+--------------------+----------------+
# |          2015-01-01|               02:22|2015-01-01 02:22|
# |          2016-01-01|               02:22|2016-01-01 02:22|
# |          2016-01-01|               08:22|2016-01-01 08:22|
# |          2016-01-02|               02:22|2016-01-02 02:22|
# |          2017-01-01|               02:22|2017-01-01 02:22|
# |          2018-01-01|               02:22|2018-01-01 02:22|
# |          2019-01-01|               02:22|2019-01-01 02:22|
# +--------------------+--------------------+----------------+

central_ts = utc_ts.withColumn("central_timestamp", 
F.from_utc_timestamp(utc_ts.UTC, 'US/Central'))

final = central_ts.withColumn('observation_date', 
F.date_format(central_ts.central_timestamp, "YYYY-MM-dd"))\
.withColumn('observation_time', F.date_format(central_ts.central_timestamp, 
"HH:mm"))

final.select('observation_date','observation_time','central_timestamp','UTC').show()
# OUTPUT
# +----------------+----------------+--------------------+----------------+
# |observation_date|observation_time|   central_timestamp|             UTC|
# +----------------+----------------+--------------------+----------------+
# |      2015-12-31|           20:22|2014-12-31 20:22:...|2015-01-01 02:22|
# |      2016-12-31|           20:22|2015-12-31 20:22:...|2016-01-01 02:22|
# |      2016-01-01|           02:22|2016-01-01 02:22:...|2016-01-01 08:22|
# |      2016-01-01|           20:22|2016-01-01 20:22:...|2016-01-02 02:22|
# |      2016-12-31|           20:22|2016-12-31 20:22:...|2017-01-01 02:22|
# |      2018-12-31|           20:22|2017-12-31 20:22:...|2018-01-01 02:22|
# |      2019-12-31|           20:22|2018-12-31 20:22:...|2019-01-01 02:22|
# 
+----------------+----------------+--------------------+----------------+{code}
The Central Time timestamp is generated correctly by the from_utc_timestamp 
function but when formatted by the date_format function _and if_ the timezone 
conversation caused a change in year then the formatted string produced by the 
date_format function will have the wrong year.

In summary, this is the incorrect behavior:

UTC timestamp="2016-01-01 02:22" -(apply from_utc_timestamp)-> Central 
timestamp="2015-12-31 20:22" --(apply date_format "YYYY-MM-dd")-> 
date="2016-12-31"

output instead should be "2015-12-31"

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to