[ 
https://issues.apache.org/jira/browse/SPARK-24752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noah Lidell updated SPARK-24752:
--------------------------------
    Description: 
Initially I have a dataframe with a column for date and a column for time in 
UTC time. I combine these two columns into a single string that represents a 
UTC timestamp. I then convert that UTC timestamp into a Central Time timestamp 
using pyspark.sql.functions.from_utc_timestamp(). In the example below I have 
timestamps that are early in the day on Janurary 1st UTC, so that when they are 
converted to Central Time the resulting timestamp is on December 31st of the 
previous year. These timestamps are produced correctly. However, if I then use 
pypspark.sql.functions.date_format() to format a date from the one of these 
December 31st Central Time timestamps then the year portion of the output will 
be wrong. 

 

 
{code:java}
import pyspark.sql.functions as F

columns = ["observation_date","observation_time"]
vals = [
('2015-01-01', '02:22'),
('2016-01-01', '02:22'),
('2016-01-01', '08:22'),
('2016-01-02', '02:22'),
('2017-01-01', '02:22'),
('2018-01-01', '02:22'),
('2019-01-01', '02:22'),
]

test_df = spark.createDataFrame(vals, columns)

test_df.show()
# OUTPUT
# +----------------+----------------+
# |observation_date|observation_time|
# +----------------+----------------+
# |      2015-01-01|           02:22|
# |      2016-01-01|           02:22|
# |      2016-01-01|           08:22|
# |      2016-01-02|           02:22|
# |      2017-01-01|           02:22|
# |      2018-01-01|           02:22|
# |      2019-01-01|           02:22|
# +----------------+----------------+

renamed = 
test_df.withColumnRenamed("observation_date","UTC_observation_date").withColumnRenamed("observation_time","UTC_observation_time")

utc_ts = renamed.withColumn("UTC", F.concat(F.col("UTC_observation_date"), 
F.lit(" "), F.col("UTC_observation_time")))

utc_ts.show()
# OUTPUT
# +--------------------+--------------------+----------------+
# |UTC_observation_date|UTC_observation_time|             UTC|
# +--------------------+--------------------+----------------+
# |          2015-01-01|               02:22|2015-01-01 02:22|
# |          2016-01-01|               02:22|2016-01-01 02:22|
# |          2016-01-01|               08:22|2016-01-01 08:22|
# |          2016-01-02|               02:22|2016-01-02 02:22|
# |          2017-01-01|               02:22|2017-01-01 02:22|
# |          2018-01-01|               02:22|2018-01-01 02:22|
# |          2019-01-01|               02:22|2019-01-01 02:22|
# +--------------------+--------------------+----------------+

central_ts = utc_ts.withColumn("central_timestamp", 
F.from_utc_timestamp(utc_ts.UTC, 'US/Central'))

final = central_ts.withColumn('observation_date', 
F.date_format(central_ts.central_timestamp, "YYYY-MM-dd"))\
.withColumn('observation_time', F.date_format(central_ts.central_timestamp, 
"HH:mm"))

final.select('observation_date','observation_time','central_timestamp','UTC').show()
# OUTPUT
# +----------------+----------------+--------------------+----------------+
# |observation_date|observation_time|   central_timestamp|             UTC|
# +----------------+----------------+--------------------+----------------+
# |      2015-12-31|           20:22|2014-12-31 20:22:...|2015-01-01 02:22|
# |      2016-12-31|           20:22|2015-12-31 20:22:...|2016-01-01 02:22|
# |      2016-01-01|           02:22|2016-01-01 02:22:...|2016-01-01 08:22|
# |      2016-01-01|           20:22|2016-01-01 20:22:...|2016-01-02 02:22|
# |      2016-12-31|           20:22|2016-12-31 20:22:...|2017-01-01 02:22|
# |      2018-12-31|           20:22|2017-12-31 20:22:...|2018-01-01 02:22|
# |      2019-12-31|           20:22|2018-12-31 20:22:...|2019-01-01 02:22|
# 
+----------------+----------------+--------------------+----------------+{code}
The Central Time timestamp is generated correctly by the from_utc_timestamp 
function but when formatted by the date_format function _and if_ the timezone 
conversation caused a change in year then the formatted string produced by the 
date_format function will have the wrong year.

In summary, this is the incorrect behavior:
{code:java}
UTC timestamp="2016-01-01 02:22" 
--(apply from_utc_timestamp)--> Central timestamp="2015-12-31 20:22" 
--(apply date_format "YYYY-MM-dd")--> date="2016-12-31"{code}
output instead should be "2015-12-31"

 

 

 

  was:
Initially I have a dataframe with a column for date and a column for time in 
UTC time. I combine these two columns into a single string that represents a 
UTC timestamp. I then convert that UTC timestamp into a Central Time timestamp 
using pyspark.sql.functions.from_utc_timestamp(). In the example below I have 
timestamps that are early in the day on Janurary 1st UTC, so that when they are 
converted to Central Time the resulting timestamp is on December 31st of the 
previous year. These timestamps are produced correctly. However, if I then use 
pypspark.sql.functions.date_format() to format a date from the one of these 
December 31st Central Time timestamps then the year portion of the output will 
be wrong. 

 

 
{code:java}
import pyspark.sql.functions as F

columns = ["observation_date","observation_time"]
vals = [
('2015-01-01', '02:22'),
('2016-01-01', '02:22'),
('2016-01-01', '08:22'),
('2016-01-02', '02:22'),
('2017-01-01', '02:22'),
('2018-01-01', '02:22'),
('2019-01-01', '02:22'),
]

test_df = spark.createDataFrame(vals, columns)

test_df.show()
# OUTPUT
# +----------------+----------------+
# |observation_date|observation_time|
# +----------------+----------------+
# |      2015-01-01|           02:22|
# |      2016-01-01|           02:22|
# |      2016-01-01|           08:22|
# |      2016-01-02|           02:22|
# |      2017-01-01|           02:22|
# |      2018-01-01|           02:22|
# |      2019-01-01|           02:22|
# +----------------+----------------+

renamed = 
test_df.withColumnRenamed("observation_date","UTC_observation_date").withColumnRenamed("observation_time","UTC_observation_time")

utc_ts = renamed.withColumn("UTC", F.concat(F.col("UTC_observation_date"), 
F.lit(" "), F.col("UTC_observation_time")))

utc_ts.show()
# OUTPUT
# +--------------------+--------------------+----------------+
# |UTC_observation_date|UTC_observation_time|             UTC|
# +--------------------+--------------------+----------------+
# |          2015-01-01|               02:22|2015-01-01 02:22|
# |          2016-01-01|               02:22|2016-01-01 02:22|
# |          2016-01-01|               08:22|2016-01-01 08:22|
# |          2016-01-02|               02:22|2016-01-02 02:22|
# |          2017-01-01|               02:22|2017-01-01 02:22|
# |          2018-01-01|               02:22|2018-01-01 02:22|
# |          2019-01-01|               02:22|2019-01-01 02:22|
# +--------------------+--------------------+----------------+

central_ts = utc_ts.withColumn("central_timestamp", 
F.from_utc_timestamp(utc_ts.UTC, 'US/Central'))

final = central_ts.withColumn('observation_date', 
F.date_format(central_ts.central_timestamp, "YYYY-MM-dd"))\
.withColumn('observation_time', F.date_format(central_ts.central_timestamp, 
"HH:mm"))

final.select('observation_date','observation_time','central_timestamp','UTC').show()
# OUTPUT
# +----------------+----------------+--------------------+----------------+
# |observation_date|observation_time|   central_timestamp|             UTC|
# +----------------+----------------+--------------------+----------------+
# |      2015-12-31|           20:22|2014-12-31 20:22:...|2015-01-01 02:22|
# |      2016-12-31|           20:22|2015-12-31 20:22:...|2016-01-01 02:22|
# |      2016-01-01|           02:22|2016-01-01 02:22:...|2016-01-01 08:22|
# |      2016-01-01|           20:22|2016-01-01 20:22:...|2016-01-02 02:22|
# |      2016-12-31|           20:22|2016-12-31 20:22:...|2017-01-01 02:22|
# |      2018-12-31|           20:22|2017-12-31 20:22:...|2018-01-01 02:22|
# |      2019-12-31|           20:22|2018-12-31 20:22:...|2019-01-01 02:22|
# 
+----------------+----------------+--------------------+----------------+{code}
The Central Time timestamp is generated correctly by the from_utc_timestamp 
function but when formatted by the date_format function _and if_ the timezone 
conversation caused a change in year then the formatted string produced by the 
date_format function will have the wrong year.

In summary, this is the incorrect behavior:

UTC timestamp="2016-01-01 02:22" -(apply from_utc_timestamp)-> Central 
timestamp="2015-12-31 20:22" --(apply date_format "YYYY-MM-dd")-> 
date="2016-12-31"

output instead should be "2015-12-31"

 

 

 


> date_format provides incorrect year after a timezone conversation changes the 
> year on a timestamp
> -------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24752
>                 URL: https://issues.apache.org/jira/browse/SPARK-24752
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core, SQL
>    Affects Versions: 2.1.2
>            Reporter: Noah Lidell
>            Priority: Minor
>
> Initially I have a dataframe with a column for date and a column for time in 
> UTC time. I combine these two columns into a single string that represents a 
> UTC timestamp. I then convert that UTC timestamp into a Central Time 
> timestamp using pyspark.sql.functions.from_utc_timestamp(). In the example 
> below I have timestamps that are early in the day on Janurary 1st UTC, so 
> that when they are converted to Central Time the resulting timestamp is on 
> December 31st of the previous year. These timestamps are produced correctly. 
> However, if I then use pypspark.sql.functions.date_format() to format a date 
> from the one of these December 31st Central Time timestamps then the year 
> portion of the output will be wrong. 
>  
>  
> {code:java}
> import pyspark.sql.functions as F
> columns = ["observation_date","observation_time"]
> vals = [
> ('2015-01-01', '02:22'),
> ('2016-01-01', '02:22'),
> ('2016-01-01', '08:22'),
> ('2016-01-02', '02:22'),
> ('2017-01-01', '02:22'),
> ('2018-01-01', '02:22'),
> ('2019-01-01', '02:22'),
> ]
> test_df = spark.createDataFrame(vals, columns)
> test_df.show()
> # OUTPUT
> # +----------------+----------------+
> # |observation_date|observation_time|
> # +----------------+----------------+
> # |      2015-01-01|           02:22|
> # |      2016-01-01|           02:22|
> # |      2016-01-01|           08:22|
> # |      2016-01-02|           02:22|
> # |      2017-01-01|           02:22|
> # |      2018-01-01|           02:22|
> # |      2019-01-01|           02:22|
> # +----------------+----------------+
> renamed = 
> test_df.withColumnRenamed("observation_date","UTC_observation_date").withColumnRenamed("observation_time","UTC_observation_time")
> utc_ts = renamed.withColumn("UTC", F.concat(F.col("UTC_observation_date"), 
> F.lit(" "), F.col("UTC_observation_time")))
> utc_ts.show()
> # OUTPUT
> # +--------------------+--------------------+----------------+
> # |UTC_observation_date|UTC_observation_time|             UTC|
> # +--------------------+--------------------+----------------+
> # |          2015-01-01|               02:22|2015-01-01 02:22|
> # |          2016-01-01|               02:22|2016-01-01 02:22|
> # |          2016-01-01|               08:22|2016-01-01 08:22|
> # |          2016-01-02|               02:22|2016-01-02 02:22|
> # |          2017-01-01|               02:22|2017-01-01 02:22|
> # |          2018-01-01|               02:22|2018-01-01 02:22|
> # |          2019-01-01|               02:22|2019-01-01 02:22|
> # +--------------------+--------------------+----------------+
> central_ts = utc_ts.withColumn("central_timestamp", 
> F.from_utc_timestamp(utc_ts.UTC, 'US/Central'))
> final = central_ts.withColumn('observation_date', 
> F.date_format(central_ts.central_timestamp, "YYYY-MM-dd"))\
> .withColumn('observation_time', F.date_format(central_ts.central_timestamp, 
> "HH:mm"))
> final.select('observation_date','observation_time','central_timestamp','UTC').show()
> # OUTPUT
> # +----------------+----------------+--------------------+----------------+
> # |observation_date|observation_time|   central_timestamp|             UTC|
> # +----------------+----------------+--------------------+----------------+
> # |      2015-12-31|           20:22|2014-12-31 20:22:...|2015-01-01 02:22|
> # |      2016-12-31|           20:22|2015-12-31 20:22:...|2016-01-01 02:22|
> # |      2016-01-01|           02:22|2016-01-01 02:22:...|2016-01-01 08:22|
> # |      2016-01-01|           20:22|2016-01-01 20:22:...|2016-01-02 02:22|
> # |      2016-12-31|           20:22|2016-12-31 20:22:...|2017-01-01 02:22|
> # |      2018-12-31|           20:22|2017-12-31 20:22:...|2018-01-01 02:22|
> # |      2019-12-31|           20:22|2018-12-31 20:22:...|2019-01-01 02:22|
> # 
> +----------------+----------------+--------------------+----------------+{code}
> The Central Time timestamp is generated correctly by the from_utc_timestamp 
> function but when formatted by the date_format function _and if_ the timezone 
> conversation caused a change in year then the formatted string produced by 
> the date_format function will have the wrong year.
> In summary, this is the incorrect behavior:
> {code:java}
> UTC timestamp="2016-01-01 02:22" 
> --(apply from_utc_timestamp)--> Central timestamp="2015-12-31 20:22" 
> --(apply date_format "YYYY-MM-dd")--> date="2016-12-31"{code}
> output instead should be "2015-12-31"
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to