[jira] [Updated] (SPARK-33863) Pyspark UDF changes timestamps to UTC

Nasir Ali (Jira) Thu, 07 Jan 2021 17:16:06 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-33863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Nasir Ali updated SPARK-33863:
------------------------------
    Description: 
*Problem*:

If I create a new column using udf, pyspark udf changes timestamps into UTC 
time. I have used following configs to let spark know the timestamps are in UTC:

 
{code:java}
--conf spark.driver.extraJavaOptions=-Duser.timezone=UTC 
--conf spark.executor.extraJavaOptions=-Duser.timezone=UTC
--conf spark.sql.session.timeZone=UTC
{code}
Below is a code snippet to reproduce the error:

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
import datetime
spark = SparkSession.builder.config("spark.sql.session.timeZone", 
"UTC").getOrCreate()df = spark.createDataFrame([("usr1",17.00, 
"2018-03-10T15:27:18+00:00"),
                            ("usr1",13.00, "2018-03-11T12:27:18+00:00"),
                            ("usr1",25.00, "2018-03-12T11:27:18+00:00"),
                            ("usr1",20.00, "2018-03-13T15:27:18+00:00"),
                            ("usr1",17.00, "2018-03-14T12:27:18+00:00"),
                            ("usr2",99.00, "2018-03-15T11:27:18+00:00"),
                            ("usr2",156.00, "2018-03-22T11:27:18+00:00"),
                            ("usr2",17.00, "2018-03-31T11:27:18+00:00"),
                            ("usr2",25.00, "2018-03-15T11:27:18+00:00"),
                            ("usr2",25.00, "2018-03-16T11:27:18+00:00")
                            ],
                           ["user","id", "ts"])
df = df.withColumn('ts', df.ts.cast('timestamp'))
df.show(truncate=False)def some_time_udf(i):
    tmp=""
    if  datetime.time(5, 0)<=i.time() < datetime.time(12, 0):
        tmp="Morning -> "+str(i)
    return tmpudf = F.udf(some_time_udf,StringType())

df.withColumn("day_part", udf(df.ts)).show(truncate=False)


{code}
I have concatenated timestamps with the string to show that pyspark pass 
timestamps as UTC.

  was:
*Problem*:

If I create a new column using udf, pyspark udf changes timestamps into UTC 
time. I have used following configs to let spark know the timestamps are in UTC:

 
{code:java}
--conf spark.driver.extraJavaOptions=-Duser.timezone=UTC 
--conf spark.executor.extraJavaOptions=-Duser.timezone=UTC
--conf spark.sql.session.timeZone=UTC
{code}
Below is a code snippet to reproduce the error:

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
import datetime
spark = SparkSession.builder.config("spark.sql.session.timeZone", 
"UTC").getOrCreate()

df = spark.createDataFrame([("usr1",17.00, "2018-03-10T15:27:18+00:00"),
                            ("usr1",13.00, "2018-03-11T12:27:18+00:00"),
                            ("usr1",25.00, "2018-03-12T11:27:18+00:00"),
                            ("usr1",20.00, "2018-03-13T15:27:18+00:00"),
                            ("usr1",17.00, "2018-03-14T12:27:18+00:00"),
                            ("usr2",99.00, "2018-03-15T11:27:18+00:00"),
                            ("usr2",156.00, "2018-03-22T11:27:18+00:00"),
                            ("usr2",17.00, "2018-03-31T11:27:18+00:00"),
                            ("usr2",25.00, "2018-03-15T11:27:18+00:00"),
                            ("usr2",25.00, "2018-03-16T11:27:18+00:00")
                            ],
                           ["user","id", "ts"])
df = df.withColumn('ts', df.ts.cast('timestamp'))

df.show(truncate=False)

def some_time_udf(i):
    tmp=""
    if  datetime.time(5, 0)<=i.time() < datetime.time(12, 0):
        tmp="Morning -> "+str(i)
    elif  datetime.time(12, 0)<=i.time() < datetime.time(17, 0):
        tmp= "Afternoon -> "+str(i)
    elif  datetime.time(17, 0)<=i.time() < datetime.time(21, 0):
        tmp= "Evening -> "+str(i)
    elif  datetime.time(21, 0)<=i.time() < datetime.time(0, 0):
        tmp= "Night -> "+str(i)
    elif  datetime.time(0, 0)<=i.time() < datetime.time(5, 0):
        tmp= "Night -> "+str(i)
    return tmpsometimeudf = 
F.udf(some_time_udf,StringType())df.withColumn("day_part", 
sometimeudf("ts")).show(truncate=False)

{code}
I have concatenated timestamps with the string to show that pyspark pass 
timestamps as UTC.


> Pyspark UDF changes timestamps to UTC
> -------------------------------------
>
>                 Key: SPARK-33863
>                 URL: https://issues.apache.org/jira/browse/SPARK-33863
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.0.1
>         Environment: MAC/Linux
> Standalone cluster / local machine
>            Reporter: Nasir Ali
>            Priority: Major
>
> *Problem*:
> If I create a new column using udf, pyspark udf changes timestamps into UTC 
> time. I have used following configs to let spark know the timestamps are in 
> UTC:
>  
> {code:java}
> --conf spark.driver.extraJavaOptions=-Duser.timezone=UTC 
> --conf spark.executor.extraJavaOptions=-Duser.timezone=UTC
> --conf spark.sql.session.timeZone=UTC
> {code}
> Below is a code snippet to reproduce the error:
>  
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as F
> from pyspark.sql.types import StringType
> import datetime
> spark = SparkSession.builder.config("spark.sql.session.timeZone", 
> "UTC").getOrCreate()df = spark.createDataFrame([("usr1",17.00, 
> "2018-03-10T15:27:18+00:00"),
>                             ("usr1",13.00, "2018-03-11T12:27:18+00:00"),
>                             ("usr1",25.00, "2018-03-12T11:27:18+00:00"),
>                             ("usr1",20.00, "2018-03-13T15:27:18+00:00"),
>                             ("usr1",17.00, "2018-03-14T12:27:18+00:00"),
>                             ("usr2",99.00, "2018-03-15T11:27:18+00:00"),
>                             ("usr2",156.00, "2018-03-22T11:27:18+00:00"),
>                             ("usr2",17.00, "2018-03-31T11:27:18+00:00"),
>                             ("usr2",25.00, "2018-03-15T11:27:18+00:00"),
>                             ("usr2",25.00, "2018-03-16T11:27:18+00:00")
>                             ],
>                            ["user","id", "ts"])
> df = df.withColumn('ts', df.ts.cast('timestamp'))
> df.show(truncate=False)def some_time_udf(i):
>     tmp=""
>     if  datetime.time(5, 0)<=i.time() < datetime.time(12, 0):
>         tmp="Morning -> "+str(i)
>     return tmpudf = F.udf(some_time_udf,StringType())
> df.withColumn("day_part", udf(df.ts)).show(truncate=False)
> {code}
> I have concatenated timestamps with the string to show that pyspark pass 
> timestamps as UTC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33863) Pyspark UDF changes timestamps to UTC

Reply via email to