[ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16349897#comment-16349897
 ] 

Felix Cheung commented on SPARK-23314:
--------------------------------------

code

 

>>> flights = spark.read.option("inferSchema", True).option("header", 
>>> True).option("dateFormat", "yyyy-MM-dd HH:mm:ss").csv("data*.csv")
>>> from pyspark.sql.functions import pandas_udf, PandasUDFType
>>> @pandas_udf(flights.schema, PandasUDFType.GROUPED_MAP)
... def subtract_mean_year_mfr(pdf):
... return pdf.assign(year_mfr=pdf.year_mfr - pdf.year_mfr.mean())
...
g = flights.groupby('mfr').apply(subtract_mean_year_mfr)

>>> g = flights.groupby('mfr').apply(subtract_mean_year_mfr)
>>>
>>> g.count()

> Pandas grouped udf on dataset with timestamp column error 
> ----------------------------------------------------------
>
>                 Key: SPARK-23314
>                 URL: https://issues.apache.org/jira/browse/SPARK-23314
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.3.0
>            Reporter: Felix Cheung
>            Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For detailed on repo, see Comment box



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to