[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

Li Jin (JIRA) Fri, 02 Feb 2018 13:40:43 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350971#comment-16350971
 ]


Li Jin commented on SPARK-23314:
--------------------------------

Hi [~felixcheung]

Thanks for the information. However, I still cannot reproduce with python2, 
pandas 0.22.0 and pyarrow 0.8.0 ...

(Although I do have to drop the "flight_id" column because the type is parsed 
to decimal)

Is it possible you have more than one pandas on your path?

 
{code:java}
>>> flights.printSchema()

root

|-- adshex: string (nullable = true)

|-- latitude: double (nullable = true)

|-- longitude: double (nullable = true)

|-- altitude: integer (nullable = true)

|-- speed: integer (nullable = true)

|-- track: integer (nullable = true)

|-- squawk: integer (nullable = true)

|-- type: string (nullable = true)

|-- timestamp: timestamp (nullable = true)

|-- name: string (nullable = true)

|-- other_names1: string (nullable = true)

|-- other_names2: string (nullable = true)

|-- n_number: string (nullable = true)

|-- serial_number: string (nullable = true)

|-- mfr_mdl_code: integer (nullable = true)

|-- mfr: string (nullable = true)

|-- model: string (nullable = true)

|-- year_mfr: integer (nullable = true)

|-- type_aircraft: integer (nullable = true)

|-- agency: string (nullable = true)

>>> flights.show()

+------+--------+----------+--------+-----+-----+------+----+-------------------+--------------------+--------------------+--------------------+--------+-------------+------------+--------------------+-----+--------+-------------+------+

|adshex|latitude| longitude|altitude|speed|track|squawk|type|          
timestamp|                name|        other_names1|        
other_names2|n_number|serial_number|mfr_mdl_code|                 
mfr|model|year_mfr|type_aircraft|agency|

+------+--------+----------+--------+-----+-----+------+----+-------------------+--------------------+--------------------+--------------------+--------+-------------+------------+--------------------+-----+--------+-------------+------+

|A72AA1| 33.2552|-117.91699|    5499|  111|  137|  4401|B350|2015-08-18 
03:58:54|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...|    
561A|        FM-36|     4220012|HAWKER BEECHCRAFT...|B300C|    2010|            
5|   dhs|

|A72AA1| 33.2659|  -117.928|    5500|  109|  138|  4401|B350|2015-08-18 
03:58:39|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...|    
561A|        FM-36|     4220012|HAWKER BEECHCRAFT...|B300C|    2010|            
5|   dhs|

|A72AA1| 33.2741|-117.93599|    5500|  109|  137|  4401|B350|2015-08-18 
03:58:28|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...|    
561A|        FM-36|     4220012|HAWKER BEECHCRAFT...|B300C|    2010|            
5|   dhs|

|A72AA1|33.28251|  -117.945|    5500|  112|  138|  4401|B350|2015-08-18 
03:58:13|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...|    
561A|        FM-36|     4220012|HAWKER BEECHCRAFT...|B300C|    2010|            
5|   dhs|

|A72AA1|33.29341|-117.95699|    5500|  102|  134|  4401|B350|2015-08-18 
03:57:58|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...|    
561A|        FM-36|     4220012|HAWKER BEECHCRAFT...|B300C|    2010|            
5|   dhs|

+------+--------+----------+--------+-----+-----+------+----+-------------------+--------------------+--------------------+--------------------+--------+-------------+------------+--------------------+-----+--------+-------------+------+



>>> from pyspark.sql.functions import pandas_udf, PandasUDFType

>>> @pandas_udf(flights.schema, PandasUDFType.GROUPED_MAP)

... def subtract_mean_year_mfr(pdf):

...     return pdf.assign(year_mfr=pdf.year_mfr - pdf.year_mfr.mean())

...

>>> g = flights.groupby('mfr').apply(subtract_mean_year_mfr)

>>> g.show()

+------+--------+----------+--------+-----+-----+------+----+-------------------+--------------------+--------------------+--------------------+--------+-------------+------------+--------------------+-----+--------+-------------+------+

|adshex|latitude| longitude|altitude|speed|track|squawk|type|          
timestamp|                name|        other_names1|        
other_names2|n_number|serial_number|mfr_mdl_code|                 
mfr|model|year_mfr|type_aircraft|agency|

+------+--------+----------+--------+-----+-----+------+----+-------------------+--------------------+--------------------+--------------------+--------+-------------+------------+--------------------+-----+--------+-------------+------+

|A72AA1| 33.2552|-117.91699|    5499|  111|  137|  4401|B350|2015-08-18 
03:58:54|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...|    
561A|        FM-36|     4220012|HAWKER BEECHCRAFT...|B300C|       0|            
5|   dhs|

|A72AA1| 33.2659|  -117.928|    5500|  109|  138|  4401|B350|2015-08-18 
03:58:39|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...|    
561A|        FM-36|     4220012|HAWKER BEECHCRAFT...|B300C|       0|            
5|   dhs|

|A72AA1| 33.2741|-117.93599|    5500|  109|  137|  4401|B350|2015-08-18 
03:58:28|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...|    
561A|        FM-36|     4220012|HAWKER BEECHCRAFT...|B300C|       0|            
5|   dhs|

|A72AA1|33.28251|  -117.945|    5500|  112|  138|  4401|B350|2015-08-18 
03:58:13|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...|    
561A|        FM-36|     4220012|HAWKER BEECHCRAFT...|B300C|       0|            
5|   dhs|

|A72AA1|33.29341|-117.95699|    5500|  102|  134|  4401|B350|2015-08-18 
03:57:58|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...|    
561A|        FM-36|     4220012|HAWKER BEECHCRAFT...|B300C|       0|            
5|   dhs|

+------+--------+----------+--------+-----+-----+------+----+-------------------+--------------------+--------------------+--------------------+--------+-------------+------------+--------------------+-----+--------+-------------+------+


>>> import pandas as pd

>>> pd.__version__

u'0.22.0'

>>> import pyarrow as pa

>>> pa.__version__

'0.8.0'

>>> sys.version_info

sys.version_info(major=2, minor=7, micro=14, releaselevel='final', serial=0)
{code}

> Pandas grouped udf on dataset with timestamp column error 
> ----------------------------------------------------------
>
>                 Key: SPARK-23314
>                 URL: https://issues.apache.org/jira/browse/SPARK-23314
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark
>    Affects Versions: 2.3.0
>            Reporter: Felix Cheung
>            Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> File "pandas/_libs/tslib.pyx", line 3593, in 
> pandas._libs.tslib.tz_localize_to_utc
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For details, see Comment box. I'm able to reproduce this on the latest 
> branch-2.3 (last change from Feb 1 UTC)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

Reply via email to