[ https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350971#comment-16350971 ]
Li Jin commented on SPARK-23314: -------------------------------- Hi [~felixcheung] Thanks for the information. However, I still cannot reproduce with python2, pandas 0.22.0 and pyarrow 0.8.0 ... (Although I do have to drop the "flight_id" column because the type is parsed to decimal) Is it possible you have more than one pandas on your path? {code:java} >>> flights.printSchema() root |-- adshex: string (nullable = true) |-- latitude: double (nullable = true) |-- longitude: double (nullable = true) |-- altitude: integer (nullable = true) |-- speed: integer (nullable = true) |-- track: integer (nullable = true) |-- squawk: integer (nullable = true) |-- type: string (nullable = true) |-- timestamp: timestamp (nullable = true) |-- name: string (nullable = true) |-- other_names1: string (nullable = true) |-- other_names2: string (nullable = true) |-- n_number: string (nullable = true) |-- serial_number: string (nullable = true) |-- mfr_mdl_code: integer (nullable = true) |-- mfr: string (nullable = true) |-- model: string (nullable = true) |-- year_mfr: integer (nullable = true) |-- type_aircraft: integer (nullable = true) |-- agency: string (nullable = true) >>> flights.show() +------+--------+----------+--------+-----+-----+------+----+-------------------+--------------------+--------------------+--------------------+--------+-------------+------------+--------------------+-----+--------+-------------+------+ |adshex|latitude| longitude|altitude|speed|track|squawk|type| timestamp| name| other_names1| other_names2|n_number|serial_number|mfr_mdl_code| mfr|model|year_mfr|type_aircraft|agency| +------+--------+----------+--------+-----+-----+------+----+-------------------+--------------------+--------------------+--------------------+--------+-------------+------------+--------------------+-----+--------+-------------+------+ |A72AA1| 33.2552|-117.91699| 5499| 111| 137| 4401|B350|2015-08-18 03:58:54|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...| 561A| FM-36| 4220012|HAWKER BEECHCRAFT...|B300C| 2010| 5| dhs| |A72AA1| 33.2659| -117.928| 5500| 109| 138| 4401|B350|2015-08-18 03:58:39|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...| 561A| FM-36| 4220012|HAWKER BEECHCRAFT...|B300C| 2010| 5| dhs| |A72AA1| 33.2741|-117.93599| 5500| 109| 137| 4401|B350|2015-08-18 03:58:28|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...| 561A| FM-36| 4220012|HAWKER BEECHCRAFT...|B300C| 2010| 5| dhs| |A72AA1|33.28251| -117.945| 5500| 112| 138| 4401|B350|2015-08-18 03:58:13|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...| 561A| FM-36| 4220012|HAWKER BEECHCRAFT...|B300C| 2010| 5| dhs| |A72AA1|33.29341|-117.95699| 5500| 102| 134| 4401|B350|2015-08-18 03:57:58|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...| 561A| FM-36| 4220012|HAWKER BEECHCRAFT...|B300C| 2010| 5| dhs| +------+--------+----------+--------+-----+-----+------+----+-------------------+--------------------+--------------------+--------------------+--------+-------------+------------+--------------------+-----+--------+-------------+------+ >>> from pyspark.sql.functions import pandas_udf, PandasUDFType >>> @pandas_udf(flights.schema, PandasUDFType.GROUPED_MAP) ... def subtract_mean_year_mfr(pdf): ... return pdf.assign(year_mfr=pdf.year_mfr - pdf.year_mfr.mean()) ... >>> g = flights.groupby('mfr').apply(subtract_mean_year_mfr) >>> g.show() +------+--------+----------+--------+-----+-----+------+----+-------------------+--------------------+--------------------+--------------------+--------+-------------+------------+--------------------+-----+--------+-------------+------+ |adshex|latitude| longitude|altitude|speed|track|squawk|type| timestamp| name| other_names1| other_names2|n_number|serial_number|mfr_mdl_code| mfr|model|year_mfr|type_aircraft|agency| +------+--------+----------+--------+-----+-----+------+----+-------------------+--------------------+--------------------+--------------------+--------+-------------+------------+--------------------+-----+--------+-------------+------+ |A72AA1| 33.2552|-117.91699| 5499| 111| 137| 4401|B350|2015-08-18 03:58:54|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...| 561A| FM-36| 4220012|HAWKER BEECHCRAFT...|B300C| 0| 5| dhs| |A72AA1| 33.2659| -117.928| 5500| 109| 138| 4401|B350|2015-08-18 03:58:39|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...| 561A| FM-36| 4220012|HAWKER BEECHCRAFT...|B300C| 0| 5| dhs| |A72AA1| 33.2741|-117.93599| 5500| 109| 137| 4401|B350|2015-08-18 03:58:28|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...| 561A| FM-36| 4220012|HAWKER BEECHCRAFT...|B300C| 0| 5| dhs| |A72AA1|33.28251| -117.945| 5500| 112| 138| 4401|B350|2015-08-18 03:58:13|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...| 561A| FM-36| 4220012|HAWKER BEECHCRAFT...|B300C| 0| 5| dhs| |A72AA1|33.29341|-117.95699| 5500| 102| 134| 4401|B350|2015-08-18 03:57:58|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...| 561A| FM-36| 4220012|HAWKER BEECHCRAFT...|B300C| 0| 5| dhs| +------+--------+----------+--------+-----+-----+------+----+-------------------+--------------------+--------------------+--------------------+--------+-------------+------------+--------------------+-----+--------+-------------+------+ >>> import pandas as pd >>> pd.__version__ u'0.22.0' >>> import pyarrow as pa >>> pa.__version__ '0.8.0' >>> sys.version_info sys.version_info(major=2, minor=7, micro=14, releaselevel='final', serial=0) {code} > Pandas grouped udf on dataset with timestamp column error > ---------------------------------------------------------- > > Key: SPARK-23314 > URL: https://issues.apache.org/jira/browse/SPARK-23314 > Project: Spark > Issue Type: Sub-task > Components: PySpark > Affects Versions: 2.3.0 > Reporter: Felix Cheung > Priority: Major > > Under SPARK-22216 > When testing pandas_udf on group bys, I saw this error with the timestamp > column. > File "pandas/_libs/tslib.pyx", line 3593, in > pandas._libs.tslib.tz_localize_to_utc > AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 > 01:29:30'), try using the 'ambiguous' argument > For details, see Comment box. I'm able to reproduce this on the latest > branch-2.3 (last change from Feb 1 UTC) -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org