[ https://issues.apache.org/jira/browse/SPARK-16394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15906266#comment-15906266 ]
Keith Bourgoin commented on SPARK-16394: ---------------------------------------- We've been having the same issue. To illustrate, this is what happens when you put datetimes first into an RDD and then a DataFrame: {code} In [10]: rdd = sc.parallelize([(1, 2, dt.datetime.now(), pytz.utc.localize(dt.datetime.now()))]) In [11]: rdd.collect() Out[11]: [(1, 2, datetime.datetime(2017, 3, 11, 12, 40, 25, 834984), datetime.datetime(2017, 3, 11, 12, 40, 25, 834996, tzinfo=<UTC>))] In [12]: df = sqlContext.createDataFrame(rdd) In [13]: df.collect() Out[13]: [Row(_1=1, _2=2, _3=datetime.datetime(2017, 3, 11, 12, 40, 25, 834984), _4=datetime.datetime(2017, 3, 11, 7, 40, 25, 834996))] {code} The datetime in the df has lost its timezone information and is now a different value. This is incredibly confusing, and I just lost about a day of time trying to figure out what exactly was going on. As the previous commenter said, the only way around this is to use a string or int representation of the value. If you do that, you lose all of Spark's date-related functionality. IMO, this is more of a Major bug than a Minor. > Timestamp conversion error in pyspark.sql.Row because of timezones > ------------------------------------------------------------------ > > Key: SPARK-16394 > URL: https://issues.apache.org/jira/browse/SPARK-16394 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.6.1 > Reporter: Martin Tapp > Priority: Minor > > We use DataFrame.map to convert each row to a dictionary using Row.asDict(). > The problem occurs when a Timestamp column is converted. It seems the > Timestamp gets converted to a naive Python datetime. This causes processing > errors since all naive datetimes get adjusted to the process' timezone. For > instance, a Timestamp with a time of midnight see's it's time bounce based on > the local timezone (+/- x hours). > Current fix is to apply the pytz.utc timezone to each datetime instance. > Proposed solution is to make all datetime instances aware and use the > pytz.utc timezone. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org