[ https://issues.apache.org/jira/browse/SPARK-25467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626902#comment-16626902 ]
Aleksandr Koriagin edited comment on SPARK-25467 at 9/25/18 11:46 AM: ---------------------------------------------------------------------- Can be reproduced with HDP (2.6.5.0-292) with Spark 2.3.0: {code:python} import datetime from pyspark.sql import Row date = datetime.date.fromordinal(1) print date # >> '0001-01-01' a = [Row(date=date)] sqlContext.createDataFrame(a).toJSON().collect() # >> [u'{"date":"0001-01-03"}'] {code} Here is a part of code probably where issue happens: https://github.com/apache/spark/blob/7d8f5b62c57c9e2903edd305e8b9c5400652fdb0/python/pyspark/sql/session.py#L750 {code:python} if isinstance(data, RDD): rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio) else: rdd, schema = self._createFromLocal(map(prepare, data), schema) # ipdb> rdd.collect() --> [(-719162,)] # See "sql.types.DateType#toInternal" about '-719162' value: # https://github.com/apache/spark/blob/7d8f5b62c57c9e2903edd305e8b9c5400652fdb0/python/pyspark/sql/types.py#L161 # This number '-719162' will be because: # datetime.date.fromordinal(1).toordinal() - datetime.datetime(1970, 1, 1).toordinal() = -719162 # # ipdb> schema --> StructType(List(StructField(date,DateType,true))) # ipdb> schema.json() --> '{"fields":[{"metadata":{},"name":"date","nullable":true,"type":"date"}],"type":"struct"}' # Here all seems to be correct good jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd()) # ipdb> jrdd.rdd().collect()[0][0] --> -719162 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json()) # After 'applySchemaToPythonRDD' transformation value is incorrect: '0001-01-03'. # ipdb> jdf.show() # +----------+ # | date| # +----------+ # |0001-01-03| <<-- should be '0001-01-01' # +----------+ # # Seems issue happens at Java/Scala part: # https://github.com/apache/spark/blob/2a0a8f753bbdc8c251f8e699c0808f35b94cfd20/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L734 df = DataFrame(jdf, self._wrapped) df._schema = schema return df {code} was (Author: aleksandr_koriagin): Can be reproduced with HDP (2.6.5.0-292) with Spark 2.3.0: {code:python} import datetime from pyspark.sql import Row date = datetime.date.fromordinal(1) print date # >> '0001-01-01' a = [Row(date=date)] sqlContext.createDataFrame(a).toJSON().collect() # >> [u'{"date":"0001-01-03"}'] {code} Here is a part of code probably where issue happens: https://github.com/apache/spark/blob/7d8f5b62c57c9e2903edd305e8b9c5400652fdb0/python/pyspark/sql/session.py#L750 {code:python} if isinstance(data, RDD): rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio) else: rdd, schema = self._createFromLocal(map(prepare, data), schema) # ipdb> rdd.collect() --> [(-719162,)] # See "sql.types.DateType#toInternal" about '-719162' value: # https://github.com/apache/spark/blob/7d8f5b62c57c9e2903edd305e8b9c5400652fdb0/python/pyspark/sql/types.py#L161 # This number '-719162' will be because: # datetime.date.fromordinal(1).toordinal() - datetime.datetime(1970, 1, 1).toordinal() = -719162 # # ipdb> schema --> StructType(List(StructField(date,DateType,true))) # ipdb> schema.json() --> '{"fields":[{"metadata":{},"name":"date","nullable":true,"type":"date"}],"type":"struct"}' # Here all seems to be correct good jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd()) jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json()) # After 'applySchemaToPythonRDD' transformation value is incorrect: '0001-01-03'. # ipdb> jdf.show() # +----------+ # | date| # +----------+ # |0001-01-03| <<-- should be '0001-01-01' # +----------+ # # Seems issue happens at Java/Scala part: # https://github.com/apache/spark/blob/2a0a8f753bbdc8c251f8e699c0808f35b94cfd20/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L734 df = DataFrame(jdf, self._wrapped) df._schema = schema return df {code} > Python date/datetime objects in dataframes increment by 1 day when converted > to JSON > ------------------------------------------------------------------------------------ > > Key: SPARK-25467 > URL: https://issues.apache.org/jira/browse/SPARK-25467 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Affects Versions: 2.3.1 > Environment: Spark 2.3.1 > Python 3.6.5 | packaged by conda-forge | (default, Apr 6 2018, 13:39:56) > [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] > openjdk version "1.8.0_181" > OpenJDK Runtime Environment (build 1.8.0_181-b13) > OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode) > Centos 7 3.10.0-862.11.6.el7.x86_64 #1 SMP Tue Aug 14 21:49:04 UTC 2018 > x86_64 x86_64 GNU/Linux > Reporter: David V. Hill > Priority: Major > > When Dataframes contains datetime.date or datetime.datetime instances and > toJSON() is called on the Dataframe, the day is incremented in the JSON date > representation. > {code} > # Create a Dataframe containing datetime.date instances, convert to JSON and > display > rows = [Row(cx=1, cy=2, dates=[datetime.date.fromordinal(1), > datetime.date.fromordinal(2)])] > df = sqc.createDataFrame(rows) > df.collect() > [Row(cx=1, cy=2, dates=[datetime.date(1, 1, 1), datetime.date(1, 1, 2)])] > df.toJSON().collect() > ['{"cx":1,"cy":2,"dates":["0001-01-03","0001-01-04"]}'] > # Issue also occurs with datetime.datetime instances > rows = [Row(cx=1, cy=2, dates=[datetime.datetime.fromordinal(1), > datetime.datetime.fromordinal(2)])] > df = sqc.createDataFrame(rows) > df.collect() > [Row(cx=1, cy=2, dates=[datetime.datetime(1, 1, 1, 0, 0, fold=1), > datetime.datetime(1, 1, 2, 0, 0)])] > df.toJSON().collect() > ['{"cx":1,"cy":2,"dates":["0001-01-02T23:50:36.000-06:00","0001-01-03T23:50:36.000-06:00"]}'] > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org