[jira] [Commented] (SPARK-25467) Python date/datetime objects in dataframes increment by 1 day when converted to JSON
[ https://issues.apache.org/jira/browse/SPARK-25467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16627024#comment-16627024 ] Aleksandr Koriagin commented on SPARK-25467: Just in case this could help, seems to be that the last wrong handled date is: - {{datetime.date(year=1582, month=10, day=14)}} or {{datetime.date.fromordinal(577735)}} And first correct handled date is: - {{datetime.date(year=1582, month=10, day=15)} or {{datetime.date.fromordinal(577736)}} {noformat} {"date":{"1582-10-12_(577733)":"1582-10-02"}} {"date":{"1582-10-13_(577734)":"1582-10-03"}} {"date":{"1582-10-14_(577735)":"1582-10-04"}} # dates after are ok {"date":{"1582-10-15_(577736)":"1582-10-15"}} {"date":{"1582-10-16_(577737)":"1582-10-16"}} {"date":{"1582-10-17_(577738)":"1582-10-17"}} {noformat} How to find it: {code:python} import datetime import itertools from pyspark.sql import Row # Find approximate year rows = [] for year in range(1, 2001): dt = datetime.date(year=year, month=1, day=1) rows.append( Row(date={'{0}_({1})'.format(dt, dt.toordinal()): dt}) ) for line in sqlContext.createDataFrame(rows).toJSON().collect(): print line # {"date":{"1580-01-01_(576718)":"1579-12-22"}} # {"date":{"1581-01-01_(577084)":"1580-12-22"}} # {"date":{"1582-01-01_(577449)":"1581-12-22"}} # dates after are ok # {"date":{"1583-01-01_(577814)":"1583-01-01"}} # {"date":{"1584-01-01_(578179)":"1584-01-01"}} # {"date":{"1585-01-01_(578545)":"1585-01-01"}} # Find approximate date years = range(1580, 1584) days = range(1, 2) months = range(1, 13) rows = [] for date in itertools.product(years, months, days): dt = datetime.date(*date) rows.append( Row(date={'{0}_({1})'.format(dt, dt.toordinal()): dt}) ) for line in sqlContext.createDataFrame(rows).toJSON().collect(): print line # {"date":{"1582-09-01_(577692)":"1582-08-22"}} # {"date":{"1582-10-01_(577722)":"1582-09-21"}} # dates after are ok # {"date":{"1582-11-01_(577753)":"1582-11-01"}} # {"date":{"1582-12-01_(577783)":"1582-12-01"}} # Find exect last bad date rows = [] for orddate in range(577722, 577784): dt = datetime.date.fromordinal(orddate) rows.append( Row(date={'{0}_({1})'.format(dt, dt.toordinal()): dt}) ) for line in sqlContext.createDataFrame(rows).toJSON().collect(): print line # {"date":{"1582-10-11_(577732)":"1582-10-01"}} # {"date":{"1582-10-12_(577733)":"1582-10-02"}} # {"date":{"1582-10-13_(577734)":"1582-10-03"}} # {"date":{"1582-10-14_(577735)":"1582-10-04"}} # dates after are ok # {"date":{"1582-10-15_(577736)":"1582-10-15"}} # {"date":{"1582-10-16_(577737)":"1582-10-16"}} # {"date":{"1582-10-17_(577738)":"1582-10-17"}} # {"date":{"1582-10-18_(577739)":"1582-10-18"}} {code} > Python date/datetime objects in dataframes increment by 1 day when converted > to JSON > > > Key: SPARK-25467 > URL: https://issues.apache.org/jira/browse/SPARK-25467 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.1 > Environment: Spark 2.3.1 > Python 3.6.5 | packaged by conda-forge | (default, Apr 6 2018, 13:39:56) > [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] > openjdk version "1.8.0_181" > OpenJDK Runtime Environment (build 1.8.0_181-b13) > OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode) > Centos 7 3.10.0-862.11.6.el7.x86_64 #1 SMP Tue Aug 14 21:49:04 UTC 2018 > x86_64 x86_64 GNU/Linux >Reporter: David V. Hill >Priority: Major > > When Dataframes contains datetime.date or datetime.datetime instances and > toJSON() is called on the Dataframe, the day is incremented in the JSON date > representation. > {code} > # Create a Dataframe containing datetime.date instances, convert to JSON and > display > rows = [Row(cx=1, cy=2, dates=[datetime.date.fromordinal(1), > datetime.date.fromordinal(2)])] > df = sqc.createDataFrame(rows) > df.collect() > [Row(cx=1, cy=2, dates=[datetime.date(1, 1, 1), datetime.date(1, 1, 2)])] > df.toJSON().collect() > ['{"cx":1,"cy":2,"dates":["0001-01-03","0001-01-04"]}'] > # Issue also occurs with datetime.datetime instances > rows = [Row(cx=1, cy=2, dates=[datetime.datetime.fromordinal(1), > datetime.datetime.fromordinal(2)])] > df = sqc.createDataFrame(rows) > df.collect() > [Row(cx=1, cy=2, dates=[datetime.datetime(1, 1, 1, 0, 0, fold=1), > datetime.datetime(1, 1, 2, 0, 0)])] > df.toJSON().collect() > ['{"cx":1,"cy":2,"dates":["0001-01-02T23:50:36.000-06:00","0001-01-03T23:50:36.000-06:00"]}'] > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail:
[jira] [Commented] (SPARK-25467) Python date/datetime objects in dataframes increment by 1 day when converted to JSON
[ https://issues.apache.org/jira/browse/SPARK-25467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626902#comment-16626902 ] Aleksandr Koriagin commented on SPARK-25467: Can be reproduced with HDP (2.6.5.0-292) with Spark 2.3.0: {code:python} import datetime from pyspark.sql import Row date = datetime.date.fromordinal(1) print date # >> '0001-01-01' a = [Row(date=date)] sqlContext.createDataFrame(a).toJSON().collect() # >> [u'{"date":"0001-01-03"}'] {code} Here is a part of code probably where issue happens: https://github.com/apache/spark/blob/7d8f5b62c57c9e2903edd305e8b9c5400652fdb0/python/pyspark/sql/session.py#L748 {code:python} if isinstance(data, RDD): rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio) else: rdd, schema = self._createFromLocal(map(prepare, data), schema) # ipdb> rdd.collect() --> [(-719162,)] # See "sql.types.DateType#toInternal" about '-719162' value: # https://github.com/apache/spark/blob/7d8f5b62c57c9e2903edd305e8b9c5400652fdb0/python/pyspark/sql/types.py#L161 # This number '-719162' will be because: # datetime.date.fromordinal(1).toordinal() - datetime.datetime(1970, 1, 1).toordinal() = -719162 # # ipdb> schema--> StructType(List(StructField(date,DateType,true))) # ipdb> schema.json() --> '{"fields":[{"metadata":{},"name":"date","nullable":true,"type":"date"}],"type":"struct"}' # Here all seems to be correct good jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd()) jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json()) # After 'applySchemaToPythonRDD' transformation value is incorrect: '0001-01-03'. # ipdb> jdf.show() # +--+ # | date| # +--+ # |0001-01-03| <<-- should be '0001-01-01' # +--+ # # Seems issue happens at Java/Scala part: # https://github.com/apache/spark/blob/2a0a8f753bbdc8c251f8e699c0808f35b94cfd20/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L734 df = DataFrame(jdf, self._wrapped) df._schema = schema return df {code} > Python date/datetime objects in dataframes increment by 1 day when converted > to JSON > > > Key: SPARK-25467 > URL: https://issues.apache.org/jira/browse/SPARK-25467 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.1 > Environment: Spark 2.3.1 > Python 3.6.5 | packaged by conda-forge | (default, Apr 6 2018, 13:39:56) > [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] > openjdk version "1.8.0_181" > OpenJDK Runtime Environment (build 1.8.0_181-b13) > OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode) > Centos 7 3.10.0-862.11.6.el7.x86_64 #1 SMP Tue Aug 14 21:49:04 UTC 2018 > x86_64 x86_64 GNU/Linux >Reporter: David V. Hill >Priority: Major > > When Dataframes contains datetime.date or datetime.datetime instances and > toJSON() is called on the Dataframe, the day is incremented in the JSON date > representation. > {code} > # Create a Dataframe containing datetime.date instances, convert to JSON and > display > rows = [Row(cx=1, cy=2, dates=[datetime.date.fromordinal(1), > datetime.date.fromordinal(2)])] > df = sqc.createDataFrame(rows) > df.collect() > [Row(cx=1, cy=2, dates=[datetime.date(1, 1, 1), datetime.date(1, 1, 2)])] > df.toJSON().collect() > ['{"cx":1,"cy":2,"dates":["0001-01-03","0001-01-04"]}'] > # Issue also occurs with datetime.datetime instances > rows = [Row(cx=1, cy=2, dates=[datetime.datetime.fromordinal(1), > datetime.datetime.fromordinal(2)])] > df = sqc.createDataFrame(rows) > df.collect() > [Row(cx=1, cy=2, dates=[datetime.datetime(1, 1, 1, 0, 0, fold=1), > datetime.datetime(1, 1, 2, 0, 0)])] > df.toJSON().collect() > ['{"cx":1,"cy":2,"dates":["0001-01-02T23:50:36.000-06:00","0001-01-03T23:50:36.000-06:00"]}'] > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25467) Python date/datetime objects in dataframes increment by 1 day when converted to JSON
[ https://issues.apache.org/jira/browse/SPARK-25467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16621391#comment-16621391 ] holdenk commented on SPARK-25467: - cc [~bryanc] > Python date/datetime objects in dataframes increment by 1 day when converted > to JSON > > > Key: SPARK-25467 > URL: https://issues.apache.org/jira/browse/SPARK-25467 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.1 > Environment: Spark 2.3.1 > Python 3.6.5 | packaged by conda-forge | (default, Apr 6 2018, 13:39:56) > [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] > openjdk version "1.8.0_181" > OpenJDK Runtime Environment (build 1.8.0_181-b13) > OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode) > Centos 7 3.10.0-862.11.6.el7.x86_64 #1 SMP Tue Aug 14 21:49:04 UTC 2018 > x86_64 x86_64 GNU/Linux >Reporter: David V. Hill >Priority: Major > > When Dataframes contains datetime.date or datetime.datetime instances and > toJSON() is called on the Dataframe, the day is incremented in the JSON date > representation. > {code} > # Create a Dataframe containing datetime.date instances, convert to JSON and > display > rows = [Row(cx=1, cy=2, dates=[datetime.date.fromordinal(1), > datetime.date.fromordinal(2)])] > df = sqc.createDataFrame(rows) > df.collect() > [Row(cx=1, cy=2, dates=[datetime.date(1, 1, 1), datetime.date(1, 1, 2)])] > df.toJSON().collect() > ['{"cx":1,"cy":2,"dates":["0001-01-03","0001-01-04"]}'] > # Issue also occurs with datetime.datetime instances > rows = [Row(cx=1, cy=2, dates=[datetime.datetime.fromordinal(1), > datetime.datetime.fromordinal(2)])] > df = sqc.createDataFrame(rows) > df.collect() > [Row(cx=1, cy=2, dates=[datetime.datetime(1, 1, 1, 0, 0, fold=1), > datetime.datetime(1, 1, 2, 0, 0)])] > df.toJSON().collect() > ['{"cx":1,"cy":2,"dates":["0001-01-02T23:50:36.000-06:00","0001-01-03T23:50:36.000-06:00"]}'] > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org