[jira] [Commented] (SPARK-25467) Python date/datetime objects in dataframes increment by 1 day when converted to JSON

2018-09-25 Thread Aleksandr Koriagin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16627024#comment-16627024
 ] 

Aleksandr Koriagin commented on SPARK-25467:


Just in case this could help, seems to be that the last wrong handled date is:
- {{datetime.date(year=1582, month=10, day=14)}} or 
{{datetime.date.fromordinal(577735)}}

And first correct handled date is:
- {{datetime.date(year=1582, month=10, day=15)} or 
{{datetime.date.fromordinal(577736)}}

{noformat}
{"date":{"1582-10-12_(577733)":"1582-10-02"}}
{"date":{"1582-10-13_(577734)":"1582-10-03"}}
{"date":{"1582-10-14_(577735)":"1582-10-04"}}
# dates after are ok
{"date":{"1582-10-15_(577736)":"1582-10-15"}}
{"date":{"1582-10-16_(577737)":"1582-10-16"}}
{"date":{"1582-10-17_(577738)":"1582-10-17"}}
{noformat}

How to find it:
{code:python}
import datetime
import itertools
from pyspark.sql import Row

# Find approximate year
rows = []
for year in range(1, 2001):
dt = datetime.date(year=year, month=1, day=1)
rows.append(
Row(date={'{0}_({1})'.format(dt, dt.toordinal()): dt})
)
for line in sqlContext.createDataFrame(rows).toJSON().collect():
print line

# {"date":{"1580-01-01_(576718)":"1579-12-22"}}
# {"date":{"1581-01-01_(577084)":"1580-12-22"}}
# {"date":{"1582-01-01_(577449)":"1581-12-22"}}
# dates after are ok
# {"date":{"1583-01-01_(577814)":"1583-01-01"}}
# {"date":{"1584-01-01_(578179)":"1584-01-01"}}
# {"date":{"1585-01-01_(578545)":"1585-01-01"}}


# Find approximate date
years = range(1580, 1584)
days = range(1, 2)
months = range(1, 13)

rows = []
for date in itertools.product(years, months, days):
dt = datetime.date(*date)
rows.append(
Row(date={'{0}_({1})'.format(dt, dt.toordinal()): dt})
)
for line in sqlContext.createDataFrame(rows).toJSON().collect():
print line

# {"date":{"1582-09-01_(577692)":"1582-08-22"}}
# {"date":{"1582-10-01_(577722)":"1582-09-21"}}
# dates after are ok
# {"date":{"1582-11-01_(577753)":"1582-11-01"}}
# {"date":{"1582-12-01_(577783)":"1582-12-01"}}


# Find exect last bad date
rows = []
for orddate in range(577722, 577784):
dt = datetime.date.fromordinal(orddate)
rows.append(
Row(date={'{0}_({1})'.format(dt, dt.toordinal()): dt})
)
for line in sqlContext.createDataFrame(rows).toJSON().collect():
print line

# {"date":{"1582-10-11_(577732)":"1582-10-01"}}
# {"date":{"1582-10-12_(577733)":"1582-10-02"}}
# {"date":{"1582-10-13_(577734)":"1582-10-03"}}
# {"date":{"1582-10-14_(577735)":"1582-10-04"}}
#  dates after are ok
# {"date":{"1582-10-15_(577736)":"1582-10-15"}}
# {"date":{"1582-10-16_(577737)":"1582-10-16"}}
# {"date":{"1582-10-17_(577738)":"1582-10-17"}}
# {"date":{"1582-10-18_(577739)":"1582-10-18"}}
{code}




> Python date/datetime objects in dataframes increment by 1 day when converted 
> to JSON
> 
>
> Key: SPARK-25467
> URL: https://issues.apache.org/jira/browse/SPARK-25467
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.1
> Environment: Spark 2.3.1
> Python 3.6.5 | packaged by conda-forge | (default, Apr  6 2018, 13:39:56) 
> [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]
> openjdk version "1.8.0_181"
> OpenJDK Runtime Environment (build 1.8.0_181-b13)
> OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
> Centos 7 3.10.0-862.11.6.el7.x86_64 #1 SMP Tue Aug 14 21:49:04 UTC 2018 
> x86_64 x86_64 GNU/Linux
>Reporter: David V. Hill
>Priority: Major
>
> When Dataframes contains datetime.date or datetime.datetime instances and 
> toJSON() is called on the Dataframe, the day is incremented in the JSON date 
> representation.
> {code}
> # Create a Dataframe containing datetime.date instances, convert to JSON and 
> display
> rows = [Row(cx=1, cy=2, dates=[datetime.date.fromordinal(1), 
> datetime.date.fromordinal(2)])]
> df = sqc.createDataFrame(rows)
> df.collect()
> [Row(cx=1, cy=2, dates=[datetime.date(1, 1, 1), datetime.date(1, 1, 2)])]
> df.toJSON().collect()
> ['{"cx":1,"cy":2,"dates":["0001-01-03","0001-01-04"]}']
> # Issue also occurs with datetime.datetime instances
> rows = [Row(cx=1, cy=2, dates=[datetime.datetime.fromordinal(1), 
> datetime.datetime.fromordinal(2)])]
> df = sqc.createDataFrame(rows)
> df.collect()
> [Row(cx=1, cy=2, dates=[datetime.datetime(1, 1, 1, 0, 0, fold=1), 
> datetime.datetime(1, 1, 2, 0, 0)])]
> df.toJSON().collect()
> ['{"cx":1,"cy":2,"dates":["0001-01-02T23:50:36.000-06:00","0001-01-03T23:50:36.000-06:00"]}']
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: 

[jira] [Commented] (SPARK-25467) Python date/datetime objects in dataframes increment by 1 day when converted to JSON

2018-09-25 Thread Aleksandr Koriagin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626902#comment-16626902
 ] 

Aleksandr Koriagin commented on SPARK-25467:


Can be reproduced with HDP (2.6.5.0-292) with Spark 2.3.0:
{code:python}
import datetime
from pyspark.sql import Row

date = datetime.date.fromordinal(1)
print date  # >> '0001-01-01'

a = [Row(date=date)]
sqlContext.createDataFrame(a).toJSON().collect()  # >> 
[u'{"date":"0001-01-03"}']
{code}
Here is a part of code probably where issue happens:
https://github.com/apache/spark/blob/7d8f5b62c57c9e2903edd305e8b9c5400652fdb0/python/pyspark/sql/session.py#L748

{code:python}
if isinstance(data, RDD):
rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
else:
rdd, schema = self._createFromLocal(map(prepare, data), schema)

# ipdb> rdd.collect() --> [(-719162,)]  
#   See "sql.types.DateType#toInternal" about '-719162' value:
#   
https://github.com/apache/spark/blob/7d8f5b62c57c9e2903edd305e8b9c5400652fdb0/python/pyspark/sql/types.py#L161
#   This number '-719162' will be because:
#   datetime.date.fromordinal(1).toordinal() - datetime.datetime(1970, 1, 
1).toordinal() = -719162
#
# ipdb> schema--> StructType(List(StructField(date,DateType,true)))
# ipdb> schema.json() --> 
'{"fields":[{"metadata":{},"name":"date","nullable":true,"type":"date"}],"type":"struct"}'
# Here all seems to be correct good

jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())

# After 'applySchemaToPythonRDD' transformation value is incorrect: 
'0001-01-03'.
# ipdb> jdf.show()
# +--+
# |  date|
# +--+
# |0001-01-03|   <<-- should be '0001-01-01'
# +--+
#
# Seems issue happens at Java/Scala part:
# 
https://github.com/apache/spark/blob/2a0a8f753bbdc8c251f8e699c0808f35b94cfd20/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L734

df = DataFrame(jdf, self._wrapped)
df._schema = schema
return df
{code}


> Python date/datetime objects in dataframes increment by 1 day when converted 
> to JSON
> 
>
> Key: SPARK-25467
> URL: https://issues.apache.org/jira/browse/SPARK-25467
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.1
> Environment: Spark 2.3.1
> Python 3.6.5 | packaged by conda-forge | (default, Apr  6 2018, 13:39:56) 
> [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]
> openjdk version "1.8.0_181"
> OpenJDK Runtime Environment (build 1.8.0_181-b13)
> OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
> Centos 7 3.10.0-862.11.6.el7.x86_64 #1 SMP Tue Aug 14 21:49:04 UTC 2018 
> x86_64 x86_64 GNU/Linux
>Reporter: David V. Hill
>Priority: Major
>
> When Dataframes contains datetime.date or datetime.datetime instances and 
> toJSON() is called on the Dataframe, the day is incremented in the JSON date 
> representation.
> {code}
> # Create a Dataframe containing datetime.date instances, convert to JSON and 
> display
> rows = [Row(cx=1, cy=2, dates=[datetime.date.fromordinal(1), 
> datetime.date.fromordinal(2)])]
> df = sqc.createDataFrame(rows)
> df.collect()
> [Row(cx=1, cy=2, dates=[datetime.date(1, 1, 1), datetime.date(1, 1, 2)])]
> df.toJSON().collect()
> ['{"cx":1,"cy":2,"dates":["0001-01-03","0001-01-04"]}']
> # Issue also occurs with datetime.datetime instances
> rows = [Row(cx=1, cy=2, dates=[datetime.datetime.fromordinal(1), 
> datetime.datetime.fromordinal(2)])]
> df = sqc.createDataFrame(rows)
> df.collect()
> [Row(cx=1, cy=2, dates=[datetime.datetime(1, 1, 1, 0, 0, fold=1), 
> datetime.datetime(1, 1, 2, 0, 0)])]
> df.toJSON().collect()
> ['{"cx":1,"cy":2,"dates":["0001-01-02T23:50:36.000-06:00","0001-01-03T23:50:36.000-06:00"]}']
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25467) Python date/datetime objects in dataframes increment by 1 day when converted to JSON

2018-09-19 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16621391#comment-16621391
 ] 

holdenk commented on SPARK-25467:
-

cc [~bryanc]

> Python date/datetime objects in dataframes increment by 1 day when converted 
> to JSON
> 
>
> Key: SPARK-25467
> URL: https://issues.apache.org/jira/browse/SPARK-25467
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.1
> Environment: Spark 2.3.1
> Python 3.6.5 | packaged by conda-forge | (default, Apr  6 2018, 13:39:56) 
> [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]
> openjdk version "1.8.0_181"
> OpenJDK Runtime Environment (build 1.8.0_181-b13)
> OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
> Centos 7 3.10.0-862.11.6.el7.x86_64 #1 SMP Tue Aug 14 21:49:04 UTC 2018 
> x86_64 x86_64 GNU/Linux
>Reporter: David V. Hill
>Priority: Major
>
> When Dataframes contains datetime.date or datetime.datetime instances and 
> toJSON() is called on the Dataframe, the day is incremented in the JSON date 
> representation.
> {code}
> # Create a Dataframe containing datetime.date instances, convert to JSON and 
> display
> rows = [Row(cx=1, cy=2, dates=[datetime.date.fromordinal(1), 
> datetime.date.fromordinal(2)])]
> df = sqc.createDataFrame(rows)
> df.collect()
> [Row(cx=1, cy=2, dates=[datetime.date(1, 1, 1), datetime.date(1, 1, 2)])]
> df.toJSON().collect()
> ['{"cx":1,"cy":2,"dates":["0001-01-03","0001-01-04"]}']
> # Issue also occurs with datetime.datetime instances
> rows = [Row(cx=1, cy=2, dates=[datetime.datetime.fromordinal(1), 
> datetime.datetime.fromordinal(2)])]
> df = sqc.createDataFrame(rows)
> df.collect()
> [Row(cx=1, cy=2, dates=[datetime.datetime(1, 1, 1, 0, 0, fold=1), 
> datetime.datetime(1, 1, 2, 0, 0)])]
> df.toJSON().collect()
> ['{"cx":1,"cy":2,"dates":["0001-01-02T23:50:36.000-06:00","0001-01-03T23:50:36.000-06:00"]}']
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org