[ https://issues.apache.org/jira/browse/SPARK-18381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15888752#comment-15888752 ]
Nicholas Chammas commented on SPARK-18381: ------------------------------------------ I am seeing a very similar issue when trying to read some date data from Parquet. When I tried to create a minimal repro, I uncovered this error, which is probably related to what is reported above: {code} >>> spark.createDataFrame([(datetime.date(1, 1, 1),)], ('date',)).show(1) +----------+ | date| +----------+ |0001-01-03| +----------+ {code} I'm not sure how Jan 1 became Jan 3, but something's obviously wrong. Here's another weird example: {code} >>> spark.createDataFrame([(datetime.date(1000, 10, 10),)], ('date',)).show(1) +----------+ | date| +----------+ |1000-10-04| +----------+ {code} In both cases, accessing the underlying data from the RDD returns the correct result: {code} >>> spark.createDataFrame([(datetime.date(1, 1, 1),)], ('date',)).take(1) [Row(date=datetime.date(1, 1, 1))] >>> spark.createDataFrame([(datetime.date(1000, 10, 10),)], ('date',)).take(1) [Row(date=datetime.date(1000, 10, 10))] {code} I'm guessing there is a problem [somewhere in here|https://github.com/apache/spark/blob/9734a928a75d29ea202e9f309f92ca4637d35671/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala] that is causing what Luca and I are seeing. Specifically, I am suspicious of [these lines|https://github.com/apache/spark/blob/9734a928a75d29ea202e9f309f92ca4637d35671/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L173-L185]. There are probably certain historical events which are being handled by some Java library when converting number of days to dates that are not being respected by Spark. cc [~davies] [~holdenk] > Wrong date conversion between spark and python for dates before 1583 > -------------------------------------------------------------------- > > Key: SPARK-18381 > URL: https://issues.apache.org/jira/browse/SPARK-18381 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.0.0 > Reporter: Luca Caniparoli > > Dates before 1538 (julian/gregorian calendar transition) are processed > incorrectly. > * With python udf (datetime.strptime), .show() returns wrong dates but > .collect() returns correct dates > * With pyspark.sql.functions.to_date, .show() shows correct dates but > .collect() returns wrong dates. Additionally, collecting '0001-01-01' returns > error when collecting dataframe. > {code:none} > from pyspark.sql.types import DateType > from pyspark.sql.functions import to_date, udf > from datetime import datetime > strToDate = udf (lambda x: datetime.strptime(x, '%Y-%m-%d'), DateType()) > l = [('0002-01-01', 1), ('1581-01-01', 2), ('1582-01-01', 3), ('1583-01-01', > 4), ('1584-01-01', 5), ('2012-01-21', 6)] > l_older = [('0001-01-01', 1)] > test_df = spark.createDataFrame(l, ["date_string", "number"]) > test_df_older = spark.createDataFrame(l_older, ["date_string", "number"]) > test_df_strptime = test_df.withColumn( "date_cast", > strToDate(test_df["date_string"])) > test_df_todate = test_df.withColumn( "date_cast", > to_date(test_df["date_string"])) > test_df_older_todate = test_df_older.withColumn( "date_cast", > to_date(test_df_older["date_string"])) > test_df_strptime.show() > test_df_todate.show() > print test_df_strptime.collect() > print test_df_todate.collect() > print test_df_older_todate.collect() > {code} > {noformat} > +-----------+------+----------+ > |date_string|number| date_cast| > +-----------+------+----------+ > | 0002-01-01| 1|0002-01-03| > | 1581-01-01| 2|1580-12-22| > | 1582-01-01| 3|1581-12-22| > | 1583-01-01| 4|1583-01-01| > | 1584-01-01| 5|1584-01-01| > | 2012-01-21| 6|2012-01-21| > +-----------+------+----------+ > +-----------+------+----------+ > |date_string|number| date_cast| > +-----------+------+----------+ > | 0002-01-01| 1|0002-01-01| > | 1581-01-01| 2|1581-01-01| > | 1582-01-01| 3|1582-01-01| > | 1583-01-01| 4|1583-01-01| > | 1584-01-01| 5|1584-01-01| > | 2012-01-21| 6|2012-01-21| > +-----------+------+----------+ > [Row(date_string=u'0002-01-01', number=1, date_cast=datetime.date(2, 1, 1)), > Row(date_string=u'1581-01-01', number=2, date_cast=datetime.date(1581, 1, > 1)), Row(date_string=u'1582-01-01', number=3, date_cast=datetime.date(1582, > 1, 1)), Row(date_string=u'1583-01-01', number=4, > date_cast=datetime.date(1583, 1, 1)), Row(date_string=u'1584-01-01', > number=5, date_cast=datetime.date(1584, 1, 1)), > Row(date_string=u'2012-01-21', number=6, date_cast=datetime.date(2012, 1, > 21))] > [Row(date_string=u'0002-01-01', number=1, date_cast=datetime.date(1, 12, > 30)), Row(date_string=u'1581-01-01', number=2, date_cast=datetime.date(1581, > 1, 11)), Row(date_string=u'1582-01-01', number=3, > date_cast=datetime.date(1582, 1, 11)), Row(date_string=u'1583-01-01', > number=4, date_cast=datetime.date(1583, 1, 1)), > Row(date_string=u'1584-01-01', number=5, date_cast=datetime.date(1584, 1, > 1)), Row(date_string=u'2012-01-21', number=6, date_cast=datetime.date(2012, > 1, 21))] > Traceback (most recent call last): > File "/tmp/zeppelin_pyspark-6043517212596195478.py", line 267, in <module> > raise Exception(traceback.format_exc()) > Exception: Traceback (most recent call last): > File "/tmp/zeppelin_pyspark-6043517212596195478.py", line 265, in <module> > exec(code) > File "<stdin>", line 15, in <module> > File "/usr/local/spark/python/pyspark/sql/dataframe.py", line 311, in > collect > return list(_load_from_socket(port, > BatchedSerializer(PickleSerializer()))) > File "/usr/local/spark/python/pyspark/rdd.py", line 142, in > _load_from_socket > for item in serializer.load_stream(rf): > File "/usr/local/spark/python/pyspark/serializers.py", line 139, in > load_stream > yield self._read_with_length(stream) > File "/usr/local/spark/python/pyspark/serializers.py", line 164, in > _read_with_length > return self.loads(obj) > File "/usr/local/spark/python/pyspark/serializers.py", line 422, in loads > return pickle.loads(obj) > File "/usr/local/spark/python/pyspark/sql/types.py", line 1354, in <lambda> > return lambda *a: dataType.fromInternal(a) > File "/usr/local/spark/python/pyspark/sql/types.py", line 600, in > fromInternal > values = [f.fromInternal(v) for f, v in zip(self.fields, obj)] > File "/usr/local/spark/python/pyspark/sql/types.py", line 439, in > fromInternal > return self.dataType.fromInternal(obj) > File "/usr/local/spark/python/pyspark/sql/types.py", line 176, in > fromInternal > return datetime.date.fromordinal(v + self.EPOCH_ORDINAL) > ValueError: ('ordinal must be >= 1', <function <lambda> at 0x7fa21bf7baa0>, > (u'0001-01-01', 1, -719164)) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org