Maxim Gekk created SPARK-31159:
----------------------------------

             Summary: Incompatible Parquet dates/timestamps with Spark 2.4
                 Key: SPARK-31159
                 URL: https://issues.apache.org/jira/browse/SPARK-31159
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.0.0
            Reporter: Maxim Gekk


Write dates/timestamps to Parquet file in Spark 2.4:
{code}
$ export TZ="UTC"
$ ~/spark-2.4/bin/spark-shell
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.5
      /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_231)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.conf.set("spark.sql.session.timeZone", "UTC")

scala> val df = Seq(("1001-01-01", "1001-01-01 01:02:03.123456")).toDF("dateS", 
"tsS").select($"dateS".cast("date").as("d"), $"tsS".cast("timestamp").as("ts"))
df: org.apache.spark.sql.DataFrame = [d: date, ts: timestamp]

scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros")

scala> spark.conf.set("spark.sql.parquet.outputTimestampType", 
"TIMESTAMP_MICROS")

scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros")
scala> 
spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros").show(false)
+----------+--------------------------+
|d         |ts                        |
+----------+--------------------------+
|1001-01-01|1001-01-01 01:02:03.123456|
+----------+--------------------------+
{code}
Spark 2.4 saves dates/timestamps in Julian calendar. The parquet-mr tool prints 
*1001-01-07* and *1001-01-07T01:02:03.123456+0000*:
{code}
$ java -jar 
/Users/maxim/proj/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar
 dump -m 
./2_4_5_micros/part-00000-fe310bfa-0f61-44af-85ee-489721042c14-c000.snappy.parquet
INT32 d
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 1 ***
value 1: R:0 D:1 V:1001-01-07

INT64 ts
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 1 ***
value 1: R:0 D:1 V:1001-01-07T01:02:03.123456+0000
{code}
Spark 3.0.0-preview2 ( and 3.1.0-SNAPSHOT) prints the same as parquet-mr but 
different values from Spark 2.4:
{code}
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-preview2
      /_/

Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_231)
scala> 
spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros").show(false)
+----------+--------------------------+
|d         |ts                        |
+----------+--------------------------+
|1001-01-07|1001-01-07 01:02:03.123456|
+----------+--------------------------+
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to