[ https://issues.apache.org/jira/browse/SPARK-31672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Maxim Gekk updated SPARK-31672: ------------------------------- Description: Write dates with dictionary encoding enabled to parquet files: {code:scala} Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.1.0-SNAPSHOT /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242) Type in expressions to have them evaluated. Type :help for more information. scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true) scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS") scala> :paste // Entering paste mode (ctrl-D to finish) Seq.tabulate(8)(_ => "1001-01-01 01:02:03.123").toDF("tsS") .select($"tsS".cast("timestamp").as("ts")) .repartition(1) .write .option("parquet.enable.dictionary", true) .mode("overwrite") .parquet("/Users/maximgekk/tmp/parquet-ts-dict") // Exiting paste mode, now interpreting. {code} Read them back: {code:scala} scala> spark.read.parquet("/Users/maximgekk/tmp/parquet-ts-dict").show(false) +-----------------------+ |ts | +-----------------------+ |1001-01-07 00:32:20.123| |1001-01-07 00:32:20.123| |1001-01-07 00:32:20.123| |1001-01-07 00:32:20.123| |1001-01-07 00:32:20.123| |1001-01-07 00:32:20.123| |1001-01-07 00:32:20.123| |1001-01-07 00:32:20.123| +-----------------------+ {code} *Expected values must be 1001-01-01 01:02:03.123.* I checked that the timestamp column is encoded by dictionary via: {code} ➜ parquet-ts-dict java -jar ~/Downloads/parquet-tools-1.12.0.jar dump ./part-00000-2c6c89b1-d165-4528-9a9d-796baa07908e-c000.snappy.parquet row group 0 -------------------------------------------------------------------------------- ts: INT64 SNAPPY DO:0 FPO:4 SZ:94/90/0.96 VC:8 ENC:BIT_PACKED,RLE,PLA [more]... ts TV=8 RL=0 DL=1 DS: 1 DE:PLAIN_DICTIONARY ---------------------------------------------------------------------------- page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY [more]... VC:8 INT64 ts -------------------------------------------------------------------------------- *** row group 1 of 1, values 1 to 8 *** value 1: R:0 D:1 V:1001-01-06T22:02:03.123000+0000 value 2: R:0 D:1 V:1001-01-06T22:02:03.123000+0000 value 3: R:0 D:1 V:1001-01-06T22:02:03.123000+0000 value 4: R:0 D:1 V:1001-01-06T22:02:03.123000+0000 value 5: R:0 D:1 V:1001-01-06T22:02:03.123000+0000 value 6: R:0 D:1 V:1001-01-06T22:02:03.123000+0000 value 7: R:0 D:1 V:1001-01-06T22:02:03.123000+0000 value 8: R:0 D:1 V:1001-01-06T22:02:03.123000+0000 {code} was: Write dates with dictionary encoding enabled to parquet files: {code:scala} Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.1.0-SNAPSHOT /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242) Type in expressions to have them evaluated. Type :help for more information. scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true) scala> :paste // Entering paste mode (ctrl-D to finish) Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS") .select($"dateS".cast("date").as("date")) .repartition(1) .write .option("parquet.enable.dictionary", true) .mode("overwrite") .parquet("/Users/maximgekk/tmp/parquet-date-dict") // Exiting paste mode, now interpreting. {code} Read them back: {code:scala} scala> spark.read.parquet("/Users/maximgekk/tmp/parquet-date-dict").show(false) +----------+ |date | +----------+ |1001-01-07| |1001-01-07| |1001-01-07| |1001-01-07| |1001-01-07| |1001-01-07| |1001-01-07| |1001-01-07| +----------+ {code} *Expected values must be 1000-01-01.* I checked that the date column is encoded by dictionary via: {code} ➜ parquet-date-dict java -jar ~/Downloads/parquet-tools-1.12.0.jar dump ./part-00000-84a77214-0c8c-45e9-ac41-5ca863b9dd94-c000.snappy.parquet row group 0 -------------------------------------------------------------------------------- date: INT32 SNAPPY DO:0 FPO:4 SZ:74/70/0.95 VC:8 ENC:BIT_PACKED,RLE,P [more]... date TV=8 RL=0 DL=1 DS: 1 DE:PLAIN_DICTIONARY ---------------------------------------------------------------------------- page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY [more]... VC:8 INT32 date -------------------------------------------------------------------------------- *** row group 1 of 1, values 1 to 8 *** value 1: R:0 D:1 V:1001-01-07 value 2: R:0 D:1 V:1001-01-07 value 3: R:0 D:1 V:1001-01-07 value 4: R:0 D:1 V:1001-01-07 value 5: R:0 D:1 V:1001-01-07 value 6: R:0 D:1 V:1001-01-07 value 7: R:0 D:1 V:1001-01-07 value 8: R:0 D:1 V:1001-01-07 {code} > Reading wrong timestamps from dictionary encoded columns in Parquet files > ------------------------------------------------------------------------- > > Key: SPARK-31672 > URL: https://issues.apache.org/jira/browse/SPARK-31672 > Project: Spark > Issue Type: Sub-task > Components: SQL > Affects Versions: 3.0.0, 3.1.0 > Reporter: Maxim Gekk > Assignee: Maxim Gekk > Priority: Major > Fix For: 3.0.0 > > > Write dates with dictionary encoding enabled to parquet files: > {code:scala} > Welcome to > ____ __ > / __/__ ___ _____/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.1.0-SNAPSHOT > /_/ > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242) > Type in expressions to have them evaluated. > Type :help for more information. > scala> > spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true) > scala> spark.conf.set("spark.sql.parquet.outputTimestampType", > "TIMESTAMP_MICROS") > scala> :paste > // Entering paste mode (ctrl-D to finish) > Seq.tabulate(8)(_ => "1001-01-01 01:02:03.123").toDF("tsS") > .select($"tsS".cast("timestamp").as("ts")) > .repartition(1) > .write > .option("parquet.enable.dictionary", true) > .mode("overwrite") > .parquet("/Users/maximgekk/tmp/parquet-ts-dict") > // Exiting paste mode, now interpreting. > {code} > Read them back: > {code:scala} > scala> spark.read.parquet("/Users/maximgekk/tmp/parquet-ts-dict").show(false) > +-----------------------+ > |ts | > +-----------------------+ > |1001-01-07 00:32:20.123| > |1001-01-07 00:32:20.123| > |1001-01-07 00:32:20.123| > |1001-01-07 00:32:20.123| > |1001-01-07 00:32:20.123| > |1001-01-07 00:32:20.123| > |1001-01-07 00:32:20.123| > |1001-01-07 00:32:20.123| > +-----------------------+ > {code} > *Expected values must be 1001-01-01 01:02:03.123.* > I checked that the timestamp column is encoded by dictionary via: > {code} > ➜ parquet-ts-dict java -jar ~/Downloads/parquet-tools-1.12.0.jar dump > ./part-00000-2c6c89b1-d165-4528-9a9d-796baa07908e-c000.snappy.parquet > row group 0 > -------------------------------------------------------------------------------- > ts: INT64 SNAPPY DO:0 FPO:4 SZ:94/90/0.96 VC:8 ENC:BIT_PACKED,RLE,PLA > [more]... > ts TV=8 RL=0 DL=1 DS: 1 DE:PLAIN_DICTIONARY > > ---------------------------------------------------------------------------- > page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY > [more]... VC:8 > INT64 ts > -------------------------------------------------------------------------------- > *** row group 1 of 1, values 1 to 8 *** > value 1: R:0 D:1 V:1001-01-06T22:02:03.123000+0000 > value 2: R:0 D:1 V:1001-01-06T22:02:03.123000+0000 > value 3: R:0 D:1 V:1001-01-06T22:02:03.123000+0000 > value 4: R:0 D:1 V:1001-01-06T22:02:03.123000+0000 > value 5: R:0 D:1 V:1001-01-06T22:02:03.123000+0000 > value 6: R:0 D:1 V:1001-01-06T22:02:03.123000+0000 > value 7: R:0 D:1 V:1001-01-06T22:02:03.123000+0000 > value 8: R:0 D:1 V:1001-01-06T22:02:03.123000+0000 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org