[jira] [Created] (SPARK-26797) Start using the new logical types API of Parquet 1.11.0 instead of the deprecated one
Zoltan Ivanfi created SPARK-26797: - Summary: Start using the new logical types API of Parquet 1.11.0 instead of the deprecated one Key: SPARK-26797 URL: https://issues.apache.org/jira/browse/SPARK-26797 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: Zoltan Ivanfi The 1.11.0 release of parquet-mr will deprecate its logical type API in favour of a newly introduced one. The new API also introduces new subtypes for different timestamp semantics, support for which should be added to Spark in order to read those types correctly. At this point only a release candidate of parquet-mr 1.11.0 is available, but that already allows implementing and reviewing this change. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26345) Parquet support Column indexes
[ https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16757209#comment-16757209 ] Zoltan Ivanfi edited comment on SPARK-26345 at 1/31/19 1:02 PM: Please note that column indexes will automatically get utilized if [spark.sql.parquet.enableVectorizedReader|https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-vectorized-parquet-reader.html#spark.sql.parquet.enableVectorizedReader] = false. If spark.sql.parquet.enableVectorizedReader = true, on the other hand (which is the default), then column indexes could only be utilized by duplicating the internal logic of parquet-mr, which would be disproportonate effort. We, the developers of the column index feature in parquet-mr did not expect Spark to make this huge investment, and we would like to provide a vectorized API instead in a future release of parquet-mr. was (Author: zi): Please note that column indexes will automatically get utilized if [spark.sql.parquet.enableVectorizedReader|https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-vectorized-parquet-reader.html#spark.sql.parquet.enableVectorizedReader] = false. If spark.sql.parquet.enableVectorizedReader = true, on the other hand (which is the default), then column indexes could only be utilized by duplicating the internal logic of parquet-mr, which would be disproportonate effort. We, the developers of the column index feature did not expect Spark to make this huge investment, and we would like to provide a vectorized API instead in a future release. > Parquet support Column indexes > -- > > Key: SPARK-26345 > URL: https://issues.apache.org/jira/browse/SPARK-26345 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Parquet 1.11.0 supports column indexing. Spark can supports this feature for > good read performance. > More details: > https://issues.apache.org/jira/browse/PARQUET-1201 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26345) Parquet support Column indexes
[ https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16757209#comment-16757209 ] Zoltan Ivanfi commented on SPARK-26345: --- Please note that column indexes will automatically get utilized if [spark.sql.parquet.enableVectorizedReader|https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-vectorized-parquet-reader.html#spark.sql.parquet.enableVectorizedReader] = false. If spark.sql.parquet.enableVectorizedReader = true, on the other hand (which is the default), then column indexes could only be utilized by duplicating the internal logic of parquet-mr, which would be disproportonate effort. We, the developers of the column index feature did not expect Spark to make this huge investment, and we would like to provide a vectorized API instead in a future release. > Parquet support Column indexes > -- > > Key: SPARK-26345 > URL: https://issues.apache.org/jira/browse/SPARK-26345 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Parquet 1.11.0 supports column indexing. Spark can supports this feature for > good read performance. > More details: > https://issues.apache.org/jira/browse/PARQUET-1201 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25102) Write Spark version information to Parquet file footers
[ https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609062#comment-16609062 ] Zoltan Ivanfi commented on SPARK-25102: --- Hi [~npoberezkin], Sorry for answering so late, I was on vacation. Unfortunately I can't tell you what the proper way is to get the Spark version as I am not a Spark developer myself but a Parquet library developer instead. > Write Spark version information to Parquet file footers > --- > > Key: SPARK-25102 > URL: https://issues.apache.org/jira/browse/SPARK-25102 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Zoltan Ivanfi >Priority: Major > > -PARQUET-352- added support for the "writer.model.name" property in the > Parquet metadata to identify the object model (application) that wrote the > file. > The easiest way to write this property is by overriding getName() of > org.apache.parquet.hadoop.api.WriteSupport. In Spark, this would mean adding > getName() to the > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25102) Write Spark version information to Parquet file footers
Zoltan Ivanfi created SPARK-25102: - Summary: Write Spark version information to Parquet file footers Key: SPARK-25102 URL: https://issues.apache.org/jira/browse/SPARK-25102 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.1 Reporter: Zoltan Ivanfi -PARQUET-352- added support for the "writer.model.name" property in the Parquet metadata to identify the object model (application) that wrote the file. The easiest way to write this property is by overriding getName() of org.apache.parquet.hadoop.api.WriteSupport. In Spark, this would mean adding getName() to the org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
[ https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoltan Ivanfi updated SPARK-20297: -- Comment: was deleted (was: Sorry, commented to the wrong JIRA.) > Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala > --- > > Key: SPARK-20297 > URL: https://issues.apache.org/jira/browse/SPARK-20297 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Mostafa Mokhtar >Priority: Major > Labels: integration > > While trying to load some data using Spark 2.1 I realized that decimal(12,2) > columns stored in Parquet written by Spark are not readable by Hive or Impala. > Repro > {code} > CREATE TABLE customer_acctbal( > c_acctbal decimal(12,2)) > STORED AS Parquet; > insert into customer_acctbal values (7539.95); > {code} > Error from Hive > {code} > Failed with exception > java.io.IOException:parquet.io.ParquetDecodingException: Can not read value > at 1 in block 0 in file > hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet > Time taken: 0.122 seconds > {code} > Error from Impala > {code} > File > 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet' > has an incompatible Parquet schema for column > 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: > DECIMAL(12,2), Parquet schema: > optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar) > {code} > Table info > {code} > hive> describe formatted customer_acctbal; > OK > # col_name data_type comment > c_acctbal decimal(12,2) > # Detailed Table Information > Database: tpch_nested_3000_parquet > Owner: mmokhtar > CreateTime: Mon Apr 10 17:47:24 PDT 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention: 0 > Location: > hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal > Table Type: MANAGED_TABLE > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 0 > rawDataSize 0 > totalSize 120 > transient_lastDdlTime 1491871644 > # Storage Information > SerDe Library: > org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe > InputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat > Compressed: No > Num Buckets:-1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format1 > Time taken: 0.032 seconds, Fetched: 31 row(s) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
[ https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419338#comment-16419338 ] Zoltan Ivanfi commented on SPARK-20297: --- Sorry, commented to the wrong JIRA. > Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala > --- > > Key: SPARK-20297 > URL: https://issues.apache.org/jira/browse/SPARK-20297 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Mostafa Mokhtar >Priority: Major > Labels: integration > > While trying to load some data using Spark 2.1 I realized that decimal(12,2) > columns stored in Parquet written by Spark are not readable by Hive or Impala. > Repro > {code} > CREATE TABLE customer_acctbal( > c_acctbal decimal(12,2)) > STORED AS Parquet; > insert into customer_acctbal values (7539.95); > {code} > Error from Hive > {code} > Failed with exception > java.io.IOException:parquet.io.ParquetDecodingException: Can not read value > at 1 in block 0 in file > hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet > Time taken: 0.122 seconds > {code} > Error from Impala > {code} > File > 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet' > has an incompatible Parquet schema for column > 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: > DECIMAL(12,2), Parquet schema: > optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar) > {code} > Table info > {code} > hive> describe formatted customer_acctbal; > OK > # col_name data_type comment > c_acctbal decimal(12,2) > # Detailed Table Information > Database: tpch_nested_3000_parquet > Owner: mmokhtar > CreateTime: Mon Apr 10 17:47:24 PDT 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention: 0 > Location: > hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal > Table Type: MANAGED_TABLE > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 0 > rawDataSize 0 > totalSize 120 > transient_lastDdlTime 1491871644 > # Storage Information > SerDe Library: > org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe > InputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat > Compressed: No > Num Buckets:-1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format1 > Time taken: 0.032 seconds, Fetched: 31 row(s) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
[ https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoltan Ivanfi updated SPARK-20297: -- Comment: was deleted (was: Could you please clarify how those DECIMALS were written in the first place? * If some manual configuration was done to allow Spark to choose this representation, then we are fine. * If an upstream Spark version wrote data using this representation by default, that's a valid reason to feel mildly uncomfortable. * If a downstream Spark version wrote data using this representation by default, then we should open a JIRA to prevent CDH Spark from doing so until Hive and Impala supports it. Thanks!) > Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala > --- > > Key: SPARK-20297 > URL: https://issues.apache.org/jira/browse/SPARK-20297 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Mostafa Mokhtar >Priority: Major > Labels: integration > > While trying to load some data using Spark 2.1 I realized that decimal(12,2) > columns stored in Parquet written by Spark are not readable by Hive or Impala. > Repro > {code} > CREATE TABLE customer_acctbal( > c_acctbal decimal(12,2)) > STORED AS Parquet; > insert into customer_acctbal values (7539.95); > {code} > Error from Hive > {code} > Failed with exception > java.io.IOException:parquet.io.ParquetDecodingException: Can not read value > at 1 in block 0 in file > hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet > Time taken: 0.122 seconds > {code} > Error from Impala > {code} > File > 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet' > has an incompatible Parquet schema for column > 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: > DECIMAL(12,2), Parquet schema: > optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar) > {code} > Table info > {code} > hive> describe formatted customer_acctbal; > OK > # col_name data_type comment > c_acctbal decimal(12,2) > # Detailed Table Information > Database: tpch_nested_3000_parquet > Owner: mmokhtar > CreateTime: Mon Apr 10 17:47:24 PDT 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention: 0 > Location: > hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal > Table Type: MANAGED_TABLE > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 0 > rawDataSize 0 > totalSize 120 > transient_lastDdlTime 1491871644 > # Storage Information > SerDe Library: > org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe > InputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat > Compressed: No > Num Buckets:-1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format1 > Time taken: 0.032 seconds, Fetched: 31 row(s) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
[ https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419335#comment-16419335 ] Zoltan Ivanfi commented on SPARK-20297: --- Could you please clarify how those DECIMALS were written in the first place? * If some manual configuration was done to allow Spark to choose this representation, then we are fine. * If an upstream Spark version wrote data using this representation by default, that's a valid reason to feel mildly uncomfortable. * If a downstream Spark version wrote data using this representation by default, then we should open a JIRA to prevent CDH Spark from doing so until Hive and Impala supports it. Thanks! > Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala > --- > > Key: SPARK-20297 > URL: https://issues.apache.org/jira/browse/SPARK-20297 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Mostafa Mokhtar >Priority: Major > Labels: integration > > While trying to load some data using Spark 2.1 I realized that decimal(12,2) > columns stored in Parquet written by Spark are not readable by Hive or Impala. > Repro > {code} > CREATE TABLE customer_acctbal( > c_acctbal decimal(12,2)) > STORED AS Parquet; > insert into customer_acctbal values (7539.95); > {code} > Error from Hive > {code} > Failed with exception > java.io.IOException:parquet.io.ParquetDecodingException: Can not read value > at 1 in block 0 in file > hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet > Time taken: 0.122 seconds > {code} > Error from Impala > {code} > File > 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet' > has an incompatible Parquet schema for column > 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: > DECIMAL(12,2), Parquet schema: > optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar) > {code} > Table info > {code} > hive> describe formatted customer_acctbal; > OK > # col_name data_type comment > c_acctbal decimal(12,2) > # Detailed Table Information > Database: tpch_nested_3000_parquet > Owner: mmokhtar > CreateTime: Mon Apr 10 17:47:24 PDT 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention: 0 > Location: > hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal > Table Type: MANAGED_TABLE > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 0 > rawDataSize 0 > totalSize 120 > transient_lastDdlTime 1491871644 > # Storage Information > SerDe Library: > org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe > InputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat > Compressed: No > Num Buckets:-1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format1 > Time taken: 0.032 seconds, Fetched: 31 row(s) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.
[ https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350240#comment-16350240 ] Zoltan Ivanfi edited comment on SPARK-12297 at 2/2/18 12:35 PM: Hive already has a workaround based on a the writer metadata. HIVE-12767 was about a more sophisticated and complicated solution based on table properties. But since the Spark community decided to implement a similar workaround to the one that already exists in Hive (based on a the writer metadata), the solution using table properties is not needed any more. I have resolved HIVE-12767 as "Won't Fix". was (Author: zi): Hive already has a workaround based on a the writer metadata. HIVE-12767 was about a more sophisticated and complicated solution based on table properties. But the Spark community decided to implement a similar workaround to the one that already exists in Hive (based on a the writer metadata), the solution using table properties is not needed any more. I have resolved HIVE-12767 as "Won't Fix". > Add work-around for Parquet/Hive int96 timestamp bug. > - > > Key: SPARK-12297 > URL: https://issues.apache.org/jira/browse/SPARK-12297 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Ryan Blue >Assignee: Imran Rashid >Priority: Major > Fix For: 2.3.0 > > > Spark copied Hive's behavior for parquet, but this was inconsistent with > other file formats, and inconsistent with Impala (which is the original > source of putting a timestamp as an int96 in parquet, I believe). This made > timestamps in parquet act more like timestamps with timezones, while in other > file formats, timestamps have no time zone, they are a "floating time". > The easiest way to see this issue is to write out a table with timestamps in > multiple different formats from one timezone, then try to read them back in > another timezone. Eg., here I write out a few timestamps to parquet and > textfile hive tables, and also just as a json file, all in the > "America/Los_Angeles" timezone: > {code} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val tblPrefix = args(0) > val schema = new StructType().add("ts", TimestampType) > val rows = sc.parallelize(Seq( > "2015-12-31 23:50:59.123", > "2015-12-31 22:49:59.123", > "2016-01-01 00:39:59.123", > "2016-01-01 01:29:59.123" > ).map { x => Row(java.sql.Timestamp.valueOf(x)) }) > val rawData = spark.createDataFrame(rows, schema).toDF() > rawData.show() > Seq("parquet", "textfile").foreach { format => > val tblName = s"${tblPrefix}_$format" > spark.sql(s"DROP TABLE IF EXISTS $tblName") > spark.sql( > raw"""CREATE TABLE $tblName ( > | ts timestamp > | ) > | STORED AS $format > """.stripMargin) > rawData.write.insertInto(tblName) > } > rawData.write.json(s"${tblPrefix}_json") > {code} > Then I start a spark-shell in "America/New_York" timezone, and read the data > back from each table: > {code} > scala> spark.sql("select * from la_parquet").collect().foreach{println} > [2016-01-01 02:50:59.123] > [2016-01-01 01:49:59.123] > [2016-01-01 03:39:59.123] > [2016-01-01 04:29:59.123] > scala> spark.sql("select * from la_textfile").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").join(spark.sql("select * from > la_textfile"), "ts").show() > ++ > | ts| > ++ > |2015-12-31 23:50:...| > |2015-12-31 22:49:...| > |2016-01-01 00:39:...| > |2016-01-01 01:29:...| > ++ > scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), > "ts").show() > +---+ > | ts| > +---+ > +---+ > {code} > The textfile and json based data shows the same times, and can be joined > against each other, while the times from the parquet data have changed (and > obviously joins fail). > This is a big problem for any organization that may try to read the same data > (say in S3) with clusters in multiple timezones. It can also be a nasty > surprise as an organization tries to migrate file formats. Finally, its a > source of incompatibility between Hive, Impala, and Spark. > HIVE-12767 aims to fix this by introducing a table property which indicates > the "storage timezone" for the table. Spark should add the same to ensure > consistency between file formats, and with Hive & Impala. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.
[ https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350240#comment-16350240 ] Zoltan Ivanfi commented on SPARK-12297: --- Hive already has a workaround based on a the writer metadata. HIVE-12767 was about a more sophisticated and complicated solution based on table properties. But the Spark community decided to implement a similar workaround to the one that already exists in Hive (based on a the writer metadata), the solution using table properties is not needed any more. I have resolved HIVE-12767 as "Won't Fix". > Add work-around for Parquet/Hive int96 timestamp bug. > - > > Key: SPARK-12297 > URL: https://issues.apache.org/jira/browse/SPARK-12297 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Ryan Blue >Assignee: Imran Rashid >Priority: Major > Fix For: 2.3.0 > > > Spark copied Hive's behavior for parquet, but this was inconsistent with > other file formats, and inconsistent with Impala (which is the original > source of putting a timestamp as an int96 in parquet, I believe). This made > timestamps in parquet act more like timestamps with timezones, while in other > file formats, timestamps have no time zone, they are a "floating time". > The easiest way to see this issue is to write out a table with timestamps in > multiple different formats from one timezone, then try to read them back in > another timezone. Eg., here I write out a few timestamps to parquet and > textfile hive tables, and also just as a json file, all in the > "America/Los_Angeles" timezone: > {code} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val tblPrefix = args(0) > val schema = new StructType().add("ts", TimestampType) > val rows = sc.parallelize(Seq( > "2015-12-31 23:50:59.123", > "2015-12-31 22:49:59.123", > "2016-01-01 00:39:59.123", > "2016-01-01 01:29:59.123" > ).map { x => Row(java.sql.Timestamp.valueOf(x)) }) > val rawData = spark.createDataFrame(rows, schema).toDF() > rawData.show() > Seq("parquet", "textfile").foreach { format => > val tblName = s"${tblPrefix}_$format" > spark.sql(s"DROP TABLE IF EXISTS $tblName") > spark.sql( > raw"""CREATE TABLE $tblName ( > | ts timestamp > | ) > | STORED AS $format > """.stripMargin) > rawData.write.insertInto(tblName) > } > rawData.write.json(s"${tblPrefix}_json") > {code} > Then I start a spark-shell in "America/New_York" timezone, and read the data > back from each table: > {code} > scala> spark.sql("select * from la_parquet").collect().foreach{println} > [2016-01-01 02:50:59.123] > [2016-01-01 01:49:59.123] > [2016-01-01 03:39:59.123] > [2016-01-01 04:29:59.123] > scala> spark.sql("select * from la_textfile").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").join(spark.sql("select * from > la_textfile"), "ts").show() > ++ > | ts| > ++ > |2015-12-31 23:50:...| > |2015-12-31 22:49:...| > |2016-01-01 00:39:...| > |2016-01-01 01:29:...| > ++ > scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), > "ts").show() > +---+ > | ts| > +---+ > +---+ > {code} > The textfile and json based data shows the same times, and can be joined > against each other, while the times from the parquet data have changed (and > obviously joins fail). > This is a big problem for any organization that may try to read the same data > (say in S3) with clusters in multiple timezones. It can also be a nasty > surprise as an organization tries to migrate file formats. Finally, its a > source of incompatibility between Hive, Impala, and Spark. > HIVE-12767 aims to fix this by introducing a table property which indicates > the "storage timezone" for the table. Spark should add the same to ensure > consistency between file formats, and with Hive & Impala. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.
[ https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16242261#comment-16242261 ] Zoltan Ivanfi commented on SPARK-12297: --- Yes, we reverted that change, because without a corresponding change in SparkSQL, it can not achieve interoperability in itself. We can only fix this issue by addressing all affected components (SparkSQL, Hive and Impala) at the same time in a consistent manner. > Add work-around for Parquet/Hive int96 timestamp bug. > - > > Key: SPARK-12297 > URL: https://issues.apache.org/jira/browse/SPARK-12297 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Ryan Blue > > Spark copied Hive's behavior for parquet, but this was inconsistent with > other file formats, and inconsistent with Impala (which is the original > source of putting a timestamp as an int96 in parquet, I believe). This made > timestamps in parquet act more like timestamps with timezones, while in other > file formats, timestamps have no time zone, they are a "floating time". > The easiest way to see this issue is to write out a table with timestamps in > multiple different formats from one timezone, then try to read them back in > another timezone. Eg., here I write out a few timestamps to parquet and > textfile hive tables, and also just as a json file, all in the > "America/Los_Angeles" timezone: > {code} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val tblPrefix = args(0) > val schema = new StructType().add("ts", TimestampType) > val rows = sc.parallelize(Seq( > "2015-12-31 23:50:59.123", > "2015-12-31 22:49:59.123", > "2016-01-01 00:39:59.123", > "2016-01-01 01:29:59.123" > ).map { x => Row(java.sql.Timestamp.valueOf(x)) }) > val rawData = spark.createDataFrame(rows, schema).toDF() > rawData.show() > Seq("parquet", "textfile").foreach { format => > val tblName = s"${tblPrefix}_$format" > spark.sql(s"DROP TABLE IF EXISTS $tblName") > spark.sql( > raw"""CREATE TABLE $tblName ( > | ts timestamp > | ) > | STORED AS $format > """.stripMargin) > rawData.write.insertInto(tblName) > } > rawData.write.json(s"${tblPrefix}_json") > {code} > Then I start a spark-shell in "America/New_York" timezone, and read the data > back from each table: > {code} > scala> spark.sql("select * from la_parquet").collect().foreach{println} > [2016-01-01 02:50:59.123] > [2016-01-01 01:49:59.123] > [2016-01-01 03:39:59.123] > [2016-01-01 04:29:59.123] > scala> spark.sql("select * from la_textfile").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").join(spark.sql("select * from > la_textfile"), "ts").show() > ++ > | ts| > ++ > |2015-12-31 23:50:...| > |2015-12-31 22:49:...| > |2016-01-01 00:39:...| > |2016-01-01 01:29:...| > ++ > scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), > "ts").show() > +---+ > | ts| > +---+ > +---+ > {code} > The textfile and json based data shows the same times, and can be joined > against each other, while the times from the parquet data have changed (and > obviously joins fail). > This is a big problem for any organization that may try to read the same data > (say in S3) with clusters in multiple timezones. It can also be a nasty > surprise as an organization tries to migrate file formats. Finally, its a > source of incompatibility between Hive, Impala, and Spark. > HIVE-12767 aims to fix this by introducing a table property which indicates > the "storage timezone" for the table. Spark should add the same to ensure > consistency between file formats, and with Hive & Impala. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.
[ https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012542#comment-16012542 ] Zoltan Ivanfi edited comment on SPARK-12297 at 5/16/17 3:16 PM: What I meant is that if a CSV file ("STORED AS TEXTFILE" in SQL terminology) contains a timestamp and you define the type of the column as TIMESTAMP, then SparkSQL interprets that timestamp as a local time value instead of a UTC-normalized one. So if you have such a table with some data and run a select in SparkSQL, then change the local timezone and run the same select again (using SparkSQL again), you will see the same timestamp. If you do the same with a Parquet table, you will see a different timestamp after changing the local timezone. I mentioned Avro as an example by mistake, as Avro-backed tables do not support the timestamp type at this moment. was (Author: zi): What I meant is that if a CSV file ("STORED AS TEXTFILE" in SQL terminology) contains a timestamp and you define the type of the column as TIMESTAMP, then SparkSQL interprets that timestamp as a local time value instead of a UTC-normalized one. So if you have such a table with some data and run a select in SparkSQL, then change the local timezone and run the same select again (using SparkSQL again), you will see the same timestamp. If you do the same with a Parquet table, you will see a different timestamp after changing the local timezone. I mentioned Avro as an example by mistake, as Avro-backed tables do not support the timestamp type at this moment. I may have been thinking about ORC. > Add work-around for Parquet/Hive int96 timestamp bug. > - > > Key: SPARK-12297 > URL: https://issues.apache.org/jira/browse/SPARK-12297 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Ryan Blue > > Spark copied Hive's behavior for parquet, but this was inconsistent with > other file formats, and inconsistent with Impala (which is the original > source of putting a timestamp as an int96 in parquet, I believe). This made > timestamps in parquet act more like timestamps with timezones, while in other > file formats, timestamps have no time zone, they are a "floating time". > The easiest way to see this issue is to write out a table with timestamps in > multiple different formats from one timezone, then try to read them back in > another timezone. Eg., here I write out a few timestamps to parquet and > textfile hive tables, and also just as a json file, all in the > "America/Los_Angeles" timezone: > {code} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val tblPrefix = args(0) > val schema = new StructType().add("ts", TimestampType) > val rows = sc.parallelize(Seq( > "2015-12-31 23:50:59.123", > "2015-12-31 22:49:59.123", > "2016-01-01 00:39:59.123", > "2016-01-01 01:29:59.123" > ).map { x => Row(java.sql.Timestamp.valueOf(x)) }) > val rawData = spark.createDataFrame(rows, schema).toDF() > rawData.show() > Seq("parquet", "textfile").foreach { format => > val tblName = s"${tblPrefix}_$format" > spark.sql(s"DROP TABLE IF EXISTS $tblName") > spark.sql( > raw"""CREATE TABLE $tblName ( > | ts timestamp > | ) > | STORED AS $format > """.stripMargin) > rawData.write.insertInto(tblName) > } > rawData.write.json(s"${tblPrefix}_json") > {code} > Then I start a spark-shell in "America/New_York" timezone, and read the data > back from each table: > {code} > scala> spark.sql("select * from la_parquet").collect().foreach{println} > [2016-01-01 02:50:59.123] > [2016-01-01 01:49:59.123] > [2016-01-01 03:39:59.123] > [2016-01-01 04:29:59.123] > scala> spark.sql("select * from la_textfile").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").join(spark.sql("select * from > la_textfile"), "ts").show() > ++ > | ts| > ++ > |2015-12-31 23:50:...| > |2015-12-31 22:49:...| > |2016-01-01 00:39:...| > |2016-01-01 01:29:...| > ++ > scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), > "ts").show() > +---+ > | ts| > +---+ > +---+ > {code} > The textfile and json based data shows the same times, and can be joined > against each other, while the times from the parquet data have changed (and > obviously joins fail). > This is a big problem for any organization that may try to read the same data > (say in S3) with clusters in multi
[jira] [Comment Edited] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.
[ https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012542#comment-16012542 ] Zoltan Ivanfi edited comment on SPARK-12297 at 5/16/17 3:11 PM: What I meant is that if a CSV file ("STORED AS TEXTFILE" in SQL terminology) contains a timestamp and you define the type of the column as TIMESTAMP, then SparkSQL interprets that timestamp as a local time value instead of a UTC-normalized one. So if you have such a table with some data and run a select in SparkSQL, then change the local timezone and run the same select again (using SparkSQL again), you will see the same timestamp. If you do the same with a Parquet table, you will see a different timestamp after changing the local timezone. I mentioned Avro as an example by mistake, as Avro-backed tables do not support the timestamp type at this moment. I may have been thinking about ORC. was (Author: zi): What I meant is that if a CSV file ("STORED AS TEXTFILE" in SQL terminology) contains a timestamp and you define the type of the column as TIMESTAMP, then SparkSQL interprets that timestamp as a local time value instead of a UTC-normalized one. So if you have such a table and insert a timestamp into it in SparkSQL, then change the local timezone and read the value back (using SparkSQL again), you will see the same timestamp. If you do the same with a Parquet table, you will see a different timestamp after changing the local timezone. I mentioned Avro as an example by mistake, as Avro-backed tables do not support the timestamp type at this moment. I may have been thinking about ORC. > Add work-around for Parquet/Hive int96 timestamp bug. > - > > Key: SPARK-12297 > URL: https://issues.apache.org/jira/browse/SPARK-12297 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Ryan Blue > > Spark copied Hive's behavior for parquet, but this was inconsistent with > other file formats, and inconsistent with Impala (which is the original > source of putting a timestamp as an int96 in parquet, I believe). This made > timestamps in parquet act more like timestamps with timezones, while in other > file formats, timestamps have no time zone, they are a "floating time". > The easiest way to see this issue is to write out a table with timestamps in > multiple different formats from one timezone, then try to read them back in > another timezone. Eg., here I write out a few timestamps to parquet and > textfile hive tables, and also just as a json file, all in the > "America/Los_Angeles" timezone: > {code} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val tblPrefix = args(0) > val schema = new StructType().add("ts", TimestampType) > val rows = sc.parallelize(Seq( > "2015-12-31 23:50:59.123", > "2015-12-31 22:49:59.123", > "2016-01-01 00:39:59.123", > "2016-01-01 01:29:59.123" > ).map { x => Row(java.sql.Timestamp.valueOf(x)) }) > val rawData = spark.createDataFrame(rows, schema).toDF() > rawData.show() > Seq("parquet", "textfile").foreach { format => > val tblName = s"${tblPrefix}_$format" > spark.sql(s"DROP TABLE IF EXISTS $tblName") > spark.sql( > raw"""CREATE TABLE $tblName ( > | ts timestamp > | ) > | STORED AS $format > """.stripMargin) > rawData.write.insertInto(tblName) > } > rawData.write.json(s"${tblPrefix}_json") > {code} > Then I start a spark-shell in "America/New_York" timezone, and read the data > back from each table: > {code} > scala> spark.sql("select * from la_parquet").collect().foreach{println} > [2016-01-01 02:50:59.123] > [2016-01-01 01:49:59.123] > [2016-01-01 03:39:59.123] > [2016-01-01 04:29:59.123] > scala> spark.sql("select * from la_textfile").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").join(spark.sql("select * from > la_textfile"), "ts").show() > ++ > | ts| > ++ > |2015-12-31 23:50:...| > |2015-12-31 22:49:...| > |2016-01-01 00:39:...| > |2016-01-01 01:29:...| > ++ > scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), > "ts").show() > +---+ > | ts| > +---+ > +---+ > {code} > The textfile and json based data shows the same times, and can be joined > against each other, while the times from the parquet data have changed (and > obviously joins fail). > This is a big problem for any organization that may try to read the same data > (say
[jira] [Commented] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.
[ https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012542#comment-16012542 ] Zoltan Ivanfi commented on SPARK-12297: --- What I meant is that if a CSV file ("STORED AS TEXTFILE" in SQL terminology) contains a timestamp and you define the type of the column as TIMESTAMP, then SparkSQL interprets that timestamp as a local time value instead of a UTC-normalized one. So if you have such a table and insert a timestamp into it in SparkSQL, then change the local timezone and read the value back (using SparkSQL again), you will see the same timestamp. If you do the same with a Parquet table, you will see a different timestamp after changing the local timezone. I mentioned Avro as an example by mistake, as Avro-backed tables do not support the timestamp type at this moment. I may have been thinking about ORC. > Add work-around for Parquet/Hive int96 timestamp bug. > - > > Key: SPARK-12297 > URL: https://issues.apache.org/jira/browse/SPARK-12297 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Ryan Blue > > Spark copied Hive's behavior for parquet, but this was inconsistent with > other file formats, and inconsistent with Impala (which is the original > source of putting a timestamp as an int96 in parquet, I believe). This made > timestamps in parquet act more like timestamps with timezones, while in other > file formats, timestamps have no time zone, they are a "floating time". > The easiest way to see this issue is to write out a table with timestamps in > multiple different formats from one timezone, then try to read them back in > another timezone. Eg., here I write out a few timestamps to parquet and > textfile hive tables, and also just as a json file, all in the > "America/Los_Angeles" timezone: > {code} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val tblPrefix = args(0) > val schema = new StructType().add("ts", TimestampType) > val rows = sc.parallelize(Seq( > "2015-12-31 23:50:59.123", > "2015-12-31 22:49:59.123", > "2016-01-01 00:39:59.123", > "2016-01-01 01:29:59.123" > ).map { x => Row(java.sql.Timestamp.valueOf(x)) }) > val rawData = spark.createDataFrame(rows, schema).toDF() > rawData.show() > Seq("parquet", "textfile").foreach { format => > val tblName = s"${tblPrefix}_$format" > spark.sql(s"DROP TABLE IF EXISTS $tblName") > spark.sql( > raw"""CREATE TABLE $tblName ( > | ts timestamp > | ) > | STORED AS $format > """.stripMargin) > rawData.write.insertInto(tblName) > } > rawData.write.json(s"${tblPrefix}_json") > {code} > Then I start a spark-shell in "America/New_York" timezone, and read the data > back from each table: > {code} > scala> spark.sql("select * from la_parquet").collect().foreach{println} > [2016-01-01 02:50:59.123] > [2016-01-01 01:49:59.123] > [2016-01-01 03:39:59.123] > [2016-01-01 04:29:59.123] > scala> spark.sql("select * from la_textfile").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").join(spark.sql("select * from > la_textfile"), "ts").show() > ++ > | ts| > ++ > |2015-12-31 23:50:...| > |2015-12-31 22:49:...| > |2016-01-01 00:39:...| > |2016-01-01 01:29:...| > ++ > scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), > "ts").show() > +---+ > | ts| > +---+ > +---+ > {code} > The textfile and json based data shows the same times, and can be joined > against each other, while the times from the parquet data have changed (and > obviously joins fail). > This is a big problem for any organization that may try to read the same data > (say in S3) with clusters in multiple timezones. It can also be a nasty > surprise as an organization tries to migrate file formats. Finally, its a > source of incompatibility between Hive, Impala, and Spark. > HIVE-12767 aims to fix this by introducing a table property which indicates > the "storage timezone" for the table. Spark should add the same to ensure > consistency between file formats, and with Hive & Impala. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.
[ https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004607#comment-16004607 ] Zoltan Ivanfi edited comment on SPARK-12297 at 5/10/17 6:26 PM: bq. It'd be great to consider this more holistically and think about alternatives in fixing them As Ryan mentioned, the Parquet community discussed this timestamp incompatibilty problem with the aim of avoiding similar problems in the future. It was decided that the specification needs to include two separate types with well-defined semantics: one for timezone-agnostic (aka. TIMESTAMP WITHOUT TIMEZONE) and one for UTC-normalized (aka. TIMESTAMP WITH TIMEZONE) timestamps. (Otherwise implementors would be tempted to misuse the single existing type for storing timestamps of different semantics, as it already happened with the int96 timestamp type). Using these two types, SQL engines will be able to unambiguously store their timestamp type regardless of its semantics. However, the TIMESTAMP type should follow TIMESTAMP WITHOUT TIMEZONE semantics for consistency with other SQL engines. The TIMESTAMP WITH TIMEZONE semantics should be implemented as a new SQL type with a matching name. While this is a nice and clean long-term solution, a short-term fix is also desired until the new types become widely supported and/or to allow dealing with existing data. The commit in question is a part of this short-term fix and it allows getting correct values when reading int96 timestamps, even for data written by other components. bq. it completely changes the behavior of one of the most important data types. A very important aspect of this fix is that it does not change SparkSQL's behavior unless the user sets a table property, so it's a completely safe and non-breaking change. bq. One of the fundamental problem is that Spark treats timestamp as timestamp with timezone, whereas impala treats timestamp as timestamp without timezone. The parquet storage is only a small piece here. The fix only addresses Parquet timestamps indeed. This, however, is intentional and is not a limitation, neither an inconsistency as the problem seems to be specific to Parquet. My understanding is that for other file formats, SparkSQL follows timezone-agnostic (TIMESTAMP WITHOUT TIMEZONE) semantics and my experiments with the CSV and Avro formats seem to confirm this. So using UTC-normalized (TIMESTAMP WITH TIMEZONE) semantics in Parquet is not only incompatible with Impala but is also inconsistent within SparkSQL itself. bq. Also this is not just a Parquet issue. The same issue could happen to all data formats. It is going to be really confusing to have something that only works for Parquet The current behavior of SparkSQL already seems to be different for Parquet than for other formats. The fix allows the user to choose a consistent and less confusing behaviour instead. It also makes Impala, Hive and SparkSQL compatible with each other regarding int96 timestamps. bq. It seems like the purpose of this patch can be accomplished by just setting the session local timezone to UTC? Unfortunately that would not suffice. The problem has to addressed in all SQL engines. As of today, Hive and Impala already contains the changes that allow interoperability using the parquet.mr.int96.write.zone table property: * Hive: ** https://github.com/apache/hive/commit/84fdc1c7c8ff0922aa44f829dbfa9659935c503e ** https://github.com/apache/hive/commit/a1cbccb8dad1824f978205a1e93ec01e87ed8ed5 ** https://github.com/apache/hive/commit/2dfcea5a95b7d623484b8be50755b817fbc91ce0 ** https://github.com/apache/hive/commit/78e29fc70dacec498c35dc556dd7403e4c9f48fe * Impala: ** https://github.com/apache/incubator-impala/commit/5803a0b0744ddaee6830d4a1bc8dba8d3f2caa26 was (Author: zi): bq. It'd be great to consider this more holistically and think about alternatives in fixing them As Ryan mentioned, the Parquet community discussed this timestamp incompatibilty problem with the aim of avoiding similar problems in the future. It was decided that the specification needs to include two separate types with well-defined semantics: one for timezone-agnostic (aka. TIMESTAMP WITHOUT TIMEZONE) and one for UTC-normalized (aka. TIMESTAMP WITH TIMEZONE) timestamps. (Otherwise implementors would be tempted to misuse the single existing type for storing timestamps of different semantics, as it already happened with the int96 timestamp type). While this is a nice and clean long-term solution, a short-term fix is also desired until the new types become widely supported and/or to allow dealing with existing data. The commit in question is a part of this short-term fix and it allows getting correct values when reading int96 timestamps, even for data written by other components. bq. it completely changes the behavior of one of the most i
[jira] [Commented] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.
[ https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004607#comment-16004607 ] Zoltan Ivanfi commented on SPARK-12297: --- bq. It'd be great to consider this more holistically and think about alternatives in fixing them As Ryan mentioned, the Parquet community discussed this timestamp incompatibilty problem with the aim of avoiding similar problems in the future. It was decided that the specification needs to include two separate types with well-defined semantics: one for timezone-agnostic (aka. TIMESTAMP WITHOUT TIMEZONE) and one for UTC-normalized (aka. TIMESTAMP WITH TIMEZONE) timestamps. (Otherwise implementors would be tempted to misuse the single existing type for storing timestamps of different semantics, as it already happened with the int96 timestamp type). While this is a nice and clean long-term solution, a short-term fix is also desired until the new types become widely supported and/or to allow dealing with existing data. The commit in question is a part of this short-term fix and it allows getting correct values when reading int96 timestamps, even for data written by other components. bq. it completely changes the behavior of one of the most important data types. A very important aspect of this fix is that it does not change SparkSQL's behavior unless the user sets a table property, so it's a completely safe and non-breaking change. bq. One of the fundamental problem is that Spark treats timestamp as timestamp with timezone, whereas impala treats timestamp as timestamp without timezone. The parquet storage is only a small piece here. The fix only addresses Parquet timestamps indeed. This, however, is intentional and is not a limitation, neither an inconsistency. The problem in fact is specific to Parquet. For other file formats (for example CSV or Avro), SparkSQL follows timezone-agnostic (TIMESTAMP WITHOUT TIMEZONE) semantics. So using UTC-normalized (TIMESTAMP WITH TIMEZONE) semantics in Parquet is not only incompatible with Impala but is also inconsistent within SparkSQL itself. bq. Also this is not just a Parquet issue. The same issue could happen to all data formats. It is going to be really confusing to have something that only works for Parquet In fact the current behavior of SparkSQL is different for Parquet than for other formats. The fix allows the user to choose a consistent and less confusing behaviour instead. It also makes Impala, Hive and SparkSQL compatible with each other regarding int96 timestamps. bq. It seems like the purpose of this patch can be accomplished by just setting the session local timezone to UTC? Unfortunately that would not suffice. The problem has to addressed in all SQL engines. As of today, Hive and Impala already contains the changes that allow interoperability using the parquet.mr.int96.write.zone table property: * Hive: ** https://github.com/apache/hive/commit/84fdc1c7c8ff0922aa44f829dbfa9659935c503e ** https://github.com/apache/hive/commit/a1cbccb8dad1824f978205a1e93ec01e87ed8ed5 ** https://github.com/apache/hive/commit/2dfcea5a95b7d623484b8be50755b817fbc91ce0 ** https://github.com/apache/hive/commit/78e29fc70dacec498c35dc556dd7403e4c9f48fe * Impala: ** https://github.com/apache/incubator-impala/commit/5803a0b0744ddaee6830d4a1bc8dba8d3f2caa26 > Add work-around for Parquet/Hive int96 timestamp bug. > - > > Key: SPARK-12297 > URL: https://issues.apache.org/jira/browse/SPARK-12297 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Ryan Blue > > Spark copied Hive's behavior for parquet, but this was inconsistent with > other file formats, and inconsistent with Impala (which is the original > source of putting a timestamp as an int96 in parquet, I believe). This made > timestamps in parquet act more like timestamps with timezones, while in other > file formats, timestamps have no time zone, they are a "floating time". > The easiest way to see this issue is to write out a table with timestamps in > multiple different formats from one timezone, then try to read them back in > another timezone. Eg., here I write out a few timestamps to parquet and > textfile hive tables, and also just as a json file, all in the > "America/Los_Angeles" timezone: > {code} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val tblPrefix = args(0) > val schema = new StructType().add("ts", TimestampType) > val rows = sc.parallelize(Seq( > "2015-12-31 23:50:59.123", > "2015-12-31 22:49:59.123", > "2016-01-01 00:39:59.123", > "2016-01-01 01:29:59.123" > ).map { x => Row(java.sql.Timestamp.valueOf(x)) }) > val rawData = spark.createDataFrame(rows, schema).toDF() > rawData.show() > Seq("parquet", "textfile").foreach { forma