[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...
Github user seancxmao closed the pull request at: https://github.com/apache/spark/pull/22184 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...
Github user yucai commented on a diff in the pull request: https://github.com/apache/spark/pull/22184#discussion_r213581563 --- Diff: docs/sql-programming-guide.md --- @@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. +## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above + + - In version 2.3.1 and earlier, when reading from a Parquet table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched. --- End diff -- The testing is based on `spark.sql.hive.convertMetastoreParquet` is set false, so it should use Hive serde reader instead of Spark reader, sorry if it is too confusing here. I guess you mean 1 and 3 :). I understand now. If we are not going to backport the PR to 2.3, I can close SPARK-25206 also? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22184#discussion_r213569148 --- Diff: docs/sql-programming-guide.md --- @@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. +## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above + + - In version 2.3.1 and earlier, when reading from a Parquet table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched. --- End diff -- https://github.com/apache/spark/pull/22184#discussion_r212405373 already shows they are not consistent, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...
Github user yucai commented on a diff in the pull request: https://github.com/apache/spark/pull/22184#discussion_r213519348 --- Diff: docs/sql-programming-guide.md --- @@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. +## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above + + - In version 2.3.1 and earlier, when reading from a Parquet table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched. --- End diff -- @gatorsmile I think 1 and 2 are always consistent. They both use Spark reader. Am I wrong? > parquet table created by Spark (using parquet) read by Spark reader > parquet table created by Spark (using hive) read by Spark reader --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22184#discussion_r213426988 --- Diff: docs/sql-programming-guide.md --- @@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. +## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above + + - In version 2.3.1 and earlier, when reading from a Parquet table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched. --- End diff -- BTW, the parquet table could be generated by our DataFrameWriter. Thus, the physical schema and logical schema could still have different cases. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22184#discussion_r213426538 --- Diff: docs/sql-programming-guide.md --- @@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. +## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above + + - In version 2.3.1 and earlier, when reading from a Parquet table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched. --- End diff -- Making 1, 2 consistent is enough. : ) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...
Github user yucai commented on a diff in the pull request: https://github.com/apache/spark/pull/22184#discussion_r213386126 --- Diff: docs/sql-programming-guide.md --- @@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. +## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above + + - In version 2.3.1 and earlier, when reading from a Parquet table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched. --- End diff -- > For Spark native parquet tables that were created by us, this is a bug fix because the previous work does not respect spark.sql.caseSensitive; for the parquet tables created by Hive, the field resolution should be consistent no matter whether it is using our reader or Hive parquet reader. @gatorsmile, need confirm with you, about consistent, we have some kinds of tables. 1. parquet table created by Spark (using parquet) read by Spark reader 2. parquet table created by Spark (using hive) read by Spark reader 3. parquet table created by Spark (using hive) read by Hive reader 4. parquet table created by Hive read by Spark reader 5. parquet table created by Hive read by Hive reader Do you want all of them to be consitent? Or 2,3,4,5 consitent is enough? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22184#discussion_r213135626 --- Diff: docs/sql-programming-guide.md --- @@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. +## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above + + - In version 2.3.1 and earlier, when reading from a Parquet table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched. --- End diff -- For Hive tables, column resolution is always case insensitive. However, When `spark.sql.hive.convertMetastoreParquet` is true, users might face inconsistent behaviors when they use native parquet reader to resolve the columns in the case sensitive mode. We still introduce behavior changes. Better error messages sounds good enough, instead of disabling `spark.sql.hive.convertMetastoreParquet` when the mode is case sensitive. cc @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/22184#discussion_r213020789 --- Diff: docs/sql-programming-guide.md --- @@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. +## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above + + - In version 2.3.1 and earlier, when reading from a Parquet table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched. --- End diff -- As a followup to cloud-fan's point, I did a deep dive into read path of parquet hive serde table. Following is a rough invocation chain: ``` org.apache.spark.sql.hive.execution.HiveTableScanExec org.apache.spark.sql.hive.HadoopTableReader (extendes org.apache.spark.sql.hive.TableReader) org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat (extends org.apache.hadoop.mapred.FileInputFormat) org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper (extends org.apache.hadoop.mapred.RecordReader) parquet.hadoop.ParquetRecordReader parquet.hadoop.InternalParquetRecordReader org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport (extends parquet.hadoop.api.ReadSupport) ``` Finally, `DataWritableReadSupport#getFieldTypeIgnoreCase` is invoked. https://github.com/JoshRosen/hive/blob/release-1.2.1-spark2/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java#L79-L95 This is why parquet hive serde table always do case-insensitive field resolution. However, this is a class inside `org.spark-project.hive:hive-exec:1.2.1.spark2`. I also found the related Hive JIRA ticket: [HIVE-7554: Parquet Hive should resolve column names in case insensitive manner](https://issues.apache.org/jira/browse/HIVE-7554) BTW: * org.apache.hadoop.hive.ql = org.spark-project.hive:hive-exec:1.2.1.spark2 * parquet.hadoop = com.twitter:parquet-hadoop-bundle:1.6.0 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/22184#discussion_r212894532 --- Diff: docs/sql-programming-guide.md --- @@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. +## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above + + - In version 2.3.1 and earlier, when reading from a Parquet table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched. --- End diff -- As a followup, I also did investigation about ORC. Below are some results. Just FYI. * https://issues.apache.org/jira/browse/SPARK-25175?focusedCommentId=16593185&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16593185 * https://issues.apache.org/jira/browse/SPARK-25175?focusedCommentId=16593194&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16593194 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22184#discussion_r212849852 --- Diff: docs/sql-programming-guide.md --- @@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. +## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above + + - In version 2.3.1 and earlier, when reading from a Parquet table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched. --- End diff -- We rely on the hive parquet serde to read hive parquet tables, and I don't think we are able to change it. The only way I can think of to make it consistent between data source table and hive table is to make sure `spark.sql.hive.convertMetastoreParquet` always work. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22184#discussion_r212834530 --- Diff: docs/sql-programming-guide.md --- @@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. +## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above + + - In version 2.3.1 and earlier, when reading from a Parquet table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched. --- End diff -- In general, my suggestion is to respect `spark.sql.caseSensitive` for both readers. Technically, is it possible? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22184#discussion_r212834477 --- Diff: docs/sql-programming-guide.md --- @@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. +## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above + + - In version 2.3.1 and earlier, when reading from a Parquet table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched. --- End diff -- @cloud-fan We need to keep the behaviors consistent no matter whether we use Hive serde reader or our native parquet reader. In the PR https://github.com/apache/spark/pull/22148, we already introduced a change for hive table, if `spark.sql.hive.convertMetastoreParquet` is set to true, right? For Spark native parquet tables that were created by us, this is a bug fix because the previous work does not respect `spark.sql.caseSensitive`; for the parquet tables created by Hive, the field resolution should be consistent no matter whether it is using our reader or Hive parquet reader. To most of end users, they do not know the difference between Hive serde reader and native parquet reader --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22184#discussion_r212662840 --- Diff: docs/sql-programming-guide.md --- @@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. +## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above + + - In version 2.3.1 and earlier, when reading from a Parquet table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched. --- End diff -- First, we should not change the behavior of hive tables. It inherits many behaviors from Hive and let's keep it as it was. Second, why we treat it as a behavior change? I think it's a bug that we don't respect `spark.sql.caseSensitive` in field resolution. In general we should not add a config to restore a bug. I don't think this document is helpful. It explains a subtle and unreasonable behavior to users, which IMO just make them confused. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22184#discussion_r212533857 --- Diff: docs/sql-programming-guide.md --- @@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. +## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above + + - In version 2.3.1 and earlier, when reading from a Parquet table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched. --- End diff -- Could you add a test case for the one you did? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22184#discussion_r212533706 --- Diff: docs/sql-programming-guide.md --- @@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. +## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above + + - In version 2.3.1 and earlier, when reading from a Parquet table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched. --- End diff -- We should respect `spark.sql.caseSensitive` in both modes, but also add a legacy SQLConf to enable users to revert back to the previous behavior. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/22184#discussion_r212405373 --- Diff: docs/sql-programming-guide.md --- @@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. +## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above + + - In version 2.3.1 and earlier, when reading from a Parquet table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched. --- End diff -- Following your advice, I did a thorough comparison between data source table and hive serde table. Parquet data and tables are created via the following code: ``` val data = spark.range(5).selectExpr("id as a", "id * 2 as B", "id * 3 as c", "id * 4 as C") spark.conf.set("spark.sql.caseSensitive", true) data.write.format("parquet").mode("overwrite").save("/user/hive/warehouse/parquet_data") CREATE TABLE parquet_data_source_lower (a LONG, b LONG, c LONG) USING parquet LOCATION '/user/hive/warehouse/parquet_data' CREATE TABLE parquet_data_source_upper (A LONG, B LONG, C LONG) USING parquet LOCATION '/user/hive/warehouse/parquet_data' CREATE TABLE parquet_hive_serde_lower (a LONG, b LONG, c LONG) STORED AS parquet LOCATION '/user/hive/warehouse/parquet_data' CREATE TABLE parquet_hive_serde_upper (A LONG, B LONG, C LONG) STORED AS parquet LOCATION '/user/hive/warehouse/parquet_data' ``` spark.sql.hive.convertMetastoreParquet is set to false: ``` spark.conf.set("spark.sql.hive.convertMetastoreParquet", false) ``` Below are the comparison results both without #22148 and with #22148. The comparison result without #22148: |no.|caseSensitive|table columns|select column|parquet column (select via data source table)|parquet column (select via hive serde table)|consistent?|resolved by SPARK-25132| | - | - | - | - | - | - | - | - | |1|true|a, b, c|a| a|a |Y | | |2| | |b|null|B|NG| | |3| | |c|c |c |Y | | |4| | |A|AnalysisException|AnalysisException|Y | | |5| | |B|AnalysisException|AnalysisException|Y | | |6| | |C|AnalysisException|AnalysisException|Y | | |7| |A, B, C|a|AnalysisException |AnalysisException|Y | | |8| | |b|AnalysisException |AnalysisException |Y | | |9| | |c|AnalysisException |AnalysisException |Y | | |10| | |A|null |a |NG | | |11| | |B|B |B|Y | | |12| | |C|C |c |NG | | |13|false|a, b, c|a|a |a |Y | | |14| | |b|null |B |NG |Y| |15| | |c|c |c |Y | | |16| | |A|a |a |Y | | |17| | |B|null |B |NG |Y| |18| | |C|c |c |Y | | |19| |A, B, C|a|null |a |NG |Y| |20| | |b|B |B |Y | | |21| | |c|C |c |NG | | |22| | |A|null |a |NG |Y| |23| | |B|B |B |Y | | |24| | |C|C |c |NG | | The comparison result with #22148 applied: |no.|caseSensitive|table columns|select column|parquet column (select via data source table)|parquet column (select via hive serde table)|consistent?|introduced by SPARK-25132| |---|---|---|---|---|---|---|---| |1|true|a, b, c|a|a |a |Y | | |2| | |b|null |B |NG | | |3| | |c|c |c |Y | | |4| | |A|AnalysisException |AnalysisException |Y | | |5| | |B|AnalysisException |AnalysisException |Y | | |6| | |C|AnalysisException |AnalysisException |Y | | |7| |A, B, C|a|AnalysisException |AnalysisException |Y | | |8| | |b|AnalysisException |AnalysisException |Y | | |9| | |c|AnalysisException |AnalysisException |Y | | |10| | |A|null |a |NG | | |11| | |B|B |B |Y | | |12| | |C|C |c |NG | | |13|false|a, b, c|a|a |a |Y | | |14| | |b|B |B |Y | | |15| | |c|RuntimeException |c |NG |Y| |16| | |A|a |a |Y | | |17| | |B|B |B |Y | | |18| | |C|RuntimeException |c |NG |Y| |19| |A, B, C|a|a |a |Y | | |20| | |b|B |B |Y | |
[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22184#discussion_r212006137 --- Diff: docs/sql-programming-guide.md --- @@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. +## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above + + - In version 2.3.1 and earlier, when reading from a Parquet table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched. --- End diff -- This is a behavior change. I am not sure whether we should backport it to 2.3.2. How about sending a note to the dev mailing list? BTW, this only affects data source table. How about hive serde table? Are they consistent? Could you add a test case? Create a table by the syntax like `CREATE TABLE ... STORED AS PARQUET`. You also need to turn off `spark.sql.hive.convertMetastoreParquet` in the test case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...
GitHub user seancxmao opened a pull request: https://github.com/apache/spark/pull/22184 [SPARK-25132][SQL][DOC] Add migration doc for case-insensitive field resolution when reading from Parquet ## What changes were proposed in this pull request? #22148 introduces a behavior change. We need to document it in the migration guide. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/seancxmao/spark SPARK-25132-DOC Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22184.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22184 commit eae8a3c98f146765d25bbf529421ce3c7a92639b Author: seancxmao Date: 2018-08-22T09:17:55Z [SPARK-25132][SQL][DOC] Case-insensitive field resolution when reading from Parquet --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org