[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

2018-12-05 Thread seancxmao
Github user seancxmao closed the pull request at:

https://github.com/apache/spark/pull/22184


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

2018-08-29 Thread yucai
Github user yucai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22184#discussion_r213581563
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
 
+## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above
+
+  - In version 2.3.1 and earlier, when reading from a Parquet table, Spark 
always returns null for any column whose column names in Hive metastore schema 
and Parquet schema are in different letter cases, no matter whether 
`spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when 
`spark.sql.caseSensitive` is set to false, Spark does case insensitive column 
name resolution between Hive metastore schema and Parquet schema, so even 
column names are in different letter cases, Spark returns corresponding column 
values. An exception is thrown if there is ambiguity, i.e. more than one 
Parquet column is matched.
--- End diff --

The testing is based on `spark.sql.hive.convertMetastoreParquet` is set 
false, so it should use Hive serde reader instead of Spark reader, sorry if it 
is too confusing here.
I guess you mean 1 and 3 :). I understand now.

If we are not going to backport the PR to 2.3, I can close SPARK-25206 also?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

2018-08-29 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22184#discussion_r213569148
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
 
+## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above
+
+  - In version 2.3.1 and earlier, when reading from a Parquet table, Spark 
always returns null for any column whose column names in Hive metastore schema 
and Parquet schema are in different letter cases, no matter whether 
`spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when 
`spark.sql.caseSensitive` is set to false, Spark does case insensitive column 
name resolution between Hive metastore schema and Parquet schema, so even 
column names are in different letter cases, Spark returns corresponding column 
values. An exception is thrown if there is ambiguity, i.e. more than one 
Parquet column is matched.
--- End diff --

https://github.com/apache/spark/pull/22184#discussion_r212405373 already 
shows they are not consistent, right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

2018-08-28 Thread yucai
Github user yucai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22184#discussion_r213519348
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
 
+## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above
+
+  - In version 2.3.1 and earlier, when reading from a Parquet table, Spark 
always returns null for any column whose column names in Hive metastore schema 
and Parquet schema are in different letter cases, no matter whether 
`spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when 
`spark.sql.caseSensitive` is set to false, Spark does case insensitive column 
name resolution between Hive metastore schema and Parquet schema, so even 
column names are in different letter cases, Spark returns corresponding column 
values. An exception is thrown if there is ambiguity, i.e. more than one 
Parquet column is matched.
--- End diff --

@gatorsmile I think 1 and 2 are always consistent. They both use Spark 
reader. Am I wrong?
> parquet table created by Spark (using parquet) read by Spark reader
> parquet table created by Spark (using hive) read by Spark reader


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

2018-08-28 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22184#discussion_r213426988
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
 
+## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above
+
+  - In version 2.3.1 and earlier, when reading from a Parquet table, Spark 
always returns null for any column whose column names in Hive metastore schema 
and Parquet schema are in different letter cases, no matter whether 
`spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when 
`spark.sql.caseSensitive` is set to false, Spark does case insensitive column 
name resolution between Hive metastore schema and Parquet schema, so even 
column names are in different letter cases, Spark returns corresponding column 
values. An exception is thrown if there is ambiguity, i.e. more than one 
Parquet column is matched.
--- End diff --

BTW, the parquet table could be generated by our DataFrameWriter. Thus, the 
physical schema and logical schema could still have different cases. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

2018-08-28 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22184#discussion_r213426538
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
 
+## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above
+
+  - In version 2.3.1 and earlier, when reading from a Parquet table, Spark 
always returns null for any column whose column names in Hive metastore schema 
and Parquet schema are in different letter cases, no matter whether 
`spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when 
`spark.sql.caseSensitive` is set to false, Spark does case insensitive column 
name resolution between Hive metastore schema and Parquet schema, so even 
column names are in different letter cases, Spark returns corresponding column 
values. An exception is thrown if there is ambiguity, i.e. more than one 
Parquet column is matched.
--- End diff --

Making 1, 2 consistent is enough. : )


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

2018-08-28 Thread yucai
Github user yucai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22184#discussion_r213386126
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
 
+## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above
+
+  - In version 2.3.1 and earlier, when reading from a Parquet table, Spark 
always returns null for any column whose column names in Hive metastore schema 
and Parquet schema are in different letter cases, no matter whether 
`spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when 
`spark.sql.caseSensitive` is set to false, Spark does case insensitive column 
name resolution between Hive metastore schema and Parquet schema, so even 
column names are in different letter cases, Spark returns corresponding column 
values. An exception is thrown if there is ambiguity, i.e. more than one 
Parquet column is matched.
--- End diff --

> For Spark native parquet tables that were created by us, this is a bug 
fix because the previous work does not respect spark.sql.caseSensitive; for the 
parquet tables created by Hive, the field resolution should be consistent no 
matter whether it is using our reader or Hive parquet reader. 

@gatorsmile, need confirm with you, about consistent, we have some kinds of 
tables.

1. parquet table created by Spark (using parquet) read by Spark reader
2. parquet table created by Spark (using hive) read by Spark reader
3. parquet table created by Spark (using hive) read by Hive reader
4. parquet table created by Hive read by Spark reader
5. parquet table created by Hive read by Hive reader

Do you want all of them to be consitent? Or 2,3,4,5 consitent is enough?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

2018-08-27 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22184#discussion_r213135626
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
 
+## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above
+
+  - In version 2.3.1 and earlier, when reading from a Parquet table, Spark 
always returns null for any column whose column names in Hive metastore schema 
and Parquet schema are in different letter cases, no matter whether 
`spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when 
`spark.sql.caseSensitive` is set to false, Spark does case insensitive column 
name resolution between Hive metastore schema and Parquet schema, so even 
column names are in different letter cases, Spark returns corresponding column 
values. An exception is thrown if there is ambiguity, i.e. more than one 
Parquet column is matched.
--- End diff --

For Hive tables, column resolution is always case insensitive. However, 
When `spark.sql.hive.convertMetastoreParquet` is true, users might face 
inconsistent behaviors when they use native parquet reader to resolve the 
columns in the case sensitive mode. We still introduce behavior changes. Better 
error messages sounds good enough, instead of disabling 
`spark.sql.hive.convertMetastoreParquet` when the mode is case sensitive.  cc 
@cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

2018-08-27 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22184#discussion_r213020789
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
 
+## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above
+
+  - In version 2.3.1 and earlier, when reading from a Parquet table, Spark 
always returns null for any column whose column names in Hive metastore schema 
and Parquet schema are in different letter cases, no matter whether 
`spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when 
`spark.sql.caseSensitive` is set to false, Spark does case insensitive column 
name resolution between Hive metastore schema and Parquet schema, so even 
column names are in different letter cases, Spark returns corresponding column 
values. An exception is thrown if there is ambiguity, i.e. more than one 
Parquet column is matched.
--- End diff --

As a followup to cloud-fan's point, I did a deep dive into read path of 
parquet hive serde table. Following is a rough invocation chain:

```
org.apache.spark.sql.hive.execution.HiveTableScanExec
org.apache.spark.sql.hive.HadoopTableReader (extendes 
org.apache.spark.sql.hive.TableReader)
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat (extends 
org.apache.hadoop.mapred.FileInputFormat)
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper 
(extends org.apache.hadoop.mapred.RecordReader)
parquet.hadoop.ParquetRecordReader
parquet.hadoop.InternalParquetRecordReader
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport (extends 
parquet.hadoop.api.ReadSupport)
```

Finally, `DataWritableReadSupport#getFieldTypeIgnoreCase` is invoked. 


https://github.com/JoshRosen/hive/blob/release-1.2.1-spark2/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java#L79-L95

This is why parquet hive serde table always do case-insensitive field 
resolution. However, this is a class inside 
`org.spark-project.hive:hive-exec:1.2.1.spark2`.

I also found the related Hive JIRA ticket:
[HIVE-7554: Parquet Hive should resolve column names in case insensitive 
manner](https://issues.apache.org/jira/browse/HIVE-7554)

BTW:
* org.apache.hadoop.hive.ql = org.spark-project.hive:hive-exec:1.2.1.spark2
* parquet.hadoop = com.twitter:parquet-hadoop-bundle:1.6.0


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

2018-08-27 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22184#discussion_r212894532
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
 
+## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above
+
+  - In version 2.3.1 and earlier, when reading from a Parquet table, Spark 
always returns null for any column whose column names in Hive metastore schema 
and Parquet schema are in different letter cases, no matter whether 
`spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when 
`spark.sql.caseSensitive` is set to false, Spark does case insensitive column 
name resolution between Hive metastore schema and Parquet schema, so even 
column names are in different letter cases, Spark returns corresponding column 
values. An exception is thrown if there is ambiguity, i.e. more than one 
Parquet column is matched.
--- End diff --

As a followup, I also did investigation about ORC. Below are some results. 
Just FYI.

* 
https://issues.apache.org/jira/browse/SPARK-25175?focusedCommentId=16593185&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16593185
* 
https://issues.apache.org/jira/browse/SPARK-25175?focusedCommentId=16593194&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16593194


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

2018-08-26 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22184#discussion_r212849852
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
 
+## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above
+
+  - In version 2.3.1 and earlier, when reading from a Parquet table, Spark 
always returns null for any column whose column names in Hive metastore schema 
and Parquet schema are in different letter cases, no matter whether 
`spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when 
`spark.sql.caseSensitive` is set to false, Spark does case insensitive column 
name resolution between Hive metastore schema and Parquet schema, so even 
column names are in different letter cases, Spark returns corresponding column 
values. An exception is thrown if there is ambiguity, i.e. more than one 
Parquet column is matched.
--- End diff --

We rely on the hive parquet serde to read hive parquet tables, and I don't 
think we are able to change it. The only way I can think of to make it 
consistent between data source table and hive table is to make sure 
`spark.sql.hive.convertMetastoreParquet` always work.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

2018-08-26 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22184#discussion_r212834530
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
 
+## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above
+
+  - In version 2.3.1 and earlier, when reading from a Parquet table, Spark 
always returns null for any column whose column names in Hive metastore schema 
and Parquet schema are in different letter cases, no matter whether 
`spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when 
`spark.sql.caseSensitive` is set to false, Spark does case insensitive column 
name resolution between Hive metastore schema and Parquet schema, so even 
column names are in different letter cases, Spark returns corresponding column 
values. An exception is thrown if there is ambiguity, i.e. more than one 
Parquet column is matched.
--- End diff --

In general, my suggestion is to respect `spark.sql.caseSensitive` for both 
readers. Technically, is it possible? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

2018-08-26 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22184#discussion_r212834477
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
 
+## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above
+
+  - In version 2.3.1 and earlier, when reading from a Parquet table, Spark 
always returns null for any column whose column names in Hive metastore schema 
and Parquet schema are in different letter cases, no matter whether 
`spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when 
`spark.sql.caseSensitive` is set to false, Spark does case insensitive column 
name resolution between Hive metastore schema and Parquet schema, so even 
column names are in different letter cases, Spark returns corresponding column 
values. An exception is thrown if there is ambiguity, i.e. more than one 
Parquet column is matched.
--- End diff --

@cloud-fan We need to keep the behaviors consistent no matter whether we 
use Hive serde reader or our native parquet reader. In the PR 
https://github.com/apache/spark/pull/22148, we already introduced a change for 
hive table, if `spark.sql.hive.convertMetastoreParquet` is set to true, right?

For Spark native parquet tables that were created by us, this is a bug fix 
because the previous work does not respect `spark.sql.caseSensitive`; for the 
parquet tables created by Hive, the field resolution should be consistent no 
matter whether it is using our reader or Hive parquet reader. To most of end 
users, they do not know the difference between Hive serde reader and native 
parquet reader


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

2018-08-24 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22184#discussion_r212662840
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
 
+## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above
+
+  - In version 2.3.1 and earlier, when reading from a Parquet table, Spark 
always returns null for any column whose column names in Hive metastore schema 
and Parquet schema are in different letter cases, no matter whether 
`spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when 
`spark.sql.caseSensitive` is set to false, Spark does case insensitive column 
name resolution between Hive metastore schema and Parquet schema, so even 
column names are in different letter cases, Spark returns corresponding column 
values. An exception is thrown if there is ambiguity, i.e. more than one 
Parquet column is matched.
--- End diff --

First, we should not change the behavior of hive tables. It inherits many 
behaviors from Hive and let's keep it as it was.

Second, why we treat it as a behavior change? I think it's a bug that we 
don't respect `spark.sql.caseSensitive` in field resolution. In general we 
should not add a config to restore a bug.

I don't think this document is helpful. It explains a subtle and 
unreasonable behavior to users, which IMO just make them confused.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

2018-08-23 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22184#discussion_r212533857
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
 
+## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above
+
+  - In version 2.3.1 and earlier, when reading from a Parquet table, Spark 
always returns null for any column whose column names in Hive metastore schema 
and Parquet schema are in different letter cases, no matter whether 
`spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when 
`spark.sql.caseSensitive` is set to false, Spark does case insensitive column 
name resolution between Hive metastore schema and Parquet schema, so even 
column names are in different letter cases, Spark returns corresponding column 
values. An exception is thrown if there is ambiguity, i.e. more than one 
Parquet column is matched.
--- End diff --

Could you add a test case for the one you did?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

2018-08-23 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22184#discussion_r212533706
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
 
+## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above
+
+  - In version 2.3.1 and earlier, when reading from a Parquet table, Spark 
always returns null for any column whose column names in Hive metastore schema 
and Parquet schema are in different letter cases, no matter whether 
`spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when 
`spark.sql.caseSensitive` is set to false, Spark does case insensitive column 
name resolution between Hive metastore schema and Parquet schema, so even 
column names are in different letter cases, Spark returns corresponding column 
values. An exception is thrown if there is ambiguity, i.e. more than one 
Parquet column is matched.
--- End diff --

We should respect `spark.sql.caseSensitive` in both modes, but also add a 
legacy SQLConf to enable users to revert back to the previous behavior. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

2018-08-23 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22184#discussion_r212405373
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
 
+## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above
+
+  - In version 2.3.1 and earlier, when reading from a Parquet table, Spark 
always returns null for any column whose column names in Hive metastore schema 
and Parquet schema are in different letter cases, no matter whether 
`spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when 
`spark.sql.caseSensitive` is set to false, Spark does case insensitive column 
name resolution between Hive metastore schema and Parquet schema, so even 
column names are in different letter cases, Spark returns corresponding column 
values. An exception is thrown if there is ambiguity, i.e. more than one 
Parquet column is matched.
--- End diff --

Following your advice, I did a thorough comparison between data source 
table and hive serde table. 

Parquet data and tables are created via the following code:

```
val data = spark.range(5).selectExpr("id as a", "id * 2 as B", "id * 3 as 
c", "id * 4 as C")
spark.conf.set("spark.sql.caseSensitive", true)

data.write.format("parquet").mode("overwrite").save("/user/hive/warehouse/parquet_data")

CREATE TABLE parquet_data_source_lower (a LONG, b LONG, c LONG) USING 
parquet LOCATION '/user/hive/warehouse/parquet_data'
CREATE TABLE parquet_data_source_upper (A LONG, B LONG, C LONG) USING 
parquet LOCATION '/user/hive/warehouse/parquet_data'
CREATE TABLE parquet_hive_serde_lower (a LONG, b LONG, c LONG) STORED AS 
parquet LOCATION '/user/hive/warehouse/parquet_data'
CREATE TABLE parquet_hive_serde_upper (A LONG, B LONG, C LONG) STORED AS 
parquet LOCATION '/user/hive/warehouse/parquet_data'
```
spark.sql.hive.convertMetastoreParquet is set to false:

```
spark.conf.set("spark.sql.hive.convertMetastoreParquet", false)
```

Below are the comparison results both without #22148 and with #22148.

The comparison result without #22148:

|no.|caseSensitive|table columns|select column|parquet column (select via 
data source table)|parquet column (select via hive serde 
table)|consistent?|resolved by SPARK-25132|
| - | - | - | - | - | - | - | - |
|1|true|a, b, c|a| a|a |Y | |
|2| | |b|null|B|NG| |
|3| | |c|c |c |Y | |
|4| | |A|AnalysisException|AnalysisException|Y | |
|5| | |B|AnalysisException|AnalysisException|Y | |
|6| | |C|AnalysisException|AnalysisException|Y | |
|7| |A, B, C|a|AnalysisException |AnalysisException|Y | |
|8| | |b|AnalysisException |AnalysisException |Y | |
|9| | |c|AnalysisException |AnalysisException |Y | |
|10| | |A|null |a |NG | |
|11| | |B|B |B|Y | |
|12| | |C|C |c |NG | |
|13|false|a, b, c|a|a |a |Y | |
|14| | |b|null |B |NG |Y|
|15| | |c|c |c |Y | |
|16| | |A|a |a |Y | |
|17| | |B|null |B |NG |Y|
|18| | |C|c |c |Y | |
|19| |A, B, C|a|null |a |NG |Y|
|20| | |b|B |B |Y | |
|21| | |c|C |c |NG | |
|22| | |A|null |a |NG |Y|
|23| | |B|B |B |Y | |
|24| | |C|C |c |NG | |

The comparison result with #22148 applied:

|no.|caseSensitive|table columns|select column|parquet column (select via 
data source table)|parquet column (select via hive serde 
table)|consistent?|introduced by SPARK-25132|
|---|---|---|---|---|---|---|---|
|1|true|a, b, c|a|a |a |Y | |
|2| | |b|null |B |NG | |
|3| | |c|c |c |Y | |
|4| | |A|AnalysisException |AnalysisException |Y | |
|5| | |B|AnalysisException |AnalysisException |Y | |
|6| | |C|AnalysisException |AnalysisException |Y | |
|7| |A, B, C|a|AnalysisException |AnalysisException |Y | |
|8| | |b|AnalysisException |AnalysisException |Y | |
|9| | |c|AnalysisException |AnalysisException |Y | |
|10| | |A|null |a |NG | |
|11| | |B|B |B |Y | |
|12| | |C|C |c |NG | |
|13|false|a, b, c|a|a |a |Y | |
|14| | |b|B |B |Y | |
|15| | |c|RuntimeException |c |NG |Y|
|16| | |A|a |a |Y | |
|17| | |B|B |B |Y | |
|18| | |C|RuntimeException |c |NG |Y|
|19| |A, B, C|a|a |a |Y | |
|20| | |b|B |B |Y | |

[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

2018-08-22 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22184#discussion_r212006137
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
 
+## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above
+
+  - In version 2.3.1 and earlier, when reading from a Parquet table, Spark 
always returns null for any column whose column names in Hive metastore schema 
and Parquet schema are in different letter cases, no matter whether 
`spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when 
`spark.sql.caseSensitive` is set to false, Spark does case insensitive column 
name resolution between Hive metastore schema and Parquet schema, so even 
column names are in different letter cases, Spark returns corresponding column 
values. An exception is thrown if there is ambiguity, i.e. more than one 
Parquet column is matched.
--- End diff --

This is a behavior change. I am not sure whether we should backport it to 
2.3.2. How about sending a note to the dev mailing list? 

BTW, this only affects data source table. How about hive serde table? Are 
they consistent? 

Could you add a test case? Create a table by the syntax like `CREATE TABLE 
... STORED AS PARQUET`. You also need to turn off 
`spark.sql.hive.convertMetastoreParquet` in the test case. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

2018-08-22 Thread seancxmao
GitHub user seancxmao opened a pull request:

https://github.com/apache/spark/pull/22184

[SPARK-25132][SQL][DOC] Add migration doc for case-insensitive field 
resolution when reading from Parquet

## What changes were proposed in this pull request?
#22148 introduces a behavior change. We need to document it in the 
migration guide.

## How was this patch tested?
N/A


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/seancxmao/spark SPARK-25132-DOC

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22184.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22184


commit eae8a3c98f146765d25bbf529421ce3c7a92639b
Author: seancxmao 
Date:   2018-08-22T09:17:55Z

[SPARK-25132][SQL][DOC] Case-insensitive field resolution when reading from 
Parquet




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org