[ 
https://issues.apache.org/jira/browse/SPARK-30559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30559:
----------------------------------
    Summary: spark.sql.hive.caseSensitiveInferenceMode does not work with Hive  
(was: Spark 2.4.4 - spark.sql.hive.caseSensitiveInferenceMode does not work 
with Hive)

> spark.sql.hive.caseSensitiveInferenceMode does not work with Hive
> -----------------------------------------------------------------
>
>                 Key: SPARK-30559
>                 URL: https://issues.apache.org/jira/browse/SPARK-30559
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.4
>         Environment: EMR 28.1 with Spark 2.4.4, Hadoop 2.8.5 and Hive 2.3.6
>            Reporter: Ori Popowski
>            Priority: Major
>
> In Spark SQL, spark.sql.hive.caseSensitiveInferenceMode INFER_ONLY and 
> INFER_AND_SAVE do not work as intended. They were supposed to infer a 
> case-sensitive schema from the underlying files, but they do not work.
>  # INFER_ONLY never works: it will always user lowercase column names from 
> Hive metastore schema
>  # INFER_AND_SAVE only works the second time {{spark.sql("SELECT …")}} is 
> called (the first time it writes the schema to TBLPROPERTIES in the metastore 
> and subsequent calls read that schema, so they do work)
> h3. Expected behavior (according to SPARK-19611)
> INFER_ONLY - infer the schema from the underlying files
> INFER_AND_SAVE - infer the schema from the underlying files, save it to the 
> metastore, and read it from the metastore on any subsequent calls
> h2. Reproduce
> h3. Prepare the data
> h4. 1) Create a Parquet file
> {code:scala}
> scala> List(("a", 1), ("b", 2)).toDF("theString", 
> "theNumber").write.parquet("hdfs:///t"){code}
>  
> h4. 2) Inspect the Parquet files
> {code:sh}
> $ hadoop jar parquet-tools-1.11.0.jar cat -j 
> hdfs:///t/part-00000-….snappy.parquet
> {"theString":"a","theNumber":1}
> $ hadoop jar parquet-tools-1.11.0.jar cat -j 
> hdfs:///t/part-00001-….snappy.parquet
> {"theString":"b","theNumber":2}{code}
> We see that they are saved with camelCase column names.
> h4. 3) Create a Hive table 
> {code:sql}
> hive> CREATE EXTERNAL TABLE t(theString string, theNumber int)
>  > ROW FORMAT SERDE 
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
>  > STORED AS INPUTFORMAT 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
>  > OUTPUTFORMAT 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
>  > LOCATION 'hdfs:///t';{code}
>  
> h3. Reproduce INFER_ONLY bug
> h4. 3) Read the table in Spark using INFER_ONLY
> {code:sh}
> $ spark-shell --master local[*] --conf 
> spark.sql.hive.caseSensitiveInferenceMode=INFER_ONLY{code}
> {code:scala}
> scala> spark.sql("SELECT * FROM default.t").columns.foreach(println)
> thestring
> thenumber
> {code}
> h4. Conclusion
> When INFER_ONLY is set, column names are lowercase always.
> h3. Reproduce INFER_AND_SAVE bug
> h4. 1) Run the for first time
> {code:sh}
> $ spark-shell --master local[*] --conf 
> spark.sql.hive.caseSensitiveInferenceMode=INFER_AND_SAVE{code}
> {code:scala}
> scala> spark.sql("SELECT * FROM default.t").columns.foreach(println)
> thestring
> thenumber{code}
> We see that column names are lowercase
> h4. 2) Run for the second time
> {code:scala}
> scala> spark.sql("SELECT * FROM default.t").columns.foreach(println)
> theString
> theNumber{code}
> We see that the column names are camelCase
> h4. Conclusion
> When INFER_AND_SAVE is set, column names are lowercase on first call and 
> camelCase on subsquent calls.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to