[ https://issues.apache.org/jira/browse/SPARK-30559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17028386#comment-17028386 ]
Ori Popowski commented on SPARK-30559: -------------------------------------- Can someone please take a look at this? > Spark 2.4.4 - spark.sql.hive.caseSensitiveInferenceMode does not work with > Hive > ------------------------------------------------------------------------------- > > Key: SPARK-30559 > URL: https://issues.apache.org/jira/browse/SPARK-30559 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.4 > Environment: EMR 28.1 with Spark 2.4.4, Hadoop 2.8.5 and Hive 2.3.6 > Reporter: Ori Popowski > Priority: Major > > In Spark SQL, spark.sql.hive.caseSensitiveInferenceMode INFER_ONLY and > INFER_AND_SAVE do not work as intended. They were supposed to infer a > case-sensitive schema from the underlying files, but they do not work. > # INFER_ONLY never works: it will always user lowercase column names from > Hive metastore schema > # INFER_AND_SAVE only works the second time {{spark.sql("SELECT …")}} is > called (the first time it writes the schema to TBLPROPERTIES in the metastore > and subsequent calls read that schema, so they do work) > h3. Expected behavior (according to SPARK-19611) > INFER_ONLY - infer the schema from the underlying files > INFER_AND_SAVE - infer the schema from the underlying files, save it to the > metastore, and read it from the metastore on any subsequent calls > h2. Reproduce > h3. Prepare the data > h4. 1) Create a Parquet file > {code:scala} > scala> List(("a", 1), ("b", 2)).toDF("theString", > "theNumber").write.parquet("hdfs:///t"){code} > > h4. 2) Inspect the Parquet files > {code:sh} > $ hadoop jar parquet-tools-1.11.0.jar cat -j > hdfs:///t/part-00000-….snappy.parquet > {"theString":"a","theNumber":1} > $ hadoop jar parquet-tools-1.11.0.jar cat -j > hdfs:///t/part-00001-….snappy.parquet > {"theString":"b","theNumber":2}{code} > We see that they are saved with camelCase column names. > h4. 3) Create a Hive table > {code:sql} > hive> CREATE EXTERNAL TABLE t(theString string, theNumber int) > > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > > LOCATION 'hdfs:///t';{code} > > h3. Reproduce INFER_ONLY bug > h4. 3) Read the table in Spark using INFER_ONLY > {code:sh} > $ spark-shell --master local[*] --conf > spark.sql.hive.caseSensitiveInferenceMode=INFER_ONLY{code} > {code:scala} > scala> spark.sql("SELECT * FROM default.t").columns.foreach(println) > thestring > thenumber > {code} > h4. Conclusion > When INFER_ONLY is set, column names are lowercase always. > h3. Reproduce INFER_AND_SAVE bug > h4. 1) Run the for first time > {code:sh} > $ spark-shell --master local[*] --conf > spark.sql.hive.caseSensitiveInferenceMode=INFER_AND_SAVE{code} > {code:scala} > scala> spark.sql("SELECT * FROM default.t").columns.foreach(println) > thestring > thenumber{code} > We see that column names are lowercase > h4. 2) Run for the second time > {code:scala} > scala> spark.sql("SELECT * FROM default.t").columns.foreach(println) > theString > theNumber{code} > We see that the column names are camelCase > h4. Conclusion > When INFER_AND_SAVE is set, column names are lowercase on first call and > camelCase on subsquent calls. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org