[ https://issues.apache.org/jira/browse/SPARK-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15007739#comment-15007739 ]
Brian Lockwood commented on SPARK-10562: ---------------------------------------- It seems that I am running into this in version 1.5.1, are the Affects Version/s: correct? > .partitionBy() creates the metastore partition columns in all lowercase, but > persists the data path as MixedCase resulting in an error when the data is > later attempted to query. > --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-10562 > URL: https://issues.apache.org/jira/browse/SPARK-10562 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.4.1 > Reporter: Jason Pohl > Assignee: Wenchen Fan > Fix For: 1.6.0 > > Attachments: MixedCasePartitionBy.dbc > > > When using DataFrame.write.partitionBy().saveAsTable() it creates the > partiton by columns in all lowercase in the meta-store. However, it writes > the data to the filesystem using mixed-case. > This causes an error when running a select against the table. > {noformat} > from pyspark.sql import Row > # Create a data frame with mixed case column names > myRDD = sc.parallelize([Row(Name="John Terry", Goals=1, Year=2015), > Row(Name="Frank Lampard", Goals=15, Year=2012)]) > myDF = sqlContext.createDataFrame(myRDD) > # Write this data out to a parquet file and partition by the Year (which is a > mixedCase name) > myDF.write.partitionBy("Year").saveAsTable("chelsea_goals") > %sql show create table chelsea_goals; > --The metastore is showwing a partition column name of all lowercase "year" > # Verify that the data is written with appropriate partitions > display(dbutils.fs.ls("/user/hive/warehouse/chelsea_goals")) > {noformat} > {code:sql} > %sql -- Now try to run a query against this table > select * from chelsea_goals > {code} > {noformat} > Error in SQL statement: UncheckedExecutionException: > java.lang.RuntimeException: Partition column year not found in schema > StructType(StructField(Goals,LongType,true), > StructField(Name,StringType,true), StructField(Year,LongType,true)) > {noformat} > {noformat} > # Now lets try this again using a lowercase column name > myRDD2 = sc.parallelize([Row(Name="John Terry", Goals=1, year=2015), > Row(Name="Frank Lampard", Goals=15, year=2012)]) > myDF2 = sqlContext.createDataFrame(myRDD2) > myDF2.write.partitionBy("year").saveAsTable("chelsea_goals2") > {noformat} > {code:sql} > %sql select * from chelsea_goals2; > --Now everything works > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org