[jira] [Commented] (SPARK-10562) .partitionBy() creates the metastore partition columns in all lowercase, but persists the data path as MixedCase resulting in an error when the data is later attempted to query.

Brian Lockwood (JIRA) Mon, 16 Nov 2015 17:05:32 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15007739#comment-15007739
 ]


Brian Lockwood commented on SPARK-10562:
----------------------------------------

It seems that I am running into this in version 1.5.1, are the Affects 
Version/s: correct?

> .partitionBy() creates the metastore partition columns in all lowercase, but 
> persists the data path as MixedCase resulting in an error when the data is 
> later attempted to query.
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-10562
>                 URL: https://issues.apache.org/jira/browse/SPARK-10562
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.1
>            Reporter: Jason Pohl
>            Assignee: Wenchen Fan
>             Fix For: 1.6.0
>
>         Attachments: MixedCasePartitionBy.dbc
>
>
> When using DataFrame.write.partitionBy().saveAsTable() it creates the 
> partiton by columns in all lowercase in the meta-store.  However, it writes 
> the data to the filesystem using mixed-case.
> This causes an error when running a select against the table.
> {noformat}
> from pyspark.sql import Row
> # Create a data frame with mixed case column names
> myRDD = sc.parallelize([Row(Name="John Terry", Goals=1, Year=2015),
>                        Row(Name="Frank Lampard", Goals=15, Year=2012)])
> myDF = sqlContext.createDataFrame(myRDD)
> # Write this data out to a parquet file and partition by the Year (which is a 
> mixedCase name)
> myDF.write.partitionBy("Year").saveAsTable("chelsea_goals")
> %sql show create table chelsea_goals;
> --The metastore is showwing a partition column name of all lowercase "year"
> # Verify that the data is written with appropriate partitions
> display(dbutils.fs.ls("/user/hive/warehouse/chelsea_goals"))
> {noformat}
> {code:sql}
> %sql -- Now try to run a query against this table
> select * from chelsea_goals
> {code}
> {noformat}
> Error in SQL statement: UncheckedExecutionException: 
> java.lang.RuntimeException: Partition column year not found in schema 
> StructType(StructField(Goals,LongType,true), 
> StructField(Name,StringType,true), StructField(Year,LongType,true))
> {noformat}
> {noformat}
> # Now lets try this again using a lowercase column name
> myRDD2 = sc.parallelize([Row(Name="John Terry", Goals=1, year=2015),
>                          Row(Name="Frank Lampard", Goals=15, year=2012)])
> myDF2 = sqlContext.createDataFrame(myRDD2)
> myDF2.write.partitionBy("year").saveAsTable("chelsea_goals2")
> {noformat}
> {code:sql}
> %sql select * from chelsea_goals2;
> --Now everything works
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10562) .partitionBy() creates the metastore partition columns in all lowercase, but persists the data path as MixedCase resulting in an error when the data is later attempted to query.

Reply via email to