I am puzzled by the behavior of column identifiers in Spark SQL. I don't find any guidance in the "Spark SQL and DataFrame Guide" at http://spark.apache.org/docs/latest/sql-programming-guide.html. I am seeing odd behavior related to case-sensitivity and to delimited (quoted) identifiers.
Consider the following declaration of a table in the Derby relational database, whose dialect hews closely to the SQL Standard: create table app.t( a int, "b" int, "c""d" int ); Now let's load that table into Spark like this: import org.apache.spark.sql._ import org.apache.spark.sql.types._ val df = sqlContext.read.format("jdbc").options( Map("url" -> "jdbc:derby:/Users/rhillegas/derby/databases/derby1", "dbtable" -> "app.t")).load() df.registerTempTable("test_data") The following query runs fine because the column name matches the normalized form in which it is stored in the metadata catalogs of the relational database: // normalized column names are recognized sqlContext.sql(s"""select A from test_data""").show But the following query fails during name resolution. This puzzles me because non-delimited identifiers are case-insensitive in the ANSI/ISO Standard. They are also supposed to be case-insensitive in HiveQL, at least according to section 2.3.1 of the QuotedIdentifier.html webpage attached to https://issues.apache.org/jira/browse/HIVE-6013: // ...unnormalized column names raise this error: org.apache.spark.sql.AnalysisException: cannot resolve 'a' given input columns A, b, c"d; sqlContext.sql("""select a from test_data""").show Delimited (quoted) identifiers are treated as string literals. Again, non-Standard behavior: // this returns rows consisting of the string literal "b" sqlContext.sql("""select "b" from test_data""").show Embedded quotes in delimited identifiers won't even parse: // embedded quotes raise this error: java.lang.RuntimeException: [1.11] failure: ``union'' expected but "d" found sqlContext.sql("""select "c""d" from test_data""").show This behavior is non-Standard and it strikes me as hard to describe to users concisely. Would the community support an effort to bring the handling of column identifiers into closer conformance with the Standard? Would backward compatibility concerns even allow us to do that? Thanks, -Rick