Thanks for that tip, Michael. I think that my sqlContext was a raw SQLContext originally. I have rebuilt Spark like so...
sbt/sbt -Phive assembly/assembly Now I see that my sqlContext is a HiveContext. That fixes one of the queries. Now unnormalized column names work: // ...unnormalized column names work now sqlContext.sql("""select a from test_data""").show However, quoted identifiers are still treated as string literals: // this still returns rows consisting of the string literal "b" sqlContext.sql("""select "b" from test_data""").show And embedded quotes inside quoted identifiers are swallowed up: // this now returns rows consisting of the string literal "cd" sqlContext.sql("""select "c""d" from test_data""").show Thanks, -Rick Michael Armbrust <mich...@databricks.com> wrote on 09/22/2015 10:58:36 AM: > From: Michael Armbrust <mich...@databricks.com> > To: Richard Hillegas/San Francisco/IBM@IBMUS > Cc: Dev <dev@spark.apache.org> > Date: 09/22/2015 10:59 AM > Subject: Re: column identifiers in Spark SQL > > Are you using a SQLContext or a HiveContext? The programming guide > suggests the latter, as the former is really only there because some > applications may have conflicts with Hive dependencies. SQLContext > is case sensitive by default where as the HiveContext is not. The > parser in HiveContext is also a lot better. > > On Tue, Sep 22, 2015 at 10:53 AM, Richard Hillegas <rhil...@us.ibm.com> wrote: > I am puzzled by the behavior of column identifiers in Spark SQL. I > don't find any guidance in the "Spark SQL and DataFrame Guide" at > http://spark.apache.org/docs/latest/sql-programming-guide.html. I am > seeing odd behavior related to case-sensitivity and to delimited > (quoted) identifiers. > > Consider the following declaration of a table in the Derby > relational database, whose dialect hews closely to the SQL Standard: > > create table app.t( a int, "b" int, "c""d" int ); > > Now let's load that table into Spark like this: > > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > > val df = sqlContext.read.format("jdbc").options( > Map("url" -> "jdbc:derby:/Users/rhillegas/derby/databases/derby1", > "dbtable" -> "app.t")).load() > df.registerTempTable("test_data") > > The following query runs fine because the column name matches the > normalized form in which it is stored in the metadata catalogs of > the relational database: > > // normalized column names are recognized > sqlContext.sql(s"""select A from test_data""").show > > But the following query fails during name resolution. This puzzles > me because non-delimited identifiers are case-insensitive in the > ANSI/ISO Standard. They are also supposed to be case-insensitive in > HiveQL, at least according to section 2.3.1 of the > QuotedIdentifier.html webpage attached to https://issues.apache.org/ > jira/browse/HIVE-6013: > > // ...unnormalized column names raise this error: > org.apache.spark.sql.AnalysisException: cannot resolve 'a' given > input columns A, b, c"d; > sqlContext.sql("""select a from test_data""").show > > Delimited (quoted) identifiers are treated as string literals. > Again, non-Standard behavior: > > // this returns rows consisting of the string literal "b" > sqlContext.sql("""select "b" from test_data""").show > > Embedded quotes in delimited identifiers won't even parse: > > // embedded quotes raise this error: java.lang.RuntimeException: > [1.11] failure: ``union'' expected but "d" found > sqlContext.sql("""select "c""d" from test_data""").show > > This behavior is non-Standard and it strikes me as hard to describe > to users concisely. Would the community support an effort to bring > the handling of column identifiers into closer conformance with the > Standard? Would backward compatibility concerns even allow us to do that? > > Thanks, > -Rick