Thanks for that additional tip, Michael. Backticks fix the problem query in which an identifier was transformed into a string literal. So this works now...
// now correctly resolves the unnormalized column id sqlContext.sql("""select `b` from test_data""").show Any suggestion about how to escape an embedded double quote? // java.sql.SQLSyntaxErrorException: Syntax error: Encountered "\"" at line 1, column 12. sqlContext.sql("""select `c"d` from test_data""").show // org.apache.spark.sql.AnalysisException: cannot resolve 'c\"d' given input columns A, b, c"d; line 1 pos 7 sqlContext.sql("""select `c\"d` from test_data""").show Thanks, -Rick Michael Armbrust <mich...@databricks.com> wrote on 09/22/2015 01:16:12 PM: > From: Michael Armbrust <mich...@databricks.com> > To: Richard Hillegas/San Francisco/IBM@IBMUS > Cc: Dev <dev@spark.apache.org> > Date: 09/22/2015 01:16 PM > Subject: Re: column identifiers in Spark SQL > > HiveQL uses `backticks` for quoted identifiers. > > On Tue, Sep 22, 2015 at 1:06 PM, Richard Hillegas <rhil...@us.ibm.com> wrote: > Thanks for that tip, Michael. I think that my sqlContext was a raw > SQLContext originally. I have rebuilt Spark like so... > > sbt/sbt -Phive assembly/assembly > > Now I see that my sqlContext is a HiveContext. That fixes one of the > queries. Now unnormalized column names work: > > // ...unnormalized column names work now > sqlContext.sql("""select a from test_data""").show > > However, quoted identifiers are still treated as string literals: > > // this still returns rows consisting of the string literal "b" > sqlContext.sql("""select "b" from test_data""").show > > And embedded quotes inside quoted identifiers are swallowed up: > > // this now returns rows consisting of the string literal "cd" > sqlContext.sql("""select "c""d" from test_data""").show > > Thanks, > -Rick > > Michael Armbrust <mich...@databricks.com> wrote on 09/22/2015 10:58:36 AM: > > > From: Michael Armbrust <mich...@databricks.com> > > To: Richard Hillegas/San Francisco/IBM@IBMUS > > Cc: Dev <dev@spark.apache.org> > > Date: 09/22/2015 10:59 AM > > Subject: Re: column identifiers in Spark SQL > > > > > Are you using a SQLContext or a HiveContext? The programming guide > > suggests the latter, as the former is really only there because some > > applications may have conflicts with Hive dependencies. SQLContext > > is case sensitive by default where as the HiveContext is not. The > > parser in HiveContext is also a lot better. > > > > On Tue, Sep 22, 2015 at 10:53 AM, Richard Hillegas <rhil...@us.ibm.com > > wrote: > > I am puzzled by the behavior of column identifiers in Spark SQL. I > > don't find any guidance in the "Spark SQL and DataFrame Guide" at > > http://spark.apache.org/docs/latest/sql-programming-guide.html. I am > > seeing odd behavior related to case-sensitivity and to delimited > > (quoted) identifiers. > > > > Consider the following declaration of a table in the Derby > > relational database, whose dialect hews closely to the SQL Standard: > > > > create table app.t( a int, "b" int, "c""d" int ); > > > > Now let's load that table into Spark like this: > > > > import org.apache.spark.sql._ > > import org.apache.spark.sql.types._ > > > > val df = sqlContext.read.format("jdbc").options( > > Map("url" -> "jdbc:derby:/Users/rhillegas/derby/databases/derby1", > > "dbtable" -> "app.t")).load() > > df.registerTempTable("test_data") > > > > The following query runs fine because the column name matches the > > normalized form in which it is stored in the metadata catalogs of > > the relational database: > > > > // normalized column names are recognized > > sqlContext.sql(s"""select A from test_data""").show > > > > But the following query fails during name resolution. This puzzles > > me because non-delimited identifiers are case-insensitive in the > > ANSI/ISO Standard. They are also supposed to be case-insensitive in > > HiveQL, at least according to section 2.3.1 of the > > QuotedIdentifier.html webpage attached to https://issues.apache.org/ > > jira/browse/HIVE-6013: > > > > // ...unnormalized column names raise this error: > > org.apache.spark.sql.AnalysisException: cannot resolve 'a' given > > input columns A, b, c"d; > > sqlContext.sql("""select a from test_data""").show > > > > Delimited (quoted) identifiers are treated as string literals. > > Again, non-Standard behavior: > > > > // this returns rows consisting of the string literal "b" > > sqlContext.sql("""select "b" from test_data""").show > > > > Embedded quotes in delimited identifiers won't even parse: > > > > // embedded quotes raise this error: java.lang.RuntimeException: > > [1.11] failure: ``union'' expected but "d" found > > sqlContext.sql("""select "c""d" from test_data""").show > > > > This behavior is non-Standard and it strikes me as hard to describe > > to users concisely. Would the community support an effort to bring > > the handling of column identifiers into closer conformance with the > > Standard? Would backward compatibility concerns even allow us to do that? > > > > Thanks, > > -Rick