I am puzzled by the behavior of column identifiers in Spark SQL. I don't
find any guidance in the "Spark SQL and DataFrame Guide" at
http://spark.apache.org/docs/latest/sql-programming-guide.html. I am seeing
odd behavior related to case-sensitivity and to delimited (quoted)
identifiers.

Consider the following declaration of a table in the Derby relational
database, whose dialect hews closely to the SQL Standard:

   create table app.t( a int, "b" int, "c""d" int );

Now let's load that table into Spark like this:

  import org.apache.spark.sql._
  import org.apache.spark.sql.types._

  val df = sqlContext.read.format("jdbc").options(
    Map("url" -> "jdbc:derby:/Users/rhillegas/derby/databases/derby1",
    "dbtable" -> "app.t")).load()
  df.registerTempTable("test_data")

The following query runs fine because the column name matches the
normalized form in which it is stored in the metadata catalogs of the
relational database:

  // normalized column names are recognized
  sqlContext.sql(s"""select A from test_data""").show

But the following query fails during name resolution. This puzzles me
because non-delimited identifiers are case-insensitive in the ANSI/ISO
Standard. They are also supposed to be case-insensitive in HiveQL, at least
according to section 2.3.1 of the QuotedIdentifier.html webpage attached to
https://issues.apache.org/jira/browse/HIVE-6013:

  // ...unnormalized column names raise this error:
org.apache.spark.sql.AnalysisException: cannot resolve 'a' given input
columns A, b, c"d;
  sqlContext.sql("""select a from test_data""").show

Delimited (quoted) identifiers are treated as string literals. Again,
non-Standard behavior:

  // this returns rows consisting of the string literal "b"
  sqlContext.sql("""select "b" from test_data""").show

Embedded quotes in delimited identifiers won't even parse:

  // embedded quotes raise this error: java.lang.RuntimeException: [1.11]
failure: ``union'' expected but "d" found
  sqlContext.sql("""select "c""d" from test_data""").show

This behavior is non-Standard and it strikes me as hard to describe to
users concisely. Would the community support an effort to bring the
handling of column identifiers into closer conformance with the Standard?
Would backward compatibility concerns even allow us to do that?

Thanks,
-Rick

Reply via email to