Re: column identifiers in Spark SQL

Richard Hillegas Tue, 22 Sep 2015 13:10:20 -0700

Thanks for that tip, Michael. I think that my sqlContext was a raw
SQLContext originally. I have rebuilt Spark like so...


  sbt/sbt -Phive assembly/assembly

Now I see that my sqlContext is a HiveContext. That fixes one of the
queries. Now unnormalized column names work:

  // ...unnormalized column names work now
  sqlContext.sql("""select a from test_data""").show

However, quoted identifiers are still treated as string literals:

  // this still returns rows consisting of the string literal "b"
  sqlContext.sql("""select "b" from test_data""").show

And embedded quotes inside quoted identifiers are swallowed up:

  // this now returns rows consisting of the string literal "cd"
  sqlContext.sql("""select "c""d" from test_data""").show

Thanks,
-Rick

Michael Armbrust <mich...@databricks.com> wrote on 09/22/2015 10:58:36 AM:

> From: Michael Armbrust <mich...@databricks.com>
> To: Richard Hillegas/San Francisco/IBM@IBMUS
> Cc: Dev <dev@spark.apache.org>
> Date: 09/22/2015 10:59 AM
> Subject: Re: column identifiers in Spark SQL
>
> Are you using a SQLContext or a HiveContext?  The programming guide
> suggests the latter, as the former is really only there because some
> applications may have conflicts with Hive dependencies.  SQLContext
> is case sensitive by default where as the HiveContext is not.  The
> parser in HiveContext is also a lot better.
>
> On Tue, Sep 22, 2015 at 10:53 AM, Richard Hillegas <rhil...@us.ibm.com>
wrote:
> I am puzzled by the behavior of column identifiers in Spark SQL. I
> don't find any guidance in the "Spark SQL and DataFrame Guide" at
> http://spark.apache.org/docs/latest/sql-programming-guide.html. I am
> seeing odd behavior related to case-sensitivity and to delimited
> (quoted) identifiers.
>
> Consider the following declaration of a table in the Derby
> relational database, whose dialect hews closely to the SQL Standard:
>
>    create table app.t( a int, "b" int, "c""d" int );
>
> Now let's load that table into Spark like this:
>
>   import org.apache.spark.sql._
>   import org.apache.spark.sql.types._
>
>   val df = sqlContext.read.format("jdbc").options(
>     Map("url" -> "jdbc:derby:/Users/rhillegas/derby/databases/derby1",
>     "dbtable" -> "app.t")).load()
>   df.registerTempTable("test_data")
>
> The following query runs fine because the column name matches the
> normalized form in which it is stored in the metadata catalogs of
> the relational database:
>
>   // normalized column names are recognized
>   sqlContext.sql(s"""select A from test_data""").show
>
> But the following query fails during name resolution. This puzzles
> me because non-delimited identifiers are case-insensitive in the
> ANSI/ISO Standard. They are also supposed to be case-insensitive in
> HiveQL, at least according to section 2.3.1 of the
> QuotedIdentifier.html webpage attached to https://issues.apache.org/
> jira/browse/HIVE-6013:
>
>   // ...unnormalized column names raise this error:
> org.apache.spark.sql.AnalysisException: cannot resolve 'a' given
> input columns A, b, c"d;
>   sqlContext.sql("""select a from test_data""").show
>
> Delimited (quoted) identifiers are treated as string literals.
> Again, non-Standard behavior:
>
>   // this returns rows consisting of the string literal "b"
>   sqlContext.sql("""select "b" from test_data""").show
>
> Embedded quotes in delimited identifiers won't even parse:
>
>   // embedded quotes raise this error: java.lang.RuntimeException:
> [1.11] failure: ``union'' expected but "d" found
>   sqlContext.sql("""select "c""d" from test_data""").show
>
> This behavior is non-Standard and it strikes me as hard to describe
> to users concisely. Would the community support an effort to bring
> the handling of column identifiers into closer conformance with the
> Standard? Would backward compatibility concerns even allow us to do that?
>
> Thanks,
> -Rick

Re: column identifiers in Spark SQL

Reply via email to