Inconsistent Spark SQL behavior when column names contain dots

Budde, Adam Thu, 31 Jul 2014 11:17:25 -0700

I’m working with a dataset where each row is stored as a single-line flat JSON 
object. I want to leverage Spark SQL to run relational queries on this data. 
Many of the object keys in this dataset have dots in them, e.g.:


{ “key.number1”: “value1”, “key.number2”: “value2” … }

I can successfully load the data as an RDD in Spark and construct a Spark SQL 
table using the jsonRDD function. If I print the schema of the table, I see 
that Spark SQL infers the full object key, dot included, as the column name:

> sqlTable.printSchema()
root
|-- key.number1: StringType
|-- key.number2: StringType
…

However, when I try to use one of these column names in a query, it seems that 
the Spark SQL parser always assumes I’m trying to reference a nested attribute. 
The same thing happens when using HiveQL. If there’s a way to escape the dot in 
the column name, I haven’t found it:

> sqlContext.sql(“SELECT key.number1 FROM TABLE sql_table LIMIT 1”).first
== Query Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes: ‘key.number1, tree
Project [‘key.number1]
…

This is not a critical issue by any means— it’s simple enough to use map() to 
transform the dots to underscores after loading the JSON data as text. I just 
wanted to reach out to the community for some guidance as to whether or not 
this issue warrants a bug report. To me, this behavior seems to be 
inconsistent— you can create a table with column names containing dots, but 
AFAICT you cannot include such columns in a query.

Also, I’d greatly appreciate it if anybody has any pointers as to where in the 
source I should be looking if I wanted to patch this issue in my local branch. 
I’ve taken a glance at some of the Spark SQL Catalyst code but I’m afraid I’m 
too much of a Scala novice to make much headway here.

For reference, I’m using Spark 1.0.1. Thanks for your input.

Adam

Inconsistent Spark SQL behavior when column names contain dots

Reply via email to