I’m working with a dataset where each row is stored as a single-line flat JSON 
object. I want to leverage Spark SQL to run relational queries on this data. 
Many of the object keys in this dataset have dots in them, e.g.:

{ “key.number1”: “value1”, “key.number2”: “value2” … }

I can successfully load the data as an RDD in Spark and construct a Spark SQL 
table using the jsonRDD function. If I print the schema of the table, I see 
that Spark SQL infers the full object key, dot included, as the column name:

> sqlTable.printSchema()
root
|-- key.number1: StringType
|-- key.number2: StringType
…

However, when I try to use one of these column names in a query, it seems that 
the Spark SQL parser always assumes I’m trying to reference a nested attribute. 
The same thing happens when using HiveQL. If there’s a way to escape the dot in 
the column name, I haven’t found it:

> sqlContext.sql(“SELECT key.number1 FROM TABLE sql_table LIMIT 1”).first
== Query Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes: ‘key.number1, tree
Project [‘key.number1]
…

This is not a critical issue by any means— it’s simple enough to use map() to 
transform the dots to underscores after loading the JSON data as text. I just 
wanted to reach out to the community for some guidance as to whether or not 
this issue warrants a bug report. To me, this behavior seems to be 
inconsistent— you can create a table with column names containing dots, but 
AFAICT you cannot include such columns in a query.

Also, I’d greatly appreciate it if anybody has any pointers as to where in the 
source I should be looking if I wanted to patch this issue in my local branch. 
I’ve taken a glance at some of the Spark SQL Catalyst code but I’m afraid I’m 
too much of a Scala novice to make much headway here.

For reference, I’m using Spark 1.0.1. Thanks for your input.

Adam

Reply via email to