I’m working with a dataset where each row is stored as a single-line flat JSON
object. I want to leverage Spark SQL to run relational queries on this data.
Many of the object keys in this dataset have dots in them, e.g.:
{ “key.number1”: “value1”, “key.number2”: “value2” … }
I can successfully load the data as an RDD in Spark and construct a Spark SQL
table using the jsonRDD function. If I print the schema of the table, I see
that Spark SQL infers the full object key, dot included, as the column name:
> sqlTable.printSchema()
root
|-- key.number1: StringType
|-- key.number2: StringType
…
However, when I try to use one of these column names in a query, it seems that
the Spark SQL parser always assumes I’m trying to reference a nested attribute.
The same thing happens when using HiveQL. If there’s a way to escape the dot in
the column name, I haven’t found it:
> sqlContext.sql(“SELECT key.number1 FROM TABLE sql_table LIMIT 1”).first
== Query Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
attributes: ‘key.number1, tree
Project [‘key.number1]
…
This is not a critical issue by any means— it’s simple enough to use map() to
transform the dots to underscores after loading the JSON data as text. I just
wanted to reach out to the community for some guidance as to whether or not
this issue warrants a bug report. To me, this behavior seems to be
inconsistent— you can create a table with column names containing dots, but
AFAICT you cannot include such columns in a query.
Also, I’d greatly appreciate it if anybody has any pointers as to where in the
source I should be looking if I wanted to patch this issue in my local branch.
I’ve taken a glance at some of the Spark SQL Catalyst code but I’m afraid I’m
too much of a Scala novice to make much headway here.
For reference, I’m using Spark 1.0.1. Thanks for your input.
Adam