I’m working with a dataset where each row is stored as a single-line flat JSON object. I want to leverage Spark SQL to run relational queries on this data. Many of the object keys in this dataset have dots in them, e.g.:
{ “key.number1”: “value1”, “key.number2”: “value2” … } I can successfully load the data as an RDD in Spark and construct a Spark SQL table using the jsonRDD function. If I print the schema of the table, I see that Spark SQL infers the full object key, dot included, as the column name: > sqlTable.printSchema() root |-- key.number1: StringType |-- key.number2: StringType … However, when I try to use one of these column names in a query, it seems that the Spark SQL parser always assumes I’m trying to reference a nested attribute. The same thing happens when using HiveQL. If there’s a way to escape the dot in the column name, I haven’t found it: > sqlContext.sql(“SELECT key.number1 FROM TABLE sql_table LIMIT 1”).first == Query Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: ‘key.number1, tree Project [‘key.number1] … This is not a critical issue by any means— it’s simple enough to use map() to transform the dots to underscores after loading the JSON data as text. I just wanted to reach out to the community for some guidance as to whether or not this issue warrants a bug report. To me, this behavior seems to be inconsistent— you can create a table with column names containing dots, but AFAICT you cannot include such columns in a query. Also, I’d greatly appreciate it if anybody has any pointers as to where in the source I should be looking if I wanted to patch this issue in my local branch. I’ve taken a glance at some of the Spark SQL Catalyst code but I’m afraid I’m too much of a Scala novice to make much headway here. For reference, I’m using Spark 1.0.1. Thanks for your input. Adam