I still see the same “Unresolved attributes” error when using hql + backticks.
Here’s a code snippet that replicates this behavior:
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val sampleRDD = sc.parallelize(Array("""{"key.one": "value1", "key.two":
"value2"}"""))
val sampleTable = hiveContext.jsonRDD(sampleRDD)
sampleTable.registerAsTable("sample_table")
hiveContext.hql("SELECT `key.one` FROM sample_table")
From: Michael Armbrust <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Date: Thursday, July 31, 2014 at 11:20 AM
To: "[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Subject: Re: Inconsistent Spark SQL behavior when column names contain dots
Ideally you'd use backticks to reference columns that contain weird characters.
I don't believe this works in sql parser, but I'm curious if using the hql
parser in HiveContext would work for you?
If you wanted to add support for this in the sql parser I'd check out
SqlParser.scala. Thought it is likely we will abandon that code in the next
release for something more complete.
On Thu, Jul 31, 2014 at 11:16 AM, Budde, Adam
<[email protected]<mailto:[email protected]>> wrote:
I’m working with a dataset where each row is stored as a single-line flat JSON
object. I want to leverage Spark SQL to run relational queries on this data.
Many of the object keys in this dataset have dots in them, e.g.:
{ “key.number1”: “value1”, “key.number2”: “value2” … }
I can successfully load the data as an RDD in Spark and construct a Spark SQL
table using the jsonRDD function. If I print the schema of the table, I see
that Spark SQL infers the full object key, dot included, as the column name:
> sqlTable.printSchema()
root
|-- key.number1: StringType
|-- key.number2: StringType
…
However, when I try to use one of these column names in a query, it seems that
the Spark SQL parser always assumes I’m trying to reference a nested attribute.
The same thing happens when using HiveQL. If there’s a way to escape the dot in
the column name, I haven’t found it:
> sqlContext.sql(“SELECT key.number1 FROM TABLE sql_table LIMIT 1”).first
== Query Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
attributes: ‘key.number1, tree
Project [‘key.number1]
…
This is not a critical issue by any means— it’s simple enough to use map() to
transform the dots to underscores after loading the JSON data as text. I just
wanted to reach out to the community for some guidance as to whether or not
this issue warrants a bug report. To me, this behavior seems to be
inconsistent— you can create a table with column names containing dots, but
AFAICT you cannot include such columns in a query.
Also, I’d greatly appreciate it if anybody has any pointers as to where in the
source I should be looking if I wanted to patch this issue in my local branch.
I’ve taken a glance at some of the Spark SQL Catalyst code but I’m afraid I’m
too much of a Scala novice to make much headway here.
For reference, I’m using Spark 1.0.1. Thanks for your input.
Adam