I doubt it will work as expected.
Note that hiveContext.hql("select ...").regAsTable("a") will create a SchemaRDD
before register the SchemaRDD with the (Hive) catalog;
While sqlContext.jsonFile("xxx").regAsTable("b") will create a SchemaRDD before
register the SchemaRDD with the SparkSQL catalog(SimpleCatalog).
The logic plans of the two SchemaRDDs are of the same type; but the physical
plans are, and should be, different.
The issue is that the transformation of the logical plans to physical plans are
controlled by the "strategies" of "contexts"; namely the sqlContext
transforms a logical plan to a physical plan suitable for SchemaRDD's execution
from an in-memory data source, while HiveContext
transforms a logical plan to a physical plan suitable for SchemaRDD's execution
from a Hive data source. So
sqlContext.sql( a join b ) will generate a physical plan for the in-memory data
source for both a and b; and
hiveContext.sql(a join b) will generate a physical plan for Hive data source
for both a and b.
What's really needed is a storage transparency from the semantic layer if
SparkSQL wants to go the data federation route.
If one could manage to create a SchemaRDD on Hive data through just the
SQLContext, not the HiveCOntext (being a subclass of SQLCOntext), seemingly
hinted by the SparkSQL web page https://spark.apache.org/sql/ in the following
code snippet:
sqlCtx.jsonFile("s3n://...")
.registerAsTable("json")
schema_rdd = sqlCtx.sql("""
SELECT *
FROM hiveTable
JOIN json ...""")
he/she might be able to perform the join of data sets of different types. I
just have not tried.
In terms of SQL-92 conforming, Presto might be better than HiveQL; while in
terms of federation, Hive is actually very good at it.
-----Original Message-----
From: chutium [mailto:[email protected]]
Sent: Thursday, August 21, 2014 4:35 AM
To: [email protected]
Subject: Re: Spark SQL Query and join different data sources.
as far as i know, HQL queries try to find the schema info of all the tables in
this query from hive metastore, so it is not possible to join tables from
sqlContext using hiveContext.hql
but this should work:
hiveContext.hql("select ...").regAsTable("a")
sqlContext.jsonFile("xxx").regAsTable("b")
then
sqlContext.sql( a join b )
i created a ticket SPARK-2710 to add ResultSets from JDBC connection as a new
data source, but no predicate push down yet, also, it is not available for HQL
so, if you are looking for something that can query different data sources with
full SQL92 syntax, facebook presto is still the only choice, they have some
kind of JDBC connector in deveopment, and there are some unofficial
implementations...
but i am looking forward to seeing the progress of Spark SQL, after
SPARK-2179 SQLContext can handle any
kind of structured data with a sequence of DataTypes as schema, although
turning the data into Rows is still a little bit tricky...
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Query-and-join-different-data-sources-tp7914p7937.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected] For additional
commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]