Re: Spark SQL Query and join different data sources.
Actually, with HiveContext, you can join hive tables with registered temporary tables. On Fri, Aug 22, 2014 at 9:07 PM, chutium teng@gmail.com wrote: oops, thanks Yan, you are right, i got scala sqlContext.sql(select * from a join b).take(10) java.lang.RuntimeException: Table Not Found: b at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:90) at org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:90) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:90) and with hql scala hiveContext.hql(select * from a join b).take(10) warning: there were 1 deprecation warning(s); re-run with -deprecation for details 14/08/22 14:48:45 INFO parse.ParseDriver: Parsing command: select * from a join b 14/08/22 14:48:45 INFO parse.ParseDriver: Parse Completed 14/08/22 14:48:45 ERROR metadata.Hive: NoSuchObjectException(message:default.a table not found) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:27129) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:27097) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result.read(ThriftHiveMetastore.java:27028) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table(ThriftHiveMetastore.java:936) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table(ThriftHiveMetastore.java:922) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:854) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89) at com.sun.proxy.$Proxy17.getTable(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:950) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:924) at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:59) so sqlContext is looking up table from org.apache.spark.sql.catalyst.analysis.SimpleCatalog, Catalog.scala hiveContext looking up from org.apache.spark.sql.hive.HiveMetastoreCatalog, HiveMetastoreCatalog.scala maybe we can do something in sqlContext to register a hive table as Spark-SQL-Table, need to read column info, partition info, location, SerDe, Input/OutputFormat and maybe StorageHandler also, from the hive metastore... -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Query-and-join-different-data-sources-tp7914p7955.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark SQL Query and join different data sources.
as far as i know, HQL queries try to find the schema info of all the tables in this query from hive metastore, so it is not possible to join tables from sqlContext using hiveContext.hql but this should work: hiveContext.hql(select ...).regAsTable(a) sqlContext.jsonFile(xxx).regAsTable(b) then sqlContext.sql( a join b ) i created a ticket SPARK-2710 to add ResultSets from JDBC connection as a new data source, but no predicate push down yet, also, it is not available for HQL so, if you are looking for something that can query different data sources with full SQL92 syntax, facebook presto is still the only choice, they have some kind of JDBC connector in deveopment, and there are some unofficial implementations... but i am looking forward to seeing the progress of Spark SQL, after SPARK-2179 SQLContext can handle any kind of structured data with a sequence of DataTypes as schema, although turning the data into Rows is still a little bit tricky... -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Query-and-join-different-data-sources-tp7914p7937.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
RE: Spark SQL Query and join different data sources.
I doubt it will work as expected. Note that hiveContext.hql(select ...).regAsTable(a) will create a SchemaRDD before register the SchemaRDD with the (Hive) catalog; While sqlContext.jsonFile(xxx).regAsTable(b) will create a SchemaRDD before register the SchemaRDD with the SparkSQL catalog(SimpleCatalog). The logic plans of the two SchemaRDDs are of the same type; but the physical plans are, and should be, different. The issue is that the transformation of the logical plans to physical plans are controlled by the strategies of contexts; namely the sqlContext transforms a logical plan to a physical plan suitable for SchemaRDD's execution from an in-memory data source, while HiveContext transforms a logical plan to a physical plan suitable for SchemaRDD's execution from a Hive data source. So sqlContext.sql( a join b ) will generate a physical plan for the in-memory data source for both a and b; and hiveContext.sql(a join b) will generate a physical plan for Hive data source for both a and b. What's really needed is a storage transparency from the semantic layer if SparkSQL wants to go the data federation route. If one could manage to create a SchemaRDD on Hive data through just the SQLContext, not the HiveCOntext (being a subclass of SQLCOntext), seemingly hinted by the SparkSQL web page https://spark.apache.org/sql/ in the following code snippet: sqlCtx.jsonFile(s3n://...) .registerAsTable(json) schema_rdd = sqlCtx.sql( SELECT * FROM hiveTable JOIN json ...) he/she might be able to perform the join of data sets of different types. I just have not tried. In terms of SQL-92 conforming, Presto might be better than HiveQL; while in terms of federation, Hive is actually very good at it. -Original Message- From: chutium [mailto:teng@gmail.com] Sent: Thursday, August 21, 2014 4:35 AM To: d...@spark.incubator.apache.org Subject: Re: Spark SQL Query and join different data sources. as far as i know, HQL queries try to find the schema info of all the tables in this query from hive metastore, so it is not possible to join tables from sqlContext using hiveContext.hql but this should work: hiveContext.hql(select ...).regAsTable(a) sqlContext.jsonFile(xxx).regAsTable(b) then sqlContext.sql( a join b ) i created a ticket SPARK-2710 to add ResultSets from JDBC connection as a new data source, but no predicate push down yet, also, it is not available for HQL so, if you are looking for something that can query different data sources with full SQL92 syntax, facebook presto is still the only choice, they have some kind of JDBC connector in deveopment, and there are some unofficial implementations... but i am looking forward to seeing the progress of Spark SQL, after SPARK-2179 SQLContext can handle any kind of structured data with a sequence of DataTypes as schema, although turning the data into Rows is still a little bit tricky... -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Query-and-join-different-data-sources-tp7914p7937.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
RE: Spark SQL Query and join different data sources.
Presto is so far good at joining different sources/databases. I tried a simple join query in Spark SQL, it fails as the followings errors val a = cql(select test.a from test JOIN test1 on test.a = test1.a) a: org.apache.spark.sql.SchemaRDD = SchemaRDD[0] at RDD at SchemaRDD.scala:104 == Query Plan == Project [a#7] Filter (a#7 = a#21) CartesianProduct org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent failure: Exception failure in TID 3 on host 127.0.0.1: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: No function to evaluate expression. type: AttributeReference, tree: a#7 org.apache.spark.sql.catalyst.expressions.AttributeReference.eval(namedExpressions.scala:158) org.apache.spark.sql.catalyst.expressions.EqualTo.eval(predicates.scala:146) org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:54) org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:54) scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) scala.collection.AbstractIterator.to(Iterator.scala:1157) scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) scala.collection.AbstractIterator.toArray(Iterator.scala:1157) org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:731) org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:731) org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083) org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) It looks like Spark SQL has long way to go to be compatible with SQL -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Query-and-join-different-data-sources-tp7914p7945.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org