Re: Spark SQL Query and join different data sources.

2014-09-02 Thread Yin Huai
Actually, with HiveContext, you can join hive tables with registered
temporary tables.


On Fri, Aug 22, 2014 at 9:07 PM, chutium teng@gmail.com wrote:

 oops, thanks Yan, you are right, i got

 scala sqlContext.sql(select * from a join b).take(10)
 java.lang.RuntimeException: Table Not Found: b
 at scala.sys.package$.error(package.scala:27)
 at

 org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:90)
 at

 org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:90)
 at scala.Option.getOrElse(Option.scala:120)
 at

 org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:90)

 and with hql

 scala hiveContext.hql(select * from a join b).take(10)
 warning: there were 1 deprecation warning(s); re-run with -deprecation for
 details
 14/08/22 14:48:45 INFO parse.ParseDriver: Parsing command: select * from a
 join b
 14/08/22 14:48:45 INFO parse.ParseDriver: Parse Completed
 14/08/22 14:48:45 ERROR metadata.Hive:
 NoSuchObjectException(message:default.a table not found)
 at

 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:27129)
 at

 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:27097)
 at

 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result.read(ThriftHiveMetastore.java:27028)
 at
 org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
 at

 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table(ThriftHiveMetastore.java:936)
 at

 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table(ThriftHiveMetastore.java:922)
 at

 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:854)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at

 org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
 at com.sun.proxy.$Proxy17.getTable(Unknown Source)
 at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:950)
 at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:924)
 at

 org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:59)


 so sqlContext is looking up table from
 org.apache.spark.sql.catalyst.analysis.SimpleCatalog, Catalog.scala
 hiveContext looking up from org.apache.spark.sql.hive.HiveMetastoreCatalog,
 HiveMetastoreCatalog.scala

 maybe we can do something in sqlContext to register a hive table as
 Spark-SQL-Table, need to read column info, partition info, location, SerDe,
 Input/OutputFormat and maybe StorageHandler also, from the hive
 metastore...




 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Query-and-join-different-data-sources-tp7914p7955.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Spark SQL Query and join different data sources.

2014-08-21 Thread chutium
as far as i know, HQL queries try to find the schema info of all the tables
in this query from hive metastore, so it is not possible to join tables from
sqlContext using hiveContext.hql

but this should work:

hiveContext.hql(select ...).regAsTable(a)
sqlContext.jsonFile(xxx).regAsTable(b)

then

sqlContext.sql( a join b )


i created a ticket SPARK-2710 to add ResultSets from JDBC connection as a
new data source, but no predicate push down yet, also, it is not available
for HQL

so, if you are looking for something that can query different data sources
with full SQL92 syntax, facebook presto is still the only choice, they have
some kind of JDBC connector in deveopment, and there are some unofficial
implementations...

but i am looking forward to seeing the progress of Spark SQL, after
SPARK-2179 SQLContext can handle any
kind of structured data with a sequence of DataTypes as schema, although
turning the data into Rows is still a little bit tricky...



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Query-and-join-different-data-sources-tp7914p7937.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: Spark SQL Query and join different data sources.

2014-08-21 Thread Yan Zhou.sc
I doubt it will work as expected.

Note that hiveContext.hql(select ...).regAsTable(a) will create a SchemaRDD 
before register the SchemaRDD with the (Hive) catalog;
While sqlContext.jsonFile(xxx).regAsTable(b) will create a SchemaRDD before 
register the SchemaRDD with the SparkSQL catalog(SimpleCatalog).
The logic plans of the two SchemaRDDs are of the same type; but the physical 
plans are, and should be, different.
The issue is that the transformation of the logical plans to physical plans are 
controlled by the strategies of contexts; namely the sqlContext
transforms a logical plan to a physical plan suitable for SchemaRDD's execution 
from an in-memory data source, while HiveContext 
transforms a logical plan to a physical plan suitable for SchemaRDD's execution 
from a Hive data source. So

sqlContext.sql( a join b ) will generate a physical plan for the in-memory data 
source for both a and b; and
hiveContext.sql(a join b) will generate a physical plan for Hive data source 
for both a and b.

What's really needed is a storage transparency from the semantic layer if 
SparkSQL wants to go the data federation route.


If one could manage to create a SchemaRDD on Hive data through just the 
SQLContext, not the HiveCOntext (being a subclass of SQLCOntext), seemingly
hinted by the SparkSQL web page https://spark.apache.org/sql/ in the following 
code snippet:

sqlCtx.jsonFile(s3n://...)
  .registerAsTable(json)
 schema_rdd = sqlCtx.sql(
   SELECT * 
   FROM hiveTable
   JOIN json ...)

he/she might be able to perform the join of data sets of different types. I 
just have not tried.


In terms of SQL-92 conforming, Presto might be better than HiveQL; while in 
terms of federation, Hive is actually very good at it.




-Original Message-
From: chutium [mailto:teng@gmail.com] 
Sent: Thursday, August 21, 2014 4:35 AM
To: d...@spark.incubator.apache.org
Subject: Re: Spark SQL Query and join different data sources.

as far as i know, HQL queries try to find the schema info of all the tables in 
this query from hive metastore, so it is not possible to join tables from 
sqlContext using hiveContext.hql

but this should work:

hiveContext.hql(select ...).regAsTable(a)
sqlContext.jsonFile(xxx).regAsTable(b)

then

sqlContext.sql( a join b )


i created a ticket SPARK-2710 to add ResultSets from JDBC connection as a new 
data source, but no predicate push down yet, also, it is not available for HQL

so, if you are looking for something that can query different data sources with 
full SQL92 syntax, facebook presto is still the only choice, they have some 
kind of JDBC connector in deveopment, and there are some unofficial 
implementations...

but i am looking forward to seeing the progress of Spark SQL, after
SPARK-2179 SQLContext can handle any
kind of structured data with a sequence of DataTypes as schema, although 
turning the data into Rows is still a little bit tricky...



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Query-and-join-different-data-sources-tp7914p7937.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional 
commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: Spark SQL Query and join different data sources.

2014-08-21 Thread alexliu68
Presto is so far good at joining different sources/databases.

I tried a simple join query in Spark SQL, it fails as the followings errors

val a = cql(select test.a  from test JOIN test1 on test.a = test1.a)
a: org.apache.spark.sql.SchemaRDD = 
SchemaRDD[0] at RDD at SchemaRDD.scala:104
== Query Plan ==
Project [a#7]
 Filter (a#7 = a#21)
  CartesianProduct 

org.apache.spark.SparkException: Job aborted due to stage failure: Task
0.0:0 failed 4 times, most recent failure: Exception failure in TID 3 on
host 127.0.0.1:
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: No function
to evaluate expression. type: AttributeReference, tree: a#7
   
org.apache.spark.sql.catalyst.expressions.AttributeReference.eval(namedExpressions.scala:158)
   
org.apache.spark.sql.catalyst.expressions.EqualTo.eval(predicates.scala:146)
   
org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:54)
   
org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:54)
scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
scala.collection.AbstractIterator.to(Iterator.scala:1157)
   
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:731)
org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:731)
   
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
   
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
org.apache.spark.scheduler.Task.run(Task.scala:51)
   
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
   
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)

It looks like Spark SQL has long way to go to be compatible with SQL




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Query-and-join-different-data-sources-tp7914p7945.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org