Re: Spark-SQL JDBC driver

2014-12-14 Thread Michael Armbrust
I'll add that there is an experimental method that allows you to start the
JDBC server with an existing HiveContext (which might have registered
temporary tables).

https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L42


On Thu, Dec 11, 2014 at 6:52 AM, Denny Lee denny.g@gmail.com wrote:

 Yes, that is correct. A quick reference on this is the post
 https://www.linkedin.com/pulse/20141007143323-732459-an-absolutely-unofficial-way-to-connect-tableau-to-sparksql-spark-1-1?_mSplash=1
 with the pertinent section being:

 It is important to note that when you create Spark tables (for example,
 via the .registerTempTable) these are operating within the Spark
 environment which resides in a separate process than the Hive Metastore.
 This means that currently tables that are created within the Spark context
 are not available through the Thrift server. To achieve this, within the
 Spark context save your temporary table into Hive - then the Spark Thrift
 Server will be able to see the table.

 HTH!


 On Thu, Dec 11, 2014 at 04:09 Anas Mosaad anas.mos...@incorta.com wrote:

 Actually I came to a conclusion that RDDs has to be persisted in hive in
 order to be able to access through thrift.
 Hope I didn't end up with incorrect conclusion.
 Please someone correct me if I am wrong.
 On Dec 11, 2014 8:53 AM, Judy Nash judyn...@exchange.microsoft.com
 wrote:

  Looks like you are wondering why you cannot see the RDD table you have
 created via thrift?



 Based on my own experience with spark 1.1, RDD created directly via
 Spark SQL (i.e. Spark Shell or Spark-SQL.sh) is not visible on thrift,
 since thrift has its own session containing its own RDD.

 Spark SQL experts on the forum can confirm on this though.



 *From:* Cheng Lian [mailto:lian.cs@gmail.com]
 *Sent:* Tuesday, December 9, 2014 6:42 AM
 *To:* Anas Mosaad
 *Cc:* Judy Nash; user@spark.apache.org
 *Subject:* Re: Spark-SQL JDBC driver



 According to the stacktrace, you were still using SQLContext rather than
 HiveContext. To interact with Hive, HiveContext *must* be used.

 Please refer to this page
 http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables

  On 12/9/14 6:26 PM, Anas Mosaad wrote:

  Back to the first question, this will mandate that hive is up and
 running?



 When I try it, I get the following exception. The documentation says
 that this method works only on SchemaRDD. I though that
 countries.saveAsTable did not work for that a reason so I created a tmp
 that contains the results from the registered temp table. Which I could
 validate that it's a SchemaRDD as shown below.




 * @Judy,* I do really appreciate your kind support and I want to
 understand and off course don't want to wast your time. If you can direct
 me the documentation describing this details, this will be great.



 scala val tmp = sqlContext.sql(select * from countries)

 tmp: org.apache.spark.sql.SchemaRDD =

 SchemaRDD[12] at RDD at SchemaRDD.scala:108

 == Query Plan ==

 == Physical Plan ==

 PhysicalRDD
 [COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29],
 MapPartitionsRDD[9] at mapPartitions at ExistingRDD.scala:36



 scala tmp.saveAsTable(Countries)

 org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
 Unresolved plan found, tree:

 'CreateTableAsSelect None, Countries, false, None

  Project
 [COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29]

   Subquery countries

LogicalRDD
 [COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29],
 MapPartitionsRDD[9] at mapPartitions at ExistingRDD.scala:36



 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:83)

 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:78)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)

 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:78)

 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:76)

 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)

 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)

 at
 scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51

RE: Spark-SQL JDBC driver

2014-12-11 Thread Anas Mosaad
Actually I came to a conclusion that RDDs has to be persisted in hive in
order to be able to access through thrift.
Hope I didn't end up with incorrect conclusion.
Please someone correct me if I am wrong.
On Dec 11, 2014 8:53 AM, Judy Nash judyn...@exchange.microsoft.com
wrote:

  Looks like you are wondering why you cannot see the RDD table you have
 created via thrift?



 Based on my own experience with spark 1.1, RDD created directly via Spark
 SQL (i.e. Spark Shell or Spark-SQL.sh) is not visible on thrift, since
 thrift has its own session containing its own RDD.

 Spark SQL experts on the forum can confirm on this though.



 *From:* Cheng Lian [mailto:lian.cs@gmail.com]
 *Sent:* Tuesday, December 9, 2014 6:42 AM
 *To:* Anas Mosaad
 *Cc:* Judy Nash; user@spark.apache.org
 *Subject:* Re: Spark-SQL JDBC driver



 According to the stacktrace, you were still using SQLContext rather than
 HiveContext. To interact with Hive, HiveContext *must* be used.

 Please refer to this page
 http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables

  On 12/9/14 6:26 PM, Anas Mosaad wrote:

  Back to the first question, this will mandate that hive is up and
 running?



 When I try it, I get the following exception. The documentation says that
 this method works only on SchemaRDD. I though that countries.saveAsTable
 did not work for that a reason so I created a tmp that contains the results
 from the registered temp table. Which I could validate that it's a
 SchemaRDD as shown below.




 * @Judy,* I do really appreciate your kind support and I want to
 understand and off course don't want to wast your time. If you can direct
 me the documentation describing this details, this will be great.



 scala val tmp = sqlContext.sql(select * from countries)

 tmp: org.apache.spark.sql.SchemaRDD =

 SchemaRDD[12] at RDD at SchemaRDD.scala:108

 == Query Plan ==

 == Physical Plan ==

 PhysicalRDD
 [COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29],
 MapPartitionsRDD[9] at mapPartitions at ExistingRDD.scala:36



 scala tmp.saveAsTable(Countries)

 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
 plan found, tree:

 'CreateTableAsSelect None, Countries, false, None

  Project
 [COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29]

   Subquery countries

LogicalRDD
 [COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29],
 MapPartitionsRDD[9] at mapPartitions at ExistingRDD.scala:36



 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:83)

 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:78)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)

 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:78)

 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:76)

 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)

 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)

 at
 scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)

 at
 scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)

 at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)

 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)

 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)

 at scala.collection.immutable.List.foreach(List.scala:318)

 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)

 at
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)

 at
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)

 at
 org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412)

 at
 org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412)

 at
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413)

 at
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413)

 at
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)

 at
 org.apache.spark.sql.SQLContext

Re: Spark-SQL JDBC driver

2014-12-11 Thread Denny Lee
Yes, that is correct. A quick reference on this is the post
https://www.linkedin.com/pulse/20141007143323-732459-an-absolutely-unofficial-way-to-connect-tableau-to-sparksql-spark-1-1?_mSplash=1
with the pertinent section being:

It is important to note that when you create Spark tables (for example, via
the .registerTempTable) these are operating within the Spark environment
which resides in a separate process than the Hive Metastore. This means
that currently tables that are created within the Spark context are not
available through the Thrift server. To achieve this, within the Spark
context save your temporary table into Hive - then the Spark Thrift Server
will be able to see the table.

HTH!

On Thu, Dec 11, 2014 at 04:09 Anas Mosaad anas.mos...@incorta.com wrote:

 Actually I came to a conclusion that RDDs has to be persisted in hive in
 order to be able to access through thrift.
 Hope I didn't end up with incorrect conclusion.
 Please someone correct me if I am wrong.
 On Dec 11, 2014 8:53 AM, Judy Nash judyn...@exchange.microsoft.com
 wrote:

  Looks like you are wondering why you cannot see the RDD table you have
 created via thrift?



 Based on my own experience with spark 1.1, RDD created directly via Spark
 SQL (i.e. Spark Shell or Spark-SQL.sh) is not visible on thrift, since
 thrift has its own session containing its own RDD.

 Spark SQL experts on the forum can confirm on this though.



 *From:* Cheng Lian [mailto:lian.cs@gmail.com]
 *Sent:* Tuesday, December 9, 2014 6:42 AM
 *To:* Anas Mosaad
 *Cc:* Judy Nash; user@spark.apache.org
 *Subject:* Re: Spark-SQL JDBC driver



 According to the stacktrace, you were still using SQLContext rather than
 HiveContext. To interact with Hive, HiveContext *must* be used.

 Please refer to this page
 http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables

  On 12/9/14 6:26 PM, Anas Mosaad wrote:

  Back to the first question, this will mandate that hive is up and
 running?



 When I try it, I get the following exception. The documentation says that
 this method works only on SchemaRDD. I though that countries.saveAsTable
 did not work for that a reason so I created a tmp that contains the results
 from the registered temp table. Which I could validate that it's a
 SchemaRDD as shown below.




 * @Judy,* I do really appreciate your kind support and I want to
 understand and off course don't want to wast your time. If you can direct
 me the documentation describing this details, this will be great.



 scala val tmp = sqlContext.sql(select * from countries)

 tmp: org.apache.spark.sql.SchemaRDD =

 SchemaRDD[12] at RDD at SchemaRDD.scala:108

 == Query Plan ==

 == Physical Plan ==

 PhysicalRDD
 [COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29],
 MapPartitionsRDD[9] at mapPartitions at ExistingRDD.scala:36



 scala tmp.saveAsTable(Countries)

 org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
 Unresolved plan found, tree:

 'CreateTableAsSelect None, Countries, false, None

  Project
 [COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29]

   Subquery countries

LogicalRDD
 [COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29],
 MapPartitionsRDD[9] at mapPartitions at ExistingRDD.scala:36



 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:83)

 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:78)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)

 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:78)

 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:76)

 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)

 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)

 at
 scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)

 at
 scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)

 at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)

 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)

 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)

 at scala.collection.immutable.List.foreach(List.scala:318

RE: Spark-SQL JDBC driver

2014-12-10 Thread Judy Nash
Looks like you are wondering why you cannot see the RDD table you have created 
via thrift?

Based on my own experience with spark 1.1, RDD created directly via Spark SQL 
(i.e. Spark Shell or Spark-SQL.sh) is not visible on thrift, since thrift has 
its own session containing its own RDD.
Spark SQL experts on the forum can confirm on this though.

From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: Tuesday, December 9, 2014 6:42 AM
To: Anas Mosaad
Cc: Judy Nash; user@spark.apache.org
Subject: Re: Spark-SQL JDBC driver

According to the stacktrace, you were still using SQLContext rather than 
HiveContext. To interact with Hive, HiveContext *must* be used.

Please refer to this page 
http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables

On 12/9/14 6:26 PM, Anas Mosaad wrote:
Back to the first question, this will mandate that hive is up and running?

When I try it, I get the following exception. The documentation says that this 
method works only on SchemaRDD. I though that countries.saveAsTable did not 
work for that a reason so I created a tmp that contains the results from the 
registered temp table. Which I could validate that it's a SchemaRDD as shown 
below.


@Judy, I do really appreciate your kind support and I want to understand and 
off course don't want to wast your time. If you can direct me the documentation 
describing this details, this will be great.


scala val tmp = sqlContext.sql(select * from countries)

tmp: org.apache.spark.sql.SchemaRDD =

SchemaRDD[12] at RDD at SchemaRDD.scala:108

== Query Plan ==

== Physical Plan ==

PhysicalRDD 
[COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29],
 MapPartitionsRDD[9] at mapPartitions at ExistingRDD.scala:36



scala tmp.saveAsTable(Countries)

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved plan 
found, tree:

'CreateTableAsSelect None, Countries, false, None

 Project 
[COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29]

  Subquery countries

   LogicalRDD 
[COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29],
 MapPartitionsRDD[9] at mapPartitions at ExistingRDD.scala:36



at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:83)

at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:78)

at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)

at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)

at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:78)

at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:76)

at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)

at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)

at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)

at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)

at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)

at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)

at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)

at scala.collection.immutable.List.foreach(List.scala:318)

at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)

at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)

at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)

at 
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412)

at 
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412)

at 
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413)

at 
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413)

at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)

at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)

at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)

at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)

at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)

at org.apache.spark.sql.SQLContext

Re: Spark-SQL JDBC driver

2014-12-09 Thread Anas Mosaad
Thanks Judy, this is exactly what I'm looking for. However, and plz forgive
me if it's a dump question is: It seems to me that thrift is the same as
hive2 JDBC driver, does this mean that starting thrift will start hive as
well on the server?

On Mon, Dec 8, 2014 at 9:11 PM, Judy Nash judyn...@exchange.microsoft.com
wrote:

  You can use thrift server for this purpose then test it with beeline.



 See doc:


 https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server





 *From:* Anas Mosaad [mailto:anas.mos...@incorta.com]
 *Sent:* Monday, December 8, 2014 11:01 AM
 *To:* user@spark.apache.org
 *Subject:* Spark-SQL JDBC driver



 Hello Everyone,



 I'm brand new to spark and was wondering if there's a JDBC driver to
 access spark-SQL directly. I'm running spark in standalone mode and don't
 have hadoop in this environment.



 --



 *Best Regards/أطيب المنى,*



 *Anas Mosaad*






-- 

*Best Regards/أطيب المنى,*

*Anas Mosaad*
*Incorta Inc.*
*+20-100-743-4510*


Re: Spark-SQL JDBC driver

2014-12-09 Thread Cheng Lian
Essentially, the Spark SQL JDBC Thrift server is just a Spark port of 
HiveServer2. You don't need to run Hive, but you do need a working 
Metastore.


On 12/9/14 3:59 PM, Anas Mosaad wrote:
Thanks Judy, this is exactly what I'm looking for. However, and plz 
forgive me if it's a dump question is: It seems to me that thrift is 
the same as hive2 JDBC driver, does this mean that starting thrift 
will start hive as well on the server?


On Mon, Dec 8, 2014 at 9:11 PM, Judy Nash 
judyn...@exchange.microsoft.com 
mailto:judyn...@exchange.microsoft.com wrote:


You can use thrift server for this purpose then test it with beeline.

See doc:


https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server

*From:*Anas Mosaad [mailto:anas.mos...@incorta.com
mailto:anas.mos...@incorta.com]
*Sent:* Monday, December 8, 2014 11:01 AM
*To:* user@spark.apache.org mailto:user@spark.apache.org
*Subject:* Spark-SQL JDBC driver

Hello Everyone,

I'm brand new to spark and was wondering if there's a JDBC driver
to access spark-SQL directly. I'm running spark in standalone mode
and don't have hadoop in this environment.

-- 


*Best Regards/أطيب المنى,*

*Anas Mosaad*




--

*Best Regards/أطيب المنى,*
*
*
*Anas Mosaad*
*Incorta Inc.*
*+20-100-743-4510*




Re: Spark-SQL JDBC driver

2014-12-09 Thread Anas Mosaad
Thanks Cheng,

I thought spark-sql is using the same exact metastore, right? However, it
didn't work as expected. Here's what I did.

In spark-shell, I loaded a csv files and registered the table, say
countries.
Started the thrift server.
Connected using beeline. When I run show tables or !tables, I get empty
list of tables as follow:

*0: jdbc:hive2://localhost:1 !tables*

*++--+-+-+--+*

*| TABLE_CAT  | TABLE_SCHEM  | TABLE_NAME  | TABLE_TYPE  | REMARKS  |*

*++--+-+-+--+*

*++--+-+-+--+*

*0: jdbc:hive2://localhost:1 show tables ;*

*+-+*

*| result  |*

*+-+*

*+-+*

*No rows selected (0.106 seconds)*

*0: jdbc:hive2://localhost:1 *



Kindly advice, what am I missing? I want to read the RDD using SQL from
outside spark-shell (i.e. like any other relational database)


On Tue, Dec 9, 2014 at 11:05 AM, Cheng Lian lian.cs@gmail.com wrote:

  Essentially, the Spark SQL JDBC Thrift server is just a Spark port of
 HiveServer2. You don't need to run Hive, but you do need a working
 Metastore.


 On 12/9/14 3:59 PM, Anas Mosaad wrote:

 Thanks Judy, this is exactly what I'm looking for. However, and plz
 forgive me if it's a dump question is: It seems to me that thrift is the
 same as hive2 JDBC driver, does this mean that starting thrift will start
 hive as well on the server?

 On Mon, Dec 8, 2014 at 9:11 PM, Judy Nash judyn...@exchange.microsoft.com
  wrote:

  You can use thrift server for this purpose then test it with beeline.



 See doc:


 https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server





 *From:* Anas Mosaad [mailto:anas.mos...@incorta.com]
 *Sent:* Monday, December 8, 2014 11:01 AM
 *To:* user@spark.apache.org
 *Subject:* Spark-SQL JDBC driver



 Hello Everyone,



 I'm brand new to spark and was wondering if there's a JDBC driver to
 access spark-SQL directly. I'm running spark in standalone mode and don't
 have hadoop in this environment.



 --



 *Best Regards/أطيب المنى,*



 *Anas Mosaad*






  --

 *Best Regards/أطيب المنى,*

  *Anas Mosaad*
 *Incorta Inc.*
 *+20-100-743-4510*





-- 

*Best Regards/أطيب المنى,*

*Anas Mosaad*
*Incorta Inc.*
*+20-100-743-4510*


Re: Spark-SQL JDBC driver

2014-12-09 Thread Cheng Lian

How did you register the table under spark-shell? Two things to notice:

1. To interact with Hive, HiveContext instead of SQLContext must be used.
2. `registerTempTable` doesn't persist the table into Hive metastore, 
and the table is lost after quitting spark-shell. Instead, you must use 
`saveAsTable`.


On 12/9/14 5:27 PM, Anas Mosaad wrote:

Thanks Cheng,

I thought spark-sql is using the same exact metastore, right? However, 
it didn't work as expected. Here's what I did.


In spark-shell, I loaded a csv files and registered the table, say 
countries.

Started the thrift server.
Connected using beeline. When I run show tables or !tables, I get 
empty list of tables as follow:


/0: jdbc:hive2://localhost:1 !tables/

/++--+-+-+--+/

/| TABLE_CAT  | TABLE_SCHEM  | TABLE_NAME  | TABLE_TYPE  | REMARKS  |/

/++--+-+-+--+/

/++--+-+-+--+/

/0: jdbc:hive2://localhost:1 show tables ;/

/+-+/

/| result  |/

/+-+/

/+-+/

/No rows selected (0.106 seconds)/

/0: jdbc:hive2://localhost:1 /



Kindly advice, what am I missing? I want to read the RDD using SQL 
from outside spark-shell (i.e. like any other relational database)



On Tue, Dec 9, 2014 at 11:05 AM, Cheng Lian lian.cs@gmail.com 
mailto:lian.cs@gmail.com wrote:


Essentially, the Spark SQL JDBC Thrift server is just a Spark port
of HiveServer2. You don't need to run Hive, but you do need a
working Metastore.


On 12/9/14 3:59 PM, Anas Mosaad wrote:

Thanks Judy, this is exactly what I'm looking for. However, and
plz forgive me if it's a dump question is: It seems to me that
thrift is the same as hive2 JDBC driver, does this mean that
starting thrift will start hive as well on the server?

On Mon, Dec 8, 2014 at 9:11 PM, Judy Nash
judyn...@exchange.microsoft.com
mailto:judyn...@exchange.microsoft.com wrote:

You can use thrift server for this purpose then test it with
beeline.

See doc:


https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server

*From:*Anas Mosaad [mailto:anas.mos...@incorta.com
mailto:anas.mos...@incorta.com]
*Sent:* Monday, December 8, 2014 11:01 AM
*To:* user@spark.apache.org mailto:user@spark.apache.org
*Subject:* Spark-SQL JDBC driver

Hello Everyone,

I'm brand new to spark and was wondering if there's a JDBC
driver to access spark-SQL directly. I'm running spark in
standalone mode and don't have hadoop in this environment.

-- 


*Best Regards/أطيب المنى,*

*Anas Mosaad*




-- 


*Best Regards/أطيب المنى,*
*
*
*Anas Mosaad*
*Incorta Inc.*
*+20-100-743-4510*





--

*Best Regards/أطيب المنى,*
*
*
*Anas Mosaad*
*Incorta Inc.*
*+20-100-743-4510*




Re: Spark-SQL JDBC driver

2014-12-09 Thread Anas Mosaad
Back to the first question, this will mandate that hive is up and running?

When I try it, I get the following exception. The documentation says that
this method works only on SchemaRDD. I though that countries.saveAsTable
did not work for that a reason so I created a tmp that contains the results
from the registered temp table. Which I could validate that it's a
SchemaRDD as shown below.


*@Judy,* I do really appreciate your kind support and I want to understand
and off course don't want to wast your time. If you can direct me the
documentation describing this details, this will be great.

scala val tmp = sqlContext.sql(select * from countries)

tmp: org.apache.spark.sql.SchemaRDD =

SchemaRDD[12] at RDD at SchemaRDD.scala:108

== Query Plan ==

== Physical Plan ==

PhysicalRDD
[COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29],
MapPartitionsRDD[9] at mapPartitions at ExistingRDD.scala:36


scala tmp.saveAsTable(Countries)

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
plan found, tree:

'CreateTableAsSelect None, Countries, false, None

 Project
[COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29]

  Subquery countries

   LogicalRDD
[COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29],
MapPartitionsRDD[9] at mapPartitions at ExistingRDD.scala:36


 at
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:83)

at
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:78)

at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)

at
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)

at
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:78)

at
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:76)

at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)

at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)

at
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)

at
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)

at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)

at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)

at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)

at scala.collection.immutable.List.foreach(List.scala:318)

at
org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)

at
org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)

at
org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)

at
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412)

at
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412)

at
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413)

at
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413)

at
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)

at
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)

at
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)

at
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)

at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)

at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)

at
org.apache.spark.sql.SchemaRDDLike$class.saveAsTable(SchemaRDDLike.scala:126)

at org.apache.spark.sql.SchemaRDD.saveAsTable(SchemaRDD.scala:108)

at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:22)

at $iwC$$iwC$$iwC$$iwC.init(console:27)

at $iwC$$iwC$$iwC.init(console:29)

at $iwC$$iwC.init(console:31)

at $iwC.init(console:33)

at init(console:35)

at .init(console:39)

at .clinit(console)

at .init(console:7)

at .clinit(console)

at $print(console)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at 

Re: Spark-SQL JDBC driver

2014-12-09 Thread Cheng Lian
According to the stacktrace, you were still using SQLContext rather than 
HiveContext. To interact with Hive, HiveContext *must* be used.


Please refer to this page 
http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables



On 12/9/14 6:26 PM, Anas Mosaad wrote:
Back to the first question,**this will mandate that hive is up and 
running?


When I try it, I get the following exception. The documentation says 
that this method works only on SchemaRDD. I though that 
countries.saveAsTable did not work for that a reason so I created a 
tmp that contains the results from the registered temp table. Which I 
could validate that it's a SchemaRDD as shown below.


*
@Judy,* I do really appreciate your kind support and I want to 
understand and off course don't want to wast your time. If you can 
direct me the documentation describing this details, this will be great.


scala val tmp = sqlContext.sql(select * from countries)

tmp: org.apache.spark.sql.SchemaRDD =

SchemaRDD[12] at RDD at SchemaRDD.scala:108

== Query Plan ==

== Physical Plan ==

PhysicalRDD 
[COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29], 
MapPartitionsRDD[9] at mapPartitions at ExistingRDD.scala:36



scala tmp.saveAsTable(Countries)

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
Unresolved plan found, tree:


'CreateTableAsSelect None, Countries, false, None

 Project 
[COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29]


  Subquery countries

   LogicalRDD 
[COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29], 
MapPartitionsRDD[9] at mapPartitions at ExistingRDD.scala:36



at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:83)


at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:78)


at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)


at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)


at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:78)


at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:76)


at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)


at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)


at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)


at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)


at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)

at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)


at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)


at scala.collection.immutable.List.foreach(List.scala:318)

at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)


at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)


at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)


at 
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412)


at 
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412)


at 
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413)


at 
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413)


at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)


at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)


at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)


at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)


at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)


at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)


at 
org.apache.spark.sql.SchemaRDDLike$class.saveAsTable(SchemaRDDLike.scala:126)


at org.apache.spark.sql.SchemaRDD.saveAsTable(SchemaRDD.scala:108)

at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:22)

at $iwC$$iwC$$iwC$$iwC.init(console:27)

at $iwC$$iwC$$iwC.init(console:29)

at $iwC$$iwC.init(console:31)

at $iwC.init(console:33)

at init(console:35)

at .init(console:39)

at .clinit(console)

at .init(console:7)

at .clinit(console)

at $print(console)

RE: Spark-SQL JDBC driver

2014-12-08 Thread Judy Nash
You can use thrift server for this purpose then test it with beeline.

See doc:
https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server


From: Anas Mosaad [mailto:anas.mos...@incorta.com]
Sent: Monday, December 8, 2014 11:01 AM
To: user@spark.apache.org
Subject: Spark-SQL JDBC driver

Hello Everyone,

I'm brand new to spark and was wondering if there's a JDBC driver to access 
spark-SQL directly. I'm running spark in standalone mode and don't have hadoop 
in this environment.

--

Best Regards/أطيب المنى,

Anas Mosaad