Re: Spark-SQL JDBC driver

Denny Lee Thu, 11 Dec 2014 06:54:03 -0800

Yes, that is correct. A quick reference on this is the post
https://www.linkedin.com/pulse/20141007143323-732459-an-absolutely-unofficial-way-to-connect-tableau-to-sparksql-spark-1-1?_mSplash=1
with the pertinent section being:


It is important to note that when you create Spark tables (for example, via
the .registerTempTable) these are operating within the Spark environment
which resides in a separate process than the Hive Metastore. This means
that currently tables that are created within the Spark context are not
available through the Thrift server. To achieve this, within the Spark
context save your temporary table into Hive - then the Spark Thrift Server
will be able to see the table.

HTH!

On Thu, Dec 11, 2014 at 04:09 Anas Mosaad <anas.mos...@incorta.com> wrote:

> Actually I came to a conclusion that RDDs has to be persisted in hive in
> order to be able to access through thrift.
> Hope I didn't end up with incorrect conclusion.
> Please someone correct me if I am wrong.
> On Dec 11, 2014 8:53 AM, "Judy Nash" <judyn...@exchange.microsoft.com>
> wrote:
>
>>  Looks like you are wondering why you cannot see the RDD table you have
>> created via thrift?
>>
>>
>>
>> Based on my own experience with spark 1.1, RDD created directly via Spark
>> SQL (i.e. Spark Shell or Spark-SQL.sh) is not visible on thrift, since
>> thrift has its own session containing its own RDD.
>>
>> Spark SQL experts on the forum can confirm on this though.
>>
>>
>>
>> *From:* Cheng Lian [mailto:lian.cs....@gmail.com]
>> *Sent:* Tuesday, December 9, 2014 6:42 AM
>> *To:* Anas Mosaad
>> *Cc:* Judy Nash; user@spark.apache.org
>> *Subject:* Re: Spark-SQL JDBC driver
>>
>>
>>
>> According to the stacktrace, you were still using SQLContext rather than
>> HiveContext. To interact with Hive, HiveContext *must* be used.
>>
>> Please refer to this page
>> http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
>>
>>  On 12/9/14 6:26 PM, Anas Mosaad wrote:
>>
>>  Back to the first question, this will mandate that hive is up and
>> running?
>>
>>
>>
>> When I try it, I get the following exception. The documentation says that
>> this method works only on SchemaRDD. I though that countries.saveAsTable
>> did not work for that a reason so I created a tmp that contains the results
>> from the registered temp table. Which I could validate that it's a
>> SchemaRDD as shown below.
>>
>>
>>
>>
>> * @Judy,* I do really appreciate your kind support and I want to
>> understand and off course don't want to wast your time. If you can direct
>> me the documentation describing this details, this will be great.
>>
>>
>>
>> scala> val tmp = sqlContext.sql("select * from countries")
>>
>> tmp: org.apache.spark.sql.SchemaRDD =
>>
>> SchemaRDD[12] at RDD at SchemaRDD.scala:108
>>
>> == Query Plan ==
>>
>> == Physical Plan ==
>>
>> PhysicalRDD
>> [COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29],
>> MapPartitionsRDD[9] at mapPartitions at ExistingRDD.scala:36
>>
>>
>>
>> scala> tmp.saveAsTable("Countries")
>>
>> org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
>> Unresolved plan found, tree:
>>
>> 'CreateTableAsSelect None, Countries, false, None
>>
>>  Project
>> [COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29]
>>
>>   Subquery countries
>>
>>    LogicalRDD
>> [COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29],
>> MapPartitionsRDD[9] at mapPartitions at ExistingRDD.scala:36
>>
>>
>>
>> at
>> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:83)
>>
>> at
>> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:78)
>>
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
>>
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
>>
>> at
>> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:78)
>>
>> at
>> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:76)
>>
>> at
>> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
>>
>> at
>> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
>>
>> at
>> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
>>
>> at
>> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
>>
>> at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
>>
>> at
>> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
>>
>> at
>> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
>>
>> at scala.collection.immutable.List.foreach(List.scala:318)
>>
>> at
>> org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
>>
>> at
>> org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)
>>
>> at
>> org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)
>>
>> at
>> org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412)
>>
>> at
>> org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412)
>>
>> at
>> org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413)
>>
>> at
>> org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413)
>>
>> at
>> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
>>
>> at
>> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
>>
>> at
>> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
>>
>> at
>> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
>>
>> at
>> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
>>
>> at
>> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
>>
>> at
>> org.apache.spark.sql.SchemaRDDLike$class.saveAsTable(SchemaRDDLike.scala:126)
>>
>> at org.apache.spark.sql.SchemaRDD.saveAsTable(SchemaRDD.scala:108)
>>
>> at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:22)
>>
>> at $iwC$$iwC$$iwC$$iwC.<init>(<console>:27)
>>
>> at $iwC$$iwC$$iwC.<init>(<console>:29)
>>
>> at $iwC$$iwC.<init>(<console>:31)
>>
>> at $iwC.<init>(<console>:33)
>>
>> at <init>(<console>:35)
>>
>> at .<init>(<console>:39)
>>
>> at .<clinit>(<console>)
>>
>> at .<init>(<console>:7)
>>
>> at .<clinit>(<console>)
>>
>> at $print(<console>)
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>
>> at java.lang.reflect.Method.invoke(Method.java:606)
>>
>> at
>> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)
>>
>> at
>> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)
>>
>> at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)
>>
>> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705)
>>
>> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669)
>>
>> at
>> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828)
>>
>> at
>> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873)
>>
>> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785)
>>
>> at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:628)
>>
>> at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:636)
>>
>> at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:641)
>>
>> at
>> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:968)
>>
>> at
>> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916)
>>
>> at
>> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916)
>>
>> at
>> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>>
>> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:916)
>>
>> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1011)
>>
>> at org.apache.spark.repl.Main$.main(Main.scala:31)
>>
>> at org.apache.spark.repl.Main.main(Main.scala)
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>
>> at java.lang.reflect.Method.invoke(Method.java:606)
>>
>> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:365)
>>
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
>>
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Dec 9, 2014 at 11:44 AM, Cheng Lian <lian.cs....@gmail.com>
>> wrote:
>>
>>  How did you register the table under spark-shell? Two things to notice:
>>
>> 1. To interact with Hive, HiveContext instead of SQLContext must be used.
>> 2. `registerTempTable` doesn't persist the table into Hive metastore, and
>> the table is lost after quitting spark-shell. Instead, you must use
>> `saveAsTable`.
>>
>>
>>
>> On 12/9/14 5:27 PM, Anas Mosaad wrote:
>>
>>  Thanks Cheng,
>>
>>
>>
>> I thought spark-sql is using the same exact metastore, right? However, it
>> didn't work as expected. Here's what I did.
>>
>>
>>
>> In spark-shell, I loaded a csv files and registered the table, say
>> countries.
>>
>> Started the thrift server.
>>
>> Connected using beeline. When I run show tables or !tables, I get empty
>> list of tables as follow:
>>
>>  *0: jdbc:hive2://localhost:10000> !tables*
>>
>> *+------------+--------------+-------------+-------------+----------+*
>>
>> *| TABLE_CAT  | TABLE_SCHEM  | TABLE_NAME  | TABLE_TYPE  | REMARKS  |*
>>
>> *+------------+--------------+-------------+-------------+----------+*
>>
>> *+------------+--------------+-------------+-------------+----------+*
>>
>> *0: jdbc:hive2://localhost:10000> show tables ;*
>>
>> *+---------+*
>>
>> *| result  |*
>>
>> *+---------+*
>>
>> *+---------+*
>>
>> *No rows selected (0.106 seconds)*
>>
>> *0: jdbc:hive2://localhost:10000> *
>>
>>
>>
>>
>>
>> Kindly advice, what am I missing? I want to read the RDD using SQL from
>> outside spark-shell (i.e. like any other relational database)
>>
>>
>>
>>
>>
>> On Tue, Dec 9, 2014 at 11:05 AM, Cheng Lian <lian.cs....@gmail.com>
>> wrote:
>>
>>  Essentially, the Spark SQL JDBC Thrift server is just a Spark port of
>> HiveServer2. You don't need to run Hive, but you do need a working
>> Metastore.
>>
>>
>>
>> On 12/9/14 3:59 PM, Anas Mosaad wrote:
>>
>>  Thanks Judy, this is exactly what I'm looking for. However, and plz
>> forgive me if it's a dump question is: It seems to me that thrift is the
>> same as hive2 JDBC driver, does this mean that starting thrift will start
>> hive as well on the server?
>>
>>
>>
>> On Mon, Dec 8, 2014 at 9:11 PM, Judy Nash <
>> judyn...@exchange.microsoft.com> wrote:
>>
>>  You can use thrift server for this purpose then test it with beeline.
>>
>>
>>
>> See doc:
>>
>>
>> https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server
>>
>>
>>
>>
>>
>> *From:* Anas Mosaad [mailto:anas.mos...@incorta.com]
>> *Sent:* Monday, December 8, 2014 11:01 AM
>> *To:* user@spark.apache.org
>> *Subject:* Spark-SQL JDBC driver
>>
>>
>>
>> Hello Everyone,
>>
>>
>>
>> I'm brand new to spark and was wondering if there's a JDBC driver to
>> access spark-SQL directly. I'm running spark in standalone mode and don't
>> have hadoop in this environment.
>>
>>
>>
>> --
>>
>>
>>
>> *Best Regards/أطيب المنى,*
>>
>>
>>
>> *Anas Mosaad*
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>>
>>
>> *Best Regards/أطيب المنى,*
>>
>>
>>
>> *Anas Mosaad*
>>
>> *Incorta Inc.*
>>
>> *+20-100-743-4510*
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>>
>>
>> *Best Regards/أطيب المنى,*
>>
>>
>>
>> *Anas Mosaad*
>>
>> *Incorta Inc.*
>>
>> *+20-100-743-4510*
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>>
>>
>> *Best Regards/أطيب المنى,*
>>
>>
>>
>> *Anas Mosaad*
>>
>> *Incorta Inc.*
>>
>> *+20-100-743-4510*
>>
>>
>>
>

Re: Spark-SQL JDBC driver

Reply via email to