RE: Advice using Spark SQL and Thrift JDBC Server

Mohammed Guller Wed, 08 Apr 2015 12:41:56 -0700

Michael,
Thank you!

Looks like the sbt build is broken for 1.3. I downloaded the source code for 
1.3, but I get the following error a few minutes after I run “sbt/sbt 
publishLocal”


[error] (network-shuffle/*:update) sbt.ResolveException: unresolved dependency: 
org.apache.spark#spark-network-common_2.10;1.3.0: configuration not public in 
org.apache.spark#spark-network-common_2.10;1.3.0: 'test'. It was required from 
org.apache.spark#spark-network-shuffle_2.10;1.3.0 test
[error] Total time: 106 s, completed Apr 8, 2015 12:33:45 PM

Mohammed

From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Wednesday, April 8, 2015 11:54 AM
To: Mohammed Guller
Cc: Todd Nist; James Aley; user; Patrick Wendell
Subject: Re: Advice using Spark SQL and Thrift JDBC Server

Sorry guys.  I didn't realize that 
https://issues.apache.org/jira/browse/SPARK-4925 was not fixed yet.

You can publish locally in the mean time (sbt/sbt publishLocal).

On Wed, Apr 8, 2015 at 8:29 AM, Mohammed Guller 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:
+1

Interestingly, I ran into the exactly the same issue yesterday.  I couldn’t 
find any documentation about which project to include as a dependency in 
build.sbt to use HiveThriftServer2. Would appreciate help.

Mohammed

From: Todd Nist [mailto:tsind...@gmail.com<mailto:tsind...@gmail.com>]
Sent: Wednesday, April 8, 2015 5:49 AM
To: James Aley
Cc: Michael Armbrust; user
Subject: Re: Advice using Spark SQL and Thrift JDBC Server

To use the HiveThriftServer2.startWithContext, I thought one would use the  
following artifact in the build:

"org.apache.spark"    %% "spark-hive-thriftserver"   % "1.3.0"

But I am unable to resolve the artifact.  I do not see it in maven central or 
any other repo.  Do I need to build Spark and publish locally or just missing 
something obvious here?

Basic class is like this:


import org.apache.spark.{SparkConf, SparkContext}



import  org.apache.spark.sql.hive.HiveContext

import org.apache.spark.sql.hive.HiveMetastoreTypes._

import org.apache.spark.sql.types._

import  org.apache.spark.sql.hive.thriftserver._



object MyThriftServer {



  val sparkConf = new SparkConf()

    // master is passed to spark-submit, but could also be specified explicitely

    // .setMaster(sparkMaster)

    .setAppName("My ThriftServer")

    .set("spark.cores.max", "2")

  val sc = new SparkContext(sparkConf)

  val  sparkContext  =  sc

  import  sparkContext._

  val  sqlContext  =  new  HiveContext(sparkContext)

  import  sqlContext._

  import sqlContext.implicits._



// register temp tables here   HiveThriftServer2.startWithContext(sqlContext)

}
Build has the following:

scalaVersion := "2.10.4"

val SPARK_VERSION = "1.3.0"


libraryDependencies ++= Seq(
    "org.apache.spark" %% "spark-streaming-kafka" % SPARK_VERSION
      exclude("org.apache.spark", "spark-core_2.10")
      exclude("org.apache.spark", "spark-streaming_2.10")
      exclude("org.apache.spark", "spark-sql_2.10")
      exclude("javax.jms", "jms"),
    "org.apache.spark" %% "spark-core"  % SPARK_VERSION %  "provided",
    "org.apache.spark" %% "spark-streaming" % SPARK_VERSION %  "provided",
    "org.apache.spark"  %% "spark-sql"  % SPARK_VERSION % "provided",
    "org.apache.spark"  %% "spark-hive" % SPARK_VERSION % "provided",
    "org.apache.spark" %% "spark-hive-thriftserver"  % SPARK_VERSION   % 
"provided",
    "org.apache.kafka" %% "kafka" % "0.8.1.1"
      exclude("javax.jms", "jms")
      exclude("com.sun.jdmk", "jmxtools")
      exclude("com.sun.jmx", "jmxri"),
    "joda-time" % "joda-time" % "2.7",
    "log4j" % "log4j" % "1.2.14"
      exclude("com.sun.jdmk", "jmxtools")
      exclude("com.sun.jmx", "jmxri")
  )

Appreciate the assistance.

-Todd

On Tue, Apr 7, 2015 at 4:09 PM, James Aley 
<james.a...@swiftkey.com<mailto:james.a...@swiftkey.com>> wrote:

Excellent, thanks for your help, I appreciate your advice!
On 7 Apr 2015 20:43, "Michael Armbrust" 
<mich...@databricks.com<mailto:mich...@databricks.com>> wrote:
That should totally work.  The other option would be to run a persistent 
metastore that multiple contexts can talk to and periodically run a job that 
creates missing tables.  The trade-off here would be more complexity, but less 
downtime due to the server restarting.

On Tue, Apr 7, 2015 at 12:34 PM, James Aley 
<james.a...@swiftkey.com<mailto:james.a...@swiftkey.com>> wrote:
Hi Michael,

Thanks so much for the reply - that really cleared a lot of things up for me!

Let me just check that I've interpreted one of your suggestions for (4) 
correctly... Would it make sense for me to write a small wrapper app that pulls 
in hive-thriftserver as a dependency, iterates my Parquet directory structure 
to discover "tables" and registers each as a temp table in some context, before 
calling HiveThriftServer2.createWithContext as you suggest?

This would mean that to add new content, all I need to is restart that app, 
which presumably could also be avoided fairly trivially by periodically 
restarting the server with a new context internally. That certainly beats 
manual curation of Hive table definitions, if it will work?


Thanks again,

James.

On 7 April 2015 at 19:30, Michael Armbrust 
<mich...@databricks.com<mailto:mich...@databricks.com>> wrote:
1) What exactly is the relationship between the thrift server and Hive? I'm 
guessing Spark is just making use of the Hive metastore to access table 
definitions, and maybe some other things, is that the case?

Underneath the covers, the Spark SQL thrift server is executing queries using a 
HiveContext.  In this mode, nearly all computation is done with Spark SQL but 
we try to maintain compatibility with Hive wherever possible.  This means that 
you can write your queries in HiveQL, read tables from the Hive metastore, and 
use Hive UDFs UDTs UDAFs, etc.

The one exception here is Hive DDL operations (CREATE TABLE, etc).  These are 
passed directly to Hive code and executed there.  The Spark SQL DDL is 
sufficiently different that we always try to parse that first, and fall back to 
Hive when it does not parse.

One possibly confusing point here, is that you can persist Spark SQL tables 
into the Hive metastore, but this is not the same as a Hive table.  We are only 
use the metastore as a repo for metadata, but are not using their format for 
the information in this case (as we have datasources that hive does not 
understand, including things like schema auto discovery).

HiveQL DDL, run by Hive but can be read by Spark SQL: CREATE TABLE t (x INT) 
SORTED AS PARQUET
Spark SQL DDL, run by Spark SQL, stored in metastore, cannot be read by hive: 
CREATE TABLE t USING parquet (path '/path/to/data')

2) Am I therefore right in thinking that SQL queries sent to the thrift server 
are still executed on the Spark cluster, using Spark SQL, and Hive plays no 
active part in computation of results?

Correct.

3) What SQL flavour is actually supported by the Thrift Server? Is it Spark 
SQL, Hive, or both? I've confused, because I've seen it accepting Hive CREATE 
TABLE syntax, but Spark SQL seems to work too?

HiveQL++ (with Spark SQL DDL).  You can make it use our simple SQL parser by 
`SET spark.sql.dialect=sql`, but honestly you probably don't want to do this.  
The included SQL parser is mostly there for people who have dependency 
conflicts with Hive.

4) When I run SQL queries using the Scala or Python shells, Spark seems to 
figure out the schema by itself from my Parquet files very well, if I use 
createTempTable on the DataFrame. It seems when running the thrift server, I 
need to create a Hive table definition first? Is that the case, or did I miss 
something? If it is, is there some sensible way to automate this?

Temporary tables are only visible to the SQLContext that creates them.  If you 
want it to be visible to the server, you need to either start the thrift server 
with the same context your program is using (see 
HiveThriftServer2.createWithContext) or make a metastore table.  This can be 
done using Spark SQL DDL:

CREATE TABLE t USING parquet (path '/path/to/data')

Michael

RE: Advice using Spark SQL and Thrift JDBC Server

Reply via email to