Michael, Thank you! Looks like the sbt build is broken for 1.3. I downloaded the source code for 1.3, but I get the following error a few minutes after I run “sbt/sbt publishLocal”
[error] (network-shuffle/*:update) sbt.ResolveException: unresolved dependency: org.apache.spark#spark-network-common_2.10;1.3.0: configuration not public in org.apache.spark#spark-network-common_2.10;1.3.0: 'test'. It was required from org.apache.spark#spark-network-shuffle_2.10;1.3.0 test [error] Total time: 106 s, completed Apr 8, 2015 12:33:45 PM Mohammed From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Wednesday, April 8, 2015 11:54 AM To: Mohammed Guller Cc: Todd Nist; James Aley; user; Patrick Wendell Subject: Re: Advice using Spark SQL and Thrift JDBC Server Sorry guys. I didn't realize that https://issues.apache.org/jira/browse/SPARK-4925 was not fixed yet. You can publish locally in the mean time (sbt/sbt publishLocal). On Wed, Apr 8, 2015 at 8:29 AM, Mohammed Guller <moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote: +1 Interestingly, I ran into the exactly the same issue yesterday. I couldn’t find any documentation about which project to include as a dependency in build.sbt to use HiveThriftServer2. Would appreciate help. Mohammed From: Todd Nist [mailto:tsind...@gmail.com<mailto:tsind...@gmail.com>] Sent: Wednesday, April 8, 2015 5:49 AM To: James Aley Cc: Michael Armbrust; user Subject: Re: Advice using Spark SQL and Thrift JDBC Server To use the HiveThriftServer2.startWithContext, I thought one would use the following artifact in the build: "org.apache.spark" %% "spark-hive-thriftserver" % "1.3.0" But I am unable to resolve the artifact. I do not see it in maven central or any other repo. Do I need to build Spark and publish locally or just missing something obvious here? Basic class is like this: import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.hive.HiveMetastoreTypes._ import org.apache.spark.sql.types._ import org.apache.spark.sql.hive.thriftserver._ object MyThriftServer { val sparkConf = new SparkConf() // master is passed to spark-submit, but could also be specified explicitely // .setMaster(sparkMaster) .setAppName("My ThriftServer") .set("spark.cores.max", "2") val sc = new SparkContext(sparkConf) val sparkContext = sc import sparkContext._ val sqlContext = new HiveContext(sparkContext) import sqlContext._ import sqlContext.implicits._ // register temp tables here HiveThriftServer2.startWithContext(sqlContext) } Build has the following: scalaVersion := "2.10.4" val SPARK_VERSION = "1.3.0" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-streaming-kafka" % SPARK_VERSION exclude("org.apache.spark", "spark-core_2.10") exclude("org.apache.spark", "spark-streaming_2.10") exclude("org.apache.spark", "spark-sql_2.10") exclude("javax.jms", "jms"), "org.apache.spark" %% "spark-core" % SPARK_VERSION % "provided", "org.apache.spark" %% "spark-streaming" % SPARK_VERSION % "provided", "org.apache.spark" %% "spark-sql" % SPARK_VERSION % "provided", "org.apache.spark" %% "spark-hive" % SPARK_VERSION % "provided", "org.apache.spark" %% "spark-hive-thriftserver" % SPARK_VERSION % "provided", "org.apache.kafka" %% "kafka" % "0.8.1.1" exclude("javax.jms", "jms") exclude("com.sun.jdmk", "jmxtools") exclude("com.sun.jmx", "jmxri"), "joda-time" % "joda-time" % "2.7", "log4j" % "log4j" % "1.2.14" exclude("com.sun.jdmk", "jmxtools") exclude("com.sun.jmx", "jmxri") ) Appreciate the assistance. -Todd On Tue, Apr 7, 2015 at 4:09 PM, James Aley <james.a...@swiftkey.com<mailto:james.a...@swiftkey.com>> wrote: Excellent, thanks for your help, I appreciate your advice! On 7 Apr 2015 20:43, "Michael Armbrust" <mich...@databricks.com<mailto:mich...@databricks.com>> wrote: That should totally work. The other option would be to run a persistent metastore that multiple contexts can talk to and periodically run a job that creates missing tables. The trade-off here would be more complexity, but less downtime due to the server restarting. On Tue, Apr 7, 2015 at 12:34 PM, James Aley <james.a...@swiftkey.com<mailto:james.a...@swiftkey.com>> wrote: Hi Michael, Thanks so much for the reply - that really cleared a lot of things up for me! Let me just check that I've interpreted one of your suggestions for (4) correctly... Would it make sense for me to write a small wrapper app that pulls in hive-thriftserver as a dependency, iterates my Parquet directory structure to discover "tables" and registers each as a temp table in some context, before calling HiveThriftServer2.createWithContext as you suggest? This would mean that to add new content, all I need to is restart that app, which presumably could also be avoided fairly trivially by periodically restarting the server with a new context internally. That certainly beats manual curation of Hive table definitions, if it will work? Thanks again, James. On 7 April 2015 at 19:30, Michael Armbrust <mich...@databricks.com<mailto:mich...@databricks.com>> wrote: 1) What exactly is the relationship between the thrift server and Hive? I'm guessing Spark is just making use of the Hive metastore to access table definitions, and maybe some other things, is that the case? Underneath the covers, the Spark SQL thrift server is executing queries using a HiveContext. In this mode, nearly all computation is done with Spark SQL but we try to maintain compatibility with Hive wherever possible. This means that you can write your queries in HiveQL, read tables from the Hive metastore, and use Hive UDFs UDTs UDAFs, etc. The one exception here is Hive DDL operations (CREATE TABLE, etc). These are passed directly to Hive code and executed there. The Spark SQL DDL is sufficiently different that we always try to parse that first, and fall back to Hive when it does not parse. One possibly confusing point here, is that you can persist Spark SQL tables into the Hive metastore, but this is not the same as a Hive table. We are only use the metastore as a repo for metadata, but are not using their format for the information in this case (as we have datasources that hive does not understand, including things like schema auto discovery). HiveQL DDL, run by Hive but can be read by Spark SQL: CREATE TABLE t (x INT) SORTED AS PARQUET Spark SQL DDL, run by Spark SQL, stored in metastore, cannot be read by hive: CREATE TABLE t USING parquet (path '/path/to/data') 2) Am I therefore right in thinking that SQL queries sent to the thrift server are still executed on the Spark cluster, using Spark SQL, and Hive plays no active part in computation of results? Correct. 3) What SQL flavour is actually supported by the Thrift Server? Is it Spark SQL, Hive, or both? I've confused, because I've seen it accepting Hive CREATE TABLE syntax, but Spark SQL seems to work too? HiveQL++ (with Spark SQL DDL). You can make it use our simple SQL parser by `SET spark.sql.dialect=sql`, but honestly you probably don't want to do this. The included SQL parser is mostly there for people who have dependency conflicts with Hive. 4) When I run SQL queries using the Scala or Python shells, Spark seems to figure out the schema by itself from my Parquet files very well, if I use createTempTable on the DataFrame. It seems when running the thrift server, I need to create a Hive table definition first? Is that the case, or did I miss something? If it is, is there some sensible way to automate this? Temporary tables are only visible to the SQLContext that creates them. If you want it to be visible to the server, you need to either start the thrift server with the same context your program is using (see HiveThriftServer2.createWithContext) or make a metastore table. This can be done using Spark SQL DDL: CREATE TABLE t USING parquet (path '/path/to/data') Michael