Hi Mohammed, I think you just need to add -DskipTests to you build. Here is how I built it:
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests clean package install build/sbt does however fail even if only doing package which should skip tests. I am able to build the "MyThriftServer" above now. Thanks Michael for the assistance. -Todd On Wed, Apr 8, 2015 at 3:39 PM, Mohammed Guller <moham...@glassbeam.com> wrote: > Michael, > > Thank you! > > > > Looks like the sbt build is broken for 1.3. I downloaded the source code > for 1.3, but I get the following error a few minutes after I run “sbt/sbt > publishLocal” > > > > [error] (network-shuffle/*:update) sbt.ResolveException: unresolved > dependency: org.apache.spark#spark-network-common_2.10;1.3.0: configuration > not public in org.apache.spark#spark-network-common_2.10;1.3.0: 'test'. It > was required from org.apache.spark#spark-network-shuffle_2.10;1.3.0 test > > [error] Total time: 106 s, completed Apr 8, 2015 12:33:45 PM > > > > Mohammed > > > > *From:* Michael Armbrust [mailto:mich...@databricks.com] > *Sent:* Wednesday, April 8, 2015 11:54 AM > *To:* Mohammed Guller > *Cc:* Todd Nist; James Aley; user; Patrick Wendell > > *Subject:* Re: Advice using Spark SQL and Thrift JDBC Server > > > > Sorry guys. I didn't realize that > https://issues.apache.org/jira/browse/SPARK-4925 was not fixed yet. > > > > You can publish locally in the mean time (sbt/sbt publishLocal). > > > > On Wed, Apr 8, 2015 at 8:29 AM, Mohammed Guller <moham...@glassbeam.com> > wrote: > > +1 > > > > Interestingly, I ran into the exactly the same issue yesterday. I > couldn’t find any documentation about which project to include as a > dependency in build.sbt to use HiveThriftServer2. Would appreciate help. > > > > Mohammed > > > > *From:* Todd Nist [mailto:tsind...@gmail.com] > *Sent:* Wednesday, April 8, 2015 5:49 AM > *To:* James Aley > *Cc:* Michael Armbrust; user > *Subject:* Re: Advice using Spark SQL and Thrift JDBC Server > > > > To use the HiveThriftServer2.startWithContext, I thought one would use the > following artifact in the build: > > > > "org.apache.spark" %% "spark-hive-thriftserver" % "1.3.0" > > > > But I am unable to resolve the artifact. I do not see it in maven central > or any other repo. Do I need to build Spark and publish locally or just > missing something obvious here? > > > > Basic class is like this: > > > > import org.apache.spark.{SparkConf, SparkContext} > > > > import org.apache.spark.sql.hive.HiveContext > > import org.apache.spark.sql.hive.HiveMetastoreTypes._ > > import org.apache.spark.sql.types._ > > import org.apache.spark.sql.hive.thriftserver._ > > > > object MyThriftServer { > > > > val sparkConf = new SparkConf() > > // master is passed to spark-submit, but could also be specified > explicitely > > // .setMaster(sparkMaster) > > .setAppName("My ThriftServer") > > .set("spark.cores.max", "2") > > val sc = new SparkContext(sparkConf) > > val sparkContext = sc > > import sparkContext._ > > val sqlContext = new HiveContext(sparkContext) > > import sqlContext._ > > import sqlContext.implicits._ > > > > // register temp tables here HiveThriftServer2.startWithContext(sqlContext) > > } > > Build has the following: > > > > scalaVersion := "2.10.4" > > > > val SPARK_VERSION = "1.3.0" > > > > > > libraryDependencies ++= Seq( > > "org.apache.spark" %% "spark-streaming-kafka" % SPARK_VERSION > > exclude("org.apache.spark", "spark-core_2.10") > > exclude("org.apache.spark", "spark-streaming_2.10") > > exclude("org.apache.spark", "spark-sql_2.10") > > exclude("javax.jms", "jms"), > > "org.apache.spark" %% "spark-core" % SPARK_VERSION % "provided", > > "org.apache.spark" %% "spark-streaming" % SPARK_VERSION % "provided", > > "org.apache.spark" %% "spark-sql" % SPARK_VERSION % "provided", > > "org.apache.spark" %% "spark-hive" % SPARK_VERSION % "provided", > > "org.apache.spark" %% "spark-hive-thriftserver" % SPARK_VERSION % > "provided", > > "org.apache.kafka" %% "kafka" % "0.8.1.1" > > exclude("javax.jms", "jms") > > exclude("com.sun.jdmk", "jmxtools") > > exclude("com.sun.jmx", "jmxri"), > > "joda-time" % "joda-time" % "2.7", > > "log4j" % "log4j" % "1.2.14" > > exclude("com.sun.jdmk", "jmxtools") > > exclude("com.sun.jmx", "jmxri") > > ) > > > > Appreciate the assistance. > > > > -Todd > > > > On Tue, Apr 7, 2015 at 4:09 PM, James Aley <james.a...@swiftkey.com> > wrote: > > Excellent, thanks for your help, I appreciate your advice! > > On 7 Apr 2015 20:43, "Michael Armbrust" <mich...@databricks.com> wrote: > > That should totally work. The other option would be to run a persistent > metastore that multiple contexts can talk to and periodically run a job > that creates missing tables. The trade-off here would be more complexity, > but less downtime due to the server restarting. > > > > On Tue, Apr 7, 2015 at 12:34 PM, James Aley <james.a...@swiftkey.com> > wrote: > > Hi Michael, > > > > Thanks so much for the reply - that really cleared a lot of things up for > me! > > > > Let me just check that I've interpreted one of your suggestions for (4) > correctly... Would it make sense for me to write a small wrapper app that > pulls in hive-thriftserver as a dependency, iterates my Parquet directory > structure to discover "tables" and registers each as a temp table in some > context, before calling HiveThriftServer2.createWithContext as you suggest? > > > > This would mean that to add new content, all I need to is restart that > app, which presumably could also be avoided fairly trivially by > periodically restarting the server with a new context internally. That > certainly beats manual curation of Hive table definitions, if it will work? > > > > > > Thanks again, > > > > James. > > > > On 7 April 2015 at 19:30, Michael Armbrust <mich...@databricks.com> wrote: > > 1) What exactly is the relationship between the thrift server and Hive? > I'm guessing Spark is just making use of the Hive metastore to access table > definitions, and maybe some other things, is that the case? > > > > Underneath the covers, the Spark SQL thrift server is executing queries > using a HiveContext. In this mode, nearly all computation is done with > Spark SQL but we try to maintain compatibility with Hive wherever > possible. This means that you can write your queries in HiveQL, read > tables from the Hive metastore, and use Hive UDFs UDTs UDAFs, etc. > > > > The one exception here is Hive DDL operations (CREATE TABLE, etc). These > are passed directly to Hive code and executed there. The Spark SQL DDL is > sufficiently different that we always try to parse that first, and fall > back to Hive when it does not parse. > > > > One possibly confusing point here, is that you can persist Spark SQL > tables into the Hive metastore, but this is not the same as a Hive table. > We are only use the metastore as a repo for metadata, but are not using > their format for the information in this case (as we have datasources that > hive does not understand, including things like schema auto discovery). > > > > HiveQL DDL, run by Hive but can be read by Spark SQL: CREATE TABLE t (x > INT) SORTED AS PARQUET > > Spark SQL DDL, run by Spark SQL, stored in metastore, cannot be read by > hive: CREATE TABLE t USING parquet (path '/path/to/data') > > > > 2) Am I therefore right in thinking that SQL queries sent to the thrift > server are still executed on the Spark cluster, using Spark SQL, and Hive > plays no active part in computation of results? > > > > Correct. > > > > 3) What SQL flavour is actually supported by the Thrift Server? Is it > Spark SQL, Hive, or both? I've confused, because I've seen it accepting > Hive CREATE TABLE syntax, but Spark SQL seems to work too? > > > > HiveQL++ (with Spark SQL DDL). You can make it use our simple SQL parser > by `SET spark.sql.dialect=sql`, but honestly you probably don't want to do > this. The included SQL parser is mostly there for people who have > dependency conflicts with Hive. > > > > 4) When I run SQL queries using the Scala or Python shells, Spark seems > to figure out the schema by itself from my Parquet files very well, if I > use createTempTable on the DataFrame. It seems when running the thrift > server, I need to create a Hive table definition first? Is that the case, > or did I miss something? If it is, is there some sensible way to automate > this? > > > > Temporary tables are only visible to the SQLContext that creates them. If > you want it to be visible to the server, you need to either start the > thrift server with the same context your program is using > (see HiveThriftServer2.createWithContext) or make a metastore table. This > can be done using Spark SQL DDL: > > > > CREATE TABLE t USING parquet (path '/path/to/data') > > > > Michael > > > > > > > > >