Hi Mohammed, Sorry, I guess I was not really clear in my response. Yes sbt fails, the -DskipTests is for mvn as I showed it in the example on how II built it.
I do not believe that -DskipTests has any impact in sbt, but could be wrong. sbt package should skip tests. I did not try to track down where the dependency was coming from. Based on Patrick comments it sound like this is now resolved. Sorry for the confustion. -Todd On Wed, Apr 8, 2015 at 4:38 PM, Todd Nist <tsind...@gmail.com> wrote: > Hi Mohammed, > > I think you just need to add -DskipTests to you build. Here is how I > built it: > > mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver > -DskipTests clean package install > > build/sbt does however fail even if only doing package which should skip > tests. > > I am able to build the "MyThriftServer" above now. > > Thanks Michael for the assistance. > > -Todd > > On Wed, Apr 8, 2015 at 3:39 PM, Mohammed Guller <moham...@glassbeam.com> > wrote: > >> Michael, >> >> Thank you! >> >> >> >> Looks like the sbt build is broken for 1.3. I downloaded the source code >> for 1.3, but I get the following error a few minutes after I run “sbt/sbt >> publishLocal” >> >> >> >> [error] (network-shuffle/*:update) sbt.ResolveException: unresolved >> dependency: org.apache.spark#spark-network-common_2.10;1.3.0: configuration >> not public in org.apache.spark#spark-network-common_2.10;1.3.0: 'test'. It >> was required from org.apache.spark#spark-network-shuffle_2.10;1.3.0 test >> >> [error] Total time: 106 s, completed Apr 8, 2015 12:33:45 PM >> >> >> >> Mohammed >> >> >> >> *From:* Michael Armbrust [mailto:mich...@databricks.com] >> *Sent:* Wednesday, April 8, 2015 11:54 AM >> *To:* Mohammed Guller >> *Cc:* Todd Nist; James Aley; user; Patrick Wendell >> >> *Subject:* Re: Advice using Spark SQL and Thrift JDBC Server >> >> >> >> Sorry guys. I didn't realize that >> https://issues.apache.org/jira/browse/SPARK-4925 was not fixed yet. >> >> >> >> You can publish locally in the mean time (sbt/sbt publishLocal). >> >> >> >> On Wed, Apr 8, 2015 at 8:29 AM, Mohammed Guller <moham...@glassbeam.com> >> wrote: >> >> +1 >> >> >> >> Interestingly, I ran into the exactly the same issue yesterday. I >> couldn’t find any documentation about which project to include as a >> dependency in build.sbt to use HiveThriftServer2. Would appreciate help. >> >> >> >> Mohammed >> >> >> >> *From:* Todd Nist [mailto:tsind...@gmail.com] >> *Sent:* Wednesday, April 8, 2015 5:49 AM >> *To:* James Aley >> *Cc:* Michael Armbrust; user >> *Subject:* Re: Advice using Spark SQL and Thrift JDBC Server >> >> >> >> To use the HiveThriftServer2.startWithContext, I thought one would use >> the following artifact in the build: >> >> >> >> "org.apache.spark" %% "spark-hive-thriftserver" % "1.3.0" >> >> >> >> But I am unable to resolve the artifact. I do not see it in maven >> central or any other repo. Do I need to build Spark and publish locally or >> just missing something obvious here? >> >> >> >> Basic class is like this: >> >> >> >> import org.apache.spark.{SparkConf, SparkContext} >> >> >> >> import org.apache.spark.sql.hive.HiveContext >> >> import org.apache.spark.sql.hive.HiveMetastoreTypes._ >> >> import org.apache.spark.sql.types._ >> >> import org.apache.spark.sql.hive.thriftserver._ >> >> >> >> object MyThriftServer { >> >> >> >> val sparkConf = new SparkConf() >> >> // master is passed to spark-submit, but could also be specified >> explicitely >> >> // .setMaster(sparkMaster) >> >> .setAppName("My ThriftServer") >> >> .set("spark.cores.max", "2") >> >> val sc = new SparkContext(sparkConf) >> >> val sparkContext = sc >> >> import sparkContext._ >> >> val sqlContext = new HiveContext(sparkContext) >> >> import sqlContext._ >> >> import sqlContext.implicits._ >> >> >> >> // register temp tables here HiveThriftServer2.startWithContext(sqlContext) >> >> } >> >> Build has the following: >> >> >> >> scalaVersion := "2.10.4" >> >> >> >> val SPARK_VERSION = "1.3.0" >> >> >> >> >> >> libraryDependencies ++= Seq( >> >> "org.apache.spark" %% "spark-streaming-kafka" % SPARK_VERSION >> >> exclude("org.apache.spark", "spark-core_2.10") >> >> exclude("org.apache.spark", "spark-streaming_2.10") >> >> exclude("org.apache.spark", "spark-sql_2.10") >> >> exclude("javax.jms", "jms"), >> >> "org.apache.spark" %% "spark-core" % SPARK_VERSION % "provided", >> >> "org.apache.spark" %% "spark-streaming" % SPARK_VERSION % "provided", >> >> "org.apache.spark" %% "spark-sql" % SPARK_VERSION % "provided", >> >> "org.apache.spark" %% "spark-hive" % SPARK_VERSION % "provided", >> >> "org.apache.spark" %% "spark-hive-thriftserver" % SPARK_VERSION % >> "provided", >> >> "org.apache.kafka" %% "kafka" % "0.8.1.1" >> >> exclude("javax.jms", "jms") >> >> exclude("com.sun.jdmk", "jmxtools") >> >> exclude("com.sun.jmx", "jmxri"), >> >> "joda-time" % "joda-time" % "2.7", >> >> "log4j" % "log4j" % "1.2.14" >> >> exclude("com.sun.jdmk", "jmxtools") >> >> exclude("com.sun.jmx", "jmxri") >> >> ) >> >> >> >> Appreciate the assistance. >> >> >> >> -Todd >> >> >> >> On Tue, Apr 7, 2015 at 4:09 PM, James Aley <james.a...@swiftkey.com> >> wrote: >> >> Excellent, thanks for your help, I appreciate your advice! >> >> On 7 Apr 2015 20:43, "Michael Armbrust" <mich...@databricks.com> wrote: >> >> That should totally work. The other option would be to run a persistent >> metastore that multiple contexts can talk to and periodically run a job >> that creates missing tables. The trade-off here would be more complexity, >> but less downtime due to the server restarting. >> >> >> >> On Tue, Apr 7, 2015 at 12:34 PM, James Aley <james.a...@swiftkey.com> >> wrote: >> >> Hi Michael, >> >> >> >> Thanks so much for the reply - that really cleared a lot of things up for >> me! >> >> >> >> Let me just check that I've interpreted one of your suggestions for (4) >> correctly... Would it make sense for me to write a small wrapper app that >> pulls in hive-thriftserver as a dependency, iterates my Parquet >> directory structure to discover "tables" and registers each as a temp table >> in some context, before calling HiveThriftServer2.createWithContext as you >> suggest? >> >> >> >> This would mean that to add new content, all I need to is restart that >> app, which presumably could also be avoided fairly trivially by >> periodically restarting the server with a new context internally. That >> certainly beats manual curation of Hive table definitions, if it will work? >> >> >> >> >> >> Thanks again, >> >> >> >> James. >> >> >> >> On 7 April 2015 at 19:30, Michael Armbrust <mich...@databricks.com> >> wrote: >> >> 1) What exactly is the relationship between the thrift server and Hive? >> I'm guessing Spark is just making use of the Hive metastore to access table >> definitions, and maybe some other things, is that the case? >> >> >> >> Underneath the covers, the Spark SQL thrift server is executing queries >> using a HiveContext. In this mode, nearly all computation is done with >> Spark SQL but we try to maintain compatibility with Hive wherever >> possible. This means that you can write your queries in HiveQL, read >> tables from the Hive metastore, and use Hive UDFs UDTs UDAFs, etc. >> >> >> >> The one exception here is Hive DDL operations (CREATE TABLE, etc). These >> are passed directly to Hive code and executed there. The Spark SQL DDL is >> sufficiently different that we always try to parse that first, and fall >> back to Hive when it does not parse. >> >> >> >> One possibly confusing point here, is that you can persist Spark SQL >> tables into the Hive metastore, but this is not the same as a Hive table. >> We are only use the metastore as a repo for metadata, but are not using >> their format for the information in this case (as we have datasources that >> hive does not understand, including things like schema auto discovery). >> >> >> >> HiveQL DDL, run by Hive but can be read by Spark SQL: CREATE TABLE t (x >> INT) SORTED AS PARQUET >> >> Spark SQL DDL, run by Spark SQL, stored in metastore, cannot be read by >> hive: CREATE TABLE t USING parquet (path '/path/to/data') >> >> >> >> 2) Am I therefore right in thinking that SQL queries sent to the thrift >> server are still executed on the Spark cluster, using Spark SQL, and Hive >> plays no active part in computation of results? >> >> >> >> Correct. >> >> >> >> 3) What SQL flavour is actually supported by the Thrift Server? Is it >> Spark SQL, Hive, or both? I've confused, because I've seen it accepting >> Hive CREATE TABLE syntax, but Spark SQL seems to work too? >> >> >> >> HiveQL++ (with Spark SQL DDL). You can make it use our simple SQL parser >> by `SET spark.sql.dialect=sql`, but honestly you probably don't want to do >> this. The included SQL parser is mostly there for people who have >> dependency conflicts with Hive. >> >> >> >> 4) When I run SQL queries using the Scala or Python shells, Spark seems >> to figure out the schema by itself from my Parquet files very well, if I >> use createTempTable on the DataFrame. It seems when running the thrift >> server, I need to create a Hive table definition first? Is that the case, >> or did I miss something? If it is, is there some sensible way to automate >> this? >> >> >> >> Temporary tables are only visible to the SQLContext that creates them. >> If you want it to be visible to the server, you need to either start the >> thrift server with the same context your program is using >> (see HiveThriftServer2.createWithContext) or make a metastore table. This >> can be done using Spark SQL DDL: >> >> >> >> CREATE TABLE t USING parquet (path '/path/to/data') >> >> >> >> Michael >> >> >> >> >> >> >> >> >> > >