Re: Advice using Spark SQL and Thrift JDBC Server

Todd Nist Thu, 09 Apr 2015 05:02:06 -0700

Hi Mohammed,

Sorry, I guess I was not really clear in my response.  Yes sbt fails, the
-DskipTests is for mvn as I showed it in the example on how II built it.


I do not believe that -DskipTests has any impact in sbt, but could be
wrong.  sbt package should skip tests.  I did not try to track down where
the dependency was coming from.  Based on Patrick comments it sound like
this is now resolved.

Sorry for the confustion.

-Todd

On Wed, Apr 8, 2015 at 4:38 PM, Todd Nist <tsind...@gmail.com> wrote:

> Hi Mohammed,
>
> I think you just need to add -DskipTests to you build.  Here is how I
> built it:
>
> mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver
> -DskipTests clean package install
>
> build/sbt does however fail even if only doing package which should skip
> tests.
>
> I am able to build the "MyThriftServer" above now.
>
> Thanks Michael for the assistance.
>
> -Todd
>
> On Wed, Apr 8, 2015 at 3:39 PM, Mohammed Guller <moham...@glassbeam.com>
> wrote:
>
>>  Michael,
>>
>> Thank you!
>>
>>
>>
>> Looks like the sbt build is broken for 1.3. I downloaded the source code
>> for 1.3, but I get the following error a few minutes after I run “sbt/sbt
>> publishLocal”
>>
>>
>>
>> [error] (network-shuffle/*:update) sbt.ResolveException: unresolved
>> dependency: org.apache.spark#spark-network-common_2.10;1.3.0: configuration
>> not public in org.apache.spark#spark-network-common_2.10;1.3.0: 'test'. It
>> was required from org.apache.spark#spark-network-shuffle_2.10;1.3.0 test
>>
>> [error] Total time: 106 s, completed Apr 8, 2015 12:33:45 PM
>>
>>
>>
>> Mohammed
>>
>>
>>
>> *From:* Michael Armbrust [mailto:mich...@databricks.com]
>> *Sent:* Wednesday, April 8, 2015 11:54 AM
>> *To:* Mohammed Guller
>> *Cc:* Todd Nist; James Aley; user; Patrick Wendell
>>
>> *Subject:* Re: Advice using Spark SQL and Thrift JDBC Server
>>
>>
>>
>> Sorry guys.  I didn't realize that
>> https://issues.apache.org/jira/browse/SPARK-4925 was not fixed yet.
>>
>>
>>
>> You can publish locally in the mean time (sbt/sbt publishLocal).
>>
>>
>>
>> On Wed, Apr 8, 2015 at 8:29 AM, Mohammed Guller <moham...@glassbeam.com>
>> wrote:
>>
>> +1
>>
>>
>>
>> Interestingly, I ran into the exactly the same issue yesterday.  I
>> couldn’t find any documentation about which project to include as a
>> dependency in build.sbt to use HiveThriftServer2. Would appreciate help.
>>
>>
>>
>> Mohammed
>>
>>
>>
>> *From:* Todd Nist [mailto:tsind...@gmail.com]
>> *Sent:* Wednesday, April 8, 2015 5:49 AM
>> *To:* James Aley
>> *Cc:* Michael Armbrust; user
>> *Subject:* Re: Advice using Spark SQL and Thrift JDBC Server
>>
>>
>>
>> To use the HiveThriftServer2.startWithContext, I thought one would use
>> the  following artifact in the build:
>>
>>
>>
>> "org.apache.spark"    %% "spark-hive-thriftserver"   % "1.3.0"
>>
>>
>>
>> But I am unable to resolve the artifact.  I do not see it in maven
>> central or any other repo.  Do I need to build Spark and publish locally or
>> just missing something obvious here?
>>
>>
>>
>> Basic class is like this:
>>
>>
>>
>> import org.apache.spark.{SparkConf, SparkContext}
>>
>>
>>
>> import  org.apache.spark.sql.hive.HiveContext
>>
>> import org.apache.spark.sql.hive.HiveMetastoreTypes._
>>
>> import org.apache.spark.sql.types._
>>
>> import  org.apache.spark.sql.hive.thriftserver._
>>
>>
>>
>> object MyThriftServer {
>>
>>
>>
>>   val sparkConf = new SparkConf()
>>
>>     // master is passed to spark-submit, but could also be specified 
>> explicitely
>>
>>     // .setMaster(sparkMaster)
>>
>>     .setAppName("My ThriftServer")
>>
>>     .set("spark.cores.max", "2")
>>
>>   val sc = new SparkContext(sparkConf)
>>
>>   val  sparkContext  =  sc
>>
>>   import  sparkContext._
>>
>>   val  sqlContext  =  new  HiveContext(sparkContext)
>>
>>   import  sqlContext._
>>
>>   import sqlContext.implicits._
>>
>>
>>
>> // register temp tables here   HiveThriftServer2.startWithContext(sqlContext)
>>
>> }
>>
>>  Build has the following:
>>
>>
>>
>> scalaVersion := "2.10.4"
>>
>>
>>
>> val SPARK_VERSION = "1.3.0"
>>
>>
>>
>>
>>
>> libraryDependencies ++= Seq(
>>
>>     "org.apache.spark" %% "spark-streaming-kafka" % SPARK_VERSION
>>
>>       exclude("org.apache.spark", "spark-core_2.10")
>>
>>       exclude("org.apache.spark", "spark-streaming_2.10")
>>
>>       exclude("org.apache.spark", "spark-sql_2.10")
>>
>>       exclude("javax.jms", "jms"),
>>
>>     "org.apache.spark" %% "spark-core"  % SPARK_VERSION %  "provided",
>>
>>     "org.apache.spark" %% "spark-streaming" % SPARK_VERSION %  "provided",
>>
>>     "org.apache.spark"  %% "spark-sql"  % SPARK_VERSION % "provided",
>>
>>     "org.apache.spark"  %% "spark-hive" % SPARK_VERSION % "provided",
>>
>>     "org.apache.spark" %% "spark-hive-thriftserver"  % SPARK_VERSION   %
>> "provided",
>>
>>     "org.apache.kafka" %% "kafka" % "0.8.1.1"
>>
>>       exclude("javax.jms", "jms")
>>
>>       exclude("com.sun.jdmk", "jmxtools")
>>
>>       exclude("com.sun.jmx", "jmxri"),
>>
>>     "joda-time" % "joda-time" % "2.7",
>>
>>     "log4j" % "log4j" % "1.2.14"
>>
>>       exclude("com.sun.jdmk", "jmxtools")
>>
>>       exclude("com.sun.jmx", "jmxri")
>>
>>   )
>>
>>
>>
>> Appreciate the assistance.
>>
>>
>>
>> -Todd
>>
>>
>>
>> On Tue, Apr 7, 2015 at 4:09 PM, James Aley <james.a...@swiftkey.com>
>> wrote:
>>
>> Excellent, thanks for your help, I appreciate your advice!
>>
>> On 7 Apr 2015 20:43, "Michael Armbrust" <mich...@databricks.com> wrote:
>>
>> That should totally work.  The other option would be to run a persistent
>> metastore that multiple contexts can talk to and periodically run a job
>> that creates missing tables.  The trade-off here would be more complexity,
>> but less downtime due to the server restarting.
>>
>>
>>
>> On Tue, Apr 7, 2015 at 12:34 PM, James Aley <james.a...@swiftkey.com>
>> wrote:
>>
>> Hi Michael,
>>
>>
>>
>> Thanks so much for the reply - that really cleared a lot of things up for
>> me!
>>
>>
>>
>> Let me just check that I've interpreted one of your suggestions for (4)
>> correctly... Would it make sense for me to write a small wrapper app that
>> pulls in hive-thriftserver as a dependency, iterates my Parquet
>> directory structure to discover "tables" and registers each as a temp table
>> in some context, before calling HiveThriftServer2.createWithContext as you
>> suggest?
>>
>>
>>
>> This would mean that to add new content, all I need to is restart that
>> app, which presumably could also be avoided fairly trivially by
>> periodically restarting the server with a new context internally. That
>> certainly beats manual curation of Hive table definitions, if it will work?
>>
>>
>>
>>
>>
>> Thanks again,
>>
>>
>>
>> James.
>>
>>
>>
>> On 7 April 2015 at 19:30, Michael Armbrust <mich...@databricks.com>
>> wrote:
>>
>>  1) What exactly is the relationship between the thrift server and Hive?
>> I'm guessing Spark is just making use of the Hive metastore to access table
>> definitions, and maybe some other things, is that the case?
>>
>>
>>
>> Underneath the covers, the Spark SQL thrift server is executing queries
>> using a HiveContext.  In this mode, nearly all computation is done with
>> Spark SQL but we try to maintain compatibility with Hive wherever
>> possible.  This means that you can write your queries in HiveQL, read
>> tables from the Hive metastore, and use Hive UDFs UDTs UDAFs, etc.
>>
>>
>>
>> The one exception here is Hive DDL operations (CREATE TABLE, etc).  These
>> are passed directly to Hive code and executed there.  The Spark SQL DDL is
>> sufficiently different that we always try to parse that first, and fall
>> back to Hive when it does not parse.
>>
>>
>>
>> One possibly confusing point here, is that you can persist Spark SQL
>> tables into the Hive metastore, but this is not the same as a Hive table.
>> We are only use the metastore as a repo for metadata, but are not using
>> their format for the information in this case (as we have datasources that
>> hive does not understand, including things like schema auto discovery).
>>
>>
>>
>> HiveQL DDL, run by Hive but can be read by Spark SQL: CREATE TABLE t (x
>> INT) SORTED AS PARQUET
>>
>> Spark SQL DDL, run by Spark SQL, stored in metastore, cannot be read by
>> hive: CREATE TABLE t USING parquet (path '/path/to/data')
>>
>>
>>
>>  2) Am I therefore right in thinking that SQL queries sent to the thrift
>> server are still executed on the Spark cluster, using Spark SQL, and Hive
>> plays no active part in computation of results?
>>
>>
>>
>> Correct.
>>
>>
>>
>>  3) What SQL flavour is actually supported by the Thrift Server? Is it
>> Spark SQL, Hive, or both? I've confused, because I've seen it accepting
>> Hive CREATE TABLE syntax, but Spark SQL seems to work too?
>>
>>
>>
>> HiveQL++ (with Spark SQL DDL).  You can make it use our simple SQL parser
>> by `SET spark.sql.dialect=sql`, but honestly you probably don't want to do
>> this.  The included SQL parser is mostly there for people who have
>> dependency conflicts with Hive.
>>
>>
>>
>>  4) When I run SQL queries using the Scala or Python shells, Spark seems
>> to figure out the schema by itself from my Parquet files very well, if I
>> use createTempTable on the DataFrame. It seems when running the thrift
>> server, I need to create a Hive table definition first? Is that the case,
>> or did I miss something? If it is, is there some sensible way to automate
>> this?
>>
>>
>>
>> Temporary tables are only visible to the SQLContext that creates them.
>> If you want it to be visible to the server, you need to either start the
>> thrift server with the same context your program is using
>> (see HiveThriftServer2.createWithContext) or make a metastore table.  This
>> can be done using Spark SQL DDL:
>>
>>
>>
>> CREATE TABLE t USING parquet (path '/path/to/data')
>>
>>
>>
>> Michael
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: Advice using Spark SQL and Thrift JDBC Server

Reply via email to