Hi Silvio, I got this working today using your suggestion with the "Initial SQL" and a "Custom Query". See here for details:
http://stackoverflow.com/questions/28403664/connect-to-existing-hive-in-intellij-using-sbt-as-build/28608608#28608608 It is not ideal as I need to write a custom query, but does work for now. I also have it working by doing a SaveAsTable on the ingested data which stores the reference into the metastore for access via the thrift server. Thanks for the help. -Todd On Wed, Feb 11, 2015 at 8:41 PM, Silvio Fiorito < silvio.fior...@granturing.com> wrote: > Hey Todd, > > I don’t have an app to test against the thrift server, are you able to > define custom SQL without using Tableau’s schema query? I guess it’s not > possible to just use SparkSQL temp tables, you may have to use permanent > Hive tables that are actually in the metastore so Tableau can discover them > in the schema. In that case you will either have to generate the Hive > tables externally from Spark or use Spark to process the data and save them > using a HiveContext. > > > From: Todd Nist > Date: Wednesday, February 11, 2015 at 7:53 PM > To: Andrew Lee > Cc: Arush Kharbanda, "user@spark.apache.org" > Subject: Re: SparkSQL + Tableau Connector > > First sorry for the long post. So back to tableau and Spark SQL, I'm > still missing something. > > TL;DR > > To get the Spark SQL Temp table associated with the metastore are there > additional steps required beyond doing the below? > > Initial SQL on connection: > > create temporary table test > using org.apache.spark.sql.json > options (path '/data/json/*'); > > cache table test; > > I feel like I'm missing a step of associating the Spark SQL table with > the metastore, do I need to actually save it in some fashion? I'm trying > to avoid saving to hive if possible. > > *Details:* > > I configured the hive-site.xml and placed it in the $SPARK_HOME/conf. > It looks like this, thanks Andrew and Arush for the assistance: > > <?xml version="1.0"?> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > > <configuration> > <property> > <name>hive.semantic.analyzer.factory.impl</name> > <value>org.apache.hcatalog.cli.HCatSemanticAnalyzerFactory</value> > </property> > > <property> > <name>hive.metastore.sasl.enabled</name> > <value>false</value> > </property> > > <property> > <name>hive.server2.authentication</name> > <value>NONE</value> > </property> > > <property> > <name>hive.server2.enable.doAs</name> > <value>true</value> > </property> > > <!-- > <property> > <name>hive.metastore.uris</name> > <value>thrift://localhost:9083</value> > <description>IP address (or fully-qualified domain name) and port of > the metastore host</description> > </property> > --> > > <property> > <name>hive.warehouse.subdir.inherit.perms</name> > <value>true</value> > </property> > > <property> > <name>hive.metastore.schema.verification</name> > <value>false</value> > </property> > > <property> > <name>javax.jdo.option.ConnectionURL</name> > > <value>jdbc:mysql://localhost:3306/metastore_db?createDatabaseIfNotExist=true</value> > <description>metadata is stored in a MySQL server</description> > </property> > > <property> > <name>javax.jdo.option.ConnectionDriverName</name> > <value>com.mysql.jdbc.Driver</value> > <description>MySQL JDBC driver class</description> > </property> > > <property> > <name>javax.jdo.option.ConnectionUserName</name> > <value>hiveuser</value> > </property> > > <property> > <name>javax.jdo.option.ConnectionPassword</name> > <value>hiveuser</value> > </property> > > </configuration> > > When I start the server it looks fine: > > >$ ./sbin/start-thriftserver.sh --hiveconf > hive.server2.thrift.port=10001 --hiveconf hive.server2.thrift.bind.host > radtech.io --master spark://radtech.io:7077 --driver-class-path > /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar > starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging > to > /usr/local/spark-1.2.1-bin-hadoop2.4/logs/spark-tnist-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-radtech.io.out > radtech:spark tnist$ tail -f > logs/spark-tnist-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-radtech.io.out > 15/02/11 19:15:24 INFO SparkDeploySchedulerBackend: Granted executor ID > app-20150211191524-0008/1 on hostPort 192.168.1.2:50851 with 2 cores, > 512.0 MB RAM > 15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: > app-20150211191524-0008/0 is now LOADING > 15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: > app-20150211191524-0008/1 is now LOADING > 15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: > app-20150211191524-0008/0 is now RUNNING > 15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: > app-20150211191524-0008/1 is now RUNNING > 15/02/11 19:15:24 INFO NettyBlockTransferService: Server created on 50938 > 15/02/11 19:15:24 INFO BlockManagerMaster: Trying to register BlockManager > 15/02/11 19:15:24 INFO BlockManagerMasterActor: Registering block manager > 192.168.1.2:50938 with 265.1 MB RAM, BlockManagerId(<driver>, > 192.168.1.2, 50938) > 15/02/11 19:15:24 INFO BlockManagerMaster: Registered BlockManager > 15/02/11 19:15:25 INFO SparkDeploySchedulerBackend: SchedulerBackend is > ready for scheduling beginning after reached minRegisteredResourcesRatio: > 0.0 > 15/02/11 19:15:25 INFO HiveMetaStore: 0: Opening raw store with > implemenation class:org.apache.hadoop.hive.metastore.ObjectStore > 15/02/11 19:15:25 INFO ObjectStore: ObjectStore, initialize called > 15/02/11 19:15:26 INFO Persistence: Property > hive.metastore.integral.jdo.pushdown unknown - will be ignored > 15/02/11 19:15:26 INFO Persistence: Property datanucleus.cache.level2 > unknown - will be ignored > 15/02/11 19:15:26 WARN Connection: BoneCP specified but not present in > CLASSPATH (or one of dependencies) > 15/02/11 19:15:26 WARN Connection: BoneCP specified but not present in > CLASSPATH (or one of dependencies) > 15/02/11 19:15:27 INFO ObjectStore: Setting MetaStore object pin classes > with > hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order" > 15/02/11 19:15:28 INFO SparkDeploySchedulerBackend: Registered executor: > Actor[akka.tcp://sparkExecutor@192.168.1.2:50944/user/Executor#1008909571] > with ID 0 > 15/02/11 19:15:28 INFO Datastore: The class > "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as > "embedded-only" so does not have its own datastore table. > 15/02/11 19:15:28 INFO Datastore: The class > "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as > "embedded-only" so does not have its own datastore table. > 15/02/11 19:15:28 INFO SparkDeploySchedulerBackend: Registered executor: > Actor[akka.tcp://sparkExecutor@192.168.1.2:50948/user/Executor#-688434541] > with ID 1 > 15/02/11 19:15:28 INFO Datastore: The class > "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as > "embedded-only" so does not have its own datastore table. > 15/02/11 19:15:28 INFO Datastore: The class > "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as > "embedded-only" so does not have its own datastore table. > 15/02/11 19:15:28 INFO Query: Reading in results for query > "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used > is closing > 15/02/11 19:15:28 INFO ObjectStore: Initialized ObjectStore > 15/02/11 19:15:28 INFO BlockManagerMasterActor: Registering block manager > 192.168.1.2:50951 with 265.1 MB RAM, BlockManagerId(0, 192.168.1.2, 50951) > 15/02/11 19:15:28 INFO BlockManagerMasterActor: Registering block manager > 192.168.1.2:50952 with 265.1 MB RAM, BlockManagerId(1, 192.168.1.2, 50952) > 15/02/11 19:15:28 INFO HiveMetaStore: Added admin role in metastore > 15/02/11 19:15:28 INFO HiveMetaStore: Added public role in metastore > 15/02/11 19:15:29 INFO HiveMetaStore: No user is added in admin role, > since config is empty > 15/02/11 19:15:29 INFO SessionState: No Tez session required at this > point. hive.execution.engine=mr. > 15/02/11 19:15:29 INFO AbstractService: HiveServer2: Async execution pool > size 100 > 15/02/11 19:15:29 INFO AbstractService: Service:OperationManager is inited. > 15/02/11 19:15:29 INFO AbstractService: Service: SessionManager is inited. > 15/02/11 19:15:29 INFO AbstractService: Service: CLIService is inited. > 15/02/11 19:15:29 INFO AbstractService: Service:ThriftBinaryCLIService is > inited. > 15/02/11 19:15:29 INFO AbstractService: Service: HiveServer2 is inited. > 15/02/11 19:15:29 INFO AbstractService: Service:OperationManager is > started. > 15/02/11 19:15:29 INFO AbstractService: Service:SessionManager is started. > 15/02/11 19:15:29 INFO AbstractService: Service:CLIService is started. > 15/02/11 19:15:29 INFO HiveMetaStore: No user is added in admin role, > since config is empty > 15/02/11 19:15:29 INFO HiveMetaStore: 0: get_databases: default > 15/02/11 19:15:29 INFO audit: ugi=tnistip=unknown-ip-addr cmd=get_databases: > default > 15/02/11 19:15:29 INFO HiveMetaStore: 0: Opening raw store with > implemenation class:org.apache.hadoop.hive.metastore.ObjectStore > 15/02/11 19:15:29 INFO ObjectStore: ObjectStore, initialize called > 15/02/11 19:15:29 INFO Query: Reading in results for query > "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used > is closing > 15/02/11 19:15:29 INFO ObjectStore: Initialized ObjectStore > 15/02/11 19:15:29 INFO AbstractService: Service:ThriftBinaryCLIService is > started. > 15/02/11 19:15:29 INFO AbstractService: Service:HiveServer2 is started. > 15/02/11 19:15:29 INFO HiveThriftServer2: HiveThriftServer2 started > 15/02/11 19:15:29 INFO ThriftCLIService: ThriftBinaryCLIService listening > on radtech.io/192.168.1.2:10001 > > When I start Tableau and use the SparkSQL (Beta) connector I see that > the "initial SQL" is being executed: > > 15/02/11 19:25:35 INFO HiveMetaStore: 2: get_database: default > 15/02/11 19:25:35 INFO audit: ugi=anonymous ip=unknown-ip-addr > cmd=get_database: default > 15/02/11 19:25:35 INFO Driver: OK > 15/02/11 19:25:35 INFO SparkExecuteStatementOperation: Running query 'set > -v' > 15/02/11 19:25:35 INFO SparkExecuteStatementOperation: Result Schema: > List(#18) > 15/02/11 19:25:35 INFO SparkExecuteStatementOperation: Result Schema: > List(#21) > 15/02/11 19:25:35 INFO ThriftCLIService: Client protocol version: > HIVE_CLI_SERVICE_PROTOCOL_V6 > 15/02/11 19:25:35 INFO HiveMetaStore: No user is added in admin role, > since config is empty > 15/02/11 19:25:35 INFO SessionState: No Tez session required at this > point. hive.execution.engine=mr. > 15/02/11 19:25:35 INFO SessionState: No Tez session required at this > point. hive.execution.engine=mr. > 15/02/11 19:25:35 INFO SparkExecuteStatementOperation: Running query 'use > `default`' > 15/02/11 19:25:35 INFO ParseDriver: Parsing command: use `default` > 15/02/11 19:25:35 INFO ParseDriver: Parse Completed > 15/02/11 19:25:35 INFO SessionState: No Tez session required at this > point. hive.execution.engine=mr. > 15/02/11 19:25:35 INFO Driver: Concurrency mode is disabled, not creating > a lock manager > 15/02/11 19:25:35 INFO ParseDriver: Parsing command: use `default` > 15/02/11 19:25:35 INFO ParseDriver: Parse Completed > 15/02/11 19:25:35 INFO Driver: Semantic Analysis Completed > 15/02/11 19:25:35 INFO Driver: Returning Hive schema: > Schema(fieldSchemas:null, properties:null) > 15/02/11 19:25:35 INFO Driver: Starting command: use `default` > 15/02/11 19:25:35 INFO HiveMetaStore: 3: get_database: default > 15/02/11 19:25:35 INFO audit: ugi=anonymous ip=unknown-ip-addr > cmd=get_database: default > 15/02/11 19:25:35 INFO HiveMetaStore: 3: Opening raw store with > implemenation class:org.apache.hadoop.hive.metastore.ObjectStore > 15/02/11 19:25:35 INFO ObjectStore: ObjectStore, initialize called > 15/02/11 19:25:36 INFO Query: Reading in results for query > "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used > is closing > 15/02/11 19:25:36 INFO ObjectStore: Initialized ObjectStore > 15/02/11 19:25:36 INFO HiveMetaStore: 3: get_database: default > 15/02/11 19:25:36 INFO audit: ugi=anonymous ip=unknown-ip-addr > cmd=get_database: default > 15/02/11 19:25:36 INFO Driver: OK > > *15/02/11 19:25:36 INFO SparkExecuteStatementOperation: Running query > 'create temporary table test * > *using org.apache.spark.sql.json **options (path ‘/data/json/*')*' > > .... > > 15/02/11 19:25:38 INFO Driver: Starting command: use `default` > 15/02/11 19:25:38 INFO HiveMetaStore: 4: get_database: default > 15/02/11 19:25:38 INFO audit: ugi=anonymous ip=unknown-ip-addr > cmd=get_database: default > 15/02/11 19:25:38 INFO HiveMetaStore: 4: Opening raw store with > implemenation class:org.apache.hadoop.hive.metastore.ObjectStore > 15/02/11 19:25:38 INFO ObjectStore: ObjectStore, initialize called > 15/02/11 19:25:38 INFO Query: Reading in results for query > "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used > is closing > 15/02/11 19:25:38 INFO ObjectStore: Initialized ObjectStore > 15/02/11 19:25:38 INFO HiveMetaStore: 4: get_database: default > 15/02/11 19:25:38 INFO audit: ugi=anonymous ip=unknown-ip-addr > cmd=get_database: default > 15/02/11 19:25:38 INFO Driver: OK > 15/02/11 19:25:38 INFO SparkExecuteStatementOperation: Running query ' > *cache table test* ' > 15/02/11 19:25:38 INFO MemoryStore: ensureFreeSpace(211383) called with > curMem=101514, maxMem=278019440 > 15/02/11 19:25:38 INFO MemoryStore: Block broadcast_2 stored as values in > memory (estimated size 206.4 KB, free 264.8 MB) > > I see no way in Tableau to see the cached table "test". I think I am > missing a step of associating the generated temp table from Spark SQL with > the metastore. Any guidance or insights on what I'm missing here. > > Thanks for the assistance. > > -Todd > > On Wed, Feb 11, 2015 at 3:20 PM, Andrew Lee <alee...@hotmail.com> wrote: > >> Sorry folks, it is executing Spark jobs instead of Hive jobs. I >> mis-read the logs since there were other activities going on on the cluster. >> >> ------------------------------ >> From: alee...@hotmail.com >> To: ar...@sigmoidanalytics.com; tsind...@gmail.com >> CC: user@spark.apache.org >> Subject: RE: SparkSQL + Tableau Connector >> Date: Wed, 11 Feb 2015 11:56:44 -0800 >> >> >> I'm using mysql as the metastore DB with Spark 1.2. >> I simply copy the hive-site.xml to /etc/spark/ and added the mysql JDBC >> JAR to spark-env.sh in /etc/spark/, everything works fine now. >> >> My setup looks like this. >> >> Tableau => Spark ThriftServer2 => HiveServer2 >> >> It's talking to Tableau Desktop 8.3. Interestingly, when I query a Hive >> table, it still invokes Hive queries to HiveServer2 which is running MR or >> Tez engine. Is this expected? >> >> I thought it should at least use the catalyst engine and talk to the >> underlying HDFS like what HiveContext API does to pull in the data into >> RDD. Did I misunderstood the purpose of Spark ThriftServer2? >> >> >> >> ------------------------------ >> Date: Wed, 11 Feb 2015 16:07:40 +0530 >> Subject: Re: SparkSQL + Tableau Connector >> From: ar...@sigmoidanalytics.com >> To: tsind...@gmail.com >> CC: user@spark.apache.org >> >> Hi >> >> I used this, though its using a embedded driver and is not a good >> approch.It works. You can configure for some other metastore type also. I >> have not tried the metastore uri's. >> >> <configuration> >> >> >> <property> >> <name>javax.jdo.option.ConnectionURL</name> >> >> <value>jdbc:derby:;databaseName=/opt/bigdata/spark-1.2.0/metastore_db;create=true</value> >> <description>URL for the DB</description> >> </property> >> >> >> <property> >> <name>javax.jdo.option.ConnectionDriverName</name> >> <value>org.apache.derby.jdbc.EmbeddedDriver</value> >> </property> >> >> >> <!-- <property> >> <name>hive.metastore.uris</name> >> <value>thrift://x.x.x.x:10000 <http://172.17.1.172:10000/></value> >> <description>IP address (or fully-qualified domain name) and port of >> the metastore host</description> >> </property> --> >> </configuration> >> >> On Wed, Feb 11, 2015 at 3:59 PM, Todd Nist <tsind...@gmail.com> wrote: >> >> Hi Arush, >> >> So yes I want to create the tables through Spark SQL. I have placed >> the hive-site.xml file inside of the $SPARK_HOME/conf directory I thought >> that was all I should need to do to have the thriftserver use it. Perhaps >> my hive-site.xml is worng, it currently looks like this: >> >> <configuration> >> <property> >> <name>hive.metastore.uris</name> >> <!-- Ensure that the following statement points to the Hive Metastore >> URI in your cluster --> >> <value>thrift://sandbox.hortonworks.com:9083</value> >> <description>URI for client to contact metastore server</description> >> </property> >> </configuration> >> >> Which leads me to believe it is going to pull form the thriftserver >> from Horton? I will go look at the docs to see if this is right, it is >> what Horton says to do. Do you have an example hive-site.xml by chance >> that works with Spark SQL? >> >> I am using 8.3 of tableau with the SparkSQL Connector. >> >> Thanks for the assistance. >> >> -Todd >> >> On Wed, Feb 11, 2015 at 2:34 AM, Arush Kharbanda < >> ar...@sigmoidanalytics.com> wrote: >> >> BTW what tableau connector are you using? >> >> On Wed, Feb 11, 2015 at 12:55 PM, Arush Kharbanda < >> ar...@sigmoidanalytics.com> wrote: >> >> I am a little confused here, why do you want to create the tables in >> hive. You want to create the tables in spark-sql, right? >> >> If you are not able to find the same tables through tableau then thrift >> is connecting to a diffrent metastore than your spark-shell. >> >> One way to specify a metstore to thrift is to provide the path to >> hive-site.xml while starting thrift using --files hive-site.xml. >> >> similarly you can specify the same metastore to your spark-submit or >> sharp-shell using the same option. >> >> >> >> On Wed, Feb 11, 2015 at 5:23 AM, Todd Nist <tsind...@gmail.com> wrote: >> >> Arush, >> >> As for #2 do you mean something like this from the docs: >> >> // sc is an existing SparkContext.val sqlContext = new >> org.apache.spark.sql.hive.HiveContext(sc)sqlContext.sql("CREATE TABLE IF NOT >> EXISTS src (key INT, value STRING)")sqlContext.sql("LOAD DATA LOCAL INPATH >> 'examples/src/main/resources/kv1.txt' INTO TABLE src")// Queries are >> expressed in HiveQLsqlContext.sql("FROM src SELECT key, >> value").collect().foreach(println) >> >> Or did you have something else in mind? >> >> -Todd >> >> >> On Tue, Feb 10, 2015 at 6:35 PM, Todd Nist <tsind...@gmail.com> wrote: >> >> Arush, >> >> Thank you will take a look at that approach in the morning. I sort of >> figured the answer to #1 was NO and that I would need to do 2 and 3 thanks >> for clarifying it for me. >> >> -Todd >> >> On Tue, Feb 10, 2015 at 5:24 PM, Arush Kharbanda < >> ar...@sigmoidanalytics.com> wrote: >> >> 1. Can the connector fetch or query schemaRDD's saved to Parquet or JSON >> files? NO >> 2. Do I need to do something to expose these via hive / metastore other >> than creating a table in hive? Create a table in spark sql to expose via >> spark sql >> 3. Does the thriftserver need to be configured to expose these in some >> fashion, sort of related to question 2 you would need to configure thrift >> to read from the metastore you expect it read from - by default it reads >> from metastore_db directory present in the directory used to launch the >> thrift server. >> On 11 Feb 2015 01:35, "Todd Nist" <tsind...@gmail.com> wrote: >> >> Hi, >> >> I'm trying to understand how and what the Tableau connector to SparkSQL >> is able to access. My understanding is it needs to connect to the >> thriftserver and I am not sure how or if it exposes parquet, json, >> schemaRDDs, or does it only expose schemas defined in the metastore / hive. >> >> >> For example, I do the following from the spark-shell which generates a >> schemaRDD from a csv file and saves it as a JSON file as well as a parquet >> file. >> >> import *org.apache.sql.SQLContext >> *import com.databricks.spark.csv._ >> val sqlContext = new SQLContext(sc) >> val test = >> sqlContext.csfFile("/data/test.csv")test.toJSON().saveAsTextFile("/data/out") >> test.saveAsParquetFile("/data/out") >> >> When I connect from Tableau, the only thing I see is the "default" >> schema and nothing in the tables section. >> So my questions are: >> >> 1. Can the connector fetch or query schemaRDD's saved to Parquet or JSON >> files? >> 2. Do I need to do something to expose these via hive / metastore other >> than creating a table in hive? >> 3. Does the thriftserver need to be configured to expose these in some >> fashion, sort of related to question 2. >> TIA for the assistance. >> -Todd >> >> >> >> >> >> >> -- >> >> [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com> >> >> *Arush Kharbanda* || Technical Teamlead >> >> ar...@sigmoidanalytics.com || www.sigmoidanalytics.com >> >> >> >> >> -- >> >> [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com> >> >> *Arush Kharbanda* || Technical Teamlead >> >> ar...@sigmoidanalytics.com || www.sigmoidanalytics.com >> >> >> >> >> >> -- >> >> [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com> >> >> *Arush Kharbanda* || Technical Teamlead >> >> ar...@sigmoidanalytics.com || www.sigmoidanalytics.com >> > >