Hi Todd, When I see /data/json appears in your log, I have a feeling that that is the default hive.metastore.warehouse.dir from hive-site.xml where the value is /data/. Could you check that property and see if you can point that to the correct Hive table HDFS directory? Another thing to look at is the Hive metastore mysql database if you are using mysql as the DB for metastore. Date: Wed, 11 Feb 2015 19:53:35 -0500 Subject: Re: SparkSQL + Tableau Connector From: tsind...@gmail.com To: alee...@hotmail.com CC: ar...@sigmoidanalytics.com; user@spark.apache.org
First sorry for the long post. So back to tableau and Spark SQL, I'm still missing something. TL;DR To get the Spark SQL Temp table associated with the metastore are there additional steps required beyond doing the below? Initial SQL on connection: create temporary table test using org.apache.spark.sql.json options (path '/data/json/*'); cache table test; I feel like I'm missing a step of associating the Spark SQL table with the metastore, do I need to actually save it in some fashion? I'm trying to avoid saving to hive if possible. Details: I configured the hive-site.xml and placed it in the $SPARK_HOME/conf. It looks like this, thanks Andrew and Arush for the assistance: <?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hive.semantic.analyzer.factory.impl</name> <value>org.apache.hcatalog.cli.HCatSemanticAnalyzerFactory</value> </property> <property> <name>hive.metastore.sasl.enabled</name> <value>false</value> </property> <property> <name>hive.server2.authentication</name> <value>NONE</value> </property> <property> <name>hive.server2.enable.doAs</name> <value>true</value> </property> <!-- <property> <name>hive.metastore.uris</name> <value>thrift://localhost:9083</value> <description>IP address (or fully-qualified domain name) and port of the metastore host</description> </property> --> <property> <name>hive.warehouse.subdir.inherit.perms</name> <value>true</value> </property> <property> <name>hive.metastore.schema.verification</name> <value>false</value> </property> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost:3306/metastore_db?createDatabaseIfNotExist=true</value> <description>metadata is stored in a MySQL server</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> <description>MySQL JDBC driver class</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hiveuser</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>hiveuser</value> </property> </configuration> When I start the server it looks fine: >$ ./sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 >--hiveconf hive.server2.thrift.bind.host radtech.io --master >spark://radtech.io:7077 --driver-class-path >/usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jarstarting >org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to >/usr/local/spark-1.2.1-bin-hadoop2.4/logs/spark-tnist-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-radtech.io.outradtech:spark > tnist$ tail -f >logs/spark-tnist-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-radtech.io.out15/02/11 > 19:15:24 INFO SparkDeploySchedulerBackend: Granted executor ID >app-20150211191524-0008/1 on hostPort 192.168.1.2:50851 with 2 cores, 512.0 MB >RAM15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: >app-20150211191524-0008/0 is now LOADING15/02/11 19:15:24 INFO >AppClient$ClientActor: Executor updated: app-20150211191524-0008/1 is now >LOADING15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: >app-20150211191524-0008/0 is now RUNNING15/02/11 19:15:24 INFO >AppClient$ClientActor: Executor updated: app-20150211191524-0008/1 is now >RUNNING15/02/11 19:15:24 INFO NettyBlockTransferService: Server created on >5093815/02/11 19:15:24 INFO BlockManagerMaster: Trying to register >BlockManager15/02/11 19:15:24 INFO BlockManagerMasterActor: Registering block >manager 192.168.1.2:50938 with 265.1 MB RAM, BlockManagerId(<driver>, >192.168.1.2, 50938)15/02/11 19:15:24 INFO BlockManagerMaster: Registered >BlockManager15/02/11 19:15:25 INFO SparkDeploySchedulerBackend: >SchedulerBackend is ready for scheduling beginning after reached >minRegisteredResourcesRatio: 0.015/02/11 19:15:25 INFO HiveMetaStore: 0: >Opening raw store with implemenation >class:org.apache.hadoop.hive.metastore.ObjectStore15/02/11 19:15:25 INFO >ObjectStore: ObjectStore, initialize called15/02/11 19:15:26 INFO Persistence: >Property hive.metastore.integral.jdo.pushdown unknown - will be >ignored15/02/11 19:15:26 INFO Persistence: Property datanucleus.cache.level2 >unknown - will be ignored15/02/11 19:15:26 WARN Connection: BoneCP specified >but not present in CLASSPATH (or one of dependencies)15/02/11 19:15:26 WARN >Connection: BoneCP specified but not present in CLASSPATH (or one of >dependencies)15/02/11 19:15:27 INFO ObjectStore: Setting MetaStore object pin >classes with >hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"15/02/11 > 19:15:28 INFO SparkDeploySchedulerBackend: Registered executor: >Actor[akka.tcp://sparkExecutor@192.168.1.2:50944/user/Executor#1008909571] >with ID 015/02/11 19:15:28 INFO Datastore: The class >"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as >"embedded-only" so does not have its own datastore table.15/02/11 19:15:28 >INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is >tagged as "embedded-only" so does not have its own datastore table.15/02/11 >19:15:28 INFO SparkDeploySchedulerBackend: Registered executor: >Actor[akka.tcp://sparkExecutor@192.168.1.2:50948/user/Executor#-688434541] >with ID 115/02/11 19:15:28 INFO Datastore: The class >"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as >"embedded-only" so does not have its own datastore table.15/02/11 19:15:28 >INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is >tagged as "embedded-only" so does not have its own datastore table.15/02/11 >19:15:28 INFO Query: Reading in results for query >"org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is >closing15/02/11 19:15:28 INFO ObjectStore: Initialized ObjectStore15/02/11 >19:15:28 INFO BlockManagerMasterActor: Registering block manager >192.168.1.2:50951 with 265.1 MB RAM, BlockManagerId(0, 192.168.1.2, >50951)15/02/11 19:15:28 INFO BlockManagerMasterActor: Registering block >manager 192.168.1.2:50952 with 265.1 MB RAM, BlockManagerId(1, 192.168.1.2, >50952)15/02/11 19:15:28 INFO HiveMetaStore: Added admin role in >metastore15/02/11 19:15:28 INFO HiveMetaStore: Added public role in >metastore15/02/11 19:15:29 INFO HiveMetaStore: No user is added in admin role, >since config is empty15/02/11 19:15:29 INFO SessionState: No Tez session >required at this point. hive.execution.engine=mr.15/02/11 19:15:29 INFO >AbstractService: HiveServer2: Async execution pool size 10015/02/11 19:15:29 >INFO AbstractService: Service:OperationManager is inited.15/02/11 19:15:29 >INFO AbstractService: Service: SessionManager is inited.15/02/11 19:15:29 INFO >AbstractService: Service: CLIService is inited.15/02/11 19:15:29 INFO >AbstractService: Service:ThriftBinaryCLIService is inited.15/02/11 19:15:29 >INFO AbstractService: Service: HiveServer2 is inited.15/02/11 19:15:29 INFO >AbstractService: Service:OperationManager is started.15/02/11 19:15:29 INFO >AbstractService: Service:SessionManager is started.15/02/11 19:15:29 INFO >AbstractService: Service:CLIService is started.15/02/11 19:15:29 INFO >HiveMetaStore: No user is added in admin role, since config is empty15/02/11 >19:15:29 INFO HiveMetaStore: 0: get_databases: default15/02/11 19:15:29 INFO >audit: ugi=tnist ip=unknown-ip-addr cmd=get_databases: >default15/02/11 19:15:29 INFO HiveMetaStore: 0: Opening raw store with >implemenation class:org.apache.hadoop.hive.metastore.ObjectStore15/02/11 >19:15:29 INFO ObjectStore: ObjectStore, initialize called15/02/11 19:15:29 >INFO Query: Reading in results for query >"org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is >closing15/02/11 19:15:29 INFO ObjectStore: Initialized ObjectStore15/02/11 >19:15:29 INFO AbstractService: Service:ThriftBinaryCLIService is >started.15/02/11 19:15:29 INFO AbstractService: Service:HiveServer2 is >started.15/02/11 19:15:29 INFO HiveThriftServer2: HiveThriftServer2 >started15/02/11 19:15:29 INFO ThriftCLIService: ThriftBinaryCLIService >listening on radtech.io/192.168.1.2:10001 When I start Tableau and use the SparkSQL (Beta) connector I see that the "initial SQL" is being executed: 15/02/11 19:25:35 INFO HiveMetaStore: 2: get_database: default 15/02/11 19:25:35 INFO audit: ugi=anonymous ip=unknown-ip-addr cmd=get_database: default 15/02/11 19:25:35 INFO Driver: OK 15/02/11 19:25:35 INFO SparkExecuteStatementOperation: Running query 'set -v' 15/02/11 19:25:35 INFO SparkExecuteStatementOperation: Result Schema: List(#18) 15/02/11 19:25:35 INFO SparkExecuteStatementOperation: Result Schema: List(#21) 15/02/11 19:25:35 INFO ThriftCLIService: Client protocol version: HIVE_CLI_SERVICE_PROTOCOL_V6 15/02/11 19:25:35 INFO HiveMetaStore: No user is added in admin role, since config is empty 15/02/11 19:25:35 INFO SessionState: No Tez session required at this point. hive.execution.engine=mr. 15/02/11 19:25:35 INFO SessionState: No Tez session required at this point. hive.execution.engine=mr. 15/02/11 19:25:35 INFO SparkExecuteStatementOperation: Running query 'use `default`' 15/02/11 19:25:35 INFO ParseDriver: Parsing command: use `default` 15/02/11 19:25:35 INFO ParseDriver: Parse Completed 15/02/11 19:25:35 INFO SessionState: No Tez session required at this point. hive.execution.engine=mr. 15/02/11 19:25:35 INFO Driver: Concurrency mode is disabled, not creating a lock manager 15/02/11 19:25:35 INFO ParseDriver: Parsing command: use `default` 15/02/11 19:25:35 INFO ParseDriver: Parse Completed 15/02/11 19:25:35 INFO Driver: Semantic Analysis Completed 15/02/11 19:25:35 INFO Driver: Returning Hive schema: Schema(fieldSchemas:null, properties:null) 15/02/11 19:25:35 INFO Driver: Starting command: use `default` 15/02/11 19:25:35 INFO HiveMetaStore: 3: get_database: default 15/02/11 19:25:35 INFO audit: ugi=anonymous ip=unknown-ip-addr cmd=get_database: default 15/02/11 19:25:35 INFO HiveMetaStore: 3: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/02/11 19:25:35 INFO ObjectStore: ObjectStore, initialize called 15/02/11 19:25:36 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing 15/02/11 19:25:36 INFO ObjectStore: Initialized ObjectStore 15/02/11 19:25:36 INFO HiveMetaStore: 3: get_database: default 15/02/11 19:25:36 INFO audit: ugi=anonymous ip=unknown-ip-addr cmd=get_database: default 15/02/11 19:25:36 INFO Driver: OK 15/02/11 19:25:36 INFO SparkExecuteStatementOperation: Running query 'create temporary table test using org.apache.spark.sql.json options (path ‘/data/json/*')' .... 15/02/11 19:25:38 INFO Driver: Starting command: use `default` 15/02/11 19:25:38 INFO HiveMetaStore: 4: get_database: default 15/02/11 19:25:38 INFO audit: ugi=anonymous ip=unknown-ip-addr cmd=get_database: default 15/02/11 19:25:38 INFO HiveMetaStore: 4: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/02/11 19:25:38 INFO ObjectStore: ObjectStore, initialize called 15/02/11 19:25:38 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing 15/02/11 19:25:38 INFO ObjectStore: Initialized ObjectStore 15/02/11 19:25:38 INFO HiveMetaStore: 4: get_database: default 15/02/11 19:25:38 INFO audit: ugi=anonymous ip=unknown-ip-addr cmd=get_database: default 15/02/11 19:25:38 INFO Driver: OK 15/02/11 19:25:38 INFO SparkExecuteStatementOperation: Running query ' cache table test ' 15/02/11 19:25:38 INFO MemoryStore: ensureFreeSpace(211383) called with curMem=101514, maxMem=278019440 15/02/11 19:25:38 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 206.4 KB, free 264.8 MB) I see no way in Tableau to see the cached table "test". I think I am missing a step of associating the generated temp table from Spark SQL with the metastore. Any guidance or insights on what I'm missing here. Thanks for the assistance. -Todd On Wed, Feb 11, 2015 at 3:20 PM, Andrew Lee <alee...@hotmail.com> wrote: Sorry folks, it is executing Spark jobs instead of Hive jobs. I mis-read the logs since there were other activities going on on the cluster. From: alee...@hotmail.com To: ar...@sigmoidanalytics.com; tsind...@gmail.com CC: user@spark.apache.org Subject: RE: SparkSQL + Tableau Connector Date: Wed, 11 Feb 2015 11:56:44 -0800 I'm using mysql as the metastore DB with Spark 1.2.I simply copy the hive-site.xml to /etc/spark/ and added the mysql JDBC JAR to spark-env.sh in /etc/spark/, everything works fine now. My setup looks like this. Tableau => Spark ThriftServer2 => HiveServer2 It's talking to Tableau Desktop 8.3. Interestingly, when I query a Hive table, it still invokes Hive queries to HiveServer2 which is running MR or Tez engine. Is this expected? I thought it should at least use the catalyst engine and talk to the underlying HDFS like what HiveContext API does to pull in the data into RDD. Did I misunderstood the purpose of Spark ThriftServer2? Date: Wed, 11 Feb 2015 16:07:40 +0530 Subject: Re: SparkSQL + Tableau Connector From: ar...@sigmoidanalytics.com To: tsind...@gmail.com CC: user@spark.apache.org Hi I used this, though its using a embedded driver and is not a good approch.It works. You can configure for some other metastore type also. I have not tried the metastore uri's. <configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:derby:;databaseName=/opt/bigdata/spark-1.2.0/metastore_db;create=true</value> <description>URL for the DB</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>org.apache.derby.jdbc.EmbeddedDriver</value> </property> <!-- <property> <name>hive.metastore.uris</name> <value>thrift://x.x.x.x:10000</value> <description>IP address (or fully-qualified domain name) and port of the metastore host</description> </property> --> </configuration> On Wed, Feb 11, 2015 at 3:59 PM, Todd Nist <tsind...@gmail.com> wrote: Hi Arush, So yes I want to create the tables through Spark SQL. I have placed the hive-site.xml file inside of the $SPARK_HOME/conf directory I thought that was all I should need to do to have the thriftserver use it. Perhaps my hive-site.xml is worng, it currently looks like this: <configuration><property> <name>hive.metastore.uris</name> <!-- Ensure that the following statement points to the Hive Metastore URI in your cluster --> <value>thrift://sandbox.hortonworks.com:9083</value> <description>URI for client to contact metastore server</description></property></configuration> Which leads me to believe it is going to pull form the thriftserver from Horton? I will go look at the docs to see if this is right, it is what Horton says to do. Do you have an example hive-site.xml by chance that works with Spark SQL? I am using 8.3 of tableau with the SparkSQL Connector. Thanks for the assistance. -Todd On Wed, Feb 11, 2015 at 2:34 AM, Arush Kharbanda <ar...@sigmoidanalytics.com> wrote: BTW what tableau connector are you using? On Wed, Feb 11, 2015 at 12:55 PM, Arush Kharbanda <ar...@sigmoidanalytics.com> wrote: I am a little confused here, why do you want to create the tables in hive. You want to create the tables in spark-sql, right? If you are not able to find the same tables through tableau then thrift is connecting to a diffrent metastore than your spark-shell. One way to specify a metstore to thrift is to provide the path to hive-site.xml while starting thrift using --files hive-site.xml. similarly you can specify the same metastore to your spark-submit or sharp-shell using the same option. On Wed, Feb 11, 2015 at 5:23 AM, Todd Nist <tsind...@gmail.com> wrote: Arush, As for #2 do you mean something like this from the docs: // sc is an existing SparkContext. val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sqlContext.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sqlContext.sql("FROM src SELECT key, value").collect().foreach(println)Or did you have something else in mind? -Todd On Tue, Feb 10, 2015 at 6:35 PM, Todd Nist <tsind...@gmail.com> wrote: Arush, Thank you will take a look at that approach in the morning. I sort of figured the answer to #1 was NO and that I would need to do 2 and 3 thanks for clarifying it for me. -Todd On Tue, Feb 10, 2015 at 5:24 PM, Arush Kharbanda <ar...@sigmoidanalytics.com> wrote: 1. Can the connector fetch or query schemaRDD's saved to Parquet or JSON files? NO 2. Do I need to do something to expose these via hive / metastore other than creating a table in hive? Create a table in spark sql to expose via spark sql 3. Does the thriftserver need to be configured to expose these in some fashion, sort of related to question 2 you would need to configure thrift to read from the metastore you expect it read from - by default it reads from metastore_db directory present in the directory used to launch the thrift server. On 11 Feb 2015 01:35, "Todd Nist" <tsind...@gmail.com> wrote: Hi, I'm trying to understand how and what the Tableau connector to SparkSQL is able to access. My understanding is it needs to connect to the thriftserver and I am not sure how or if it exposes parquet, json, schemaRDDs, or does it only expose schemas defined in the metastore / hive. For example, I do the following from the spark-shell which generates a schemaRDD from a csv file and saves it as a JSON file as well as a parquet file. import org.apache.sql.SQLContext import com.databricks.spark.csv._ val sqlContext = new SQLContext(sc) val test = sqlContext.csfFile("/data/test.csv") test.toJSON().saveAsTextFile("/data/out") test.saveAsParquetFile("/data/out") When I connect from Tableau, the only thing I see is the "default" schema and nothing in the tables section. So my questions are: 1. Can the connector fetch or query schemaRDD's saved to Parquet or JSON files? 2. Do I need to do something to expose these via hive / metastore other than creating a table in hive? 3. Does the thriftserver need to be configured to expose these in some fashion, sort of related to question 2. TIA for the assistance. -Todd -- Arush Kharbanda || Technical teamleadar...@sigmoidanalytics.com || www.sigmoidanalytics.com -- Arush Kharbanda || Technical teamleadar...@sigmoidanalytics.com || www.sigmoidanalytics.com -- Arush Kharbanda || Technical teamleadar...@sigmoidanalytics.com || www.sigmoidanalytics.com