RE: SparkSQL + Tableau Connector

Andrew Lee Tue, 17 Feb 2015 09:11:56 -0800

Hi Todd,
When I see /data/json appears in your log, I have a feeling that that is the 
default hive.metastore.warehouse.dir from hive-site.xml where the value is 
/data/.  Could you check that property and see if you can point that to the 
correct Hive table HDFS directory?
Another thing to look at is the Hive metastore mysql database if you are using 
mysql as the DB for metastore.
Date: Wed, 11 Feb 2015 19:53:35 -0500
Subject: Re: SparkSQL + Tableau Connector
From: tsind...@gmail.com
To: alee...@hotmail.com
CC: ar...@sigmoidanalytics.com; user@spark.apache.org


First sorry for the long post.  So back to tableau and Spark SQL, I'm still 
missing something.  
TL;DR
To get the Spark SQL Temp table associated with the metastore are there 
additional steps required beyond doing the below?

Initial SQL on connection:








create temporary table test                     
using org.apache.spark.sql.json                  
options (path '/data/json/*');
cache table test;

I feel like I'm missing a step of associating the Spark SQL table with the 
metastore, do I need to actually save it in some fashion?   I'm trying to avoid 
saving to hive if possible.
Details:
I configured the hive-site.xml and placed it in the $SPARK_HOME/conf.  It looks 
like this, thanks Andrew and Arush for the assistance:
<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>  <property>    <name>hive.semantic.analyzer.factory.impl</name> 
   <value>org.apache.hcatalog.cli.HCatSemanticAnalyzerFactory</value>  
</property>
  <property>    <name>hive.metastore.sasl.enabled</name>    
<value>false</value>  </property>
  <property>    <name>hive.server2.authentication</name>    <value>NONE</value> 
 </property>
  <property>    <name>hive.server2.enable.doAs</name>    <value>true</value>  
</property>
  <!--  <property>    <name>hive.metastore.uris</name>    
<value>thrift://localhost:9083</value>    <description>IP address (or 
fully-qualified domain name) and port of the metastore host</description>  
</property>  -->
  <property>    <name>hive.warehouse.subdir.inherit.perms</name>    
<value>true</value>  </property>
  <property>    <name>hive.metastore.schema.verification</name>    
<value>false</value>  </property>
  <property>    <name>javax.jdo.option.ConnectionURL</name>    
<value>jdbc:mysql://localhost:3306/metastore_db?createDatabaseIfNotExist=true</value>
    <description>metadata is stored in a MySQL server</description>  </property>
  <property>    <name>javax.jdo.option.ConnectionDriverName</name>    
<value>com.mysql.jdbc.Driver</value>    <description>MySQL JDBC driver 
class</description>  </property>
  <property>    <name>javax.jdo.option.ConnectionUserName</name>    
<value>hiveuser</value>  </property>
  <property>    <name>javax.jdo.option.ConnectionPassword</name>    
<value>hiveuser</value>  </property>
</configuration>
When I start the server it looks fine:
>$ ./sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 
>--hiveconf hive.server2.thrift.bind.host radtech.io --master 
>spark://radtech.io:7077 --driver-class-path 
>/usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jarstarting 
>org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to 
>/usr/local/spark-1.2.1-bin-hadoop2.4/logs/spark-tnist-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-radtech.io.outradtech:spark
> tnist$ tail -f 
>logs/spark-tnist-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-radtech.io.out15/02/11
> 19:15:24 INFO SparkDeploySchedulerBackend: Granted executor ID 
>app-20150211191524-0008/1 on hostPort 192.168.1.2:50851 with 2 cores, 512.0 MB 
>RAM15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: 
>app-20150211191524-0008/0 is now LOADING15/02/11 19:15:24 INFO 
>AppClient$ClientActor: Executor updated: app-20150211191524-0008/1 is now 
>LOADING15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: 
>app-20150211191524-0008/0 is now RUNNING15/02/11 19:15:24 INFO 
>AppClient$ClientActor: Executor updated: app-20150211191524-0008/1 is now 
>RUNNING15/02/11 19:15:24 INFO NettyBlockTransferService: Server created on 
>5093815/02/11 19:15:24 INFO BlockManagerMaster: Trying to register 
>BlockManager15/02/11 19:15:24 INFO BlockManagerMasterActor: Registering block 
>manager 192.168.1.2:50938 with 265.1 MB RAM, BlockManagerId(<driver>, 
>192.168.1.2, 50938)15/02/11 19:15:24 INFO BlockManagerMaster: Registered 
>BlockManager15/02/11 19:15:25 INFO SparkDeploySchedulerBackend: 
>SchedulerBackend is ready for scheduling beginning after reached 
>minRegisteredResourcesRatio: 0.015/02/11 19:15:25 INFO HiveMetaStore: 0: 
>Opening raw store with implemenation 
>class:org.apache.hadoop.hive.metastore.ObjectStore15/02/11 19:15:25 INFO 
>ObjectStore: ObjectStore, initialize called15/02/11 19:15:26 INFO Persistence: 
>Property hive.metastore.integral.jdo.pushdown unknown - will be 
>ignored15/02/11 19:15:26 INFO Persistence: Property datanucleus.cache.level2 
>unknown - will be ignored15/02/11 19:15:26 WARN Connection: BoneCP specified 
>but not present in CLASSPATH (or one of dependencies)15/02/11 19:15:26 WARN 
>Connection: BoneCP specified but not present in CLASSPATH (or one of 
>dependencies)15/02/11 19:15:27 INFO ObjectStore: Setting MetaStore object pin 
>classes with 
>hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"15/02/11
> 19:15:28 INFO SparkDeploySchedulerBackend: Registered executor: 
>Actor[akka.tcp://sparkExecutor@192.168.1.2:50944/user/Executor#1008909571] 
>with ID 015/02/11 19:15:28 INFO Datastore: The class 
>"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
>"embedded-only" so does not have its own datastore table.15/02/11 19:15:28 
>INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is 
>tagged as "embedded-only" so does not have its own datastore table.15/02/11 
>19:15:28 INFO SparkDeploySchedulerBackend: Registered executor: 
>Actor[akka.tcp://sparkExecutor@192.168.1.2:50948/user/Executor#-688434541] 
>with ID 115/02/11 19:15:28 INFO Datastore: The class 
>"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
>"embedded-only" so does not have its own datastore table.15/02/11 19:15:28 
>INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is 
>tagged as "embedded-only" so does not have its own datastore table.15/02/11 
>19:15:28 INFO Query: Reading in results for query 
>"org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is 
>closing15/02/11 19:15:28 INFO ObjectStore: Initialized ObjectStore15/02/11 
>19:15:28 INFO BlockManagerMasterActor: Registering block manager 
>192.168.1.2:50951 with 265.1 MB RAM, BlockManagerId(0, 192.168.1.2, 
>50951)15/02/11 19:15:28 INFO BlockManagerMasterActor: Registering block 
>manager 192.168.1.2:50952 with 265.1 MB RAM, BlockManagerId(1, 192.168.1.2, 
>50952)15/02/11 19:15:28 INFO HiveMetaStore: Added admin role in 
>metastore15/02/11 19:15:28 INFO HiveMetaStore: Added public role in 
>metastore15/02/11 19:15:29 INFO HiveMetaStore: No user is added in admin role, 
>since config is empty15/02/11 19:15:29 INFO SessionState: No Tez session 
>required at this point. hive.execution.engine=mr.15/02/11 19:15:29 INFO 
>AbstractService: HiveServer2: Async execution pool size 10015/02/11 19:15:29 
>INFO AbstractService: Service:OperationManager is inited.15/02/11 19:15:29 
>INFO AbstractService: Service: SessionManager is inited.15/02/11 19:15:29 INFO 
>AbstractService: Service: CLIService is inited.15/02/11 19:15:29 INFO 
>AbstractService: Service:ThriftBinaryCLIService is inited.15/02/11 19:15:29 
>INFO AbstractService: Service: HiveServer2 is inited.15/02/11 19:15:29 INFO 
>AbstractService: Service:OperationManager is started.15/02/11 19:15:29 INFO 
>AbstractService: Service:SessionManager is started.15/02/11 19:15:29 INFO 
>AbstractService: Service:CLIService is started.15/02/11 19:15:29 INFO 
>HiveMetaStore: No user is added in admin role, since config is empty15/02/11 
>19:15:29 INFO HiveMetaStore: 0: get_databases: default15/02/11 19:15:29 INFO 
>audit: ugi=tnist        ip=unknown-ip-addr      cmd=get_databases: 
>default15/02/11 19:15:29 INFO HiveMetaStore: 0: Opening raw store with 
>implemenation class:org.apache.hadoop.hive.metastore.ObjectStore15/02/11 
>19:15:29 INFO ObjectStore: ObjectStore, initialize called15/02/11 19:15:29 
>INFO Query: Reading in results for query 
>"org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is 
>closing15/02/11 19:15:29 INFO ObjectStore: Initialized ObjectStore15/02/11 
>19:15:29 INFO AbstractService: Service:ThriftBinaryCLIService is 
>started.15/02/11 19:15:29 INFO AbstractService: Service:HiveServer2 is 
>started.15/02/11 19:15:29 INFO HiveThriftServer2: HiveThriftServer2 
>started15/02/11 19:15:29 INFO ThriftCLIService: ThriftBinaryCLIService 
>listening on radtech.io/192.168.1.2:10001
When I start Tableau and use the SparkSQL (Beta) connector I see that the 
"initial SQL" is being executed:








15/02/11 19:25:35 INFO HiveMetaStore: 2: get_database: default
15/02/11 19:25:35 INFO audit: ugi=anonymous ip=unknown-ip-addr 
cmd=get_database: default
15/02/11 19:25:35 INFO Driver: OK
15/02/11 19:25:35 INFO SparkExecuteStatementOperation: Running query 'set -v'
15/02/11 19:25:35 INFO SparkExecuteStatementOperation: Result Schema: List(#18)
15/02/11 19:25:35 INFO SparkExecuteStatementOperation: Result Schema: List(#21)
15/02/11 19:25:35 INFO ThriftCLIService: Client protocol version: 
HIVE_CLI_SERVICE_PROTOCOL_V6
15/02/11 19:25:35 INFO HiveMetaStore: No user is added in admin role, since 
config is empty
15/02/11 19:25:35 INFO SessionState: No Tez session required at this point. 
hive.execution.engine=mr.
15/02/11 19:25:35 INFO SessionState: No Tez session required at this point. 
hive.execution.engine=mr.
15/02/11 19:25:35 INFO SparkExecuteStatementOperation: Running query 'use 
`default`'
15/02/11 19:25:35 INFO ParseDriver: Parsing command: use `default`
15/02/11 19:25:35 INFO ParseDriver: Parse Completed
15/02/11 19:25:35 INFO SessionState: No Tez session required at this point. 
hive.execution.engine=mr.
15/02/11 19:25:35 INFO Driver: Concurrency mode is disabled, not creating a 
lock manager
15/02/11 19:25:35 INFO ParseDriver: Parsing command: use `default`
15/02/11 19:25:35 INFO ParseDriver: Parse Completed
15/02/11 19:25:35 INFO Driver: Semantic Analysis Completed
15/02/11 19:25:35 INFO Driver: Returning Hive schema: Schema(fieldSchemas:null, 
properties:null)
15/02/11 19:25:35 INFO Driver: Starting command: use `default`
15/02/11 19:25:35 INFO HiveMetaStore: 3: get_database: default
15/02/11 19:25:35 INFO audit: ugi=anonymous ip=unknown-ip-addr 
cmd=get_database: default
15/02/11 19:25:35 INFO HiveMetaStore: 3: Opening raw store with implemenation 
class:org.apache.hadoop.hive.metastore.ObjectStore
15/02/11 19:25:35 INFO ObjectStore: ObjectStore, initialize called
15/02/11 19:25:36 INFO Query: Reading in results for query 
"org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is 
closing
15/02/11 19:25:36 INFO ObjectStore: Initialized ObjectStore
15/02/11 19:25:36 INFO HiveMetaStore: 3: get_database: default
15/02/11 19:25:36 INFO audit: ugi=anonymous ip=unknown-ip-addr 
cmd=get_database: default
15/02/11 19:25:36 INFO Driver: OK
15/02/11 19:25:36 INFO SparkExecuteStatementOperation: Running query 'create 
temporary table test
using org.apache.spark.sql.json
options (path ‘/data/json/*')'
....
15/02/11 19:25:38 INFO Driver: Starting command: use `default`
15/02/11 19:25:38 INFO HiveMetaStore: 4: get_database: default
15/02/11 19:25:38 INFO audit: ugi=anonymous ip=unknown-ip-addr 
cmd=get_database: default
15/02/11 19:25:38 INFO HiveMetaStore: 4: Opening raw store with implemenation 
class:org.apache.hadoop.hive.metastore.ObjectStore
15/02/11 19:25:38 INFO ObjectStore: ObjectStore, initialize called
15/02/11 19:25:38 INFO Query: Reading in results for query 
"org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is 
closing
15/02/11 19:25:38 INFO ObjectStore: Initialized ObjectStore
15/02/11 19:25:38 INFO HiveMetaStore: 4: get_database: default
15/02/11 19:25:38 INFO audit: ugi=anonymous ip=unknown-ip-addr 
cmd=get_database: default
15/02/11 19:25:38 INFO Driver: OK
15/02/11 19:25:38 INFO SparkExecuteStatementOperation: Running query '
cache table test '
15/02/11 19:25:38 INFO MemoryStore: ensureFreeSpace(211383) called with 
curMem=101514, maxMem=278019440
15/02/11 19:25:38 INFO MemoryStore: Block broadcast_2 stored as values in 
memory (estimated size 206.4 KB, free 264.8 MB)
I see no way in Tableau to see the cached table "test".  I think I am missing a 
step of associating the generated temp table from Spark SQL with the metastore. 
 Any guidance or insights on what I'm missing here.
Thanks for the assistance.
-Todd

On Wed, Feb 11, 2015 at 3:20 PM, Andrew Lee <alee...@hotmail.com> wrote:



Sorry folks, it is executing Spark jobs instead of Hive jobs. I mis-read the 
logs since there were other activities going on on the cluster.

From: alee...@hotmail.com
To: ar...@sigmoidanalytics.com; tsind...@gmail.com
CC: user@spark.apache.org
Subject: RE: SparkSQL + Tableau Connector
Date: Wed, 11 Feb 2015 11:56:44 -0800




I'm using mysql as the metastore DB with Spark 1.2.I simply copy the 
hive-site.xml to /etc/spark/ and added the mysql JDBC JAR to spark-env.sh in 
/etc/spark/, everything works fine now.
My setup looks like this.
Tableau => Spark ThriftServer2 => HiveServer2
It's talking to Tableau Desktop 8.3. Interestingly, when I query a Hive table, 
it still invokes Hive queries to HiveServer2 which is running MR or Tez engine. 
 Is this expected?  
I thought it should at least use the catalyst engine and talk to the underlying 
HDFS like what HiveContext API does to pull in the data into RDD.  Did I 
misunderstood the purpose of Spark ThriftServer2?


Date: Wed, 11 Feb 2015 16:07:40 +0530
Subject: Re: SparkSQL + Tableau Connector
From: ar...@sigmoidanalytics.com
To: tsind...@gmail.com
CC: user@spark.apache.org

Hi
I used this, though its using a embedded driver and is not a good approch.It 
works. You can configure for some other metastore type also. I have not tried 
the metastore uri's.








<configuration>




<property>

  <name>javax.jdo.option.ConnectionURL</name>

  
<value>jdbc:derby:;databaseName=/opt/bigdata/spark-1.2.0/metastore_db;create=true</value>

  <description>URL for the DB</description>

</property>




<property>

  <name>javax.jdo.option.ConnectionDriverName</name>

  <value>org.apache.derby.jdbc.EmbeddedDriver</value>

</property>




<!-- <property>

  <name>hive.metastore.uris</name>

  <value>thrift://x.x.x.x:10000</value>

  <description>IP address (or fully-qualified domain name) and port of the 
metastore host</description>

</property> -->

</configuration>

On Wed, Feb 11, 2015 at 3:59 PM, Todd Nist <tsind...@gmail.com> wrote:
Hi Arush,
So yes I want to create the tables through Spark SQL.  I have placed the 
hive-site.xml file inside of the $SPARK_HOME/conf directory I thought that was 
all I should need to do to have the thriftserver use it.  Perhaps my 
hive-site.xml is worng, it currently looks like this:

<configuration><property>  <name>hive.metastore.uris</name>  <!-- Ensure that 
the following statement points to the Hive Metastore URI in your cluster -->  
<value>thrift://sandbox.hortonworks.com:9083</value>  <description>URI for 
client to contact metastore server</description></property></configuration>
Which leads me to believe it is going to pull form the thriftserver from 
Horton?  I will go look at the docs to see if this is right, it is what Horton 
says to do.  Do you have an example hive-site.xml by chance that works with 
Spark SQL?
I am using 8.3 of tableau with the SparkSQL Connector.
Thanks for the assistance.
-Todd
On Wed, Feb 11, 2015 at 2:34 AM, Arush Kharbanda <ar...@sigmoidanalytics.com> 
wrote:
BTW what tableau connector are you using?
On Wed, Feb 11, 2015 at 12:55 PM, Arush Kharbanda <ar...@sigmoidanalytics.com> 
wrote:
 I am a little confused here, why do you want to create the tables in hive. You 
want to create the tables in spark-sql, right?
If you are not able to find the same tables through tableau then thrift is 
connecting to a diffrent metastore than your spark-shell.
One way to specify a metstore to thrift is to provide the path to hive-site.xml 
while starting thrift using --files hive-site.xml.
similarly you can specify the same metastore to your spark-submit or 
sharp-shell using the same option.



On Wed, Feb 11, 2015 at 5:23 AM, Todd Nist <tsind...@gmail.com> wrote:
Arush,
As for #2 do you mean something like this from the docs:

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
sqlContext.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' 
INTO TABLE src")

// Queries are expressed in HiveQL
sqlContext.sql("FROM src SELECT key, value").collect().foreach(println)Or did 
you have something else in mind?
-Todd

On Tue, Feb 10, 2015 at 6:35 PM, Todd Nist <tsind...@gmail.com> wrote:
Arush,
Thank you will take a look at that approach in the morning.  I sort of figured 
the answer to #1 was NO and that I would need to do 2 and 3 thanks for 
clarifying it for me.
-Todd
On Tue, Feb 10, 2015 at 5:24 PM, Arush Kharbanda <ar...@sigmoidanalytics.com> 
wrote:
1.  Can the connector fetch or query schemaRDD's saved to Parquet or JSON 
files? NO

2.  Do I need to do something to expose these via hive / metastore other than 
creating a table in hive? Create a table in spark sql to expose via spark sql

3.  Does the thriftserver need to be configured to expose these in some 
fashion, sort of related to question 2 you would need to configure thrift to 
read from the metastore you expect it read from - by default it reads from 
metastore_db directory present in the directory used to launch the thrift 
server.


On 11 Feb 2015 01:35, "Todd Nist" <tsind...@gmail.com> wrote:
Hi,
I'm trying to understand how and what the Tableau connector to SparkSQL is able 
to access.  My understanding is it needs to connect to the thriftserver and I 
am not sure how or if it exposes parquet, json, schemaRDDs, or does it only 
expose schemas defined in the metastore / hive.  
For example, I do the following from the spark-shell which generates a 
schemaRDD from a csv file and saves it as a JSON file as well as a parquet file.
import org.apache.sql.SQLContext
import com.databricks.spark.csv._

val sqlContext = new SQLContext(sc)
val test = sqlContext.csfFile("/data/test.csv")
test.toJSON().saveAsTextFile("/data/out")
test.saveAsParquetFile("/data/out")









When I connect from Tableau, the only thing I see is the "default" schema and 
nothing in the tables section.
So my questions are:

1.  Can the connector fetch or query schemaRDD's saved to Parquet or JSON files?
2.  Do I need to do something to expose these via hive / metastore other than 
creating a table in hive?
3.  Does the thriftserver need to be configured to expose these in some 
fashion, sort of related to question 2.
TIA for the assistance.
-Todd








-- 
Arush Kharbanda || Technical teamleadar...@sigmoidanalytics.com || 
www.sigmoidanalytics.com



-- 
Arush Kharbanda || Technical teamleadar...@sigmoidanalytics.com || 
www.sigmoidanalytics.com





-- 
Arush Kharbanda || Technical teamleadar...@sigmoidanalytics.com || 
www.sigmoidanalytics.com

RE: SparkSQL + Tableau Connector

Reply via email to