Re: SparkSQL + Tableau Connector
Great, glad it worked out! From: Todd Nist Date: Thursday, February 19, 2015 at 9:19 AM To: Silvio Fiorito Cc: user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: SparkSQL + Tableau Connector Hi Silvio, I got this working today using your suggestion with the Initial SQL and a Custom Query. See here for details: http://stackoverflow.com/questions/28403664/connect-to-existing-hive-in-intellij-using-sbt-as-build/28608608#28608608 It is not ideal as I need to write a custom query, but does work for now. I also have it working by doing a SaveAsTable on the ingested data which stores the reference into the metastore for access via the thrift server. Thanks for the help. -Todd On Wed, Feb 11, 2015 at 8:41 PM, Silvio Fiorito silvio.fior...@granturing.commailto:silvio.fior...@granturing.com wrote: Hey Todd, I don’t have an app to test against the thrift server, are you able to define custom SQL without using Tableau’s schema query? I guess it’s not possible to just use SparkSQL temp tables, you may have to use permanent Hive tables that are actually in the metastore so Tableau can discover them in the schema. In that case you will either have to generate the Hive tables externally from Spark or use Spark to process the data and save them using a HiveContext. From: Todd Nist Date: Wednesday, February 11, 2015 at 7:53 PM To: Andrew Lee Cc: Arush Kharbanda, user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: SparkSQL + Tableau Connector First sorry for the long post. So back to tableau and Spark SQL, I'm still missing something. TL;DR To get the Spark SQL Temp table associated with the metastore are there additional steps required beyond doing the below? Initial SQL on connection: create temporary table test using org.apache.spark.sql.json options (path '/data/json/*'); cache table test; I feel like I'm missing a step of associating the Spark SQL table with the metastore, do I need to actually save it in some fashion? I'm trying to avoid saving to hive if possible. Details: I configured the hive-site.xml and placed it in the $SPARK_HOME/conf. It looks like this, thanks Andrew and Arush for the assistance: ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? configuration property namehive.semantic.analyzer.factory.impl/name valueorg.apache.hcatalog.cli.HCatSemanticAnalyzerFactory/value /property property namehive.metastore.sasl.enabled/name valuefalse/value /property property namehive.server2.authentication/name valueNONE/value /property property namehive.server2.enable.doAs/name valuetrue/value /property !-- property namehive.metastore.uris/name valuethrift://localhost:9083/value descriptionIP address (or fully-qualified domain name) and port of the metastore host/description /property -- property namehive.warehouse.subdir.inherit.perms/name valuetrue/value /property property namehive.metastore.schema.verification/name valuefalse/value /property property namejavax.jdo.option.ConnectionURL/name valuejdbc:mysql://localhost:3306/metastore_db?createDatabaseIfNotExist=true/value descriptionmetadata is stored in a MySQL server/description /property property namejavax.jdo.option.ConnectionDriverName/name valuecom.mysql.jdbc.Driver/value descriptionMySQL JDBC driver class/description /property property namejavax.jdo.option.ConnectionUserName/name valuehiveuser/value /property property namejavax.jdo.option.ConnectionPassword/name valuehiveuser/value /property /configuration When I start the server it looks fine: $ ./sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf hive.server2.thrift.bind.host radtech.iohttp://radtech.io --master spark://radtech.io:7077http://radtech.io:7077 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to /usr/local/spark-1.2.1-bin-hadoop2.4/logs/spark-tnist-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-radtech.io.out radtech:spark tnist$ tail -f logs/spark-tnist-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-radtech.io.out 15/02/11 19:15:24 INFO SparkDeploySchedulerBackend: Granted executor ID app-20150211191524-0008/1 on hostPort 192.168.1.2:50851http://192.168.1.2:50851 with 2 cores, 512.0 MB RAM 15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: app-20150211191524-0008/0 is now LOADING 15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: app-20150211191524-0008/1 is now LOADING 15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: app-20150211191524-0008/0 is now RUNNING 15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: app-20150211191524-0008/1 is now RUNNING 15/02/11 19:15:24 INFO NettyBlockTransferService: Server created on 50938 15/02/11 19
Re: SparkSQL + Tableau Connector
Hi Silvio, I got this working today using your suggestion with the Initial SQL and a Custom Query. See here for details: http://stackoverflow.com/questions/28403664/connect-to-existing-hive-in-intellij-using-sbt-as-build/28608608#28608608 It is not ideal as I need to write a custom query, but does work for now. I also have it working by doing a SaveAsTable on the ingested data which stores the reference into the metastore for access via the thrift server. Thanks for the help. -Todd On Wed, Feb 11, 2015 at 8:41 PM, Silvio Fiorito silvio.fior...@granturing.com wrote: Hey Todd, I don’t have an app to test against the thrift server, are you able to define custom SQL without using Tableau’s schema query? I guess it’s not possible to just use SparkSQL temp tables, you may have to use permanent Hive tables that are actually in the metastore so Tableau can discover them in the schema. In that case you will either have to generate the Hive tables externally from Spark or use Spark to process the data and save them using a HiveContext. From: Todd Nist Date: Wednesday, February 11, 2015 at 7:53 PM To: Andrew Lee Cc: Arush Kharbanda, user@spark.apache.org Subject: Re: SparkSQL + Tableau Connector First sorry for the long post. So back to tableau and Spark SQL, I'm still missing something. TL;DR To get the Spark SQL Temp table associated with the metastore are there additional steps required beyond doing the below? Initial SQL on connection: create temporary table test using org.apache.spark.sql.json options (path '/data/json/*'); cache table test; I feel like I'm missing a step of associating the Spark SQL table with the metastore, do I need to actually save it in some fashion? I'm trying to avoid saving to hive if possible. *Details:* I configured the hive-site.xml and placed it in the $SPARK_HOME/conf. It looks like this, thanks Andrew and Arush for the assistance: ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? configuration property namehive.semantic.analyzer.factory.impl/name valueorg.apache.hcatalog.cli.HCatSemanticAnalyzerFactory/value /property property namehive.metastore.sasl.enabled/name valuefalse/value /property property namehive.server2.authentication/name valueNONE/value /property property namehive.server2.enable.doAs/name valuetrue/value /property !-- property namehive.metastore.uris/name valuethrift://localhost:9083/value descriptionIP address (or fully-qualified domain name) and port of the metastore host/description /property -- property namehive.warehouse.subdir.inherit.perms/name valuetrue/value /property property namehive.metastore.schema.verification/name valuefalse/value /property property namejavax.jdo.option.ConnectionURL/name valuejdbc:mysql://localhost:3306/metastore_db?createDatabaseIfNotExist=true/value descriptionmetadata is stored in a MySQL server/description /property property namejavax.jdo.option.ConnectionDriverName/name valuecom.mysql.jdbc.Driver/value descriptionMySQL JDBC driver class/description /property property namejavax.jdo.option.ConnectionUserName/name valuehiveuser/value /property property namejavax.jdo.option.ConnectionPassword/name valuehiveuser/value /property /configuration When I start the server it looks fine: $ ./sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf hive.server2.thrift.bind.host radtech.io --master spark://radtech.io:7077 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to /usr/local/spark-1.2.1-bin-hadoop2.4/logs/spark-tnist-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-radtech.io.out radtech:spark tnist$ tail -f logs/spark-tnist-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-radtech.io.out 15/02/11 19:15:24 INFO SparkDeploySchedulerBackend: Granted executor ID app-20150211191524-0008/1 on hostPort 192.168.1.2:50851 with 2 cores, 512.0 MB RAM 15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: app-20150211191524-0008/0 is now LOADING 15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: app-20150211191524-0008/1 is now LOADING 15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: app-20150211191524-0008/0 is now RUNNING 15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: app-20150211191524-0008/1 is now RUNNING 15/02/11 19:15:24 INFO NettyBlockTransferService: Server created on 50938 15/02/11 19:15:24 INFO BlockManagerMaster: Trying to register BlockManager 15/02/11 19:15:24 INFO BlockManagerMasterActor: Registering block manager 192.168.1.2:50938 with 265.1 MB RAM, BlockManagerId(driver, 192.168.1.2, 50938) 15/02
RE: SparkSQL + Tableau Connector
Hi Todd, When I see /data/json appears in your log, I have a feeling that that is the default hive.metastore.warehouse.dir from hive-site.xml where the value is /data/. Could you check that property and see if you can point that to the correct Hive table HDFS directory? Another thing to look at is the Hive metastore mysql database if you are using mysql as the DB for metastore. Date: Wed, 11 Feb 2015 19:53:35 -0500 Subject: Re: SparkSQL + Tableau Connector From: tsind...@gmail.com To: alee...@hotmail.com CC: ar...@sigmoidanalytics.com; user@spark.apache.org First sorry for the long post. So back to tableau and Spark SQL, I'm still missing something. TL;DR To get the Spark SQL Temp table associated with the metastore are there additional steps required beyond doing the below? Initial SQL on connection: create temporary table test using org.apache.spark.sql.json options (path '/data/json/*'); cache table test; I feel like I'm missing a step of associating the Spark SQL table with the metastore, do I need to actually save it in some fashion? I'm trying to avoid saving to hive if possible. Details: I configured the hive-site.xml and placed it in the $SPARK_HOME/conf. It looks like this, thanks Andrew and Arush for the assistance: ?xml version=1.0??xml-stylesheet type=text/xsl href=configuration.xsl? configuration propertynamehive.semantic.analyzer.factory.impl/name valueorg.apache.hcatalog.cli.HCatSemanticAnalyzerFactory/value /property propertynamehive.metastore.sasl.enabled/name valuefalse/value /property propertynamehive.server2.authentication/namevalueNONE/value /property propertynamehive.server2.enable.doAs/namevaluetrue/value /property !-- propertynamehive.metastore.uris/name valuethrift://localhost:9083/valuedescriptionIP address (or fully-qualified domain name) and port of the metastore host/description /property -- propertynamehive.warehouse.subdir.inherit.perms/name valuetrue/value /property propertynamehive.metastore.schema.verification/name valuefalse/value /property propertynamejavax.jdo.option.ConnectionURL/name valuejdbc:mysql://localhost:3306/metastore_db?createDatabaseIfNotExist=true/value descriptionmetadata is stored in a MySQL server/description /property propertynamejavax.jdo.option.ConnectionDriverName/name valuecom.mysql.jdbc.Driver/valuedescriptionMySQL JDBC driver class/description /property propertynamejavax.jdo.option.ConnectionUserName/name valuehiveuser/value /property propertynamejavax.jdo.option.ConnectionPassword/name valuehiveuser/value /property /configuration When I start the server it looks fine: $ ./sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf hive.server2.thrift.bind.host radtech.io --master spark://radtech.io:7077 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jarstarting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to /usr/local/spark-1.2.1-bin-hadoop2.4/logs/spark-tnist-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-radtech.io.outradtech:spark tnist$ tail -f logs/spark-tnist-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-radtech.io.out15/02/11 19:15:24 INFO SparkDeploySchedulerBackend: Granted executor ID app-20150211191524-0008/1 on hostPort 192.168.1.2:50851 with 2 cores, 512.0 MB RAM15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: app-20150211191524-0008/0 is now LOADING15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: app-20150211191524-0008/1 is now LOADING15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: app-20150211191524-0008/0 is now RUNNING15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: app-20150211191524-0008/1 is now RUNNING15/02/11 19:15:24 INFO NettyBlockTransferService: Server created on 5093815/02/11 19:15:24 INFO BlockManagerMaster: Trying to register BlockManager15/02/11 19:15:24 INFO BlockManagerMasterActor: Registering block manager 192.168.1.2:50938 with 265.1 MB RAM, BlockManagerId(driver, 192.168.1.2, 50938)15/02/11 19:15:24 INFO BlockManagerMaster: Registered BlockManager15/02/11 19:15:25 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.015/02/11 19:15:25 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore15/02/11 19:15:25 INFO ObjectStore: ObjectStore, initialize called15/02/11 19:15:26 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored15/02/11 19:15:26 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored15/02/11 19:15:26 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)15/02/11 19:15:26 WARN Connection: BoneCP
RE: SparkSQL + Tableau Connector
Sorry folks, it is executing Spark jobs instead of Hive jobs. I mis-read the logs since there were other activities going on on the cluster. From: alee...@hotmail.com To: ar...@sigmoidanalytics.com; tsind...@gmail.com CC: user@spark.apache.org Subject: RE: SparkSQL + Tableau Connector Date: Wed, 11 Feb 2015 11:56:44 -0800 I'm using mysql as the metastore DB with Spark 1.2.I simply copy the hive-site.xml to /etc/spark/ and added the mysql JDBC JAR to spark-env.sh in /etc/spark/, everything works fine now. My setup looks like this. Tableau = Spark ThriftServer2 = HiveServer2 It's talking to Tableau Desktop 8.3. Interestingly, when I query a Hive table, it still invokes Hive queries to HiveServer2 which is running MR or Tez engine. Is this expected? I thought it should at least use the catalyst engine and talk to the underlying HDFS like what HiveContext API does to pull in the data into RDD. Did I misunderstood the purpose of Spark ThriftServer2? Date: Wed, 11 Feb 2015 16:07:40 +0530 Subject: Re: SparkSQL + Tableau Connector From: ar...@sigmoidanalytics.com To: tsind...@gmail.com CC: user@spark.apache.org Hi I used this, though its using a embedded driver and is not a good approch.It works. You can configure for some other metastore type also. I have not tried the metastore uri's. configuration property namejavax.jdo.option.ConnectionURL/name valuejdbc:derby:;databaseName=/opt/bigdata/spark-1.2.0/metastore_db;create=true/value descriptionURL for the DB/description /property property namejavax.jdo.option.ConnectionDriverName/name valueorg.apache.derby.jdbc.EmbeddedDriver/value /property !-- property namehive.metastore.uris/name valuethrift://x.x.x.x:1/value descriptionIP address (or fully-qualified domain name) and port of the metastore host/description /property -- /configuration On Wed, Feb 11, 2015 at 3:59 PM, Todd Nist tsind...@gmail.com wrote: Hi Arush, So yes I want to create the tables through Spark SQL. I have placed the hive-site.xml file inside of the $SPARK_HOME/conf directory I thought that was all I should need to do to have the thriftserver use it. Perhaps my hive-site.xml is worng, it currently looks like this: configurationproperty namehive.metastore.uris/name !-- Ensure that the following statement points to the Hive Metastore URI in your cluster -- valuethrift://sandbox.hortonworks.com:9083/value descriptionURI for client to contact metastore server/description/property/configuration Which leads me to believe it is going to pull form the thriftserver from Horton? I will go look at the docs to see if this is right, it is what Horton says to do. Do you have an example hive-site.xml by chance that works with Spark SQL? I am using 8.3 of tableau with the SparkSQL Connector. Thanks for the assistance. -Todd On Wed, Feb 11, 2015 at 2:34 AM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: BTW what tableau connector are you using? On Wed, Feb 11, 2015 at 12:55 PM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: I am a little confused here, why do you want to create the tables in hive. You want to create the tables in spark-sql, right? If you are not able to find the same tables through tableau then thrift is connecting to a diffrent metastore than your spark-shell. One way to specify a metstore to thrift is to provide the path to hive-site.xml while starting thrift using --files hive-site.xml. similarly you can specify the same metastore to your spark-submit or sharp-shell using the same option. On Wed, Feb 11, 2015 at 5:23 AM, Todd Nist tsind...@gmail.com wrote: Arush, As for #2 do you mean something like this from the docs: // sc is an existing SparkContext. val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING)) sqlContext.sql(LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src) // Queries are expressed in HiveQL sqlContext.sql(FROM src SELECT key, value).collect().foreach(println)Or did you have something else in mind? -Todd On Tue, Feb 10, 2015 at 6:35 PM, Todd Nist tsind...@gmail.com wrote: Arush, Thank you will take a look at that approach in the morning. I sort of figured the answer to #1 was NO and that I would need to do 2 and 3 thanks for clarifying it for me. -Todd On Tue, Feb 10, 2015 at 5:24 PM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: 1. Can the connector fetch or query schemaRDD's saved to Parquet or JSON files? NO 2. Do I need to do something to expose these via hive / metastore other than creating a table in hive? Create a table in spark sql to expose via spark sql 3. Does the thriftserver need to be configured to expose these in some fashion, sort of related to question 2 you would need to configure thrift to read from the metastore you expect it read from - by default it reads from metastore_db directory present in the directory used to launch
Re: SparkSQL + Tableau Connector
Hey Todd, I don’t have an app to test against the thrift server, are you able to define custom SQL without using Tableau’s schema query? I guess it’s not possible to just use SparkSQL temp tables, you may have to use permanent Hive tables that are actually in the metastore so Tableau can discover them in the schema. In that case you will either have to generate the Hive tables externally from Spark or use Spark to process the data and save them using a HiveContext. From: Todd Nist Date: Wednesday, February 11, 2015 at 7:53 PM To: Andrew Lee Cc: Arush Kharbanda, user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: SparkSQL + Tableau Connector First sorry for the long post. So back to tableau and Spark SQL, I'm still missing something. TL;DR To get the Spark SQL Temp table associated with the metastore are there additional steps required beyond doing the below? Initial SQL on connection: create temporary table test using org.apache.spark.sql.json options (path '/data/json/*'); cache table test; I feel like I'm missing a step of associating the Spark SQL table with the metastore, do I need to actually save it in some fashion? I'm trying to avoid saving to hive if possible. Details: I configured the hive-site.xml and placed it in the $SPARK_HOME/conf. It looks like this, thanks Andrew and Arush for the assistance: ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? configuration property namehive.semantic.analyzer.factory.impl/name valueorg.apache.hcatalog.cli.HCatSemanticAnalyzerFactory/value /property property namehive.metastore.sasl.enabled/name valuefalse/value /property property namehive.server2.authentication/name valueNONE/value /property property namehive.server2.enable.doAs/name valuetrue/value /property !-- property namehive.metastore.uris/name valuethrift://localhost:9083/value descriptionIP address (or fully-qualified domain name) and port of the metastore host/description /property -- property namehive.warehouse.subdir.inherit.perms/name valuetrue/value /property property namehive.metastore.schema.verification/name valuefalse/value /property property namejavax.jdo.option.ConnectionURL/name valuejdbc:mysql://localhost:3306/metastore_db?createDatabaseIfNotExist=true/value descriptionmetadata is stored in a MySQL server/description /property property namejavax.jdo.option.ConnectionDriverName/name valuecom.mysql.jdbc.Driver/value descriptionMySQL JDBC driver class/description /property property namejavax.jdo.option.ConnectionUserName/name valuehiveuser/value /property property namejavax.jdo.option.ConnectionPassword/name valuehiveuser/value /property /configuration When I start the server it looks fine: $ ./sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf hive.server2.thrift.bind.host radtech.iohttp://radtech.io --master spark://radtech.io:7077http://radtech.io:7077 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to /usr/local/spark-1.2.1-bin-hadoop2.4/logs/spark-tnist-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-radtech.io.out radtech:spark tnist$ tail -f logs/spark-tnist-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-radtech.io.out 15/02/11 19:15:24 INFO SparkDeploySchedulerBackend: Granted executor ID app-20150211191524-0008/1 on hostPort 192.168.1.2:50851http://192.168.1.2:50851 with 2 cores, 512.0 MB RAM 15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: app-20150211191524-0008/0 is now LOADING 15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: app-20150211191524-0008/1 is now LOADING 15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: app-20150211191524-0008/0 is now RUNNING 15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: app-20150211191524-0008/1 is now RUNNING 15/02/11 19:15:24 INFO NettyBlockTransferService: Server created on 50938 15/02/11 19:15:24 INFO BlockManagerMaster: Trying to register BlockManager 15/02/11 19:15:24 INFO BlockManagerMasterActor: Registering block manager 192.168.1.2:50938http://192.168.1.2:50938 with 265.1 MB RAM, BlockManagerId(driver, 192.168.1.2, 50938) 15/02/11 19:15:24 INFO BlockManagerMaster: Registered BlockManager 15/02/11 19:15:25 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0 15/02/11 19:15:25 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/02/11 19:15:25 INFO ObjectStore: ObjectStore, initialize called 15/02/11 19:15:26 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored 15/02/11 19:15:26 INFO Persistence: Property
Re: SparkSQL + Tableau Connector
class:org.apache.hadoop.hive.metastore.ObjectStore 15/02/11 19:25:35 INFO ObjectStore: ObjectStore, initialize called 15/02/11 19:25:36 INFO Query: Reading in results for query org.datanucleus.store.rdbms.query.SQLQuery@0 since the connection used is closing 15/02/11 19:25:36 INFO ObjectStore: Initialized ObjectStore 15/02/11 19:25:36 INFO HiveMetaStore: 3: get_database: default 15/02/11 19:25:36 INFO audit: ugi=anonymous ip=unknown-ip-addr cmd=get_database: default 15/02/11 19:25:36 INFO Driver: OK *15/02/11 19:25:36 INFO SparkExecuteStatementOperation: Running query 'create temporary table test* *using org.apache.spark.sql.json**options (path ‘/data/json/*')*' 15/02/11 19:25:38 INFO Driver: Starting command: use `default` 15/02/11 19:25:38 INFO HiveMetaStore: 4: get_database: default 15/02/11 19:25:38 INFO audit: ugi=anonymous ip=unknown-ip-addr cmd=get_database: default 15/02/11 19:25:38 INFO HiveMetaStore: 4: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/02/11 19:25:38 INFO ObjectStore: ObjectStore, initialize called 15/02/11 19:25:38 INFO Query: Reading in results for query org.datanucleus.store.rdbms.query.SQLQuery@0 since the connection used is closing 15/02/11 19:25:38 INFO ObjectStore: Initialized ObjectStore 15/02/11 19:25:38 INFO HiveMetaStore: 4: get_database: default 15/02/11 19:25:38 INFO audit: ugi=anonymous ip=unknown-ip-addr cmd=get_database: default 15/02/11 19:25:38 INFO Driver: OK 15/02/11 19:25:38 INFO SparkExecuteStatementOperation: Running query ' *cache table test* ' 15/02/11 19:25:38 INFO MemoryStore: ensureFreeSpace(211383) called with curMem=101514, maxMem=278019440 15/02/11 19:25:38 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 206.4 KB, free 264.8 MB) I see no way in Tableau to see the cached table test. I think I am missing a step of associating the generated temp table from Spark SQL with the metastore. Any guidance or insights on what I'm missing here. Thanks for the assistance. -Todd On Wed, Feb 11, 2015 at 3:20 PM, Andrew Lee alee...@hotmail.com wrote: Sorry folks, it is executing Spark jobs instead of Hive jobs. I mis-read the logs since there were other activities going on on the cluster. -- From: alee...@hotmail.com To: ar...@sigmoidanalytics.com; tsind...@gmail.com CC: user@spark.apache.org Subject: RE: SparkSQL + Tableau Connector Date: Wed, 11 Feb 2015 11:56:44 -0800 I'm using mysql as the metastore DB with Spark 1.2. I simply copy the hive-site.xml to /etc/spark/ and added the mysql JDBC JAR to spark-env.sh in /etc/spark/, everything works fine now. My setup looks like this. Tableau = Spark ThriftServer2 = HiveServer2 It's talking to Tableau Desktop 8.3. Interestingly, when I query a Hive table, it still invokes Hive queries to HiveServer2 which is running MR or Tez engine. Is this expected? I thought it should at least use the catalyst engine and talk to the underlying HDFS like what HiveContext API does to pull in the data into RDD. Did I misunderstood the purpose of Spark ThriftServer2? -- Date: Wed, 11 Feb 2015 16:07:40 +0530 Subject: Re: SparkSQL + Tableau Connector From: ar...@sigmoidanalytics.com To: tsind...@gmail.com CC: user@spark.apache.org Hi I used this, though its using a embedded driver and is not a good approch.It works. You can configure for some other metastore type also. I have not tried the metastore uri's. configuration property namejavax.jdo.option.ConnectionURL/name valuejdbc:derby:;databaseName=/opt/bigdata/spark-1.2.0/metastore_db;create=true/value descriptionURL for the DB/description /property property namejavax.jdo.option.ConnectionDriverName/name valueorg.apache.derby.jdbc.EmbeddedDriver/value /property !-- property namehive.metastore.uris/name valuethrift://x.x.x.x:1 http://172.17.1.172:1//value descriptionIP address (or fully-qualified domain name) and port of the metastore host/description /property -- /configuration On Wed, Feb 11, 2015 at 3:59 PM, Todd Nist tsind...@gmail.com wrote: Hi Arush, So yes I want to create the tables through Spark SQL. I have placed the hive-site.xml file inside of the $SPARK_HOME/conf directory I thought that was all I should need to do to have the thriftserver use it. Perhaps my hive-site.xml is worng, it currently looks like this: configuration property namehive.metastore.uris/name !-- Ensure that the following statement points to the Hive Metastore URI in your cluster -- valuethrift://sandbox.hortonworks.com:9083/value descriptionURI for client to contact metastore server/description /property /configuration Which leads me to believe it is going to pull form the thriftserver from Horton? I will go look at the docs to see if this is right, it is what Horton says to do. Do you have an example hive-site.xml by chance that works with Spark SQL? I
Re: SparkSQL + Tableau Connector
Hi Arush, So yes I want to create the tables through Spark SQL. I have placed the hive-site.xml file inside of the $SPARK_HOME/conf directory I thought that was all I should need to do to have the thriftserver use it. Perhaps my hive-site.xml is worng, it currently looks like this: configuration property namehive.metastore.uris/name !-- Ensure that the following statement points to the Hive Metastore URI in your cluster -- valuethrift://sandbox.hortonworks.com:9083/value descriptionURI for client to contact metastore server/description /property /configuration Which leads me to believe it is going to pull form the thriftserver from Horton? I will go look at the docs to see if this is right, it is what Horton says to do. Do you have an example hive-site.xml by chance that works with Spark SQL? I am using 8.3 of tableau with the SparkSQL Connector. Thanks for the assistance. -Todd On Wed, Feb 11, 2015 at 2:34 AM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: BTW what tableau connector are you using? On Wed, Feb 11, 2015 at 12:55 PM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: I am a little confused here, why do you want to create the tables in hive. You want to create the tables in spark-sql, right? If you are not able to find the same tables through tableau then thrift is connecting to a diffrent metastore than your spark-shell. One way to specify a metstore to thrift is to provide the path to hive-site.xml while starting thrift using --files hive-site.xml. similarly you can specify the same metastore to your spark-submit or sharp-shell using the same option. On Wed, Feb 11, 2015 at 5:23 AM, Todd Nist tsind...@gmail.com wrote: Arush, As for #2 do you mean something like this from the docs: // sc is an existing SparkContext.val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING))sqlContext.sql(LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src) // Queries are expressed in HiveQLsqlContext.sql(FROM src SELECT key, value).collect().foreach(println) Or did you have something else in mind? -Todd On Tue, Feb 10, 2015 at 6:35 PM, Todd Nist tsind...@gmail.com wrote: Arush, Thank you will take a look at that approach in the morning. I sort of figured the answer to #1 was NO and that I would need to do 2 and 3 thanks for clarifying it for me. -Todd On Tue, Feb 10, 2015 at 5:24 PM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: 1. Can the connector fetch or query schemaRDD's saved to Parquet or JSON files? NO 2. Do I need to do something to expose these via hive / metastore other than creating a table in hive? Create a table in spark sql to expose via spark sql 3. Does the thriftserver need to be configured to expose these in some fashion, sort of related to question 2 you would need to configure thrift to read from the metastore you expect it read from - by default it reads from metastore_db directory present in the directory used to launch the thrift server. On 11 Feb 2015 01:35, Todd Nist tsind...@gmail.com wrote: Hi, I'm trying to understand how and what the Tableau connector to SparkSQL is able to access. My understanding is it needs to connect to the thriftserver and I am not sure how or if it exposes parquet, json, schemaRDDs, or does it only expose schemas defined in the metastore / hive. For example, I do the following from the spark-shell which generates a schemaRDD from a csv file and saves it as a JSON file as well as a parquet file. import *org.apache.sql.SQLContext *import com.databricks.spark.csv._ val sqlContext = new SQLContext(sc) val test = sqlContext.csfFile(/data/test.csv)test.toJSON().saveAsTextFile(/data/out) test.saveAsParquetFile(/data/out) When I connect from Tableau, the only thing I see is the default schema and nothing in the tables section. So my questions are: 1. Can the connector fetch or query schemaRDD's saved to Parquet or JSON files? 2. Do I need to do something to expose these via hive / metastore other than creating a table in hive? 3. Does the thriftserver need to be configured to expose these in some fashion, sort of related to question 2. TIA for the assistance. -Todd -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: SparkSQL + Tableau Connector
Hi Silvio, So the Initial SQL is executing now, I did not have the * added that and it worked fine. FWIW the * is not needed for the parquet files: create temporary table test using org.apache.spark.sql.json options (path '/data/out/*') ; cache table test; select count(1) from test; Unfortunately while the table is created and cached, i can see the statements being executed in the log file of spark, it is not associated with any schema at least that is being picked up by the Tableau Connector. So unless there is someway to associate it with a given schema I think I'm at a dead end on this one. Anything I may be missing here? Thanks for the help, it is much appreciated. I will give Arush suggestion a try tomorrow. -Todd On Tue, Feb 10, 2015 at 7:24 PM, Silvio Fiorito silvio.fior...@granturing.com wrote: Todd, I just tried it in bin/spark-sql shell. I created a folder *json *and just put 2 copies of the same people.json file This is what I ran: spark-sql create temporary table people using org.apache.spark.sql.json options (path 'examples/src/main/resources/json/*') ; Time taken: 0.34 seconds spark-sql select * from people; NULLMichael 30 Andy 19 Justin NULLMichael 30 Andy 19 Justin Time taken: 0.576 seconds From: Todd Nist Date: Tuesday, February 10, 2015 at 6:49 PM To: Silvio Fiorito Cc: user@spark.apache.org Subject: Re: SparkSQL + Tableau Connector Hi Silvio, Ah, I like that, there is a section in Tableau for Initial SQL to be executed upon connecting this would fit well there. I guess I will need to issue a collect(), coalesce(1,true).saveAsTextFile(...) or use repartition(1), as the file currently is being broken into multiple parts. While this works in the spark-shell: val test = sqlContext.jsonFile(/data/out/“) // returs all parts back as one It seems to fail in just spark-sql: create temporary table test using org.apache.spark.sql.json options (path '/data/out/') cache table test with: [Simba][SparkODBC] (35) Error from Spark: error code: '0' error message: 'org.apache.spark.sql.hive.HiveQl$ParseException: Failed to parse: create temporary table test using org.apache.spark.sql.json options (path '/data/out/') cache table test'. Initial SQL Error. Check that the syntax is correct and that you have access privileges to the requested database. Thanks again for the suggestion and I will give work with it a bit more tomorrow. -Todd On Tue, Feb 10, 2015 at 5:48 PM, Silvio Fiorito silvio.fior...@granturing.com wrote: Hi Todd, What you could do is run some SparkSQL commands immediately after the Thrift server starts up. Or does Tableau have some init SQL commands you could run? You can actually load data using SQL, such as: create temporary table people using org.apache.spark.sql.json options (path 'examples/src/main/resources/people.json’) cache table people create temporary table users using org.apache.spark.sql.parquet options (path 'examples/src/main/resources/users.parquet’) cache table users From: Todd Nist Date: Tuesday, February 10, 2015 at 3:03 PM To: user@spark.apache.org Subject: SparkSQL + Tableau Connector Hi, I'm trying to understand how and what the Tableau connector to SparkSQL is able to access. My understanding is it needs to connect to the thriftserver and I am not sure how or if it exposes parquet, json, schemaRDDs, or does it only expose schemas defined in the metastore / hive. For example, I do the following from the spark-shell which generates a schemaRDD from a csv file and saves it as a JSON file as well as a parquet file. import *org.apache.sql.SQLContext *import com.databricks.spark.csv._ val sqlContext = new SQLContext(sc) val test = sqlContext.csfFile(/data/test.csv)test.toJSON().saveAsTextFile(/data/out) test.saveAsParquetFile(/data/out) When I connect from Tableau, the only thing I see is the default schema and nothing in the tables section. So my questions are: 1. Can the connector fetch or query schemaRDD's saved to Parquet or JSON files? 2. Do I need to do something to expose these via hive / metastore other than creating a table in hive? 3. Does the thriftserver need to be configured to expose these in some fashion, sort of related to question 2. TIA for the assistance. -Todd
Re: SparkSQL + Tableau Connector
I am a little confused here, why do you want to create the tables in hive. You want to create the tables in spark-sql, right? If you are not able to find the same tables through tableau then thrift is connecting to a diffrent metastore than your spark-shell. One way to specify a metstore to thrift is to provide the path to hive-site.xml while starting thrift using --files hive-site.xml. similarly you can specify the same metastore to your spark-submit or sharp-shell using the same option. On Wed, Feb 11, 2015 at 5:23 AM, Todd Nist tsind...@gmail.com wrote: Arush, As for #2 do you mean something like this from the docs: // sc is an existing SparkContext.val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING))sqlContext.sql(LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src) // Queries are expressed in HiveQLsqlContext.sql(FROM src SELECT key, value).collect().foreach(println) Or did you have something else in mind? -Todd On Tue, Feb 10, 2015 at 6:35 PM, Todd Nist tsind...@gmail.com wrote: Arush, Thank you will take a look at that approach in the morning. I sort of figured the answer to #1 was NO and that I would need to do 2 and 3 thanks for clarifying it for me. -Todd On Tue, Feb 10, 2015 at 5:24 PM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: 1. Can the connector fetch or query schemaRDD's saved to Parquet or JSON files? NO 2. Do I need to do something to expose these via hive / metastore other than creating a table in hive? Create a table in spark sql to expose via spark sql 3. Does the thriftserver need to be configured to expose these in some fashion, sort of related to question 2 you would need to configure thrift to read from the metastore you expect it read from - by default it reads from metastore_db directory present in the directory used to launch the thrift server. On 11 Feb 2015 01:35, Todd Nist tsind...@gmail.com wrote: Hi, I'm trying to understand how and what the Tableau connector to SparkSQL is able to access. My understanding is it needs to connect to the thriftserver and I am not sure how or if it exposes parquet, json, schemaRDDs, or does it only expose schemas defined in the metastore / hive. For example, I do the following from the spark-shell which generates a schemaRDD from a csv file and saves it as a JSON file as well as a parquet file. import *org.apache.sql.SQLContext *import com.databricks.spark.csv._ val sqlContext = new SQLContext(sc) val test = sqlContext.csfFile(/data/test.csv)test.toJSON().saveAsTextFile(/data/out) test.saveAsParquetFile(/data/out) When I connect from Tableau, the only thing I see is the default schema and nothing in the tables section. So my questions are: 1. Can the connector fetch or query schemaRDD's saved to Parquet or JSON files? 2. Do I need to do something to expose these via hive / metastore other than creating a table in hive? 3. Does the thriftserver need to be configured to expose these in some fashion, sort of related to question 2. TIA for the assistance. -Todd -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: SparkSQL + Tableau Connector
BTW what tableau connector are you using? On Wed, Feb 11, 2015 at 12:55 PM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: I am a little confused here, why do you want to create the tables in hive. You want to create the tables in spark-sql, right? If you are not able to find the same tables through tableau then thrift is connecting to a diffrent metastore than your spark-shell. One way to specify a metstore to thrift is to provide the path to hive-site.xml while starting thrift using --files hive-site.xml. similarly you can specify the same metastore to your spark-submit or sharp-shell using the same option. On Wed, Feb 11, 2015 at 5:23 AM, Todd Nist tsind...@gmail.com wrote: Arush, As for #2 do you mean something like this from the docs: // sc is an existing SparkContext.val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING))sqlContext.sql(LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src) // Queries are expressed in HiveQLsqlContext.sql(FROM src SELECT key, value).collect().foreach(println) Or did you have something else in mind? -Todd On Tue, Feb 10, 2015 at 6:35 PM, Todd Nist tsind...@gmail.com wrote: Arush, Thank you will take a look at that approach in the morning. I sort of figured the answer to #1 was NO and that I would need to do 2 and 3 thanks for clarifying it for me. -Todd On Tue, Feb 10, 2015 at 5:24 PM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: 1. Can the connector fetch or query schemaRDD's saved to Parquet or JSON files? NO 2. Do I need to do something to expose these via hive / metastore other than creating a table in hive? Create a table in spark sql to expose via spark sql 3. Does the thriftserver need to be configured to expose these in some fashion, sort of related to question 2 you would need to configure thrift to read from the metastore you expect it read from - by default it reads from metastore_db directory present in the directory used to launch the thrift server. On 11 Feb 2015 01:35, Todd Nist tsind...@gmail.com wrote: Hi, I'm trying to understand how and what the Tableau connector to SparkSQL is able to access. My understanding is it needs to connect to the thriftserver and I am not sure how or if it exposes parquet, json, schemaRDDs, or does it only expose schemas defined in the metastore / hive. For example, I do the following from the spark-shell which generates a schemaRDD from a csv file and saves it as a JSON file as well as a parquet file. import *org.apache.sql.SQLContext *import com.databricks.spark.csv._ val sqlContext = new SQLContext(sc) val test = sqlContext.csfFile(/data/test.csv)test.toJSON().saveAsTextFile(/data/out) test.saveAsParquetFile(/data/out) When I connect from Tableau, the only thing I see is the default schema and nothing in the tables section. So my questions are: 1. Can the connector fetch or query schemaRDD's saved to Parquet or JSON files? 2. Do I need to do something to expose these via hive / metastore other than creating a table in hive? 3. Does the thriftserver need to be configured to expose these in some fashion, sort of related to question 2. TIA for the assistance. -Todd -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: SparkSQL + Tableau Connector
1. Can the connector fetch or query schemaRDD's saved to Parquet or JSON files? NO 2. Do I need to do something to expose these via hive / metastore other than creating a table in hive? Create a table in spark sql to expose via spark sql 3. Does the thriftserver need to be configured to expose these in some fashion, sort of related to question 2 you would need to configure thrift to read from the metastore you expect it read from - by default it reads from metastore_db directory present in the directory used to launch the thrift server. On 11 Feb 2015 01:35, Todd Nist tsind...@gmail.com wrote: Hi, I'm trying to understand how and what the Tableau connector to SparkSQL is able to access. My understanding is it needs to connect to the thriftserver and I am not sure how or if it exposes parquet, json, schemaRDDs, or does it only expose schemas defined in the metastore / hive. For example, I do the following from the spark-shell which generates a schemaRDD from a csv file and saves it as a JSON file as well as a parquet file. import *org.apache.sql.SQLContext *import com.databricks.spark.csv._ val sqlContext = new SQLContext(sc) val test = sqlContext.csfFile(/data/test.csv)test.toJSON().saveAsTextFile(/data/out) test.saveAsParquetFile(/data/out) When I connect from Tableau, the only thing I see is the default schema and nothing in the tables section. So my questions are: 1. Can the connector fetch or query schemaRDD's saved to Parquet or JSON files? 2. Do I need to do something to expose these via hive / metastore other than creating a table in hive? 3. Does the thriftserver need to be configured to expose these in some fashion, sort of related to question 2. TIA for the assistance. -Todd
Re: SparkSQL + Tableau Connector
Hi Todd, What you could do is run some SparkSQL commands immediately after the Thrift server starts up. Or does Tableau have some init SQL commands you could run? You can actually load data using SQL, such as: create temporary table people using org.apache.spark.sql.json options (path 'examples/src/main/resources/people.json’) cache table people create temporary table users using org.apache.spark.sql.parquet options (path 'examples/src/main/resources/users.parquet’) cache table users From: Todd Nist Date: Tuesday, February 10, 2015 at 3:03 PM To: user@spark.apache.orgmailto:user@spark.apache.org Subject: SparkSQL + Tableau Connector Hi, I'm trying to understand how and what the Tableau connector to SparkSQL is able to access. My understanding is it needs to connect to the thriftserver and I am not sure how or if it exposes parquet, json, schemaRDDs, or does it only expose schemas defined in the metastore / hive. For example, I do the following from the spark-shell which generates a schemaRDD from a csv file and saves it as a JSON file as well as a parquet file. import org.apache.sql.SQLContext import com.databricks.spark.csv._ val sqlContext = new SQLContext(sc) val test = sqlContext.csfFile(/data/test.csv)test.toJSON().saveAsTextFile(/data/out) test.saveAsParquetFile(/data/out) When I connect from Tableau, the only thing I see is the default schema and nothing in the tables section. So my questions are: 1. Can the connector fetch or query schemaRDD's saved to Parquet or JSON files? 2. Do I need to do something to expose these via hive / metastore other than creating a table in hive? 3. Does the thriftserver need to be configured to expose these in some fashion, sort of related to question 2. TIA for the assistance. -Todd
Re: SparkSQL + Tableau Connector
Todd, I just tried it in bin/spark-sql shell. I created a folder json and just put 2 copies of the same people.json file This is what I ran: spark-sql create temporary table people using org.apache.spark.sql.json options (path 'examples/src/main/resources/json/*') ; Time taken: 0.34 seconds spark-sql select * from people; NULLMichael 30 Andy 19 Justin NULLMichael 30 Andy 19 Justin Time taken: 0.576 seconds From: Todd Nist Date: Tuesday, February 10, 2015 at 6:49 PM To: Silvio Fiorito Cc: user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: SparkSQL + Tableau Connector Hi Silvio, Ah, I like that, there is a section in Tableau for Initial SQL to be executed upon connecting this would fit well there. I guess I will need to issue a collect(), coalesce(1,true).saveAsTextFile(...) or use repartition(1), as the file currently is being broken into multiple parts. While this works in the spark-shell: val test = sqlContext.jsonFile(/data/out/“) // returs all parts back as one It seems to fail in just spark-sql: create temporary table test using org.apache.spark.sql.json options (path '/data/out/') cache table test with: [Simba][SparkODBC] (35) Error from Spark: error code: '0' error message: 'org.apache.spark.sql.hive.HiveQl$ParseException: Failed to parse: create temporary table test using org.apache.spark.sql.json options (path '/data/out/') cache table test'. Initial SQL Error. Check that the syntax is correct and that you have access privileges to the requested database. Thanks again for the suggestion and I will give work with it a bit more tomorrow. -Todd On Tue, Feb 10, 2015 at 5:48 PM, Silvio Fiorito silvio.fior...@granturing.commailto:silvio.fior...@granturing.com wrote: Hi Todd, What you could do is run some SparkSQL commands immediately after the Thrift server starts up. Or does Tableau have some init SQL commands you could run? You can actually load data using SQL, such as: create temporary table people using org.apache.spark.sql.json options (path 'examples/src/main/resources/people.json’) cache table people create temporary table users using org.apache.spark.sql.parquet options (path 'examples/src/main/resources/users.parquet’) cache table users From: Todd Nist Date: Tuesday, February 10, 2015 at 3:03 PM To: user@spark.apache.orgmailto:user@spark.apache.org Subject: SparkSQL + Tableau Connector Hi, I'm trying to understand how and what the Tableau connector to SparkSQL is able to access. My understanding is it needs to connect to the thriftserver and I am not sure how or if it exposes parquet, json, schemaRDDs, or does it only expose schemas defined in the metastore / hive. For example, I do the following from the spark-shell which generates a schemaRDD from a csv file and saves it as a JSON file as well as a parquet file. import org.apache.sql.SQLContext import com.databricks.spark.csv._ val sqlContext = new SQLContext(sc) val test = sqlContext.csfFile(/data/test.csv)test.toJSON().saveAsTextFile(/data/out) test.saveAsParquetFile(/data/out) When I connect from Tableau, the only thing I see is the default schema and nothing in the tables section. So my questions are: 1. Can the connector fetch or query schemaRDD's saved to Parquet or JSON files? 2. Do I need to do something to expose these via hive / metastore other than creating a table in hive? 3. Does the thriftserver need to be configured to expose these in some fashion, sort of related to question 2. TIA for the assistance. -Todd
Re: SparkSQL + Tableau Connector
Arush, As for #2 do you mean something like this from the docs: // sc is an existing SparkContext.val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING))sqlContext.sql(LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src) // Queries are expressed in HiveQLsqlContext.sql(FROM src SELECT key, value).collect().foreach(println) Or did you have something else in mind? -Todd On Tue, Feb 10, 2015 at 6:35 PM, Todd Nist tsind...@gmail.com wrote: Arush, Thank you will take a look at that approach in the morning. I sort of figured the answer to #1 was NO and that I would need to do 2 and 3 thanks for clarifying it for me. -Todd On Tue, Feb 10, 2015 at 5:24 PM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: 1. Can the connector fetch or query schemaRDD's saved to Parquet or JSON files? NO 2. Do I need to do something to expose these via hive / metastore other than creating a table in hive? Create a table in spark sql to expose via spark sql 3. Does the thriftserver need to be configured to expose these in some fashion, sort of related to question 2 you would need to configure thrift to read from the metastore you expect it read from - by default it reads from metastore_db directory present in the directory used to launch the thrift server. On 11 Feb 2015 01:35, Todd Nist tsind...@gmail.com wrote: Hi, I'm trying to understand how and what the Tableau connector to SparkSQL is able to access. My understanding is it needs to connect to the thriftserver and I am not sure how or if it exposes parquet, json, schemaRDDs, or does it only expose schemas defined in the metastore / hive. For example, I do the following from the spark-shell which generates a schemaRDD from a csv file and saves it as a JSON file as well as a parquet file. import *org.apache.sql.SQLContext *import com.databricks.spark.csv._ val sqlContext = new SQLContext(sc) val test = sqlContext.csfFile(/data/test.csv)test.toJSON().saveAsTextFile(/data/out) test.saveAsParquetFile(/data/out) When I connect from Tableau, the only thing I see is the default schema and nothing in the tables section. So my questions are: 1. Can the connector fetch or query schemaRDD's saved to Parquet or JSON files? 2. Do I need to do something to expose these via hive / metastore other than creating a table in hive? 3. Does the thriftserver need to be configured to expose these in some fashion, sort of related to question 2. TIA for the assistance. -Todd
Re: SparkSQL + Tableau Connector
Arush, Thank you will take a look at that approach in the morning. I sort of figured the answer to #1 was NO and that I would need to do 2 and 3 thanks for clarifying it for me. -Todd On Tue, Feb 10, 2015 at 5:24 PM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: 1. Can the connector fetch or query schemaRDD's saved to Parquet or JSON files? NO 2. Do I need to do something to expose these via hive / metastore other than creating a table in hive? Create a table in spark sql to expose via spark sql 3. Does the thriftserver need to be configured to expose these in some fashion, sort of related to question 2 you would need to configure thrift to read from the metastore you expect it read from - by default it reads from metastore_db directory present in the directory used to launch the thrift server. On 11 Feb 2015 01:35, Todd Nist tsind...@gmail.com wrote: Hi, I'm trying to understand how and what the Tableau connector to SparkSQL is able to access. My understanding is it needs to connect to the thriftserver and I am not sure how or if it exposes parquet, json, schemaRDDs, or does it only expose schemas defined in the metastore / hive. For example, I do the following from the spark-shell which generates a schemaRDD from a csv file and saves it as a JSON file as well as a parquet file. import *org.apache.sql.SQLContext *import com.databricks.spark.csv._ val sqlContext = new SQLContext(sc) val test = sqlContext.csfFile(/data/test.csv)test.toJSON().saveAsTextFile(/data/out) test.saveAsParquetFile(/data/out) When I connect from Tableau, the only thing I see is the default schema and nothing in the tables section. So my questions are: 1. Can the connector fetch or query schemaRDD's saved to Parquet or JSON files? 2. Do I need to do something to expose these via hive / metastore other than creating a table in hive? 3. Does the thriftserver need to be configured to expose these in some fashion, sort of related to question 2. TIA for the assistance. -Todd
Re: SparkSQL + Tableau Connector
Hi Silvio, Ah, I like that, there is a section in Tableau for Initial SQL to be executed upon connecting this would fit well there. I guess I will need to issue a collect(), coalesce(1,true).saveAsTextFile(...) or use repartition(1), as the file currently is being broken into multiple parts. While this works in the spark-shell: val test = sqlContext.jsonFile(/data/out/“) // returs all parts back as one It seems to fail in just spark-sql: create temporary table test using org.apache.spark.sql.json options (path '/data/out/') cache table test with: [Simba][SparkODBC] (35) Error from Spark: error code: '0' error message: 'org.apache.spark.sql.hive.HiveQl$ParseException: Failed to parse: create temporary table test using org.apache.spark.sql.json options (path '/data/out/') cache table test'. Initial SQL Error. Check that the syntax is correct and that you have access privileges to the requested database. Thanks again for the suggestion and I will give work with it a bit more tomorrow. -Todd On Tue, Feb 10, 2015 at 5:48 PM, Silvio Fiorito silvio.fior...@granturing.com wrote: Hi Todd, What you could do is run some SparkSQL commands immediately after the Thrift server starts up. Or does Tableau have some init SQL commands you could run? You can actually load data using SQL, such as: create temporary table people using org.apache.spark.sql.json options (path 'examples/src/main/resources/people.json’) cache table people create temporary table users using org.apache.spark.sql.parquet options (path 'examples/src/main/resources/users.parquet’) cache table users From: Todd Nist Date: Tuesday, February 10, 2015 at 3:03 PM To: user@spark.apache.org Subject: SparkSQL + Tableau Connector Hi, I'm trying to understand how and what the Tableau connector to SparkSQL is able to access. My understanding is it needs to connect to the thriftserver and I am not sure how or if it exposes parquet, json, schemaRDDs, or does it only expose schemas defined in the metastore / hive. For example, I do the following from the spark-shell which generates a schemaRDD from a csv file and saves it as a JSON file as well as a parquet file. import *org.apache.sql.SQLContext *import com.databricks.spark.csv._ val sqlContext = new SQLContext(sc) val test = sqlContext.csfFile(/data/test.csv)test.toJSON().saveAsTextFile(/data/out) test.saveAsParquetFile(/data/out) When I connect from Tableau, the only thing I see is the default schema and nothing in the tables section. So my questions are: 1. Can the connector fetch or query schemaRDD's saved to Parquet or JSON files? 2. Do I need to do something to expose these via hive / metastore other than creating a table in hive? 3. Does the thriftserver need to be configured to expose these in some fashion, sort of related to question 2. TIA for the assistance. -Todd