Spark SQL -JDBC connectivity
Hi, I would to know the steps to connect SPARK SQL from spring framework (Web-UI). also how to run and deploy the web application?
Re: Spark SQL JDBC Connectivity
For the time being, we decided to take a different route. We created a Rest API layer in our app and allowed SQL query passing via the Rest. Internally we pass that query to the SparkSQL layer on the RDD and return back the results. With this Spark SQL is supported for our RDDs via this rest API now. It is easy to do this and took a just a few hours and it works for our use case. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-JDBC-Connectivity-tp6511p10986.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark SQL JDBC Connectivity
Very cool. Glad you found a solution that works. On Wed, Jul 30, 2014 at 1:04 PM, Venkat Subramanian vsubr...@gmail.com wrote: For the time being, we decided to take a different route. We created a Rest API layer in our app and allowed SQL query passing via the Rest. Internally we pass that query to the SparkSQL layer on the RDD and return back the results. With this Spark SQL is supported for our RDDs via this rest API now. It is easy to do this and took a just a few hours and it works for our use case. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-JDBC-Connectivity-tp6511p10986.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark SQL JDBC Connectivity and more
1) If I have a standalone spark application that has already built a RDD, how can SharkServer2 or for that matter Shark access 'that' RDD and do queries on it. All the examples I have seen for Shark, the RDD (tables) are created within Shark's spark context and processed. This is not possible out of the box with Shark. If you look at the code for SharkServer2 though, you'll see that its just a standard HiveContext under the covers. If you modify this startup code, any SchemaRDD you register as a table in this context will be exposed over JDBC. [Venkat] Are you saying - pull in the SharkServer2 code in my standalone spark application (as a part of the standalone application process), pass in the spark context of the standalone app to SharkServer2 Sparkcontext at startup and viola we get a SQL/JDBC interfaces for the RDDs of the Standalone app that are exposed as tables? Thanks for the clarification. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-JDBC-Connectivity-tp6511p7264.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark SQL JDBC Connectivity and more
[Venkat] Are you saying - pull in the SharkServer2 code in my standalone spark application (as a part of the standalone application process), pass in the spark context of the standalone app to SharkServer2 Sparkcontext at startup and viola we get a SQL/JDBC interfaces for the RDDs of the Standalone app that are exposed as tables? Thanks for the clarification. Yeah, thats should work although it is pretty hacky and is not officially supported. It might be interesting to augment Shark to allow the user to invoke custom applications using the same SQLContext. If this is something you'd have time to implement I'd be happy to discuss the design further.
Spark SQL JDBC Connectivity
We are planning to use the latest Spark SQL on RDDs. If a third party application wants to connect to Spark via JDBC, does Spark SQL have support? (We want to avoid going though Shark/Hive JDBC layer as we need good performance). BTW, we also want to do the same for Spark Streaming - With Spark SQL work on DStreams (since the underlying structure is RDD anyway) and can we expose the streaming DStream RDD through JDBC via Spark SQL for Realtime analytics. Any pointers on this will greatly help. Regards, Venkat -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-JDBC-Connectivity-tp6511.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark SQL JDBC Connectivity
On Wed, May 28, 2014 at 11:39 PM, Venkat Subramanian vsubr...@gmail.comwrote: We are planning to use the latest Spark SQL on RDDs. If a third party application wants to connect to Spark via JDBC, does Spark SQL have support? (We want to avoid going though Shark/Hive JDBC layer as we need good performance). We don't have a full release yet, but there is a branch on the Shark github repository that has a version of SharkServer2 that uses Spark SQL. We also plan to port the Shark CLI, but this is not yet finished. You can find this branch along with documentation here: https://github.com/amplab/shark/tree/sparkSql Note that this version has not yet received much testing (outside of the integration tests that are run on Spark SQL). That said, I would love for people to test it out and report any problems or missing features. Any help here would be greatly appreciated! BTW, we also want to do the same for Spark Streaming - With Spark SQL work on DStreams (since the underlying structure is RDD anyway) and can we expose the streaming DStream RDD through JDBC via Spark SQL for Realtime analytics. We have talked about doing this, but this is not currently on the near term road map.
Re: Spark SQL JDBC Connectivity and more
Thanks Michael. OK will try SharkServer2.. But I have some basic questions on a related area: 1) If I have a standalone spark application that has already built a RDD, how can SharkServer2 or for that matter Shark access 'that' RDD and do queries on it. All the examples I have seen for Shark, the RDD (tables) are created within Shark's spark context and processed. I have stylized the real problem we have which is, we have a standalone spark application that is processing DStreams and producing output Dstreams. I want to expose that near real-time Dstream data to a 3 rd party app via JDBC and allow SharkServer2 CLI to operate and query on the Dstreams real-time all from memory. Currently we are writing output stream to Cassandra and exposing it to 3 rd party app through it via JDBC, but want to avoid that extra disk write which increases latency. 2) I have two applications, one used for processing and computing output RDD from an input and another for post processing the resultant RDD into multiple persistent stores + doing other things with it. These are split in to separate processes intentionally. How do we share the output RDD from first application to second application without writing to disk (thinking of serializing the RDD and streaming through Kafka, but then we loose time and all the fault tolerance that RDD brings in)? Is Tachyon the only other way? Are there other models/design patterns for applications that share RDDs, as this may be a very common use case? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-JDBC-Connectivity-tp6511p6543.html Sent from the Apache Spark User List mailing list archive at Nabble.com.