I’ve been puzzled by this lately. I too would like to use the thrift server to provide JDBC style access to datasets via SparkSQL. Is this possible? The examples show temp tables created during the lifetime of a SparkContext. I assume I can use SparkSQL to query those tables while the context is active, but what happens when the context is stopped? I can no longer query this table, via the thrift server. Do I need Hive in this scenario? I don’t want to rebuild the Spark distribution unless absolutely necessary.
From the examples, it looks like SparkSQL is syntax sugar for manipulating an RDD, but if I need external access to this data, I need a separate store, outside of Spark (Mongo/Cassandra/HDFS/etc..) Am I correct here? Thanks, mn > On Oct 27, 2014, at 7:43 PM, Ron Ayoub <ronalday...@live.com> wrote: > > This does look like it provides a good way to allow other process to access > the contents of an RDD in a separate app? Is there any other general purpose > mechanism for serving up RDD data? I understand that the driver app and > workers all are app specific and run in separate executors but would be cool > if there was some general way to create a server app based on Spark. Perhaps > Spark SQL is that general way and I'll soon find out. Thanks. > > From: mich...@databricks.com > Date: Mon, 27 Oct 2014 14:35:46 -0700 > Subject: Re: Spark to eliminate full-table scan latency > To: ronalday...@live.com > CC: user@spark.apache.org > > You can access cached data in spark through the JDBC server: > > http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server > > <http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server> > > On Mon, Oct 27, 2014 at 1:47 PM, Ron Ayoub <ronalday...@live.com > <mailto:ronalday...@live.com>> wrote: > We have a table containing 25 features per item id along with feature > weights. A correlation matrix can be constructed for every feature pair based > on co-occurrence. If a user inputs a feature they can find out the features > that are correlated with a self-join requiring a single full table scan. This > results in high latency for big data (10 seconds +) due to the IO involved in > the full table scan. My idea is for this feature the data can be loaded into > an RDD and transformations and actions can be applied to find out per query > what are the correlated features. > > I'm pretty sure Spark can do this sort of thing. Since I'm new, what I'm not > sure about is, is Spark appropriate as a server application? For instance, > the drive application would have to load the RDD and then listen for request > and return results, perhaps using a socket? Are there any libraries to > facilitate this sort of Spark server app? So I understand how Spark can be > used to grab data, run algorithms, and put results back but is it appropriate > as the engine of a server app and what are the general patterns involved?