Re: Spark to eliminate full-table scan latency

Matt Narrell Tue, 28 Oct 2014 10:33:35 -0700

I’ve been puzzled by this lately.  I too would like to use the thrift server to 
provide JDBC style access to datasets via SparkSQL.  Is this possible?  The 
examples show temp tables created during the lifetime of a SparkContext.  I 
assume I can use SparkSQL to query those tables while the context is active, 
but what happens when the context is stopped?  I can no longer query this 
table, via the thrift server.  Do I need Hive in this scenario?  I don’t want 
to rebuild the Spark distribution unless absolutely necessary.


From the examples, it looks like SparkSQL is syntax sugar for manipulating an 
RDD, but if I need external access to this data, I need a separate store, 
outside of Spark (Mongo/Cassandra/HDFS/etc..)  Am I correct here?

Thanks,

mn

> On Oct 27, 2014, at 7:43 PM, Ron Ayoub <ronalday...@live.com> wrote:
> 
> This does look like it provides a good way to allow other process to access 
> the contents of an RDD in a separate app? Is there any other general purpose 
> mechanism for serving up RDD data? I understand that the driver app and 
> workers all are app specific and run in separate executors but would be cool 
> if there was some general way to create a server app based on Spark. Perhaps 
> Spark SQL is that general way and I'll soon find out. Thanks. 
> 
> From: mich...@databricks.com
> Date: Mon, 27 Oct 2014 14:35:46 -0700
> Subject: Re: Spark to eliminate full-table scan latency
> To: ronalday...@live.com
> CC: user@spark.apache.org
> 
> You can access cached data in spark through the JDBC server:
> 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server
>  
> <http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server>
> 
> On Mon, Oct 27, 2014 at 1:47 PM, Ron Ayoub <ronalday...@live.com 
> <mailto:ronalday...@live.com>> wrote:
> We have a table containing 25 features per item id along with feature 
> weights. A correlation matrix can be constructed for every feature pair based 
> on co-occurrence. If a user inputs a feature they can find out the features 
> that are correlated with a self-join requiring a single full table scan. This 
> results in high latency for big data (10 seconds +) due to the IO involved in 
> the full table scan. My idea is for this feature the data can be loaded into 
> an RDD and transformations and actions can be applied to find out per query 
> what are the correlated features. 
> 
> I'm pretty sure Spark can do this sort of thing. Since I'm new, what I'm not 
> sure about is, is Spark appropriate as a server application? For instance, 
> the drive application would have to load the RDD and then listen for request 
> and return results, perhaps using a socket?  Are there any libraries to 
> facilitate this sort of Spark server app? So I understand how Spark can be 
> used to grab data, run algorithms, and put results back but is it appropriate 
> as the engine of a server app and what are the general patterns involved?

Re: Spark to eliminate full-table scan latency

Reply via email to