Hi Allan,

Where is the data stored right now? If it's in a relational database, and you 
are using Spark with Hadoop, I feel like it would make sense to move the import 
the data into HDFS, just because it would be faster to access the data. You 
could use Sqoop to do that.

In terms of having a long running Spark context, you could look into the Spark 
job server:

https://github.com/spark-jobserver/spark-jobserver/blob/master/README.md

It would allow you to cache all the data in memory and then accept queries via 
REST API calls. You would have to refresh your cache as the data changes of 
course, but it sounds like that is not very often.

In terms of running the queries themselves, I would think you could use Spark 
SQL and the DataFrame/DataSet API, which is built into Spark. You will have to 
think about the best way to partition your data, depending on the queries 
themselves.

Here is a link to the Spark SQL docs:

http://spark.apache.org/docs/latest/sql-programming-guide.html

I hope that helps, and I'm sure other folks will have some helpful advice as 
well.

Thanks,
Subhash 

Sent from my iPhone

> On Mar 5, 2017, at 3:49 PM, Allan Richards <allan.richa...@gmail.com> wrote:
> 
> Hi,
> 
> I am looking to use Spark to help execute queries against a reasonably large 
> dataset (1 billion rows). I'm a bit lost with all the different libraries / 
> add ons to Spark, and am looking for some direction as to what I should look 
> at / what may be helpful.
> 
> A couple of relevant points:
>  - The dataset doesn't change over time. 
>  - There are a small number of applications (or queries I guess, but it's 
> more complicated than a single SQL query) that I want to run against it, but 
> the parameters to those queries will change all the time.
>  - There is a logical grouping of the data per customer, which will generally 
> consist of 1-5000 rows.
> 
> I want each query to run as fast as possible (less than a second or two). So 
> ideally I want to keep all the records in memory, but distributed over the 
> different nodes in the cluster. Does this mean sharing a SparkContext between 
> queries, or is this where HDFS comes in, or is there something else that 
> would be better suited?
> 
> Or is there another overall approach I should look into for executing queries 
> in "real time" against a dataset this size?
> 
> Thanks,
> Allan.

Reply via email to