I agree with the others that a dedicated NoSQL datastore can make sense. You 
should look at the lambda architecture paradigm. Keep in mind that more memory 
does not necessarily mean more performance. It is the right data structure for  
the queries of your users. Additionally, if your queries are executed over the 
whole dataset and you want to have answer times in 2 seconds, you should look 
at databases that do aggregations on samples of the data (cf. 
https://jornfranke.wordpress.com/2015/06/28/big-data-what-is-next-oltp-olap-predictive-analytics-sampling-and-probabilistic-databases).
 E.g. Hive has a tablesample functionality since a long time.

> On 5 Mar 2017, at 21:49, Allan Richards <allan.richa...@gmail.com> wrote:
> 
> Hi,
> 
> I am looking to use Spark to help execute queries against a reasonably large 
> dataset (1 billion rows). I'm a bit lost with all the different libraries / 
> add ons to Spark, and am looking for some direction as to what I should look 
> at / what may be helpful.
> 
> A couple of relevant points:
>  - The dataset doesn't change over time. 
>  - There are a small number of applications (or queries I guess, but it's 
> more complicated than a single SQL query) that I want to run against it, but 
> the parameters to those queries will change all the time.
>  - There is a logical grouping of the data per customer, which will generally 
> consist of 1-5000 rows.
> 
> I want each query to run as fast as possible (less than a second or two). So 
> ideally I want to keep all the records in memory, but distributed over the 
> different nodes in the cluster. Does this mean sharing a SparkContext between 
> queries, or is this where HDFS comes in, or is there something else that 
> would be better suited?
> 
> Or is there another overall approach I should look into for executing queries 
> in "real time" against a dataset this size?
> 
> Thanks,
> Allan.

Reply via email to