Hi,

I am looking to use Spark to help execute queries against a reasonably
large dataset (1 billion rows). I'm a bit lost with all the different
libraries / add ons to Spark, and am looking for some direction as to what
I should look at / what may be helpful.

A couple of relevant points:
 - The dataset doesn't change over time.
 - There are a small number of applications (or queries I guess, but it's
more complicated than a single SQL query) that I want to run against it,
but the parameters to those queries will change all the time.
 - There is a logical grouping of the data per customer, which will
generally consist of 1-5000 rows.

I want each query to run as fast as possible (less than a second or two).
So ideally I want to keep all the records in memory, but distributed over
the different nodes in the cluster. Does this mean sharing a SparkContext
between queries, or is this where HDFS comes in, or is there something else
that would be better suited?

Or is there another overall approach I should look into for executing
queries in "real time" against a dataset this size?

Thanks,
Allan.

Reply via email to