Thanks for the feedback everyone. We've had a look at different SQL based
solutions, and have got good performance out of them, but some of the
reports we make can't be generated with a single bit of SQL. This is just
an investigation to see if Spark is a viable alternative.

I've got another question (I also asked on stack overflow
<http://stackoverflow.com/questions/42661350/spark-jobserver-very-large-task-size>).
Basically I'm seeing (proportionally) large task deserialisation times, and
am wondering why. I'm using jobserver and reusing an existing context and
RDD, so I believe all the data should be cached on the executors already. I
would have thought the serialised task just contains the query to execute
(the jar should also have been pushed across already?) the partition id and
the RDD id, so should be very lightweight?
An example of timings:
Scheduler delay: 7ms
Task deserialization time: 19ms
Executor computing time: 4ms

Thanks,
Allan.

On Mon, Mar 6, 2017 at 6:05 PM, Jörn Franke <jornfra...@gmail.com> wrote:

> I agree with the others that a dedicated NoSQL datastore can make sense.
> You should look at the lambda architecture paradigm. Keep in mind that more
> memory does not necessarily mean more performance. It is the right data
> structure for  the queries of your users. Additionally, if your queries are
> executed over the whole dataset and you want to have answer times in 2
> seconds, you should look at databases that do aggregations on samples of
> the data (cf. https://jornfranke.wordpress.com/2015/06/28/big-
> data-what-is-next-oltp-olap-predictive-analytics-sampling-
> and-probabilistic-databases). E.g. Hive has a tablesample functionality
> since a long time.
>
> On 5 Mar 2017, at 21:49, Allan Richards <allan.richa...@gmail.com> wrote:
>
> Hi,
>
> I am looking to use Spark to help execute queries against a reasonably
> large dataset (1 billion rows). I'm a bit lost with all the different
> libraries / add ons to Spark, and am looking for some direction as to what
> I should look at / what may be helpful.
>
> A couple of relevant points:
>  - The dataset doesn't change over time.
>  - There are a small number of applications (or queries I guess, but it's
> more complicated than a single SQL query) that I want to run against it,
> but the parameters to those queries will change all the time.
>  - There is a logical grouping of the data per customer, which will
> generally consist of 1-5000 rows.
>
> I want each query to run as fast as possible (less than a second or two).
> So ideally I want to keep all the records in memory, but distributed over
> the different nodes in the cluster. Does this mean sharing a SparkContext
> between queries, or is this where HDFS comes in, or is there something else
> that would be better suited?
>
> Or is there another overall approach I should look into for executing
> queries in "real time" against a dataset this size?
>
> Thanks,
> Allan.
>
>

Reply via email to