Hello
I'm new to spark, so sorry if my question looks dumb.
I have a problem which I hope to solve using spark. Here is short
description:
1. I have a simple flow of the 600k tuples per minute. Each tuple is
structured metric name and its value:
(a.b.c.d, value)
(a.b.x, value)
(g.b.x, value)
2. Each minute I have to calculate ~1k real-time aggregates on this flow
and send results to dashboard.
3. I have quite straightforward solution (Fn and Rn are filter and
reduce functions for n-th aggregate):
metrics = StreamingContext(sc, 60).cache()
aggregate1 = metrics.filter(F1).reduce(R1)
aggregate2 = metrics.filter(F2).reduce(R2)
...
...
aggregate1.pprint()
aggregate2.pprint()
...
...
metrics.start()
Test are run on a small cluster: single master and two workers, 4
cores each. Each aggregate takes 1 sec to complete.
As far as I understand each minute spark starts a job for each
aggregate, so if I need to complete 1000 jobs per 60 seconds I need
1000/60 ~= 17 workers minimum. Is it correct? Will reduces run parallel?
Is there a way to calculate these aggregates in single job?
Thanks!
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org