Hello

I'm new to spark, so sorry if my question looks dumb.
I have a problem which I hope to solve using spark. Here is short description:

1. I have a simple flow of the 600k tuples per minute. Each tuple is structured metric name and its value:
    (a.b.c.d, value)
    (a.b.x, value)
    (g.b.x, value)

2. Each minute I have to calculate ~1k real-time aggregates on this flow and send results to dashboard. 3. I have quite straightforward solution (Fn and Rn are filter and reduce functions for n-th aggregate):

    metrics = StreamingContext(sc, 60).cache()
    aggregate1 = metrics.filter(F1).reduce(R1)
    aggregate2 = metrics.filter(F2).reduce(R2)
    ...
    ...
    aggregate1.pprint()
    aggregate2.pprint()
    ...
    ...
    metrics.start()

Test are run on a small cluster: single master and two workers, 4 cores each. Each aggregate takes 1 sec to complete.

As far as I understand each minute spark starts a job for each aggregate, so if I need to complete 1000 jobs per 60 seconds I need 1000/60 ~= 17 workers minimum. Is it correct? Will reduces run parallel?

Is there a way to calculate these aggregates in single job?

Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to