Q: optimal way to calculate aggregates on a stream

igor Sat, 03 Oct 2015 13:25:04 -0700

Hello

I'm new to spark, so sorry if my question looks dumb.

I have a problem which I hope to solve using spark. Here is shortdescription:

1. I have a simple flow of the 600k tuples per minute. Each tuple isstructured metric name and its value:

    (a.b.c.d, value)
    (a.b.x, value)
    (g.b.x, value)

2. Each minute I have to calculate ~1k real-time aggregates on this flowand send results to dashboard.3. I have quite straightforward solution (Fn and Rn are filter andreduce functions for n-th aggregate):


    metrics = StreamingContext(sc, 60).cache()
    aggregate1 = metrics.filter(F1).reduce(R1)
    aggregate2 = metrics.filter(F2).reduce(R2)
    ...
    ...
    aggregate1.pprint()
    aggregate2.pprint()
    ...
    ...
    metrics.start()

Test are run on a small cluster: single master and two workers, 4cores each. Each aggregate takes 1 sec to complete.

As far as I understand each minute spark starts a job for eachaggregate, so if I need to complete 1000 jobs per 60 seconds I need1000/60 ~= 17 workers minimum. Is it correct? Will reduces run parallel?


Is there a way to calculate these aggregates in single job?

Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Q: optimal way to calculate aggregates on a stream

Reply via email to