Setting source vs sink vs window parallelism with data increase

Padarn Wilson Thu, 28 Feb 2019 01:50:46 -0800

Hi all,

I'm trying to process many records, and I have an expensive operation I'm
trying to optimize. Simplified it is something like:


Data: (key1, count, time)

Source -> Map(x -> (x, newKeyList(x.key1))
            -> Flatmap(x -> x._2.map(y => (y, x._1.count, x._1.time))
            -> Keyby(_.key1).TublingWindow().apply..
            -> Sink

In the Map -> Flatmap, what is happening is that each key is mapping to a
set of keys, and then this is set as the new key. This effectively increase
the size of the stream by 16x

What I am trying to figure out is how to set the parallelism of my
operators. I see in some comments that people suggest your source, sink and
aggregation should have different parallelism, but I'm not clear on exactly
why, or what this means for CPU utilization.
(see for example
https://stackoverflow.com/questions/48317438/unable-to-achieve-high-cpu-utilization-with-flink-and-gelly
)

Also, it isn't clear to me the best way to handle this increase in data
within the stream itself.

Thanks

Setting source vs sink vs window parallelism with data increase

Reply via email to