Hi Jason,
A job with multiple reshuffle data could be scalable under normal
circumstances.
But we should carefully avoid data skew. Because if input stream has data
skew, add more resources would not help.
Besides that, if we could adjust the order of the functions, we could put
the keyed process function with the lowest selectivity at the top. The
lower the ratio of output records number to input records number, the lower
the selectivity is.

Best,
JING ZHANG


Jason Liu <jasonli...@ucla.edu> 于2021年8月31日周二 上午8:12写道:

> Hi there,
>
>     We have this use case where we need to have multiple keybys operators
> with its own MapState, all with different keys, in a single Flink app. This
> obviously means we'll be reshuffling our data a lot.
>     Our TPS is around 1-2k, with ~2kb per event and we use Kinesis Data
> Analytics as the infrastructure (running roughly on ~128 KPU of hardware).
> I'm currently in the design phase of this system and just wondering if we
> can put the data through 4-5 keyed process functions all with different key
> bys and if it can be scalable with a large enough Flink cluster. I don't
> think we can get around this requirement much (other than replicating
> data). Alternatively, we can just run multiple small Flink clusters, each
> with its own unique keyBys but I'm not sure if or how much that'll help.
>      Thanks for any potential insights!
>
> -Jason
>

Reply via email to