[ https://issues.apache.org/jira/browse/FLINK-18203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17128954#comment-17128954 ]
Jiayi Liao commented on FLINK-18203: ------------------------------------ To let all executions share the same collection of {{OperatorStreamStateHandle}}, we should make {{OperatorStreamStateHandle}} unmodified in case someone modify this in other places. > Reduce objects usage in redistributing union states > --------------------------------------------------- > > Key: FLINK-18203 > URL: https://issues.apache.org/jira/browse/FLINK-18203 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing > Affects Versions: 1.10.1 > Reporter: Jiayi Liao > Priority: Major > > #{{RoundRobinOperatorStateRepartitioner}}#{{repartitionUnionState}} creates a > new {{OperatorStreamStateHandle}} instance for every {{StreamStateHandle}} > instance used in every execution, which causes the number of new > {{OperatorStreamStateHandle}} instances up to m * n (jobvertex parallelism * > count of all executions' StreamStateHandle). > But in fact, all executions can share the same collection of > {{StreamStateHandle}} and the number of {{OperatorStreamStateHandle}} can be > reduced down to the count of all executions' StreamStateHandle. > I met this problem on production when we're testing a job with > parallelism=10k and the memory problem is extermely serious when yarn > containers go dead and the job starts doing failover. -- This message was sent by Atlassian Jira (v8.3.4#803005)