[ https://issues.apache.org/jira/browse/FLINK-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
vinoyang closed FLINK-10413. ---------------------------- Resolution: Duplicate > requestPartitionState messages overwhelms JM RPC main thread > ------------------------------------------------------------ > > Key: FLINK-10413 > URL: https://issues.apache.org/jira/browse/FLINK-10413 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination > Affects Versions: 1.7.0 > Reporter: Zhu Zhu > Assignee: vinoyang > Priority: Major > > We tried to benchmark the job scheduling performance with a 2000x2000 > ALL-to-ALL streaming(EAGER) job. The input data is empty so the tasks > finishes soon after started. > In this case we see slow RPC responses and TM/RM heartbeats to JM will > finally timeout. > We find ~2,000,000 requestPartitionState messages triggered by > triggerPartitionProducerStateCheck in a short time, which overwhelms JM RPC > main thread. This is due to downstream tasks can be started earlier than > upstream tasks in EAGER scheduling. > > We's suggest no partition producer state check to avoid this issue. The task > can just keep waiting for a while and retrying if the partition does not > exist. There are two cases when the partition does not exist: > # the partition is not started yet > # the partition is failed > In case 1, retry works. In case 2, a task failover will soon happen and > cancel the downstream tasks as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)