it might be related to this issue https://issues.apache.org/jira/browse/FLINK-10774
On Tue, Mar 26, 2019 at 4:35 PM Fritz Budiyanto <fbudi...@icloud.com> wrote: > Hi All, > > We're using Flink-1.4.2 and noticed many dangling connections to Kafka > after job deletion/recreation. The trigger here is Job cancelation/failure > due to network down event followed by Job recreation. > > Our flink job has checkpointing disabled, and upon job failure (due to > network failure), the Job got deleted and re-created. There were network > failure event which impacting communication between task manager(s) and > task-manager <-> job-manager. Our custom job controller monitored this > condition and tried to cancel the job, followed by recreating the job > (after a minute or so). > > Because of the network failure, the above steps were repeated many times > and eventually the flink-docker-container's socket file descriptors were > exhausted. > Looks like there were many Kafka connections from flink-task-manager to > the local Kafka broker. > > netstat -ntap | grep 9092 | grep java | wc -l > 2235 > > Is this a known issue which already fixed in later release ? If yes, could > someone point out the Jira link? > If this is a new issue, could someone let me know how to move forward and > debug this issue ? Looks like kafka consumers were not cleaned up properly > upon job cancelation. > > Thanks, > Fritz