Could this be related to the issue reported here? https://issues.apache.org/jira/browse/FLINK-34063
Gyula On Wed, Jan 10, 2024 at 4:04 PM Yang LI <yang.hunter...@gmail.com> wrote: > Just to give more context, my setup uses Apache Flink 1.18 with the > adaptive scheduler enabled, issues happen randomly particularly > post-restart behaviors. > > After each restart, the system logs indicate "Adding split(s) to reader:", > signifying the reassignment of partitions across different TaskManagers. An > anomaly arises with specific partitions, for example, partition-10. This > partition does not appear in the logs immediately post-restart. It remains > unlogged for several hours, during which no data consumption from > partition-10 occurs. Subsequently, the logs display "Discovered new > partitions:", and only then does the consumption of data from partition-10 > recommence. > > Could you provide any insights or hypotheses regarding the underlying cause > of this delayed recognition and processing of certain partitions? > > Best regards, > Yang > > On Mon, 8 Jan 2024 at 16:24, Yang LI <yang.hunter...@gmail.com> wrote: > > > Dear Flink Community, > > > > I've encountered an issue during the testing of my Flink autoscaler. It > > appears that Flink is losing track of specific Kafka partitions, leading > to > > a persistent increase in lag on these partitions. The logs indicate a > > 'kafka connector metricGroup name collision exception.' Notably, the > > consumption on these Kafka partitions returns to normal after restarting > > the Kafka broker. For context, I have enabled in-place rescaling support > > with 'jobmanager.scheduler: Adaptive.' > > > > I suspect the problem may stem from: > > > > The in-place rescaling support triggering restarts of some taskmanagers. > > This process might not be restarting the metric groups registered by the > > Kafka source connector correctly, leading to a name collision exception > and > > preventing Flink from accurately reporting metrics related to Kafka > > consumption. > > A potential edge case in the metric for pending records, especially when > > different partitions exhibit varying lags. This discrepancy might be > > causing the pending record metric to malfunction. > > I would appreciate your insights on these observations. > > > > Best regards, > > Yang LI > > >