Re: Flink pending record metric weired after autoscaler rescaling

Gyula Fóra Fri, 12 Jan 2024 06:39:36 -0800

Could this be related to the issue reported here?
https://issues.apache.org/jira/browse/FLINK-34063


Gyula

On Wed, Jan 10, 2024 at 4:04 PM Yang LI <[email protected]> wrote:

> Just to give more context, my setup uses Apache Flink 1.18 with the
> adaptive scheduler enabled, issues happen randomly particularly
> post-restart behaviors.
>
> After each restart, the system logs indicate "Adding split(s) to reader:",
> signifying the reassignment of partitions across different TaskManagers. An
> anomaly arises with specific partitions, for example, partition-10. This
> partition does not appear in the logs immediately post-restart. It remains
> unlogged for several hours, during which no data consumption from
> partition-10 occurs. Subsequently, the logs display "Discovered new
> partitions:", and only then does the consumption of data from partition-10
> recommence.
>
> Could you provide any insights or hypotheses regarding the underlying cause
> of this delayed recognition and processing of certain partitions?
>
> Best regards,
> Yang
>
> On Mon, 8 Jan 2024 at 16:24, Yang LI <[email protected]> wrote:
>
> > Dear Flink Community,
> >
> > I've encountered an issue during the testing of my Flink autoscaler. It
> > appears that Flink is losing track of specific Kafka partitions, leading
> to
> > a persistent increase in lag on these partitions. The logs indicate a
> > 'kafka connector metricGroup name collision exception.' Notably, the
> > consumption on these Kafka partitions returns to normal after restarting
> > the Kafka broker. For context, I have enabled in-place rescaling support
> > with 'jobmanager.scheduler: Adaptive.'
> >
> > I suspect the problem may stem from:
> >
> > The in-place rescaling support triggering restarts of some taskmanagers.
> > This process might not be restarting the metric groups registered by the
> > Kafka source connector correctly, leading to a name collision exception
> and
> > preventing Flink from accurately reporting metrics related to Kafka
> > consumption.
> > A potential edge case in the metric for pending records, especially when
> > different partitions exhibit varying lags. This discrepancy might be
> > causing the pending record metric to malfunction.
> > I would appreciate your insights on these observations.
> >
> > Best regards,
> > Yang LI
> >
>

Re: Flink pending record metric weired after autoscaler rescaling

Reply via email to