Hi Jason,

could you describe your topology? Are you writing to Kafka? Are you using
exactly once? Are you seeing any warning?
If so, one thing that immediately comes to my mind is
transaction.max.timeout.ms. If the value in flink (by default 1h) is higher
than what the Kafka brokers support, it may run into indefinite restart
loops in rare cases.

"Kafka brokers by default have transaction.max.timeout.ms set to 15
minutes. This property will not allow to set transaction timeouts for the
producers larger than it’s value. FlinkKafkaProducer011 by default sets the
transaction.timeout.ms property in producer config to 1 hour, thus
transaction.max.timeout.ms should be increased before using the
Semantic.EXACTLY_ONCE mode."

Best,

Arvid

On Fri, Jan 24, 2020 at 2:47 AM Jason Kania <jason.ka...@ymail.com> wrote:

> I am attempting to migrate from 1.7.1 to 1.9.1 and I have hit a problem
> where previously working jobs can no longer launch after being submitted.
> In the UI, the submitted jobs show up as deploying for a period, then go
> into a run state before returning to the deploy state and this repeats
> regularly with the job bouncing between states. No exceptions or errors are
> visible in the logs. There is no data coming in for the job to process and
> the kafka queues are empty.
>
> If I look at the thread activity of the task manager running the job in
> top, I see that the busiest threads are flink-akka threads, sometimes
> jumping to very high CPU numbers. That is all I have for info.
>
> Any suggestions on how to debug this? I can set break points and connect
> if that helps, just not sure at this point where to start.
>
> Thanks,
>
> Jason
>

Reply via email to