Hi Jason, could you describe your topology? Are you writing to Kafka? Are you using exactly once? Are you seeing any warning? If so, one thing that immediately comes to my mind is transaction.max.timeout.ms. If the value in flink (by default 1h) is higher than what the Kafka brokers support, it may run into indefinite restart loops in rare cases.
"Kafka brokers by default have transaction.max.timeout.ms set to 15 minutes. This property will not allow to set transaction timeouts for the producers larger than it’s value. FlinkKafkaProducer011 by default sets the transaction.timeout.ms property in producer config to 1 hour, thus transaction.max.timeout.ms should be increased before using the Semantic.EXACTLY_ONCE mode." Best, Arvid On Fri, Jan 24, 2020 at 2:47 AM Jason Kania <jason.ka...@ymail.com> wrote: > I am attempting to migrate from 1.7.1 to 1.9.1 and I have hit a problem > where previously working jobs can no longer launch after being submitted. > In the UI, the submitted jobs show up as deploying for a period, then go > into a run state before returning to the deploy state and this repeats > regularly with the job bouncing between states. No exceptions or errors are > visible in the logs. There is no data coming in for the job to process and > the kafka queues are empty. > > If I look at the thread activity of the task manager running the job in > top, I see that the busiest threads are flink-akka threads, sometimes > jumping to very high CPU numbers. That is all I have for info. > > Any suggestions on how to debug this? I can set break points and connect > if that helps, just not sure at this point where to start. > > Thanks, > > Jason >