Hi,
It would be helpful for understanding the problem if you could share the
logs.

Thank you~

Xintong Song



On Wed, Jan 15, 2020 at 12:23 AM burgesschen <tchen...@bloomberg.net> wrote:

> Hi guys,
>
> Out team is observing a stability issue on our Standalone Flink clusters.
>
> Background: The kafka cluster our flink jobs read from/ write to have some
> issues and every 10 to15 mins one of the partition leaders switch. This
> causes jobs that write to/ read from that topic fail and restart. Usually
> this is not a problem since the jobs can restart and work with the new
> partition leader. However, one of those restarts can make the jobs enter a
> failing state and never be able to recover.
>
> In the failing state. The jobmanager has exception:
>
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Could not allocate all requires slots within timeout of 300000 ms. Slots
> required: 24, slots allocated: 12
>
> During that time, 2 of the taskmanager are reporting that all the slots on
> them are occupied, however, from the dashboard of the jobmanager, no job is
> deployed to those 2 taskmanagers.
>
> My guesstimation is that since the jobs restart fairly frequently, one of
> the times the slots are not released properly when jobs failed, resulting
> in
> the jobmanager falsely believing that those 2 taskmanagers' slots are still
> occupied.
>
> It does sound like an issue mentioned in
> https://issues.apache.org/jira/browse/FLINK-9932
> but we are using 1.6.2 and according to the jira ticket, this bug is fixed
> in 1.6.2
>
> Please let me know if you have any ideas or how we can prevent it. Thank
> you
> so much!
>
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>

Reply via email to