Hi Ori,

Thanks for reaching out! I do fear that there's not much that we can help
out with. As you mentioned, it looks like there's a network issue which
would be on the Google side of issues. I'm assuming that the mentioned
Flink version corresponds with Flink 1.12 [1], which isn't supported in the
Flink community anymore. Are you restarting the job from a savepoint or
starting fresh without state at all?

Best regards,

Martijn

[1]
https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.0

On Sun, Oct 2, 2022 at 3:38 AM Ori Popowski <ori....@gmail.com> wrote:

> Hi,
>
> We're using Flink 2.10.2 on Google Dataproc.
>
> Lately we experience a very unusual problem: the job fails and when it's
> trying to recover we get this error:
>
> Slot request bulk is not fulfillable! Could not allocate the required slot
> within slot request timeout
>
> I investigated what happened and I saw that the failure is caused by a
> heartbeat timeout to one of the containers. I looked at the container's
> logs and I saw something unusual:
>
>    1. Eight minutes before the heartbeat timeout the logs show connection
>    problems to the Confluent Kafka topic and also to Datadog, which means
>    there's a network issue with the whole node or just the specific container.
>    2. The container logs disappear at this point, but the node logs show
>    multiple Garbage Collection pauses, ranging from 10 seconds to 215 (!)
>    seconds.
>
> It looks like right after the network issue the node itself gets into an
> endless GC phase, and my theory is that the slots are not fulfillable
> because the node itself is not available because it gets into an endless GC.
>
> I want to note that we've been running this job for months without any
> issues. The issues started one month ago arbitrarily, not following a Flink
> version upgrade, job code upgrade, change in amount or type of data being
> processed, and neither a Dataproc image version change.
>
> Attached are job manager jogs, container logs, and node logs.
>
> How can we recover from this issue?
>
> Thanks!
>
>

Reply via email to