This might be related with FLINK-28481, which is a bug in fabric8io k8s
client.

[1]. https://issues.apache.org/jira/browse/FLINK-28481

Best,
Yang

On Tue, Feb 6, 2024 at 12:30 PM Lavkesh Lahngir <lavk...@linux.com> wrote:

> Hi, Matthias, I was wondering if there are any timeout or heartbeat
> configurations for KubeHA available.
>
> Thanks.
>
> On Mon, 5 Feb 2024 at 8:58 PM, Matthias Pohl <matthias.p...@aiven.io
> .invalid>
> wrote:
>
> > That's stated in the Jira issue. I didn't have the time to investigate it
> > further.
> >
> > On Mon, Feb 5, 2024 at 1:55 PM Lavkesh Lahngir <lavk...@linux.com>
> wrote:
> >
> > > Hi Matthias,
> > > Thanks for the suggestion. Do we know which part of code caused this
> > issue
> > > and how it was fixed?
> > >
> > > Thanks!
> > >
> > > On Mon, 5 Feb 2024 at 18:06, Matthias Pohl <matthias.p...@aiven.io
> > > .invalid>
> > > wrote:
> > >
> > > > Hi Lavkesh,
> > > > FLINK-33998 [1] sounds quite similar to what you describe.
> > > >
> > > > The solution was to upgrade to Flink version 1.14.6. I didn't have
> the
> > > > capacity to look into the details considering that the mentioned
> Flink
> > > > version 1.14 is not officially supported by the community anymore
> and a
> > > fix
> > > > seems to have been provided with a newer version.
> > > >
> > > > Matthias
> > > >
> > > > [1] https://issues.apache.org/jira/browse/FLINK-33998
> > > >
> > > > On Mon, Feb 5, 2024 at 6:18 AM Lavkesh Lahngir <lavk...@linux.com>
> > > wrote:
> > > >
> > > > > Hii, Few more details:
> > > > > We are running GKE version 1.27.7-gke.1121002.
> > > > > and using flink version 1.14.3.
> > > > >
> > > > > Thanks!
> > > > >
> > > > > On Mon, 5 Feb 2024 at 12:05, Lavkesh Lahngir <lavk...@linux.com>
> > > wrote:
> > > > >
> > > > > > Hii All,
> > > > > >
> > > > > > We run a Flink operator on GKE, deploying one Flink job per job
> > > > manager.
> > > > > > We utilize
> > > > > >
> > > >
> > org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
> > > > > > for high availability. The JobManager employs config maps for
> > > > > checkpointing
> > > > > > and leader election. If, at any point, the Kube API server
> returns
> > an
> > > > > error
> > > > > > (5xx or 4xx), the JM pod is restarted. This occurrence is
> sporadic,
> > > > > > happening every 1-2 days for some jobs among the 400 running in
> the
> > > > same
> > > > > > cluster, each with its JobManager pod.
> > > > > >
> > > > > > What might be causing these errors from the Kube? One possibility
> > is
> > > > that
> > > > > > when the JM writes the config map and attempts to retrieve it
> > > > immediately
> > > > > > after, it could result in a 404 error.
> > > > > > Are there any configurations to increase heartbeat or timeouts
> that
> > > > might
> > > > > > be causing temporary disconnections from the Kube API server?
> > > > > >
> > > > > > Thank you!
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to