Re: Native Kubernetes Task Managers

2024-06-09 Thread Alexis Sarda-Espinosa
Hi,

sorry, I got sidetracked this week. Yes, I was thinking about the
possibility of changing the logic so that the names are stable, the way a
stateful set would do it but without needing an actual stateful set. When I
look at metrics, in particular those that are not exposed by Flink but
rather by other systems, it's easier to track behavior with stable names,
for example memory utilization even after a pod restart. It might even be
easier to detect such restarts since some Task Manager metrics would be
missing for a short period, and that could help narrow down the timestamps
for related logs, so that potential OOM kills can be corroborated through
the resource manager's logs. I think this could be more helpful than
exposing the attempt ID as part of the pod name, what do you think?

Regards,
Alexis.

Am Mo., 3. Juni 2024 um 08:35 Uhr schrieb spoon_lz :

> Hi Alexis, Are you describing the logic of adjusting the following part:
> | final String podName = String.format(TASK_MANAGER_POD_FORMAT, clusterId,
> currentMaxAttemptId, ++currentMaxPodId);
> |
> `currentMaxAttemptId` relates to the number of rebuilds for
> ResourceManager, and `currentMaxPodId` describes the pod index.
>
>
>
> Regards,
> Zhuo.
>  Replied Message 
> | From | Alexis Sarda-Espinosa |
> | Date | 06/3/2024 14:19 |
> | To |  |
> | Subject | Re: Native Kubernetes Task Managers |
> Ah no, I meant that I wouldn't use a stateful set, rather just adjust the
> names of the pods that are created/managed directly by the job manager.
>
> Regards,
> Alexis.
>
> Am Mo., 3. Juni 2024 um 07:31 Uhr schrieb Xintong Song <
> tonysong...@gmail.com>:
>
> I may not have understood what you mean by the naming scheme. I think the
> limitation "pods in a StatefulSet are always terminated in the reverse
> order as they are created" comes from Kubernetes and has nothing to do with
> the naming scheme.
>
> Best,
>
> Xintong
>
>
>
> On Mon, Jun 3, 2024 at 1:13 PM Alexis Sarda-Espinosa <
> sarda.espin...@gmail.com> wrote:
>
> Hi Xintong,
>
> After experimenting a bit, I came to roughly the same conclusion: cleanup
> is what's more or less incompatible if Kubernetes manages the pods. Then
> it
> might be better to just allow using a more stable pod naming scheme that
> doesn't depend on the attempt number and thus produces more stable task
> manager metrics. I'll explore that.
>
> Regards,
> Alexis.
>
> On Mon, 3 Jun 2024, 03:35 Xintong Song,  wrote:
>
> I think the reason we didn't choose StatefulSet when introducing the
> Native
> K8s Deployment is that, IIRC, we want Flink's ResourceManager to have
> full
> control of the individual pod lifecycles.
>
> E.g.,
> - Pods in a StatefulSet are always terminated in the reverse order as
> they
> are created. This prevents us from releasing a specific idle TM that is
> not
> necessarily created lastly.
> - If a pod is unexpectedly terminated, Flink's ResourceManager should
> decide whether to restart it or not according to the job status.
> (Technically, the same issue as above, that we may want pods to be
> terminated / deleted in a different order.)
>
> There might be some other reasons. I just cannot recall all the
> details.
>
> As for determining whether a pod is OOM killed, I think Flink does
> print
> diagnostics for terminated pods in JM logs, i.e. the `exitCode`,
> `reason`
> and `message` of the `Terminated` container state. In our production,
> it
> shows "(exitCode=137, reason=OOMKilled, message=null)". However, since
> the
> diagnostics are from K8s, I'm not 100% sure whether this behavior is
> same
> for all K8s versions,.
>
> Best,
>
> Xintong
>
>
>
> On Sun, Jun 2, 2024 at 7:35 PM Alexis Sarda-Espinosa <
> sarda.espin...@gmail.com> wrote:
>
> Hi devs,
>
> Some time ago I asked about the way Task Manager pods are handled by
> the
> native Kubernetes driver [1]. I have now looked a bit through the
> source
> code and I think it could be possible to deploy TMs with a stateful
> set,
> which could allow tracking OOM kills as I mentioned in my original
> email,
> and could also make it easier to track metrics and create alerts,
> since
> the
> labels wouldn't change as much.
>
> One challenge is probably the new elastic scaling features [2], since
> the
> driver would have to differentiate between new pod requests due to a
> TM
> terminating, and a request due to scaling. I'm also not sure where
> downscaling requests are currently handled.
>
> I would be interested in taking a look at this and seeing if I can
> get
> something working. I think it would be possible to make it
> configurable
> in
> a way that maintains backwards compatibility. Would it be ok if I
> enter a
> Jira ticket and try it out?
>
> Regards,
> Alexis.
>
> [1] https://lists.apache.org/thread/jysgdldv8swgf4fhqwqochgf6hq0qs52
> [2]
>
>
>
>
>
> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/deployment/elastic_scaling/
>
>
>
>
>


Re: Native Kubernetes Task Managers

2024-06-02 Thread spoon_lz
Hi Alexis, Are you describing the logic of adjusting the following part:
| final String podName = String.format(TASK_MANAGER_POD_FORMAT, clusterId, 
currentMaxAttemptId, ++currentMaxPodId);
|
`currentMaxAttemptId` relates to the number of rebuilds for ResourceManager, 
and `currentMaxPodId` describes the pod index.



Regards,
Zhuo.
 Replied Message 
| From | Alexis Sarda-Espinosa |
| Date | 06/3/2024 14:19 |
| To |  |
| Subject | Re: Native Kubernetes Task Managers |
Ah no, I meant that I wouldn't use a stateful set, rather just adjust the
names of the pods that are created/managed directly by the job manager.

Regards,
Alexis.

Am Mo., 3. Juni 2024 um 07:31 Uhr schrieb Xintong Song <
tonysong...@gmail.com>:

I may not have understood what you mean by the naming scheme. I think the
limitation "pods in a StatefulSet are always terminated in the reverse
order as they are created" comes from Kubernetes and has nothing to do with
the naming scheme.

Best,

Xintong



On Mon, Jun 3, 2024 at 1:13 PM Alexis Sarda-Espinosa <
sarda.espin...@gmail.com> wrote:

Hi Xintong,

After experimenting a bit, I came to roughly the same conclusion: cleanup
is what's more or less incompatible if Kubernetes manages the pods. Then
it
might be better to just allow using a more stable pod naming scheme that
doesn't depend on the attempt number and thus produces more stable task
manager metrics. I'll explore that.

Regards,
Alexis.

On Mon, 3 Jun 2024, 03:35 Xintong Song,  wrote:

I think the reason we didn't choose StatefulSet when introducing the
Native
K8s Deployment is that, IIRC, we want Flink's ResourceManager to have
full
control of the individual pod lifecycles.

E.g.,
- Pods in a StatefulSet are always terminated in the reverse order as
they
are created. This prevents us from releasing a specific idle TM that is
not
necessarily created lastly.
- If a pod is unexpectedly terminated, Flink's ResourceManager should
decide whether to restart it or not according to the job status.
(Technically, the same issue as above, that we may want pods to be
terminated / deleted in a different order.)

There might be some other reasons. I just cannot recall all the
details.

As for determining whether a pod is OOM killed, I think Flink does
print
diagnostics for terminated pods in JM logs, i.e. the `exitCode`,
`reason`
and `message` of the `Terminated` container state. In our production,
it
shows "(exitCode=137, reason=OOMKilled, message=null)". However, since
the
diagnostics are from K8s, I'm not 100% sure whether this behavior is
same
for all K8s versions,.

Best,

Xintong



On Sun, Jun 2, 2024 at 7:35 PM Alexis Sarda-Espinosa <
sarda.espin...@gmail.com> wrote:

Hi devs,

Some time ago I asked about the way Task Manager pods are handled by
the
native Kubernetes driver [1]. I have now looked a bit through the
source
code and I think it could be possible to deploy TMs with a stateful
set,
which could allow tracking OOM kills as I mentioned in my original
email,
and could also make it easier to track metrics and create alerts,
since
the
labels wouldn't change as much.

One challenge is probably the new elastic scaling features [2], since
the
driver would have to differentiate between new pod requests due to a
TM
terminating, and a request due to scaling. I'm also not sure where
downscaling requests are currently handled.

I would be interested in taking a look at this and seeing if I can
get
something working. I think it would be possible to make it
configurable
in
a way that maintains backwards compatibility. Would it be ok if I
enter a
Jira ticket and try it out?

Regards,
Alexis.

[1] https://lists.apache.org/thread/jysgdldv8swgf4fhqwqochgf6hq0qs52
[2]




https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/deployment/elastic_scaling/






Re: Native Kubernetes Task Managers

2024-06-02 Thread Alexis Sarda-Espinosa
Ah no, I meant that I wouldn't use a stateful set, rather just adjust the
names of the pods that are created/managed directly by the job manager.

Regards,
Alexis.

Am Mo., 3. Juni 2024 um 07:31 Uhr schrieb Xintong Song <
tonysong...@gmail.com>:

> I may not have understood what you mean by the naming scheme. I think the
> limitation "pods in a StatefulSet are always terminated in the reverse
> order as they are created" comes from Kubernetes and has nothing to do with
> the naming scheme.
>
> Best,
>
> Xintong
>
>
>
> On Mon, Jun 3, 2024 at 1:13 PM Alexis Sarda-Espinosa <
> sarda.espin...@gmail.com> wrote:
>
> > Hi Xintong,
> >
> > After experimenting a bit, I came to roughly the same conclusion: cleanup
> > is what's more or less incompatible if Kubernetes manages the pods. Then
> it
> > might be better to just allow using a more stable pod naming scheme that
> > doesn't depend on the attempt number and thus produces more stable task
> > manager metrics. I'll explore that.
> >
> > Regards,
> > Alexis.
> >
> > On Mon, 3 Jun 2024, 03:35 Xintong Song,  wrote:
> >
> > > I think the reason we didn't choose StatefulSet when introducing the
> > Native
> > > K8s Deployment is that, IIRC, we want Flink's ResourceManager to have
> > full
> > > control of the individual pod lifecycles.
> > >
> > > E.g.,
> > > - Pods in a StatefulSet are always terminated in the reverse order as
> > they
> > > are created. This prevents us from releasing a specific idle TM that is
> > not
> > > necessarily created lastly.
> > > - If a pod is unexpectedly terminated, Flink's ResourceManager should
> > > decide whether to restart it or not according to the job status.
> > > (Technically, the same issue as above, that we may want pods to be
> > > terminated / deleted in a different order.)
> > >
> > > There might be some other reasons. I just cannot recall all the
> details.
> > >
> > > As for determining whether a pod is OOM killed, I think Flink does
> print
> > > diagnostics for terminated pods in JM logs, i.e. the `exitCode`,
> `reason`
> > > and `message` of the `Terminated` container state. In our production,
> it
> > > shows "(exitCode=137, reason=OOMKilled, message=null)". However, since
> > the
> > > diagnostics are from K8s, I'm not 100% sure whether this behavior is
> same
> > > for all K8s versions,.
> > >
> > > Best,
> > >
> > > Xintong
> > >
> > >
> > >
> > > On Sun, Jun 2, 2024 at 7:35 PM Alexis Sarda-Espinosa <
> > > sarda.espin...@gmail.com> wrote:
> > >
> > > > Hi devs,
> > > >
> > > > Some time ago I asked about the way Task Manager pods are handled by
> > the
> > > > native Kubernetes driver [1]. I have now looked a bit through the
> > source
> > > > code and I think it could be possible to deploy TMs with a stateful
> > set,
> > > > which could allow tracking OOM kills as I mentioned in my original
> > email,
> > > > and could also make it easier to track metrics and create alerts,
> since
> > > the
> > > > labels wouldn't change as much.
> > > >
> > > > One challenge is probably the new elastic scaling features [2], since
> > the
> > > > driver would have to differentiate between new pod requests due to a
> TM
> > > > terminating, and a request due to scaling. I'm also not sure where
> > > > downscaling requests are currently handled.
> > > >
> > > > I would be interested in taking a look at this and seeing if I can
> get
> > > > something working. I think it would be possible to make it
> configurable
> > > in
> > > > a way that maintains backwards compatibility. Would it be ok if I
> > enter a
> > > > Jira ticket and try it out?
> > > >
> > > > Regards,
> > > > Alexis.
> > > >
> > > > [1] https://lists.apache.org/thread/jysgdldv8swgf4fhqwqochgf6hq0qs52
> > > > [2]
> > > >
> > > >
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/deployment/elastic_scaling/
> > > >
> > >
> >
>


Re: Native Kubernetes Task Managers

2024-06-02 Thread Xintong Song
I may not have understood what you mean by the naming scheme. I think the
limitation "pods in a StatefulSet are always terminated in the reverse
order as they are created" comes from Kubernetes and has nothing to do with
the naming scheme.

Best,

Xintong



On Mon, Jun 3, 2024 at 1:13 PM Alexis Sarda-Espinosa <
sarda.espin...@gmail.com> wrote:

> Hi Xintong,
>
> After experimenting a bit, I came to roughly the same conclusion: cleanup
> is what's more or less incompatible if Kubernetes manages the pods. Then it
> might be better to just allow using a more stable pod naming scheme that
> doesn't depend on the attempt number and thus produces more stable task
> manager metrics. I'll explore that.
>
> Regards,
> Alexis.
>
> On Mon, 3 Jun 2024, 03:35 Xintong Song,  wrote:
>
> > I think the reason we didn't choose StatefulSet when introducing the
> Native
> > K8s Deployment is that, IIRC, we want Flink's ResourceManager to have
> full
> > control of the individual pod lifecycles.
> >
> > E.g.,
> > - Pods in a StatefulSet are always terminated in the reverse order as
> they
> > are created. This prevents us from releasing a specific idle TM that is
> not
> > necessarily created lastly.
> > - If a pod is unexpectedly terminated, Flink's ResourceManager should
> > decide whether to restart it or not according to the job status.
> > (Technically, the same issue as above, that we may want pods to be
> > terminated / deleted in a different order.)
> >
> > There might be some other reasons. I just cannot recall all the details.
> >
> > As for determining whether a pod is OOM killed, I think Flink does print
> > diagnostics for terminated pods in JM logs, i.e. the `exitCode`, `reason`
> > and `message` of the `Terminated` container state. In our production, it
> > shows "(exitCode=137, reason=OOMKilled, message=null)". However, since
> the
> > diagnostics are from K8s, I'm not 100% sure whether this behavior is same
> > for all K8s versions,.
> >
> > Best,
> >
> > Xintong
> >
> >
> >
> > On Sun, Jun 2, 2024 at 7:35 PM Alexis Sarda-Espinosa <
> > sarda.espin...@gmail.com> wrote:
> >
> > > Hi devs,
> > >
> > > Some time ago I asked about the way Task Manager pods are handled by
> the
> > > native Kubernetes driver [1]. I have now looked a bit through the
> source
> > > code and I think it could be possible to deploy TMs with a stateful
> set,
> > > which could allow tracking OOM kills as I mentioned in my original
> email,
> > > and could also make it easier to track metrics and create alerts, since
> > the
> > > labels wouldn't change as much.
> > >
> > > One challenge is probably the new elastic scaling features [2], since
> the
> > > driver would have to differentiate between new pod requests due to a TM
> > > terminating, and a request due to scaling. I'm also not sure where
> > > downscaling requests are currently handled.
> > >
> > > I would be interested in taking a look at this and seeing if I can get
> > > something working. I think it would be possible to make it configurable
> > in
> > > a way that maintains backwards compatibility. Would it be ok if I
> enter a
> > > Jira ticket and try it out?
> > >
> > > Regards,
> > > Alexis.
> > >
> > > [1] https://lists.apache.org/thread/jysgdldv8swgf4fhqwqochgf6hq0qs52
> > > [2]
> > >
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/deployment/elastic_scaling/
> > >
> >
>


Re: Native Kubernetes Task Managers

2024-06-02 Thread Alexis Sarda-Espinosa
Hi Xintong,

After experimenting a bit, I came to roughly the same conclusion: cleanup
is what's more or less incompatible if Kubernetes manages the pods. Then it
might be better to just allow using a more stable pod naming scheme that
doesn't depend on the attempt number and thus produces more stable task
manager metrics. I'll explore that.

Regards,
Alexis.

On Mon, 3 Jun 2024, 03:35 Xintong Song,  wrote:

> I think the reason we didn't choose StatefulSet when introducing the Native
> K8s Deployment is that, IIRC, we want Flink's ResourceManager to have full
> control of the individual pod lifecycles.
>
> E.g.,
> - Pods in a StatefulSet are always terminated in the reverse order as they
> are created. This prevents us from releasing a specific idle TM that is not
> necessarily created lastly.
> - If a pod is unexpectedly terminated, Flink's ResourceManager should
> decide whether to restart it or not according to the job status.
> (Technically, the same issue as above, that we may want pods to be
> terminated / deleted in a different order.)
>
> There might be some other reasons. I just cannot recall all the details.
>
> As for determining whether a pod is OOM killed, I think Flink does print
> diagnostics for terminated pods in JM logs, i.e. the `exitCode`, `reason`
> and `message` of the `Terminated` container state. In our production, it
> shows "(exitCode=137, reason=OOMKilled, message=null)". However, since the
> diagnostics are from K8s, I'm not 100% sure whether this behavior is same
> for all K8s versions,.
>
> Best,
>
> Xintong
>
>
>
> On Sun, Jun 2, 2024 at 7:35 PM Alexis Sarda-Espinosa <
> sarda.espin...@gmail.com> wrote:
>
> > Hi devs,
> >
> > Some time ago I asked about the way Task Manager pods are handled by the
> > native Kubernetes driver [1]. I have now looked a bit through the source
> > code and I think it could be possible to deploy TMs with a stateful set,
> > which could allow tracking OOM kills as I mentioned in my original email,
> > and could also make it easier to track metrics and create alerts, since
> the
> > labels wouldn't change as much.
> >
> > One challenge is probably the new elastic scaling features [2], since the
> > driver would have to differentiate between new pod requests due to a TM
> > terminating, and a request due to scaling. I'm also not sure where
> > downscaling requests are currently handled.
> >
> > I would be interested in taking a look at this and seeing if I can get
> > something working. I think it would be possible to make it configurable
> in
> > a way that maintains backwards compatibility. Would it be ok if I enter a
> > Jira ticket and try it out?
> >
> > Regards,
> > Alexis.
> >
> > [1] https://lists.apache.org/thread/jysgdldv8swgf4fhqwqochgf6hq0qs52
> > [2]
> >
> >
> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/deployment/elastic_scaling/
> >
>


Re: Native Kubernetes Task Managers

2024-06-02 Thread Xintong Song
I think the reason we didn't choose StatefulSet when introducing the Native
K8s Deployment is that, IIRC, we want Flink's ResourceManager to have full
control of the individual pod lifecycles.

E.g.,
- Pods in a StatefulSet are always terminated in the reverse order as they
are created. This prevents us from releasing a specific idle TM that is not
necessarily created lastly.
- If a pod is unexpectedly terminated, Flink's ResourceManager should
decide whether to restart it or not according to the job status.
(Technically, the same issue as above, that we may want pods to be
terminated / deleted in a different order.)

There might be some other reasons. I just cannot recall all the details.

As for determining whether a pod is OOM killed, I think Flink does print
diagnostics for terminated pods in JM logs, i.e. the `exitCode`, `reason`
and `message` of the `Terminated` container state. In our production, it
shows "(exitCode=137, reason=OOMKilled, message=null)". However, since the
diagnostics are from K8s, I'm not 100% sure whether this behavior is same
for all K8s versions,.

Best,

Xintong



On Sun, Jun 2, 2024 at 7:35 PM Alexis Sarda-Espinosa <
sarda.espin...@gmail.com> wrote:

> Hi devs,
>
> Some time ago I asked about the way Task Manager pods are handled by the
> native Kubernetes driver [1]. I have now looked a bit through the source
> code and I think it could be possible to deploy TMs with a stateful set,
> which could allow tracking OOM kills as I mentioned in my original email,
> and could also make it easier to track metrics and create alerts, since the
> labels wouldn't change as much.
>
> One challenge is probably the new elastic scaling features [2], since the
> driver would have to differentiate between new pod requests due to a TM
> terminating, and a request due to scaling. I'm also not sure where
> downscaling requests are currently handled.
>
> I would be interested in taking a look at this and seeing if I can get
> something working. I think it would be possible to make it configurable in
> a way that maintains backwards compatibility. Would it be ok if I enter a
> Jira ticket and try it out?
>
> Regards,
> Alexis.
>
> [1] https://lists.apache.org/thread/jysgdldv8swgf4fhqwqochgf6hq0qs52
> [2]
>
> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/deployment/elastic_scaling/
>