Re: [ DISCUSS ] FLIP-XXX : [Plugin] Enhancing Flink Failure Management in Kubernetes with Dynamic Termination Log Integration

2024-04-25 Thread Swathi C
Thanks everyone for the feedback.
 Will try to dive deep into Pluggable Enrichers to see how we can
incorporate termination-log by using it.

Regards,
Swathi C

On Thu, Apr 25, 2024 at 12:33 PM Martijn Visser 
wrote:

> Hi Swathi C,
>
> Also including the Dev mailing list.
>
> If you have a good reason for not being able to use the pluggable enricher
> FLIP, you'll have to include that rationale in your own FLIP and explain
> it. You might get challenged for it in the Dev mailing list thread
> discussion, but that's the point.
>
> Regards,
>
> Martijn
>
> On Thu, Apr 25, 2024 at 8:51 AM Swathi C 
> wrote:
>
>> Hi Martijn and Ahmed,
>>
>> This proposed FLIP was mainly focusing for the CRUD failures use case (
>> and not job failures ) and might not be able to use pluggable enricher FLIP
>> ( as that mainly focuses on job failures ). Hence, for going forward as a
>> new FLIP, we might not be able to leverage pluggable enricher FLIP for this
>> use case. So, we might not be able to reformulate it for CRUD failures.
>>
>> So, is it ok with this new proposal or let us know if I'm missing
>> anything and if it is related to pluggable enricher FLIP or anyway we can
>> use pluggable enricker FLIP here for CRUD failures.
>>
>> Regards,
>> Swathi C
>>
>> ------ Forwarded message -----
>> From: Martijn Visser 
>> Date: Thu, Apr 25, 2024 at 2:46 AM
>> Subject: Re: [ DISCUSS ] FLIP-XXX : [Plugin] Enhancing Flink Failure
>> Management in Kubernetes with Dynamic Termination Log Integration
>> To: 
>> Cc: , 
>>
>>
>> I would prefer a separate FLIP
>>
>> On Wed, Apr 24, 2024 at 3:25 PM Swathi C 
>> wrote:
>>
>> > Sure Ahmed and Martijn.
>> > Fetching the flink particular job related failure and adding this logic
>> to
>> > termination-log is definitely a sub-task of pluggable enricher as we can
>> > leverage pluggable enricher to achieve this.
>> > But for CRUD level failures, which is mainly used to notify if the job
>> > manager failed might not be using the pluggable enricher. So, let us
>> know
>> > if that needs to be there as a separate FLIP or we can combine that as
>> well
>> > under the pluggable enricher ( by adding another sub task ) ?
>> >
>> > Regards,
>> > Swathi C
>> >
>> > On Wed, Apr 24, 2024 at 3:46 PM Ahmed Hamdy 
>> wrote:
>> >
>> > > Hi,
>> > > I agree with the Martijn, We can reformulate the FLIP to introduce
>> > > termination log as supported pluggable enricher. If you believe the
>> scope
>> > > of work is a subset (Further implementation) we can just add a Jira
>> > ticket
>> > > for it. IMO this will also help with implementation taking the
>> existing
>> > > enrichers into reference.
>> > > Best Regards
>> > > Ahmed Hamdy
>> > >
>> > >
>> > > On Tue, 23 Apr 2024 at 15:23, Martijn Visser <
>> martijnvis...@apache.org>
>> > > wrote:
>> > >
>> > > > From a procedural point of view, we shouldn't make FLIPs sub-tasks
>> for
>> > > > existing FLIPs that have been voted/are released. That will only
>> cause
>> > > > confusion down the line. A new FLIP should take existing
>> functionality
>> > > > (like FLIP-304) into account, and propose how to improve on what
>> that
>> > > > original FLIP has introduced or how you're going to leverage what's
>> > > already
>> > > > there.
>> > > >
>> > > > On Tue, Apr 23, 2024 at 11:42 AM ramkrishna vasudevan <
>> > > > ramvasu.fl...@gmail.com> wrote:
>> > > >
>> > > > > Hi Gyula and Ahmed,
>> > > > >
>> > > > > I totally agree that there is an interlap in the final goal that
>> both
>> > > the
>> > > > > FLIPs are achieving here and infact FLIP-304 is more comprehensive
>> > for
>> > > > job
>> > > > > failures.
>> > > > >
>> > > > > But as a proposal to move forward can we make Swathi's FLIP/JIRA
>> as a
>> > > sub
>> > > > > task for FLIP-304 and continue with the PR since the main aim is
>> to
>> > get
>> > > > the
>> > > > > cluster failure pushed to the termination log for K8s based
>> > > deployments.
>> > &g

Re: [ DISCUSS ] FLIP-XXX : [Plugin] Enhancing Flink Failure Management in Kubernetes with Dynamic Termination Log Integration

2024-04-25 Thread Martijn Visser
Hi Swathi C,

Also including the Dev mailing list.

If you have a good reason for not being able to use the pluggable enricher
FLIP, you'll have to include that rationale in your own FLIP and explain
it. You might get challenged for it in the Dev mailing list thread
discussion, but that's the point.

Regards,

Martijn

On Thu, Apr 25, 2024 at 8:51 AM Swathi C  wrote:

> Hi Martijn and Ahmed,
>
> This proposed FLIP was mainly focusing for the CRUD failures use case (
> and not job failures ) and might not be able to use pluggable enricher FLIP
> ( as that mainly focuses on job failures ). Hence, for going forward as a
> new FLIP, we might not be able to leverage pluggable enricher FLIP for this
> use case. So, we might not be able to reformulate it for CRUD failures.
>
> So, is it ok with this new proposal or let us know if I'm missing
> anything and if it is related to pluggable enricher FLIP or anyway we can
> use pluggable enricker FLIP here for CRUD failures.
>
> Regards,
> Swathi C
>
> -- Forwarded message -
> From: Martijn Visser 
> Date: Thu, Apr 25, 2024 at 2:46 AM
> Subject: Re: [ DISCUSS ] FLIP-XXX : [Plugin] Enhancing Flink Failure
> Management in Kubernetes with Dynamic Termination Log Integration
> To: 
> Cc: , 
>
>
> I would prefer a separate FLIP
>
> On Wed, Apr 24, 2024 at 3:25 PM Swathi C 
> wrote:
>
> > Sure Ahmed and Martijn.
> > Fetching the flink particular job related failure and adding this logic
> to
> > termination-log is definitely a sub-task of pluggable enricher as we can
> > leverage pluggable enricher to achieve this.
> > But for CRUD level failures, which is mainly used to notify if the job
> > manager failed might not be using the pluggable enricher. So, let us know
> > if that needs to be there as a separate FLIP or we can combine that as
> well
> > under the pluggable enricher ( by adding another sub task ) ?
> >
> > Regards,
> > Swathi C
> >
> > On Wed, Apr 24, 2024 at 3:46 PM Ahmed Hamdy 
> wrote:
> >
> > > Hi,
> > > I agree with the Martijn, We can reformulate the FLIP to introduce
> > > termination log as supported pluggable enricher. If you believe the
> scope
> > > of work is a subset (Further implementation) we can just add a Jira
> > ticket
> > > for it. IMO this will also help with implementation taking the existing
> > > enrichers into reference.
> > > Best Regards
> > > Ahmed Hamdy
> > >
> > >
> > > On Tue, 23 Apr 2024 at 15:23, Martijn Visser  >
> > > wrote:
> > >
> > > > From a procedural point of view, we shouldn't make FLIPs sub-tasks
> for
> > > > existing FLIPs that have been voted/are released. That will only
> cause
> > > > confusion down the line. A new FLIP should take existing
> functionality
> > > > (like FLIP-304) into account, and propose how to improve on what that
> > > > original FLIP has introduced or how you're going to leverage what's
> > > already
> > > > there.
> > > >
> > > > On Tue, Apr 23, 2024 at 11:42 AM ramkrishna vasudevan <
> > > > ramvasu.fl...@gmail.com> wrote:
> > > >
> > > > > Hi Gyula and Ahmed,
> > > > >
> > > > > I totally agree that there is an interlap in the final goal that
> both
> > > the
> > > > > FLIPs are achieving here and infact FLIP-304 is more comprehensive
> > for
> > > > job
> > > > > failures.
> > > > >
> > > > > But as a proposal to move forward can we make Swathi's FLIP/JIRA
> as a
> > > sub
> > > > > task for FLIP-304 and continue with the PR since the main aim is to
> > get
> > > > the
> > > > > cluster failure pushed to the termination log for K8s based
> > > deployments.
> > > > > And once it is completed we can work to make FLIP-304 to support
> job
> > > > > failure propagation to termination log?
> > > > >
> > > > > Regards
> > > > > Ram
> > > > >
> > > > > On Thu, Apr 18, 2024 at 10:07 PM Swathi C <
> swathi.c.apa...@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Hi Gyula and  Ahmed,
> > > > > >
> > > > > > Thanks for reviewing this.
> > > > > >
> > > > > > @gyula.f...@gmail.com  , currently since
> our
> > > aim
> > > > > as
>

Re: [ DISCUSS ] FLIP-XXX : [Plugin] Enhancing Flink Failure Management in Kubernetes with Dynamic Termination Log Integration

2024-04-24 Thread Martijn Visser
I would prefer a separate FLIP

On Wed, Apr 24, 2024 at 3:25 PM Swathi C  wrote:

> Sure Ahmed and Martijn.
> Fetching the flink particular job related failure and adding this logic to
> termination-log is definitely a sub-task of pluggable enricher as we can
> leverage pluggable enricher to achieve this.
> But for CRUD level failures, which is mainly used to notify if the job
> manager failed might not be using the pluggable enricher. So, let us know
> if that needs to be there as a separate FLIP or we can combine that as well
> under the pluggable enricher ( by adding another sub task ) ?
>
> Regards,
> Swathi C
>
> On Wed, Apr 24, 2024 at 3:46 PM Ahmed Hamdy  wrote:
>
> > Hi,
> > I agree with the Martijn, We can reformulate the FLIP to introduce
> > termination log as supported pluggable enricher. If you believe the scope
> > of work is a subset (Further implementation) we can just add a Jira
> ticket
> > for it. IMO this will also help with implementation taking the existing
> > enrichers into reference.
> > Best Regards
> > Ahmed Hamdy
> >
> >
> > On Tue, 23 Apr 2024 at 15:23, Martijn Visser 
> > wrote:
> >
> > > From a procedural point of view, we shouldn't make FLIPs sub-tasks for
> > > existing FLIPs that have been voted/are released. That will only cause
> > > confusion down the line. A new FLIP should take existing functionality
> > > (like FLIP-304) into account, and propose how to improve on what that
> > > original FLIP has introduced or how you're going to leverage what's
> > already
> > > there.
> > >
> > > On Tue, Apr 23, 2024 at 11:42 AM ramkrishna vasudevan <
> > > ramvasu.fl...@gmail.com> wrote:
> > >
> > > > Hi Gyula and Ahmed,
> > > >
> > > > I totally agree that there is an interlap in the final goal that both
> > the
> > > > FLIPs are achieving here and infact FLIP-304 is more comprehensive
> for
> > > job
> > > > failures.
> > > >
> > > > But as a proposal to move forward can we make Swathi's FLIP/JIRA as a
> > sub
> > > > task for FLIP-304 and continue with the PR since the main aim is to
> get
> > > the
> > > > cluster failure pushed to the termination log for K8s based
> > deployments.
> > > > And once it is completed we can work to make FLIP-304 to support job
> > > > failure propagation to termination log?
> > > >
> > > > Regards
> > > > Ram
> > > >
> > > > On Thu, Apr 18, 2024 at 10:07 PM Swathi C  >
> > > > wrote:
> > > >
> > > > > Hi Gyula and  Ahmed,
> > > > >
> > > > > Thanks for reviewing this.
> > > > >
> > > > > @gyula.f...@gmail.com  , currently since our
> > aim
> > > > as
> > > > > part of this FLIP was only to fail the cluster when job
> manager/flink
> > > has
> > > > > issues such that the cluster would no longer be usable, hence, we
> > > > proposed
> > > > > only related to that.
> > > > > Your right, that it covers only job main class errors, job manager
> > run
> > > > time
> > > > > failures, if the Job manager wants to write any metadata to any
> other
> > > > > system ( ABFS, S3 , ... )  and the job failures will not be
> covered.
> > > > >
> > > > > FLIP-304 is mainly used to provide Failure enrichers for job
> > failures.
> > > > > Since, this FLIP is mainly for flink Job manager failures, let us
> > know
> > > if
> > > > > we can leverage the goodness of both and try to extend FLIP-304 and
> > add
> > > > our
> > > > > plugin implementation to cover the job level issues ( propagate
> this
> > > info
> > > > > to the /dev/termination-log such that, the container status reports
> > it
> > > > for
> > > > > flink on K8S by implementing Failure Enricher interface and
> > > > > processFailure() to do this ) and use this FLIP proposal for
> generic
> > > > flink
> > > > > cluster (Job manager/cluster ) failures.
> > > > >
> > > > > Regards,
> > > > > Swathi C
> > > > >
> > > > > On Thu, Apr 18, 2024 at 7:36 PM Ahmed Hamdy 
> > > > wrote:
> > > > >
> > > > > > Hi Swathi!
> > > > > > Thanks for the proposal.
> > > > > > Could you please elaborate what this FLIP offers more than
> > > Flip-304[1]?
> > > > > > Flip 304 proposes a Pluggable mechanism for enriching Job
> failures,
> > > If
> > > > I
> > > > > am
> > > > > > not mistaken this proposal looks like a subset of it.
> > > > > >
> > > > > > 1-
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-304%3A+Pluggable+Failure+Enrichers
> > > > > >
> > > > > > Best Regards
> > > > > > Ahmed Hamdy
> > > > > >
> > > > > >
> > > > > > On Thu, 18 Apr 2024 at 08:23, Gyula Fóra 
> > > wrote:
> > > > > >
> > > > > > > Hi Swathi!
> > > > > > >
> > > > > > > Thank you for creating this proposal. I really like the general
> > > idea
> > > > of
> > > > > > > increasing the K8s native observability of Flink job errors.
> > > > > > >
> > > > > > > I took a quick look at your reference PR, the termination log
> > > related
> > > > > > logic
> > > > > > > is contained completely in the ClusterEntrypoint. What type of
> > > errors
> > > > > > will
> > > > > > > this actuall

Re: [ DISCUSS ] FLIP-XXX : [Plugin] Enhancing Flink Failure Management in Kubernetes with Dynamic Termination Log Integration

2024-04-24 Thread Swathi C
Sure Ahmed and Martijn.
Fetching the flink particular job related failure and adding this logic to
termination-log is definitely a sub-task of pluggable enricher as we can
leverage pluggable enricher to achieve this.
But for CRUD level failures, which is mainly used to notify if the job
manager failed might not be using the pluggable enricher. So, let us know
if that needs to be there as a separate FLIP or we can combine that as well
under the pluggable enricher ( by adding another sub task ) ?

Regards,
Swathi C

On Wed, Apr 24, 2024 at 3:46 PM Ahmed Hamdy  wrote:

> Hi,
> I agree with the Martijn, We can reformulate the FLIP to introduce
> termination log as supported pluggable enricher. If you believe the scope
> of work is a subset (Further implementation) we can just add a Jira ticket
> for it. IMO this will also help with implementation taking the existing
> enrichers into reference.
> Best Regards
> Ahmed Hamdy
>
>
> On Tue, 23 Apr 2024 at 15:23, Martijn Visser 
> wrote:
>
> > From a procedural point of view, we shouldn't make FLIPs sub-tasks for
> > existing FLIPs that have been voted/are released. That will only cause
> > confusion down the line. A new FLIP should take existing functionality
> > (like FLIP-304) into account, and propose how to improve on what that
> > original FLIP has introduced or how you're going to leverage what's
> already
> > there.
> >
> > On Tue, Apr 23, 2024 at 11:42 AM ramkrishna vasudevan <
> > ramvasu.fl...@gmail.com> wrote:
> >
> > > Hi Gyula and Ahmed,
> > >
> > > I totally agree that there is an interlap in the final goal that both
> the
> > > FLIPs are achieving here and infact FLIP-304 is more comprehensive for
> > job
> > > failures.
> > >
> > > But as a proposal to move forward can we make Swathi's FLIP/JIRA as a
> sub
> > > task for FLIP-304 and continue with the PR since the main aim is to get
> > the
> > > cluster failure pushed to the termination log for K8s based
> deployments.
> > > And once it is completed we can work to make FLIP-304 to support job
> > > failure propagation to termination log?
> > >
> > > Regards
> > > Ram
> > >
> > > On Thu, Apr 18, 2024 at 10:07 PM Swathi C 
> > > wrote:
> > >
> > > > Hi Gyula and  Ahmed,
> > > >
> > > > Thanks for reviewing this.
> > > >
> > > > @gyula.f...@gmail.com  , currently since our
> aim
> > > as
> > > > part of this FLIP was only to fail the cluster when job manager/flink
> > has
> > > > issues such that the cluster would no longer be usable, hence, we
> > > proposed
> > > > only related to that.
> > > > Your right, that it covers only job main class errors, job manager
> run
> > > time
> > > > failures, if the Job manager wants to write any metadata to any other
> > > > system ( ABFS, S3 , ... )  and the job failures will not be covered.
> > > >
> > > > FLIP-304 is mainly used to provide Failure enrichers for job
> failures.
> > > > Since, this FLIP is mainly for flink Job manager failures, let us
> know
> > if
> > > > we can leverage the goodness of both and try to extend FLIP-304 and
> add
> > > our
> > > > plugin implementation to cover the job level issues ( propagate this
> > info
> > > > to the /dev/termination-log such that, the container status reports
> it
> > > for
> > > > flink on K8S by implementing Failure Enricher interface and
> > > > processFailure() to do this ) and use this FLIP proposal for generic
> > > flink
> > > > cluster (Job manager/cluster ) failures.
> > > >
> > > > Regards,
> > > > Swathi C
> > > >
> > > > On Thu, Apr 18, 2024 at 7:36 PM Ahmed Hamdy 
> > > wrote:
> > > >
> > > > > Hi Swathi!
> > > > > Thanks for the proposal.
> > > > > Could you please elaborate what this FLIP offers more than
> > Flip-304[1]?
> > > > > Flip 304 proposes a Pluggable mechanism for enriching Job failures,
> > If
> > > I
> > > > am
> > > > > not mistaken this proposal looks like a subset of it.
> > > > >
> > > > > 1-
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-304%3A+Pluggable+Failure+Enrichers
> > > > >
> > > > > Best Regards
> > > > > Ahmed Hamdy
> > > > >
> > > > >
> > > > > On Thu, 18 Apr 2024 at 08:23, Gyula Fóra 
> > wrote:
> > > > >
> > > > > > Hi Swathi!
> > > > > >
> > > > > > Thank you for creating this proposal. I really like the general
> > idea
> > > of
> > > > > > increasing the K8s native observability of Flink job errors.
> > > > > >
> > > > > > I took a quick look at your reference PR, the termination log
> > related
> > > > > logic
> > > > > > is contained completely in the ClusterEntrypoint. What type of
> > errors
> > > > > will
> > > > > > this actually cover?
> > > > > >
> > > > > > To me this seems to cover only:
> > > > > >  - Job main class errors (ie startup errors)
> > > > > >  - JobManager failures
> > > > > >
> > > > > > Would regular job errors (that cause only job failover but not JM
> > > > errors)
> > > > > > be reported somehow with this plugin?
> > > > > >
> > > > > > Thanks
> > > > > > Gyula
> > > > > >
> > > > > >

Re: [ DISCUSS ] FLIP-XXX : [Plugin] Enhancing Flink Failure Management in Kubernetes with Dynamic Termination Log Integration

2024-04-24 Thread Ahmed Hamdy
Hi,
I agree with the Martijn, We can reformulate the FLIP to introduce
termination log as supported pluggable enricher. If you believe the scope
of work is a subset (Further implementation) we can just add a Jira ticket
for it. IMO this will also help with implementation taking the existing
enrichers into reference.
Best Regards
Ahmed Hamdy


On Tue, 23 Apr 2024 at 15:23, Martijn Visser 
wrote:

> From a procedural point of view, we shouldn't make FLIPs sub-tasks for
> existing FLIPs that have been voted/are released. That will only cause
> confusion down the line. A new FLIP should take existing functionality
> (like FLIP-304) into account, and propose how to improve on what that
> original FLIP has introduced or how you're going to leverage what's already
> there.
>
> On Tue, Apr 23, 2024 at 11:42 AM ramkrishna vasudevan <
> ramvasu.fl...@gmail.com> wrote:
>
> > Hi Gyula and Ahmed,
> >
> > I totally agree that there is an interlap in the final goal that both the
> > FLIPs are achieving here and infact FLIP-304 is more comprehensive for
> job
> > failures.
> >
> > But as a proposal to move forward can we make Swathi's FLIP/JIRA as a sub
> > task for FLIP-304 and continue with the PR since the main aim is to get
> the
> > cluster failure pushed to the termination log for K8s based deployments.
> > And once it is completed we can work to make FLIP-304 to support job
> > failure propagation to termination log?
> >
> > Regards
> > Ram
> >
> > On Thu, Apr 18, 2024 at 10:07 PM Swathi C 
> > wrote:
> >
> > > Hi Gyula and  Ahmed,
> > >
> > > Thanks for reviewing this.
> > >
> > > @gyula.f...@gmail.com  , currently since our aim
> > as
> > > part of this FLIP was only to fail the cluster when job manager/flink
> has
> > > issues such that the cluster would no longer be usable, hence, we
> > proposed
> > > only related to that.
> > > Your right, that it covers only job main class errors, job manager run
> > time
> > > failures, if the Job manager wants to write any metadata to any other
> > > system ( ABFS, S3 , ... )  and the job failures will not be covered.
> > >
> > > FLIP-304 is mainly used to provide Failure enrichers for job failures.
> > > Since, this FLIP is mainly for flink Job manager failures, let us know
> if
> > > we can leverage the goodness of both and try to extend FLIP-304 and add
> > our
> > > plugin implementation to cover the job level issues ( propagate this
> info
> > > to the /dev/termination-log such that, the container status reports it
> > for
> > > flink on K8S by implementing Failure Enricher interface and
> > > processFailure() to do this ) and use this FLIP proposal for generic
> > flink
> > > cluster (Job manager/cluster ) failures.
> > >
> > > Regards,
> > > Swathi C
> > >
> > > On Thu, Apr 18, 2024 at 7:36 PM Ahmed Hamdy 
> > wrote:
> > >
> > > > Hi Swathi!
> > > > Thanks for the proposal.
> > > > Could you please elaborate what this FLIP offers more than
> Flip-304[1]?
> > > > Flip 304 proposes a Pluggable mechanism for enriching Job failures,
> If
> > I
> > > am
> > > > not mistaken this proposal looks like a subset of it.
> > > >
> > > > 1-
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-304%3A+Pluggable+Failure+Enrichers
> > > >
> > > > Best Regards
> > > > Ahmed Hamdy
> > > >
> > > >
> > > > On Thu, 18 Apr 2024 at 08:23, Gyula Fóra 
> wrote:
> > > >
> > > > > Hi Swathi!
> > > > >
> > > > > Thank you for creating this proposal. I really like the general
> idea
> > of
> > > > > increasing the K8s native observability of Flink job errors.
> > > > >
> > > > > I took a quick look at your reference PR, the termination log
> related
> > > > logic
> > > > > is contained completely in the ClusterEntrypoint. What type of
> errors
> > > > will
> > > > > this actually cover?
> > > > >
> > > > > To me this seems to cover only:
> > > > >  - Job main class errors (ie startup errors)
> > > > >  - JobManager failures
> > > > >
> > > > > Would regular job errors (that cause only job failover but not JM
> > > errors)
> > > > > be reported somehow with this plugin?
> > > > >
> > > > > Thanks
> > > > > Gyula
> > > > >
> > > > > On Tue, Apr 16, 2024 at 8:21 AM Swathi C <
> swathi.c.apa...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > I would like to start a discussion on FLIP-XXX : [Plugin]
> Enhancing
> > > > Flink
> > > > > > Failure Management in Kubernetes with Dynamic Termination Log
> > > > > Integration.
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1tWR0Fi3w7VQeD_9VUORh8EEOva3q-V0XhymTkNaXHOc/edit?usp=sharing
> > > > > >
> > > > > >
> > > > > > This FLIP proposes an improvement plugin and focuses mainly on
> > Flink
> > > on
> > > > > > K8S but can be used as a generic plugin and add further
> > enhancements.
> > > > > >
> > > > > > Looking forward to everyone's feedback and suggestions. Thank you
> > !!
> > > > > >
> > > > > > Best Regards,
> > > > > > Swathi Chandr

Re: [ DISCUSS ] FLIP-XXX : [Plugin] Enhancing Flink Failure Management in Kubernetes with Dynamic Termination Log Integration

2024-04-23 Thread Martijn Visser
>From a procedural point of view, we shouldn't make FLIPs sub-tasks for
existing FLIPs that have been voted/are released. That will only cause
confusion down the line. A new FLIP should take existing functionality
(like FLIP-304) into account, and propose how to improve on what that
original FLIP has introduced or how you're going to leverage what's already
there.

On Tue, Apr 23, 2024 at 11:42 AM ramkrishna vasudevan <
ramvasu.fl...@gmail.com> wrote:

> Hi Gyula and Ahmed,
>
> I totally agree that there is an interlap in the final goal that both the
> FLIPs are achieving here and infact FLIP-304 is more comprehensive for job
> failures.
>
> But as a proposal to move forward can we make Swathi's FLIP/JIRA as a sub
> task for FLIP-304 and continue with the PR since the main aim is to get the
> cluster failure pushed to the termination log for K8s based deployments.
> And once it is completed we can work to make FLIP-304 to support job
> failure propagation to termination log?
>
> Regards
> Ram
>
> On Thu, Apr 18, 2024 at 10:07 PM Swathi C 
> wrote:
>
> > Hi Gyula and  Ahmed,
> >
> > Thanks for reviewing this.
> >
> > @gyula.f...@gmail.com  , currently since our aim
> as
> > part of this FLIP was only to fail the cluster when job manager/flink has
> > issues such that the cluster would no longer be usable, hence, we
> proposed
> > only related to that.
> > Your right, that it covers only job main class errors, job manager run
> time
> > failures, if the Job manager wants to write any metadata to any other
> > system ( ABFS, S3 , ... )  and the job failures will not be covered.
> >
> > FLIP-304 is mainly used to provide Failure enrichers for job failures.
> > Since, this FLIP is mainly for flink Job manager failures, let us know if
> > we can leverage the goodness of both and try to extend FLIP-304 and add
> our
> > plugin implementation to cover the job level issues ( propagate this info
> > to the /dev/termination-log such that, the container status reports it
> for
> > flink on K8S by implementing Failure Enricher interface and
> > processFailure() to do this ) and use this FLIP proposal for generic
> flink
> > cluster (Job manager/cluster ) failures.
> >
> > Regards,
> > Swathi C
> >
> > On Thu, Apr 18, 2024 at 7:36 PM Ahmed Hamdy 
> wrote:
> >
> > > Hi Swathi!
> > > Thanks for the proposal.
> > > Could you please elaborate what this FLIP offers more than Flip-304[1]?
> > > Flip 304 proposes a Pluggable mechanism for enriching Job failures, If
> I
> > am
> > > not mistaken this proposal looks like a subset of it.
> > >
> > > 1-
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-304%3A+Pluggable+Failure+Enrichers
> > >
> > > Best Regards
> > > Ahmed Hamdy
> > >
> > >
> > > On Thu, 18 Apr 2024 at 08:23, Gyula Fóra  wrote:
> > >
> > > > Hi Swathi!
> > > >
> > > > Thank you for creating this proposal. I really like the general idea
> of
> > > > increasing the K8s native observability of Flink job errors.
> > > >
> > > > I took a quick look at your reference PR, the termination log related
> > > logic
> > > > is contained completely in the ClusterEntrypoint. What type of errors
> > > will
> > > > this actually cover?
> > > >
> > > > To me this seems to cover only:
> > > >  - Job main class errors (ie startup errors)
> > > >  - JobManager failures
> > > >
> > > > Would regular job errors (that cause only job failover but not JM
> > errors)
> > > > be reported somehow with this plugin?
> > > >
> > > > Thanks
> > > > Gyula
> > > >
> > > > On Tue, Apr 16, 2024 at 8:21 AM Swathi C 
> > > > wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > I would like to start a discussion on FLIP-XXX : [Plugin] Enhancing
> > > Flink
> > > > > Failure Management in Kubernetes with Dynamic Termination Log
> > > > Integration.
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1tWR0Fi3w7VQeD_9VUORh8EEOva3q-V0XhymTkNaXHOc/edit?usp=sharing
> > > > >
> > > > >
> > > > > This FLIP proposes an improvement plugin and focuses mainly on
> Flink
> > on
> > > > > K8S but can be used as a generic plugin and add further
> enhancements.
> > > > >
> > > > > Looking forward to everyone's feedback and suggestions. Thank you
> !!
> > > > >
> > > > > Best Regards,
> > > > > Swathi Chandrashekar
> > > > >
> > > >
> > >
> >
>


Re: [ DISCUSS ] FLIP-XXX : [Plugin] Enhancing Flink Failure Management in Kubernetes with Dynamic Termination Log Integration

2024-04-23 Thread ramkrishna vasudevan
Hi Gyula and Ahmed,

I totally agree that there is an interlap in the final goal that both the
FLIPs are achieving here and infact FLIP-304 is more comprehensive for job
failures.

But as a proposal to move forward can we make Swathi's FLIP/JIRA as a sub
task for FLIP-304 and continue with the PR since the main aim is to get the
cluster failure pushed to the termination log for K8s based deployments.
And once it is completed we can work to make FLIP-304 to support job
failure propagation to termination log?

Regards
Ram

On Thu, Apr 18, 2024 at 10:07 PM Swathi C  wrote:

> Hi Gyula and  Ahmed,
>
> Thanks for reviewing this.
>
> @gyula.f...@gmail.com  , currently since our aim as
> part of this FLIP was only to fail the cluster when job manager/flink has
> issues such that the cluster would no longer be usable, hence, we proposed
> only related to that.
> Your right, that it covers only job main class errors, job manager run time
> failures, if the Job manager wants to write any metadata to any other
> system ( ABFS, S3 , ... )  and the job failures will not be covered.
>
> FLIP-304 is mainly used to provide Failure enrichers for job failures.
> Since, this FLIP is mainly for flink Job manager failures, let us know if
> we can leverage the goodness of both and try to extend FLIP-304 and add our
> plugin implementation to cover the job level issues ( propagate this info
> to the /dev/termination-log such that, the container status reports it for
> flink on K8S by implementing Failure Enricher interface and
> processFailure() to do this ) and use this FLIP proposal for generic flink
> cluster (Job manager/cluster ) failures.
>
> Regards,
> Swathi C
>
> On Thu, Apr 18, 2024 at 7:36 PM Ahmed Hamdy  wrote:
>
> > Hi Swathi!
> > Thanks for the proposal.
> > Could you please elaborate what this FLIP offers more than Flip-304[1]?
> > Flip 304 proposes a Pluggable mechanism for enriching Job failures, If I
> am
> > not mistaken this proposal looks like a subset of it.
> >
> > 1-
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-304%3A+Pluggable+Failure+Enrichers
> >
> > Best Regards
> > Ahmed Hamdy
> >
> >
> > On Thu, 18 Apr 2024 at 08:23, Gyula Fóra  wrote:
> >
> > > Hi Swathi!
> > >
> > > Thank you for creating this proposal. I really like the general idea of
> > > increasing the K8s native observability of Flink job errors.
> > >
> > > I took a quick look at your reference PR, the termination log related
> > logic
> > > is contained completely in the ClusterEntrypoint. What type of errors
> > will
> > > this actually cover?
> > >
> > > To me this seems to cover only:
> > >  - Job main class errors (ie startup errors)
> > >  - JobManager failures
> > >
> > > Would regular job errors (that cause only job failover but not JM
> errors)
> > > be reported somehow with this plugin?
> > >
> > > Thanks
> > > Gyula
> > >
> > > On Tue, Apr 16, 2024 at 8:21 AM Swathi C 
> > > wrote:
> > >
> > > > Hi All,
> > > >
> > > > I would like to start a discussion on FLIP-XXX : [Plugin] Enhancing
> > Flink
> > > > Failure Management in Kubernetes with Dynamic Termination Log
> > > Integration.
> > > >
> > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1tWR0Fi3w7VQeD_9VUORh8EEOva3q-V0XhymTkNaXHOc/edit?usp=sharing
> > > >
> > > >
> > > > This FLIP proposes an improvement plugin and focuses mainly on Flink
> on
> > > > K8S but can be used as a generic plugin and add further enhancements.
> > > >
> > > > Looking forward to everyone's feedback and suggestions. Thank you !!
> > > >
> > > > Best Regards,
> > > > Swathi Chandrashekar
> > > >
> > >
> >
>


Re: [ DISCUSS ] FLIP-XXX : [Plugin] Enhancing Flink Failure Management in Kubernetes with Dynamic Termination Log Integration

2024-04-18 Thread Swathi C
Hi Gyula and  Ahmed,

Thanks for reviewing this.

@gyula.f...@gmail.com  , currently since our aim as
part of this FLIP was only to fail the cluster when job manager/flink has
issues such that the cluster would no longer be usable, hence, we proposed
only related to that.
Your right, that it covers only job main class errors, job manager run time
failures, if the Job manager wants to write any metadata to any other
system ( ABFS, S3 , ... )  and the job failures will not be covered.

FLIP-304 is mainly used to provide Failure enrichers for job failures.
Since, this FLIP is mainly for flink Job manager failures, let us know if
we can leverage the goodness of both and try to extend FLIP-304 and add our
plugin implementation to cover the job level issues ( propagate this info
to the /dev/termination-log such that, the container status reports it for
flink on K8S by implementing Failure Enricher interface and
processFailure() to do this ) and use this FLIP proposal for generic flink
cluster (Job manager/cluster ) failures.

Regards,
Swathi C

On Thu, Apr 18, 2024 at 7:36 PM Ahmed Hamdy  wrote:

> Hi Swathi!
> Thanks for the proposal.
> Could you please elaborate what this FLIP offers more than Flip-304[1]?
> Flip 304 proposes a Pluggable mechanism for enriching Job failures, If I am
> not mistaken this proposal looks like a subset of it.
>
> 1-
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-304%3A+Pluggable+Failure+Enrichers
>
> Best Regards
> Ahmed Hamdy
>
>
> On Thu, 18 Apr 2024 at 08:23, Gyula Fóra  wrote:
>
> > Hi Swathi!
> >
> > Thank you for creating this proposal. I really like the general idea of
> > increasing the K8s native observability of Flink job errors.
> >
> > I took a quick look at your reference PR, the termination log related
> logic
> > is contained completely in the ClusterEntrypoint. What type of errors
> will
> > this actually cover?
> >
> > To me this seems to cover only:
> >  - Job main class errors (ie startup errors)
> >  - JobManager failures
> >
> > Would regular job errors (that cause only job failover but not JM errors)
> > be reported somehow with this plugin?
> >
> > Thanks
> > Gyula
> >
> > On Tue, Apr 16, 2024 at 8:21 AM Swathi C 
> > wrote:
> >
> > > Hi All,
> > >
> > > I would like to start a discussion on FLIP-XXX : [Plugin] Enhancing
> Flink
> > > Failure Management in Kubernetes with Dynamic Termination Log
> > Integration.
> > >
> > >
> > >
> >
> https://docs.google.com/document/d/1tWR0Fi3w7VQeD_9VUORh8EEOva3q-V0XhymTkNaXHOc/edit?usp=sharing
> > >
> > >
> > > This FLIP proposes an improvement plugin and focuses mainly on Flink on
> > > K8S but can be used as a generic plugin and add further enhancements.
> > >
> > > Looking forward to everyone's feedback and suggestions. Thank you !!
> > >
> > > Best Regards,
> > > Swathi Chandrashekar
> > >
> >
>


Re: [ DISCUSS ] FLIP-XXX : [Plugin] Enhancing Flink Failure Management in Kubernetes with Dynamic Termination Log Integration

2024-04-18 Thread Ahmed Hamdy
Hi Swathi!
Thanks for the proposal.
Could you please elaborate what this FLIP offers more than Flip-304[1]?
Flip 304 proposes a Pluggable mechanism for enriching Job failures, If I am
not mistaken this proposal looks like a subset of it.

1-
https://cwiki.apache.org/confluence/display/FLINK/FLIP-304%3A+Pluggable+Failure+Enrichers

Best Regards
Ahmed Hamdy


On Thu, 18 Apr 2024 at 08:23, Gyula Fóra  wrote:

> Hi Swathi!
>
> Thank you for creating this proposal. I really like the general idea of
> increasing the K8s native observability of Flink job errors.
>
> I took a quick look at your reference PR, the termination log related logic
> is contained completely in the ClusterEntrypoint. What type of errors will
> this actually cover?
>
> To me this seems to cover only:
>  - Job main class errors (ie startup errors)
>  - JobManager failures
>
> Would regular job errors (that cause only job failover but not JM errors)
> be reported somehow with this plugin?
>
> Thanks
> Gyula
>
> On Tue, Apr 16, 2024 at 8:21 AM Swathi C 
> wrote:
>
> > Hi All,
> >
> > I would like to start a discussion on FLIP-XXX : [Plugin] Enhancing Flink
> > Failure Management in Kubernetes with Dynamic Termination Log
> Integration.
> >
> >
> >
> https://docs.google.com/document/d/1tWR0Fi3w7VQeD_9VUORh8EEOva3q-V0XhymTkNaXHOc/edit?usp=sharing
> >
> >
> > This FLIP proposes an improvement plugin and focuses mainly on Flink on
> > K8S but can be used as a generic plugin and add further enhancements.
> >
> > Looking forward to everyone's feedback and suggestions. Thank you !!
> >
> > Best Regards,
> > Swathi Chandrashekar
> >
>


Re: [ DISCUSS ] FLIP-XXX : [Plugin] Enhancing Flink Failure Management in Kubernetes with Dynamic Termination Log Integration

2024-04-18 Thread Gyula Fóra
Hi Swathi!

Thank you for creating this proposal. I really like the general idea of
increasing the K8s native observability of Flink job errors.

I took a quick look at your reference PR, the termination log related logic
is contained completely in the ClusterEntrypoint. What type of errors will
this actually cover?

To me this seems to cover only:
 - Job main class errors (ie startup errors)
 - JobManager failures

Would regular job errors (that cause only job failover but not JM errors)
be reported somehow with this plugin?

Thanks
Gyula

On Tue, Apr 16, 2024 at 8:21 AM Swathi C  wrote:

> Hi All,
>
> I would like to start a discussion on FLIP-XXX : [Plugin] Enhancing Flink
> Failure Management in Kubernetes with Dynamic Termination Log Integration.
>
>
> https://docs.google.com/document/d/1tWR0Fi3w7VQeD_9VUORh8EEOva3q-V0XhymTkNaXHOc/edit?usp=sharing
>
>
> This FLIP proposes an improvement plugin and focuses mainly on Flink on
> K8S but can be used as a generic plugin and add further enhancements.
>
> Looking forward to everyone's feedback and suggestions. Thank you !!
>
> Best Regards,
> Swathi Chandrashekar
>