Hi, I agree with the Martijn, We can reformulate the FLIP to introduce termination log as supported pluggable enricher. If you believe the scope of work is a subset (Further implementation) we can just add a Jira ticket for it. IMO this will also help with implementation taking the existing enrichers into reference. Best Regards Ahmed Hamdy
On Tue, 23 Apr 2024 at 15:23, Martijn Visser <martijnvis...@apache.org> wrote: > From a procedural point of view, we shouldn't make FLIPs sub-tasks for > existing FLIPs that have been voted/are released. That will only cause > confusion down the line. A new FLIP should take existing functionality > (like FLIP-304) into account, and propose how to improve on what that > original FLIP has introduced or how you're going to leverage what's already > there. > > On Tue, Apr 23, 2024 at 11:42 AM ramkrishna vasudevan < > ramvasu.fl...@gmail.com> wrote: > > > Hi Gyula and Ahmed, > > > > I totally agree that there is an interlap in the final goal that both the > > FLIPs are achieving here and infact FLIP-304 is more comprehensive for > job > > failures. > > > > But as a proposal to move forward can we make Swathi's FLIP/JIRA as a sub > > task for FLIP-304 and continue with the PR since the main aim is to get > the > > cluster failure pushed to the termination log for K8s based deployments. > > And once it is completed we can work to make FLIP-304 to support job > > failure propagation to termination log? > > > > Regards > > Ram > > > > On Thu, Apr 18, 2024 at 10:07 PM Swathi C <swathi.c.apa...@gmail.com> > > wrote: > > > > > Hi Gyula and Ahmed, > > > > > > Thanks for reviewing this. > > > > > > @gyula.f...@gmail.com <gyula.f...@gmail.com> , currently since our aim > > as > > > part of this FLIP was only to fail the cluster when job manager/flink > has > > > issues such that the cluster would no longer be usable, hence, we > > proposed > > > only related to that. > > > Your right, that it covers only job main class errors, job manager run > > time > > > failures, if the Job manager wants to write any metadata to any other > > > system ( ABFS, S3 , ... ) and the job failures will not be covered. > > > > > > FLIP-304 is mainly used to provide Failure enrichers for job failures. > > > Since, this FLIP is mainly for flink Job manager failures, let us know > if > > > we can leverage the goodness of both and try to extend FLIP-304 and add > > our > > > plugin implementation to cover the job level issues ( propagate this > info > > > to the /dev/termination-log such that, the container status reports it > > for > > > flink on K8S by implementing Failure Enricher interface and > > > processFailure() to do this ) and use this FLIP proposal for generic > > flink > > > cluster (Job manager/cluster ) failures. > > > > > > Regards, > > > Swathi C > > > > > > On Thu, Apr 18, 2024 at 7:36 PM Ahmed Hamdy <hamdy10...@gmail.com> > > wrote: > > > > > > > Hi Swathi! > > > > Thanks for the proposal. > > > > Could you please elaborate what this FLIP offers more than > Flip-304[1]? > > > > Flip 304 proposes a Pluggable mechanism for enriching Job failures, > If > > I > > > am > > > > not mistaken this proposal looks like a subset of it. > > > > > > > > 1- > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-304%3A+Pluggable+Failure+Enrichers > > > > > > > > Best Regards > > > > Ahmed Hamdy > > > > > > > > > > > > On Thu, 18 Apr 2024 at 08:23, Gyula Fóra <gyula.f...@gmail.com> > wrote: > > > > > > > > > Hi Swathi! > > > > > > > > > > Thank you for creating this proposal. I really like the general > idea > > of > > > > > increasing the K8s native observability of Flink job errors. > > > > > > > > > > I took a quick look at your reference PR, the termination log > related > > > > logic > > > > > is contained completely in the ClusterEntrypoint. What type of > errors > > > > will > > > > > this actually cover? > > > > > > > > > > To me this seems to cover only: > > > > > - Job main class errors (ie startup errors) > > > > > - JobManager failures > > > > > > > > > > Would regular job errors (that cause only job failover but not JM > > > errors) > > > > > be reported somehow with this plugin? > > > > > > > > > > Thanks > > > > > Gyula > > > > > > > > > > On Tue, Apr 16, 2024 at 8:21 AM Swathi C < > swathi.c.apa...@gmail.com> > > > > > wrote: > > > > > > > > > > > Hi All, > > > > > > > > > > > > I would like to start a discussion on FLIP-XXX : [Plugin] > Enhancing > > > > Flink > > > > > > Failure Management in Kubernetes with Dynamic Termination Log > > > > > Integration. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1tWR0Fi3w7VQeD_9VUORh8EEOva3q-V0XhymTkNaXHOc/edit?usp=sharing > > > > > > > > > > > > > > > > > > This FLIP proposes an improvement plugin and focuses mainly on > > Flink > > > on > > > > > > K8S but can be used as a generic plugin and add further > > enhancements. > > > > > > > > > > > > Looking forward to everyone's feedback and suggestions. Thank you > > !! > > > > > > > > > > > > Best Regards, > > > > > > Swathi Chandrashekar > > > > > > > > > > > > > > > > > > > > >