Thanks Wilfred for the proposal.

I agree with the overall approach. To summarize the desired
responsibilities of a generic 3rd party app management plugin (only the
Spark operator plugin for now), combining with what Weiwei said,

* It will only react to lifecycle events (Add, Update, Delete etc) for its
CRD objects and it will not react to lifecycle events for pods, which will
be left for the general plugin to take care of.
* For the same underlying workload (e.g. a Spark job), the 3rd party plugin
should see the same app ID as the general plugin (i.e. no two plugins
should think of the same workload as two different apps)
* The state of an application will be determined by the 3rd party plugin
(e.g. the Spark operator plugin, not the general plugin, will determine the
current state of a Spark job)

Chaoran



On Mon, Mar 29, 2021 at 1:55 PM Weiwei Yang <w...@apache.org> wrote:

> Hi Wilfred
>
> The original idea was to have each app mgmt plugin, e.g spark operator
> plugin, manage the certain type of app's lifecycle independently.
> That means each pod on K8s will only be seen and monitored by one app mgmt
> plugin. The problems we found earlier were because it goes
> against this idea, that both the general and spark plugin reacts on the
> same set of Spark pods. This is a bit different than your proposal,
> could you please take a look?
>
>
> On Mon, Mar 29, 2021 at 3:34 AM Wilfred Spiegelenburg <wilfr...@apache.org
> >
> wrote:
>
> > Hi,
> >
> > Based on testing that was performed around gang scheduling and the spark
> > operator by Bowen Li and Chaoran Yu we found that the behaviour around
> the
> > operator was far from optimal. YUNIKORN-558
> > <https://issues.apache.org/jira/browse/YUNIKORN-558> was logged to help
> > with the integration.
> > We did not put any development or test time into making sure the operator
> > and gang scheduling worked. The behaviour that was observed was not
> linked
> > to gang scheduling but to the generic way the operator implementation
> works
> > in YuniKorn.
> >
> > The current Spark operator, implemented
> > in pkg/appmgmt/sparkoperator/spark.go, listens to the Spark CRD
> > add/update/delete. Each CRD is then converted into an application inside
> > YuniKorn and processed. The pods created by the Spark operator form the
> > other half of the application. However the CRD has its own application
> ID.
> > The application ID for the Spark pods (drivers and executors) is
> different.
> >
> > This leaves us with two applications in the system: one without pods (CRD
> > based) and one with pods (the real workload). The real workload pods have
> > an owner reference set to the CRD. Having two applications for one real
> > workload is strange. It does not work correctly in the UI and gives all
> > kinds of issues on completion and recovery on restart.
> >
> > The proposal is now to merge the two objects into one application inside
> > YuniKorn. The CRD can still be used to track updates and provide events
> for
> > scheduling etc. The "ApplicationID" set in the driver or executor pods
> > should be used to track this application.
> > The owner reference allows linking the real pods back to the CRD. The CRD
> > will be used to provide the life cycle tracking and as an event
> collector.
> >
> > All these changes do require rework on the app management side. I hope
> the
> > proposal sounds like the correct way forward. This same CRD based
> mechanism
> > also seems to fit in with the way the flink operator works.
> > Please provide some feedback on this proposal. Implementation would
> require
> > changes in app management and related unit tests. Recovery and gang
> > scheduling tests should also be covered under this change.
> >
> > Wilfred
> >
>

Reply via email to