Re: [DISCUSS] FLIP-74: Flink JobClient API

Thomas Weise Sun, 29 Sep 2019 16:17:29 -0700

I did not realize there was a plan to deprecate anything in the REST API?

The REST API is super important for tooling written in non JVM languages,
that does not include a Flink client (like FlinkK8sOperator). The REST API
should continue to support all job management operations, including job
submission.


Thomas


On Sun, Sep 29, 2019 at 1:37 PM Konstantin Knauf <[email protected]>
wrote:

> Hi Zili,
>
> thanks for working on this topic. Just read through the FLIP and I have two
> questions:
>
> * should we add "cancelWithSavepeoint" to a new public API, when we have
> deprecated the corresponding REST API/CLI methods? In my understanding
> there is no reason to use it anymore.
> * should we call "stopWithSavepoint" simply "stop" as "stop" always
> performs a savepoint?
>
> Best,
>
> Konstantin
>
>
>
> On Fri, Sep 27, 2019 at 10:48 AM Aljoscha Krettek <[email protected]>
> wrote:
>
> > Hi Flavio,
> >
> > I agree that this would be good to have. But I also think that this is
> > outside the scope of FLIP-74, I think it is an orthogonal feature.
> >
> > Best,
> > Aljoscha
> >
> > > On 27. Sep 2019, at 10:31, Flavio Pompermaier <[email protected]>
> > wrote:
> > >
> > > Hi all,
> > > just a remark about the Flink REST APIs (and its client as well):
> almost
> > > all the times we need a way to dynamically know which jobs are
> contained
> > in
> > > a jar file, and this could be exposed by the REST endpoint under
> > > /jars/:jarid/entry-points (a simple way to implement this would be to
> > check
> > > the value of Main-class or Main-classes inside the Manifest of the jar
> if
> > > they exists [1]).
> > >
> > > I understand that this is something that is not strictly required to
> > > execute Flink jobs but IMHO it would ease A LOT the work of UI
> developers
> > > that could have a way to show the users all available jobs inside a
> jar +
> > > their configurable parameters.
> > > For example, right now in the WebUI, you can upload a jar and then you
> > have
> > > to set (without any autocomplete or UI support) the main class and
> their
> > > params (for example using a string like --param1 xx --param2 yy).
> > > Adding this functionality to the REST API and the respective client
> would
> > > enable the WebUI (and all UIs interacting with a Flink cluster) to
> > prefill
> > > a dropdown list containing the list of entry-point classes (i.e. Flink
> > > jobs) and, once selected, their required (typed) parameters.
> > >
> > > Best,
> > > Flavio
> > >
> > > [1] https://issues.apache.org/jira/browse/FLINK-10864
> > >
> > > On Fri, Sep 27, 2019 at 9:16 AM Zili Chen <[email protected]>
> wrote:
> > >
> > >> modify
> > >>
> > >> /we just shutdown the cluster on the exit of client that running
> inside
> > >> cluster/
> > >>
> > >> to
> > >>
> > >> we just shutdown the cluster on both the exit of client that running
> > inside
> > >> cluster and the finish of job.
> > >> Since client is running inside cluster we can easily wait for the end
> of
> > >> two both in ClusterEntrypoint.
> > >>
> > >>
> > >> Zili Chen <[email protected]> 于2019年9月27日周五 下午3:13写道：
> > >>
> > >>> About JobCluster
> > >>>
> > >>> Actually I am not quite sure what we gains from DETACHED
> configuration
> > on
> > >>> cluster side.
> > >>> We don't have a NON-DETACHED JobCluster in fact in our codebase,
> right?
> > >>>
> > >>> It comes to me one major questions we have to answer first.
> > >>>
> > >>> *What JobCluster conceptually is exactly*
> > >>>
> > >>> Related discussion can be found in JIRA[1] and mailing list[2].
> Stephan
> > >>> gives a nice
> > >>> description of JobCluster:
> > >>>
> > >>> Two things to add: - The job mode is very nice in the way that it
> runs
> > >> the
> > >>> client inside the cluster (in the same image/process that is the JM)
> > and
> > >>> thus unifies both applications and what the Spark world calls the
> > "driver
> > >>> mode". - Another thing I would add is that during the FLIP-6 design,
> we
> > >>> were thinking about setups where Dispatcher and JobManager are
> separate
> > >>> processes. A Yarn or Mesos Dispatcher of a session could run
> > >> independently
> > >>> (even as privileged processes executing no code). Then you the
> > "per-job"
> > >>> mode could still be helpful: when a job is submitted to the
> dispatcher,
> > >> it
> > >>> launches the JM again in a per-job mode, so that JM and TM processes
> > are
> > >>> bound to teh job only. For higher security setups, it is important
> that
> > >>> processes are not reused across jobs.
> > >>>
> > >>> However, currently in "per-job" mode we generate JobGraph in client
> > side,
> > >>> launching
> > >>> the JobCluster and retrieve the JobGraph for execution. So actually,
> we
> > >>> don't "run the
> > >>> client inside the cluster".
> > >>>
> > >>> Besides, refer to the discussion with Till[1], it would be helpful we
> > >>> follow the same process
> > >>> of session mode for that of "per-job" mode in user perspective, that
> we
> > >>> don't use
> > >>> OptimizedPlanEnvironment to create JobGraph, but directly deploy
> Flink
> > >>> cluster in env.execute.
> > >>>
> > >>> Generally 2 points
> > >>>
> > >>> 1. Running Flink job by invoke user main method and execute
> throughout,
> > >>> instead of create
> > >>> JobGraph from main-class.
> > >>> 2. Run the client inside the cluster.
> > >>>
> > >>> If 1 and 2 are implemented. There is obvious no need for DETACHED
> mode
> > in
> > >>> cluster side
> > >>> because we just shutdown the cluster on the exit of client that
> running
> > >>> inside cluster. Whether
> > >>> or not delivered the result is up to user code.
> > >>>
> > >>> [1]
> > >>>
> > >>
> >
> https://issues.apache.org/jira/browse/FLINK-14051?focusedCommentId=16931388&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16931388
> > >>> [2]
> > >>>
> > >>
> >
> https://lists.apache.org/x/thread.html/e8f14a381be6c027e8945f884c3cfcb309ce49c1ba557d3749fca495@%3Cdev.flink.apache.org%3E
> > >>>
> > >>>
> > >>> Zili Chen <[email protected]> 于2019年9月27日周五 下午2:13写道：
> > >>>
> > >>>> Thanks for your replies Kostas & Aljoscha!
> > >>>>
> > >>>> Below are replies point by point.
> > >>>>
> > >>>> 1. For DETACHED mode, what I said there is about the DETACHED mode
> in
> > >>>> client side.
> > >>>> There are two configurations overload the item DETACHED[1].
> > >>>>
> > >>>> In client side, it means whether or not client.submitJob is blocking
> > to
> > >>>> job execution result.
> > >>>> Due to client.submitJob returns CompletableFuture<JobClient>
> > >> NON-DETACHED
> > >>>> is no
> > >>>> power at all. Caller of submitJob makes the decision whether or not
> > >>>> blocking to get the
> > >>>> JobClient and request for the job execution result. If client
> crashes,
> > >> it
> > >>>> is a user scope
> > >>>> exception that should be handled in user code; if client lost
> > connection
> > >>>> to cluster, we have
> > >>>> a retry times and interval configuration that automatically retry
> and
> > >>>> throws an user scope
> > >>>> exception if exceed.
> > >>>>
> > >>>> Your comment about poll for result or job result sounds like a
> concern
> > >> on
> > >>>> cluster side.
> > >>>>
> > >>>> In cluster side, DETACHED mode is alive only in JobCluster. If
> > DETACHED
> > >>>> configured,
> > >>>> JobCluster exits on job finished; if NON-DETACHED configured,
> > JobCluster
> > >>>> exits on job
> > >>>> execution result delivered. FLIP-74 doesn't stick to changes on this
> > >>>> scope, it is just remained.
> > >>>>
> > >>>> However, it is an interesting part we can revisit this
> implementation
> > a
> > >>>> bit.
> > >>>>
> > >>>> <see the next email for compact reply in this one>
> > >>>>
> > >>>> 2. The retrieval of JobClient is so important that if we don't have
> a
> > >> way
> > >>>> to retrieve JobClient it is
> > >>>> a dumb public user-facing interface(what a strange state :P).
> > >>>>
> > >>>> About the retrieval of JobClient, as mentioned in the document, two
> > ways
> > >>>> should be supported.
> > >>>>
> > >>>> (1). Retrieved as return type of job submission.
> > >>>> (2). Retrieve a JobClient of existing job.(with job id)
> > >>>>
> > >>>> I highly respect your thoughts about how Executors should be and
> > >> thoughts
> > >>>> on multi-layered clients.
> > >>>> Although, (2) is not supported by public interfaces as summary of
> > >>>> discussion above, we can discuss
> > >>>> a bit on the place of Executors on multi-layered clients and find a
> > way
> > >>>> to retrieve JobClient of
> > >>>> existing job with public client API. I will comment in FLIP-73
> > thread[2]
> > >>>> since it is almost about Executors.
> > >>>>
> > >>>> Best,
> > >>>> tison.
> > >>>>
> > >>>> [1]
> > >>>>
> > >>
> >
> https://docs.google.com/document/d/1E-8UjOLz4QPUTxetGWbU23OlsIH9VIdodpTsxwoQTs0/edit?disco=AAAADnLLvM8
> > >>>> [2]
> > >>>>
> > >>
> >
> https://lists.apache.org/x/thread.html/dc3a541709f96906b43df4155373af1cd09e08c3f105b0bd0ba3fca2@%3Cdev.flink.apache.org%3E
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> Kostas Kloudas <[email protected]> 于2019年9月25日周三 下午9:29写道：
> > >>>>
> > >>>>> Hi Tison,
> > >>>>>
> > >>>>> Thanks for the FLIP and launching the discussion!
> > >>>>>
> > >>>>> As a first note, big +1 on providing/exposing a JobClient to the
> > users!
> > >>>>>
> > >>>>> Some points that would be nice to be clarified:
> > >>>>> 1) You mention that we can get rid of the DETACHED mode: I agree
> that
> > >>>>> at a high level, given that everything will now be asynchronous,
> > there
> > >>>>> is no need to keep the DETACHED mode but I think we should specify
> > >>>>> some aspects. For example, without the explicit separation of the
> > >>>>> modes, what happens when the job finishes. Does the client
> > >>>>> periodically poll for the result always or the result is pushed
> when
> > >>>>> in NON-DETACHED mode? What happens if the client disconnects and
> > >>>>> reconnects?
> > >>>>>
> > >>>>> 2) On the "how to retrieve a JobClient for a running Job", I think
> > >>>>> this is related to the other discussion you opened in the ML about
> > >>>>> multi-layered clients. First of all, I agree that exposing
> different
> > >>>>> "levels" of clients would be a nice addition, and actually there
> have
> > >>>>> been some discussions about doing so in the future. Now for this
> > >>>>> specific discussion:
> > >>>>>      i) I do not think that we should expose the
> > >>>>> ClusterDescriptor/ClusterSpecification to the user, as this ties us
> > to
> > >>>>> a specific architecture which may change in the future.
> > >>>>>     ii) I do not think it should be the Executor that will provide
> a
> > >>>>> JobClient for an already running job (only for the Jobs that it
> > >>>>> submits). The job of the executor should just be to execute() a
> > >>>>> pipeline.
> > >>>>>     iii) I think a solution that respects the separation of
> concerns
> > >>>>> could be the addition of another component (in the future),
> something
> > >>>>> like a ClientFactory, or ClusterFactory that will have methods
> like:
> > >>>>> ClusterClient createCluster(Configuration), JobClient
> > >>>>> retrieveJobClient(Configuration , JobId), maybe even (although not
> > >>>>> sure) Executor getExecutor(Configuration ) and maybe more. This
> > >>>>> component would be responsible to interact with a cluster manager
> > like
> > >>>>> Yarn and do what is now being done by the ClusterDescriptor plus
> some
> > >>>>> more stuff.
> > >>>>>
> > >>>>> Although under the hood all these abstractions (Environments,
> > >>>>> Executors, ...) underneath use the same clients, I believe their
> > >>>>> job/existence is not contradicting but they simply hide some of the
> > >>>>> complexity from the user, and give us, as developers some freedom
> to
> > >>>>> change in the future some of the parts. For example, the executor
> > will
> > >>>>> take a Pipeline, create a JobGraph and submit it, instead of
> > requiring
> > >>>>> the user to do each step separately. This allows us to, for
> example,
> > >>>>> get rid of the Plan if in the future everything is DataStream.
> > >>>>> Essentially, I think of these as layers of an onion with the
> clients
> > >>>>> being close to the core. The higher you go, the more functionality
> is
> > >>>>> included and hidden from the public eye.
> > >>>>>
> > >>>>> Point iii) by the way is just a thought and by no means final. I
> also
> > >>>>> like the idea of multi-layered clients so this may spark up the
> > >>>>> discussion.
> > >>>>>
> > >>>>> Cheers,
> > >>>>> Kostas
> > >>>>>
> > >>>>> On Wed, Sep 25, 2019 at 2:21 PM Aljoscha Krettek <
> > [email protected]>
> > >>>>> wrote:
> > >>>>>>
> > >>>>>> Hi Tison,
> > >>>>>>
> > >>>>>> Thanks for proposing the document! I had some comments on the
> > >> document.
> > >>>>>>
> > >>>>>> I think the only complex thing that we still need to figure out is
> > >> how
> > >>>>> to get a JobClient for a job that is already running. As you
> > mentioned
> > >> in
> > >>>>> the document. Currently I’m thinking that its ok to add a method to
> > >>>>> Executor for retrieving a JobClient for a running job by providing
> an
> > >> ID.
> > >>>>> Let’s see what Kostas has to say on the topic.
> > >>>>>>
> > >>>>>> Best,
> > >>>>>> Aljoscha
> > >>>>>>
> > >>>>>>> On 25. Sep 2019, at 12:31, Zili Chen <[email protected]>
> wrote:
> > >>>>>>>
> > >>>>>>> Hi all,
> > >>>>>>>
> > >>>>>>> Summary from the discussion about introducing Flink JobClient
> > >> API[1]
> > >>>>> we
> > >>>>>>> draft FLIP-74[2] to
> > >>>>>>> gather thoughts and towards a standard public user-facing
> > >> interfaces.
> > >>>>>>>
> > >>>>>>> This discussion thread aims at standardizing job level client
> API.
> > >>>>> But I'd
> > >>>>>>> like to emphasize that
> > >>>>>>> how to retrieve JobClient possibly causes further discussion on
> > >>>>> different
> > >>>>>>> level clients exposed from
> > >>>>>>> Flink so that a following thread will be started later to
> > >> coordinate
> > >>>>>>> FLIP-73 and FLIP-74 on
> > >>>>>>> expose issue.
> > >>>>>>>
> > >>>>>>> Looking forward to your opinions.
> > >>>>>>>
> > >>>>>>> Best,
> > >>>>>>> tison.
> > >>>>>>>
> > >>>>>>> [1]
> > >>>>>>>
> > >>>>>
> > >>
> >
> https://lists.apache.org/thread.html/ce99cba4a10b9dc40eb729d39910f315ae41d80ec74f09a356c73938@%3Cdev.flink.apache.org%3E
> > >>>>>>> [2]
> > >>>>>>>
> > >>>>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-74%3A+Flink+JobClient+API
> > >>>>>>
> > >>>>>
> > >>>>
> >
> >
>
> --
>
> Konstantin Knauf | Solutions Architect
>
> +49 160 91394525
>
>
> Follow us @VervericaData Ververica <https://www.ververica.com/>
>
>
> --
>
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
> (Tony) Cheng
>

Re: [DISCUSS] FLIP-74: Flink JobClient API

Reply via email to