Re: [DISCUSS] FLIP-74: Flink JobClient API

Zili Chen Fri, 27 Sep 2019 00:16:51 -0700

modify

/we just shutdown the cluster on the exit of client that running inside
cluster/


to

we just shutdown the cluster on both the exit of client that running inside
cluster and the finish of job.
Since client is running inside cluster we can easily wait for the end of
two both in ClusterEntrypoint.


Zili Chen <wander4...@gmail.com> 于2019年9月27日周五 下午3:13写道：

> About JobCluster
>
> Actually I am not quite sure what we gains from DETACHED configuration on
> cluster side.
> We don't have a NON-DETACHED JobCluster in fact in our codebase, right?
>
> It comes to me one major questions we have to answer first.
>
> *What JobCluster conceptually is exactly*
>
> Related discussion can be found in JIRA[1] and mailing list[2]. Stephan
> gives a nice
> description of JobCluster:
>
> Two things to add: - The job mode is very nice in the way that it runs the
> client inside the cluster (in the same image/process that is the JM) and
> thus unifies both applications and what the Spark world calls the "driver
> mode". - Another thing I would add is that during the FLIP-6 design, we
> were thinking about setups where Dispatcher and JobManager are separate
> processes. A Yarn or Mesos Dispatcher of a session could run independently
> (even as privileged processes executing no code). Then you the "per-job"
> mode could still be helpful: when a job is submitted to the dispatcher, it
> launches the JM again in a per-job mode, so that JM and TM processes are
> bound to teh job only. For higher security setups, it is important that
> processes are not reused across jobs.
>
> However, currently in "per-job" mode we generate JobGraph in client side,
> launching
> the JobCluster and retrieve the JobGraph for execution. So actually, we
> don't "run the
> client inside the cluster".
>
> Besides, refer to the discussion with Till[1], it would be helpful we
> follow the same process
> of session mode for that of "per-job" mode in user perspective, that we
> don't use
> OptimizedPlanEnvironment to create JobGraph, but directly deploy Flink
> cluster in env.execute.
>
> Generally 2 points
>
> 1. Running Flink job by invoke user main method and execute throughout,
> instead of create
> JobGraph from main-class.
> 2. Run the client inside the cluster.
>
> If 1 and 2 are implemented. There is obvious no need for DETACHED mode in
> cluster side
> because we just shutdown the cluster on the exit of client that running
> inside cluster. Whether
> or not delivered the result is up to user code.
>
> [1]
> https://issues.apache.org/jira/browse/FLINK-14051?focusedCommentId=16931388&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16931388
> [2]
> https://lists.apache.org/x/thread.html/e8f14a381be6c027e8945f884c3cfcb309ce49c1ba557d3749fca495@%3Cdev.flink.apache.org%3E
>
>
> Zili Chen <wander4...@gmail.com> 于2019年9月27日周五 下午2:13写道：
>
>> Thanks for your replies Kostas & Aljoscha!
>>
>> Below are replies point by point.
>>
>> 1. For DETACHED mode, what I said there is about the DETACHED mode in
>> client side.
>> There are two configurations overload the item DETACHED[1].
>>
>> In client side, it means whether or not client.submitJob is blocking to
>> job execution result.
>> Due to client.submitJob returns CompletableFuture<JobClient> NON-DETACHED
>> is no
>> power at all. Caller of submitJob makes the decision whether or not
>> blocking to get the
>> JobClient and request for the job execution result. If client crashes, it
>> is a user scope
>> exception that should be handled in user code; if client lost connection
>> to cluster, we have
>> a retry times and interval configuration that automatically retry and
>> throws an user scope
>> exception if exceed.
>>
>> Your comment about poll for result or job result sounds like a concern on
>> cluster side.
>>
>> In cluster side, DETACHED mode is alive only in JobCluster. If DETACHED
>> configured,
>> JobCluster exits on job finished; if NON-DETACHED configured, JobCluster
>> exits on job
>> execution result delivered. FLIP-74 doesn't stick to changes on this
>> scope, it is just remained.
>>
>> However, it is an interesting part we can revisit this implementation a
>> bit.
>>
>> <see the next email for compact reply in this one>
>>
>> 2. The retrieval of JobClient is so important that if we don't have a way
>> to retrieve JobClient it is
>> a dumb public user-facing interface(what a strange state :P).
>>
>> About the retrieval of JobClient, as mentioned in the document, two ways
>> should be supported.
>>
>> (1). Retrieved as return type of job submission.
>> (2). Retrieve a JobClient of existing job.(with job id)
>>
>> I highly respect your thoughts about how Executors should be and thoughts
>> on multi-layered clients.
>> Although, (2) is not supported by public interfaces as summary of
>> discussion above, we can discuss
>> a bit on the place of Executors on multi-layered clients and find a way
>> to retrieve JobClient of
>> existing job with public client API. I will comment in FLIP-73 thread[2]
>> since it is almost about Executors.
>>
>> Best,
>> tison.
>>
>> [1]
>> https://docs.google.com/document/d/1E-8UjOLz4QPUTxetGWbU23OlsIH9VIdodpTsxwoQTs0/edit?disco=AAAADnLLvM8
>> [2]
>> https://lists.apache.org/x/thread.html/dc3a541709f96906b43df4155373af1cd09e08c3f105b0bd0ba3fca2@%3Cdev.flink.apache.org%3E
>>
>>
>>
>>
>> Kostas Kloudas <kklou...@gmail.com> 于2019年9月25日周三 下午9:29写道：
>>
>>> Hi Tison,
>>>
>>> Thanks for the FLIP and launching the discussion!
>>>
>>> As a first note, big +1 on providing/exposing a JobClient to the users!
>>>
>>> Some points that would be nice to be clarified:
>>> 1) You mention that we can get rid of the DETACHED mode: I agree that
>>> at a high level, given that everything will now be asynchronous, there
>>> is no need to keep the DETACHED mode but I think we should specify
>>> some aspects. For example, without the explicit separation of the
>>> modes, what happens when the job finishes. Does the client
>>> periodically poll for the result always or the result is pushed when
>>> in NON-DETACHED mode? What happens if the client disconnects and
>>> reconnects?
>>>
>>> 2) On the "how to retrieve a JobClient for a running Job", I think
>>> this is related to the other discussion you opened in the ML about
>>> multi-layered clients. First of all, I agree that exposing different
>>> "levels" of clients would be a nice addition, and actually there have
>>> been some discussions about doing so in the future. Now for this
>>> specific discussion:
>>>       i) I do not think that we should expose the
>>> ClusterDescriptor/ClusterSpecification to the user, as this ties us to
>>> a specific architecture which may change in the future.
>>>      ii) I do not think it should be the Executor that will provide a
>>> JobClient for an already running job (only for the Jobs that it
>>> submits). The job of the executor should just be to execute() a
>>> pipeline.
>>>      iii) I think a solution that respects the separation of concerns
>>> could be the addition of another component (in the future), something
>>> like a ClientFactory, or ClusterFactory that will have methods like:
>>> ClusterClient createCluster(Configuration), JobClient
>>> retrieveJobClient(Configuration , JobId), maybe even (although not
>>> sure) Executor getExecutor(Configuration ) and maybe more. This
>>> component would be responsible to interact with a cluster manager like
>>> Yarn and do what is now being done by the ClusterDescriptor plus some
>>> more stuff.
>>>
>>> Although under the hood all these abstractions (Environments,
>>> Executors, ...) underneath use the same clients, I believe their
>>> job/existence is not contradicting but they simply hide some of the
>>> complexity from the user, and give us, as developers some freedom to
>>> change in the future some of the parts. For example, the executor will
>>> take a Pipeline, create a JobGraph and submit it, instead of requiring
>>> the user to do each step separately. This allows us to, for example,
>>> get rid of the Plan if in the future everything is DataStream.
>>> Essentially, I think of these as layers of an onion with the clients
>>> being close to the core. The higher you go, the more functionality is
>>> included and hidden from the public eye.
>>>
>>> Point iii) by the way is just a thought and by no means final. I also
>>> like the idea of multi-layered clients so this may spark up the
>>> discussion.
>>>
>>> Cheers,
>>> Kostas
>>>
>>> On Wed, Sep 25, 2019 at 2:21 PM Aljoscha Krettek <aljos...@apache.org>
>>> wrote:
>>> >
>>> > Hi Tison,
>>> >
>>> > Thanks for proposing the document! I had some comments on the document.
>>> >
>>> > I think the only complex thing that we still need to figure out is how
>>> to get a JobClient for a job that is already running. As you mentioned in
>>> the document. Currently I’m thinking that its ok to add a method to
>>> Executor for retrieving a JobClient for a running job by providing an ID.
>>> Let’s see what Kostas has to say on the topic.
>>> >
>>> > Best,
>>> > Aljoscha
>>> >
>>> > > On 25. Sep 2019, at 12:31, Zili Chen <wander4...@gmail.com> wrote:
>>> > >
>>> > > Hi all,
>>> > >
>>> > > Summary from the discussion about introducing Flink JobClient API[1]
>>> we
>>> > > draft FLIP-74[2] to
>>> > > gather thoughts and towards a standard public user-facing interfaces.
>>> > >
>>> > > This discussion thread aims at standardizing job level client API.
>>> But I'd
>>> > > like to emphasize that
>>> > > how to retrieve JobClient possibly causes further discussion on
>>> different
>>> > > level clients exposed from
>>> > > Flink so that a following thread will be started later to coordinate
>>> > > FLIP-73 and FLIP-74 on
>>> > > expose issue.
>>> > >
>>> > > Looking forward to your opinions.
>>> > >
>>> > > Best,
>>> > > tison.
>>> > >
>>> > > [1]
>>> > >
>>> https://lists.apache.org/thread.html/ce99cba4a10b9dc40eb729d39910f315ae41d80ec74f09a356c73938@%3Cdev.flink.apache.org%3E
>>> > > [2]
>>> > >
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-74%3A+Flink+JobClient+API
>>> >
>>>
>>

Re: [DISCUSS] FLIP-74: Flink JobClient API

Reply via email to