Re: [DISCUSS] FLIP-241: Completed Jobs Information Enhancement

Xintong Song Fri, 24 Jun 2022 01:27:27 -0700

I see. So you are suggesting the jobmanager to support both /foo/bar and
/jobs/:jobid/foo/bar, while the history server only supports the latter.


I was initially thinking having two APIs in jobmanager serving the exact
same purpose is a bit tricky. Now I think it's a good point that these two
APIs, despite now returning the same results, can return different things
in future.

Junhan & Yangze, WDYT?

Best,

Xintong



On Fri, Jun 24, 2022 at 3:10 PM Chesnay Schepler <[email protected]> wrote:

> This is pretty simple to explain.
>
> "I want to know the environment the job ran in." ->
> /jobs/:jobid/environment
> "I want to know the environment the JM ran in." -> /jobmanager/environment
>
> It's less about the JobID being a parameter, and more of a way for them
> to better model the resource they are interested in.
>
> In the future we could consider the job environment endpoint to return
> not just the JM environment, but also those from the CLI/TMs.
>
> On 24/06/2022 06:37, Xintong Song wrote:
> > Whether the job ID is actually used in the end isn't visible after all.
> >
> > I'm not sure about this. E.g., for an empty session cluster, users have
> to
> > understand they don't need to provide an actual jobid for requesting
> > jobmanager information via rest.
> >
> > I believe both ways work. I think this is a trade off between a)
> explaining
> > to history server rest api users how the urls are different from
> jobmanager
> > and b) explaining to jobmanager rest api users why we need an unused
> jobid
> > for some of the cases. I'm leaning toward the current approach, because
> I'd
> > expect a smaller set of history server rest api users than (or even a
> > subset of) that of jobmanager.
> >
> > The plan is to document which (and how) the urls are different from
> > jobmanager in the history server page [1].
> >
> > Compatibility test indeed should be considered. Thanks for pointing it
> out.
> > Currently the compatibility of history server rest api is guaranteed by
> the
> > compatibility of jobmanager rest api. I think the only thing we need is
> to
> > make sure /foo/bar of jobmanager is identical to /jobs/:jobid/foo/bar of
> > history server. We can introduce an interface, as a subtype of
> JsonArchivist,
> > that archives the json with a path that includes the jobid. Then we can
> > test against all relevant handlers as implementations of this interface.
> >
> > WDYT?
> >
> > Best,
> >
> > Xintong
> >
> >
> > [1]
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/advanced/historyserver/#available-requests
> >
> >
> >
> > On Thu, Jun 23, 2022 at 5:07 PM Chesnay Schepler <[email protected]>
> wrote:
> >
> >> The addition of the /jobs/:jobid/jobmanager/config / environment
> >> exclusively to the HS is a bit of a strange workaround.
> >> How do you intend to document those? (and test compatibility)?
> >>
> >> Why not just add a general /jobs/:jobid/environment endpoint that works
> >> just like jobmanager/environment.
> >> To me that seems like a cleaner solution.
> >> It is somewhat mentioned as an alternative in the FLIP, but I don't
> >> understand what is supposed to be confusing about it.
> >> Whether the job ID is actually used in the end isn't visible after all.
> >>
> >> /jobmanager/config could be integrated into /jobs/:jobid/config.
> >>
> >> The same approach could maybe be used for logs; not really sure yet (not
> >> a fan of displaying logs in the HS in the first place).
> >>
> >> On 23/06/2022 06:55, junhan yang wrote:
> >>> Hi all,
> >>>
> >>> Thank you all for your feedbacks. As far as I can see, it looks like
> the
> >>> discussion on this FLIP has been converged.
> >>>
> >>> I will start a new vote thread now.
> >>>
> >>> Best regards,
> >>> Junhan
> >>>
> >>> Yangze Guo <[email protected]> 于2022年6月17日周五 14:05写道：
> >>>
> >>>> Thanks for the input, Jiangang.
> >>>>
> >>>> I think it's a valid demand to distinguish completed jobs with the
> same
> >>>> name.
> >>>> - If they are different jobs, I think users need to give them
> >>>> different meaningful names respectively.
> >>>> - If they are exactly the same job, IIUC, what you need is to figure
> >>>> out the order. ApplicationId in Yarn might help. But in this case, you
> >>>> can just sort them with the start time.
> >>>>
> >>>> Best,
> >>>> Yangze Guo
> >>>>
> >>>> On Fri, Jun 17, 2022 at 12:13 PM Jiangang Liu <
> >> [email protected]>
> >>>> wrote:
> >>>>> Thanks for the FLIP. It is helpful to track detail infos for
> completed
> >>>> jobs.
> >>>>> I want to ask another question. In our environment, sometimes it is
> >> hard
> >>>> to
> >>>>> distinguish jobs since the same job names may appear multi times in
> the
> >>>>> completed jobs. Because a job may run multi times or different jobs
> >> have
> >>>>> the same job names. I wonder that wether we can enhance the complete
> >> jobs
> >>>>> display with more information, such as applicationId and application
> >> name
> >>>>> in yarn. Maybe it is different in k8s to identify a job.
> >>>>>
> >>>>> Best
> >>>>> Jiangang Liu
> >>>>>
> >>>>> Yangze Guo <[email protected]> 于2022年6月17日周五 11:40写道：
> >>>>>
> >>>>>> Thanks for the feedback, Aitozi and Jing.
> >>>>>>
> >>>>>>> Are each attempts of the TaskManager or JobManager pods (if failure
> >>>>>> occurs)
> >>>>>> all be shown in the ui?
> >>>>>>
> >>>>>> The info of the prior execution attempts will be archived, you could
> >>>>>> refer to `ArchivedExecutionVertex$priorExecutions`.
> >>>>>>
> >>>>>>> It seems that most of these metrics are more interesting to batch
> >>>> jobs.
> >>>>>> Does it make sense to calculate them for pure streaming jobs too?
> >>>>>>
> >>>>>> All the proposed metrics will be calculated no matter what the job
> >>>> type is.
> >>>>>>> Why "duration is less interesting" which is mentioned in the FLIP?
> >>>>>> As a first step, we mainly focus on the most interesting status
> during
> >>>>>> the job lifecycle. The duration of final states like FINISHED and
> >>>>>> CANCELED is meaningless, while abnormal conditions like CANCELING
> will
> >>>>>> not be included at the moment.
> >>>>>>
> >>>>>>> Could you share your thoughts on "accumulated-busy-time"? It should
> >>>>>> describe the time while the task is working as expected, i.e. the
> >> happy
> >>>>>> path. When do we need it for analytics or diagnosis?
> >>>>>>
> >>>>>> A task could be busy or idle while it is working. Users may adjust
> the
> >>>>>> parallelism or the partition key according to the ratio between
> them.
> >>>>>>
> >>>>>> Best,
> >>>>>> Yangze Guo
> >>>>>>
> >>>>>> On Fri, Jun 17, 2022 at 5:08 AM Jing Ge <[email protected]> wrote:
> >>>>>>> Hi Junhan
> >>>>>>>
> >>>>>>> These are must-to-have information for batch processing. Thanks for
> >>>>>>> bringing it up.
> >>>>>>>
> >>>>>>> I have some comments:
> >>>>>>>
> >>>>>>> 1. It seems that most of these metrics are more interesting to
> batch
> >>>>>> jobs.
> >>>>>>> Does it make sense to calculate them for pure streaming jobs too?
> >>>>>>> 2. Why "duration is less interesting" which is mentioned in the
> FLIP?
> >>>>>>> 3. Could you share your thoughts on "accumulated-busy-time"? It
> >>>> should
> >>>>>>> describe the time while the task is working as expected, i.e. the
> >>>> happy
> >>>>>>> path. When do we need it for analytics or diagnosis?
> >>>>>>>
> >>>>>>> BTW, you might want to optimize the format of the FLIP. Some text
> is
> >>>>>>> running out of the right border of the wiki page.
> >>>>>>>
> >>>>>>> Best regards,
> >>>>>>> Jing
> >>>>>>>
> >>>>>>> On Thu, Jun 16, 2022 at 4:40 PM Aitozi <[email protected]>
> wrote:
> >>>>>>>
> >>>>>>>> Thanks Junhan for driving this. It a great improvement for the
> >>>> batch
> >>>>>> jobs.
> >>>>>>>> I'm looking forward to this feature in our internal use case. +1
> >>>> for
> >>>>>> it.
> >>>>>>>> One more question:
> >>>>>>>>
> >>>>>>>> Are each attempts of the TaskManager or JobManager pods (if
> failure
> >>>>>> occurs)
> >>>>>>>> all be shown in the ui ?
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Aitozi.
> >>>>>>>>
> >>>>>>>> Yang Wang <[email protected]> 于2022年6月16日周四 19:10写道：
> >>>>>>>>
> >>>>>>>>> Thanks Xintong for the explanation.
> >>>>>>>>>
> >>>>>>>>> It makes sense to leave the discussion about job result store in
> >>>> a
> >>>>>>>>> dedicated thread.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> Yang
> >>>>>>>>>
> >>>>>>>>> Xintong Song <[email protected]> 于2022年6月16日周四 13:40写道：
> >>>>>>>>>
> >>>>>>>>>> My impression of JobResultStore is more about fault tolerance
> >>>> and
> >>>>>> high
> >>>>>>>>>> availability. Using it for providing information to users
> >>>> sounds
> >>>>>> worth
> >>>>>>>>>> exploring. We probably need more time to think it through.
> >>>>>>>>>>
> >>>>>>>>>> Given that it doesn't conflict with what we have proposed in
> >>>> this
> >>>>>> FLIP,
> >>>>>>>>> I'd
> >>>>>>>>>> suggest considering it as a separate thread and exclude it
> >>>> from the
> >>>>>>>> scope
> >>>>>>>>>> of this one.
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>>
> >>>>>>>>>> Xintong
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Jun 16, 2022 at 11:43 AM Yang Wang <
> >>>> [email protected]>
> >>>>>>>>> wrote:
> >>>>>>>>>>> This is a very useful feature both for finished streaming and
> >>>>>> batch
> >>>>>>>>> jobs.
> >>>>>>>>>>> Except for the WebUI & REST API improvements, I am curious
> >>>>>> whether we
> >>>>>>>>>> could
> >>>>>>>>>>> also integrate some critical information(e.g. latest
> >>>> checkpoint)
> >>>>>> into
> >>>>>>>>> the
> >>>>>>>>>>> job result store[1].
> >>>>>>>>>>> I am just feeling this is also somehow related with
> >>>> "Completed
> >>>>>> Jobs
> >>>>>>>>>>> Information Enhancement".
> >>>>>>>>>>> And I think the history server is not necessary for all the
> >>>>>> scenarios
> >>>>>>>>>>> especially when users only want to check the job execution
> >>>>>> result.
> >>>>>>>>>>> [1].
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-194%3A+Introduce+the+JobResultStore
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Yang
> >>>>>>>>>>>
> >>>>>>>>>>> Xintong Song <[email protected]> 于2022年6月15日周三 15:37写道：
> >>>>>>>>>>>
> >>>>>>>>>>>> Thanks Junhan,
> >>>>>>>>>>>>
> >>>>>>>>>>>> +1 for the proposed improvements.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Xintong
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, Jun 15, 2022 at 3:16 PM Yangze Guo <
> >>>> [email protected]
> >>>>>>>>> wrote:
> >>>>>>>>>>>>> Thanks for driving this, Junhan.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I think it's a valuable usability improvement for both
> >>>>>> streaming
> >>>>>>>>> and
> >>>>>>>>>>>>> batch users. Looking forward to the community feedback.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Best,
> >>>>>>>>>>>>> Yangze Guo
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Wed, Jun 15, 2022 at 3:10 PM junhan yang <
> >>>>>>>>>> [email protected]>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I would like to open a discussion on FLIP-241:
> >>>> Completed
> >>>>>> Jobs
> >>>>>>>>>>>> Information
> >>>>>>>>>>>>>> Enhancement.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> As far as we can tell, streaming and batch users have
> >>>>>> different
> >>>>>>>>>>>> interests
> >>>>>>>>>>>>>> in probing a job. As Flink grows into a unified
> >>>> streaming &
> >>>>>>>> batch
> >>>>>>>>>>>>> processor
> >>>>>>>>>>>>>> and is adopted by more and more batch users, the user
> >>>>>>>> experience
> >>>>>>>>> of
> >>>>>>>>>>>>>> completed job's inspection has become more and more
> >>>>>> important.
> >>>>>>>>>> After
> >>>>>>>>>>>>> doing
> >>>>>>>>>>>>>> several market research, there are several potential
> >>>>>>>> improvements
> >>>>>>>>>>>>> spotted.
> >>>>>>>>>>>>>> The main purpose here is due to the involvement of
> >>>> WebUI &
> >>>>>> REST
> >>>>>>>>> API
> >>>>>>>>>>>>>> changes, which should be openly discussed and voted on
> >>>> as
> >>>>>>>> FLIPs.
> >>>>>>>>>>>>>> You can find more details in FLIP-241 document[1].
> >>>> Looking
> >>>>>>>>> forward
> >>>>>>>>>> to
> >>>>>>>>>>>>>> your feedback.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> [1] https://cwiki.apache.org/confluence/x/dRD1D
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>> Junhan
> >>
> >>
>
>

Re: [DISCUSS] FLIP-241: Completed Jobs Information Enhancement

Reply via email to