Hi everyone,
disclaimer: i read the contribution guide about improvement requests (i.e.
i should actually just start a jira ticket) but i thought it would make
sense to run this first through the mailing list here. after collecting
some input i would then create the jira ticket.
When accessing the Flink Web Dashboard (which is basically what i do almost
every day to check some status of a job or so), I recently felt that the
actual information given in the top portion of the start page is highly
improvable. I created a first mock by moving html elements around and
wanted to share this one now:
[image: image.png]
With the exception of the metrics (see below) none of this information
should be new, but rather re-organized to speed up investigation and
monitoring:
- complete overview on the cluster status and health, without clicking
through a lot of pages.
- Active and stand-by Job Managers. Also their health is depicted as a
color (as a first suggestion: last heartbeat is inside heartbeat.timeout)
- Current registered Task Managers
- the little bar on the side indicates task slot usage. i did not
color it since a fully utilised task manager is not
necessarily something
bad.
- the color indicates the health of the task manager (as a first
suggestion: last heartbeat is inside heartbeat.timeout)
- overview on some cluster metrics
Some points to notice:
- All data you see on the screenshot is mock, no number relates to
another number at all. but colors should relate to the numbers already
which they indicate.
- All of this could also be done with other monitoring solutions someone
might have in his company, by reading out JMX metrics and then plotting
those in his monitoring solution (e.g. grafana). But this out of the box
solution would save everyone from doing it on their own and they could
trust the metrics shown here.
- Some of the metrics can only be done with FLINK-7286
<https://issues.apache.org/jira/browse/FLINK-7286> being done. So i
would split the implementation of this into two parts (cluster overview and
metrics) and do them separately.
- This first mock up is targeted to what we here at Zalando would like
to see first glance, so it fits our use case very well. We mostly use
long-running session clusters.
- I'm more a Backend Guy with some Frontend expertise (but mostly in
React, no angular1 (Flink Web Dashboard is built with this currently)
experience) and not at all a designer.
What do you think? I would be glad to have some feedback on this,
especially if this makes sense in the broad community. I would no matter
what implement this somehow, if not in the Flink Master branch, then as a
OS project which anyone can deploy next to their flink clusters. But i
first wanted to run it through here to see if this sparks any interest.
Please also let me know if you see difficulties implementing this already,
maybe i have overseen something.
Can't wait for your input.
Cheers
--
*Fabian WollertZalando SE*
E-Mail: [email protected]