Github user rmetzger commented on the pull request: https://github.com/apache/flink/pull/421#issuecomment-75545711 Thanks everybody for the positive feedback! > What does the OS load mean? It would be really awesome to show the CPU load, too. I think this is a helpful indicator. On the OS load: http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages I totally agree that the OS load is not a very good metric for our purposes. The reason why I didn't try to get better metrics for this is that I didn't want to play "ugly tricks" to get them. My code is getting the metrics only via the management beans. The `OperatingSystemMXBean` is only exposing the load and the number of processor cores: http://docs.oracle.com/javase/7/docs/api/java/lang/management/OperatingSystemMXBean.html#getSystemLoadAverage() There is another implementation of the `OperatingSystemMXBean` (https://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html) which is also exposing stuff like `getProcessCpuLoad()`. But the availability of this management bean depends on the used JVM version etc. Another way to get the CPU load of the process would be parsing the output of `ps` or `top`. But that also falls into the category of "ugly tricks". I think we should aim for getting those metrics into the system as well. Adding them is a matter of registering another Gauge in the TaskManager's metrics registry and visualizing the JSON output. I hope that these kinds of refinements are done by external contributors. Once this PR has been merged, I'll file a JIRA to improve the CPU monitoring. >What are the current options for showing the detailed metrics? I see a "show 3 TMs" and "show all TMs" button in the screenshot? Can you select which three to show? No, you cannot choose which three TMs. I added these buttons because starting a large Flink cluster (50+ nodes) will cause quite some load on the browser updating all the charts. Usually its sufficient to see monitor the load of a few TMs only, because they are doing mostly the same (ideally). But I agree that there is room for improvement. > How about we open a document and sketch the design of the monitoring and create smaller PRs to get there step-by-step. I totally agree that we should do small incremental improvements. As I said in the PR description, the primary purpose of this PR is to get the basic monitoring infrastructure in place, how we present the stuff in the end is subject to further PRs. I have started working on the "per-job" monitoring and found that I have to change some details of this PR as well. Depending on my progress on the "per-job" monitoring I might contribute the changes here together with the "per-job" metrics. If I don't have enough time this week to open a PR for the per job metrics this week, I'll merge this change to master.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---