Hello together, I am trying to enhance Flink's monitoring capabilities in style of the GSoC 2014 Proposal by Rajika Kumarasiri [1].
Short abstract: He suggested to use the Java standard, the Java Mangement Extensions(JMX). The idea is to put an MBean-Server in the JobManager, so that the JobManager itself and all Taskmanagers in the cluster can register their MBeans to this server via RMI. Different monitoring stages (No, standard, full) reduce the affect on the system performance. The JMX service should be accessible in an improved web-component using an RESTful API. He also suggested the use of the SIGAR[2] JNI library to gather the system information. In my opinion this point is discussible. In Java 7 they introduced Platform MXBeans[3] which already cover the basic system information, and so in my eyes the use of a JNI library might be a little overkill. But of course this depends on the aimed depth of monitoring. So the primary question: What parameters/system properties/utilizations/work loads should be monitored in your opinions? Have a nice weekend! Nils [1] https://github.com/stratosphere/stratosphere/wiki/GSoC-2014-Project-Proposal-Draft-by-Rajika-Kumarasiri [2] https://support.hyperic.com/display/SIGAR/Home [3] https://docs.oracle.com/javase/7/docs/technotes/guides/management/overview.html