Hi, Igniters! I'm working on some metrics related aspects and it seems that we have missed some points regarding the metrics management. There are at least two major problems from my point of view.
Problem definition. --------------------------- 1. There is no unified approach to adding metrics during development. It would be grate to have one common place where developer should add new metrics or can find answer for question "What kind of metrics do we have for smth?". I talk only about Ignite internal metrics, not user's metrics. Now the logic is spread throughout the code base and there is now any way to find proper place where particular metric could be added to the registry. Moreover, we can create registry in one place and add some set of metrics to the registry in another place and we have many examples of it. Of course, metrics API allows it and it is meaningful drawback from my point view. MetricRegistry is mutable and could be modified at any place of the code and any time of program execution. It allows developer follow the simplest way which is usually wrong. Also GridMetricManager.registry() method has two responsibilities: it creates and lookups registries. When I see registry() method call in the code I can't assume what exactly intention had developer (is it first creation of registry or modification of existing registry). 2. There is no unified approach to enabling/disabling metrics. Here I mean both problems: runtime enabling/disabling and code design that should allow do it in proper way. There is JIRA issue [1] about it. Now some metrics can be enabled/disabled in static configuration and at runtime (cache metrics for example) and other don't have such functionality. It should be unified. Any metrics set (eq. MetricRegistry in current implementation) should have one unified approach for enabling/disabling. Also we should provide infrastructure for it as part of metrics framework. Proposed solution. --------------------------- 1. Introduce new interface, e.g. MetriŃSource that is responsible for metrics life cycle for one metrics source (e.g. cache metrics, system metrics, etc.) Each component with metrics should register own implementation of this interface in metrics manager (GridMetricManager). Interface declares the following methods: - String name() - returns sources name that will be available in special MBean responsible for access for metrics manager. - MetricRegistry enable() - returns MetricRegistry instance with allready registered metrics. Metric registry has the same name as name returned by MetricSource.name() method. This method can be invoked on node start (if metrics are statically enabled) or as result of metric enabling via MBean. As result all instance of MetricSource will have all metric instances initialized and MetricRegistry must be added to the registries map by GridMetricManager. - void disable() - Should be invoked after removing corresponding MetricRegistry from GridMetricManager.registries. The responsibility is destroying of all metric instances (just assign null). This method can be invoked during stopping component (e.g. cache stop, index removing) or as result of metric disabling via MBean. - boolean enabled() - Actual state for current implemntation. NOTE 1: Adding/removing metric registries in GridMetricManager should be performed with copy-on-write semantic in order to avoid possible data races in related threads (e.g. exporters). It also allows avoid data copy during exporting (just use link to the registries). Also we can avoid using ConcurrentHashMap in such case. But of course additional copies will be created per each registries map modification. But we assume that it isn't frequent operation. NOTE 2: MetricSource interface specific implementation is also responsible for keeping direct references to metric instances and also for functionality related with metric values calculation (including checking whether metrics are enabled or not). NOTE 3: GridMetricManager is responsible for registration/deregistration of MBeans corresponding to each metrics source. GridMetricManager is responsible for proper actions sequence (e.g. we should first create metric registry and after it add it to registries collection). At this point we have: - Way to find out all metrics in the system just navigating throughout all implementations of MetricSource interface. - Unified approach to metric creation. - Unified approach to metric enabling/disabling at runtime. 2. Introduce MetricRegistryBuilder that returns immutable instance of MetricRegistry. It will force developer to write code related with metric creation at exactly one place. It also allows to avoid MetricRegistry modification at runtime. Also we should remove GridMetricRegistry.registry() method and introduce two new methods: register(MetricRegistry) and unregister(String registryName). Each method should assert that no old value for register() case or registry was deleted for unregister case (we have some problems with it now, e.g. for cache with near cache configuration). At this point we have some additional guaranties of proper metrics framework usage. Such design also will allow optimize metric exporters because it can sure that metric registry will not modified at runtime with out removing and conseqent adding registry to the GridMetricManager.registires map. 3. Introduce MetricManagementMBean that returns list of all metric sources with boolean flag that indicates whether metrics are enabled or disabled for this source. List of all sources should be supported by GridMetricManager (so metricSources collection should be added and accessors). This MBean also should provide methods for enabling and disabling metrics. At this point we have interface for enabling/disabling metrics via JMX and basis for implementation other tools (e.g. we can manage metrics via control.sh after additional development). All current metrics related code should be refactored accordingly this proposal and it is most complex and time consuming part. I'm ready to do all this things. Also anybody can implement and contribute this. But first I would like to hear your thoughts and ideas. Thanks! [1] https://issues.apache.org/jira/browse/IGNITE-11927