Hi Nils, I'm not sure if its a good idea to use JMX. I fear that we are overengineering something here for features that we don't really need. I don't know any tools that can evaluate these JMX information (I think thats the main argument for using JMX). Also doing this kind of monitoring (connecting to JVMs running somewhere in a cluster from a local client) is often really complicated (firewalls, port forwardings, ssh, ...) that its probably to impractical. So users will probably only use the web interface. I think the main reason why we agreed to use JMX for that proposal was, that the student who proposed the topic knew the JMX system very well.
What I would do is the following: a) Work on the Akka branch of Till until its merged ( https://github.com/apache/incubator-flink/pull/149). There is no point in changing the current RPC / JobManager / TaskManager infrastructure if we are going to replace it very soon (I think its a matter of days). b) Just "piggyback" on the heartbeat that the TaskManagers are sending to the JobManager and include metrics there. Then, collect them at the JobManager and expose them via the web interface. c) Metrics: I would start with Garbage Collection statistics. Then: - "bytes in" per operator - input splits processed for DataSources - Current Iteration number / Avg Iteration time - Disk IO / Network IO stats. Let me know if you need more information. Best, Robert On Sun, Nov 23, 2014 at 11:28 PM, Ufuk Celebi <u...@apache.org> wrote: > > On 23 Nov 2014, at 00:03, Fabian Hueske <fhue...@apache.org> wrote: > > > Hi Nils, > > > > Flink's current monitoring is quite limited and basically restricted to > > status updates of the parallel tasks (scheduled, started, finished, > > canceled, failed, etc.). > > There is also some code lying around to collect system stats such as CPU, > > memory, and network utilization. However, it is not used right now, > AFAIK. > > In case of a long running job, it is hard to figure out what is going on > > and whether a program makes progress or not. > > > > Having a monitoring infrastructure which allows to add, collect, and > query > > new metrics with low effort would be a great addition to Flink. > > From what I know, JMX was explicitly designed for this purpose and seems > to > > be a good fit. Since it is a Java standard, other tools can easily > connect > > and retrieve monitoring data. > > > > As a starting point, I would focus to get an early prototype that uses > JMX > > to collect a single metric such as number of tuples processed by a Map > > function. > > Having such a showcase, would help to have a good discussion about how to > > implement the monitoring infrastructure. > > The question of metrics to collect is orthogonal to that. If we have a > good > > system to collect and gather stats, these can be added one by one. > > +1 > > I don't have experience with JMX, but I agree with Fabian that the > architecture of this monitoring service is very important and should come > first. It should be flexible enough to easily support the collection of > metrics by any operator and the user. > > Every task manager needs expose this service to collect (and aggregate) > data, which then would be collected at a central instance (e.g. the > JobManager). I am not sure at this point, but it might be worthwhile to > think about separating this central monitoring service from the JobManager > in order to reduce JobManager load and have more flexibility, e.g. running > it as a central history server to monitor multiple JobManager instances > (for example in YARN setups). > > – Ufuk