It would be nice to have integration with the existing tools, e.g. Ganglia. [1] These already cover system statistics, (CPU, network, I/O...) and one can define own stats to monitor. Hadoop is nicely integrated with it.
[1] http://ganglia.sourceforge.net/ On Tue, Dec 2, 2014 at 9:37 PM, Fabian Hueske <[email protected]> wrote: > I see mainly two use cases to locally collect data on TMs and send it (and > aggregate it) on the JM. > > 1) Monitoring of the system and running jobs: This might include system > stats (CPU, disk usage, network traffic & buffer usage, internal memory > utilization, ...) but also progress information (number of processed > elements, histogram of UDF in/out ratio, UDF exec times, etc.). > 2) Statistics collection for optimization: Stats would include key counts & > distributions, record count & sizes, UDF stats (in/out ratio, exec times, > ...). Depending on the expertise of the user, this information could also > be valuable monitoring information. > > In both cases, we need a service to ship collected data from the TMs to the > JM and aggregated and store it there. > Once this service is in place, the collection of metrics could be > independently implemented. > > 2014-12-02 14:57 GMT+01:00 Alexander Alexandrov < > [email protected]>: > > > This is another way to do it. > > > > I just created a JIRA issue for that: > > > > https://issues.apache.org/jira/browse/FLINK-1297 > > > > If you can give me some pointers and suggest implementation strategies I > > can try to prototype something in a feature branch over the weekend and > > share it for review. > > > > > > > > 2014-12-02 14:43 GMT+01:00 Ufuk Celebi <[email protected]>: > > > > > Have you also thought about adding the statistics collection with the > > > writers, i.e. the collector or record writer? > > > > > > If all you care about is the data that the user emits from her code, > that > > > should be fine. > > > > > > On Tue, Dec 2, 2014 at 2:33 PM, Robert Metzger <[email protected]> > > > wrote: > > > > > > > Yes. I also got the impression that you are looking for something > > > slightly > > > > different. > > > > > > > > It is probably easier for you right now to "hack" something into the > > > system > > > > to get these statistics. > > > > > > > > On Tue, Dec 2, 2014 at 2:25 PM, Alexander Alexandrov < > > > > [email protected]> wrote: > > > > > > > > > I checked the thread. I am not sure whether this is aligned with > > what I > > > > > want to contribute. > > > > > > > > > > The discussion in the other thread seems to be going in the > direction > > > of > > > > > general-purpose monitoring (you are talking about Disk + Network > IO, > > > > input > > > > > splits). > > > > > > > > > > I would like to have a very thin code base that can be (1) > > > transparently > > > > > injected in UDFs (if you can manipulate the AST), or wrapped in > > > identity > > > > > mappers (if you cannot) in order to gather collection statistics > > (min, > > > > max, > > > > > distinct, maybe some histograms) to facilitate incremental > > > optimization. > > > > > > > > > > I agree that this should be based on existing infrastructure (Akka) > > and > > > > > should not be over over-engineered. > > > > > > > > > > I will announce this in the other branch and create a JIRA ticket > to > > > fix > > > > > the parameters of what has to be done and the best way to implement > > it > > > > with > > > > > the other contributors. > > > > > > > > > > > > > > > > > > > > 2014-12-02 14:12 GMT+01:00 Kostas Tzoumas <[email protected]>: > > > > > > > > > > > From the status of that thread and absence of a JIRA (as far as I > > > could > > > > > > tell), I would suggest that you start working on this and > announce > > it > > > > on > > > > > > the other thread, perhaps Nils would be interested in jumping in. > > > > > > > > > > > > On Tue, Dec 2, 2014 at 2:06 PM, Ufuk Celebi <[email protected]> > > wrote: > > > > > > > > > > > > > Very nice to hear :) > > > > > > > > > > > > > > See this thread: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-td2573.html > > > > > > > > > > > > > > On Tue, Dec 2, 2014 at 2:00 PM, Alexander Alexandrov < > > > > > > > [email protected]> wrote: > > > > > > > > > > > > > > > Just a quick shout to check whether somebody is already > working > > > on > > > > a > > > > > > > > statistics collection component? > > > > > > > > > > > > > > > > If yes, can you point me to previous discussions in the > mailing > > > > list > > > > > > and > > > > > > > a > > > > > > > > WIP branch -- I want to bring myself up to date with the > > ongoing > > > > > > efforts. > > > > > > > > > > > > > > > > If not, I would like to start working on that component and > > > ideally > > > > > > > > integrate some parts of it in the 0.8 release. > > > > > > > > > > > > > > > > Cheers! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
