Re: Statistics collection for optimization

2014-12-02 Thread Márton Balassi
It would be nice to have integration with the existing tools, e.g. Ganglia. [1] These already cover system statistics, (CPU, network, I/O...) and one can define own stats to monitor. Hadoop is nicely integrated with it. [1] http://ganglia.sourceforge.net/ On Tue, Dec 2, 2014 at 9:37 PM, Fabian Hu

Re: InputFormat API and current scanned row count

2014-12-02 Thread Fabian Hueske
Yes, sure. Tracking records per split and UDF exec time per call (min, max, avg, or histogram) would be valuable information when debugging the performance of a program. 2014-12-02 22:08 GMT+01:00 Flavio Pompermaier : > In my specific use case I was intererested in understanding why the scans > o

Re: InputFormat API and current scanned row count

2014-12-02 Thread Flavio Pompermaier
In my specific use case I was intererested in understanding why the scans of the splits were taking a long time, so I was intrested in getting statistics about the number of records contained in each split and the rate/speed of its reading..do you think it could be something useful in general? On D

Re: InputFormat API and current scanned row count

2014-12-02 Thread Fabian Hueske
Hi Flavio, we have a few recently started efforts to implement the collection of monitoring and runtime/data statistics. Counting the number of elements emitted by an operator (or data source) will be included. Do you want to count the number of produced tuples for monitoring the progress or do y

Re: Statistics collection for optimization

2014-12-02 Thread Fabian Hueske
I see mainly two use cases to locally collect data on TMs and send it (and aggregate it) on the JM. 1) Monitoring of the system and running jobs: This might include system stats (CPU, disk usage, network traffic & buffer usage, internal memory utilization, ...) but also progress information (numbe

Re: [DISCUSS] Graduation of Flink from the Incubator

2014-12-02 Thread Alan Gates
+1. Some quotes from my email to the PPMC on this topic: The project has successfully made two [I stand corrected, three] Apache releases and added new committers. The community seems to be good at welcoming new contributors, based on the interactions on the mailing lists. The core team mem

[jira] [Created] (FLINK-1298) Consolidate Handling of User Code ClassLoader

2014-12-02 Thread Aljoscha Krettek (JIRA)
Aljoscha Krettek created FLINK-1298: --- Summary: Consolidate Handling of User Code ClassLoader Key: FLINK-1298 URL: https://issues.apache.org/jira/browse/FLINK-1298 Project: Flink Issue Type:

Re: Statistics collection for optimization

2014-12-02 Thread Alexander Alexandrov
This is another way to do it. I just created a JIRA issue for that: https://issues.apache.org/jira/browse/FLINK-1297 If you can give me some pointers and suggest implementation strategies I can try to prototype something in a feature branch over the weekend and share it for review. 2014-12-02

Re: Enhance Flink's monitoring capabilities

2014-12-02 Thread aalexandrov
Hello Nils, I am going to work on a similar issue related to tracking some basics statistics of the intermediate results produced by dataflows during execution. I just create a Jira issue here: https://issues.apache.org/jira/browse/FLINK-1297 If you already have some work done on extending the

[jira] [Created] (FLINK-1297) Add support for tracking statistics of intermediate results

2014-12-02 Thread Alexander Alexandrov (JIRA)
Alexander Alexandrov created FLINK-1297: --- Summary: Add support for tracking statistics of intermediate results Key: FLINK-1297 URL: https://issues.apache.org/jira/browse/FLINK-1297 Project: Flin

Re: [GitHub] incubator-flink pull request: Add support for Subclasses, Interfac...

2014-12-02 Thread Ufuk Celebi
Thanks for the update. :)

Re: Statistics collection for optimization

2014-12-02 Thread Ufuk Celebi
Have you also thought about adding the statistics collection with the writers, i.e. the collector or record writer? If all you care about is the data that the user emits from her code, that should be fine. On Tue, Dec 2, 2014 at 2:33 PM, Robert Metzger wrote: > Yes. I also got the impression th

Re: Statistics collection for optimization

2014-12-02 Thread Robert Metzger
Yes. I also got the impression that you are looking for something slightly different. It is probably easier for you right now to "hack" something into the system to get these statistics. On Tue, Dec 2, 2014 at 2:25 PM, Alexander Alexandrov < alexander.s.alexand...@gmail.com> wrote: > I checked t

Re: Statistics collection for optimization

2014-12-02 Thread Alexander Alexandrov
I checked the thread. I am not sure whether this is aligned with what I want to contribute. The discussion in the other thread seems to be going in the direction of general-purpose monitoring (you are talking about Disk + Network IO, input splits). I would like to have a very thin code base that

Re: Statistics collection for optimization

2014-12-02 Thread Robert Metzger
The thread mentioned by Ufuk is an ongoing discussion, thats why there is no JIRA yet. To my understanding, its a student doing a project on Flink. Also, I would like to give you the same advice I already gave to Nils: I would highly recommend using Till's Akka branch for starting to work on that.

Re: Statistics collection for optimization

2014-12-02 Thread Kostas Tzoumas
>From the status of that thread and absence of a JIRA (as far as I could tell), I would suggest that you start working on this and announce it on the other thread, perhaps Nils would be interested in jumping in. On Tue, Dec 2, 2014 at 2:06 PM, Ufuk Celebi wrote: > Very nice to hear :) > > See th

Re: Statistics collection for optimization

2014-12-02 Thread Ufuk Celebi
Very nice to hear :) See this thread: http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-td2573.html On Tue, Dec 2, 2014 at 2:00 PM, Alexander Alexandrov < alexander.s.alexand...@gmail.com> wrote: > Just a quick shout to check whether

Statistics collection for optimization

2014-12-02 Thread Alexander Alexandrov
Just a quick shout to check whether somebody is already working on a statistics collection component? If yes, can you point me to previous discussions in the mailing list and a WIP branch -- I want to bring myself up to date with the ongoing efforts. If not, I would like to start working on that