Fwd: Enhance Flink's monitoring capabilities

Alexander Alexandrov Fri, 12 Dec 2014 05:56:23 -0800

I have created an issue for the related dataflow statistics tracking
feature here:


https://issues.apache.org/jira/browse/FLINK-1297

FLINK-456 seems to have some overlap with what I described. I suggest to
either have three separate issues or at least work on resolving FLINK-1297
and FLINK-456 in three stages:

1. agree upon a design and implement the basic service architecture and the
model;
2. implement dataflow statistics tracking on top of (1): min, max, count,
count distinct;
3. implement runtime statistics tracking on top of (1): CPU, I/O load;

It makes sense to have a design document (probably Markdown) with some
figures to agree on the scope and implementation aspects on (1) as Henry
Proposed in the "Statistics collection for optimization" thread before we
start with the actual implementation.

Robert's prototype branch (
https://github.com/rmetzger/incubator-flink/tree/flink456) on top of the
latest version of Till's Akka rework seems to be a good starting point to
fork for the actual work on (1). I suggest that after that we somehow
divide and conquer (2) and (3).

Regards,
Alexander

---------- Forwarded message ----------
From: Henry Saputra <[email protected]>
Date: 2014-12-12 6:18 GMT+01:00
Subject: Re: Enhance Flink's monitoring capabilities
To: "[email protected]" <[email protected]>

Thanks Robert, looks like we could use this JIRA to do the work

- Henry

On Thu, Dec 11, 2014 at 9:25 AM, Robert Metzger <[email protected]> wrote:
> I think this (very old) issue is somewhat closely describing the feature:
> https://issues.apache.org/jira/browse/FLINK-456
>
>
>
> On Thu, Dec 11, 2014 at 8:32 AM, Henry Saputra <[email protected]>
> wrote:
>
>> Just curious, is there any JIRA filed for this or was it just in
>> preliminary proposal talk?
>>
>> - Henry
>>
>> On Sun, Dec 7, 2014 at 3:36 PM, Stephan Ewen <[email protected]> wrote:
>> > That actually sounds like a great idea. I discussed a bit with Robert
>> > offline on Friday, and it seems that Metrics has most of what we talked
>> > about.
>> >
>> > I also like the way they make it extensible, so people can capture
their
>> > own metrics.
>> >
>> > On Sun, Dec 7, 2014 at 6:02 AM, Henry Saputra <[email protected]>
>> > wrote:
>> >
>> >> Hi Robert,
>> >>
>> >> From I have seen it so far, it is probably better and easier for Flink
>> >> to leverage metrics library [1] for the metrics collection rather than
>> >> building organically.
>> >>
>> >> Several ASF projects like Spark [2] and Tajo have used it with great
>> >> success.
>> >>
>> >> One of the main reasons is maintainability and the breath of types of
>> >> metric could and should be collected.
>> >>
>> >> - Henry
>> >>
>> >> [1] https://dropwizard.github.io/metrics/3.1.0/getting-started/
>> >> [2] https://spark.apache.org/docs/1.0.1/monitoring.html
>> >> [3] https://issues.apache.org/jira/browse/TAJO-333
>> >>
>> >> On Sat, Dec 6, 2014 at 11:13 AM, Robert Metzger <[email protected]>
>> >> wrote:
>> >> > Hey Nils,
>> >> >
>> >> > I have played around a bit with a little prototype. You can find the
>> code
>> >> > here: https://github.com/rmetzger/incubator-flink/tree/flink456 (its
>> >> > another branch in my repo).
>> >> > You can see the changes that I applied on top of Till's Akka branch
>> here:
>> >> >
>> >>
>>
https://github.com/rmetzger/incubator-flink/compare/tillrohrmann:akka_scala...rmetzger:flink456?expand=1
>> >> >
>> >> > What the code does is collecting statistics about each TaskManager
in
>> the
>> >> > system. These stats are assembled into a "MetricsReport" which is
send
>> >> with
>> >> > the periodical heartbeat to the JobManager. The JobManager stores
the
>> >> > latest MetricsReport for each TaskManager (in the Instance object
for
>> >> each
>> >> > TM).
>> >> > When the user accesses the TaskManager overview, the latest
>> MetricsReport
>> >> > is send as a JSONObject to the browser.
>> >> >
>> >> > to test my changes, check out the code, build it
>> >> >  mvn clean package -DskipTests -Dcheckstyle.skip=true
>> >> > go into
>> >> > cd
>> >> >
>> >>
>>
flink-dist/target/flink-0.8-incubating-SNAPSHOT-bin/flink-0.8-incubating-SNAPSHOT/
>> >> > and start the web interface
>> >> > /bin/start-local.sh
>> >> >
>> >> > Go to localhost:8081, in the "TaskManager" view, you can see some
>> >> metrics.
>> >> > Here is a screenshot: http://img42.com/eNPve
>> >> >
>> >> > I named my branch after this issue, as it is probably describing
best
>> >> what
>> >> > we're working on here: FLINK-456
>> >> > <https://issues.apache.org/jira/browse/FLINK-456>
>> >> >
>> >> > As I said in the beginning, its really just a prototype. Let me know
>> if
>> >> you
>> >> > have any further questions.
>> >> > For the "per TaskManager" reports, we should probably integrate some
>> more
>> >> > statistics. Also, the presentation of the numbers is very very basic
>> >> right
>> >> > now. I think there are many good libraries for visualizing these
>> kinds of
>> >> > stats.
>> >> > Also, the numbers currently represent only a "snapshot", however,
>> some of
>> >> > the numbers can be accumulated (read/write bytes of the io manager).
>> >> > Another missing feature is storing a little history of numbers to
>> >> visualize
>> >> > metrics over time.
>> >> >
>> >> > I'm trying to find time to look into "per job" metrics as well. They
>> will
>> >> > require a bit more infrastructure to distinguish them on the
>> JobManager
>> >> > side and to get them on the TaskManagers.
>> >> >
>> >> >
>> >> > Best,
>> >> > Robert
>> >> >
>> >> >
>> >> >
>> >> > On Tue, Dec 2, 2014 at 2:53 PM, aalexandrov <
>> >> > [email protected]> wrote:
>> >> >
>> >> >> Hello Nils,
>> >> >>
>> >> >> I am going to work on a similar issue related to tracking some
basics
>> >> >> statistics of the intermediate results produced by dataflows during
>> >> >> execution.
>> >> >>
>> >> >> I just create a Jira issue here:
>> >> >>
>> >> >> https://issues.apache.org/jira/browse/FLINK-1297
>> >> >>
>> >> >> If you already have some work done on extending the monitoring
>> >> capabilities
>> >> >> in a branch, it might be good to sync-up the development in order
to
>> >> avoid
>> >> >> duplicated work (e.g. using the same communication channel used to
>> send
>> >> the
>> >> >> data from the task managers to the job manager).
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> View this message in context:
>> >> >>
>> >>
>>
http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-tp2573p2713.html
>> >> >> Sent from the Apache Flink (Incubator) Mailing List archive.
mailing
>> >> list
>> >> >> archive at Nabble.com.
>> >> >>
>> >>
>>

Fwd: Enhance Flink's monitoring capabilities

Reply via email to