[ 
https://issues.apache.org/jira/browse/FLINK-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15158090#comment-15158090
 ] 

Jamie Grier commented on FLINK-1502:
------------------------------------

[~eastcirclek] Let's define our terms to  make sure we're talking about the 
same thing.

*Session*: A single instance of a Job Manager and some # of TaskManagers 
working together.   A session can be created "on-the-fly" for a single job or 
it can be a long-running thing.  Multiple jobs can start, run, and finish in 
the same session.  Think of the "yarn-session.sh" command.  This creates a 
session outside of any particular job.  This is also what I've meant when I've 
said "cluster".  A Yarn session is a "cluster" that we've spun up for some 
length of time on Yarn.  Another example of a cluster would be a standalone 
install of Flink on some # of machines.

*Job*: A single batch or streaming job that runs on a Flink cluster.

In the above scenario, and if your definition of sessions is in agreement with 
mine.  You would instead have the following.  Note that I've named the cluster 
according to the "session" name you've given, because in this case each session 
is really a different (ad-hoc) cluster.  When you run a job directly using just 
"flink run -ytm ..." on YARN you are spinning up an ad-hoc cluster for your job.

After Session 1 is finished, Node 1 would have the following metrics:

- cluster.session1.taskmanager.1.gc_time

After session 2 is finshed, Node 1 would have the following metrics:

- cluster.session1.taskmanager.1.gc_time 
- cluster.session2.taskmanager.2.gc_time
- cluster.session3.taskmanager.3.gc_time

There are many metrics in this case because that's exactly what you want.  
These are JVM scope metrics we are talking about and those are 3 different 
JVMS, not the same one so it makes total sense for them to have these different 
names/scopes.  These metrics have nothing to do with each other and it doesn't 
matter which host they are from.  They are scoped to the cluster (or session) 
and logical TaskManager index, not the host.

The above should not be confused with any host level metrics we want to report. 
 Host level metrics would be scoped simply by the hostname so they wouldn't 
grow either.

One more example, hopefully to clarify.  Let's say I spun up a long-running 
cluster (or session) using yarn-session.sh -tm 3.  Now we have a Flink cluster 
running on YARN with no jobs running and three TaskManagers.  We then run three 
different jobs one after another on this cluster.  The metrics would still 
simply be:

- cluster.yarn-session.taskmanager.1.gc_time
- cluster.yarn-session.taskmanager.2.gc_time
- cluster.yarn-session.taskmanager.3.gc_time

No matter how many jobs you ran this list would not grow, which is natural 
because there have only been 3 TaskManagers.  Now if one of these TaskManagers 
were to fail and be restarted it would assume the same name -- that's the point 
of using "logical" indexes so the set of metrics name in that case still would 
not be larger than the above.

In the initial case you describe above if you didn't want lot's of different 
metrics over time you could also just give all of your sessions the same name.  
You're metrics are growing because you're spinning up many different clusters 
(sessions) over time with different names each time.  If you used the same name 
for the cluster (session) every time this metrics namespace growth would not 
occur.

I hope any of that made sense ;)  This is getting a bit hard to describe this 
way.  We could also sync via Hangouts or something if that is easier.



> Expose metrics to graphite, ganglia and JMX.
> --------------------------------------------
>
>                 Key: FLINK-1502
>                 URL: https://issues.apache.org/jira/browse/FLINK-1502
>             Project: Flink
>          Issue Type: Sub-task
>          Components: JobManager, TaskManager
>    Affects Versions: 0.9
>            Reporter: Robert Metzger
>            Assignee: Dongwon Kim
>            Priority: Minor
>             Fix For: pre-apache
>
>
> The metrics library allows to expose collected metrics easily to other 
> systems such as graphite, ganglia or Java's JVM (VisualVM).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to