[flink-web] 02/02: Rebuild website.

fhueske Mon, 25 Feb 2019 06:49:28 -0800

This is an automated email from the ASF dual-hosted git repository.

fhueske pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/flink-web.git


commit 3314df9df136691743153dbd1fc11f011c9a7816
Author: Fabian Hueske <fhue...@apache.org>
AuthorDate: Mon Feb 25 15:48:15 2019 +0100

    Rebuild website.
---
 content/blog/feed.xml                              | 591 +++++++++++++++
 content/blog/index.html                            |  38 +-
 content/blog/page2/index.html                      |  40 +-
 content/blog/page3/index.html                      |  40 +-
 content/blog/page4/index.html                      |  40 +-
 content/blog/page5/index.html                      |  38 +-
 content/blog/page6/index.html                      |  36 +-
 content/blog/page7/index.html                      |  38 +-
 content/blog/page8/index.html                      |  25 +
 .../2019-02-21-monitoring-best-practices/fig-1.png | Bin 0 -> 19621 bytes
 .../2019-02-21-monitoring-best-practices/fig-2.png | Bin 0 -> 6637 bytes
 .../2019-02-21-monitoring-best-practices/fig-3.png | Bin 0 -> 28722 bytes
 .../2019-02-21-monitoring-best-practices/fig-4.png | Bin 0 -> 15780 bytes
 .../2019-02-21-monitoring-best-practices/fig-5.png | Bin 0 -> 37684 bytes
 .../2019-02-21-monitoring-best-practices/fig-6.png | Bin 0 -> 26912 bytes
 .../2019-02-21-monitoring-best-practices/fig-7.png | Bin 0 -> 25941 bytes
 .../2019-02-21-monitoring-best-practices/fig-8.png | Bin 0 -> 32185 bytes
 content/index.html                                 |   8 +-
 .../news/2019/02/25/monitoring-best-practices.html | 791 +++++++++++++++++++++
 content/zh/index.html                              |   8 +-
 20 files changed, 1582 insertions(+), 111 deletions(-)

diff --git a/content/blog/feed.xml b/content/blog/feed.xml
index fcaa25d..070ec94 100644
--- a/content/blog/feed.xml
+++ b/content/blog/feed.xml
@@ -7,6 +7,597 @@
 <atom:link href="https://flink.apache.org/blog/feed.xml"; rel="self" 
type="application/rss+xml" />
 
 <item>
+<title>Monitoring Apache Flink Applications 101</title>
+<description>&lt;!-- improve style of tables --&gt;
+&lt;style&gt;
+  table { border: 0px solid black; table-layout: auto; width: 800px; }
+  th, td { border: 1px solid black; padding: 5px; padding-left: 10px; 
padding-right: 10px; }
+  th { text-align: center }
+  td { vertical-align: top }
+&lt;/style&gt;
+
+&lt;p&gt;This blog post provides an introduction to Apache Flink’s built-in 
monitoring
+and metrics system, that allows developers to effectively monitor their Flink
+jobs. Oftentimes, the task of picking the relevant metrics to monitor a
+Flink application can be overwhelming for a DevOps team that is just starting
+with stream processing and Apache Flink. Having worked with many organizations
+that deploy Flink at scale, I would like to share my experience and some best
+practice with the community.&lt;/p&gt;
+
+&lt;p&gt;With business-critical applications running on Apache Flink, 
performance monitoring
+becomes an increasingly important part of a successful production deployment. 
It 
+ensures that any degradation or downtime is immediately identified and resolved
+as quickly as possible.&lt;/p&gt;
+
+&lt;p&gt;Monitoring goes hand-in-hand with observability, which is a 
prerequisite for
+troubleshooting and performance tuning. Nowadays, with the complexity of modern
+enterprise applications and the speed of delivery increasing, an engineering
+team must understand and have a complete overview of its applications’ status 
at
+any given point in time.&lt;/p&gt;
+
+&lt;h2 id=&quot;flinks-metrics-system&quot;&gt;Flink’s Metrics 
System&lt;/h2&gt;
+
+&lt;p&gt;The foundation for monitoring Flink jobs is its &lt;a 
href=&quot;https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/metrics.html&quot;&gt;metrics
+system&lt;/a&gt;
+which consists of two components; &lt;code&gt;Metrics&lt;/code&gt; and 
&lt;code&gt;MetricsReporters&lt;/code&gt;.&lt;/p&gt;
+
+&lt;h3 id=&quot;metrics&quot;&gt;Metrics&lt;/h3&gt;
+
+&lt;p&gt;Flink comes with a comprehensive set of built-in metrics such 
as:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;Used JVM Heap / NonHeap / Direct Memory (per 
Task-/JobManager)&lt;/li&gt;
+  &lt;li&gt;Number of Job Restarts (per Job)&lt;/li&gt;
+  &lt;li&gt;Number of Records Per Second (per Operator)&lt;/li&gt;
+  &lt;li&gt;…&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;These metrics have different scopes and measure more general (e.g. 
JVM or
+operating system) as well as Flink-specific aspects.&lt;/p&gt;
+
+&lt;p&gt;As a user, you can and should add application-specific metrics to your
+functions. Typically these include counters for the number of invalid records 
or
+the number of records temporarily buffered in managed state. Besides counters,
+Flink offers additional metrics types like gauges and histograms. For
+instructions on how to register your own metrics with Flink’s metrics system
+please check out &lt;a 
href=&quot;https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/metrics.html#registering-metrics&quot;&gt;Flink’s
+documentation&lt;/a&gt;.
+In this blog post, we will focus on how to get the most out of Flink’s built-in
+metrics.&lt;/p&gt;
+
+&lt;h3 id=&quot;metricsreporters&quot;&gt;MetricsReporters&lt;/h3&gt;
+
+&lt;p&gt;All metrics can be queried via Flink’s REST API. However, users can 
configure
+MetricsReporters to send the metrics to external systems. Apache Flink provides
+reporters to the most common monitoring tools out-of-the-box including JMX,
+Prometheus, Datadog, Graphite and InfluxDB. For information about how to
+configure a reporter check out Flink’s &lt;a 
href=&quot;https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/metrics.html#reporter&quot;&gt;MetricsReporter
+documentation&lt;/a&gt;.&lt;/p&gt;
+
+&lt;p&gt;In the remaining part of this blog post, we will go over some of the 
most
+important metrics to monitor your Apache Flink application.&lt;/p&gt;
+
+&lt;h2 id=&quot;monitoring-general-health&quot;&gt;Monitoring General 
Health&lt;/h2&gt;
+
+&lt;p&gt;The first thing you want to monitor is whether your job is actually 
in a &lt;em&gt;RUNNING&lt;/em&gt;
+state. In addition, it pays off to monitor the number of restarts and the time
+since the last restart.&lt;/p&gt;
+
+&lt;p&gt;Generally speaking, successful checkpointing is a strong indicator of 
the
+general health of your application. For each checkpoint, checkpoint barriers
+need to flow through the whole topology of your Flink job and events and
+barriers cannot overtake each other. Therefore, a successful checkpoint shows
+that no channel is fully congested.&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;Key Metrics&lt;/strong&gt;&lt;/p&gt;
+
+&lt;table&gt;
+  &lt;thead&gt;
+    &lt;tr&gt;
+      &lt;th&gt;Metric&lt;/th&gt;
+      &lt;th&gt;Scope&lt;/th&gt;
+      &lt;th&gt;Description&lt;/th&gt;
+    &lt;/tr&gt;
+  &lt;/thead&gt;
+  &lt;tbody&gt;
+    &lt;tr&gt;
+      &lt;td&gt;&lt;code&gt;uptime&lt;/code&gt;&lt;/td&gt;
+      &lt;td&gt;job&lt;/td&gt;
+      &lt;td&gt;The time that the job has been running without 
interruption.&lt;/td&gt;
+    &lt;/tr&gt;
+    &lt;tr&gt;
+      &lt;td&gt;&lt;code&gt;fullRestarts&lt;/code&gt;&lt;/td&gt;
+      &lt;td&gt;job&lt;/td&gt;
+      &lt;td&gt;The total number of full restarts since this job was 
submitted.&lt;/td&gt;
+    &lt;/tr&gt;
+    &lt;tr&gt;
+      
&lt;td&gt;&lt;code&gt;numberOfCompletedCheckpoints&lt;/code&gt;&lt;/td&gt;
+      &lt;td&gt;job&lt;/td&gt;
+      &lt;td&gt;The number of successfully completed checkpoints.&lt;/td&gt;
+    &lt;/tr&gt;
+    &lt;tr&gt;
+      &lt;td&gt;&lt;code&gt;numberOfFailedCheckpoints&lt;/code&gt;&lt;/td&gt;
+      &lt;td&gt;job&lt;/td&gt;
+      &lt;td&gt;The number of failed checkpoints.&lt;/td&gt;
+    &lt;/tr&gt;
+  &lt;/tbody&gt;
+&lt;/table&gt;
+
+&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;Example Dashboard Panels&lt;/strong&gt;&lt;/p&gt;
+
+&lt;center&gt;
+&lt;img 
src=&quot;/img/blog/2019-02-21-monitoring-best-practices/fig-1.png&quot; 
width=&quot;800px&quot; alt=&quot;Uptime (35 minutes), Restarting Time (3 
milliseconds) and Number of Full Restarts (7)&quot; /&gt;
+&lt;br /&gt;
+&lt;i&gt;&lt;small&gt;Uptime (35 minutes), Restarting Time (3 milliseconds) 
and Number of Full Restarts (7)&lt;/small&gt;&lt;/i&gt;
+&lt;/center&gt;
+&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
+
+&lt;center&gt;
+&lt;img 
src=&quot;/img/blog/2019-02-21-monitoring-best-practices/fig-2.png&quot; 
width=&quot;800px&quot; alt=&quot;Completed Checkpoints (18336), Failed 
(14)&quot; /&gt;
+&lt;br /&gt;
+&lt;i&gt;&lt;small&gt;Completed Checkpoints (18336), Failed 
(14)&lt;/small&gt;&lt;/i&gt;
+&lt;/center&gt;
+&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;Possible Alerts&lt;/strong&gt;&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;&lt;code&gt;ΔfullRestarts&lt;/code&gt; &amp;gt; 
&lt;code&gt;threshold&lt;/code&gt;&lt;/li&gt;
+  &lt;li&gt;&lt;code&gt;ΔnumberOfFailedCheckpoints&lt;/code&gt; &amp;gt; 
&lt;code&gt;threshold&lt;/code&gt;&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;h2 id=&quot;monitoring-progress--throughput&quot;&gt;Monitoring Progress 
&amp;amp; Throughput&lt;/h2&gt;
+
+&lt;p&gt;Knowing that your application is RUNNING and checkpointing is working 
fine is good,
+but it does not tell you whether the application is actually making progress 
and
+keeping up with the upstream systems.&lt;/p&gt;
+
+&lt;h3 id=&quot;throughput&quot;&gt;Throughput&lt;/h3&gt;
+
+&lt;p&gt;Flink provides multiple metrics to measure the throughput of our 
application.
+For each operator or task (remember: a task can contain multiple &lt;a 
href=&quot;https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/operators/#task-chaining-and-resource-groups&quot;&gt;chained
+tasks&lt;/a&gt;
+Flink counts the number of records and bytes going in and out. Out of those
+metrics, the rate of outgoing records per operator is often the most intuitive
+and easiest to reason about.&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;Key Metrics&lt;/strong&gt;&lt;/p&gt;
+
+&lt;table&gt;
+  &lt;thead&gt;
+    &lt;tr&gt;
+      &lt;th&gt;Metric&lt;/th&gt;
+      &lt;th&gt;Scope&lt;/th&gt;
+      &lt;th&gt;Description&lt;/th&gt;
+    &lt;/tr&gt;
+  &lt;/thead&gt;
+  &lt;tbody&gt;
+    &lt;tr&gt;
+      &lt;td&gt;&lt;code&gt;numRecordsOutPerSecond&lt;/code&gt;&lt;/td&gt;
+      &lt;td&gt;task&lt;/td&gt;
+      &lt;td&gt;The number of records this operator/task sends per 
second.&lt;/td&gt;
+    &lt;/tr&gt;
+    &lt;tr&gt;
+      &lt;td&gt;&lt;code&gt;numRecordsOutPerSecond&lt;/code&gt;&lt;/td&gt;
+      &lt;td&gt;operator&lt;/td&gt;
+      &lt;td&gt;The number of records this operator sends per 
second.&lt;/td&gt;
+    &lt;/tr&gt;
+  &lt;/tbody&gt;
+&lt;/table&gt;
+
+&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;Example Dashboard Panels&lt;/strong&gt;&lt;/p&gt;
+
+&lt;center&gt;
+&lt;img 
src=&quot;/img/blog/2019-02-21-monitoring-best-practices/fig-3.png&quot; 
width=&quot;800px&quot; alt=&quot;Mean Records Out per Second per 
Operator&quot; /&gt;
+&lt;br /&gt;
+&lt;i&gt;&lt;small&gt;Mean Records Out per Second per 
Operator&lt;/small&gt;&lt;/i&gt;
+&lt;/center&gt;
+&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;Possible Alerts&lt;/strong&gt;&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;&lt;code&gt;recordsOutPerSecond&lt;/code&gt; = 
&lt;code&gt;0&lt;/code&gt; (for a non-Sink operator)&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;&lt;em&gt;Note&lt;/em&gt;: Source operators always have zero incoming 
records. Sink operators
+always have zero outgoing records because the metrics only count
+Flink-internal communication. There is a &lt;a 
href=&quot;https://issues.apache.org/jira/browse/FLINK-7286&quot;&gt;JIRA
+ticket&lt;/a&gt; to change this
+behavior.&lt;/p&gt;
+
+&lt;h3 id=&quot;progress&quot;&gt;Progress&lt;/h3&gt;
+
+&lt;p&gt;For applications, that use event time semantics, it is important that 
watermarks
+progress over time. A watermark of time &lt;em&gt;t&lt;/em&gt; tells the 
framework, that it
+should not anymore expect to receive  events with a timestamp earlier than 
&lt;em&gt;t&lt;/em&gt;,
+and in turn, to trigger all operations that were scheduled for a timestamp 
&amp;lt; &lt;em&gt;t&lt;/em&gt;.
+For example, an event time window that ends at &lt;em&gt;t&lt;/em&gt; = 30 
will be closed and
+evaluated once the watermark passes 30.&lt;/p&gt;
+
+&lt;p&gt;As a consequence, you should monitor the watermark at event 
time-sensitive
+operators in your application, such as process functions and windows. If the
+difference between the current processing time and the watermark, known as
+even-time skew, is unusually high, then it typically implies one of two issues.
+First, it could mean that your are simply processing old events, for example
+during catch-up after a downtime or when your job is simply not able to keep up
+and events are queuing up. Second, it could mean a single upstream sub-task has
+not sent a watermark for a long time (for example because it did not receive 
any
+events to base the watermark on), which also prevents the watermark in
+downstream operators to progress. This &lt;a 
href=&quot;https://issues.apache.org/jira/browse/FLINK-5017&quot;&gt;JIRA
+ticket&lt;/a&gt; provides further
+information and a work around for the latter.&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;Key Metrics&lt;/strong&gt;&lt;/p&gt;
+
+&lt;table&gt;
+  &lt;thead&gt;
+    &lt;tr&gt;
+      &lt;th&gt;Metric&lt;/th&gt;
+      &lt;th&gt;Scope&lt;/th&gt;
+      &lt;th&gt;Description&lt;/th&gt;
+    &lt;/tr&gt;
+  &lt;/thead&gt;
+  &lt;tbody&gt;
+    &lt;tr&gt;
+      &lt;td&gt;&lt;code&gt;currentOutputWatermark&lt;/code&gt;&lt;/td&gt;
+      &lt;td&gt;operator&lt;/td&gt;
+      &lt;td&gt;The last watermark this operator has emitted.&lt;/td&gt;
+    &lt;/tr&gt;
+  &lt;/tbody&gt;
+&lt;/table&gt;
+
+&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;Example Dashboard Panels&lt;/strong&gt;&lt;/p&gt;
+
+&lt;center&gt;
+&lt;img 
src=&quot;/img/blog/2019-02-21-monitoring-best-practices/fig-4.png&quot; 
width=&quot;800px&quot; alt=&quot;Event Time Lag per Subtask of a single 
operator in the topology. In this case, the watermark is lagging a few seconds 
behind for each subtask.&quot; /&gt;
+&lt;br /&gt;
+&lt;i&gt;&lt;small&gt;Event Time Lag per Subtask of a single operator in the 
topology. In this case, the watermark is lagging a few seconds behind for each 
subtask.&lt;/small&gt;&lt;/i&gt;
+&lt;/center&gt;
+&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;Possible Alerts&lt;/strong&gt;&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;&lt;code&gt;currentProcessingTime - 
currentOutputWatermark&lt;/code&gt; &amp;gt; 
&lt;code&gt;threshold&lt;/code&gt;&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;h3 id=&quot;keeping-up&quot;&gt;“Keeping Up”&lt;/h3&gt;
+
+&lt;p&gt;When consuming from a message queue, there is often a direct way to 
monitor if
+your application is keeping up. By using connector-specific metrics you can
+monitor how far behind the head of the message queue your current consumer 
group
+is. Flink forwards the underlying metrics from most sources.&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;Key Metrics&lt;/strong&gt;&lt;/p&gt;
+
+&lt;table&gt;
+  &lt;thead&gt;
+    &lt;tr&gt;
+      &lt;th&gt;Metric&lt;/th&gt;
+      &lt;th&gt;Scope&lt;/th&gt;
+      &lt;th&gt;Description&lt;/th&gt;
+    &lt;/tr&gt;
+  &lt;/thead&gt;
+  &lt;tbody&gt;
+    &lt;tr&gt;
+      &lt;td&gt;&lt;code&gt;records-lag-max&lt;/code&gt;&lt;/td&gt;
+      &lt;td&gt;user&lt;/td&gt;
+      &lt;td&gt;applies to &lt;code&gt;FlinkKafkaConsumer&lt;/code&gt;. The 
maximum lag in terms of the number of records for any partition in this window. 
An increasing value over time is your best indication that the consumer group 
is not keeping up with the producers.&lt;/td&gt;
+    &lt;/tr&gt;
+    &lt;tr&gt;
+      &lt;td&gt;&lt;code&gt;millisBehindLatest&lt;/code&gt;&lt;/td&gt;
+      &lt;td&gt;user&lt;/td&gt;
+      &lt;td&gt;applies to &lt;code&gt;FlinkKinesisConsumer&lt;/code&gt;. The 
number of milliseconds a consumer is behind the head of the stream. For any 
consumer and Kinesis shard, this indicates how far it is behind the current 
time.&lt;/td&gt;
+    &lt;/tr&gt;
+  &lt;/tbody&gt;
+&lt;/table&gt;
+
+&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;Possible Alerts&lt;/strong&gt;&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;&lt;code&gt;records-lag-max&lt;/code&gt;  &amp;gt; 
&lt;code&gt;threshold&lt;/code&gt;&lt;/li&gt;
+  &lt;li&gt;&lt;code&gt;millisBehindLatest&lt;/code&gt; &amp;gt; 
&lt;code&gt;threshold&lt;/code&gt;&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;h2 id=&quot;monitoring-latency&quot;&gt;Monitoring Latency&lt;/h2&gt;
+
+&lt;p&gt;Generally speaking, latency is the delay between the creation of an 
event and
+the time at which results based on this event become visible. Once the event is
+created it is usually stored in a persistent message queue, before it is
+processed by Apache Flink, which then writes the results to a database or calls
+a downstream system. In such a pipeline, latency can be introduced at each 
stage
+and for various reasons including the following:&lt;/p&gt;
+
+&lt;ol&gt;
+  &lt;li&gt;It might take a varying amount of time until events are persisted 
in the
+message queue.&lt;/li&gt;
+  &lt;li&gt;During periods of high load or during recovery, events might spend 
some time
+in the message queue until they are processed by Flink (see previous 
section).&lt;/li&gt;
+  &lt;li&gt;Some operators in a streaming topology need to buffer events for 
some time
+(e.g. in a time window) for functional reasons.&lt;/li&gt;
+  &lt;li&gt;Each computation in your Flink topology (framework or user code), 
as well as
+each network shuffle, takes time and adds to latency.&lt;/li&gt;
+  &lt;li&gt;If the application emits through a transactional sink, the sink 
will only
+commit and publish transactions upon successful checkpoints of Flink, adding
+latency usually up to the checkpointing interval for each record.&lt;/li&gt;
+&lt;/ol&gt;
+
+&lt;p&gt;In practice, it has proven invaluable to add timestamps to your 
events at
+multiple stages (at least at creation, persistence, ingestion by Flink,
+publication by Flink, possibly sampling those to save bandwidth). The
+differences between these timestamps can be exposed as a user-defined metric in
+your Flink topology to derive the latency distribution of each stage.&lt;/p&gt;
+
+&lt;p&gt;In the rest of this section, we will only consider latency, which is 
introduced
+inside the Flink topology and cannot be attributed to transactional sinks or
+events being buffered for functional reasons (4.).&lt;/p&gt;
+
+&lt;p&gt;To this end, Flink comes with a feature called &lt;a 
href=&quot;https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/metrics.html#latency-tracking&quot;&gt;Latency
+Tracking&lt;/a&gt;.
+When enabled, Flink will insert so-called latency markers periodically at all
+sources. For each sub-task, a latency distribution from each source to this
+operator will be reported. The granularity of these histograms can be further
+controlled by setting &lt;em&gt;metrics.latency.granularity&lt;/em&gt; as 
desired.&lt;/p&gt;
+
+&lt;p&gt;Due to the potentially high number of histograms (in particular for
+&lt;em&gt;metrics.latency.granularity: subtask&lt;/em&gt;), enabling latency 
tracking can
+significantly impact the performance of the cluster. It is recommended to only
+enable it to locate sources of latency during debugging.&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;Metrics&lt;/strong&gt;&lt;/p&gt;
+
+&lt;table&gt;
+  &lt;thead&gt;
+    &lt;tr&gt;
+      &lt;th&gt;Metric&lt;/th&gt;
+      &lt;th&gt;Scope&lt;/th&gt;
+      &lt;th&gt;Description&lt;/th&gt;
+    &lt;/tr&gt;
+  &lt;/thead&gt;
+  &lt;tbody&gt;
+    &lt;tr&gt;
+      &lt;td&gt;&lt;code&gt;latency&lt;/code&gt;&lt;/td&gt;
+      &lt;td&gt;operator&lt;/td&gt;
+      &lt;td&gt;The latency from the source operator to this 
operator.&lt;/td&gt;
+    &lt;/tr&gt;
+    &lt;tr&gt;
+      &lt;td&gt;&lt;code&gt;restartingTime&lt;/code&gt;&lt;/td&gt;
+      &lt;td&gt;job&lt;/td&gt;
+      &lt;td&gt;The time it took to restart the job, or how long the current 
restart has been in progress.&lt;/td&gt;
+    &lt;/tr&gt;
+  &lt;/tbody&gt;
+&lt;/table&gt;
+
+&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;Example Dashboard Panel&lt;/strong&gt;&lt;/p&gt;
+
+&lt;center&gt;
+&lt;img 
src=&quot;/img/blog/2019-02-21-monitoring-best-practices/fig-5.png&quot; 
width=&quot;800px&quot; alt=&quot;Latency distribution between a source and a 
single sink subtask.&quot; /&gt;
+&lt;br /&gt;
+&lt;i&gt;&lt;small&gt;Latency distribution between a source and a single sink 
subtask.&lt;/small&gt;&lt;/i&gt;
+&lt;/center&gt;
+&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
+
+&lt;h2 id=&quot;jvm-metrics&quot;&gt;JVM Metrics&lt;/h2&gt;
+
+&lt;p&gt;So far we have only looked at Flink-specific metrics. As long as 
latency &amp;amp;
+throughput of your application are in line with your expectations and it is
+checkpointing consistently, this is probably everything you need. On the other
+hand, if you job’s performance is starting to degrade among the firstmetrics 
you
+want to look at are memory consumption and CPU load of your Task- &amp;amp; 
JobManager
+JVMs.&lt;/p&gt;
+
+&lt;h3 id=&quot;memory&quot;&gt;Memory&lt;/h3&gt;
+
+&lt;p&gt;Flink reports the usage of Heap, NonHeap, Direct &amp;amp; Mapped 
memory for JobManagers
+and TaskManagers.&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;
+    &lt;p&gt;Heap memory - as with most JVM applications - is the most 
volatile and important
+metric to watch. This is especially true when using Flink’s filesystem
+statebackend as it keeps all state objects on the JVM Heap. If the size of
+long-living objects on the Heap increases significantly, this can usually be
+attributed to the size of your application state (check the 
+&lt;a 
href=&quot;https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/metrics.html#checkpointing&quot;&gt;checkpointing
 metrics&lt;/a&gt;
+for an estimated size of the on-heap state). The possible reasons for growing
+state are very application-specific. Typically, an increasing number of keys, a
+large event-time skew between different input streams or simply missing state
+cleanup may cause growing state.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;NonHeap memory is dominated by the metaspace, the size of which 
is unlimited by default
+and holds class metadata as well as static content. There is a 
+&lt;a 
href=&quot;https://issues.apache.org/jira/browse/FLINK-10317&quot;&gt;JIRA 
Ticket&lt;/a&gt; to limit the size
+to 250 megabyte by default.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;The biggest driver of Direct memory is by far the
+number of Flink’s network buffers, which can be
+&lt;a 
href=&quot;https://ci.apache.org/projects/flink/flink-docs-release-1.7/ops/config.html#configuring-the-network-buffers&quot;&gt;configured&lt;/a&gt;.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;Mapped memory is usually close to zero as Flink does not use 
memory-mapped files.&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;In a containerized environment you should additionally monitor the 
overall
+memory consumption of the Job- and TaskManager containers to ensure they don’t
+exceed their resource limits. This is particularly important, when using the
+RocksDB statebackend, since RocksDB allocates a considerable amount of
+memory off heap. To understand how much memory RocksDB might use, you can
+checkout &lt;a 
href=&quot;https://www.da-platform.com/blog/manage-rocksdb-memory-size-apache-flink&quot;&gt;this
 blog
+post&lt;/a&gt;
+by Stefan Richter.&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;Key Metrics&lt;/strong&gt;&lt;/p&gt;
+
+&lt;table&gt;
+  &lt;thead&gt;
+    &lt;tr&gt;
+      &lt;th&gt;Metric&lt;/th&gt;
+      &lt;th&gt;Scope&lt;/th&gt;
+      &lt;th&gt;Description&lt;/th&gt;
+    &lt;/tr&gt;
+  &lt;/thead&gt;
+  &lt;tbody&gt;
+    &lt;tr&gt;
+      
&lt;td&gt;&lt;code&gt;Status.JVM.Memory.NonHeap.Committed&lt;/code&gt;&lt;/td&gt;
+      &lt;td&gt;job-/taskmanager&lt;/td&gt;
+      &lt;td&gt;The amount of non-heap memory guaranteed to be available to 
the JVM (in bytes).&lt;/td&gt;
+    &lt;/tr&gt;
+    &lt;tr&gt;
+      &lt;td&gt;&lt;code&gt;Status.JVM.Memory.Heap.Used&lt;/code&gt;&lt;/td&gt;
+      &lt;td&gt;job-/taskmanager&lt;/td&gt;
+      &lt;td&gt;The amount of heap memory currently used (in bytes).&lt;/td&gt;
+    &lt;/tr&gt;
+    &lt;tr&gt;
+      
&lt;td&gt;&lt;code&gt;Status.JVM.Memory.Heap.Committed&lt;/code&gt;&lt;/td&gt;
+      &lt;td&gt;job-/taskmanager&lt;/td&gt;
+      &lt;td&gt;The amount of heap memory guaranteed to be available to the 
JVM (in bytes).&lt;/td&gt;
+    &lt;/tr&gt;
+    &lt;tr&gt;
+      
&lt;td&gt;&lt;code&gt;Status.JVM.Memory.Direct.MemoryUsed&lt;/code&gt;&lt;/td&gt;
+      &lt;td&gt;job-/taskmanager&lt;/td&gt;
+      &lt;td&gt;The amount of memory used by the JVM for the direct buffer 
pool (in bytes).&lt;/td&gt;
+    &lt;/tr&gt;
+    &lt;tr&gt;
+      
&lt;td&gt;&lt;code&gt;Status.JVM.Memory.Mapped.MemoryUsed&lt;/code&gt;&lt;/td&gt;
+      &lt;td&gt;job-/taskmanager&lt;/td&gt;
+      &lt;td&gt;The amount of memory used by the JVM for the mapped buffer 
pool (in bytes).&lt;/td&gt;
+    &lt;/tr&gt;
+    &lt;tr&gt;
+      &lt;td&gt;&lt;code&gt;Status.JVM.GarbageCollector.G1 Young 
Generation.Time&lt;/code&gt;&lt;/td&gt;
+      &lt;td&gt;job-/taskmanager&lt;/td&gt;
+      &lt;td&gt;The total time spent performing G1 Young Generation garbage 
collection.&lt;/td&gt;
+    &lt;/tr&gt;
+    &lt;tr&gt;
+      &lt;td&gt;&lt;code&gt;Status.JVM.GarbageCollector.G1 Old 
Generation.Time&lt;/code&gt;&lt;/td&gt;
+      &lt;td&gt;job-/taskmanager&lt;/td&gt;
+      &lt;td&gt;The total time spent performing G1 Old Generation garbage 
collection.&lt;/td&gt;
+    &lt;/tr&gt;
+  &lt;/tbody&gt;
+&lt;/table&gt;
+
+&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;Example Dashboard Panel&lt;/strong&gt;&lt;/p&gt;
+
+&lt;center&gt;
+&lt;img 
src=&quot;/img/blog/2019-02-21-monitoring-best-practices/fig-6.png&quot; 
width=&quot;800px&quot; alt=&quot;TaskManager memory consumption and garbage 
collection times.&quot; /&gt;
+&lt;br /&gt;
+&lt;i&gt;&lt;small&gt;TaskManager memory consumption and garbage collection 
times.&lt;/small&gt;&lt;/i&gt;
+&lt;/center&gt;
+&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
+
+&lt;center&gt;
+&lt;img 
src=&quot;/img/blog/2019-02-21-monitoring-best-practices/fig-7.png&quot; 
width=&quot;800px&quot; alt=&quot;JobManager memory consumption and garbage 
collection times.&quot; /&gt;
+&lt;br /&gt;
+&lt;i&gt;&lt;small&gt;JobManager memory consumption and garbage collection 
times.&lt;/small&gt;&lt;/i&gt;
+&lt;/center&gt;
+&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;Possible Alerts&lt;/strong&gt;&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;&lt;code&gt;container memory limit&lt;/code&gt; &amp;lt; 
&lt;code&gt;container memory + safety margin&lt;/code&gt;&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;h3 id=&quot;cpu&quot;&gt;CPU&lt;/h3&gt;
+
+&lt;p&gt;Besides memory, you should also monitor the CPU load of the 
TaskManagers. If
+your TaskManagers are constantly under very high load, you might be able to
+improve the overall performance by decreasing the number of task slots per
+TaskManager (in case of a Standalone setup), by providing more resources to the
+TaskManager (in case of a containerized setup), or by providing more
+TaskManagers. In general, a system already running under very high load during
+normal operations, will need much more time to catch-up after recovering from a
+downtime. During this time you will see a much higher latency (event-time 
skew) than
+usual.&lt;/p&gt;
+
+&lt;p&gt;A sudden increase in the CPU load might also be attributed to high 
garbage
+collection pressure, which should be visible in the JVM memory metrics as 
well.&lt;/p&gt;
+
+&lt;p&gt;If one or a few TaskManagers are constantly under very high load, 
this can slow
+down the whole topology due to long checkpoint alignment times and increasing
+event-time skew. A common reason is skew in the partition key of the data, 
which
+can be mitigated by pre-aggregating before the shuffle or keying on a more
+evenly distributed key.&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;Key Metrics&lt;/strong&gt;&lt;/p&gt;
+
+&lt;table&gt;
+  &lt;thead&gt;
+    &lt;tr&gt;
+      &lt;th&gt;Metric&lt;/th&gt;
+      &lt;th&gt;Scope&lt;/th&gt;
+      &lt;th&gt;Description&lt;/th&gt;
+    &lt;/tr&gt;
+  &lt;/thead&gt;
+  &lt;tbody&gt;
+    &lt;tr&gt;
+      &lt;td&gt;&lt;code&gt;Status.JVM.CPU.Load&lt;/code&gt;&lt;/td&gt;
+      &lt;td&gt;job-/taskmanager&lt;/td&gt;
+      &lt;td&gt;The recent CPU usage of the JVM.&lt;/td&gt;
+    &lt;/tr&gt;
+  &lt;/tbody&gt;
+&lt;/table&gt;
+
+&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;Example Dashboard Panel&lt;/strong&gt;&lt;/p&gt;
+
+&lt;center&gt;
+&lt;img 
src=&quot;/img/blog/2019-02-21-monitoring-best-practices/fig-8.png&quot; 
width=&quot;800px&quot; alt=&quot;TaskManager &amp;amp; JobManager CPU 
load.&quot; /&gt;
+&lt;br /&gt;
+&lt;i&gt;&lt;small&gt;TaskManager &amp;amp; JobManager CPU 
load.&lt;/small&gt;&lt;/i&gt;
+&lt;/center&gt;
+&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
+
+&lt;h2 id=&quot;system-resources&quot;&gt;System Resources&lt;/h2&gt;
+
+&lt;p&gt;In addition to the JVM metrics above, it is also possible to use 
Flink’s metrics
+system to gather insights about system resources, i.e. memory, CPU &amp;amp;
+network-related metrics for the whole machine as opposed to the Flink processes
+alone. System resource monitoring is disabled by default and requires 
additional
+dependencies on the classpath. Please check out the 
+&lt;a 
href=&quot;https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/metrics.html#system-resources&quot;&gt;Flink
 system resource metrics documentation&lt;/a&gt; for
+additional guidance and details. System resource monitoring in Flink can be 
very
+helpful in setups without existing host monitoring capabilities.&lt;/p&gt;
+
+&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;
+
+&lt;p&gt;This post tries to shed some light on Flink’s metrics and monitoring 
system. You
+can utilise it as a starting point when you first think about how to
+successfully monitor your Flink application. I highly recommend to start
+monitoring your Flink application early on in the development phase. This way
+you will be able to improve your dashboards and alerts over time and, more
+importantly, observe the performance impact of the changes to your application
+throughout the development phase. By doing so, you can ask the right questions
+about the runtime behaviour of your application, and learn much more about
+Flink’s internals early on.&lt;/p&gt;
+
+&lt;p&gt;Last but not least, this post only scratches the surface of the 
overall metrics
+and monitoring capabilities of Apache Flink. I highly recommend going over
+&lt;a 
href=&quot;https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/metrics.html&quot;&gt;Flink’s
 metrics documentation&lt;/a&gt;
+for a full reference of Flink’s metrics system.&lt;/p&gt;
+</description>
+<pubDate>Mon, 25 Feb 2019 13:00:00 +0100</pubDate>
+<link>https://flink.apache.org/news/2019/02/25/monitoring-best-practices.html</link>
+<guid isPermaLink="true">/news/2019/02/25/monitoring-best-practices.html</guid>
+</item>
+
+<item>
 <title>Apache Flink 1.6.4 Released</title>
 <description>&lt;p&gt;The Apache Flink community released the fourth bugfix 
version of the Apache Flink 1.6 series.&lt;/p&gt;
 
diff --git a/content/blog/index.html b/content/blog/index.html
index ba5eaa4..1c7fad0 100644
--- a/content/blog/index.html
+++ b/content/blog/index.html
@@ -155,6 +155,19 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a 
href="/news/2019/02/25/monitoring-best-practices.html">Monitoring Apache Flink 
Applications 101</a></h2>
+
+      <p>25 Feb 2019
+       Konstantin Knauf (<a 
href="https://twitter.com/snntrable";>@snntrable</a>)</p>
+
+      <p>The monitoring of business-critical applications is a crucial aspect 
of a production deployment. It ensures that any degradation or downtime is 
immediately identified and can be resolved as quickly as possible. In this 
post, we discuss the most important metrics that indicate healthy Flink 
applications.</p>
+
+      <p><a href="/news/2019/02/25/monitoring-best-practices.html">Continue 
reading &raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a 
href="/news/2019/02/25/release-1.6.4.html">Apache Flink 1.6.4 Released</a></h2>
 
       <p>25 Feb 2019
@@ -289,21 +302,6 @@ Please check the <a 
href="https://issues.apache.org/jira/secure/ReleaseNote.jspa
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a 
href="/news/2018/09/20/release-1.6.1.html">Apache Flink 1.6.1 Released</a></h2>
-
-      <p>20 Sep 2018
-      </p>
-
-      <p><p>The Apache Flink community released the first bugfix version of 
the Apache Flink 1.6 series.</p>
-
-</p>
-
-      <p><a href="/news/2018/09/20/release-1.6.1.html">Continue reading 
&raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -336,6 +334,16 @@ Please check the <a 
href="https://issues.apache.org/jira/secure/ReleaseNote.jspa
 
     <ul id="markdown-toc">
       
+      <li><a href="/news/2019/02/25/monitoring-best-practices.html">Monitoring 
Apache Flink Applications 101</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2019/02/25/release-1.6.4.html">Apache Flink 1.6.4 
Released</a></li>
 
       
diff --git a/content/blog/page2/index.html b/content/blog/page2/index.html
index 1322666..ea5852d 100644
--- a/content/blog/page2/index.html
+++ b/content/blog/page2/index.html
@@ -155,6 +155,21 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a 
href="/news/2018/09/20/release-1.6.1.html">Apache Flink 1.6.1 Released</a></h2>
+
+      <p>20 Sep 2018
+      </p>
+
+      <p><p>The Apache Flink community released the first bugfix version of 
the Apache Flink 1.6 series.</p>
+
+</p>
+
+      <p><a href="/news/2018/09/20/release-1.6.1.html">Continue reading 
&raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a 
href="/news/2018/09/20/release-1.5.4.html">Apache Flink 1.5.4 Released</a></h2>
 
       <p>20 Sep 2018
@@ -287,21 +302,6 @@
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a 
href="/news/2018/02/15/release-1.4.1.html">Apache Flink 1.4.1 Released</a></h2>
-
-      <p>15 Feb 2018
-      </p>
-
-      <p><p>The Apache Flink community released the first bugfix version of 
the Apache Flink 1.4 series.</p>
-
-</p>
-
-      <p><a href="/news/2018/02/15/release-1.4.1.html">Continue reading 
&raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -334,6 +334,16 @@
 
     <ul id="markdown-toc">
       
+      <li><a href="/news/2019/02/25/monitoring-best-practices.html">Monitoring 
Apache Flink Applications 101</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2019/02/25/release-1.6.4.html">Apache Flink 1.6.4 
Released</a></li>
 
       
diff --git a/content/blog/page3/index.html b/content/blog/page3/index.html
index 9e71489..53897c2 100644
--- a/content/blog/page3/index.html
+++ b/content/blog/page3/index.html
@@ -155,6 +155,21 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a 
href="/news/2018/02/15/release-1.4.1.html">Apache Flink 1.4.1 Released</a></h2>
+
+      <p>15 Feb 2018
+      </p>
+
+      <p><p>The Apache Flink community released the first bugfix version of 
the Apache Flink 1.4 series.</p>
+
+</p>
+
+      <p><a href="/news/2018/02/15/release-1.4.1.html">Continue reading 
&raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a 
href="/features/2018/01/30/incremental-checkpointing.html">Managing Large State 
in Apache Flink: An Intro to Incremental Checkpointing</a></h2>
 
       <p>30 Jan 2018
@@ -288,21 +303,6 @@ what’s coming in Flink 1.4.0 as well as a preview of what 
the Flink community
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a 
href="/news/2017/04/26/release-1.2.1.html">Apache Flink 1.2.1 Released</a></h2>
-
-      <p>26 Apr 2017
-      </p>
-
-      <p><p>The Apache Flink community released the first bugfix version of 
the Apache Flink 1.2 series.</p>
-
-</p>
-
-      <p><a href="/news/2017/04/26/release-1.2.1.html">Continue reading 
&raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -335,6 +335,16 @@ what’s coming in Flink 1.4.0 as well as a preview of what 
the Flink community
 
     <ul id="markdown-toc">
       
+      <li><a href="/news/2019/02/25/monitoring-best-practices.html">Monitoring 
Apache Flink Applications 101</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2019/02/25/release-1.6.4.html">Apache Flink 1.6.4 
Released</a></li>
 
       
diff --git a/content/blog/page4/index.html b/content/blog/page4/index.html
index 9a063fa..49cbb44 100644
--- a/content/blog/page4/index.html
+++ b/content/blog/page4/index.html
@@ -155,6 +155,21 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a 
href="/news/2017/04/26/release-1.2.1.html">Apache Flink 1.2.1 Released</a></h2>
+
+      <p>26 Apr 2017
+      </p>
+
+      <p><p>The Apache Flink community released the first bugfix version of 
the Apache Flink 1.2 series.</p>
+
+</p>
+
+      <p><a href="/news/2017/04/26/release-1.2.1.html">Continue reading 
&raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a 
href="/news/2017/04/04/dynamic-tables.html">Continuous Queries on Dynamic 
Tables</a></h2>
 
       <p>04 Apr 2017 by Fabian Hueske, Shaoxuan Wang, and Xiaowei Jiang
@@ -282,21 +297,6 @@
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a 
href="/news/2016/08/11/release-1.1.1.html">Flink 1.1.1 Released</a></h2>
-
-      <p>11 Aug 2016
-      </p>
-
-      <p><p>Today, the Flink community released Flink version 1.1.1.</p>
-
-</p>
-
-      <p><a href="/news/2016/08/11/release-1.1.1.html">Continue reading 
&raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -329,6 +329,16 @@
 
     <ul id="markdown-toc">
       
+      <li><a href="/news/2019/02/25/monitoring-best-practices.html">Monitoring 
Apache Flink Applications 101</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2019/02/25/release-1.6.4.html">Apache Flink 1.6.4 
Released</a></li>
 
       
diff --git a/content/blog/page5/index.html b/content/blog/page5/index.html
index 1b53216b..ad877ae 100644
--- a/content/blog/page5/index.html
+++ b/content/blog/page5/index.html
@@ -155,6 +155,21 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a 
href="/news/2016/08/11/release-1.1.1.html">Flink 1.1.1 Released</a></h2>
+
+      <p>11 Aug 2016
+      </p>
+
+      <p><p>Today, the Flink community released Flink version 1.1.1.</p>
+
+</p>
+
+      <p><a href="/news/2016/08/11/release-1.1.1.html">Continue reading 
&raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a 
href="/news/2016/08/08/release-1.1.0.html">Announcing Apache Flink 
1.1.0</a></h2>
 
       <p>08 Aug 2016
@@ -286,19 +301,6 @@
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a 
href="/news/2015/12/18/a-year-in-review.html">Flink 2015: A year in review, and 
a lookout to 2016</a></h2>
-
-      <p>18 Dec 2015 by Robert Metzger (<a 
href="https://twitter.com/";>@rmetzger_</a>)
-      </p>
-
-      <p><p>With 2015 ending, we thought that this would be good time to 
reflect on the amazing work done by the Flink community over this past year, 
and how much this community has grown.</p></p>
-
-      <p><a href="/news/2015/12/18/a-year-in-review.html">Continue reading 
&raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -331,6 +333,16 @@
 
     <ul id="markdown-toc">
       
+      <li><a href="/news/2019/02/25/monitoring-best-practices.html">Monitoring 
Apache Flink Applications 101</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2019/02/25/release-1.6.4.html">Apache Flink 1.6.4 
Released</a></li>
 
       
diff --git a/content/blog/page6/index.html b/content/blog/page6/index.html
index c1dea47..1018774 100644
--- a/content/blog/page6/index.html
+++ b/content/blog/page6/index.html
@@ -155,6 +155,19 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a 
href="/news/2015/12/18/a-year-in-review.html">Flink 2015: A year in review, and 
a lookout to 2016</a></h2>
+
+      <p>18 Dec 2015 by Robert Metzger (<a 
href="https://twitter.com/";>@rmetzger_</a>)
+      </p>
+
+      <p><p>With 2015 ending, we thought that this would be good time to 
reflect on the amazing work done by the Flink community over this past year, 
and how much this community has grown.</p></p>
+
+      <p><a href="/news/2015/12/18/a-year-in-review.html">Continue reading 
&raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a 
href="/news/2015/12/11/storm-compatibility.html">Storm Compatibility in Apache 
Flink: How to run existing Storm topologies on Flink</a></h2>
 
       <p>11 Dec 2015 by Matthias J. Sax (<a 
href="https://twitter.com/";>@MatthiasJSax</a>)
@@ -292,19 +305,6 @@ vertex-centric or gather-sum-apply to Flink dataflows.</p>
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a 
href="/news/2015/05/14/Community-update-April.html">April 2015 in the Flink 
community</a></h2>
-
-      <p>14 May 2015 by Kostas Tzoumas (<a 
href="https://twitter.com/";>@kostas_tzoumas</a>)
-      </p>
-
-      <p><p>The monthly update from the Flink community. Including the 
availability of a new preview release, lots of meetups and conference talks and 
a great interview about Flink.</p></p>
-
-      <p><a href="/news/2015/05/14/Community-update-April.html">Continue 
reading &raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -337,6 +337,16 @@ vertex-centric or gather-sum-apply to Flink dataflows.</p>
 
     <ul id="markdown-toc">
       
+      <li><a href="/news/2019/02/25/monitoring-best-practices.html">Monitoring 
Apache Flink Applications 101</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2019/02/25/release-1.6.4.html">Apache Flink 1.6.4 
Released</a></li>
 
       
diff --git a/content/blog/page7/index.html b/content/blog/page7/index.html
index dcd4c3d..2ccdfed 100644
--- a/content/blog/page7/index.html
+++ b/content/blog/page7/index.html
@@ -155,6 +155,19 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a 
href="/news/2015/05/14/Community-update-April.html">April 2015 in the Flink 
community</a></h2>
+
+      <p>14 May 2015 by Kostas Tzoumas (<a 
href="https://twitter.com/";>@kostas_tzoumas</a>)
+      </p>
+
+      <p><p>The monthly update from the Flink community. Including the 
availability of a new preview release, lots of meetups and conference talks and 
a great interview about Flink.</p></p>
+
+      <p><a href="/news/2015/05/14/Community-update-April.html">Continue 
reading &raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a 
href="/news/2015/05/11/Juggling-with-Bits-and-Bytes.html">Juggling with Bits 
and Bytes</a></h2>
 
       <p>11 May 2015 by Fabian Hüske (<a 
href="https://twitter.com/";>@fhueske</a>)
@@ -298,21 +311,6 @@ and offers a new API including definition of flexible 
windows.</p>
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a 
href="/news/2014/11/18/hadoop-compatibility.html">Hadoop Compatibility in 
Flink</a></h2>
-
-      <p>18 Nov 2014 by Fabian Hüske (<a 
href="https://twitter.com/";>@fhueske</a>)
-      </p>
-
-      <p><p><a href="http://hadoop.apache.org";>Apache Hadoop</a> is an 
industry standard for scalable analytical data processing. Many data analysis 
applications have been implemented as Hadoop MapReduce jobs and run in clusters 
around the world. Apache Flink can be an alternative to MapReduce and improves 
it in many dimensions. Among other features, Flink provides much better 
performance and offers APIs in Java and Scala, which are very easy to use. 
Similar to Hadoop, Flink’s APIs provi [...]
-
-</p>
-
-      <p><a href="/news/2014/11/18/hadoop-compatibility.html">Continue reading 
&raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -345,6 +343,16 @@ and offers a new API including definition of flexible 
windows.</p>
 
     <ul id="markdown-toc">
       
+      <li><a href="/news/2019/02/25/monitoring-best-practices.html">Monitoring 
Apache Flink Applications 101</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2019/02/25/release-1.6.4.html">Apache Flink 1.6.4 
Released</a></li>
 
       
diff --git a/content/blog/page8/index.html b/content/blog/page8/index.html
index 494132b..d8f5bb7 100644
--- a/content/blog/page8/index.html
+++ b/content/blog/page8/index.html
@@ -155,6 +155,21 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a 
href="/news/2014/11/18/hadoop-compatibility.html">Hadoop Compatibility in 
Flink</a></h2>
+
+      <p>18 Nov 2014 by Fabian Hüske (<a 
href="https://twitter.com/";>@fhueske</a>)
+      </p>
+
+      <p><p><a href="http://hadoop.apache.org";>Apache Hadoop</a> is an 
industry standard for scalable analytical data processing. Many data analysis 
applications have been implemented as Hadoop MapReduce jobs and run in clusters 
around the world. Apache Flink can be an alternative to MapReduce and improves 
it in many dimensions. Among other features, Flink provides much better 
performance and offers APIs in Java and Scala, which are very easy to use. 
Similar to Hadoop, Flink’s APIs provi [...]
+
+</p>
+
+      <p><a href="/news/2014/11/18/hadoop-compatibility.html">Continue reading 
&raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a 
href="/news/2014/11/04/release-0.7.0.html">Apache Flink 0.7.0 available</a></h2>
 
       <p>04 Nov 2014
@@ -249,6 +264,16 @@ academic and open source project that Flink originates 
from.</p>
 
     <ul id="markdown-toc">
       
+      <li><a href="/news/2019/02/25/monitoring-best-practices.html">Monitoring 
Apache Flink Applications 101</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2019/02/25/release-1.6.4.html">Apache Flink 1.6.4 
Released</a></li>
 
       
diff --git a/content/img/blog/2019-02-21-monitoring-best-practices/fig-1.png 
b/content/img/blog/2019-02-21-monitoring-best-practices/fig-1.png
new file mode 100644
index 0000000..70659b7
Binary files /dev/null and 
b/content/img/blog/2019-02-21-monitoring-best-practices/fig-1.png differ
diff --git a/content/img/blog/2019-02-21-monitoring-best-practices/fig-2.png 
b/content/img/blog/2019-02-21-monitoring-best-practices/fig-2.png
new file mode 100644
index 0000000..06c0b7a
Binary files /dev/null and 
b/content/img/blog/2019-02-21-monitoring-best-practices/fig-2.png differ
diff --git a/content/img/blog/2019-02-21-monitoring-best-practices/fig-3.png 
b/content/img/blog/2019-02-21-monitoring-best-practices/fig-3.png
new file mode 100644
index 0000000..97513db
Binary files /dev/null and 
b/content/img/blog/2019-02-21-monitoring-best-practices/fig-3.png differ
diff --git a/content/img/blog/2019-02-21-monitoring-best-practices/fig-4.png 
b/content/img/blog/2019-02-21-monitoring-best-practices/fig-4.png
new file mode 100644
index 0000000..b536dbb
Binary files /dev/null and 
b/content/img/blog/2019-02-21-monitoring-best-practices/fig-4.png differ
diff --git a/content/img/blog/2019-02-21-monitoring-best-practices/fig-5.png 
b/content/img/blog/2019-02-21-monitoring-best-practices/fig-5.png
new file mode 100644
index 0000000..cbf29d7
Binary files /dev/null and 
b/content/img/blog/2019-02-21-monitoring-best-practices/fig-5.png differ
diff --git a/content/img/blog/2019-02-21-monitoring-best-practices/fig-6.png 
b/content/img/blog/2019-02-21-monitoring-best-practices/fig-6.png
new file mode 100644
index 0000000..8cdae36
Binary files /dev/null and 
b/content/img/blog/2019-02-21-monitoring-best-practices/fig-6.png differ
diff --git a/content/img/blog/2019-02-21-monitoring-best-practices/fig-7.png 
b/content/img/blog/2019-02-21-monitoring-best-practices/fig-7.png
new file mode 100644
index 0000000..b58e64e
Binary files /dev/null and 
b/content/img/blog/2019-02-21-monitoring-best-practices/fig-7.png differ
diff --git a/content/img/blog/2019-02-21-monitoring-best-practices/fig-8.png 
b/content/img/blog/2019-02-21-monitoring-best-practices/fig-8.png
new file mode 100644
index 0000000..414b577
Binary files /dev/null and 
b/content/img/blog/2019-02-21-monitoring-best-practices/fig-8.png differ
diff --git a/content/index.html b/content/index.html
index e63f823..99a20a3 100644
--- a/content/index.html
+++ b/content/index.html
@@ -438,6 +438,9 @@
 
   <dl>
       
+        <dt> <a 
href="/news/2019/02/25/monitoring-best-practices.html">Monitoring Apache Flink 
Applications 101</a></dt>
+        <dd>The monitoring of business-critical applications is a crucial 
aspect of a production deployment. It ensures that any degradation or downtime 
is immediately identified and can be resolved as quickly as possible. In this 
post, we discuss the most important metrics that indicate healthy Flink 
applications.</dd>
+      
         <dt> <a href="/news/2019/02/25/release-1.6.4.html">Apache Flink 1.6.4 
Released</a></dt>
         <dd><p>The Apache Flink community released the fourth bugfix version 
of the Apache Flink 1.6 series.</p>
 
@@ -455,11 +458,6 @@
         <dd><p>The Apache Flink community released the sixth and last bugfix 
version of the Apache Flink 1.5 series.</p>
 
 </dd>
-      
-        <dt> <a href="/news/2018/12/22/release-1.6.3.html">Apache Flink 1.6.3 
Released</a></dt>
-        <dd><p>The Apache Flink community released the third bugfix version of 
the Apache Flink 1.6 series.</p>
-
-</dd>
     
   </dl>
 
diff --git a/content/news/2019/02/25/monitoring-best-practices.html 
b/content/news/2019/02/25/monitoring-best-practices.html
new file mode 100644
index 0000000..859628e
--- /dev/null
+++ b/content/news/2019/02/25/monitoring-best-practices.html
@@ -0,0 +1,791 @@
+<!DOCTYPE html>
+<html lang="en">
+  <head>
+    <meta charset="utf-8">
+    <meta http-equiv="X-UA-Compatible" content="IE=edge">
+    <meta name="viewport" content="width=device-width, initial-scale=1">
+    <!-- The above 3 meta tags *must* come first in the head; any other head 
content must come *after* these tags -->
+    <title>Apache Flink: Monitoring Apache Flink Applications 101</title>
+    <link rel="shortcut icon" href="/favicon.ico" type="image/x-icon">
+    <link rel="icon" href="/favicon.ico" type="image/x-icon">
+
+    <!-- Bootstrap -->
+    <link rel="stylesheet" 
href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.4/css/bootstrap.min.css";>
+    <link rel="stylesheet" href="/css/flink.css">
+    <link rel="stylesheet" href="/css/syntax.css">
+
+    <!-- Blog RSS feed -->
+    <link href="/blog/feed.xml" rel="alternate" type="application/rss+xml" 
title="Apache Flink Blog: RSS feed" />
+
+    <!-- jQuery (necessary for Bootstrap's JavaScript plugins) -->
+    <!-- We need to load Jquery in the header for custom google analytics 
event tracking-->
+    <script src="/js/jquery.min.js"></script>
+
+    <!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media 
queries -->
+    <!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
+    <!--[if lt IE 9]>
+      <script 
src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js";></script>
+      <script 
src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js";></script>
+    <![endif]-->
+  </head>
+  <body>  
+    
+
+    <!-- Main content. -->
+    <div class="container">
+    <div class="row">
+
+      
+     <div id="sidebar" class="col-sm-3">
+        
+
+<!-- Top navbar. -->
+    <nav class="navbar navbar-default">
+        <!-- The logo. -->
+        <div class="navbar-header">
+          <button type="button" class="navbar-toggle collapsed" 
data-toggle="collapse" data-target="#bs-example-navbar-collapse-1">
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+          </button>
+          <div class="navbar-logo">
+            <a href="/">
+              <img alt="Apache Flink" src="/img/flink-header-logo.svg" 
width="147px" height="73px">
+            </a>
+          </div>
+        </div><!-- /.navbar-header -->
+
+        <!-- The navigation links. -->
+        <div class="collapse navbar-collapse" 
id="bs-example-navbar-collapse-1">
+          <ul class="nav navbar-nav navbar-main">
+
+            <!-- First menu section explains visitors what Flink is -->
+
+            <!-- What is Stream Processing? -->
+            <!--
+            <li><a href="/streamprocessing1.html">What is Stream 
Processing?</a></li>
+            -->
+
+            <!-- What is Flink? -->
+            <li><a href="/flink-architecture.html">What is Apache 
Flink?</a></li>
+
+            <!-- Use cases -->
+            <li><a href="/usecases.html">Use Cases</a></li>
+
+            <!-- Powered by -->
+            <li><a href="/poweredby.html">Powered By</a></li>
+
+            <!-- FAQ -->
+            <li><a href="/faq.html">FAQ</a></li>
+
+            &nbsp;
+            <!-- Second menu section aims to support Flink users -->
+
+            <!-- Downloads -->
+            <li><a href="/downloads.html">Downloads</a></li>
+
+            <!-- Quickstart -->
+            <li>
+              <a 
href="https://ci.apache.org/projects/flink/flink-docs-release-1.7/quickstart/setup_quickstart.html";
 target="_blank">Tutorials <small><span class="glyphicon 
glyphicon-new-window"></span></small></a>
+            </li>
+
+            <!-- Documentation -->
+            <li class="dropdown">
+              <a class="dropdown-toggle" data-toggle="dropdown" 
href="#">Documentation<span class="caret"></span></a>
+              <ul class="dropdown-menu">
+                <li><a 
href="https://ci.apache.org/projects/flink/flink-docs-release-1.7"; 
target="_blank">1.7 (Latest stable release) <small><span class="glyphicon 
glyphicon-new-window"></span></small></a></li>
+                <li><a 
href="https://ci.apache.org/projects/flink/flink-docs-master"; 
target="_blank">1.8 (Snapshot) <small><span class="glyphicon 
glyphicon-new-window"></span></small></a></li>
+              </ul>
+            </li>
+
+            <!-- getting help -->
+            <li><a href="/gettinghelp.html">Getting Help</a></li>
+
+            <!-- Blog -->
+            <li class="active"><a href="/blog/"><b>Flink Blog</b></a></li>
+
+            &nbsp;
+
+            <!-- Third menu section aim to support community and contributors 
-->
+
+            <!-- Community -->
+            <li><a href="/community.html">Community &amp; Project Info</a></li>
+
+            <!-- Contribute -->
+            <li><a href="/how-to-contribute.html">How to Contribute</a></li>
+
+            <!-- GitHub -->
+            <li>
+              <a href="https://github.com/apache/flink"; target="_blank">Flink 
on GitHub <small><span class="glyphicon 
glyphicon-new-window"></span></small></a>
+            </li>
+
+            &nbsp;
+
+            <!-- Language Switcher -->
+            <li>
+              
+                 
+                  <a 
href="/zh/news/2019/02/25/monitoring-best-practices.html">中文版</a>   
+                
+              
+            </li>
+
+          </ul>
+
+          <ul class="nav navbar-nav navbar-bottom">
+          <hr />
+
+            <!-- Twitter -->
+            <li><a href="https://twitter.com/apacheflink"; 
target="_blank">@ApacheFlink <small><span class="glyphicon 
glyphicon-new-window"></span></small></a></li>
+
+            <!-- Visualizer -->
+            <li class=" hidden-md hidden-sm"><a href="/visualizer/" 
target="_blank">Plan Visualizer <small><span class="glyphicon 
glyphicon-new-window"></span></small></a></li>
+
+          </ul>
+        </div><!-- /.navbar-collapse -->
+    </nav>
+      </div>
+      <div class="col-sm-9">
+      <div class="row-fluid">
+  <div class="col-sm-12">
+    <div class="row">
+      <h1>Monitoring Apache Flink Applications 101</h1>
+
+      <article>
+        <p>25 Feb 2019 Konstantin Knauf (<a 
href="https://twitter.com/snntrable";>@snntrable</a>)</p>
+
+<!-- improve style of tables -->
+<style>
+  table { border: 0px solid black; table-layout: auto; width: 800px; }
+  th, td { border: 1px solid black; padding: 5px; padding-left: 10px; 
padding-right: 10px; }
+  th { text-align: center }
+  td { vertical-align: top }
+</style>
+
+<p>This blog post provides an introduction to Apache Flink’s built-in 
monitoring
+and metrics system, that allows developers to effectively monitor their Flink
+jobs. Oftentimes, the task of picking the relevant metrics to monitor a
+Flink application can be overwhelming for a DevOps team that is just starting
+with stream processing and Apache Flink. Having worked with many organizations
+that deploy Flink at scale, I would like to share my experience and some best
+practice with the community.</p>
+
+<p>With business-critical applications running on Apache Flink, performance 
monitoring
+becomes an increasingly important part of a successful production deployment. 
It 
+ensures that any degradation or downtime is immediately identified and resolved
+as quickly as possible.</p>
+
+<p>Monitoring goes hand-in-hand with observability, which is a prerequisite for
+troubleshooting and performance tuning. Nowadays, with the complexity of modern
+enterprise applications and the speed of delivery increasing, an engineering
+team must understand and have a complete overview of its applications’ status 
at
+any given point in time.</p>
+
+<h2 id="flinks-metrics-system">Flink’s Metrics System</h2>
+
+<p>The foundation for monitoring Flink jobs is its <a 
href="https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/metrics.html";>metrics
+system</a>
+which consists of two components; <code>Metrics</code> and 
<code>MetricsReporters</code>.</p>
+
+<h3 id="metrics">Metrics</h3>
+
+<p>Flink comes with a comprehensive set of built-in metrics such as:</p>
+
+<ul>
+  <li>Used JVM Heap / NonHeap / Direct Memory (per Task-/JobManager)</li>
+  <li>Number of Job Restarts (per Job)</li>
+  <li>Number of Records Per Second (per Operator)</li>
+  <li>…</li>
+</ul>
+
+<p>These metrics have different scopes and measure more general (e.g. JVM or
+operating system) as well as Flink-specific aspects.</p>
+
+<p>As a user, you can and should add application-specific metrics to your
+functions. Typically these include counters for the number of invalid records 
or
+the number of records temporarily buffered in managed state. Besides counters,
+Flink offers additional metrics types like gauges and histograms. For
+instructions on how to register your own metrics with Flink’s metrics system
+please check out <a 
href="https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/metrics.html#registering-metrics";>Flink’s
+documentation</a>.
+In this blog post, we will focus on how to get the most out of Flink’s built-in
+metrics.</p>
+
+<h3 id="metricsreporters">MetricsReporters</h3>
+
+<p>All metrics can be queried via Flink’s REST API. However, users can 
configure
+MetricsReporters to send the metrics to external systems. Apache Flink provides
+reporters to the most common monitoring tools out-of-the-box including JMX,
+Prometheus, Datadog, Graphite and InfluxDB. For information about how to
+configure a reporter check out Flink’s <a 
href="https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/metrics.html#reporter";>MetricsReporter
+documentation</a>.</p>
+
+<p>In the remaining part of this blog post, we will go over some of the most
+important metrics to monitor your Apache Flink application.</p>
+
+<h2 id="monitoring-general-health">Monitoring General Health</h2>
+
+<p>The first thing you want to monitor is whether your job is actually in a 
<em>RUNNING</em>
+state. In addition, it pays off to monitor the number of restarts and the time
+since the last restart.</p>
+
+<p>Generally speaking, successful checkpointing is a strong indicator of the
+general health of your application. For each checkpoint, checkpoint barriers
+need to flow through the whole topology of your Flink job and events and
+barriers cannot overtake each other. Therefore, a successful checkpoint shows
+that no channel is fully congested.</p>
+
+<p><strong>Key Metrics</strong></p>
+
+<table>
+  <thead>
+    <tr>
+      <th>Metric</th>
+      <th>Scope</th>
+      <th>Description</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td><code>uptime</code></td>
+      <td>job</td>
+      <td>The time that the job has been running without interruption.</td>
+    </tr>
+    <tr>
+      <td><code>fullRestarts</code></td>
+      <td>job</td>
+      <td>The total number of full restarts since this job was submitted.</td>
+    </tr>
+    <tr>
+      <td><code>numberOfCompletedCheckpoints</code></td>
+      <td>job</td>
+      <td>The number of successfully completed checkpoints.</td>
+    </tr>
+    <tr>
+      <td><code>numberOfFailedCheckpoints</code></td>
+      <td>job</td>
+      <td>The number of failed checkpoints.</td>
+    </tr>
+  </tbody>
+</table>
+
+<p><br /></p>
+
+<p><strong>Example Dashboard Panels</strong></p>
+
+<center>
+<img src="/img/blog/2019-02-21-monitoring-best-practices/fig-1.png" 
width="800px" alt="Uptime (35 minutes), Restarting Time (3 milliseconds) and 
Number of Full Restarts (7)" />
+<br />
+<i><small>Uptime (35 minutes), Restarting Time (3 milliseconds) and Number of 
Full Restarts (7)</small></i>
+</center>
+<p><br /></p>
+
+<center>
+<img src="/img/blog/2019-02-21-monitoring-best-practices/fig-2.png" 
width="800px" alt="Completed Checkpoints (18336), Failed (14)" />
+<br />
+<i><small>Completed Checkpoints (18336), Failed (14)</small></i>
+</center>
+<p><br /></p>
+
+<p><strong>Possible Alerts</strong></p>
+
+<ul>
+  <li><code>ΔfullRestarts</code> &gt; <code>threshold</code></li>
+  <li><code>ΔnumberOfFailedCheckpoints</code> &gt; <code>threshold</code></li>
+</ul>
+
+<h2 id="monitoring-progress--throughput">Monitoring Progress &amp; 
Throughput</h2>
+
+<p>Knowing that your application is RUNNING and checkpointing is working fine 
is good,
+but it does not tell you whether the application is actually making progress 
and
+keeping up with the upstream systems.</p>
+
+<h3 id="throughput">Throughput</h3>
+
+<p>Flink provides multiple metrics to measure the throughput of our 
application.
+For each operator or task (remember: a task can contain multiple <a 
href="https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/operators/#task-chaining-and-resource-groups";>chained
+tasks</a>
+Flink counts the number of records and bytes going in and out. Out of those
+metrics, the rate of outgoing records per operator is often the most intuitive
+and easiest to reason about.</p>
+
+<p><strong>Key Metrics</strong></p>
+
+<table>
+  <thead>
+    <tr>
+      <th>Metric</th>
+      <th>Scope</th>
+      <th>Description</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td><code>numRecordsOutPerSecond</code></td>
+      <td>task</td>
+      <td>The number of records this operator/task sends per second.</td>
+    </tr>
+    <tr>
+      <td><code>numRecordsOutPerSecond</code></td>
+      <td>operator</td>
+      <td>The number of records this operator sends per second.</td>
+    </tr>
+  </tbody>
+</table>
+
+<p><br /></p>
+
+<p><strong>Example Dashboard Panels</strong></p>
+
+<center>
+<img src="/img/blog/2019-02-21-monitoring-best-practices/fig-3.png" 
width="800px" alt="Mean Records Out per Second per Operator" />
+<br />
+<i><small>Mean Records Out per Second per Operator</small></i>
+</center>
+<p><br /></p>
+
+<p><strong>Possible Alerts</strong></p>
+
+<ul>
+  <li><code>recordsOutPerSecond</code> = <code>0</code> (for a non-Sink 
operator)</li>
+</ul>
+
+<p><em>Note</em>: Source operators always have zero incoming records. Sink 
operators
+always have zero outgoing records because the metrics only count
+Flink-internal communication. There is a <a 
href="https://issues.apache.org/jira/browse/FLINK-7286";>JIRA
+ticket</a> to change this
+behavior.</p>
+
+<h3 id="progress">Progress</h3>
+
+<p>For applications, that use event time semantics, it is important that 
watermarks
+progress over time. A watermark of time <em>t</em> tells the framework, that it
+should not anymore expect to receive  events with a timestamp earlier than 
<em>t</em>,
+and in turn, to trigger all operations that were scheduled for a timestamp 
&lt; <em>t</em>.
+For example, an event time window that ends at <em>t</em> = 30 will be closed 
and
+evaluated once the watermark passes 30.</p>
+
+<p>As a consequence, you should monitor the watermark at event time-sensitive
+operators in your application, such as process functions and windows. If the
+difference between the current processing time and the watermark, known as
+even-time skew, is unusually high, then it typically implies one of two issues.
+First, it could mean that your are simply processing old events, for example
+during catch-up after a downtime or when your job is simply not able to keep up
+and events are queuing up. Second, it could mean a single upstream sub-task has
+not sent a watermark for a long time (for example because it did not receive 
any
+events to base the watermark on), which also prevents the watermark in
+downstream operators to progress. This <a 
href="https://issues.apache.org/jira/browse/FLINK-5017";>JIRA
+ticket</a> provides further
+information and a work around for the latter.</p>
+
+<p><strong>Key Metrics</strong></p>
+
+<table>
+  <thead>
+    <tr>
+      <th>Metric</th>
+      <th>Scope</th>
+      <th>Description</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td><code>currentOutputWatermark</code></td>
+      <td>operator</td>
+      <td>The last watermark this operator has emitted.</td>
+    </tr>
+  </tbody>
+</table>
+
+<p><br /></p>
+
+<p><strong>Example Dashboard Panels</strong></p>
+
+<center>
+<img src="/img/blog/2019-02-21-monitoring-best-practices/fig-4.png" 
width="800px" alt="Event Time Lag per Subtask of a single operator in the 
topology. In this case, the watermark is lagging a few seconds behind for each 
subtask." />
+<br />
+<i><small>Event Time Lag per Subtask of a single operator in the topology. In 
this case, the watermark is lagging a few seconds behind for each 
subtask.</small></i>
+</center>
+<p><br /></p>
+
+<p><strong>Possible Alerts</strong></p>
+
+<ul>
+  <li><code>currentProcessingTime - currentOutputWatermark</code> &gt; 
<code>threshold</code></li>
+</ul>
+
+<h3 id="keeping-up">“Keeping Up”</h3>
+
+<p>When consuming from a message queue, there is often a direct way to monitor 
if
+your application is keeping up. By using connector-specific metrics you can
+monitor how far behind the head of the message queue your current consumer 
group
+is. Flink forwards the underlying metrics from most sources.</p>
+
+<p><strong>Key Metrics</strong></p>
+
+<table>
+  <thead>
+    <tr>
+      <th>Metric</th>
+      <th>Scope</th>
+      <th>Description</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td><code>records-lag-max</code></td>
+      <td>user</td>
+      <td>applies to <code>FlinkKafkaConsumer</code>. The maximum lag in terms 
of the number of records for any partition in this window. An increasing value 
over time is your best indication that the consumer group is not keeping up 
with the producers.</td>
+    </tr>
+    <tr>
+      <td><code>millisBehindLatest</code></td>
+      <td>user</td>
+      <td>applies to <code>FlinkKinesisConsumer</code>. The number of 
milliseconds a consumer is behind the head of the stream. For any consumer and 
Kinesis shard, this indicates how far it is behind the current time.</td>
+    </tr>
+  </tbody>
+</table>
+
+<p><br /></p>
+
+<p><strong>Possible Alerts</strong></p>
+
+<ul>
+  <li><code>records-lag-max</code>  &gt; <code>threshold</code></li>
+  <li><code>millisBehindLatest</code> &gt; <code>threshold</code></li>
+</ul>
+
+<h2 id="monitoring-latency">Monitoring Latency</h2>
+
+<p>Generally speaking, latency is the delay between the creation of an event 
and
+the time at which results based on this event become visible. Once the event is
+created it is usually stored in a persistent message queue, before it is
+processed by Apache Flink, which then writes the results to a database or calls
+a downstream system. In such a pipeline, latency can be introduced at each 
stage
+and for various reasons including the following:</p>
+
+<ol>
+  <li>It might take a varying amount of time until events are persisted in the
+message queue.</li>
+  <li>During periods of high load or during recovery, events might spend some 
time
+in the message queue until they are processed by Flink (see previous 
section).</li>
+  <li>Some operators in a streaming topology need to buffer events for some 
time
+(e.g. in a time window) for functional reasons.</li>
+  <li>Each computation in your Flink topology (framework or user code), as 
well as
+each network shuffle, takes time and adds to latency.</li>
+  <li>If the application emits through a transactional sink, the sink will only
+commit and publish transactions upon successful checkpoints of Flink, adding
+latency usually up to the checkpointing interval for each record.</li>
+</ol>
+
+<p>In practice, it has proven invaluable to add timestamps to your events at
+multiple stages (at least at creation, persistence, ingestion by Flink,
+publication by Flink, possibly sampling those to save bandwidth). The
+differences between these timestamps can be exposed as a user-defined metric in
+your Flink topology to derive the latency distribution of each stage.</p>
+
+<p>In the rest of this section, we will only consider latency, which is 
introduced
+inside the Flink topology and cannot be attributed to transactional sinks or
+events being buffered for functional reasons (4.).</p>
+
+<p>To this end, Flink comes with a feature called <a 
href="https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/metrics.html#latency-tracking";>Latency
+Tracking</a>.
+When enabled, Flink will insert so-called latency markers periodically at all
+sources. For each sub-task, a latency distribution from each source to this
+operator will be reported. The granularity of these histograms can be further
+controlled by setting <em>metrics.latency.granularity</em> as desired.</p>
+
+<p>Due to the potentially high number of histograms (in particular for
+<em>metrics.latency.granularity: subtask</em>), enabling latency tracking can
+significantly impact the performance of the cluster. It is recommended to only
+enable it to locate sources of latency during debugging.</p>
+
+<p><strong>Metrics</strong></p>
+
+<table>
+  <thead>
+    <tr>
+      <th>Metric</th>
+      <th>Scope</th>
+      <th>Description</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td><code>latency</code></td>
+      <td>operator</td>
+      <td>The latency from the source operator to this operator.</td>
+    </tr>
+    <tr>
+      <td><code>restartingTime</code></td>
+      <td>job</td>
+      <td>The time it took to restart the job, or how long the current restart 
has been in progress.</td>
+    </tr>
+  </tbody>
+</table>
+
+<p><br /></p>
+
+<p><strong>Example Dashboard Panel</strong></p>
+
+<center>
+<img src="/img/blog/2019-02-21-monitoring-best-practices/fig-5.png" 
width="800px" alt="Latency distribution between a source and a single sink 
subtask." />
+<br />
+<i><small>Latency distribution between a source and a single sink 
subtask.</small></i>
+</center>
+<p><br /></p>
+
+<h2 id="jvm-metrics">JVM Metrics</h2>
+
+<p>So far we have only looked at Flink-specific metrics. As long as latency 
&amp;
+throughput of your application are in line with your expectations and it is
+checkpointing consistently, this is probably everything you need. On the other
+hand, if you job’s performance is starting to degrade among the firstmetrics 
you
+want to look at are memory consumption and CPU load of your Task- &amp; 
JobManager
+JVMs.</p>
+
+<h3 id="memory">Memory</h3>
+
+<p>Flink reports the usage of Heap, NonHeap, Direct &amp; Mapped memory for 
JobManagers
+and TaskManagers.</p>
+
+<ul>
+  <li>
+    <p>Heap memory - as with most JVM applications - is the most volatile and 
important
+metric to watch. This is especially true when using Flink’s filesystem
+statebackend as it keeps all state objects on the JVM Heap. If the size of
+long-living objects on the Heap increases significantly, this can usually be
+attributed to the size of your application state (check the 
+<a 
href="https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/metrics.html#checkpointing";>checkpointing
 metrics</a>
+for an estimated size of the on-heap state). The possible reasons for growing
+state are very application-specific. Typically, an increasing number of keys, a
+large event-time skew between different input streams or simply missing state
+cleanup may cause growing state.</p>
+  </li>
+  <li>
+    <p>NonHeap memory is dominated by the metaspace, the size of which is 
unlimited by default
+and holds class metadata as well as static content. There is a 
+<a href="https://issues.apache.org/jira/browse/FLINK-10317";>JIRA Ticket</a> to 
limit the size
+to 250 megabyte by default.</p>
+  </li>
+  <li>
+    <p>The biggest driver of Direct memory is by far the
+number of Flink’s network buffers, which can be
+<a 
href="https://ci.apache.org/projects/flink/flink-docs-release-1.7/ops/config.html#configuring-the-network-buffers";>configured</a>.</p>
+  </li>
+  <li>
+    <p>Mapped memory is usually close to zero as Flink does not use 
memory-mapped files.</p>
+  </li>
+</ul>
+
+<p>In a containerized environment you should additionally monitor the overall
+memory consumption of the Job- and TaskManager containers to ensure they don’t
+exceed their resource limits. This is particularly important, when using the
+RocksDB statebackend, since RocksDB allocates a considerable amount of
+memory off heap. To understand how much memory RocksDB might use, you can
+checkout <a 
href="https://www.da-platform.com/blog/manage-rocksdb-memory-size-apache-flink";>this
 blog
+post</a>
+by Stefan Richter.</p>
+
+<p><strong>Key Metrics</strong></p>
+
+<table>
+  <thead>
+    <tr>
+      <th>Metric</th>
+      <th>Scope</th>
+      <th>Description</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td><code>Status.JVM.Memory.NonHeap.Committed</code></td>
+      <td>job-/taskmanager</td>
+      <td>The amount of non-heap memory guaranteed to be available to the JVM 
(in bytes).</td>
+    </tr>
+    <tr>
+      <td><code>Status.JVM.Memory.Heap.Used</code></td>
+      <td>job-/taskmanager</td>
+      <td>The amount of heap memory currently used (in bytes).</td>
+    </tr>
+    <tr>
+      <td><code>Status.JVM.Memory.Heap.Committed</code></td>
+      <td>job-/taskmanager</td>
+      <td>The amount of heap memory guaranteed to be available to the JVM (in 
bytes).</td>
+    </tr>
+    <tr>
+      <td><code>Status.JVM.Memory.Direct.MemoryUsed</code></td>
+      <td>job-/taskmanager</td>
+      <td>The amount of memory used by the JVM for the direct buffer pool (in 
bytes).</td>
+    </tr>
+    <tr>
+      <td><code>Status.JVM.Memory.Mapped.MemoryUsed</code></td>
+      <td>job-/taskmanager</td>
+      <td>The amount of memory used by the JVM for the mapped buffer pool (in 
bytes).</td>
+    </tr>
+    <tr>
+      <td><code>Status.JVM.GarbageCollector.G1 Young 
Generation.Time</code></td>
+      <td>job-/taskmanager</td>
+      <td>The total time spent performing G1 Young Generation garbage 
collection.</td>
+    </tr>
+    <tr>
+      <td><code>Status.JVM.GarbageCollector.G1 Old Generation.Time</code></td>
+      <td>job-/taskmanager</td>
+      <td>The total time spent performing G1 Old Generation garbage 
collection.</td>
+    </tr>
+  </tbody>
+</table>
+
+<p><br /></p>
+
+<p><strong>Example Dashboard Panel</strong></p>
+
+<center>
+<img src="/img/blog/2019-02-21-monitoring-best-practices/fig-6.png" 
width="800px" alt="TaskManager memory consumption and garbage collection 
times." />
+<br />
+<i><small>TaskManager memory consumption and garbage collection 
times.</small></i>
+</center>
+<p><br /></p>
+
+<center>
+<img src="/img/blog/2019-02-21-monitoring-best-practices/fig-7.png" 
width="800px" alt="JobManager memory consumption and garbage collection times." 
/>
+<br />
+<i><small>JobManager memory consumption and garbage collection 
times.</small></i>
+</center>
+<p><br /></p>
+
+<p><strong>Possible Alerts</strong></p>
+
+<ul>
+  <li><code>container memory limit</code> &lt; <code>container memory + safety 
margin</code></li>
+</ul>
+
+<h3 id="cpu">CPU</h3>
+
+<p>Besides memory, you should also monitor the CPU load of the TaskManagers. If
+your TaskManagers are constantly under very high load, you might be able to
+improve the overall performance by decreasing the number of task slots per
+TaskManager (in case of a Standalone setup), by providing more resources to the
+TaskManager (in case of a containerized setup), or by providing more
+TaskManagers. In general, a system already running under very high load during
+normal operations, will need much more time to catch-up after recovering from a
+downtime. During this time you will see a much higher latency (event-time 
skew) than
+usual.</p>
+
+<p>A sudden increase in the CPU load might also be attributed to high garbage
+collection pressure, which should be visible in the JVM memory metrics as 
well.</p>
+
+<p>If one or a few TaskManagers are constantly under very high load, this can 
slow
+down the whole topology due to long checkpoint alignment times and increasing
+event-time skew. A common reason is skew in the partition key of the data, 
which
+can be mitigated by pre-aggregating before the shuffle or keying on a more
+evenly distributed key.</p>
+
+<p><strong>Key Metrics</strong></p>
+
+<table>
+  <thead>
+    <tr>
+      <th>Metric</th>
+      <th>Scope</th>
+      <th>Description</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td><code>Status.JVM.CPU.Load</code></td>
+      <td>job-/taskmanager</td>
+      <td>The recent CPU usage of the JVM.</td>
+    </tr>
+  </tbody>
+</table>
+
+<p><br /></p>
+
+<p><strong>Example Dashboard Panel</strong></p>
+
+<center>
+<img src="/img/blog/2019-02-21-monitoring-best-practices/fig-8.png" 
width="800px" alt="TaskManager &amp; JobManager CPU load." />
+<br />
+<i><small>TaskManager &amp; JobManager CPU load.</small></i>
+</center>
+<p><br /></p>
+
+<h2 id="system-resources">System Resources</h2>
+
+<p>In addition to the JVM metrics above, it is also possible to use Flink’s 
metrics
+system to gather insights about system resources, i.e. memory, CPU &amp;
+network-related metrics for the whole machine as opposed to the Flink processes
+alone. System resource monitoring is disabled by default and requires 
additional
+dependencies on the classpath. Please check out the 
+<a 
href="https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/metrics.html#system-resources";>Flink
 system resource metrics documentation</a> for
+additional guidance and details. System resource monitoring in Flink can be 
very
+helpful in setups without existing host monitoring capabilities.</p>
+
+<h2 id="conclusion">Conclusion</h2>
+
+<p>This post tries to shed some light on Flink’s metrics and monitoring 
system. You
+can utilise it as a starting point when you first think about how to
+successfully monitor your Flink application. I highly recommend to start
+monitoring your Flink application early on in the development phase. This way
+you will be able to improve your dashboards and alerts over time and, more
+importantly, observe the performance impact of the changes to your application
+throughout the development phase. By doing so, you can ask the right questions
+about the runtime behaviour of your application, and learn much more about
+Flink’s internals early on.</p>
+
+<p>Last but not least, this post only scratches the surface of the overall 
metrics
+and monitoring capabilities of Apache Flink. I highly recommend going over
+<a 
href="https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/metrics.html";>Flink’s
 metrics documentation</a>
+for a full reference of Flink’s metrics system.</p>
+
+      </article>
+    </div>
+
+    <div class="row">
+      <div id="disqus_thread"></div>
+      <script type="text/javascript">
+        /* * * CONFIGURATION VARIABLES: EDIT BEFORE PASTING INTO YOUR WEBPAGE 
* * */
+        var disqus_shortname = 'stratosphere-eu'; // required: replace example 
with your forum shortname
+
+        /* * * DON'T EDIT BELOW THIS LINE * * */
+        (function() {
+            var dsq = document.createElement('script'); dsq.type = 
'text/javascript'; dsq.async = true;
+            dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
+             (document.getElementsByTagName('head')[0] || 
document.getElementsByTagName('body')[0]).appendChild(dsq);
+        })();
+      </script>
+    </div>
+  </div>
+</div>
+      </div>
+    </div>
+
+    <hr />
+
+    <div class="row">
+      <div class="footer text-center col-sm-12">
+        <p>Copyright © 2014-2019 <a href="http://apache.org";>The Apache 
Software Foundation</a>. All Rights Reserved.</p>
+        <p>Apache Flink, Flink®, Apache®, the squirrel logo, and the Apache 
feather logo are either registered trademarks or trademarks of The Apache 
Software Foundation.</p>
+        <p><a href="/privacy-policy.html">Privacy Policy</a> &middot; <a 
href="/blog/feed.xml">RSS feed</a></p>
+      </div>
+    </div>
+    </div><!-- /.container -->
+
+    <!-- Include all compiled plugins (below), or include individual files as 
needed -->
+    <script 
src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.4/js/bootstrap.min.js";></script>
+    <script 
src="https://cdnjs.cloudflare.com/ajax/libs/jquery.matchHeight/0.7.0/jquery.matchHeight-min.js";></script>
+    <script src="/js/codetabs.js"></script>
+    <script src="/js/stickysidebar.js"></script>
+
+    <!-- Google Analytics -->
+    <script>
+      
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
+      (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new 
Date();a=s.createElement(o),
+      
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
+      
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
+
+      ga('create', 'UA-52545728-1', 'auto');
+      ga('send', 'pageview');
+    </script>
+  </body>
+</html>
diff --git a/content/zh/index.html b/content/zh/index.html
index af36744..03d69a2 100644
--- a/content/zh/index.html
+++ b/content/zh/index.html
@@ -437,6 +437,9 @@
 
   <dl>
       
+        <dt> <a 
href="/news/2019/02/25/monitoring-best-practices.html">Monitoring Apache Flink 
Applications 101</a></dt>
+        <dd>The monitoring of business-critical applications is a crucial 
aspect of a production deployment. It ensures that any degradation or downtime 
is immediately identified and can be resolved as quickly as possible. In this 
post, we discuss the most important metrics that indicate healthy Flink 
applications.</dd>
+      
         <dt> <a href="/news/2019/02/25/release-1.6.4.html">Apache Flink 1.6.4 
Released</a></dt>
         <dd><p>The Apache Flink community released the fourth bugfix version 
of the Apache Flink 1.6 series.</p>
 
@@ -454,11 +457,6 @@
         <dd><p>The Apache Flink community released the sixth and last bugfix 
version of the Apache Flink 1.5 series.</p>
 
 </dd>
-      
-        <dt> <a href="/news/2018/12/22/release-1.6.3.html">Apache Flink 1.6.3 
Released</a></dt>
-        <dd><p>The Apache Flink community released the third bugfix version of 
the Apache Flink 1.6 series.</p>
-
-</dd>
     
   </dl>

[flink-web] 02/02: Rebuild website.

Reply via email to