This is an automated email from the ASF dual-hosted git repository. fhueske pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/flink-web.git
commit 531512023b182312e42a3fe87c25d4ea6044f1da Author: Maximilian Bode <maximilian.b...@tngtech.com> AuthorDate: Thu Mar 7 17:49:23 2019 +0100 Add blog post 'Flink and Prometheus' This closes #184. --- _posts/2019-03-11-prometheus-monitoring.md | 128 +++++++++++++++++++++ .../prometheus.png | Bin 0 -> 56449 bytes .../prometheusalerts.png | Bin 0 -> 80256 bytes .../prometheusexamplejob.png | Bin 0 -> 60456 bytes 4 files changed, 128 insertions(+) diff --git a/_posts/2019-03-11-prometheus-monitoring.md b/_posts/2019-03-11-prometheus-monitoring.md new file mode 100644 index 0000000..dc47eb9 --- /dev/null +++ b/_posts/2019-03-11-prometheus-monitoring.md @@ -0,0 +1,128 @@ +--- +layout: post +title: "Flink and Prometheus: Cloud-native monitoring of streaming applications" +date: 2019-03-11T12:00:00.000Z +authors: +- max: + name: "Maximilian Bode, TNG Technology Consulting" + twitter: "mxpbode" +category: features +excerpt: This blog post describes how developers can leverage Apache Flink's built-in metrics system together with Prometheus to observe and monitor streaming applications in an effective way. +--- + +This blog post describes how developers can leverage Apache Flink's built-in [metrics system](https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/metrics.html) together with [Prometheus](https://prometheus.io/) to observe and monitor streaming applications in an effective way. This is a follow-up post from my [Flink Forward](https://flink-forward.org/) Berlin 2018 talk ([slides](https://www.slideshare.net/MaximilianBode1/monitoring-flink-with-prometheus), [video](https [...] + +## Why Prometheus? + +Prometheus is a metrics-based monitoring system that was originally created in 2012. The system is completely open-source (under the Apache License 2) with a vibrant community behind it and it has graduated from the Cloud Native Foundation last year – a sign of maturity, stability and production-readiness. As we mentioned, the system is based on metrics and it is designed to measure the overall health, behavior and performance of a service. Prometheus features a multi-dimensional data mo [...] + +* **Metrics:** Prometheus defines metrics as floats of information that change in time. These time series have millisecond precision. + +* **Labels** are the key-value pairs associated with time series that support Prometheus' flexible and powerful data model – in contrast to hierarchical data structures that one might experience with traditional metrics systems. + +* **Scrape:** Prometheus is a pull-based system and fetches ("scrapes") metrics data from specified sources that expose HTTP endpoints with a text-based format. + +* **PromQL** is Prometheus' [query language](https://prometheus.io/docs/prometheus/latest/querying/basics/). It can be used for both building dashboards and setting up alert rules that will trigger when specific conditions are met. + +When considering metrics and monitoring systems for your Flink jobs, there are many [options](https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/metrics.html). Flink offers native support for exposing data to Prometheus via the `PrometheusReporter` configuration. Setting up this integration is very easy. + +Prometheus is a great choice as usually Flink jobs are not running in isolation but in a greater context of microservices. For making metrics available to Prometheus from other parts of a larger system, there are two options: There exist [libraries for all major languages](https://prometheus.io/docs/instrumenting/clientlibs/) to instrument other applications. Additionally, there is a wide variety of [exporters](https://prometheus.io/docs/instrumenting/exporters/), which are tools that ex [...] + +## Prometheus and Flink in Action + +We have provided a [GitHub repository](https://github.com/mbode/flink-prometheus-example) that demonstrates the integration described above. To have a look, clone the repository, make sure [Docker](https://docs.docker.com/install/) is installed and run: + +``` +./gradlew composeUp +``` + +This builds a Flink job using the build tool [Gradle](https://gradle.org/) and starts up a local environment based on [Docker Compose](https://docs.docker.com/compose/) running the job in a [Flink job cluster](https://ci.apache.org/projects/flink/flink-docs-release-1.7/ops/deployment/docker.html#flink-job-cluster) (reachable at [http://localhost:8081](http://localhost:8081/)) as well as a Prometheus instance ([http://localhost:9090](http://localhost:9090/)). + +<center> +<img src="{{ site.baseurl }}/img/blog/2019-03-11-prometheus-monitoring/prometheusexamplejob.png" width="600px" alt="PrometheusExampleJob in Flink Web UI"/> +<br/> +<i><small>Job graph and custom metric for example job in Flink web interface.</small></i> +</center> +<br/> + +The `PrometheusExampleJob` has three operators: Random numbers up to 10,000 are generated, then a map counts the events and creates a histogram of the values passed through. Finally, the events are discarded without further output. The very simple code below is from the second operator. It illustrates how easy it is to add custom metrics relevant to your business logic into your Flink job. + +```java +class FlinkMetricsExposingMapFunction extends RichMapFunction<Integer, Integer> { + private transient Counter eventCounter; + + @Override + public void open(Configuration parameters) { + eventCounter = getRuntimeContext().getMetricGroup().counter("events"); + } + + @Override + public Integer map(Integer value) { + eventCounter.inc(); + return value; + } +} +``` +<center><i><small>Excerpt from <a href="https://github.com/mbode/flink-prometheus-example/blob/master/src/main/java/com/github/mbode/flink_prometheus_example/FlinkMetricsExposingMapFunction.java">FlinkMetricsExposingMapFunction.java</a> demonstrating custom Flink metric.</small></i></center> + +## Configuring Prometheus with Flink + +To start monitoring Flink with Prometheus, the following steps are necessary: + +1. Make the `PrometheusReporter` jar available to the classpath of the Flink cluster (it comes with the Flink distribution): + + cp /opt/flink/opt/flink-metrics-prometheus-1.7.2.jar /opt/flink/lib + +2. [Configure the reporter](https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/metrics.html#reporter) in Flink's _flink-conf.yaml_. All job managers and task managers will expose the metrics on the configured port. + + metrics.reporters: prom + metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter + metrics.reporter.prom.port: 9999 + +3. Prometheus needs to know where to scrape metrics. In a static scenario, you can simply [configure Prometheus](https://prometheus.io/docs/prometheus/latest/configuration/configuration/) in _prometheus.yml_ with the following: + + scrape_configs: + - job_name: 'flink' + static_configs: + - targets: ['job-cluster:9999', 'taskmanager1:9999', 'taskmanager2:9999'] + + In more dynamic scenarios we recommend using Prometheus' service discovery support for different platforms such as Kubernetes, AWS EC2 and more. + +Both custom metrics are now available in Prometheus: + +<center> +<img src="{{ site.baseurl }}/img/blog/2019-03-11-prometheus-monitoring/prometheus.png" width="600px" alt="Prometheus web UI with example metric"/> +<br/> +<i><small>Example metric in Prometheus web UI.</small></i> +</center> +<br/> + +More technical metrics from the Flink cluster (like checkpoint sizes or duration, Kafka offsets or resource consumption) are also available. If you are interested, you can check out the HTTP endpoints exposing all Prometheus metrics for the job managers and the two task managers on [http://localhost:9249](http://localhost:9249/metrics), [http://localhost:9250](http://localhost:9250/metrics) and [http://localhost:9251](http://localhost:9251/metrics), respectively. + +To test Prometheus' alerting feature, kill one of the Flink task managers via + +``` +docker kill taskmanager1 +``` + +Our Flink job can recover from this partial failure via the mechanism of [Checkpointing](https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/state/checkpointing.html). Nevertheless, after roughly one minute (as configured in the alert rule) the following alert will fire: + +<center> +<img src="{{ site.baseurl }}/img/blog/2019-03-11-prometheus-monitoring/prometheusalerts.png" width="600px" alt="Prometheus web UI with example alert"/> +<br/> +<i><small>Example alert in Prometheus web UI.</small></i> +</center> +<br/> + +In real-world situations alerts like this one can be routed through a component called [Alertmanager](https://prometheus.io/docs/alerting/alertmanager/) and be grouped into notifications to systems like email, PagerDuty or Slack. + +Go ahead and play around with the setup, and check out the [Grafana](https://grafana.com/grafana) instance reachable at [http://localhost:3000](http://localhost:3000/) (credentials _admin:flink_) for visualizing Prometheus metrics. If there are any questions or problems, feel free to [create an issue](https://github.com/mbode/flink-prometheus-example/issues). Once finished, do not forget to tear down the setup via + +``` +./gradlew composeDown +``` +<br/> + +## Conclusion + +Using Prometheus together with Flink provides an easy way for effective monitoring and alerting of your Flink jobs. Both projects have exciting and vibrant communities behind them with new developments and additions scheduled for upcoming releases. We encourage you to try the two technologies together as it has immensely improved our insights into Flink jobs running in production. diff --git a/img/blog/2019-03-11-prometheus-monitoring/prometheus.png b/img/blog/2019-03-11-prometheus-monitoring/prometheus.png new file mode 100644 index 0000000..a8af061 Binary files /dev/null and b/img/blog/2019-03-11-prometheus-monitoring/prometheus.png differ diff --git a/img/blog/2019-03-11-prometheus-monitoring/prometheusalerts.png b/img/blog/2019-03-11-prometheus-monitoring/prometheusalerts.png new file mode 100755 index 0000000..0b0ce14 Binary files /dev/null and b/img/blog/2019-03-11-prometheus-monitoring/prometheusalerts.png differ diff --git a/img/blog/2019-03-11-prometheus-monitoring/prometheusexamplejob.png b/img/blog/2019-03-11-prometheus-monitoring/prometheusexamplejob.png new file mode 100755 index 0000000..f4bc7ac Binary files /dev/null and b/img/blog/2019-03-11-prometheus-monitoring/prometheusexamplejob.png differ