This is an automated email from the ASF dual-hosted git repository. wusheng pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/skywalking.git
The following commit(s) were added to refs/heads/master by this push: new 1e9766a96c SWIP-9, support flink monitoring (#13167) 1e9766a96c is described below commit 1e9766a96c8d37b3a2e51d47d7bcfec7d9935d93 Author: peachisai <2581009...@qq.com> AuthorDate: Mon Apr 14 15:37:13 2025 +0800 SWIP-9, support flink monitoring (#13167) --- docs/en/changes/changes.md | 1 + docs/en/swip/SWIP-9.md | 94 ++++++++++++++++++++++++++++++++++++++++++++++ docs/en/swip/readme.md | 3 +- 3 files changed, 97 insertions(+), 1 deletion(-) diff --git a/docs/en/changes/changes.md b/docs/en/changes/changes.md index 35aa82b436..e94beef5bd 100644 --- a/docs/en/changes/changes.md +++ b/docs/en/changes/changes.md @@ -14,6 +14,7 @@ #### Documentation * BanyanDB: Add `Data Lifecycle Stages(Hot/Warm/Cold)` documentation. +* Add `SWIP-9 Support flink monitoring`. All issues and pull requests are [here](https://github.com/apache/skywalking/milestone/230?closed=1) diff --git a/docs/en/swip/SWIP-9.md b/docs/en/swip/SWIP-9.md new file mode 100644 index 0000000000..53766e17a1 --- /dev/null +++ b/docs/en/swip/SWIP-9.md @@ -0,0 +1,94 @@ +# Support Flink Monitoring +## Motivation +Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Now that Skywalking can monitor OpenTelemetry metrics, I want to add Flink monitoring via the OpenTelemetry Collector, which fetches metrics from its own Http Endpoint +to expose metrics data for Prometheus. + +## Architecture Graph +There is no significant architecture-level change. + +## Proposed Changes +Flink expose its metrics via HTTP endpoint to OpenTelemetry collector, using SkyWalking openTelemetry receiver to receive these metrics。 +Provide cluster, instance, and endpoint dimensions monitoring. + +### Flink Cluster Supported Metrics + +| Monitoring Panel | Unit | Metric Name | Description | Data Source | +|-------------------------------|-------|-------------------------------------------------------|---------------------------------------------------------------------------------------------------|------------------| +| Running Jobs | Count | meter_flink_jobManager_running_job_number | The number of running jobs. | Flink JobManager | +| TaskManagers | Count | meter_flink_jobManager_taskManagers_registered_number | The number of taskManagers. | Flink JobManager | +| JVM CPU Load | % | meter_flink_jobManager_jvm_cpu_load | The number of the jobManager JVM CPU load. | Flink JobManager | +| JVM thread count | Count | meter_flink_jobManager_jvm_thread_count | The total number of the jobManager JVM threads. | Flink JobManager | +| JVM Memory Heap Used | MB | meter_flink_jobManager_jvm_memory_heap_used | The amount of the jobManager JVM memory heap used. | Flink JobManager | +| JVM Memory NonHeap Used | MB | meter_flink_jobManager_jvm_memory_NonHeap_used | The amount of the jobManager JVM nonHeap memory used. | Flink JobManager | +| Task Managers Slots Total | Count | meter_flink_jobManager_taskManagers_slots_total | The number of total slots. | Flink JobManager | +| Task Managers Slots Available | Count | meter_flink_jobManager_taskManagers_slots_available | The number of available slots. | Flink JobManager | +| JVM CPU Time | ms | meter_flink_jobManager_jvm_cpu_time | The jobManager CPU time used by the JVM. | Flink JobManager | +| JVM Memory Heap Available | MB | meter_flink_jobManager_jvm_memory_heap_available | The amount of the jobManager available JVM memory Heap. | Flink JobManager | +| JVM Memory NoHeap Available | MB | meter_flink_jobManager_jvm_memory_nonHeap_available | The amount of the jobManager available JVM memory noHeap. | Flink JobManager | +| JVM Memory Metaspace Used | MB | meter_flink_jobManager_jvm_memory_metaspace_used | The amount of the jobManager Used JVM metaspace memory. | Flink JobManager | +| JVM Metaspace Available | MB | meter_flink_jobManager_jvm_memory_metaspace_available | The amount of the jobManager available JVM Metaspace Memory. | Flink JobManager | +| JVM G1 Young Generation Count | Count | meter_flink_jobManager_jvm_g1_young_generation_count | The number of the jobManager JVM g1 young generation count. | Flink JobManager | +| JVM G1 Old Generation Count | Count | meter_flink_jobManager_jvm_g1_old_generation_count | The number of the jobManager JVM g1 old generation count. | Flink JobManager | +| JVM G1 Young Generation Time | Count | meter_flink_jobManager_jvm_g1_young_generation_time | The time of the jobManager JVM g1 young generation. | Flink JobManager | +| JVM G1 Old Generation Time | ms | meter_flink_jobManager_jvm_g1_old_generation_time | The time of JVM g1 old generation. | Flink JobManager | +| JVM G1 Old Generation Count | Count | meter_flink_jobManager_jvm_all_garbageCollector_count | The number of the jobManager JVM all garbageCollector count. | Flink JobManager | +| JVM All GarbageCollector Time | ms | meter_flink_jobManager_jvm_all_garbageCollector_time | The time spent performing garbage collection for the given (or all) collector for the jobManager. | Flink JobManager | + + +### Flink taskManager Supported Metrics + +| Monitoring Panel | Unit | Metric Name | Description | Data Source | +|----------------------------------|---------|----------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------| +| JVM CPU Load | % | meter_flink_taskManager_jvm_cpu_load | The number of the JVM CPU load. | Flink TaskManager | +| JVM Thread Count | Count | meter_flink_taskManager_jvm_thread_count | The total number of JVM threads. | Flink TaskManager | +| JVM Memory Heap Used | MB | meter_flink_taskManager_jvm_memory_heap_used | The amount of JVM memory heap used. | Flink TaskManager | +| JVM Memory NonHeap Used | MB | meter_flink_taskManager_jvm_memory_nonHeap_used | The amount of JVM nonHeap memory used. | Flink TaskManager | +| JVM CPU Time | ms | meter_flink_taskManager_jvm_cpu_time | The CPU time used by the JVM. | Flink TaskManager | +| JVM Memory Heap Available | MB | meter_flink_taskManager_jvm_memory_heap_available | The amount of available JVM memory Heap. | Flink TaskManager | +| JVM Memory NonHeap Available | MB | meter_flink_taskManager_jvm_memory_nonHeap_available | The amount of available JVM memory nonHeap. | Flink TaskManager | +| JVM Memory Metaspace Used | MB | meter_flink_taskManager_jvm_memory_metaspace_used | The amount of Used JVM metaspace memory. | Flink TaskManager | +| JVM Metaspace Available | MB | meter_flink_taskManager_jvm_memory_metaspace_available | The amount of Available JVM Metaspace Memory. | Flink TaskManager | +| NumRecordsIn | Count | meter_flink_taskManager_numRecordsIn | The total number of records this task has received. | Flink TaskManager | +| NumRecordsOut | Count | meter_flink_taskManager_numRecordsOut | The total number of records this task has emitted. | Flink TaskManager | +| NumBytesInPerSecond | Bytes/s | meter_flink_taskManager_numBytesInPerSecond | The number of bytes received per second. | Flink TaskManager | +| NumBytesOutPerSecond | Bytes/s | meter_flink_taskManager_numBytesOutPerSecond | The number of bytes this task emits per second. | Flink TaskManager | +| Netty UsedMemory | MB | meter_flink_taskManager_netty_usedMemory | The amount of used netty memory. | Flink TaskManager | +| Netty AvailableMemory | MB | meter_flink_taskManager_netty_availableMemory | The amount of available netty memory. | Flink TaskManager | +| IsBackPressured | Count | meter_flink_taskManager_isBackPressured | Whether the task is back-pressured. | Flink TaskManager | +| InPoolUsage | % | meter_flink_taskManager_inPoolUsage | An estimate of the input buffers usage. (ignores LocalInputChannels). | Flink TaskManager | +| OutPoolUsage | % | meter_flink_taskManager_outPoolUsage | An estimate of the output buffers usage. The pool usage can be > 100% if overdraft buffers are being used. | Flink TaskManager | +| SoftBackPressuredTimeMsPerSecond | ms | meter_flink_taskManager_softBackPressuredTimeMsPerSecond | The time this task is softly back pressured per second.Softly back pressured task will be still responsive and capable of for example triggering unaligned checkpoints. | Flink TaskManager | +| HardBackPressuredTimeMsPerSecond | ms | meter_flink_taskManager_hardBackPressuredTimeMsPerSecond | The time this task is back pressured in a hard way per second.During hard back pressured task is completely blocked and unresponsive preventing for example unaligned checkpoints from triggering. | Flink TaskManager | +| IdleTimeMsPerSecond | ms | meter_flink_taskManager_idleTimeMsPerSecond | The time this task is idle (has no data to process) per second. Idle time excludes back pressured time, so if the task is back pressured it is not idle. | Flink TaskManager | +| BusyTimeMsPerSecond | ms | meter_flink_taskManager_busyTimeMsPerSecond | The time this task is busy (neither idle nor back pressured) per second. Can be NaN, if the value could not be calculated. | Flink TaskManager | + + +### Flink Job Supported Metrics + +| Monitoring Panel | Unit | Metric Name | Description | Data Source | +|-------------------------|---------|-----------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------| +| Job RunningTime | min | meter_flink_job_runningTime | The job running time. | Flink JobManager | +| Job Restart Number | Count | meter_flink_job_restart_number | The number of job restart. | Flink JobManager | +| Job RestartingTime | min | meter_flink_job_restartingTime | The job restarting Time. | Flink JobManager | +| Job CancellingTime | min | meter_flink_job_cancellingTime | The job cancelling time. | Flink JobManager | +| Checkpoints Total | Count | meter_flink_job_checkpoints_total | The total number of checkpoints. | Flink JobManager | +| Checkpoints Failed | Count | meter_flink_job_checkpoints_failed | The number of failed checkpoints. | Flink JobManager | +| Checkpoints Completed | Count | meter_flink_job_checkpoints_completed | The number of completed checkpoints. | Flink JobManager | +| Checkpoints InProgress | Count | meter_flink_job_checkpoints_inProgress | The number of inProgress checkpoints. | Flink JobManager | +| CurrentEmitEventTimeLag | ms | meter_flink_job_currentEmitEventTimeLag | The latency between a data record's event time and its emission time from the source. | Flink TaskManager | +| NumRecordsIn | Count | meter_flink_job_numRecordsIn | The total number of records this operator/task has received. | Flink TaskManager | +| NumRecordsOut | Count | meter_flink_job_numRecordsOut | The total number of records this operator/task has emitted. | Flink TaskManager | +| NumBytesInPerSecond | Bytes/s | meter_flink_job_numBytesInPerSecond | The number of bytes this task received per second. | Flink TaskManager | +| NumBytesOutPerSecond | Bytes/s | meter_flink_job_numBytesOutPerSecond | The number of bytes this task emits per second. | Flink TaskManager | +| LastCheckpointSize | Bytes | meter_flink_job_lastCheckpointSize | The checkPointed size of the last checkpoint (in bytes), this metric could be different from lastCheckpointFullSize if incremental checkpoint or changelog is enabled. | Flink JobManager | +| LastCheckpointDuration | ms | meter_flink_job_lastCheckpointDuration | The time it took to complete the last checkpoint. | Flink JobManager | + +## Imported Dependencies libs and their licenses. +No new dependency. + +## Compatibility +no breaking changes. + +## General usage docs + +This feature is out of the box. diff --git a/docs/en/swip/readme.md b/docs/en/swip/readme.md index 72fdfbaa85..0cf9f8cc43 100644 --- a/docs/en/swip/readme.md +++ b/docs/en/swip/readme.md @@ -68,10 +68,11 @@ All accepted and proposed SWIPs can be found in [here](https://github.com/apache ## Known SWIPs -Next SWIP Number: 9 +Next SWIP Number: 10 ### Accepted SWIPs +- [SWIP-9 Support Flink Monitoring](SWIP-9.md) - [SWIP-8 Support Kong Monitoring](SWIP-8.md) - [SWIP-6 Support ActiveMQ Monitoring](SWIP-6.md) - [SWIP-5 Support ClickHouse Monitoring](SWIP-5.md)