Re: Prometheus metrics does not work in 1.15.0 taskmanager

Peter Schrott Wed, 04 May 2022 00:59:19 -0700

Sorry for the spamming!

Just after jumping into the debug-session I noticed that there are indeed
exceptions thrown when fetching the metrics on port 9200:


13657 INFO  [ScalaTest-run] com.sun.net.httpserver  - HttpServer
created http 0.0.0.0/0.0.0.0:9200
13658 INFO  [ScalaTest-run] com.sun.net.httpserver  - context created: /
13658 INFO  [ScalaTest-run] com.sun.net.httpserver  - context created: /metrics
13659 INFO  [ScalaTest-run] o.a.f.m.p.PrometheusReporter  - Started
PrometheusReporter HTTP server on port 9200.
13745 DEBUG [prometheus-http-1-1] com.sun.net.httpserver  - GET /
HTTP/1.1 [200  OK] ()
14028 DEBUG [prometheus-http-1-2] com.sun.net.httpserver  - GET /
HTTP/1.1 [200  OK] ()
14998 DEBUG [prometheus-http-1-3] com.sun.net.httpserver  - GET /
HTTP/1.1 [200  OK] ()
15580 DEBUG [prometheus-http-1-4] com.sun.net.httpserver  - GET /
HTTP/1.1 [200  OK] ()
16022 DEBUG [prometheus-http-1-5] com.sun.net.httpserver  - GET /
HTTP/1.1 [200  OK] ()
16458 DEBUG [prometheus-http-1-1] com.sun.net.httpserver  - GET /
HTTP/1.1 [200  OK] ()
16885 DEBUG [prometheus-http-1-2] com.sun.net.httpserver  - GET /
HTTP/1.1 [200  OK] ()
17381 DEBUG [prometheus-http-1-3] com.sun.net.httpserver  - GET /
HTTP/1.1 [200  OK] ()
17809 DEBUG [prometheus-http-1-4] com.sun.net.httpserver  - GET /
HTTP/1.1 [200  OK] ()
18259 DEBUG [prometheus-http-1-5] com.sun.net.httpserver  - GET /
HTTP/1.1 [200  OK] ()
18695 DEBUG [prometheus-http-1-1] com.sun.net.httpserver  - GET /
HTTP/1.1 [200  OK] ()
19159 DEBUG [prometheus-http-1-2] com.sun.net.httpserver  - GET /
HTTP/1.1 [200  OK] ()
19758 DEBUG [prometheus-http-1-3] com.sun.net.httpserver  - GET /
HTTP/1.1 [200  OK] ()
20112 DEBUG [prometheus-http-1-4] com.sun.net.httpserver  - GET /
HTTP/1.1 [200  OK] ()
20544 DEBUG [prometheus-http-1-5] com.sun.net.httpserver  - GET /
HTTP/1.1 [200  OK] ()
20989 DEBUG [prometheus-http-1-1] com.sun.net.httpserver  - GET /
HTTP/1.1 [200  OK] ()
21419 DEBUG [prometheus-http-1-2] o.a.f.m.p.PrometheusReporter  -
Invalid type for Gauge
org.apache.flink.runtime.checkpoint.CheckpointStatsTracker$LatestCompletedCheckpointExternalPathGauge@3fae55e7:
java.lang.String, only number types and booleans are supported by this
reporter.
21421 DEBUG [prometheus-http-1-2] com.sun.net.httpserver  - GET /
HTTP/1.1 [200  OK] ()
21847 DEBUG [prometheus-http-1-3] o.a.f.m.p.PrometheusReporter  -
Invalid type for Gauge
org.apache.flink.connector.kafka.source.metrics.KafkaSourceReaderMetrics$$Lambda$4076/388206242@78846648:
java.lang.String, only number types and booleans are supported by this
reporter.
21851 DEBUG [prometheus-http-1-3] com.sun.net.httpserver  -
ServerImpl.Exchange (2)
java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Double
   at 
org.apache.flink.streaming.connectors.kafka.internals.metrics.KafkaMetricMutableWrapper.getValue(KafkaMetricMutableWrapper.java:37)
   at 
org.apache.flink.streaming.connectors.kafka.internals.metrics.KafkaMetricMutableWrapper.getValue(KafkaMetricMutableWrapper.java:27)
   at 
org.apache.flink.metrics.prometheus.AbstractPrometheusReporter$2.get(AbstractPrometheusReporter.java:262)
   at io.prometheus.client.Gauge.collect(Gauge.java:317)
   at 
io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.findNextElement(CollectorRegistry.java:190)
   at 
io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.nextElement(CollectorRegistry.java:223)
   at 
io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.nextElement(CollectorRegistry.java:144)
   at 
io.prometheus.client.exporter.common.TextFormat.write004(TextFormat.java:22)
   at 
io.prometheus.client.exporter.HTTPServer$HTTPMetricHandler.handle(HTTPServer.java:60)
   at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:79)
   at sun.net.httpserver.AuthFilter.doFilter(AuthFilter.java:83)
   at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:82)
   at 
sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(ServerImpl.java:675)
   at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:79)
   at sun.net.httpserver.ServerImpl$Exchange.run(ServerImpl.java:647)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   at java.lang.Thread.run(Thread.java:748)
21851 TRACE [prometheus-http-1-3] com.sun.net.httpserver  - Closing
connection: java.nio.channels.SocketChannel[connected
local=/127.0.0.1:9200 remote=/127.0.0.1:50508]


For my defence: This jul - slf4j - logback setup is really nasty :O

Best, Peter



On Wed, May 4, 2022 at 9:47 AM Peter Schrott <pe...@bluerootlabs.io> wrote:

> Hi Chesnay,
>
> Thanks for that support! Just for compilation: Running the "Problem-Job"
> locally as test in Intellij (as Chesney suggested above) reproduces the
> described problem:
>
> ➜  ~ curl localhost:9200curl: (52) Empty reply from server
>
> Doing the same with other jobs metrics are available on localhost:9200.
>
> One other thing I noticed yesterday in the cluster is that job/task
> specific metrics are available for a very short time after the job is
> started (for around a few seconds). E.g:
>
> # HELP flink_taskmanager_job_task_backPressuredTimeMsPerSecond 
> backPressuredTimeMsPerSecond (scope: taskmanager_job_task)
>
> After all tasks are "green" in the webui, the "empty reply from server" is
> back.
>
> 1)
> I changed the prometheus config in my cluster, but as you saied, it does
> not have any impact.
>
> 2)
> For the logging in a test scenario, I also had to add the following lines
> in my test class:
>
> SLF4JBridgeHandler.removeHandlersForRootLogger()
> SLF4JBridgeHandler.install()
>
> (source:
> https://www.slf4j.org/api/org/slf4j/bridge/SLF4JBridgeHandler.html)
>  As well as resetting log levels for jul in my logback.xml:
>
> <contextListener class="ch.qos.logback.classic.jul.LevelChangePropagator">
>     <resetJUL>true</resetJUL>
> </contextListener>
>
> This infos just for completeness, if someone else stumbles upon.
>
> I set the following loggers to lvl TRACE:
>
> <logger name="com.sun.net.httpserver" level="TRACE" additive="false">
>     <appender-ref ref="ASYNC_FILE" />
> </logger>
>
> <logger name="org.apache.flink.metrics.prometheus" level="TRACE" 
> additive="false">
>     <appender-ref ref="ASYNC_FILE" />
> </logger>
>
> <logger name="io.prometheus.client" level="TRACE" additive="false">
>     <appender-ref ref="ASYNC_FILE" />
> </logger>
>
> When running the job in a local test as suggested above I get the
> following log messages:
>
> 12701 INFO  [ScalaTest-run] com.sun.net.httpserver  - HttpServer created http 
> 0.0.0.0/0.0.0.0:9200
> 12703 INFO  [ScalaTest-run] com.sun.net.httpserver  - context created: /
> 12703 INFO  [ScalaTest-run] com.sun.net.httpserver  - context created: 
> /metrics
> 12704 INFO  [ScalaTest-run] o.a.f.m.p.PrometheusReporter  - Started 
> PrometheusReporter HTTP server on port 9200.
>
>
> 3)
> I have not tried to reproduce in a local cluster yet, as the issue is also
> reproducible in the test environment. But thanks for the hint - could be
> very helpful!
>
>  __
>
> From the observations it does not seem like there is a problem with the
> http server itself. I am just making assumptions: It feels like there is a
> problem with reading and providing the metrics. As the issue
> reproducible in the local setup I have the comfy option to debug in
> Intellij now - I'll spend my day with this if no other hints or ideas arise.
>
> Thanks & Best, Peter
>
> On Tue, May 3, 2022 at 4:01 PM Chesnay Schepler <ches...@apache.org>
> wrote:
>
>> > I noticed that my config of the PrometheusReporter is different here. I
>> have: `metrics.reporter.prom.class:
>> org.apache.flink.metrics.prometheus.PrometheusReporter`. I will investigate
>> if this is a problem.
>>
>> That's not a problem.
>>
>> > Which trace logs are interesting?
>>
>> The logging config I provided should highlight the relevant bits
>> (com.sun.net.httpserver).
>> At least in my local tests this is where any interesting things were
>> logged.
>> Note that this part of the code uses java.util.logging, not slf4j/log4j.
>>
>> > When running a local flink (start-cluster.sh), I do not have a certain
>> url/port to access the taskmanager, right?
>>
>> If you configure a port range it should be as simple as curl
>> localhost:<port>.
>> You can find the used port in the taskmanager logs.
>> Or just try the first N ports in the range ;)
>>
>> On 03/05/2022 14:11, Peter Schrott wrote:
>>
>> Hi Chesnay,
>>
>> Thanks for the code snipped. Which trace logs are interesting? Of "
>> org.apache.flink.metrics.prometheus.PrometheusReporter"?
>> I could also add this logger settings in the environment where the
>> problem is present.
>>
>> Other than that, I am not sure how to reproduce this issue in a local
>> setup. In the cluster where the metrics are missing I am navigating to the
>> certain taskmanager and try to access the metrics via the configured
>> prometheus port. When running a local flink (start-cluster.sh), I do not
>> have a certain url/port to access the taskmanager, right?
>>
>> I noticed that my config of the PrometheusReporter is different here. I
>> have: `metrics.reporter.prom.class:
>> org.apache.flink.metrics.prometheus.PrometheusReporter`. I will investigate
>> if this is a problem.
>>
>> Unfortunately I can not provide my job at the moment. It
>> contains business logic and it is tightly coupled with our Kafka systems. I
>> will check the option of creating a sample job to reproduce the problem.
>>
>> Best, Peter
>>
>> On Tue, May 3, 2022 at 12:48 PM Chesnay Schepler <ches...@apache.org>
>> wrote:
>>
>>> You'd help me out greatly if you could provide me with a sample job that
>>> runs into the issue.
>>>
>>> So far I wasn't able to reproduce the issue,
>>> but it should be clear that there is some given 3 separate reports,
>>> although it is strange that so far it was only reported for Prometheus.
>>>
>>> If one of you is able to reproduce the issue within a Test and is
>>> feeling adventurous,
>>> then you might be able to get more information by forwarding the
>>> java.util.logging
>>> to SLF4J. Below is some code to get you started.
>>>
>>> DebuggingTest.java:
>>>
>>> class DebuggingTest {
>>>
>>>     static {
>>>         LogManager.getLogManager().getLogger("").setLevel(Level.FINEST);
>>>         SLF4JBridgeHandler.removeHandlersForRootLogger();
>>>         SLF4JBridgeHandler.install();
>>>         miniClusterExtension =
>>>                 new MiniClusterExtension(
>>>                         new MiniClusterResourceConfiguration.Builder()
>>>                                 .setConfiguration(getConfiguration())
>>>                                 .setNumberSlotsPerTaskManager(1)
>>>                                 .build());
>>>     }
>>>
>>>     @RegisterExtension private static final MiniClusterExtension 
>>> miniClusterExtension;
>>>
>>>     private static Configuration getConfiguration() {
>>>         final Configuration configuration = new Configuration();
>>>
>>>         configuration.setString(
>>>                 "metrics.reporter.prom.factory.class", 
>>> PrometheusReporterFactory.class.getName());
>>>         configuration.setString("metrics.reporter.prom.port", "9200-9300");
>>>
>>>         return configuration;
>>>     }
>>>
>>>     @Test
>>>     void runJob() throws Exception {
>>>         <run job>
>>>     }
>>> }
>>>
>>>
>>> pom.xml:
>>>
>>> <dependency>
>>>    <groupId>org.slf4j</groupId>
>>>    <artifactId>jul-to-slf4j</artifactId>
>>>    <version>1.7.32</version>
>>> </dependency>
>>>
>>> log4j2-test.properties:
>>>
>>> rootLogger.level = off
>>> rootLogger.appenderRef.test.ref = TestLogger
>>> logger.http.name = com.sun.net.httpserver
>>> logger.http.level = trace
>>> appender.testlogger.name = TestLogger
>>> appender.testlogger.type = CONSOLE
>>> appender.testlogger.target = SYSTEM_ERR
>>> appender.testlogger.layout.type = PatternLayout
>>> appender.testlogger.layout.pattern = %-4r [%t] %-5p %c %x - %m%n
>>>
>>> On 03/05/2022 10:41, ChangZhuo Chen (陳昌倬) wrote:
>>>
>>> On Tue, May 03, 2022 at 10:32:03AM +0200, Peter Schrott wrote:
>>>
>>> Hi!
>>>
>>> I also discovered problems with the PrometheusReporter on Flink 1.15.0,
>>> coming from 1.14.4. I already consulted the mailing 
>>> list:https://lists.apache.org/thread/m8ohrfkrq1tqgq7lowr9p226z3yc0fgc
>>> I have not found the underlying problem or a solution to it.
>>>
>>> Actually, after re-checking, I see the same log WARNINGS as
>>> ChangZhou described.
>>>
>>> As I described, it seems to be an issue with my job. If no job, or an
>>> example job runs on the taskmanager the basic metrics work just fine. Maybe
>>> ChangZhou can confirm this?
>>>
>>> @ChangZhou what's your job setup? I am running a streaming SQL job, but
>>> also using data streams API to create the streaming environment and from
>>> that the table environment and finally using a StatementSet to execute
>>> multiple SQL statements in one job.
>>>
>>> We are running a streaming application with low level API with
>>> Kubernetes operator FlinkDeployment.
>>>
>>>
>>>
>>>
>>>
>>

Re: Prometheus metrics does not work in 1.15.0 taskmanager

Reply via email to