Yes, that looks like a new bug in 1.15.
The migration to the new non-deprecated Kafka API in the KafkaMetricMutableWrapper was done incorrectly.

This should affect every job that uses the new kafka connector.

Thank you for debugging the issue!

I will create a ticket.

As the stracktrace says, class cast exception occurs here:

I found the following metrics to be affected (might be more):
MetricName [name=version, group=app-info, description=Metric indicating version, tags={client-id=producer-3}]
-> value: "6.2.2-ccs" (String)

MetricName [name=start-time-ms, group=app-info, description=Metric indicating start-time-ms, tags={client-id=producer-3}]
-> value: 1651654724987 (Long)

MetricName [name=commit-id, group=app-info, description=Metric indicating commit-id, tags={client-id=producer-3}]
-> value: "2ceb5dc7891720b7" (String)

Problematic code part seems to be introduced with "Bump Kafka version to 2.8":

Is this a potential bug introduced in 1.15.0?

    Just after jumping into the debug-session I noticed that there are
    indeed exceptions thrown when fetching the metrics on port 9200:

    21419 DEBUG  [prometheus-http-1-2]o.a.f.m.p.PrometheusReporter   - Invalid 
type for 
 only number types and booleans are supported by this reporter.
    21421 DEBUG  [prometheus-http-1-2]   - GET / HTTP/1.1 
 [200   OK] ()
    21847 DEBUG  [prometheus-http-1-3]o.a.f.m.p.PrometheusReporter   - Invalid 
type for 
 only number types and booleans are supported by this reporter.
    21851 DEBUG  [prometheus-http-1-3]   
-ServerImpl.Exchange  (2)
    java.lang.ClassCastException:java.lang.Long  cannot be cast 
tojava.lang.Double at
    at io.prometheus.client.Gauge.collect( at
    at$Chain.doFilter( at at$Chain.doFilter( at$Exchange$LinkHandler.handle(
    at$Chain.doFilter( at$ at
    at 21851  TRACE 
[prometheus-http-1-3]   - Closing 
connection:java.nio.channels.SocketChannel[connected local=/  

        Hi Chesnay,

        Thanks for that support! Just for compilation: Running the
        "Problem-Job" locally as test in Intellij (as Chesney
        suggested above) reproduces the described problem:

        ➜  ~ curl localhost:9200
curl: (52) Empty reply from server
        Doing the same with other jobs metrics are available on

        One other thing I noticed yesterday in the cluster is that
        job/task specific metrics are available for a very short time
        after the job is started (for around a few seconds). E.g:

        # HELP flink_taskmanager_job_task_backPressuredTimeMsPerSecond 
backPressuredTimeMsPerSecond (scope: taskmanager_job_task)

        After all tasks are "green" in the webui, the "empty reply
        from server" is back.

        I changed the prometheus config in my cluster, but as you
        saied, it does not have any impact.

        For the logging in a test scenario, I also had to add the
        following lines in my test class:


         As well as resetting log levels for jul in my logback.xml:

        <resetJUL>true</resetJUL> </contextListener>

        This infos just for completeness, if someone else stumbles upon.

        I set the following loggers to lvl TRACE:

        <logger name="" level="TRACE"
        additive="false"> <appender-ref ref="ASYNC_FILE" /> </logger>
        <logger name="org.apache.flink.metrics.prometheus"
        level="TRACE" additive="false"> <appender-ref ref="ASYNC_FILE"
        /> </logger> <logger name="io.prometheus.client" level="TRACE"
        additive="false"> <appender-ref ref="ASYNC_FILE" /> </logger>

        When running the job in a local test as suggested above I get
        the following log messages:

        12701 INFO   [ScalaTest-run]   - HttpServer 
created http0.0.0.0/ 12703 INFO   
[ScalaTest-run]   - context created: /
        12703 INFO   [ScalaTest-run]   - context created: 
        12704 INFO   [ScalaTest-run]o.a.f.m.p.PrometheusReporter   - Started 
PrometheusReporter HTTP server on port9200.

        I have not tried to reproduce in a local cluster yet, as the
        issue is also reproducible in the test environment. But thanks
        for the hint - could be very helpful!


        From the observations it does not seem like there is a problem
        with the http server itself. I am just making assumptions: It
        feels like there is a problem with reading and providing the
        metrics. As the issue reproducible in the local setup I have
        the comfy option to debug in Intellij now - I'll spend my day
        with this if no other hints or ideas arise.

            > I noticed that my config of the PrometheusReporter is
            different here. I have: `metrics.reporter.prom.class:
            org.apache.flink.metrics.prometheus.PrometheusReporter`. I
            will investigate if this is a problem.

            That's not a problem.

            > Which trace logs are interesting?

            The logging config I provided should highlight the
            relevant bits (
            At least in my local tests this is where any interesting
            things were logged.
            Note that this part of the code uses java.util.logging,
            not slf4j/log4j.

            > When running a local flink (, I do not
            have a certain url/port to access the taskmanager, right?

            If you configure a port range it should be as simple as
            curl localhost:<port>.
            You can find the used port in the taskmanager logs.
            Or just try the first N ports in the range ;)

            Hi Chesnay,

            Thanks for the code snipped. Which trace logs are
            interesting? Of
            I could also add this logger settings in the environment
            where the problem is present.

            Other than that, I am not sure how to reproduce this
            issue in a local setup. In the cluster where the metrics
            are missing I am navigating to the certain taskmanager
            and try to access the metrics via the configured
            prometheus port. When running a local flink
            (, I do not have a certain url/port to
            access the taskmanager, right?

            I noticed that my config of the PrometheusReporter is
            different here. I have: `metrics.reporter.prom.class:
            I will investigate if this is a problem.

            Unfortunately I can not provide my job at the moment. It
            contains business logic and it is tightly coupled with
            our Kafka systems. I will check the option of creating a
            sample job to reproduce the problem.

                You'd help me out greatly if you could provide me
                with a sample job that runs into the issue.

                So far I wasn't able to reproduce the issue,
                but it should be clear that there is some given 3
                separate reports,
                although it is strange that so far it was only
                reported for Prometheus.

                If one of you is able to reproduce the issue within a
                Test and is feeling adventurous,
                then you might be able to get more information by
                forwarding the java.util.logging
                to SLF4J. Below is some code to get you started.


                class DebuggingTest {

                     static {
                         miniClusterExtension =
                                 new MiniClusterExtension(

                     @RegisterExtension private static final 
MiniClusterExtension miniClusterExtension;

                     private static Configuration getConfiguration() {
                         final Configuration configuration = new 


                         return configuration;

                     void runJob() throws Exception {
                         <run job>



                rootLogger.level = off
                rootLogger.appenderRef.test.ref = TestLogger

        <>  =
                logger.http.level = trace

        <>  = 
                appender.testlogger.type = CONSOLE
       = SYSTEM_ERR
                appender.testlogger.layout.type = PatternLayout
                appender.testlogger.layout.pattern = %-4r [%t] %-5p %c %x - %m%n

                On Tue, May 03, 2022 at 10:32:03AM +0200, Peter Schrott wrote:

                I also discovered problems with the PrometheusReporter on Flink 
                coming from 1.14.4. I already consulted the mailing list:
                I have not found the underlying problem or a solution to it.

                Actually, after re-checking, I see the same log WARNINGS as
                ChangZhou described.

                As I described, it seems to be an issue with my job. If no job, 
or an
                example job runs on the taskmanager the basic metrics work just 
fine. Maybe
                ChangZhou can confirm this?

                @ChangZhou what's your job setup? I am running a streaming SQL 
job, but
                also using data streams API to create the streaming environment 
and from
                that the table environment and finally using a StatementSet to 
                multiple SQL statements in one job.
                We are running a streaming application with low level API with
                Kubernetes operator FlinkDeployment.

