[ 
https://issues.apache.org/jira/browse/FLINK-40069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-40069:
-----------------------------------
    Labels: pull-request-available test-stability  (was: test-stability)

> OpenTelemetryMetricReporterProtocolTest fails with "No content to map due to 
> end-of-input" due to OTel collector readiness race
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-40069
>                 URL: https://issues.apache.org/jira/browse/FLINK-40069
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Metrics, Tests
>    Affects Versions: 2.4.0
>            Reporter: Martijn Visser
>            Assignee: Martijn Visser
>            Priority: Major
>              Labels: pull-request-available, test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=76609&view=results
>  (leg: test_cron_jdk21_connect)
> {{testGzipCompressionGrpc}} and {{testGzipCompressionHttp}} each failed after 
> exhausting the full 2-minute {{eventually()}} budget:
> {noformat}
> org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.exc.MismatchedInputException:
>  No content to map due to end-of-input
>       at 
> org.apache.flink.metrics.otel.OpenTelemetryTestBase.lambda$eventuallyConsumeJson$1(OpenTelemetryTestBase.java:114)
>       at 
> org.apache.flink.metrics.otel.OpenTelemetryMetricReporterProtocolTest.assertReported(OpenTelemetryMetricReporterProtocolTest.java:66)
>       at 
> org.apache.flink.metrics.otel.AbstractOpenTelemetryReporterProtocolTest.testGzipCompressionGrpc(AbstractOpenTelemetryReporterProtocolTest.java:88)
> {noformat}
> Root cause: {{OtelTestContainer}} sets no wait strategy, so testcontainers 
> uses the default {{HostPortWaitStrategy}}. The shell-less 
> {{otel/opentelemetry-collector}} image cannot run the internal exec check 
> ("/bin/sh: no such file or directory"), leaving only the external host-port 
> probe, which reports the mapped port ready before the OTLP receiver accepts 
> connections. The test's single one-shot {{report()}} was invoked before the 
> collector logged readiness, so the export failed ({{Metric export completed 
> with issues: ... successfulBatches=0, failedBatches=1}}) and, because the 
> assertion loop only re-reads the collector output file and never re-exports, 
> the file stayed empty for the whole timeout. Gzip is incidental: those two 
> tests simply ran first against the cold container.
> Proposed fix: (1) wait for the collector's "Everything is ready. Begin 
> running and processing data." log line (emitted only after all components 
> including the OTLP receivers started) with a bounded startup timeout; (2) 
> re-invoke {{report()}} inside the retry loop so a single failed export 
> (including a transient HTTP 404) cannot doom the whole budget.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to