[
https://issues.apache.org/jira/browse/FLINK-40069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated FLINK-40069:
-----------------------------------
Labels: pull-request-available test-stability (was: test-stability)
> OpenTelemetryMetricReporterProtocolTest fails with "No content to map due to
> end-of-input" due to OTel collector readiness race
> -------------------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-40069
> URL: https://issues.apache.org/jira/browse/FLINK-40069
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Metrics, Tests
> Affects Versions: 2.4.0
> Reporter: Martijn Visser
> Assignee: Martijn Visser
> Priority: Major
> Labels: pull-request-available, test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=76609&view=results
> (leg: test_cron_jdk21_connect)
> {{testGzipCompressionGrpc}} and {{testGzipCompressionHttp}} each failed after
> exhausting the full 2-minute {{eventually()}} budget:
> {noformat}
> org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.exc.MismatchedInputException:
> No content to map due to end-of-input
> at
> org.apache.flink.metrics.otel.OpenTelemetryTestBase.lambda$eventuallyConsumeJson$1(OpenTelemetryTestBase.java:114)
> at
> org.apache.flink.metrics.otel.OpenTelemetryMetricReporterProtocolTest.assertReported(OpenTelemetryMetricReporterProtocolTest.java:66)
> at
> org.apache.flink.metrics.otel.AbstractOpenTelemetryReporterProtocolTest.testGzipCompressionGrpc(AbstractOpenTelemetryReporterProtocolTest.java:88)
> {noformat}
> Root cause: {{OtelTestContainer}} sets no wait strategy, so testcontainers
> uses the default {{HostPortWaitStrategy}}. The shell-less
> {{otel/opentelemetry-collector}} image cannot run the internal exec check
> ("/bin/sh: no such file or directory"), leaving only the external host-port
> probe, which reports the mapped port ready before the OTLP receiver accepts
> connections. The test's single one-shot {{report()}} was invoked before the
> collector logged readiness, so the export failed ({{Metric export completed
> with issues: ... successfulBatches=0, failedBatches=1}}) and, because the
> assertion loop only re-reads the collector output file and never re-exports,
> the file stayed empty for the whole timeout. Gzip is incidental: those two
> tests simply ran first against the cold container.
> Proposed fix: (1) wait for the collector's "Everything is ready. Begin
> running and processing data." log line (emitted only after all components
> including the OTLP receivers started) with a bounded startup timeout; (2)
> re-invoke {{report()}} inside the retry loop so a single failed export
> (including a transient HTTP 404) cannot doom the whole budget.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)