Martijn Visser created FLINK-40069:
--------------------------------------
Summary: OpenTelemetryMetricReporterProtocolTest fails with "No
content to map due to end-of-input" due to OTel collector readiness race
Key: FLINK-40069
URL: https://issues.apache.org/jira/browse/FLINK-40069
Project: Flink
Issue Type: Bug
Components: Runtime / Metrics, Tests
Affects Versions: 2.4.0
Reporter: Martijn Visser
Assignee: Martijn Visser
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=76609&view=results
(leg: test_cron_jdk21_connect)
{{testGzipCompressionGrpc}} and {{testGzipCompressionHttp}} each failed after
exhausting the full 2-minute {{eventually()}} budget:
{noformat}
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.exc.MismatchedInputException:
No content to map due to end-of-input
at
org.apache.flink.metrics.otel.OpenTelemetryTestBase.lambda$eventuallyConsumeJson$1(OpenTelemetryTestBase.java:114)
at
org.apache.flink.metrics.otel.OpenTelemetryMetricReporterProtocolTest.assertReported(OpenTelemetryMetricReporterProtocolTest.java:66)
at
org.apache.flink.metrics.otel.AbstractOpenTelemetryReporterProtocolTest.testGzipCompressionGrpc(AbstractOpenTelemetryReporterProtocolTest.java:88)
{noformat}
Root cause: {{OtelTestContainer}} sets no wait strategy, so testcontainers uses
the default {{HostPortWaitStrategy}}. The shell-less
{{otel/opentelemetry-collector}} image cannot run the internal exec check
("/bin/sh: no such file or directory"), leaving only the external host-port
probe, which reports the mapped port ready before the OTLP receiver accepts
connections. The test's single one-shot {{report()}} was invoked before the
collector logged readiness, so the export failed ({{Metric export completed
with issues: ... successfulBatches=0, failedBatches=1}}) and, because the
assertion loop only re-reads the collector output file and never re-exports,
the file stayed empty for the whole timeout. Gzip is incidental: those two
tests simply ran first against the cold container.
Proposed fix: (1) wait for the collector's "Everything is ready. Begin running
and processing data." log line (emitted only after all components including the
OTLP receivers started) with a bounded startup timeout; (2) re-invoke
{{report()}} inside the retry loop so a single failed export (including a
transient HTTP 404) cannot doom the whole budget.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)