Martijn Visser created FLINK-40069:
--------------------------------------

             Summary: OpenTelemetryMetricReporterProtocolTest fails with "No 
content to map due to end-of-input" due to OTel collector readiness race
                 Key: FLINK-40069
                 URL: https://issues.apache.org/jira/browse/FLINK-40069
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Metrics, Tests
    Affects Versions: 2.4.0
            Reporter: Martijn Visser
            Assignee: Martijn Visser


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=76609&view=results
 (leg: test_cron_jdk21_connect)

{{testGzipCompressionGrpc}} and {{testGzipCompressionHttp}} each failed after 
exhausting the full 2-minute {{eventually()}} budget:

{noformat}
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.exc.MismatchedInputException:
 No content to map due to end-of-input
      at 
org.apache.flink.metrics.otel.OpenTelemetryTestBase.lambda$eventuallyConsumeJson$1(OpenTelemetryTestBase.java:114)
      at 
org.apache.flink.metrics.otel.OpenTelemetryMetricReporterProtocolTest.assertReported(OpenTelemetryMetricReporterProtocolTest.java:66)
      at 
org.apache.flink.metrics.otel.AbstractOpenTelemetryReporterProtocolTest.testGzipCompressionGrpc(AbstractOpenTelemetryReporterProtocolTest.java:88)
{noformat}

Root cause: {{OtelTestContainer}} sets no wait strategy, so testcontainers uses 
the default {{HostPortWaitStrategy}}. The shell-less 
{{otel/opentelemetry-collector}} image cannot run the internal exec check 
("/bin/sh: no such file or directory"), leaving only the external host-port 
probe, which reports the mapped port ready before the OTLP receiver accepts 
connections. The test's single one-shot {{report()}} was invoked before the 
collector logged readiness, so the export failed ({{Metric export completed 
with issues: ... successfulBatches=0, failedBatches=1}}) and, because the 
assertion loop only re-reads the collector output file and never re-exports, 
the file stayed empty for the whole timeout. Gzip is incidental: those two 
tests simply ran first against the cold container.

Proposed fix: (1) wait for the collector's "Everything is ready. Begin running 
and processing data." log line (emitted only after all components including the 
OTLP receivers started) with a bounded startup timeout; (2) re-invoke 
{{report()}} inside the retry loop so a single failed export (including a 
transient HTTP 404) cannot doom the whole budget.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to