MartijnVisser opened a new pull request, #28635:
URL: https://github.com/apache/flink/pull/28635
## What is the purpose of the change
Fixes intermittent failures of
`OpenTelemetryMetricReporterProtocolTest.testGzipCompressionGrpc/Http`
(`MismatchedInputException: No content to map due to end-of-input` after
exhausting the full 2-minute retry budget; Azure build 76609, leg
`test_cron_jdk21_connect`). Two issues combined: (1) `OtelTestContainer` used
the default `HostPortWaitStrategy`, which false-reports readiness for the
shell-less collector image (its internal exec check cannot run), so the
reporter's export, invoked ~160ms before the collector logged readiness, began
against a not-yet-listening OTLP receiver and failed; (2) the test exported
exactly once before polling, so that single failed export left the collector
output file empty for the entire timeout. Gzip is incidental: those two tests
simply ran first against the cold container.
## Brief change log
- `OtelTestContainer`: wait for the collector's "Everything is ready.
Begin running and processing data." log line (emitted only after all components
including the OTLP receivers have started, unlike the health_check extension
which can report healthy independently of receiver readiness) with a bounded
1-minute startup timeout.
- `OpenTelemetryTestBase`: add an `eventuallyConsumeJson` overload with a
pre-attempt hook; the existing single-arg method delegates with a no-op.
- `OpenTelemetryMetricReporterProtocolTest`: re-export (`report()` +
`waitForLastReportToComplete()`) inside the retry loop rather than once
beforehand; this also covers transient export failures such as an observed
one-off HTTP 404.
## Verifying this change
This change is already covered by existing tests. Verified locally with
Docker: the three protocol classes (metrics/events/traces) and both ITCases
pass twice in a row (29 tests, 0 failures), with the metrics protocol class
completing in seconds versus the 256s two-error CI failure.
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): no
- The public API, i.e., is any changed class annotated with
`@Public(Evolving)`: no
- The serializers: no
- The runtime per-record code paths (performance sensitive): no
- Anything that affects deployment or recovery: JobManager (and its
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
- The S3 file system connector: no
## Documentation
- Does this pull request introduce a new feature? no
---
##### Was generative AI tooling used to co-author this PR?
- [X] Yes (Claude Opus 4.8, via Claude Code)
Generated-by: Claude Opus 4.8 (1M context)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]