MartijnVisser opened a new pull request, #28635:
URL: https://github.com/apache/flink/pull/28635

   ## What is the purpose of the change
   
   Fixes intermittent failures of 
`OpenTelemetryMetricReporterProtocolTest.testGzipCompressionGrpc/Http` 
(`MismatchedInputException: No content to map due to end-of-input` after 
exhausting the full 2-minute retry budget; Azure build 76609, leg 
`test_cron_jdk21_connect`). Two issues combined: (1) `OtelTestContainer` used 
the default `HostPortWaitStrategy`, which false-reports readiness for the 
shell-less collector image (its internal exec check cannot run), so the 
reporter's export, invoked ~160ms before the collector logged readiness, began 
against a not-yet-listening OTLP receiver and failed; (2) the test exported 
exactly once before polling, so that single failed export left the collector 
output file empty for the entire timeout. Gzip is incidental: those two tests 
simply ran first against the cold container.
   
   ## Brief change log
   
     - `OtelTestContainer`: wait for the collector's "Everything is ready. 
Begin running and processing data." log line (emitted only after all components 
including the OTLP receivers have started, unlike the health_check extension 
which can report healthy independently of receiver readiness) with a bounded 
1-minute startup timeout.
     - `OpenTelemetryTestBase`: add an `eventuallyConsumeJson` overload with a 
pre-attempt hook; the existing single-arg method delegates with a no-op.
     - `OpenTelemetryMetricReporterProtocolTest`: re-export (`report()` + 
`waitForLastReportToComplete()`) inside the retry loop rather than once 
beforehand; this also covers transient export failures such as an observed 
one-off HTTP 404.
   
   ## Verifying this change
   
   This change is already covered by existing tests. Verified locally with 
Docker: the three protocol classes (metrics/events/traces) and both ITCases 
pass twice in a row (29 tests, 0 failures), with the metrics protocol class 
completing in seconds versus the 256s two-error CI failure.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
   
   ---
   
   ##### Was generative AI tooling used to co-author this PR?
   
   - [X] Yes (Claude Opus 4.8, via Claude Code)
   
   Generated-by: Claude Opus 4.8 (1M context)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to