robobario opened a new pull request, #24491: URL: https://github.com/apache/flink/pull/24491
## What is the purpose of the change This pull request aims to make end-to-end test scripts that source `common_s3_operations.sh` fail fast if the aws cli container fails to start. It also adds a single naive retry aiming to recover from a transient network failure. [FLINK-34569](https://issues.apache.org/jira/browse/FLINK-34569) describes an issue where an end-to-end test run took 15 minutes to fail after the aws cli container failed to start. From the test logs: ``` 2024-03-02T04:10:55.5496990Z Unable to find image 'banst/awscli:latest' locally 2024-03-02T04:10:56.3857380Z docker: Error response from daemon: Head "https://registry-1.docker.io/v2/banst/awscli/manifests/latest": read tcp 10.1.0.97:33016->54.236.113.205:443: read: connection reset by peer. 2024-03-02T04:10:56.3857877Z See 'docker run --help'. 2024-03-02T04:10:56.4586492Z Error: No such object: ``` This failure isn't handled and so later we were stuck in a loop trying to docker exec commands like `docker exec -t "" command`. To test it locally I've been provoking docker run failures by changing the image name to something non-existent. ## Brief change log - *Fail fast if aws cli container fails to run* - *Add naive retry when creating aws cli container* - *Add --rm to jq docker run commands to remove them on exit* ## Verifying this change This change is a trivial rework / code cleanup without any test coverage. I verified that it fails fast by modifying the awscli image to have a non-existant name, to provoke a `docker run` failure, causing it to fail like: ``` ============================================================================== Running 'test-scripts/test_file_sink.sh s3 StreamingFileSink skip_check_exceptions' ============================================================================== TEST_DATA_DIR: /home/roby/development/redhat-managed-kafka/upstream/flink/flink-end-to-end-tests/test-scripts/temp-test-directory-53909550201 Flink dist directory: /home/roby/development/redhat-managed-kafka/upstream/flink/flink-dist/target/flink-1.20-SNAPSHOT-bin/flink-1.20-SNAPSHOT Found AWS bucket robeyoun-testing-flink-13-03-2024, running the e2e test. Found AWS access key, running the e2e test. Found AWS secret key, running the e2e test. Unable to find image 'banstz/awscli:latest' locally docker: Error response from daemon: pull access denied for banstz/awscli, repository does not exist or may require 'docker login': denied: requested access to the resource is denied. See 'docker run --help'. running aws cli container failed Unable to find image 'banstz/awscli:latest' locally docker: Error response from daemon: pull access denied for banstz/awscli, repository does not exist or may require 'docker login': denied: requested access to the resource is denied. See 'docker run --help'. running aws cli container failed running the aws cli container failed [FAIL] Test script contains errors. Checking for errors... No errors in log files. Checking for exceptions... No exceptions in log files. Checking for non-empty .out files... grep: /home/roby/development/redhat-managed-kafka/upstream/flink/build-target/log/*.out: No such file or directory No non-empty .out files. [FAIL] 'test-scripts/test_file_sink.sh s3 StreamingFileSink skip_check_exceptions' failed after 0 minutes and 6 seconds! Test exited with exit code 1 ``` ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): no - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no - The serializers:no - The runtime per-record code paths (performance sensitive): no - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no - The S3 file system connector: no ## Documentation - Does this pull request introduce a new feature? no - If yes, how is the feature documented? not applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org