robobario opened a new pull request, #24491:
URL: https://github.com/apache/flink/pull/24491

   ## What is the purpose of the change
   
   This pull request aims to make end-to-end test scripts that source 
`common_s3_operations.sh` fail fast if the aws cli container fails to start. It 
also adds a single naive retry aiming to recover from a transient network 
failure.
   
   [FLINK-34569](https://issues.apache.org/jira/browse/FLINK-34569) describes 
an issue where an end-to-end test run took 15 minutes to fail after the aws cli 
container failed to start. From the test logs:
   
   ```
   2024-03-02T04:10:55.5496990Z Unable to find image 'banst/awscli:latest' 
locally 2024-03-02T04:10:56.3857380Z docker: Error response from daemon: Head 
"https://registry-1.docker.io/v2/banst/awscli/manifests/latest": read tcp 
10.1.0.97:33016->54.236.113.205:443: read: connection reset by peer. 
2024-03-02T04:10:56.3857877Z See 'docker run --help'. 
2024-03-02T04:10:56.4586492Z Error: No such object:
   ```
   
   This failure isn't handled and so later we were stuck in a loop trying to 
docker exec commands like `docker exec -t "" command`.
   
   To test it locally I've been provoking docker run failures by changing the 
image name to something non-existent.
   
   ## Brief change log
   
     - *Fail fast if aws cli container fails to run*
     - *Add naive retry when creating aws cli container*
     - *Add --rm to jq docker run commands to remove them on exit*
   
   ## Verifying this change
   
   This change is a trivial rework / code cleanup without any test coverage.
   
   I verified that it fails fast by modifying the awscli image to have a 
non-existant name, to provoke a `docker run` failure, causing it to fail like:
   
   ```
   
==============================================================================
   Running 'test-scripts/test_file_sink.sh s3 StreamingFileSink 
skip_check_exceptions'
   
==============================================================================
   TEST_DATA_DIR: 
/home/roby/development/redhat-managed-kafka/upstream/flink/flink-end-to-end-tests/test-scripts/temp-test-directory-53909550201
   Flink dist directory: 
/home/roby/development/redhat-managed-kafka/upstream/flink/flink-dist/target/flink-1.20-SNAPSHOT-bin/flink-1.20-SNAPSHOT
   Found AWS bucket robeyoun-testing-flink-13-03-2024, running the e2e test.
   Found AWS access key, running the e2e test.
   Found AWS secret key, running the e2e test.
   Unable to find image 'banstz/awscli:latest' locally
   docker: Error response from daemon: pull access denied for banstz/awscli, 
repository does not exist or may require 'docker login': denied: requested 
access to the resource is denied.
   See 'docker run --help'.
   running aws cli container failed
   Unable to find image 'banstz/awscli:latest' locally
   docker: Error response from daemon: pull access denied for banstz/awscli, 
repository does not exist or may require 'docker login': denied: requested 
access to the resource is denied.
   See 'docker run --help'.
   running aws cli container failed
   running the aws cli container failed
   [FAIL] Test script contains errors.
   Checking for errors...
   No errors in log files.
   Checking for exceptions...
   No exceptions in log files.
   Checking for non-empty .out files...
   grep: 
/home/roby/development/redhat-managed-kafka/upstream/flink/build-target/log/*.out:
 No such file or directory
   No non-empty .out files.
   
   [FAIL] 'test-scripts/test_file_sink.sh s3 StreamingFileSink 
skip_check_exceptions' failed after 0 minutes and 6 seconds! Test exited with 
exit code 1
   ```
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
     - The serializers:no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? not applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to