aljoscha commented on a change in pull request #6965: [FLINK-10368][e2e] Hardened kerberized yarn e2e test URL: https://github.com/apache/flink/pull/6965#discussion_r230008139
########## File path: flink-end-to-end-tests/test-scripts/test_yarn_kerberos_docker.sh ########## @@ -60,19 +64,41 @@ function cluster_shutdown { trap cluster_shutdown INT trap cluster_shutdown EXIT -until docker cp $FLINK_TARBALL_DIR/$FLINK_TARBALL master:/home/hadoop-user/; do - # we're retrying this one because we don't know yet if the container is ready - echo "Uploading Flink tarball to docker master failed, retrying ..." - sleep 5 +# wait for kerberos to be set up +start_time=$(date +%s) +until docker logs master 2>&1 | grep -q "Finished master initialization"; do + current_time=$(date +%s) + time_diff=$((current_time - start_time)) + + if [ $time_diff -ge $MAX_RETRY_SECONDS ]; then + echo "ERROR: Could not start hadoop cluster. Aborting..." + exit 0 + else + echo "Waiting for hadoop cluster to come up. We have been trying for $time_diff seconds, retrying ..." + sleep 10 + fi done +# perform health checks +if ! { [ $(docker inspect -f '{{.State.Running}}' master 2>&1) = 'true' ] && + [ $(docker inspect -f '{{.State.Running}}' slave1 2>&1) = 'true' ] && + [ $(docker inspect -f '{{.State.Running}}' slave2 2>&1) = 'true' ] && + [ $(docker inspect -f '{{.State.Running}}' kdc 2>&1) = 'true' ]; }; +then + echo "ERROR: Could not start hadoop cluster. At least one of the containers failed. Aborting..." + exit 0 Review comment: Isn't exit code `1` the exit code for failure? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services