This was originally an internal message and may refer to some of our projects, but the background information will be useful so have left references to these in.
We are having an issue with Google App Engine preventing us from making new deployments. The error message is: ERROR: (gcloud.app.deploy) Error Response: [4] Your deployment has failed to become healthy in the allotted time and therefore was rolled back. If you believe this was an error, try adjusting the 'app_start_timeout_sec' setting in the 'readiness_check' section. This is a surprising error, especially as we haven't had issues with this until recently. It appears our changes earlier this year to prepare for the new Google App Engine split health checks didn't actually work, so when the system was deprecated on September 15th (mentioned here https://cloud.google.com/appengine/docs/flexible/custom-runtimes/migrating-to-split-health-checks), no deployments worked from that point on. Health checks specification is listed here: https://cloud.google.com/appengine/docs/flexible/python/reference/app-yaml#liveness_path . The error message references the `app_start_timout_sec` setting, more details about this is found here: https://cloud.google.com/endpoints/docs/openapi/troubleshoot-aeflex-deployment. I didn’t think it was a timeout issue, since our system boots fairly quickly (less than the 5 minutes it defaults to) so I investigated the logs of a version of the app (from now on I’m talking about codeWOF production system unless specified). The versions only listed the ‘working’ versions, but when I looked in the Logs Viewer, all the different versions were listed, including those that had failed. With the following `app.yaml` the logs were showing this error: liveness_check: path: "/gae/liveness_check" readiness_check: path: "/gae/readiness_check" Ready for new connections Compiling message files Starting gunicorn 19.9.0 Listening at: http://0.0.0.0:8080 (13) Using worker: gevent Booting worker with pid: 16 Booting worker with pid: 17 Booting worker with pid: 18 GET 301 0 B 2 ms GoogleHC/1.0 /readiness_check GET 301 0 B 3 ms GoogleHC/1.0 /liveness_check This confirmed that the system had booted successfully and the checks were getting through but returning the wrong code, a 301 redirect instead of a 200. But also that the checks were going to the wrong URL, no prefix was shown. I believed the redirect was caused by either the `APPEND_SLASH` setting, or the HTTP to HTTPS redirect. I tried the following configuration and got the following: liveness_check: path: "/liveness_check/" readiness_check: path: "/readiness_check/" GET 301 0 B 2 ms GoogleHC/1.0 /readiness_check GET 301 0 B 3 ms GoogleHC/1.0 /liveness_check Same error as above, so it appears that setting the custom path does not affect where the health check is sent. Searching for the custom path in all logging messages returns exactly one message (summary below): 2019-11-06 16:24:14.288 NZDT App Engine Create Version default: 20191106t032141 livenessCheck: { path: "/liveness_check/" } readinessCheck: { path: "/readiness_check/" } Resources: { cpu: 1 memoryGb: 3.75 } So this is the first thing to look into, is setting the custom path correctly, I couldn’t get this to change. I read all StackOverflow posts talking about App Engine and split health checks (there were less than 10 entries) and tried all suggested fixes. These included: - Checking the split health check was set correctly using `gcloud app describe --project codewof`. - Setting the split health checks (again) with `gcloud app update --split-health-checks --project codewof`. The last thing I had tried resulted in something quite interesting. I deleted all health check settings in the `app.yaml` files. The documentation ( https://cloud.google.com/appengine/docs/flexible/custom-runtimes/configuring-your-app-with-app-yaml#updated_health_checks) states the following: By default, HTTP requests from health checks are not forwarded to your > application container. If you want to extend health checks to your > application, then specify a path for liveness checks or readiness checks. A > customized health check to your application is considered successful if it > returns a 200 OK response code. This sounded like the overall VM was being checked, rather than the docker image running inside of it, and the deployment worked! GET 200 0 B 2 ms GoogleHC/1.0 /readiness_check GET 200 0 B 3 ms GoogleHC/1.0 /liveness_check But if the docker container fails for some reason, Google App Engine wouldn’t know there is an issue. We need to look into this scenario and see what it actually means, I couldn’t find anything specifying it exactly. However this allows us to do urgent deployments. I also tested the following to skip HTTPS redirects. `settings/production.py` SECURE_REDIRECT_EXEMPT = [ r'^/?cron/.*', r'^/?liveness_check/?$', r'^/?readiness_check/?$', ] liveness_check: path: "/liveness_check/" readiness_check: path: "/readiness_check/" GET 301 0 B 2 ms GoogleHC/1.0 /readiness_check GET 301 0 B 3 ms GoogleHC/1.0 /liveness_check The last confusing thing I discovered was to do with the `codewof-dev` website’s behaviour conflicting with documentation I had read. I can’t find the documentation again but I’m pretty sure it said that the App Engine instance will either run the old legacy or new split health checks. But the `codewof-dev` website is running both! GET 200 0 B 2 ms GoogleHC/1.0 /readiness_check GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health GET 200 0 B 2 ms GoogleHC/1.0 /readiness_check GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health GET 200 0 B 2 ms GoogleHC/1.0 /readiness_check GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health GET 200 0 B 2 ms GoogleHC/1.0 /readiness_check GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health GET 200 0 B 2 ms GoogleHC/1.0 /readiness_check GET 200 0 B 3 ms GoogleHC/1.0 /liveness_check Last discovery: I tested this morning by deleting all the health check configurations in the app.yaml files (as I had done previously) but also deleted all the custom health check URLs in our config URL routing file. The system deployed successfully with the following health checks GET 200 0 B 2 ms GoogleHC/1.0 /readiness_check GET 200 0 B 3 ms GoogleHC/1.0 /liveness_check This seems to show that the App Engine VM instance has its own check, and it's not entering our Docker container. This would be fine for most GAE flexible instances, but not the custom runtime option we are using. -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To unsubscribe from this group and stop receiving emails from it, send an email to google-appengine+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/google-appengine/ed3fd17b-d8bf-4d91-918c-b148b7a79bf6%40googlegroups.com.