This was originally an internal message and may refer to some of our 
projects, but the background information will be useful so have left 
references to these in.

We are having an issue with Google App Engine preventing us from making new 
deployments.

The error message is:

ERROR: (gcloud.app.deploy) Error Response: [4] Your deployment has failed 
to become healthy in the allotted time and therefore was rolled back. If 
you believe this was an error, try adjusting the 'app_start_timeout_sec' 
setting in the 'readiness_check' section.


This is a surprising error, especially as we haven't had issues with this 
until recently. It appears our changes earlier this year to prepare for the 
new Google App Engine split health checks didn't actually work, so when the 
system was deprecated on September 15th (mentioned here 
https://cloud.google.com/appengine/docs/flexible/custom-runtimes/migrating-to-split-health-checks),
 
no deployments worked from that point on. Health checks specification is 
listed here: 
https://cloud.google.com/appengine/docs/flexible/python/reference/app-yaml#liveness_path
.

The error message references the `app_start_timout_sec` setting, more 
details about this is found here: 
https://cloud.google.com/endpoints/docs/openapi/troubleshoot-aeflex-deployment. 
I didn’t think it was a timeout issue, since our system boots fairly 
quickly (less than the 5 minutes it defaults to) so I investigated the logs 
of a version of the app (from now on I’m talking about codeWOF production 
system unless specified). The versions only listed the ‘working’ versions, 
but when I looked in the Logs Viewer, all the different versions were 
listed, including those that had failed.

With the following `app.yaml` the logs were showing this error:


liveness_check:
    path: "/gae/liveness_check"

readiness_check:
    path: "/gae/readiness_check"


Ready for new connections
Compiling message files
Starting gunicorn 19.9.0
Listening at: http://0.0.0.0:8080 (13)
Using worker: gevent
Booting worker with pid: 16
Booting worker with pid: 17
Booting worker with pid: 18
GET 301 0 B 2 ms GoogleHC/1.0 /readiness_check
GET 301 0 B 3 ms GoogleHC/1.0 /liveness_check


This confirmed that the system had booted successfully and the checks were 
getting through but returning the wrong code, a 301 redirect instead of a 
200.  But also that the checks were going to the wrong URL, no prefix was 
shown.

I believed the redirect was caused by either the `APPEND_SLASH` setting, or 
the HTTP to HTTPS redirect. I tried the following configuration and got the 
following:

liveness_check:
    path: "/liveness_check/"

readiness_check:
    path: "/readiness_check/"


GET 301 0 B 2 ms GoogleHC/1.0 /readiness_check
GET 301 0 B 3 ms GoogleHC/1.0 /liveness_check

Same error as above, so it appears that setting the custom path does not 
affect where the health check is sent. Searching for the custom path in all 
logging messages returns exactly one message (summary below):


2019-11-06 16:24:14.288 NZDT App Engine Create Version default:
20191106t032141
livenessCheck: { path: "/liveness_check/" }
readinessCheck: { path: "/readiness_check/" }
Resources: { cpu: 1 memoryGb: 3.75 }

So this is the first thing to look into, is setting the custom path 
correctly, I couldn’t get this to change.

I read all StackOverflow posts talking about App Engine and split health 
checks (there were less than 10 entries) and tried all suggested fixes. 
These included:

   - 
   
   Checking the split health check was set correctly using `gcloud app 
   describe --project codewof`.
   - 
   
   Setting the split health checks (again) with `gcloud app update 
   --split-health-checks --project codewof`.
   

The last thing I had tried resulted in something quite interesting. I 
deleted all health check settings in the `app.yaml` files.

The documentation (
https://cloud.google.com/appengine/docs/flexible/custom-runtimes/configuring-your-app-with-app-yaml#updated_health_checks)
 
states the following:

By default, HTTP requests from health checks are not forwarded to your 
> application container. If you want to extend health checks to your 
> application, then specify a path for liveness checks or readiness checks. A 
> customized health check to your application is considered successful if it 
> returns a 200 OK response code.


This sounded like the overall VM was being checked, rather than the docker 
image running inside of it, and the deployment worked! 


GET 200 0 B 2 ms GoogleHC/1.0 /readiness_check
GET 200 0 B 3 ms GoogleHC/1.0 /liveness_check


But if the docker container fails for some reason, Google App Engine 
wouldn’t know there is an issue. We need to look into this scenario and see 
what it actually means, I couldn’t find anything specifying it exactly. 
However this allows us to do urgent deployments.

I also tested the following to skip HTTPS redirects.

`settings/production.py`


SECURE_REDIRECT_EXEMPT = [
    r'^/?cron/.*',
    r'^/?liveness_check/?$',
    r'^/?readiness_check/?$',
]



liveness_check:
    path: "/liveness_check/"

readiness_check:
    path: "/readiness_check/"



GET 301 0 B 2 ms GoogleHC/1.0 /readiness_check
GET 301 0 B 3 ms GoogleHC/1.0 /liveness_check



The last confusing thing I discovered was to do with the `codewof-dev` 
website’s behaviour conflicting with documentation I had read. I can’t find 
the documentation again but I’m pretty sure it said that the App Engine 
instance will either run the old legacy or new split health checks. But the 
`codewof-dev` website is running both!


GET 200 0 B 2 ms GoogleHC/1.0 /readiness_check
GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health
GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health
GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health
GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health
GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health
GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health
GET 200 0 B 2 ms GoogleHC/1.0 /readiness_check
GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health
GET 200 0 B 2 ms GoogleHC/1.0 /readiness_check
GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health
GET 200 0 B 2 ms GoogleHC/1.0 /readiness_check
GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health
GET 200 0 B 2 ms GoogleHC/1.0 /readiness_check
GET 200 0 B 3 ms GoogleHC/1.0 /liveness_check

Last discovery: I tested this morning by deleting all the health check 
configurations in the app.yaml files (as I had done previously) but also 
deleted all the custom health check URLs in our config URL routing file. 
The system deployed successfully with the following health checks

GET 200 0 B 2 ms GoogleHC/1.0 /readiness_check
GET 200 0 B 3 ms GoogleHC/1.0 /liveness_check

This seems to show that the App Engine VM instance has its own check, and 
it's not entering our Docker container. This would be fine for most GAE 
flexible instances, but not the custom runtime option we are using.

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-appengine/ed3fd17b-d8bf-4d91-918c-b148b7a79bf6%40googlegroups.com.
  • [google-appengine... Courtney Bracefield

Reply via email to