Hello ladies and gentlemen, I am here to hopefully draw on some collective knowledge about App Engine and its intricacies.
For the last two weeks our (my company) site has been experiencing very odd latency issues, and having now tried about 7 different methods of solving it, we are left at exactly where we began: Rising costs with performance that is much lower than previously. <https://lh3.googleusercontent.com/-THNMrlceFvM/V3TMvgEHvKI/AAAAAAAAQVk/FSZj42sZiCcRHRHmQuvSwM_mGYnbmsrYACLcB/s1600/search-console-latency.png> Essentially what happens is that say 50-60% of our requests are served normally, however the remainder have these extremely long "pauses" in the middle of the trace which is basically during the "processing" phase of the backend handling (after the datastore & memcache data has been retrieved). Here is an example of a single page that in the space of an hour had wildly different loading times for users. The vast majority were the same thing, grab 3 things from memcache and spit out the html retrieved from memcache. That's it... <https://lh3.googleusercontent.com/-MAFuZARJRh4/V3TG1qjbBbI/AAAAAAAAQT0/xTVp4xN-VAkEpV4d12xzxf9q3UeUFjsQwCLcB/s1600/game-all-latencies.png> And some individual traces to see what is happening <https://lh3.googleusercontent.com/-HA1u3SS8Y24/V3THA0kDf9I/AAAAAAAAQT8/eI6G77L6uOEU90Ahc0h2DTWVAiFO6zgtgCLcB/s1600/game-trace-1.png> <https://lh3.googleusercontent.com/-qUoJUEVI8Fk/V3THDM4PbZI/AAAAAAAAQUE/vidZSsIRjvAXjPzicg1CKH8s9RQ03g0XwCLcB/s1600/game-trace-2.png> <https://lh3.googleusercontent.com/-87zfyzosAU8/V3THJTYTUPI/AAAAAAAAQUU/DgjHOalOmcUipa_pGkY20F8eVwYa89m0QCLcB/s1600/game-trace-4.png> <https://lh3.googleusercontent.com/-8qUv1v0IJ-U/V3THGAOd_iI/AAAAAAAAQUM/mwtbr1Ona_o0wWX1k-abL4TJzxiiGC8HgCLcB/s1600/game-trace-3.png> So essentially the troubleshooting steps we took to figure out what was going wrong. - Checked all deployed code over the week preceding and following the latency spike to ensure we hadn't let some truly horrendous, heavy code slip through the review process. Everything deployed around that period was rather light, backend/cms based updates, hardly anything touching customer-facing requests. - Appstats, obviously. On the development server (and even unloaded test versions on the production server) such behavior is not seen. Didn't help. - Reducing unnecessary requests (figure 1) - We noticed some of our ajax-loaded content was creating 2-3 additional, separate requests per user-page-load, and as such refactored the code to only call those things when absolutely necessary, and eliminated one altogether. For the most part, a page load now equals one request. This had no effect on the latency spikes - Created a separate system that meant that our backend task-based processing was cut down by 90%, and thus the instance average load was reduced significantly. This had the opposite effect and average latency actually climbed, I suspect because of the extensive memcache use with large chunks of data (tracking what things should be updated by the backend tasks) - Separated the front end and tasks-back-end into modules/services so that frontend requests could have 100% instance attention. This had a small effect, but the spikes are still regularly happening (as seen in the above traces). - Played with max_idle_instances - This had a wonderful effect of *halving* our daily instance costs, with almost no effect on latency. When this is set to automatic, we get charged for a huge amount of unused instances, it actually borders on ludicrous (figure 2) - Played with max_concurrent_requests (8->16->10) which only served to make the latency issues worse. - Hours and hours pouring over logs, traces, dashboard graphs. * Figure 1 (Since the latency spike on June 5th, we have worked to reduce meaningless requests through API calls or task queuing) * <https://lh3.googleusercontent.com/-IQpPZuv89HE/V3TLipqGFWI/AAAAAAAAQVI/Ud7Y3JURuUAZxkHIxrqTRO5fvwYfHkz7gCLcB/s1600/requests-trend.png> *Figure 2 (14:40 is when the auto-scaling setting was deployed)* <https://lh3.googleusercontent.com/-PXr6eiEz28E/V3TJ1cXXxcI/AAAAAAAAQUs/PRdTbdBb_uUYBH8AqQNGJE3xGvbw3J50ACLcB/s1600/instances-deploy-time.png> What I have noticed is when the CPU cycles spike, *so does the latency*. So it would lead in the direction that our requests are starved for CPU time, however now that we have deployed the instance auto scaling (and are paying for an average of around 8 instances vs 4-5 previously), it has not improved the latency, which confuses me considerably. If it were all requests that had slowed down, our code would clearly need optimization. If the rise in latency coincided with a change in our frontend processing, it would make sense, but there were only very light backend changes deployed within +/- 2 days of the first latency spike (figure 3) *Figure 3 - Latency started rising on June 5th* <https://lh3.googleusercontent.com/-ndJeBJmfltI/V3TK6pWWA8I/AAAAAAAAQU8/DzLB6vH_EgwQG4FEvvejqw8w0uYDO4nVQCLcB/s1600/request-latency-spike-date.png> Some other images that may assist in understanding the issue: CPU Cycles (today) <https://lh3.googleusercontent.com/-Tl8nvTKddjk/V3TMWg6v02I/AAAAAAAAQVQ/ZEDNMzpteYo5kh6X10Woo0BvqoZgrVk0ACLcB/s1600/cpu-cycles-deploy-time.png> CPU Cycles (2 month) <https://lh3.googleusercontent.com/-ynMp9wN2y0s/V3TMgrBeH3I/AAAAAAAAQVc/lLNMBrVvp_UUE2b_gT0txgG_tsnYoFpfACLcB/s1600/cycle-per-sec-no-module.png> Is there anyone out there that can proffer some advice of where to poke, prod or peer next? I have only been using App Engine for 1.5 years now, but this company has been on the platform for about 4 years without these kinds of issues. -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To unsubscribe from this group and stop receiving emails from it, send an email to google-appengine+unsubscr...@googlegroups.com. To post to this group, send email to google-appengine@googlegroups.com. Visit this group at https://groups.google.com/group/google-appengine. To view this discussion on the web visit https://groups.google.com/d/msgid/google-appengine/c4b79bc0-534e-48d4-8284-5a6e136e4350%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.