[google-appengine] GAE Latency & Instance issues

Trevor Chinn Thu, 30 Jun 2016 00:46:23 -0700

Hello ladies and gentlemen, I am here to hopefully draw on some collective 
knowledge about App Engine and its intricacies.


For the last two weeks our (my company) site has been experiencing very odd 
latency issues, and having now tried about 7 different methods of solving 
it, we are left at exactly where we began: Rising costs with performance 
that is much lower than previously. 

<https://lh3.googleusercontent.com/-THNMrlceFvM/V3TMvgEHvKI/AAAAAAAAQVk/FSZj42sZiCcRHRHmQuvSwM_mGYnbmsrYACLcB/s1600/search-console-latency.png>


Essentially what happens is that say 50-60% of our requests are served 
normally, however the remainder have these extremely long "pauses" in the 
middle of the trace which is basically during the "processing" phase of the 
backend handling (after the datastore & memcache data has been retrieved). 
Here is an example of a single page that in the space of an hour had wildly 
different loading times for users. The vast majority were the same thing, 
grab 3 things from memcache and spit out the html retrieved from memcache. 
That's it... 

<https://lh3.googleusercontent.com/-MAFuZARJRh4/V3TG1qjbBbI/AAAAAAAAQT0/xTVp4xN-VAkEpV4d12xzxf9q3UeUFjsQwCLcB/s1600/game-all-latencies.png>

And some individual traces to see what is happening

<https://lh3.googleusercontent.com/-HA1u3SS8Y24/V3THA0kDf9I/AAAAAAAAQT8/eI6G77L6uOEU90Ahc0h2DTWVAiFO6zgtgCLcB/s1600/game-trace-1.png>

<https://lh3.googleusercontent.com/-qUoJUEVI8Fk/V3THDM4PbZI/AAAAAAAAQUE/vidZSsIRjvAXjPzicg1CKH8s9RQ03g0XwCLcB/s1600/game-trace-2.png>

<https://lh3.googleusercontent.com/-87zfyzosAU8/V3THJTYTUPI/AAAAAAAAQUU/DgjHOalOmcUipa_pGkY20F8eVwYa89m0QCLcB/s1600/game-trace-4.png>

<https://lh3.googleusercontent.com/-8qUv1v0IJ-U/V3THGAOd_iI/AAAAAAAAQUM/mwtbr1Ona_o0wWX1k-abL4TJzxiiGC8HgCLcB/s1600/game-trace-3.png>




So essentially the troubleshooting steps we took to figure out what was 
going wrong. 

   - Checked all deployed code over the week preceding and following the 
   latency spike to ensure we hadn't let some truly horrendous, heavy code 
   slip through the review process. Everything deployed around that period was 
   rather light, backend/cms based updates, hardly anything touching 
   customer-facing requests. 
   - Appstats, obviously. On the development server (and even unloaded test 
   versions on the production server) such behavior is not seen. Didn't help. 
   - Reducing unnecessary requests (figure 1) - We noticed some of our 
   ajax-loaded content was creating 2-3 additional, separate requests per 
   user-page-load, and as such refactored the code to only call those things 
   when absolutely necessary, and eliminated one altogether. For the most 
   part, a page load now equals one request. This had no effect on the latency 
   spikes
   - Created a separate system that meant that our backend task-based 
   processing was cut down by 90%, and thus the instance average load was 
   reduced significantly. This had the opposite effect and average latency 
   actually climbed, I suspect because of the extensive memcache use with 
   large chunks of data (tracking what things should be updated by the backend 
   tasks)
   - Separated the front end and tasks-back-end into modules/services so 
   that frontend requests could have 100% instance attention. This had a small 
   effect, but the spikes are still regularly happening (as seen in the above 
   traces). 
   - Played with max_idle_instances  - This had a wonderful effect of 
   *halving* our daily instance costs, with almost no effect on latency. 
   When this is set to automatic, we get charged for a huge amount of unused 
   instances, it actually borders on ludicrous (figure 2) 
   - Played with max_concurrent_requests (8->16->10) which only served to 
   make the latency issues worse. 
   - Hours and hours pouring over logs, traces, dashboard graphs. 


* Figure 1 (Since the latency spike on June 5th, we have worked to reduce 
meaningless requests through API calls or task queuing) *

<https://lh3.googleusercontent.com/-IQpPZuv89HE/V3TLipqGFWI/AAAAAAAAQVI/Ud7Y3JURuUAZxkHIxrqTRO5fvwYfHkz7gCLcB/s1600/requests-trend.png>

*Figure 2 (14:40 is when the auto-scaling setting was deployed)*

<https://lh3.googleusercontent.com/-PXr6eiEz28E/V3TJ1cXXxcI/AAAAAAAAQUs/PRdTbdBb_uUYBH8AqQNGJE3xGvbw3J50ACLcB/s1600/instances-deploy-time.png>

What I have noticed is when the CPU cycles spike, *so does the latency*. So 
it would lead in the direction that our requests are starved for CPU time, 
however now that we have deployed the instance auto scaling (and are paying 
for an average of around 8 instances vs 4-5 previously), it has not 
improved the latency, which confuses me considerably. 

If it were all requests that had slowed down, our code would clearly need 
optimization. If the rise in latency coincided with a change in our 
frontend processing, it would make sense, but there were only very light 
backend changes deployed within +/- 2 days of the first latency spike 
(figure 3)

*Figure 3 - Latency started rising on June 5th*

<https://lh3.googleusercontent.com/-ndJeBJmfltI/V3TK6pWWA8I/AAAAAAAAQU8/DzLB6vH_EgwQG4FEvvejqw8w0uYDO4nVQCLcB/s1600/request-latency-spike-date.png>
 
Some other images that may assist in understanding the issue:

CPU Cycles (today)

<https://lh3.googleusercontent.com/-Tl8nvTKddjk/V3TMWg6v02I/AAAAAAAAQVQ/ZEDNMzpteYo5kh6X10Woo0BvqoZgrVk0ACLcB/s1600/cpu-cycles-deploy-time.png>

CPU Cycles (2 month)

<https://lh3.googleusercontent.com/-ynMp9wN2y0s/V3TMgrBeH3I/AAAAAAAAQVc/lLNMBrVvp_UUE2b_gT0txgG_tsnYoFpfACLcB/s1600/cycle-per-sec-no-module.png>


Is there anyone out there that can proffer some advice of where to poke, 
prod or peer next? I have only been using App Engine for 1.5 years now, but 
this company has been on the platform for about 4 years without these kinds 
of issues.

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at https://groups.google.com/group/google-appengine.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-appengine/c4b79bc0-534e-48d4-8284-5a6e136e4350%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[google-appengine] GAE Latency & Instance issues

Reply via email to