Thanks Brandon. Many of the DeadlineExceededErrors were occurring during warmup requests, according to the stacktraces, during python import statements. I upped the number of idle instances in an attempt to mitigate this sort of thrashing, and your advice makes sense for this case. Our pending latency is set to 'Automatic' on both ends.
I'm attaching some graphs from the period when this was the worst Instances: <https://lh4.googleusercontent.com/--AtYMbWJ4ek/TxGNT3nfp0I/AAAAAAAAUuE/hTlZm78Mc08/s1600/Screen%252520Shot%2525202012-01-14%252520at%2525209.08.59%252520AM.png> Requests per second: <https://lh6.googleusercontent.com/-LoIlwGhvLrA/TxGOnvzGmSI/AAAAAAAAUuc/Sg07YssPK_4/s1600/Screen%252520Shot%2525202012-01-14%252520at%2525209.17.39%252520AM.png> Milliseconds per request: <https://lh5.googleusercontent.com/-A76zVs8CCEo/TxGNZ9kcpfI/AAAAAAAAUuQ/w20AuPvgw50/s1600/Screen%252520Shot%2525202012-01-14%252520at%2525209.09.41%252520AM.png> This suggests that some higher latency handlers were hit (some people were editing content), taking up the existing front end instances, after which GAE was trying to spin up some dynamic instances to serve other requests. But during warmup, there were DeadelineExceededErrors during file imports, suggesting that the dynamic instances aren't being given enough time to warmup. Increasing the idle instances helps. So perhaps the revised question, at least for our particular situation is: why, under load, do the dynamic instances timeout during warmup? That seems to compound the problem as the dynamic instances aren't able to serve the requests that are backed up, leading to user visible 500 errors, and more attempts to dynamically load instances. Does my theory have any holes? Is relying on dynamic instances to handle spikes without 500 errors unrealistic? I know the docs state, "A smaller number of idle Instances means your application costs less to run, but may encounter more startup latency during load spikes." but thrashing on DeadlineExceededErrors during warmup seems to indicate that dynamic instances can't be relied upon for load spikes at all right now. -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/bYRgRhlKZjoJ. To post to this group, send email to google-appengine@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.