[google-appengine] Re: 502 Server Errors after upgrade?

Brett Slatkin Tue, 10 Mar 2009 17:06:01 -0700

Hi Nick,

Sorry for the delay here. Been working hard these last few days to
resolve these latency issues, which are looking good now. I'd like to
help you figure out the issue here, so please bear with me since I'm
not looking at your app's source code. =)

I believe Marzia stated elsewhere that the once you get into the
100ms+ range for CPU time (runtime only, not including APIs) you will
begin to see this prioritization and additional latency at the request
level. This could account for some of the variability you have seen.
Another thing to keep in mind is the active dynamic requests limit
explained here:

http://code.google.com/appengine/docs/quotas.html#Request_Limits

"An application operating entirely within the free quotas can process
around 30 active dynamic requests simultaneously. This means that an
app whose average server-side request processing time is 75
milliseconds can serve up to (1000 ms/second / 75 ms/request) * 30 =
400 requests/second, independent of the quota system, without
incurring any additional latency. Requests for static files are not
affected by this limit. Applications that are heavily CPU-bound, on
the other hand, may incur some additional latency in long-running
requests in order to make room for other apps sharing the same
servers."

We maximize the throughput of these dynamic requests as much as
possible, taking advantage of App Caching
(http://code.google.com/appengine/docs/python/runtime.html#App_Caching).
This is what enables high load applications to serve at 75ms of
runtime CPU per request. If you're not using App Caching, you could
see a significant impact on your application's latency.

Another thing to think about is how API call latency affects overall
throughput. For example, if you execute a series of Datastore queries
to retrieve many entities, you may be looking at 200-400ms of latency
for a query (or more, depending on the shape of your data). Now
connect this number with the runtime CPU and wall-clock time and
you're looking at 450-650ms of wall-clock time minimum to execute a
single request. Doing simple math you can get your maximum throughput
per instance per second: 1 request/650ms = ~1.5 requests per second
per instance. What if you're doing three of these queries per request?
Then we're looking at a throughput of less than 1 request per second
per instance.

Going back to it: There are around 30 active instances available to
your application. We do our best to maximize your use of these
depending on your load and your application's throughput. But one
thing to remember is the faster your app is (in wall-clock latency),
the better we can spread its load across these 30 instances. With
requests that may take 1 second or more to complete, you may see more
variability in your app's latency and throughput. I believe that's
part of what's going on here.

The reason is that we scale your instances to match your sustained
throughput. If your average request takes 500ms and you're sustaining
10 requests per second, then you can serve this load easily using only
5 instances. However, if ever so often you get a request that takes
2-5 seconds, then you're going to see some extra latency as the faster
500ms requests slow down for the big guy to go through. If bigger
requests keep coming through, things should balance back out in a
short amount of time; but if the bigger workload is relatively
infrequent this could show up as latency spikes.

There are a few solutions to this. One big one is caching. I assume
you're already doing that, but if you can get any more to reduce
latency, that will definitely help. Another is to profile your
application to find where you spend the majority of your wall-clock
latency (see http://code.google.com/appengine/kb/commontasks.html#profiling).
Precomputing as much information always helps (that's the App Engine
way). But another thing that could help a lot is using background
processing (when that feature is ready) to do the precomputing in the
background; this could isolate you from increased latency you'll see
when more expensive requests go through the system.

Hopefully this information has helped a bit. Please let me know if you
have any questions. I still would very much appreciate your help in
tracking down the latency you're experiencing so we can figure out the
root cause. Thanks,

-Brett

On Tue, Mar 3, 2009 at 12:05 PM, Nick Winter <livel...@gmail.com> wrote:
>
> Yes, I'm still seeing problems. We've been averaging 4 seconds per
> request for the past six hours, roughly correlated to the elevated
> latency that you say is within tolerance and is averaging 250 ms per
> request. Some of that is due to a recently identified performance
> issue with one of our common handlers being identified as a "high CPU"
> handler and getting sidelined to make room for more performant apps'
> handlers. We haven't been able to fix that yet, but it's not just that
> handler: it's every dynamic handler.
>
> This isn't just an issue since last night, though; it's been on and
> off for two months. We're still seeing widely variable response times
> on almost every request. In the admin console, for example, most
> requests take a normal second or so, but a fifth of them or so take
> between 7-20 seconds to respond. This is stuff like expanding one log
> entry with no log messages in it. I know that's not in my code. Is it
> instance startup costs? I think so. From my logs, I'll load something
> simple like a FAQ page, and it's fine, whenever I have to start
> another instance and import Django and some views, well, that's 7-14
> seconds to do it. Have we changed anything in the past two months? A
> couple very small things, but mostly our development has been on
> another branch. It goes between "working just fine on every request"
> and "several seconds to spawn an instance running the same code as
> usual" from day to day and time to time. It's just been getting more
> frequent and worse as time goes on.
>
> The simplest handler we have, which imports our models, reads one
> thing at random from the datastore, and returns it, with no web
> framework, normally runs in 50-250 ms and takes 50-80 ms of cpu, but
> sometimes (new instance?) takes 1300-2000 ms and 130-300 ms cpu. But I
> made an even simpler handler as a test, with no import, that just
> prints a random number. Starting a new instance for that only takes a
> 70-250 ms response time. And if I make that import our models every
> time, it doesn't change anything. It's almost like the cost of
> starting a new instance is multiplied by the cost of the instantiation
> code is multiplied by whether App Engine is feeling cheerful that
> hour.
>
> It could be our code, I could see that. If there's a reason why the
> admin console would be so slow at the same times and in the same way
> as the rest of our site, maybe it would shed some light. If we're
> doing something subtly wrong when we import Django (0.96) and our
> models, I could see that, but only if there's a reason why we wouldn't
> see it sometimes and would at other times, and why it would have been
> getting worse without any code changes.
>
> It would also help a lot to know exactly how this works (from the
> Quotas page):
> "Applications that are heavily cpu-bound, on the other hand, may incur
> some additional latency in long-running requests in order to make room
> for other apps sharing the same servers."
> Marzia has said that this is per-request, so I'm assuming that certain
> handlers will get sidelined, but this is happening site-wide --
> otherwise, it would make perfect sense. How is a handler identified as
> high-cpu? How long does that classification last?
>
> Thanks,
> --Nick (app id: skrit)
>
>
> On Mar 3, 1:25 pm, Brett Slatkin <brett-appeng...@google.com> wrote:
>> Hi Nick,
>>
>> On Tue, Mar 3, 2009 at 8:55 AM, Nick Winter <livel...@gmail.com> wrote:
>>
>> >http://code.google.com/status/appengine/detail/serving/2009/03/03#ae-...
>>
>> > Just about every day for the past several weeks, there's been elevated
>> > latency like this, usually at similar times of day. It was unfortunate
>> > and frustrating before, but since last night our development is
>> > stalled because every part of App Engine is too slow to do any testing
>> > or data manipulation right now. 5 seconds per request?
>>
>> > I'm confident that the App Engine team will get a handle on the
>> > performance and everything will be shiny once more, but it'd be nice
>> > to hear some word as to what's going on. Are the servers just
>> > overloaded? Did something go wrong with the maintenance last night? Is
>> > anomaly-yellow serving to be expected?
>>
>> We had some unexpected issues during the maintenance last night which
>> caused elevated latencies and errors for all applications. We resolved
>> the issue around 8:45pm last night and things have returned to normal
>> since. Please let me know if you're still seeing any problems.
>>
>> As for the elevated latency for the dynamic request metric (that you
>> linked to), this is primarily a product of alert tolerances. We're
>> still tuning our status site metrics to match real-world expectations
>> of App Engine performance. You'll notice today that we've raised some
>> of these tolerances by a little bit, causing many of the lines to go
>> back to a blue color (i.e., everything OK).
>>
>> -Brett
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appengine@googlegroups.com
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en
-~----------~----~----~----~------~----~------~--~---

[google-appengine] Re: 502 Server Errors after upgrade?

Reply via email to