[google-appengine] Re: Latency spikes and aborted requests in the last 24 hours

Jason C Thu, 23 Sep 2010 13:35:36 -0700

Lest you think it's something specific with deferred, I get bursts on
other urls as well. My examples often have lots of deferred because
that is by far our most common url. Here is another example:


09-23 01:25PM 27.093 /mgmt/search/start/ 500 29684ms
09-23 01:23PM 55.101 /mgmt/search/start/ 500 29973ms
09-23 01:22PM 25.100 /mgmt/search/start/ 500 29746ms
09-23 01:20PM 44.929 /mgmt/search/start/ 500 40098ms
09-23 01:19PM 15.158 /mgmt/search/start/ 500 29602ms

Here's an example from a test deployment of the same app, with much
lower QPS (appid: steprep-demo). Note the blend of 30-s and 10-s
timeouts.

09-23 01:27PM 27.743 /_ah/queue/deferred 500 37833ms
09-23 01:27PM 42.985 /_ah/queue/deferred 500 10227ms
09-23 01:27PM 42.972 /exec-reports/queue/increment-daily-data-counter
500 10237ms
09-23 01:27PM 37.157 /exec-reports/queue/increment-all-time-data-
counter 500 10052ms
09-23 01:27PM 36.758 /mgmt/jobs/put/deferred/?jobId=478f5f7f-fb3f-4974-
b033-18d1ef75902a 500 10102ms
09-23 01:27PM 36.187 /exec-reports/queue/increment-daily-data-counter
500 10087ms
09-23 01:27PM 35.626 /mgmt/jobs/put/deferred/?jobId=2efd0717-a59f-446c-
b1f9-4a2200420f95 500 10002ms
09-23 01:27PM 26.337 /_ah/queue/deferred 500 10077ms
09-23 01:27PM 26.146 /exec-reports/queue/increment-daily-data-counter
500 10006ms

/mgmt/jobs/put/deferred/, in particular, does 3 db.get(), 2 db.put(),
and 1 db.delete(), all with single entities/keys - no other APIs are
used. Though in this example, that's somewhat irrelevant because it's
hitting the 10-s boundary, not the 30-s.

j

On Sep 23, 2:25 pm, Jason C <jason.a.coll...@gmail.com> wrote:
> Yes, in the /_ah/queue/deferred requests, we are almost certainly
> doing datastore/memcache work and very likely doing urlfetch work (I'm
> referring to this probabilistically, because we use deferred for many
> things).
>
> BTW, appid steprep is Python.
>
> j
>
> On Sep 23, 2:21 pm, "Ikai Lan (Google)" <ikai.l+gro...@google.com>
> wrote:
>
>
>
> > Jason, are you calling URLFetch/datastore/Memcache in that action?
>
> > So far there are no reports of Java applications being impacted. We usually
> > have a bit more of flexibility when it comes to tearing down and spinning up
> > Python instances, so I'm wondering if this is related.
>
> > --
> > Ikai Lan
> > Developer Programs Engineer, Google App Engine
> > Blogger:http://googleappengine.blogspot.com
> > Reddit:http://www.reddit.com/r/appengine
> > Twitter:http://twitter.com/app_engine
>
> > On Thu, Sep 23, 2010 at 4:17 PM, Jason C <jason.a.coll...@gmail.com> wrote:
> > > When I refer to bursts of errors, this is an example of what I mean.
> > > This is a query for any 500-series errors against appid: steprep. Note
> > > how close together the log entries are (i.e., the bursty-ness). These
> > > requests operate very quickly when App Engine is performing well.
>
> > > 09-23 01:07PM 42.459 /_ah/queue/deferred 500 29768ms
> > > 09-23 01:07PM 42.084 /_ah/queue/deferred 500 29485ms
> > > 09-23 01:07PM 38.017 /_ah/queue/deferred 500 29523ms
> > > 09-23 01:07PM 33.636 /_ah/queue/deferred 500 29417ms
> > > 09-23 01:07PM 32.176 /_ah/queue/deferred 500 29787ms
> > > 09-23 01:07PM 30.812 /_ah/queue/deferred 500 29451ms
> > > 09-23 01:07PM 25.477 /_ah/queue/deferred 500 29626ms
> > > 09-23 01:07PM 24.605 /_ah/queue/deferred 500 30025ms
> > > 09-23 01:07PM 14.422 /mgmt/search/start/ 500 40086ms
> > > 09-23 01:07PM 24.204 /_ah/queue/deferred 500 29768ms
> > > 09-23 01:07PM 20.746 /_ah/queue/deferred 500 29968ms
> > > 09-23 01:07PM 30.383 /exec-reports/queue/increment-all-time-data-
> > > counter 500 10003ms
> > > 09-23 01:05PM 47.180 /_ah/queue/deferred 500 30070ms
> > > 09-23 01:05PM 42.770 /mgmt/search/start/ 500 30053ms
> > > 09-23 01:04PM 55.080 /_ah/queue/deferred 500 29404ms
> > > 09-23 01:04PM 49.054 /_ah/queue/deferred 500 29589ms
>
> > > Anything around the 30-s mark is a DeadlineExceededError, anything
> > > around the 10-s mark is the "Request was aborted after waiting too
> > > long to attempt to service your request....".
>
> > > These bursts are _NOT_ rare; I can find them at will, including
> > > similar bursts of the 10-s style:
>
> > > 09-23 12:52PM 07.445 /_ah/queue/deferred 500 10006ms
> > > 09-23 12:52PM 06.510 /_ah/queue/deferred 500 10016ms
> > > 09-23 12:52PM 06.303 /_ah/queue/deferred 500 10003ms
> > > 09-23 12:52PM 05.902 /_ah/queue/deferred 500 10017ms
> > > 09-23 12:52PM 05.641 /_ah/queue/deferred 500 10017ms
> > > 09-23 12:52PM 04.872 /_ah/queue/deferred 500 10002ms
> > > 09-23 12:52PM 04.211 /_ah/queue/deferred 500 10030ms
> > > 09-23 12:52PM 03.759 /_ah/queue/deferred 500 10029ms
> > > 09-23 12:52PM 03.652 /_ah/queue/deferred 500 10021ms
>
> > > I am running out of ideas of what I can do at my end to address this
> > > issue. Does anyone from Google have suggestions? Is this an
> > > infrastructure / autoscaling problem, or is it a problem at my end?
>
> > > On Sep 23, 11:06 am, Jason C <jason.a.coll...@gmail.com> wrote:
> > > > We get lots of bursts of loading failures. Most recently: 9.45a to
> > > > 9.47a log time (2010-09-23). Appid: steprep
>
> > > > We also get lots of ongoing 10-second timeout. Using the logs, it
> > > > seems to me that our user-facing requests are below the 1000ms
> > > > threshold, so I don't know why we get so many 10-second timeouts
> > > > (i.e., the "Request was aborted after waiting too long to attempt to
> > > > service your request. This may happen sporadically when the App Engine
> > > > serving cluster is under unexpectedly high or uneven load. If you see
> > > > this message frequently, please contact the App Engine team." error).
>
> > > > Our QPS varies throughout the day from 1 to around 12, though we have
> > > > jobs that introduce short spikes of 50-60.
>
> > > > j
>
> > > > On Sep 23, 10:39 am, Eli Jones <eli.jo...@gmail.com> wrote:
>
> > > > > My app is extremely lightweight.
>
> > > > > The average CPU usage for a cold start falls in these ranges:
>
> > > > > CPU_MS:    1030 - 1230
> > > > > API_CPU_MS:  408
>
> > > > > The rough outline of the code that runs is:
>
> > > > > 1.  Pull credentials form memcache and decrypt.
> > > > > 2.  Connect to gdata service, and get some data.
> > > > > 3.  Create 4 new entities (of same Model type) containing this fetched
> > > data
> > > > > and append to a list.
> > > > > 4.  Stick those entities in memcache as protobufs.
> > > > > 5.  Create new entity (for a different Model) and append to the list
> > > used in
> > > > > Step 3.
> > > > > 6.  db.put(list) to the datastore. (or keep trying until success)
> > > > > 7. If necessary, add task to run at next 5 minute mark.
>
> > > > > So.. as long as memcache is working and the encrypted credentials are
> > > > > there.. there are not any Gets from the datastore.  There is just the
> > > Put
> > > > > that is putting 4 entities of one model type and 1 entity of another
> > > model
> > > > > type all at once.
>
> > > > > timezone from Pytz is used for correct timezone calculations at the
> > > start of
> > > > > the day.. to make sure that the initial task starts at the correct
> > > time.
> > > > >  (At this point I could move the "from pytz import timezone" down to
> > > just
> > > > > the function that calculates the initial start delay.. but I avoid
> > > making
> > > > > changes to this code unless it's really necessary (timezone was used 
> > > > > in
> > > more
> > > > > than one place in the past.. so it was imported at the top of the
> > > file).
>
> > > > > On Thu, Sep 23, 2010 at 12:20 PM, Ikai Lan (Google) <
>
> > > > > ikai.l+gro...@google.com <ikai.l%2bgro...@google.com> <
> > > ikai.l%2bgro...@google.com <ikai.l%252bgro...@google.com>>> wrote:
> > > > > > It sounds like these are impacting Python developers - I haven't
> > > heard any
> > > > > > Java app developers chime in yet.
>
> > > > > > What's the ballpark of the QPS you guys are doing? <1qps, 10qps,
> > > 100qps,
> > > > > > etc?
>
> > > > > > --
> > > > > > Ikai Lan
> > > > > > Developer Programs Engineer, Google App Engine
> > > > > > Blogger:http://googleappengine.blogspot.com
> > > > > > Reddit:http://www.reddit.com/r/appengine
> > > > > > Twitter:http://twitter.com/app_engine
>
> > > > > > On Thu, Sep 23, 2010 at 12:15 PM, Eli Jones <eli.jo...@gmail.com>
> > > wrote:
>
> > > > > >> Also, a cold start for this task can take as little as 960 ms 
> > > > > >> (where
> > > it
> > > > > >> shows the log detail: "This request caused a new process to be
> > > started for
> > > > > >> your application...").
>
> > > > > >> And, the task is doing a URLFetch, so.. I'm sure that is taking up 
> > > > > >> a
> > > chunk
> > > > > >> of the running time.
>
> > > > > >> One of my imports is pytz (no django or anything like that 
> > > > > >> though)..
> > > so
> > > > > >> that could also be of interest.
>
> > > > > >> On Thu, Sep 23, 2010 at 12:10 PM, Eli Jones <eli.jo...@gmail.com>
> > > wrote:
>
> > > > > >>> My app is in Python.  The only thing it does between 6:30 AM PST
> > > and 1:00
> > > > > >>> PM PST is run a single chained task.  The task just runs once 
> > > > > >>> every
> > > 5
> > > > > >>> minutes.  (There are no other backend processes or users).
>
> > > > > >>> Yesterday at 11:00 AM PST and 11:02 AM PST, the task experienced
> > > the 10
> > > > > >>> second, 500 error (with 0 CPU_MS used).. then the task 
> > > > > >>> successfully
> > > retried
> > > > > >>> at 11:02:55 AM PST.
>
> > > > > >>> There was another one (not mentioned in your list of times for
> > > yesterday)
> > > > > >>> at 7:45 AM PST.  Failed after 10 seconds with a 500 error and 0
> > > CPU_MS used.
> > > > > >>>  The task successfully retried at 7:47 AM PST (for some reason, 
> > > > > >>> the
> > > tasks in
> > > > > >>> my app wait 120 seconds for the first retry)
>
> > > > > >>> For some reason, the log doesn't contain the message that usually
> > > goes
> > > > > >>> with this error (there is no detailed log message):
>
> > > > > >>>  "Request was aborted after waiting too long to attempt to service
> > > your
> > > > > >>> request. This may happen sporadically when the App Engine serving
> > > cluster is
> > > > > >>> under unexpectedly high or uneven load. If you see this message
> > > frequently,
> > > > > >>> please contact the App Engine team."
>
> > > > > >>> But, it has the exact same fingerprint. (It just seems that the
> > > logs are
> > > > > >>> trimming the detailed messages faster than usual.)
>
> > > > > >>> This chained task runs 79 times each day (this would be 82 
> > > > > >>> requests
> > > > > >>> counting these 3 errors).. so these errors accounted for 3.66 % of
> > > the
> > > > > >>> requests during the day yesterday.
>
> > > > > >>> On Thu, Sep 23, 2010 at 11:19 AM, Ikai Lan (Google) <
> > > > > >>> ikai.l+gro...@google.com <ikai.l%2bgro...@google.com> <
> > > ikai.l%2bgro...@google.com <ikai.l%252bgro...@google.com>>> wrote:
>
> > > > > >>>> Hey guys,
>
> > > > > >>>> We've been tracking latency spikes and intermittent batches of
> > > aborted
> > > > > >>>> requests over the last 24 hour period. There were at least 3
> > > periods of
> > > > > >>>> spikes yesterday:
>
> > > > > >>>> ~1100 Pacific (moderate, brief period of high)
> > > > > >>>> ~1300 Pacific (small)
> > > > > >>>> ~1530 Pacific (small)
>
> > > > > >>>> Community reports seem to indicate that these are related to
> > > loading
> > > > > >>>> requests. If you are being affected, can you report your
> > > environment,
> > > > > >>>> language and if these times correlate with your experiences in
> > > this thread?
> > > > > >>>> Are there periods that this particular report did not cover?
> > > Please provide
> > > > > >>>> measurements, times and application IDs.
>
> > > > > >>>> --
> > > > > >>>> Ikai Lan
> > > > > >>>> Developer Programs Engineer, Google App Engine...
>
> read more »

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appeng...@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

[google-appengine] Re: Latency spikes and aborted requests in the last 24 hours

Reply via email to