Lest you think it's something specific with deferred, I get bursts on other urls as well. My examples often have lots of deferred because that is by far our most common url. Here is another example:
09-23 01:25PM 27.093 /mgmt/search/start/ 500 29684ms 09-23 01:23PM 55.101 /mgmt/search/start/ 500 29973ms 09-23 01:22PM 25.100 /mgmt/search/start/ 500 29746ms 09-23 01:20PM 44.929 /mgmt/search/start/ 500 40098ms 09-23 01:19PM 15.158 /mgmt/search/start/ 500 29602ms Here's an example from a test deployment of the same app, with much lower QPS (appid: steprep-demo). Note the blend of 30-s and 10-s timeouts. 09-23 01:27PM 27.743 /_ah/queue/deferred 500 37833ms 09-23 01:27PM 42.985 /_ah/queue/deferred 500 10227ms 09-23 01:27PM 42.972 /exec-reports/queue/increment-daily-data-counter 500 10237ms 09-23 01:27PM 37.157 /exec-reports/queue/increment-all-time-data- counter 500 10052ms 09-23 01:27PM 36.758 /mgmt/jobs/put/deferred/?jobId=478f5f7f-fb3f-4974- b033-18d1ef75902a 500 10102ms 09-23 01:27PM 36.187 /exec-reports/queue/increment-daily-data-counter 500 10087ms 09-23 01:27PM 35.626 /mgmt/jobs/put/deferred/?jobId=2efd0717-a59f-446c- b1f9-4a2200420f95 500 10002ms 09-23 01:27PM 26.337 /_ah/queue/deferred 500 10077ms 09-23 01:27PM 26.146 /exec-reports/queue/increment-daily-data-counter 500 10006ms /mgmt/jobs/put/deferred/, in particular, does 3 db.get(), 2 db.put(), and 1 db.delete(), all with single entities/keys - no other APIs are used. Though in this example, that's somewhat irrelevant because it's hitting the 10-s boundary, not the 30-s. j On Sep 23, 2:25 pm, Jason C <jason.a.coll...@gmail.com> wrote: > Yes, in the /_ah/queue/deferred requests, we are almost certainly > doing datastore/memcache work and very likely doing urlfetch work (I'm > referring to this probabilistically, because we use deferred for many > things). > > BTW, appid steprep is Python. > > j > > On Sep 23, 2:21 pm, "Ikai Lan (Google)" <ikai.l+gro...@google.com> > wrote: > > > > > Jason, are you calling URLFetch/datastore/Memcache in that action? > > > So far there are no reports of Java applications being impacted. We usually > > have a bit more of flexibility when it comes to tearing down and spinning up > > Python instances, so I'm wondering if this is related. > > > -- > > Ikai Lan > > Developer Programs Engineer, Google App Engine > > Blogger:http://googleappengine.blogspot.com > > Reddit:http://www.reddit.com/r/appengine > > Twitter:http://twitter.com/app_engine > > > On Thu, Sep 23, 2010 at 4:17 PM, Jason C <jason.a.coll...@gmail.com> wrote: > > > When I refer to bursts of errors, this is an example of what I mean. > > > This is a query for any 500-series errors against appid: steprep. Note > > > how close together the log entries are (i.e., the bursty-ness). These > > > requests operate very quickly when App Engine is performing well. > > > > 09-23 01:07PM 42.459 /_ah/queue/deferred 500 29768ms > > > 09-23 01:07PM 42.084 /_ah/queue/deferred 500 29485ms > > > 09-23 01:07PM 38.017 /_ah/queue/deferred 500 29523ms > > > 09-23 01:07PM 33.636 /_ah/queue/deferred 500 29417ms > > > 09-23 01:07PM 32.176 /_ah/queue/deferred 500 29787ms > > > 09-23 01:07PM 30.812 /_ah/queue/deferred 500 29451ms > > > 09-23 01:07PM 25.477 /_ah/queue/deferred 500 29626ms > > > 09-23 01:07PM 24.605 /_ah/queue/deferred 500 30025ms > > > 09-23 01:07PM 14.422 /mgmt/search/start/ 500 40086ms > > > 09-23 01:07PM 24.204 /_ah/queue/deferred 500 29768ms > > > 09-23 01:07PM 20.746 /_ah/queue/deferred 500 29968ms > > > 09-23 01:07PM 30.383 /exec-reports/queue/increment-all-time-data- > > > counter 500 10003ms > > > 09-23 01:05PM 47.180 /_ah/queue/deferred 500 30070ms > > > 09-23 01:05PM 42.770 /mgmt/search/start/ 500 30053ms > > > 09-23 01:04PM 55.080 /_ah/queue/deferred 500 29404ms > > > 09-23 01:04PM 49.054 /_ah/queue/deferred 500 29589ms > > > > Anything around the 30-s mark is a DeadlineExceededError, anything > > > around the 10-s mark is the "Request was aborted after waiting too > > > long to attempt to service your request....". > > > > These bursts are _NOT_ rare; I can find them at will, including > > > similar bursts of the 10-s style: > > > > 09-23 12:52PM 07.445 /_ah/queue/deferred 500 10006ms > > > 09-23 12:52PM 06.510 /_ah/queue/deferred 500 10016ms > > > 09-23 12:52PM 06.303 /_ah/queue/deferred 500 10003ms > > > 09-23 12:52PM 05.902 /_ah/queue/deferred 500 10017ms > > > 09-23 12:52PM 05.641 /_ah/queue/deferred 500 10017ms > > > 09-23 12:52PM 04.872 /_ah/queue/deferred 500 10002ms > > > 09-23 12:52PM 04.211 /_ah/queue/deferred 500 10030ms > > > 09-23 12:52PM 03.759 /_ah/queue/deferred 500 10029ms > > > 09-23 12:52PM 03.652 /_ah/queue/deferred 500 10021ms > > > > I am running out of ideas of what I can do at my end to address this > > > issue. Does anyone from Google have suggestions? Is this an > > > infrastructure / autoscaling problem, or is it a problem at my end? > > > > On Sep 23, 11:06 am, Jason C <jason.a.coll...@gmail.com> wrote: > > > > We get lots of bursts of loading failures. Most recently: 9.45a to > > > > 9.47a log time (2010-09-23). Appid: steprep > > > > > We also get lots of ongoing 10-second timeout. Using the logs, it > > > > seems to me that our user-facing requests are below the 1000ms > > > > threshold, so I don't know why we get so many 10-second timeouts > > > > (i.e., the "Request was aborted after waiting too long to attempt to > > > > service your request. This may happen sporadically when the App Engine > > > > serving cluster is under unexpectedly high or uneven load. If you see > > > > this message frequently, please contact the App Engine team." error). > > > > > Our QPS varies throughout the day from 1 to around 12, though we have > > > > jobs that introduce short spikes of 50-60. > > > > > j > > > > > On Sep 23, 10:39 am, Eli Jones <eli.jo...@gmail.com> wrote: > > > > > > My app is extremely lightweight. > > > > > > The average CPU usage for a cold start falls in these ranges: > > > > > > CPU_MS: 1030 - 1230 > > > > > API_CPU_MS: 408 > > > > > > The rough outline of the code that runs is: > > > > > > 1. Pull credentials form memcache and decrypt. > > > > > 2. Connect to gdata service, and get some data. > > > > > 3. Create 4 new entities (of same Model type) containing this fetched > > > data > > > > > and append to a list. > > > > > 4. Stick those entities in memcache as protobufs. > > > > > 5. Create new entity (for a different Model) and append to the list > > > used in > > > > > Step 3. > > > > > 6. db.put(list) to the datastore. (or keep trying until success) > > > > > 7. If necessary, add task to run at next 5 minute mark. > > > > > > So.. as long as memcache is working and the encrypted credentials are > > > > > there.. there are not any Gets from the datastore. There is just the > > > Put > > > > > that is putting 4 entities of one model type and 1 entity of another > > > model > > > > > type all at once. > > > > > > timezone from Pytz is used for correct timezone calculations at the > > > start of > > > > > the day.. to make sure that the initial task starts at the correct > > > time. > > > > > (At this point I could move the "from pytz import timezone" down to > > > just > > > > > the function that calculates the initial start delay.. but I avoid > > > making > > > > > changes to this code unless it's really necessary (timezone was used > > > > > in > > > more > > > > > than one place in the past.. so it was imported at the top of the > > > file). > > > > > > On Thu, Sep 23, 2010 at 12:20 PM, Ikai Lan (Google) < > > > > > > ikai.l+gro...@google.com <ikai.l%2bgro...@google.com> < > > > ikai.l%2bgro...@google.com <ikai.l%252bgro...@google.com>>> wrote: > > > > > > It sounds like these are impacting Python developers - I haven't > > > heard any > > > > > > Java app developers chime in yet. > > > > > > > What's the ballpark of the QPS you guys are doing? <1qps, 10qps, > > > 100qps, > > > > > > etc? > > > > > > > -- > > > > > > Ikai Lan > > > > > > Developer Programs Engineer, Google App Engine > > > > > > Blogger:http://googleappengine.blogspot.com > > > > > > Reddit:http://www.reddit.com/r/appengine > > > > > > Twitter:http://twitter.com/app_engine > > > > > > > On Thu, Sep 23, 2010 at 12:15 PM, Eli Jones <eli.jo...@gmail.com> > > > wrote: > > > > > > >> Also, a cold start for this task can take as little as 960 ms > > > > > >> (where > > > it > > > > > >> shows the log detail: "This request caused a new process to be > > > started for > > > > > >> your application..."). > > > > > > >> And, the task is doing a URLFetch, so.. I'm sure that is taking up > > > > > >> a > > > chunk > > > > > >> of the running time. > > > > > > >> One of my imports is pytz (no django or anything like that > > > > > >> though).. > > > so > > > > > >> that could also be of interest. > > > > > > >> On Thu, Sep 23, 2010 at 12:10 PM, Eli Jones <eli.jo...@gmail.com> > > > wrote: > > > > > > >>> My app is in Python. The only thing it does between 6:30 AM PST > > > and 1:00 > > > > > >>> PM PST is run a single chained task. The task just runs once > > > > > >>> every > > > 5 > > > > > >>> minutes. (There are no other backend processes or users). > > > > > > >>> Yesterday at 11:00 AM PST and 11:02 AM PST, the task experienced > > > the 10 > > > > > >>> second, 500 error (with 0 CPU_MS used).. then the task > > > > > >>> successfully > > > retried > > > > > >>> at 11:02:55 AM PST. > > > > > > >>> There was another one (not mentioned in your list of times for > > > yesterday) > > > > > >>> at 7:45 AM PST. Failed after 10 seconds with a 500 error and 0 > > > CPU_MS used. > > > > > >>> The task successfully retried at 7:47 AM PST (for some reason, > > > > > >>> the > > > tasks in > > > > > >>> my app wait 120 seconds for the first retry) > > > > > > >>> For some reason, the log doesn't contain the message that usually > > > goes > > > > > >>> with this error (there is no detailed log message): > > > > > > >>> "Request was aborted after waiting too long to attempt to service > > > your > > > > > >>> request. This may happen sporadically when the App Engine serving > > > cluster is > > > > > >>> under unexpectedly high or uneven load. If you see this message > > > frequently, > > > > > >>> please contact the App Engine team." > > > > > > >>> But, it has the exact same fingerprint. (It just seems that the > > > logs are > > > > > >>> trimming the detailed messages faster than usual.) > > > > > > >>> This chained task runs 79 times each day (this would be 82 > > > > > >>> requests > > > > > >>> counting these 3 errors).. so these errors accounted for 3.66 % of > > > the > > > > > >>> requests during the day yesterday. > > > > > > >>> On Thu, Sep 23, 2010 at 11:19 AM, Ikai Lan (Google) < > > > > > >>> ikai.l+gro...@google.com <ikai.l%2bgro...@google.com> < > > > ikai.l%2bgro...@google.com <ikai.l%252bgro...@google.com>>> wrote: > > > > > > >>>> Hey guys, > > > > > > >>>> We've been tracking latency spikes and intermittent batches of > > > aborted > > > > > >>>> requests over the last 24 hour period. There were at least 3 > > > periods of > > > > > >>>> spikes yesterday: > > > > > > >>>> ~1100 Pacific (moderate, brief period of high) > > > > > >>>> ~1300 Pacific (small) > > > > > >>>> ~1530 Pacific (small) > > > > > > >>>> Community reports seem to indicate that these are related to > > > loading > > > > > >>>> requests. If you are being affected, can you report your > > > environment, > > > > > >>>> language and if these times correlate with your experiences in > > > this thread? > > > > > >>>> Are there periods that this particular report did not cover? > > > Please provide > > > > > >>>> measurements, times and application IDs. > > > > > > >>>> -- > > > > > >>>> Ikai Lan > > > > > >>>> Developer Programs Engineer, Google App Engine... > > read more » -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to google-appeng...@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.