(What is the etiquette for replying to mixed top-post/bottom-post
threads? Google Wave to the rescue!?!)

I am also suffering from the logging circular buffer. Part of me feels
like if a customer is paying triple-digit dollars per day to Google he
should get more than a few gigs of logs! But to discuss a technical
solution...

Part a suggestion from my experience, and part a request from any
Googlers: What I did was to subclass RequestHandler into
LoggingRequestHandler (the site is a webapp app) which would create
its own logger object and set the default log level to warn. Next, all
logging happens through that logger instead of the root one in the
logger module. Now I have no debug or info logs because they will be
re-enabled only under special circumstances. To re-activate them, the
LoggingRequestHandler pulls a very simple serialized dict from
basically a singleton entity in the DB (or memcache), of the following
format:

{
  "Name 1 (usually an issue id from our tracker)": [
    ["HandlerClass1", "http_method", "get_or_post_var", "trigger
value"],
    ["HandlerClass2", "http_method", "get_or_post_var", "trigger
value"]
  ],

  # More concrete example below
  "Issue 12345": [
    ["GetUsers", "get", "username", "m...@example.com"],
    ["PostUser", "post", "username", "m...@example.com"]
  ]
}

LoggingRequestHandler wraps self.get(), self.put(), etc., first
checking whether its own name, the HTTP method, query parameters and
values match those listed in the data structure. Those strings are
treated as regexes, and there are some aliased "variables" like
"__url__" which allows me to match against the entire URL. If any of
the conditions match, the logging object's level is set to DEBUG.
Performance impact is negligible, and the main problem is that we now
have to make a round-trip to users to tell them "logging is activated"
and to reproduce their problem.

Now to my question. I am suspicious that all the work I did to
implement this is for nothing, because our logs are still rotating out
into the nothing after only a day. Is this because the request volume
is still so high (80 request/sec, the vast majority with no logs)?
Back in the bad old days we were logging over 1KB of data per request
and I thought it was reasonable; but my expectation of having logs
(all of which I'm interested in by the way) linger on for many days
turned out to be totally wrong. If anybody has any advice about this I
would love to hear it.

In the meantime, I think I will have to do what I probably should have
done from the start: regularly fetch the logs to a separate system and
search through them there.

On Oct 6, 4:47 pm, "Nick Johnson (Google)" <nick.john...@google.com>
wrote:
> Hi Neil,
>
>
>
> On Tue, Oct 6, 2009 at 1:00 AM, neil souza <nr...@foulplaymedia.com> wrote:
>
> > thanks nick, responses in line (hope they come through right, i don't
> > really know how to use groups)
>
> > On Sep 30, 2:57 am, "Nick Johnson (Google)" <nick.john...@google.com>
> > wrote:
> > > Hi Neil,
>
> > > Sorry for the delay responding. Responses inline.
>
> > > On Sat, Sep 26, 2009 at 1:40 PM, neil souza <nr...@foulplaymedia.com>
> > wrote:
>
> > > > the issue: it looks like we may not be getting all of our log entries
> > > > when when pull the logs from app engine.
>
> > > > first, a little context. there's a lot here, so bear with me.
>
> > > > we need to record event lines for metrics. normally, we would write
> > > > the lines to a local file on each app server and then pull those logs
> > > > every few minutes from the metrics system. we found this to be the
> > > > most stable and scalable architecture.
>
> > > > however, in app engine land, we can't write to a file. so we wrote the
> > > > event lines to the logs, set up a script to pull them in 10 minute
> > > > intervals, and loaded them into the stats system.
>
> > > > to be clear, the process goes like this:
>
> > > > 1.) an event happens on the server that we'd like to record. we write
> > > > a line to the log using logging.info(...) in python
>
> > > > 2.) every 10 minutes, a job starts on a metrics server, which requests
> > > > the next batch of logs by calling appcfg.py. the last log in the new
> > > > batch is kept in a append file to use as the 'sentinel' for the next
> > > > fetch.
>
> > > > 3.) the new log file is parsed for event lines, which are written to
> > > > another 'event' file.
>
> > > > 4.) a few minutes later, another job grabs new event files and loads
> > > > the events into the metrics system.
>
> > > > everything seemed to work great. until we realized that we were
> > > > missing events. a lot of them. between 20-50%.
>
> > > > there are some events that need to be shared with other systems. for
> > > > one if those event types, i was feeling lazy, so i just fired http
> > > > hits at the other system as the event happen. at some point, we
> > > > compared these numbers - and found them to be drastically different.
>
> > > > i ran tests today comparing the number of events recorded 'through'
> > > > the logs system and the same events recorded by http hit during
> > > > runtime. the percent of 'missing' events ranged from 18-56%, and the
> > > > percent missing appeared to be significantly higher when the frequency
> > > > of events was higher (during peak).
>
> > > > i've done a significant amount of work that points to the logs being
> > > > missing by the point that appcfg.py records them. i've reasonably
> > > > verified that all the event lines that appcfg.py pulls down make it
> > > > into the metrics system. oh, and all the numbers are being run on
> > > > unique user counts, so there's no way that the counts could be
> > > > mistakenly large (accidentally reading an event twice does not produce
> > > > a new unique user id).
>
> > > > my questions / issues:
>
> > > > 1.) should we be getting all of our logged lines from appcfg.py's
> > > > request_logs command? is this a known behavior? recall that we are
> > > > missing 20-50% of events - this is not a small discrepancy.
>
> > > App Engine has a fixed amount of space available for logs; it's
> > essentially
> > > a circular buffer. When it runs out of space, it starts replacing older
> > > logs.
>
> > well, i'm going to guess that's the culprit.
>
> > > > 2.) we're pulling our logs every 10 minutes. seeing as the
> > > > request_logs command lets you specify the time you want in days, i
> > > > imagine this as more frequent than intended. could this be causing an
> > > > issue?
>
> > > How much traffic are you getting? What's the size of 10 minutes' worth of
> > > logs?
>
> > we're at maybe avg. 200 request and looks like we're recording 1.25
> > events per request, so perhaps 250 log lines / sec? that's in addition
> > to all the other junk getting spilled out there - i didn't know that
> > space was limited, there's prob some debug output, then the
> > exceptions, etc...
>
> There's a limit on the total amount of logs stored - the logs are
> effectively stored in a circular buffer, so old data is overwritten as
> needed to make space for new data.
>
>
>
>
>
> > > > 3.) we switch major versions of the app every time we push, which can
> > > > be several times each day. this doesn't make sense as an issue since
> > > > the numbers are know to be wrong over periods where there have been no
> > > > version changes, but i wanted to mention it.
>
> > > > 4.) can you suggest a better solution for getting data to offline
> > > > processing? right now we getting the correct numbers using the async
> > > > http requests without ever calling get_result() or the like as a 'fire-
> > > > and-forget' http hit (not even sure if we're supposed to use it like
> > > > this, but seems to work). however, this approach has serious
> > > > drawbacks:
>
> > > You could log to the datastore, and read and delete old entries using
> > > remote_api.
>
> > this just doesn't seem like the right job for the datastore - we're
> > only inserting at 250 events / sec right now, but need to be able to
> > scale that. if we're inserting a few thousand events per second, and
> > can only fetch or delete 1K at a time, that seems like a potential
> > problem. we can batch up the events in each request, but still only
> > limits the inserts per second to the app's requests per second, which
> > can have the same issue. just doesn't sound fun.
>
> > maybe http requests are the best way to do it for now, unless we can
> > get the log space significantly expanded or find another solution.
>
> If you're doing 250 events a second, HTTP requests are going to be equally
> problematic - you'll be doing 250 of _those_ per second. 1000 events is
> still 4 seconds worth - so instead of doing 250QPS of outgoing traffic, you
> could be doing only 1/4 of a QPS of incoming traffic!
>
> -Nick
>
>
>
>
>
> > > > a.) http requests are very slow and expensive for something that does
> > > > not need to happen immediately.
>
> > > > b.) if the metrics system endpoint becomes unavailable, then, at best,
> > > > the data gets lost. at worst the issue domino's back up into the http
> > > > servers and takes the app down as well. (each http request has to
> > > > timeout, which takes significantly longer, spiking the concurrent
> > > > connections. this has screwed me multiple times. maybe you google guys
> > > > mark an endpoint as down system-wide so that subsequent requests never
> > > > attempt the connection, but we were never that smart).
>
> > > > thanks in advance, neil.
>
> > > --
> > > Nick Johnson, Developer Programs Engineer, App Engine
> > > Google Ireland Ltd. :: Registered in Dublin, Ireland, Registration
> > Number:
> > > 368047
>
> --
> Nick Johnson, Developer Programs Engineer, App Engine
> Google Ireland Ltd. :: Registered in Dublin, Ireland, Registration Number:
> 368047
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appengine@googlegroups.com
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to