Parsing log files for commonly needed performance data is no fun. As a next step toward eliminating that, I've been working on how to approach this similarly to how pg_stat_statements can cut down on query log processing. I thought it would be novel to outline this for design review before coding any more of it. I keep switching between calling the sort of things people want to know about Timing Events and Performance Events. No preference here, I'll use the former because it's shorter.

Modeling this on pg_stat_statements includes the hope of packaging it as an extension with minor core hooks, and the idea that there would be a fixed size list of timing events available at any time. Consider it a FIFO queue built using a circular buffer. Dump events into there and provide a way to view them. If some fall out before they're analyzed/saved, that doesn't seem a problem for now. If you want 100% durable, the log is still available. Eventually I'd like a consumer for these that wrote them to a history table as an option, but that seems a second priority after getting them into memory in the first place. I'd like that consumer thing for pg_stat_statements too, but so far we're getting by without it. It seems like something that might benefit from the in-core queuing work one day too.

= Initial event list =

There are three particular events I have been thinking about, hoping that will be enough to make sure the structure is general enough to handle more later:

-log_lock_waits output. This is my #1 goal. Lock waits are incredibly common issues in larger systems, and the process for tracking them down right now is very painful. Probably deadlocks too.

-log_temp_files output. This everyone wants to monitor, but it won't be as useful unless query tieing (below) works.

-log_checkpoints. The fight to make this data available via pg_stat_bgwriter has been useful. You really need a trending system to do anything useful with the data that way though. Being able to write a simple timing events query that looked for checkpoint spikes would be great for alerting, which is often done with something other than a trending package. To give specific examples here, the graphs Munin makes for pg_stat_bgwriter data work, but they're not much of a help if what you really want is to trigger a Nagios alert when checkpoints are slow. A Timing Event entry of a slow checkpoint would be easy to trigger alerts from.

= Data to save =

There's a few obvious fields that could be explicitly saved as regular columns:

-timestamp
-user
-database
-event type code. Presumably an integer that's decoded via a master list of these, in some header file.
-running process id
-running process details (below)
-referenced process id. The one causing the problem in situations like a lock wait.
-referenced process details (below)
-serial number. People who will want to poll & store this data might want a better way to do that than just a timestamp. Would let you know when you lost some of it, too, so you'd know to consider increasing the buffer size or decrease the polling interval.

= Storing event details =

When I think about how people are going to want to use this data, it should be easy to tear apart without returning to the need to write a complicated parser.

My first idea with that in mind was to consider this like a series of key/value pairs. If you look at the log_checkpoints output, you can see it approaches that form already at the end, with "=" signs as the delimiters. Taking an example from a pre-9.1 server (so no sync times, but they obviously fit the style already):

checkpoint complete: wrote 128770 buffers (12.3%); 0 transaction log file(s) added, 979 removed, 129 recycled; write=431.784 s, sync=6048.384 s, total=6484.669 s

You can imagine that instead as:

wrote=128770 wrote_pct=12.3 log_added=0 log_removed=979 log_recycled=129 recycled write=431.784 sync=6048.384 s total=6484.669

Even if the timing event data was stored exactly like that--simple text string, perhaps with some escaping of "=" signs--I think it would be completely usable to people. log_temp_files output saves like this easily enough too.

Another way this data might be stored is to have a key/value composite type, and then make each timing event entry an array of those.

I'm open to other ideas for how to store this. If the events go into one place, the different data each saves turns into the classic "flexible schema" problem.

= Query text =

For the often multi-line query text that's dumped by log_lock_waits, I've thought of two possibilities so far:

-Mash them into a single text line using "\n" between them. Now you're even back to where, with the right escaping, the entire timing event detail part could still collapse into a single text string. That would kill the idea of simple parsing of key/value data stuffed there though.

-Create an array of lines to hold the two statements. Can always make the view that shows the data unnest it. Seems like this would compel putting the two query description arrays into their own fields though.

I don't really like either of these. The second should result in a better UI in the end though, at the expense of bloating the width of the event table some.

= Query Tieing =

I'd like to have a way to reference the normalized version of the active query or queries at the time when that's appropriate too. That seems like it should be a key that's joinable with the normalized version in pg_stat_statements. One of the ideas bounced around in the "Hash id in pg_stat_statements" thread, like assigning a numeric code for them, would seem fine for that. This will turn dependent on one of those ideas being committed.

I have not confirmed how feasible it is to extract a query hash key at the point when the log messages show up yet. ProcSleep logs the lock waiting and it grabs the full query text. Presumably that means it could grab the hash key instead.

FileClose is what logs temp file data. That seems pretty disconnected from what the executor is doing at the time.

If it's not possible to tie things to specific queries cleanly in every case, I'd rather dump that requirement for now than make it a blocker. That can always be added onto the basic timing event plumbing later.

= Limitations of scope =

I was thinking the act of loading the extension for this would turn the logging on, and the existing GUCs like log_lock_waits will also apply. It seems useful and easy to have a load time option for the module that says to disable all the replaced log file writes if you're using the timing event mechanism. It would be nice to have a granular way to ask for some timing events: "I want to see lock waits but not temp files". That seems like scope creep I don't think is necessary to make this useful.

All these initial targets are events that do not occur that often, and I'm not too concerned about the performance of this implementation. Lock waits come out once per second per connection. temp files shouldn't be that common. And checkpoints are infrequent. For now, I hope that only targeting infrequent events means the overhead of dumping the timing event into memory will be little overhead. Eventually I'd like to have a much finer timing events capability, more along the lines of the DTrace level. Getting overhead low enough to go there seems hard to attack at first though, while simultaneously dealing well with what I'm trying to extract without log file parsing.

Alternately, think about the overhead of these initial targets like this: right now, all of these things are being written to the log file. If you can afford to do that, you can certainly afford to write a copy to memory instead, and probably in addition too.

--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to