[HACKERS] RFC: Timing Events

Greg Smith Thu, 01 Nov 2012 16:09:07 -0700

Parsing log files for commonly needed performance data is no fun. As anext step toward eliminating that, I've been working on how to approachthis similarly to how pg_stat_statements can cut down on query logprocessing. I thought it would be novel to outline this for designreview before coding any more of it. I keep switching between callingthe sort of things people want to know about Timing Events andPerformance Events. No preference here, I'll use the former becauseit's shorter.

Modeling this on pg_stat_statements includes the hope of packaging it asan extension with minor core hooks, and the idea that there would be afixed size list of timing events available at any time. Consider it aFIFO queue built using a circular buffer. Dump events into there andprovide a way to view them. If some fall out before they'reanalyzed/saved, that doesn't seem a problem for now. If you want 100%durable, the log is still available. Eventually I'd like a consumer forthese that wrote them to a history table as an option, but that seems asecond priority after getting them into memory in the first place. I'dlike that consumer thing for pg_stat_statements too, but so far we'regetting by without it. It seems like something that might benefit fromthe in-core queuing work one day too.


= Initial event list =

There are three particular events I have been thinking about, hopingthat will be enough to make sure the structure is general enough tohandle more later:

-log_lock_waits output. This is my #1 goal. Lock waits are incrediblycommon issues in larger systems, and the process for tracking them downright now is very painful. Probably deadlocks too.

-log_temp_files output. This everyone wants to monitor, but it won't beas useful unless query tieing (below) works.

-log_checkpoints. The fight to make this data available viapg_stat_bgwriter has been useful. You really need a trending system todo anything useful with the data that way though. Being able to write asimple timing events query that looked for checkpoint spikes would begreat for alerting, which is often done with something other than atrending package. To give specific examples here, the graphs Muninmakes for pg_stat_bgwriter data work, but they're not much of a help ifwhat you really want is to trigger a Nagios alert when checkpoints areslow. A Timing Event entry of a slow checkpoint would be easy totrigger alerts from.


= Data to save =

There's a few obvious fields that could be explicitly saved as regularcolumns:


-timestamp
-user
-database

-event type code. Presumably an integer that's decoded via a masterlist of these, in some header file.

-running process id
-running process details (below)

-referenced process id. The one causing the problem in situations likea lock wait.

-referenced process details (below)

-serial number. People who will want to poll & store this data mightwant a better way to do that than just a timestamp. Would let you knowwhen you lost some of it, too, so you'd know to consider increasing thebuffer size or decrease the polling interval.


= Storing event details =

When I think about how people are going to want to use this data, itshould be easy to tear apart without returning to the need to write acomplicated parser.

My first idea with that in mind was to consider this like a series ofkey/value pairs. If you look at the log_checkpoints output, you can seeit approaches that form already at the end, with "=" signs as thedelimiters. Taking an example from a pre-9.1 server (so no sync times,but they obviously fit the style already):

checkpoint complete: wrote 128770 buffers (12.3%); 0 transaction logfile(s) added, 979 removed, 129 recycled; write=431.784 s, sync=6048.384s, total=6484.669 s


You can imagine that instead as:

wrote=128770 wrote_pct=12.3 log_added=0 log_removed=979 log_recycled=129recycled write=431.784 sync=6048.384 s total=6484.669

Even if the timing event data was stored exactly like that--simple textstring, perhaps with some escaping of "=" signs--I think it would becompletely usable to people. log_temp_files output saves like thiseasily enough too.

Another way this data might be stored is to have a key/value compositetype, and then make each timing event entry an array of those.

I'm open to other ideas for how to store this. If the events go intoone place, the different data each saves turns into the classic"flexible schema" problem.


= Query text =

For the often multi-line query text that's dumped by log_lock_waits,I've thought of two possibilities so far:

-Mash them into a single text line using "\n" between them. Now you'reeven back to where, with the right escaping, the entire timing eventdetail part could still collapse into a single text string. That wouldkill the idea of simple parsing of key/value data stuffed there though.

-Create an array of lines to hold the two statements. Can always makethe view that shows the data unnest it. Seems like this would compelputting the two query description arrays into their own fields though.

I don't really like either of these. The second should result in abetter UI in the end though, at the expense of bloating the width of theevent table some.


= Query Tieing =

I'd like to have a way to reference the normalized version of the activequery or queries at the time when that's appropriate too. That seemslike it should be a key that's joinable with the normalized version inpg_stat_statements. One of the ideas bounced around in the "Hash id inpg_stat_statements" thread, like assigning a numeric code for them,would seem fine for that. This will turn dependent on one of thoseideas being committed.

I have not confirmed how feasible it is to extract a query hash key atthe point when the log messages show up yet. ProcSleep logs the lockwaiting and it grabs the full query text. Presumably that means itcould grab the hash key instead.

FileClose is what logs temp file data. That seems pretty disconnectedfrom what the executor is doing at the time.

If it's not possible to tie things to specific queries cleanly in everycase, I'd rather dump that requirement for now than make it a blocker.That can always be added onto the basic timing event plumbing later.


= Limitations of scope =

I was thinking the act of loading the extension for this would turn thelogging on, and the existing GUCs like log_lock_waits will also apply.It seems useful and easy to have a load time option for the module thatsays to disable all the replaced log file writes if you're using thetiming event mechanism. It would be nice to have a granular way to askfor some timing events: "I want to see lock waits but not temp files".That seems like scope creep I don't think is necessary to make thisuseful.

All these initial targets are events that do not occur that often, andI'm not too concerned about the performance of this implementation.Lock waits come out once per second per connection. temp filesshouldn't be that common. And checkpoints are infrequent. For now, Ihope that only targeting infrequent events means the overhead of dumpingthe timing event into memory will be little overhead. Eventually I'dlike to have a much finer timing events capability, more along the linesof the DTrace level. Getting overhead low enough to go there seems hardto attack at first though, while simultaneously dealing well with whatI'm trying to extract without log file parsing.

Alternately, think about the overhead of these initial targets likethis: right now, all of these things are being written to the log file.If you can afford to do that, you can certainly afford to write a copyto memory instead, and probably in addition too.


--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] RFC: Timing Events

Reply via email to