Re: [HACKERS] Instrument checkpoint sync calls

Cédric Villemain Tue, 30 Nov 2010 10:33:09 -0800

2010/11/30 Greg Smith <[email protected]>:
> Jeff Janes wrote:
>>
>> For the individual file sync times emitted under debug1, it would be
>> very handy if the file being synced was identified, for example
>> "relation base/16384/16523". Rather than being numbered sequentially
>> within a given checkpoint.
>>
>
> I was numbering them sequentially so that it's straightforward to graph the
> sync times in an external analysis tool, but the relation data is helpful
> too.  New patch reflecting all upthread suggestions is attached.  The output
> looks like this now at DEBUG1:
>
> LOG:  checkpoint starting: xlog
> DEBUG:  checkpoint sync: number=1 file=base/16424/11645 time=11589.549000
> msec
> DEBUG:  checkpoint sync: number=2 file=base/16424/16438 time=16.148000 msec
> DEBUG:  checkpoint sync: number=3 file=base/16424/16437 time=53.530000 msec
> DEBUG:  checkpoint sync: number=4 file=base/16424/16447 time=10.214000 msec
> DEBUG:  checkpoint sync: number=5 file=base/16424/11607 time=1.499000 msec
> DEBUG:  checkpoint sync: number=6 file=base/16424/16425_fsm time=2.921000
> msec
> DEBUG:  checkpoint sync: number=7 file=base/16424/16437.1 time=4.237000 msec
> DEBUG:  checkpoint sync: number=8 file=base/16424/16428_fsm time=1.654000
> msec
> DEBUG:  checkpoint sync: number=9 file=base/16424/16442 time=7.920000 msec
> DEBUG:  checkpoint sync: number=10 file=base/16424/16428_vm time=2.613000
> msec
> DEBUG:  checkpoint sync: number=11 file=base/16424/11618 time=1.468000 msec
> DEBUG:  checkpoint sync: number=12 file=base/16424/16437_fsm time=2.638000
> msec
> DEBUG:  checkpoint sync: number=13 file=base/16424/16428 time=2.883000 msec
> DEBUG:  checkpoint sync: number=14 file=base/16424/16425 time=3.369000 msec
> DEBUG:  checkpoint sync: number=15 file=base/16424/16437_vm time=8.686000
> msec
> DEBUG:  checkpoint sync: number=16 file=base/16424/16425_vm time=5.984000
> msec
> LOG:  checkpoint complete: wrote 2074 buffers (50.6%); 0 transaction log
> file(s) added, 0 removed, 3 recycled; write=0.617 s, sync=11.715 s,
> total=22.167 s; sync files=16, longest=11.589 s, average=0.724 s
>
> I kept the units for the DEBUG level ones in msec because that's a better
> scale for the common really short syncs during good behavior.  But the
> summary info in seconds now appears at the end of the existing "checkpoint
> complete" message, so only one line to parse for those looking to analyze
> the gross checkpoint data.  That looks to work well enough for finding
> situations like the big ext3 spikes.  You can easily see one in this example
> by the fact that "longest=11.589 s" is almost the entirety of "sync=11.715
> s".  That's the really key thing there's currently no visibility into,
> that's made obvious with this patch.


wonderfull.

>
> This might be ready for some proper review now.  I know there's at least one
> blatant bug still in here I haven't found yet, related to how the averages
> are computed.  I saw this once:
>
> LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s)
> added, 0 removed, 1 recycled; write=0.000 s, sync=0.000 s, total=0.001 s;
> sync files=0, longest=0.000 s, average=-9223372036854775808.-2147483 s
>
> After an immediate checkpoint, so at least one path not quite right yet.
>
> --
> Greg Smith   2ndQuadrant US    [email protected]   Baltimore, MD
>
>
>
>
> --
> Sent via pgsql-hackers mailing list ([email protected])
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>



-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Instrument checkpoint sync calls

Reply via email to