Re: [HACKERS] Spread checkpoint sync

2011-02-10 Thread Greg Smith
Looks like it's time to close the book on this one for 9.1 
development...the unfortunate results are at 
http://www.2ndquadrant.us/pgbench-results/index.htm  Test set #12 is the 
one with spread sync I was hoping would turn out better than #9, the 
reference I was trying to improve on.  TPS is about 5% slower on the 
scale=500 and 15% slower on the scale=1000 tests with sync spread out.  
Even worse, maximum latency went up a lot. 


I am convinced of a couple of things now:

1) Most of the benefit we were seeing from the original patch I 
submitted was simply from doing much better at absorbing fsync requests 
from backends while the checkpoint sync was running.  The already 
committed fsync compaction patch effectively removes that problem 
though, to the extent it's possible to do so, making the remaining 
pieces here not as useful in its wake.


2) I need to start over testing here with something that isn't 100% 
write all of the time the way pgbench is.  It's really hard to isolate 
out latency improvements when the test program guarantees all associated 
write caches will be completely filled at every moment.  Also, I can't 
see any benefit if I make changes that improve performance only for 
readers with it, which is quite unrealistic relative to real-world 
workloads.


3) The existing write spreading code in the background writer needs to 
be overhauled, too, before spreading the syncs around is going to give 
the benefits I was hoping for.


Given all that, I'm going to take my feedback and give the test server a 
much deserved break.  I'm happy that the fsync compaction patch has made 
9.1 much more tolerant of write-heavy loads than earlier versions, so 
it's not like no progress was made in this release.


For anyone who wants more details here...the news on this spread sync 
implementation is not all bad.  If you compare this result from HEAD, 
with scale=1000 and clients=256:


http://www.2ndquadrant.us/pgbench-results/611/index.html

Against its identically configured result with spread sync:

http://www.2ndquadrant.us/pgbench-results/708/index.html

There are actually significantly less times in the 2000 ms latency 
area.  That shows up as a reduction in the 90th percentile latency 
figures I compute, and you can see it in the graph if you look at how 
much denser the points are in the 2000 - 4000 ms are on #611.  But 
that's a pretty weak change.


But the most disappointing part here relative to what I was hoping is 
what happens with bigger buffer caches.  The main idea driving this 
approach was that it would enable larger values of shared_buffers 
without the checkpoint spikes being as bad.  Test set #13 tries that 
out, by increasing shared_buffers from 256MB to 4GB, along with a big 
enough increase in checkpoint_segments to make most checkpoints time 
based.  Not only did smaller scale TPS drop in half, all kinds of bad 
things happened to latency.  Here's a sample of the sort of 
dysfunctional checkpoints that came out of that:


2011-02-10 02:41:17 EST: LOG:  checkpoint starting: xlog
2011-02-10 02:53:15 EST: DEBUG:  checkpoint sync:  estimated segments=22
2011-02-10 02:53:15 EST: DEBUG:  checkpoint sync: number=1 
file=base/16384/16768 time=150.008 msec
2011-02-10 02:53:15 EST: DEBUG:  checkpoint sync: number=2 
file=base/16384/16749 time=0.002 msec
2011-02-10 02:53:15 EST: DEBUG:  checkpoint sync: number=3 
file=base/16384/16749_fsm time=0.001 msec
2011-02-10 02:53:23 EST: DEBUG:  checkpoint sync: number=4 
file=base/16384/16761 time=8014.102 msec
2011-02-10 02:53:23 EST: DEBUG:  checkpoint sync: number=5 
file=base/16384/16752_vm time=0.002 msec
2011-02-10 02:53:35 EST: DEBUG:  checkpoint sync: number=6 
file=base/16384/16761.5 time=11739.038 msec
2011-02-10 02:53:37 EST: DEBUG:  checkpoint sync: number=7 
file=base/16384/16761.6 time=2205.721 msec
2011-02-10 02:53:45 EST: DEBUG:  checkpoint sync: number=8 
file=base/16384/16761.2 time=8273.849 msec
2011-02-10 02:54:06 EST: DEBUG:  checkpoint sync: number=9 
file=base/16384/16766 time=20874.167 msec
2011-02-10 02:54:06 EST: DEBUG:  checkpoint sync: number=10 
file=base/16384/16762 time=0.002 msec
2011-02-10 02:54:08 EST: DEBUG:  checkpoint sync: number=11 
file=base/16384/16761.3 time=2440.441 msec
2011-02-10 02:54:09 EST: DEBUG:  checkpoint sync: number=12 
file=base/16384/16766.1 time=635.839 msec
2011-02-10 02:54:09 EST: DEBUG:  checkpoint sync: number=13 
file=base/16384/16752_fsm time=0.001 msec
2011-02-10 02:54:09 EST: DEBUG:  checkpoint sync: number=14 
file=base/16384/16764 time=0.001 msec
2011-02-10 02:54:09 EST: DEBUG:  checkpoint sync: number=15 
file=base/16384/16768_fsm time=0.001 msec
2011-02-10 02:54:09 EST: DEBUG:  checkpoint sync: number=16 
file=base/16384/16761_vm time=0.001 msec
2011-02-10 02:54:09 EST: DEBUG:  checkpoint sync: number=17 
file=base/16384/16761.4 time=150.702 msec
2011-02-10 02:54:09 EST: DEBUG:  checkpoint sync: number=18 
file=base/16384/16752 time=0.002 msec
2011-02-10 02:54:09 EST: 

Re: [HACKERS] Spread checkpoint sync

2011-02-10 Thread Robert Haas
On Thu, Feb 10, 2011 at 10:30 PM, Greg Smith g...@2ndquadrant.com wrote:
 3) The existing write spreading code in the background writer needs to be
 overhauled, too, before spreading the syncs around is going to give the
 benefits I was hoping for.

I've been thinking about this problem a bit.  It strikes me that the
whole notion of a background writer delay is probably wrong-headed.
Instead of having fixed-length cycles, we might want to make the delay
dependent on whether we're actually keeping up.  So during each cycle,
we decide how many buffers we want to clean, and we write 'em.  Then
we go to sleep.  When we wake up again, we figure out whether we kept
up.  If the number of buffers we wrote during the prior cycle was more
than the required number, then we'll sleep longer the next time, up to
some maximum; if we we didn't write enough, we'll reduce the sleep.

Along with this, we'd want to change the minimum rate of writing
checkpoint buffers from 1 per cycle to 1 for every 200 ms, or
something like that.

We could even possibly have a system where backends wake the
background writer up early if they notice that it's not keeping up,
although it's not exactly clear what a good algorithm would be.
Another thing that would be really nice is if backends could somehow
let the background writer know when they're using a
BufferAccessStrategy, and somehow convince the background writer to
write those buffers out to the OS at top speed.

 I want to make this problem go away, but as you can see spreading the sync
 calls around isn't enough.  I think the main write loop needs to get spread
 out more, too, so that the background writer is trying to work at a more
 reasonable pace.  I am pleased I've been able to reproduce this painful
 behavior at home using test data, because that much improves my odds of
 being able to isolate its cause and test solutions.  But it's a tricky
 problem, and I'm certainly not going to fix it in the next week.

Thanks for working on this.  I hope we get a better handle on it for 9.2.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-02-07 Thread Cédric Villemain
2011/2/7 Greg Smith g...@2ndquadrant.com:
 Robert Haas wrote:

 With the fsync queue compaction patch applied, I think most of this is
 now not needed.  Attached please find an attempt to isolate the
 portion that looks like it might still be useful.  The basic idea of
 what remains here is to make the background writer still do its normal
 stuff even when it's checkpointing.  In particular, with this patch
 applied, PG will:

 1. Absorb fsync requests a lot more often during the sync phase.
 2. Still try to run the cleaning scan during the sync phase.
 3. Pause for 3 seconds after every fsync.


 Yes, the bits you extracted were the remaining useful parts from the
 original patch.  Today was quiet here because there were sports on or
 something, and I added full auto-tuning magic to the attached update.  I
 need to kick off benchmarks and report back tomorrow to see how well this
 does, but any additional patch here would only be code cleanup on the messy
 stuff I did in here (plus proper implementation of the pair of GUCs).  This
 has finally gotten to the exact logic I've been meaning to complete as
 spread sync since the idea was first postponed in 8.3, with the benefit of
 some fsync aborption improvements along the way too

 The automatic timing is modeled on the existing checkpoint_completion_target
 concept, except with a new tunable (not yet added as a GUC) currently called
 CheckPointSyncTarget, set to 0.8 right now.  What I think I want to do is
 make the existing checkpoint_completion_target now be the target for the end
 of the sync phase, matching its name; people who bumped it up won't
 necessarily even have to change anything.  Then the new guc can be
 checkpoint_write_target, representing the target that is in there right now.

Is it worth a new thread with the different IO improvements done so
far or on-going and how we may add new GUC(if required !!!) with
intelligence between those patches ? ( For instance, hint bit IO limit
needs probably a tunable to define something similar to
hint_write_completion_target and/or IO_throttling strategy, ...items
which are still in gestation...)


 I tossed the earlier idea of counting relations to sync based on the write
 phase data as too inaccurate after testing, and with it for now goes
 checkpoint sorting.  Instead, I just take a first pass over pendingOpsTable
 to get a total number of things to sync, which will always match the real
 count barring strange circumstances (like dropping a table).

 As for the automatically determining the interval, I take the number of
 syncs that have finished so far, divide by the total, and get a number
 between 0.0 and 1.0 that represents progress on the sync phase.  I then use
 the same basic CheckpointWriteDelay logic that is there for spreading writes
 out, except with the new sync target.  I realized that if we assume the
 checkpoint writes should have finished in CheckPointCompletionTarget worth
 of time or segments, we can compute a new progress metric with the formula:

 progress = CheckPointCompletionTarget + (1.0 - CheckPointCompletionTarget) *
 finished / goal;

 Where finished is the number of segments written out, while goal is the
 total.  To turn this into an example, let's say the default parameters are
 set, we've finished the writes, and  finished 1 out of 4 syncs; that much
 work will be considered:

 progress = 0.5 + (1.0 - 0.5) * 1 / 4 = 0.625

 On a scale that effectively aimes to be finished sync work by 0.8.

 I don't use quite the same logic as the CheckpointWriteDelay though.  It
 turns out the existing checkpoint_completion implementation doesn't always
 work like I thought it did, which provide some very interesting insight into
 why my attempts to work around checkpoint problems haven't worked as well as
 expected the last few years.  I thought that what it did was wait until an
 amount of time determined by the target was reached until it did the next
 write.  That's not quite it; what it actually does is check progress against
 the target, then sleep exactly one nap interval if it is is ahead of
 schedule.  That is only the same thing if you have a lot of buffers to write
 relative to the amount of time involved.  There's some alternative logic if
 you don't have bgwriter_lru_maxpages set, but in the normal situation it
 effectively it means that:

 maximum write spread time=bgwriter_delay * checkpoint dirty blocks

 No matter how far apart you try to spread the checkpoints.  Now, typically,
 when people run into these checkpoint spikes in production, reducing
 shared_buffers improves that.  But I now realize that doing so will then
 reduce the average number of dirty blocks participating in the checkpoint,
 and therefore potentially pull the spread down at the same time!  Also, if
 you try and tune bgwriter_delay down to get better background cleaning,
 you're also reducing the maximum spread.  Between this issue and the bad
 behavior when the fsync queue fills, no wonder this 

Re: [HACKERS] Spread checkpoint sync

2011-02-07 Thread Greg Smith

Cédric Villemain wrote:

Is it worth a new thread with the different IO improvements done so
far or on-going and how we may add new GUC(if required !!!) with
intelligence between those patches ? ( For instance, hint bit IO limit
needs probably a tunable to define something similar to
hint_write_completion_target and/or IO_throttling strategy, ...items
which are still in gestation...)
  


Maybe, but I wouldn't bring all that up right now.  Trying to wrap up 
the CommitFest, too distracting, etc.


As a larger statement on this topic, I'm never very excited about 
redesigning here starting from any point other than saw a bottleneck 
doing x on a production system.  There's a long list of such things 
already around waiting to be addressed, and I've never seen any good 
evidence of work related to hint bits being on it.  Please correct me if 
you know of some--I suspect you do from the way you're brining this up.  
If we were to consider kicking off some larger work here, I would drive 
that by asking where the data supporting that work being necessary is at 
first.  It's hard enough to fix a bottleneck that's staring right at 
you, trying to address one that's just theorized is impossible.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-02-07 Thread Kevin Grittner
Greg Smith g...@2ndquadrant.com wrote:

 As a larger statement on this topic, I'm never very excited about
 redesigning here starting from any point other than saw a
 bottleneck doing x on a production system.  There's a long list
 of such things already around waiting to be addressed, and I've
 never seen any good evidence of work related to hint bits being on
 it.  Please correct me if you know of some--I suspect you do from
 the way you're brining this up.

There are occasional posts from those wondering why their read-only
queries are so slow after a bulk load, and why they are doing heavy
writes.  (I remember when I posted about that, as a relative newbie,
and I know I've seen others.)

I think worst case is probably:

- Bulk load data.
- Analyze (but don't vacuum) the new data.
- Start a workload with a lot of small, concurrent random reads.
- Watch performance tank when the write cache gluts.

This pattern is why we've adopted a pretty strict rule in our shop
that we run VACUUM FREEZE ANALYZE between a bulk load and putting
the database back into production.  It's probably a bigger issue for
those who can't do that.

-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-02-07 Thread Greg Smith

Kevin Grittner wrote:

There are occasional posts from those wondering why their read-only
queries are so slow after a bulk load, and why they are doing heavy
writes.  (I remember when I posted about that, as a relative newbie,
and I know I've seen others.)
  


Sure; I created http://wiki.postgresql.org/wiki/Hint_Bits a while back 
specifically to have a resource to explain that mystery to offer 
people.  But there's a difference between having a performance issue 
that people don't understand, and having a real bottleneck you can't get 
rid of.  My experience is that people who have hint bit issues run into 
them as a minor side-effect of a larger vacuum issue, and that if you 
get that under control they're only a minor detail in comparison.  Makes 
it hard to get too excited about optimizing them.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-02-06 Thread Greg Smith

Robert Haas wrote:

With the fsync queue compaction patch applied, I think most of this is
now not needed.  Attached please find an attempt to isolate the
portion that looks like it might still be useful.  The basic idea of
what remains here is to make the background writer still do its normal
stuff even when it's checkpointing.  In particular, with this patch
applied, PG will:

1. Absorb fsync requests a lot more often during the sync phase.
2. Still try to run the cleaning scan during the sync phase.
3. Pause for 3 seconds after every fsync.
  


Yes, the bits you extracted were the remaining useful parts from the 
original patch.  Today was quiet here because there were sports on or 
something, and I added full auto-tuning magic to the attached update.  I 
need to kick off benchmarks and report back tomorrow to see how well 
this does, but any additional patch here would only be code cleanup on 
the messy stuff I did in here (plus proper implementation of the pair of 
GUCs).  This has finally gotten to the exact logic I've been meaning to 
complete as spread sync since the idea was first postponed in 8.3, with 
the benefit of some fsync aborption improvements along the way too


The automatic timing is modeled on the existing 
checkpoint_completion_target concept, except with a new tunable (not yet 
added as a GUC) currently called CheckPointSyncTarget, set to 0.8 right 
now.  What I think I want to do is make the existing 
checkpoint_completion_target now be the target for the end of the sync 
phase, matching its name; people who bumped it up won't necessarily even 
have to change anything.  Then the new guc can be 
checkpoint_write_target, representing the target that is in there right now.


I tossed the earlier idea of counting relations to sync based on the 
write phase data as too inaccurate after testing, and with it for now 
goes checkpoint sorting.  Instead, I just take a first pass over 
pendingOpsTable to get a total number of things to sync, which will 
always match the real count barring strange circumstances (like dropping 
a table).


As for the automatically determining the interval, I take the number of 
syncs that have finished so far, divide by the total, and get a number 
between 0.0 and 1.0 that represents progress on the sync phase.  I then 
use the same basic CheckpointWriteDelay logic that is there for 
spreading writes out, except with the new sync target.  I realized that 
if we assume the checkpoint writes should have finished in 
CheckPointCompletionTarget worth of time or segments, we can compute a 
new progress metric with the formula:


progress = CheckPointCompletionTarget + (1.0 - 
CheckPointCompletionTarget) * finished / goal;


Where finished is the number of segments written out, while goal is 
the total.  To turn this into an example, let's say the default 
parameters are set, we've finished the writes, and  finished 1 out of 4 
syncs; that much work will be considered:


progress = 0.5 + (1.0 - 0.5) * 1 / 4 = 0.625

On a scale that effectively aimes to be finished sync work by 0.8.

I don't use quite the same logic as the CheckpointWriteDelay though.  It 
turns out the existing checkpoint_completion implementation doesn't 
always work like I thought it did, which provide some very interesting 
insight into why my attempts to work around checkpoint problems haven't 
worked as well as expected the last few years.  I thought that what it 
did was wait until an amount of time determined by the target was 
reached until it did the next write.  That's not quite it; what it 
actually does is check progress against the target, then sleep exactly 
one nap interval if it is is ahead of schedule.  That is only the same 
thing if you have a lot of buffers to write relative to the amount of 
time involved.  There's some alternative logic if you don't have 
bgwriter_lru_maxpages set, but in the normal situation it effectively it 
means that:


maximum write spread time=bgwriter_delay * checkpoint dirty blocks

No matter how far apart you try to spread the checkpoints.  Now, 
typically, when people run into these checkpoint spikes in production, 
reducing shared_buffers improves that.  But I now realize that doing so 
will then reduce the average number of dirty blocks participating in the 
checkpoint, and therefore potentially pull the spread down at the same 
time!  Also, if you try and tune bgwriter_delay down to get better 
background cleaning, you're also reducing the maximum spread.  Between 
this issue and the bad behavior when the fsync queue fills, no wonder 
this has been so hard to tune out of production systems.  At some point, 
the reduction in spread defeats further attempts to reduce the size of 
what's written at checkpoint time, by lowering the amount of data involved.


What I do instead is nap until just after the planned schedule, then 
execute the sync.  What ends up happening then is that there can be a 
long pause between the end of the write phase and 

Re: [HACKERS] Spread checkpoint sync

2011-02-04 Thread Greg Smith

Michael Banck wrote:

On Sat, Jan 15, 2011 at 05:47:24AM -0500, Greg Smith wrote:
  

For example, the pre-release Squeeze numbers we're seeing are awful so
far, but it's not really done yet either. 



Unfortunately, it does not look like Debian squeeze will change any more
(or has changed much since your post) at this point, except for maybe
further stable kernel updates.  


Which file system did you see those awful numbers on and could you maybe
go into some more detail?
  


Once the release comes out any day now I'll see if I can duplicate them 
on hardware I can talk about fully, and share the ZCAV graphs if it's 
still there.  The server I've been running all of the extended pgbench 
tests in this thread on is running Ubuntu simply as a temporary way to 
get 2.6.32 before Squeeze ships.  Last time I tried installing one of 
the Squeeze betas I didn't get anywhere; hoping the installer bug I ran 
into has been sorted when I try again.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books



Re: [HACKERS] Spread checkpoint sync

2011-02-04 Thread Greg Smith
As already mentioned in the broader discussion at 
http://archives.postgresql.org/message-id/4d4c4610.1030...@2ndquadrant.com 
, I'm seeing no solid performance swing in the checkpoint sorting code 
itself.  Better sometimes, worse others, but never by a large amount.


Here's what the statistics part derived from the sorted data looks like 
on a real checkpoint spike:


2011-02-04 07:02:51 EST: LOG:  checkpoint starting: xlog
2011-02-04 07:02:51 EST: DEBUG:  BufferSync 10 dirty blocks in 
relation.segment_fork 17216.0_2
2011-02-04 07:02:51 EST: DEBUG:  BufferSync 159 dirty blocks in 
relation.segment_fork 17216.0_1
2011-02-04 07:02:51 EST: DEBUG:  BufferSync 10 dirty blocks in 
relation.segment_fork 17216.3_0
2011-02-04 07:02:51 EST: DEBUG:  BufferSync 548 dirty blocks in 
relation.segment_fork 17216.4_0
2011-02-04 07:02:51 EST: DEBUG:  BufferSync 808 dirty blocks in 
relation.segment_fork 17216.5_0
2011-02-04 07:02:51 EST: DEBUG:  BufferSync 799 dirty blocks in 
relation.segment_fork 17216.6_0
2011-02-04 07:02:51 EST: DEBUG:  BufferSync 807 dirty blocks in 
relation.segment_fork 17216.7_0
2011-02-04 07:02:51 EST: DEBUG:  BufferSync 716 dirty blocks in 
relation.segment_fork 17216.8_0
2011-02-04 07:02:51 EST: DEBUG:  BufferSync 3857 buffers to write, 8 
total dirty segment file(s) expected to need sync
2011-02-04 07:03:31 EST: DEBUG:  checkpoint sync: number=1 
file=base/16384/17216.5 time=1324.614 msec
2011-02-04 07:03:31 EST: DEBUG:  checkpoint sync: number=2 
file=base/16384/17216.4 time=0.002 msec
2011-02-04 07:03:31 EST: DEBUG:  checkpoint sync: number=3 
file=base/16384/17216_fsm time=0.001 msec
2011-02-04 07:03:47 EST: DEBUG:  checkpoint sync: number=4 
file=base/16384/17216.10 time=16446.753 msec
2011-02-04 07:03:53 EST: DEBUG:  checkpoint sync: number=5 
file=base/16384/17216.8 time=5804.252 msec
2011-02-04 07:03:53 EST: DEBUG:  checkpoint sync: number=6 
file=base/16384/17216.7 time=0.001 msec
2011-02-04 07:03:54 EST: DEBUG:  compacted fsync request queue from 
32768 entries to 2 entries
2011-02-04 07:03:54 EST: CONTEXT:  writing block 1642223 of relation 
base/16384/17216
2011-02-04 07:04:00 EST: DEBUG:  checkpoint sync: number=7 
file=base/16384/17216.11 time=6350.577 msec
2011-02-04 07:04:00 EST: DEBUG:  checkpoint sync: number=8 
file=base/16384/17216.9 time=0.001 msec
2011-02-04 07:04:00 EST: DEBUG:  checkpoint sync: number=9 
file=base/16384/17216.6 time=0.001 msec
2011-02-04 07:04:00 EST: DEBUG:  checkpoint sync: number=10 
file=base/16384/17216.3 time=0.001 msec
2011-02-04 07:04:00 EST: DEBUG:  checkpoint sync: number=11 
file=base/16384/17216_vm time=0.001 msec
2011-02-04 07:04:00 EST: LOG:  checkpoint complete: wrote 3813 buffers 
(11.6%); 0 transaction log file(s) added, 0 removed, 64 recycled; 
write=39.073 s, sync=29.926 s, total=69.003 s; sync files=11, 
longest=16.446 s, average=2.720 s


You can see that it ran out of fsync absorption space in the middle of 
the sync phase, which is usually when compaction is needed, but the 
recent patch to fix that kicked in and did its thing.


Couple of observations:

-The total number of buffers I'm computing based on the checkpoint 
writes being sorted it not a perfect match to the number reported by the 
checkpoint complete status line.  Sometimes they are the same, 
sometimes not.  Not sure why yet.


-The estimate for expected to need sync computed as a by-product of 
the checkpoint sorting is not completely accurate either.  This 
particular one has a fairly large error in it, percentage-wise, being 
off by 3 with a total of 11.  Presumably these are absorbed fsync 
requests that were already queued up before the checkpoint even 
started.  So any time estimate I drive based off of this count is only 
going to be approximate.


-The order in which the sync phase processes files is unrelated to the 
order in which they are written out.  Note that 17216.10 here, the 
biggest victim (cause?) of the I/O spike, isn't even listed among the 
checkpoint writes!


The fuzziness here is a bit disconcerting, and I'll keep digging for why 
it happens.  But I don't see any reason not to continue forward using 
the rough count here to derive a nap time from, which I can then feed 
into the useful leftovers patch that Robert already refactored here.  
Can always sharpen up that estimate later, I need to get some solid 
results I can share on what the delay time does to the 
throughput/latency pattern next.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-02-04 Thread Robert Haas
On Fri, Feb 4, 2011 at 2:08 PM, Greg Smith g...@2ndquadrant.com wrote:
 -The total number of buffers I'm computing based on the checkpoint writes
 being sorted it not a perfect match to the number reported by the
 checkpoint complete status line.  Sometimes they are the same, sometimes
 not.  Not sure why yet.

My first guess would be that in the cases where it's not the same,
some backend evicted the buffer before the background writer got to
it.  That's expected under heavy contention for shared_buffers.

 -The estimate for expected to need sync computed as a by-product of the
 checkpoint sorting is not completely accurate either.  This particular one
 has a fairly large error in it, percentage-wise, being off by 3 with a total
 of 11.  Presumably these are absorbed fsync requests that were already
 queued up before the checkpoint even started.  So any time estimate I drive
 based off of this count is only going to be approximate.

As previously noted, I wonder if we ought sync queued-up requests that
don't require writes before beginning the write phase.

 -The order in which the sync phase processes files is unrelated to the order
 in which they are written out.  Note that 17216.10 here, the biggest victim
 (cause?) of the I/O spike, isn't even listed among the checkpoint writes!

That's awful.  If more than 50% of the I/O is going to happen from one
fsync() call, that seems to put a pretty pessimal bound on how much
improvement we can hope to achieve here.  Or am I missing something?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-02-03 Thread Michael Banck
On Sat, Jan 15, 2011 at 05:47:24AM -0500, Greg Smith wrote:
 For example, the pre-release Squeeze numbers we're seeing are awful so
 far, but it's not really done yet either. 

Unfortunately, it does not look like Debian squeeze will change any more
(or has changed much since your post) at this point, except for maybe
further stable kernel updates.  

Which file system did you see those awful numbers on and could you maybe
go into some more detail?


Thanks,

Michael

-- 
marco_g I did send an email to propose multithreading to
grub-devel on the first of april.
marco_g Unfortunately everyone thought I was serious ;-)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-02-01 Thread Greg Smith

Greg Smith wrote:
I think the right way to compute relations to sync is to finish the 
sorted writes patch I sent over a not quite right yet update to already


Attached update now makes much more sense than the misguided patch I 
submitted two weesk ago.  This takes the original sorted write code, 
first adjusting it so it only allocates the memory its tag structure is 
stored in once (in a kind of lazy way I can improve on right now).  It 
then computes a bunch of derived statistics from a single walk of the 
sorted data on each pass through.  Here's an example of what comes out:


DEBUG:  BufferSync 1 dirty blocks in relation.segment_fork 11809.0_0
DEBUG:  BufferSync 2 dirty blocks in relation.segment_fork 11811.0_0
DEBUG:  BufferSync 3 dirty blocks in relation.segment_fork 11812.0_0
DEBUG:  BufferSync 3 dirty blocks in relation.segment_fork 16496.0_0
DEBUG:  BufferSync 28 dirty blocks in relation.segment_fork 16499.0_0
DEBUG:  BufferSync 1 dirty blocks in relation.segment_fork 11638.0_0
DEBUG:  BufferSync 1 dirty blocks in relation.segment_fork 11640.0_0
DEBUG:  BufferSync 2 dirty blocks in relation.segment_fork 11641.0_0
DEBUG:  BufferSync 1 dirty blocks in relation.segment_fork 11642.0_0
DEBUG:  BufferSync 1 dirty blocks in relation.segment_fork 11644.0_0
DEBUG:  BufferSync 2048 dirty blocks in relation.segment_fork 16508.0_0
DEBUG:  BufferSync 1 dirty blocks in relation.segment_fork 11645.0_0
DEBUG:  BufferSync 1 dirty blocks in relation.segment_fork 11661.0_0
DEBUG:  BufferSync 1 dirty blocks in relation.segment_fork 11663.0_0
DEBUG:  BufferSync 1 dirty blocks in relation.segment_fork 11664.0_0
DEBUG:  BufferSync 1 dirty blocks in relation.segment_fork 11672.0_0
DEBUG:  BufferSync 1 dirty blocks in relation.segment_fork 11685.0_0
DEBUG:  BufferSync 2097 buffers to write, 17 total dirty segment file(s) 
expected to need sync


This is the first checkpoint after starting to populate a new pgbench 
database.  The next four show it extending into new segments:


DEBUG:  BufferSync 2048 dirty blocks in relation.segment_fork 16508.1_0
DEBUG:  BufferSync 2048 buffers to write, 1 total dirty segment file(s) 
expected to need sync


DEBUG:  BufferSync 2048 dirty blocks in relation.segment_fork 16508.2_0
DEBUG:  BufferSync 2048 buffers to write, 1 total dirty segment file(s) 
expected to need sync


DEBUG:  BufferSync 2048 dirty blocks in relation.segment_fork 16508.3_0
DEBUG:  BufferSync 2048 buffers to write, 1 total dirty segment file(s) 
expected to need sync


DEBUG:  BufferSync 2048 dirty blocks in relation.segment_fork 16508.4_0
DEBUG:  BufferSync 2048 buffers to write, 1 total dirty segment file(s) 
expected to need sync


The fact that it's always showing 2048 dirty blocks on these makes me 
think I'm computing something wrong still, but the general idea here is 
working now.  I had to use some magic from the md layer to let bufmgr.c 
know how its writes were going to get mapped into file segments and 
correspondingly fsync calls later.  Not happy about breaking the API 
encapsulation there, but don't see an easy way to compute that data at 
the per-segment level--and it's not like that's going to change in the 
near future anyway.


I like this approach for a providing a map of how to spread syncs out 
for a couple of reasons:


-It computes data that could be used to drive sync spread timing in a 
relatively short amount of simple code.


-You get write sorting at the database level helping out the OS.  
Everything I've been seeing recently on benchmarks says Linux at least 
needs all the help it can get in that regard, even if block order 
doesn't necessarily align perfectly with disk order.


-It's obvious how to take this same data and build a future model where 
the time allocated for fsyncs was proportional to how much that 
particular relation was touched.


Benchmarks of just the impact of the sorting step and continued bug 
swatting to follow.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1f89e52..ef9df7d 100644
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***
*** 48,53 
--- 48,63 
  #include utils/rel.h
  #include utils/resowner.h
  
+ /*
+  * Checkpoint time mapping between the buffer id values and the associated
+  * buffer tags of dirty buffers to write
+  */
+ typedef struct BufAndTag
+ {
+ int buf_id;
+ BufferTag   tag;
+ 	BlockNumber	segNum;
+ } BufAndTag;
  
  /* Note: these two macros only work on shared buffers, not local ones! */
  #define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)-buf_id) * BLCKSZ))
*** int			target_prefetch_pages = 0;
*** 78,83 
--- 88,96 
  static volatile BufferDesc *InProgressBuf = NULL;
  

Re: [HACKERS] Spread checkpoint sync

2011-02-01 Thread Robert Haas
On Mon, Jan 31, 2011 at 4:28 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 Back to the idea at hand - I proposed something a bit along these
 lines upthread, but my idea was to proactively perform the fsyncs on
 the relations that had gone the longest without a write, rather than
 the ones with the most dirty data.

 Yeah.  What I meant to suggest, but evidently didn't explain well, was
 to use that or something much like it as the rule for deciding *what* to
 fsync next, but to use amount-of-unsynced-data-versus-threshold as the
 method for deciding *when* to do the next fsync.

Oh, I see.  Yeah, that could be a good algorithm.

I also think Bruce's idea of calling fsync() on each relation just
*before* we start writing the pages from that relation might have some
merit.  (I'm assuming here that we are sorting the writes.)  That
should tend to result in the end-of-checkpoint fsyncs being quite
fast, because we'll only have as much dirty data floating around as we
actually wrote during the checkpoint, which according to Greg Smith is
usually a small fraction of the total data in need of flushing.  Also,
if one of the pre-write fsyncs takes a long time, then that'll get
factored into our calculations of how fast we need to write the
remaining data to finish the checkpoint on schedule.  Of course
there's still the possibility that the I/O system literally can't
finish a checkpoint in X minutes, but even in that case, the I/O
saturation will hopefully be more spread out across the entire
checkpoint instead of falling like a hammer at the very end.

Back to your idea: One problem with trying to bound the unflushed data
is that it's not clear what the bound should be.  I've had this mental
model where we want the OS to write out pages to disk, but that's not
always true, per Greg Smith's recent posts about Linux kernel tuning
slowing down VACUUM.  A possible advantage of the Momjian algorithm
(as it's known in the literature) is that we don't actually start
forcing anything out to disk until we have a reason to do so - namely,
an impending checkpoint.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-02-01 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 
 I also think Bruce's idea of calling fsync() on each relation just
 *before* we start writing the pages from that relation might have
 some merit.
 
What bothers me about that is that you may have a lot of the same
dirty pages in the OS cache as the PostgreSQL cache, and you've just
ensured that the OS will write those *twice*.  I'm pretty sure that
the reason the aggressive background writer settings we use have not
caused any noticeable increase in OS disk writes is that many
PostgreSQL writes of the same buffer keep an OS buffer page from
becoming stale enough to get flushed until PostgreSQL writes to it
taper off.  Calling fsync() right before doing one last push of
the data could be really pessimal for some workloads.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-02-01 Thread Bruce Momjian
Robert Haas wrote:
 Back to your idea: One problem with trying to bound the unflushed data
 is that it's not clear what the bound should be.  I've had this mental
 model where we want the OS to write out pages to disk, but that's not
 always true, per Greg Smith's recent posts about Linux kernel tuning
 slowing down VACUUM.  A possible advantage of the Momjian algorithm
 (as it's known in the literature) is that we don't actually start
 forcing anything out to disk until we have a reason to do so - namely,
 an impending checkpoint.

My trivial idea was:  let's assume we checkpoint every 10 minutes, and
it takes 5 minutes for us to write the data to the kernel.   If no one
else is writing to those files, we can safely wait maybe 5 more minutes
before issuing the fsync.  If, however, hundreds of writes are coming in
for the same files in those final 5 minutes, we should fsync right away.

My idea is that our delay between writes and fsync should somehow be
controlled by how many writes to the same files are coming to the kernel
while we are considering waiting because the only downside to delay is
the accumulation of non-critical writes coming into the kernel for the
same files we are going to fsync later.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-02-01 Thread Bruce Momjian
Greg Smith wrote:
 Greg Smith wrote:
  I think the right way to compute relations to sync is to finish the 
  sorted writes patch I sent over a not quite right yet update to already
 
 Attached update now makes much more sense than the misguided patch I 
 submitted two weesk ago.  This takes the original sorted write code, 
 first adjusting it so it only allocates the memory its tag structure is 
 stored in once (in a kind of lazy way I can improve on right now).  It 
 then computes a bunch of derived statistics from a single walk of the 
 sorted data on each pass through.  Here's an example of what comes out:

In that patch, I would like to see a meta-comment explaining why the
sorting is happening and what we hope to gain.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-02-01 Thread Robert Haas
On Tue, Feb 1, 2011 at 12:58 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Robert Haas robertmh...@gmail.com wrote:

 I also think Bruce's idea of calling fsync() on each relation just
 *before* we start writing the pages from that relation might have
 some merit.

 What bothers me about that is that you may have a lot of the same
 dirty pages in the OS cache as the PostgreSQL cache, and you've just
 ensured that the OS will write those *twice*.  I'm pretty sure that
 the reason the aggressive background writer settings we use have not
 caused any noticeable increase in OS disk writes is that many
 PostgreSQL writes of the same buffer keep an OS buffer page from
 becoming stale enough to get flushed until PostgreSQL writes to it
 taper off.  Calling fsync() right before doing one last push of
 the data could be really pessimal for some workloads.

I was thinking about what Greg reported here:

http://archives.postgresql.org/pgsql-hackers/2010-11/msg01387.php

If the amount of pre-checkpoint dirty data is 3GB and the checkpoint
is writing 250MB, then you shouldn't have all that many extra
writes... but you might have some, and that might be enough to send
the whole thing down the tubes.

InnoDB apparently handles this problem by advancing the redo pointer
in small steps instead of in large jumps.  AIUI, in addition to
tracking the LSN of each page, they also track the first-dirtied LSN.
That lets you checkpoint to an arbitrary LSN by flushing just the
pages with an older first-dirtied LSN.  So instead of doing a
checkpoint every hour, you might do a mini-checkpoint every 10
minutes.  Since the mini-checkpoints each need to flush less data,
they should be less disruptive than a full checkpoint.  But that, too,
will generate some extra writes.  Basically, any idea that involves
calling fsync() more often is going to tend to smooth out the I/O load
at the cost of some increase in the total number of writes.

If we don't want any increase at all in the number of writes,
spreading out the fsync() calls is pretty much the only other option.
I'm worried that even with good tuning that won't be enough to tamp
down the latency spikes.  But maybe it will be...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-02-01 Thread Bruce Momjian
Kevin Grittner wrote:
 Robert Haas robertmh...@gmail.com wrote:
  
  I also think Bruce's idea of calling fsync() on each relation just
  *before* we start writing the pages from that relation might have
  some merit.
  
 What bothers me about that is that you may have a lot of the same
 dirty pages in the OS cache as the PostgreSQL cache, and you've just
 ensured that the OS will write those *twice*.  I'm pretty sure that
 the reason the aggressive background writer settings we use have not
 caused any noticeable increase in OS disk writes is that many
 PostgreSQL writes of the same buffer keep an OS buffer page from
 becoming stale enough to get flushed until PostgreSQL writes to it
 taper off.  Calling fsync() right before doing one last push of
 the data could be really pessimal for some workloads.

OK, maybe my idea needs to be adjusted and we should trigger an early
fsync if non-fsync writes are coming in for blocks _other_ than the ones
we already wrote for that checkpoint.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-02-01 Thread Tom Lane
Bruce Momjian br...@momjian.us writes:
 My trivial idea was:  let's assume we checkpoint every 10 minutes, and
 it takes 5 minutes for us to write the data to the kernel.   If no one
 else is writing to those files, we can safely wait maybe 5 more minutes
 before issuing the fsync.  If, however, hundreds of writes are coming in
 for the same files in those final 5 minutes, we should fsync right away.

Huh?  I would surely hope we could assume that nobody but Postgres is
writing the database files?  Or are you considering that the bgwriter
doesn't know exactly what the backends are doing?  That's true, but
I still maintain that we should design the bgwriter's behavior on the
assumption that writes from backends are negligible.  Certainly the
backends aren't issuing fsyncs.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-02-01 Thread Bruce Momjian
Tom Lane wrote:
 Bruce Momjian br...@momjian.us writes:
  My trivial idea was:  let's assume we checkpoint every 10 minutes, and
  it takes 5 minutes for us to write the data to the kernel.   If no one
  else is writing to those files, we can safely wait maybe 5 more minutes
  before issuing the fsync.  If, however, hundreds of writes are coming in
  for the same files in those final 5 minutes, we should fsync right away.
 
 Huh?  I would surely hope we could assume that nobody but Postgres is
 writing the database files?  Or are you considering that the bgwriter
 doesn't know exactly what the backends are doing?  That's true, but
 I still maintain that we should design the bgwriter's behavior on the
 assumption that writes from backends are negligible.  Certainly the
 backends aren't issuing fsyncs.

Right, no one else is writing but us.  When I said no one else I meant
no other bgwrites writes are going to the files we wrote as part of the
checkpoint, but have not fsync'ed yet.  I assume we have two write
streams --- the checkpoint writes, which we know at the start of the
checkpoint, and the bgwriter writes that are happening in an
unpredictable way based on database activity.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-31 Thread Itagaki Takahiro
On Mon, Jan 31, 2011 at 13:41, Robert Haas robertmh...@gmail.com wrote:
 1. Absorb fsync requests a lot more often during the sync phase.
 2. Still try to run the cleaning scan during the sync phase.
 3. Pause for 3 seconds after every fsync.

 So if we want the checkpoint
 to finish in, say, 20 minutes, we can't know whether the write phase
 needs to be finished by minute 10 or 15 or 16 or 19 or only by 19:59.

We probably need deadline-based scheduling, that is being used in write()
phase. If we want to sync 100 files in 20 minutes, each file should be
sync'ed in 12 seconds if we think each fsync takes the same time.
If we would have better estimation algorithm (file size? dirty ratio?),
each fsync chould have some weight factor.  But deadline-based scheduling
is still needed then.

BTW, we should not sleep in full-speed checkpoint. CHECKPOINT command,
shutdown, pg_start_backup(), and some of checkpoints during recovery
might don't want to sleep.

-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-31 Thread Robert Haas
On Mon, Jan 31, 2011 at 3:04 AM, Itagaki Takahiro
itagaki.takah...@gmail.com wrote:
 On Mon, Jan 31, 2011 at 13:41, Robert Haas robertmh...@gmail.com wrote:
 1. Absorb fsync requests a lot more often during the sync phase.
 2. Still try to run the cleaning scan during the sync phase.
 3. Pause for 3 seconds after every fsync.

 So if we want the checkpoint
 to finish in, say, 20 minutes, we can't know whether the write phase
 needs to be finished by minute 10 or 15 or 16 or 19 or only by 19:59.

 We probably need deadline-based scheduling, that is being used in write()
 phase. If we want to sync 100 files in 20 minutes, each file should be
 sync'ed in 12 seconds if we think each fsync takes the same time.
 If we would have better estimation algorithm (file size? dirty ratio?),
 each fsync chould have some weight factor.  But deadline-based scheduling
 is still needed then.

Right.  I think the problem is balancing the write and sync phases.
For example, if your operating system is very aggressively writing out
dirty pages to disk, then you want the write phase to be as long as
possible and the sync phase can be very short because there won't be
much work to do.  But if your operating system is caching lots of
stuff in memory and writing dirty pages out to disk only when
absolutely necessary, then the write phase could be relatively quick
without much hurting anything, but the sync phase will need to be long
to keep from crushing the I/O system.  The trouble is, we don't really
have a priori way to know which it's doing.  Maybe we could try to
tune based on the behavior of previous checkpoints, but I'm wondering
if we oughtn't to take the cheesy path first and split
checkpoint_completion_target into checkpoint_write_target and
checkpoint_sync_target.  That's another parameter to set, but I'd
rather add a parameter that people have to play with to find the right
value than impose an arbitrary rule that creates unavoidable bad
performance in certain environments.

 BTW, we should not sleep in full-speed checkpoint. CHECKPOINT command,
 shutdown, pg_start_backup(), and some of checkpoints during recovery
 might don't want to sleep.

Yeah, I think that's understood.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-31 Thread Heikki Linnakangas

On 31.01.2011 16:44, Robert Haas wrote:

On Mon, Jan 31, 2011 at 3:04 AM, Itagaki Takahiro
itagaki.takah...@gmail.com  wrote:

On Mon, Jan 31, 2011 at 13:41, Robert Haasrobertmh...@gmail.com  wrote:

1. Absorb fsync requests a lot more often during the sync phase.
2. Still try to run the cleaning scan during the sync phase.
3. Pause for 3 seconds after every fsync.

So if we want the checkpoint
to finish in, say, 20 minutes, we can't know whether the write phase
needs to be finished by minute 10 or 15 or 16 or 19 or only by 19:59.


We probably need deadline-based scheduling, that is being used in write()
phase. If we want to sync 100 files in 20 minutes, each file should be
sync'ed in 12 seconds if we think each fsync takes the same time.
If we would have better estimation algorithm (file size? dirty ratio?),
each fsync chould have some weight factor.  But deadline-based scheduling
is still needed then.


Right.  I think the problem is balancing the write and sync phases.
For example, if your operating system is very aggressively writing out
dirty pages to disk, then you want the write phase to be as long as
possible and the sync phase can be very short because there won't be
much work to do.  But if your operating system is caching lots of
stuff in memory and writing dirty pages out to disk only when
absolutely necessary, then the write phase could be relatively quick
without much hurting anything, but the sync phase will need to be long
to keep from crushing the I/O system.  The trouble is, we don't really
have a priori way to know which it's doing.  Maybe we could try to
tune based on the behavior of previous checkpoints, ...


IMHO we should re-consider the patch to sort the writes. Not so much 
because of the performance gain that gives, but because we can then 
re-arrange the fsyncs so that you write one file, then fsync it, then 
write the next file and so on. That way we the time taken by the fsyncs 
is distributed between the writes, so we don't need to accurately 
estimate how long each will take. If one fsync takes a long time, the 
writes that follow will just be done a bit faster to catch up.



... but I'm wondering
if we oughtn't to take the cheesy path first and split
checkpoint_completion_target into checkpoint_write_target and
checkpoint_sync_target.  That's another parameter to set, but I'd
rather add a parameter that people have to play with to find the right
value than impose an arbitrary rule that creates unavoidable bad
performance in certain environments.


That is of course simpler..

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-31 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 IMHO we should re-consider the patch to sort the writes. Not so much 
 because of the performance gain that gives, but because we can then 
 re-arrange the fsyncs so that you write one file, then fsync it, then 
 write the next file and so on.

Isn't that going to make performance worse not better?  Generally you
want to give the kernel as much scheduling flexibility as possible,
which you do by issuing the write as far before the fsync as you can.
An arrangement like the above removes all cross-file scheduling freedom.
For example, if two files are on different spindles, you've just
guaranteed that no I/O overlap is possible.

 That way we the time taken by the fsyncs 
 is distributed between the writes,

That sounds like you have an entirely wrong mental model of where the
cost comes from.  Those times are not independent.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-31 Thread Robert Haas
On Mon, Jan 31, 2011 at 11:29 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 IMHO we should re-consider the patch to sort the writes. Not so much
 because of the performance gain that gives, but because we can then
 re-arrange the fsyncs so that you write one file, then fsync it, then
 write the next file and so on.

 Isn't that going to make performance worse not better?  Generally you
 want to give the kernel as much scheduling flexibility as possible,
 which you do by issuing the write as far before the fsync as you can.
 An arrangement like the above removes all cross-file scheduling freedom.
 For example, if two files are on different spindles, you've just
 guaranteed that no I/O overlap is possible.

 That way we the time taken by the fsyncs
 is distributed between the writes,

 That sounds like you have an entirely wrong mental model of where the
 cost comes from.  Those times are not independent.

Yeah, Greg Smith made the same point a week or three ago.  But it
seems to me that there is potential value in overlaying the write and
sync phases to some degree.  For example, if the write phase is spread
over 15 minutes and you have 30 files, then by, say, minute 7, it's a
probably OK to flush the file you wrote first.  Waiting longer isn't
necessarily going to help - the kernel has probably written what it is
going to write without prodding.

In fact, it might be that on a busy system, you could lose by waiting
*too long* to perform the fsync.  The cleaning scan and/or backends
may kick out additional dirty buffers that will now have to get forced
down to disk, even though you don't really care about them (because
they were dirtied after the checkpoint write had already been done).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-31 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Mon, Jan 31, 2011 at 11:29 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 That sounds like you have an entirely wrong mental model of where the
 cost comes from.  Those times are not independent.

 Yeah, Greg Smith made the same point a week or three ago.  But it
 seems to me that there is potential value in overlaying the write and
 sync phases to some degree.  For example, if the write phase is spread
 over 15 minutes and you have 30 files, then by, say, minute 7, it's a
 probably OK to flush the file you wrote first.

Yeah, probably, but we can't do anything as stupid as file-by-file.

I wonder whether it'd be useful to keep track of the total amount of
data written-and-not-yet-synced, and to issue fsyncs often enough to
keep that below some parameter; the idea being that the parameter would
limit how much dirty kernel disk cache there is.  Of course, ideally the
kernel would have a similar tunable and this would be a waste of effort
on our part...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-31 Thread Robert Haas
On Mon, Jan 31, 2011 at 11:51 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Mon, Jan 31, 2011 at 11:29 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 That sounds like you have an entirely wrong mental model of where the
 cost comes from.  Those times are not independent.

 Yeah, Greg Smith made the same point a week or three ago.  But it
 seems to me that there is potential value in overlaying the write and
 sync phases to some degree.  For example, if the write phase is spread
 over 15 minutes and you have 30 files, then by, say, minute 7, it's a
 probably OK to flush the file you wrote first.

 Yeah, probably, but we can't do anything as stupid as file-by-file.

Eh?

 I wonder whether it'd be useful to keep track of the total amount of
 data written-and-not-yet-synced, and to issue fsyncs often enough to
 keep that below some parameter; the idea being that the parameter would
 limit how much dirty kernel disk cache there is.  Of course, ideally the
 kernel would have a similar tunable and this would be a waste of effort
 on our part...

It's not clear to me how you'd maintain that information without it
turning into a contention bottleneck.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-31 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 3. Pause for 3 seconds after every fsync.

 I think something along the lines of #3 is probably a good idea,

Really?  Any particular delay is guaranteed wrong.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-31 Thread Robert Haas
On Mon, Jan 31, 2011 at 12:01 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 3. Pause for 3 seconds after every fsync.

 I think something along the lines of #3 is probably a good idea,

 Really?  Any particular delay is guaranteed wrong.

What I was getting at was - I think it's probably a good idea not to
do the fsyncs at top speed, but I'm not too sure how they should be
spaced out.  I agree a fixed delay isn't necessarily right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-31 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Mon, Jan 31, 2011 at 11:51 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 I wonder whether it'd be useful to keep track of the total amount of
 data written-and-not-yet-synced, and to issue fsyncs often enough to
 keep that below some parameter; the idea being that the parameter would
 limit how much dirty kernel disk cache there is.  Of course, ideally the
 kernel would have a similar tunable and this would be a waste of effort
 on our part...

 It's not clear to me how you'd maintain that information without it
 turning into a contention bottleneck.

What contention bottleneck?  I was just visualizing the bgwriter process
locally tracking how many writes it'd issued.  Backend-issued writes
should happen seldom enough to be ignorable for this purpose.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-31 Thread Robert Haas
On Mon, Jan 31, 2011 at 12:11 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Mon, Jan 31, 2011 at 11:51 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 I wonder whether it'd be useful to keep track of the total amount of
 data written-and-not-yet-synced, and to issue fsyncs often enough to
 keep that below some parameter; the idea being that the parameter would
 limit how much dirty kernel disk cache there is.  Of course, ideally the
 kernel would have a similar tunable and this would be a waste of effort
 on our part...

 It's not clear to me how you'd maintain that information without it
 turning into a contention bottleneck.

 What contention bottleneck?  I was just visualizing the bgwriter process
 locally tracking how many writes it'd issued.  Backend-issued writes
 should happen seldom enough to be ignorable for this purpose.

Ah.  Well, if you ignore backend writes, then yes, there's no
contention bottleneck.  However, I seem to recall Greg Smith showing a
system at PGCon last year with a pretty respectable volume of backend
writes (30%?) and saying OK, so here's a healthy system.  Perhaps
I'm misremembering.  But at any rate any backend that is using a
BufferAccessStrategy figures to do a lot of its own writes.  This is
probably an area for improvement in future releases, if we an figure
out how to do it: if we're doing a bulk load into a system with 4GB of
shared_buffers using a 16MB ring buffer, we'd ideally like the
background writer - or somebody other than the foreground process - to
go nuts on those buffers, writing them out as fast as it possibly can
- rather than letting the backend do it when the ring wraps around.

Back to the idea at hand - I proposed something a bit along these
lines upthread, but my idea was to proactively perform the fsyncs on
the relations that had gone the longest without a write, rather than
the ones with the most dirty data.  I'm not sure which is better.
Obviously, doing the ones that have gone idle gives the OS more time
to write out the data, but OTOH it might not succeed in purging much
dirty data.  Doing the ones with the most dirty data will definitely
reduce the size of the final checkpoint, but might also cause a
latency spike if it's triggered immediately after heavy write activity
on that file.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-31 Thread Bruce Momjian
Robert Haas wrote:
 Back to the idea at hand - I proposed something a bit along these
 lines upthread, but my idea was to proactively perform the fsyncs on
 the relations that had gone the longest without a write, rather than
 the ones with the most dirty data.  I'm not sure which is better.
 Obviously, doing the ones that have gone idle gives the OS more time
 to write out the data, but OTOH it might not succeed in purging much
 dirty data.  Doing the ones with the most dirty data will definitely
 reduce the size of the final checkpoint, but might also cause a
 latency spike if it's triggered immediately after heavy write activity
 on that file.

Crazy idea #2 --- it would be interesting if you issued an fsync
_before_ you wrote out data to a file that needed an fsync.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-31 Thread Greg Smith

Tom Lane wrote:

I wonder whether it'd be useful to keep track of the total amount of
data written-and-not-yet-synced, and to issue fsyncs often enough to
keep that below some parameter; the idea being that the parameter would
limit how much dirty kernel disk cache there is.  Of course, ideally the
kernel would have a similar tunable and this would be a waste of effort
on our part...
  


I wanted to run the tests again before reporting in detail here, because 
the results are so bad, but I threw out an initial report about trying 
to push this toward this down to be the kernel's job at 
http://blog.2ndquadrant.com/en/2011/01/tuning-linux-for-low-postgresq.html


So far it looks like the newish Linux dirty_bytes parameter works well 
at reducing latency by limiting how much dirty data can pile up before 
it gets nudged heavily toward disk.  But the throughput drop you pay on 
VACUUM in particular is brutal, I'm seeing over a 50% slowdown in some 
cases.  I suspect we need to let the regular cleaner and backend writes 
queue up in the largest possible cache for VACUUM, so it benefits as 
much as possible from elevator sorting of writes.  I suspect this being 
the worst case now for a tightly controlled write cache is an unintended 
side-effect of the ring buffer implementation it uses now.


Right now I'm running the same tests on XFS instead of ext3, and those 
are just way more sensible all around; I'll revisit this on that 
filesystem and ext4.  The scale=500 tests I've running lots of lately 
are a full 3X TPS faster on XFS relative to ext3, with about 1/8 as much 
worst-case latency.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-31 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Back to the idea at hand - I proposed something a bit along these
 lines upthread, but my idea was to proactively perform the fsyncs on
 the relations that had gone the longest without a write, rather than
 the ones with the most dirty data.

Yeah.  What I meant to suggest, but evidently didn't explain well, was
to use that or something much like it as the rule for deciding *what* to
fsync next, but to use amount-of-unsynced-data-versus-threshold as the
method for deciding *when* to do the next fsync.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-31 Thread Greg Smith

Tom Lane wrote:

Robert Haas robertmh...@gmail.com writes:
  

3. Pause for 3 seconds after every fsync.



  

I think something along the lines of #3 is probably a good idea,



Really?  Any particular delay is guaranteed wrong.

  


'3 seconds' is just a placeholder for whatever comes out of a total 
time scheduled to sync / relations to sync computation.  (Still doing 
all my thinking in terms of time, altough I recognize a showdown with 
segment-based checkpoints is coming too)


I think the right way to compute relations to sync is to finish the 
sorted writes patch I sent over a not quite right yet update to already, 
which is my next thing to work on here.  I remain pessimistic that any 
attempt to issue fsync calls without the maximum possible delay after 
asking kernel to write things out first will work out well.  My recent 
tests with low values of dirty_bytes on Linux just reinforces how bad 
that can turn out.  In addition to computing the relation count while 
sorting them, placing writes in-order by relation and then doing all 
writes followed by all syncs should place the database right in the 
middle of the throughput/latency trade-off here.  It will have had the 
maximum amount of time we can give it to sort and flush writes for any 
given relation before it is asked to sync it.  I don't want to try and 
be any smarter than that without trying to be a *lot* smarter--timing 
individual sync calls, feedback loops on time estimation, etc.


At this point I have to agree with Robert's observation that splitting 
checkpoints into checkpoint_write_target and checkpoint_sync_target is 
the only reasonable thing left that might be possible complete in a 
short period.  So that's how this can compute the total time numerator here.


The main thing I will warn about in relations to discussion today is the 
danger of true dead-line oriented scheduling in this area.  The 
checkpoint process may discover the sync phase is falling behind 
expectations because the individual sync calls are taking longer than 
expected.  If that happens, aiming for the finish on target anyway 
goal puts you right back to a guaranteed nasty write spike again.  I 
think many people would prefer logging the overrun as tuning feedback 
for the DBA rather than to accelerate, which is likely to make the 
problem even worse if the checkpoint is falling behind.  But since 
ultimately the feedback for this will be make the checkpoints longer or 
increase checkpoint_sync_target, sync acceleration to meet the deadline 
isn't unacceptable; DBA can try both of those themselves if seeing spikes.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books



Re: [HACKERS] Spread checkpoint sync

2011-01-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith g...@2ndquadrant.com wrote:
 I've attached an updated version of the initial sync spreading patch here,
 one that applies cleanly on top of HEAD and over top of the sync
 instrumentation patch too.  The conflict that made that hard before is gone
 now.

With the fsync queue compaction patch applied, I think most of this is
now not needed.  Attached please find an attempt to isolate the
portion that looks like it might still be useful.  The basic idea of
what remains here is to make the background writer still do its normal
stuff even when it's checkpointing.  In particular, with this patch
applied, PG will:

1. Absorb fsync requests a lot more often during the sync phase.
2. Still try to run the cleaning scan during the sync phase.
3. Pause for 3 seconds after every fsync.

I suspect that #1 is probably a good idea.  It seems pretty clear
based on your previous testing that the fsync compaction patch should
be sufficient to prevent us from hitting the wall, but if we're going
to any kind of nontrivial work here then cleaning the queue is a
sensible thing to do along the way, and there's little downside.

I also suspect #2 is a good idea.  The fact that we're checkpointing
doesn't mean that the system suddenly doesn't require clean buffers,
and the experimentation I've done recently (see: limiting hint bit
I/O) convinces me that it's pretty expensive from a performance
standpoint when backends have to start writing out their own buffers,
so continuing to do that work during the sync phase of a checkpoint,
just as we do during the write phase, seems pretty sensible.

I think something along the lines of #3 is probably a good idea, but
the current coding doesn't take checkpoint_completion_target into
account.  The underlying problem here is that it's at least somewhat
reasonable to assume that if we write() a whole bunch of blocks, each
write() will take approximately the same amount of time.  But this is
not true at all with respect to fsync() - they neither take the same
amount of time as each other, nor is there any fixed ratio between
write() time and fsync() time to go by.  So if we want the checkpoint
to finish in, say, 20 minutes, we can't know whether the write phase
needs to be finished by minute 10 or 15 or 16 or 19 or only by 19:59.

One idea I have is to try to get some of the fsyncs out of the queue
at times other than end-of-checkpoint.  Even if this resulted in some
modest increase in the total number of fsync() calls, it might improve
performance by causing data to be flushed to disk in smaller chunks.
For example, suppose we kept an LRU list of pending fsync requests -
every time we remember an fsync request for a particular relation, we
move it to the head (hot end) of the LRU.  And periodically we pull
the tail entry off the list and fsync it - say, after
checkpoint_timeout / (# of items in the list).  That way, when we
arrive at the end of the checkpoint and starting syncing everything,
the syncs hopefully complete more quickly because we've already forced
a bunch of the data down to disk.  That algorithm may well be too
stupid or just not work in real life, but perhaps there's some
variation that would be sensible.  The point is: instead of or in
addition to trying to spread out the sync phase, we might want to
investigate whether it's possible to reduce its size.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 4df69c2..36da084 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -726,6 +726,53 @@ CheckpointWriteDelay(int flags, double progress)
 }
 
 /*
+ * CheckpointSyncDelay -- yield control to bgwriter during a checkpoint
+ *
+ * This function is called after each file sync performed by mdsync().
+ * It is responsible for keeping the bgwriter's normal activities in
+ * progress during a long checkpoint.
+ */
+void
+CheckpointSyncDelay(void)
+{
+	pg_time_t	now;
+ 	pg_time_t	sync_start_time;
+ 	int			sync_delay_secs;
+ 
+ 	/*
+ 	 * Delay after each sync, in seconds.  This could be a parameter.  But
+ 	 * since ideally this will be auto-tuning in the near future, not
+	 * assigning it a GUC setting yet.
+ 	 */
+#define EXTRA_SYNC_DELAY	3
+
+	/* Do nothing if checkpoint is being executed by non-bgwriter process */
+	if (!am_bg_writer)
+		return;
+
+ 	sync_start_time = (pg_time_t) time(NULL);
+
+	/*
+	 * Perform the usual bgwriter duties.
+	 */
+ 	for (;;)
+ 	{
+		AbsorbFsyncRequests();
+ 		BgBufferSync();
+ 		CheckArchiveTimeout();
+ 		BgWriterNap();
+ 
+ 		/*
+ 		 * Are we there yet?
+ 		 */
+ 		now = (pg_time_t) time(NULL);
+ 		sync_delay_secs = now - sync_start_time;
+ 		if (sync_delay_secs = EXTRA_SYNC_DELAY)
+			break;
+	}
+}
+
+/*
  * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint
  *		 in time?
  *
diff --git a/src/backend/storage/smgr/md.c 

Re: [HACKERS] Spread checkpoint sync

2011-01-29 Thread Robert Haas
On Fri, Jan 28, 2011 at 12:53 AM, Greg Smith g...@2ndquadrant.com wrote:
 Where there are still very ugly maximum latency figures here in every case,
 these periods just aren't as wide with the patch in place.

OK, committed the patch, with some additional commenting, and after
fixing the compiler warning Chris Browne noticed.

 P.S. Yes, I know I have other review work to do as well.  Starting on the
 rest of that tomorrow.

*cracks whip*

Man, this thing doesn't work at all.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-27 Thread Greg Smith

Greg Smith wrote:
I think a helpful next step here would be to put Robert's fsync 
compaction patch into here and see if that helps.  There are enough 
backend syncs showing up in the difficult workloads (scale=1000, 
clients =32) that its impact should be obvious.


Initial tests show everything expected from this change and more.  This 
took me a while to isolate because of issues where the filesystem 
involved degraded over time, giving a heavy bias toward a faster first 
test run, before anything was fragmented.  I just had to do a whole new 
mkfs on the database/xlog disks when switching between test sets in 
order to eliminate that.


At a scale of 500, I see the following average behavior:

Clients TPS backend-fsync
16 557 155
32 587 572
64 628 843
128 621 1442
256 632 2504

On one run through with the fsync compaction patch applied this turned into:

Clients TPS backend-fsync
16 637 0
32 621 0
64 721 0
128 716 0
256 841 0

So not only are all the backend fsyncs gone, there is a very clear TPS 
improvement too.  The change in results at =64 clients are well above 
the usual noise threshold in these tests. 

The problem where individual fsync calls during checkpoints can take a 
long time is not appreciably better.  But I think this will greatly 
reduce the odds of running into the truly dysfunctional breakdown, where 
checkpoint and backend fsync calls compete with one another, that caused 
the worst-case situation kicking off this whole line of research here.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-27 Thread Robert Haas
On Thu, Jan 27, 2011 at 12:18 PM, Greg Smith g...@2ndquadrant.com wrote:
 Greg Smith wrote:

 I think a helpful next step here would be to put Robert's fsync compaction
 patch into here and see if that helps.  There are enough backend syncs
 showing up in the difficult workloads (scale=1000, clients =32) that its
 impact should be obvious.

 Initial tests show everything expected from this change and more.  This took
 me a while to isolate because of issues where the filesystem involved
 degraded over time, giving a heavy bias toward a faster first test run,
 before anything was fragmented.  I just had to do a whole new mkfs on the
 database/xlog disks when switching between test sets in order to eliminate
 that.

 At a scale of 500, I see the following average behavior:

 Clients TPS backend-fsync
 16 557 155
 32 587 572
 64 628 843
 128 621 1442
 256 632 2504

 On one run through with the fsync compaction patch applied this turned into:

 Clients TPS backend-fsync
 16 637 0
 32 621 0
 64 721 0
 128 716 0
 256 841 0

 So not only are all the backend fsyncs gone, there is a very clear TPS
 improvement too.  The change in results at =64 clients are well above the
 usual noise threshold in these tests.
 The problem where individual fsync calls during checkpoints can take a long
 time is not appreciably better.  But I think this will greatly reduce the
 odds of running into the truly dysfunctional breakdown, where checkpoint and
 backend fsync calls compete with one another, that caused the worst-case
 situation kicking off this whole line of research here.

Dude!  That's pretty cool.  Thanks for doing that measurement work -
that's really awesome.

Barring objections, I'll go ahead and commit my patch.

Based on what I saw looking at this, I'm thinking that the backend
fsyncs probably happen in clusters - IOW, it's not 2504 backend fsyncs
spread uniformly throughout the test, but clusters of 100 or more that
happen in very quick succession, followed by relief when the
background writer gets around to emptying the queue.  During each
cluster, the system probably slows way down, and then recovers when
the queue is emptied.  So the TPS improvement isn't at all a uniform
speedup, but simply relief from the stall that would otherwise result
from a full queue.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-27 Thread Greg Smith

Robert Haas wrote:

Based on what I saw looking at this, I'm thinking that the backend
fsyncs probably happen in clusters - IOW, it's not 2504 backend fsyncs
spread uniformly throughout the test, but clusters of 100 or more that
happen in very quick succession, followed by relief when the
background writer gets around to emptying the queue.


That's exactly the case.  You'll be running along fine, the queue will 
fill, and then hundreds of them can pile up in seconds.  Since the worst 
of that seemed to be during the sync phase of the checkpoint, adding 
additional queue management logic to there is where we started at.  I 
thought this compaction idea would be more difficult to implement than 
your patch proved to be though, so doing this first is working out quite 
well instead.


This is what all the log messages from the patch look like here, at 
scale=500 and shared_buffers=256MB:


DEBUG:  compacted fsync request queue from 32768 entries to 11 entries

That's an 8GB database, and from looking at the relative sizes I'm 
guessing 7 entries refer to the 1GB segments of the accounts table, 2 to 
its main index, and the other 2 are likely branches/tellers data.  Since 
I know the production system I ran into this on has about 400 file 
segments on it regularly dirtied a higher shared_buffers than that, I 
expect this will demolish this class of problem on it, too.


I'll have all the TPS over time graphs available to publish by the end 
of my day here, including tests at a scale of 1000 as well.  Those 
should give a little more insight into how the patch is actually 
impacting high-level performance.  I don't dare disturb the ongoing 
tests by copying all that data out of there until they're finished, will 
be a few hours yet.


My only potential concern over committing this is that I haven't done a 
sanity check over whether it impacts the fsync mechanics in a way that 
might cause an issue.  Your assumptions there are documented and look 
reasonable on quick review; I just haven't had much time yet to look for 
flaws in them.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-27 Thread Greg Smith

Robert Haas wrote:

During each cluster, the system probably slows way down, and then recovers when
the queue is emptied.  So the TPS improvement isn't at all a uniform
speedup, but simply relief from the stall that would otherwise result
from a full queue.
  


That does seem to be the case here.  
http://www.2ndquadrant.us/pgbench-results/index.htm now has results from 
my a long test series, at two database scales that caused many backend 
fsyncs during earlier tests.  Set #5 is the existing server code, #6 is 
with the patch applied.  There are zero backend fsync calls with the 
patch applied, which isn't surprising given how simple the schema is on 
this test case.  An average of a 14% TPS gain appears at a scale of 500 
and a 8% one at 1000; the attached CSV file summarizes the average 
figures for the archives.  The gains do appear to be from smoothing out 
the dead period that normally occur during the sync phase of the checkpoint.


For example, here are the fastest runs at scale=1000/clients=256 with 
and without the patch:


http://www.2ndquadrant.us/pgbench-results/436/index.html (tps=361)
http://www.2ndquadrant.us/pgbench-results/486/index.html (tps=380)

Here the difference in how much less of a slowdown there is around the 
checkpoint end points is really obvious, and obviously an improvement.  
You can see the same thing to a lesser extent at the other end of the 
scale; here's the fastest runs at scale=500/clients=16:


http://www.2ndquadrant.us/pgbench-results/402/index.html (tps=590)
http://www.2ndquadrant.us/pgbench-results/462/index.html (tps=643)

Where there are still very ugly maximum latency figures here in every 
case, these periods just aren't as wide with the patch in place.


I'm moving onto some brief testing some of the newer kernel behavior 
here, then returning to testing the other checkpoint spreading ideas on 
top of this compation patch, presuming something like it will end up 
being committed first.  I think it's safe to say I can throw away the 
changes to try and alter the fsync absorption code present in what I 
submitted before, as this scheme does a much better job of avoiding that 
problem than those earlier queue alteration ideas.  I'm glad Robert 
grabbed the right one from the pile of ideas I threw out for what else 
might help here.


P.S. Yes, I know I have other review work to do as well.  Starting on 
the rest of that tomorrow.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books

,,Unmodified,,Compacted Fsync,,,
scale,clients,tps,max_latency,tps,max_latency,TPS Gain,% Gain
500,16,557,17963.41,631,17116.31,74,13.3%
500,32,587,25838.8,655,24311.54,68,11.6%
500,64,628,35198.39,727,38040.39,99,15.8%
500,128,621,41001.91,687,48195.77,66,10.6%
500,256,632,49610.39,747,46799.48,115,18.2%
,,,
1000,16,306,39298.95,321,40826.58,15,4.9%
1000,32,314,40120.35,345,27910.51,31,9.9%
1000,64,334,46244.86,358,45138.1,24,7.2%
1000,128,343,72501.57,372,47125.46,29,8.5%
1000,256,321,80588.63,350,83232.14,29,9.0%

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-18 Thread Cédric Villemain
2011/1/18 Greg Smith g...@2ndquadrant.com:
 Bruce Momjian wrote:

 Should we be writing until 2:30 then sleep 30 seconds and fsync at 3:00?


 The idea of having a dead period doing no work at all between write phase
 and sync phase may have some merit.  I don't have enough test data yet on
 some more fundamental issues in this area to comment on whether that smaller
 optimization would be valuable.  It may be a worthwhile concept to throw
 into the sequencing.

Are we able to have some pause without strict rules like 'stop for 30
sec' ? (case : my hardware is very good and I can write 400MB/sec with
no interrupt, XXX IOPS)

I wonder if we are not going to have issue with  RAID firmware + BBU
+ linux scheduler because we are adding 'unexpected' behavior in the
middle.

-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-18 Thread Greg Smith

Robert Haas wrote:

Idea #4: For ext3 filesystems that like to dump the entire buffer
cache instead of only the requested file, write a little daemon that
runs alongside of (and completely indepdently of) PostgreSQL.  Every
30 s, it opens a 1-byte file, changes the byte, fsyncs the file, and
closes the file, thus dumping the cache and preventing a ridiculous
growth in the amount of data to be sync'd at checkpoint time.
  


Today's data suggests this problem has been resolved in the latest 
kernels.  I saw the giant flush/series of small flushes pattern quite 
easily on the CentOS5 system I last did heavy pgbench testing on.  The 
one I'm testing now has kernel 2.6.23 (Ubuntu 10.04), and it doesn't 
show it at all.


Here's what a bad checkpoint looks like on this system:

LOG:  checkpoint starting: xlog
DEBUG:  checkpoint sync: number=1 file=base/24746/36596.8 time=7651.601 msec
DEBUG:  checkpoint sync: number=2 file=base/24746/36506 time=0.001 msec
DEBUG:  checkpoint sync: number=3 file=base/24746/36596.2 time=1891.695 msec
DEBUG:  checkpoint sync: number=4 file=base/24746/36596.4 time=7431.441 msec
DEBUG:  checkpoint sync: number=5 file=base/24746/36515 time=0.216 msec
DEBUG:  checkpoint sync: number=6 file=base/24746/36596.9 time=4422.892 msec
DEBUG:  checkpoint sync: number=7 file=base/24746/36596.12 time=954.242 msec
DEBUG:  checkpoint sync: number=8 file=base/24746/36237_fsm time=0.002 msec
DEBUG:  checkpoint sync: number=9 file=base/24746/36503 time=0.001 msec
DEBUG:  checkpoint sync: number=10 file=base/24746/36584 time=41.401 msec
DEBUG:  checkpoint sync: number=11 file=base/24746/36596.7 time=885.921 msec
DEBUG:  checkpoint sync: number=12 file=base/24813/30774 time=0.002 msec
DEBUG:  checkpoint sync: number=13 file=base/24813/24822 time=0.005 msec
DEBUG:  checkpoint sync: number=14 file=base/24746/36801 time=49.801 msec
DEBUG:  checkpoint sync: number=15 file=base/24746/36601.2 time=610.996 msec
DEBUG:  checkpoint sync: number=16 file=base/24746/36596 time=16154.361 msec
DEBUG:  checkpoint sync: number=17 file=base/24746/36503_vm time=0.001 msec
DEBUG:  checkpoint sync: number=18 file=base/24746/36508 time=0.000 msec
DEBUG:  checkpoint sync: number=19 file=base/24746/36596.10 
time=9759.898 msec
DEBUG:  checkpoint sync: number=20 file=base/24746/36596.3 time=3392.727 
msec

DEBUG:  checkpoint sync: number=21 file=base/24746/36237 time=0.150 msec
DEBUG:  checkpoint sync: number=22 file=base/24746/36596.11 
time=9153.437 msec

DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 1057833 of relation base/24746/36596

[800 more of these]

DEBUG:  checkpoint sync: number=23 file=base/24746/36601.1 
time=48697.179 msec

DEBUG:  could not forward fsync request because request queue is full
DEBUG:  checkpoint sync: number=24 file=base/24746/36597 time=0.080 msec
DEBUG:  checkpoint sync: number=25 file=base/24746/36237_vm time=0.001 msec
DEBUG:  checkpoint sync: number=26 file=base/24813/24822_fsm time=0.001 msec
DEBUG:  checkpoint sync: number=27 file=base/24746/36503_fsm time=0.000 msec
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 20619 of relation base/24746/36601
DEBUG:  checkpoint sync: number=28 file=base/24746/36506_fsm time=0.000 msec
DEBUG:  checkpoint sync: number=29 file=base/24746/36596_vm time=0.040 msec
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 278967 of relation base/24746/36596
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 1582400 of relation base/24746/36596
DEBUG:  checkpoint sync: number=30 file=base/24746/36596.6 time=0.002 msec
DEBUG:  checkpoint sync: number=31 file=base/24813/11647 time=0.004 msec
DEBUG:  checkpoint sync: number=32 file=base/24746/36601 time=201.632 msec
DEBUG:  checkpoint sync: number=33 file=base/24746/36801_fsm time=0.001 msec
DEBUG:  checkpoint sync: number=34 file=base/24746/36596.5 time=0.001 msec
DEBUG:  checkpoint sync: number=35 file=base/24746/36599 time=0.000 msec
DEBUG:  checkpoint sync: number=36 file=base/24746/36587 time=0.005 msec
DEBUG:  checkpoint sync: number=37 file=base/24746/36596_fsm time=0.001 msec
DEBUG:  checkpoint sync: number=38 file=base/24746/36596.1 time=0.001 msec
DEBUG:  checkpoint sync: number=39 file=base/24746/36801_vm time=0.001 msec
LOG:  checkpoint complete: wrote 9515 buffers (29.0%); 0 transaction log 
file(s) added, 0 removed, 64 recycled; write=32.409 s, sync=111.615 s, 
total=144.052 s; sync files=39, longest=48.697 s, average=2.853 s


Here the file that's been brutally delayed via backend contention is 
#23, after already seeing quite long delays on the earlier ones.  That 
I've never seen with earlier kernels running ext3.


This is good in that it makes it more likely a spread sync approach that 
works on XFS will also work on these newer kernels with ext4.  Then the 
only group we wouldn't be able to help if that works the ext3 

Re: [HACKERS] Spread checkpoint sync

2011-01-18 Thread Josh Berkus

 To be frank, I really don't care about fixing this behavior on ext3,
 especially in the context of that sort of hack.  That filesystem is not
 the future, it's not possible to ever really make it work right, and
 every minute spent on pandering to its limitations would be better spent
 elsewhere IMHO.  I'm starting with the ext3 benchmarks just to provide
 some proper context for the worst-case behavior people can see right
 now, and to make sure refactoring here doesn't make things worse on it. 
 My target is same or slightly better on ext3, much better on XFS and ext4.

Please don't forget that we need to avoid performance regressions on
NTFS and ZFS as well.  They don't need to improve, but we can't let them
regress.  I think we can ignore BSD/UFS and Solaris/UFS, as well as
HFS+, though.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-17 Thread Bruce Momjian
Greg Smith wrote:
 One of the components to the write queue is some notion that writes that 
 have been waiting longest should eventually be flushed out.  Linux has 
 this number called dirty_expire_centiseconds which suggests it enforces 
 just that, set to a default of 30 seconds.  This is why some 5-minute 
 interval checkpoints with default parameters, effectively spreading the 
 checkpoint over 2.5 minutes, can work under the current design.  
 Anything you wrote at T+0 to T+2:00 *should* have been written out 
 already when you reach T+2:30 and sync.  Unfortunately, when the system 
 gets busy, there is this congestion control logic that basically 
 throws out any guarantee of writes starting shortly after the expiration 
 time.

Should we be writing until 2:30 then sleep 30 seconds and fsync at 3:00?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-17 Thread Jeff Janes
On Sun, Jan 16, 2011 at 7:13 PM, Greg Smith g...@2ndquadrant.com wrote:
 I have finished a first run of benchmarking the current 9.1 code at various
 sizes.  See http://www.2ndquadrant.us/pgbench-results/index.htm for many
 details.  The interesting stuff is in Test Set 3, near the bottom.  That's
 the first one that includes buffer_backend_fsync data.  This iall on ext3 so
 far, but is using a newer 2.6.32 kernel, the one from Ubuntu 10.04.

 The results are classic Linux in 2010:  latency pauses from checkpoint sync
 will easily leave the system at a dead halt for a minute, with the worst one
 observed this time dropping still for 108 seconds.  That one is weird, but
 these two are completely averge cases:

 http://www.2ndquadrant.us/pgbench-results/210/index.html
 http://www.2ndquadrant.us/pgbench-results/215/index.html

 I think a helpful next step here would be to put Robert's fsync compaction
 patch into here and see if that helps.  There are enough backend syncs
 showing up in the difficult workloads (scale=1000, clients =32) that its
 impact should be obvious.

Have you ever tested Robert's other idea of having a metronome process
do a periodic fsync on a dummy file which is located on the same ext3fs
as the table files?  I think that that would be interesting to see.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-17 Thread Greg Smith

Jeff Janes wrote:

Have you ever tested Robert's other idea of having a metronome process
do a periodic fsync on a dummy file which is located on the same ext3fs
as the table files?  I think that that would be interesting to see.
  


To be frank, I really don't care about fixing this behavior on ext3, 
especially in the context of that sort of hack.  That filesystem is not 
the future, it's not possible to ever really make it work right, and 
every minute spent on pandering to its limitations would be better spent 
elsewhere IMHO.  I'm starting with the ext3 benchmarks just to provide 
some proper context for the worst-case behavior people can see right 
now, and to make sure refactoring here doesn't make things worse on it.  
My target is same or slightly better on ext3, much better on XFS and ext4.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-17 Thread Jim Nasby
On Jan 15, 2011, at 8:15 AM, Robert Haas wrote:
 Well, the point of this is not to save time in the bgwriter - I'm not
 surprised to hear that wasn't noticeable.  The point is that when the
 fsync request queue fills up, backends start performing an fsync *for
 every block they write*, and that's about as bad for performance as
 it's possible to be.  So it's worth going to a little bit of trouble
 to try to make sure it doesn't happen.  It didn't happen *terribly*
 frequently before, but it does seem to be common enough to worry about
 - e.g. on one occasion, I was able to reproduce it just by running
 pgbench -i -s 25 or something like that on a laptop.

Wow, that's the kind of thing that would be incredibly difficult to figure out, 
especially while your production system is in flames... Can we change ereport 
that happens in that case from DEBUG1 to WARNING? Or provide some other means 
to track it?
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-17 Thread Robert Haas
On Mon, Jan 17, 2011 at 6:07 PM, Jim Nasby j...@nasby.net wrote:
 On Jan 15, 2011, at 8:15 AM, Robert Haas wrote:
 Well, the point of this is not to save time in the bgwriter - I'm not
 surprised to hear that wasn't noticeable.  The point is that when the
 fsync request queue fills up, backends start performing an fsync *for
 every block they write*, and that's about as bad for performance as
 it's possible to be.  So it's worth going to a little bit of trouble
 to try to make sure it doesn't happen.  It didn't happen *terribly*
 frequently before, but it does seem to be common enough to worry about
 - e.g. on one occasion, I was able to reproduce it just by running
 pgbench -i -s 25 or something like that on a laptop.

 Wow, that's the kind of thing that would be incredibly difficult to figure 
 out, especially while your production system is in flames... Can we change 
 ereport that happens in that case from DEBUG1 to WARNING? Or provide some 
 other means to track it?

Something like this?

http://git.postgresql.org/gitweb?p=postgresql.git;a=commit;h=3134d8863e8473e3ed791e27d484f9e548220411

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-17 Thread Greg Smith

Jim Nasby wrote:

Wow, that's the kind of thing that would be incredibly difficult to figure out, 
especially while your production system is in flames... Can we change ereport 
that happens in that case from DEBUG1 to WARNING? Or provide some other means 
to track it


That's why we already added pg_stat_bgwriter.buffers_backend_fsync to 
track the problem before trying to improve it.  It was driving me crazy 
on a production server not having any visibility into when it happened.  
I haven't seen that we need anything beyond that so far.  In the context 
of this new patch for example, if you get to where a backend does its 
own sync, you'll know it did a compaction as part of that.  The existing 
statistic would tell you enough.


There's now enough data in test set 3 at 
http://www.2ndquadrant.us/pgbench-results/index.htm to start to see how 
this breaks down on a moderately big system (well, by most people's 
standards, but not Jim for whom this is still a toy).  Note the 
backend_sync column on the right, very end of the page; that's the 
relevant counter I'm commenting on:


scale=175:  Some backend fsync with 64 clients, 2/3 runs.
scale=250:  Significant backend fsync with 32 and 64 clients, every run.
scale=500:  Moderate to large backend fsync at any client count =16.  
This seems to be worst spot of those mapped.  Above here, I would guess 
the TPS numbers start slowing enough that the fsync request queue 
activity drops, too.

scale=1000:  Backend fsync starting at 8 clients
scale=2000:  Backend fsync starting at 16 clients.  By here I think the 
TPS volumes are getting low enough that clients are stuck significantly 
more often waiting for seeks rather than fsync.


Looks like the most effective spot for me to focus testing on with this 
server is scales of 500 and 1000, with 16 to 64 clients.  Now that I've 
got the scale fine tuned better, I may crank up the client counts too 
and see what that does.  I'm glad these are appearing in reasonable 
volume here though, was starting to get nervous about only having NDA 
restricted results to work against.  Some days you just have to cough up 
for your own hardware.


I just tagged pgbench-tools-0.6.0 and pushed to 
GitHub/git.postgresql.org with the changes that track and report on 
buffers_backend_fsync if anyone else wants to try this out.  It includes 
those numbers if you have a 9.1 with them, otherwise just reports 0 for 
it all the time; detection of the feature wasn't hard to add.  The end 
portion of a config file for the program (the first part specifies 
host/username info and the like) that would replicate the third test set 
here is:


MAX_WORKERS=4
SCRIPT=tpc-b.sql
SCALES=1 10 100 175 250 500 1000 2000
SETCLIENTS=4 8 16 32 64
SETTIMES=3
RUNTIME=600
TOTTRANS=

--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-17 Thread Greg Smith

Bruce Momjian wrote:

Should we be writing until 2:30 then sleep 30 seconds and fsync at 3:00?
  


The idea of having a dead period doing no work at all between write 
phase and sync phase may have some merit.  I don't have enough test data 
yet on some more fundamental issues in this area to comment on whether 
that smaller optimization would be valuable.  It may be a worthwhile 
concept to throw into the sequencing.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-16 Thread Jeff Janes
On Tue, Jan 11, 2011 at 5:27 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith g...@2ndquadrant.com wrote:
 One of the ideas Simon and I had been considering at one point was adding
 some better de-duplication logic to the fsync absorb code, which I'm
 reminded by the pattern here might be helpful independently of other
 improvements.

 Hopefully I'm not stepping on any toes here, but I thought this was an
 awfully good idea and had a chance to take a look at how hard it would
 be today while en route from point A to point B.  The answer turned
 out to be not very, so PFA a patch that seems to work.  I tested it
 by attaching gdb to the background writer while running pgbench, and
 it eliminate the backend fsyncs without even breaking a sweat.

I had been concerned about how long the lock would be held, and I was
pondering ways to do only partial deduplication to reduce the time.

But since you already wrote a patch to do the whole thing, I figured
I'd time it.

I arranged to test an instrumented version of your patch under large
shared_buffers of 4GB, conditions that would maximize the opportunity
for it to take a long time.  Running your compaction to go from 524288
to a handful (14 to 29, depending on run) took between 36 and 39
milliseconds.

For comparison, doing just the memcpy part of AbsorbFsyncRequest on
a full queue took from 24 to 27 milliseconds.

They are close enough to each other that I am no longer interested in
partial deduplication.  But both are long enough that I wonder if
having a hash table in shared memory that is kept unique automatically
at each update might not be worthwhile.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-16 Thread Robert Haas
On Sun, Jan 16, 2011 at 7:32 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 But since you already wrote a patch to do the whole thing, I figured
 I'd time it.

Thanks!

 I arranged to test an instrumented version of your patch under large
 shared_buffers of 4GB, conditions that would maximize the opportunity
 for it to take a long time.  Running your compaction to go from 524288
 to a handful (14 to 29, depending on run) took between 36 and 39
 milliseconds.

 For comparison, doing just the memcpy part of AbsorbFsyncRequest on
 a full queue took from 24 to 27 milliseconds.

 They are close enough to each other that I am no longer interested in
 partial deduplication.  But both are long enough that I wonder if
 having a hash table in shared memory that is kept unique automatically
 at each update might not be worthwhile.

There are basically three operations that we care about here: (1) time
to add an fsync request to the queue, (2) time to absorb requests from
the queue, and (3) time to compact the queue.  The first is by far the
most common, and at least in any situation that anyone's analyzed so
far, the second will be far more common than the third.  Therefore, it
seems unwise to accept any slowdown in #1 to speed up either #2 or #3,
and a hash table probe is definitely going to be slower than what's
required to add an element under the status quo.

We could perhaps mitigate this by partitioning the hash table.
Alternatively, we could split the queue in half and maintain a global
variable - protected by the same lock - indicating which half is
currently open for insertions.  The background writer would grab the
lock, flip the global, release the lock, and then drain the half not
currently open to insertions; the next iteration would flush the other
half.  However, it's unclear to me that either of these things has any
value.  I can't remember any reports of contention on the
BgWriterCommLock, so it seems like changing the logic as minimally as
possible as the way to go.

(In contrast, note that the WAL insert lock, proc array lock, and lock
manager/buffer manager partition locks are all known to be heavily
contended in certain workloads.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-16 Thread Greg Smith
I have finished a first run of benchmarking the current 9.1 code at 
various sizes.  See http://www.2ndquadrant.us/pgbench-results/index.htm 
for many details.  The interesting stuff is in Test Set 3, near the 
bottom.  That's the first one that includes buffer_backend_fsync data.  
This iall on ext3 so far, but is using a newer 2.6.32 kernel, the one 
from Ubuntu 10.04.


The results are classic Linux in 2010:  latency pauses from checkpoint 
sync will easily leave the system at a dead halt for a minute, with the 
worst one observed this time dropping still for 108 seconds.  That one 
is weird, but these two are completely averge cases:


http://www.2ndquadrant.us/pgbench-results/210/index.html
http://www.2ndquadrant.us/pgbench-results/215/index.html

I think a helpful next step here would be to put Robert's fsync 
compaction patch into here and see if that helps.  There are enough 
backend syncs showing up in the difficult workloads (scale=1000, 
clients =32) that its impact should be obvious.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-16 Thread Robert Haas
On Sun, Jan 16, 2011 at 10:13 PM, Greg Smith g...@2ndquadrant.com wrote:
 I have finished a first run of benchmarking the current 9.1 code at various
 sizes.  See http://www.2ndquadrant.us/pgbench-results/index.htm for many
 details.  The interesting stuff is in Test Set 3, near the bottom.  That's
 the first one that includes buffer_backend_fsync data.  This iall on ext3 so
 far, but is using a newer 2.6.32 kernel, the one from Ubuntu 10.04.

 The results are classic Linux in 2010:  latency pauses from checkpoint sync
 will easily leave the system at a dead halt for a minute, with the worst one
 observed this time dropping still for 108 seconds.

I wish I understood better what makes Linux systems freeze up under
heavy I/O load.  Linux - like other UNIX-like systems - generally has
reasonably effective mechanisms for preventing a single task from
monopolizing the (or a) CPU in the presence of other processes that
also wish to be time-sliced, but the same thing doesn't appear to be
true of I/O.

 I think a helpful next step here would be to put Robert's fsync compaction
 patch into here and see if that helps.  There are enough backend syncs
 showing up in the difficult workloads (scale=1000, clients =32) that its
 impact should be obvious.

Thanks for doing this work.  I look forward to the results.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-15 Thread Greg Smith

Robert Haas wrote:

On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith g...@2ndquadrant.com wrote:
  

One of the ideas Simon and I had been considering at one point was adding
some better de-duplication logic to the fsync absorb code, which I'm
reminded by the pattern here might be helpful independently of other
improvements.



Hopefully I'm not stepping on any toes here, but I thought this was an
awfully good idea and had a chance to take a look at how hard it would
be today while en route from point A to point B.  The answer turned
out to be not very, so PFA a patch that seems to work.  I tested it
by attaching gdb to the background writer while running pgbench, and
it eliminate the backend fsyncs without even breaking a sweat.
  


No toe damage, this is great, I hadn't gotten to coding for this angle 
yet at all.  Suffering from an overload of ideas and (mostly wasted) 
test data, so thanks for exploring this concept and proving it works.


I'm not sure what to do with the rest of the work I've been doing in 
this area here, so I'm tempted to just combine this new bit from you 
with the older patch I submitted, streamline, and see if that makes 
sense.  Expected to be there already, then how about spending 5 minutes 
first checking out that autovacuum lock patch again turned out to be a 
wild underestimate.


Part of the problem is that it's become obvious to me the last month 
that right now is a terrible time to be doing Linux benchmarks that 
impact filesystem sync behavior.  The recent kernel changes that are 
showing in the next rev of the enterprise distributions--like RHEL6 and 
Debian Squeeze both working to get a stable 2.6.32--have made testing a 
nightmare.  I don't want to dump a lot of time into optimizing for 
2.6.32 if this problem changes its form in newer kernels, but the 
distributions built around newer kernels are just not fully baked enough 
yet to tell.  For example, the pre-release Squeeze numbers we're seeing 
are awful so far, but it's not really done yet either.  I expect 3-6 
months from today, that all will have settled down enough that I can 
make some sense of it.  Lately my work with the latest distributions has 
just been burning time installing stuff that doesn't work quite right yet.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books



Re: [HACKERS] Spread checkpoint sync

2011-01-15 Thread Robert Haas
On Sat, Jan 15, 2011 at 5:47 AM, Greg Smith g...@2ndquadrant.com wrote:
 No toe damage, this is great, I hadn't gotten to coding for this angle yet
 at all.  Suffering from an overload of ideas and (mostly wasted) test data,
 so thanks for exploring this concept and proving it works.

Yeah - obviously I want to make sure that someone reviews the logic
carefully, since a loss of fsyncs or a corruption of the request queue
could affect system stability, but only very rarely, since you'd need
full fsync queue + crash.  But the code is pretty simple, so it should
be possible to convince ourselves as to its correctness (or
otherwise).  Obviously, major credit to you and Simon for identifying
the problem and coming up with a proposed fix.

 I'm not sure what to do with the rest of the work I've been doing in this
 area here, so I'm tempted to just combine this new bit from you with the
 older patch I submitted, streamline, and see if that makes sense.  Expected
 to be there already, then how about spending 5 minutes first checking out
 that autovacuum lock patch again turned out to be a wild underestimate.

I'd rather not combine the patches, because this one is pretty simple
and just does one thing, but feel free to write something that applies
over top of it.  Looking through your old patch (sync-spread-v3),
there seem to be a couple of components there:

- Compact the fsync queue based on percentage fill rather than number
of writes per absorb.  If we apply my queue-compacting logic, do we
still need this?  The queue compaction may hold the BgWriterCommLock
for slightly longer than AbsorbFsyncRequests() would, but I'm not
inclined to jump to the conclusion that this is worth getting excited
about.  The whole idea of accessing BgWriterShmem-num_requests
without the lock gives me the willies anyway - sure, it'll probably
work OK most of the time, especially on x86, but it seems hard to
predict whether there will be occasional bad behavior on platforms
with weak memory ordering.

- Call pgstat_send_bgwriter() at the end of AbsorbFsyncRequests().
Not sure what the motivation for this is.

- CheckpointSyncDelay(), to make sure that we absorb fsync requests
and free up buffers during a long checkpoint.  I think this part is
clearly valuable, although I'm not sure we've satisfactorily solved
the problem of how to spread out the fsyncs and still complete the
checkpoint on schedule.

As to that, I have a couple of half-baked ideas I'll throw out so you
can laugh at them.  Some of these may be recycled versions of ideas
you've already had/mentioned, so, again, credit to you for getting the
ball rolling.

Idea #1: When we absorb fsync requests, don't just remember that there
was an fsync request; also remember the time of said fsync request.
If a new fsync request arrives for a segment for which we're already
remembering an fsync request, update the timestamp on the request.
Periodically scan the fsync request queue for requests older than,
say, 30 s, and perform one such request.   The idea is - if we wrote a
bunch of data to a relation and then haven't touched it for a while,
force it out to disk before the checkpoint actually starts so that the
volume of work required by the checkpoint is lessened.

Idea #2: At the beginning of a checkpoint when we scan all the
buffers, count the number of buffers that need to be synced for each
relation.  Use the same hashtable that we use for tracking pending
fsync requests.  Then, interleave the writes and the fsyncs.  Start by
performing any fsyncs that need to happen but have no buffers to sync
(i.e. everything that must be written to that relation has already
been written).  Then, start performing the writes, decrementing the
pending-write counters as you go.  If the pending-write count for a
relation hits zero, you can add it to the list of fsyncs that can be
performed before the writes are finished.  If it doesn't hit zero
(perhaps because a non-bgwriter process wrote a buffer that we were
going to write anyway), then we'll do it at the end.  One problem with
this - aside from complexity - is that most likely most fsyncs would
either happen at the beginning or very near the end, because there's
no reason to assume that buffers for the same relation would be
clustered together in shared_buffers.  But I'm inclined to think that
at least the idea of performing fsyncs for which no dirty buffers
remain in shared_buffers at the beginning of the checkpoint rather
than at the end might have some value.

Idea #3: Stick with the idea of a fixed delay between fsyncs, but
compute how many fsyncs you think you're ultimately going to need at
the start of the checkpoint, and back up the target completion time by
3 s per fsync from the get-go, so that the checkpoint still finishes
on schedule.

Idea #4: For ext3 filesystems that like to dump the entire buffer
cache instead of only the requested file, write a little daemon that
runs alongside of (and completely indepdently of) PostgreSQL.  

Re: [HACKERS] Spread checkpoint sync

2011-01-15 Thread Simon Riggs
On Sat, 2011-01-15 at 05:47 -0500, Greg Smith wrote:
 Robert Haas wrote: 
  On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith g...@2ndquadrant.com wrote:

   One of the ideas Simon and I had been considering at one point was adding
   some better de-duplication logic to the fsync absorb code, which I'm
   reminded by the pattern here might be helpful independently of other
   improvements.
   
  
  Hopefully I'm not stepping on any toes here, but I thought this was an
  awfully good idea and had a chance to take a look at how hard it would
  be today while en route from point A to point B.  The answer turned
  out to be not very, so PFA a patch that seems to work.  I tested it
  by attaching gdb to the background writer while running pgbench, and
  it eliminate the backend fsyncs without even breaking a sweat.

 
 No toe damage, this is great, I hadn't gotten to coding for this angle
 yet at all.  Suffering from an overload of ideas and (mostly wasted)
 test data, so thanks for exploring this concept and proving it works.

No toe damage either, but are we sure we want the de-duplication logic
and in this place?

I was originally of the opinion that de-duplicating the list would save
time in the bgwriter, but that guess was wrong by about two orders of
magnitude, IIRC. The extra time in the bgwriter wasn't even noticeable.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-15 Thread Robert Haas
On Sat, Jan 15, 2011 at 8:55 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Sat, 2011-01-15 at 05:47 -0500, Greg Smith wrote:
 Robert Haas wrote:
  On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith g...@2ndquadrant.com wrote:
 
   One of the ideas Simon and I had been considering at one point was adding
   some better de-duplication logic to the fsync absorb code, which I'm
   reminded by the pattern here might be helpful independently of other
   improvements.
  
 
  Hopefully I'm not stepping on any toes here, but I thought this was an
  awfully good idea and had a chance to take a look at how hard it would
  be today while en route from point A to point B.  The answer turned
  out to be not very, so PFA a patch that seems to work.  I tested it
  by attaching gdb to the background writer while running pgbench, and
  it eliminate the backend fsyncs without even breaking a sweat.
 

 No toe damage, this is great, I hadn't gotten to coding for this angle
 yet at all.  Suffering from an overload of ideas and (mostly wasted)
 test data, so thanks for exploring this concept and proving it works.

 No toe damage either, but are we sure we want the de-duplication logic
 and in this place?

 I was originally of the opinion that de-duplicating the list would save
 time in the bgwriter, but that guess was wrong by about two orders of
 magnitude, IIRC. The extra time in the bgwriter wasn't even noticeable.

Well, the point of this is not to save time in the bgwriter - I'm not
surprised to hear that wasn't noticeable.  The point is that when the
fsync request queue fills up, backends start performing an fsync *for
every block they write*, and that's about as bad for performance as
it's possible to be.  So it's worth going to a little bit of trouble
to try to make sure it doesn't happen.  It didn't happen *terribly*
frequently before, but it does seem to be common enough to worry about
- e.g. on one occasion, I was able to reproduce it just by running
pgbench -i -s 25 or something like that on a laptop.

With this patch applied, there's no performance impact vs. current
code in the very, very common case where space remains in the queue -
999 times out of 1000, writing to the fsync queue will be just as fast
as ever.  But in the unusual case where the queue has been filled up,
compacting the queue is much much faster than performing an fsync, and
the best part is that the compaction is generally massive.  I was
seeing things like 4096 entries compressed to 14.  So clearly even
if the compaction took as long as the fsync itself it would be worth
it, because the next 4000+ guys who come along again go through the
fast path.  But in fact I think it's much faster than an fsync.

In order to get pathological behavior even with this patch applied,
you'd need to have NBuffers pending fsync requests and they'd all have
to be different.  I don't think that's theoretically impossible, but
Greg's research seems to indicate that even on busy systems we don't
come even a little bit close to the circumstances that would cause it
to occur in practice.  Every other change we might make in this area
will further improve this case, too: for example, doing an absorb
after each fsync would presumably help, as would the more drastic step
of splitting the bgwriter into two background processes (one to do
background page cleaning, and the other to do checkpoints, for
example).  But even without those sorts of changes, I think this is
enough to effectively eliminate the full fsync queue problem in
practice, which seems worth doing independently of anything else.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-15 Thread Greg Smith

Robert Haas wrote:

Idea #2: At the beginning of a checkpoint when we scan all the
buffers, count the number of buffers that need to be synced for each
relation.  Use the same hashtable that we use for tracking pending
fsync requests.  Then, interleave the writes and the fsyncs...

Idea #3: Stick with the idea of a fixed delay between fsyncs, but
compute how many fsyncs you think you're ultimately going to need at
the start of the checkpoint, and back up the target completion time by
3 s per fsync from the get-go, so that the checkpoint still finishes
on schedule.
  


What I've been working on is something halfway between these two ideas.  
I have a patch, and it doesn't work right yet because I just broke it, 
but since I have some faint hope this will all come together any minute 
now I'm going to share it before someone announces a deadline has passed 
or something.  (whistling).  I'm going to add this messy thing and the 
patch you submitted upthread to the CF list; I'll review yours, I'll 
either fix the remaining problem in this one myself or rewrite to one of 
your ideas, and then it's onto a round of benchmarking.


Once upon a time we got a patch from Itagaki Takahiro whose purpose was 
to sort writes before sending them out:


http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php

This didn't really work repeatedly for everyone because of the now well 
understood ext3 issues--I never replicated that speedup at the time for 
example.  And this was before the spread checkpoint code was in 8.3.  
The hope was that it wasn't really going to be necessary after that anyway.


Back to today...instead of something complicated, it struck me that if I 
just had a count of exactly how many files were involved in each 
checkpoint, that would be helpful.  I could keep the idea of a fixed 
delay between fsyncs, but just auto-tune that delay amount based on the 
count.  And how do you count the number of unique things in a list?  
Well, you can always sort them.  I thought that if the sorted writes 
patch got back to functional again, it could serve two purposes.  It 
would group all of the writes for a file together, and if you did the 
syncs in the same sorted order they would have the maximum odds of 
discovering the data was already written.  So rather than this possible 
order:


table block
a 1
b 1
c 1
c 2
b 2
a 2
sync a
sync b
sync c

Which has very low odds of the sync on a finishing quickly, we'd get 
this one:


table block
a 1
a 2
b 1
b 2
c 1
c 2
sync a
sync b
sync c

Which sure seems like a reasonable way to improve the odds data has been 
written before the associated sync comes along.


Also, I could just traverse the sorted list with some simple logic to 
count the number of unique files, and then set the delay between fsync 
writes based on it.  In the above, once the list was sorted, easy to 
just see how many times the table name changes on a linear scan of the 
sorted data.  3 files, so if the checkpoint target gives me, say, a 
minute of time to sync them, I can delay 20 seconds between.  Simple 
math, and exactly the sort I used to get reasonable behavior on the busy 
production system this all started on.  There's some unresolved 
trickiness in the segment-driven checkpoint case, but one thing at a time.


So I fixed the bitrot on the old sorted patch, which was fun as it came 
from before the 8.3 changes.  It seemed to work.  I then moved the 
structure it uses to hold the list of buffers to write, the thing that's 
sorted, into shared memory.  It's got a predictable maximum size, 
relying on palloc in the middle of the checkpoint code seems bad, and 
there's some potential gain from not reallocating it every time through.


Somewhere along the way, it started doing this instead of what I wanted:

BadArgument(!(((header-context) != ((void *)0)  
(Node*)((header-context)))-type) == T_AllocSetContext, File: 
mcxt.c, Line: 589)


(that's from initdb, not a good sign)

And it's left me wondering whether this whole idea is a dead end I used 
up my window of time wandering down.


There's good bits in the patch I submitted for the last CF and in the 
patch you wrote earlier this week.  This unfinished patch may be a 
valuable idea to fit in there too once I fix it, or maybe it's 
fundamentally flawed and one of the other ideas you suggested (or I have 
sitting on the potential design list) will work better.  There's a patch 
integration problem that needs to be solved here, but I think almost all 
the individual pieces are available.  I'd hate to see this fail to get 
integrated now just for lack of time, considering the problem is so 
serious when you run into it.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books

diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 

Re: [HACKERS] Spread checkpoint sync

2011-01-15 Thread Robert Haas
On Sat, Jan 15, 2011 at 9:25 AM, Greg Smith g...@2ndquadrant.com wrote:
 Once upon a time we got a patch from Itagaki Takahiro whose purpose was to
 sort writes before sending them out:

 http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php

Ah, a fine idea!

 Which has very low odds of the sync on a finishing quickly, we'd get this
 one:

 table block
 a 1
 a 2
 b 1
 b 2
 c 1
 c 2
 sync a
 sync b
 sync c

 Which sure seems like a reasonable way to improve the odds data has been
 written before the associated sync comes along.

I'll believe it when I see it.  How about this:

a 1
a 2
sync a
b 1
b 2
sync b
c 1
c 2
sync c

Or maybe some variant, where we become willing to fsync a file a
certain number of seconds after writing the last block, or when all
the writes are done, whichever comes first.  It seems to me that it's
going to be a bear to figure out what fraction of the checkpoint
you've completed if you put all of the syncs at the end, and this
whole problem appears to be predicated the assumption that the OS
*isn't* writing out in a timely fashion.  Are we sure that postponing
the fsync relative to the writes is anything more than wishful
thinking?

 Also, I could just traverse the sorted list with some simple logic to count
 the number of unique files, and then set the delay between fsync writes
 based on it.  In the above, once the list was sorted, easy to just see how
 many times the table name changes on a linear scan of the sorted data.  3
 files, so if the checkpoint target gives me, say, a minute of time to sync
 them, I can delay 20 seconds between.  Simple math, and exactly the sort I

How does the checkpoint target give you any time to sync them?  Unless
you squeeze the writes together more tightly, but that seems sketchy.

 So I fixed the bitrot on the old sorted patch, which was fun as it came from
 before the 8.3 changes.  It seemed to work.  I then moved the structure it
 uses to hold the list of buffers to write, the thing that's sorted, into
 shared memory.  It's got a predictable maximum size, relying on palloc in
 the middle of the checkpoint code seems bad, and there's some potential gain
 from not reallocating it every time through.

Well you don't have to put it in shared memory on account of any of
that.  You can just hang it on a global variable.

 There's good bits in the patch I submitted for the last CF and in the patch
 you wrote earlier this week.  This unfinished patch may be a valuable idea
 to fit in there too once I fix it, or maybe it's fundamentally flawed and
 one of the other ideas you suggested (or I have sitting on the potential
 design list) will work better.  There's a patch integration problem that
 needs to be solved here, but I think almost all the individual pieces are
 available.  I'd hate to see this fail to get integrated now just for lack of
 time, considering the problem is so serious when you run into it.

Likewise, but committing something half-baked is no good either.  I
think we're in a position to crush the full-fsync-queue problem flat
(my patch should do that, and there are several other obvious things
we can do for extra certainty) but the problem of spreading out the
fsyncs looks to me like something we don't completely know how to
solve.  If we can find something that's a modest improvement on the
status quo and we can be confident in quickly, good, but I'd rather
have 9.1 go out the door on time without fully fixing this than delay
the release.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-15 Thread Greg Smith

Robert Haas wrote:

I'll believe it when I see it.  How about this:

a 1
a 2
sync a
b 1
b 2
sync b
c 1
c 2
sync c

Or maybe some variant, where we become willing to fsync a file a
certain number of seconds after writing the last block, or when all
the writes are done, whichever comes first.


That's going to give worse performance than the current code in some 
cases.  The goal of what's in there now is that you get a sequence like 
this:


a1
b1
a2
[Filesystem writes a1]
b2
[Filesystem writes b1]
sync a [Only has to write a2]
sync b [Only has to write b2]

This idea works until you to get where the filesystem write cache is so 
large that it becomes lazier about writing things.  The fundamental 
idea--push writes out some time before the sync, in hopes the filesystem 
will get to them before that said--it not unsound.  On some systems, 
doing the sync more aggressively than that will be a regression.  This 
approach just breaks down in some cases, and those cases are happening 
more now because their likelihood scales with total RAM.  I don't want 
to screw the people with smaller systems, who may be getting 
considerable benefit from the existing sequence.  Today's little 
systems--which are very similar to the high-end ones the spread 
checkpoint stuff was developed on during 8.3--do get some benefit from 
it as far as I know.


Anyway, now that the ability to get logging on all this stuff went in 
during the last CF, it's way easier to just setup a random system to run 
tests in this area than it used to be.  Whatever testing does happen 
should include, say, a 2GB laptop with a single hard drive in it.  I 
think that's the bottom of what is reasonable to consider a reasonable 
target for tweaking write performance on, given hardware 9.1 is likely 
to be deployed on.



How does the checkpoint target give you any time to sync them?  Unless
you squeeze the writes together more tightly, but that seems sketchy.
  


Obviously the checkpoint target idea needs to be shuffled around some 
too.  I was thinking of making the new default 0.8, and having it split 
the time in half for write and sync.  That will make the write phase 
close to the speed people are seeing now, at the default of 0.5, while 
giving some window for spread sync too.  The exact way to redistribute 
that around I'm not so concerned about yet.  When I get to where that's 
the most uncertain thing left I'll benchmark the TPS vs. latency 
trade-off and see what happens.  If the rest of the code is good enough 
but this just needs to be tweaked, that's a perfect thing to get beta 
feedback to finalize.



Well you don't have to put it in shared memory on account of any of
that.  You can just hang it on a global variable.
  


Hmm.  Because it's so similar to other things being allocated in shared 
memory, I just automatically pushed it over to there.  But you're right; 
it doesn't need to be that complicated.  Nobody is touching it but the 
background writer.



If we can find something that's a modest improvement on the
status quo and we can be confident in quickly, good, but I'd rather
have 9.1 go out the door on time without fully fixing this than delay
the release.
  


I'm not somebody who needs to be convinced of that.  There are two near 
commit quality pieces of this out there now:


1) Keep some BGW cleaning and fsync absorption going while sync is 
happening, rather than starting it and ignoring everything else until 
it's done.


2) Compact fsync requests when the queue fills

If that's all we can get for 9.1, it will still be a major improvement.  
I realize I only have a very short period of time to complete a major 
integration breakthrough on the pieces floating around before the goal 
here has to drop to something less ambitious.  I head to the West Coast 
for a week on the 23rd; I'll be forced to throw in the towel at that 
point if I can't get the better ideas we have in pieces here all 
assembled well by then.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-15 Thread Simon Riggs
On Sat, 2011-01-15 at 09:15 -0500, Robert Haas wrote:
 On Sat, Jan 15, 2011 at 8:55 AM, Simon Riggs si...@2ndquadrant.com wrote:
  On Sat, 2011-01-15 at 05:47 -0500, Greg Smith wrote:
  Robert Haas wrote:
   On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith g...@2ndquadrant.com wrote:
  
One of the ideas Simon and I had been considering at one point was 
adding
some better de-duplication logic to the fsync absorb code, which I'm
reminded by the pattern here might be helpful independently of other
improvements.
   
  
   Hopefully I'm not stepping on any toes here, but I thought this was an
   awfully good idea and had a chance to take a look at how hard it would
   be today while en route from point A to point B.  The answer turned
   out to be not very, so PFA a patch that seems to work.  I tested it
   by attaching gdb to the background writer while running pgbench, and
   it eliminate the backend fsyncs without even breaking a sweat.
  
 
  No toe damage, this is great, I hadn't gotten to coding for this angle
  yet at all.  Suffering from an overload of ideas and (mostly wasted)
  test data, so thanks for exploring this concept and proving it works.
 
  No toe damage either, but are we sure we want the de-duplication logic
  and in this place?
 
  I was originally of the opinion that de-duplicating the list would save
  time in the bgwriter, but that guess was wrong by about two orders of
  magnitude, IIRC. The extra time in the bgwriter wasn't even noticeable.
 
 Well, the point of this is not to save time in the bgwriter - I'm not
 surprised to hear that wasn't noticeable.  The point is that when the
 fsync request queue fills up, backends start performing an fsync *for
 every block they write*, and that's about as bad for performance as
 it's possible to be.  So it's worth going to a little bit of trouble
 to try to make sure it doesn't happen.  It didn't happen *terribly*
 frequently before, but it does seem to be common enough to worry about
 - e.g. on one occasion, I was able to reproduce it just by running
 pgbench -i -s 25 or something like that on a laptop.
 
 With this patch applied, there's no performance impact vs. current
 code in the very, very common case where space remains in the queue -
 999 times out of 1000, writing to the fsync queue will be just as fast
 as ever.  But in the unusual case where the queue has been filled up,
 compacting the queue is much much faster than performing an fsync, and
 the best part is that the compaction is generally massive.  I was
 seeing things like 4096 entries compressed to 14.  So clearly even
 if the compaction took as long as the fsync itself it would be worth
 it, because the next 4000+ guys who come along again go through the
 fast path.  But in fact I think it's much faster than an fsync.
 
 In order to get pathological behavior even with this patch applied,
 you'd need to have NBuffers pending fsync requests and they'd all have
 to be different.  I don't think that's theoretically impossible, but
 Greg's research seems to indicate that even on busy systems we don't
 come even a little bit close to the circumstances that would cause it
 to occur in practice.  Every other change we might make in this area
 will further improve this case, too: for example, doing an absorb
 after each fsync would presumably help, as would the more drastic step
 of splitting the bgwriter into two background processes (one to do
 background page cleaning, and the other to do checkpoints, for
 example).  But even without those sorts of changes, I think this is
 enough to effectively eliminate the full fsync queue problem in
 practice, which seems worth doing independently of anything else.

You've persuaded me.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-15 Thread Robert Haas
On Sat, Jan 15, 2011 at 10:31 AM, Greg Smith g...@2ndquadrant.com wrote:
 That's going to give worse performance than the current code in some cases.

OK.

 How does the checkpoint target give you any time to sync them?  Unless
 you squeeze the writes together more tightly, but that seems sketchy.

 Obviously the checkpoint target idea needs to be shuffled around some too.
  I was thinking of making the new default 0.8, and having it split the time
 in half for write and sync.  That will make the write phase close to the
 speed people are seeing now, at the default of 0.5, while giving some window
 for spread sync too.  The exact way to redistribute that around I'm not so
 concerned about yet.  When I get to where that's the most uncertain thing
 left I'll benchmark the TPS vs. latency trade-off and see what happens.  If
 the rest of the code is good enough but this just needs to be tweaked,
 that's a perfect thing to get beta feedback to finalize.

That seems like a bad idea - don't we routinely recommend that people
crank this up to 0.9?  You'd be effectively bounding the upper range
of this setting to a value to the less than the lowest value we
recommend anyone use today.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-15 Thread Greg Smith

Robert Haas wrote:

That seems like a bad idea - don't we routinely recommend that people
crank this up to 0.9?  You'd be effectively bounding the upper range
of this setting to a value to the less than the lowest value we
recommend anyone use today.
  


I was just giving an example of how I might do an initial split.  
There's a checkpoint happening now at time T; we have a rough idea that 
it needs to be finished before some upcoming time T+D.  Currently with 
default parameters this becomes:


Write:  0.5 * D; Sync:  0

Even though Sync obviously doesn't take zero.  The slop here is enough 
that it usually works anyway.


I was suggesting that a quick reshuffling to:

Write:  0.4 * D; Sync:  0.4 * D

Might be a good first candidate for how to split the time up better.  
The fact that this gives less writing time than the current biggest 
spread possible:


Write:  0.9 * D; Sync: 0

Is true.  It's also true that in the case where sync time really is 
zero, this new default would spread writes less than the current 
default.  I think that's optimistic, but it could happen if checkpoints 
are small and you have a good write cache.


Step back from that a second though.  Ultimately, the person who is 
getting checkpoints at a 5 minute interval, and is being nailed by 
spikes, should have the option of just increasing the parameters to make 
that interval bigger.  First you increase the measly default segments to 
a reasonable range, then checkpoint_completion_target is the second one 
you can try.  But from there, you quickly move onto making 
checkpoint_timeout longer.  At some point, there is no option but to 
give up checkpoints every 5 minutes as being practical, and make the 
average interval longer.


Whether or not a refactoring here makes things slightly worse for cases 
closer to the default doesn't bother me too much.  What bothers me is 
the way trying to stretch checkpoints out further fails to deliver as 
well as it should.  I'd be OK with saying to get the exact same spread 
situation as in older versions, you may need to retarget for checkpoints 
every 6 minutes *if* in the process I get a much better sync latency 
distribution in most cases.


Here's an interesting data point from the customer site this all started 
at, one I don't think they'll mind sharing since it helps make the 
situation more clear to the community.  After applying this code to 
spread sync out, in order to get their server back to functional we had 
to move all the parameters involved up to where checkpoints were spaced 
35 minutes apart.  It just wasn't possible to write any faster than that 
without disrupting foreground activity. 

The whole current model where people think of this stuff in terms of 
segments and completion targets is a UI disaster.  The direction I want 
to go in is where users can say make sure checkpoints happen every N 
minutes, and something reasonable happens without additional parameter 
fiddling.  And if the resulting checkpoint I/O spike is too big, they 
just increase the timeout to N+1 or N*2 to spread the checkpoint 
further.  Getting too bogged down thinking in terms of the current, 
really terrible interface is something I'm trying to break myself of.  
Long-term, I want there to be checkpoint_timeout, and all the other 
parameters are gone, replaced by an internal implementation of the best 
practices proven to work even on busy systems.  I don't have as much 
clarity on exactly what that best practice is the way that, say, I just 
suggested exactly how to eliminate wal_buffers as an important thing to 
manually set.  But that's the dream UI:  you shoot for a checkpoint 
interval, and something reasonable happens; if that's too intense, you 
just increase the interval to spread further.  There probably will be 
small performance regression possible vs. the current code with 
parameter combination that happen to work well on it.  Preserving every 
one of those is something that's not as important to me as making the 
tuning interface simple and clear.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-15 Thread Marti Raudsepp
On Sat, Jan 15, 2011 at 14:05, Robert Haas robertmh...@gmail.com wrote:
 Idea #4: For ext3 filesystems that like to dump the entire buffer
 cache instead of only the requested file, write a little daemon that
 runs alongside of (and completely indepdently of) PostgreSQL.  Every
 30 s, it opens a 1-byte file, changes the byte, fsyncs the file, and
 closes the file, thus dumping the cache and preventing a ridiculous
 growth in the amount of data to be sync'd at checkpoint time.

Wouldn't it be easier to just mount in data=writeback mode? This
provides a similar level of journaling as most other file systems.

Regards,
Marti

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-15 Thread Robert Haas
On Sat, Jan 15, 2011 at 5:57 PM, Greg Smith g...@2ndquadrant.com wrote:
 I was just giving an example of how I might do an initial split.  There's a
 checkpoint happening now at time T; we have a rough idea that it needs to be
 finished before some upcoming time T+D.  Currently with default parameters
 this becomes:

 Write:  0.5 * D; Sync:  0

 Even though Sync obviously doesn't take zero.  The slop here is enough that
 it usually works anyway.

 I was suggesting that a quick reshuffling to:

 Write:  0.4 * D; Sync:  0.4 * D

 Might be a good first candidate for how to split the time up better.

What is the basis for thinking that the sync should get the same
amount of time as the writes?  That seems pretty arbitrary.  Right
now, you're allowing 3 seconds per fsync, which could be a lot more or
a lot less than 40% of the total checkpoint time, but I have a pretty
clear sense of why that's a sensible thing to try: you give the rest
of the system a moment or two to get some I/O done for something other
than the checkpoint before flushing the next batch of buffers.  But
the checkpoint activity is always going to be spikey if it does
anything at all, so spacing it out *more* isn't obviously useful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-15 Thread Greg Smith

Robert Haas wrote:

What is the basis for thinking that the sync should get the same
amount of time as the writes?  That seems pretty arbitrary.  Right
now, you're allowing 3 seconds per fsync, which could be a lot more or
a lot less than 40% of the total checkpoint time...


Just that it's where I ended up at when fighting with this for a month 
on the system I've seen the most problems at.  The 3 second number was 
reversed from a computation that said aim for an internal of X minutes; 
we have Y relations on average involved in the checkpoint.  The 
direction my latest patch is strugling to go is computing a reasonable 
time automatically in the same way--count the relations, do a time 
estimate, add enough delay so the sync calls should be spread linearly 
over the given time range.




the checkpoint activity is always going to be spikey if it does
anything at all, so spacing it out *more* isn't obviously useful.
  


One of the components to the write queue is some notion that writes that 
have been waiting longest should eventually be flushed out.  Linux has 
this number called dirty_expire_centiseconds which suggests it enforces 
just that, set to a default of 30 seconds.  This is why some 5-minute 
interval checkpoints with default parameters, effectively spreading the 
checkpoint over 2.5 minutes, can work under the current design.  
Anything you wrote at T+0 to T+2:00 *should* have been written out 
already when you reach T+2:30 and sync.  Unfortunately, when the system 
gets busy, there is this congestion control logic that basically 
throws out any guarantee of writes starting shortly after the expiration 
time.


It turns out that the only thing that really works are the tunables that 
block new writes from happening once the queue is full, but they can't 
be set low enough to work well in earlier kernels when combined with 
lots of RAM.  Using the terminology of 
http://www.mjmwired.net/kernel/Documentation/sysctl/vm.txt at some point 
you hit a point where a process generating disk writes will itself 
start writeback.  This is anologous to the PostgreSQL situation where 
backends do their own fsync calls.  The kernel will eventually move to 
where those trying to write new data are instead recruited into being 
additional sources of write flushing.  That's the part you just can't 
make aggressive enough on older kernels; dirty writers can always win.  
Ideally, the system never digs itself into a hole larger than you can 
afford to wait to write out.  It's a transacton speed vs. latency thing 
though, and the older kernels just don't consider the latency side well 
enough.


There is new mechanism in the latest kernels to control this much 
better:  dirty_bytes and dirty_background_bytes are the tunables.  I 
haven't had a chance to test yet.  As mentioned upthread, some of the 
bleding edge kernels that have this feature available in are showing 
such large general performance regressions in our tests, compared to the 
boring old RHEL5 kernel, that whether this feature works or not is 
irrelevant.  I haven't tracked down which new kernel distributions work 
well performance-wise and which don't yet for PostgreSQL.


I'm hoping that when I get there, I'll see results like 
http://serverfault.com/questions/126413/limit-linux-background-flush-dirty-pages 
, where the ideal setting for dirty_bytes  to keep latency under control 
with BBWC was 15MB.  To put that into perspective, the lowest useful 
setting you can set dirty_ratio to is 5% of RAM.  That's 410MB on my 
measly 8GB desktop, and 3.3GB on the 64GB production server I've been 
trying to tune.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2011-01-11 Thread Robert Haas
On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith g...@2ndquadrant.com wrote:
 Having the pg_stat_bgwriter.buffers_backend_fsync patch available all the
 time now has made me reconsider how important one potential bit of
 refactoring here would be.  I managed to catch one of the situations where
 really popular relations were being heavily updated in a way that was
 competing with the checkpoint on my test system (which I can happily share
 the logs of), with the instrumentation patch applied but not the spread sync
 one:

 LOG:  checkpoint starting: xlog
 DEBUG:  could not forward fsync request because request queue is full
 CONTEXT:  writing block 7747 of relation base/16424/16442
 DEBUG:  could not forward fsync request because request queue is full
 CONTEXT:  writing block 42688 of relation base/16424/16437
 DEBUG:  could not forward fsync request because request queue is full
 CONTEXT:  writing block 9723 of relation base/16424/16442
 DEBUG:  could not forward fsync request because request queue is full
 CONTEXT:  writing block 58117 of relation base/16424/16437
 DEBUG:  could not forward fsync request because request queue is full
 CONTEXT:  writing block 165128 of relation base/16424/16437
 [330 of these total, all referring to the same two relations]

 DEBUG:  checkpoint sync: number=1 file=base/16424/16448_fsm
 time=10132.83 msec
 DEBUG:  checkpoint sync: number=2 file=base/16424/11645 time=0.001000 msec
 DEBUG:  checkpoint sync: number=3 file=base/16424/16437 time=7.796000 msec
 DEBUG:  checkpoint sync: number=4 file=base/16424/16448 time=4.679000 msec
 DEBUG:  checkpoint sync: number=5 file=base/16424/11607 time=0.001000 msec
 DEBUG:  checkpoint sync: number=6 file=base/16424/16437.1 time=3.101000 msec
 DEBUG:  checkpoint sync: number=7 file=base/16424/16442 time=4.172000 msec
 DEBUG:  checkpoint sync: number=8 file=base/16424/16428_vm time=0.001000
 msec
 DEBUG:  checkpoint sync: number=9 file=base/16424/16437_fsm time=0.001000
 msec
 DEBUG:  checkpoint sync: number=10 file=base/16424/16428 time=0.001000 msec
 DEBUG:  checkpoint sync: number=11 file=base/16424/16425 time=0.00 msec
 DEBUG:  checkpoint sync: number=12 file=base/16424/16437_vm time=0.001000
 msec
 DEBUG:  checkpoint sync: number=13 file=base/16424/16425_vm time=0.001000
 msec
 LOG:  checkpoint complete: wrote 3032 buffers (74.0%); 0 transaction log
 file(s) added, 0 removed, 0 recycled; write=1.742 s, sync=10.153 s,
 total=37.654 s; sync files=13, longest=10.132 s, average=0.779 s

 Note here how the checkpoint was hung on trying to get 16448_fsm written
 out, but the backends were issuing constant competing fsync calls to these
 other relations.  This is very similar to the production case this patch was
 written to address, which I hadn't been able to share a good example of yet.
  That's essentially what it looks like, except with the contention going on
 for minutes instead of seconds.

 One of the ideas Simon and I had been considering at one point was adding
 some better de-duplication logic to the fsync absorb code, which I'm
 reminded by the pattern here might be helpful independently of other
 improvements.

Hopefully I'm not stepping on any toes here, but I thought this was an
awfully good idea and had a chance to take a look at how hard it would
be today while en route from point A to point B.  The answer turned
out to be not very, so PFA a patch that seems to work.  I tested it
by attaching gdb to the background writer while running pgbench, and
it eliminate the backend fsyncs without even breaking a sweat.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 4457df6..f6cd8dc 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -182,6 +182,7 @@ static void CheckArchiveTimeout(void);
 static void BgWriterNap(void);
 static bool IsCheckpointOnSchedule(double progress);
 static bool ImmediateCheckpointRequested(void);
+static bool CompactBgwriterRequestQueue(void);
 
 /* Signal handlers */
 
@@ -1090,10 +1091,20 @@ ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
 	/* Count all backend writes regardless of if they fit in the queue */
 	BgWriterShmem-num_backend_writes++;
 
+	/*
+	 * If the background writer isn't running or the request queue is full,
+	 * the backend will have to perform its own fsync request.  But before
+	 * forcing that to happen, we can try to compact the background writer
+	 * request queue.
+	 */
 	if (BgWriterShmem-bgwriter_pid == 0 ||
-		BgWriterShmem-num_requests = BgWriterShmem-max_requests)
+		(BgWriterShmem-num_requests = BgWriterShmem-max_requests
+		 !CompactBgwriterRequestQueue()))
 	{
-		/* Also count the subset where backends have to do their own fsync */
+		/*
+		 * Count the subset of writes where backends have to do their own
+		 * fsync
+		 */
 		BgWriterShmem-num_backend_fsync++;
 		

Re: [HACKERS] Spread checkpoint sync

2010-12-08 Thread Simon Riggs
On Mon, 2010-12-06 at 23:26 -0300, Alvaro Herrera wrote:

 Why would multiple bgwriter processes worry you?

Because it complicates the tracking of files requiring fsync.

As Greg says, the last attempt to do that was a lot of code.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2010-12-07 Thread Greg Smith

Alvaro Herrera wrote:

Why would multiple bgwriter processes worry you?

Of course, it wouldn't work to have multiple processes trying to execute
a checkpoint simultaneously, but what if we separated the tasks so that
one process is in charge of checkpoints, and another one is in charge of
the LRU scan?
  


I was commenting more in the context of development resource 
allocation.  Moving toward that design would be helpful, but it alone 
isn't enough to improve the checkpoint sync issues.  My concern is that 
putting work into that area will be a distraction from making progress 
on those.  If individual syncs take so long that the background writer 
gets lost for a while executing them, and therefore doesn't do LRU 
cleanup, you've got a problem that LRU-related improvements probably 
aren't enough to solve.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2010-12-06 Thread Alvaro Herrera
Excerpts from Greg Smith's message of dom dic 05 20:02:48 -0300 2010:

 When ends up happening if you push toward fully sync I/O is the design 
 you see in some other databases, where you need multiple writer 
 processes.  Then requests for new pages can continue to allocate as 
 needed, while keeping any one write from blocking things.  That's one 
 sort of a way to simulate asynchronous I/O, and you can substitute true 
 async I/O instead in many of those implementations.  We didn't have much 
 luck with portability on async I/O when that was last experimented with, 
 and having multiple background writer processes seems like overkill; 
 that whole direction worries me.

Why would multiple bgwriter processes worry you?

Of course, it wouldn't work to have multiple processes trying to execute
a checkpoint simultaneously, but what if we separated the tasks so that
one process is in charge of checkpoints, and another oneZis in charge of
the LRU scan?

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2010-12-05 Thread Greg Smith

Heikki Linnakangas wrote:
If you fsync() a file with one dirty page in it, it's going to return 
very quickly, but a 1GB file will take a while. That could be 
problematic if you have a thousand small files and a couple of big 
ones, as you would want to reserve more time for the big ones. I'm not 
sure what to do about it, maybe it's not a problem in practice.


It's a problem in practice allright, with the bulk-loading situation 
being the main one you'll hit it.  If somebody is running a giant COPY 
to populate a table at the time the checkpoint starts, there's probably 
a 1GB file of dirty data that's unsynced around there somewhere.  I 
think doing anything about that situation requires an additional leap in 
thinking about buffer cache evicition and fsync absorption though.  
Ultimately I think we'll end up doing sync calls for relations that have 
gone cold for a while all the time as part of BGW activity, not just 
at checkpoint time, to try and avoid this whole area better.  That's a 
lot more than I'm trying to do in my first pass of improvements though.


In the interest of cutting the number of messy items left in the 
official CommitFest, I'm going to mark my patch here Returned with 
Feedback and continue working in the general direction I was already 
going.  Concept shared, underlying patches continue to advance, good 
discussion around it; those were my goals for this CF and I think we're 
there.


I have a good idea how to autotune the sync spread that's hardcoded in 
the current patch.  I'll work on finishing that up and organizing some 
more extensive performance tests.  Right now I'm more concerned about 
finishing the tests around the wal_sync_method issues, which are related 
to this and need to get sorted out a bit more urgently.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2010-12-05 Thread Rob Wultsch
On Sun, Dec 5, 2010 at 2:53 PM, Greg Smith g...@2ndquadrant.com wrote:
 Heikki Linnakangas wrote:

 If you fsync() a file with one dirty page in it, it's going to return very
 quickly, but a 1GB file will take a while. That could be problematic if you
 have a thousand small files and a couple of big ones, as you would want to
 reserve more time for the big ones. I'm not sure what to do about it, maybe
 it's not a problem in practice.

 It's a problem in practice allright, with the bulk-loading situation being
 the main one you'll hit it.  If somebody is running a giant COPY to populate
 a table at the time the checkpoint starts, there's probably a 1GB file of
 dirty data that's unsynced around there somewhere.  I think doing anything
 about that situation requires an additional leap in thinking about buffer
 cache evicition and fsync absorption though.  Ultimately I think we'll end
 up doing sync calls for relations that have gone cold for a while all the
 time as part of BGW activity, not just at checkpoint time, to try and avoid
 this whole area better.  That's a lot more than I'm trying to do in my first
 pass of improvements though.

 In the interest of cutting the number of messy items left in the official
 CommitFest, I'm going to mark my patch here Returned with Feedback and
 continue working in the general direction I was already going.  Concept
 shared, underlying patches continue to advance, good discussion around it;
 those were my goals for this CF and I think we're there.

 I have a good idea how to autotune the sync spread that's hardcoded in the
 current patch.  I'll work on finishing that up and organizing some more
 extensive performance tests.  Right now I'm more concerned about finishing
 the tests around the wal_sync_method issues, which are related to this and
 need to get sorted out a bit more urgently.

 --
 Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
 PostgreSQL Training, Services and Support        www.2ndQuadrant.us


Forgive me, but is all of this a step on the slippery slope to
direction io? And is this a bad thing?


-- 
Rob Wultsch
wult...@gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2010-12-05 Thread Greg Smith

Rob Wultsch wrote:

Forgive me, but is all of this a step on the slippery slope to
direct io? And is this a bad thing


I don't really think so.  There's an important difference in my head 
between direct I/O, where the kernel is told write this immediately!, 
and what I'm trying to achive.  I want to give the kernel an opportunity 
to write blocks out in an efficient way, so that it can take advantage 
of elevator sorting, write combining, and similar tricks.  But, 
eventually, those writes have to make it out to disk.  Linux claims to 
have concepts like a deadline for I/O to happen, but they turn out to 
not be so effective once the system gets backed up with enough writes.  
Since fsync time is the only effective deadline, I'm progressing from 
the standpoint that adjusting when it happens relative to the write will 
help, while still allowing the kernel an opportunity to get the writes 
out on its own schedule.


When ends up happening if you push toward fully sync I/O is the design 
you see in some other databases, where you need multiple writer 
processes.  Then requests for new pages can continue to allocate as 
needed, while keeping any one write from blocking things.  That's one 
sort of a way to simulate asynchronous I/O, and you can substitute true 
async I/O instead in many of those implementations.  We didn't have much 
luck with portability on async I/O when that was last experimented with, 
and having multiple background writer processes seems like overkill; 
that whole direction worries me.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2010-12-04 Thread Greg Smith

Greg Stark wrote:

Using sync_file_range you can specify the set of blocks to sync and
then block on them only after some time has passed. But there's no
documentation on how this relates to the I/O scheduler so it's not
clear it would have any effect on the problem. 


I believe this is the exact spot we're stalled at in regards to getting 
this improved on the Linux side, as I understand it at least.  *The* 
answer for this class of problem on Linux is to use sync_file_range, and 
I don't think we'll ever get any sympathy from those kernel developers 
until we do.  But that's a Linux specific call, so doing that is going 
to add a write path fork with platform-specific code into the database.  
If I thought sync_file_range was a silver bullet guaranteed to make this 
better, maybe I'd go for that.  I think there's some relatively 
low-hanging fruit on the database side that would do better before going 
to that extreme though, thus the patch.



We might still have to delay the begining of the sync to allow the dirty blocks 
to be synced
naturally and then when we issue it still end up catching a lot of
other i/o as well.
  


Whether it's lots or not is really workload dependent.  I work from 
the assumption that the blocks being written out by the checkpoint are 
the most popular ones in the database, the ones that accumulate a high 
usage count and stay there.  If that's true, my guess is that the writes 
being done while the checkpoint is executing are a bit less likely to be 
touching the same files.  You raise a valid concern, I just haven't seen 
that actually happen in practice yet.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2010-12-02 Thread Greg Stark
On Wed, Dec 1, 2010 at 4:25 AM, Greg Smith g...@2ndquadrant.com wrote:
 I ask because I don't have a mental model of how the pause can help.
 Given that this dirty data has been hanging around for many minutes
 already, what is a 3 second pause going to heal?


 The difference is that once an fsync call is made, dirty data is much more
 likely to be forced out.  It's the one thing that bypasses all other ways
 the kernel might try to avoid writing the data

I had always assumed the problem was that other I/O had been done to
the files in the meantime. I.e. the fsync is not just syncing the
checkpoint but any other blocks that had been flushed since the
checkpoint started. The longer the checkpoint is spread over the more
other I/O is included as well.

Using sync_file_range you can specify the set of blocks to sync and
then block on them only after some time has passed. But there's no
documentation on how this relates to the I/O scheduler so it's not
clear it would have any effect on the problem. We might still have to
delay the begining of the sync to allow the dirty blocks to be synced
naturally and then when we issue it still end up catching a lot of
other i/o as well.




-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2010-12-02 Thread Josh Berkus

 Using sync_file_range you can specify the set of blocks to sync and
 then block on them only after some time has passed. But there's no
 documentation on how this relates to the I/O scheduler so it's not
 clear it would have any effect on the problem. We might still have to
 delay the begining of the sync to allow the dirty blocks to be synced
 naturally and then when we issue it still end up catching a lot of
 other i/o as well.

This *really* sounds like we should be working with the FS geeks on
making the OS do this work for us.  Greg, you wanna go to LinuxCon next
year?

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2010-12-02 Thread Robert Haas
On Thu, Dec 2, 2010 at 2:24 PM, Greg Stark gsst...@mit.edu wrote:
 On Wed, Dec 1, 2010 at 4:25 AM, Greg Smith g...@2ndquadrant.com wrote:
 I ask because I don't have a mental model of how the pause can help.
 Given that this dirty data has been hanging around for many minutes
 already, what is a 3 second pause going to heal?


 The difference is that once an fsync call is made, dirty data is much more
 likely to be forced out.  It's the one thing that bypasses all other ways
 the kernel might try to avoid writing the data

 I had always assumed the problem was that other I/O had been done to
 the files in the meantime. I.e. the fsync is not just syncing the
 checkpoint but any other blocks that had been flushed since the
 checkpoint started.

It strikes me that we might start the syncs of the files that the
checkpoint isn't going to dirty further at the start of the
checkpoint, and do the rest at the end.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2010-12-01 Thread Heikki Linnakangas

On 01.12.2010 06:25, Greg Smith wrote:

Jeff Janes wrote:

I ask because I don't have a mental model of how the pause can help.
Given that this dirty data has been hanging around for many minutes
already, what is a 3 second pause going to heal?


The difference is that once an fsync call is made, dirty data is much
more likely to be forced out. It's the one thing that bypasses all other
ways the kernel might try to avoid writing the data--both the dirty
ratio guidelines and the congestion control logic--and forces those
writes to happen as soon as they can be scheduled. If you graph the
amount of data shown Dirty: by /proc/meminfo over time, once the sync
calls start happening it's like a descending staircase pattern, dropping
a little bit as each sync fires.


Do you have any idea how to autotune the delay between fsyncs?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2010-12-01 Thread Greg Smith

Heikki Linnakangas wrote:

Do you have any idea how to autotune the delay between fsyncs?


I'm thinking to start by counting the number of relations that need them 
at the beginning of the checkpoint.  Then use the same basic math that 
drives the spread writes, where you assess whether you're on schedule or 
not based on segment/time progress relative to how many have been sync'd 
out of that total.  At a high level I think that idea translates over 
almost directly into the existing write spread code.  Was hoping for a 
sanity check from you in particular about whether that seems reasonable 
or not before diving into the coding.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2010-12-01 Thread Heikki Linnakangas

On 01.12.2010 23:30, Greg Smith wrote:

Heikki Linnakangas wrote:

Do you have any idea how to autotune the delay between fsyncs?


I'm thinking to start by counting the number of relations that need them
at the beginning of the checkpoint. Then use the same basic math that
drives the spread writes, where you assess whether you're on schedule or
not based on segment/time progress relative to how many have been sync'd
out of that total. At a high level I think that idea translates over
almost directly into the existing write spread code. Was hoping for a
sanity check from you in particular about whether that seems reasonable
or not before diving into the coding.


Sounds reasonable to me. fsync()s are a lot less uniform than write()s, 
though. If you fsync() a file with one dirty page in it, it's going to 
return very quickly, but a 1GB file will take a while. That could be 
problematic if you have a thousand small files and a couple of big ones, 
as you would want to reserve more time for the big ones. I'm not sure 
what to do about it, maybe it's not a problem in practice.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2010-11-30 Thread Greg Smith

Ron Mayer wrote:

Might smoother checkpoints be better solved by talking
to the OS vendors  virtual-memory-tunning-knob-authors
to work with them on exposing the ideal knobs; rather than
saying that our only tool is a hammer(fsync) so the problem
must be handled as a nail.
  


Maybe, but it's hard to argue that the current implementation--just 
doing all of the sync calls as fast as possible, one after the other--is 
going to produce worst-case behavior in a lot of situations.  Given that 
it's not a huge amount of code to do better, I'd rather do some work in 
that direction, instead of presuming the kernel authors will ever make 
this go away.  Spreading the writes out as part of the checkpoint rework 
in 8.3 worked better than any kernel changes I've tested since then, and 
I'm not real optimisic about this getting resolved at the system level.  
So long as the database changes aren't antagonistic toward kernel 
improvements, I'd prefer to have some options here that become effective 
as soon as the database code is done.


I've attached an updated version of the initial sync spreading patch 
here, one that applies cleanly on top of HEAD and over top of the sync 
instrumentation patch too.  The conflict that made that hard before is 
gone now.


Having the pg_stat_bgwriter.buffers_backend_fsync patch available all 
the time now has made me reconsider how important one potential bit of 
refactoring here would be.  I managed to catch one of the situations 
where really popular relations were being heavily updated in a way that 
was competing with the checkpoint on my test system (which I can happily 
share the logs of), with the instrumentation patch applied but not the 
spread sync one:


LOG:  checkpoint starting: xlog
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 7747 of relation base/16424/16442
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 42688 of relation base/16424/16437
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 9723 of relation base/16424/16442
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 58117 of relation base/16424/16437
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 165128 of relation base/16424/16437
[330 of these total, all referring to the same two relations]

DEBUG:  checkpoint sync: number=1 file=base/16424/16448_fsm 
time=10132.83 msec

DEBUG:  checkpoint sync: number=2 file=base/16424/11645 time=0.001000 msec
DEBUG:  checkpoint sync: number=3 file=base/16424/16437 time=7.796000 msec
DEBUG:  checkpoint sync: number=4 file=base/16424/16448 time=4.679000 msec
DEBUG:  checkpoint sync: number=5 file=base/16424/11607 time=0.001000 msec
DEBUG:  checkpoint sync: number=6 file=base/16424/16437.1 time=3.101000 msec
DEBUG:  checkpoint sync: number=7 file=base/16424/16442 time=4.172000 msec
DEBUG:  checkpoint sync: number=8 file=base/16424/16428_vm time=0.001000 
msec
DEBUG:  checkpoint sync: number=9 file=base/16424/16437_fsm 
time=0.001000 msec

DEBUG:  checkpoint sync: number=10 file=base/16424/16428 time=0.001000 msec
DEBUG:  checkpoint sync: number=11 file=base/16424/16425 time=0.00 msec
DEBUG:  checkpoint sync: number=12 file=base/16424/16437_vm 
time=0.001000 msec
DEBUG:  checkpoint sync: number=13 file=base/16424/16425_vm 
time=0.001000 msec
LOG:  checkpoint complete: wrote 3032 buffers (74.0%); 0 transaction log 
file(s) added, 0 removed, 0 recycled; write=1.742 s, sync=10.153 s, 
total=37.654 s; sync files=13, longest=10.132 s, average=0.779 s


Note here how the checkpoint was hung on trying to get 16448_fsm written 
out, but the backends were issuing constant competing fsync calls to 
these other relations.  This is very similar to the production case this 
patch was written to address, which I hadn't been able to share a good 
example of yet.  That's essentially what it looks like, except with the 
contention going on for minutes instead of seconds.


One of the ideas Simon and I had been considering at one point was 
adding some better de-duplication logic to the fsync absorb code, which 
I'm reminded by the pattern here might be helpful independently of other 
improvements.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 620b197..501cab8 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -143,8 +143,8 @@ typedef struct
 
 static BgWriterShmemStruct *BgWriterShmem;
 
-/* interval for calling AbsorbFsyncRequests in CheckpointWriteDelay */
-#define WRITES_PER_ABSORB		1000
+/* Fraction of fsync absorb queue that needs to be filled before acting */

Re: [HACKERS] Spread checkpoint sync

2010-11-30 Thread Josh Berkus

 Maybe, but it's hard to argue that the current implementation--just
 doing all of the sync calls as fast as possible, one after the other--is
 going to produce worst-case behavior in a lot of situations.  Given that
 it's not a huge amount of code to do better, I'd rather do some work in
 that direction, instead of presuming the kernel authors will ever make
 this go away.  Spreading the writes out as part of the checkpoint rework
 in 8.3 worked better than any kernel changes I've tested since then, and
 I'm not real optimisic about this getting resolved at the system level. 
 So long as the database changes aren't antagonistic toward kernel
 improvements, I'd prefer to have some options here that become effective
 as soon as the database code is done.

Besides, even if kernel/FS authors did improve things, the improvements
would not be available on production platforms for years.  And, for that
matter, while Linux and BSD are pretty responsive to our feedback,
Apple, Microsoft and Oracle are most definitely not.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2010-11-30 Thread Jeff Janes
On Sun, Nov 14, 2010 at 3:48 PM, Greg Smith g...@2ndquadrant.com wrote:

...

 One change that turned out be necessary rather than optional--to get good
 performance from the system under tuning--was to make regular background
 writer activity, including fsync absorb checks, happen during these sync
 pauses.  The existing code ran the checkpoint sync work in a pretty tight
 loop, which as I alluded to in an earlier patch today can lead to the
 backends competing with the background writer to get their sync calls
 executed.  This squashes that problem if the background writer is setup
 properly.

Have you tested out this absorb during syncing phase code without
the sleep between the syncs?
I.e. so that it still a tight loop, but the loop alternates between
sync and absorb, with no intentional pause?

I wonder if all the improvement you see might not be due entirely to
the absorb between syncs, and none or very little from
the sleep itself.

I ask because I don't have a mental model of how the pause can help.
Given that this dirty data has been hanging around for many minutes
already, what is a 3 second pause going to heal?

The healing power of clearing out the absorb queue seems much more obvious.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2010-11-30 Thread Greg Smith

Jeff Janes wrote:

Have you tested out this absorb during syncing phase code without
the sleep between the syncs?
I.e. so that it still a tight loop, but the loop alternates between
sync and absorb, with no intentional pause?
  


Yes; that's how it was developed.  It helped to have just the extra 
absorb work without the pauses, but that alone wasn't enough to really 
improve things on the server we ran into this problem badly on.



I ask because I don't have a mental model of how the pause can help.
Given that this dirty data has been hanging around for many minutes
already, what is a 3 second pause going to heal?
  


The difference is that once an fsync call is made, dirty data is much 
more likely to be forced out.  It's the one thing that bypasses all 
other ways the kernel might try to avoid writing the data--both the 
dirty ratio guidelines and the congestion control logic--and forces 
those writes to happen as soon as they can be scheduled.  If you graph 
the amount of data shown Dirty: by /proc/meminfo over time, once the 
sync calls start happening it's like a descending staircase pattern, 
dropping a little bit as each sync fires. 


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2010-11-26 Thread Ron Mayer
Josh Berkus wrote:
 On 11/20/10 6:11 PM, Jeff Janes wrote:
 True, but I think that changing these from their defaults is not
 considered to be a dark art reserved for kernel hackers, i.e they are
 something that sysadmins are expected to tweak to suite their work
 load, just like the shmmax and such. 
 
 I disagree.  Linux kernel hackers know about these kinds of parameters,
 and I suppose that Linux performance experts do.  But very few
 sysadmins, in my experience, have any idea.

To me, a lot of this conversation feels parallel to the
arguments the occasionally come up debating writing directly
to raw disks bypassing the filesystems altogether.

Might smoother checkpoints be better solved by talking
to the OS vendors  virtual-memory-tunning-knob-authors
to work with them on exposing the ideal knobs; rather than
saying that our only tool is a hammer(fsync) so the problem
must be handled as a nail.


Hypothetically - what would the ideal knobs be?

Something like madvise WONTNEED but that leaves pages
in the OS's cache after writing them?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2010-11-23 Thread Cédric Villemain
2010/11/21 Andres Freund and...@anarazel.de:
 On Sunday 21 November 2010 23:19:30 Martijn van Oosterhout wrote:
 For a similar problem we had (kernel buffering too much) we had success
 using the fadvise and madvise WONTNEED syscalls to force the data to
 exit the cache much sooner than it would otherwise. This was on Linux
 and it had the side-effect that the data was deleted from the kernel
 cache, which we wanted, but probably isn't appropriate here.
 Yep, works fine. Although it has the issue that the data will get read again 
 if
 archiving/SR is enabled.

mmhh . the current code does call DONTNEED or WILLNEED for WAL
depending of the archiving off or on.

This matters *only* once the data is writen (fsync, fdatasync), before
that it should not have  an effect.


 There is also sync_file_range, but that's linux specific, although
 close to what you want I think. It would allow you to work with blocks
 smaller than 1GB.
 Unfortunately that puts the data under quite high write-out pressure inside
 the kernel - which is not what you actually want because it limits reordering
 and such significantly.

 It would be nicer if you could get a mix of both semantics (looking at it,
 depending on the approach that seems to be about a 10 line patch to the
 kernel). I.e. indicate that you want to write the pages soonish, but don't put
 it on the head of the writeout queue.

 Andres

 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2010-11-21 Thread Greg Smith

Jeff Janes wrote:

And for very large memory
systems, even 1% may be too much to cache (dirty*_ratio can only be
set in integer percent points), so recent kernels introduced
dirty*_bytes parameters.  I like these better because they do what
they say.  With the dirty*_ratio, I could never figure out what it was
a ratio of, and the results were unpredictable without extensive
experimentation.
  


Right, you can't set dirty_background_ratio low enough to make this 
problem go away.  Even attempts to set it to 1%, back when that that was 
the right size for it, seem to be defeated by other mechanisms within 
the kernel.  Last time I looked at the related source code, it seemed 
the congestion control logic that kicks in to throttle writes was a 
likely suspect.  This is why I'm not real optimistic about newer 
mechanism like the dirty_background_bytes added 2.6.29 to help here, as 
that just gives a mapping to setting lower values; the same basic logic 
is under the hood.


Like Jeff, I've never seen dirty_expire_centisecs help at all, possibly 
due to the same congestion mechanism. 


Yes, but how much work do we want to put into redoing the checkpoint
logic so that the sysadmin on a particular OS and configuration and FS
can avoid having to change the kernel parameters away from their
defaults?  (Assuming of course I am correctly understanding the
problem, always a dangerous assumption.)
  


I've been trying to make this problem go away using just the kernel 
tunables available since 2006.  I adjusted them carefully on the server 
that ran into this problem so badly that it motivated the submitted 
patch, months before this issue got bad.  It didn't help.  Maybe if they 
were running a later kernel that supported dirty_background_bytes that 
would have worked better.  During the last few years, the only thing 
that has consistently helped in every case is the checkpoint spreading 
logic that went into 8.3.  I no longer expect that the kernel developers 
will ever make this problem go away the way checkpoints are written out 
right now, whereas the last good PostgreSQL work in this area definitely 
helped.


The basic premise of the current checkpoint code is that if you write 
all of the buffers out early enough, by the time syncs execute enough of 
the data should have gone out that those don't take very long to 
process.  That was usually true for the last few years, on systems with 
a battery-backed cache; the amount of memory cached by the OS was 
relatively small relative to the RAID cache size.  That's not the case 
anymore, and that divergence is growing bigger.


The idea that the checkpoint sync code can run in a relatively tight 
loop, without stopping to do the normal background writer cleanup work, 
is also busted by that observation.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2010-11-21 Thread Greg Smith

Robert Haas wrote:

Doing all the writes and then all the fsyncs meets this requirement
trivially, but I'm not so sure that's a good idea.  For example, given
files F1 ... Fn with dirty pages needing checkpoint writes, we could
do the following: first, do any pending fsyncs for files not among F1
.. Fn; then, write all pages for F1 and fsync, write all pages for F2
and fsync, write all pages for F3 and fsync, etc.  This might seem
dumb because we're not really giving the OS a chance to write anything
out before we fsync, but think about the ext3 case where the whole
filesystem cache gets flushed anyway.


I'm not horribly interested in optimizing for the ext3 case per se, as I 
consider that filesystem fundamentally broken from the perspective of 
its ability to deliver low-latency here.  I wouldn't want a patch that 
improved behavior on filesystem with granular fsync to make the ext3 
situation worst.  That's as much as I'd want design to lean toward 
considering its quirks.  Jeff Janes made a case downthread for why not 
make it the admin/OS's job to worry about this?  In cases where there 
is a reasonable solution available, in the form of switch to XFS or 
ext4, I'm happy to take that approach.


Let me throw some numbers out to give a better idea of the shape and 
magnitude of the problem case I've been working on here.  In the 
situation that leads that the near hour-long sync phase I've seen, 
checkpoints will start with about a 3GB backlog of data in the kernel 
write cache to deal with.  That's about 4% of RAM, just under the 5% 
threshold set by dirty_background_ratio.  Whether or not the 256MB write 
cache on the controller is also filled is a relatively minor detail I 
can't monitor easily.  The checkpoint itself?  250MB each time. 

This proportion is why I didn't think to follow the alternate path of 
worrying about spacing the write and fsync calls out differently.  I 
shrunk shared_buffers down to make the actual checkpoints smaller, which 
helped to some degree; that's what got them down to smaller than the 
RAID cache size.  But the amount of data cached by the operating system 
is the real driver of total sync time here.  Whether or not you include 
all of the writes from the checkpoint itself before you start calling 
fsync didn't actually matter very much; in the case I've been chasing, 
those are getting cached anyway.  The write storm from the fsync calls 
themselves forcing things out seems to be the driver on I/O spikes, 
which is why I started with spacing those out.


Writes go out at a rate of around 5MB/s, so clearing the 3GB backlog 
takes a minimum of 10 minutes of real time.  There are about 300 1GB 
relation files involved in the case I've been chasing.  This is where 
the 3 second delay number came from; 300 files, 3 seconds each, 900 
seconds = 15 minutes of sync spread.  You can turn that math around to 
figure out how much delay per relation you can afford while still 
keeping checkpoints to a planned end time, which isn't done in the patch 
I submitted yet.


Ultimately what I want to do here is some sort of smarter write-behind 
sync operation, perhaps with a LRU on relations with pending fsync 
requests.  The idea would be to sync relations that haven't been touched 
in a while in advance of the checkpoint even.  I think that's similar to 
the general idea Robert is suggesting here, to get some sync calls 
flowing before all of the checkpoint writes have happened.  I think that 
the final sync calls will need to get spread out regardless, and since 
doing that requires a fairly small amount of code too that's why we 
started with that.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2010-11-21 Thread Martijn van Oosterhout
On Sun, Nov 21, 2010 at 04:54:00PM -0500, Greg Smith wrote:
 Ultimately what I want to do here is some sort of smarter write-behind  
 sync operation, perhaps with a LRU on relations with pending fsync  
 requests.  The idea would be to sync relations that haven't been touched  
 in a while in advance of the checkpoint even.  I think that's similar to  
 the general idea Robert is suggesting here, to get some sync calls  
 flowing before all of the checkpoint writes have happened.  I think that  
 the final sync calls will need to get spread out regardless, and since  
 doing that requires a fairly small amount of code too that's why we  
 started with that.

For a similar problem we had (kernel buffering too much) we had success
using the fadvise and madvise WONTNEED syscalls to force the data to
exit the cache much sooner than it would otherwise. This was on Linux
and it had the side-effect that the data was deleted from the kernel
cache, which we wanted, but probably isn't appropriate here.

There is also sync_file_range, but that's linux specific, although
close to what you want I think. It would allow you to work with blocks
smaller than 1GB.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 Patriotism is when love of your own people comes first; nationalism,
 when hate for people other than your own comes first. 
   - Charles de Gaulle


signature.asc
Description: Digital signature


Re: [HACKERS] Spread checkpoint sync

2010-11-21 Thread Andres Freund
On Sunday 21 November 2010 23:19:30 Martijn van Oosterhout wrote:
 For a similar problem we had (kernel buffering too much) we had success
 using the fadvise and madvise WONTNEED syscalls to force the data to
 exit the cache much sooner than it would otherwise. This was on Linux
 and it had the side-effect that the data was deleted from the kernel
 cache, which we wanted, but probably isn't appropriate here.
Yep, works fine. Although it has the issue that the data will get read again if 
archiving/SR is enabled.

 There is also sync_file_range, but that's linux specific, although
 close to what you want I think. It would allow you to work with blocks
 smaller than 1GB.
Unfortunately that puts the data under quite high write-out pressure inside 
the kernel - which is not what you actually want because it limits reordering 
and such significantly.

It would be nicer if you could get a mix of both semantics (looking at it, 
depending on the approach that seems to be about a 10 line patch to the 
kernel). I.e. indicate that you want to write the pages soonish, but don't put 
it on the head of the writeout queue.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2010-11-21 Thread Josh Berkus
On 11/20/10 6:11 PM, Jeff Janes wrote:
 True, but I think that changing these from their defaults is not
 considered to be a dark art reserved for kernel hackers, i.e they are
 something that sysadmins are expected to tweak to suite their work
 load, just like the shmmax and such. 

I disagree.  Linux kernel hackers know about these kinds of parameters,
and I suppose that Linux performance experts do.  But very few
sysadmins, in my experience, have any idea.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2010-11-21 Thread Robert Haas
On Sun, Nov 21, 2010 at 4:54 PM, Greg Smith g...@2ndquadrant.com wrote:
 Let me throw some numbers out [...]

Interesting.

 Ultimately what I want to do here is some sort of smarter write-behind sync
 operation, perhaps with a LRU on relations with pending fsync requests.  The
 idea would be to sync relations that haven't been touched in a while in
 advance of the checkpoint even.  I think that's similar to the general idea
 Robert is suggesting here, to get some sync calls flowing before all of the
 checkpoint writes have happened.  I think that the final sync calls will
 need to get spread out regardless, and since doing that requires a fairly
 small amount of code too that's why we started with that.

Doing some kind of background fsyinc-ing might indeed be sensible, but
I agree that's secondary to trying to spread out the fsyncs during the
checkpoint itself.  I guess the question is what we can do there
sensibly without an unreasonable amount of new code.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2010-11-20 Thread Jeff Janes
On Mon, Nov 15, 2010 at 6:15 PM, Robert Haas robertmh...@gmail.com wrote:
 On Sun, Nov 14, 2010 at 6:48 PM, Greg Smith g...@2ndquadrant.com wrote:
 The second issue is that the delay between sync calls is currently
 hard-coded, at 3 seconds.  I believe the right path here is to consider the
 current checkpoint_completion_target to still be valid, then work back from
 there.  That raises the question of what percentage of the time writes
 should now be compressed into relative to that, to leave some time to spread
 the sync calls.  If we're willing to say writes finish in first 1/2 of
 target, syncs execute in second 1/2, that I could implement that here.
  Maybe that ratio needs to be another tunable.  Still thinking about that
 part, and it's certainly open to community debate.

I would speculate that the answer is likely to be nearly binary.  The
best option would either be to do the writes as fast as possible and
spread out the fsyncs, or spread out the writes and do the fsyncs as
fast as possible.  Depending on the system set up.


 The thing to realize
 that complicates the design is that the actual sync execution may take a
 considerable period of time.  It's much more likely for that to happen than
 in the case of an individual write, as the current spread checkpoint does,
 because those are usually cached.  In the spread sync case, it's easy for
 one slow sync to make the rest turn into ones that fire in quick succession,
 to make up for lost time.

 I think the behavior of file systems and operating systems is highly
 relevant here.  We seem to have a theory that allowing a delay between
 the write and the fsync should give the OS a chance to start writing
 the data out,

I thought that the theory was that doing too many fsync in short order
can lead to some kind of starvation of other IO.

If the theory is that we want to wait between writes and fsyncs, then
the current behavior is probably the best, Spreading out the writes
and then doing all the syncs at the end gives the best delay time
between an average write and the sync of that written to file.  Or,
spread the writes out over 150 seconds, sleep for 140 seconds, then do
the fsyncs.  But I don't think that that is the theory.


 but do we have any evidence indicating whether and under
 what circumstances that actually occurs?  For example, if we knew that
 it's important to wait at least 30 s but waiting 60 s is no better,
 that would be useful information.

 Another question I have is about how we're actually going to know when
 any given fsync can be performed.  For any given segment, there are a
 certain number of pages A that are already dirty at the start of the
 checkpoint.

Dirty in the shared pool, or dirty in the OS cache?

 Then there are a certain number of additional pages B
 that are going to be written out during the checkpoint.  If it so
 happens that B = 0, we can call fsync() at the beginning of the
 checkpoint without losing anything (in fact, we gain something: any
 pages dirtied by cleaning scans or backend writes during the
 checkpoint won't need to hit the disk;

Aren't those pages written out by cleaning scans and backend writes
while the checkpoint is occurring exactly what you defined to be page
set B, and then to be zero?

 and if the filesystem dumps
 more of its cache than necessary on fsync, we may as well take that
 hit before dirtying a bunch more stuff).  But if B  0, then we should
 attempt the fsync() until we've written them all; otherwise we'll end
 up having to fsync() that segment twice.

 Doing all the writes and then all the fsyncs meets this requirement
 trivially, but I'm not so sure that's a good idea.  For example, given
 files F1 ... Fn with dirty pages needing checkpoint writes, we could
 do the following: first, do any pending fsyncs for files not among F1
 .. Fn; then, write all pages for F1 and fsync, write all pages for F2
 and fsync, write all pages for F3 and fsync, etc.  This might seem
 dumb because we're not really giving the OS a chance to write anything
 out before we fsync, but think about the ext3 case where the whole
 filesystem cache gets flushed anyway.  It's much better to dump the
 cache at the beginning of the checkpoint and then again after every
 file than it is to spew many GB of dirty stuff into the cache and then
 drop the hammer.

But the kernel has knobs to prevent that from happening.
dirty_background_ratio, dirty_ratio, dirty_background_bytes (on newer
kernels), dirty_expire_centisecs.  Don't these knobs work?  Also, ext3
is supposed to do a journal commit every 5 seconds under default mount
conditions.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2010-11-20 Thread Robert Haas
On Sat, Nov 20, 2010 at 6:21 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 The thing to realize
 that complicates the design is that the actual sync execution may take a
 considerable period of time.  It's much more likely for that to happen than
 in the case of an individual write, as the current spread checkpoint does,
 because those are usually cached.  In the spread sync case, it's easy for
 one slow sync to make the rest turn into ones that fire in quick succession,
 to make up for lost time.

 I think the behavior of file systems and operating systems is highly
 relevant here.  We seem to have a theory that allowing a delay between
 the write and the fsync should give the OS a chance to start writing
 the data out,

 I thought that the theory was that doing too many fsync in short order
 can lead to some kind of starvation of other IO.

 If the theory is that we want to wait between writes and fsyncs, then
 the current behavior is probably the best, Spreading out the writes
 and then doing all the syncs at the end gives the best delay time
 between an average write and the sync of that written to file.  Or,
 spread the writes out over 150 seconds, sleep for 140 seconds, then do
 the fsyncs.  But I don't think that that is the theory.

Well, I've heard Bruce and, I think, possibly also Greg talk about
wanting to wait after doing the writes in the hopes that the kernel
will start to flush the dirty pages, but I'm wondering whether it
wouldn't be better to just give up on that and do: small batch of
writes - fsync those writes - another small batch of writes - fsync
that batch - etc.

 but do we have any evidence indicating whether and under
 what circumstances that actually occurs?  For example, if we knew that
 it's important to wait at least 30 s but waiting 60 s is no better,
 that would be useful information.

 Another question I have is about how we're actually going to know when
 any given fsync can be performed.  For any given segment, there are a
 certain number of pages A that are already dirty at the start of the
 checkpoint.

 Dirty in the shared pool, or dirty in the OS cache?

OS cache, sorry.

 Then there are a certain number of additional pages B
 that are going to be written out during the checkpoint.  If it so
 happens that B = 0, we can call fsync() at the beginning of the
 checkpoint without losing anything (in fact, we gain something: any
 pages dirtied by cleaning scans or backend writes during the
 checkpoint won't need to hit the disk;

 Aren't those pages written out by cleaning scans and backend writes
 while the checkpoint is occurring exactly what you defined to be page
 set B, and then to be zero?

No, sorry, I'm referring to cases where all the dirty pages in a
segment have been written out the OS but we have not yet issued the
necessary fsync.

 and if the filesystem dumps
 more of its cache than necessary on fsync, we may as well take that
 hit before dirtying a bunch more stuff).  But if B  0, then we should
 attempt the fsync() until we've written them all; otherwise we'll end
 up having to fsync() that segment twice.

 Doing all the writes and then all the fsyncs meets this requirement
 trivially, but I'm not so sure that's a good idea.  For example, given
 files F1 ... Fn with dirty pages needing checkpoint writes, we could
 do the following: first, do any pending fsyncs for files not among F1
 .. Fn; then, write all pages for F1 and fsync, write all pages for F2
 and fsync, write all pages for F3 and fsync, etc.  This might seem
 dumb because we're not really giving the OS a chance to write anything
 out before we fsync, but think about the ext3 case where the whole
 filesystem cache gets flushed anyway.  It's much better to dump the
 cache at the beginning of the checkpoint and then again after every
 file than it is to spew many GB of dirty stuff into the cache and then
 drop the hammer.

 But the kernel has knobs to prevent that from happening.
 dirty_background_ratio, dirty_ratio, dirty_background_bytes (on newer
 kernels), dirty_expire_centisecs.  Don't these knobs work?  Also, ext3
 is supposed to do a journal commit every 5 seconds under default mount
 conditions.

I don't know in detail.  dirty_expire_centisecs sounds useful; I think
the problem with dirty_background_ratio and dirty_ratio is that the
default ratios are large enough that on systems with a huge pile of
memory, they allow more dirty data to accumulate than can be flushed
without causing an I/O storm.  I believe Greg Smith made a comment
along the lines of - memory sizes are grow faster than I/O speeds;
therefore a ratio that is OK for a low-end system with a modest amount
of memory causes problems on a high-end system that has faster I/O but
MUCH more memory.

As a kernel developer, I suspect the tendency is to try to set the
ratio so that you keep enough free memory around to service future
allocation requests.  Optimizing for possible future fsyncs is
probably not the top priority...

-- 

Re: [HACKERS] Spread checkpoint sync

2010-11-20 Thread Jeff Janes
On Sat, Nov 20, 2010 at 5:17 PM, Robert Haas robertmh...@gmail.com wrote:
 On Sat, Nov 20, 2010 at 6:21 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 Doing all the writes and then all the fsyncs meets this requirement
 trivially, but I'm not so sure that's a good idea.  For example, given
 files F1 ... Fn with dirty pages needing checkpoint writes, we could
 do the following: first, do any pending fsyncs for files not among F1
 .. Fn; then, write all pages for F1 and fsync, write all pages for F2
 and fsync, write all pages for F3 and fsync, etc.  This might seem
 dumb because we're not really giving the OS a chance to write anything
 out before we fsync, but think about the ext3 case where the whole
 filesystem cache gets flushed anyway.  It's much better to dump the
 cache at the beginning of the checkpoint and then again after every
 file than it is to spew many GB of dirty stuff into the cache and then
 drop the hammer.

 But the kernel has knobs to prevent that from happening.
 dirty_background_ratio, dirty_ratio, dirty_background_bytes (on newer
 kernels), dirty_expire_centisecs.  Don't these knobs work?  Also, ext3
 is supposed to do a journal commit every 5 seconds under default mount
 conditions.

 I don't know in detail.  dirty_expire_centisecs sounds useful; I think
 the problem with dirty_background_ratio and dirty_ratio is that the
 default ratios are large enough that on systems with a huge pile of
 memory, they allow more dirty data to accumulate than can be flushed
 without causing an I/O storm.

True, but I think that changing these from their defaults is not
considered to be a dark art reserved for kernel hackers, i.e they are
something that sysadmins are expected to tweak to suite their work
load, just like the shmmax and such.  And for very large memory
systems, even 1% may be too much to cache (dirty*_ratio can only be
set in integer percent points), so recent kernels introduced
dirty*_bytes parameters.  I like these better because they do what
they say.  With the dirty*_ratio, I could never figure out what it was
a ratio of, and the results were unpredictable without extensive
experimentation.

 I believe Greg Smith made a comment
 along the lines of - memory sizes are grow faster than I/O speeds;
 therefore a ratio that is OK for a low-end system with a modest amount
 of memory causes problems on a high-end system that has faster I/O but
 MUCH more memory.

Yes, but how much work do we want to put into redoing the checkpoint
logic so that the sysadmin on a particular OS and configuration and FS
can avoid having to change the kernel parameters away from their
defaults?  (Assuming of course I am correctly understanding the
problem, always a dangerous assumption.)

Some experiments I have just done show that dirty_expire_centisecs
does not seem reliable on ext3, but the dirty*_ratio and dirty*_bytes
seem reliable on ext2, ext3, and ext4.

But that may not apply to RAID, I don't have one I can test.


Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2010-11-15 Thread Robert Haas
On Sun, Nov 14, 2010 at 6:48 PM, Greg Smith g...@2ndquadrant.com wrote:
 The second issue is that the delay between sync calls is currently
 hard-coded, at 3 seconds.  I believe the right path here is to consider the
 current checkpoint_completion_target to still be valid, then work back from
 there.  That raises the question of what percentage of the time writes
 should now be compressed into relative to that, to leave some time to spread
 the sync calls.  If we're willing to say writes finish in first 1/2 of
 target, syncs execute in second 1/2, that I could implement that here.
  Maybe that ratio needs to be another tunable.  Still thinking about that
 part, and it's certainly open to community debate.  The thing to realize
 that complicates the design is that the actual sync execution may take a
 considerable period of time.  It's much more likely for that to happen than
 in the case of an individual write, as the current spread checkpoint does,
 because those are usually cached.  In the spread sync case, it's easy for
 one slow sync to make the rest turn into ones that fire in quick succession,
 to make up for lost time.

I think the behavior of file systems and operating systems is highly
relevant here.  We seem to have a theory that allowing a delay between
the write and the fsync should give the OS a chance to start writing
the data out, but do we have any evidence indicating whether and under
what circumstances that actually occurs?  For example, if we knew that
it's important to wait at least 30 s but waiting 60 s is no better,
that would be useful information.

Another question I have is about how we're actually going to know when
any given fsync can be performed.  For any given segment, there are a
certain number of pages A that are already dirty at the start of the
checkpoint.  Then there are a certain number of additional pages B
that are going to be written out during the checkpoint.  If it so
happens that B = 0, we can call fsync() at the beginning of the
checkpoint without losing anything (in fact, we gain something: any
pages dirtied by cleaning scans or backend writes during the
checkpoint won't need to hit the disk; and if the filesystem dumps
more of its cache than necessary on fsync, we may as well take that
hit before dirtying a bunch more stuff).  But if B  0, then we should
attempt the fsync() until we've written them all; otherwise we'll end
up having to fsync() that segment twice.

Doing all the writes and then all the fsyncs meets this requirement
trivially, but I'm not so sure that's a good idea.  For example, given
files F1 ... Fn with dirty pages needing checkpoint writes, we could
do the following: first, do any pending fsyncs for files not among F1
.. Fn; then, write all pages for F1 and fsync, write all pages for F2
and fsync, write all pages for F3 and fsync, etc.  This might seem
dumb because we're not really giving the OS a chance to write anything
out before we fsync, but think about the ext3 case where the whole
filesystem cache gets flushed anyway.  It's much better to dump the
cache at the beginning of the checkpoint and then again after every
file than it is to spew many GB of dirty stuff into the cache and then
drop the hammer.

I'm just brainstorming here; feel free to tell me I'm all wet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers