Re: [HACKERS] Bgwriter strategies

2007-07-11 Thread Tom Lane
"Pavan Deolasee" <[EMAIL PROTECTED]> writes:
> I think you are assuming that the next write of the same block won't
> use another OS cache block. I doubt if thats the way writes are handled
> by the kernel. Each write would typically end up being queued up in the
> kernel
> where each write will have its own copy of the block to the written. Isn't
> it ?

A kernel that worked like that would have a problem doing read(), ie,
it'd have to search to find the latest version of the block.  So I'd
expect that most systems would prefer to keep only one in-memory copy
of any given block and overwrite it at write() time.  No sane kernel
designer will optimize write() at the expense of read() performance,
especially when you consider that a design as above really pessimizes
write() too --- it does more I/O than is necessary when the same block
is modified repeatedly in a short time.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Bgwriter strategies

2007-07-11 Thread Alvaro Herrera
Pavan Deolasee escribió:
> On 7/11/07, Heikki Linnakangas <[EMAIL PROTECTED]> wrote:
> >
> >I was able
> >to reproduce the phenomenon with a simple C program that writes 8k
> >blocks in random order to a fixed size file. I've attached it along with
> >output of running it on my test server. The output shows how the writes
> >start to periodically block after a while. I was able to reproduce the
> >problem on my laptop as well. Can anyone explain what's going on?
> >
> I think you are assuming that the next write of the same block won't
> use another OS cache block. I doubt if thats the way writes are handled
> by the kernel. Each write would typically end up being queued up in the
> kernel
> where each write will have its own copy of the block to the written. Isn't
> it ?

I don't think so -- at least not on Linux.  See
https://ols2006.108.redhat.com/2007/Reprints/zijlstra-Reprint.pdf
where he talks about a patch to the page cache.  He describes the
current page cache there; each page is kept on a tree, so a second write
to the same page would "overwrite" the page of the original write.

-- 
Alvaro Herrera http://www.amazon.com/gp/registry/CTMLCN8V17R4
"Las mujeres son como hondas:  mientras más resistencia tienen,
 más lejos puedes llegar con ellas"  (Jonas Nightingale, Leap of Faith)

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Bgwriter strategies

2007-07-11 Thread Pavan Deolasee

On 7/11/07, Heikki Linnakangas <[EMAIL PROTECTED]> wrote:


I was able
to reproduce the phenomenon with a simple C program that writes 8k
blocks in random order to a fixed size file. I've attached it along with
output of running it on my test server. The output shows how the writes
start to periodically block after a while. I was able to reproduce the
problem on my laptop as well. Can anyone explain what's going on?




I think you are assuming that the next write of the same block won't
use another OS cache block. I doubt if thats the way writes are handled
by the kernel. Each write would typically end up being queued up in the
kernel
where each write will have its own copy of the block to the written. Isn't
it ?


Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [HACKERS] Bgwriter strategies

2007-07-11 Thread Heikki Linnakangas
In the last couple of days, I've been running a lot of DBT-2 tests and 
smaller microbenchmarks with different bgwriter settings and 
experimental patches, but I have not been able to produce a repeatable 
test case where any of the bgwriter configurations perform better than 
not having bgwriter at all.


I encountered a strange phenomenon that I don't understand. I ran a 
small test case with DELETEs in random order, using an index, on a table 
~300MB table, with shared_buffers smaller than that. I expected that to 
be dominated by the speed postgres can swap pages in and out of the 
shared buffer cache, but surprisingly the test starts to block on the 
write I/O, even though the table fits completely in OS cache. I was able 
to reproduce the phenomenon with a simple C program that writes 8k 
blocks in random order to a fixed size file. I've attached it along with 
output of running it on my test server. The output shows how the writes 
start to periodically block after a while. I was able to reproduce the 
problem on my laptop as well. Can anyone explain what's going on?


Anyone out there have a repeatable test case where bgwriter helps?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
#include 
#include 
#include 
#include 
#include 
#include 

int main(int argc, char **argv)
{
  int fd;
  off_t len;
  char buf[8192];
  int i;
  int size;
  struct timeval begin_t;

  if (argc != 3)
  {
printf("Usage: writetest  \n");
exit(1);
  }

  fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, S_IWUSR | S_IRUSR);
  if (fd == -1)
  {
perror(NULL);
exit(1);
  }
  size = atoi(argv[2]) * 1024 * 1024;

  for(i=0; i < size;)
i += write(fd, buf, sizeof(buf));

  len = i;

  fsync(fd);

  gettimeofday(&begin_t, NULL);
  for(i = 0; i < 1000; i++)
  {
lseek(fd, ((random() % (len / sizeof(buf * sizeof(buf), SEEK_SET);
write(fd, buf, sizeof(buf));
if(i % 4 == 0)
{
  struct timeval t;
  long msecs;

  gettimeofday(&t, NULL);
  msecs = (t.tv_sec - begin_t.tv_sec) * 1000 +(t.tv_usec - begin_t.tv_usec) / 1000;
  printf("%d blocks written, time=%ld ms\n", i, msecs);
  begin_t = t;
}
  }
}
./writetest /mnt/data/writetest-data 80
0 blocks written, time=0 ms
4 blocks written, time=251 ms
8 blocks written, time=241 ms
12 blocks written, time=241 ms
16 blocks written, time=241 ms
20 blocks written, time=242 ms
24 blocks written, time=242 ms
28 blocks written, time=241 ms
32 blocks written, time=241 ms
36 blocks written, time=242 ms
40 blocks written, time=241 ms
44 blocks written, time=241 ms
48 blocks written, time=241 ms
52 blocks written, time=242 ms
56 blocks written, time=241 ms
60 blocks written, time=241 ms
64 blocks written, time=242 ms
68 blocks written, time=242 ms
72 blocks written, time=242 ms
76 blocks written, time=241 ms
80 blocks written, time=242 ms
84 blocks written, time=4579 ms
88 blocks written, time=244 ms
92 blocks written, time=242 ms
96 blocks written, time=4752 ms
100 blocks written, time=241 ms
104 blocks written, time=4618 ms
108 blocks written, time=242 ms
112 blocks written, time=4614 ms
116 blocks written, time=246 ms
120 blocks written, time=243 ms
124 blocks written, time=4619 ms
128 blocks written, time=242 ms
132 blocks written, time=242 ms
136 blocks written, time=4605 ms
140 blocks written, time=242 ms


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Bgwriter strategies

2007-07-09 Thread Simon Riggs
On Fri, 2007-07-06 at 10:55 +0100, Heikki Linnakangas wrote:

> We need to get the requirements straight.
> 
> One goal of bgwriter is clearly to keep just enough buffers clean in 
> front of the clock hand so that backends don't need to do writes 
> themselves until the next bgwriter iteration. But not any more than 
> that, otherwise we might end up doing more writes than necessary if some 
> of the buffers are redirtied.

The purpose of the WAL/shared buffer cache is to avoid having to write
all of the data blocks touched by a transaction to disk before end of
transaction, thus increasing request response time. That purpose is only
fulfilled iff using the shared buffer cache does not require us to write
out someone else's dirty buffers, while avoiding our own. The bgwriter
exists specifically to clean the dirty buffers, so that users do not
have to clean theirs or anybody else's dirty buffers.

> To deal with bursty workloads, for example a batch of 2 GB worth of 
> inserts coming in every 10 minutes, it seems we want to keep doing a 
> little bit of cleaning even when the system is idle, to prepare for the 
> next burst. The idea is to smoothen the physical I/O bursts; if we don't 
> clean the dirty buffers left over from the previous burst during the 
> idle period, the I/O system will be bottlenecked during the bursts, and 
> sit idle otherwise.

In short, bursty workloads are the normal situation.

When capacity is not saturated the bgwriter can utilise the additional
capacity to reduce statement response times.

It is standard industry practice to avoid running a system at peak
throughout for long periods of time, so DBT-2 does not represent a
normal situation. This is because the response times are only
predictable on a non-saturated system and most apps have some implicit
or explicit service level objective. 

However, the server needs to cope with periods of saturation, so must be
able to perform efficiently during those times. 

So I see there are two modes of operation:

i) dirty block write offload when capacity is available
ii) efficient operation when server is saturated.

DBT-2 represents only the second mode of operation; the two modes are
equally important, yet mode i) is the ideal situation. 

> To strike a balance between cleaning buffers ahead of possible bursts in 
> the future and not doing unnecessary I/O when no such bursts come, I 
> think a reasonable strategy is to write buffers with usage_count=0 at a 
> slow pace when there's no buffer allocations happening.

Agreed.

-- 
  Simon Riggs
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Bgwriter strategies

2007-07-09 Thread Simon Riggs
On Thu, 2007-07-05 at 21:50 +0100, Heikki Linnakangas wrote:

> All test runs were also patched to count the # of buffer allocations, 
> and # of buffer flushes performed by bgwriter and backends. Here's those 
> results (I hope the intendation gets through properly):
> 
>   imola-336   imola-337   imola-340
> writes by checkpoint38302   30410   39529
> writes by bgwriter 350113 2205782 1418672
> writes by backends1834333  265755  787633
> writes total  748 2501947 2245834
> allocations   2683170 2657896 2699974

These results may show that the minimum bgwriter_delay of 10ms may be
too large for the workloads: whatever the strategy used the bgwriter
spends too much time sleeping when it should be working.

-- 
  Simon Riggs
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Bgwriter strategies

2007-07-07 Thread Greg Smith

On Fri, 6 Jul 2007, Heikki Linnakangas wrote:

To strike a balance between cleaning buffers ahead of possible bursts in the 
future and not doing unnecessary I/O when no such bursts come, I think a 
reasonable strategy is to write buffers with usage_count=0 at a slow pace 
when there's no buffer allocations happening.


One idea I had there was to always scan max_pages buffers each time even 
if there were less allocations than needed for that.  That number is 
usually relatively small compared to the size of the buffer cache, so it 
would creep through the buffer cache at a bounded pace during idle 
periods.  It's actually nice to watch the LRU cleaner get so far ahead 
during idle spots that it catches the strategy point, so that when the 
next burst comes, it doesn't have to do anything until there's a full lap 
by the clock sweep.


Anyway, completely with you on the rest of this post, everything you said 
matches the direction I've been trudging toward.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Bgwriter strategies

2007-07-06 Thread Gregory Stark
"Tom Lane" <[EMAIL PROTECTED]> writes:

>> That would be overly aggressive on a workload that's steady on average, 
>> but consists of small bursts. Like this: 0 0 0 0 100 0 0 0 0 100 0 0 0 0 
>> 100. You'd end up writing ~100 pages on every bgwriter round, but you 
>> only need an average of 20 pages per round.
>
> No, you wouldn't be *writing* that many, you'd only be keeping that many
> *clean*; which only costs more work if any of them get re-dirtied
> between writing and use.  Which is a fairly small probability if we're
> talking about a small difference in the number of buffers to keep clean.
> So I think the average number of writes is hardly different, it's just
> that the backends are far less likely to have to do any of them.

Well Postgres's hint bits tends to redirty pages precisely once at just about
the time when they're ready to be paged out. But I think there are things we
can do to tackle that head-on. 

Bgwriter could try to set hint bits before cleaning these pages for example.
Or we could elect in selected circumstances not to write out a page that is
hint-bit-dirty-only. Or some combination of those options depending on the
circumstances. Figuring out the circumstances is the hard part.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Bgwriter strategies

2007-07-06 Thread Greg Smith

On Fri, 6 Jul 2007, Heikki Linnakangas wrote:

I've been running these test with bgwriter_delay of 10 ms, which is probably 
too aggressive.


Even on relatively high-end hardware, I've found it hard to get good 
results out of the BGW with the delay under 50ms--particularly when trying 
to do some more complicated smoothing.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Bgwriter strategies

2007-07-06 Thread Heikki Linnakangas

Heikki Linnakangas wrote:
I scheduled a test with the moving average method as well, we'll see how 
that fares.


No too well :(.

Strange. The total # of writes is on par with having bgwriter disabled, 
but the physical I/O graphs show more I/O (on par with the aggressive 
bgwriter), and the response times are higher.


I just noticed that on the tests with the moving average, or the simple 
"just enough" method, there's a small bump in the CPU usage during the 
ramp up period. I believe that's because bgwriter scans through the 
whole buffer cache without finding enough buffers to clean. I ran some 
tests earlier with unpatched bgwriter tuned to the maximum, and it used 
~10% of CPU, which is the same level that the bump rises to. 
Unfortunately I haven't been taking pg_buffercache snapshots until after 
the ramp up; it should've shown up there.


I've been running these test with bgwriter_delay of 10 ms, which is 
probably too aggressive. I used that to test the idea of starting the 
scan from where it left off, instead of always starting from clock hand.


If someone wants to have a look, the # of writes are collected to a 
separate log file in /server/buf_alloc_stats.log. There's 
no link to it from the html files. There's also summary snapshots of 
pg_buffercache every 30 seconds in /server/bufcache.log.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Bgwriter strategies

2007-07-06 Thread Greg Smith

On Fri, 6 Jul 2007, Tom Lane wrote:


The problem is that it'd be very hard to track how far ahead of the
recycling sweep hand we are, because that number has to be measured
in usage-count-zero pages.  I see no good way to know how many of the
pages we scanned before have been touched (and given nonzero usage
counts) unless we rescan them.


I've actually been working on how to address that specific problem without 
expressly tracking the contents of the buffer cache.  When the background 
writer is called, it finds out how many buffers were allocated and how far 
the sweep point moved since the last call.  From that, you can calculate 
how many buffers on average need to be scanned per allocation, which tells 
you something about the recently encountered density of 0-usage count 
buffers.  My thought was to use that as an input to the computation for 
how far ahead to stay.



I've been doing moving averages for years and years, and I find that the
multiplication approach works at least as well as explicitly storing the
last K observations.  It takes a lot less storage and arithmetic too.


I was simplifying the description just to comment on the range for K; I 
was using a multiplication approach for the computation.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Bgwriter strategies

2007-07-06 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas <[EMAIL PROTECTED]> writes:

imola-336   imola-337   imola-340
writes by checkpoint  38302   30410   39529
writes by bgwriter   350113 2205782 1418672
writes by backends  1834333  265755  787633
writes total748 2501947 2245834
allocations 2683170 2657896 2699974


It looks like Tom's idea is not a winner; it leads to more writes than 
necessary.


The incremental number of writes is not that large; only about 10% more.
The interesting thing is that those "extra" writes must represent
buffers that were re-touched after their usage_count went to zero, but
before they could be recycled by the clock sweep.  While you'd certainly
expect some of that, I'm surprised it is as much as 10%.  Maybe we need
to play with the buffer allocation strategy some more.

The very small difference in NOTPM among the three runs says that either
this whole area is unimportant, or DBT2 isn't a good test case for it;
or maybe that there's something wrong with the patches?


The small difference in NOTPM is because the I/O still wasn't saturated 
even with 10% extra writes.


I ran more tests with a higher number of warehouses, and the extra 
writes start to show in the response times. See tests 341-344: 
http://community.enterprisedb.com/bgwriter/.


I scheduled a test with the moving average method as well, we'll see how 
that fares.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Bgwriter strategies

2007-07-06 Thread Tom Lane
Greg Smith <[EMAIL PROTECTED]> writes:
> On Thu, 5 Jul 2007, Tom Lane wrote:
>> This would give us a safety margin such that buffers_to_clean is not 
>> less than the largest demand observed in the last 100 iterations...and 
>> it takes quite a while for the memory of a demand spike to be forgotten 
>> completely.

> If you tested this strategy even on a steady load, I'd expect you'll find 
> there are large spikes in allocations during the occasional period where 
> everything is just right to pull a bunch of buffers in, and if you let 
> that max linger around for 100 iterations you'll write a large number of 
> buffers more than you need.

You seem to have the same misunderstanding as Heikki.  What I was
proposing was not a target for how many to *write* on each cycle, but
a target for how far ahead of the clock sweep hand to look.  If say
the target is 100, we'll scan forward from the sweep until we have seen
100 clean zero-usage-count buffers; but we only have to write whichever
of them weren't already clean.

This is actually not so different from my previous proposal, in that the
idea is to keep ahead of the sweep by a particular distance.  The
previous idea was that that distance was "all the buffers", whereas this
idea is "a moving average of the actual demand rate".  The excess writes
created by the previous proposal were because of the probability of
re-dirtying buffers between cleaning and recycling.  We reduce that
probability by not trying to keep so many of 'em clean.  But I think
that we can meet the goal of having backends do hardly any of the writes
with a relatively small increase in the target distance, and thus a
relatively small differential in the number of wasted writes.  Heikki's
test showed that Itagaki-san's patch wasn't doing that well in
eliminating writes by backends, so we need a more aggressive target for
how many buffers to keep clean than it has; but I think not a huge
amount more, and thus my proposal.

BTW, somewhere upthread you suggested combining the target-distance
idea with the idea that the cleaning work uses a separate sweep hand and
thus doesn't re-examine the same buffers on every bgwriter iteration.
The problem is that it'd be very hard to track how far ahead of the
recycling sweep hand we are, because that number has to be measured
in usage-count-zero pages.  I see no good way to know how many of the
pages we scanned before have been touched (and given nonzero usage
counts) unless we rescan them.

We could approximate it maybe: try to keep the cleaning hand N total
buffers ahead of the recycling hand, where N is the target number of
clean usage-count-zero buffers scaled by the average fraction of
count-zero buffers (which we can track a moving average of as we advance
the recycling hand).  However I'm not sure the complexity and
uncertainty is worth it.  What I took away from Heikki's experiment is
that trying to stay a large distance in front of the recycle sweep
isn't actually so useful because you get too many wasted writes due
to re-dirtying.  So restructuring the algorithm to make it cheap
CPU-wise to stay well ahead is not so useful either.

> I ended up settling on max(moving average of the last 16,most recent 
> allocation), and that seemed to work pretty well without being too 
> wasteful from excessive writes.

I've been doing moving averages for years and years, and I find that the
multiplication approach works at least as well as explicitly storing the
last K observations.  It takes a lot less storage and arithmetic too.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Bgwriter strategies

2007-07-06 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas <[EMAIL PROTECTED]> writes:

Tom Lane wrote:

buffers_to_clean = Max(buffers_used * 1.1,
buffers_to_clean * 0.999);


That would be overly aggressive on a workload that's steady on average, 
but consists of small bursts. Like this: 0 0 0 0 100 0 0 0 0 100 0 0 0 0 
100. You'd end up writing ~100 pages on every bgwriter round, but you 
only need an average of 20 pages per round.


No, you wouldn't be *writing* that many, you'd only be keeping that many
*clean*; which only costs more work if any of them get re-dirtied
between writing and use.  Which is a fairly small probability if we're
talking about a small difference in the number of buffers to keep clean.
So I think the average number of writes is hardly different, it's just
that the backends are far less likely to have to do any of them.


Ah, ok, I misunderstood what you were proposing. Yes, that seems like a 
good algorithm then.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Bgwriter strategies

2007-07-06 Thread Tom Lane
Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> buffers_to_clean = Max(buffers_used * 1.1,
>> buffers_to_clean * 0.999);

> That would be overly aggressive on a workload that's steady on average, 
> but consists of small bursts. Like this: 0 0 0 0 100 0 0 0 0 100 0 0 0 0 
> 100. You'd end up writing ~100 pages on every bgwriter round, but you 
> only need an average of 20 pages per round.

No, you wouldn't be *writing* that many, you'd only be keeping that many
*clean*; which only costs more work if any of them get re-dirtied
between writing and use.  Which is a fairly small probability if we're
talking about a small difference in the number of buffers to keep clean.
So I think the average number of writes is hardly different, it's just
that the backends are far less likely to have to do any of them.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Bgwriter strategies

2007-07-06 Thread Heikki Linnakangas

Greg Smith wrote:

On Fri, 6 Jul 2007, Heikki Linnakangas wrote:

There's something wrong with that. The number of buffer allocations 
shouldn't depend on the bgwriter strategy at all.


I was seeing a smaller (closer to 5%) increase in buffer allocations 
switching from no background writer to using the stock one before I did 
any code tinkering, so it didn't strike me as odd.  I believe it's 
related to the TPS numbers.  When there are more transactions being 
executed per unit time, it's more likely the useful blocks will stay in 
memory because their usage_count is getting tickled faster, and 
therefore there's less of the most useful blocks being swapped out only 
to be re-allocated again later.


Did you run the test for a constant number of transactions? If you did, 
the access pattern and the number of allocations should be *exactly* the 
same with 1 client, assuming the initial state and the seed used for the 
random number generator is the same.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Bgwriter strategies

2007-07-06 Thread Greg Smith

On Fri, 6 Jul 2007, Heikki Linnakangas wrote:

There's something wrong with that. The number of buffer allocations shouldn't 
depend on the bgwriter strategy at all.


I was seeing a smaller (closer to 5%) increase in buffer allocations 
switching from no background writer to using the stock one before I did 
any code tinkering, so it didn't strike me as odd.  I believe it's related 
to the TPS numbers.  When there are more transactions being executed per 
unit time, it's more likely the useful blocks will stay in memory because 
their usage_count is getting tickled faster, and therefore there's less of 
the most useful blocks being swapped out only to be re-allocated again 
later.


Since the bad bgwriter tunings reduce TPS, I believe that's the mechanism 
by which there are more allocations needed.  I'll try to keep an eye on 
this now that you've brought it up.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Bgwriter strategies

2007-07-06 Thread Heikki Linnakangas

Greg Smith wrote:
As you can see, I achieved the goal of almost never having a backend 
write its own buffer, so yeah for that.  That's the only good thing I 
can say about it though.  The TPS results take a moderate dive, and 
there's about 10% more buffer allocations.  The big and obvious issues 
is that I'm writing almost 75% more buffers this way--way worse even 
than the 10% extra overhead Heikki was seeing.  But since I've going out 
of my way to find a worse-case for this code, I consider mission 
accomplished there.


There's something wrong with that. The number of buffer allocations 
shouldn't depend on the bgwriter strategy at all.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Bgwriter strategies

2007-07-06 Thread Greg Smith
I just got my own first set of useful tests of using the new "remember 
where you last scanned to" BGW implementation suggested by Tom.  What I 
did was keep the exiting % to scan, but cut back the number to scan when 
so close to a complete lap ahead of the strategy point that I'd cross it 
if I scanned that much.  So when the system was idle, it would very 
quickly catch up with the strategy point, but if the %/max numbers were 
low it's possible for it to fall behind.


My workload was just the UPDATE statement out of pgbench with a database 
of scale 25 (~400MB, picked so most operations were in memory), which 
pushes lots of things in and out of the buffer cache as fast as possible.


Here's some data with no background writer at all:

clients tps buf_clean buf_backend buf_alloc
1   13400   72554   96846
2   14210   73969   88879
3   14180   71452   86339
4   13440   75184   90187
8   13610   73063   88099
15  13480   71861   86923

And here's what I got with the new approach, using 10% for the scan 
percentage and a maximum of 200 buffers written out.  I picked those 
numbers after some experimentation because they were the first I found 
where the background writer was almost always riding right behind the 
strategy point; with lower numbers, when the background writer woke up it 
often found it had already fallen behind the stategy point and had to 
start cleaning forward the old way instead, which wasn't what I wanted to 
test.


clients tps buf_clean buf_backend buf_alloc
1   1261122917  150 105655
2   1186126663  26  97586
3   1154127780  21  98077
4   1181127685  19  98068
8   1076128597  2   98229
15  1065128399  5   98143

As you can see, I achieved the goal of almost never having a backend write 
its own buffer, so yeah for that.  That's the only good thing I can say 
about it though.  The TPS results take a moderate dive, and there's about 
10% more buffer allocations.  The big and obvious issues is that I'm 
writing almost 75% more buffers this way--way worse even than the 10% 
extra overhead Heikki was seeing.  But since I've going out of my way to 
find a worse-case for this code, I consider mission accomplished there.


Anyway, will have more detailed reports to post after I collect some more 
data; for now I just wanted to join Heikki in confirming that the strategy 
of trying to get the LRU cleaner to ride right behind the strategy point 
can really waste a whole lot of writes.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Bgwriter strategies

2007-07-06 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas <[EMAIL PROTECTED]> writes:

imola-336   imola-337   imola-340
writes by checkpoint  38302   30410   39529
writes by bgwriter   350113 2205782 1418672
writes by backends  1834333  265755  787633
writes total748 2501947 2245834
allocations 2683170 2657896 2699974


It looks like Tom's idea is not a winner; it leads to more writes than 
necessary.


The incremental number of writes is not that large; only about 10% more.
The interesting thing is that those "extra" writes must represent
buffers that were re-touched after their usage_count went to zero, but
before they could be recycled by the clock sweep.  While you'd certainly
expect some of that, I'm surprised it is as much as 10%.  Maybe we need
to play with the buffer allocation strategy some more.

The very small difference in NOTPM among the three runs says that either
this whole area is unimportant, or DBT2 isn't a good test case for it;
or maybe that there's something wrong with the patches?

On imola-340, there's still a significant amount of backend writes. I'm 
still not sure what we should be aiming at. Is 0 backend writes our goal?


Well, the lower the better, but not at the cost of a very large increase
in total writes.

Imola-340 was with a patch along the lines of 
Itagaki's original patch, ensuring that there's as many clean pages in 
front of the clock head as were consumed by backends since last bgwriter 
iteration.


This seems intuitively wrong, since in the presence of bursty request
behavior it'll constantly be getting caught short of buffers.  I think
you need a safety margin and a moving-average decay factor.  Possibly
something like

buffers_to_clean = Max(buffers_used * 1.1,
   buffers_to_clean * 0.999);

where buffers_used is the current observation of demand.  This would
give us a safety margin such that buffers_to_clean is not less than
the largest demand observed in the last 100 iterations (0.999 ^ 100
is about 0.90, cancelling out the initial 10% safety margin), and it
takes quite a while for the memory of a demand spike to be forgotten
completely.


That would be overly aggressive on a workload that's steady on average, 
but consists of small bursts. Like this: 0 0 0 0 100 0 0 0 0 100 0 0 0 0 
100. You'd end up writing ~100 pages on every bgwriter round, but you 
only need an average of 20 pages per round. That'd be effectively the 
same as keeping all buffers with usage_count=0 clean.


BTW, I believe that kind of workload is actually very common. That's 
what you get if one transaction causes say 10-100 buffer allocations, 
and you execute one such transaction every few seconds.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Bgwriter strategies

2007-07-06 Thread Heikki Linnakangas

Greg Smith wrote:

On Thu, 5 Jul 2007, Heikki Linnakangas wrote:

It looks like Tom's idea is not a winner; it leads to more writes than 
necessary.


What I came away with as the core of Tom's idea is that the cleaning/LRU 
writer shouldn't ever scan the same section of the buffer cache twice, 
because anything that resulted in a new dirty buffer will be unwritable 
by it until the clock sweep passes over it.  I never took that to mean 
that idea necessarily had to be implemented as "trying to aggressively 
keep all pages with usage_count=0 clean".


I've been making slow progress on this myself, and the question I've 
been trying to answer is whether this fundamental idea really matters or 
not. One clear benefit of that alternate implementation should allow is 
setting a lower value for the interval without being as concerned that 
you're wasting resources by doing so, which I've found to a problem with 
the current implementation--it will consume a lot of CPU scanning the 
same section right now if you lower that too much.


Yes, in fact ignoring the CPU overhead of scanning the same section over 
and over again, Tom's proposal is the same as setting both 
bgwriter_lru_* settings all the way up to the max. In fact I ran a DBT-2 
test like that as well, and the # of writes was indeed the same, just 
with a max higher CPU usage. It's clear that scanning the same section 
over and over again has been a waste of time in previous releases.


As a further data point, I constructed a smaller test case that performs 
random DELETEs on a table using an index. I varied the # of 
shared_buffers, and ran the test with bgwriter disabled, or tuned all 
the way up to the maximum. Here's the results from that:


 shared_buffers | writes | writes |   writes_ratio
+++---
 2560   |  86936 |  88023 |  1.01250345081439
 5120   |  81207 |  84551 |  1.04117871612053
 7680   |  75367 |  80603 |  1.06947337694216
 10240  |  69772 |  74533 |  1.06823654187926
 12800  |  64281 |  69237 |  1.07709898725907
 15360  |  58515 |  64735 |  1.10629753054772
 17920  |  53231 |  58635 |  1.10151979109917
 20480  |  48128 |  54403 |  1.13038148271277
 23040  |  43087 |  49949 |  1.15925917330053
 25600  |  39062 |  46477 |   1.1898264297783
 28160  |  35391 |  43739 |  1.23587917832217
 30720  |  32713 |  37480 |  1.14572188426619
 33280  |  31634 |  31677 |  1.00135929695897
 35840  |  31668 |  31717 |  1.00154730327144
 38400  |  31696 |  31693 | 0.05350832913
 40960  |  31685 |  31730 |  1.00142023039293
 43520  |  31694 |  31650 | 0.998611724616647
 46080  |  31661 |  31650 | 0.999652569407157

The first writes-column is the # of writes with bgwriter disabled, 2nd 
column is with the aggressive bgwriter. The table size is 4 pages, 
so after that the table fits in cache and the bgwriter strategy makes no 
difference.



As far as your results, first off I'm really glad to see someone else 
comparing checkpoint/backend/bgwriter writes the same I've been doing so 
I finally have someone else's results to compare against.  I expect that 
the optimal approach here is a hybrid one that structures scanning the 
buffer cache the new way Tom suggests, but limits the number of writes 
to "just enough".  I happen to be fond of the "just enough" computation 
based on a weighted moving average I wrote before, but there's certainly 
room for multiple implementations of that part of the code to evolve.


We need to get the requirements straight.

One goal of bgwriter is clearly to keep just enough buffers clean in 
front of the clock hand so that backends don't need to do writes 
themselves until the next bgwriter iteration. But not any more than 
that, otherwise we might end up doing more writes than necessary if some 
of the buffers are redirtied.


To deal with bursty workloads, for example a batch of 2 GB worth of 
inserts coming in every 10 minutes, it seems we want to keep doing a 
little bit of cleaning even when the system is idle, to prepare for the 
next burst. The idea is to smoothen the physical I/O bursts; if we don't 
clean the dirty buffers left over from the previous burst during the 
idle period, the I/O system will be bottlenecked during the bursts, and 
sit idle otherwise.


To strike a balance between cleaning buffers ahead of possible bursts in 
the future and not doing unnecessary I/O when no such bursts come, I 
think a reasonable strategy is to write buffers with usage_count=0 at a 
slow pace when there's no buffer allocations happening.


To smoothen the small variations on a relatively steady workload, the 
weighted average sounds good.





--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Bgwriter strategies

2007-07-06 Thread Greg Smith

On Thu, 5 Jul 2007, Tom Lane wrote:

This would give us a safety margin such that buffers_to_clean is not 
less than the largest demand observed in the last 100 iterations...and 
it takes quite a while for the memory of a demand spike to be forgotten 
completely.


If you tested this strategy even on a steady load, I'd expect you'll find 
there are large spikes in allocations during the occasional period where 
everything is just right to pull a bunch of buffers in, and if you let 
that max linger around for 100 iterations you'll write a large number of 
buffers more than you need.  That's what I saw when I tried to remember 
too much information about allocation history in the version of the auto 
LRU tuner I worked on.  For example, with 32000 buffers, with pgbench 
trying to UPDATE as fast as possible I sometimes hit
1500 allocations in an interval, but the steady-state allocation level 

was closer to 500.

I ended up settling on max(moving average of the last 16,most recent 
allocation), and that seemed to work pretty well without being too 
wasteful from excessive writes.  Playing with multiples of 2, 8 was 
definately not enough memory to smooth usefully, while 32 seemed a little 
sluggish on the entry and wasteful on the exit ends.


At the default interval, 16 iterations is looking back at the previous 3.2 
seconds.  I have a feeling the proper tuning for this should be 
time-based, where you would decide how long ago to consider looking back 
for and compute the iterations based on that.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Bgwriter strategies

2007-07-05 Thread Greg Smith

On Thu, 5 Jul 2007, Heikki Linnakangas wrote:

It looks like Tom's idea is not a winner; it leads to more writes than 
necessary.


What I came away with as the core of Tom's idea is that the cleaning/LRU 
writer shouldn't ever scan the same section of the buffer cache twice, 
because anything that resulted in a new dirty buffer will be unwritable by 
it until the clock sweep passes over it.  I never took that to mean that 
idea necessarily had to be implemented as "trying to aggressively keep all 
pages with usage_count=0 clean".


I've been making slow progress on this myself, and the question I've been 
trying to answer is whether this fundamental idea really matters or not. 
One clear benefit of that alternate implementation should allow is setting 
a lower value for the interval without being as concerned that you're 
wasting resources by doing so, which I've found to a problem with the 
current implementation--it will consume a lot of CPU scanning the same 
section right now if you lower that too much.


As far as your results, first off I'm really glad to see someone else 
comparing checkpoint/backend/bgwriter writes the same I've been doing so I 
finally have someone else's results to compare against.  I expect that the 
optimal approach here is a hybrid one that structures scanning the buffer 
cache the new way Tom suggests, but limits the number of writes to "just 
enough".  I happen to be fond of the "just enough" computation based on a 
weighted moving average I wrote before, but there's certainly room for 
multiple implementations of that part of the code to evolve.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Bgwriter strategies

2007-07-05 Thread Tom Lane
Heikki Linnakangas <[EMAIL PROTECTED]> writes:
>   imola-336   imola-337   imola-340
> writes by checkpoint38302   30410   39529
> writes by bgwriter 350113 2205782 1418672
> writes by backends1834333  265755  787633
> writes total  748 2501947 2245834
> allocations   2683170 2657896 2699974

> It looks like Tom's idea is not a winner; it leads to more writes than 
> necessary.

The incremental number of writes is not that large; only about 10% more.
The interesting thing is that those "extra" writes must represent
buffers that were re-touched after their usage_count went to zero, but
before they could be recycled by the clock sweep.  While you'd certainly
expect some of that, I'm surprised it is as much as 10%.  Maybe we need
to play with the buffer allocation strategy some more.

The very small difference in NOTPM among the three runs says that either
this whole area is unimportant, or DBT2 isn't a good test case for it;
or maybe that there's something wrong with the patches?

> On imola-340, there's still a significant amount of backend writes. I'm 
> still not sure what we should be aiming at. Is 0 backend writes our goal?

Well, the lower the better, but not at the cost of a very large increase
in total writes.

> Imola-340 was with a patch along the lines of 
> Itagaki's original patch, ensuring that there's as many clean pages in 
> front of the clock head as were consumed by backends since last bgwriter 
> iteration.

This seems intuitively wrong, since in the presence of bursty request
behavior it'll constantly be getting caught short of buffers.  I think
you need a safety margin and a moving-average decay factor.  Possibly
something like

buffers_to_clean = Max(buffers_used * 1.1,
   buffers_to_clean * 0.999);

where buffers_used is the current observation of demand.  This would
give us a safety margin such that buffers_to_clean is not less than
the largest demand observed in the last 100 iterations (0.999 ^ 100
is about 0.90, cancelling out the initial 10% safety margin), and it
takes quite a while for the memory of a demand spike to be forgotten
completely.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


[HACKERS] Bgwriter strategies

2007-07-05 Thread Heikki Linnakangas

I ran some DBT-2 tests to compare different bgwriter strategies:

http://community.enterprisedb.com/bgwriter/

imola-336 was run with minimal bgwriter settings, so that most writes 
are done by backends. imola-337 was patched with an implementation of 
Tom's bgwriter idea, trying to aggressively keep all pages with 
usage_count=0 clean. Imola-340 was with a patch along the lines of 
Itagaki's original patch, ensuring that there's as many clean pages in 
front of the clock head as were consumed by backends since last bgwriter 
iteration.


All test runs were also patched to count the # of buffer allocations, 
and # of buffer flushes performed by bgwriter and backends. Here's those 
results (I hope the intendation gets through properly):


imola-336   imola-337   imola-340
writes by checkpoint  38302   30410   39529
writes by bgwriter   350113 2205782 1418672
writes by backends  1834333  265755  787633
writes total748 2501947 2245834
allocations 2683170 2657896 2699974

It looks like Tom's idea is not a winner; it leads to more writes than 
necessary. But the OS caches the writes, so let's look at the actual I/O 
performed to be sure, from iostat:


http://community.enterprisedb.com/bgwriter/writes-336-337-340.jpg

The graph shows that on imola-337, there was indeed more write traffic 
than on the other two test runs.


On imola-340, there's still a significant amount of backend writes. I'm 
still not sure what we should be aiming at. Is 0 backend writes our goal?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org