[HACKERS] Checkpoints and buffers that are hint-bit-dirty

2007-07-06 Thread Gregory Stark

When we checkpoint we write out all dirty buffers. But ISTM we don't really
need to write out buffers which are dirty but which have an LSN older than the
previous checkpoint. Those represent buffers which were dirtied by a
non-wal-logged modification, ie, hint bit setting. The other non-wal-logged
operations will sync the buffer themselves when they're done.

I guess it doesn't really buy much, probably just a slight delay in writing
out the page until bgwriter gets around to it. Conceivably you could have a
hot buffer with many missing hint bits which will get written out on several
checkpoints but how many of those can you have? And extending the checkpoint
doesn't seem like much of a concern. On the other hand it wouldn't be hard to
check would it?

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] usleep feature for pgbench

2007-07-06 Thread Tom Lane
Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> Jan Wieck wrote:
>> You mean as a second, optional argument? Good idea.
>> 
>> us = microseconds
>> ms = milliseconds
>> s  = seconds (default)
>> 
>> \sleep {value|:variable} [us|ms|s]
>> 
>> Is that okay with everyone?

> I won't object, but is it really worth the trouble?

If we don't get it right now, we'll be stuck with the API later.
+1 for the definition suggested above.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Bgwriter strategies

2007-07-06 Thread Gregory Stark
"Tom Lane" <[EMAIL PROTECTED]> writes:

>> That would be overly aggressive on a workload that's steady on average, 
>> but consists of small bursts. Like this: 0 0 0 0 100 0 0 0 0 100 0 0 0 0 
>> 100. You'd end up writing ~100 pages on every bgwriter round, but you 
>> only need an average of 20 pages per round.
>
> No, you wouldn't be *writing* that many, you'd only be keeping that many
> *clean*; which only costs more work if any of them get re-dirtied
> between writing and use.  Which is a fairly small probability if we're
> talking about a small difference in the number of buffers to keep clean.
> So I think the average number of writes is hardly different, it's just
> that the backends are far less likely to have to do any of them.

Well Postgres's hint bits tends to redirty pages precisely once at just about
the time when they're ready to be paged out. But I think there are things we
can do to tackle that head-on. 

Bgwriter could try to set hint bits before cleaning these pages for example.
Or we could elect in selected circumstances not to write out a page that is
hint-bit-dirty-only. Or some combination of those options depending on the
circumstances. Figuring out the circumstances is the hard part.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Updated tsearch documentation

2007-07-06 Thread Bruce Momjian
FYI, I have massively reorganized the text search documentation and it
is getting closer to something I am happy with:

http://momjian.us/expire/fulltext/HTML/textsearch.html

-- 
  Bruce Momjian  <[EMAIL PROTECTED]>  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Bgwriter strategies

2007-07-06 Thread Greg Smith

On Fri, 6 Jul 2007, Heikki Linnakangas wrote:

I've been running these test with bgwriter_delay of 10 ms, which is probably 
too aggressive.


Even on relatively high-end hardware, I've found it hard to get good 
results out of the BGW with the delay under 50ms--particularly when trying 
to do some more complicated smoothing.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Still recommending daily vacuum...

2007-07-06 Thread Alvaro Herrera
Kevin Grittner wrote:

> This all started with the question about whether the documentation should
> say anything about vacuum schedules other than "enable autovacuum."
> My point was that I have a use case where I think that a scheduled vacuum
> will be better than leaving everything to autovacuum.  I may not be the only
> one, so I'm thinking the documentation should discuss where an explicit
> schedule might be useful.

Fair enough, you are correct that this scenario may be a useful one to
describe in the docs.

There are plans to add some sort of scheduling to autovacuum so that it
choses different settings based on time/date.  It may be useful for you
as well (though of course they are 8.4 material at best ...), and we may
consider shrinking the docs (or at least reshaping them) a bit at that
time.


> The changes sound good, but I don't see the point of having any vacuum
> activity during the work week on the big tables in the database I was
> describing.

Of course.

-- 
Alvaro Herrera   Valdivia, Chile   ICBM: S 39º 49' 18.1", W 73º 13' 56.4"
"A wizard is never late, Frodo Baggins, nor is he early.
 He arrives precisely when he means to."  (Gandalf, en LoTR FoTR)

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Still recommending daily vacuum...

2007-07-06 Thread Kevin Grittner
>>> On Fri, Jul 6, 2007 at  2:19 PM, in message
<[EMAIL PROTECTED]>, Alvaro Herrera
<[EMAIL PROTECTED]> wrote: 
> Kevin Grittner wrote:

> 2. The point of autovacuum is to get rid of maintenance burden, not add
> to it.  If you know which tables are small and frequently updated, then
> configure those to specific settings that you've found to be optimal,
> and then you don't need to worry about vacuuming them any longer.
 
We have 72 counties using the same schema, which has over 300 tables.
(Each has their own server, located in their county, with their data.)
Rather than trying to fine-tune autovacuum for each table in all locations,
we find it more convenient to use general settings which are aggressive
enough for the small, high-update tables, but lax enough to let the big
ones go until a nightly database vacuum.  (That time is pretty slack
anyway, so why not off-load the overhead of the vacuum to those hours?)
 
> If you didn't tune it to match specific tables,
> most likely your biggest tables never met the formula's condition, which
> is why you were seeing it affecting only the small tables (which met the
> condition under the values you configured server-wide).
> 
> The extra I/O I was talking about would come from vacuuming one of your
> biggest tables, which could cause the amount of I/O to swamp everything
> else the server was doing at the time.
 
Is there something better about having the autovacuum compete with load
during the week, rather than doing a database vacuum during otherwise
idle weekend hours, immediately after the weekly delete of almost 2% of
the rows?  At the time we run the database vacuum analyze, there is
nothing else running to be swamped.
 
>> Our tables tend to fall into one of four categories, small tables with high
>> update rates, medium tables (millions or tens of millions of rows) with
>> thousands or tens of thousands of updates per day, static tables of various
>> sizes that are only modified as part of a software release, and big honking
>> tables (100s of GB) which are either insert-only or are insert with
>> periodic purge of old rows.  Only the first group has a chance of being
>> autovacuumed in normal operations.  Event he purges don't cause it to kick
>> in.
> 
> It could certainly vacuum all your tables.
 
Well, sure, if we weren't doing a nightly database vacuum.  (Weekly for the
database containing the largest tables, mentioned above.)
 
This all started with the question about whether the documentation should
say anything about vacuum schedules other than "enable autovacuum."
My point was that I have a use case where I think that a scheduled vacuum
will be better than leaving everything to autovacuum.  I may not be the only
one, so I'm thinking the documentation should discuss where an explicit
schedule might be useful.
 
The changes sound good, but I don't see the point of having any vacuum
activity during the work week on the big tables in the database I was
describing.  It seems to me that it would result in at least some
performance degradation for the interactive users, and bloat the table,
since we might start inserting before the post-delete vacuum.
 
-Kevin
 


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Still recommending daily vacuum...

2007-07-06 Thread Matthew T. O'Connor

Alvaro Herrera wrote:

Matthew T. O'Connor wrote:
Well, if a table has 10 rows, and we keep the current threshold of 1000
rows, then this table must have 1002 dead tuples (99% dead tuples, 1002
dead + 10 live) before being vacuumed.  This seems wasteful because
there are 500 dead tuples on it and only 10 live tuples.  So each scan
must wade through all the dead tuples.

Another small table with 100 tuples will be vacuumed on every iteration
as well, even if there are just two dead tuples.  So you are right --
maybe dropping it all the way to 0 is too much.  But a small value of 10
is reasonable?  That will make the 10 tuple table be vacuumed when there
are 10 dead tuples (50% of dead tuples), and the 100 tuple table when
there are 11 (11% of dead tuples).  It decreases quickly to the scale
factor (2%, or do we want to decrease it to 1%?)


I think it's probably fine. I think, that the optimal number for the 
base_threhold is probably dependant on the width of the row, for a very 
narrow row where you might have many on the same page, 20 or 50 might be 
right, but for a very wide table a smaller number might be optimal, 
however I think it probably doesn't matter much anyway.


Reducing the default to 10 seems fine, and perhaps even removing it as a 
tuning knob.  I think there are too many autovacuum knobs and it 
confuses people.  Is it too late to possibly remove this GUC altogether?



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Bgwriter strategies

2007-07-06 Thread Heikki Linnakangas

Heikki Linnakangas wrote:
I scheduled a test with the moving average method as well, we'll see how 
that fares.


No too well :(.

Strange. The total # of writes is on par with having bgwriter disabled, 
but the physical I/O graphs show more I/O (on par with the aggressive 
bgwriter), and the response times are higher.


I just noticed that on the tests with the moving average, or the simple 
"just enough" method, there's a small bump in the CPU usage during the 
ramp up period. I believe that's because bgwriter scans through the 
whole buffer cache without finding enough buffers to clean. I ran some 
tests earlier with unpatched bgwriter tuned to the maximum, and it used 
~10% of CPU, which is the same level that the bump rises to. 
Unfortunately I haven't been taking pg_buffercache snapshots until after 
the ramp up; it should've shown up there.


I've been running these test with bgwriter_delay of 10 ms, which is 
probably too aggressive. I used that to test the idea of starting the 
scan from where it left off, instead of always starting from clock hand.


If someone wants to have a look, the # of writes are collected to a 
separate log file in /server/buf_alloc_stats.log. There's 
no link to it from the html files. There's also summary snapshots of 
pg_buffercache every 30 seconds in /server/bufcache.log.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


[HACKERS] CurrentMemoryContext is NULL

2007-07-06 Thread Guan Wang

Hi guys,

I've played with PostgreSql for couple weeks. Currently, I try to develop an
extension on windows system with VC++2005. I installed the full package of
PostgreSql 8.3 with include files and libraries. Compilation is ok in
vc2005, besides some warnings. But, when I tried to run my function in psql,
I always got a run time error. After I got the PostgreSql source codes, I
complied my own libpostgres.lib. And then link my code with this lib. Run
it, got the same error. I found the CurrentMemoryContext is NULL, which was
supposed to be initialized by the backend itself. Anyone could tell me what
happened here? Just because it's windows? or I should use cygwin?

I noticed CurrentMemoryContext is supposed to be imported to my dll by the
macro DLLIMPORT. It seems like PostgreSql can't send this parameter to my
dll...

Thank you for helping me!

The snippet of my codes looks like below:

#include "postgres.h"

#include "fmgr.h"

PG_FUNTION_INFO_V1(foobar)

Datum foobar(PG_FUNCTION_ARGS)
{
   void* ptr=palloc(10); // or pgport_palloc(10); error occurred here,
CurrentMemoryContext equals to NULL, NULL pointer!
   return NULL;
}


Re: [HACKERS] Still recommending daily vacuum...

2007-07-06 Thread Alvaro Herrera
Kevin Grittner wrote:
> >>> On Tue, Jul 3, 2007 at  5:34 PM, in message
> <[EMAIL PROTECTED]>, Alvaro Herrera
> <[EMAIL PROTECTED]> wrote: 
> > Kevin Grittner wrote:
> > 
> >> Autovacuum is enabled with very aggressive settings, to cover small
> >> tables, including one with about 75 rows that can be updated 100 or more
> >> times per second.  Even with these settings there is zero chance of any
> >> table of even moderate size hitting the autovacuum threshold between our
> >> scheduled vacuums.
> > 
> > Sounds like you would be served by setting those specific tables to a
> > lower vacuum scale factor (keeping a more normal default for the rest of
> > the tables), and having a non-zero vacuum delay setting (to avoid
> > excessive I/O consumption).  Have you tried that?
>  
> I did play with that, but it doens't seem to make sense in our environment.
> We have about 100 databases, most of them scattered around the state, and
> any extra maintenance like that has a cost, particularly with the daily
> cluster changing the oid.  Both from doing the math and from experience,
> I can say that the autovacuum only affects the small, frequently updated
> tables, so I could see no benefit.  Am I missing somethign?  (I can't see
> where this causes any extra I/O.)

There seem to be a misunderstanding here.

1. Cluster does not change the OID.  It only changes the relfilenode.
The file on disk is named something else, but the OID used in the
database remains unchanged.  So if you insert something into
pg_autovacuum it continues to work after a CLUSTER, you don't need to
update the OID.

2. The point of autovacuum is to get rid of maintenance burden, not add
to it.  If you know which tables are small and frequently updated, then
configure those to specific settings that you've found to be optimal,
and then you don't need to worry about vacuuming them any longer.

You already know this but: autovacuum uses a formula to determine which
tables to vacuum.  The formula is based on the number of dead tuples,
the size of the table and two factors that you can configure per table
as well as globally.  If you didn't tune it to match specific tables,
most likely your biggest tables never met the formula's condition, which
is why you were seeing it affecting only the small tables (which met the
condition under the values you configured server-wide).

The extra I/O I was talking about would come from vacuuming one of your
biggest tables, which could cause the amount of I/O to swamp everything
else the server was doing at the time.  Since it never actually touched
the big tables this hasn't happened to you yet.  Do note that autovacuum
uses the vacuum_cost_delay if autovacuum_vacuum_cost_delay is set to the
default value of -1.

> Our tables tend to fall into one of four categories, small tables with high
> update rates, medium tables (millions or tens of millions of rows) with
> thousands or tens of thousands of updates per day, static tables of various
> sizes that are only modified as part of a software release, and big honking
> tables (100s of GB) which are either insert-only or are insert with
> periodic purge of old rows.  Only the first group has a chance of being
> autovacuumed in normal operations.  Event he purges don't cause it to kick
> in.

It could certainly vacuum all your tables.  But one thing to keep in
mind that as of 8.2, only one autovacuum process can be running.  So if
and when it decides to vacuum the big tables, it will be long before it
is able to go back and check the small tables.

This is fixed in 8.3.


> >> Oh, the tiny, high-update tables occasionally bloat to hundreds or
> >> thousands of pages because of long-running transactions, so we schedule
> >> a daily cluster on those, just to keep things tidy.
> > 
> > If you can afford the cluster then there's no problem.  I don't expect
> > that to change in 8.3.
>  
> Here also we're talking 10 to 20 milliseconds.  I understand that in 8.2
> that leaves a chance of an error, but we seem to have dodged that bullet
> so far.  Has that gotten any safer in 8.3?

Yes, it did, assuming I understood what error are you talking about
(cluster not leaving dead tuples possibly seen by concurrent
transactions).

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Still recommending daily vacuum...

2007-07-06 Thread Alvaro Herrera
Matthew T. O'Connor wrote:
> Alvaro Herrera wrote:
> >Jim C. Nasby wrote:
> >>FWIW, I normally go with the 8.2 defaults, though I could see dropping
> >>vacuum_scale_factor down to 0.1 or 0.15. I also think the thresholds
> >>could be decreased further, maybe divide by 10.
> >
> >How about pushing thresholds all the way down to 0?
> 
> As long as it handles small (or zero row) tables ok then yes.  The 
> base_threshold in the originial contrib autovacuum was just an easy way 
> to not vacuum really small tables too often.  If a table has only 10 
> rows, it's going to get vacuumed every time one row is updated.  I guess 
> that's not a big problem with a table that small but still seems excessive.

Well, if a table has 10 rows, and we keep the current threshold of 1000
rows, then this table must have 1002 dead tuples (99% dead tuples, 1002
dead + 10 live) before being vacuumed.  This seems wasteful because
there are 500 dead tuples on it and only 10 live tuples.  So each scan
must wade through all the dead tuples.

Another small table with 100 tuples will be vacuumed on every iteration
as well, even if there are just two dead tuples.  So you are right --
maybe dropping it all the way to 0 is too much.  But a small value of 10
is reasonable?  That will make the 10 tuple table be vacuumed when there
are 10 dead tuples (50% of dead tuples), and the 100 tuple table when
there are 11 (11% of dead tuples).  It decreases quickly to the scale
factor (2%, or do we want to decrease it to 1%?)

Does this sound acceptable?

-- 
Alvaro Herrera http://www.amazon.com/gp/registry/DXLWNGRJD34J
"I am amazed at [the pgsql-sql] mailing list for the wonderful support, and
lack of hesitasion in answering a lost soul's question, I just wished the rest
of the mailing list could be like this."   (Fotis)
   (http://archives.postgresql.org/pgsql-sql/2006-06/msg00265.php)

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Bgwriter strategies

2007-07-06 Thread Greg Smith

On Fri, 6 Jul 2007, Tom Lane wrote:


The problem is that it'd be very hard to track how far ahead of the
recycling sweep hand we are, because that number has to be measured
in usage-count-zero pages.  I see no good way to know how many of the
pages we scanned before have been touched (and given nonzero usage
counts) unless we rescan them.


I've actually been working on how to address that specific problem without 
expressly tracking the contents of the buffer cache.  When the background 
writer is called, it finds out how many buffers were allocated and how far 
the sweep point moved since the last call.  From that, you can calculate 
how many buffers on average need to be scanned per allocation, which tells 
you something about the recently encountered density of 0-usage count 
buffers.  My thought was to use that as an input to the computation for 
how far ahead to stay.



I've been doing moving averages for years and years, and I find that the
multiplication approach works at least as well as explicitly storing the
last K observations.  It takes a lot less storage and arithmetic too.


I was simplifying the description just to comment on the range for K; I 
was using a multiplication approach for the computation.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] usleep feature for pgbench

2007-07-06 Thread Jan Wieck

On 7/6/2007 1:32 PM, Heikki Linnakangas wrote:

Jan Wieck wrote:

On 7/6/2007 10:44 AM, Peter Eisentraut wrote:

Am Donnerstag, 5. Juli 2007 21:12 schrieb Jan Wieck:

To test some changes in Slony I needed a

 \usleep [microseconds|:variable]

in pgbench's scripting language to be able to have hundreds of
concurrent running transactions without totally swamping the system. I
was wondering if anyone would object to permanently adding this to the
pgbench code?


Or maybe a \sleep command that takes units, if it's not too much work.



You mean as a second, optional argument? Good idea.

  us = microseconds
  ms = milliseconds
  s  = seconds (default)

  \sleep {value|:variable} [us|ms|s]

Is that okay with everyone?


I won't object, but is it really worth the trouble?

Can you do microsecond precision sleeps, and on what platforms? How much 
overhead is there? IIRC, on Linux the minimum time you can sleep depends 
on CONFIG_HZ, and the default was 10 ms until recently.




It is what you tell select(2) in the struct timeval. Why limit it to 
some arbitrary precision?



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Bgwriter strategies

2007-07-06 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas <[EMAIL PROTECTED]> writes:

imola-336   imola-337   imola-340
writes by checkpoint  38302   30410   39529
writes by bgwriter   350113 2205782 1418672
writes by backends  1834333  265755  787633
writes total748 2501947 2245834
allocations 2683170 2657896 2699974


It looks like Tom's idea is not a winner; it leads to more writes than 
necessary.


The incremental number of writes is not that large; only about 10% more.
The interesting thing is that those "extra" writes must represent
buffers that were re-touched after their usage_count went to zero, but
before they could be recycled by the clock sweep.  While you'd certainly
expect some of that, I'm surprised it is as much as 10%.  Maybe we need
to play with the buffer allocation strategy some more.

The very small difference in NOTPM among the three runs says that either
this whole area is unimportant, or DBT2 isn't a good test case for it;
or maybe that there's something wrong with the patches?


The small difference in NOTPM is because the I/O still wasn't saturated 
even with 10% extra writes.


I ran more tests with a higher number of warehouses, and the extra 
writes start to show in the response times. See tests 341-344: 
http://community.enterprisedb.com/bgwriter/.


I scheduled a test with the moving average method as well, we'll see how 
that fares.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] usleep feature for pgbench

2007-07-06 Thread Heikki Linnakangas

Jan Wieck wrote:

On 7/6/2007 10:44 AM, Peter Eisentraut wrote:

Am Donnerstag, 5. Juli 2007 21:12 schrieb Jan Wieck:

To test some changes in Slony I needed a

 \usleep [microseconds|:variable]

in pgbench's scripting language to be able to have hundreds of
concurrent running transactions without totally swamping the system. I
was wondering if anyone would object to permanently adding this to the
pgbench code?


Or maybe a \sleep command that takes units, if it's not too much work.



You mean as a second, optional argument? Good idea.

  us = microseconds
  ms = milliseconds
  s  = seconds (default)

  \sleep {value|:variable} [us|ms|s]

Is that okay with everyone?


I won't object, but is it really worth the trouble?

Can you do microsecond precision sleeps, and on what platforms? How much 
overhead is there? IIRC, on Linux the minimum time you can sleep depends 
on CONFIG_HZ, and the default was 10 ms until recently.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] usleep feature for pgbench

2007-07-06 Thread Jan Wieck

On 7/6/2007 10:44 AM, Peter Eisentraut wrote:

Am Donnerstag, 5. Juli 2007 21:12 schrieb Jan Wieck:

To test some changes in Slony I needed a

 \usleep [microseconds|:variable]

in pgbench's scripting language to be able to have hundreds of
concurrent running transactions without totally swamping the system. I
was wondering if anyone would object to permanently adding this to the
pgbench code?


Or maybe a \sleep command that takes units, if it's not too much work.



You mean as a second, optional argument? Good idea.

  us = microseconds
  ms = milliseconds
  s  = seconds (default)

  \sleep {value|:variable} [us|ms|s]

Is that okay with everyone?


Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] pg_autovacuum -> pg_class.reloptions?

2007-07-06 Thread Tom Lane
Alvaro Herrera <[EMAIL PROTECTED]> writes:
> A long time ago, Tom proposed moving the pg_autovacuum settings into
> reloptions.  I know it's really late in the devel cycle but I wonder if
> such a move would be acceptable at this time?

I think it's too late to be considering essentially-cosmetic changes
for 8.3, especially when you've not even started on a patch.


If we don't start saying "no" on a regular basis, we'll never get this
thing out the door.  We are already more than two months behind the
intended schedule, and I see absolutely nothing getting done on several
of the major patches.


regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Bgwriter strategies

2007-07-06 Thread Tom Lane
Greg Smith <[EMAIL PROTECTED]> writes:
> On Thu, 5 Jul 2007, Tom Lane wrote:
>> This would give us a safety margin such that buffers_to_clean is not 
>> less than the largest demand observed in the last 100 iterations...and 
>> it takes quite a while for the memory of a demand spike to be forgotten 
>> completely.

> If you tested this strategy even on a steady load, I'd expect you'll find 
> there are large spikes in allocations during the occasional period where 
> everything is just right to pull a bunch of buffers in, and if you let 
> that max linger around for 100 iterations you'll write a large number of 
> buffers more than you need.

You seem to have the same misunderstanding as Heikki.  What I was
proposing was not a target for how many to *write* on each cycle, but
a target for how far ahead of the clock sweep hand to look.  If say
the target is 100, we'll scan forward from the sweep until we have seen
100 clean zero-usage-count buffers; but we only have to write whichever
of them weren't already clean.

This is actually not so different from my previous proposal, in that the
idea is to keep ahead of the sweep by a particular distance.  The
previous idea was that that distance was "all the buffers", whereas this
idea is "a moving average of the actual demand rate".  The excess writes
created by the previous proposal were because of the probability of
re-dirtying buffers between cleaning and recycling.  We reduce that
probability by not trying to keep so many of 'em clean.  But I think
that we can meet the goal of having backends do hardly any of the writes
with a relatively small increase in the target distance, and thus a
relatively small differential in the number of wasted writes.  Heikki's
test showed that Itagaki-san's patch wasn't doing that well in
eliminating writes by backends, so we need a more aggressive target for
how many buffers to keep clean than it has; but I think not a huge
amount more, and thus my proposal.

BTW, somewhere upthread you suggested combining the target-distance
idea with the idea that the cleaning work uses a separate sweep hand and
thus doesn't re-examine the same buffers on every bgwriter iteration.
The problem is that it'd be very hard to track how far ahead of the
recycling sweep hand we are, because that number has to be measured
in usage-count-zero pages.  I see no good way to know how many of the
pages we scanned before have been touched (and given nonzero usage
counts) unless we rescan them.

We could approximate it maybe: try to keep the cleaning hand N total
buffers ahead of the recycling hand, where N is the target number of
clean usage-count-zero buffers scaled by the average fraction of
count-zero buffers (which we can track a moving average of as we advance
the recycling hand).  However I'm not sure the complexity and
uncertainty is worth it.  What I took away from Heikki's experiment is
that trying to stay a large distance in front of the recycle sweep
isn't actually so useful because you get too many wasted writes due
to re-dirtying.  So restructuring the algorithm to make it cheap
CPU-wise to stay well ahead is not so useful either.

> I ended up settling on max(moving average of the last 16,most recent 
> allocation), and that seemed to work pretty well without being too 
> wasteful from excessive writes.

I've been doing moving averages for years and years, and I find that the
multiplication approach works at least as well as explicitly storing the
last K observations.  It takes a lot less storage and arithmetic too.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


[HACKERS] pg_autovacuum -> pg_class.reloptions?

2007-07-06 Thread Alvaro Herrera
Hi,

A long time ago, Tom proposed moving the pg_autovacuum settings into
reloptions.  I know it's really late in the devel cycle but I wonder if
such a move would be acceptable at this time?  I feel it would be a good
move, for example per
http://thread.gmane.org/gmane.comp.db.postgresql.general/92643/
and the third item in
http://thread.gmane.org/gmane.comp.db.postgresql.bugs/14175

I admit I haven't even tried yet, so I would need to start working on a
patch.  This would have the advantage that

1. users would not need to come up with "disable" values for variables
they are not interested in touching, and

2. we could have more control in detecting invalid values

Opinions?

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Bgwriter strategies

2007-07-06 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas <[EMAIL PROTECTED]> writes:

Tom Lane wrote:

buffers_to_clean = Max(buffers_used * 1.1,
buffers_to_clean * 0.999);


That would be overly aggressive on a workload that's steady on average, 
but consists of small bursts. Like this: 0 0 0 0 100 0 0 0 0 100 0 0 0 0 
100. You'd end up writing ~100 pages on every bgwriter round, but you 
only need an average of 20 pages per round.


No, you wouldn't be *writing* that many, you'd only be keeping that many
*clean*; which only costs more work if any of them get re-dirtied
between writing and use.  Which is a fairly small probability if we're
talking about a small difference in the number of buffers to keep clean.
So I think the average number of writes is hardly different, it's just
that the backends are far less likely to have to do any of them.


Ah, ok, I misunderstood what you were proposing. Yes, that seems like a 
good algorithm then.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Bgwriter strategies

2007-07-06 Thread Tom Lane
Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> buffers_to_clean = Max(buffers_used * 1.1,
>> buffers_to_clean * 0.999);

> That would be overly aggressive on a workload that's steady on average, 
> but consists of small bursts. Like this: 0 0 0 0 100 0 0 0 0 100 0 0 0 0 
> 100. You'd end up writing ~100 pages on every bgwriter round, but you 
> only need an average of 20 pages per round.

No, you wouldn't be *writing* that many, you'd only be keeping that many
*clean*; which only costs more work if any of them get re-dirtied
between writing and use.  Which is a fairly small probability if we're
talking about a small difference in the number of buffers to keep clean.
So I think the average number of writes is hardly different, it's just
that the backends are far less likely to have to do any of them.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] usleep feature for pgbench

2007-07-06 Thread Peter Eisentraut
Am Donnerstag, 5. Juli 2007 21:12 schrieb Jan Wieck:
> To test some changes in Slony I needed a
>
>  \usleep [microseconds|:variable]
>
> in pgbench's scripting language to be able to have hundreds of
> concurrent running transactions without totally swamping the system. I
> was wondering if anyone would object to permanently adding this to the
> pgbench code?

Or maybe a \sleep command that takes units, if it's not too much work.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Bgwriter strategies

2007-07-06 Thread Heikki Linnakangas

Greg Smith wrote:

On Fri, 6 Jul 2007, Heikki Linnakangas wrote:

There's something wrong with that. The number of buffer allocations 
shouldn't depend on the bgwriter strategy at all.


I was seeing a smaller (closer to 5%) increase in buffer allocations 
switching from no background writer to using the stock one before I did 
any code tinkering, so it didn't strike me as odd.  I believe it's 
related to the TPS numbers.  When there are more transactions being 
executed per unit time, it's more likely the useful blocks will stay in 
memory because their usage_count is getting tickled faster, and 
therefore there's less of the most useful blocks being swapped out only 
to be re-allocated again later.


Did you run the test for a constant number of transactions? If you did, 
the access pattern and the number of allocations should be *exactly* the 
same with 1 client, assuming the initial state and the seed used for the 
random number generator is the same.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Bgwriter strategies

2007-07-06 Thread Greg Smith

On Fri, 6 Jul 2007, Heikki Linnakangas wrote:

There's something wrong with that. The number of buffer allocations shouldn't 
depend on the bgwriter strategy at all.


I was seeing a smaller (closer to 5%) increase in buffer allocations 
switching from no background writer to using the stock one before I did 
any code tinkering, so it didn't strike me as odd.  I believe it's related 
to the TPS numbers.  When there are more transactions being executed per 
unit time, it's more likely the useful blocks will stay in memory because 
their usage_count is getting tickled faster, and therefore there's less of 
the most useful blocks being swapped out only to be re-allocated again 
later.


Since the bad bgwriter tunings reduce TPS, I believe that's the mechanism 
by which there are more allocations needed.  I'll try to keep an eye on 
this now that you've brought it up.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Bgwriter strategies

2007-07-06 Thread Heikki Linnakangas

Greg Smith wrote:
As you can see, I achieved the goal of almost never having a backend 
write its own buffer, so yeah for that.  That's the only good thing I 
can say about it though.  The TPS results take a moderate dive, and 
there's about 10% more buffer allocations.  The big and obvious issues 
is that I'm writing almost 75% more buffers this way--way worse even 
than the 10% extra overhead Heikki was seeing.  But since I've going out 
of my way to find a worse-case for this code, I consider mission 
accomplished there.


There's something wrong with that. The number of buffer allocations 
shouldn't depend on the bgwriter strategy at all.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Bgwriter strategies

2007-07-06 Thread Greg Smith
I just got my own first set of useful tests of using the new "remember 
where you last scanned to" BGW implementation suggested by Tom.  What I 
did was keep the exiting % to scan, but cut back the number to scan when 
so close to a complete lap ahead of the strategy point that I'd cross it 
if I scanned that much.  So when the system was idle, it would very 
quickly catch up with the strategy point, but if the %/max numbers were 
low it's possible for it to fall behind.


My workload was just the UPDATE statement out of pgbench with a database 
of scale 25 (~400MB, picked so most operations were in memory), which 
pushes lots of things in and out of the buffer cache as fast as possible.


Here's some data with no background writer at all:

clients tps buf_clean buf_backend buf_alloc
1   13400   72554   96846
2   14210   73969   88879
3   14180   71452   86339
4   13440   75184   90187
8   13610   73063   88099
15  13480   71861   86923

And here's what I got with the new approach, using 10% for the scan 
percentage and a maximum of 200 buffers written out.  I picked those 
numbers after some experimentation because they were the first I found 
where the background writer was almost always riding right behind the 
strategy point; with lower numbers, when the background writer woke up it 
often found it had already fallen behind the stategy point and had to 
start cleaning forward the old way instead, which wasn't what I wanted to 
test.


clients tps buf_clean buf_backend buf_alloc
1   1261122917  150 105655
2   1186126663  26  97586
3   1154127780  21  98077
4   1181127685  19  98068
8   1076128597  2   98229
15  1065128399  5   98143

As you can see, I achieved the goal of almost never having a backend write 
its own buffer, so yeah for that.  That's the only good thing I can say 
about it though.  The TPS results take a moderate dive, and there's about 
10% more buffer allocations.  The big and obvious issues is that I'm 
writing almost 75% more buffers this way--way worse even than the 10% 
extra overhead Heikki was seeing.  But since I've going out of my way to 
find a worse-case for this code, I consider mission accomplished there.


Anyway, will have more detailed reports to post after I collect some more 
data; for now I just wanted to join Heikki in confirming that the strategy 
of trying to get the LRU cleaner to ride right behind the strategy point 
can really waste a whole lot of writes.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Bgwriter strategies

2007-07-06 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas <[EMAIL PROTECTED]> writes:

imola-336   imola-337   imola-340
writes by checkpoint  38302   30410   39529
writes by bgwriter   350113 2205782 1418672
writes by backends  1834333  265755  787633
writes total748 2501947 2245834
allocations 2683170 2657896 2699974


It looks like Tom's idea is not a winner; it leads to more writes than 
necessary.


The incremental number of writes is not that large; only about 10% more.
The interesting thing is that those "extra" writes must represent
buffers that were re-touched after their usage_count went to zero, but
before they could be recycled by the clock sweep.  While you'd certainly
expect some of that, I'm surprised it is as much as 10%.  Maybe we need
to play with the buffer allocation strategy some more.

The very small difference in NOTPM among the three runs says that either
this whole area is unimportant, or DBT2 isn't a good test case for it;
or maybe that there's something wrong with the patches?

On imola-340, there's still a significant amount of backend writes. I'm 
still not sure what we should be aiming at. Is 0 backend writes our goal?


Well, the lower the better, but not at the cost of a very large increase
in total writes.

Imola-340 was with a patch along the lines of 
Itagaki's original patch, ensuring that there's as many clean pages in 
front of the clock head as were consumed by backends since last bgwriter 
iteration.


This seems intuitively wrong, since in the presence of bursty request
behavior it'll constantly be getting caught short of buffers.  I think
you need a safety margin and a moving-average decay factor.  Possibly
something like

buffers_to_clean = Max(buffers_used * 1.1,
   buffers_to_clean * 0.999);

where buffers_used is the current observation of demand.  This would
give us a safety margin such that buffers_to_clean is not less than
the largest demand observed in the last 100 iterations (0.999 ^ 100
is about 0.90, cancelling out the initial 10% safety margin), and it
takes quite a while for the memory of a demand spike to be forgotten
completely.


That would be overly aggressive on a workload that's steady on average, 
but consists of small bursts. Like this: 0 0 0 0 100 0 0 0 0 100 0 0 0 0 
100. You'd end up writing ~100 pages on every bgwriter round, but you 
only need an average of 20 pages per round. That'd be effectively the 
same as keeping all buffers with usage_count=0 clean.


BTW, I believe that kind of workload is actually very common. That's 
what you get if one transaction causes say 10-100 buffer allocations, 
and you execute one such transaction every few seconds.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Bgwriter strategies

2007-07-06 Thread Heikki Linnakangas

Greg Smith wrote:

On Thu, 5 Jul 2007, Heikki Linnakangas wrote:

It looks like Tom's idea is not a winner; it leads to more writes than 
necessary.


What I came away with as the core of Tom's idea is that the cleaning/LRU 
writer shouldn't ever scan the same section of the buffer cache twice, 
because anything that resulted in a new dirty buffer will be unwritable 
by it until the clock sweep passes over it.  I never took that to mean 
that idea necessarily had to be implemented as "trying to aggressively 
keep all pages with usage_count=0 clean".


I've been making slow progress on this myself, and the question I've 
been trying to answer is whether this fundamental idea really matters or 
not. One clear benefit of that alternate implementation should allow is 
setting a lower value for the interval without being as concerned that 
you're wasting resources by doing so, which I've found to a problem with 
the current implementation--it will consume a lot of CPU scanning the 
same section right now if you lower that too much.


Yes, in fact ignoring the CPU overhead of scanning the same section over 
and over again, Tom's proposal is the same as setting both 
bgwriter_lru_* settings all the way up to the max. In fact I ran a DBT-2 
test like that as well, and the # of writes was indeed the same, just 
with a max higher CPU usage. It's clear that scanning the same section 
over and over again has been a waste of time in previous releases.


As a further data point, I constructed a smaller test case that performs 
random DELETEs on a table using an index. I varied the # of 
shared_buffers, and ran the test with bgwriter disabled, or tuned all 
the way up to the maximum. Here's the results from that:


 shared_buffers | writes | writes |   writes_ratio
+++---
 2560   |  86936 |  88023 |  1.01250345081439
 5120   |  81207 |  84551 |  1.04117871612053
 7680   |  75367 |  80603 |  1.06947337694216
 10240  |  69772 |  74533 |  1.06823654187926
 12800  |  64281 |  69237 |  1.07709898725907
 15360  |  58515 |  64735 |  1.10629753054772
 17920  |  53231 |  58635 |  1.10151979109917
 20480  |  48128 |  54403 |  1.13038148271277
 23040  |  43087 |  49949 |  1.15925917330053
 25600  |  39062 |  46477 |   1.1898264297783
 28160  |  35391 |  43739 |  1.23587917832217
 30720  |  32713 |  37480 |  1.14572188426619
 33280  |  31634 |  31677 |  1.00135929695897
 35840  |  31668 |  31717 |  1.00154730327144
 38400  |  31696 |  31693 | 0.05350832913
 40960  |  31685 |  31730 |  1.00142023039293
 43520  |  31694 |  31650 | 0.998611724616647
 46080  |  31661 |  31650 | 0.999652569407157

The first writes-column is the # of writes with bgwriter disabled, 2nd 
column is with the aggressive bgwriter. The table size is 4 pages, 
so after that the table fits in cache and the bgwriter strategy makes no 
difference.



As far as your results, first off I'm really glad to see someone else 
comparing checkpoint/backend/bgwriter writes the same I've been doing so 
I finally have someone else's results to compare against.  I expect that 
the optimal approach here is a hybrid one that structures scanning the 
buffer cache the new way Tom suggests, but limits the number of writes 
to "just enough".  I happen to be fond of the "just enough" computation 
based on a weighted moving average I wrote before, but there's certainly 
room for multiple implementations of that part of the code to evolve.


We need to get the requirements straight.

One goal of bgwriter is clearly to keep just enough buffers clean in 
front of the clock hand so that backends don't need to do writes 
themselves until the next bgwriter iteration. But not any more than 
that, otherwise we might end up doing more writes than necessary if some 
of the buffers are redirtied.


To deal with bursty workloads, for example a batch of 2 GB worth of 
inserts coming in every 10 minutes, it seems we want to keep doing a 
little bit of cleaning even when the system is idle, to prepare for the 
next burst. The idea is to smoothen the physical I/O bursts; if we don't 
clean the dirty buffers left over from the previous burst during the 
idle period, the I/O system will be bottlenecked during the bursts, and 
sit idle otherwise.


To strike a balance between cleaning buffers ahead of possible bursts in 
the future and not doing unnecessary I/O when no such bursts come, I 
think a reasonable strategy is to write buffers with usage_count=0 at a 
slow pace when there's no buffer allocations happening.


To smoothen the small variations on a relatively steady workload, the 
weighted average sounds good.





--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Bgwriter strategies

2007-07-06 Thread Greg Smith

On Thu, 5 Jul 2007, Tom Lane wrote:

This would give us a safety margin such that buffers_to_clean is not 
less than the largest demand observed in the last 100 iterations...and 
it takes quite a while for the memory of a demand spike to be forgotten 
completely.


If you tested this strategy even on a steady load, I'd expect you'll find 
there are large spikes in allocations during the occasional period where 
everything is just right to pull a bunch of buffers in, and if you let 
that max linger around for 100 iterations you'll write a large number of 
buffers more than you need.  That's what I saw when I tried to remember 
too much information about allocation history in the version of the auto 
LRU tuner I worked on.  For example, with 32000 buffers, with pgbench 
trying to UPDATE as fast as possible I sometimes hit
1500 allocations in an interval, but the steady-state allocation level 

was closer to 500.

I ended up settling on max(moving average of the last 16,most recent 
allocation), and that seemed to work pretty well without being too 
wasteful from excessive writes.  Playing with multiples of 2, 8 was 
definately not enough memory to smooth usefully, while 32 seemed a little 
sluggish on the entry and wasteful on the exit ends.


At the default interval, 16 iterations is looking back at the previous 3.2 
seconds.  I have a feeling the proper tuning for this should be 
time-based, where you would decide how long ago to consider looking back 
for and compute the iterations based on that.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] usleep feature for pgbench

2007-07-06 Thread Greg Smith

On Thu, 5 Jul 2007, Jan Wieck wrote:

Original pgbench reported 39, 37 and 33 TPS. Having my patch applied it 
reported 40, 38 and 33 TPS. Inserting a "\usleep 1" after the update to 
accounts of a default equivalent script changed those numbers to 40, 37 and 
33. I interpret that as "does not change observed performance".


Tell you what:  put your work into a patch, send it to the list, and I'll 
test that it doesn't degrade results for you.  If your pgbench results are 
in the <40 TPS range even with that low of a scale, you're not in a 
position to tell whether it has a negative performance impact.  That 
select statement you're fiddling with can turn into a bottleneck at high 
client loads, and from your description I can't tell if you've made that 
worse, but you'll never see it unless you're pushing, say,

1000 TPS and >50 clients.  Also:  3 pgbench results at one client load
is quite a bit short of proving no impact on performance; I'll queue up 50 
or so, which is where I start to trust results from that unruly tool.


This is actually a feature I'd be kind of interested to have, because it 
would allow you to pass two (or more) script files to pgbench and adjust 
the transaction mix.  What happens when you do that right now is that 
inevitably all the clients get blocked at once on whatever the hardest to 
execute transaction is, and the results are kind of deceptive as a result.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] tsearch2: language or encoding

2007-07-06 Thread Tom Lane
Tatsuo Ishii <[EMAIL PROTECTED]> writes:
> I'm wondering if a tsearch's configuration is bound to a language or
> an encoding. If it's bound to a language, there's a serious design
> problem, I would think. An encoding or charset is not necessarily
> bound to single language. We can find such that example everywhere(I'm
> not talking about Unicode here). LATIN1 inclues English and several
> european languages. EUC-JP includes English and Japanese etc. And
> we specify encoding for char's property, not language, I would say the
> configuration should be bound to an encoding.

Surely not, because then what do you do with utf8, which (allegedly)
represents every language on earth?

As far as the word-stemming part goes, that is very clearly bound
to a language not an encoding.  There may be some other parts of
the code that really are better attached to an encoding --- Oleg,
Teodor, your thoughts?

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings