date:20070607


Greg Smith wrote:

On Wed, 6 Jun 2007, Heikki Linnakangas wrote:

The original patch uses bgwriter_all_max_pages to set the minimum 
rate. I think we should have a separate variable, 
checkpoint_write_min_rate, in KB/s, instead.


Completely agreed.  There shouldn't be any coupling with the background 
writer parameters, which may be set for a completely different set of 
priorities than the checkpoint has.  I have to look at this code again 
to see why it's a min_rate instead of a max, that seems a little weird.


It's min rate, because it never writes slower than that, and it can 
write faster if the next checkpoint is due soon so that we wouldn't 
finish before it's time to start the next one. (Or to be precise, before 
the next checkpoint is closer than 100-(checkpoint_write_percent)% of 
the checkpoint interval)


Nap phase:  We should therefore give the delay as a number of seconds 
instead of as a percentage of checkpoint interval.


Again, the setting here should be completely decoupled from another GUC 
like the interval.  My main complaint with the original form of this 
patch was how much it tried to syncronize the process with the interval; 
since I don't even have a system where that value is set to something, 
because it's all segment based instead, that whole idea was incompatible.


checkpoint_segments is taken into account as well as checkpoint_timeout. 
I used the term checkpoint interval to mean the real interval at which 
the checkpoints occur, whether it's because of segments or timeout.


The original patch tried to spread the load out as evenly as possible 
over the time available.  I much prefer thinking in terms of getting it 
done as quickly as possible while trying to bound the I/O storm.


Yeah, the checkpoint_min_rate allows you to do that.

So there's two extreme ways you can use LDC:
1. Finish the checkpoint as soon as possible, without disturbing other 
activity too much. Set checkpoint_write_percent to a high number, and 
set checkpoint_min_rate to define too much.
2. Disturb other activity as little as possible, as long as the 
checkpoint finishes in a reasonable time. Set checkpoint_min_rate to a 
low number, and checkpoint_write_percent to define reasonable time


Are both interesting use cases, or is it enough to cater for just one of 
them? I think 2 is easier to tune. Defining the min_rate properly can be 
difficult and depends a lot on your hardware and application, but a 
default value of say 50% for checkpoint_write_percent to tune for use 
case 2 should work pretty well for most people.


In any case, the checkpoint better finish before it's time to start 
another one. Or would you rather delay the next checkpoint, and let 
checkpoint take as long as it takes to finish at the min_rate?


And we don't know how much work an fsync performs. The patch uses the 
file size as a measure of that, but as we discussed that doesn't 
necessarily have anything to do with reality. fsyncing a 1GB file with 
one dirty block isn't any more expensive than fsyncing a file with a 
single block.


On top of that, if you have a system with a write cache, the time an 
fsync takes can greatly depend on how full it is at the time, which 
there is no way to measure or even model easily.


Is there any way to track how many dirty blocks went into each file 
during the checkpoint write?  That's your best bet for guessing how long 
the fsync will take.


I suppose it's possible, but the OS has hopefully started flushing them 
to disk almost as soon as we started the writes, so even that isn't very 
good a measure.


On a Linux system, one way to model it is that the OS flushes dirty 
buffers to disk at the same rate as we write them, but delayed by 
dirty_expire_centisecs. That should hold if the writes are spread out 
enough. Then the amount of dirty buffers in OS cache at the end of write 
phase is roughly constant, as long as the write phase lasts longer than 
dirty_expire_centisecs. If we take a nap of dirty_expire_centisecs after 
the write phase, the fsyncs should be effectively no-ops, except that 
they will flush any other writes the bgwriter lru-sweep and other 
backends performed during the nap.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-07 Thread Hannu Krosing

Ühel kenal päeval, K, 2007-06-06 kell 11:03, kirjutas Tom Lane:
 Heikki Linnakangas [EMAIL PROTECTED] writes:
  GUC summary and suggested default values
  
  checkpoint_write_percent = 50   # % of checkpoint interval to 
  spread out 
  writes
  checkpoint_write_min_rate = 1000# minimum I/O rate to write dirty 
  buffers at checkpoint (KB/s)
  checkpoint_nap_duration = 2 # delay between write and sync 
  phase, in 
  seconds
  checkpoint_fsync_period = 30# duration of the sync phase, 
  in seconds
  checkpoint_fsync_delay = 500# max. delay between fsyncs
 
  I don't like adding that many GUC variables, but I don't really see a 
  way to tune them automatically.
 
 If we don't know how to tune them, how will the users know?  

He talked about doing it _automatically_.

If the knobns are available, it will be possible to determine good
values even by brute-force performance testing, given enough time and
manpower is available.

 Having to
 add that many variables to control one feature says to me that we don't
 understand the feature.

The feature has lots of complex dependencies to things outside postgres,
so learning to understand it takes time. Having the knows available
helps as more people ar willing to do turn-the-knobs-and-test vs.
recompile-and-test.

 Perhaps what we need is to think about how it can auto-tune itself.

Sure.

---
Hannu Krosing


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] [RFC] GSoC Work on readonly queries done so far


Florian G. Pflug wrote:

Jeff Davis wrote:

Are you referring to the size of the xip array being a problem? Would it
help to tie the size of the xip array to max_connections? I understand
that max_connections might be greater on the master, but maybe something
similar? 


Thats what I currently do - the xip array on the slave is sized to
hold max_connections entries (Actually, it's max_connections +
max_prepared_xacts I think). The problem occurs exactly if those
values are set too small on the slave - and since shared mem
objects are not resizeable, I don't see how the slave can handle
an xip overflow gracefully other than by not publishing the
information in shared memory as long as it doesn't fit there.


You could store the value of max_connections in the checkpoint xlog 
record, and read it from there in the slave. Though one could still 
change it on the master and restart without restarting the slave as well.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [HACKERS] How do we create the releases?

2007-06-07 Thread Magnus Hagander

On Wed, Jun 06, 2007 at 07:56:25PM -0300, Marc G. Fournier wrote:
 - --On Tuesday, June 05, 2007 10:28:58 +0300 Devrim GÜNDÜZ 
 [EMAIL PROTECTED] wrote:
 
 
  Hi Marc,
 
  Is there a written procedure about creating tarballs? I'd like to start
  working on 8.3 RPMs and I want to know what I should to to create a
  tarball.
 
 Just a script ... relevant bits:

snip

Is this script in CVS somewhere? I know it's not in the main backend repo.
But we shuold keep it either there or somewhere else - perhaps the pmt
repository?

//Magnus


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] How do we create the releases?

2007-06-07 Thread Peter Eisentraut

Magnus Hagander wrote:
 Is this script in CVS somewhere? I know it's not in the main backend
 repo.

Most of what these scripts have done in the past has been systematized 
and folded into the make dist target in the mean time, and AFAICT, we 
could integrate the documentation preparation as well, now that it 
works automatically in a reliable fashion.  That would reduce the 
answer to the original question to something like

./configure
make dist[check]

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] How do we create the releases?

2007-06-07 Thread Magnus Hagander

On Thu, Jun 07, 2007 at 01:12:06PM +0200, Peter Eisentraut wrote:
 Magnus Hagander wrote:
  Is this script in CVS somewhere? I know it's not in the main backend
  repo.
 
 Most of what these scripts have done in the past has been systematized 
 and folded into the make dist target in the mean time, and AFAICT, we 
 could integrate the documentation preparation as well, now that it 
 works automatically in a reliable fashion.  That would reduce the 
 answer to the original question to something like
 
 ./configure
 make dist[check]

If we can do that, that makes it even better. But is it something we can
reasonably backpatch? Otherwise we still need both documentation and
scripts to deal with back branches.

//Magnus

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] How do we create the releases?

2007-06-07 Thread Devrim GÜNDÜZ

Hi,

On Thu, 2007-06-07 at 13:12 +0200, Peter Eisentraut wrote:
 make dist[check]

AFAIK make dist does not prepare postgres.tar.gz.

Regards,
-- 
Devrim GÜNDÜZ
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Managed Services, Shared and Dedicated Hosting
Co-Authors: plPHP, ODBCng - http://www.commandprompt.com/




signature.asc
Description: This is a digitally signed message part

Re: [HACKERS] Controlling Load Distributed Checkpoints

Thinking about this whole idea a bit more, it occured to me that the 
current approach to write all, then fsync all is really a historical 
artifact of the fact that we used to use the system-wide sync call 
instead of fsyncs to flush the pages to disk. That might not be the best 
way to do things in the new load-distributed-checkpoint world.


How about interleaving the writes with the fsyncs?

1.
Scan all shared buffers, and build a list of all files with dirty pages, 
and buffers belonging to them


2.
foreach(file in list)
{
  foreach(buffer belonging to file)
  {
write();
sleep(); /* to throttle the I/O rate */
  }
  sleep(); /* to give the OS a chance to flush the writes at it's own 
pace */

  fsync()
}

This would spread out the fsyncs in a natural way, making the knob to 
control the duration of the sync phase unnecessary.


At some point we'll also need to fsync all files that have been modified 
since the last checkpoint, but don't have any dirty buffers in the 
buffer cache. I think it's a reasonable assumption that fsyncing those 
files doesn't generate a lot of I/O. Since the writes have been made 
some time ago, the OS has likely already flushed them to disk.


Doing the 1st phase of just scanning the buffers to see which ones are 
dirty also effectively implements the optimization of not writing 
buffers that were dirtied after the checkpoint start. And grouping the 
writes per file gives the OS a better chance to group the physical writes.


One problem is that currently the segmentation of relations to 1GB files 
is handled at a low level inside md.c, and we don't really have any 
visibility into that in the buffer manager. ISTM that some changes to 
the smgr interfaces would be needed for this to work well, though just 
doing it on a relation per relation basis would also be better than the 
current approach.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [HACKERS] Controlling Load Distributed Checkpoints

Heikki Linnakangas [EMAIL PROTECTED] writes:
 Thinking about this whole idea a bit more, it occured to me that the 
 current approach to write all, then fsync all is really a historical 
 artifact of the fact that we used to use the system-wide sync call 
 instead of fsyncs to flush the pages to disk. That might not be the best 
 way to do things in the new load-distributed-checkpoint world.

 How about interleaving the writes with the fsyncs?

I don't think it's a historical artifact at all: it's a valid reflection
of the fact that we don't know enough about disk layout to do low-level
I/O scheduling.  Issuing more fsyncs than necessary will do little
except guarantee a less-than-optimal scheduling of the writes.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] How do we create the releases?

2007-06-07 Thread Marc G. Fournier

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1



- --On Thursday, June 07, 2007 14:22:22 +0300 Devrim GÜNDÜZ 
[EMAIL PROTECTED] wrote:

 Hi,

 On Thu, 2007-06-07 at 13:12 +0200, Peter Eisentraut wrote:
 make dist[check]

 AFAIK make dist does not prepare postgres.tar.gz.

right, as peter originally said, 'make dist' *could* incorporate the steps that 
right now lead up to 'make dist' (ie. build the docs) ...

- 
Marc G. Fournier   Hub.Org Networking Services (http://www.hub.org)
Email . [EMAIL PROTECTED]  MSN . [EMAIL PROTECTED]
Yahoo . yscrappy   Skype: hub.orgICQ . 7615664
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (FreeBSD)

iD8DBQFGaB3E4QvfyHIvDvMRAgqRAJ4ln2cGOdPzcNzlbnjCnmMyZsU2tgCeO677
IbDu8XpIqts1wy04JAk6lwI=
=Zjhp
-END PGP SIGNATURE-


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

[HACKERS] little PITR annoyance

2007-06-07 Thread ohp

Hi all,

I've benn biten twice by this one:
I enabled archiving on both my pgsql servers.
the archive command was a simple gzip on an nfs mounted dir.
Now the nfs server crashed.
Both wal space filled up until postgresql shuts itself down because of no
more space for WAL.
That perfectly normal and expected.

What I did'nt expect and don't understand is that postgresql refused to
start up after the nfs server was up and running until I added some more
space on the WAL fs although if it had started archiving ,
space would have been there .

I wonder if archiving could start before postgresql at least to make a
little room for database engine to really start.

Hope I'm clear...

Thank you for your attention

Best Regards

-- 
Olivier PRENANT Tel: +33-5-61-50-97-00 (Work)
15, Chemin des Monges+33-5-61-50-97-01 (Fax)
31190 AUTERIVE   +33-6-07-63-80-64 (GSM)
FRANCE  Email: [EMAIL PROTECTED]
--
Make your life a dream, make your dream a reality. (St Exupery)

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Vacuuming anything zeroes shared table stats

Michael Fuhr wrote:
 Is vacuuming any table supposed to zero the statistics for all
 shared tables?  Doesn't that have implications for autovacuum?  The
 example below is in 8.2.4 but I'm seeing similar behavior in 8.1.9
 and 8.3devel.

The problem is that the database hash is cleared of databases that no
longer exist, and the database list is constructed by scanning
pg_database.  Since no entry exist for the database we use for shared
tables (InvalidOid), the hash table is dropped.  The attached patch
fixes this.

 Additionally, in 8.3devel doing anything that queries or modifies a
 shared table seems to zero the statistics for all shared tables.

I'm not sure if this is fixed by the patch; can you verify, or provide a
more specific description of the problem?

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Index: src/backend/postmaster/pgstat.c
===
RCS file: /home/alvherre/Code/cvs/pgsql/src/backend/postmaster/pgstat.c,v
retrieving revision 1.158
diff -c -p -r1.158 pgstat.c
*** src/backend/postmaster/pgstat.c	27 May 2007 17:28:35 -	1.158
--- src/backend/postmaster/pgstat.c	7 Jun 2007 15:25:30 -
*** pgstat_vacuum_tabstat(void)
*** 897,902 
--- 897,906 
   *	Collect the OIDs of either all databases or all tables, according to
   *	the parameter, into a temporary hash table.  Caller should hash_destroy
   *	the result when done with it.
+  *
+  *	NB -- this function adds an entry with InvalidOid if asked for a list of
+  *	databases.  This is because we use a shared table for such a database to
+  *	keep stats for shared tables.
   * --
   */
  static HTAB *
*** pgstat_collect_oids(Oid catalogid)
*** 930,935 
--- 934,950 
  	heap_endscan(scan);
  	heap_close(rel, AccessShareLock);
  
+ 	/*
+ 	 * If we were asked for databases, add an entry for the pseudo-database
+ 	 * with InvalidOid, where we store shared tables.
+ 	 */
+ 	if (catalogid == DatabaseRelationId)
+ 	{
+ 		Oid			invalid = InvalidOid;
+ 
+ 		hash_search(htab, (void *) invalid, HASH_ENTER, NULL);
+ 	}
+ 
  	return htab;
  }
  

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Vacuuming anything zeroes shared table stats

Alvaro Herrera [EMAIL PROTECTED] writes:
 The problem is that the database hash is cleared of databases that no
 longer exist, and the database list is constructed by scanning
 pg_database.  Since no entry exist for the database we use for shared
 tables (InvalidOid), the hash table is dropped.

Doh ...

 The attached patch fixes this.

Wouldn't it be easier to just special-case the shared DB in
pgstat_vacuum_tabstat?

while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(hstat)) != NULL)
{
Oiddbid = dbentry-databaseid;

CHECK_FOR_INTERRUPTS();

-   if (hash_search(htab, (void *) dbid, HASH_FIND, NULL) == NULL)
+   /* ignore the DB entry for shared tables ... they never go away */
+   if (OidIsValid(dbid) 
+   hash_search(htab, (void *) dbid, HASH_FIND, NULL) == NULL)
pgstat_drop_database(dbid);
}


 Additionally, in 8.3devel doing anything that queries or modifies a
 shared table seems to zero the statistics for all shared tables.

 I'm not sure if this is fixed by the patch; can you verify, or provide a
 more specific description of the problem?

Seems unlikely that this bug would explain a behavior like that.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] [COMMITTERS] pgsql: Create a GUC parametertemp_tablespacesthat allows selection of

Jaime Casanova [EMAIL PROTECTED] writes:
 On 6/4/07, Tom Lane [EMAIL PROTECTED] wrote:
 Perhaps a reasonable compromise could work like this: at the first point
 in a transaction where a temp file is created, choose a random list
 element, and thereafter advance cyclically for the duration of that
 transaction.

 ok. are you doing this? or can i prepare a patch that implements this?
 i guess we can allocate the memory for the list in TopTransactionContext.

I'll work on it ... I want to rejigger the API between fd.c and
tablespace.c anyway.  (fd.c still shouldn't be calling tablespace.c ...)

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Autovacuum launcher doesn't notice death of postmaster immediately

2007-06-07 Thread Jim C. Nasby

On Mon, Jun 04, 2007 at 11:04:26AM -0400, Alvaro Herrera wrote:
 The launcher is set up to wake up in autovacuum_naptime seconds at most.
 So if the user configures a ridiculuos time (for example 86400 seconds,
 which I've seen) then the launcher would not detect the postmaster death

Yeah, I've seen people set that up with the intention of now autovacuum
will only run during our slow time!. I'm thinking it'd be worth
mentioning in the docs that this won't work, and instead suggesting that
they run vacuumdb -a or equivalent at that time instead. Thoughts?
-- 
Jim Nasby  [EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)


pgpwir45JckSd.pgp
Description: PGP signature

Re: [HACKERS] [PATCHES] [BUGS] BUG #3326: Invalid lower bound of autovacuum_cost_limit

Tom Lane wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
  But this is misleading (started postmaster with good value, then edited
  postgresql.conf and entered -2):
 
  17903 LOG:  received SIGHUP, reloading configuration files
  17903 LOG:  -2 is outside the valid range for parameter 
  autovacuum_vacuum_cost_limit (-1 .. 1000)
 
  Note how it still says the range is -1 .. 1000.
 
 Can we redefine things to make zero be the disabled value, thus
 keeping the range of valid values contiguous?

That would be another solution ... though it would be different from the
valid value for autovacuum_vacuum_cost_delay (on which 0 is a valid
value).  Also it would be a different value from previous versions.

I don't think either of these is a showstopper, so let's go for that if
nobody objects.

-- 
Alvaro Herrera  Developer, http://www.PostgreSQL.org/
inflex really, I see PHP as like a strange amalgamation of C, Perl, Shell
crab inflex: you know that amalgam means mixture with mercury,
   more or less, right?
crab i.e., deadly poison

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] [PATCHES] [BUGS] BUG #3326: Invalid lower bound of autovacuum_cost_limit

2007-06-07 Thread Matthew T. O'Connor


Alvaro Herrera wrote:

Tom Lane wrote:

Alvaro Herrera [EMAIL PROTECTED] writes:

But this is misleading (started postmaster with good value, then edited
postgresql.conf and entered -2):
17903 LOG:  received SIGHUP, reloading configuration files
17903 LOG:  -2 is outside the valid range for parameter 
autovacuum_vacuum_cost_limit (-1 .. 1000)
Note how it still says the range is -1 .. 1000.

Can we redefine things to make zero be the disabled value, thus
keeping the range of valid values contiguous?


That would be another solution ... though it would be different from the
valid value for autovacuum_vacuum_cost_delay (on which 0 is a valid
value).  Also it would be a different value from previous versions.

I don't think either of these is a showstopper, so let's go for that if
nobody objects.


Can you make 0 and -1 both valid disabled values?  That way it will be 
compatible with previous releases.


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Controlling Load Distributed Checkpoints


Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
Thinking about this whole idea a bit more, it occured to me that the 
current approach to write all, then fsync all is really a historical 
artifact of the fact that we used to use the system-wide sync call 
instead of fsyncs to flush the pages to disk. That might not be the best 
way to do things in the new load-distributed-checkpoint world.



How about interleaving the writes with the fsyncs?


I don't think it's a historical artifact at all: it's a valid reflection
of the fact that we don't know enough about disk layout to do low-level
I/O scheduling.  Issuing more fsyncs than necessary will do little
except guarantee a less-than-optimal scheduling of the writes.


I'm not proposing to issue any more fsyncs. I'm proposing to change the 
ordering so that instead of first writing all dirty buffers and then 
fsyncing all files, we'd write all buffers belonging to a file, fsync 
that file only, then write all buffers belonging to next file, fsync, 
and so forth.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Controlling Load Distributed Checkpoints

Heikki Linnakangas [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 I don't think it's a historical artifact at all: it's a valid reflection
 of the fact that we don't know enough about disk layout to do low-level
 I/O scheduling.  Issuing more fsyncs than necessary will do little
 except guarantee a less-than-optimal scheduling of the writes.

 I'm not proposing to issue any more fsyncs. I'm proposing to change the 
 ordering so that instead of first writing all dirty buffers and then 
 fsyncing all files, we'd write all buffers belonging to a file, fsync 
 that file only, then write all buffers belonging to next file, fsync, 
 and so forth.

But that means that the I/O to different files cannot be overlapped by
the kernel, even if it would be more efficient to do so.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Controlling Load Distributed Checkpoints


Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:

Tom Lane wrote:

I don't think it's a historical artifact at all: it's a valid reflection
of the fact that we don't know enough about disk layout to do low-level
I/O scheduling.  Issuing more fsyncs than necessary will do little
except guarantee a less-than-optimal scheduling of the writes.


I'm not proposing to issue any more fsyncs. I'm proposing to change the 
ordering so that instead of first writing all dirty buffers and then 
fsyncing all files, we'd write all buffers belonging to a file, fsync 
that file only, then write all buffers belonging to next file, fsync, 
and so forth.


But that means that the I/O to different files cannot be overlapped by
the kernel, even if it would be more efficient to do so.


True. On the other hand, if we issue writes in essentially random order, 
we might fill the kernel buffers with random blocks and the kernel needs 
to flush them to disk as almost random I/O. If we did the writes in 
groups, the kernel has better chance at coalescing them.


I tend to agree that if the goal is to finish the checkpoint as quickly 
as possible, the current approach is better. In the context of load 
distributed checkpoints, however, it's unlikely the kernel can do any 
significant overlapping since we're trickling the writes anyway.


Do we need both strategies?

I'm starting to feel we should give up on smoothing the fsyncs and 
distribute the writes only, for 8.3. As we get more experience with that 
and it's shortcomings, we can enhance our checkpoints further in 8.4.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [HACKERS] Vacuuming anything zeroes shared table stats

2007-06-07 Thread Michael Fuhr

On Thu, Jun 07, 2007 at 11:41:56AM -0400, Tom Lane wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
  The attached patch fixes this.
 
 Wouldn't it be easier to just special-case the shared DB in
 pgstat_vacuum_tabstat?

Thanks; I'll test these patches when I get a chance.

  Additionally, in 8.3devel doing anything that queries or modifies a
  shared table seems to zero the statistics for all shared tables.
 
  I'm not sure if this is fixed by the patch; can you verify, or provide a
  more specific description of the problem?
 
 Seems unlikely that this bug would explain a behavior like that.

Further investigation shows that what really seems to be happening
in 8.3devel is that the statistics for shared tables are reset every
15 seconds when autovacuum is enabled, which it is by default.  I
don't observe this phenomenon when autovacuum is disabled.  Here's
a test case:

select * from pg_pltemplate;

select seq_scan, idx_scan, now()
from pg_stat_all_tables where relname in ('pg_pltemplate');

Repeat the second select until the statistics for pg_pltemplate
become zero, which should be within 15 seconds.  Repeating the
experiment should reveal a 15-second cycle of statistics resets.

In case this behavior is platform-dependent I'm testing on Solaris 9
sparc.

-- 
Michael Fuhr

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Vacuuming anything zeroes shared table stats

Michael Fuhr [EMAIL PROTECTED] writes:
 Further investigation shows that what really seems to be happening
 in 8.3devel is that the statistics for shared tables are reset every
 15 seconds when autovacuum is enabled, which it is by default.  I
 don't observe this phenomenon when autovacuum is disabled.

OK, so it's just that pgstat_vacuum_tabstat() gets run by autovacuum.
So either form of Alvaro's patch should fix it.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-07 Thread Greg Smith


On Thu, 7 Jun 2007, Heikki Linnakangas wrote:


So there's two extreme ways you can use LDC:
1. Finish the checkpoint as soon as possible, without disturbing other 
activity too much
2. Disturb other activity as little as possible, as long as the 
checkpoint finishes in a reasonable time.
Are both interesting use cases, or is it enough to cater for just one of 
them? I think 2 is easier to tune.


The motivation for the (1) case is that you've got a system that's 
dirtying the buffer cache very fast in normal use, where even the 
background writer is hard pressed to keep the buffer pool clean.  The 
checkpoint is the most powerful and efficient way to clean up many dirty 
buffers out of such a buffer cache in a short period of time so that 
you're back to having room to work in again.  In that situation, since 
there are many buffers to write out, you'll also be suffering greatly from 
fsync pauses.  Being able to synchronize writes a little better with the 
underlying OS to smooth those out is a huge help.


I'm completely biased because of the workloads I've been dealing with 
recently, but I consider (2) so much easier to tune for that it's barely 
worth worrying about.  If your system is so underloaded that you can let 
the checkpoints take their own sweet time, I'd ask if you have enough 
going on that you're suffering very much from checkpoint performance 
issues anyway.  I'm used to being in a situation where if you don't push 
out checkpoint data as fast as physically possible, you end up fighting 
with the client backends for write bandwidth once the LRU point moves past 
where the checkpoint has written out to already.  I'm not sure how much 
always running the LRU background writer will improve that situation.


On a Linux system, one way to model it is that the OS flushes dirty buffers 
to disk at the same rate as we write them, but delayed by 
dirty_expire_centisecs. That should hold if the writes are spread out enough.


If they're really spread out, sure.  There is congestion avoidance code 
inside the Linux kernel that makes dirty_expire_centisecs not quite work 
the way it is described under load.  All you can say in the general case 
is that when dirty_expire_centisecs has passed, the kernel badly wants to 
write the buffers out as quickly as possible; that could still be many 
seconds after the expiration time on a busy system, or on one with slow 
I/O.


On every system I've ever played with Postgres write performance on, I 
discovered that the memory-based parameters like dirty_background_ratio 
were really driving write behavior, and I almost ignore the expire timeout 
now.  Plotting the Dirty: value in /proc/meminfo as you're running tests 
is extremely informative for figuring out what Linux is really doing 
underneath the database writes.


The influence of the congestion code is why I made the comment about 
watching how long writes are taking to gauge how fast you can dump data 
onto the disks.  When you're suffering from one of the congestion 
mechanisms, the initial writes start blocking, even before the fsync. 
That behavior is almost undocumented outside of the relevant kernel source 
code.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [HACKERS] Vacuuming anything zeroes shared table stats

Tom Lane wrote:
 Michael Fuhr [EMAIL PROTECTED] writes:
  Further investigation shows that what really seems to be happening
  in 8.3devel is that the statistics for shared tables are reset every
  15 seconds when autovacuum is enabled, which it is by default.  I
  don't observe this phenomenon when autovacuum is disabled.
 
 OK, so it's just that pgstat_vacuum_tabstat() gets run by autovacuum.
 So either form of Alvaro's patch should fix it.

Right.  Committed.

Thanks for the report!

-- 
Alvaro Herrera  Developer, http://www.PostgreSQL.org/
El miedo atento y previsor es la madre de la seguridad (E. Burke)

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] [PATCHES] [BUGS] BUG #3326: Invalid lower bound of autovacuum_cost_limit

Matthew T. O'Connor wrote:
 Alvaro Herrera wrote:
 Tom Lane wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
 But this is misleading (started postmaster with good value, then edited
 postgresql.conf and entered -2):
 17903 LOG:  received SIGHUP, reloading configuration files
 17903 LOG:  -2 is outside the valid range for parameter 
 autovacuum_vacuum_cost_limit (-1 .. 1000)
 Note how it still says the range is -1 .. 1000.
 Can we redefine things to make zero be the disabled value, thus
 keeping the range of valid values contiguous?
 
 That would be another solution ... though it would be different from the
 valid value for autovacuum_vacuum_cost_delay (on which 0 is a valid
 value).  Also it would be a different value from previous versions.
 
 I don't think either of these is a showstopper, so let's go for that if
 nobody objects.
 
 Can you make 0 and -1 both valid disabled values?  That way it will be 
 compatible with previous releases.

Heh, sure, we can do that too and it doesn't seem like anybody would
object.  I will patch the documentation so that that the disabled
value is zero, and still allow -1.  That way it doesn't seem like there
should be any objection.

-- 
Alvaro Herrera http://www.flickr.com/photos/alvherre/
Escucha y olvidarás; ve y recordarás; haz y entenderás (Confucio)

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Autovacuum launcher doesn't notice death of postmaster immediately

2007-06-07 Thread Andrew Hammond


On 6/7/07, Jim C. Nasby [EMAIL PROTECTED] wrote:

On Mon, Jun 04, 2007 at 11:04:26AM -0400, Alvaro Herrera wrote:
 The launcher is set up to wake up in autovacuum_naptime seconds at most.
 So if the user configures a ridiculuos time (for example 86400 seconds,
 which I've seen) then the launcher would not detect the postmaster death


Is there some threshold after which we should have PostgreSQL emit a
warning to the effect of autovacuum_naptime is very large. Are you
sure you know what you're doing?


Yeah, I've seen people set that up with the intention of now autovacuum
will only run during our slow time!. I'm thinking it'd be worth
mentioning in the docs that this won't work, and instead suggesting that
they run vacuumdb -a or equivalent at that time instead. Thoughts?


Hmmm... it seems to me that points new users towards not using
autovacuum, which doesn't seem like the best idea. I think it'd be
better to say that setting the naptime really high is a Bad Idea.
Instead, if they want to shift maintenances to off hours they should
consider using a cron job that bonks around the
pg_autovacuum.vac_base_thresh or vac_scale_factor values for tables
they don't want vacuumed during operational hours (set them really
high at the start of operational hours, then to normal during off
hours). Tweaking the enable column would work too, but they presumably
don't want to disable ANALYZE, although it's entirely likely that new
users don't know what ANALYZE does, in which case they _really_ don't
want to disable it.

This should probably be very close to a section that says something
about how insufficient maintenance can be expected to lead to greater
performance issues than using autovacuum with default settings.
Assuming we believe that to be the case, which I think is reasonable
given that we are now defaulting to having autovacuum enabled.

Andrew

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

[HACKERS] pqlib suggestion

2007-06-07 Thread Dann Corbit

Clients using pqlib can get some boost by increasing tcp/ip window size.
It might be good to make it a settable parameter.

Anyway, maybe something like this:

 

/* --

 * connectNoDelay -

 * Sets the TCP_NODELAY socket option.

 * Returns 1 if successful, 0 if not.

 * --

 */

static int

connectNoDelay(PGconn *conn)

{

int on = 1;

 

#ifdef TCP_NODELAY

if (setsockopt(conn-sock, IPPROTO_TCP, TCP_NODELAY,

   (char *) on,

   sizeof(on))  0)

{

char  sebuf[256];

 

printfPQExpBuffer(conn-errorMessage,

libpq_gettext(could not set socket
to TCP no delay mode: %s\n),

 
SOCK_STRERROR(SOCK_ERRNO, sebuf, sizeof(sebuf)));

return 0;

}

#endif

 

on = 65535;

if (setsockopt(conn-sock, SOL_SOCKET, SO_RCVBUF,

   (char *) on,

   sizeof(on))  0)

{

printfPQExpBuffer(conn-errorMessage,

 connectNoDelay() --
setsockopt SO_RCVBUF failed: errno=%d\n%s\n,

 
errno, strerror(errno));

#ifdef WIN32

printf(Winsock error: %i\n,
WSAGetLastError());

#endif

return 0;

}

on = 65535;

if (setsockopt(conn-sock, SOL_SOCKET, SO_SNDBUF,

   (char *) on,

   sizeof(on))  0)

{

printfPQExpBuffer(conn-errorMessage,

 connectNoDelay() --
setsockopt SO_SNDBUF failed: errno=%d\n%s\n,

 
errno, strerror(errno));

#ifdef WIN32

printf(Winsock error: %i\n,
WSAGetLastError());

#endif

return 0;

}

return 1;

}

Re: [HACKERS] Autovacuum launcher doesn't notice death of postmaster immediately

Andrew Hammond [EMAIL PROTECTED] writes:
 Hmmm... it seems to me that points new users towards not using
 autovacuum, which doesn't seem like the best idea. I think it'd be
 better to say that setting the naptime really high is a Bad Idea.

It seems like we should have an upper limit on the GUC variable that's
less than INT_MAX ;-).  Would an hour be sane?  10 minutes?

This is independent of the problem at hand, though, which is that we
probably want the launcher to notice postmaster death in less time
than autovacuum_naptime, for reasonable values of same.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Controlling Load Distributed Checkpoints

Greg Smith [EMAIL PROTECTED] writes:

 I'm completely biased because of the workloads I've been dealing with 
 recently,
 but I consider (2) so much easier to tune for that it's barely worth worrying
 about.  If your system is so underloaded that you can let the checkpoints take
 their own sweet time, I'd ask if you have enough going on that you're 
 suffering
 very much from checkpoint performance issues anyway.  I'm used to being in a
 situation where if you don't push out checkpoint data as fast as physically
 possible, you end up fighting with the client backends for write bandwidth 
 once
 the LRU point moves past where the checkpoint has written out to already.  I'm
 not sure how much always running the LRU background writer will improve that
 situation.

I think you're working from a faulty premise.

There's no relationship between the volume of writes and how important the
speed of checkpoint is. In either scenario you should assume a system that is
close to the max i/o bandwidth. The only question is which task the admin
would prefer take the hit for maxing out the bandwidth, the transactions or
the checkpoint.

You seem to have imagined that letting the checkpoint take longer will slow
down transactions. In fact that's precisely the effect we're trying to avoid.
Right now we're seeing tests where Postgres stops handling *any* transactions
for up to a minute. In virtually any real world scenario that would simply be
unacceptable.

That one-minute outage is a direct consequence of trying to finish the
checkpoint as quick as possible. If we spread it out then it might increase
the average i/o load if you sum it up over time, but then you just need a
faster i/o controller. 

The only scenario where you would prefer the absolute lowest i/o rate summed
over time would be if you were close to maxing out your i/o bandwidth,
couldn't buy a faster controller, and response time was not a factor, only
sheer volume of transactions processed mattered. That's a much less common
scenario than caring about the response time.

The flip side of having to worry about response time buying a faster
controller doesn't even help. It would shorten the duration of the checkpoint
but not eliminate it. A 30-second outage every half hour is just as
unacceptable as a 1-minute outage every half hour.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Autovacuum launcher doesn't notice death of postmaster immediately

2007-06-07 Thread Matthew T. O'Connor


Tom Lane wrote:

Andrew Hammond [EMAIL PROTECTED] writes:

Hmmm... it seems to me that points new users towards not using
autovacuum, which doesn't seem like the best idea. I think it'd be
better to say that setting the naptime really high is a Bad Idea.


It seems like we should have an upper limit on the GUC variable that's
less than INT_MAX ;-).  Would an hour be sane?  10 minutes?

This is independent of the problem at hand, though, which is that we
probably want the launcher to notice postmaster death in less time
than autovacuum_naptime, for reasonable values of same.


Do we need a configurable autovacuum naptime at all?  I know I put it in 
the original contrib autovacuum because I had no idea what knobs might 
be needed.  I can't see a good reason to ever have a naptime longer than 
the default 60 seconds, but I suppose one might want a smaller naptime 
for a very active system?


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-07 Thread Greg Smith


On Thu, 7 Jun 2007, Gregory Stark wrote:


You seem to have imagined that letting the checkpoint take longer will slow
down transactions.


And you seem to have imagined that I have so much spare time that I'm just 
making stuff up to entertain myself and sow confusion.


I observed some situations where delaying checkpoints too long ends up 
slowing down both transaction rate and response time, using earlier 
variants of the LDC patch and code with similar principles I wrote.  I'm 
trying to keep the approach used here out of the worst of the corner cases 
I ran into, or least to make it possible for people in those situations to 
have some ability to tune out of the bad spots.  I am unfortunately not 
free to disclose all those test results, and since that project is over I 
can't see how the current LDC compares to what I tested at the time.


I plainly stated I had a bias here, one that's not even close to the 
average case.  My concern here was that Heikki would end up optimizing in 
a direction where a really wide spread across the active checkpoint 
interval was strongly preferred.  I wanted to offer some suggestions on 
the type of situation where that might not be true, but where a different 
tuning of LDC would still be an improvement over the current behavior. 
There are some tuning knobs there that I don't want to see go away until 
there's been a wider range of tests to prove they aren't effective.



Right now we're seeing tests where Postgres stops handling *any* transactions
for up to a minute. In virtually any real world scenario that would simply be
unacceptable.


No doubt; I've seen things get close to that bad myself, both on the high 
and low end. I collided with the issue in a situation of maxing out your 
i/o bandwidth, couldn't buy a faster controller at one point, which is 
what kicked off my working in this area.  It turned out there were still 
some software tunables left that pulled the worst case down to the 2-5 
second range instead.  With more checkpoint_segments to decrease the 
frequency, that was just enough to make the problem annoying rather than 
crippling.  But after that, I could easily imagine a different application 
scenario where the behavior you describe is the best case.


This is really a serious issue with the current design of the database, 
one that merely changes instead of going away completely if you throw more 
hardware at it.  I'm perversely glad to hear this is torturing more people 
than just me as it improves the odds the situation will improve.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] [RFC] GSoC Work on readonly queries done so far

2007-06-07 Thread Florian G. Pflug


Simon Riggs wrote:

On Wed, 2007-06-06 at 16:11 +0200, Florian G. Pflug wrote:


.) Added a new GUC operational_mode, which can be set to either
readwrite or readonly. If it is set to readwrite (the default),
postgres behaves as usual. All the following changes are only
in effect if operational_mode is set to readonly.


Do we need this? We are already InArchiveRecovery.

If I understand you correctly, you suggest that readonly queries
are allways allowed during archive recovery - so upon startup
postgres step through these states:
  .) Initial recovery (Until we reach a consistent state)
  .) Allow readonly queries
  .) Finish recovery in the background (might mean recovering forever
 on a PITR slave)
  .) Allow readwrite queries

My plan was to have a global switch, which lets you choose between
  .) All queries are readonly (Until the next postmaster restart at least),
 but you get background replay
  .) No background replay, but once replay is done, readwrite queries
 can be execute (Just what PG does now).

The main reason why I invented that global switch operational_mode was
to remove to need to switch between readonly mode and readwrite mode
on the fly.


.) Created a macro ASSUME_OPMODE_READWRITE that does elog(ERROR)
if postgre is not in readwrite mode. This macro protects the
following functions to make sure that no writes occur in
readonly mode.
  SimpleLruWritePage, SLruPhysicalWritePage
  EndPrepare, FinishPreparedTransaction
  XLogInsert, XLogWrite, ShutdownXLog
  CreateCheckpoint
  MarkBufferDirty.


These are Asserts?

The macro ASSUME_OPMODE_READWRITE just does
if (!OperationalModeReadWrite)
  elog(ERROR, ...)


  .) Don't start autovacuum and bgwriter. Instead of bgwriter, bgreplay
 is started, and it takes over that role that bgwriter play in the
 shutdown process.


Autovacuum - understood.

What does bgreplay do? Why not just start bgwriter earlier and disable
some of its other functionality while InRecovery?

See above - it seemed simpler to clearly seperate


  .) Transactions are assigned a dummy xid ReadOnlyTransactionId, that
 is considered to be later than any other xid.


So you are bumping FirstNormalTransactionId up by one for this?

In fact I changed MaxTransactionId to 0xFFFE, and set
ReadOnlyTransactionId to 0x. Additionally, I changed
IsNormalTransactionId to test not only for = FirstNormalTransactionid,
but also for = MaxTransactionId.


You're assuming then that we will freeze replay while we run a query?

No. My plan is to first get to a point where replay is freezes while
queries are running, and to then figure out a more intelligent way to do this.
I already have a few ideas how to do this, but I want to complete the simple 
version, before I start with that work.



Otherwise doing this will mean the snapshot changes as a query executes.

Why? It's only the xid of the transaction, not it's xmin and xmax that are
set to ReadOnlyTransactionId.


  .) A global ReadOnlySnapshot is maintained in shared memory. This is
 copied into backend local memory by GetReadonlySnapshotData (which
 replaces GetSnapshotData in readonly mode).
   .) Crash recovery is not performed in readonly mode - instead, postgres
  PANICs, and tells the DBA to restart in readwrite mode. Archive
  recovery of course *will* be allowed, but I'm not that far yet.


This is the very heart of the matter. This isn't just a technical issue,
it goes to the heart of the use case for this feature. Can we recover
while running queries?

Yes. My comment only applies only to crash recovery - i.e, recovery that happens
*without* a recovery.conf present, after a crash.
It only really matters if you do following
  .) Start pg in readwrite mode.
  .) Kill it / It crashes
  .) Restart in readonly mode.

The main different between crash recovery, and recovery from a filesystem-level
backup is the additional information that the backup label gives us in the
second case - more specifically, the minRecoveryLoc that we read from the
backup label. Only with that knowledge is recovering until we reach
a consistent state a welldefined operation. And readonly queries
can only be executed *after* we did this minimal recovery. So if there is
crash recovery to be done, we best we could do is to recover, and then start
in readonly mode. If this is *really* what the DBA wants, he can just start
in readwrite mode first, then cleanly shut PG down, and restart in readonly 
mode.

 If not, how much time will we spend in replay

mode v query mode? Will we be able to run long running queries *and*
maintain a reasonable time to recover? Is this a mechanism for providing
HA and additional query capacity, or is it just a mechanism for
additional query capacity only? Those are open questions to which I
don't have any answers yet myself.

My goal is to allow replay and queries to run concurrently, at least as
long as only inserts, updates and

Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-07 Thread Joshua D. Drake



This is really a serious issue with the current design of the database, 
one that merely changes instead of going away completely if you throw 
more hardware at it.  I'm perversely glad to hear this is torturing more 
people than just me as it improves the odds the situation will improve.


It tortures pretty much any high velocity postgresql db of which there 
are more and more every day.


Joshua D. Drake




--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly




--

  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Minor changes to Recovery related code

2007-06-07 Thread Simon Riggs

On Sat, 2007-03-31 at 00:51 +0200, Florian G. Pflug wrote:
 Simon Riggs wrote:
  On Fri, 2007-03-30 at 16:34 -0400, Tom Lane wrote:
  Simon Riggs [EMAIL PROTECTED] writes:
  2. pg_stop_backup() should wait until all archive files are safely
  archived before returning
  Not sure I agree with that one.  If it fails, you can't tell whether the
  action is done and it failed while waiting for the archiver, or if you
  need to redo it.
  
  There's a slight delay between pg_stop_backup() completing and the
  archiver doing its stuff. Currently if somebody does a -m fast straight
  after the pg_stop_backup() the backup may be unusable.
  
  We need a way to plug that small hole.
  
  I suggest that pg_stop_backup() polls once per second until
  pg_xlog/archive_status/LOG.ready disappears, in which case it ends
  successfully. If it does this for more than 60 seconds it ends
  successfully but produces a WARNING.
 
 I fear that ending sucessfully despite having not archived all wals
 will make this feature less worthwile. If a dba knows what he is
 doing, he can code a perfectly safe backup script using 8.2 too.
 He'll just have to check the current wal position after pg_stop_backup(),
 (There is a function for that, right?), and wait until the corresponding
 wal was archived.
 
 In realitly, however, I feare that most people will just create a script
 that does 'echo select pg_stop_backup | psql' or something similar.
 If they're a bit more carefull, they will enable ON_ERROR_STOP, and check
 the return value of pgsql. I believe that those are the people who would
 really benefit from a pg_stop_backup() that waits for archiving to complete.
 But they probably won't check for WARNINGs.
 
 Maybe doing it the other way round would be an option?
 pg_stop_backup() could wait for the archiver to complete forever, but
 spit out a warning every 60 seconds or so WARNING: Still waiting
 for wal archiving of wal ??? to complete. If someone really wants
 a 60-second timeout, he can just use statement_timeout.

I've just come up against this problem again, so I think it is a must
fix for this release. Other problems exist also, mentioned on separate
threads.

We have a number of problems surrounding pg_stop_backup/shutdown:

1. pg_stop_backup() currently returns before the WAL file containing the
last change is correctly archived. That is a small hole, but one that is
exposed when people write test scripts that immediately shutdown the
database after issuing pg_stop_backup(). It doesn't make much sense to
shutdown immediately after a hot backup, but it should still work
sensibly.

2. We've also had problems caused by making the archiver wait until all
WAL files are archived. If there is a backlog for some reason and the
DBA issues a restart (i.e. stop and immediate restart) then making the
archiver loop while it tries (and possibly fails) to archive all files
would cause an outage. Avoiding this is why we do the current
get-out-fast approach.
There are some sub scenarios:
a) there is a backlog of WAL files, but no error has occurred on the
*last* file (we might have just fixed a problem).
b) there is a backlog of WAL files, but an error is causing a retry of
the last file.

My proposal is for us to record somewhere other than the logs that a
failure to archive has occurred and is being retried. Failure to archive
will be recorded in the archive_status directory as an additional file
called archive_error, which will be deleted in the case of archive
success and created in the case of archive error. This maintains
archiver's lack of attachment to shared memory and general simplicity of
design.

- pg_stop_backup() will wait until the WAL file that ends the backup is
safely archived, even if a failure to archive occurs. This is a change
to current behaviour, but since it implements the originally *expected*
behaviour IMHO it should be the default.

- new function: pg_stop_backup_nowait() return immediately without
waiting for archive, the same as the current pg_stop_backup()

- new function: pg_stop_backup_wait(int seconds) wait until either an
archival fails or the ending WAL file is archived, with a max wait as
specified. wait=0 means wait until archive errors are resolved.

Alternatives?

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] [RFC] GSoC Work on readonly queries done so far

2007-06-07 Thread Florian G. Pflug


Heikki Linnakangas wrote:

Florian G. Pflug wrote:

Jeff Davis wrote:

Are you referring to the size of the xip array being a problem? Would it
help to tie the size of the xip array to max_connections? I understand
that max_connections might be greater on the master, but maybe something
similar? 


Thats what I currently do - the xip array on the slave is sized to
hold max_connections entries (Actually, it's max_connections +
max_prepared_xacts I think). The problem occurs exactly if those
values are set too small on the slave - and since shared mem
objects are not resizeable, I don't see how the slave can handle
an xip overflow gracefully other than by not publishing the
information in shared memory as long as it doesn't fit there.


You could store the value of max_connections in the checkpoint xlog 
record, and read it from there in the slave. Though one could still 
change it on the master and restart without restarting the slave as well.


But AFAIK shmem allocation happens before recovery starts... Even if this
was solved, it would only be a partial solution since as you note, the
master might be restarted while the slave keeps running. So I think it's
better not too add too much complexity, and just tell the DBA to increase
max_connections on the slave, together with a comment in the documentation
never to sex max_connections smaller on the slave than on the master.

greetings, Florian Pflug


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] [COMMITTERS] pgsql: Avoid losing track of data for shared tables in pgstats.


Alvaro Herrera [EMAIL PROTECTED] writes:

 Log Message:
 ---
 Avoid losing track of data for shared tables in pgstats.  Report by Michael
 Fuhr, patch from Tom Lane after a messier suggestion by me.

When does this bug date to? is it possible it's related to the performance
drop immediately following a vacuum analyze we've been seeing?

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] [RFC] GSoC Work on readonly queries done so far

2007-06-07 Thread Florian G. Pflug


Jeff Davis wrote:

On Wed, 2007-06-06 at 22:36 +0100, Simon Riggs wrote:

  .) Transactions are assigned a dummy xid ReadOnlyTransactionId, that
 is considered to be later than any other xid.

So you are bumping FirstNormalTransactionId up by one for this?

You're assuming then that we will freeze replay while we run a query?
Otherwise doing this will mean the snapshot changes as a query executes.


Is it possible to put a normal xmax for the snapshot?

It wouldn't be a real transaction on the slave, and also the master will
use that ID for a real transaction itself. However, I don't see a real
problem on the slave because it would only be used for the purpose of
the snapshot we need at that moment.


My plan is the following:
.) Initially, queries and recovery will run interleaved, but not concurrently.
   For that, an empty snapshot is sufficient, with
   xmin=xid=xmax=ReadOnlyTransactionId.
.) Then, I'll work on running them concurrently. The replay process will publish
   a current snapshot in shared memory, using real xmin and xmax values
   it generates by maintaining a list of currently active (as in: running when
   the wal was written on the master) transactions. In that case, only xid
   is set to ReadOnlyTransactionId.

greetings, Florian Pflug


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] [ADMIN] Attempt to re-archive existing WAL logsafterrestoringfrom backup

2007-06-07 Thread Simon Riggs

On Thu, 2007-06-07 at 22:20 +0100, Simon Riggs wrote:
On Sat, 2007-06-02 at 01:07 +0100, Jon Colverson wrote:
Simon Riggs wrote:
I've spent some time re-examining the code and I can't see a route for
the discussed problem to occur. All the main code paths do not exhibit
this problem in 8.2, so I'm not sure where to go from here.

If you can help further, please give me a shout.

Thanks a lot for looking into this.

I've attached a self-contained minimal shell script which sets up a
cluster and demonstrates the problem. I hope that will provide an
unambiguous description. I'm running it on Ubuntu 7.04 (i386) and the
script includes all the commands to build PostgreSQL from scratch, but
obviously you can snip that stuff out and change the paths to suit your
environment.

The script ends with the server log showing the archiver attempting to
re-archive the last log that was restored in the recovery procedure.

Jon,

Your script is correct and there is a problem, as you demonstrate. Thank
you for bringing this to the list.

You have followed the manual in full. In Step 6 of the recovery
procedure outlined here
http://developer.postgresql.org/pgdocs/postgres/continuous-archiving.html#BACKUP-PITR-RECOVERY
we say that if you have partially completed WAL logs these should be
copied back into pg_xlog. This is correct and if we do this we avoid the
error you have highlighted. In the case of a true disaster recovery, as
you have attempted to test, no files are available and so an attempt is
made to archive the last restored file. That attempt fails iff you have
done as the manual suggests and used an archive_command that prevents
duplicate file insertions.

I'll repost to pgsql-hackers to discuss solutions. Fixes are simple, but
require some discussion.

Currently if we perform an archive recovery and the end of logs occur
during or at the end of an xlog that has already been archived then we
will fail and keep failing to archive anything. Thats not much fun and
it appears to me that this problem goes back to early days of PITR and
may require backpatching.

Currently if an xlog file has a missing archive_status file then we
create a .ready status, which leads to the command being retried, which
might fail or might not. Since the last file on an archive recovery
seldom has an archive status this leads to failure situations when the
archive_command prevents duplicate files from existing in the archive.

AFAICS we have four options:

1. We can assume that if an xlog file has no archive status it must have
been archived. Hmmm, sounds like a poor assumption.

2. We can relax the recommendation to have an archive_command that
prevents duplicate logs. Although I haven't in recent times seen this as
too strong a requirement because we have timelines, I don't think we
should relax this.

3. We craft a special archive_status file for the last xlog file in an
archive, so that it never gets archived.

4. As Warren Little recently regretted, if we have a corrupt xlog file
then end of logs happens much earlier than anticipated. When this occurs
many subsequent xlogs would be named similarly to files already in the
archive, yet they would be a separate timeline. We could solve this
problem as well by simply bumping the TLI each time we complete a
recovery, whatever the reason. Presumably the archives will not stretch
back as far as 2 billion recoveries.

Option 3 is the straightforward and fixes the issue directly; option 4
solves many problems in a bold sweep of the design pen, with no obvious
downside, AFAICS. Both are simple and can be backpatched easily.

Initially I thought you'd fallen foul of another problem which is known
to exist, which is caused by immediately shutting down a server after
having taken a hot backup. This isn't a problem in real usage though is
a problem in a scripted test. I've already proposed solutions there and
will pick up that thread again also.

On another thread on -hackers.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] [COMMITTERS] pgsql: Avoid losing track of data for shared tables in pgstats.

Gregory Stark wrote:
 
 Alvaro Herrera [EMAIL PROTECTED] writes:
 
  Log Message:
  ---
  Avoid losing track of data for shared tables in pgstats.  Report by Michael
  Fuhr, patch from Tom Lane after a messier suggestion by me.
 
 When does this bug date to?

It was in 8.1.  I didn't verify whether it affects on 8.0; I think we
separated the shared tables in pgstats in 8.1, with the advent of
autovacuum, so I assume it doesn't.  The patch doesn't apply cleanly to
8.0 anyway, and I decided not to spent much time on it seeing that
nobody has noticed it in years.


 is it possible it's related to the performance drop immediately
 following a vacuum analyze we've been seeing?

I don't think so, unless you were counting on pgstats data of shared
tables for something.  The optimizer, for one, doesn't, so I doubt it
would affect query planning.  And it would only affect you if your
queries were using shared tables, which I very much doubt ...

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] little PITR annoyance

2007-06-07 Thread Simon Riggs

On Thu, 2007-06-07 at 17:01 +0200, [EMAIL PROTECTED] wrote:

 I've benn biten twice by this one:
 I enabled archiving on both my pgsql servers.
 the archive command was a simple gzip on an nfs mounted dir.
 Now the nfs server crashed.
 Both wal space filled up until postgresql shuts itself down because of no
 more space for WAL.
 That perfectly normal and expected.
 
 What I did'nt expect and don't understand is that postgresql refused to
 start up after the nfs server was up and running until I added some more
 space on the WAL fs although if it had started archiving ,
 space would have been there .
 
 I wonder if archiving could start before postgresql at least to make a
 little room for database engine to really start.

gzip write a new file and then deletes the old, doesn't it? So it must
require space on the xlog drive.

Does it still fail if you avoid using gzip and just use scp?

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 6: explain analyze is your friend

Best Practice for running vacuums during off hours WAS Re: [HACKERS] Autovacuum launcher doesn't notice death of postmaster immediately

2007-06-07 Thread Andrew Hammond


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 6/7/07, Matthew T. O'Connor  wrote:

Tom Lane wrote:
 Andrew Hammond  writes:
 Hmmm... it seems to me that points new users towards not using
 autovacuum, which doesn't seem like the best idea. I think it'd be
 better to say that setting the naptime really high is a Bad Idea.

 It seems like we should have an upper limit on the GUC variable that's
 less than INT_MAX ;-).  Would an hour be sane?  10 minutes?

 This is independent of the problem at hand, though, which is that we
 probably want the launcher to notice postmaster death in less time
 than autovacuum_naptime, for reasonable values of same.

Do we need a configurable autovacuum naptime at all?  I know I put it in
the original contrib autovacuum because I had no idea what knobs might
be needed.  I can't see a good reason to ever have a naptime longer than
the default 60 seconds, but I suppose one might want a smaller naptime
for a very active system?


That's a good question. I can't see any reason for a naptime longer
than 60 seconds either.

I think very large naptime settings are a symptom of another issue:
what's the Right Way to defer vacuums until off hours? Is that even
a desirable thing anymore? I don't think it is in the majority of
cases.

I originally thought that this was more of a Best Practices issue (ie,
fix in the docs, not the code), but now I'm wondering if there's much
call for supporting the idea of being more aggressive with vacuums at
different times of the day / week / month. Anyone?
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (Darwin)

iD8DBQFGaIdN+zlEYLc6JJgRAiNFAJ49CQwiTVxWhXNeOzIBABLN5LZY3wCfUj/W
ZLakjPyRVwOijaB6keS3ld8=
=Hg/X
-END PGP SIGNATURE-

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: Best Practice for running vacuums during off hours WAS Re: [HACKERS] Autovacuum launcher doesn't notice death of postmaster immediately

Andrew Hammond escribió:

 That's a good question. I can't see any reason for a naptime longer
 than 60 seconds either.
 
 I think very large naptime settings are a symptom of another issue:
 what's the Right Way to defer vacuums until off hours? Is that even
 a desirable thing anymore? I don't think it is in the majority of
 cases.
 
 I originally thought that this was more of a Best Practices issue (ie,
 fix in the docs, not the code), but now I'm wondering if there's much
 call for supporting the idea of being more aggressive with vacuums at
 different times of the day / week / month. Anyone?

That's why the intention is to have the autovacuum scheduling feature
be a mechanism for changing the autovac parameters according to date and
time.  We even have a Google Summer of Code project about that.

-- 
Alvaro Herrera http://www.amazon.com/gp/registry/DXLWNGRJD34J
La virtud es el justo medio entre dos defectos (Aristóteles)

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] [COMMITTERS] pgsql: Avoid losing track of data for shared tables in pgstats.

Alvaro Herrera [EMAIL PROTECTED] writes:

 Gregory Stark wrote:

  is it possible it's related to the performance drop immediately
  following a vacuum analyze we've been seeing?

 I don't think so, unless you were counting on pgstats data of shared
 tables for something.  The optimizer, for one, doesn't, so I doubt it
 would affect query planning.  And it would only affect you if your
 queries were using shared tables, which I very much doubt ...

Does anything use the pgstats data for anything other than presenting feedback
to users?

Autovacuum uses it to estimate when tables should be vacuumed right? This
wouldn't have caused autovacuum to go nuts vacuuming these tables would it?
But I doubt even then that it could consume much i/o bandwidth.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] TOAST usage setting

2007-06-07 Thread Bruce Momjian

Gregory Stark wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
 
  Well, it is summarized here:
 
  http://momjian.us/expire/TOAST/SUMMARY.html
 
  It made non-TOAST access 2x faster, but TOAST 7x slower, and that seemed
  like a good compromise.
 
 Is this still testing with all data fitting in RAM?

Yes.  Having things out of RAM is going to make access even slower, but
it is going to allow the heap to be in RAM more often.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] .conf File Organization WAS: Controlling Load Distributed Checkpoints

2007-06-07 Thread Josh Berkus


All,

This brings up another point.   With the increased number of .conf 
options, the file is getting hard to read again.  I'd like to do another 
reorganization, but I don't really want to break people's diff scripts. 
 Should I worry about that?


--Josh

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [HACKERS] .conf File Organization WAS: Controlling Load Distributed Checkpoints

2007-06-07 Thread Joshua D. Drake


Josh Berkus wrote:

All,

This brings up another point.   With the increased number of .conf 
options, the file is getting hard to read again.  I'd like to do another 
reorganization, but I don't really want to break people's diff scripts. 
 Should I worry about that?


As a point of feedback, autovacuum and vacuum should be together.

Joshua D. Drake




--Josh

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org




--

  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] [COMMITTERS] pgsql: Avoid losing track of data for shared tables in pgstats.

Gregory Stark wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
 
  Gregory Stark wrote:
 
   is it possible it's related to the performance drop immediately
   following a vacuum analyze we've been seeing?
 
  I don't think so, unless you were counting on pgstats data of shared
  tables for something.  The optimizer, for one, doesn't, so I doubt it
  would affect query planning.  And it would only affect you if your
  queries were using shared tables, which I very much doubt ...
 
 Does anything use the pgstats data for anything other than presenting feedback
 to users?

Not that I know of.

 Autovacuum uses it to estimate when tables should be vacuumed right?

Yep

 This wouldn't have caused autovacuum to go nuts vacuuming these tables
 would it?  But I doubt even then that it could consume much i/o
 bandwidth.

Yes but keep in mind that these are only the shared tables: pg_database,
pg_authid, pg_shdepend, etc.  Those are not tables that you're going to
use regularly, much less _bloat_ regularly that they need frequent
vacuuming.

Maybe pg_shdepend, because it would be used when creating temp tables.

-- 
Alvaro Herrera http://www.flickr.com/photos/alvherre/
Postgres is bloatware by design: it was built to house
 PhD theses. (Joey Hellerstein, SIGMOD annual conference 2002)

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] .conf File Organization WAS: Controlling Load Distributed Checkpoints

Josh Berkus [EMAIL PROTECTED] writes:
 This brings up another point.   With the increased number of .conf 
 options, the file is getting hard to read again.  I'd like to do another 
 reorganization, but I don't really want to break people's diff scripts. 

Do you have a better organizing principle than what's there now?

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] [COMMITTERS] pgsql: Avoid losing track of data for shared tables in pgstats.

2007-06-07 Thread Michael Fuhr

On Thu, Jun 07, 2007 at 11:46:17PM +0100, Gregory Stark wrote:
 Does anything use the pgstats data for anything other than presenting feedback
 to users?
 
 Autovacuum uses it to estimate when tables should be vacuumed right? This
 wouldn't have caused autovacuum to go nuts vacuuming these tables would it?
 But I doubt even then that it could consume much i/o bandwidth.

I discovered this problem after noticing that pg_shdepend had gotten
horribly bloated -- apparently due to heavy use of temporary tables
by an application whose performance I was investigating -- despite
autovacuum being enabled.  When I looked at the statistics for
pg_shdepend the values for n_tup_{ins,upd,del} were much lower than
I expected.  After watching the numbers grow for a few minutes I
saw them reset; after observing this behavior several times I was
able to correlate the resets with vacuums of other tables.

Since the statistics for pg_shdepend rarely got high enough to
trigger an autovacuum that table was almost never being vacuumed.
I suggested to the DBA that he vacuum it manually; after five minutes
the vacuum completed and the application's performance improved
immediately.

INFO:  pg_shdepend: found 8475403 removable, 3907 nonremovable row versions 
in 76783 pages

-- 
Michael Fuhr

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] [COMMITTERS] pgsql: Avoid losing track of data for shared tables in pgstats.

2007-06-07 Thread Michael Fuhr

On Thu, Jun 07, 2007 at 06:09:36PM -0400, Alvaro Herrera wrote:
 Gregory Stark wrote:
  When does this bug date to?
 
 It was in 8.1.  I didn't verify whether it affects on 8.0; I think we
 separated the shared tables in pgstats in 8.1, with the advent of
 autovacuum, so I assume it doesn't.

Doesn't appear to, at least not using the test case I found for 8.1
and later.

-- 
Michael Fuhr

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] [COMMITTERS] pgsql: Avoid losing track of data for shared tables in pgstats.

Michael Fuhr [EMAIL PROTECTED] writes:
 On Thu, Jun 07, 2007 at 06:09:36PM -0400, Alvaro Herrera wrote:
 Gregory Stark wrote:
 When does this bug date to?
 
 It was in 8.1.  I didn't verify whether it affects on 8.0; I think we
 separated the shared tables in pgstats in 8.1, with the advent of
 autovacuum, so I assume it doesn't.

 Doesn't appear to, at least not using the test case I found for 8.1
 and later.

Yeah, the separate hashtable for shared rels was added for 8.1:
http://archives.postgresql.org/pgsql-committers/2005-07/msg00627.php

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] TOAST usage setting