Re: [HACKERS] is_array_type vs type_is_array
On Thu, 7 Jun 2007, Tom Lane wrote: Jeremy Drake [EMAIL PROTECTED] writes: Was there some change in functionality reason for renaming is_array_type to type_is_array? Just to sync style with type_is_enum ... there were more of the latter than the former. OK, so it is safe to just #define one to the other, right? It broke compilation of fulldisjunctions, Sorry, but we change internal APIs every day, and twice on Sundays. Deal with it. This is why I build fulldisjunctions in my sandbox, because when I decided to try it out one time, I could not get it to compile. After much effort getting it happy with HEAD and sending the changes back to the maintainer of fulldisjunctions, I thought a good thing for me to contribute is to make sure it continues to compile cleanly against HEAD and send patches when it breaks. I just wanted to make sure that the functionality of this function did not change in some way that I did not see before sending a patch to the maintainer of fulldisjunctions. Deal with it was not the response I was expecting, as that is exactly what I am trying to do ;) -- It is generally agreed that Hello is an appropriate greeting because if you entered a room and said Goodbye, it could confuse a lot of people. -- Dolph Sharp, I'm O.K., You're Not So Hot ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Controlling Load Distributed Checkpoints
Greg Smith wrote: On Wed, 6 Jun 2007, Heikki Linnakangas wrote: The original patch uses bgwriter_all_max_pages to set the minimum rate. I think we should have a separate variable, checkpoint_write_min_rate, in KB/s, instead. Completely agreed. There shouldn't be any coupling with the background writer parameters, which may be set for a completely different set of priorities than the checkpoint has. I have to look at this code again to see why it's a min_rate instead of a max, that seems a little weird. It's min rate, because it never writes slower than that, and it can write faster if the next checkpoint is due soon so that we wouldn't finish before it's time to start the next one. (Or to be precise, before the next checkpoint is closer than 100-(checkpoint_write_percent)% of the checkpoint interval) Nap phase: We should therefore give the delay as a number of seconds instead of as a percentage of checkpoint interval. Again, the setting here should be completely decoupled from another GUC like the interval. My main complaint with the original form of this patch was how much it tried to syncronize the process with the interval; since I don't even have a system where that value is set to something, because it's all segment based instead, that whole idea was incompatible. checkpoint_segments is taken into account as well as checkpoint_timeout. I used the term checkpoint interval to mean the real interval at which the checkpoints occur, whether it's because of segments or timeout. The original patch tried to spread the load out as evenly as possible over the time available. I much prefer thinking in terms of getting it done as quickly as possible while trying to bound the I/O storm. Yeah, the checkpoint_min_rate allows you to do that. So there's two extreme ways you can use LDC: 1. Finish the checkpoint as soon as possible, without disturbing other activity too much. Set checkpoint_write_percent to a high number, and set checkpoint_min_rate to define too much. 2. Disturb other activity as little as possible, as long as the checkpoint finishes in a reasonable time. Set checkpoint_min_rate to a low number, and checkpoint_write_percent to define reasonable time Are both interesting use cases, or is it enough to cater for just one of them? I think 2 is easier to tune. Defining the min_rate properly can be difficult and depends a lot on your hardware and application, but a default value of say 50% for checkpoint_write_percent to tune for use case 2 should work pretty well for most people. In any case, the checkpoint better finish before it's time to start another one. Or would you rather delay the next checkpoint, and let checkpoint take as long as it takes to finish at the min_rate? And we don't know how much work an fsync performs. The patch uses the file size as a measure of that, but as we discussed that doesn't necessarily have anything to do with reality. fsyncing a 1GB file with one dirty block isn't any more expensive than fsyncing a file with a single block. On top of that, if you have a system with a write cache, the time an fsync takes can greatly depend on how full it is at the time, which there is no way to measure or even model easily. Is there any way to track how many dirty blocks went into each file during the checkpoint write? That's your best bet for guessing how long the fsync will take. I suppose it's possible, but the OS has hopefully started flushing them to disk almost as soon as we started the writes, so even that isn't very good a measure. On a Linux system, one way to model it is that the OS flushes dirty buffers to disk at the same rate as we write them, but delayed by dirty_expire_centisecs. That should hold if the writes are spread out enough. Then the amount of dirty buffers in OS cache at the end of write phase is roughly constant, as long as the write phase lasts longer than dirty_expire_centisecs. If we take a nap of dirty_expire_centisecs after the write phase, the fsyncs should be effectively no-ops, except that they will flush any other writes the bgwriter lru-sweep and other backends performed during the nap. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Controlling Load Distributed Checkpoints
Ühel kenal päeval, K, 2007-06-06 kell 11:03, kirjutas Tom Lane: Heikki Linnakangas [EMAIL PROTECTED] writes: GUC summary and suggested default values checkpoint_write_percent = 50 # % of checkpoint interval to spread out writes checkpoint_write_min_rate = 1000# minimum I/O rate to write dirty buffers at checkpoint (KB/s) checkpoint_nap_duration = 2 # delay between write and sync phase, in seconds checkpoint_fsync_period = 30# duration of the sync phase, in seconds checkpoint_fsync_delay = 500# max. delay between fsyncs I don't like adding that many GUC variables, but I don't really see a way to tune them automatically. If we don't know how to tune them, how will the users know? He talked about doing it _automatically_. If the knobns are available, it will be possible to determine good values even by brute-force performance testing, given enough time and manpower is available. Having to add that many variables to control one feature says to me that we don't understand the feature. The feature has lots of complex dependencies to things outside postgres, so learning to understand it takes time. Having the knows available helps as more people ar willing to do turn-the-knobs-and-test vs. recompile-and-test. Perhaps what we need is to think about how it can auto-tune itself. Sure. --- Hannu Krosing ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [RFC] GSoC Work on readonly queries done so far
Florian G. Pflug wrote: Jeff Davis wrote: Are you referring to the size of the xip array being a problem? Would it help to tie the size of the xip array to max_connections? I understand that max_connections might be greater on the master, but maybe something similar? Thats what I currently do - the xip array on the slave is sized to hold max_connections entries (Actually, it's max_connections + max_prepared_xacts I think). The problem occurs exactly if those values are set too small on the slave - and since shared mem objects are not resizeable, I don't see how the slave can handle an xip overflow gracefully other than by not publishing the information in shared memory as long as it doesn't fit there. You could store the value of max_connections in the checkpoint xlog record, and read it from there in the slave. Though one could still change it on the master and restart without restarting the slave as well. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] How do we create the releases?
On Wed, Jun 06, 2007 at 07:56:25PM -0300, Marc G. Fournier wrote: - --On Tuesday, June 05, 2007 10:28:58 +0300 Devrim GÜNDÜZ [EMAIL PROTECTED] wrote: Hi Marc, Is there a written procedure about creating tarballs? I'd like to start working on 8.3 RPMs and I want to know what I should to to create a tarball. Just a script ... relevant bits: snip Is this script in CVS somewhere? I know it's not in the main backend repo. But we shuold keep it either there or somewhere else - perhaps the pmt repository? //Magnus ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] How do we create the releases?
Magnus Hagander wrote: Is this script in CVS somewhere? I know it's not in the main backend repo. Most of what these scripts have done in the past has been systematized and folded into the make dist target in the mean time, and AFAICT, we could integrate the documentation preparation as well, now that it works automatically in a reliable fashion. That would reduce the answer to the original question to something like ./configure make dist[check] -- Peter Eisentraut http://developer.postgresql.org/~petere/ ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] How do we create the releases?
On Thu, Jun 07, 2007 at 01:12:06PM +0200, Peter Eisentraut wrote: Magnus Hagander wrote: Is this script in CVS somewhere? I know it's not in the main backend repo. Most of what these scripts have done in the past has been systematized and folded into the make dist target in the mean time, and AFAICT, we could integrate the documentation preparation as well, now that it works automatically in a reliable fashion. That would reduce the answer to the original question to something like ./configure make dist[check] If we can do that, that makes it even better. But is it something we can reasonably backpatch? Otherwise we still need both documentation and scripts to deal with back branches. //Magnus ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] How do we create the releases?
Hi, On Thu, 2007-06-07 at 13:12 +0200, Peter Eisentraut wrote: make dist[check] AFAIK make dist does not prepare postgres.tar.gz. Regards, -- Devrim GÜNDÜZ PostgreSQL Replication, Consulting, Custom Development, 24x7 support Managed Services, Shared and Dedicated Hosting Co-Authors: plPHP, ODBCng - http://www.commandprompt.com/ signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Controlling Load Distributed Checkpoints
Thinking about this whole idea a bit more, it occured to me that the current approach to write all, then fsync all is really a historical artifact of the fact that we used to use the system-wide sync call instead of fsyncs to flush the pages to disk. That might not be the best way to do things in the new load-distributed-checkpoint world. How about interleaving the writes with the fsyncs? 1. Scan all shared buffers, and build a list of all files with dirty pages, and buffers belonging to them 2. foreach(file in list) { foreach(buffer belonging to file) { write(); sleep(); /* to throttle the I/O rate */ } sleep(); /* to give the OS a chance to flush the writes at it's own pace */ fsync() } This would spread out the fsyncs in a natural way, making the knob to control the duration of the sync phase unnecessary. At some point we'll also need to fsync all files that have been modified since the last checkpoint, but don't have any dirty buffers in the buffer cache. I think it's a reasonable assumption that fsyncing those files doesn't generate a lot of I/O. Since the writes have been made some time ago, the OS has likely already flushed them to disk. Doing the 1st phase of just scanning the buffers to see which ones are dirty also effectively implements the optimization of not writing buffers that were dirtied after the checkpoint start. And grouping the writes per file gives the OS a better chance to group the physical writes. One problem is that currently the segmentation of relations to 1GB files is handled at a low level inside md.c, and we don't really have any visibility into that in the buffer manager. ISTM that some changes to the smgr interfaces would be needed for this to work well, though just doing it on a relation per relation basis would also be better than the current approach. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Controlling Load Distributed Checkpoints
Heikki Linnakangas [EMAIL PROTECTED] writes: Thinking about this whole idea a bit more, it occured to me that the current approach to write all, then fsync all is really a historical artifact of the fact that we used to use the system-wide sync call instead of fsyncs to flush the pages to disk. That might not be the best way to do things in the new load-distributed-checkpoint world. How about interleaving the writes with the fsyncs? I don't think it's a historical artifact at all: it's a valid reflection of the fact that we don't know enough about disk layout to do low-level I/O scheduling. Issuing more fsyncs than necessary will do little except guarantee a less-than-optimal scheduling of the writes. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] How do we create the releases?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 - --On Thursday, June 07, 2007 14:22:22 +0300 Devrim GÜNDÜZ [EMAIL PROTECTED] wrote: Hi, On Thu, 2007-06-07 at 13:12 +0200, Peter Eisentraut wrote: make dist[check] AFAIK make dist does not prepare postgres.tar.gz. right, as peter originally said, 'make dist' *could* incorporate the steps that right now lead up to 'make dist' (ie. build the docs) ... - Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email . [EMAIL PROTECTED] MSN . [EMAIL PROTECTED] Yahoo . yscrappy Skype: hub.orgICQ . 7615664 -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (FreeBSD) iD8DBQFGaB3E4QvfyHIvDvMRAgqRAJ4ln2cGOdPzcNzlbnjCnmMyZsU2tgCeO677 IbDu8XpIqts1wy04JAk6lwI= =Zjhp -END PGP SIGNATURE- ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[HACKERS] little PITR annoyance
Hi all, I've benn biten twice by this one: I enabled archiving on both my pgsql servers. the archive command was a simple gzip on an nfs mounted dir. Now the nfs server crashed. Both wal space filled up until postgresql shuts itself down because of no more space for WAL. That perfectly normal and expected. What I did'nt expect and don't understand is that postgresql refused to start up after the nfs server was up and running until I added some more space on the WAL fs although if it had started archiving , space would have been there . I wonder if archiving could start before postgresql at least to make a little room for database engine to really start. Hope I'm clear... Thank you for your attention Best Regards -- Olivier PRENANT Tel: +33-5-61-50-97-00 (Work) 15, Chemin des Monges+33-5-61-50-97-01 (Fax) 31190 AUTERIVE +33-6-07-63-80-64 (GSM) FRANCE Email: [EMAIL PROTECTED] -- Make your life a dream, make your dream a reality. (St Exupery) ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Vacuuming anything zeroes shared table stats
Michael Fuhr wrote: Is vacuuming any table supposed to zero the statistics for all shared tables? Doesn't that have implications for autovacuum? The example below is in 8.2.4 but I'm seeing similar behavior in 8.1.9 and 8.3devel. The problem is that the database hash is cleared of databases that no longer exist, and the database list is constructed by scanning pg_database. Since no entry exist for the database we use for shared tables (InvalidOid), the hash table is dropped. The attached patch fixes this. Additionally, in 8.3devel doing anything that queries or modifies a shared table seems to zero the statistics for all shared tables. I'm not sure if this is fixed by the patch; can you verify, or provide a more specific description of the problem? -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support Index: src/backend/postmaster/pgstat.c === RCS file: /home/alvherre/Code/cvs/pgsql/src/backend/postmaster/pgstat.c,v retrieving revision 1.158 diff -c -p -r1.158 pgstat.c *** src/backend/postmaster/pgstat.c 27 May 2007 17:28:35 - 1.158 --- src/backend/postmaster/pgstat.c 7 Jun 2007 15:25:30 - *** pgstat_vacuum_tabstat(void) *** 897,902 --- 897,906 * Collect the OIDs of either all databases or all tables, according to * the parameter, into a temporary hash table. Caller should hash_destroy * the result when done with it. + * + * NB -- this function adds an entry with InvalidOid if asked for a list of + * databases. This is because we use a shared table for such a database to + * keep stats for shared tables. * -- */ static HTAB * *** pgstat_collect_oids(Oid catalogid) *** 930,935 --- 934,950 heap_endscan(scan); heap_close(rel, AccessShareLock); + /* + * If we were asked for databases, add an entry for the pseudo-database + * with InvalidOid, where we store shared tables. + */ + if (catalogid == DatabaseRelationId) + { + Oid invalid = InvalidOid; + + hash_search(htab, (void *) invalid, HASH_ENTER, NULL); + } + return htab; } ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Vacuuming anything zeroes shared table stats
Alvaro Herrera [EMAIL PROTECTED] writes: The problem is that the database hash is cleared of databases that no longer exist, and the database list is constructed by scanning pg_database. Since no entry exist for the database we use for shared tables (InvalidOid), the hash table is dropped. Doh ... The attached patch fixes this. Wouldn't it be easier to just special-case the shared DB in pgstat_vacuum_tabstat? while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(hstat)) != NULL) { Oiddbid = dbentry-databaseid; CHECK_FOR_INTERRUPTS(); - if (hash_search(htab, (void *) dbid, HASH_FIND, NULL) == NULL) + /* ignore the DB entry for shared tables ... they never go away */ + if (OidIsValid(dbid) + hash_search(htab, (void *) dbid, HASH_FIND, NULL) == NULL) pgstat_drop_database(dbid); } Additionally, in 8.3devel doing anything that queries or modifies a shared table seems to zero the statistics for all shared tables. I'm not sure if this is fixed by the patch; can you verify, or provide a more specific description of the problem? Seems unlikely that this bug would explain a behavior like that. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [COMMITTERS] pgsql: Create a GUC parametertemp_tablespacesthat allows selection of
Jaime Casanova [EMAIL PROTECTED] writes: On 6/4/07, Tom Lane [EMAIL PROTECTED] wrote: Perhaps a reasonable compromise could work like this: at the first point in a transaction where a temp file is created, choose a random list element, and thereafter advance cyclically for the duration of that transaction. ok. are you doing this? or can i prepare a patch that implements this? i guess we can allocate the memory for the list in TopTransactionContext. I'll work on it ... I want to rejigger the API between fd.c and tablespace.c anyway. (fd.c still shouldn't be calling tablespace.c ...) regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Autovacuum launcher doesn't notice death of postmaster immediately
On Mon, Jun 04, 2007 at 11:04:26AM -0400, Alvaro Herrera wrote: The launcher is set up to wake up in autovacuum_naptime seconds at most. So if the user configures a ridiculuos time (for example 86400 seconds, which I've seen) then the launcher would not detect the postmaster death Yeah, I've seen people set that up with the intention of now autovacuum will only run during our slow time!. I'm thinking it'd be worth mentioning in the docs that this won't work, and instead suggesting that they run vacuumdb -a or equivalent at that time instead. Thoughts? -- Jim Nasby [EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) pgpwir45JckSd.pgp Description: PGP signature
Re: [HACKERS] [PATCHES] [BUGS] BUG #3326: Invalid lower bound of autovacuum_cost_limit
Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: But this is misleading (started postmaster with good value, then edited postgresql.conf and entered -2): 17903 LOG: received SIGHUP, reloading configuration files 17903 LOG: -2 is outside the valid range for parameter autovacuum_vacuum_cost_limit (-1 .. 1000) Note how it still says the range is -1 .. 1000. Can we redefine things to make zero be the disabled value, thus keeping the range of valid values contiguous? That would be another solution ... though it would be different from the valid value for autovacuum_vacuum_cost_delay (on which 0 is a valid value). Also it would be a different value from previous versions. I don't think either of these is a showstopper, so let's go for that if nobody objects. -- Alvaro Herrera Developer, http://www.PostgreSQL.org/ inflex really, I see PHP as like a strange amalgamation of C, Perl, Shell crab inflex: you know that amalgam means mixture with mercury, more or less, right? crab i.e., deadly poison ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PATCHES] [BUGS] BUG #3326: Invalid lower bound of autovacuum_cost_limit
Alvaro Herrera wrote: Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: But this is misleading (started postmaster with good value, then edited postgresql.conf and entered -2): 17903 LOG: received SIGHUP, reloading configuration files 17903 LOG: -2 is outside the valid range for parameter autovacuum_vacuum_cost_limit (-1 .. 1000) Note how it still says the range is -1 .. 1000. Can we redefine things to make zero be the disabled value, thus keeping the range of valid values contiguous? That would be another solution ... though it would be different from the valid value for autovacuum_vacuum_cost_delay (on which 0 is a valid value). Also it would be a different value from previous versions. I don't think either of these is a showstopper, so let's go for that if nobody objects. Can you make 0 and -1 both valid disabled values? That way it will be compatible with previous releases. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Controlling Load Distributed Checkpoints
Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: Thinking about this whole idea a bit more, it occured to me that the current approach to write all, then fsync all is really a historical artifact of the fact that we used to use the system-wide sync call instead of fsyncs to flush the pages to disk. That might not be the best way to do things in the new load-distributed-checkpoint world. How about interleaving the writes with the fsyncs? I don't think it's a historical artifact at all: it's a valid reflection of the fact that we don't know enough about disk layout to do low-level I/O scheduling. Issuing more fsyncs than necessary will do little except guarantee a less-than-optimal scheduling of the writes. I'm not proposing to issue any more fsyncs. I'm proposing to change the ordering so that instead of first writing all dirty buffers and then fsyncing all files, we'd write all buffers belonging to a file, fsync that file only, then write all buffers belonging to next file, fsync, and so forth. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Controlling Load Distributed Checkpoints
Heikki Linnakangas [EMAIL PROTECTED] writes: Tom Lane wrote: I don't think it's a historical artifact at all: it's a valid reflection of the fact that we don't know enough about disk layout to do low-level I/O scheduling. Issuing more fsyncs than necessary will do little except guarantee a less-than-optimal scheduling of the writes. I'm not proposing to issue any more fsyncs. I'm proposing to change the ordering so that instead of first writing all dirty buffers and then fsyncing all files, we'd write all buffers belonging to a file, fsync that file only, then write all buffers belonging to next file, fsync, and so forth. But that means that the I/O to different files cannot be overlapped by the kernel, even if it would be more efficient to do so. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Controlling Load Distributed Checkpoints
Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: Tom Lane wrote: I don't think it's a historical artifact at all: it's a valid reflection of the fact that we don't know enough about disk layout to do low-level I/O scheduling. Issuing more fsyncs than necessary will do little except guarantee a less-than-optimal scheduling of the writes. I'm not proposing to issue any more fsyncs. I'm proposing to change the ordering so that instead of first writing all dirty buffers and then fsyncing all files, we'd write all buffers belonging to a file, fsync that file only, then write all buffers belonging to next file, fsync, and so forth. But that means that the I/O to different files cannot be overlapped by the kernel, even if it would be more efficient to do so. True. On the other hand, if we issue writes in essentially random order, we might fill the kernel buffers with random blocks and the kernel needs to flush them to disk as almost random I/O. If we did the writes in groups, the kernel has better chance at coalescing them. I tend to agree that if the goal is to finish the checkpoint as quickly as possible, the current approach is better. In the context of load distributed checkpoints, however, it's unlikely the kernel can do any significant overlapping since we're trickling the writes anyway. Do we need both strategies? I'm starting to feel we should give up on smoothing the fsyncs and distribute the writes only, for 8.3. As we get more experience with that and it's shortcomings, we can enhance our checkpoints further in 8.4. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Vacuuming anything zeroes shared table stats
On Thu, Jun 07, 2007 at 11:41:56AM -0400, Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: The attached patch fixes this. Wouldn't it be easier to just special-case the shared DB in pgstat_vacuum_tabstat? Thanks; I'll test these patches when I get a chance. Additionally, in 8.3devel doing anything that queries or modifies a shared table seems to zero the statistics for all shared tables. I'm not sure if this is fixed by the patch; can you verify, or provide a more specific description of the problem? Seems unlikely that this bug would explain a behavior like that. Further investigation shows that what really seems to be happening in 8.3devel is that the statistics for shared tables are reset every 15 seconds when autovacuum is enabled, which it is by default. I don't observe this phenomenon when autovacuum is disabled. Here's a test case: select * from pg_pltemplate; select seq_scan, idx_scan, now() from pg_stat_all_tables where relname in ('pg_pltemplate'); Repeat the second select until the statistics for pg_pltemplate become zero, which should be within 15 seconds. Repeating the experiment should reveal a 15-second cycle of statistics resets. In case this behavior is platform-dependent I'm testing on Solaris 9 sparc. -- Michael Fuhr ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Vacuuming anything zeroes shared table stats
Michael Fuhr [EMAIL PROTECTED] writes: Further investigation shows that what really seems to be happening in 8.3devel is that the statistics for shared tables are reset every 15 seconds when autovacuum is enabled, which it is by default. I don't observe this phenomenon when autovacuum is disabled. OK, so it's just that pgstat_vacuum_tabstat() gets run by autovacuum. So either form of Alvaro's patch should fix it. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Controlling Load Distributed Checkpoints
On Thu, 7 Jun 2007, Heikki Linnakangas wrote: So there's two extreme ways you can use LDC: 1. Finish the checkpoint as soon as possible, without disturbing other activity too much 2. Disturb other activity as little as possible, as long as the checkpoint finishes in a reasonable time. Are both interesting use cases, or is it enough to cater for just one of them? I think 2 is easier to tune. The motivation for the (1) case is that you've got a system that's dirtying the buffer cache very fast in normal use, where even the background writer is hard pressed to keep the buffer pool clean. The checkpoint is the most powerful and efficient way to clean up many dirty buffers out of such a buffer cache in a short period of time so that you're back to having room to work in again. In that situation, since there are many buffers to write out, you'll also be suffering greatly from fsync pauses. Being able to synchronize writes a little better with the underlying OS to smooth those out is a huge help. I'm completely biased because of the workloads I've been dealing with recently, but I consider (2) so much easier to tune for that it's barely worth worrying about. If your system is so underloaded that you can let the checkpoints take their own sweet time, I'd ask if you have enough going on that you're suffering very much from checkpoint performance issues anyway. I'm used to being in a situation where if you don't push out checkpoint data as fast as physically possible, you end up fighting with the client backends for write bandwidth once the LRU point moves past where the checkpoint has written out to already. I'm not sure how much always running the LRU background writer will improve that situation. On a Linux system, one way to model it is that the OS flushes dirty buffers to disk at the same rate as we write them, but delayed by dirty_expire_centisecs. That should hold if the writes are spread out enough. If they're really spread out, sure. There is congestion avoidance code inside the Linux kernel that makes dirty_expire_centisecs not quite work the way it is described under load. All you can say in the general case is that when dirty_expire_centisecs has passed, the kernel badly wants to write the buffers out as quickly as possible; that could still be many seconds after the expiration time on a busy system, or on one with slow I/O. On every system I've ever played with Postgres write performance on, I discovered that the memory-based parameters like dirty_background_ratio were really driving write behavior, and I almost ignore the expire timeout now. Plotting the Dirty: value in /proc/meminfo as you're running tests is extremely informative for figuring out what Linux is really doing underneath the database writes. The influence of the congestion code is why I made the comment about watching how long writes are taking to gauge how fast you can dump data onto the disks. When you're suffering from one of the congestion mechanisms, the initial writes start blocking, even before the fsync. That behavior is almost undocumented outside of the relevant kernel source code. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Vacuuming anything zeroes shared table stats
Tom Lane wrote: Michael Fuhr [EMAIL PROTECTED] writes: Further investigation shows that what really seems to be happening in 8.3devel is that the statistics for shared tables are reset every 15 seconds when autovacuum is enabled, which it is by default. I don't observe this phenomenon when autovacuum is disabled. OK, so it's just that pgstat_vacuum_tabstat() gets run by autovacuum. So either form of Alvaro's patch should fix it. Right. Committed. Thanks for the report! -- Alvaro Herrera Developer, http://www.PostgreSQL.org/ El miedo atento y previsor es la madre de la seguridad (E. Burke) ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PATCHES] [BUGS] BUG #3326: Invalid lower bound of autovacuum_cost_limit
Matthew T. O'Connor wrote: Alvaro Herrera wrote: Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: But this is misleading (started postmaster with good value, then edited postgresql.conf and entered -2): 17903 LOG: received SIGHUP, reloading configuration files 17903 LOG: -2 is outside the valid range for parameter autovacuum_vacuum_cost_limit (-1 .. 1000) Note how it still says the range is -1 .. 1000. Can we redefine things to make zero be the disabled value, thus keeping the range of valid values contiguous? That would be another solution ... though it would be different from the valid value for autovacuum_vacuum_cost_delay (on which 0 is a valid value). Also it would be a different value from previous versions. I don't think either of these is a showstopper, so let's go for that if nobody objects. Can you make 0 and -1 both valid disabled values? That way it will be compatible with previous releases. Heh, sure, we can do that too and it doesn't seem like anybody would object. I will patch the documentation so that that the disabled value is zero, and still allow -1. That way it doesn't seem like there should be any objection. -- Alvaro Herrera http://www.flickr.com/photos/alvherre/ Escucha y olvidarás; ve y recordarás; haz y entenderás (Confucio) ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Autovacuum launcher doesn't notice death of postmaster immediately
On 6/7/07, Jim C. Nasby [EMAIL PROTECTED] wrote: On Mon, Jun 04, 2007 at 11:04:26AM -0400, Alvaro Herrera wrote: The launcher is set up to wake up in autovacuum_naptime seconds at most. So if the user configures a ridiculuos time (for example 86400 seconds, which I've seen) then the launcher would not detect the postmaster death Is there some threshold after which we should have PostgreSQL emit a warning to the effect of autovacuum_naptime is very large. Are you sure you know what you're doing? Yeah, I've seen people set that up with the intention of now autovacuum will only run during our slow time!. I'm thinking it'd be worth mentioning in the docs that this won't work, and instead suggesting that they run vacuumdb -a or equivalent at that time instead. Thoughts? Hmmm... it seems to me that points new users towards not using autovacuum, which doesn't seem like the best idea. I think it'd be better to say that setting the naptime really high is a Bad Idea. Instead, if they want to shift maintenances to off hours they should consider using a cron job that bonks around the pg_autovacuum.vac_base_thresh or vac_scale_factor values for tables they don't want vacuumed during operational hours (set them really high at the start of operational hours, then to normal during off hours). Tweaking the enable column would work too, but they presumably don't want to disable ANALYZE, although it's entirely likely that new users don't know what ANALYZE does, in which case they _really_ don't want to disable it. This should probably be very close to a section that says something about how insufficient maintenance can be expected to lead to greater performance issues than using autovacuum with default settings. Assuming we believe that to be the case, which I think is reasonable given that we are now defaulting to having autovacuum enabled. Andrew ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] pqlib suggestion
Clients using pqlib can get some boost by increasing tcp/ip window size. It might be good to make it a settable parameter. Anyway, maybe something like this: /* -- * connectNoDelay - * Sets the TCP_NODELAY socket option. * Returns 1 if successful, 0 if not. * -- */ static int connectNoDelay(PGconn *conn) { int on = 1; #ifdef TCP_NODELAY if (setsockopt(conn-sock, IPPROTO_TCP, TCP_NODELAY, (char *) on, sizeof(on)) 0) { char sebuf[256]; printfPQExpBuffer(conn-errorMessage, libpq_gettext(could not set socket to TCP no delay mode: %s\n), SOCK_STRERROR(SOCK_ERRNO, sebuf, sizeof(sebuf))); return 0; } #endif on = 65535; if (setsockopt(conn-sock, SOL_SOCKET, SO_RCVBUF, (char *) on, sizeof(on)) 0) { printfPQExpBuffer(conn-errorMessage, connectNoDelay() -- setsockopt SO_RCVBUF failed: errno=%d\n%s\n, errno, strerror(errno)); #ifdef WIN32 printf(Winsock error: %i\n, WSAGetLastError()); #endif return 0; } on = 65535; if (setsockopt(conn-sock, SOL_SOCKET, SO_SNDBUF, (char *) on, sizeof(on)) 0) { printfPQExpBuffer(conn-errorMessage, connectNoDelay() -- setsockopt SO_SNDBUF failed: errno=%d\n%s\n, errno, strerror(errno)); #ifdef WIN32 printf(Winsock error: %i\n, WSAGetLastError()); #endif return 0; } return 1; }
Re: [HACKERS] Autovacuum launcher doesn't notice death of postmaster immediately
Andrew Hammond [EMAIL PROTECTED] writes: Hmmm... it seems to me that points new users towards not using autovacuum, which doesn't seem like the best idea. I think it'd be better to say that setting the naptime really high is a Bad Idea. It seems like we should have an upper limit on the GUC variable that's less than INT_MAX ;-). Would an hour be sane? 10 minutes? This is independent of the problem at hand, though, which is that we probably want the launcher to notice postmaster death in less time than autovacuum_naptime, for reasonable values of same. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Controlling Load Distributed Checkpoints
Greg Smith [EMAIL PROTECTED] writes: I'm completely biased because of the workloads I've been dealing with recently, but I consider (2) so much easier to tune for that it's barely worth worrying about. If your system is so underloaded that you can let the checkpoints take their own sweet time, I'd ask if you have enough going on that you're suffering very much from checkpoint performance issues anyway. I'm used to being in a situation where if you don't push out checkpoint data as fast as physically possible, you end up fighting with the client backends for write bandwidth once the LRU point moves past where the checkpoint has written out to already. I'm not sure how much always running the LRU background writer will improve that situation. I think you're working from a faulty premise. There's no relationship between the volume of writes and how important the speed of checkpoint is. In either scenario you should assume a system that is close to the max i/o bandwidth. The only question is which task the admin would prefer take the hit for maxing out the bandwidth, the transactions or the checkpoint. You seem to have imagined that letting the checkpoint take longer will slow down transactions. In fact that's precisely the effect we're trying to avoid. Right now we're seeing tests where Postgres stops handling *any* transactions for up to a minute. In virtually any real world scenario that would simply be unacceptable. That one-minute outage is a direct consequence of trying to finish the checkpoint as quick as possible. If we spread it out then it might increase the average i/o load if you sum it up over time, but then you just need a faster i/o controller. The only scenario where you would prefer the absolute lowest i/o rate summed over time would be if you were close to maxing out your i/o bandwidth, couldn't buy a faster controller, and response time was not a factor, only sheer volume of transactions processed mattered. That's a much less common scenario than caring about the response time. The flip side of having to worry about response time buying a faster controller doesn't even help. It would shorten the duration of the checkpoint but not eliminate it. A 30-second outage every half hour is just as unacceptable as a 1-minute outage every half hour. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Autovacuum launcher doesn't notice death of postmaster immediately
Tom Lane wrote: Andrew Hammond [EMAIL PROTECTED] writes: Hmmm... it seems to me that points new users towards not using autovacuum, which doesn't seem like the best idea. I think it'd be better to say that setting the naptime really high is a Bad Idea. It seems like we should have an upper limit on the GUC variable that's less than INT_MAX ;-). Would an hour be sane? 10 minutes? This is independent of the problem at hand, though, which is that we probably want the launcher to notice postmaster death in less time than autovacuum_naptime, for reasonable values of same. Do we need a configurable autovacuum naptime at all? I know I put it in the original contrib autovacuum because I had no idea what knobs might be needed. I can't see a good reason to ever have a naptime longer than the default 60 seconds, but I suppose one might want a smaller naptime for a very active system? ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Controlling Load Distributed Checkpoints
On Thu, 7 Jun 2007, Gregory Stark wrote: You seem to have imagined that letting the checkpoint take longer will slow down transactions. And you seem to have imagined that I have so much spare time that I'm just making stuff up to entertain myself and sow confusion. I observed some situations where delaying checkpoints too long ends up slowing down both transaction rate and response time, using earlier variants of the LDC patch and code with similar principles I wrote. I'm trying to keep the approach used here out of the worst of the corner cases I ran into, or least to make it possible for people in those situations to have some ability to tune out of the bad spots. I am unfortunately not free to disclose all those test results, and since that project is over I can't see how the current LDC compares to what I tested at the time. I plainly stated I had a bias here, one that's not even close to the average case. My concern here was that Heikki would end up optimizing in a direction where a really wide spread across the active checkpoint interval was strongly preferred. I wanted to offer some suggestions on the type of situation where that might not be true, but where a different tuning of LDC would still be an improvement over the current behavior. There are some tuning knobs there that I don't want to see go away until there's been a wider range of tests to prove they aren't effective. Right now we're seeing tests where Postgres stops handling *any* transactions for up to a minute. In virtually any real world scenario that would simply be unacceptable. No doubt; I've seen things get close to that bad myself, both on the high and low end. I collided with the issue in a situation of maxing out your i/o bandwidth, couldn't buy a faster controller at one point, which is what kicked off my working in this area. It turned out there were still some software tunables left that pulled the worst case down to the 2-5 second range instead. With more checkpoint_segments to decrease the frequency, that was just enough to make the problem annoying rather than crippling. But after that, I could easily imagine a different application scenario where the behavior you describe is the best case. This is really a serious issue with the current design of the database, one that merely changes instead of going away completely if you throw more hardware at it. I'm perversely glad to hear this is torturing more people than just me as it improves the odds the situation will improve. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [RFC] GSoC Work on readonly queries done so far
Simon Riggs wrote: On Wed, 2007-06-06 at 16:11 +0200, Florian G. Pflug wrote: .) Added a new GUC operational_mode, which can be set to either readwrite or readonly. If it is set to readwrite (the default), postgres behaves as usual. All the following changes are only in effect if operational_mode is set to readonly. Do we need this? We are already InArchiveRecovery. If I understand you correctly, you suggest that readonly queries are allways allowed during archive recovery - so upon startup postgres step through these states: .) Initial recovery (Until we reach a consistent state) .) Allow readonly queries .) Finish recovery in the background (might mean recovering forever on a PITR slave) .) Allow readwrite queries My plan was to have a global switch, which lets you choose between .) All queries are readonly (Until the next postmaster restart at least), but you get background replay .) No background replay, but once replay is done, readwrite queries can be execute (Just what PG does now). The main reason why I invented that global switch operational_mode was to remove to need to switch between readonly mode and readwrite mode on the fly. .) Created a macro ASSUME_OPMODE_READWRITE that does elog(ERROR) if postgre is not in readwrite mode. This macro protects the following functions to make sure that no writes occur in readonly mode. SimpleLruWritePage, SLruPhysicalWritePage EndPrepare, FinishPreparedTransaction XLogInsert, XLogWrite, ShutdownXLog CreateCheckpoint MarkBufferDirty. These are Asserts? The macro ASSUME_OPMODE_READWRITE just does if (!OperationalModeReadWrite) elog(ERROR, ...) .) Don't start autovacuum and bgwriter. Instead of bgwriter, bgreplay is started, and it takes over that role that bgwriter play in the shutdown process. Autovacuum - understood. What does bgreplay do? Why not just start bgwriter earlier and disable some of its other functionality while InRecovery? See above - it seemed simpler to clearly seperate .) Transactions are assigned a dummy xid ReadOnlyTransactionId, that is considered to be later than any other xid. So you are bumping FirstNormalTransactionId up by one for this? In fact I changed MaxTransactionId to 0xFFFE, and set ReadOnlyTransactionId to 0x. Additionally, I changed IsNormalTransactionId to test not only for = FirstNormalTransactionid, but also for = MaxTransactionId. You're assuming then that we will freeze replay while we run a query? No. My plan is to first get to a point where replay is freezes while queries are running, and to then figure out a more intelligent way to do this. I already have a few ideas how to do this, but I want to complete the simple version, before I start with that work. Otherwise doing this will mean the snapshot changes as a query executes. Why? It's only the xid of the transaction, not it's xmin and xmax that are set to ReadOnlyTransactionId. .) A global ReadOnlySnapshot is maintained in shared memory. This is copied into backend local memory by GetReadonlySnapshotData (which replaces GetSnapshotData in readonly mode). .) Crash recovery is not performed in readonly mode - instead, postgres PANICs, and tells the DBA to restart in readwrite mode. Archive recovery of course *will* be allowed, but I'm not that far yet. This is the very heart of the matter. This isn't just a technical issue, it goes to the heart of the use case for this feature. Can we recover while running queries? Yes. My comment only applies only to crash recovery - i.e, recovery that happens *without* a recovery.conf present, after a crash. It only really matters if you do following .) Start pg in readwrite mode. .) Kill it / It crashes .) Restart in readonly mode. The main different between crash recovery, and recovery from a filesystem-level backup is the additional information that the backup label gives us in the second case - more specifically, the minRecoveryLoc that we read from the backup label. Only with that knowledge is recovering until we reach a consistent state a welldefined operation. And readonly queries can only be executed *after* we did this minimal recovery. So if there is crash recovery to be done, we best we could do is to recover, and then start in readonly mode. If this is *really* what the DBA wants, he can just start in readwrite mode first, then cleanly shut PG down, and restart in readonly mode. If not, how much time will we spend in replay mode v query mode? Will we be able to run long running queries *and* maintain a reasonable time to recover? Is this a mechanism for providing HA and additional query capacity, or is it just a mechanism for additional query capacity only? Those are open questions to which I don't have any answers yet myself. My goal is to allow replay and queries to run concurrently, at least as long as only inserts, updates and
Re: [HACKERS] Controlling Load Distributed Checkpoints
This is really a serious issue with the current design of the database, one that merely changes instead of going away completely if you throw more hardware at it. I'm perversely glad to hear this is torturing more people than just me as it improves the odds the situation will improve. It tortures pretty much any high velocity postgresql db of which there are more and more every day. Joshua D. Drake -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/ ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Minor changes to Recovery related code
On Sat, 2007-03-31 at 00:51 +0200, Florian G. Pflug wrote: Simon Riggs wrote: On Fri, 2007-03-30 at 16:34 -0400, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: 2. pg_stop_backup() should wait until all archive files are safely archived before returning Not sure I agree with that one. If it fails, you can't tell whether the action is done and it failed while waiting for the archiver, or if you need to redo it. There's a slight delay between pg_stop_backup() completing and the archiver doing its stuff. Currently if somebody does a -m fast straight after the pg_stop_backup() the backup may be unusable. We need a way to plug that small hole. I suggest that pg_stop_backup() polls once per second until pg_xlog/archive_status/LOG.ready disappears, in which case it ends successfully. If it does this for more than 60 seconds it ends successfully but produces a WARNING. I fear that ending sucessfully despite having not archived all wals will make this feature less worthwile. If a dba knows what he is doing, he can code a perfectly safe backup script using 8.2 too. He'll just have to check the current wal position after pg_stop_backup(), (There is a function for that, right?), and wait until the corresponding wal was archived. In realitly, however, I feare that most people will just create a script that does 'echo select pg_stop_backup | psql' or something similar. If they're a bit more carefull, they will enable ON_ERROR_STOP, and check the return value of pgsql. I believe that those are the people who would really benefit from a pg_stop_backup() that waits for archiving to complete. But they probably won't check for WARNINGs. Maybe doing it the other way round would be an option? pg_stop_backup() could wait for the archiver to complete forever, but spit out a warning every 60 seconds or so WARNING: Still waiting for wal archiving of wal ??? to complete. If someone really wants a 60-second timeout, he can just use statement_timeout. I've just come up against this problem again, so I think it is a must fix for this release. Other problems exist also, mentioned on separate threads. We have a number of problems surrounding pg_stop_backup/shutdown: 1. pg_stop_backup() currently returns before the WAL file containing the last change is correctly archived. That is a small hole, but one that is exposed when people write test scripts that immediately shutdown the database after issuing pg_stop_backup(). It doesn't make much sense to shutdown immediately after a hot backup, but it should still work sensibly. 2. We've also had problems caused by making the archiver wait until all WAL files are archived. If there is a backlog for some reason and the DBA issues a restart (i.e. stop and immediate restart) then making the archiver loop while it tries (and possibly fails) to archive all files would cause an outage. Avoiding this is why we do the current get-out-fast approach. There are some sub scenarios: a) there is a backlog of WAL files, but no error has occurred on the *last* file (we might have just fixed a problem). b) there is a backlog of WAL files, but an error is causing a retry of the last file. My proposal is for us to record somewhere other than the logs that a failure to archive has occurred and is being retried. Failure to archive will be recorded in the archive_status directory as an additional file called archive_error, which will be deleted in the case of archive success and created in the case of archive error. This maintains archiver's lack of attachment to shared memory and general simplicity of design. - pg_stop_backup() will wait until the WAL file that ends the backup is safely archived, even if a failure to archive occurs. This is a change to current behaviour, but since it implements the originally *expected* behaviour IMHO it should be the default. - new function: pg_stop_backup_nowait() return immediately without waiting for archive, the same as the current pg_stop_backup() - new function: pg_stop_backup_wait(int seconds) wait until either an archival fails or the ending WAL file is archived, with a max wait as specified. wait=0 means wait until archive errors are resolved. Alternatives? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [RFC] GSoC Work on readonly queries done so far
Heikki Linnakangas wrote: Florian G. Pflug wrote: Jeff Davis wrote: Are you referring to the size of the xip array being a problem? Would it help to tie the size of the xip array to max_connections? I understand that max_connections might be greater on the master, but maybe something similar? Thats what I currently do - the xip array on the slave is sized to hold max_connections entries (Actually, it's max_connections + max_prepared_xacts I think). The problem occurs exactly if those values are set too small on the slave - and since shared mem objects are not resizeable, I don't see how the slave can handle an xip overflow gracefully other than by not publishing the information in shared memory as long as it doesn't fit there. You could store the value of max_connections in the checkpoint xlog record, and read it from there in the slave. Though one could still change it on the master and restart without restarting the slave as well. But AFAIK shmem allocation happens before recovery starts... Even if this was solved, it would only be a partial solution since as you note, the master might be restarted while the slave keeps running. So I think it's better not too add too much complexity, and just tell the DBA to increase max_connections on the slave, together with a comment in the documentation never to sex max_connections smaller on the slave than on the master. greetings, Florian Pflug ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [COMMITTERS] pgsql: Avoid losing track of data for shared tables in pgstats.
Alvaro Herrera [EMAIL PROTECTED] writes: Log Message: --- Avoid losing track of data for shared tables in pgstats. Report by Michael Fuhr, patch from Tom Lane after a messier suggestion by me. When does this bug date to? is it possible it's related to the performance drop immediately following a vacuum analyze we've been seeing? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [RFC] GSoC Work on readonly queries done so far
Jeff Davis wrote: On Wed, 2007-06-06 at 22:36 +0100, Simon Riggs wrote: .) Transactions are assigned a dummy xid ReadOnlyTransactionId, that is considered to be later than any other xid. So you are bumping FirstNormalTransactionId up by one for this? You're assuming then that we will freeze replay while we run a query? Otherwise doing this will mean the snapshot changes as a query executes. Is it possible to put a normal xmax for the snapshot? It wouldn't be a real transaction on the slave, and also the master will use that ID for a real transaction itself. However, I don't see a real problem on the slave because it would only be used for the purpose of the snapshot we need at that moment. My plan is the following: .) Initially, queries and recovery will run interleaved, but not concurrently. For that, an empty snapshot is sufficient, with xmin=xid=xmax=ReadOnlyTransactionId. .) Then, I'll work on running them concurrently. The replay process will publish a current snapshot in shared memory, using real xmin and xmax values it generates by maintaining a list of currently active (as in: running when the wal was written on the master) transactions. In that case, only xid is set to ReadOnlyTransactionId. greetings, Florian Pflug ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [ADMIN] Attempt to re-archive existing WAL logsafterrestoringfrom backup
On Thu, 2007-06-07 at 22:20 +0100, Simon Riggs wrote: On Sat, 2007-06-02 at 01:07 +0100, Jon Colverson wrote: Simon Riggs wrote: I've spent some time re-examining the code and I can't see a route for the discussed problem to occur. All the main code paths do not exhibit this problem in 8.2, so I'm not sure where to go from here. If you can help further, please give me a shout. Thanks a lot for looking into this. I've attached a self-contained minimal shell script which sets up a cluster and demonstrates the problem. I hope that will provide an unambiguous description. I'm running it on Ubuntu 7.04 (i386) and the script includes all the commands to build PostgreSQL from scratch, but obviously you can snip that stuff out and change the paths to suit your environment. The script ends with the server log showing the archiver attempting to re-archive the last log that was restored in the recovery procedure. Jon, Your script is correct and there is a problem, as you demonstrate. Thank you for bringing this to the list. You have followed the manual in full. In Step 6 of the recovery procedure outlined here http://developer.postgresql.org/pgdocs/postgres/continuous-archiving.html#BACKUP-PITR-RECOVERY we say that if you have partially completed WAL logs these should be copied back into pg_xlog. This is correct and if we do this we avoid the error you have highlighted. In the case of a true disaster recovery, as you have attempted to test, no files are available and so an attempt is made to archive the last restored file. That attempt fails iff you have done as the manual suggests and used an archive_command that prevents duplicate file insertions. I'll repost to pgsql-hackers to discuss solutions. Fixes are simple, but require some discussion. Currently if we perform an archive recovery and the end of logs occur during or at the end of an xlog that has already been archived then we will fail and keep failing to archive anything. Thats not much fun and it appears to me that this problem goes back to early days of PITR and may require backpatching. Currently if an xlog file has a missing archive_status file then we create a .ready status, which leads to the command being retried, which might fail or might not. Since the last file on an archive recovery seldom has an archive status this leads to failure situations when the archive_command prevents duplicate files from existing in the archive. AFAICS we have four options: 1. We can assume that if an xlog file has no archive status it must have been archived. Hmmm, sounds like a poor assumption. 2. We can relax the recommendation to have an archive_command that prevents duplicate logs. Although I haven't in recent times seen this as too strong a requirement because we have timelines, I don't think we should relax this. 3. We craft a special archive_status file for the last xlog file in an archive, so that it never gets archived. 4. As Warren Little recently regretted, if we have a corrupt xlog file then end of logs happens much earlier than anticipated. When this occurs many subsequent xlogs would be named similarly to files already in the archive, yet they would be a separate timeline. We could solve this problem as well by simply bumping the TLI each time we complete a recovery, whatever the reason. Presumably the archives will not stretch back as far as 2 billion recoveries. Option 3 is the straightforward and fixes the issue directly; option 4 solves many problems in a bold sweep of the design pen, with no obvious downside, AFAICS. Both are simple and can be backpatched easily. Initially I thought you'd fallen foul of another problem which is known to exist, which is caused by immediately shutting down a server after having taken a hot backup. This isn't a problem in real usage though is a problem in a scripted test. I've already proposed solutions there and will pick up that thread again also. On another thread on -hackers. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [COMMITTERS] pgsql: Avoid losing track of data for shared tables in pgstats.
Gregory Stark wrote: Alvaro Herrera [EMAIL PROTECTED] writes: Log Message: --- Avoid losing track of data for shared tables in pgstats. Report by Michael Fuhr, patch from Tom Lane after a messier suggestion by me. When does this bug date to? It was in 8.1. I didn't verify whether it affects on 8.0; I think we separated the shared tables in pgstats in 8.1, with the advent of autovacuum, so I assume it doesn't. The patch doesn't apply cleanly to 8.0 anyway, and I decided not to spent much time on it seeing that nobody has noticed it in years. is it possible it's related to the performance drop immediately following a vacuum analyze we've been seeing? I don't think so, unless you were counting on pgstats data of shared tables for something. The optimizer, for one, doesn't, so I doubt it would affect query planning. And it would only affect you if your queries were using shared tables, which I very much doubt ... -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] little PITR annoyance
On Thu, 2007-06-07 at 17:01 +0200, [EMAIL PROTECTED] wrote: I've benn biten twice by this one: I enabled archiving on both my pgsql servers. the archive command was a simple gzip on an nfs mounted dir. Now the nfs server crashed. Both wal space filled up until postgresql shuts itself down because of no more space for WAL. That perfectly normal and expected. What I did'nt expect and don't understand is that postgresql refused to start up after the nfs server was up and running until I added some more space on the WAL fs although if it had started archiving , space would have been there . I wonder if archiving could start before postgresql at least to make a little room for database engine to really start. gzip write a new file and then deletes the old, doesn't it? So it must require space on the xlog drive. Does it still fail if you avoid using gzip and just use scp? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Best Practice for running vacuums during off hours WAS Re: [HACKERS] Autovacuum launcher doesn't notice death of postmaster immediately
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 6/7/07, Matthew T. O'Connor wrote: Tom Lane wrote: Andrew Hammond writes: Hmmm... it seems to me that points new users towards not using autovacuum, which doesn't seem like the best idea. I think it'd be better to say that setting the naptime really high is a Bad Idea. It seems like we should have an upper limit on the GUC variable that's less than INT_MAX ;-). Would an hour be sane? 10 minutes? This is independent of the problem at hand, though, which is that we probably want the launcher to notice postmaster death in less time than autovacuum_naptime, for reasonable values of same. Do we need a configurable autovacuum naptime at all? I know I put it in the original contrib autovacuum because I had no idea what knobs might be needed. I can't see a good reason to ever have a naptime longer than the default 60 seconds, but I suppose one might want a smaller naptime for a very active system? That's a good question. I can't see any reason for a naptime longer than 60 seconds either. I think very large naptime settings are a symptom of another issue: what's the Right Way to defer vacuums until off hours? Is that even a desirable thing anymore? I don't think it is in the majority of cases. I originally thought that this was more of a Best Practices issue (ie, fix in the docs, not the code), but now I'm wondering if there's much call for supporting the idea of being more aggressive with vacuums at different times of the day / week / month. Anyone? -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (Darwin) iD8DBQFGaIdN+zlEYLc6JJgRAiNFAJ49CQwiTVxWhXNeOzIBABLN5LZY3wCfUj/W ZLakjPyRVwOijaB6keS3ld8= =Hg/X -END PGP SIGNATURE- ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: Best Practice for running vacuums during off hours WAS Re: [HACKERS] Autovacuum launcher doesn't notice death of postmaster immediately
Andrew Hammond escribió: That's a good question. I can't see any reason for a naptime longer than 60 seconds either. I think very large naptime settings are a symptom of another issue: what's the Right Way to defer vacuums until off hours? Is that even a desirable thing anymore? I don't think it is in the majority of cases. I originally thought that this was more of a Best Practices issue (ie, fix in the docs, not the code), but now I'm wondering if there's much call for supporting the idea of being more aggressive with vacuums at different times of the day / week / month. Anyone? That's why the intention is to have the autovacuum scheduling feature be a mechanism for changing the autovac parameters according to date and time. We even have a Google Summer of Code project about that. -- Alvaro Herrera http://www.amazon.com/gp/registry/DXLWNGRJD34J La virtud es el justo medio entre dos defectos (Aristóteles) ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] [COMMITTERS] pgsql: Avoid losing track of data for shared tables in pgstats.
Alvaro Herrera [EMAIL PROTECTED] writes: Gregory Stark wrote: is it possible it's related to the performance drop immediately following a vacuum analyze we've been seeing? I don't think so, unless you were counting on pgstats data of shared tables for something. The optimizer, for one, doesn't, so I doubt it would affect query planning. And it would only affect you if your queries were using shared tables, which I very much doubt ... Does anything use the pgstats data for anything other than presenting feedback to users? Autovacuum uses it to estimate when tables should be vacuumed right? This wouldn't have caused autovacuum to go nuts vacuuming these tables would it? But I doubt even then that it could consume much i/o bandwidth. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] TOAST usage setting
Gregory Stark wrote: Bruce Momjian [EMAIL PROTECTED] writes: Well, it is summarized here: http://momjian.us/expire/TOAST/SUMMARY.html It made non-TOAST access 2x faster, but TOAST 7x slower, and that seemed like a good compromise. Is this still testing with all data fitting in RAM? Yes. Having things out of RAM is going to make access even slower, but it is going to allow the heap to be in RAM more often. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] .conf File Organization WAS: Controlling Load Distributed Checkpoints
All, This brings up another point. With the increased number of .conf options, the file is getting hard to read again. I'd like to do another reorganization, but I don't really want to break people's diff scripts. Should I worry about that? --Josh ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] .conf File Organization WAS: Controlling Load Distributed Checkpoints
Josh Berkus wrote: All, This brings up another point. With the increased number of .conf options, the file is getting hard to read again. I'd like to do another reorganization, but I don't really want to break people's diff scripts. Should I worry about that? As a point of feedback, autovacuum and vacuum should be together. Joshua D. Drake --Josh ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/ ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [COMMITTERS] pgsql: Avoid losing track of data for shared tables in pgstats.
Gregory Stark wrote: Alvaro Herrera [EMAIL PROTECTED] writes: Gregory Stark wrote: is it possible it's related to the performance drop immediately following a vacuum analyze we've been seeing? I don't think so, unless you were counting on pgstats data of shared tables for something. The optimizer, for one, doesn't, so I doubt it would affect query planning. And it would only affect you if your queries were using shared tables, which I very much doubt ... Does anything use the pgstats data for anything other than presenting feedback to users? Not that I know of. Autovacuum uses it to estimate when tables should be vacuumed right? Yep This wouldn't have caused autovacuum to go nuts vacuuming these tables would it? But I doubt even then that it could consume much i/o bandwidth. Yes but keep in mind that these are only the shared tables: pg_database, pg_authid, pg_shdepend, etc. Those are not tables that you're going to use regularly, much less _bloat_ regularly that they need frequent vacuuming. Maybe pg_shdepend, because it would be used when creating temp tables. -- Alvaro Herrera http://www.flickr.com/photos/alvherre/ Postgres is bloatware by design: it was built to house PhD theses. (Joey Hellerstein, SIGMOD annual conference 2002) ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] .conf File Organization WAS: Controlling Load Distributed Checkpoints
Josh Berkus [EMAIL PROTECTED] writes: This brings up another point. With the increased number of .conf options, the file is getting hard to read again. I'd like to do another reorganization, but I don't really want to break people's diff scripts. Do you have a better organizing principle than what's there now? regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [COMMITTERS] pgsql: Avoid losing track of data for shared tables in pgstats.
On Thu, Jun 07, 2007 at 11:46:17PM +0100, Gregory Stark wrote: Does anything use the pgstats data for anything other than presenting feedback to users? Autovacuum uses it to estimate when tables should be vacuumed right? This wouldn't have caused autovacuum to go nuts vacuuming these tables would it? But I doubt even then that it could consume much i/o bandwidth. I discovered this problem after noticing that pg_shdepend had gotten horribly bloated -- apparently due to heavy use of temporary tables by an application whose performance I was investigating -- despite autovacuum being enabled. When I looked at the statistics for pg_shdepend the values for n_tup_{ins,upd,del} were much lower than I expected. After watching the numbers grow for a few minutes I saw them reset; after observing this behavior several times I was able to correlate the resets with vacuums of other tables. Since the statistics for pg_shdepend rarely got high enough to trigger an autovacuum that table was almost never being vacuumed. I suggested to the DBA that he vacuum it manually; after five minutes the vacuum completed and the application's performance improved immediately. INFO: pg_shdepend: found 8475403 removable, 3907 nonremovable row versions in 76783 pages -- Michael Fuhr ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] [COMMITTERS] pgsql: Avoid losing track of data for shared tables in pgstats.
On Thu, Jun 07, 2007 at 06:09:36PM -0400, Alvaro Herrera wrote: Gregory Stark wrote: When does this bug date to? It was in 8.1. I didn't verify whether it affects on 8.0; I think we separated the shared tables in pgstats in 8.1, with the advent of autovacuum, so I assume it doesn't. Doesn't appear to, at least not using the test case I found for 8.1 and later. -- Michael Fuhr ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [COMMITTERS] pgsql: Avoid losing track of data for shared tables in pgstats.
Michael Fuhr [EMAIL PROTECTED] writes: On Thu, Jun 07, 2007 at 06:09:36PM -0400, Alvaro Herrera wrote: Gregory Stark wrote: When does this bug date to? It was in 8.1. I didn't verify whether it affects on 8.0; I think we separated the shared tables in pgstats in 8.1, with the advent of autovacuum, so I assume it doesn't. Doesn't appear to, at least not using the test case I found for 8.1 and later. Yeah, the separate hashtable for shared rels was added for 8.1: http://archives.postgresql.org/pgsql-committers/2005-07/msg00627.php regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] TOAST usage setting
Bruce Momjian [EMAIL PROTECTED] writes: Gregory Stark wrote: Is this still testing with all data fitting in RAM? Yes. Having things out of RAM is going to make access even slower, but it is going to allow the heap to be in RAM more often. It would let us measure the actual impact of TOAST. The largest negative effect of which is to destroy the sequential access pattern and the positive effect is as you say to increase the cache effectiveness on non-toasted data. The cpu costs are insignificant so testing in-memory cases misses the point. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster