Re: [HACKERS] Performance problem in PLPgSQL
On Wednesday, July 24, 2013 4:23 AM Tom Lane wrote: I wrote: Marc Cousin cousinm...@gmail.com writes: The example below is of course extremely simplified, and obviously not what we are really doing in the database, but it exhibits the slowdown between 9.1.9 and 9.2.4. Hm. Some experimentation shows that the plan cache is failing to switch to a generic plan, but I'm not sure why the cast would have anything to do with that ... Hah, I see why: (gdb) s 1009if (plansource-generic_cost avg_custom_cost * 1.1) (gdb) p plansource-generic_cost $18 = 0.012501 (gdb) p avg_custom_cost $19 = 0.01 (gdb) p avg_custom_cost * 1.1 $20 = 0.011001 That is, the estimated cost of the custom plan is just the evaluation time for a simple constant, while the estimated cost of the generic plan includes a charge for evaluation of int4_numeric(). That makes the latter more than ten percent more expensive, and since this logic isn't considering anything else at all (particularly not the cost of planning), it thinks that makes the custom plan worth picking. We've been around on this before, but not yet thought of a reasonable way to estimate planning cost, let alone compare it to estimated execution costs. Up to now I hadn't thought that was a particularly urgent issue, but this example shows it's worth worrying about. One thing that was suggested in the previous discussion is to base the planning cost estimate on the length of the rangetable. We could do something trivial like add 10 * (length(plan-rangetable) + 1) in this comparison. Another thing that would fix this particular issue, though perhaps not related ones, is to start charging something nonzero for ModifyTable nodes, say along the lines of one seq_page_cost per inserted/modified row. That would knock the total estimated cost for an INSERT up enough that the ten percent threshold wouldn't be exceeded. Shouldn't it consider new value of boundparam to decide whether a new custom plan is needed, as that can be one of the main reason why it would need different plan. Current behavior is either it will choose generic plan or build a new custom plan with new parameters based on Choose_custom_plan(). Shouldn't the behavior of this be as below: a. choose generic plan b. choose one of existing custom plan c. create new custom plan The choice can be made based on the new value of bound parameter. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Backup throttling
Hello, the purpose of this patch is to limit impact of pg_backup on running server. Feedback is appreciated. // Antonin Houska (Tony) diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml index eb0c1d6..3b7ecfd 100644 --- a/doc/src/sgml/ref/pg_basebackup.sgml +++ b/doc/src/sgml/ref/pg_basebackup.sgml @@ -189,6 +189,22 @@ PostgreSQL documentation /varlistentry varlistentry + termoption-r/option/term + termoption--max-rate/option/term + listitem + para + The maximum amount of data transferred from server per second. + The purpose is to limit impact of + applicationpg_basebackup/application on a running master server. + /para + para + Suffixes literalk/literal (kilobytes) and literalm/literal + (megabytes) are accepted. For example: literal10m/literal + /para + /listitem + /varlistentry + + varlistentry termoption-R/option/term termoption--write-recovery-conf/option/term listitem diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c index a1e12a8..7fb2d78 100644 --- a/src/bin/pg_basebackup/pg_basebackup.c +++ b/src/bin/pg_basebackup/pg_basebackup.c @@ -45,11 +45,23 @@ bool streamwal = false; bool fastcheckpoint = false; bool writerecoveryconf = false; int standby_message_timeout = 10 * 1000; /* 10 sec = default */ +double max_rate = 0; /* Maximum bytes per second. 0 implies + * unlimited rate. */ + +#define MAX_RATE_LOWER 0x8000 /* 32 kB */ +#define MAX_RATE_UPPER 0x4000 /* 1 GB */ +/* + * We shouldn't measure time whenever a small piece of data (e.g. TAR file + * header) has arrived. That would introduce high CPU overhead. + */ +#define RATE_MIN_SAMPLE 32768 /* Progress counters */ static uint64 totalsize; static uint64 totaldone; +static uint64 last_measured; static int tablespacecount; +static int64 last_measured_time; /* Pipe to communicate with background wal receiver process */ #ifndef WIN32 @@ -72,9 +84,14 @@ static volatile LONG has_xlogendptr = 0; static PQExpBuffer recoveryconfcontents = NULL; /* Function headers */ + + + static void usage(void); static void verify_dir_is_empty_or_create(char *dirname); static void progress_report(int tablespacenum, const char *filename); +static double parse_max_rate(char *src); +static void enforce_max_rate(); static void ReceiveTarFile(PGconn *conn, PGresult *res, int rownum); static void ReceiveAndUnpackTarFile(PGconn *conn, PGresult *res, int rownum); @@ -110,6 +127,7 @@ usage(void) printf(_(\nOptions controlling the output:\n)); printf(_( -D, --pgdata=DIRECTORY receive base backup into directory\n)); printf(_( -F, --format=p|t output format (plain (default), tar)\n)); + printf(_( -r, --max-rate maximum transfer rate between server and client\n)); printf(_( -R, --write-recovery-conf\n write recovery.conf after backup\n)); printf(_( -x, --xlog include required WAL files in backup (fetch mode)\n)); @@ -473,6 +491,113 @@ progress_report(int tablespacenum, const char *filename) fprintf(stderr, \r); } +static double +parse_max_rate(char *src) +{ + int factor; + char *after_num; + double result; + + result = strtod(src, after_num); + if (src == after_num) + { + fprintf(stderr, _(%s: invalid transfer rate \%s\\n), progname, src); + exit(1); + } + + /* + * Evaluate (optional) suffix. + * + * after_num should now be right behind the numeric value. + */ + factor = 1; + switch (tolower(*after_num)) + { + /* + * Only the following suffixes are allowed. It's not too useful to + * restrict the rate to gigabytes: such a rate will probably bring + * significant impact on the master anyway, so the throttling + * won't help much. + */ + case 'm': + factor = 10; + case 'k': + factor = 10; + after_num++; + break; + + default: + + /* + * If there's no suffix, byte is the unit. Possible invalid letter + * will make conversion to integer fail. + */ + break; + } + + /* The rest can only consist of white space. */ + while (*after_num != '\0') + { + if (!isspace(*after_num)) + { + fprintf(stderr, _(%s: invalid transfer rate \%s\\n), progname, src); + exit(1); + } + after_num++; + } + + result *= factor; + if (result MAX_RATE_LOWER || result MAX_RATE_UPPER) + { + fprintf(stderr, _(%s: transfer rate out of range \%s\\n), progname, src); + exit(1); + } + return result; +} + +/* + * If the progress is more than what max_rate allows, sleep. + * + * Do not call if max_rate == 0. + */ +static void +enforce_max_rate() +{ + int64 min_elapsed, +now, +elapsed, +to_sleep; + + int64 last_chunk; + + last_chunk = totaldone - last_measured; + + /* The measurements shouldn't be more frequent then necessary. */ + if (last_chunk RATE_MIN_SAMPLE) + return; + + + now = localGetCurrentTimestamp(); + elapsed = now - last_measured_time; + + /* + *
Re: [HACKERS] Performance problem in PLPgSQL
On Wednesday, July 24, 2013 11:38 AM Amit Kapila wrote: On Wednesday, July 24, 2013 4:23 AM Tom Lane wrote: I wrote: Marc Cousin cousinm...@gmail.com writes: The example below is of course extremely simplified, and obviously not what we are really doing in the database, but it exhibits the slowdown between 9.1.9 and 9.2.4. Hm. Some experimentation shows that the plan cache is failing to switch to a generic plan, but I'm not sure why the cast would have anything to do with that ... Hah, I see why: (gdb) s 1009if (plansource-generic_cost avg_custom_cost * 1.1) (gdb) p plansource-generic_cost $18 = 0.012501 (gdb) p avg_custom_cost $19 = 0.01 (gdb) p avg_custom_cost * 1.1 $20 = 0.011001 That is, the estimated cost of the custom plan is just the evaluation time for a simple constant, while the estimated cost of the generic plan includes a charge for evaluation of int4_numeric(). That makes the latter more than ten percent more expensive, and since this logic isn't considering anything else at all (particularly not the cost of planning), it thinks that makes the custom plan worth picking. We've been around on this before, but not yet thought of a reasonable way to estimate planning cost, let alone compare it to estimated execution costs. Up to now I hadn't thought that was a particularly urgent issue, but this example shows it's worth worrying about. One thing that was suggested in the previous discussion is to base the planning cost estimate on the length of the rangetable. We could do something trivial like add 10 * (length(plan-rangetable) + 1) in this comparison. Another thing that would fix this particular issue, though perhaps not related ones, is to start charging something nonzero for ModifyTable nodes, say along the lines of one seq_page_cost per inserted/modified row. That would knock the total estimated cost for an INSERT up enough that the ten percent threshold wouldn't be exceeded. Shouldn't it consider new value of boundparam to decide whether a new custom plan is needed, as that can be one of the main reason why it would need different plan. Current behavior is either it will choose generic plan or build a new custom plan with new parameters based on Choose_custom_plan(). Shouldn't the behavior of this be as below: a. choose generic plan b. choose one of existing custom plan c. create new custom plan The choice can be made based on the new value of bound parameter. For the case of Insert where no scan is involved (as in this case), do we anytime need different plan? Can we always use generic plan for such cases? With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [bug fix] PITR corrupts the database cluster
Hello, I've encountered a bug of PITR that corrupts the database. I'm willing to submit the patch to fix it, but I'm wondering what approach is appropriate. Could you give me your opinions? [Problem] I cannot connect to the database after performing the following steps: 1. CREATE DATABASE mydb; 2. Take a base backup with pg_basebackup. 3. DROP DATABASE mydb; 4. Shutdown the database server with pg_ctl stop. 5. Recover the database cluster to the point where the base backup completed, i.e., before dropping mydb. The contents of recovery.conf is: restore_command = 'cp /arc/dir/%f %p' recovery_target_timeline = 'latest' recovery_target_time = 'STOP TIME recorded in the backup history file which was created during base backup' I expected to be able to connect to mydb because I recovered to the point before dropping mydb. However, I cannot connect to mydb because the directory for mydb does not exist. The entry for mydb exists in pg_database. [Cause] DROP DATABASE emits the below WAL records: 1. System catalog changes including deletion of a tuple for mydb in pg_database 2. Deletion of directories for the database 3. Transaction commit During recovery, postgres replays 1 and 2. It ends the recovery when it notices that the time recorded in commit record (3 above) is later than the recovery target time. The commit record is not replayed, thus the system catalog changes are virtually undone. The problem is that 2 is replayed. This deletes the directory for the database although the transaction is not committed. [How to fix] There are two approaches. Which do you think is the way to go? Approach 1 During recovery, when the WAL record for directory deletion is found, just record that fact for later replay (in a hash table keyed by xid). When the corresponding transaction commit record is found, replay the directory deletion record. Approach 2 Like the DROP TABLE/INDEX case, piggyback the directory deletion record on the transaction commit record, and eliminate the directory deletion record altogether. I think we need to take approach 1 even when we also does 2, because 1 is necessary when the backup and archive WAL are already taken with the current PostgreSQL anyway. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [bug fix] PITR corrupts the database cluster
On 2013-07-24 19:30:09 +0900, MauMau wrote: I've encountered a bug of PITR that corrupts the database. I'm willing to submit the patch to fix it, but I'm wondering what approach is appropriate. Could you give me your opinions? [Problem] I cannot connect to the database after performing the following steps: 1. CREATE DATABASE mydb; 2. Take a base backup with pg_basebackup. 3. DROP DATABASE mydb; 4. Shutdown the database server with pg_ctl stop. 5. Recover the database cluster to the point where the base backup completed, i.e., before dropping mydb. The contents of recovery.conf is: restore_command = 'cp /arc/dir/%f %p' recovery_target_timeline = 'latest' recovery_target_time = 'STOP TIME recorded in the backup history file which was created during base backup' For a second I wanted to say that it's a user error because you should just set recovery_target_lsn based on the END WAL location in the backup history file. Unfortunately recovery_target_lsn doesn't exist. It should. I expected to be able to connect to mydb because I recovered to the point before dropping mydb. However, I cannot connect to mydb because the directory for mydb does not exist. The entry for mydb exists in pg_database. [Cause] DROP DATABASE emits the below WAL records: 1. System catalog changes including deletion of a tuple for mydb in pg_database 2. Deletion of directories for the database 3. Transaction commit Approach 1 During recovery, when the WAL record for directory deletion is found, just record that fact for later replay (in a hash table keyed by xid). When the corresponding transaction commit record is found, replay the directory deletion record. I think that's too much of a special case implementation. Approach 2 Like the DROP TABLE/INDEX case, piggyback the directory deletion record on the transaction commit record, and eliminate the directory deletion record altogether. I don't think burdening commit records with that makes sense. It's just not a common enough case. What we imo could do would be to drop the tablespaces in a *separate* transaction *after* the transaction that removed the pg_tablespace entry. Then an incomplete actions logic similar to btree and gin could be used to remove the database directory if we crashed between the two transactions. SO: TXN1 does: * remove catalog entries * drop buffers * XLogInsert(XLOG_DBASE_DROP_BEGIN) TXN2: * remove_dbtablespaces * XLogInsert(XLOG_DBASE_DROP_FINISH) The RM_DBASE_ID resource manager would then grow a rm_cleanup callback (which would perform TXN2 if we failed inbetween) and a rm_safe_restartpoint which would prevent restartpoints from occuring on standby between both. The same should probably done for CREATE DATABASE because that currently can result in partially copied databases lying around. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [bug fix] PITR corrupts the database cluster
On 2013-07-24 12:59:43 +0200, Andres Freund wrote: Approach 2 Like the DROP TABLE/INDEX case, piggyback the directory deletion record on the transaction commit record, and eliminate the directory deletion record altogether. I don't think burdening commit records with that makes sense. It's just not a common enough case. What we imo could do would be to drop the tablespaces in a *separate* transaction *after* the transaction that removed the pg_tablespace entry. Then an incomplete actions logic similar to btree and gin could be used to remove the database directory if we crashed between the two transactions. SO: TXN1 does: * remove catalog entries * drop buffers * XLogInsert(XLOG_DBASE_DROP_BEGIN) TXN2: * remove_dbtablespaces * XLogInsert(XLOG_DBASE_DROP_FINISH) The RM_DBASE_ID resource manager would then grow a rm_cleanup callback (which would perform TXN2 if we failed inbetween) and a rm_safe_restartpoint which would prevent restartpoints from occuring on standby between both. The same should probably done for CREATE DATABASE because that currently can result in partially copied databases lying around. And CREATE/DROP TABLESPACE. Not really related, but CREATE DATABASE's implementation makes me itch everytime I read parts of it... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Backup throttling
On Wed, Jul 24, 2013 at 4:20 PM, Antonin Houska antonin.hou...@gmail.com wrote: Hello, the purpose of this patch is to limit impact of pg_backup on running server. Feedback is appreciated. Interesting. Please add this patch to the next commitfeat. https://commitfest.postgresql.org/action/commitfest_view?id=19 Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] DATE type output does not follow datestyle parameter
Hello, The description of datestyle parameter does not seem to match the actual behavior. Is this a bug to be fixed? Which do you think should be corrected, the program or the manual? The manual says: DateStyle (string) Sets the display format for date and time values, as well as the rules for interpreting ambiguous date input values. For historical reasons, this variable contains two independent components: the output format specification (ISO, Postgres, SQL, or German) and the input/output specification for year/month/day ordering (DMY, MDY, or YMD). ... And says: http://www.postgresql.org/docs/current/static/datatype-datetime.html 8.5.2. Date/Time Output The output of the date and time types is of course only the date or time part in accordance with the given examples. After doing SET datestyle = 'Postgres, MDY' on the psql prompt, I did the following things on the same psql session: 1. SELECT current_timestamp; now -- Wed Jul 24 10:51:00.217 2013 GMT (1 行) This is exactly as I expected. 2. SELECT current_date; I expected the output Wed Jul 24 2013 or Jul 24 2013, but I got: date 07-24-2013 (1 行) This does not follow the above statement in 8.5.2. This output is created by EncodeDateOnly() in src/backend/utils/adt/datetime.c. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [bug fix] PITR corrupts the database cluster
Andres Freund and...@2ndquadrant.com wrote: On 2013-07-24 12:59:43 +0200, Andres Freund wrote: Approach 2 Like the DROP TABLE/INDEX case, piggyback the directory deletion record on the transaction commit record, and eliminate the directory deletion record altogether. I don't think burdening commit records with that makes sense. It's just not a common enough case. What we imo could do would be to drop the tablespaces in a *separate* transaction *after* the transaction that removed the pg_tablespace entry. Then an incomplete actions logic similar to btree and gin could be used to remove the database directory if we crashed between the two transactions. SO: TXN1 does: * remove catalog entries * drop buffers * XLogInsert(XLOG_DBASE_DROP_BEGIN) TXN2: * remove_dbtablespaces * XLogInsert(XLOG_DBASE_DROP_FINISH) The RM_DBASE_ID resource manager would then grow a rm_cleanup callback (which would perform TXN2 if we failed inbetween) and a rm_safe_restartpoint which would prevent restartpoints from occuring on standby between both. The same should probably done for CREATE DATABASE because that currently can result in partially copied databases lying around. And CREATE/DROP TABLESPACE. Not really related, but CREATE DATABASE's implementation makes me itch everytime I read parts of it... I've been hoping that we could get rid of the rm_cleanup mechanism entirely. I eliminated it for gist a while back, and I've been thinking of doing the same for gin and btree. The way it works currently is buggy - while we have rm_safe_restartpoint to avoid creating a restartpoint at a bad moment, there is nothing to stop you from running a checkpoint while incomplete actions are pending. It's possible that there are page locks or something that prevent it in practice, but it feels shaky. So I'd prefer a solution that doesn't rely on rm_cleanup. Piggybacking on commit record seems ok to me, though if we're going to have a lot of different things to attach there, maybe we need to generalize it somehow. Like, allow resource managers to attach arbitrary payload to the commit record, and provide a new rm_redo_commit function to replay them. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [bug fix] PITR corrupts the database cluster
On 2013-07-24 15:45:52 +0300, Heikki Linnakangas wrote: Andres Freund and...@2ndquadrant.com wrote: On 2013-07-24 12:59:43 +0200, Andres Freund wrote: Approach 2 What we imo could do would be to drop the tablespaces in a *separate* transaction *after* the transaction that removed the pg_tablespace entry. Then an incomplete actions logic similar to btree and gin could be used to remove the database directory if we crashed between the two transactions. SO: TXN1 does: * remove catalog entries * drop buffers * XLogInsert(XLOG_DBASE_DROP_BEGIN) TXN2: * remove_dbtablespaces * XLogInsert(XLOG_DBASE_DROP_FINISH) The RM_DBASE_ID resource manager would then grow a rm_cleanup callback (which would perform TXN2 if we failed inbetween) and a rm_safe_restartpoint which would prevent restartpoints from occuring on standby between both. The same should probably done for CREATE DATABASE because that currently can result in partially copied databases lying around. And CREATE/DROP TABLESPACE. Not really related, but CREATE DATABASE's implementation makes me itch everytime I read parts of it... I've been hoping that we could get rid of the rm_cleanup mechanism entirely. I eliminated it for gist a while back, and I've been thinking of doing the same for gin and btree. The way it works currently is buggy - while we have rm_safe_restartpoint to avoid creating a restartpoint at a bad moment, there is nothing to stop you from running a checkpoint while incomplete actions are pending. It's possible that there are page locks or something that prevent it in practice, but it feels shaky. So I'd prefer a solution that doesn't rely on rm_cleanup. Piggybacking on commit record seems ok to me, though if we're going to have a lot of different things to attach there, maybe we need to generalize it somehow. Like, allow resource managers to attach arbitrary payload to the commit record, and provide a new rm_redo_commit function to replay them. The problem is that piggybacking on the commit record doesn't really fix the problem that we end up with a bad state if we crash in a bad moment. For CREATE DATABASE you will have to copy the template database *before* you commit the pg_database insert. Which means if we abort before that we have old data in the datadir. For DROP DATABASE, without something like incomplete actions, piggybacking on the commit record doesn't solve the issue of CHECKPOINTS either, because the commit record you piggybacked on could have committed before a checkpoint, while you still were busy deleting all the files. Similar things go for CREATE/DROP TABLESPACE. I think getting rid of something like incomplete actions entirely isn't really realistic. There are several places were we probably should use them but don't. I think we probably will have change the logic around it so there's something that determines whether a normal checkpoint is safe atm. A simple version of this basically exists via GetVirtualXIDsDelayingChkpt() which is now also used for checksums but we probably want to extend that at some point. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [bug fix] PITR corrupts the database cluster
Andres Freund and...@2ndquadrant.com wrote: On 2013-07-24 15:45:52 +0300, Heikki Linnakangas wrote: Andres Freund and...@2ndquadrant.com wrote: On 2013-07-24 12:59:43 +0200, Andres Freund wrote: Approach 2 What we imo could do would be to drop the tablespaces in a *separate* transaction *after* the transaction that removed the pg_tablespace entry. Then an incomplete actions logic similar to btree and gin could be used to remove the database directory if we crashed between the two transactions. SO: TXN1 does: * remove catalog entries * drop buffers * XLogInsert(XLOG_DBASE_DROP_BEGIN) TXN2: * remove_dbtablespaces * XLogInsert(XLOG_DBASE_DROP_FINISH) The RM_DBASE_ID resource manager would then grow a rm_cleanup callback (which would perform TXN2 if we failed inbetween) and a rm_safe_restartpoint which would prevent restartpoints from occuring on standby between both. The same should probably done for CREATE DATABASE because that currently can result in partially copied databases lying around. And CREATE/DROP TABLESPACE. Not really related, but CREATE DATABASE's implementation makes me itch everytime I read parts of it... I've been hoping that we could get rid of the rm_cleanup mechanism entirely. I eliminated it for gist a while back, and I've been thinking of doing the same for gin and btree. The way it works currently is buggy - while we have rm_safe_restartpoint to avoid creating a restartpoint at a bad moment, there is nothing to stop you from running a checkpoint while incomplete actions are pending. It's possible that there are page locks or something that prevent it in practice, but it feels shaky. So I'd prefer a solution that doesn't rely on rm_cleanup. Piggybacking on commit record seems ok to me, though if we're going to have a lot of different things to attach there, maybe we need to generalize it somehow. Like, allow resource managers to attach arbitrary payload to the commit record, and provide a new rm_redo_commit function to replay them. The problem is that piggybacking on the commit record doesn't really fix the problem that we end up with a bad state if we crash in a bad moment. For CREATE DATABASE you will have to copy the template database *before* you commit the pg_database insert. Which means if we abort before that we have old data in the datadir. For DROP DATABASE, without something like incomplete actions, piggybacking on the commit record doesn't solve the issue of CHECKPOINTS either, because the commit record you piggybacked on could have committed before a checkpoint, while you still were busy deleting all the files. That's no different from CREATE TABLE / INDEX and DROP TABLE / INDEX. E.g. If you crash after CREATE TABLE but before COMMIT, the file is leaked. But it's just a waste of space, everything still works. It would be nice to fix that leak, for tables and indexes too... - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Suggestion for concurrent index creation using a single full scan operation
Wow.. thanks guys, really appreciate the detailed analysis. Tim On Wed, Jul 24, 2013 at 4:08 AM, Noah Misch n...@leadboat.com wrote: On Tue, Jul 23, 2013 at 01:06:26PM +0100, Tim Kane wrote: I haven't given this a lot of thought, but it struck me that when rebuilding tables (be it for a restore process, or some other operational activity) - there is more often than not a need to build an index or two, sometimes many indexes, against the same relation. It strikes me that in order to build just one index, we probably need to perform a full table scan (in a lot of cases). If we are building multiple indexes sequentially against that same table, then we're probably performing multiple sequential scans in succession, once for each index. Check. Could we architect a mechanism that allowed multiple index creation statements to execute concurrently, with all of their inputs fed directly from a single sequential scan against the full relation? From a language construct point of view, this may not be trivial to implement for raw/interactive SQL - but possibly this is a candidate for the custom format restore? As Greg Stark mentioned, pg_restore can already issue index build commands in parallel. Where applicable, that's probably superior to having one backend build multiple indexes during a single heap scan. Index builds are CPU-intensive, and the pg_restore approach takes advantage of additional CPU cores in addition to possibly saving I/O. However, the pg_restore method is not applicable if you want CREATE INDEX CONCURRENTLY, and it's not applicable for implicit index building such as happens for ALTER TABLE rewrites and for VACUUM FULL. Backend-managed concurrent index builds could shine there. I presume this would substantially increase the memory overhead required to build those indexes, though the performance gains may be advantageous. The multi-index-build should respect maintenance_work_mem overall. Avoiding cases where that makes concurrent builds slower than sequential builds is a key challenge for such a project: - If the index builds each fit in maintenance_work_mem when run sequentially and some spill to disk when run concurrently, expect concurrency to lose. - If the heap is small enough to stay in cache from one index build to the next, performing the builds concurrently is probably a wash or a loss. - Concurrency should help when a wide-row table large enough to exhaust OS cache has narrow indexes that all fit in maintenance_work_mem. I don't know whether concurrency would help for a huge-table scenario where the indexes do overspill maintenance_work_mem. You would have N indexes worth of external merge files competing for disk bandwidth; that could cancel out heap I/O savings. Overall, it's easy to end up with a loss. We could punt by having an index_build_concurrency GUC, much like pg_restore relies on the user to discover a good -j value. But if finding cases where concurrency helps is too hard, leaving the GUC at one would become the standard advice. Apologies in advance if this is not the correct forum for suggestions.. It's the right forum. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] [bug fix] PITR corrupts the database cluster
Heikki Linnakangas hlinnakan...@vmware.com writes: That's no different from CREATE TABLE / INDEX and DROP TABLE / INDEX. E.g. If you crash after CREATE TABLE but before COMMIT, the file is leaked. But it's just a waste of space, everything still works. Well, it is different, because if you crash partway through dropping a tablespace or database, you have inconsistent state. It would be nice to fix that leak, for tables and indexes too... I'm inclined to think that this wouldn't be a good use of resources, at least not at the individual table/index level. We'd surely be adding some significant amount of overhead to normal operating paths, in order to cover a case that really shouldn't happen in practice. The only thing here that really bothers me is that a crash during DROP DATABASE/TABLESPACE could leave us with a partially populated db/ts that's still accessible through the system catalogs. We could however do something to ensure that the db/ts is atomically removed from use before we start dropping individual files. Then, if you get a crash, there'd still be system catalog entries but they'd be pointing at nothing, so the behavior would be clean and understandable whereas right now it's not. In the case of DROP TABLESPACE this seems relatively easy: drop or rename the symlink before we start flushing individual files. I'm not quite sure how to do it for DROP DATABASE though --- I thought of renaming the database directory, say from 12345 to 12345.dead, but if there are tablespaces in use then we might have a database subdirectory in each one, so we couldn't rename them all atomically. I guess one thing we could do is create a flag file, say dead.dont.use, in the database's default-tablespace directory, and make new backends check for that before being willing to start up in that database; then make sure that removal of that file is the last step in DROP DATABASE. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [bug fix] PITR corrupts the database cluster
On 2013-07-24 16:15:59 +0300, Heikki Linnakangas wrote: For DROP DATABASE, without something like incomplete actions, piggybacking on the commit record doesn't solve the issue of CHECKPOINTS either, because the commit record you piggybacked on could have committed before a checkpoint, while you still were busy deleting all the files. That's no different from CREATE TABLE / INDEX and DROP TABLE / INDEX. E.g. If you crash after CREATE TABLE but before COMMIT, the file is leaked. But it's just a waste of space, everything still works. Imo it's not really the same. For one, spurious files left over by CREATE TABLE are handled when assigning future relfilenodes (see GetNewRelFileNode()), we don't do the same for databases and tablespaces afaik. Furthermore, I don't think there's such a big issue with DROP TABLE unless maybe you've created the relation in the same transaction that you're dropping it. Sure, there's the issue with a checkpoint in the wrong place, but other than that we're going to remove the file when replaying the commit record. We also should never end up with an inconsistent state between catalog and filesystem. It would be nice to fix that leak, for tables and indexes too... Agreed. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] Insert result does not match record count
On 07/22/2013 06:20 PM, Jeff Janes wrote: On Fri, Jul 19, 2013 at 3:20 PM, Natalie Wenz nataliew...@ebureau.com wrote: Hi all, I am moving some data from one table to another in 9.2.4, and keep seeing this strange scenario: insert into newtable select data from oldtable where proc_date = x and proc_date y; INSERT 0 78551642 select count(*) from newtable where proc_date = x and proc_date y; count --- 4373518938 It looks to me like the status report is 32 bits and overflowed. 4,373,518,938 - 2^32 = 78,551,642 Attached is a small patch that should fix the problem. Vik *** a/src/backend/commands/createas.c --- b/src/backend/commands/createas.c *** *** 172,178 ExecCreateTableAs(CreateTableAsStmt *stmt, const char *queryString, /* save the rowcount if we're given a completionTag to fill */ if (completionTag) snprintf(completionTag, COMPLETION_TAG_BUFSIZE, ! SELECT %u, queryDesc-estate-es_processed); /* and clean up */ ExecutorFinish(queryDesc); --- 172,178 /* save the rowcount if we're given a completionTag to fill */ if (completionTag) snprintf(completionTag, COMPLETION_TAG_BUFSIZE, ! SELECT UINT64_FORMAT, queryDesc-estate-es_processed); /* and clean up */ ExecutorFinish(queryDesc); *** a/src/backend/tcop/pquery.c --- b/src/backend/tcop/pquery.c *** *** 195,201 ProcessQuery(PlannedStmt *plan, { case CMD_SELECT: snprintf(completionTag, COMPLETION_TAG_BUFSIZE, ! SELECT %u, queryDesc-estate-es_processed); break; case CMD_INSERT: if (queryDesc-estate-es_processed == 1) --- 195,201 { case CMD_SELECT: snprintf(completionTag, COMPLETION_TAG_BUFSIZE, ! SELECT UINT64_FORMAT, queryDesc-estate-es_processed); break; case CMD_INSERT: if (queryDesc-estate-es_processed == 1) *** *** 203,217 ProcessQuery(PlannedStmt *plan, else lastOid = InvalidOid; snprintf(completionTag, COMPLETION_TAG_BUFSIZE, ! INSERT %u %u, lastOid, queryDesc-estate-es_processed); break; case CMD_UPDATE: snprintf(completionTag, COMPLETION_TAG_BUFSIZE, ! UPDATE %u, queryDesc-estate-es_processed); break; case CMD_DELETE: snprintf(completionTag, COMPLETION_TAG_BUFSIZE, ! DELETE %u, queryDesc-estate-es_processed); break; default: strcpy(completionTag, ???); --- 203,217 else lastOid = InvalidOid; snprintf(completionTag, COMPLETION_TAG_BUFSIZE, ! INSERT %u UINT64_FORMAT, lastOid, queryDesc-estate-es_processed); break; case CMD_UPDATE: snprintf(completionTag, COMPLETION_TAG_BUFSIZE, ! UPDATE UINT64_FORMAT, queryDesc-estate-es_processed); break; case CMD_DELETE: snprintf(completionTag, COMPLETION_TAG_BUFSIZE, ! DELETE UINT64_FORMAT, queryDesc-estate-es_processed); break; default: strcpy(completionTag, ???); *** a/src/include/nodes/execnodes.h --- b/src/include/nodes/execnodes.h *** *** 375,381 typedef struct EState List *es_rowMarks; /* List of ExecRowMarks */ ! uint32 es_processed; /* # of tuples processed */ Oid es_lastoid; /* last oid processed (by INSERT) */ int es_top_eflags; /* eflags passed to ExecutorStart */ --- 375,381 List *es_rowMarks; /* List of ExecRowMarks */ ! uint64 es_processed; /* # of tuples processed */ Oid es_lastoid; /* last oid processed (by INSERT) */ int es_top_eflags; /* eflags passed to ExecutorStart */ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [bug fix] PITR corrupts the database cluster
I wrote: The only thing here that really bothers me is that a crash during DROP DATABASE/TABLESPACE could leave us with a partially populated db/ts that's still accessible through the system catalogs. ... I guess one thing we could do is create a flag file, say dead.dont.use, in the database's default-tablespace directory, and make new backends check for that before being willing to start up in that database; then make sure that removal of that file is the last step in DROP DATABASE. After a bit of experimentation, it seems there's a pre-existing way that we could do this: just remove PG_VERSION from the database's default directory as the first filesystem action in DROP DATABASE. If we crash before committing, subsequent attempts to connect to that database would fail like this: $ psql bogus psql: FATAL: base/176774 is not a valid data directory DETAIL: File base/176774/PG_VERSION is missing. which is probably already good enough, though maybe we could add a HINT suggesting that the DB was incompletely dropped. To ensure this is replayed properly on slave servers, I'd be inclined to mechanize it by (1) changing remove_dbtablespaces to ensure that the DB's default tablespace is the first one dropped, and (2) incorporating remove-PG_VERSION-first into rmtree(). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [bug fix] PITR corrupts the database cluster
On 2013-07-24 10:57:14 -0400, Tom Lane wrote: I wrote: The only thing here that really bothers me is that a crash during DROP DATABASE/TABLESPACE could leave us with a partially populated db/ts that's still accessible through the system catalogs. ... I guess one thing we could do is create a flag file, say dead.dont.use, in the database's default-tablespace directory, and make new backends check for that before being willing to start up in that database; then make sure that removal of that file is the last step in DROP DATABASE. After a bit of experimentation, it seems there's a pre-existing way that we could do this: just remove PG_VERSION from the database's default directory as the first filesystem action in DROP DATABASE. If we crash before committing, subsequent attempts to connect to that database would fail like this: Neat idea, especially as it would work on the back branches without problems. It doesn't address MauMau's original problem of not being able to set a correct stop point because the datadir is being removed before the commit record (which is the only place we currently can set a recovery_target on in a basebackup scenario) though. Wouldn't it make more sense to simply commit the pg_database entry removal separately from the actual removal of the datadir? DROP DATABASE already can't run in a transaction, so that should be easy. $ psql bogus psql: FATAL: base/176774 is not a valid data directory DETAIL: File base/176774/PG_VERSION is missing. which is probably already good enough, though maybe we could add a HINT suggesting that the DB was incompletely dropped. We could rename/fsync() the file to PG_BEING_DROPPED atomically. Then we could give a useful error instead of an error message with a maybe hint. That would require to remove the file last though, so I am not sure if it's worth the bother. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] ilist.h is not useful as-is
So I went off to implement the SPITupleTable tracking discussed in 6553.1374424...@sss.pgh.pa.us, and thought it would be cool to use the slist infrastructure defined in lib/ilist.h rather than creating a separate List node for each SPITupleTable struct. However, I soon ran into a problem: there's no real support for remove the current element of an slist while we're scanning it, which is really the only list manipulation I need. The only way to remove an element is slist_delete(), which will iterate over the list *again* and thus create an O(N^2) penalty. Or I could use a dlist, but two pointers per struct seem pretty silly. So I'm going to end up hand-implementing the same kind of manipulation we frequently use with traditional Lists, namely keep a second variable that's the preceding list element (not the next one) so I can unlink and delete the target element when I find it. ilist.h is not offering me any useful support at all for this scenario. Seems like we're missing a bet here. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ilist.h is not useful as-is
On 2013-07-24 11:26:00 -0400, Tom Lane wrote: So I went off to implement the SPITupleTable tracking discussed in 6553.1374424...@sss.pgh.pa.us, and thought it would be cool to use the slist infrastructure defined in lib/ilist.h rather than creating a separate List node for each SPITupleTable struct. However, I soon ran into a problem: there's no real support for remove the current element of an slist while we're scanning it, which is really the only list manipulation I need. The only way to remove an element is slist_delete(), which will iterate over the list *again* and thus create an O(N^2) penalty. Or I could use a dlist, but two pointers per struct seem pretty silly. So I'm going to end up hand-implementing the same kind of manipulation we frequently use with traditional Lists, namely keep a second variable that's the preceding list element (not the next one) so I can unlink and delete the target element when I find it. ilist.h is not offering me any useful support at all for this scenario. Seems like we're missing a bet here. Hm. Yes. This should be added. I didn't need it so far, but I definitely can see usecases. slist_delete_current(slist_mutable_iter *)? I am willing to cough up a patch if you want. This will require another member variable in slist_mutable_iter which obviously will need to be maintained, but that seems fine to me since it will reduce the cost of actually deleting noticeably. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] Insert result does not match record count
On 07/24/2013 04:04 PM, Vik Fearing wrote: On 07/22/2013 06:20 PM, Jeff Janes wrote: On Fri, Jul 19, 2013 at 3:20 PM, Natalie Wenz nataliew...@ebureau.com wrote: Hi all, I am moving some data from one table to another in 9.2.4, and keep seeing this strange scenario: insert into newtable select data from oldtable where proc_date = x and proc_date y; INSERT 0 78551642 select count(*) from newtable where proc_date = x and proc_date y; count --- 4373518938 It looks to me like the status report is 32 bits and overflowed. 4,373,518,938 - 2^32 = 78,551,642 Attached is a small patch that should fix the problem. It would seem this was not as thorough as it should have been and there's a lot more to do. I'm working on a better patch. Also worth mentioning is bug #7766. http://www.postgresql.org/message-id/e1tlli5-0007tr...@wrigleys.postgresql.org Vik -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ilist.h is not useful as-is
Andres Freund and...@2ndquadrant.com writes: On 2013-07-24 11:26:00 -0400, Tom Lane wrote: So I'm going to end up hand-implementing the same kind of manipulation we frequently use with traditional Lists, namely keep a second variable that's the preceding list element (not the next one) so I can unlink and delete the target element when I find it. ilist.h is not offering me any useful support at all for this scenario. Seems like we're missing a bet here. Hm. Yes. This should be added. I didn't need it so far, but I definitely can see usecases. slist_delete_current(slist_mutable_iter *)? I am willing to cough up a patch if you want. Please. This will require another member variable in slist_mutable_iter which obviously will need to be maintained, but that seems fine to me since it will reduce the cost of actually deleting noticeably. I think that's all right. Conceivably we could introduce two forms of iterator depending on whether you want to delete or not, but that seems like overkill to me. In fact, now that I think about it, the distinction between slist_iter and slist_mutable_iter is really misstated in the comments, isn't it? The *only* action on the list that's unsafe with an active slist_iter is to delete the current element (and then continue iterating). If the value of slist_mutable_iter is to support deletion of the current element, then it seems obvious that it should support doing so efficiently, ie it should carry both prev and next. Also, if it's carrying both of those, then use of slist_mutable_iter really doesn't allow any action other than deleting the current element --- adding new nodes ahead of the current element isn't safe. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ilist.h is not useful as-is
Andres Freund wrote: On 2013-07-24 11:26:00 -0400, Tom Lane wrote: So I'm going to end up hand-implementing the same kind of manipulation we frequently use with traditional Lists, namely keep a second variable that's the preceding list element (not the next one) so I can unlink and delete the target element when I find it. ilist.h is not offering me any useful support at all for this scenario. Seems like we're missing a bet here. Hm. Yes. This should be added. I didn't need it so far, but I definitely can see usecases. slist_delete_current(slist_mutable_iter *)? I am willing to cough up a patch if you want. This will require another member variable in slist_mutable_iter which obviously will need to be maintained, but that seems fine to me since it will reduce the cost of actually deleting noticeably. Amusingly, this will mean ForgetBackgroundWorker() will need to have a slist_mutable_iter * argument if it wants to use the new function ... -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ilist.h is not useful as-is
On 2013-07-24 11:53:39 -0400, Alvaro Herrera wrote: Andres Freund wrote: On 2013-07-24 11:26:00 -0400, Tom Lane wrote: So I'm going to end up hand-implementing the same kind of manipulation we frequently use with traditional Lists, namely keep a second variable that's the preceding list element (not the next one) so I can unlink and delete the target element when I find it. ilist.h is not offering me any useful support at all for this scenario. Seems like we're missing a bet here. Hm. Yes. This should be added. I didn't need it so far, but I definitely can see usecases. slist_delete_current(slist_mutable_iter *)? I am willing to cough up a patch if you want. This will require another member variable in slist_mutable_iter which obviously will need to be maintained, but that seems fine to me since it will reduce the cost of actually deleting noticeably. Amusingly, this will mean ForgetBackgroundWorker() will need to have a slist_mutable_iter * argument if it wants to use the new function ... The only alternative I see would be to pass both the prev, and the cur elements which seems to be less expressive from my POV. I am open to other suggestions? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ilist.h is not useful as-is
Andres Freund wrote: On 2013-07-24 11:53:39 -0400, Alvaro Herrera wrote: Andres Freund wrote: Amusingly, this will mean ForgetBackgroundWorker() will need to have a slist_mutable_iter * argument if it wants to use the new function ... The only alternative I see would be to pass both the prev, and the cur elements which seems to be less expressive from my POV. I am open to other suggestions? In the grand scheme of things, I think it's fine. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] dynamic background workers, round two
The dynamic background workers patch that I submitted for CF1 was generally well-received, but several people commented on a significant limitation: there's currently no way for a backend that requests a new background worker to know whether that background worker was successfully started. If you're using the background worker mechanism to run daemon processes of some sort, this is probably not a huge problem. You most likely don't start and stop those processes very frequently, and when you do manually start them, you can always examine the server log to see whether everything worked as planned. Maybe not ideal, but not unbearably awful, either. However, the goal I'm working towards here is parallel query, and for that application, and some others, things don't look so good. In that case, you are probably launching a background worker flagged as BGW_NEVER_RESTART, and so the postmaster is going to try to start it just once, and if it doesn't work, you really need some way of knowing that. Of course, if the worker is launched successfully, you can have it notify the process that started it via any mechanism you choose: creating a sentinel file, inserting data into a table, setting the process latch. Sky's the limit. However, if the worker isn't launched successfully (fork fails, or the process crashes before it reaches your code) you have no way of knowing. If you don't receive the agreed-upon notification from the child, it means that EITHER the process was never started in the first place OR the postmaster just hasn't gotten around to starting it yet. Of course, you could wait for a long time (5s ?) and then give up, but there's no good way to set the wait time. If you make it long, then it takes a long time for you to report errors to the client even when those errors happen quickly. If you make it short, you may time out spuriously on a busy system. The attached patch attempts to remedy this problem. When you register a background worker, you can obtain a handle that can subsequently be used to query for the worker's PID. If you additionally initialize bgw_notify_pid = MyProcPid, then the postmaster will send you SIGUSR1 when worker startup has been attempted (successfully or not). You can wait for this signal by passing your handle to WaitForBackgroundWorkerStartup(), which will return only when either (1) an attempt has been made to start the worker or (2) the postmaster is determined to have died. This interface seems adequate for something like worker_spi, where it's useful to know whether the child was started or not (and return the PID if so) but that's about all that's really needed. More complex notification interfaces can also be coded using the primitives introduced by this patch. Setting bgw_notify_pid will cause the postmaster to send SIGUSR1 every time it starts the worker and every time the worker dies. Every time you receive that signal, you can call GetBackgroundWorkerPid() for each background worker to find out which ones have started or terminated. The only slight difficulty is in waiting for receipt of that signal, which I implemented by adding a new global called set_latch_on_sigusr1. WaitForBackgroundWorkerStartup() uses that flag internally, but there's nothing to prevent a caller from using it as part of their own event loop. I find the set_latch_on_sigusr1 flag to be slightly ugly, but it seems better and far safer than having the postmaster try to actually set the latch itself - for example, it'd obviously be unacceptable to pass a Latch * to the postmaster and have the postmaster assume that pointer valid. I'm hopeful that this is about as much fiddling with the background worker mechanism per se as will be needed for parallel query. Once we have this, I think the next hurdle will be designing suitable IPC mechanisms, so that there's a relatively easy way for the parent process to pass information to its background workers and visca versa. I expect the main tool for that to be a dynamic shared memory facility; but I'm hoping that won't be closely tied to background workers, though they may be heavy users of it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company bgworker-feedback-v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Design proposal: fsync absorb linear slider
On Tue, Jul 23, 2013 at 12:13 PM, Greg Smith g...@2ndquadrant.com wrote: On 7/23/13 10:56 AM, Robert Haas wrote: On Mon, Jul 22, 2013 at 11:48 PM, Greg Smith g...@2ndquadrant.com wrote: We know that a 1GB relation segment can take a really long time to write out. That could include up to 128 changed 8K pages, and we allow all of them to get dirty before any are forced to disk with fsync. By my count, it can include up to 131,072 changed 8K pages. Even better! I can pinpoint exactly what time last night I got tired enough to start making trivial mistakes. Everywhere I said 128 it's actually 131,072, which just changes the range of the GUC I proposed. Getting the number right really highlights just how bad the current situation is. Would you expect the database to dump up to 128K writes into a file and then have low latency when it's flushed to disk with fsync? Of course not. But that's the job the checkpointer process is trying to do right now. And it's doing it blind--it has no idea how many dirty pages might have accumulated before it started. I'm not exactly sure how best to use the information collected. fsync every N writes is one approach. Another is to use accumulated writes to predict how long fsync on that relation should take. Whenever I tried to spread fsync calls out before, the scale of the piled up writes from backends was the input I really wanted available. The segment write count gives an alternate way to sort the blocks too, you might start with the heaviest hit ones. In all these cases, the fundamental I keep coming back to is wanting to cue off past write statistics. If you want to predict relative I/O delay times with any hope of accuracy, you have to start the checkpoint knowing something about the backend and background writer activity since the last one. So, I don't think this is a bad idea; in fact, I think it'd be a good thing to explore. The hard part is likely to be convincing ourselves of anything about how well or poorly it works on arbitrary hardware under arbitrary workloads, but we've got to keep trying things until we find something that works well, so why not this? One general observation is that there are two bad things that happen when we checkpoint. One is that we force all of the data in RAM out to disk, and the other is that we start doing lots of FPIs. Both of these things harm throughput. Your proposal allows the user to make the first of those behaviors more frequent without making the second one more frequent. That idea seems promising, and it also seems to admit of many variations. For example, instead of issuing an fsync when after N OS writes to a particular file, we could fsync the file with the most writes every K seconds. That way, if the system has busy and idle periods, we'll effectively catch up on our fsyncs when the system isn't that busy, and we won't bunch them up too much if there's a sudden surge of activity. Now that's just a shot in the dark and there might be reasons why it's terrible, but I just generally offer it as food for thought that the triggering event for the extra fsyncs could be chosen via a multitude of different algorithms, and as you hack through this it might be worth trying a few different possibilities. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review: UNNEST (and other functions) WITH ORDINALITY
On Tue, Jul 23, 2013 at 9:38 PM, Tom Lane t...@sss.pgh.pa.us wrote: If it weren't that we've been speculating for years about deprecating SRFs-in-tlists once we had LATERAL, I would personally consider this patch DOA in this form. If we do think we'll probably deprecate that feature, then not extending WITH ORDINALITY to such cases is at least defensible. On the other hand, considering that we've yet to ship a release supporting LATERAL, it's probably premature to commit to such deprecation --- we don't really know whether people will find LATERAL to be a convenient and performant substitute. I guess I'd sort of assumed that the plan was to continue accepting SRFs in tlists but rewrite them as lateral joins, rather than getting rid of them altogether. IIUC that would simplify some things inside the executor. I'd be a bit more reluctant to just ban SRFs in target lists outright. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review: UNNEST (and other functions) WITH ORDINALITY
Robert Haas robertmh...@gmail.com writes: On Tue, Jul 23, 2013 at 9:38 PM, Tom Lane t...@sss.pgh.pa.us wrote: If it weren't that we've been speculating for years about deprecating SRFs-in-tlists once we had LATERAL, I would personally consider this patch DOA in this form. I guess I'd sort of assumed that the plan was to continue accepting SRFs in tlists but rewrite them as lateral joins, rather than getting rid of them altogether. That seems to me to be unlikely to happen, because it would be impossible to preserve the current (admittedly bad) semantics. If we're going to change the behavior at all we might as well just drop the feature, IMO. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review: UNNEST (and other functions) WITH ORDINALITY
On Fri, Jul 19, 2013 at 1:50 PM, Greg Stark st...@mit.edu wrote: My only reservation with this patch is whether the WITH_ORDINALITY parser hack is the way we want to go. The precedent was already set with WITH TIME ZONE though and I think this was the best option. I share this reservation. Lexer hacks are reasonable ways of getting LALR(2)-ish behavior in very simple cases, but it doesn't take much to get into trouble. I think the with ordinality as (select 1) select * from ordinality example you posted is precisely on point. Currently, we will have four classes of keywords: unreserved, column-name, type-function, and reserved. There are rules for when each of those types of keywords needs to be quoted, and those rules are relatively well-understood. This patch will introduce, without documentation, a fifth class of keyword. ORDINALITY will need to be quoted when, and only when, it immediately follows WITH. Without some change to our deparsing code, this is a dump/restore hazard; and with some such change it's still probably not a good idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review: UNNEST (and other functions) WITH ORDINALITY
On Wed, Jul 24, 2013 at 1:36 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Tue, Jul 23, 2013 at 9:38 PM, Tom Lane t...@sss.pgh.pa.us wrote: If it weren't that we've been speculating for years about deprecating SRFs-in-tlists once we had LATERAL, I would personally consider this patch DOA in this form. I guess I'd sort of assumed that the plan was to continue accepting SRFs in tlists but rewrite them as lateral joins, rather than getting rid of them altogether. That seems to me to be unlikely to happen, because it would be impossible to preserve the current (admittedly bad) semantics. If we're going to change the behavior at all we might as well just drop the feature, IMO. Maybe. I'd be kind of sad to lose some of the simple cases that work now, like SELECT srf(), in favor of having to write SELECT * FROM srf(). I'd probably get over it, but I'm sure a lot of people would be mildly annoyed at having to change their working application code. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review: UNNEST (and other functions) WITH ORDINALITY
On Wed, Jul 24, 2013 at 6:36 PM, Tom Lane t...@sss.pgh.pa.us wrote: That seems to me to be unlikely to happen, because it would be impossible to preserve the current (admittedly bad) semantics. If we're going to change the behavior at all we might as well just drop the feature, IMO. It would be nice to support a single SRF in the target list. That would side-step the bad semantics and also make it easier to implement. But I'm not sure how easy it would be in practice because I've learned not to underestimate the difficulty of making seemingly small changes to the planner. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] Insert result does not match record count
Vik Fearing vik.fear...@dalibo.com writes: Also worth mentioning is bug #7766. http://www.postgresql.org/message-id/e1tlli5-0007tr...@wrigleys.postgresql.org Yeah, did you read that whole thread? The real issue here is going to be whether client-side code falls over on wider-than-32-bit counts. We can fix the backend and be pretty sure that we've found all the relevant places inside it, but we'll just be exporting the issue. I did look at libpq and noted that it doesn't seem to have any internal problem, because it returns the count to callers as a string (!). But what do you think are the odds that callers are using code that won't overflow? I'd bet on finding atoi() or suchlike in a lot of callers. Even if they thought to use strtoul(), unsigned long is not necessarily 64 bits wide. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch to add support of IF NOT EXISTS to others CREATE statements
Hello, patch works fine but is there any reason to comparing each ifNotExists in different way? i.e. ProcedureCreate if (!ifNotExists) ... else { ... return } TypeCreate if (ifNotExists) { ... return } ... --- Shouldn't it be more consistent? Regards, Karol -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review: UNNEST (and other functions) WITH ORDINALITY
On Wed, Jul 24, 2013 at 6:39 PM, Robert Haas robertmh...@gmail.com wrote: This patch will introduce, without documentation, a fifth class of keyword. ORDINALITY will need to be quoted when, and only when, it immediately follows WITH. Without some change to our deparsing code, this is a dump/restore hazard; and with some such change it's still probably not a good idea. Strictly speaking this patc doesn't introduce this fifth class of keyword. We already had TIME in that category (and also FIRST and LAST in a similar category following NULLS). If we have a solution for WITH keyword then presumably we would implement it for WITH TIME and WITH ORDINALITY at the same time. In the interim I suppose we could teach pg_dump to quote any keyword that follows WITH or NULLS pretty easily. Or just quote those four words unconditionally. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Adding Zigzag Merge Join to Index Nested Loops Join
On Tue, Jul 23, 2013 at 3:40 PM, tubadzin tubad...@o2.pl wrote: I want add Zigzag Merge join to Index Nested Loops Join alghoritm. http://www.cs.berkeley.edu/~fox/summaries/database/query_eval_5-8.html Which files are responsible for Index nested loops join ? (not simple nested loops join which has double foreach in nodeNestloop.c) nodeIndexscan.c? nodeIndexonlyscan.c? Best Regards nodeNestloop.c handles all execution of all nested loops, regardless of whether there's an index scan involved. Of course, if there is an index scan involved, then nodeIndexscan.c will also be involved; if there is an index-only scan, then nodeIndexonlyscan.c, and so on. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ilist.h is not useful as-is
Hi, On 2013-07-24 11:49:20 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: This will require another member variable in slist_mutable_iter which obviously will need to be maintained, but that seems fine to me since it will reduce the cost of actually deleting noticeably. I think that's all right. Conceivably we could introduce two forms of iterator depending on whether you want to delete or not, but that seems like overkill to me. Agreed. Especially as you point out there's no real point to mutable_iter as committed. In fact, now that I think about it, the distinction between slist_iter and slist_mutable_iter is really misstated in the comments, isn't it? The *only* action on the list that's unsafe with an active slist_iter is to delete the current element (and then continue iterating). If the value of slist_mutable_iter is to support deletion of the current element, then it seems obvious that it should support doing so efficiently, ie it should carry both prev and next. Also, if it's carrying both of those, then use of slist_mutable_iter really doesn't allow any action other than deleting the current element --- adding new nodes ahead of the current element isn't safe. True. I think I mentally copypasted to much logic from the doubly linked list case. The first attached patch adds slist_delete_current(), updates the comments addressing your points and converts the bgworker code to pass the iterator around (it's more efficient which might actually matter with a few hundred bgworkers). I found the added newlines in slist_foreach_modify useful, but maybe they should be removed again. I think this should be included in 9.3 once reviewed. The second patch adds a regression test for background workers via worker_spi which I used to test slist_delete_current() addition. It's not 100% as it, but I thought it worthwile to post it anyway a) only tests dynamically registered workers, it should start it's own regression test starting some bgworkers statically b) disables installcheck harshly causing a warning from make: /home/andres/src/postgresql/src/makefiles/pgxs.mk:297: warning: ignoring old commands for target `installcheck' Manually defining a pg_regress in 'check:' should fix it probably because that part of pgxs.mk is dependant on REGRESS = being set. c) it only tests BGW_NEVER_RESTART Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services From 0dfa542f457ebd71640830a993e3a428720b4179 Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Wed, 24 Jul 2013 19:34:10 +0200 Subject: [PATCH 1/2] Add slist_delete_current() --- src/backend/lib/ilist.c | 2 +- src/backend/postmaster/bgworker.c | 5 - src/backend/postmaster/postmaster.c | 4 ++-- src/include/lib/ilist.h | 32 + src/include/postmaster/bgworker_internals.h | 2 +- 5 files changed, 36 insertions(+), 9 deletions(-) diff --git a/src/backend/lib/ilist.c b/src/backend/lib/ilist.c index 957a118..dddb6eb 100644 --- a/src/backend/lib/ilist.c +++ b/src/backend/lib/ilist.c @@ -28,7 +28,7 @@ * * It is not allowed to delete a 'node' which is is not in the list 'head' * - * Caution: this is O(n) + * Caution: this is O(n), use slist_delete_current() while iterating. */ void slist_delete(slist_head *head, slist_node *node) diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c index ada24e9..9967b69 100644 --- a/src/backend/postmaster/bgworker.c +++ b/src/backend/postmaster/bgworker.c @@ -272,10 +272,13 @@ BackgroundWorkerStateChange(void) * the postmaster. */ void -ForgetBackgroundWorker(RegisteredBgWorker *rw) +ForgetBackgroundWorker(slist_mutable_iter *cur) { + RegisteredBgWorker *rw; BackgroundWorkerSlot *slot; + rw = slist_container(RegisteredBgWorker, rw_lnode, cur-cur); + Assert(rw-rw_shmem_slot max_worker_processes); slot = BackgroundWorkerData-slot[rw-rw_shmem_slot]; slot-in_use = false; diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index b6b8111..0342be7 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -1452,7 +1452,7 @@ DetermineSleepTime(struct timeval * timeout) if (rw-rw_worker.bgw_restart_time == BGW_NEVER_RESTART) { -ForgetBackgroundWorker(rw); +ForgetBackgroundWorker(siter); continue; } @@ -5643,7 +5643,7 @@ StartOneBackgroundWorker(void) { if (rw-rw_worker.bgw_restart_time == BGW_NEVER_RESTART) { -ForgetBackgroundWorker(rw); +ForgetBackgroundWorker(iter); continue; } diff --git a/src/include/lib/ilist.h b/src/include/lib/ilist.h index 73f4115..f8f0458 100644 --- a/src/include/lib/ilist.h +++ b/src/include/lib/ilist.h @@ -211,7 +211,9 @@ typedef struct slist_head * Used as state in
Re: [HACKERS] Review: UNNEST (and other functions) WITH ORDINALITY
Greg Stark st...@mit.edu writes: On Wed, Jul 24, 2013 at 6:39 PM, Robert Haas robertmh...@gmail.com wrote: This patch will introduce, without documentation, a fifth class of keyword. ORDINALITY will need to be quoted when, and only when, it immediately follows WITH. Without some change to our deparsing code, this is a dump/restore hazard; and with some such change it's still probably not a good idea. Strictly speaking this patc doesn't introduce this fifth class of keyword. We already had TIME in that category (and also FIRST and LAST in a similar category following NULLS). If we have a solution for WITH keyword then presumably we would implement it for WITH TIME and WITH ORDINALITY at the same time. It strikes me that we could hack the grammar for CTEs so that it allows WITH_ORDINALITY and WITH_TIME as the first token, with appropriate insertion of the proper identifier value. I'm not sure if there are any other places where WITH can be followed by a random identifier. I don't see any workable fix that doesn't involve the funny token, though. Consider CREATE VIEW v AS SELECT ... FROM UNNEST(...) WITH ORDINALITY; CREATE VIEW v AS SELECT ... FROM UNNEST(...) WITH NO DATA; WITH ORDINALITY really needs to be parsed as part of the FROM clause. WITH NO DATA really needs to *not* be parsed as part of the FROM clause. Without looking ahead more than one token, there is absolutely no way for the grammar to decide if it's got the whole FROM clause or not at the point where WITH is the next token. So our choices are to have two-token lookahead at the lexer level, or to give up on bison and find something that can implement a parsing algorithm better than LALR(1). I know which one seems more likely to get done in the foreseeable future. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review: UNNEST (and other functions) WITH ORDINALITY
On 2013-07-24 13:36:39 -0400, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: On Tue, Jul 23, 2013 at 9:38 PM, Tom Lane t...@sss.pgh.pa.us wrote: If it weren't that we've been speculating for years about deprecating SRFs-in-tlists once we had LATERAL, I would personally consider this patch DOA in this form. I guess I'd sort of assumed that the plan was to continue accepting SRFs in tlists but rewrite them as lateral joins, rather than getting rid of them altogether. That seems to me to be unlikely to happen, because it would be impossible to preserve the current (admittedly bad) semantics. If we're going to change the behavior at all we might as well just drop the feature, IMO. I think removing the feature will be a rather painful procedure for users and thus will need a rather long deprecation period. The amount of code using SRFs in targetlists is quite huge if my experience is anything to go by. And much of that can trivially/centrally be rewritten to LATERAL, not to speak of the cross-version compatibility problem. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] Insert result does not match record count
On 2013-07-24 13:48:23 -0400, Tom Lane wrote: Vik Fearing vik.fear...@dalibo.com writes: Also worth mentioning is bug #7766. http://www.postgresql.org/message-id/e1tlli5-0007tr...@wrigleys.postgresql.org Yeah, did you read that whole thread? The real issue here is going to be whether client-side code falls over on wider-than-32-bit counts. We can fix the backend and be pretty sure that we've found all the relevant places inside it, but we'll just be exporting the issue. I did look at libpq and noted that it doesn't seem to have any internal problem, because it returns the count to callers as a string (!). But what do you think are the odds that callers are using code that won't overflow? I'd bet on finding atoi() or suchlike in a lot of callers. Even if they thought to use strtoul(), unsigned long is not necessarily 64 bits wide. Application code that relies on the values already has problems though since the returned values are pretty bogus now. Including the fact that it can return 0 as the number of modified rows which is checked for more frequently than the actual number IME... So I think client code that uses simplistic stuff like atoi isn't worse off afterwards since the values will be about as bogus. I am more worried about code that does range checks like java's string conversion routines... I think fixing this for 9.4 is fine, but due to the compat issues I think it's to late for 9.3. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ilist.h is not useful as-is
Andres Freund and...@2ndquadrant.com writes: The first attached patch adds slist_delete_current(), updates the comments addressing your points and converts the bgworker code to pass the iterator around (it's more efficient which might actually matter with a few hundred bgworkers). I found the added newlines in slist_foreach_modify useful, but maybe they should be removed again. I think this should be included in 9.3 once reviewed. Agreed; since we have not shipped ilist.h in a release yet, the benefits of having it behave the same in all branches should outweigh any pain from changing this post-beta. The second patch adds a regression test for background workers via worker_spi which I used to test slist_delete_current() addition. It's not 100% as it, but I thought it worthwile to post it anyway Hm. I'll review and commit the ilist changes, but Alvaro or somebody should decide if such a test is sensible. I'd be a bit worried about it in a make installcheck context ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] Insert result does not match record count
Andres Freund and...@2ndquadrant.com writes: I think fixing this for 9.4 is fine, but due to the compat issues I think it's to late for 9.3. Agreed -- this is effectively a protocol change, albeit a pretty minor one, so I can't see back-patching it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ilist.h is not useful as-is
Andres Freund and...@2ndquadrant.com writes: The first attached patch adds slist_delete_current(), updates the comments addressing your points and converts the bgworker code to pass the iterator around (it's more efficient which might actually matter with a few hundred bgworkers). I found the added newlines in slist_foreach_modify useful, but maybe they should be removed again. I think this should be included in 9.3 once reviewed. Agreed; since we have not shipped ilist.h in a release yet, the benefits of having it behave the same in all branches should outweigh any pain from changing this post-beta. Especially as it shouldn't break any existing working code since the old API is still there. Obviously it breaks the ABI, but ... On 2013-07-24 14:29:42 -0400, Tom Lane wrote: The second patch adds a regression test for background workers via worker_spi which I used to test slist_delete_current() addition. It's not 100% as it, but I thought it worthwile to post it anyway Hm. I'll review and commit the ilist changes, but Alvaro or somebody should decide if such a test is sensible. I'd be a bit worried about it in a make installcheck context ... I've disabled installcheck for that reason. Is there any way to override a makefile target in gnu make without a warning? If we want coverage for statically registered workers - which seems like a good idea, we need our own postgresql.conf stanza and thus a manual pg_regress invocation anyway. Which should fix that error. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dynamic background workers, round two
Hi, On 2013-07-24 12:46:21 -0400, Robert Haas wrote: The attached patch attempts to remedy this problem. When you register a background worker, you can obtain a handle that can subsequently be used to query for the worker's PID. If you additionally initialize bgw_notify_pid = MyProcPid, then the postmaster will send you SIGUSR1 when worker startup has been attempted (successfully or not). You can wait for this signal by passing your handle to WaitForBackgroundWorkerStartup(), which will return only when either (1) an attempt has been made to start the worker or (2) the postmaster is determined to have died. This interface seems adequate for something like worker_spi, where it's useful to know whether the child was started or not (and return the PID if so) but that's about all that's really needed. This seems like a sensible idea to me. But, in the context of dynamic query, don't we also need the reverse infrastructure of notifying a bgworker that the client, that requested it to be started, has died? Ending up with a dozen bgworkers running because the normal backend FATALed doesn't seem nice. More complex notification interfaces can also be coded using the primitives introduced by this patch. Setting bgw_notify_pid will cause the postmaster to send SIGUSR1 every time it starts the worker and every time the worker dies. Every time you receive that signal, you can call GetBackgroundWorkerPid() for each background worker to find out which ones have started or terminated. The only slight difficulty is in waiting for receipt of that signal, which I implemented by adding a new global called set_latch_on_sigusr1. WaitForBackgroundWorkerStartup() uses that flag internally, but there's nothing to prevent a caller from using it as part of their own event loop. I find the set_latch_on_sigusr1 flag to be slightly ugly, but it seems better and far safer than having the postmaster try to actually set the latch itself - for example, it'd obviously be unacceptable to pass a Latch * to the postmaster and have the postmaster assume that pointer valid. I am with you with finding it somewhat ugly. --- a/contrib/worker_spi/worker_spi.c +++ b/contrib/worker_spi/worker_spi.c Btw, I've posted a minimal regression test for bworkers/worker_spi in 20130724175742.gd10...@alap2.anarazel.de - I'd like to see some coverage of this... +int bgw_notify_pid; It really should be pid_t even though we're not consistently doing that in the code. diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c index f7ebd1a..6414291 100644 --- a/src/backend/commands/async.c +++ b/src/backend/commands/async.c @@ -207,8 +207,6 @@ typedef struct QueueBackendStatus QueuePosition pos; /* backend has read queue up to here */ } QueueBackendStatus; -#define InvalidPid (-1) - diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h index edced29..0aa540a 100644 --- a/src/include/miscadmin.h +++ b/src/include/miscadmin.h @@ -28,6 +28,8 @@ #define PG_BACKEND_VERSIONSTR postgres (PostgreSQL) PG_VERSION \n +#define InvalidPid (-1) + That value strikes me as a rather bad idea. As a pid that represents init's processgroup. I.e. kill(InvalidPid) will try to kill init... I'd rather not spread that outside async.c. +/* + * Report the PID of a newly-launched background worker in shared memory. + * + * This function should only be called from the postmaster. + */ +void +ReportBackgroundWorkerPID(RegisteredBgWorker *rw) +{ + BackgroundWorkerSlot *slot; + + Assert(rw-rw_shmem_slot max_worker_processes); + slot = BackgroundWorkerData-slot[rw-rw_shmem_slot]; + slot-pid = rw-rw_pid; + + if (rw-rw_worker.bgw_notify_pid != 0) + kill(rw-rw_worker.bgw_notify_pid, SIGUSR1); +} Any reason not to use SendProcSignal() properly here? Just that you don't know the BackendId? I seems unclean to reuse SIGUSR1 without using the infrastructure built for that. I first thought you didn't want to do it because you it touches shmem, but given that ReportBackgroundWorkerPID() does so itself... + * If handle != NULL, we'll set *handle to a pointer that can subsequently + * be used as an argument to GetBackgroundWorkerPid(). The caller can + * free this pointer using pfree(), if desired. */ bool -RegisterDynamicBackgroundWorker(BackgroundWorker *worker) +RegisterDynamicBackgroundWorker(BackgroundWorker *worker, + BackgroundWorkerHandle **handle) I'd just let the caller provide the memory, but whatever ;) +/* + * Wait for a background worker to start up. + * + * The return value is the same as for GetBackgroundWorkerPID, except that + * we only return InvalidPid if the postmaster has died (and therefore we + * know that the worker will never be started). + */ +int +WaitForBackgroundWorkerStartup(BackgroundWorkerHandle *handle)
Re: [HACKERS] Review: UNNEST (and other functions) WITH ORDINALITY
Tom Lane said: If we did it with a WithOrdinality expression node, the result would always be of type RECORD, and we'd have to use blessed tuple descriptors to keep track of exactly which record type a particular instance emits. This is certainly do-able (see RowExpr for precedent). Maybe RowExpr is a precedent for something, but it has this long-standing problem that makes it very hard to use usefully: postgres=# select (r).* from (select row(a,b) as r from (values (1,2)) v(a,b)) s; ERROR: record type has not been registered It seems way too short on comments. [...] This can certainly be addressed. but it sure looks like it flat out removed several existing regression-test cases Here's why, in rangefuncs.sql: --invokes ExecReScanFunctionScan SELECT * FROM foorescan f WHERE f.fooid IN (SELECT fooid FROM foorescan(5002,5004)) ORDER BY 1,2; I don't think that has invoked ExecReScanFunctionScan since 7.4 or so. It certainly does not do so now (confirmed by gdb as well as by the query plan). By all means keep the old tests if you want a never-remove-tests-for-any-reason policy, but having added tests that actually _do_ invoke ExecReScanFunctionScan, I figured the old ones were redundant. (Also, these kinds of tests can be done a bit better now with values and lateral rather than creating and dropping tables just for the one test.) and a few existing comments as well. I've double-checked, and I don't see any existing comments removed. FWIW, I concur with the gripe I remember seeing upthread that the default name of the added column ought not be ?column?. This seems to be a common complaint, but gives rise to two questions: 1) what should the name be? 2) should users be depending on it? I've yet to find another db that actually documents a specific column name for the ordinality column; it's always taken for granted that the user should always be supplying an alias. (Admittedly there are not many dbs that support it at all; DB2 does, and I believe Teradata.) -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ilist.h is not useful as-is
Andres Freund and...@2ndquadrant.com writes: The first attached patch adds slist_delete_current(), updates the comments addressing your points and converts the bgworker code to pass the iterator around (it's more efficient which might actually matter with a few hundred bgworkers). Applied with fixes. The slist_delete_current() logic wouldn't actually work for deleting multiple adjacent list elements, because the foreach macro would advance prev to point at the just-deleted element. Compare the places where we use list_delete_cell() in a loop --- we're careful to advance prev only when we *don't* delete the current element. I dealt with that by making slist_delete_current() back up the iterator's cur pointer to equal prev, so that the loop macro's next copying of cur to prev is a no-op. This means callers can't rely on cur anymore after the deletion, but in typical usage they'd have computed a pointer to their struct already so this shouldn't be a problem. Another fine point is you can't use both slist_delete and slist_delete_current within the same loop, since the former wouldn't apply this correction. Also, I fixed the slist_foreach_modify macro to not doubly evaluate its lhead argument, since I think we agreed that was a bad idea, and did some more work on the comments. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dynamic background workers, round two
On Thu, Jul 25, 2013 at 6:34 AM, Andres Freund and...@2ndquadrant.comwrote: Hi, On 2013-07-24 12:46:21 -0400, Robert Haas wrote: The attached patch attempts to remedy this problem. When you register a background worker, you can obtain a handle that can subsequently be used to query for the worker's PID. If you additionally initialize bgw_notify_pid = MyProcPid, then the postmaster will send you SIGUSR1 when worker startup has been attempted (successfully or not). You can wait for this signal by passing your handle to WaitForBackgroundWorkerStartup(), which will return only when either (1) an attempt has been made to start the worker or (2) the postmaster is determined to have died. This interface seems adequate for something like worker_spi, where it's useful to know whether the child was started or not (and return the PID if so) but that's about all that's really needed. This seems like a sensible idea to me. But, in the context of dynamic query, don't we also need the reverse infrastructure of notifying a bgworker that the client, that requested it to be started, has died? Ending up with a dozen bgworkers running because the normal backend FATALed doesn't seem nice. Yes, this would be something necessary to have, but definitely it should be a separate patch. In order to satisfy that, you could register for each worker the PID of the parent process that started it, if it was started by an an other bgworker, and have postmaster scan the list of bgworkers that need to be stopped after the death of their parent if it was a bgworker. By the way, the PID registered in this case should not be bgw_notify_pid but something saved in BackgroundWorkerSlot. My 2c on that though, feel free to comment. More complex notification interfaces can also be coded using the primitives introduced by this patch. Setting bgw_notify_pid will cause the postmaster to send SIGUSR1 every time it starts the worker and every time the worker dies. Every time you receive that signal, you can call GetBackgroundWorkerPid() for each background worker to find out which ones have started or terminated. The only slight difficulty is in waiting for receipt of that signal, which I implemented by adding a new global called set_latch_on_sigusr1. WaitForBackgroundWorkerStartup() uses that flag internally, but there's nothing to prevent a caller from using it as part of their own event loop. I find the set_latch_on_sigusr1 flag to be slightly ugly, but it seems better and far safer than having the postmaster try to actually set the latch itself - for example, it'd obviously be unacceptable to pass a Latch * to the postmaster and have the postmaster assume that pointer valid. I am with you with finding it somewhat ugly. After a quick look at the code, yeah using a boolean here looks like an overkill. -- Michael
Re: [HACKERS] dynamic background workers, round two
On Thu, Jul 25, 2013 at 6:34 AM, Andres Freund and...@2ndquadrant.comwrote: --- a/contrib/worker_spi/worker_spi.c +++ b/contrib/worker_spi/worker_spi.c Btw, I've posted a minimal regression test for bworkers/worker_spi in 20130724175742.gd10...@alap2.anarazel.de - I'd like to see some coverage of this... I could not find this email ID in the archives. A link perhaps? -- Michael
Re: [HACKERS] dynamic background workers, round two
On 2013-07-25 08:03:17 +0900, Michael Paquier wrote: On Thu, Jul 25, 2013 at 6:34 AM, Andres Freund and...@2ndquadrant.comwrote: --- a/contrib/worker_spi/worker_spi.c +++ b/contrib/worker_spi/worker_spi.c Btw, I've posted a minimal regression test for bworkers/worker_spi in 20130724175742.gd10...@alap2.anarazel.de - I'd like to see some coverage of this... I could not find this email ID in the archives. A link perhaps? Hm. Works for me: http://www.postgresql.org/message-id/20130724175742.gd10...@alap2.anarazel.de Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: proposal 9.4 plpgsql: allows access to call stack from GET DIAGNOSTICS statement
All, * Rushabh Lathia (rushabh.lat...@gmail.com) wrote: Latest patch looks good to me. I've committed this, with some pretty serious clean-up. In the future, please don't simply copy/paste code w/o any updates to the comments included, particularly when the comments no longer make sense in the new place. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Preventing tuple-table leakage in plpgsql
I wrote: Hmm ... good point. The other plan I'd been considering was to add explicit tracking inside spi.c of all tuple tables created within the current procedure, and then have AtEOSubXact_SPI flush any that were created inside a failed subxact. The tables would still be children of the procCxt and thus could not be leaked past SPI_finish. When you suggested attaching to subtransaction contexts I thought that would let us get away without this additional bookkeeping logic, but maybe we should bite the bullet and add the extra logic. A change that's meant to remove leak risks really shouldn't be introducing other, new leak risks. (An additional advantage is we could detect attempts to free the same tuptable more than once, which would be a good thing ...) Here's a draft cut at that. I've checked that this fixes the reported memory leak. This uses ilist.h, so it couldn't easily be backported to before 9.3, but we weren't going to do that anyway. Worth noting is that SPI_freetuptable() is now much more picky about what it's being passed: it won't free anything that is not a tuptable of the currently connected SPI procedure. This doesn't appear to be a problem for anything in core or contrib, but I wonder whether anyone thinks we need to relax that? If so, we could allow it to search down the SPI context stack, but I'm not sure I see a use-case for allowing an inner SPI procedure to free a tuple table made by an outer one. In general, I think this version of SPI_freetuptable() is a lot safer than what we had before, and possibly likely to find bugs in calling code. Another point worth making is that this version of the patch deletes the tuple tables during AtEOSubXact_SPI(), earlier in cleanup than would happen with the prior version. That increases the risk that external code might try to delete an already-deleted tuple table, if it tries to call SPI_freetuptable() during subxact cleanup. The new code won't crash, although come to think of it it will probably throw an error because you're not connected anymore. (Maybe this is a reason to not insist on being connected, but just silently search whatever the top stack context is?) This still lacks doc changes, too. regards, tom lane diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c index f30b2cd93330d72e0a412d6517aec08c96cc753e..f179922dd4bd4eb36212ebd91413ba28de44b2fc 100644 *** a/src/backend/executor/spi.c --- b/src/backend/executor/spi.c *** SPI_connect(void) *** 126,131 --- 126,132 _SPI_current-processed = 0; _SPI_current-lastoid = InvalidOid; _SPI_current-tuptable = NULL; + slist_init(_SPI_current-tuptables); _SPI_current-procCxt = NULL; /* in case we fail to create 'em */ _SPI_current-execCxt = NULL; _SPI_current-connectSubid = GetCurrentSubTransactionId(); *** SPI_finish(void) *** 166,172 /* Restore memory context as it was before procedure call */ MemoryContextSwitchTo(_SPI_current-savedcxt); ! /* Release memory used in procedure call */ MemoryContextDelete(_SPI_current-execCxt); _SPI_current-execCxt = NULL; MemoryContextDelete(_SPI_current-procCxt); --- 167,173 /* Restore memory context as it was before procedure call */ MemoryContextSwitchTo(_SPI_current-savedcxt); ! /* Release memory used in procedure call (including tuptables) */ MemoryContextDelete(_SPI_current-execCxt); _SPI_current-execCxt = NULL; MemoryContextDelete(_SPI_current-procCxt); *** AtEOSubXact_SPI(bool isCommit, SubTransa *** 282,292 */ if (_SPI_current !isCommit) { /* free Executor memory the same as _SPI_end_call would do */ MemoryContextResetAndDeleteChildren(_SPI_current-execCxt); ! /* throw away any partially created tuple-table */ ! SPI_freetuptable(_SPI_current-tuptable); ! _SPI_current-tuptable = NULL; } } --- 283,316 */ if (_SPI_current !isCommit) { + slist_mutable_iter siter; + /* free Executor memory the same as _SPI_end_call would do */ MemoryContextResetAndDeleteChildren(_SPI_current-execCxt); ! ! /* throw away any tuple tables created within current subxact */ ! slist_foreach_modify(siter, _SPI_current-tuptables) ! { ! SPITupleTable *tuptable; ! ! tuptable = slist_container(SPITupleTable, next, siter.cur); ! if (tuptable-subid = mySubid) ! { ! /* ! * If we used SPI_freetuptable() here, its internal search of ! * the tuptables list would make this operation O(N^2). ! * Instead, just free the tuptable manually. ! */ ! slist_delete_current(siter); ! if (tuptable == _SPI_current-tuptable) ! _SPI_current-tuptable = NULL; ! if (tuptable == SPI_tuptable) ! SPI_tuptable = NULL; ! MemoryContextDelete(tuptable-tuptabcxt); ! } ! } ! /* in particular we should have gotten rid of any in-progress table */ ! Assert(_SPI_current-tuptable == NULL); } }
Re: [HACKERS] dynamic background workers, round two
On Thu, Jul 25, 2013 at 8:05 AM, Andres Freund and...@2ndquadrant.comwrote: On 2013-07-25 08:03:17 +0900, Michael Paquier wrote: On Thu, Jul 25, 2013 at 6:34 AM, Andres Freund and...@2ndquadrant.com wrote: --- a/contrib/worker_spi/worker_spi.c +++ b/contrib/worker_spi/worker_spi.c Btw, I've posted a minimal regression test for bworkers/worker_spi in 20130724175742.gd10...@alap2.anarazel.de - I'd like to see some coverage of this... I could not find this email ID in the archives. A link perhaps? Hm. Works for me: http://www.postgresql.org/message-id/20130724175742.gd10...@alap2.anarazel.de Oops. Thanks, the search field does not like if spaces are added with the ID. -- Michael
[HACKERS] Re: proposal 9.4 plpgsql: allows access to call stack from GET DIAGNOSTICS statement
Hello 2013/7/25 Stephen Frost sfr...@snowman.net: All, * Rushabh Lathia (rushabh.lat...@gmail.com) wrote: Latest patch looks good to me. I've committed this, with some pretty serious clean-up. In the future, please don't simply copy/paste code w/o any updates to the comments included, particularly when the comments no longer make sense in the new place. Thank you very much Regards Pavel Thanks, Stephen -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers