Re: [HACKERS] [PATCH] add --progress option to pgbench (submission 3)
New submission which put option help in alphabetical position, as per Peter Eisentraut f0ed3a8a99b052d2d5e0b6153a8907b90c486636 This is for reference to the next commitfest. Patch update after conflict induced by pg-indentation, for the next commitfest. -- Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index 8ff6623..c583f39 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -164,6 +164,7 @@ bool use_log; /* log transaction latencies to a file */ bool use_quiet; /* quiet logging onto stderr */ int agg_interval; /* log aggregates instead of individual * transactions */ +int progress = 0; /* thread progress report every this seconds */ bool is_connect; /* establish connection for each transaction */ bool is_latencies; /* report per-command latencies */ int main_pid; /* main process id used in log filename */ @@ -354,6 +355,8 @@ usage(void) protocol for submitting queries to server (default: simple)\n -n do not run VACUUM before tests\n -N do not update tables \pgbench_tellers\ and \pgbench_branches\\n + -P SEC, --progress SEC\n + show thread progress report about every SEC seconds\n -r report average latency per command\n -s NUM report this scale factor in output\n -S perform SELECT-only transactions\n @@ -2112,6 +2115,7 @@ main(int argc, char **argv) {unlogged-tables, no_argument, unlogged_tables, 1}, {sampling-rate, required_argument, NULL, 4}, {aggregate-interval, required_argument, NULL, 5}, + {progress, required_argument, NULL, 'P'}, {NULL, 0, NULL, 0} }; @@ -2178,7 +2182,7 @@ main(int argc, char **argv) state = (CState *) pg_malloc(sizeof(CState)); memset(state, 0, sizeof(CState)); - while ((c = getopt_long(argc, argv, ih:nvp:dqSNc:j:Crs:t:T:U:lf:D:F:M:, long_options, optindex)) != -1) + while ((c = getopt_long(argc, argv, ih:nvp:dqSNc:j:Crs:t:T:U:lf:D:F:M:P:, long_options, optindex)) != -1) { switch (c) { @@ -2333,6 +2337,16 @@ main(int argc, char **argv) exit(1); } break; + case 'P': +progress = atoi(optarg); +if (progress = 0) +{ + fprintf(stderr, + thread progress delay (-P) must not be negative (%s)\n, + optarg); + exit(1); +} +break; case 0: /* This covers long options which take no argument. */ break; @@ -2695,6 +2709,9 @@ threadRun(void *arg) int nstate = thread-nstate; int remains = nstate; /* number of remaining clients */ int i; + /* for reporting progress: */ + int64 last_report = INSTR_TIME_GET_MICROSEC(thread-start_time); + int64 last_count = 0; AggVals aggs; @@ -2858,6 +2875,29 @@ threadRun(void *arg) st-con = NULL; } } + + /* per thread progress report, about every 5s */ + if (progress) + { + instr_time now_time; + int64 now, run; + INSTR_TIME_SET_CURRENT(now_time); + now = INSTR_TIME_GET_MICROSEC(now_time); + run = now - last_report; + if (run = progress * 100) + { +/* generate and show report */ +int64 count = 0; +for (i=0; instate; i++) + count += state[i].cnt; +fprintf(stderr, thread %d running at %f tps after %.1f s\n, + thread-tid, 100.0 * (count-last_count) / run, + (now - INSTR_TIME_GET_MICROSEC(thread-start_time))/ + 100.0); +last_count = count; +last_report = now; + } + } } done: diff --git a/doc/src/sgml/pgbench.sgml b/doc/src/sgml/pgbench.sgml index e9900d3..e58ea58 100644 --- a/doc/src/sgml/pgbench.sgml +++ b/doc/src/sgml/pgbench.sgml @@ -392,6 +392,16 @@ pgbench optional replaceableoptions/ /optional replaceabledbname/ /varlistentry varlistentry + termoption-P/option replaceablesec//term + termoption--progress/option replaceablesec//term + listitem + para + Show thread progress report about every literalsec/ seconds. + /para + /listitem + /varlistentry + + varlistentry termoption-s/option replaceablescale_factor//term listitem para -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] detecting binary backup in progress
On Sat, Jun 1, 2013 at 4:55 AM, Alvaro Herrera alvhe...@2ndquadrant.comwrote: Heikki Linnakangas wrote: The manual says: pg_is_in_backup()boolTrue if an on-line exclusive backup is still in progress. So clearly that is intentional. That could use some rephrasing, though; a layman won't know what an exclusive backup is. Heck, I don't understand what it is either. Same here. Does it mean taking a backup not with pg_basebackup but by executing yourself external operations between pg_start/stop_backup calls? -- Michael
Re: [HACKERS] detecting binary backup in progress
On 2013-06-01 17:05:57 +0900, Michael Paquier wrote: On Sat, Jun 1, 2013 at 4:55 AM, Alvaro Herrera alvhe...@2ndquadrant.comwrote: Heikki Linnakangas wrote: The manual says: pg_is_in_backup()boolTrue if an on-line exclusive backup is still in progress. So clearly that is intentional. That could use some rephrasing, though; a layman won't know what an exclusive backup is. Heck, I don't understand what it is either. Same here. Does it mean taking a backup not with pg_basebackup but by executing yourself external operations between pg_start/stop_backup calls? Basically yes. pg_start/stop_backup places the backup label into the data directory itself so there can only be one of them at a time since it has to have a fixed name. With the streaming protocol the backup label is only added to the streamed data, so there can be multiple ones at the same time. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Combo xids
Currently, we hold xmin and xmax for each tuple. For xmax, we have the multixact mechanism that allows us to represent an array of xids with just a single pseudo xid. So why not hold xmin and xmax as part of a multixact? Instead of caching two xids on each tuple, why not hold just one and allow access to the array of xids via multixact? An idea very similar to combo cids, hence the name combo xids. When would this make sense? Frequently. Most of the time a tuple needs only one xid set. In most cases, we set xmin and xmax a long time apart. Very few cases end with both of them set inside the *same* xmin horizon. In a heavy transactional enviroment, the horizon moves forwards quickly, on the order of a few seconds. Very few rows get inserted and then updated/deleted that quickly. With long reporting queries, data tends to be updated less, so again the rows aren't touched within the same horizon. As a result, we hardly ever need both xmin and xmax at the same time - when we need to set xmax, xmin is already committed/cleaned. What is the benefit? Merging xmin/xmax would save 4 bytes per row. On servers with 8 byte word length, that means that we'd save 8 bytes per row for tables that have between 9 and 40 columns. Which is the majority of tables. How? Clearly this would require changes to tuple format and therefore not something I would suggest for this year ahead, but seems worth getting the idea down, even if it gets ruled out. Multixact is now persistent, so this change wouldn't take much to implement. Given that it would require a change in in-disk format, we might also consider that it could be possible to cache multixactid data on the disk blocks that need it. A list of multixactids could be stored in amongst the tuples on block, with the head of the list being a 2 byte addition to the block header. Which could be used to reduce the overhead of multixactid use. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Deferring transaction wraparound
As we get faster, we'll need to wrap the xid counter more quickly. At current full speed, we could wrap the xid counter every 72 hours. This is a concern for very large tables, since currently we have to rescan the whole table. This has my interest for change in the next release. We could solve that with a freeze map. Avoiding wraparound vacuums will get harder and harder for us and an approach which is a minor improvement on existing freezing won't help for long. People often ask about 8 byte xids directly, to which I reply: been discussed, major downsides. So I've been trying to rethink this from first principles to come up with a better idea. A much better idea is to hold the xmin epoch on the tuple, in addition to the xid, if there was a good place to hold this. CommandId is 32 bits, but very few commands need 2 billion commands in one transaction. So the suggestion is that we optionally store the xid epoch in 16 bits of the cmdid field. We would still allow very large commandIds and combocids, but we optimise the common usage. By doing this, we will be able to reduce xid wraparounds from every 72 hours to once every 538 years, at current transaction rates, in most cases. And in much less busy databases, this will drop to essentially never. In detail: Whenever we write a tuple with a t_cid of less than 65536 we set a new flag, infomask2 HEAP_HAS_EPOCH and write the xid epoch to the remaining 16 bits. (Or at least, the epoch modulo 65536). So this approach is binary compatible with current storage format (even though I was happy to break it if necessary, I don't think that is helpful). We hold the relnextfreezeepoch on pg_class, which starts at the epoch when the table is created -1 (or 65535 at bootstrap). i.e. we next do a freeze scan in 65536 epochs. Anytime we write a t_cid that is 65535 we reset the relnextfreezeepoch for the table to currentepoch+1 using a non-transactional update. Autovacuum looks at the relnextfreezeepoch when deciding whether to kick off a wraparound freeze vacuum. Setting relnextfreezeepoch needs good interlocks to avoid resetting it when a long running transaction is running. Various designs, but suggest using something similar to vacuum cycleid, for when a long running transaction and an wraparound freeze vacuum occur concurrently. That way we only need to take special care when we have a very long transaction running. Very long transactions, that is transactions with 65536 commands will need freezing within 1 cycle just as we currently do, but only for changes made in the later part of the transaction. However, those are typically pg_dump reloads and we should be able to use COPY FREEZE with those so they get loaded frozen and don't then need later freezing at all. Hang on! Which xid is the epoch for? Well, for xmin of course. If xmax is set, its either a lock or a delete. And either way we get to clean things up with regular vacuums, which we would still need to do each cycle. It's only the xmin that is annoying, because there isn't much other need to revisit those data blocks. If we don't like that, we could use a few extra bits from the epoch field to determine which xid it applies to, but I think that's too much. This seemed a bit radical when I first thought of this, but it still seems solid now. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Optimising Foreign Key checks
FK checks can be expensive, especially when loading large volumes of data into an existing table or partition. A couple of ideas for improving performance are discussed here: 1. Use Case: Bulk loading COPY pgbench_accounts; -- references pgbench_branches with many repeated values Proposal: Transactions that need multiple checks can be optimised by simply LOCKing the whole referenced table, once. We can then hold the referenced table as a Hash, like we do with a Hash Join (its almost exactly the same thing). This works in two ways: it speeds up checks and it also reduces the locking overhead. This would require explicit permission of the user, which would be given by a new table parameter, set on the referenced table. WITH (foreign_key_lock_level = row | table) Setting this would lock out changes on that table, so would only be suitable for read-mostly tables. But that is exactly the most frequently referenced table in a FK anyway, reference tables, so the optimisation is appropriate in probably the majority of cases. 2. Use Case: Transactional repetition BEGIN; INSERT INTO order VALUES (ordid, ) INSERT INTO order_line VALUES (ordid, 1, .) INSERT INTO order_line VALUES (ordid, 2, .) INSERT INTO order_line VALUES (ordid, 3, .) INSERT INTO order_line VALUES (ordid, 4, .) ... COMMIT; The inserts into order_line repeatedly execute checks against the same ordid. Deferring and then de-duplicating the checks would optimise the transaction. Proposal: De-duplicate multiple checks against same value. This would be implemented by keeping a hash of rows that we had already either inserted and/or locked as the transaction progresses, so we can use the hash to avoid queuing up after triggers. We could also use this technique to de-duplicate checks within a single statement. In both cases we are building up a local hash table with values and then using those values to avoid queuing constraint triggers. So code is similar for both. Thoughts? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)
New submission for the next commit fest. This new version also reports the average lag time, i.e. the delay between scheduled and actual transaction start times. This may help detect whether things went smothly, or if at some time some delay was introduced because of the load and some catchup was done afterwards. Question 1: should it report the maximum lang encountered? Question 2: the next step would be to have the current lag shown under option --progress, but that would mean having a combined --throttle --progress patch submission, or maybe dependencies between patches. -- Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index 8ff6623..9b5adc2 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -137,6 +137,12 @@ int unlogged_tables = 0; double sample_rate = 0.0; /* + * whether clients are throttled to a given rate, expressed as a delay in us. + * 0, the default means no throttling. + */ +int64 throttle = 0; + +/* * tablespace selection */ char *tablespace = NULL; @@ -205,6 +211,9 @@ typedef struct int nvariables; instr_time txn_begin; /* used for measuring transaction latencies */ instr_time stmt_begin; /* used for measuring statement latencies */ + int64 trigger; /* previous/next throttling (us) */ + bool throttled; /* whether current transaction was throttled */ + int64 throttle_lag; /* transaction lag behind throttling */ int use_file; /* index in sql_files for this client */ bool prepared[MAX_FILES]; } CState; @@ -348,6 +357,9 @@ usage(void) -D VARNAME=VALUE\n define variable for use by custom script\n -f FILENAME read transaction script from FILENAME\n + -H SPEC, --throttle SPEC\n + delay in second to throttle each client\n + sample specs: 0.025 40tps 25ms 25000us\n -j NUM number of threads (default: 1)\n -l write transaction times to log file\n -M simple|extended|prepared\n @@ -902,13 +914,40 @@ doCustom(TState *thread, CState *st, instr_time *conn_time, FILE *logfile, AggVa top: commands = sql_files[st-use_file]; + /* handle throttling once per transaction by inserting a sleep. + * this is simpler than doing it at the end. + */ + if (throttle ! st-throttled) + { + /* compute delay to approximate a Poisson distribution + * 100 = 13.8 .. 0 multiplier + * if transactions are too slow or a given wait shorter than + * a transaction, the next transaction will start right away. + */ + int64 wait = (int64) + throttle * -log(getrand(thread, 1, 100)/100.0); + st-trigger += wait; + st-sleeping = 1; + st-until = st-trigger; + st-throttled = true; + if (debug) + fprintf(stderr, client %d throttling INT64_FORMAT us\n, + st-id, wait); + } + if (st-sleeping) { /* are we sleeping? */ instr_time now; - + int64 now_us; INSTR_TIME_SET_CURRENT(now); - if (st-until = INSTR_TIME_GET_MICROSEC(now)) + now_us = INSTR_TIME_GET_MICROSEC(now); + if (st-until = now_us) + { st-sleeping = 0; /* Done sleeping, go ahead with next command */ + if (throttle st-state==0) +/* measure lag of throttled transaction */ +st-throttle_lag += (now_us - st-until); + } else return true; /* Still sleeping, nothing to do here */ } @@ -1091,6 +1130,7 @@ top: st-state = 0; st-use_file = (int) getrand(thread, 0, num_files - 1); commands = sql_files[st-use_file]; + st-throttled = false; } } @@ -2012,7 +2052,8 @@ process_builtin(char *tb) /* print out results */ static void -printResults(int ttype, int normal_xacts, int nclients, +printResults(int ttype, int normal_xacts, + CState *clients, int nclients, TState *threads, int nthreads, instr_time total_time, instr_time conn_total_time) { @@ -2052,6 +2093,23 @@ printResults(int ttype, int normal_xacts, int nclients, printf(number of transactions actually processed: %d\n, normal_xacts); } + + if (throttle) + { + /* Report average transaction lag under throttling, i.e. the delay + between scheduled and actual start times for the transaction. + The measured lag may be linked to the thread/client load, + the database load, or the Poisson throttling process. + should it report the maximum encountered lag? + */ + int64 throttle_lag = 0; + int c; + for (c = 0; c nclients; c++) + throttle_lag += clients[c].throttle_lag; + printf(average transaction lag: %.3f ms\n, + 0.001 * throttle_lag / normal_xacts); + } + printf(tps = %f (including connections establishing)\n, tps_include); printf(tps = %f (excluding connections establishing)\n, tps_exclude); @@ -2112,6 +2170,7 @@ main(int argc, char **argv) {unlogged-tables, no_argument, unlogged_tables, 1}, {sampling-rate, required_argument, NULL, 4}, {aggregate-interval, required_argument, NULL, 5}, + {throttle, required_argument, NULL, 'H'}, {NULL, 0,
Re: [HACKERS] Combo xids
On Sat, Jun 01, 2013 at 09:22:05AM +0100, Simon Riggs wrote: When would this make sense? Frequently. Most of the time a tuple needs only one xid set. In most cases, we set xmin and xmax a long time apart. Very few cases end with both of them set inside the *same* xmin horizon. In a heavy transactional enviroment, the horizon moves forwards quickly, on the order of a few seconds. Very few rows get inserted and then updated/deleted that quickly. With long reporting queries, data tends to be updated less, so again the rows aren't touched within the same horizon. As a result, we hardly ever need both xmin and xmax at the same time - when we need to set xmax, xmin is already committed/cleaned. Is this really true? Consider a long running query A and a tuple created by B after A. If another transaction comes to update B you can't throw away the xmin because you need it to prove that A can't see the tuple. Or is the idea to create multixacts for each combination of xmin/xmax encountered? And the assumption is that there aren't that many? That could be measured. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] detecting binary backup in progress
## Joe Conway (m...@joeconway.com): However there is a period of time after pg_start_backup() is first executed to when it completes, during which backup_label file does not exist yet, but the backup has essentially been started. Is there any way to detect this state? When I did some research on the very same question just a few weeks ago, I settled for external locks (lockfile, pg_advisory_lock(), ..., depending on your exact situation) around the backup-taking code. Regards, Christoph -- Spare Space -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Which table stored in which file in PGDATA/base/[db-oid]
Dear Hackers I've created a new DB, and a bunch of files created in base/12054, 12054 is oid of the new DB. I want to find what table stored in each file. BTW, I read this http://www.postgresql.org/docs/9.2/interactive/storage-file-layout.html I have 156 files with numerical names, vm and fsm file are ignored. 107 files are same as pg_class.reltoastrelid,so I have 49 files that I do not know what tables stored in them. Any idea to find ? Thanks, Soroosh
Re: [HACKERS] Which table stored in which file in PGDATA/base/[db-oid]
On Sat, Jun 01, 2013 at 03:11:50PM +0430, Soroosh Sardari wrote: Dear Hackers I've created a new DB, and a bunch of files created in base/12054, 12054 is oid of the new DB. I want to find what table stored in each file. BTW, I read this http://www.postgresql.org/docs/9.2/interactive/storage-file-layout.html I have 156 files with numerical names, vm and fsm file are ignored. 107 files are same as pg_class.reltoastrelid,so I have 49 files that I do not know what tables stored in them. Any idea to find ? From that page: Each table and index is stored in a separate file. For ordinary relations, these files are named after the table or index's filenode number, which can be found in pg_class.relfilenode. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Which table stored in which file in PGDATA/base/[db-oid]
Yes, I have some files which is not in pg_class.relfilenode of any table or index. I want to know which table or index stored in such files. From that page: Each table and index is stored in a separate file. For ordinary relations, these files are named after the table or index's filenode number, which can be found in pg_class.relfilenode. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) iQIVAwUBUanSAkvt++dL5i1EAQhn6hAAg9eiZEz2eV6Z/5f8ae56MNGwM5L1P6nU y2pN49PoSz0FkO3lBwcShH3/O0s+SgNy8kh6Klm1qDlwvX9HFGeRVd9guX7/fFil eu+Ueg5nVzXA4fb/NwjS+Hh1B+/NdJQnklddP6K4Pm0VW51wqaaFA3hn/CfNMiO2 07i8L/NFjlngc5wstQLGcxuE5bl69c1qGhl8RHoOPLRhFgMSzkxSR9TglTDPaniu rptpWvHgfRYdorANBaSI3SByw8WeSPbrTHusX4XC5zVkIk7GZQiogQlQVRA7yBT6 YpdjqB4thWDctR4VLv0yvBRJ5g7M9GkhWSOmpDoRBWCB2EFFPwrBhyrxt/e/aPCn +Nt1nFxtKGV4/tPW7cI9b4bv2OZctmOaoDByqAZUuB891eOebVjif9MsQeG5IWFb 5KOnQcQ+TxlmCkF7zot5Tv8ndMTtJN8eKAkhay+xmLjON/2tGl+ArKbVAqck2oIb xGSavSLg6HZ/FmMNkbHVSo6/Z7Nmup2GGYsWWJhHvoO0hbGHCnxobAsWQGPUsC7l 6osFCcBokvZtIERLttznP1S8RvmLP6EuByxNNQY4MV1GJm55P1PHZeWRGCYMEDil Fs73My0YxHBtnjI/LbgJ4GhKzINsQqviHJPFraKq8NdW/+B3Pte6bmtlRFa8Z/t+ J6hjI9Wgky0= =68cp -END PGP SIGNATURE-
Re: [HACKERS] Which table stored in which file in PGDATA/base/[db-oid]
On Sat, Jun 01, 2013 at 03:27:40PM +0430, Soroosh Sardari wrote: Yes, I have some files which is not in pg_class.relfilenode of any table or index. I want to know which table or index stored in such files. That shouldn't happen. Are you sure you're looking in the right database? Kan you list the filenames? Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Which table stored in which file in PGDATA/base/[db-oid]
2013/6/1 Martijn van Oosterhout klep...@svana.org On Sat, Jun 01, 2013 at 03:27:40PM +0430, Soroosh Sardari wrote: Yes, I have some files which is not in pg_class.relfilenode of any table or index. I want to know which table or index stored in such files. That shouldn't happen. Are you sure you're looking in the right database? Kan you list the filenames? According to the docs, it is possible if there had been operations on the table that change filenode, like TRUNCATE. Also, some some relations has relfilenode being 0, like pg_class catalogue table. Check more here: http://www.postgresql.org/docs/current/interactive/storage-file-layout.html It is recommended to use pg_relation_filenode() function: http://www.postgresql.org/docs/current/interactive/functions-admin.html#FUNCTIONS-ADMIN-DBLOCATION -- Victor Y. Yegorov
Re: [HACKERS] Which table stored in which file in PGDATA/base/[db-oid]
On 2013-06-01 13:04:55 +0200, Martijn van Oosterhout wrote: On Sat, Jun 01, 2013 at 03:27:40PM +0430, Soroosh Sardari wrote: Yes, I have some files which is not in pg_class.relfilenode of any table or index. I want to know which table or index stored in such files. That shouldn't happen. Are you sure you're looking in the right database? Kan you list the filenames? It's actually entirely normal. For some system tables the actual relfilenode isn't stored in the system catalog but in the relation mapper. Those are a) tables needed to access the catalogs themselves like pg_class, pg_attribute, pg_proc, .. b) shared tables where we otherwise couldn't change the relfilenode from another database To get the actual relfilenode you actually need to do something like: SELECT relname, pg_relation_filenode(pg_class.oid) FROM pg_class; Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Which table stored in which file in PGDATA/base/[db-oid]
Yes, I'm sure. Difference of filenodes and new files changed w.r.t my first mail, because I added a table. I attached 3 files, newfile.pg : list of numerical files in base/[db-oid], ls | grep '[[:digit:]]\' filenode.pg : select distinct relfilenode from pg_class newfile-filenode.pg : Set of oids which exists in newfile.pg and does not in filenode.pg On Sat, Jun 1, 2013 at 3:34 PM, Martijn van Oosterhout klep...@svana.orgwrote: On Sat, Jun 01, 2013 at 03:27:40PM +0430, Soroosh Sardari wrote: Yes, I have some files which is not in pg_class.relfilenode of any table or index. I want to know which table or index stored in such files. That shouldn't happen. Are you sure you're looking in the right database? Kan you list the filenames? Have a nice day, newfile.pg Description: Binary data filenode.pg Description: Binary data newfile-filenode.pg Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] detecting binary backup in progress
On 31 May 2013 21:06, Andres Freund and...@2ndquadrant.com wrote: On 2013-05-31 22:53:14 +0300, Heikki Linnakangas wrote: On 31.05.2013 22:36, Andres Freund wrote: On 2013-05-31 22:29:45 +0300, Heikki Linnakangas wrote: Note that pg_is_in_backup() just checks for presence of $PGDATA/backup_label. Also note that pg_basebackup doesn't create backup_label in the server. It's included in the backup that's sent to the client, but it's never written to disk in the server. So checking for backup_label manually or with pg_is_in_backup() will return false even if pg_basebackup is running. Whoa. You are right, but I'd call that a bug. I don't understand why we aren't just checking XLogCtl-Insert.(nonExclusiveBackups||exlusiveBackup)? Well, depends on what you imagine the function is used for. If you think of it as will pg_start_backup() throw an error if I call it now, or do I need to call pg_stop_backup(), then the current behavior is correct. The manual says: pg_is_in_backup()boolTrue if an on-line exclusive backup is still in progress. So clearly that is intentional. Well, just because it's intentional, doesn't mean its a good idea ;). There very well are reasons to check for in progress non-exclusive backups as well. You e.g. wouldn't want to restart the database while the weekly base backup of your 1TB database is in progress, just because it's done via the replication protocol. If we weren't in beta 1 already I'd vote for making it into: pg_backup_in_progress(OUT bool exclusive, OUT int non_exclusive) or similar. Perhaps we should do that anyway? That could use some rephrasing, though; a layman won't know what an exclusive backup is. True. Although I have to admit I can't come up with a succinct name for it it right now. I see that this exact discussion has happened once before, after the initial commit. AFAICS nobody likes the fact that pg_is_in_backup() only covers exclusive backups. The problem seems to be that we can't find a better term. But the problem remains that having a function called pg_is_in_backup() that *clearly* does *not* do what it says, is a problem. Yes, few people will understand what an exclusive backup is, but that is a very good reason to not split hairs in the definition. The way to resolve this is to have two functions: pg_is_in_backup() - which covers both/all kinds of backup pg_is_in_exclusive_backup() - which covers just the exclusive backup mode and some clear documentation that explains why the two functions are necessary. Any objections to me committing those changes? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freezing without write I/O
On 30 May 2013 14:33, Heikki Linnakangas hlinnakan...@vmware.com wrote: Since we're bashing around ideas around freezing, let me write down the idea I've been pondering and discussing with various people for years. I don't think I invented this myself, apologies to whoever did for not giving credit. The reason we have to freeze is that otherwise our 32-bit XIDs wrap around and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but that would waste a lot space. The trick is to add a field to the page header indicating the 'epoch' of the XID, while keeping the XIDs in tuple header 32-bit wide (*). The other reason we freeze is to truncate the clog. But with 64-bit XIDs, we wouldn't actually need to change old XIDs on disk to FrozenXid. Instead, we could implicitly treat anything older than relfrozenxid as frozen. That's the basic idea. Vacuum freeze only needs to remove dead tuples, but doesn't need to dirty pages that contain no dead tuples. I have to say this is pretty spooky. I'd not read hackers all week, so I had no idea so many other people were thinking about freezing as well. This idea is damn near identical to what I've suggested. My suggestion came because I was looking to get rid of fields out of the tuple header; which didn't come to much. The good news is that is complete chance, so it must mean we're on the right track. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freezing without write I/O
On 30 May 2013 19:39, Robert Haas robertmh...@gmail.com wrote: On Thu, May 30, 2013 at 9:33 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: The reason we have to freeze is that otherwise our 32-bit XIDs wrap around and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but that would waste a lot space. The trick is to add a field to the page header indicating the 'epoch' of the XID, while keeping the XIDs in tuple header 32-bit wide (*). Check. The other reason we freeze is to truncate the clog. But with 64-bit XIDs, we wouldn't actually need to change old XIDs on disk to FrozenXid. Instead, we could implicitly treat anything older than relfrozenxid as frozen. Check. That's the basic idea. Vacuum freeze only needs to remove dead tuples, but doesn't need to dirty pages that contain no dead tuples. Check. Yes, this is the critical point. Large insert-only tables don't need to be completely re-written twice. Since we're not storing 64-bit wide XIDs on every tuple, we'd still need to replace the XIDs with FrozenXid whenever the difference between the smallest and largest XID on a page exceeds 2^31. But that would only happen when you're updating the page, in which case the page is dirtied anyway, so it wouldn't cause any extra I/O. It would cause some extra WAL activity, but it wouldn't dirty the page an extra time. This would also be the first step in allowing the clog to grow larger than 2 billion transactions, eliminating the need for anti-wraparound freezing altogether. You'd still want to truncate the clog eventually, but it would be nice to not be pressed against the wall with run vacuum freeze now, or the system will shut down. Interesting. That seems like a major advantage. (*) Adding an epoch is inaccurate, but I like to use that as my mental model. If you just add a 32-bit epoch field, then you cannot have xids from different epochs on the page, which would be a problem. In reality, you would store one 64-bit XID value in the page header, and use that as the reference point for all the 32-bit XIDs on the tuples. See existing convert_txid() function for how that works. Another method is to store the 32-bit xid values in tuple headers as offsets from the per-page 64-bit value, but then you'd always need to have the 64-bit value at hand when interpreting the XIDs, even if they're all recent. As I see it, the main downsides of this approach are: (1) It breaks binary compatibility (unless you do something to provided for it, like put the epoch in the special space). (2) It consumes 8 bytes per page. I think it would be possible to get this down to say 5 bytes per page pretty easily; we'd simply decide that the low-order 3 bytes of the reference XID must always be 0. Possibly you could even do with 4 bytes, or 4 bytes plus some number of extra bits. Yes, the idea of having a base Xid on every page is complicated and breaks compatibility. Same idea can work well if we do this via tuple headers. (3) You still need to periodically scan the entire relation, or else have a freeze map as Simon and Josh suggested. I don't think that is needed with this approach. (The freeze map was Andres' idea, not mine. I just accepted it as what I thought was the only way forwards. Now I see other ways) The upsides of this approach as compared with what Andres and I are proposing are: (1) It provides a stepping stone towards allowing indefinite expansion of CLOG, which is quite appealing as an alternative to a hard shut-down. I would be against expansion of the CLOG beyond its current size. If we have removed all aborted rows and marked hints, then we don't need the CLOG values and can trim that down. I don't mind the hints, its the freezing we don't need. convert_txid() function for how that works. Another method is to store the 32-bit xid values in tuple headers as offsets from the per-page 64-bit value, but then you'd always need to have the 64-bit value at hand when interpreting the XIDs, even if they're all recent. You've touched here on the idea of putting the epoch in the tuple header, which is where what I posted comes together. We don't need anything at page level, we just need something on each tuple. Please can you look at my recent post on how to put this in the tuple header? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Running pgindent
* Bruce Momjian (br...@momjian.us) wrote: OK. Done. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] detecting binary backup in progress
Simon Riggs si...@2ndquadrant.com writes: The way to resolve this is to have two functions: pg_is_in_backup() - which covers both/all kinds of backup pg_is_in_exclusive_backup() - which covers just the exclusive backup mode What will you do with pg_backup_start_time()? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Placing hints in line pointers
Notes on a longer term idea... An item pointer (also called line pointer) is used to allow an external pointer to an item, while allowing us to place the tuple that anywhere on the page. An ItemId is 4 bytes long and currently consists of (see src/include/storage/itemid.h)... typedef struct ItemIdData { unsignedlp_off:15, /* offset to tuple (from start of page) */ lp_flags:2, /* state of item pointer, see below */ lp_len:15; /* byte length of tuple */ } ItemIdData; The offset to the tuple is 15 bits, which is sufficient to point to 32768 separate byte positions, and hence why we limit ourselves to 32kB blocks. If we use 4 byte alignment for tuples, then that would mean we wouldn't ever use the lower 2 bits of lp_off, nor would we use the lower 2 bits of lp_len. They are always set at zero. (Obviously, with 8 byte alignment we would have 3 bits spare in each, but I'm looking for something that works the same on various architectures for simplicity). So my suggestion is to make lp_off and lp_len store the values in terms of 4 byte chunks, which would allow us to rework the data structure like this... typedef struct ItemIdData { unsignedlp_off:13, /* offset to tuple (from start of page), number of 4 byte chunks */ lp_xmin_hint:2, /* committed and invalid hints for xmin */ lp_flags:2, /* state of item pointer, see below */ lp_len:13; /* byte length of tuple, number of 4 byte chunks */ lp_xmax_hint:2, /*committed and invalid hints for xmax */ } ItemIdData; i.e. we have room for 4 additional bits and we use those to put the tuple hints for xmin and xmax Doing this would have two purposes: * We wouldn't need to follow the pointer if the row is marked aborted. This would save a random memory access for that tuple * It would isolate the tuple hint values into a smaller area of the block, so we would be able to avoid the annoyance of recalculating the checksums for the whole block when a single bit changes. We wouldn't need to do a FPW when a hint changes, we would only need to take a copy of the ItemId array, which is much smaller. And it could be protected by its own checksum. (In addition, if we wanted, this could be used to extend block size to 64kB if we used 8-byte alignment for tuples) -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump with postgis extension dumps rules separately
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 05/31/2013 08:46 PM, Robert Haas wrote: On Wed, May 29, 2013 at 6:55 PM, Joe Conway m...@joeconway.com wrote: OK, simple enough. New patch attached. I still need to do some testing to verify this does not break anything, but other than that, any complaints (including the notion of backpatching this back to 9.1)? Here's a cleaned up version, which also includes documentation. I'll commit back to 9.1 in a day or two unless there are any objections. Changing SQL syntax in the back-branches isn't normally something we do, but I confess I don't see any real reason not to do it in this case. That was part of my hesitation, but I don't see any better way to fix existing installations and this is pretty well self-contained. Any other opinions out there? Joe - -- Joe Conway credativ LLC: http://www.credativ.us Linux, PostgreSQL, and general Open Source Training, Service, Consulting, 24x7 Support -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBAgAGBQJRqgvcAAoJEDfy90M199hlmG0P/0LmkNBvMrMqASA4zyhtKGTG 3Wd+/wC2cHPrfVqFmEKsuCStWTiQxzdcNGgPBfzdg5QskB8xcAr81ggH3mW5ldHE Gnz9ZJ6LaAWeqAg0IjIir2spQmZbNfPc9BY+vnTQAoSPmJwoXgFLnJSdW8+5JrLR qwrRv3f6jJzYPXYdSXu91fDCwNi7mZmcqjJRtjO58xI+hcrNsKMjGnloryeifrVP N1ZI2vrPiwUBmKR01RTjjfTjCA1iBxwABLbzknO4hNchE7l8ghcXmE/K5Zkaj8E4 QXQk/dx5EzXlKtOqBpKh2QpZZDoKD1NAR9u+SSsbjjdgzXM+L3SkslvlisbCEbrH HwYys2honEk38SzxeDeqpmLBDmEqccfuq/VIqe82szbusn58kmq7RbU/veScIwkA 5eAOzi+YbbaK1ThS2CZKrt9DqhUgaIhj66X7+bmhusPxG1cQyGnV8Tetol50Hyo4 6unkqiQhr4qfXwDtrUDDtdBxTiFWsIwXCe3zytp9J6HStHN1OjGfjDM8Mu71wwiH 44PqYnugaJff7I6fLC+qDWX5VD5i+7gSm8/Q7awt1hk7L5gLsFU6qQmJVkpY2HJs RQ3G2aoB2pKyvnbeYvFb9Ny1LH2I8gnUW0vXZ69T7ecvRLm4IC4wmFc3SGmUXP+x xbRWb6LW+RXPhsZYooKQ =RP/c -END PGP SIGNATURE- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump with postgis extension dumps rules separately
Joe Conway m...@joeconway.com writes: On 05/31/2013 08:46 PM, Robert Haas wrote: Changing SQL syntax in the back-branches isn't normally something we do, but I confess I don't see any real reason not to do it in this case. That was part of my hesitation, but I don't see any better way to fix existing installations and this is pretty well self-contained. Any other opinions out there? I don't like this approach much. 1. It does nothing to fix the issue in *existing* databases, which won't have pg_depend entries like this. 2. In general, we have assumed that properties of tables, such as indexes and constraints, cannot be independent members of extensions. It's not clear to me why rules should be different. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] detecting binary backup in progress
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 06/01/2013 02:43 AM, Christoph Moench-Tegeder wrote: ## Joe Conway (m...@joeconway.com): However there is a period of time after pg_start_backup() is first executed to when it completes, during which backup_label file does not exist yet, but the backup has essentially been started. Is there any way to detect this state? When I did some research on the very same question just a few weeks ago, I settled for external locks (lockfile, pg_advisory_lock(), ..., depending on your exact situation) around the backup-taking code. Right, and an external lockfile is good except there is a race condition. Proc1, t0) SELECT pg_start_backup(...) -- command starts to execute Proc2, t1) lay down a lockfile (or other mechanism) Proc2, t2) check for running backup by looking for backup_label Proc1, t3) SELECT pg_start_backup(...) -- command finishes, -- backup_label created So you are forced to sleep for some arbitrary time just in case pg_start_backup() has started but not completed at the point when you try to lock out the backup. Or better (at least my solution), is you check pg_stat_activity for the presence of a running SELECT pg_start_backup(...) command prior to checking for backup_label. I imagine there is still a small race condition there but at least much smaller. Joe - -- Joe Conway credativ LLC: http://www.credativ.us Linux, PostgreSQL, and general Open Source Training, Service, Consulting, 24x7 Support -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBAgAGBQJRqg8eAAoJEDfy90M199hlm2wP/ihMBBOvhT+fbXHcZj6A4wW9 /d3N2KZ6tRZE6Sb6W6ouy8jdMjgQsVfgrfHh7Ts0NS1cu6k2mTgbjxfi2JygjFzh PLpj4b9cRMYRLaeEj5Ik69HM/IMg6gLW5WM2hRrRiXopbxXIK8qR37rQmjWM2aYU OJ+qlKKx1K5d/VS6bmxveprAMzplR/U9OhEOteFo+A2ODl7Vsd9wL2NIa6DyiVvq BsM0QEJjkFZSkKvrZyWctO6v6j/ccNgo2xUJMyPAOVFxabOTw1CMUjdVDM9Im61/ c+AfchWCVHaLOGMY7KlGUmNhHuWcIY23u2sucn5JMpNbOmJRyexwsXCYIkBZXZdP OpaMq1w37aY2HwtrSpAgzUditQqoMjbq0PVgwoTu8P+pYbwToNclXW/TGq2zeDA3 mKWUCGGbSKfjoQks0yMNho05YIJkCkZeTDRMTuXN6k2Gf3WgqRyNwDfjnT0+YFZn Su93JZ5gE/vDugf7o47OeyrLTXcqVt3WgTCae7A70Vi2nenq6jWVCCKqTz9E7Ct6 I3Vhjal0dxpd6pi0sfI6msRAnPKoxfu9vjXdDuRf+NbzxpG8Gwb+HDaZzE/ffqz8 /473B/ZgNqCIXd9/loCTVdnewSaUDNuGqNxmmCMtFpmEC1SZ0zZZhImeLQFkA17k mwSGNqxchm8J/4ExM/n9 =x1tN -END PGP SIGNATURE- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] detecting binary backup in progress
On 2013-06-01 08:11:26 -0700, Joe Conway wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 06/01/2013 02:43 AM, Christoph Moench-Tegeder wrote: ## Joe Conway (m...@joeconway.com): However there is a period of time after pg_start_backup() is first executed to when it completes, during which backup_label file does not exist yet, but the backup has essentially been started. Is there any way to detect this state? When I did some research on the very same question just a few weeks ago, I settled for external locks (lockfile, pg_advisory_lock(), ..., depending on your exact situation) around the backup-taking code. Right, and an external lockfile is good except there is a race condition. Proc1, t0) SELECT pg_start_backup(...) -- command starts to execute Proc2, t1) lay down a lockfile (or other mechanism) Proc2, t2) check for running backup by looking for backup_label Proc1, t3) SELECT pg_start_backup(...) -- command finishes, -- backup_label created So you are forced to sleep for some arbitrary time just in case pg_start_backup() has started but not completed at the point when you try to lock out the backup. Uh. Why would you do the lock(file) thingy *after* calling pg_start_backup? You should do lock before calling start backup and remove the lock after calling stop backup. In that case I don't see where the race condition is? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump with postgis extension dumps rules separately
On 2013-06-01 11:07:53 -0400, Tom Lane wrote: Joe Conway m...@joeconway.com writes: On 05/31/2013 08:46 PM, Robert Haas wrote: Changing SQL syntax in the back-branches isn't normally something we do, but I confess I don't see any real reason not to do it in this case. That was part of my hesitation, but I don't see any better way to fix existing installations and this is pretty well self-contained. Any other opinions out there? I don't like this approach much. 1. It does nothing to fix the issue in *existing* databases, which won't have pg_depend entries like this. Well, you can now write an extension upgrade script that adds the missing dependencies. To me that sounds better than letting it fiddle with pg_depend. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump with postgis extension dumps rules separately
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 06/01/2013 08:07 AM, Tom Lane wrote: Joe Conway m...@joeconway.com writes: On 05/31/2013 08:46 PM, Robert Haas wrote: Changing SQL syntax in the back-branches isn't normally something we do, but I confess I don't see any real reason not to do it in this case. That was part of my hesitation, but I don't see any better way to fix existing installations and this is pretty well self-contained. Any other opinions out there? I don't like this approach much. 1. It does nothing to fix the issue in *existing* databases, which won't have pg_depend entries like this. The grammar change does allow the pg_depend entries to be added through ALTER EXTENSION. It works perfectly. 2. In general, we have assumed that properties of tables, such as indexes and constraints, cannot be independent members of extensions. It's not clear to me why rules should be different. I can look at having pg_dump ignore these entries, but I suspect that will be quite a bit more invasive. Joe - -- Joe Conway credativ LLC: http://www.credativ.us Linux, PostgreSQL, and general Open Source Training, Service, Consulting, 24x7 Support -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBAgAGBQJRqhHeAAoJEDfy90M199hlCwwP/215zTz6F1/pPDUowppEjQfd YNCeufgm9ZpcycOjhz/wBFGSaOcPOn5eoBwcYC6XqVGDemU8MVUENcpydq2ltRzl ks5o1LZsWRnYh594v3Wi6K0neQ9G4qx9Lx03k9RdE7TWVdnu4JziQb6BNPEyfa+D 9kCt6tHXOv2xYwr2FeVieH6dlDpEwScFSG59nUlE2mM7i9/eM7cuiCw7EJZpXJuD 48hkcEZkYlBZAAL6JAErEqMkl8bEB8JEk2s/YkoH8W/qkZrnd21k8IeHPcWBh2DH 2vVJBGBLZL2wYEMT1Qu5phiYhlUoXnHgIHPPVPeLl3Vx2U6D+00vswuwmKKD2T1/ aAgdQOX8ubNr9GJfAeBZ/GeLdAqr4sei1lzxM2LJkVmu7szE+6DYXyFvB3kO3Fxh IXxDmhCWv7OeKPMo5OGjV3/Vjzxsxx85BDsz/TFhIyNaKvzz2UacspcKlZ51Sr6i 5FYSRPII8g3FPtt1/8ed/Js9ZQOscqrUw/gxrttexGs1I4R8ZooUesD2pSrED9wn CKYY5AYWihH3cugyUZbqOOBTVEtgFpzKCo1p8zLUizTnlR6pnrRTWt+7/0Q+rSw/ SXnCwbeE4aaESghTOSb6P5l9JkCIbprz+URWYYmHWm6k0CnMUuKtM1syYBkdx+ew Mptd78Ya3Wl6rRkC0pa5 =8g4V -END PGP SIGNATURE- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] detecting binary backup in progress
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 06/01/2013 08:14 AM, Andres Freund wrote: On 2013-06-01 08:11:26 -0700, Joe Conway wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 06/01/2013 02:43 AM, Christoph Moench-Tegeder wrote: ## Joe Conway (m...@joeconway.com): However there is a period of time after pg_start_backup() is first executed to when it completes, during which backup_label file does not exist yet, but the backup has essentially been started. Is there any way to detect this state? When I did some research on the very same question just a few weeks ago, I settled for external locks (lockfile, pg_advisory_lock(), ..., depending on your exact situation) around the backup-taking code. Right, and an external lockfile is good except there is a race condition. Proc1, t0) SELECT pg_start_backup(...) -- command starts to execute Proc2, t1) lay down a lockfile (or other mechanism) Proc2, t2) check for running backup by looking for backup_label Proc1, t3) SELECT pg_start_backup(...) -- command finishes, -- backup_label created So you are forced to sleep for some arbitrary time just in case pg_start_backup() has started but not completed at the point when you try to lock out the backup. Uh. Why would you do the lock(file) thingy *after* calling pg_start_backup? You should do lock before calling start backup and remove the lock after calling stop backup. In that case I don't see where the race condition is? No, the point is I have no control over Proc1. I am trying to prevent Proc2 from running concurrently with a binary backup (Proc1). So I need to lock out Proc1, or detect it is running and wait for it to finish. The problem is that in between t0 and t3 there is no good way to determine that Proc1 has started its backup because backup_label does not yet exist, and there is no other evidence besides pg_stat_activity (which as I said is better than nothing but probably not perfect). Joe - -- Joe Conway credativ LLC: http://www.credativ.us Linux, PostgreSQL, and general Open Source Training, Service, Consulting, 24x7 Support -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBAgAGBQJRqhLuAAoJEDfy90M199hlfY4P/jiuiSODmJs9xphXAcs1Xo2k hVFTDWTYysFYr9hhnTyWlv9B60h9f/nYu/EeXXyOlynSs/DOUpWNFDJcF/t/SopI D4es+F+LFoSnIeeyNPu4lYzBeQ4tHFh7KqHYGd640eCaTDW2O2uqu89R8sMbhMQs HXSGAa8N8Vmy/+js1xSfMHA/8qK2QGkEAxU/IJmYPKn+QYmHh3iyQ+9rDTVG7ghM +3FC7EPtv9jsXPHczyT3qwcAy76DNjKbET9z3GsDc9qDIZezFRlh5mALXHPf6Puj X1Bk8okZQj45bMx3DkxhGawae045O6nbUPYIwVTujAWB+1rqbKJFW2qSmp8DRL9u w+k0kbRB++wu3QNxNx5GOKuHJwsVVU5CBhMLqPOPrBwYCZtAJPY7llCPUQclJ91A QSe1lcEXSlQ0vlva4oC1ksEWfSCedudAwFkbiexKUu48FHQsSQrlbE46zbAKSL40 XzYCPN1eT/MsxAeXiV0ehtvEeiiqQU3aQifrOcIZiBquZPQP6I/kQ1WbdMJ9zChD Wff8+J6nMbFFetfxIfkMdJIaKuMYetnZ0NkHiHORzESGqIc5Rm5t4gzA4bHNiVg9 jDNDSmnsR8FujfkhySrQcZTmV/a+y7p1mLJlpyKNF/OX55LdRA97dwIbpcutXScW DmpFiM+molu2bLPd17H5 =jjJq -END PGP SIGNATURE- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] detecting binary backup in progress
On 2013-06-01 08:27:42 -0700, Joe Conway wrote: Uh. Why would you do the lock(file) thingy *after* calling pg_start_backup? You should do lock before calling start backup and remove the lock after calling stop backup. In that case I don't see where the race condition is? No, the point is I have no control over Proc1. I am trying to prevent Proc2 from running concurrently with a binary backup (Proc1). So I need to lock out Proc1, or detect it is running and wait for it to finish. Backups over which you don't have control sound a bit scary ;). I think at that point you have a race condition no matter what since the backup could be started between your check and when you call pg_start_backup anyay. So just calling pg_start_backup and handling the error properly sounds like the way to go in that case. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump with postgis extension dumps rules separately
Andres Freund and...@2ndquadrant.com writes: On 2013-06-01 11:07:53 -0400, Tom Lane wrote: I don't like this approach much. 1. It does nothing to fix the issue in *existing* databases, which won't have pg_depend entries like this. Well, you can now write an extension upgrade script that adds the missing dependencies. To me that sounds better than letting it fiddle with pg_depend. Per my point #2, that would be the wrong solution, quite aside from the wrongness of dumping the fixup burden on the extension author. For one thing, the extension author has no idea whether his script is being loaded into a database that has this patch. If it doesn't, adding a command like this would cause the script to fail outright. If it does, then the command is unnecessary, since the patch also includes a code change that adds the dependency. But in any case, making rules act differently from other table properties for this purpose seems seriously wrong. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump with postgis extension dumps rules separately
On 2013-06-01 11:31:05 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2013-06-01 11:07:53 -0400, Tom Lane wrote: I don't like this approach much. 1. It does nothing to fix the issue in *existing* databases, which won't have pg_depend entries like this. Well, you can now write an extension upgrade script that adds the missing dependencies. To me that sounds better than letting it fiddle with pg_depend. Per my point #2, that would be the wrong solution, quite aside from the wrongness of dumping the fixup burden on the extension author. For one thing, the extension author has no idea whether his script is being loaded into a database that has this patch. If it doesn't, adding a command like this would cause the script to fail outright. If it does, then the command is unnecessary, since the patch also includes a code change that adds the dependency. But in any case, making rules act differently from other table properties for this purpose seems seriously wrong. What's your proposal to fix this situation then? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump with postgis extension dumps rules separately
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 06/01/2013 08:32 AM, Andres Freund wrote: On 2013-06-01 11:31:05 -0400, Tom Lane wrote: But in any case, making rules act differently from other table properties for this purpose seems seriously wrong. What's your proposal to fix this situation then? I gather Tom would rather see this handled by filtering in pg_dump. Joe - -- Joe Conway credativ LLC: http://www.credativ.us Linux, PostgreSQL, and general Open Source Training, Service, Consulting, 24x7 Support -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBAgAGBQJRqhWfAAoJEDfy90M199hlZ4oQAIBha6LI6cCtPUTPgh4JT0Jy oUH/+TWZbVUxe2AMMskxDh65DhUjTpHliEQiM6Eyd6gGx9icSKxHMo7pfvfqNIZi kZQeY2x0un1RRd1yyydNuZYKk9cJWOzTJl+OaGCVUlVAnre1hs6ykkeWVotRDvJz jMMs+XasEIr7MNNIbJqKGKNkiSD53gOOybouxzgqitsf/6+qp4DTmHgDurzptxgD HpskiKBJkA7Trb5Xukd6SrhajzYVaF8+DAHzBaZBwKXvg/wr4JN1NEHpr8O3+itP iIWbnR2iGxgkFRTvwwiJx+Phc2BJIVRwBzAyN4AAaiM7WykX14ztmyIQf5cEUNF5 abpFxMedMq0yATwU/5XBZ/HPLkCRv9mK+6zQUXrQ0rd2KaM/wMDA1DdpfH9C0vd7 6MPUNrq8U2V3UxJudMx59wnQMVSDoVNuxPn+wRBnUtpyIorwYIOVybRt/T3/F4tm 6UCoaBtGF4EWvdxtBpr9B9rl/xSc/JxMr6TNV+3S7EITg/QUbqKlcsMvza16PIXu cFPKuI4VeKyKRt7bTV9hE0HRqL15qsnOQhZZ9aSH9kcqozpWCglHmWLiUljvIRtt z4OMKXPOaayZWFLfex/dFd+AXE396sLBgadKCr5Y+L0U08SNCVVAXK9k84zwJOcy MhOClw1EUQ9WTGaFmMfP =FALs -END PGP SIGNATURE- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump with postgis extension dumps rules separately
Joe Conway m...@joeconway.com writes: I can look at having pg_dump ignore these entries, but I suspect that will be quite a bit more invasive. Actually, I believe the answer is just that getSchemaData() is doing things in the wrong order: if (g_verbose) write_msg(NULL, reading rewrite rules\n); getRules(fout, numRules); /* * Identify extension member objects and mark them as not to be dumped. * This must happen after reading all objects that can be direct members * of extensions, but before we begin to process table subsidiary objects. */ if (g_verbose) write_msg(NULL, finding extension members\n); getExtensionMembership(fout, extinfo, numExtensions); Per that comment, getRules() should be called down where indexes, constraints, and triggers are processed. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump with postgis extension dumps rules separately
I wrote: Actually, I believe the answer is just that getSchemaData() is doing things in the wrong order: BTW, I'm inclined to think it's also wrong that the getEventTriggers() call was just added at the end; those things are certainly not table subsidiary objects. I don't know if we allow event triggers to be extension members, but if we did, that call would have to occur before getExtensionMembership(). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump with postgis extension dumps rules separately
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 06/01/2013 09:39 AM, Tom Lane wrote: I wrote: Actually, I believe the answer is just that getSchemaData() is doing things in the wrong order: BTW, I'm inclined to think it's also wrong that the getEventTriggers() call was just added at the end; those things are certainly not table subsidiary objects. I don't know if we allow event triggers to be extension members, but if we did, that call would have to occur before getExtensionMembership(). Thanks! I'll have a look. Joe - -- Joe Conway credativ LLC: http://www.credativ.us Linux, PostgreSQL, and general Open Source Training, Service, Consulting, 24x7 Support -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBAgAGBQJRqiWWAAoJEDfy90M199hlvAgP+gNNd+Vf70i4ecANCffYIqa7 PrvuKBAP/ZwWwISO7G/T5JbmKrhsySsrVWCQqSdIj62eLzkgbToqIbQuygcUjg3t SfwjxuSqf8P8oR+LWkKnU2pcY/WbpdnmJHYO0e6Y+Fn2I/OP3yImkUm+2O9nMflI v0M2qxcVXu1/aUnfdU7+BLZli0S+gUp+FoiO9O+YaAzK7jCyg3q0QbCXLh9ygEim 5ABZCONiTbH3eALRHfeD1uBHAU60gdbiFPwSK7zBCW9FzVKbU5IFV1qIl4oKTlzo rpvgXnJ7r1EngyHUnXZfTltnhCN6joOiRA3GLDflA0e+f933zJf/KXnRzRcjb1+H vpmtKWdM9qnH7XcNQYe9EJXkLEcktctgDs01cgaFBy1EK6qEjFD7FAbfRMMpLVTp 4/HIQX62SezGJTzCW06Qz6/Gt2ycJ5HtLmrEmYrzhOjN+ZEsaZBr3ad/nU/zExEZ TzNF8tX9SJzgEFd87x1Xz1pNeX+ewJ8uI87aqyWTIeHPG3GjjFUKehVYmBPGRawl bWrGYo1G65b0h3jvSzAFEL82e0wzaEoxXyoBuogaefNohrCNrpiKUIfPwjuCimC0 MCGgmS8Kj76sqw/MTq2vcSY2ynEgse1O0weWI3sDM/M/dCvIQSCtNqfvTWL+PISL 01MhTFzyWFQa5E5206w/ =imAT -END PGP SIGNATURE- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum, Freeze and Analyze: the big picture
On 30 May 2013 19:48, Josh Berkus j...@agliodbs.com wrote: There's currently some great ideas bouncing around about eliminating the overhead associated with FREEZE. However, I wanted to take a step back and take a look at the big picture for VACUUM, FREEZE and ANALYZE. That is a very commendable approach. We should do that more often. The big, big picture is this: 90% of our users need to think about VACUUM/ANALYZE at least 10% of the time and 10% of our users need to think about it almost 90% of the time. When you say stuff like that, you should add speculating from my personal experience. People might get the impression you'd measured this somehow and it could confuse the issue if you try to assemble a high level viewpoint and then add in factoids that are just opinions. We should strive to measure such things. That's considerably better than was the case 5 years ago, when vacuum management was a daily or weekly responsibility for nearly 100% of our users, but it's still not good enough. Our target should be that only those with really unusual setups should have to *ever* think about vacuum and analyze. I think that's where we already are given that 1000s of users have quite small databases. The problem increases with scale. Larger databases have bigger problems and make it easier to notice things are happening. I think you should mention that the evidence for these issues is anecdotal and take careful notes of the backgrounds in which they occurred. Saying things occur in all cases wouldn't be accurate or helpful to their resolution. We should be seeking to contrast this against other databases to see if we are better or worse than other systems. For example, recording the moans of someone who is currently managing a 1 TB database, but yet hasn't ever managed anything else that big is less valuable than a balanced, experienced viewpoint (if such exists). Anyway, I support this approach, just wanted to make sure we do it in sufficient detail to be useful. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] getting rid of freezing
On 28 May 2013 15:15, Robert Haas robertmh...@gmail.com wrote: On Sat, May 25, 2013 at 6:14 AM, Simon Riggs si...@2ndquadrant.com wrote: I think the right way is actually to rethink and simplify all this complexity of Freezing/Pruning/Hinting/Visibility I agree, but I think that's likely to have to wait until we get a pluggable storage API, and then a few years beyond that for someone to develop the technology to enable the new and better way. In the meantime, if we can eliminate or even reduce the impact of freezing in the near term, I think that's worth doing. I think we can do better more quickly than that. Andres' basic idea of skipping freeze completely was a valuable one and is the right way forwards. And it looks like the epoch based approach that Heikki and I have come up seems likely to end up somewhere workable. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] detecting binary backup in progress
On 1 June 2013 15:45, Tom Lane t...@sss.pgh.pa.us wrote: Simon Riggs si...@2ndquadrant.com writes: The way to resolve this is to have two functions: pg_is_in_backup() - which covers both/all kinds of backup pg_is_in_exclusive_backup() - which covers just the exclusive backup mode What will you do with pg_backup_start_time()? Hmm, at least all of those functions use the backup name consistently. I guess I wasn't suggesting we rename pg_start_backup() to pg_start_exclusive_backup(), so maybe it makes sense. pg_start_backup() talks about an online backup, while pg_is_in_backup() talks about an exclusive backup. Minimum change here would be to make pg_start_backup talk about an exclusive backup also. What we need is a function that says whether it is possible to shutdown because of a backup, or not. pg_basebackup is an active task, whereas an exclusive backup never is. So we need a function to tell us that although nothing else is running, but we are running an exclusive backup. So changing pg_is_in_backup() to refer to all kinds of backup still allows it to be used for its primary purpose - to tell whether its OK to shutdown or not, but it also makes it clearer. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Combo xids
On 01.06.2013 11:22, Simon Riggs wrote: What is the benefit? Merging xmin/xmax would save 4 bytes per row. On servers with 8 byte word length, that means that we'd save 8 bytes per row for tables that have between 9 and 40 columns. Which is the majority of tables. Hmm, it would probably be much easier to squeeze, say, one byte from the tuple header, than full four bytes. Then you could store store a null bitmap for upto 16 columns, without crossing the 24 byte mark. That would already get you much of the benefit. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Combo xids
On 1 June 2013 19:25, Heikki Linnakangas hlinnakan...@vmware.com wrote: Hmm, it would probably be much easier to squeeze, say, one byte from the tuple header, than full four bytes. Then you could store store a null bitmap for upto 16 columns, without crossing the 24 byte mark. That would already get you much of the benefit. It seemed worth recording the idea, though I agree that now its recorded its not the best idea for reducing table size. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] COPY .... (FORMAT binary) syntax doesn't work
On 27 May 2013 15:31, Tom Lane t...@sss.pgh.pa.us wrote: Simon Riggs si...@2ndquadrant.com writes: On 26 May 2013 17:10, Tom Lane t...@sss.pgh.pa.us wrote: More readable would be to invent an intermediate nonterminal falling between ColId and ColLabel, whose expansion would be IDENT | unreserved_keyword | col_name_keyword | type_func_name_keyword, and then replace ColId_or_Sconst with whatever-we-call-that_or_Sconst. Any thoughts about a name for that new nonterminal? Do you think complicating the parser in that way is worth the trouble for this case? Could that slow down parsing? It makes the grammar tables a bit larger (1% or so IIRC). There would be some distributed penalty for that, but probably not much. Of course there's always the slippery-slope argument about that. We don't actually have to fix it; clearly not too many people are bothered, since no complaints in 3 years. Documenting 'binary' seems better. Well, my thought is there are other cases. For instance: regression=# create role binary; ERROR: syntax error at or near binary LINE 1: create role binary; ^ regression=# create user cross; ERROR: syntax error at or near cross LINE 1: create user cross; ^ If we don't have to treat type_func_name_keywords as reserved in these situations, shouldn't we avoid doing so? Seems reasonable argument, so +1. Sorry for delayed reply. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freezing without write I/O
On 31.05.2013 06:02, Robert Haas wrote: On Thu, May 30, 2013 at 2:39 PM, Robert Haasrobertmh...@gmail.com wrote: Random thought: Could you compute the reference XID based on the page LSN? That would eliminate the storage overhead. After mulling this over a bit, I think this is definitely possible. We begin a new half-epoch every 2 billion transactions. We remember the LSN at which the current half-epoch began and the LSN at which the previous half-epoch began. When a new half-epoch begins, the first backend that wants to stamp a tuple with an XID from the new half-epoch must first emit a new half-epoch WAL record, which becomes the starting LSN for the new half-epoch. Clever! Pages in unlogged tables need some extra treatment, as they don't normally have a valid LSN, but that shouldn't be too hard. We define a new page-level bit, something like PD_RECENTLY_FROZEN. When this bit is set, it means there are no unfrozen tuples on the page with XIDs that predate the current half-epoch. Whenever we know this to be true, we set the bit. If the page LSN crosses more than one half-epoch boundary at a time, we freeze the page and set the bit. If the page LSN crosses exactly one half-epoch boundary, then (1) if the bit is set, we clear it and (2) if the bit is not set, we freeze the page and set the bit. Yep, I think that would work. Want to write the patch, or should I? ;-) - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freezing without write I/O
On 1 June 2013 19:48, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 31.05.2013 06:02, Robert Haas wrote: On Thu, May 30, 2013 at 2:39 PM, Robert Haasrobertmh...@gmail.com wrote: Random thought: Could you compute the reference XID based on the page LSN? That would eliminate the storage overhead. After mulling this over a bit, I think this is definitely possible. We begin a new half-epoch every 2 billion transactions. We remember the LSN at which the current half-epoch began and the LSN at which the previous half-epoch began. When a new half-epoch begins, the first backend that wants to stamp a tuple with an XID from the new half-epoch must first emit a new half-epoch WAL record, which becomes the starting LSN for the new half-epoch. Clever! Pages in unlogged tables need some extra treatment, as they don't normally have a valid LSN, but that shouldn't be too hard. I like the idea of using the LSN to indicate the epoch. It saves any other work we might consider, such as setting page or tuple level epochs. We define a new page-level bit, something like PD_RECENTLY_FROZEN. When this bit is set, it means there are no unfrozen tuples on the page with XIDs that predate the current half-epoch. Whenever we know this to be true, we set the bit. If the page LSN crosses more than one half-epoch boundary at a time, we freeze the page and set the bit. If the page LSN crosses exactly one half-epoch boundary, then (1) if the bit is set, we clear it and (2) if the bit is not set, we freeze the page and set the bit. Yep, I think that would work. Want to write the patch, or should I? ;-) If we set a bit, surely we need to write the page. Isn't that what we were trying to avoid? Why set a bit at all? If we know the LSN of the page and we know the epoch boundaries, then we can work out when the page was last written to and infer that the page is virtually frozen. As soon as we make a change to a virtually frozen page, we can actually freeze it and then make the change. But we still have the problem of knowing which pages have been frozen and which haven't. Can we clear up those points first? Or at least my understanding of them. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freezing without write I/O
On Sat, Jun 1, 2013 at 2:48 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: We define a new page-level bit, something like PD_RECENTLY_FROZEN. When this bit is set, it means there are no unfrozen tuples on the page with XIDs that predate the current half-epoch. Whenever we know this to be true, we set the bit. If the page LSN crosses more than one half-epoch boundary at a time, we freeze the page and set the bit. If the page LSN crosses exactly one half-epoch boundary, then (1) if the bit is set, we clear it and (2) if the bit is not set, we freeze the page and set the bit. Yep, I think that would work. Want to write the patch, or should I? ;-) Have at it. I think the tricky part is going to be figuring out the synchronization around half-epoch boundaries. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freezing without write I/O
On Sat, Jun 1, 2013 at 3:22 PM, Simon Riggs si...@2ndquadrant.com wrote: If we set a bit, surely we need to write the page. Isn't that what we were trying to avoid? No, the bit only gets set in situations when we were going to dirty the page for some other reason anyway. Specifically, if a page modification discovers that we've switched epochs (but just once) and the bit isn't already set, we can set it in lieu of scanning the entire page for tuples that need freezing. Under this proposal, pages that don't contain any dead tuples needn't be dirtied for freezing, ever. Smells like awesome. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimising Foreign Key checks
On Sat, Jun 01, 2013 at 09:41:13AM +0100, Simon Riggs wrote: FK checks can be expensive, especially when loading large volumes of data into an existing table or partition. A couple of ideas for improving performance are discussed here: 1. Use Case: Bulk loading COPY pgbench_accounts; -- references pgbench_branches with many repeated values Proposal: Transactions that need multiple checks can be optimised by simply LOCKing the whole referenced table, once. We can then hold the referenced table as a Hash, like we do with a Hash Join (its almost exactly the same thing). This works in two ways: it speeds up checks and it also reduces the locking overhead. 2. Use Case: Transactional repetition BEGIN; INSERT INTO order VALUES (ordid, ) INSERT INTO order_line VALUES (ordid, 1, .) INSERT INTO order_line VALUES (ordid, 2, .) INSERT INTO order_line VALUES (ordid, 3, .) INSERT INTO order_line VALUES (ordid, 4, .) Proposal: De-duplicate multiple checks against same value. This would be implemented by keeping a hash of rows that we had already either inserted and/or locked as the transaction progresses, so we can use the hash to avoid queuing up after triggers. I find (2) more promising than (1). It helps automatically, and it helps in more cases. The main advantage of (1) is avoiding the writes of tuple locks onto the PK table. Therefore, I expect (1) to distinguish itself over (2) when the referenced values are _not_ repeated too much. If referenced values are repeated, tuple locking costs would tend to disappear into the noise after the deduplication of (2). In both cases we are building up a local hash table with values and then using those values to avoid queuing constraint triggers. So code is similar for both. Thoughts? Will this add too much cost where it doesn't help? I don't know what to predict there. There's the obvious case of trivial transactions with no more than one referential integrity check per FK, but there's also the case of a transaction with many FK checks all searching different keys. If the hash hit rate (key duplication rate) is low, the hash can consume considerably more memory than the trigger queue without preventing many RI queries. What sort of heuristic could we use to avoid pessimizing such cases? Same-transaction UPDATE or DELETE of the PK table, as well as subtransaction abort, potentially invalidates hash entries. I recommend thinking relatively early about how best to handle that. Before deciding what to think overall, I needed to see a benchmark. I ran this simple one based on your scenarios: BEGIN; CREATE TABLE order (orderid int PRIMARY KEY); CREATE TABLE order_line ( orderid int, lineno int, PRIMARY KEY (orderid, lineno), FOREIGN KEY (orderid) REFERENCES order ); INSERT INTO order VALUES (1); INSERT INTO order_line SELECT 1, n FROM generate_series(1,100) t(n); ROLLBACK; See attached output from perf report -s parent -g graph,5,caller; I suggest browsing under less -S. It confirms that the expensive part is something your proposals would address. A different way to help the bulk loading case would be to lock more keys with a single query. Currently, we issue a query like this for every INSERTed row: SELECT 1 FROM ONLY pktable WHERE pkcol = $1 FOR KEY SHARE OF x We could instead consider queued tuples in batches: SELECT 1 FROM ONLY pktable WHERE pkcol = ANY (ARRAY[$1,$2,...,$100]) FOR KEY SHARE OF x This would have the advantage of not adding a long-lived, potentially-large data structure and not depending on the rate of referenced key duplication. But it would not help non-bulk scenarios. (However, the user could make your example for (2) become a bulk scenario by deferring the FK constraint.) Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com # Events: 20K cycles # # Overhead Parent symbol # . # 100.00% [other] | --- (nil) | --96.88%-- __libc_start_main generic_start_main.isra.0 | --96.77%-- main PostmasterMain | --95.62%-- PostgresMain | --95.51%-- PortalRun PortalRunMulti | --95.50%-- ProcessQuery | |--71.37%-- standard_ExecutorFinish
Re: [HACKERS] Vacuum, Freeze and Analyze: the big picture
On Sat, Jun 1, 2013 at 1:49 PM, Simon Riggs si...@2ndquadrant.com wrote: On 30 May 2013 19:48, Josh Berkus j...@agliodbs.com wrote: There's currently some great ideas bouncing around about eliminating the overhead associated with FREEZE. However, I wanted to take a step back and take a look at the big picture for VACUUM, FREEZE and ANALYZE. That is a very commendable approach. We should do that more often. The big, big picture is this: 90% of our users need to think about VACUUM/ANALYZE at least 10% of the time and 10% of our users need to think about it almost 90% of the time. When you say stuff like that, you should add speculating from my personal experience. People might get the impression you'd measured this somehow and it could confuse the issue if you try to assemble a high level viewpoint and then add in factoids that are just opinions. We should strive to measure such things. That's considerably better than was the case 5 years ago, when vacuum management was a daily or weekly responsibility for nearly 100% of our users, but it's still not good enough. Our target should be that only those with really unusual setups should have to *ever* think about vacuum and analyze. I think that's where we already are given that 1000s of users have quite small databases. The problem increases with scale. Larger databases have bigger problems and make it easier to notice things are happening. I think you should mention that the evidence for these issues is anecdotal and take careful notes of the backgrounds in which they occurred. Saying things occur in all cases wouldn't be accurate or helpful to their resolution. We should be seeking to contrast this against other databases to see if we are better or worse than other systems. For example, recording the moans of someone who is currently managing a 1 TB database, but yet hasn't ever managed anything else that big is less valuable than a balanced, experienced viewpoint (if such exists). Anyway, I support this approach, just wanted to make sure we do it in sufficient detail to be useful. I agree with all that. I don't have any data either, but I agree that AFAICT it seems to mostly be a problem for large (terabyte-scale) databases, or ones that are dreadfully short of I/O bandwidth. AWS, I'm looking at you. It would be interesting to make a list of what other issues people have seen using PostgreSQL on very large data sets. Complaints I've heard include: 1. Inexplicable failure of the planner to use indexes on very large tables, preferring an obviously-stupid sequential scan. This might be fixed by the latest index-size fudge factor work. 2. Lack of concurrent DDL. On VACUUM and ANALYZE specifically, I'd have to say that the most common problems I encounter are (a) anti-wraparound vacuums kicking in at inconvenient times and eating up too many resources and (b) users making ridiculous settings changes to avoid the problems caused by anti-wraparound vacuums kicking in at inconvenient times and eating up too many resources. The changes we've been discussing elsewhere may not completely solve this problem, because we'll still have to read all pages that aren't yet all-visible... but they should surely help. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freezing without write I/O
On 1 June 2013 21:26, Robert Haas robertmh...@gmail.com wrote: On Sat, Jun 1, 2013 at 3:22 PM, Simon Riggs si...@2ndquadrant.com wrote: If we set a bit, surely we need to write the page. Isn't that what we were trying to avoid? No, the bit only gets set in situations when we were going to dirty the page for some other reason anyway. Specifically, if a page modification discovers that we've switched epochs (but just once) and the bit isn't already set, we can set it in lieu of scanning the entire page for tuples that need freezing. Under this proposal, pages that don't contain any dead tuples needn't be dirtied for freezing, ever. Smells like awesome. Agreed, well done both. What I especially like about it is how little logic it will require, and no page format changes. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers