Re: [HACKERS] [WIP] Better partial index-only scans
Here's a SQL script that (1) demonstrates the new index only scan functionality, and (2) at least on my machine, has a consistently higher planning time for the version with my change than without it. On Sun, Mar 16, 2014 at 5:08 AM, Joshua Yanovski pythones...@gmail.com wrote: Proof of concept initial patch for enabling index only scans for partial indices even when an attribute is not in the target list, as long as it is only used in restriction clauses that can be proved by the index predicate. This also works for index quals, though they still can't be used in the target list. However, this patch may be inefficient since it duplicates effort that is currently delayed until after the best plan is chosen. The patch works by basically repeating the logic from create_indexscan_plan in createplan.c that determines which clauses can't be discarded, instead of the current approach, which just assumes that any attributes referenced anywhere in a restriction clause has to be a column in the relevant index. It should build against master and passes make check for me. It also includes a minor fix in the same code in createplan.c to make sure we're explicitly comparing an empty list to NIL, but I can take that out if that's not considered in scope. If this were the final patch I'd probably coalesce the code used in both places into a single function, but since I'm not certain that the implementation in check_index_only won't change substantially I held off on that. Since the original comment suggested that this was not done due to planner performance concerns, I assume the performance of this approach is unacceptable (though I did a few benchmarks and wasn't able to detect a consistent difference--what would be a good test for this?). As such, this is intended as more of a first pass that I can build on, but I wanted to get feedback at this stage on where we can improve (particularly if there were already ideas on how this might be done, as the comment hints). Index only scans cost less than regular index scans so I don't think we can get away with waiting until we've chosen the best plan before we do the work described above. That said, as I see it performance could improve in any combination of five ways: * Improve the performance of determining which clauses can't be discarded (e.g. precompute some information about equivalence classes for index predicates, mess around with the order in which we check the clauses to make it fail faster, switch to real union-find data structures for equivalence classes). * Find a cleverer way of figuring out whether a partial index can be used than just checking which clauses can't be discarded. * Use a simpler heuristic (that doesn't match what use to determine which clauses can be discarded, but still matches more than we do now). * Take advantage of work we do here to speed things up elsewhere (e.g. if this does get chosen as the best plan we don't need to recompute the same information in create_indexscan_plan). * Delay determining whether to use an index scan or index only scan until after cost analysis somehow. I'm not sure exactly what this would entail. Since this is my first real work with the codebase, I'd really appreciate it if people could help me figure out the best approach here (and, more importantly, if one is necessary based on benchmarks). And while this should go without saying, if this patch doesn't actually work then please let me know, since all the above is based on the assumption that what's there is enough :) Thanks, Joshua Yanovski -- Josh test_indexscan.sql Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix typo in nbtree.h introduced by efada2b
On Mon, Mar 17, 2014 at 6:06 AM, Michael Paquier michael.paqu...@gmail.comwrote: Hi, I found a small typo in nbtree.h, introduced by commit efada2b. Patch is attached. Applied, thanks. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] inherit support for foreign tables
Hi Fujita-san, Thank you for working this patch! No problem, but my point seems always out of the main target a bit:( | =# alter table passwd add column added int, add column added2 int; | NOTICE: This command affects foreign relation cf1 | NOTICE: This command affects foreign relation cf1 | ALTER TABLE | =# select * from passwd; | ERROR: missing data for column added | CONTEXT: COPY cf1, line 1: root:x:0:0:root:/root:/bin/bash | =# This seems far better than silently performing the command, except for the duplicated message:( New bitmap might required to avoid the duplication.. As I said before, I think it would be better to show this kind of information on each of the affected tables whether or not the affected one is foreign. I also think it would be better to show it only when the user has specified an option to do so, similar to a VERBOSE option of other commands. ISTM this should be implemented as a separate patch. Hmm. I *wish* this kind of indication to be with this patch even only for foreign tables which would have inconsistency potentially. Expanding to other objects and/or knobs are no problem to be added later. Hmm. I tried minimal implementation to do that. This omits cost recalculation but seems to work as expected. This seems enough if cost recalc is trivial here. I think we should redo the cost/size estimate, because for example, greater parameterization leads to a smaller rowcount estimate, if I understand correctly. In addition, I think this reparameterization should be done by the FDW itself, becasuse the FDW has more knowledge about it than the PG core. So, I think we should introduce a new FDW routine for that, say ReparameterizeForeignPath(), as proposed in [1]. Attached is an updated version of the patch. Due to the above reason, I removed from the patch the message displaying function you added. Sorry for the delay. [1] http://www.postgresql.org/message-id/530c7464.6020...@lab.ntt.co.jp I had a rough look on foreign reparameterize stuff. It seems ok generally. By the way, Can I have a simple script to build an environment to run this on? regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench
Hi Heikki-san, (2014/03/17 14:39), KONDO Mitsumasa wrote: (2014/03/15 15:53), Fabien COELHO wrote: Hello Heikki, A couple of comments: * There should be an explicit \setrandom ... uniform option too, even though you get that implicitly if you don't specify the distribution Fix. We can use \setrandom val min max uniform without error messages. * What exactly does the threshold mean? The docs informally explain that the larger the thresold, the more frequent values close to the middle of the interval are drawn, but that's pretty vague. There are explanations and computations as comments in the code. If it is about the documentation, I'm not sure that a very precise mathematical definition will help a lot of people, and might rather hinder understanding, so the doc focuses on an intuitive explanation instead. Add more detail information in the document. Is it OK? Please confirm it. * Does min and max really make sense for gaussian and exponential distributions? For gaussian, I would expect mean and standard deviation as the parameters, not min/max/threshold. Yes... and no:-) The aim is to draw an integer primary key from a table, so it must be in a specified range. This is approximated by drawing a double value with the expected distribution (gaussian or exponential) and project it carefully onto integers. If it is out of range, there is a loop and another value is drawn. The minimal threshold constraint (2.0) ensures that the probability of looping is low. It make sense. Please see the attached picutre in last day. * How about setting the variable as a float instead of integer? Would seem more natural to me. At least as an option. Which variable? The values set by setrandom are mostly used for primary keys. We really want integers in a range. Oh, I see. He said about documents. The document was mistaken. Threshold parameter must be double and fix the document. By the way, you seem to want to remove --gaussian=NUM and --exponential=NUM command options. Can you tell me the objective reason? I think pgbench is the benchmark test on PostgreSQL and default benchmark is TPC-B-like benchmark. It is written in documents, and default benchmark wasn't changed by my patch. So we need not remove command options, and they are one of the variety of benchmark options. Maybe you have something misunderstanding about my patch... Regards, -- Mitsumasa KONDO NTT Open Source Software Center *** a/contrib/pgbench/pgbench.c --- b/contrib/pgbench/pgbench.c *** *** 98,103 static int pthread_join(pthread_t th, void **thread_return); --- 98,106 #define LOG_STEP_SECONDS 5 /* seconds between log messages */ #define DEFAULT_NXACTS 10 /* default nxacts */ + #define MIN_GAUSSIAN_THRESHOLD 2.0 /* minimum threshold for gauss */ + #define MIN_EXPONENTIAL_THRESHOLD 2.0 /* minimum threshold for exp */ + int nxacts = 0; /* number of transactions per client */ int duration = 0; /* duration in seconds */ *** *** 169,174 bool is_connect; /* establish connection for each transaction */ --- 172,185 bool is_latencies; /* report per-command latencies */ int main_pid; /* main process id used in log filename */ + /* gaussian distribution tests: */ + double stdev_threshold; /* standard deviation threshold */ + booluse_gaussian = false; + + /* exponential distribution tests: */ + double exp_threshold; /* threshold for exponential */ + bool use_exponential = false; + char *pghost = ; char *pgport = ; char *login = NULL; *** *** 330,335 static char *select_only = { --- 341,428 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n }; + /* --exponential case */ + static char *exponential_tpc_b = { + \\set nbranches CppAsString2(nbranches) * :scale\n + \\set ntellers CppAsString2(ntellers) * :scale\n + \\set naccounts CppAsString2(naccounts) * :scale\n + \\setrandom aid 1 :naccounts exponential :exp_threshold\n + \\setrandom bid 1 :nbranches\n + \\setrandom tid 1 :ntellers\n + \\setrandom delta -5000 5000\n + BEGIN;\n + UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n + SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n + UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n + UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n + INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n + END;\n + }; + + /* --exponential with -N case */ + static char *exponential_simple_update = { + \\set nbranches CppAsString2(nbranches) * :scale\n + \\set ntellers CppAsString2(ntellers) * :scale\n + \\set naccounts CppAsString2(naccounts) * :scale\n + \\setrandom aid 1 :naccounts exponential :exp_threshold\n + \\setrandom bid 1 :nbranches\n + \\setrandom tid 1 :ntellers\n + \\setrandom delta -5000 5000\n + BEGIN;\n + UPDATE
Re: [HACKERS] gaussian distribution pgbench
On 03/15/2014 08:53 AM, Fabien COELHO wrote: * Does min and max really make sense for gaussian and exponential distributions? For gaussian, I would expect mean and standard deviation as the parameters, not min/max/threshold. Yes... and no:-) The aim is to draw an integer primary key from a table, so it must be in a specified range. Well, I don't agree with that aim. It's useful for choosing a primary key, as in the pgbench TPC-B workload, but a gaussian distributed random number could be used for many other things too. For example: \setrandom foo ... gaussian select * from cheese where weight :foo And :foo should be a float, not an integer. That's what I was trying to say earlier, when I said that the variable should be a float. If you need an integer, just cast or round it in the query. I realize that the current \setrandom sets the variable to an integer, so gaussian/exponential would be different. But so what? An option to generate uniformly distributed floats would be handy too, though. This is approximated by drawing a double value with the expected distribution (gaussian or exponential) and project it carefully onto integers. If it is out of range, there is a loop and another value is drawn. The minimal threshold constraint (2.0) ensures that the probability of looping is low. Well, that's one way to do constraint it to the given range, but there are many other ways to do it. Like, clamp it to the min/max if it's out of range. I don't think we need to choose any particular method, you can handle that in the test script. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench
On 03/17/2014 10:40 AM, KONDO Mitsumasa wrote: By the way, you seem to want to remove --gaussian=NUM and --exponential=NUM command options. Can you tell me the objective reason? I think pgbench is the benchmark test on PostgreSQL and default benchmark is TPC-B-like benchmark. It is written in documents, and default benchmark wasn't changed by my patch. So we need not remove command options, and they are one of the variety of benchmark options. Maybe you have something misunderstanding about my patch... There is an infinite number of variants of the TPC-B test that we could include in pgbench. If we start adding every one of them, we're quickly going to have hundreds of options to choose the workload. I'd like to keep pgbench simple. These two new test variants, gaussian and exponential, are not that special that they'd deserve to be included in the program itself. pgbench already has a mechanism for running custom scripts, in which you can specify whatever workload you want. Let's use that. If it's missing something you need to specify the workload you want, let's enhance the script language. The features we're missing, which makes it difficult to write the gaussian and exponential variants as custom scripts, is the capability to create random numbers with a non-uniform distribution. That's the feature we should include in pgbench. (Actually, you could do the Box-Muller transformation as part of the query, to convert the uniform random variable to a gaussian one. Then you wouldn't need any changes to pgbench. But I agree that \setrandom ... gaussian would be quite handy) - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench
(2014/03/17 17:46), Heikki Linnakangas wrote: On 03/15/2014 08:53 AM, Fabien COELHO wrote: * Does min and max really make sense for gaussian and exponential distributions? For gaussian, I would expect mean and standard deviation as the parameters, not min/max/threshold. Yes... and no:-) The aim is to draw an integer primary key from a table, so it must be in a specified range. Well, I don't agree with that aim. It's useful for choosing a primary key, as in the pgbench TPC-B workload, but a gaussian distributed random number could be used for many other things too. For example: \setrandom foo ... gaussian select * from cheese where weight :foo And :foo should be a float, not an integer. That's what I was trying to say earlier, when I said that the variable should be a float. If you need an integer, just cast or round it in the query. I realize that the current \setrandom sets the variable to an integer, so gaussian/exponential would be different. But so what? An option to generate uniformly distributed floats would be handy too, though. Well, it seems new feature. If you want to realise it as double, add '\setrandomd' as a double random generator in pgbebch. I will agree with that. This is approximated by drawing a double value with the expected distribution (gaussian or exponential) and project it carefully onto integers. If it is out of range, there is a loop and another value is drawn. The minimal threshold constraint (2.0) ensures that the probability of looping is low. Well, that's one way to do constraint it to the given range, but there are many other ways to do it. Like, clamp it to the min/max if it's out of range. It's too heavy method.. Client calculation must be light. I don't think we need to choose any particular method, you can handle that in the test script. I think our implementation is the best way to realize it. It is fast and robustness for the probability of looping is low. If you have better idea, please teach us. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench
(2014/03/17 18:02), Heikki Linnakangas wrote: On 03/17/2014 10:40 AM, KONDO Mitsumasa wrote: By the way, you seem to want to remove --gaussian=NUM and --exponential=NUM command options. Can you tell me the objective reason? I think pgbench is the benchmark test on PostgreSQL and default benchmark is TPC-B-like benchmark. It is written in documents, and default benchmark wasn't changed by my patch. So we need not remove command options, and they are one of the variety of benchmark options. Maybe you have something misunderstanding about my patch... There is an infinite number of variants of the TPC-B test that we could include in pgbench. If we start adding every one of them, we're quickly going to have hundreds of options to choose the workload. I'd like to keep pgbench simple. These two new test variants, gaussian and exponential, are not that special that they'd deserve to be included in the program itself. Well, I add only two options, and they are major distribution that are seen in real database system than uniform distiribution. I'm afraid, I think you are too worried and it will not be added hundreds of options. And pgbench is still simple. pgbench already has a mechanism for running custom scripts, in which you can specify whatever workload you want. Let's use that. If it's missing something you need to specify the workload you want, let's enhance the script language. I have not seen user who is using pgbench custom script very much. And gaussian and exponential distribution are much better to measure the real system perfomance, so I'd like to use it command option. In now pgbench, we can only measure about database size, but it isn't realistic situation. We want to forcast the required system from calculating the size of hot spot or distirbution of access pettern. I'd realy like to include it on my heart:) Please... Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench
On Mon, Mar 17, 2014 at 7:07 PM, KONDO Mitsumasa kondo.mitsum...@lab.ntt.co.jp wrote: (2014/03/17 18:02), Heikki Linnakangas wrote: On 03/17/2014 10:40 AM, KONDO Mitsumasa wrote: By the way, you seem to want to remove --gaussian=NUM and --exponential=NUM command options. Can you tell me the objective reason? I think pgbench is the benchmark test on PostgreSQL and default benchmark is TPC-B-like benchmark. It is written in documents, and default benchmark wasn't changed by my patch. So we need not remove command options, and they are one of the variety of benchmark options. Maybe you have something misunderstanding about my patch... There is an infinite number of variants of the TPC-B test that we could include in pgbench. If we start adding every one of them, we're quickly going to have hundreds of options to choose the workload. I'd like to keep pgbench simple. These two new test variants, gaussian and exponential, are not that special that they'd deserve to be included in the program itself. Well, I add only two options, and they are major distribution that are seen in real database system than uniform distiribution. I'm afraid, I think you are too worried and it will not be added hundreds of options. And pgbench is still simple. pgbench already has a mechanism for running custom scripts, in which you can specify whatever workload you want. Let's use that. If it's missing something you need to specify the workload you want, let's enhance the script language. I have not seen user who is using pgbench custom script very much. And gaussian and exponential distribution are much better to measure the real system perfomance, so I'd like to use it command option. In now pgbench, we can only measure about database size, but it isn't realistic situation. We want to forcast the required system from calculating the size of hot spot or distirbution of access pettern. I'd realy like to include it on my heart:) Please... I have no strong opinion about the command-line option for gaussian, but I think that we should focus on \setrandom gaussian first. Even after that's committed, we can implement that commnand-line option later if many people think that's necessary. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: show relation and tuple infos of a lock to acquire
Hi Amit, I've been ill the last few days, so sorry for my late response. I have updated the patch to pass TID and operation information in error context and changed some of the comments in code. Let me know if the added operation information is useful, else we can use better generic message in context. I don't think that this fixes the translation guideline issues brought up by Robert. This produces differing strings for the different cases as well and it also passes in altering data directly to the error string. It also may produce error messages that are really weird. You initialize the string with while attempting to . The remaining part of the function is covered by if()s which may lead to error messages like this: „while attempting to “ „while attempting to in relation public.xyz of database abc“ „while attempting to in database abc“ Although this may not be very likely (because ItemPointerIsValid((info-ctid))) should in this case not return false). Attached you will find a new version of this patch; it slightly violates the translation guidelines as well: it assembles an error string (but it doesn't pass in altering data like ctid or things like that). I simply couldn't think of a nice solution without doing so, and after looking through the code there are a few cases (e.g. CheckTableNotInUse()) where this is done, too. If we insist of having complete strings in this case we would have to have 6 * 3 * 2 error strings in the code. Best regards, -- Christian Kruse http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index e2337ac..c0f5881 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -107,6 +107,8 @@ static TransactionId MultiXactIdGetUpdateXid(TransactionId xmax, uint16 t_infomask); static void MultiXactIdWait(MultiXactId multi, MultiXactStatus status, int *remaining, uint16 infomask); +static void MultiXactIdWaitExtended(Relation rel, ItemPointerData ctid, const char *opr_name, + MultiXactId multi, MultiXactStatus status, int *remaining, uint16 infomask); static bool ConditionalMultiXactIdWait(MultiXactId multi, MultiXactStatus status, int *remaining, uint16 infomask); @@ -2714,8 +2716,9 @@ l1: if (infomask HEAP_XMAX_IS_MULTI) { /* wait for multixact */ - MultiXactIdWait((MultiXactId) xwait, MultiXactStatusUpdate, - NULL, infomask); + MultiXactIdWaitExtended(relation, tp.t_data-t_ctid, delete, + (MultiXactId) xwait, + MultiXactStatusUpdate, NULL, infomask); LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); /* @@ -2741,7 +2744,8 @@ l1: else { /* wait for regular transaction to end */ - XactLockTableWait(xwait); + XactLockTableWaitExtended(relation, tp.t_data-t_ctid, + delete, xwait); LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); /* @@ -3266,8 +3270,8 @@ l2: int remain; /* wait for multixact */ - MultiXactIdWait((MultiXactId) xwait, mxact_status, remain, - infomask); + MultiXactIdWaitExtended(relation, oldtup.t_data-t_ctid, update, + (MultiXactId) xwait, mxact_status, remain, infomask); LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); /* @@ -3306,7 +3310,7 @@ l2: /* * There was no UPDATE in the MultiXact; or it aborted. No * TransactionIdIsInProgress() call needed here, since we called - * MultiXactIdWait() above. + * MultiXactIdWaitExtended() above. */ if (!TransactionIdIsValid(update_xact) || TransactionIdDidAbort(update_xact)) @@ -3341,7 +3345,8 @@ l2: else { /* wait for regular transaction to end */ -XactLockTableWait(xwait); +XactLockTableWaitExtended(relation, oldtup.t_data-t_ctid, + update, xwait); LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); /* @@ -4409,7 +4414,10 @@ l3: RelationGetRelationName(relation; } else - MultiXactIdWait((MultiXactId) xwait, status, NULL, infomask); + MultiXactIdWaitExtended(relation, tuple-t_data-t_ctid, + lock, (MultiXactId) xwait, + status, NULL, + infomask); /* if there are updates, follow the update chain */ if (follow_updates @@ -4464,7 +4472,8 @@ l3: RelationGetRelationName(relation; } else - XactLockTableWait(xwait); + XactLockTableWaitExtended(relation, tuple-t_data-t_ctid, + lock, xwait); /* if there are updates, follow the update chain */ if (follow_updates @@ -5151,7 +5160,9 @@ l4: if (needwait) { LockBuffer(buf, BUFFER_LOCK_UNLOCK); - XactLockTableWait(members[i].xid); + XactLockTableWaitExtended(rel, mytup.t_data-t_ctid, + lock updated, + members[i].xid); pfree(members); goto l4; } @@ -5211,7 +5222,8 @@ l4: if (needwait) { LockBuffer(buf,
[HACKERS] Various typos
Attached is a small patch to fix various typos. Regards Thom diff --git a/contrib/pgcrypto/openssl.c b/contrib/pgcrypto/openssl.c index ad7fb9e..86c0fb0 100644 --- a/contrib/pgcrypto/openssl.c +++ b/contrib/pgcrypto/openssl.c @@ -429,7 +429,7 @@ bf_init(PX_Cipher *c, const uint8 *key, unsigned klen, const uint8 *iv) /* * Test if key len is supported. BF_set_key silently cut large keys and it - * could be be a problem when user transfer crypted data from one server + * could be a problem when user transfer crypted data from one server * to another. */ diff --git a/doc/src/sgml/gin.sgml b/doc/src/sgml/gin.sgml index 80fb4f3..561608f 100644 --- a/doc/src/sgml/gin.sgml +++ b/doc/src/sgml/gin.sgml @@ -270,7 +270,7 @@ contains the corresponding query keys. Likewise, the function must return GIN_FALSE only if the item does not match, whether or not it contains the GIN_MAYBE keys. If the result depends on the GIN_MAYBE - entries, ie. the match cannot be confirmed or refuted based on the + entries, i.e. the match cannot be confirmed or refuted based on the known query keys, the function must return GIN_MAYBE. /para para diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c index e24ff18..9d96bf5 100644 --- a/src/backend/access/gin/ginget.c +++ b/src/backend/access/gin/ginget.c @@ -933,7 +933,7 @@ keyGetItem(GinState *ginstate, MemoryContext tempCtx, GinScanKey key, /* * Ok, we now know that there are no matches minItem. * - * If minItem is lossy, it means that there there were no exact items on + * If minItem is lossy, it means that there were no exact items on * the page among requiredEntries, because lossy pointers sort after exact * items. However, there might be exact items for the same page among * additionalEntries, so we mustn't advance past them. diff --git a/src/backend/access/gin/ginlogic.c b/src/backend/access/gin/ginlogic.c index 4c8d706..b13ce4c 100644 --- a/src/backend/access/gin/ginlogic.c +++ b/src/backend/access/gin/ginlogic.c @@ -10,7 +10,7 @@ * a GIN scan can apply various optimizations, if it can determine that an * item matches or doesn't match, even if it doesn't know if some of the keys * are present or not. Hence, it's useful to have a ternary-logic consistent - * function, where where each key can be TRUE (present), FALSE (not present), + * function, where each key can be TRUE (present), FALSE (not present), * or MAYBE (don't know if present). This file provides such a ternary-logic * consistent function, implemented by calling the regular boolean consistent * function many times, with all the MAYBE arguments set to all combinations diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c index 239c7da..4cf07ea 100644 --- a/src/backend/access/heap/rewriteheap.c +++ b/src/backend/access/heap/rewriteheap.c @@ -783,9 +783,9 @@ raw_heap_insert(RewriteState state, HeapTuple tup) * deal with WAL logging at all - an fsync() at the end of a rewrite would be * sufficient for crash safety. Any mapping that hasn't been safely flushed to * disk has to be by an aborted (explicitly or via a crash) transaction and is - * ignored by virtue of the xid in it's name being subject to a + * ignored by virtue of the xid in its name being subject to a * TransactionDidCommit() check. But we want to support having standbys via - * physical replication, both for availability and to to do logical decoding + * physical replication, both for availability and to do logical decoding * there. * */ @@ -1046,7 +1046,7 @@ logical_rewrite_log_mapping(RewriteState state, TransactionId xid, /* * Perform logical remapping for a tuple that's mapped from old_tid to - * new_tuple-t_self by rewrite_heap_tuple() iff necessary for the tuple. + * new_tuple-t_self by rewrite_heap_tuple() if necessary for the tuple. */ static void logical_rewrite_heap_tuple(RewriteState state, ItemPointerData old_tid, diff --git a/src/backend/lib/ilist.c b/src/backend/lib/ilist.c index 58c5810..73a6473 100644 --- a/src/backend/lib/ilist.c +++ b/src/backend/lib/ilist.c @@ -26,7 +26,7 @@ /* * Delete 'node' from list. * - * It is not allowed to delete a 'node' which is is not in the list 'head' + * It is not allowed to delete a 'node' which is not in the list 'head' * * Caution: this is O(n); consider using slist_delete_current() instead. */ diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c index 04e407a..8c6c6c2 100644 --- a/src/backend/replication/logical/logical.c +++ b/src/backend/replication/logical/logical.c @@ -293,7 +293,7 @@ CreateInitDecodingContext(char *plugin, * So we have to acquire the ProcArrayLock to prevent computation of new * xmin horizons by other backends, get the safe decoding xid, and inform * the slot
Re: [HACKERS] Changeset Extraction v7.9.1
On Wed, Mar 12, 2014 at 2:06 PM, Andres Freund and...@2ndquadrant.com wrote: Attached are the collected remaining patches. The docs might need further additions, but it seems better to add them now. A few questions about pg_recvlogical: - There doesn't seem to be any provision for this tool to ever switch from one output file to the next. That seems like a practical need. One idea would be to have it respond to SIGHUP by reopening the originally-named output file. Another would be to switch, after so many bytes, to filename.1, then filename.2, etc. - It confirms the write and flush positions, but doesn't appear to actually flush anywhere. - While I quite agree with your desire for stringinfo in src/common, couldn't you use the roughly-equivalent PQExpBuffer facilities in libpq instead? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Changeset Extraction v7.9.1
On 2014-03-17 06:55:28 -0400, Robert Haas wrote: On Wed, Mar 12, 2014 at 2:06 PM, Andres Freund and...@2ndquadrant.com wrote: Attached are the collected remaining patches. The docs might need further additions, but it seems better to add them now. A few questions about pg_recvlogical: - There doesn't seem to be any provision for this tool to ever switch from one output file to the next. That seems like a practical need. One idea would be to have it respond to SIGHUP by reopening the originally-named output file. Another would be to switch, after so many bytes, to filename.1, then filename.2, etc. Hm. So far I haven't had the need, but you're right, it would be useful. I don't like the .n notion, but SIGHUP would be fine with me. I'll add that. - It confirms the write and flush positions, but doesn't appear to actually flush anywhere. Yea. The reason it reports the flush position is that it allows to test sync rep. I don't think other usecases will appreciate frequent fsyncs... Maybe make it optional? - While I quite agree with your desire for stringinfo in src/common, couldn't you use the roughly-equivalent PQExpBuffer facilities in libpq instead? Yes. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_dump without explicit table locking
Hi, at work at my company I inherited responsibility for a large PG 8.1 DB, with a an extreme number of tables (~30). Surprisingly this is working quite well, except for maintenance and backup. I am tasked with finding a way to do dump restore to 9.3 with as little downtime as possible. Using 9.3's pg_dump with -j12 I found out that pg_dump takes 6 hours to lock tables using a single thread, then does the data dump in 1 more hour using 12 workers. However if I patch out the explicit LOCK TABLE statements this only takes 1 hour total. Of course no one else is using the DB at this time. In a pathological test case scenario in a staging environment the dump time decreased from 5 hours to 5 minutes. I've googled the problem and there seem to be more people with similar problems, so I made this a command line option --no-table-locks and wrapped it up in as nice a patch against github/master as I can manage (and I didn't use C for a long time). I hope you find it useful. regards, Jürgen Strobel commit 393d47cf6b5ce77297e52e7fc390aa862fd2c2fd Author: Jürgen Strobel juergen+git...@strobel.info Date: Fri Mar 7 16:54:22 2014 +0100 Add option --no-table-locks to pg_dump. diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml index 1f0d4de..5725c46 100644 --- a/doc/src/sgml/ref/pg_dump.sgml +++ b/doc/src/sgml/ref/pg_dump.sgml @@ -772,6 +772,25 @@ PostgreSQL documentation /varlistentry varlistentry + termoption--no-table-locks//term + listitem + para +If this option is specified, tables to be dumped will not be locked explicitly +at the start of pg_dump. It implies option--no-synchronized-snapshots/. +This is potentially dangerous and should only be used by experts, +and only while there is no other activity in the database. + /para + para +In the presense of an extreme number of tables pg_dump can exhibit +bad startup performance because it needs to issue LOCK TABLE statements for all +tables. This is intended as a last resort option for dump and restore version +upgrades, in order to make downtime manageable. In an especially bad case +with ~30 tables this reduced dump time from 6 to 1 hours. + /para + /listitem + /varlistentry + + varlistentry termoption--no-tablespaces/option/term listitem para diff --git a/src/bin/pg_dump/parallel.c b/src/bin/pg_dump/parallel.c index 6f2634b..ac05356 100644 --- a/src/bin/pg_dump/parallel.c +++ b/src/bin/pg_dump/parallel.c @@ -32,6 +32,9 @@ #define PIPE_READ 0 #define PIPE_WRITE 1 +/* flag for the no-table-locks command-line long option */ +intno_table_locks = 0; + /* file-scope variables */ #ifdef WIN32 static unsigned int tMasterThreadId = 0; @@ -886,7 +889,7 @@ WaitForCommands(ArchiveHandle *AH, int pipefd[2]) * meantime. lockTableNoWait dies in this case to prevent a * deadlock. */ - if (strcmp(te-desc, BLOBS) != 0) + if (!no_table_locks strcmp(te-desc, BLOBS) != 0) lockTableNoWait(AH, te); /* diff --git a/src/bin/pg_dump/parallel.h b/src/bin/pg_dump/parallel.h index 7a32a9b..f95a6bb 100644 --- a/src/bin/pg_dump/parallel.h +++ b/src/bin/pg_dump/parallel.h @@ -70,6 +70,9 @@ extern bool parallel_init_done; extern DWORD mainThreadId; #endif +/* flag for the no-table-locks command-line long option */ +extern int no_table_locks; + extern void init_parallel_dump_utils(void); extern int GetIdleWorker(ParallelState *pstate); diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 17bb846..e1fe047 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -357,6 +357,7 @@ main(int argc, char **argv) {use-set-session-authorization, no_argument, use_setsessauth, 1}, {no-security-labels, no_argument, no_security_labels, 1}, {no-synchronized-snapshots, no_argument, no_synchronized_snapshots, 1}, + {no-table-locks, no_argument, no_table_locks, 1}, {no-unlogged-table-data, no_argument, no_unlogged_table_data, 1}, {NULL, 0, NULL, 0} @@ -562,6 +563,10 @@ main(int argc, char **argv) if (column_inserts) dump_inserts = 1; + /* --no-table-locks implies --no_synchronized_snapshots */ + if (no_table_locks) + no_synchronized_snapshots = 1; + if (dataOnly schemaOnly) { write_msg(NULL, options -s/--schema-only and -a/--data-only cannot be used together\n); @@ -902,6 +907,7 @@ help(const char *progname) printf(_( --insertsdump data as INSERT commands,
Re: [HACKERS] Various typos
On Mon, Mar 17, 2014 at 7:52 PM, Thom Brown t...@linux.com wrote: Attached is a small patch to fix various typos. Thanks! Committed. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Changeset Extraction v7.9.1
On Mon, Mar 17, 2014 at 7:27 AM, Andres Freund and...@2ndquadrant.com wrote: - There doesn't seem to be any provision for this tool to ever switch from one output file to the next. That seems like a practical need. One idea would be to have it respond to SIGHUP by reopening the originally-named output file. Another would be to switch, after so many bytes, to filename.1, then filename.2, etc. Hm. So far I haven't had the need, but you're right, it would be useful. I don't like the .n notion, but SIGHUP would be fine with me. I'll add that. Cool. - It confirms the write and flush positions, but doesn't appear to actually flush anywhere. Yea. The reason it reports the flush position is that it allows to test sync rep. I don't think other usecases will appreciate frequent fsyncs... Maybe make it optional? Well, as I'm sure you recognize, if you're actually trying to build a replication solution with this tool, you can't let the database throw away the state required to suck changes out of the database unless you've got those changes safely stored away somewhere else. Now, of course, if you don't acknowledge to the database that the stuff is on disk, you're going to get data file bloat and excess WAL retention, unlucky you. But acknowledging that you've got the changes when they're not actually on disk doesn't actually provide the guarantees you went to so much trouble to build in to the mechanism. So the no-flush version really can ONLY ever be useful for testing, AFAICS, or if you really don't care that much whether it can survive a server crash. Perhaps there could be a switch for an fsync interval, or something like that. The default could be, say, to fsync every 10 seconds. And if you want to change it, then go ahead; 0 disables. Writing to standard output would be documented as unreliable. Other ideas welcome. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump without explicit table locking
On Mon, Mar 17, 2014 at 7:52 AM, Jürgen Strobel juergen...@strobel.info wrote: at work at my company I inherited responsibility for a large PG 8.1 DB, with a an extreme number of tables (~30). Surprisingly this is working quite well, except for maintenance and backup. I am tasked with finding a way to do dump restore to 9.3 with as little downtime as possible. Using 9.3's pg_dump with -j12 I found out that pg_dump takes 6 hours to lock tables using a single thread, then does the data dump in 1 more hour using 12 workers. However if I patch out the explicit LOCK TABLE statements this only takes 1 hour total. Of course no one else is using the DB at this time. In a pathological test case scenario in a staging environment the dump time decreased from 5 hours to 5 minutes. I've googled the problem and there seem to be more people with similar problems, so I made this a command line option --no-table-locks and wrapped it up in as nice a patch against github/master as I can manage (and I didn't use C for a long time). I hope you find it useful. Fascinating report. Whether we use your patch or not, that's interesting to know about. Please add your patch here so we don't forget about it: https://commitfest.postgresql.org/action/commitfest_view/open See also https://wiki.postgresql.org/wiki/Submitting_a_Patch -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] What should we do for reliable WAL archiving?
On Mon, Mar 17, 2014 at 10:20 AM, Robert Haas robertmh...@gmail.com wrote: On Sun, Mar 16, 2014 at 6:23 AM, MauMau maumau...@gmail.com wrote: The PostgreSQL documentation describes cp (on UNIX/Linux) or copy (on Windows) as an example for archive_command. However, cp/copy does not sync the copied data to disk. As a result, the completed WAL segments would be lost in the following sequence: 1. A WAL segment fills up. 2. The archiver process archives the just filled WAL segment using archive_command. That is, cp/copy reads the WAL segment file from pg_xlog/ and writes to the archive area. At this point, the WAL file is not persisted to the archive area yet, because cp/copy doesn't sync the writes. 3. The checkpoint processing removes the WAL segment file from pg_xlog/. 4. The OS crashes. The filled WAL segment doesn't exist anywhere any more. Considering the reliable image of PostgreSQL and widespread use in enterprise systems, I think something should be done. Could you give me your opinions on the right direction? Although the doc certainly escapes by saying (This is an example, not a recommendation, and might not work on all platforms.), it seems from pgsql-xxx MLs that many people are following this example. * Improve the example in the documentation. But what command can we use to reliably sync just one file? * Provide some command, say pg_copy, which copies a file synchronously by using fsync(), and describes in the doc something like for simple use cases, you can use pg_copy as the standard reliable copy command. +1. This won't obviate the need for tools to manage replication, but it would make it possible to get the simplest case right without guessing. +1, too. And, what about making pg_copy call posix_fadvise(DONT_NEED) against the archived file after the copy? Also It might be good idea to support the direct copy of the file to avoid wasting the file cache. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump without explicit table locking
2014-03-17 12:52 GMT+01:00 Jürgen Strobel juergen...@strobel.info: Hi, at work at my company I inherited responsibility for a large PG 8.1 DB, with a an extreme number of tables (~30). Surprisingly this is working quite well, except for maintenance and backup. I am tasked with finding a way to do dump restore to 9.3 with as little downtime as possible. Using 9.3's pg_dump with -j12 I found out that pg_dump takes 6 hours to lock tables using a single thread, then does the data dump in 1 more hour using 12 workers. However if I patch out the explicit LOCK TABLE statements this only takes 1 hour total. Of course no one else is using the DB at this time. In a pathological test case scenario in a staging environment the dump time decreased from 5 hours to 5 minutes. I've googled the problem and there seem to be more people with similar problems, so I made this a command line option --no-table-locks and wrapped it up in as nice a patch against github/master as I can manage (and I didn't use C for a long time). I hope you find it useful. Joe Conway sent me a tip so commit eeb6f37d89fc60c6449ca12ef9e914 91069369cb significantly decrease a time necessary for locking. So it can help to. I am not sure, if missing lock is fully correct. In same situation I though about some form of database level lock. So you can get a protected access by one statement. Regards Pavel Stehule regards, Jürgen Strobel -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Changeset Extraction v7.9.1
On 2014-03-17 08:00:22 -0400, Robert Haas wrote: Yea. The reason it reports the flush position is that it allows to test sync rep. I don't think other usecases will appreciate frequent fsyncs... Maybe make it optional? Well, as I'm sure you recognize, if you're actually trying to build a replication solution with this tool, you can't let the database throw away the state required to suck changes out of the database unless you've got those changes safely stored away somewhere else. Hm. I don't think a replication tool will use pg_recvlogical, do you really forsee that? The main use cases for it that I can see are testing/development of output plugins and logging/auditing. That's not to say safe writing method isn't interesting tho. Perhaps there could be a switch for an fsync interval, or something like that. The default could be, say, to fsync every 10 seconds. And if you want to change it, then go ahead; 0 disables. Writing to standard output would be documented as unreliable. Other ideas welcome. Hm. That'll be a bit nasty. fsync() is async signal safe, but it's still forbidden to be called from a signal on windows IIRC. I guess we can couple it with the standby_message_timeout stuff. Unless you have a better idea? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Changeset Extraction v7.9.1
On 2014-03-17 08:00:22 -0400, Robert Haas wrote: On Mon, Mar 17, 2014 at 7:27 AM, Andres Freund and...@2ndquadrant.com wrote: - There doesn't seem to be any provision for this tool to ever switch from one output file to the next. That seems like a practical need. One idea would be to have it respond to SIGHUP by reopening the originally-named output file. Another would be to switch, after so many bytes, to filename.1, then filename.2, etc. Hm. So far I haven't had the need, but you're right, it would be useful. I don't like the .n notion, but SIGHUP would be fine with me. I'll add that. Cool. So, I've implemented this, but it won't work on windows afaics. There's no SIGHUP on windows, and the signal emulation code used in the backend is backend only... I'll be happy enough to declare this a known limitation for now. Arguments to the contrary, best complemented with a solution? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Changeset Extraction v7.9.1
On Mon, Mar 17, 2014 at 8:29 AM, Andres Freund and...@2ndquadrant.com wrote: Perhaps there could be a switch for an fsync interval, or something like that. The default could be, say, to fsync every 10 seconds. And if you want to change it, then go ahead; 0 disables. Writing to standard output would be documented as unreliable. Other ideas welcome. Hm. That'll be a bit nasty. fsync() is async signal safe, but it's still forbidden to be called from a signal on windows IIRC. I guess we can couple it with the standby_message_timeout stuff. Eh... I don't see any need to involve signals. I'd just check after each write() whether enough time has passed, or something like that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Changeset Extraction v7.9.1
On Mon, Mar 17, 2014 at 8:58 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-03-17 08:00:22 -0400, Robert Haas wrote: On Mon, Mar 17, 2014 at 7:27 AM, Andres Freund and...@2ndquadrant.com wrote: - There doesn't seem to be any provision for this tool to ever switch from one output file to the next. That seems like a practical need. One idea would be to have it respond to SIGHUP by reopening the originally-named output file. Another would be to switch, after so many bytes, to filename.1, then filename.2, etc. Hm. So far I haven't had the need, but you're right, it would be useful. I don't like the .n notion, but SIGHUP would be fine with me. I'll add that. Cool. So, I've implemented this, but it won't work on windows afaics. There's no SIGHUP on windows, and the signal emulation code used in the backend is backend only... I'll be happy enough to declare this a known limitation for now. Arguments to the contrary, best complemented with a solution? Blarg. I don't really like that, but I admit I don't have a better idea, unless it's to go back to the suffix idea, with something like --file-size-limit=XXX to trigger the switch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Changeset Extraction v7.9.1
On 2014-03-17 09:13:38 -0400, Robert Haas wrote: On Mon, Mar 17, 2014 at 8:29 AM, Andres Freund and...@2ndquadrant.com wrote: Perhaps there could be a switch for an fsync interval, or something like that. The default could be, say, to fsync every 10 seconds. And if you want to change it, then go ahead; 0 disables. Writing to standard output would be documented as unreliable. Other ideas welcome. Hm. That'll be a bit nasty. fsync() is async signal safe, but it's still forbidden to be called from a signal on windows IIRC. I guess we can couple it with the standby_message_timeout stuff. Eh... I don't see any need to involve signals. I'd just check after each write() whether enough time has passed, or something like that. What if no new writes are needed? Because e.g. there's either no write activity on the primary or all writes are in another database or somesuch? I think checking in sendFeedback() is the best bet... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Changeset Extraction v7.9.1
On 2014-03-17 09:14:51 -0400, Robert Haas wrote: On Mon, Mar 17, 2014 at 8:58 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-03-17 08:00:22 -0400, Robert Haas wrote: On Mon, Mar 17, 2014 at 7:27 AM, Andres Freund and...@2ndquadrant.com wrote: - There doesn't seem to be any provision for this tool to ever switch from one output file to the next. That seems like a practical need. One idea would be to have it respond to SIGHUP by reopening the originally-named output file. Another would be to switch, after so many bytes, to filename.1, then filename.2, etc. Hm. So far I haven't had the need, but you're right, it would be useful. I don't like the .n notion, but SIGHUP would be fine with me. I'll add that. Cool. So, I've implemented this, but it won't work on windows afaics. There's no SIGHUP on windows, and the signal emulation code used in the backend is backend only... I'll be happy enough to declare this a known limitation for now. Arguments to the contrary, best complemented with a solution? Blarg. I don't really like that, but I admit I don't have a better idea, unless it's to go back to the suffix idea, with something like --file-size-limit=XXX to trigger the switch. I think the SIGHUP support can be a useful independently from --file-size-limit, so adding it seems like a good idea anyway. I think anything more advanced than the SIGHUP stuff is for another day. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index
On Sun, Mar 16, 2014 at 7:15 AM, Alexander Korotkov aekorot...@gmail.com wrote: On Sat, Mar 15, 2014 at 11:27 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 03/15/2014 08:40 PM, Fujii Masao wrote: Hi, I executed the following statements in HEAD and 9.3, and compared the size of WAL which were generated by data insertion in GIN index. - CREATE EXTENSION pg_trgm; CREATE TABLE hoge (col1 text); CREATE INDEX hogeidx ON hoge USING gin (col1 gin_trgm_ops) WITH (FASTUPDATE = off); CHECKPOINT; SELECT pg_switch_xlog(); SELECT pg_switch_xlog(); SELECT pg_current_xlog_location(); INSERT INTO hoge SELECT 'POSTGRESQL' FROM generate_series(1, 100); SELECT pg_current_xlog_location(); - The results of WAL size are 960 MB (9.3) 2113 MB (HEAD) The WAL size in HEAD was more than two times bigger than that in 9.3. Recently the source code of GIN index has been changed dramatically. Is the increase in GIN-related WAL intentional or a bug? It was somewhat expected. Updating individual items on the new-format GIN pages requires decompressing and recompressing the page, and the recompressed posting lists need to be WAL-logged. Which generates much larger WAL records. That said, I didn't expect the difference to be quite that big when you're appending to the end of the table. When the new entries go to the end of the posting lists, you only need to recompress and WAL-log the last posting list, which is max 256 bytes long. But I guess that's still a lot more WAL than in the old format. I ran pg_xlogdump | grep Gin and checked the size of GIN-related WAL, and then found its max seems more than 256B. Am I missing something? What I observed is [In HEAD] At first, the size of GIN-related WAL is gradually increasing up to about 1400B. rmgr: Gin len (rec/tot): 48/80, tx: 1813, lsn: 0/020020D8, prev 0/0270, bkp: , desc: Insert item, node: 1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: F rmgr: Gin len (rec/tot): 56/88, tx: 1813, lsn: 0/02002440, prev 0/020023F8, bkp: , desc: Insert item, node: 1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: T rmgr: Gin len (rec/tot): 64/96, tx: 1813, lsn: 0/020044D8, prev 0/02004490, bkp: , desc: Insert item, node: 1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: T ... rmgr: Gin len (rec/tot): 1376/ 1408, tx: 1813, lsn: 0/02A7EE90, prev 0/02A7E910, bkp: , desc: Insert item, node: 1663/12945/16441 blkno: 2 isdata: F isleaf: T isdelete: T rmgr: Gin len (rec/tot): 1392/ 1424, tx: 1813, lsn: 0/02A7F458, prev 0/02A7F410, bkp: , desc: Create posting tree, node: 1663/12945/16441 blkno: 4 Then the size decreases to about 100B and is gradually increasing again up to 320B. rmgr: Gin len (rec/tot):116/ 148, tx: 1813, lsn: 0/02A7F9E8, prev 0/02A7F458, bkp: , desc: Insert item, node: 1663/12945/16441 blkno: 4 isdata: T isleaf: T unmodified: 1280 length: 1372 (compressed) rmgr: Gin len (rec/tot): 40/72, tx: 1813, lsn: 0/02A7FA80, prev 0/02A7F9E8, bkp: , desc: Insert item, node: 1663/12945/16441 blkno: 3 isdata: F isleaf: T isdelete: T ... rmgr: Gin len (rec/tot):118/ 150, tx: 1813, lsn: 0/02A83BA0, prev 0/02A83B58, bkp: , desc: Insert item, node: 1663/12945/16441 blkno: 4 isdata: T isleaf: T unmodified: 1280 length: 1374 (compressed) ... rmgr: Gin len (rec/tot):288/ 320, tx: 1813, lsn: 0/02AEDE28, prev 0/02AEDCE8, bkp: , desc: Insert item, node: 1663/12945/16441 blkno: 14 isdata: T isleaf: T unmodified: 1280 length: 1544 (compressed) Then the size decreases to 66B and is gradually increasing again up to 320B. This increase and decrease of WAL size seems to continue. [In 9.3] At first, the size of GIN-related WAL is gradually increasing up to about 2700B. rmgr: Gin len (rec/tot): 52/84, tx: 1812, lsn: 0/02000430, prev 0/020003D8, bkp: , desc: Insert item, node: 1663/12896/16441 blkno: 1 offset: 11 nitem: 1 isdata: F isleaf T isdelete F updateBlkno:4294967295 rmgr: Gin len (rec/tot): 60/92, tx: 1812, lsn: 0/020004D0, prev 0/02000488, bkp: , desc: Insert item, node: 1663/12896/16441 blkno: 1 offset: 1 nitem: 1 isdata: F isleaf T isdelete T updateBlkno:4294967295 ... rmgr: Gin len (rec/tot): 2740/ 2772, tx: 1812, lsn: 0/026D1670, prev 0/026D0B98, bkp: , desc: Insert item, node: 1663/12896/16441 blkno: 5 offset: 2 nitem: 1 isdata: F isleaf T isdelete T updateBlkno:4294967295 rmgr: Gin len (rec/tot): 2714/ 2746, tx: 1812, lsn: 0/026D21A8, prev 0/026D2160, bkp: , desc: Create posting tree, node: 1663/12896/16441 blkno: 6 The size decreases to 66B and then is never changed. rmgr: Gin len (rec/tot):
Re: [HACKERS] gaussian distribution pgbench
KONDO Mitsumasa kondo.mitsum...@lab.ntt.co.jp writes: (2014/03/17 18:02), Heikki Linnakangas wrote: On 03/17/2014 10:40 AM, KONDO Mitsumasa wrote: There is an infinite number of variants of the TPC-B test that we could include in pgbench. If we start adding every one of them, we're quickly going to have hundreds of options to choose the workload. I'd like to keep pgbench simple. These two new test variants, gaussian and exponential, are not that special that they'd deserve to be included in the program itself. Well, I add only two options, and they are major distribution that are seen in real database system than uniform distiribution. I'm afraid, I think you are too worried and it will not be added hundreds of options. And pgbench is still simple. FWIW, I concur with Heikki on this. Adding new versions of \setrandom is useful functionality. Embedding them in the standard test is not, because that just makes it (even) less standard. And pgbench has too darn many switches already. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] What should we do for reliable WAL archiving?
2014-03-17 21:12 GMT+09:00 Fujii Masao masao.fu...@gmail.com: On Mon, Mar 17, 2014 at 10:20 AM, Robert Haas robertmh...@gmail.com wrote: On Sun, Mar 16, 2014 at 6:23 AM, MauMau maumau...@gmail.com wrote: The PostgreSQL documentation describes cp (on UNIX/Linux) or copy (on Windows) as an example for archive_command. However, cp/copy does not sync the copied data to disk. As a result, the completed WAL segments would be lost in the following sequence: 1. A WAL segment fills up. 2. The archiver process archives the just filled WAL segment using archive_command. That is, cp/copy reads the WAL segment file from pg_xlog/ and writes to the archive area. At this point, the WAL file is not persisted to the archive area yet, because cp/copy doesn't sync the writes. 3. The checkpoint processing removes the WAL segment file from pg_xlog/. 4. The OS crashes. The filled WAL segment doesn't exist anywhere any more. Considering the reliable image of PostgreSQL and widespread use in enterprise systems, I think something should be done. Could you give me your opinions on the right direction? Although the doc certainly escapes by saying (This is an example, not a recommendation, and might not work on all platforms.), it seems from pgsql-xxx MLs that many people are following this example. * Improve the example in the documentation. But what command can we use to reliably sync just one file? * Provide some command, say pg_copy, which copies a file synchronously by using fsync(), and describes in the doc something like for simple use cases, you can use pg_copy as the standard reliable copy command. +1. This won't obviate the need for tools to manage replication, but it would make it possible to get the simplest case right without guessing. +1, too. And, what about making pg_copy call posix_fadvise(DONT_NEED) against the archived file after the copy? Also It might be good idea to support the direct copy of the file to avoid wasting the file cache. Use direct_cp. http://directcp.sourceforge.net/direct_cp.html Regards, -- Mitsumasa KONDO NTT Open Source Software Center
Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index
Fujii Masao escribió: On Sun, Mar 16, 2014 at 7:15 AM, Alexander Korotkov aekorot...@gmail.com wrote: That could be optimized, but I figured we can live with it, thanks to the fastupdate feature. Fastupdate allows amortizing that cost over several insertions. But of course, you explicitly disabled that... Let me know if you want me to write patch addressing this issue. Yeah, I really want you to address this problem! That's definitely useful for every users disabling FASTUPDATE option for some reasons. Users that disable FASTUPDATE, in my experience, do so because their stock work_mem is way too high and GIN searches become too slow due to having to scan too large a list. I think it might make sense to invest a modest amount of time in getting FASTUPDATE to be sized completely differently from today -- perhaps base it on a hardcoded factor of BLCKSZ, rather than work_mem. Or, if we really need to make it configurable, then let it have its own parameter. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump without explicit table locking
Pavel Stehule pavel.steh...@gmail.com writes: 2014-03-17 12:52 GMT+01:00 Jürgen Strobel juergen...@strobel.info: I've googled the problem and there seem to be more people with similar problems, so I made this a command line option --no-table-locks and wrapped it up in as nice a patch against github/master as I can manage (and I didn't use C for a long time). I hope you find it useful. Joe Conway sent me a tip so commit eeb6f37d89fc60c6449ca12ef9e914 91069369cb significantly decrease a time necessary for locking. So it can help to. Indeed. I think there's zero chance that we'd accept the patch as proposed. If there's still a performance problem in HEAD, we'd look for some other way to improve matters more. (Note that this is only one of assorted O(N^2) behaviors in older versions of pg_dump; we've gradually stamped them out over time.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Archive recovery won't be completed on some situation.
On 03/15/2014 05:59 PM, Fujii Masao wrote: What about adding new option into pg_resetxlog so that we can reset the pg_control's backup start location? Even after we've accidentally entered into the situation that you described, we can exit from that by resetting the backup start location in pg_control. Also this option seems helpful to salvage the data as a last resort from the corrupted backup. Yeah, seems reasonable. After you run pg_resetxlog, there's no hope that the backup end record would arrive any time later. And if it does, it won't really do much good after you've reset the WAL. We probably should just clear out the backup start/stop location always when you run pg_resetxlog. Your database is potentially broken if you reset the WAL before reaching consistency, but if forcibly do that with pg_resetxlog -f, you've been warned. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: show relation and tuple infos of a lock to acquire
On Mon, Mar 17, 2014 at 4:20 PM, Christian Kruse christ...@2ndquadrant.com wrote: Hi Amit, I've been ill the last few days, so sorry for my late response. Sorry to hear and no problem for delay. I don't think that this fixes the translation guideline issues brought up by Robert. This produces differing strings for the different cases as well and it also passes in altering data directly to the error string. It also may produce error messages that are really weird. You initialize the string with while attempting to . The remaining part of the function is covered by if()s which may lead to error messages like this: while attempting to while attempting to in relation public.xyz of database abc while attempting to in database abc Although this may not be very likely (because ItemPointerIsValid((info-ctid))) should in this case not return false). Yes, this is good point and even I have thought of it before sending the patch. The reason why it seems not good to duplicate the message is that I am not aware even of single case where tuple info (chid, relinfo) will not be available by the time it will come to wait on tuple in new call (XactLockTableWaitExtended), do you think there exists any such case or it's just based on apprehension. If there exists any such, then isn't it better to use earlier version of API for that call site as is case we have left for SnapBuildFindSnapshot? If it's just apprehension, then I think it's better to have a check such that if any of the tuple info is missing then don't add context and in future if we have any such usecase then we can have multiple messages. Attached you will find a new version of this patch; it slightly violates the translation guidelines as well: it assembles an error string (but it doesn't pass in altering data like ctid or things like that). I simply couldn't think of a nice solution without doing so, and after looking through the code there are a few cases (e.g. CheckTableNotInUse()) where this is done, too. I could not see any problem in the modifications done by you, but if the function is generic enough that it doesn't assume what caller has set info, then why to assume that opr_name will always be filled. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index
On 03/17/2014 03:20 PM, Fujii Masao wrote: On Sun, Mar 16, 2014 at 7:15 AM, Alexander Korotkov aekorot...@gmail.com wrote: On Sat, Mar 15, 2014 at 11:27 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: I ran pg_xlogdump | grep Gin and checked the size of GIN-related WAL, and then found its max seems more than 256B. Am I missing something? What I observed is [In HEAD] At first, the size of GIN-related WAL is gradually increasing up to about 1400B. rmgr: Gin len (rec/tot): 48/80, tx: 1813, lsn: 0/020020D8, prev 0/0270, bkp: , desc: Insert item, node: 1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: F rmgr: Gin len (rec/tot): 56/88, tx: 1813, lsn: 0/02002440, prev 0/020023F8, bkp: , desc: Insert item, node: 1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: T rmgr: Gin len (rec/tot): 64/96, tx: 1813, lsn: 0/020044D8, prev 0/02004490, bkp: , desc: Insert item, node: 1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: T ... rmgr: Gin len (rec/tot): 1376/ 1408, tx: 1813, lsn: 0/02A7EE90, prev 0/02A7E910, bkp: , desc: Insert item, node: 1663/12945/16441 blkno: 2 isdata: F isleaf: T isdelete: T rmgr: Gin len (rec/tot): 1392/ 1424, tx: 1813, lsn: 0/02A7F458, prev 0/02A7F410, bkp: , desc: Create posting tree, node: 1663/12945/16441 blkno: 4 This corresponds to the stage where the items are stored in-line in the entry-tree. After it reaches a certain size, a posting tree is created. Then the size decreases to about 100B and is gradually increasing again up to 320B. rmgr: Gin len (rec/tot):116/ 148, tx: 1813, lsn: 0/02A7F9E8, prev 0/02A7F458, bkp: , desc: Insert item, node: 1663/12945/16441 blkno: 4 isdata: T isleaf: T unmodified: 1280 length: 1372 (compressed) rmgr: Gin len (rec/tot): 40/72, tx: 1813, lsn: 0/02A7FA80, prev 0/02A7F9E8, bkp: , desc: Insert item, node: 1663/12945/16441 blkno: 3 isdata: F isleaf: T isdelete: T ... rmgr: Gin len (rec/tot):118/ 150, tx: 1813, lsn: 0/02A83BA0, prev 0/02A83B58, bkp: , desc: Insert item, node: 1663/12945/16441 blkno: 4 isdata: T isleaf: T unmodified: 1280 length: 1374 (compressed) ... rmgr: Gin len (rec/tot):288/ 320, tx: 1813, lsn: 0/02AEDE28, prev 0/02AEDCE8, bkp: , desc: Insert item, node: 1663/12945/16441 blkno: 14 isdata: T isleaf: T unmodified: 1280 length: 1544 (compressed) Then the size decreases to 66B and is gradually increasing again up to 320B. This increase and decrease of WAL size seems to continue. Here the new items are appended to posting tree pages. This is where the maximum of 256 bytes I mentioned applies. 256 bytes is the max size of one compressed posting list, the WAL record containing it includes some other stuff too, which adds up to that 320 bytes. [In 9.3] At first, the size of GIN-related WAL is gradually increasing up to about 2700B. rmgr: Gin len (rec/tot): 52/84, tx: 1812, lsn: 0/02000430, prev 0/020003D8, bkp: , desc: Insert item, node: 1663/12896/16441 blkno: 1 offset: 11 nitem: 1 isdata: F isleaf T isdelete F updateBlkno:4294967295 rmgr: Gin len (rec/tot): 60/92, tx: 1812, lsn: 0/020004D0, prev 0/02000488, bkp: , desc: Insert item, node: 1663/12896/16441 blkno: 1 offset: 1 nitem: 1 isdata: F isleaf T isdelete T updateBlkno:4294967295 ... rmgr: Gin len (rec/tot): 2740/ 2772, tx: 1812, lsn: 0/026D1670, prev 0/026D0B98, bkp: , desc: Insert item, node: 1663/12896/16441 blkno: 5 offset: 2 nitem: 1 isdata: F isleaf T isdelete T updateBlkno:4294967295 rmgr: Gin len (rec/tot): 2714/ 2746, tx: 1812, lsn: 0/026D21A8, prev 0/026D2160, bkp: , desc: Create posting tree, node: 1663/12896/16441 blkno: 6 The size decreases to 66B and then is never changed. Same mechanism on 9.3, but the insertions to the posting tree pages are constant size. That could be optimized, but I figured we can live with it, thanks to the fastupdate feature. Fastupdate allows amortizing that cost over several insertions. But of course, you explicitly disabled that... Let me know if you want me to write patch addressing this issue. Yeah, I really want you to address this problem! That's definitely useful for every users disabling FASTUPDATE option for some reasons. Ok, let's think about it a little bit. I think there are three fairly simple ways to address this: 1. The GIN data leaf recompress record contains an offset called unmodifiedlength, and the data that comes after that offset. Currently, the record is written so that unmodifiedlength points to the end of the last compressed posting list stored on the page that was not modified, followed by all the modified ones. The straightforward way to cut down the WAL record
Re: [HACKERS] gaussian distribution pgbench
On Sat, Mar 15, 2014 at 4:50 AM, Mitsumasa KONDO kondo.mitsum...@gmail.com wrote: There are explanations and computations as comments in the code. If it is about the documentation, I'm not sure that a very precise mathematical definition will help a lot of people, and might rather hinder understanding, so the doc focuses on an intuitive explanation instead. Yeah, I think that we had better to only explain necessary infomation for using this feature. If we add mathematical theory in docs, it will be too difficult for user. And it's waste. Well, if you *don't* include at least *some* mathematical description of what the feature does in the documentation, then users who need to understand it will have to read the source code to figure it out, which is going to be even more difficult. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index
Heikki Linnakangas hlinnakan...@vmware.com writes: 2. Instead of storing the new compressed posting list in the WAL record, store only the new item pointers added to the page. WAL replay would then have to duplicate the work done in the main insertion code path: find the right posting lists to insert to, decode them, add the new items, and re-encode. That sounds fairly dangerous ... is any user-defined code involved in those decisions? This record format would be higher-level, in the sense that we would not store the physical copy of the compressed posting list as it was formed originally. The same work would be done at WAL replay. As the code stands, it will produce exactly the same result, but that's not guaranteed if we make bugfixes to the code later, and a master and standby are running different minor version. There's not necessarily anything wrong with that, but it's something to keep in mind. Version skew would be a hazard too, all right. I think it's important that WAL replay be a pretty mechanical, predictable process. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] on_exit_reset fails to clear DSM-related exit actions
On Mon, Mar 10, 2014 at 11:48 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Fri, Mar 7, 2014 at 2:40 PM, Robert Haas robertmh...@gmail.com wrote: Hmm. So the problematic sequence of events is where a postmaster child forks, and then exits without exec-ing, perhaps because e.g. exec fails? I've attempted a fix for this case. The attached patch makes test_shm_mq fork() a child process that calls on_exit_reset() and then exits. Without the fix I just pushed, this causes the tests to fail; with this fix, this does not cause the tests to fail. I'm not entirely sure that this is exactly right, but I think it's an improvement. This looks generally sane to me, but I wonder whether you should also teach PGSharedMemoryDetach() to physically detach from DSM segments not just the main shmem seg. I looked at this a little more. It seems dsm_backend_shutdown() already does almost the right thing; it fires all the on-detach callbacks and unmaps the corresponding segments. It does NOT unmap the control segment, though, and processes that are unmapping the main shared memory segment for safety should really be getting rid of the control segment too (even though the consequences of corrupting the control segment are much less bad). One option is to just change that function to also unmap the control segment, and maybe rename it to dsm_detach_all(), and then use that everywhere. The problem is that I'm not sure we really want to incur the overhead of an extra munmap() during every backend exit, just to get rid of a control segment which was going to be unmapped anyway by process termination. For that matter, I'm not sure why we bother arranging that for the main shared memory segment, either: surely whatever function the shmdt() and munmap() calls in IpcMemoryDetach may have will be equally well-served by the forthcoming exit()? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index
On 03/17/2014 04:33 PM, Tom Lane wrote: Heikki Linnakangas hlinnakan...@vmware.com writes: 2. Instead of storing the new compressed posting list in the WAL record, store only the new item pointers added to the page. WAL replay would then have to duplicate the work done in the main insertion code path: find the right posting lists to insert to, decode them, add the new items, and re-encode. That sounds fairly dangerous ... is any user-defined code involved in those decisions? No. This record format would be higher-level, in the sense that we would not store the physical copy of the compressed posting list as it was formed originally. The same work would be done at WAL replay. As the code stands, it will produce exactly the same result, but that's not guaranteed if we make bugfixes to the code later, and a master and standby are running different minor version. There's not necessarily anything wrong with that, but it's something to keep in mind. Version skew would be a hazard too, all right. I think it's important that WAL replay be a pretty mechanical, predictable process. Yeah. One particular point to note is that if in one place we do the more high level thing and have WAL replay re-encode the page as it sees fit, then we can *not* rely on the page being byte-by-byte identical in other places. Like, in vacuum, where items are deleted. Heap and B-tree WAL records also rely on PageAddItem etc. to reconstruct the page, instead of making a physical copy of the modified parts. And _bt_restore_page even inserts the items physically in different order than the normal codepath does. So for good or bad, there is some precedence for this. The imminent danger I see is if we change the logic on how the items are divided into posting lists, and end up in a situation where a master server adds an item to a page, and it just fits, but with the compression logic the standby version has, it cannot make it fit. As an escape hatch for that, we could have the WAL replay code try the compression again, with a larger max. posting list size, if it doesn't fit at first. And/or always leave something like 10 bytes of free space on every data page to make up for small differences in the logic. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index
On Mon, Mar 17, 2014 at 10:54 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Heap and B-tree WAL records also rely on PageAddItem etc. to reconstruct the page, instead of making a physical copy of the modified parts. And _bt_restore_page even inserts the items physically in different order than the normal codepath does. So for good or bad, there is some precedence for this. Yikes. The imminent danger I see is if we change the logic on how the items are divided into posting lists, and end up in a situation where a master server adds an item to a page, and it just fits, but with the compression logic the standby version has, it cannot make it fit. As an escape hatch for that, we could have the WAL replay code try the compression again, with a larger max. posting list size, if it doesn't fit at first. And/or always leave something like 10 bytes of free space on every data page to make up for small differences in the logic. That scares the crap out of me. I don't see any intrinsic problem with relying on the existence page contents to figure out how to roll forward, as PageAddItem does; after all, we do FPIs precisely so that the page is in a known good state when we start. However, I really think we ought to try hard to make this deterministic in terms of what the resulting state of the page is; anything else seems like it's playing with fire, and I bet we'll get burned sooner rather than later. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] First-draft release notes for next week's releases
Greg Stark st...@mit.edu writes: This is not really accurate: This error allowed multiple versions of the same row to become visible to queries, resulting in apparent duplicates. Since the error is in WAL replay, it would only manifest during crash recovery or on standby servers. I think the idea is coming from what the second sentence below is getting at but it may be too complex to explain in a release note: The error causes some rows to disappear from indexes resulting in inconsistent query results on a hot standby depending on whether indexes are used. If the standby is subsequently activated or if it occurs during recovery after a crash or backup restore it could result in unique constraint violations as well. Hm ... rows disappearing from indexes might make people think that they could fix or mitigate the damage via REINDEX. That's not really true though is it? It looks to me like IndexBuildHeapScan will suffer an Assert failure in an assert-enabled build, or build a bogus index if not assert-enabled, when it comes across a heap-only tuple that has no parent. I'm thinking we'd better promote that Assert to a normal runtime elog. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] on_exit_reset fails to clear DSM-related exit actions
Robert Haas robertmh...@gmail.com writes: One option is to just change that function to also unmap the control segment, and maybe rename it to dsm_detach_all(), and then use that everywhere. The problem is that I'm not sure we really want to incur the overhead of an extra munmap() during every backend exit, just to get rid of a control segment which was going to be unmapped anyway by process termination. For that matter, I'm not sure why we bother arranging that for the main shared memory segment, either: surely whatever function the shmdt() and munmap() calls in IpcMemoryDetach may have will be equally well-served by the forthcoming exit()? Before you lobotomize that code too much, consider the postmaster crash-recovery case. That path does need to detach from the old shmem segment. Also, I might be wrong, but I think IpcMemoryDetach is a *postmaster* on_shmem_exit routine; it isn't called during backend exit. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index
Robert Haas robertmh...@gmail.com writes: On Mon, Mar 17, 2014 at 10:54 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Heap and B-tree WAL records also rely on PageAddItem etc. to reconstruct the page, instead of making a physical copy of the modified parts. And _bt_restore_page even inserts the items physically in different order than the normal codepath does. So for good or bad, there is some precedence for this. Yikes. Yeah. I think it's arguably a bug that _bt_restore_page works like that, even though we've not been burnt up to now. The imminent danger I see is if we change the logic on how the items are divided into posting lists, and end up in a situation where a master server adds an item to a page, and it just fits, but with the compression logic the standby version has, it cannot make it fit. As an escape hatch for that, we could have the WAL replay code try the compression again, with a larger max. posting list size, if it doesn't fit at first. And/or always leave something like 10 bytes of free space on every data page to make up for small differences in the logic. That scares the crap out of me. Likewise. Saving some WAL space is *not* worth this kind of risk. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Planner hints in Postgresql
I am implementing Planner hints in Postgresql to force the optimizer to select a particular plan for a query on request from sql input. I am having trouble in modifying the planner code. I want to create a path node of hint plan and make it the plan to be used by executor. How do I enforce this ? Should I create a new Plan for this ..how to create a plan node which can be then given directly to executor for a particular query?
Re: [HACKERS] Planner hints in Postgresql
On Mon, Mar 17, 2014 at 9:22 PM, Rajmohan C csrajmo...@gmail.com wrote: I am implementing Planner hints in Postgresql to force the optimizer to select a particular plan for a query on request from sql input. I am having trouble in modifying the planner code. I want to create a path node of hint plan and make it the plan to be used by executor. How do I enforce this ? Should I create a new Plan for this ..how to create a plan node which can be then given directly to executor for a particular query? Planner hints have been discussed a lot before as well and AFAIK there is a wiki page that says why we shouldnt implement them. Have you referred to them? Please share if you have any new points on the same. Regards, Atri
Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index
On 03/17/2014 05:35 PM, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: On Mon, Mar 17, 2014 at 10:54 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: The imminent danger I see is if we change the logic on how the items are divided into posting lists, and end up in a situation where a master server adds an item to a page, and it just fits, but with the compression logic the standby version has, it cannot make it fit. As an escape hatch for that, we could have the WAL replay code try the compression again, with a larger max. posting list size, if it doesn't fit at first. And/or always leave something like 10 bytes of free space on every data page to make up for small differences in the logic. That scares the crap out of me. Likewise. Saving some WAL space is *not* worth this kind of risk. One fairly good compromise would be to only include the new items, not the whole modified compression lists, and let the replay logic do the re-encoding of the posting lists. But also include the cutoff points of each posting list in the WAL record. That way the replay code would have no freedom in how it decides to split the items into compressed lists, that would be fully specified by the WAL record. Here's a refresher for those who didn't follow the development of the new page format: The data page basically contains a list of ItemPointers. The items are compressed, to save disk space. However, to make random access faster, all the items on the page are not compressed as one big list. Instead, the big array of items is split into roughly equal chunks, and each chunk is compressed separately. The chunks are stored on the page one after each other. (The chunks are called posting lists in the code, the struct is called GinPostingListData) The compression is completely deterministic (each item is stored as a varbyte-encoded delta from the previous item), but there are no hard rules on how the items on the page ought to be divided into the posting lists. Currently, the code tries to maintain a max size of 256 bytes per list - but it will cope with any size it finds on disk. This is where the danger lies, where we could end up with a different physical page after WAL replay, if we just include the new items in the WAL record. The WAL replay might decide to split the items into posting lists differently than was originally done. (as the code stands, it would always make the same decision, completely deterministically, but that might change in a minor version if we're not careful) We can tie WAL replay's hands about that, if we include a list of items that form the posting lists in the WAL record. That adds some bloat, compared to only including the new items, but not too much. (and we still only need do that for posting lists following the first modified one.) Alexander, would you like to give that a shot, or will I? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Portability issues in shm_mq
On Sun, Mar 16, 2014 at 10:45 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: But I think there's another possible problem here. In order for reads from the buffer not to suffer alignment problems, the chunk size for reads and writes from the buffer needs to be MAXIMUM_ALIGNOF (or some multiple of it). And in order to avoid a great deal of additional and unwarranted complexity, the size of the message word also needs to be MAXIMUM_ALIGNOF (or some multiple of it). So the message word can only be of size 4 if MAXIMUM_ALIGNOF is also 4. IOW, I think your approach is going to run into trouble on any system where sizeof(Size)==4 but MAXIMUM_ALIGNOF==8. Well, it will result in padding space when you maxalign the length word, but I don't see why it wouldn't work; and it would certainly be no less efficient than what's there today. Well, the problem is with this: /* Write the message length into the buffer. */ if (!mqh-mqh_did_length_word) { res = shm_mq_send_bytes(mqh, sizeof(uint64), nbytes, nowait, bytes_written); If I change nbytes to be of type Size, and the second argument to sizeof(Size), then it's wrong whenever sizeof(Size) % MAXIMUM_ALIGNOF != 0. I could do something like: union { char pad[MAXIMUM_ALIGNOF]; Size val; } padded_size; padded_size.val = nbytes; res = shm_mq_send_bytes(mqh, sizeof(padded_size), padded_size, nowait, bytes_written); ...but that probably *is* less efficient, and it's certainly a lot uglier; a similar hack will be needed when extracting the length word, and assertions elsewhere will need adjustment. I wonder if it wouldn't be better to adjust the external API to use Size just for consistency but, internally to the module, keep using 8 byte sizes within the buffer. Really, I think that would boil down to going through and making sure that we use TYPEALIGN(...,sizeof(uint64)) everywhere instead of MAXALIGN(), which doesn't seem like a big deal. On the third hand, maybe there are or will be platforms out there where MAXIMUM_ALIGNOF 8. If so, it's probably best to bite the bullet and allow for padding now, so we don't have to monkey with this again later. Sorry for belaboring this point, but I want to make sure I only need to fix this once. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Portability issues in shm_mq
Robert Haas robertmh...@gmail.com writes: On Sun, Mar 16, 2014 at 10:45 PM, Tom Lane t...@sss.pgh.pa.us wrote: Well, it will result in padding space when you maxalign the length word, but I don't see why it wouldn't work; and it would certainly be no less efficient than what's there today. Well, the problem is with this: /* Write the message length into the buffer. */ if (!mqh-mqh_did_length_word) { res = shm_mq_send_bytes(mqh, sizeof(uint64), nbytes, nowait, bytes_written); If I change nbytes to be of type Size, and the second argument to sizeof(Size), then it's wrong whenever sizeof(Size) % MAXIMUM_ALIGNOF != 0. Well, you need to maxalign the number of bytes physically inserted into the queue. Doesn't shm_mq_send_bytes do that? Where do you do the maxaligning of the message payload data, when the payload is odd-length? I would have expected some logic like copy N bytes but then advance the pointer by maxalign(N). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Planner hints in Postgresql
Atri Sharma wrote On Mon, Mar 17, 2014 at 9:22 PM, Rajmohan C lt; csrajmohan@ gt; wrote: I am implementing Planner hints in Postgresql to force the optimizer to select a particular plan for a query on request from sql input. I am having trouble in modifying the planner code. I want to create a path node of hint plan and make it the plan to be used by executor. How do I enforce this ? Should I create a new Plan for this ..how to create a plan node which can be then given directly to executor for a particular query? Planner hints have been discussed a lot before as well and AFAIK there is a wiki page that says why we shouldnt implement them. Have you referred to them? Please share if you have any new points on the same. Regards, Atri http://wiki.postgresql.org/wiki/Todo (I got to it via the FAQ link on the homepage and the Developer FAQ section there-in. You should make sure you've scanned that as well.) Note the final section titled: Features We Do Not Want Also, you need to consider what you are doing when you cross-post (a bad thing generally) -hackers and -novice. As there is, rightly IMO, no -novice-hackers list you should have probably just hit up -general. Need to discuss the general why before any meaningful help on the how is going to be considered by hackers. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Planner-hints-in-Postgresql-tp5796347p5796353.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Planner hints in Postgresql
David Johnston pol...@yahoo.com writes: Need to discuss the general why before any meaningful help on the how is going to be considered by hackers. Possibly worth noting is that in past discussions, we've concluded that the most sensible type of hint would not be use this plan at all, but here's what to assume about the selectivity of this WHERE clause. That seems considerably less likely to break than any attempt to directly specify plan details. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] warning when compiling utils/tqual.h
I noticed (by running cd src/include ; make check with the attached patch applied) that since commit b89e151054 (Introduce logical decoding.) tqual.h now emits this warning when compiled standalone: /pgsql/source/HEAD/src/include/utils/tqual.h:101:13: warning: ‘struct HTAB’ declared inside parameter list [enabled by default] /pgsql/source/HEAD/src/include/utils/tqual.h:101:13: warning: its scope is only this definition or declaration, which is probably not what you want [enabled by default] The prototype in question is: /* * To avoid leaking to much knowledge about reorderbuffer implementation * details this is implemented in reorderbuffer.c not tqual.c. */ extern bool ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data, Snapshot snapshot, HeapTuple htup, Buffer buffer, CommandId *cmin, CommandId *cmax); Now there are two easy ways to silence this: 1. add struct HTAB; to the previous line in tqual.h, i.e. a forward declaration. We have very few of these; most of the time ISTM we prefer a solution like the next one: 2. add #include utils/hsearch.h at the top of tqual.h. Now my inclination is to use the second one, because it seems cleaner to me and avoid having a forward struct decl in an unrelated header, which will later bleed into other source files. Do we have a project guideline saying that we prefer option two, and does that explain why we have so few declarations as in option one? There is of course a third choice which is to dictate that this function ought to be declared in reorderbuffer.h; but that would have the unpleasant side-effect that tqual.c would need to #include that. Opinions please? I would like to have a guideline about this so I can reason in a principled way in future cases in which this kind of question arises. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] warning when compiling utils/tqual.h
Hi, On 2014-03-17 13:40:53 -0300, Alvaro Herrera wrote: I noticed (by running cd src/include ; make check with the attached patch applied) that since commit b89e151054 (Introduce logical decoding.) tqual.h now emits this warning when compiled standalone: I think we should add such a check (refined to not rely on precompiled headers) to check-world or something... 1. add struct HTAB; to the previous line in tqual.h, i.e. a forward declaration. We have very few of these; most of the time ISTM we prefer a solution like the next one: That's what I am for. We don't have *that* few of those. There's also several cases where struct pointers are members in other structs. There you don't even need a forward declaration (like e.g. struct HTAB* in PlannerInfo). There is of course a third choice which is to dictate that this function ought to be declared in reorderbuffer.h; but that would have the unpleasant side-effect that tqual.c would need to #include that. I am pretty clearly against this. Opinions please? I would like to have a guideline about this so I can reason in a principled way in future cases in which this kind of question arises. I'd welcome a clearly stated policy as well. I think forward declarations are sensible tool if used sparingly, to decouple things that are primarily related on the implementation level. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] warning when compiling utils/tqual.h
Alvaro Herrera alvhe...@2ndquadrant.com writes: I noticed (by running cd src/include ; make check with the attached patch applied) that since commit b89e151054 (Introduce logical decoding.) tqual.h now emits this warning when compiled standalone: /pgsql/source/HEAD/src/include/utils/tqual.h:101:13: warning: âstruct HTABâ declared inside parameter list [enabled by default] /pgsql/source/HEAD/src/include/utils/tqual.h:101:13: warning: its scope is only this definition or declaration, which is probably not what you want [enabled by default] The prototype in question is: /* * To avoid leaking to much knowledge about reorderbuffer implementation * details this is implemented in reorderbuffer.c not tqual.c. */ extern bool ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data, Snapshot snapshot, HeapTuple htup, Buffer buffer, CommandId *cmin, CommandId *cmax); I guess the real question is why such a prototype is in tqual.h in the first place. ISTM this should be pushed somewhere specific to reorderbuffer.c. I'm -1 on having struct HTAB bleed into tqual.h via either of the methods you suggest. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] warning when compiling utils/tqual.h
On 2014-03-17 12:50:37 -0400, Tom Lane wrote: Alvaro Herrera alvhe...@2ndquadrant.com writes: I noticed (by running cd src/include ; make check with the attached patch applied) that since commit b89e151054 (Introduce logical decoding.) tqual.h now emits this warning when compiled standalone: /pgsql/source/HEAD/src/include/utils/tqual.h:101:13: warning: ‘struct HTAB’ declared inside parameter list [enabled by default] /pgsql/source/HEAD/src/include/utils/tqual.h:101:13: warning: its scope is only this definition or declaration, which is probably not what you want [enabled by default] The prototype in question is: /* * To avoid leaking to much knowledge about reorderbuffer implementation * details this is implemented in reorderbuffer.c not tqual.c. */ extern bool ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data, Snapshot snapshot, HeapTuple htup, Buffer buffer, CommandId *cmin, CommandId *cmax); I guess the real question is why such a prototype is in tqual.h in the first place. ISTM this should be pushed somewhere specific to reorderbuffer.c. I'm -1 on having struct HTAB bleed into tqual.h via either of the methods you suggest. It's the only logical decoding specific routine that's called by tqual.c which doesn't need anything else from reorderbuffer. I'd like to avoid more stuff bleeding in there... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] warning when compiling utils/tqual.h
Andres Freund and...@2ndquadrant.com writes: On 2014-03-17 13:40:53 -0300, Alvaro Herrera wrote: There is of course a third choice which is to dictate that this function ought to be declared in reorderbuffer.h; but that would have the unpleasant side-effect that tqual.c would need to #include that. I am pretty clearly against this. Let me get this straight. reorderbuffer.c exports a function that needs to be used by tqual.c. The obvious method to do this is to declare the function in reorderbuffer.h and have tqual.c #include that. Apparently you think it's better to have tqual.h declare the function. How is that not 100% backwards? Even worse that it requires more header-inclusion bloat for some functionality entirely unrelated to snapshots? That sounds borderline insane from here. You need a whole lot more justification than I'm against it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] warning when compiling utils/tqual.h
Andres Freund and...@2ndquadrant.com writes: On 2014-03-17 12:50:37 -0400, Tom Lane wrote: I guess the real question is why such a prototype is in tqual.h in the first place. ISTM this should be pushed somewhere specific to reorderbuffer.c. I'm -1 on having struct HTAB bleed into tqual.h via either of the methods you suggest. It's the only logical decoding specific routine that's called by tqual.c which doesn't need anything else from reorderbuffer. I'd like to avoid more stuff bleeding in there... Instead, you'd like HTAB to bleed into everything that uses tqual.h? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] warning when compiling utils/tqual.h
On 2014-03-17 12:57:15 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-03-17 12:50:37 -0400, Tom Lane wrote: I guess the real question is why such a prototype is in tqual.h in the first place. ISTM this should be pushed somewhere specific to reorderbuffer.c. I'm -1 on having struct HTAB bleed into tqual.h via either of the methods you suggest. It's the only logical decoding specific routine that's called by tqual.c which doesn't need anything else from reorderbuffer. I'd like to avoid more stuff bleeding in there... Instead, you'd like HTAB to bleed into everything that uses tqual.h? Meh. a struct forward declaration isn't exactly a significant thing to expose. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] warning when compiling utils/tqual.h
On 2014-03-17 12:56:12 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-03-17 13:40:53 -0300, Alvaro Herrera wrote: There is of course a third choice which is to dictate that this function ought to be declared in reorderbuffer.h; but that would have the unpleasant side-effect that tqual.c would need to #include that. I am pretty clearly against this. Let me get this straight. reorderbuffer.c exports a function that needs to be used by tqual.c. The obvious method to do this is to declare the function in reorderbuffer.h and have tqual.c #include that. Apparently you think it's better to have tqual.h declare the function. How is that not 100% backwards? Even worse that it requires more header-inclusion bloat for some functionality entirely unrelated to snapshots? I don't think cmin/cmax are entirely unrelated to tuple visibility. If it didn't require exposing reorderbuffer.c internals, it's implementation should have been in tqual.c. And a struct HTAB; wouldn't require including a header. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] First-draft release notes for next week's releases
On 03/17/2014 08:28 AM, Tom Lane wrote: Greg Stark st...@mit.edu writes: The error causes some rows to disappear from indexes resulting in inconsistent query results on a hot standby depending on whether indexes are used. If the standby is subsequently activated or if it occurs during recovery after a crash or backup restore it could result in unique constraint violations as well. Hm ... rows disappearing from indexes might make people think that they could fix or mitigate the damage via REINDEX. That's not really true though is it? It looks to me like IndexBuildHeapScan will suffer an Assert failure in an assert-enabled build, or build a bogus index if not assert-enabled, when it comes across a heap-only tuple that has no parent. First, see suggested text in my first-draft release announcement. Second, if a user has encountered this kind of data corruption on their master (due to crash recovery), how exactly *do* they fix it? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] First-draft release notes for next week's releases
On 2014-03-17 10:03:52 -0700, Josh Berkus wrote: On 03/17/2014 08:28 AM, Tom Lane wrote: Greg Stark st...@mit.edu writes: The error causes some rows to disappear from indexes resulting in inconsistent query results on a hot standby depending on whether indexes are used. If the standby is subsequently activated or if it occurs during recovery after a crash or backup restore it could result in unique constraint violations as well. Hm ... rows disappearing from indexes might make people think that they could fix or mitigate the damage via REINDEX. That's not really true though is it? It looks to me like IndexBuildHeapScan will suffer an Assert failure in an assert-enabled build, or build a bogus index if not assert-enabled, when it comes across a heap-only tuple that has no parent. First, see suggested text in my first-draft release announcement. I don't think that text is any better, it's imo even wrong: The bug causes rows to vanish from indexes during recovery due to simultaneous updates of rows on both sides of a foreign key. Neither is a foreign key, nor simultaneous updates, nor both sides a prerequisite. Second, if a user has encountered this kind of data corruption on their master (due to crash recovery), how exactly *do* they fix it? Dump/restore is the most obvious candidate. The next best thing I can think of is a noop rewriting ALTER TABLE, that doesn't deal with ctid chains IIRC, in contrast to CLUSTER/VACUUM FULL. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Planner hints in Postgresql
On Mon, Mar 17, 2014 at 9:45 PM, Tom Lane t...@sss.pgh.pa.us wrote: David Johnston pol...@yahoo.com writes: Need to discuss the general why before any meaningful help on the how is going to be considered by hackers. Possibly worth noting is that in past discussions, we've concluded that the most sensible type of hint would not be use this plan at all, but here's what to assume about the selectivity of this WHERE clause. That seems considerably less likely to break than any attempt to directly specify plan details. Isnt using a user given value for selectivity a pretty risky situation as it can horribly screw up the plan selection? Why not allow the user to specify an alternate plan and have the planner assign a higher preference to it during plan evaluation? This shall allow us to still have a fair evaluation of all possible plans as we do right now and yet have a higher preference for the user given plan during evaluation? Regards, Atri -- Regards, Atri *l'apprenant*
Re: [HACKERS] on_exit_reset fails to clear DSM-related exit actions
On Mon, Mar 17, 2014 at 11:32 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: One option is to just change that function to also unmap the control segment, and maybe rename it to dsm_detach_all(), and then use that everywhere. The problem is that I'm not sure we really want to incur the overhead of an extra munmap() during every backend exit, just to get rid of a control segment which was going to be unmapped anyway by process termination. For that matter, I'm not sure why we bother arranging that for the main shared memory segment, either: surely whatever function the shmdt() and munmap() calls in IpcMemoryDetach may have will be equally well-served by the forthcoming exit()? Before you lobotomize that code too much, consider the postmaster crash-recovery case. That path does need to detach from the old shmem segment. Also, I might be wrong, but I think IpcMemoryDetach is a *postmaster* on_shmem_exit routine; it isn't called during backend exit. Ah, right. I verified using strace that, at least on Linux 2.6.32-279.el6.x86_64, the only system call made on disconnecting a psql session is exit_group(0). I also tried setting breakpoints on PGSharedMemoryDetach and IpcMemoryDetach and they do not fire. After mulling over a few possible approaches, I came up with the attached, which seems short and to the point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c index cf2ce46..8153160 100644 --- a/src/backend/postmaster/pgarch.c +++ b/src/backend/postmaster/pgarch.c @@ -40,6 +40,7 @@ #include postmaster/fork_process.h #include postmaster/pgarch.h #include postmaster/postmaster.h +#include storage/dsm.h #include storage/fd.h #include storage/ipc.h #include storage/latch.h @@ -163,6 +164,7 @@ pgarch_start(void) on_exit_reset(); /* Drop our connection to postmaster's shared memory, as well */ + dsm_detach_all(); PGSharedMemoryDetach(); PgArchiverMain(0, NULL); diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 1ca5d13..3dc280a 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -50,6 +50,7 @@ #include postmaster/postmaster.h #include storage/proc.h #include storage/backendid.h +#include storage/dsm.h #include storage/fd.h #include storage/ipc.h #include storage/latch.h @@ -709,6 +710,7 @@ pgstat_start(void) on_exit_reset(); /* Drop our connection to postmaster's shared memory, as well */ + dsm_detach_all(); PGSharedMemoryDetach(); PgstatCollectorMain(0, NULL); diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c index e277a9a..4731ab7 100644 --- a/src/backend/postmaster/syslogger.c +++ b/src/backend/postmaster/syslogger.c @@ -39,6 +39,7 @@ #include postmaster/fork_process.h #include postmaster/postmaster.h #include postmaster/syslogger.h +#include storage/dsm.h #include storage/ipc.h #include storage/latch.h #include storage/pg_shmem.h @@ -626,6 +627,7 @@ SysLogger_Start(void) on_exit_reset(); /* Drop our connection to postmaster's shared memory, as well */ + dsm_detach_all(); PGSharedMemoryDetach(); /* do the work */ diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c index 232c099..c967177 100644 --- a/src/backend/storage/ipc/dsm.c +++ b/src/backend/storage/ipc/dsm.c @@ -722,6 +722,8 @@ dsm_attach(dsm_handle h) /* * At backend shutdown time, detach any segments that are still attached. + * (This is similar to dsm_detach_all, except that there's no reason to + * unmap the control segment before exiting, so we don't bother.) */ void dsm_backend_shutdown(void) @@ -736,6 +738,31 @@ dsm_backend_shutdown(void) } /* + * Detach all shared memory segments, including the control segments. This + * should be called, along with PGSharedMemoryDetach, in processes that + * might inherit mappings but are not intended to be connected to dynamic + * shared memory. + */ +void +dsm_detach_all(void) +{ + void *control_address = dsm_control; + + while (!dlist_is_empty(dsm_segment_list)) + { + dsm_segment *seg; + + seg = dlist_head_element(dsm_segment, node, dsm_segment_list); + dsm_detach(seg); + } + + if (control_address != NULL) + dsm_impl_op(DSM_OP_DETACH, dsm_control_handle, 0, + dsm_control_impl_private, control_address, + dsm_control_mapped_size, ERROR); +} + +/* * Resize an existing shared memory segment. * * This may cause the shared memory segment to be remapped at a different diff --git a/src/include/storage/dsm.h b/src/include/storage/dsm.h index 03dd767..e4669a1 100644 --- a/src/include/storage/dsm.h +++ b/src/include/storage/dsm.h @@ -20,6 +20,7 @@ typedef struct dsm_segment dsm_segment; /* Startup and shutdown functions. */ extern void dsm_postmaster_startup(void); extern void
Re: [HACKERS] Planner hints in Postgresql
Atri Sharma atri.j...@gmail.com writes: On Mon, Mar 17, 2014 at 9:45 PM, Tom Lane t...@sss.pgh.pa.us wrote: Possibly worth noting is that in past discussions, we've concluded that the most sensible type of hint would not be use this plan at all, but here's what to assume about the selectivity of this WHERE clause. That seems considerably less likely to break than any attempt to directly specify plan details. Isnt using a user given value for selectivity a pretty risky situation as it can horribly screw up the plan selection? And forcing a plan to be used *isn't* that? Please re-read the older threads, since you evidently have not. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Portability issues in shm_mq
On Mon, Mar 17, 2014 at 12:03 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Sun, Mar 16, 2014 at 10:45 PM, Tom Lane t...@sss.pgh.pa.us wrote: Well, it will result in padding space when you maxalign the length word, but I don't see why it wouldn't work; and it would certainly be no less efficient than what's there today. Well, the problem is with this: /* Write the message length into the buffer. */ if (!mqh-mqh_did_length_word) { res = shm_mq_send_bytes(mqh, sizeof(uint64), nbytes, nowait, bytes_written); If I change nbytes to be of type Size, and the second argument to sizeof(Size), then it's wrong whenever sizeof(Size) % MAXIMUM_ALIGNOF != 0. Well, you need to maxalign the number of bytes physically inserted into the queue. Doesn't shm_mq_send_bytes do that? Where do you do the maxaligning of the message payload data, when the payload is odd-length? I would have expected some logic like copy N bytes but then advance the pointer by maxalign(N). Oh, yeah. Duh. Clearly my brain isn't working today. Hmm, so maybe this will be fairly simple... will try it out. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Planner hints in Postgresql
* Atri Sharma (atri.j...@gmail.com) wrote: Isnt using a user given value for selectivity a pretty risky situation as it can horribly screw up the plan selection? Why not allow the user to specify an alternate plan and have the planner Uh, you're worried about the user given us a garbage selectivity, but they're going to get a full-blown plan perfect? assign a higher preference to it during plan evaluation? This shall allow us to still have a fair evaluation of all possible plans as we do right now and yet have a higher preference for the user given plan during evaluation? What exactly would such a preference look like? A cost modifier? We'd almost certainly have to make that into a GUC or a value passed in as part of the query, with a high likelihood of users figuring out how to use it to say use my plan forever and always.. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Planner hints in Postgresql
Atri Sharma wrote On Mon, Mar 17, 2014 at 9:45 PM, Tom Lane lt; tgl@.pa gt; wrote: David Johnston lt; polobo@ gt; writes: Need to discuss the general why before any meaningful help on the how is going to be considered by hackers. Possibly worth noting is that in past discussions, we've concluded that the most sensible type of hint would not be use this plan at all, but here's what to assume about the selectivity of this WHERE clause. That seems considerably less likely to break than any attempt to directly specify plan details. Isnt using a user given value for selectivity a pretty risky situation as it can horribly screw up the plan selection? Why not allow the user to specify an alternate plan and have the planner assign a higher preference to it during plan evaluation? This shall allow us to still have a fair evaluation of all possible plans as we do right now and yet have a higher preference for the user given plan during evaluation? The larger question to answer first is whether we want to implement something that is deterministic... How about just dropping the whole concept of hinting and provide a way for someone to say use this plan, or die trying. Maybe require it be used in conjunction with named PREPAREd statements: PREPARE s1 (USING /path/to/plan_def_on_server_or_something_similar) AS SELECT ...; Aside from whole-plan specification I can definitely see where join/where specification could be useful if it can overcome the current limitation of not being able to calculate inter-table estimations. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Planner-hints-in-Postgresql-tp5796347p5796378.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] First-draft release notes for next week's releases
On 2014-03-15 16:02:19 -0400, Tom Lane wrote: First-draft release notes are committed, and should be visible at http://www.postgresql.org/docs/devel/static/release-9-3-4.html once guaibasaurus does its next buildfarm run a few minutes from now. Any suggestions? So, the current text is: This error allowed multiple versions of the same row to become visible to queries, resulting in apparent duplicates. Since the error is in WAL replay, it would only manifest during crash recovery or on standby servers. what about: The most prominent consequence of this bug is that rows can appear to not exist when accessed via an index, while still being visible in sequential scans. This in turn can lead to constraints, including unique and foreign key ones, to be violated lateron. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Planner hints in Postgresql
On Mon, Mar 17, 2014 at 10:58 PM, Stephen Frost sfr...@snowman.net wrote: * Atri Sharma (atri.j...@gmail.com) wrote: Isnt using a user given value for selectivity a pretty risky situation as it can horribly screw up the plan selection? Why not allow the user to specify an alternate plan and have the planner Uh, you're worried about the user given us a garbage selectivity, but they're going to get a full-blown plan perfect? I never said that the user plan would be perfect. The entire point of planner hints is based on the assumption that the user knows more about the data than the planner does hence the user's ideas about the plan should be given a preference. Garbage selectivity can screw up the cost estimation of *all* our possible plans and we could end up preferring a sequential scan over an index only scan for e.g. I am trying to think of ways that give some preference to a user plan but do not interfere with the cost estimation of our other potential plans. What exactly would such a preference look like? A cost modifier? We'd almost certainly have to make that into a GUC or a value passed in as part of the query, with a high likelihood of users figuring out how to use it to say use my plan forever and always.. A factor that we experimentally determine by which we decrease the cost of the user specified plan so that it gets a higher preference in the plan evaluation. Of course, this is not a nice hack. Specifically after our discussion on IRC the other day, I am against planner hints, but if we are just discussing how it could be done, I could think of some ways which I listed. Regards, Atri -- Regards, Atri *l'apprenant*
Re: [HACKERS] on_exit_reset fails to clear DSM-related exit actions
Robert Haas robertmh...@gmail.com writes: After mulling over a few possible approaches, I came up with the attached, which seems short and to the point. Looks reasonable in principle. I didn't run through all the existing PGSharedMemoryDetach calls to see if there are any other places to call dsm_detach_all, but it's an easy fix if there are any. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Planner hints in Postgresql
* Atri Sharma (atri.j...@gmail.com) wrote: Of course, this is not a nice hack. Specifically after our discussion on IRC the other day, I am against planner hints, but if we are just discussing how it could be done, I could think of some ways which I listed. There's lots of ways to implement planner hints, but I fail to see the point in discussing how to implement something we actively don't want. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] First-draft release notes for next week's releases
Andres Freund and...@2ndquadrant.com writes: On 2014-03-17 10:03:52 -0700, Josh Berkus wrote: First, see suggested text in my first-draft release announcement. I don't think that text is any better, it's imo even wrong: The bug causes rows to vanish from indexes during recovery due to simultaneous updates of rows on both sides of a foreign key. Neither is a foreign key, nor simultaneous updates, nor both sides a prerequisite. What I've got at the moment is This error caused updated rows to disappear from indexes, resulting in inconsistent query results depending on whether an index scan was used. Subsequent processing could result in unique-key violations, since the previously updated row would not be found by later index searches. Since this error is in WAL replay, it would only manifest during crash recovery or on standby servers. The improperly-replayed case can arise when a table row that is referenced by a foreign-key constraint is updated concurrently with creation of a referencing row. OK, or not? The time window for bikeshedding this is dwindling rapidly. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Planner hints in Postgresql
The larger question to answer first is whether we want to implement something that is deterministic... How about just dropping the whole concept of hinting and provide a way for someone to say use this plan, or die trying. Maybe require it be used in conjunction with named PREPAREd statements: You mean taking away the entire concept of query planning and cost estimation? Thats like replacing the optimizer with DBA decision and I am not at all comfortable with that idea. That are only my thoughts though. PREPARE s1 (USING /path/to/plan_def_on_server_or_something_similar) AS SELECT ...; Aside from whole-plan specification I can definitely see where join/where specification could be useful if it can overcome the current limitation of not being able to calculate inter-table estimations. Prepare plans use a generic plan for the execution. Replacing it with a totally user defined plan does not seem to be clean. The crux is that IMHO planner hints are a bad way of trying to circumvent the need for cross-column statistics. We should do cross-column statistics done and ignore planner hints completely. Regards, Atri -- Regards, Atri *l'apprenant*
Re: [HACKERS] Planner hints in Postgresql
There's lots of ways to implement planner hints, but I fail to see the point in discussing how to implement something we actively don't want. +1. The original poster wanted a way to implement it as a personal project or something ( I think he only replied to me, not the entire list). Planner hints should be ignored :) Regards, Atri -- Regards, Atri *l'apprenant*
Re: [HACKERS] jsonb status
On 03/16/2014 04:10 AM, Peter Geoghegan wrote: On Thu, Mar 13, 2014 at 2:00 PM, Andrew Dunstan and...@dunslane.net wrote: I'll be travelling a good bit of tomorrow (Friday), but I hope Peter has finished by the time I am back on deck late tomorrow and that I am able to commit this on Saturday. I asked Andrew to hold off on committing this today. It was agreed that we weren't quite ready, because there were one or two remaining bugs (since fixed), but also because I felt that it would be useful to first hear the opinions of more people before proceeding. I think that we're not that far from having something committed. Obviously I hope to get this into 9.4, and attach a lot of strategic importance to having the feature, which is why I made a large effort to help land it. Attached patch has a number of notable revisions. Throughout, it has been possible for anyone to follow our progress here: https://github.com/feodor/postgres/commits/jsonb_and_hstore * In general, the file jsonb_support.c (renamed to jsonb_utils.c) is vastly better commented, and has a much clearer structure. This was not something I did much with in the previous revision, and so it has been a definite focus of this one. * Hashing is refactored to not use CRC32 anymore. I felt this was a questionable method of hashing, both within jsonb_hash(), as well as the jsonb_hash_ops GIN operator class. * Dead code elimination. * I got around to fixing the memory leaks in B-Tree support function one. * Andrew added hstore_to_jsonb, hstore_to_jsonb_loose functions and a cast. One goal of this effort is to preserve a parallel set of facilities for the json and jsonb types, and that includes hstore-related features. * A fix from Alexander for the jsonb_hash_ops @operator issue I complained about during the last submission was merged. * There is no longer any GiST opclass. That just leaves B-Tree, hash, GIN (default) and GIN jsonb_hash_ops opclasses. My outstanding concerns are: * Have we got things right with GIN indexing, containment semantics, etc? See my remarks in the patch, by grepping contain within jsonb_util.c. Is the GIN text storage serialization format appropriate and correct? * General design concerns. By far the largest source of these is the file jsonb_util.c. * Is the on-disk format that we propose to tie Postgres to as good as it could be? I've been working through all the changes and fixes that Peter and others have made, and they look pretty good to me. There are a few mostly cosmetic changes I want to make, but nothing that would be worth holding up committing this for. I'm fairly keen to get this committed, get some buildfarm coverage and get more people playing with it and testing it. Like Peter, I would like to see more comments from people on the GIN support, especially. The one outstanding significant question of substance I have is this: given the commit 5 days ago of provision for triConsistent functions for GIN opclasses, should be be adding these to the two GIN opclasses we are providing, and what should they look like? Again, this isn't an issue that I think needs to hold up committing what we have now. Regarding Peter's last question, if we're not satisfied with the on-disk format proposed it would mean throwing the whole effort out and starting again. The only thing I have thought of as an alternative would be to store the structure and values separately rather than with values inline with the structure. That way you could have a hash of values more or less, which would eliminate redundancy of storage of things like object field names. But such a structure might well involve at least as much computational overhead as the current structure. And nobody's been saying all along hold on, we can do better than this. So I'm pretty inclined to go with what we have. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Planner hints in Postgresql
On Mon, Mar 17, 2014 at 10:54 PM, Tom Lane t...@sss.pgh.pa.us wrote: Atri Sharma atri.j...@gmail.com writes: On Mon, Mar 17, 2014 at 9:45 PM, Tom Lane t...@sss.pgh.pa.us wrote: Possibly worth noting is that in past discussions, we've concluded that the most sensible type of hint would not be use this plan at all, but here's what to assume about the selectivity of this WHERE clause. That seems considerably less likely to break than any attempt to directly specify plan details. Isnt using a user given value for selectivity a pretty risky situation as it can horribly screw up the plan selection? And forcing a plan to be used *isn't* that? Please re-read the older threads, since you evidently have not. I never said that we force a plan to be used. I just said that we should increase the preference for a user given plan and not interfere in the cost estimation of the other potential plans and the evaluation of the final selected plan. Regards, Atri -- Regards, Atri *l'apprenant*
Re: [HACKERS] Planner hints in Postgresql
On Mon, Mar 17, 2014 at 11:15 AM, Tom Lane t...@sss.pgh.pa.us wrote: David Johnston pol...@yahoo.com writes: Need to discuss the general why before any meaningful help on the how is going to be considered by hackers. Possibly worth noting is that in past discussions, we've concluded that the most sensible type of hint would not be use this plan at all, but here's what to assume about the selectivity of this WHERE clause. That seems considerably less likely to break than any attempt to directly specify plan details. Yeah -- the most common case I see is outlier culling where several repeated low non-deterministic selectivity quals stack reducing the row count estimate to 1. For example: SELECT * FROM foo WHERE length(bar) = 1000 AND length(bar) = 2; The user may have special knowledge that the above is very (or very un-) selective that is difficult or not cost effective to gather in the general case. IIRC in the archives (heh) there is a special workaround using indexes and some discussion regarding how a hypothetical feature involving user input selectivity estimates might look. I don't think that discussion is complete: the syntax for user input selectivity is an unsolved problem. There's a big difference between saying to the planner, Use plan X vs Here's some information describing the data supporting choosing plan X intelligently. The latter allows for better plans in the face of varied/changing data, integrates with the planner in natural way, and encourages users to understand how the planner works. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] First-draft release notes for next week's releases
On 2014-03-17 13:42:59 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-03-17 10:03:52 -0700, Josh Berkus wrote: First, see suggested text in my first-draft release announcement. I don't think that text is any better, it's imo even wrong: The bug causes rows to vanish from indexes during recovery due to simultaneous updates of rows on both sides of a foreign key. Neither is a foreign key, nor simultaneous updates, nor both sides a prerequisite. What I've got at the moment is This error caused updated rows to disappear from indexes, resulting in inconsistent query results depending on whether an index scan was used. Subsequent processing could result in unique-key violations, since the previously updated row would not be found by later index searches. Since this error is in WAL replay, it would only manifest during crash recovery or on standby servers. The improperly-replayed case can arise when a table row that is referenced by a foreign-key constraint is updated concurrently with creation of a referencing row. OK, or not? The time window for bikeshedding this is dwindling rapidly. That's much better, yes. Two things: * I'd change the warning about unique key violations into a more general one about constraints. Foreign key and exclusion constraint are also affected... * I wonder if we should make the possible origins a bit more general as it's perfectly possible to trigger the problem without foreign keys. Maybe: can arise when a table row that has been updated is row locked; that can e.g. happen when foreign keys are used. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] First-draft release notes for next week's releases
On 2014-03-17 11:28:45 -0400, Tom Lane wrote: Hm ... rows disappearing from indexes might make people think that they could fix or mitigate the damage via REINDEX. Good point. I guess in some cases it will end up working because VACUUM/hot pruning have cleaned up the mess, but that's certainly not something I'd want to rely upon. They very well could have messed up things when presented with bogus input data. That's not really true though is it? It looks to me like IndexBuildHeapScan will suffer an Assert failure in an assert-enabled build, or build a bogus index if not assert-enabled, when it comes across a heap-only tuple that has no parent. I'm thinking we'd better promote that Assert to a normal runtime elog. I wonder if we also should make rewriteheap.c warn about such things. Although I don't immediately see a trivial way to do so. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Planner hints in Postgresql
* Merlin Moncure (mmonc...@gmail.com) wrote: Yeah -- the most common case I see is outlier culling where several repeated low non-deterministic selectivity quals stack reducing the row count estimate to 1. For example: SELECT * FROM foo WHERE length(bar) = 1000 AND length(bar) = 2; This is exactly the issue that I've seen also- where we end up picking a Nested Loop because we think only one row is going to be returned and instead we end up getting a bunch and it takes forever. There was also some speculation on trying to change plans mid-stream to address a situation like that, once we realize what's happening. Not sure that's really practical but it would be nice to find some solution. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Planner hints in Postgresql
There's a big difference between saying to the planner, Use plan X vs Here's some information describing the data supporting choosing plan X intelligently. The latter allows for better plans in the face of varied/changing data, integrates with the planner in natural way, and encourages users to understand how the planner works. +1 I was thinking of varying the 'weight' of a user defined plan by an fixed experimental factor to tell the planner to give higher/lower preference to this plan, but after your idea above, I think Stephen's point of introducing a GUC for the factor is the only way possible and I agree with him on the point that eventually the user will figure out a way to force usage of his plan using the GUC. Regards, Atri -- Regards, Atri *l'apprenant*
Re: [HACKERS] First-draft release notes for next week's releases
Andres Freund and...@2ndquadrant.com writes: That's much better, yes. Two things: * I'd change the warning about unique key violations into a more general one about constraints. Foreign key and exclusion constraint are also affected... I'll see what I can do. * I wonder if we should make the possible origins a bit more general as it's perfectly possible to trigger the problem without foreign keys. Maybe: can arise when a table row that has been updated is row locked; that can e.g. happen when foreign keys are used. IIUC, this case only occurs when using the new-in-9.3 types of nonexclusive row locks. I'm willing to bet that the number of applications using those is negligible; so I think it's all right to not mention that case explicitly, as long as the wording doesn't say that foreign keys are the *only* cause (which I didn't). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] First-draft release notes for next week's releases
On 2014-03-17 14:01:03 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: * I wonder if we should make the possible origins a bit more general as it's perfectly possible to trigger the problem without foreign keys. Maybe: can arise when a table row that has been updated is row locked; that can e.g. happen when foreign keys are used. IIUC, this case only occurs when using the new-in-9.3 types of nonexclusive row locks. I'm willing to bet that the number of applications using those is negligible; so I think it's all right to not mention that case explicitly, as long as the wording doesn't say that foreign keys are the *only* cause (which I didn't). I actually think the issue could also occur with row locks of other severities (is that the correct term?). Alvaro probably knows better, but if I see correctly it's also triggerable if a backend waits for an updating transaction to finish and follow_updates = true is passed to heap_lock_tuple(). Which e.g. nodeLockRows.c does... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] First-draft release notes for next week's releases
Andres Freund and...@2ndquadrant.com writes: On 2014-03-17 14:01:03 -0400, Tom Lane wrote: IIUC, this case only occurs when using the new-in-9.3 types of nonexclusive row locks. I'm willing to bet that the number of applications using those is negligible; so I think it's all right to not mention that case explicitly, as long as the wording doesn't say that foreign keys are the *only* cause (which I didn't). I actually think the issue could also occur with row locks of other severities (is that the correct term?). The commit log entry says We were resetting the tuple's HEAP_HOT_UPDATED flag as well as t_ctid on WAL replay of a tuple-lock operation, which is incorrect when the tuple is already updated. Back-patch to 9.3. The clearing of both header elements was there previously, but since no update could be present on a tuple that was being locked, it was harmless. which I read to mean that the case can't occur with the types of row locks that were allowed pre-9.3. but if I see correctly it's also triggerable if a backend waits for an updating transaction to finish and follow_updates = true is passed to heap_lock_tuple(). Which e.g. nodeLockRows.c does... That sounds backwards. nodeLockRows locks the latest tuple in the chain, so it can't be subject to this. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb status
Alexander will take a look on TriConsistent function. On Mon, Mar 17, 2014 at 9:48 PM, Andrew Dunstan and...@dunslane.net wrote: On 03/16/2014 04:10 AM, Peter Geoghegan wrote: On Thu, Mar 13, 2014 at 2:00 PM, Andrew Dunstan and...@dunslane.net wrote: I'll be travelling a good bit of tomorrow (Friday), but I hope Peter has finished by the time I am back on deck late tomorrow and that I am able to commit this on Saturday. I asked Andrew to hold off on committing this today. It was agreed that we weren't quite ready, because there were one or two remaining bugs (since fixed), but also because I felt that it would be useful to first hear the opinions of more people before proceeding. I think that we're not that far from having something committed. Obviously I hope to get this into 9.4, and attach a lot of strategic importance to having the feature, which is why I made a large effort to help land it. Attached patch has a number of notable revisions. Throughout, it has been possible for anyone to follow our progress here: https://github.com/feodor/postgres/commits/jsonb_and_hstore * In general, the file jsonb_support.c (renamed to jsonb_utils.c) is vastly better commented, and has a much clearer structure. This was not something I did much with in the previous revision, and so it has been a definite focus of this one. * Hashing is refactored to not use CRC32 anymore. I felt this was a questionable method of hashing, both within jsonb_hash(), as well as the jsonb_hash_ops GIN operator class. * Dead code elimination. * I got around to fixing the memory leaks in B-Tree support function one. * Andrew added hstore_to_jsonb, hstore_to_jsonb_loose functions and a cast. One goal of this effort is to preserve a parallel set of facilities for the json and jsonb types, and that includes hstore-related features. * A fix from Alexander for the jsonb_hash_ops @operator issue I complained about during the last submission was merged. * There is no longer any GiST opclass. That just leaves B-Tree, hash, GIN (default) and GIN jsonb_hash_ops opclasses. My outstanding concerns are: * Have we got things right with GIN indexing, containment semantics, etc? See my remarks in the patch, by grepping contain within jsonb_util.c. Is the GIN text storage serialization format appropriate and correct? * General design concerns. By far the largest source of these is the file jsonb_util.c. * Is the on-disk format that we propose to tie Postgres to as good as it could be? I've been working through all the changes and fixes that Peter and others have made, and they look pretty good to me. There are a few mostly cosmetic changes I want to make, but nothing that would be worth holding up committing this for. I'm fairly keen to get this committed, get some buildfarm coverage and get more people playing with it and testing it. Like Peter, I would like to see more comments from people on the GIN support, especially. The one outstanding significant question of substance I have is this: given the commit 5 days ago of provision for triConsistent functions for GIN opclasses, should be be adding these to the two GIN opclasses we are providing, and what should they look like? Again, this isn't an issue that I think needs to hold up committing what we have now. Regarding Peter's last question, if we're not satisfied with the on-disk format proposed it would mean throwing the whole effort out and starting again. The only thing I have thought of as an alternative would be to store the structure and values separately rather than with values inline with the structure. That way you could have a hash of values more or less, which would eliminate redundancy of storage of things like object field names. But such a structure might well involve at least as much computational overhead as the current structure. And nobody's been saying all along hold on, we can do better than this. So I'm pretty inclined to go with what we have. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Planner hints in Postgresql
On Mon, Mar 17, 2014 at 12:57 PM, Atri Sharma atri.j...@gmail.com wrote: There's a big difference between saying to the planner, Use plan X vs Here's some information describing the data supporting choosing plan X intelligently. The latter allows for better plans in the face of varied/changing data, integrates with the planner in natural way, and encourages users to understand how the planner works. +1 I was thinking of varying the 'weight' of a user defined plan by an fixed experimental factor to tell the planner to give higher/lower preference to this plan, but after your idea above, I think Stephen's point of introducing a GUC for the factor is the only way possible and I agree with him on the point that eventually the user will figure out a way to force usage of his plan using the GUC. GUC is not the answer beyond the broad brush mostly debugging level features they already support. What do you do if your plan simultaneously needs and does not need nestloops? A query plan is a complicated thing that is the result of detail analysis of the data. I bet there are less than 100 users on the planet with the architectural knowledge of the planner to submit a 'plan'. What users do have is knowledge of the data that the database can't effectively gather for some reason. Looking at my query above, what it would need (assuming the planner could not be made to look through length()) would be something like: SELECT * FROM foo WHERE length(bar) = 1000 WITH SELECTIVITY 0.999 AND length(bar) = 2 WITH SELECTIVITY 0.999; Note, that's a trivial treatment of the syntax challenges. Ultimately it'd probably look different and/or be hooked in a different way (say, via the function call). merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] First-draft release notes for next week's releases
On 2014-03-17 14:16:41 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-03-17 14:01:03 -0400, Tom Lane wrote: IIUC, this case only occurs when using the new-in-9.3 types of nonexclusive row locks. I'm willing to bet that the number of applications using those is negligible; so I think it's all right to not mention that case explicitly, as long as the wording doesn't say that foreign keys are the *only* cause (which I didn't). I actually think the issue could also occur with row locks of other severities (is that the correct term?). The commit log entry says We were resetting the tuple's HEAP_HOT_UPDATED flag as well as t_ctid on WAL replay of a tuple-lock operation, which is incorrect when the tuple is already updated. Back-patch to 9.3. The clearing of both header elements was there previously, but since no update could be present on a tuple that was being locked, it was harmless. which I read to mean that the case can't occur with the types of row locks that were allowed pre-9.3. That's not an unreasonable interpretation of the commit message, but I don't think it's correct with respect to the code :( but if I see correctly it's also triggerable if a backend waits for an updating transaction to finish and follow_updates = true is passed to heap_lock_tuple(). Which e.g. nodeLockRows.c does... That sounds backwards. nodeLockRows locks the latest tuple in the chain, so it can't be subject to this. Hm, I don't see anything in the code preventing it, that's the lock_tuple() before the EPQ stuff... in ExecLockRows(): foreach(lc, node-lr_arowMarks) { test = heap_lock_tuple(erm-relation, tuple, estate-es_output_cid, lockmode, erm-noWait, true, buffer, hufd); ReleaseBuffer(buffer); switch (test) { case HeapTupleSelfUpdated: ... the true passed to heap_lock_tuple() is the follow_updates parameter. And then in heap_lock_tuple(): if (require_sleep) { if (infomask HEAP_XMAX_IS_MULTI) { ... /* if there are updates, follow the update chain */ if (follow_updates !HEAP_XMAX_IS_LOCKED_ONLY(infomask)) { HTSU_Result res; res = heap_lock_updated_tuple(relation, tuple, t_ctid, GetCurrentTransactionId(), ... else { /* wait for regular transaction to end */ if (nowait) { if (!ConditionalXactLockTableWait(xwait)) ereport(ERROR, (errcode(ERRCODE_LOCK_NOT_AVAILABLE), errmsg(could not obtain lock on row in relation \%s\, RelationGetRelationName(relation; } else XactLockTableWait(xwait); /* if there are updates, follow the update chain */ if (follow_updates !HEAP_XMAX_IS_LOCKED_ONLY(infomask)) ... if (RelationNeedsWAL(relation)) { xl_heap_lock xlrec; XLogRecPtr recptr; XLogRecData rdata[2]; xlrec.target.node = relation-rd_node; xlrec.target.tid = tuple-t_self; ... To me that looks sufficient to trigger the bug, because we're issuing a wal record about the row that was passed to heap_lock_update(), not the latest one in the ctid chain. When replaying that record, it will reset the t_ctid field, thus breaking the chain. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] First-draft release notes for next week's releases
Andres Freund and...@2ndquadrant.com writes: To me that looks sufficient to trigger the bug, because we're issuing a wal record about the row that was passed to heap_lock_update(), not the latest one in the ctid chain. When replaying that record, it will reset the t_ctid field, thus breaking the chain. [ scratches head ... ] If that's what's happening, isn't it a bug in itself? Surely the WAL record ought to point at the tuple that was locked. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] First-draft release notes for next week's releases
On 2014-03-17 14:29:56 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: To me that looks sufficient to trigger the bug, because we're issuing a wal record about the row that was passed to heap_lock_update(), not the latest one in the ctid chain. When replaying that record, it will reset the t_ctid field, thus breaking the chain. [ scratches head ... ] If that's what's happening, isn't it a bug in itself? Surely the WAL record ought to point at the tuple that was locked. There's a separate XLOG_HEAP2_LOCK_UPDATED record, for every later tuple version, emitted by heap_lock_updated_tuple_rec(). This really is mind bendingly complex :(. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Planner hints in Postgresql
On Mon, Mar 17, 2014 at 11:50 PM, Merlin Moncure mmonc...@gmail.com wrote: On Mon, Mar 17, 2014 at 12:57 PM, Atri Sharma atri.j...@gmail.com wrote: There's a big difference between saying to the planner, Use plan X vs Here's some information describing the data supporting choosing plan X intelligently. The latter allows for better plans in the face of varied/changing data, integrates with the planner in natural way, and encourages users to understand how the planner works. +1 I was thinking of varying the 'weight' of a user defined plan by an fixed experimental factor to tell the planner to give higher/lower preference to this plan, but after your idea above, I think Stephen's point of introducing a GUC for the factor is the only way possible and I agree with him on the point that eventually the user will figure out a way to force usage of his plan using the GUC. GUC is not the answer beyond the broad brush mostly debugging level features they already support. What do you do if your plan simultaneously needs and does not need nestloops? A query plan is a complicated thing that is the result of detail analysis of the data. I bet there are less than 100 users on the planet with the architectural knowledge of the planner to submit a 'plan'. What users do have is knowledge of the data that the database can't effectively gather for some reason. Looking at my query above, what it would need (assuming the planner could not be made to look through length()) would be something like: SELECT * FROM foo WHERE length(bar) = 1000 WITH SELECTIVITY 0.999 AND length(bar) = 2 WITH SELECTIVITY 0.999; Wont this have scaling issues and issues over time as the data in the table changes? Suppose I make a view with the above query. With time, as the data in the table changes, the selectivity values wont be good for planning. This may potentially lead to a lot of changes in the view definition and other places where this query was used. In general, I think I step back on my point that specifying the selectivity is a bad idea. Could this also work (for the time being) for cross-column statistics? Regards, Atri -- Regards, Atri *l'apprenant*
Re: [HACKERS] Planner hints in Postgresql
2014-03-17 19:35 GMT+01:00 Atri Sharma atri.j...@gmail.com: On Mon, Mar 17, 2014 at 11:50 PM, Merlin Moncure mmonc...@gmail.comwrote: On Mon, Mar 17, 2014 at 12:57 PM, Atri Sharma atri.j...@gmail.com wrote: There's a big difference between saying to the planner, Use plan X vs Here's some information describing the data supporting choosing plan X intelligently. The latter allows for better plans in the face of varied/changing data, integrates with the planner in natural way, and encourages users to understand how the planner works. +1 I was thinking of varying the 'weight' of a user defined plan by an fixed experimental factor to tell the planner to give higher/lower preference to this plan, but after your idea above, I think Stephen's point of introducing a GUC for the factor is the only way possible and I agree with him on the point that eventually the user will figure out a way to force usage of his plan using the GUC. GUC is not the answer beyond the broad brush mostly debugging level features they already support. What do you do if your plan simultaneously needs and does not need nestloops? A query plan is a complicated thing that is the result of detail analysis of the data. I bet there are less than 100 users on the planet with the architectural knowledge of the planner to submit a 'plan'. What users do have is knowledge of the data that the database can't effectively gather for some reason. Looking at my query above, what it would need (assuming the planner could not be made to look through length()) would be something like: SELECT * FROM foo WHERE length(bar) = 1000 WITH SELECTIVITY 0.999 AND length(bar) = 2 WITH SELECTIVITY 0.999; Wont this have scaling issues and issues over time as the data in the table changes? Suppose I make a view with the above query. With time, as the data in the table changes, the selectivity values wont be good for planning. This may potentially lead to a lot of changes in the view definition and other places where this query was used. In general, I think I step back on my point that specifying the selectivity is a bad idea. Could this also work (for the time being) for cross-column statistics? It is another issue. I don't believe so SELECTIVITY can work well too. Slow queries are usually related to some strange points in data. I am thinking so well concept should be based on validity of estimations. Some plans are based on totally wrong estimation, but should be fast due less sensitivity to bad estimations. So well concept is penalization some risk plans - or use brute force - like COLUMN store engine does. Their plan is usually simply and tolerant to bad estimations. Pavel Regards, Atri -- Regards, Atri *l'apprenant*
Re: [HACKERS] First-draft release notes for next week's releases
On 2014-03-17 14:52:25 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-03-17 14:29:56 -0400, Tom Lane wrote: [ scratches head ... ] If that's what's happening, isn't it a bug in itself? Surely the WAL record ought to point at the tuple that was locked. There's a separate XLOG_HEAP2_LOCK_UPDATED record, for every later tuple version, emitted by heap_lock_updated_tuple_rec(). This really is mind bendingly complex :(. Ah, I see; so only the original tuple in the chain is at risk? Depending on what you define the original tuple in the chain to be. No, if you happen to mean the root tuple of a ctid chain or similar; which I guess you didn't. Yes, if you mean the tuplepassed to heap_lock_tuple(). heap_xlog_lock_updated() looks (and has looked) correct. How about this: Sounds good to me. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] First-draft release notes for next week's releases
Andres Freund and...@2ndquadrant.com writes: On 2014-03-17 14:29:56 -0400, Tom Lane wrote: [ scratches head ... ] If that's what's happening, isn't it a bug in itself? Surely the WAL record ought to point at the tuple that was locked. There's a separate XLOG_HEAP2_LOCK_UPDATED record, for every later tuple version, emitted by heap_lock_updated_tuple_rec(). This really is mind bendingly complex :(. Ah, I see; so only the original tuple in the chain is at risk? How about this: This error caused updated rows to not be found by index scans, resulting in inconsistent query results depending on whether an index scan was used. Subsequent processing could result in constraint violations, since the previously updated row would not be found by later index searches, thus possibly allowing conflicting rows to be inserted. Since this error is in WAL replay, it would only manifest during crash recovery or on standby servers. The improperly-replayed case most commonly arises when a table row that is referenced by a foreign-key constraint is updated concurrently with creation of a referencing row; but it can also occur when any variant of commandSELECT FOR UPDATE/ is applied to a row that is being concurrently updated. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Planner hints in Postgresql
Atri Sharma atri.j...@gmail.com writes: Wont this have scaling issues and issues over time as the data in the table changes? It can't possibly have worse problems of that sort than explicitly specifying a plan does. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Planner hints in Postgresql
On Mon, Mar 17, 2014 at 1:45 PM, Pavel Stehule pavel.steh...@gmail.com wrote: I don't believe so SELECTIVITY can work well too. Slow queries are usually related to some strange points in data. I am thinking so well concept should be based on validity of estimations. Some plans are based on totally wrong estimation, but should be fast due less sensitivity to bad estimations. So well concept is penalization some risk plans - or use brute force - like COLUMN store engine does. Their plan is usually simply and tolerant to bad estimations. Disagree. There is a special case of slow query where problem is not with the data but with the expression over the data; something in the query defeats sampled selectivity. Common culprits are: *) CASE expressions *) COALESCE *) casts *) simple tranformational expressions *) predicate string concatenation When using those expressions, you often end up with default selectivity assumptions and if they are way off -- watch out. Plan risk analysis solves a different problem: small changes in the data mean big changes in the execution runtime. It probably wouldn't even help cases where the server thinks there is one row and you actually have thousands or millions unless you want to implement a selectivity range with perhaps a risk coefficient. This was also suggested sometime back and was also met with some skepticism (but it'd be interesting to see!). merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] First-draft release notes for next week's releases
Andres Freund wrote: On 2014-03-17 14:01:03 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: * I wonder if we should make the possible origins a bit more general as it's perfectly possible to trigger the problem without foreign keys. Maybe: can arise when a table row that has been updated is row locked; that can e.g. happen when foreign keys are used. IIUC, this case only occurs when using the new-in-9.3 types of nonexclusive row locks. I'm willing to bet that the number of applications using those is negligible; so I think it's all right to not mention that case explicitly, as long as the wording doesn't say that foreign keys are the *only* cause (which I didn't). I actually think the issue could also occur with row locks of other severities (is that the correct term?). Alvaro probably knows better, but if I see correctly it's also triggerable if a backend waits for an updating transaction to finish and follow_updates = true is passed to heap_lock_tuple(). Which e.g. nodeLockRows.c does... Uhm. But at the bottom of that block, right above the failed: label (heapam.c line 4527 in current master), we recheck the tuple for locked-only-ness; and fail the whole operation by returning HeapTupleUpdated, if it's not locked-only, no? Which would cause ExecLockRows to grab the next version via EvalPlanQualFetch. Essentially that check is a lock-conflict test, and the only thing that does not conflict with an update is a FOR KEY SHARE lock. Note the only way to pass that test is that either the tuple is locked-only (spelled in three different ways), or !require_sleep. Am I completely misunderstanding what's being said here? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] First-draft release notes for next week's releases
Alvaro Herrera alvhe...@2ndquadrant.com writes: Uhm. But at the bottom of that block, right above the failed: label (heapam.c line 4527 in current master), we recheck the tuple for locked-only-ness; and fail the whole operation by returning HeapTupleUpdated, if it's not locked-only, no? Which would cause ExecLockRows to grab the next version via EvalPlanQualFetch. Essentially that check is a lock-conflict test, and the only thing that does not conflict with an update is a FOR KEY SHARE lock. Right, the row-lock acquisition has to succeed for there to be a risk. I wasn't sure whether 9.3 had introduced any such cases for existing row lock types. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Planner hints in Postgresql
On 3/17/14, 12:58 PM, Stephen Frost wrote: * Merlin Moncure (mmonc...@gmail.com) wrote: Yeah -- the most common case I see is outlier culling where several repeated low non-deterministic selectivity quals stack reducing the row count estimate to 1. For example: SELECT * FROM foo WHERE length(bar) = 1000 AND length(bar) = 2; This is exactly the issue that I've seen also- where we end up picking a Nested Loop because we think only one row is going to be returned and instead we end up getting a bunch and it takes forever. FWIW, I've also seen problems with merge and hash joins at work, but I don't have any concrete examples handy. :( There was also some speculation on trying to change plans mid-stream to address a situation like that, once we realize what's happening. Not sure that's really practical but it would be nice to find some solution. Just being able to detect that something has possibly gone wrong would be useful. We could log that to alert the DBA/user of a potential bad plan. We could even format this in such a fashion that it's suitable for emailing the community with; the query, the plan, the stats, etc. That might make it easier for us to fix the planner (although at this point it seems like we're hitting statistics gathering problems that we simply don't know how to solve). There is another aspect of this though: plan stability. There are lots of cases where users couldn't care less about getting an optimal plan, but they care *greatly* about not getting a brain-dead plan. -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Planner hints in Postgresql
On Mon, Mar 17, 2014 at 2:02 PM, Jim Nasby j...@nasby.net wrote: Just being able to detect that something has possibly gone wrong would be useful. We could log that to alert the DBA/user of a potential bad plan. We could even format this in such a fashion that it's suitable for emailing the community with; the query, the plan, the stats, etc. That might make it easier for us to fix the planner (although at this point it seems like we're hitting statistics gathering problems that we simply don't know how to solve). Again, that's not the case here. The problem is that the server is using hard wired assumptions (like, 10% selective) *instead* of statistics -- at least in the case discussed above. That being said, I think you're on to something: EXPLAIN ANALYZE rowcounts don't indicate if the row count was generated from data based assumptions or SWAGs. So maybe you could decorate the plan description with an indicator that suggests when default selectivity rules were hit. There is another aspect of this though: plan stability. There are lots of cases where users couldn't care less about getting an optimal plan, but they care *greatly* about not getting a brain-dead plan. Except for cases I noted above, I don't understand how you could flag 'sub-optimal' or 'brain-dead' plans. The server always picks the best plan it can. The trick is to (in a very simple and cpu-unintensive way) indicate when there isn't a lot of confidence in the plan -- but that's not the same thing. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] First-draft release notes for next week's releases
On 2014-03-17 16:17:35 -0300, Alvaro Herrera wrote: Andres Freund wrote: On 2014-03-17 14:01:03 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: * I wonder if we should make the possible origins a bit more general as it's perfectly possible to trigger the problem without foreign keys. Maybe: can arise when a table row that has been updated is row locked; that can e.g. happen when foreign keys are used. IIUC, this case only occurs when using the new-in-9.3 types of nonexclusive row locks. I'm willing to bet that the number of applications using those is negligible; so I think it's all right to not mention that case explicitly, as long as the wording doesn't say that foreign keys are the *only* cause (which I didn't). I actually think the issue could also occur with row locks of other severities (is that the correct term?). Alvaro probably knows better, but if I see correctly it's also triggerable if a backend waits for an updating transaction to finish and follow_updates = true is passed to heap_lock_tuple(). Which e.g. nodeLockRows.c does... Uhm. But at the bottom of that block, right above the failed: label (heapam.c line 4527 in current master), we recheck the tuple for locked-only-ness; and fail the whole operation by returning HeapTupleUpdated, if it's not locked-only, no? Which would cause ExecLockRows to grab the next version via EvalPlanQualFetch. Essentially that check is a lock-conflict test, and the only thing that does not conflict with an update is a FOR KEY SHARE lock. What I was thinking of is the case where heap_lock_tuple() notices it needs to sleep and then in the if (require_sleep) block does a lock_updated_tuple(). If the updating transaction aborts while waiting lock_updated_tuple_rec() will issue a XLOG_HEAP2_LOCK_UPDATED for that row and then return MayBeUpdated. Which will make heap_lock_tuple() successfully lock the row, thereby resetting t_ctid during replay. What I missed is that case resetting the ctid chain is perfectly fine, since the pointed to tuple is actually dead. I was just confused by the fact that we do actually issue a XLOG_HEAP2_LOCK_UPDATED for a dead row. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bpchar functinos
On Sat, Mar 15, 2014 at 05:02:44PM +0330, Mohsen SM wrote: I want to fined when is used these functions(what query caused the call of these functions) : -char_bpchar() -bpchar_name() -name_bpchar() They implement casts. For example, select 'foo'::character(10)::name calls bpchar_name(). -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Planner hints in Postgresql
On Mon, Mar 17, 2014 at 01:20:47PM -0500, Merlin Moncure wrote: A query plan is a complicated thing that is the result of detail analysis of the data. I bet there are less than 100 users on the planet with the architectural knowledge of the planner to submit a 'plan'. What users do have is knowledge of the data that the database can't effectively gather for some reason. Looking at my query above, what it would need (assuming the planner could not be made to look through length()) would be something like: SELECT * FROM foo WHERE length(bar) = 1000 WITH SELECTIVITY 0.999 AND length(bar) = 2 WITH SELECTIVITY 0.999; A small issue with selectivity is that the selectivity is probably not what the users are expecting anyway, since many will related to conditional selectivities. PostgreSQL is pretty good at single column statistics, it just sometimes screws up on cross-column correlations. This ties in with alerting about a bad plan: if the EXPLAIN output could list for each condition what the actual selectivity was it might give user a way of understanding the problem. So the example given might lead to output like: clause selectivity estimated length(bar)20.50 0.50 length(bar)1000 | length(bar)2 0.50 0.25 The execution engine can only output conditional selectivities because of the order of execution. But this would at least give users a handle on the problem. Note that a first cut of the problem might simply be something like likely()/unlikely() as in gcc. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] First-draft release notes for next week's releases
On Mon, Mar 17, 2014 at 3:28 PM, Tom Lane t...@sss.pgh.pa.us wrote: I'm thinking we'd better promote that Assert to a normal runtime elog. I wasn't sure about this but on further thought I think it's a really good idea and should be mentioned in the release notes. One of the things that's been bothering me about this bug is that it's really hard to be know if your standby has suffered from the problem and if you've promoted it you don't know if your database has a problem. That said, it would be nice to actually fix the problem, not just detect it. Eventually vacuum would fix the problem. I think. I'm not really sure what will happen actually. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers