Re: [HACKERS] array_length(anyarray)
On 10 January 2014 00:36, Marko Tiikkaja ma...@joh.to wrote: On 1/10/14, 1:20 AM, Merlin Moncure wrote: I'm piling on: it's not clear at all to me why you've special cased this to lower_bound=1. First of all, there are other reasons to check length than iteration. Yes, I agree. A length function that returned 0 for empty arrays would be far from useless. Can you point me to some examples? The example I see all the time is code like if array_length(nodes, 1) 5 then ... do something ... then you realise (or not as the case may be) that this doesn't work for empty arrays, and have to remember to wrap it in a coalesce call. Simply being able to write if cardinality(nodes) 5 then ... do something ... is not just shorter, easier to type and easier to read, it is far less likely to be the source of subtle bugs. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Fri, Jan 10, 2014 at 4:09 AM, Dean Rasheed dean.a.rash...@gmail.comwrote: Hi, Reading over this, I realised that there is a problem with NaN handling --- once the state becomes NaN, it can never recover. So the results using the inverse transition function don't match HEAD in cases like this: create table t(a int, b numeric); insert into t values(1,1),(2,2),(3,'NaN'),(4,3),(5,4); select a, b, sum(b) over(order by a rows between 1 preceding and current row) from t; which in HEAD produces: a | b | sum ---+-+- 1 | 1 | 1 2 | 2 | 3 3 | NaN | NaN 4 | 3 | NaN 5 | 4 | 7 (5 rows) but with this patch produces: a | b | sum ---+-+- 1 | 1 | 1 2 | 2 | 3 3 | NaN | NaN 4 | 3 | NaN 5 | 4 | NaN (5 rows) Nice catch! Thanks for having a look at the patch. Ok, so I thought about this and I don't think it's too big a problem at to fix it all. I think it can be handled very similar to how I'm taking care of NULL values in window frame. For these, I simply keep a count of them in an int64 and when the last one leaves the aggregate context things can continue as normal. Lucky for us that all numeric aggregation (and now inverse aggregation) goes through 2 functions. do_numeric_accum() and the new inverse version of it do_numeric_discard(), both these functions operate on a NumericAggState which in the attached I've changed the isNaN bool field to a NaNCount int64 field. I'm just doing NaNCount++ when we get a NaN value in do_numeric_accum and NaNCount-- in do_numeric_discard(), in the final functions I'm just checking if NaNCount 0. Though this implementation does fix the reported problem unfortunately it may have an undesired performance impact for numeric aggregate functions when not uses in the context of a window.. Let me explain what I mean: Previously there was some code in do_numeric_accum() which did: if (state-isNaN || NUMERIC_IS_NAN(newval)) { state-isNaN = true; return; } Which meant that it didn't bother adding new perfectly valid numerics to the aggregate totals when there was an NaN encountered previously. I had to change this to continue on regardless as we still need to keep the totals just in case all the NaN values are removed and the totals are required once again. This means that the non-window version of SUM(numeric) and AVG(numeric) and and the stddev aggregates for numeric pay a price and have to keep on totaling after encountering NaN values. :( If there was a way to know if the function was being called in a window context or a normal aggregate context then we probably almost completely restore that possible performance regression just by skipping the totaling when not in windows context. I really don't know how common NaN values are in the real world to know if this matters too much. I'd hazard a guess that more people would benefit from inverse transitions on numeric types more, but I have nothing to back that up. I've attached version 2 of the patch which fixes the NaN problem and adds a regression test to cover it. Thanks again for testing this and finding the problem. Regards David Rowley Regards, Dean inverse_transition_functions_v2.0.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Fri, Jan 10, 2014 at 5:15 AM, Tom Lane t...@sss.pgh.pa.us wrote: Dean Rasheed dean.a.rash...@gmail.com writes: Reading over this, I realised that there is a problem with NaN handling --- once the state becomes NaN, it can never recover. So the results using the inverse transition function don't match HEAD in cases like this: Ouch! That takes out numeric, float4, and float8 in one fell swoop. Given the relative infrequency of NaNs in most data, it seems like it might still be possible to get a speedup if we could use inverse transitions until we hit a NaN, then do it the hard way until the NaN is outside the window, then go back to inverse transitions. I'm not sure though if this is at all practical from an implementation standpoint. We certainly don't want the core code knowing about anything as datatype-specific as a NaN, but maybe the inverse transition function could have an API that allows reporting I can't do it here, fall back to the hard way. I had thought about that API, not for numeric as I think I've managed to find another solution, it was for MAX and MIN. I posted an idea about it here: http://www.postgresql.org/message-id/caaphdvqu+ygw0vbpbb+yxhrpg5vcy_kifyi8xmxfo8kyocz...@mail.gmail.com but it didn't generate much interest at the time and I didn't have any ideas on how the inverse aggregate functions would communicate this inability to remove the value to the caller. Perhaps it would be an idea still, but I had put it to the back of my mind in favour of tuplestore indexes that could be created on the fly based on the row position within the frame and the aggregate's sort operator on the aggregate value. This would mean that MAX and MIN values could be found quickly all the time rather than just when the value being removed happened not to affect the current maximum or minimum. It's not something I have planned for this patch though and I'd have lots of questions around memory allocation before I'd want to start any work on it. Regards David Rowley regards, tom lane
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
bool_or() FALSE can be removed, removing TRUE requires a rescan. Could be made fully invertible by counting the number of TRUE and FALSE values, similar to my suggestion for how to handle NaN for sum(numeric). Same works for bool_and(). bit_or() Like boo_or(), 0 can be removed, everything else requires a rescan. Same works for bit_and() Interesting, I'd not thought of any way to optimise these ones, but I had originally thought about allowing the inverse transition functions to report if they could perform the inverse transition based on the value they received and if they reported failure, then perform the rescan. I just don't quite know yet the base way for the inverse transition function to communicate this to the caller yet. If you have any ideas on the best way to do this then I'd really like to hear them. Regards David Rowley
Re: [HACKERS] array_length(anyarray)
On 1/10/14, 9:04 AM, Dean Rasheed wrote: On 10 January 2014 00:36, Marko Tiikkaja ma...@joh.to wrote: Can you point me to some examples? The example I see all the time is code like if array_length(nodes, 1) 5 then ... do something ... then you realise (or not as the case may be) that this doesn't work for empty arrays, and have to remember to wrap it in a coalesce call. Simply being able to write if cardinality(nodes) 5 then ... do something ... is not just shorter, easier to type and easier to read, it is far less likely to be the source of subtle bugs But this is what I don't understand: why do you care whether there's less than 5 elements in the array, but you don't care about how they're organized? '[2:3]={1,2}'::int[] and '{{1},{2}}'::int[] both give the same result when unnest()ed, sure, but why do you want to accept such crap as input if you just want a list of elements? I guess what I truly want is a less generic type that's like an array, but always one-dimensional with a lower bound of 1. There's too much garbage that can be passed to a function taking an array as an input right now. Regards, Marko Tiikkaja -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] array_length(anyarray)
On Fri, Jan 10, 2014 at 2:04 AM, Dean Rasheed dean.a.rash...@gmail.com wrote: On 10 January 2014 00:36, Marko Tiikkaja ma...@joh.to wrote: On 1/10/14, 1:20 AM, Merlin Moncure wrote: I'm piling on: it's not clear at all to me why you've special cased this to lower_bound=1. First of all, there are other reasons to check length than iteration. Yes, I agree. A length function that returned 0 for empty arrays would be far from useless. Can you point me to some examples? The example I see all the time is code like if array_length(nodes, 1) 5 then ... do something ... then you realise (or not as the case may be) that this doesn't work for empty arrays, and have to remember to wrap it in a coalesce call. Simply being able to write if cardinality(nodes) 5 then ... do something ... is not just shorter, easier to type and easier to read, it is far less likely to be the source of subtle bugs. right -- exactly. or, 'ORDER BY cardinatility(nodes)', etc etc. Furthermore, we already have pretty good support for iteration with arrays via unnest(). What's needed for better iteration support (IMO) is a function that does what unnest does but returns an array on indexes (one per dimsension) -- a generalization of the _pg_expandarray function. Lets' say 'unnest_dims'. 'unnest_dims' is non-trivial to code in user land while 'array_length' is an extremely trivial wrapper to array_upper(). cardinality() (which is much better name for the function IMSNHO) gives a*b*c values say for a 3d array also does something non-trivial *particularly in the case of offset arrays*. On Fri, Jan 10, 2014 at 3:36 AM, Marko Tiikkaja ma...@joh.to wrote: I guess what I truly want is a less generic type that's like an array, but always one-dimensional with a lower bound of 1. Your function would be the only one in the array API that implemented special behaviors like that. That's suggests to me that the less generic function belongs in user land, not in the core array API. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On 10 January 2014 08:12, David Rowley dgrowle...@gmail.com wrote: On Fri, Jan 10, 2014 at 4:09 AM, Dean Rasheed dean.a.rash...@gmail.com wrote: Hi, Reading over this, I realised that there is a problem with NaN handling --- once the state becomes NaN, it can never recover. So the results using the inverse transition function don't match HEAD in cases like this: create table t(a int, b numeric); insert into t values(1,1),(2,2),(3,'NaN'),(4,3),(5,4); select a, b, sum(b) over(order by a rows between 1 preceding and current row) from t; which in HEAD produces: a | b | sum ---+-+- 1 | 1 | 1 2 | 2 | 3 3 | NaN | NaN 4 | 3 | NaN 5 | 4 | 7 (5 rows) but with this patch produces: a | b | sum ---+-+- 1 | 1 | 1 2 | 2 | 3 3 | NaN | NaN 4 | 3 | NaN 5 | 4 | NaN (5 rows) Nice catch! Thanks for having a look at the patch. Ok, so I thought about this and I don't think it's too big a problem at to fix it all. I think it can be handled very similar to how I'm taking care of NULL values in window frame. For these, I simply keep a count of them in an int64 and when the last one leaves the aggregate context things can continue as normal. Lucky for us that all numeric aggregation (and now inverse aggregation) goes through 2 functions. do_numeric_accum() and the new inverse version of it do_numeric_discard(), both these functions operate on a NumericAggState which in the attached I've changed the isNaN bool field to a NaNCount int64 field. I'm just doing NaNCount++ when we get a NaN value in do_numeric_accum and NaNCount-- in do_numeric_discard(), in the final functions I'm just checking if NaNCount 0. Cool, that sounds like a neat fix. Though this implementation does fix the reported problem unfortunately it may have an undesired performance impact for numeric aggregate functions when not uses in the context of a window.. Let me explain what I mean: Previously there was some code in do_numeric_accum() which did: if (state-isNaN || NUMERIC_IS_NAN(newval)) { state-isNaN = true; return; } Which meant that it didn't bother adding new perfectly valid numerics to the aggregate totals when there was an NaN encountered previously. I had to change this to continue on regardless as we still need to keep the totals just in case all the NaN values are removed and the totals are required once again. This means that the non-window version of SUM(numeric) and AVG(numeric) and and the stddev aggregates for numeric pay a price and have to keep on totaling after encountering NaN values. :( I suspect that NaNs almost never occur in practice, so the fact that it might now take longer to tell you that the sum is NaN doesn't worry me. More important is that it always gives the right answer. Note, if anyone ever did this for floats, +/- Infinity would also need to be handled, so you'd have to maintain 3 counts and deal with logic like Infinity - Infinity = NaN. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] array_length(anyarray)
On 1/10/14, 10:41 AM, Merlin Moncure wrote: What's needed for better iteration support (IMO) is a function that does what unnest does but returns an array on indexes (one per dimsension) -- a generalization of the _pg_expandarray function. Lets' say 'unnest_dims'. So unnest_dims('{{1,2},{3,4}}'::int[]) would return VALUES (1, '{1,2}'::int[]), (2, '{3,4}'::int[])? If so, then yes, that's a functionality I've considered us to have been missing for a long time. Regards, Marko Tiikkaja -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] array_length(anyarray)
On Fri, Jan 10, 2014 at 3:52 AM, Marko Tiikkaja ma...@joh.to wrote: On 1/10/14, 10:41 AM, Merlin Moncure wrote: What's needed for better iteration support (IMO) is a function that does what unnest does but returns an array on indexes (one per dimsension) -- a generalization of the _pg_expandarray function. Lets' say 'unnest_dims'. So unnest_dims('{{1,2},{3,4}}'::int[]) would return VALUES (1, '{1,2}'::int[]), (2, '{3,4}'::int[])? If so, then yes, that's a functionality I've considered us to have been missing for a long time. not quite. it returns int[], anyelement: so, using your example, you'd get: [1,1], 1 [1,2], 2 [2,1], 3 [2,2], 4 like unnest() it would fully decompose the array do individual elements. what you have above slices the array which is useful,but probably shouldn't live under the 'unnest' name -- perhaps 'slice'. Pavel added it to pl/pgsql under the FOREACH syntax (FYI). merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Relocation of tablespaces in pg_basebackup
Hi Steeve, Il 09/01/14 22:38, Steeve Lennmark ha scritto: I'm a barman user myself so that was actually my initial thought. Ah! Very good! If there aren't some kind of hidden internal that I've missed I don't see a way to convert an OID (only have OID and path at this stage) to a tablespace name. This solution, even though not optimal, is a lot better than my initial one where I used the OID directly. Try: SELECT spcname, oid, pg_tablespace_location(oid) FROM pg_tablespace Thanks, Gabriele -- Gabriele Bartolini - 2ndQuadrant Italia PostgreSQL Training, Services and Support gabriele.bartol...@2ndquadrant.it | www.2ndQuadrant.it -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Relocation of tablespaces in pg_basebackup
On Fri, Jan 10, 2014 at 12:25 PM, Gabriele Bartolini gabriele.bartol...@2ndquadrant.it wrote: Hi Steeve, Il 09/01/14 22:38, Steeve Lennmark ha scritto: I'm a barman user myself so that was actually my initial thought. Ah! Very good! If there aren't some kind of hidden internal that I've missed I don't see a way to convert an OID (only have OID and path at this stage) to a tablespace name. This solution, even though not optimal, is a lot better than my initial one where I used the OID directly. Try: SELECT spcname, oid, pg_tablespace_location(oid) FROM pg_tablespace That would require a second connection to the database. You cannot run that query from the walsender session. And that's exactly the issue that Steeve pointed out in his first email. I think it's better to let pg_basebackup work at the lower level, and then leave it to higher level tools to be able to do the mapping to names. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] [PATCH] Relocation of tablespaces in pg_basebackup
On 2014-01-10 12:27:23 +0100, Magnus Hagander wrote: On Fri, Jan 10, 2014 at 12:25 PM, Gabriele Bartolini gabriele.bartol...@2ndquadrant.it wrote: Hi Steeve, Il 09/01/14 22:38, Steeve Lennmark ha scritto: I'm a barman user myself so that was actually my initial thought. Ah! Very good! If there aren't some kind of hidden internal that I've missed I don't see a way to convert an OID (only have OID and path at this stage) to a tablespace name. This solution, even though not optimal, is a lot better than my initial one where I used the OID directly. Try: SELECT spcname, oid, pg_tablespace_location(oid) FROM pg_tablespace That would require a second connection to the database. You cannot run that query from the walsender session. And that's exactly the issue that Steeve pointed out in his first email. Theoretically nothing is stopping us from providing a command outputting that information - it's a global catalog, so we can access it without problems. I think it's better to let pg_basebackup work at the lower level, and then leave it to higher level tools to be able to do the mapping to names. That doesn't negate this argument though. Not really convinced either way yet. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] array_length(anyarray)
On Jan10, 2014, at 11:00 , Merlin Moncure mmonc...@gmail.com wrote: On Fri, Jan 10, 2014 at 3:52 AM, Marko Tiikkaja ma...@joh.to wrote: On 1/10/14, 10:41 AM, Merlin Moncure wrote: What's needed for better iteration support (IMO) is a function that does what unnest does but returns an array on indexes (one per dimsension) -- a generalization of the _pg_expandarray function. Lets' say 'unnest_dims'. So unnest_dims('{{1,2},{3,4}}'::int[]) would return VALUES (1, '{1,2}'::int[]), (2, '{3,4}'::int[])? If so, then yes, that's a functionality I've considered us to have been missing for a long time. not quite. it returns int[], anyelement: so, using your example, you'd get: [1,1], 1 [1,2], 2 [2,1], 3 [2,2], 4 Now that we have WITH ORDINALITY, it'd be sufficient to have a variant of array_dims() that returns int[][] instead of text, say array_dimsarray(). Your unnest_dims could then be written as unnest(array_dimsarray(array)) with ordinality best regards, florian pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Jan10, 2014, at 09:34 , David Rowley dgrowle...@gmail.com wrote: I just don't quite know yet the base way for the inverse transition function to communicate this to the caller yet. If you have any ideas on the best way to do this then I'd really like to hear them. Could they maybe just return NULL as the new state? It would mean that aggregates that do want to provide an inverse transition function couldn't use NULL as a valid aggregate state, but do we need to support that? Looking at the code it seems that for quite a few existing aggregates, the state remains NULL until the first non-NULL input is processed. But that doesn't hurt much - those aggregates can remain as they are until someone wants to add an inverse transfer function. Once that happens, there's a choice between living with needless rescans on trailing NULL inputs or changing the state type. This solution isn't particularly pretty, but I don't currently see a good alternative that allows implementing inverse transfer functions is something other than C and avoid needless overhead for those which are written in C. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] array_length(anyarray)
On Fri, Jan 10, 2014 at 6:00 AM, Florian Pflug f...@phlo.org wrote: On Jan10, 2014, at 11:00 , Merlin Moncure mmonc...@gmail.com wrote: On Fri, Jan 10, 2014 at 3:52 AM, Marko Tiikkaja ma...@joh.to wrote: On 1/10/14, 10:41 AM, Merlin Moncure wrote: What's needed for better iteration support (IMO) is a function that does what unnest does but returns an array on indexes (one per dimsension) -- a generalization of the _pg_expandarray function. Lets' say 'unnest_dims'. So unnest_dims('{{1,2},{3,4}}'::int[]) would return VALUES (1, '{1,2}'::int[]), (2, '{3,4}'::int[])? If so, then yes, that's a functionality I've considered us to have been missing for a long time. not quite. it returns int[], anyelement: so, using your example, you'd get: [1,1], 1 [1,2], 2 [2,1], 3 [2,2], 4 Now that we have WITH ORDINALITY, it'd be sufficient to have a variant of array_dims() that returns int[][] instead of text, say array_dimsarray(). Your unnest_dims could then be written as unnest(array_dimsarray(array)) with ordinality hm, not quite following that. maybe an example? my issue with 'WITH ORDINALITY' (while it's pretty neat) is that it doesn't give you the dimension coordinate of each datum so you can't really use it to slice. with unnest_dims(), you an slice, say via: select array_agg(value) from (unnest_dims('{{1,2},{3,4}}'::int[]) group by dims[1]; or select array_agg(value) from (unnest_dims('{{1,2},{3,4}}'::int[]) where dims[1] = 2; not super elegant, but good enough for most uses I think. anyways, getting back on topic, the question on the table is cardinality() vs array_length, right? merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
Florian Pflug f...@phlo.org writes: On Jan10, 2014, at 09:34 , David Rowley dgrowle...@gmail.com wrote: I just don't quite know yet the base way for the inverse transition function to communicate this to the caller yet. If you have any ideas on the best way to do this then I'd really like to hear them. Could they maybe just return NULL as the new state? It would mean that aggregates that do want to provide an inverse transition function couldn't use NULL as a valid aggregate state, but do we need to support that? Yeah, I was going to suggest the same. Seems like it wouldn't be that difficult to come up with some alternative representation for everything seen so far was NULL, if you needed to. Looking at the code it seems that for quite a few existing aggregates, the state remains NULL until the first non-NULL input is processed. But that doesn't hurt much - those aggregates can remain as they are until someone wants to add an inverse transfer function. Once that happens, there's a choice between living with needless rescans on trailing NULL inputs or changing the state type. Also, it might be reasonable for both the regular and the inverse transition functions to be strict. If a null entering the window does not matter, then a null exiting the window doesn't either, no? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On 01/10/2014 05:36 AM, Peter Geoghegan wrote: While I realize that everyone is busy, I'm concerned about the lack of discussing here. It's been 6 full days since I posted my benchmark, which I expected to quickly clear some things up, or at least garner interest, and yet no one has commented here since. Nah, that's nothing. I have a patch in the January commitfest that was already posted for the previous commitfest. It received zero review back then, and still has no reviewer signed up, let alone anyone actually reviewing it. And arguably it's a bug fix! http://www.postgresql.org/message-id/5285071b.1040...@vmware.com Wink wink, if you're looking for patches to review... ;-) The alternative exclusion* patch still deadlocks in an unprincipled fashion, when simple, idiomatic usage encounters contention. Heikki intends to produce a revision that fixes the problem, though having considered it carefully myself, I don't know what mechanism he has in mind, and frankly I'm skeptical. Here's an updated patch. Hope it works now... This is based on an older version, and doesn't include any fixes from your latest btreelock_insert_on_dup.v7.2014_01_07.patch. Please check the common parts, and copy over any relevant changes. The fix for the deadlocking issue consists of a few parts. First, there's a new heavy-weight lock type, a speculative insertion lock, which works somewhat similarly to XactLockTableWait(), but is only held for the duration of a single speculative insertion. When a backend is about to begin a speculative insertion, it first acquires the speculative insertion lock. When it's done with the insertion, meaning it has either cancelled it by killing the already-inserted tuple or decided that it's going to go ahead with it, the lock is released. The speculative insertion lock is keyed by Xid, and token. The lock can be taken many times in the same transaction, and token's purpose is to distinguish which insertion is currently in progress. The token is simply a backend-local counter, incremented each time the lock is taken. In addition to the heavy-weight lock, there are new fields in PGPROC to indicate which tuple the backend is currently inserting. When the tuple is inserted, the backend fills in the relation's relfilenode and item pointer in MyProc-specInsert* fields, while still holding the buffer lock. The current speculative insertion token is also stored there. With that mechanism, when another backend sees a tuple whose xmin is still in progress, it can check if the insertion is a speculative insertion. To do that, scan the proc array, and find the backend with the given xid. Then, check that the relfilenode and itempointer in that backend's PGPROC slot match the tuple, and make note of the token the backend had advertised. HeapTupleSatisfiesDirty() does the proc array check, and returns the token in the snapshot, alongside snapshot-xmin. The caller can then use that information in place of XactLockTableWait(). There would be other ways to skin the cat, but this seemed like the quickest to implement. One more straightforward approach would be to use the tuple's TID directly in the speculative insertion lock's key, instead of Xid+token, but then the inserter would have to grab the heavy-weight lock while holding the buffer lock, which seems dangerous. Another alternative would be to store token in the heap tuple header, instead of PGPROC; a tuple that's still being speculatively inserted has no xmax, so it could be placed in that field. Or ctid. More importantly, I have to question whether we should continue to pursue that alternative approach, giving what we now know about its performance characteristics. Yes. It could be improved, but not by terribly much, particularly for the case where there is plenty of update contention, which was shown in [1] to be approximately 2-3 times slower than extended page locking (*and* it's already looking for would-be duplicates*first*). I'm trying to be as fair as possible, and yet the difference is huge. *shrug*. I'm not too concerned about performance during contention. But let's see how this fixed version performs. Could you repeat the tests you did with this? Any guesses what the bottleneck is? At a quick glance at a profile of a pgbench run with this patch, I didn't see anything out of ordinary, so I'm guessing it's lock contention somewhere. - Heikki speculative-insertions-2014_01_10.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On 01/08/2014 06:46 AM, Peter Geoghegan wrote: A new revision of my patch is attached. I'm getting deadlocks with this patch, using the test script you posted earlier in http://www.postgresql.org/message-id/CAM3SWZQh=8xnvgbbzyhjexujbhwznjutjez9t-dbo9t_mx_...@mail.gmail.com. Am doing something wrong, or is that a regression? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] getenv used in libpq caused missing values under Windows
Hello, Pgsql-hackers. As you probably know dealing with Windows MSVCRT is some kind of hell. That's why we have src/port/win32env.c particulalry, because there may be several CRT's loaded at the same time. libpq unfortunately is using standard getenv function call to fill connection parameters inside conninfo_add_defaults(), see http://www.postgresql.org/docs/9.3/interactive/libpq-envars.html This happens because MSVCRT has it's own local copy of environment variables table, and this table is not updated during execution. So if one used SetEnvironmentVariable() Windows API to set any variable, e.g. PGHOST, then this change will not be visible inside libpq's conninfo_add_defaults() function where all default values are obtained. This situation especially unpleasant for non-C developers, since there is no opportunity to use standard putenv function or ported pgwin32_putenv. My proposal is to implement pgwin32_getenv function which will call GetEnvironmentVariable first and if it's NULL then call MSVCRT's getenv in the same way as pgwin32_putenv does. So now the bad scenario is: 1. SetEnvironmentVariable('PGHOST=192.188.9.9') 2. PQconnectdbParams without specifying host parameter will fail -- With best wishes, Pavel mailto:pa...@gf.microolap.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add CREATE support to event triggers
On 8 January 2014 20:42, Alvaro Herrera alvhe...@2ndquadrant.com wrote: CREATE SCHEMA IF NOT EXISTS some schema AUTHORIZATION some guy; Hmm, given in 9.3 it was OK to have only DROP event triggers, I think it should be equally acceptable to have just CREATE, but without every option on CREATE. CREATE SCHEMA is easily the most complex thing here and would be the command/event to deprioritise if we had any issues getting this done/agreeing something for 9.4. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
2. Provide a new reloption to specify Wal compression for update operation on table Create table tbl(c1 char(100)) With (compress_wal = true); Alternative options: a. compress_wal can take input as operation, e.g. 'insert', 'update', b. use alternate syntax: Create table tbl(c1 char(100)) Compress Wal For Update; c. anything better? I think WITH (compress_wal = true) is pretty good. I don't understand your point about taking the operation as input, because this only applies to updates. But we could try to work update into the name of the setting somehow, so as to be less likely to conflict with future additions, like maybe wal_compress_update. I think the alternate syntax you propose is clearly worse, because it would involve adding new keywords, something we try to avoid. The only possible enhancement I can think of here is to make the setting an integer rather than a Boolean, defined as the minimum acceptable compression ratio. A setting of 0 means always compress; a setting of 100 means never compress; intermediate values define the least acceptable ratio. But to be honest, I think that's overkill; I'd be inclined to hard-code the default value of 25 in the patch and make it a #define. The only real advantage of requiring a minimum 25% compression percentage is that we can bail out on compression three-quarters of the way through the tuple if we're getting nowhere. That's fine for what it is, but the idea that users are going to see much benefit from twaddling that number seems very dubious to me. Points to consider - 1. As the current algorithm store the entry for same chunks at head of list, it will always find last but one chunk (we don't store last 4 bytes) for long matching string during match phase in encoding (pgrb_delta_encode). We can improve it either by storing same chunks at end of list instead of at head or by trying to find a good_match technique used in lz algorithm. Finding good_match technique can have overhead in some of the cases when there is actually no match. I don't see what the good_match thing has to do with anything in the Rabin algorithm. But I do think there might be a bug here, which is that, unless I'm misinterpreting something, hp is NOT the end of the chunk. After calling pgrb_hash_init(), we've looked at the first FOUR bytes of the input. If we find that we have a zero hash value at that point, shouldn't the chunk size be 4, not 1? And similarly if we find it after sucking in one more byte, shouldn't the chunk size be 5, not 2? Right now, we're deciding where the chunks should end based on the data in the chunk plus the following 3 bytes, and that seems wonky. I would expect us to include all of those bytes in the chunk. 2. Another optimization that we can do in pgrb_find_match(), is that currently if it doesn't find the first chunk (chunk got by hash index) matching, it continues to find the match in other chunks. I am not sure if there is any benefit to search for other chunks if first one is not matching. Well, if you took that out, I suspect it would hurt the compression ratio. Unless the CPU savings are substantial, I'd leave it alone. 3. We can move code from pg_lzcompress.c to some new file pg_rbcompress.c, if we want to move, then we need to either duplicate some common macros like pglz_out_tag or keep it common, but might be change the name. +1 for a new file. 4. Decide on min and max chunksize. (currently kept as 2 and 4 respectively). The point to consider is that if we keep bigger chunk sizes, then it can save us on CPU cycles, but less reduction in Wal, on the other side if we keep it small it can have better reduction in Wal but consume more CPU cycles. Whoa. That seems way too small. Since PGRB_PATTERN_AFTER_BITS is 4, the average length of a chunk is about 16 bytes. It makes little sense to have the maximum chunk size be 25% of the expected chunk length. I'd recommend making the maximum chunk length something like 4 * PGRB_CONST_NUM, and the minimum chunk length maybe something like 4. 5. kept an guc variable 'wal_update_compression_ratio', for test purpose, we can remove it before commit. Let's remove it now. 7. docs needs to be updated, tab completion needs some work. Tab completion can be skipped for now, but documentation is important. 8. We can extend Alter Table to set compress option for table. I don't understand what you have in mind here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add CREATE support to event triggers
On Thu, Jan 9, 2014 at 5:17 PM, Jim Nasby j...@nasby.net wrote: On 1/9/14, 11:58 AM, Alvaro Herrera wrote: Robert Haas escribió: On Wed, Jan 8, 2014 at 10:27 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Hmm. This seems like a reasonable thing to do, except that I would like the output to always be the constant, and have some other way to enable the clause or disable it. With your present boolean: so if_not_exists: {output: IF NOT EXISTS, present: true/false} Why not: if_not_exists: true/false Yeah, that's another option. If we do this, though, the expansion function would have to know that an if_not_exist element expands to IF NOT EXISTS. Maybe that's okay. Right now, the expansion function is pretty stupid, which is nice. Yeah, the source side of this will always have to understand the nuances of every command; it'd be really nice to not burden the other side with that as well. The only downside I see is a larger JSON output, but meh. Another advantage is if you really wanted to you could modify the output formatting in the JSON doc to do something radically different if so inclined... Yeah. I wasn't necessarily objecting to the way Alvaro did it, just asking why he did it that way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
On Fri, Jan 10, 2014 at 10:21:42AM +0530, Amit Kapila wrote: On Thu, Jan 9, 2014 at 10:45 PM, Bruce Momjian br...@momjian.us wrote: I think RAID-1 is a very good comparison because it is successful technology and has similar issues. RAID-1 is like Postgres synchronous_standby_names mode in the sense that the RAID-1 controller will not return success until writes have happened on both mirrors, but it is unlike synchronous_standby_names in that it will degrade and continue writes even when it can't write to both mirrors. What is being discussed is to allow the RAID-1 behavior in Postgres. One issue that came up in discussions is the insufficiency of writing a degrade notice in a server log file because the log file isn't durable from server failures, meaning you don't know if a fail-over to the slave lost commits. The degrade message has to be stored durably against a server failure, e.g. on a pager, probably using a command like we do for archive_command, and has to return success before the server continues in degrade mode. I assume degraded RAID-1 controllers inform administrators in the same way. Here I think if user is aware from beginning that this is the behaviour, then may be the importance of message is not very high. What I want to say is that if we provide a UI in such a way that user decides during setup of server the behavior that is required by him. For example, if we provide a new parameter available_synchronous_standby_names along with current parameter and ask user to use this new parameter, if he wishes to synchronously commit transactions on another server when it is available, else it will operate as a standalone sync master. I know there was a desire to remove this TODO item, but I think we have brought up enough new issues that we can keep it to see if we can come up with a solution. I have added a link to this discussion on the TODO item. I think we will need at least four new GUC variables: * timeout control for degraded mode * command to run during switch to degraded mode * command to run during switch from degraded mode * read-only variable to report degraded mode -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add CREATE support to event triggers
On Fri, Jan 10, 2014 at 10:36 AM, Simon Riggs si...@2ndquadrant.com wrote: On 8 January 2014 20:42, Alvaro Herrera alvhe...@2ndquadrant.com wrote: CREATE SCHEMA IF NOT EXISTS some schema AUTHORIZATION some guy; Hmm, given in 9.3 it was OK to have only DROP event triggers, I think it should be equally acceptable to have just CREATE, but without every option on CREATE. CREATE SCHEMA is easily the most complex thing here and would be the command/event to deprioritise if we had any issues getting this done/agreeing something for 9.4. I don't know that I agree with that, but I guess we can cross that bridge when we come to it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add CREATE support to event triggers
On 10 January 2014 15:48, Robert Haas robertmh...@gmail.com wrote: On Fri, Jan 10, 2014 at 10:36 AM, Simon Riggs si...@2ndquadrant.com wrote: On 8 January 2014 20:42, Alvaro Herrera alvhe...@2ndquadrant.com wrote: CREATE SCHEMA IF NOT EXISTS some schema AUTHORIZATION some guy; Hmm, given in 9.3 it was OK to have only DROP event triggers, I think it should be equally acceptable to have just CREATE, but without every option on CREATE. CREATE SCHEMA is easily the most complex thing here and would be the command/event to deprioritise if we had any issues getting this done/agreeing something for 9.4. I don't know that I agree with that, but I guess we can cross that bridge when we come to it. We've come to it... You would prefer either everything or nothing?? On what grounds? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
On 10 January 2014 15:47, Bruce Momjian br...@momjian.us wrote: I know there was a desire to remove this TODO item, but I think we have brought up enough new issues that we can keep it to see if we can come up with a solution. Can you summarise what you think the new issues are? All I see is some further rehashing of old discussions. There is already a solution to the problem because the docs are already very clear that you need multiple standbys to achieve commit guarantees AND high availability. RTFM is usually used as some form of put down, but that is what needs to happen here. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ADMIN] Column missing from pg_statistics
Kadri Raudsepp kadri.rauds...@nordicgaming.com writes: I have set up a cron-job that sends me daily reports on bloat amount in tables and indices, which I calculate using pg_stats, not pgstattuple, for performance and I/O reasons. If the bloat amount or percentage are big enough, I use pg_repack to get rid of it. At some point I noticed, that some tables keep showing up in the reports with the same amount of bloat, which pg_repack was seemingly unable to remove. Investigation showed that pgstattuple gave very different results than my bloat-finding query. Reason - for some tables there are some columns that never show up in pg_statistics. Hmm. Eyeballing the ANALYZE code, I note that it will decide that it hasn't got any valid statistics for a column if (1) it finds no NULL values and (2) every single sampled value in the column is too wide (more than WIDTH_THRESHOLD = 1024 bytes wide). Does this describe your problematic column? It seems like the code is being too conservative here --- it could at least generate valid values for stanullfrac and stawidth. I'm inclined to think maybe it should also set stadistinct = -1 (unique) in this case, since the basic assumption that validates ignoring very wide values is that they aren't duplicates. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fixing pg_basebackup with tablespaces found in $PGDATA
On Thu, Jan 2, 2014 at 03:34:04PM +0100, Bernd Helmle wrote: --On 1. Januar 2014 23:53:46 +0100 Dimitri Fontaine dimi...@2ndquadrant.fr wrote: Hi, As much as I've seen people frown upon $subject, it still happens in the wild, and Magnus seems to agree that the current failure mode of our pg_basebackup tool when confronted to the situation is a bug. So here's a fix, attached. I've seen having tablespaces under PGDATA as a policy within several data centres in the past. The main reasoning behind this seems that they strictly separate platform and database administration and for database inventory reasons. They are regularly surprised if you tell them to not use tablespaces in such a way, since they absorbed this practice over the years from other database systems. So +1 for fixing this. FYI, this setup also causes problems for pg_upgrade. There is a recent thread about that that I will reply to. The problem is that pre-9.2 servers get a mismatch between the symlink and the pg_tablespace path when they rename the old cluster. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Jan10, 2014, at 15:49 , Tom Lane t...@sss.pgh.pa.us wrote: Florian Pflug f...@phlo.org writes: Looking at the code it seems that for quite a few existing aggregates, the state remains NULL until the first non-NULL input is processed. But that doesn't hurt much - those aggregates can remain as they are until someone wants to add an inverse transfer function. Once that happens, there's a choice between living with needless rescans on trailing NULL inputs or changing the state type. Also, it might be reasonable for both the regular and the inverse transition functions to be strict. If a null entering the window does not matter, then a null exiting the window doesn't either, no? That's not true, I think, unless we're special-casing strict transition functions somewhere. AFAICS, an aggregate with a strict transition function will produce the state NULL whenever any of the inputs was NULL, i.e. we won't ever transition out of the NULL state once we got there. The inverse transition function, however, would *have* to be able to transition away from the NULL state, which requires it to be non-strict. Otherwise, how would the state aver become non-NULL after the last NULL input leaves the window? best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
Florian Pflug f...@phlo.org writes: On Jan10, 2014, at 15:49 , Tom Lane t...@sss.pgh.pa.us wrote: Also, it might be reasonable for both the regular and the inverse transition functions to be strict. If a null entering the window does not matter, then a null exiting the window doesn't either, no? That's not true, I think, unless we're special-casing strict transition functions somewhere. AFAICS, an aggregate with a strict transition function will produce the state NULL whenever any of the inputs was NULL, i.e. we won't ever transition out of the NULL state once we got there. Nope, not the case; read xaggr.sgml and/or the CREATE AGGREGATE reference page. An aggregate with a strict transition function essentially just ignores null input rows. I suspect the inverse transition function could just be made strict with a similar special-case rule (viz, keep the old transition value when deleting a null input from the window); but maybe I'm missing something and it has to work harder than that anyway. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add CREATE support to event triggers
On Fri, Jan 10, 2014 at 10:55 AM, Simon Riggs si...@2ndquadrant.com wrote: On 10 January 2014 15:48, Robert Haas robertmh...@gmail.com wrote: On Fri, Jan 10, 2014 at 10:36 AM, Simon Riggs si...@2ndquadrant.com wrote: On 8 January 2014 20:42, Alvaro Herrera alvhe...@2ndquadrant.com wrote: CREATE SCHEMA IF NOT EXISTS some schema AUTHORIZATION some guy; Hmm, given in 9.3 it was OK to have only DROP event triggers, I think it should be equally acceptable to have just CREATE, but without every option on CREATE. CREATE SCHEMA is easily the most complex thing here and would be the command/event to deprioritise if we had any issues getting this done/agreeing something for 9.4. I don't know that I agree with that, but I guess we can cross that bridge when we come to it. We've come to it... You would prefer either everything or nothing?? On what grounds? I hardly think I need to justify that position. That's project policy and always has been. When somebody implements 50% of a feature, or worse yet 95% of a feature, it violates the POLA for users and doesn't always subsequently get completed, leaving us with long-term warts that are hard to eliminate. It's perfectly fine to implement a feature incrementally if the pieces are individually self-consistent and ideally even useful, but deciding to support every command except one because the last one is hard to implement doesn't seem like a principled approach to anything. It's not even obvious to me that CREATE SCHEMA is all that much harder than anything else and Alvaro has not said that that's the only thing he can't implement (or why) so I think it's entirely premature to make the decision now about which way to proceed - but, OK, sure, if you want to force the issue now, then yeah, I think it's better to have everything or nothing than to have support for only some things justified by nothing more than implementation complexity. Aside from the general issue, in this particular case, I have previously and repeatedly expressed concerns about regression test coverage and suggested a path that would guarantee thorough regression testing but which would require that support be complete for everything present in our regression tests. Although there may be some other plan for guaranteeing thorough regression testing not only now but going forward, I have not seen it proposed here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
Florian Pflug f...@phlo.org wrote: Tom Lane t...@sss.pgh.pa.us wrote: Florian Pflug f...@phlo.org writes: For float 4 and float8, wasn't the consensus that the potential lossy-ness of addition makes this impossible anyway, even without the NaN issue? But... Well, that was my opinion, I'm not sure if it was consensus ;-). I'd say your example showing how it could produce completely bogus results was pretty convincing... Aggregates on approximate (floating-point) numbers are not nearly as consistent as many people probably assume. Picture for a minute a table where a column contains positive floating point numbers happen to be located in the heap in increasing order, perhaps as the result of a CLUSTER on an index on the column. SELECT sum(colname) FROM tablename; would tend to give the most accurate answer possible when a sequence scan was run -- unless there happened to be a seqscan already half-way through the heap. Then the result would be different. So the exact same query against the exact same data, with no intervening modifications or maintenance activity could give one answer most of the time, and give various other answers depending on concurrent SELECT queries. Given that this is already the case with aggregates on floating point approximate numbers, why should we rule out an optimization which only makes rounding errors more likely to be visible? The real issue here is that if you are using an approximate data type and expecting exact answers, you will have problems. That's not to say that approximations are useless. If you represent the circumference of the earth with a double precision number you're dealing with an expected rounding error of about a foot. That's close enough for many purposes. The mistake is assuming that it will be exact or that rounding errors cannot accumulate. In situations where SQL does not promise particular ordering of operations, it should not be assumed; so any expectations of a specific or repeatable result from a sum or average of approximate numbers is misplaced. But NaN is an orthogonal problem I think. Agreed. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
On 01/10/2014 05:09 PM, Simon Riggs wrote: On 10 January 2014 15:47, Bruce Momjian br...@momjian.us wrote: I know there was a desire to remove this TODO item, but I think we have brought up enough new issues that we can keep it to see if we can come up with a solution. Can you summarise what you think the new issues are? All I see is some further rehashing of old discussions. There is already a solution to the problem because the docs are already very clear that you need multiple standbys to achieve commit guarantees AND high availability. RTFM is usually used as some form of put down, but that is what needs to happen here. If we want to get the guarantees that often come up in sync rep discussions - namely that you can assume that your change is applied on standby when commit returns - then we could implement this by returning LSN from commit at protocol level and having an option in queries on standby to wait for this LSN (again passed on wire below the level of query) to be applied. This can be mostly hidden in drivers and would need very little effort from end user to use. basically you tell the driver that one connection is bound as the slave of another and driver can manage using the right LSNs. That is the last LSN received from master is always attached to queries on slaves. Cheers -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
Kevin Grittner kgri...@ymail.com writes: Aggregates on approximate (floating-point) numbers are not nearly as consistent as many people probably assume. Picture for a minute a table where a column contains positive floating point numbers happen to be located in the heap in increasing order, perhaps as the result of a CLUSTER on an index on the column. SELECT sum(colname) FROM tablename; would tend to give the most accurate answer possible when a sequence scan was run -- unless there happened to be a seqscan already half-way through the heap. Then the result would be different. I don't think that argument holds any water. In the first place, somebody could turn off synchronize_seqscans if they needed to have the calculation done the same way every time (and I recall questions from users who ended up doing exactly that, shortly after we introduced synchronize_seqscans). In the second place, for most use-cases it'd be pretty foolish to rely on physical heap order, so somebody who was really trying to sum float8s accurately would likely do select sum(x order by x) from ... This is a well-defined, numerically stable calculation, and I don't want to see us put in non-defeatable optimizations that break it. The real issue here is that if you are using an approximate data type and expecting exact answers, you will have problems. That's a canard. People who know what they're doing (admittedly a minority) do not expect exact answers, but they do expect to be able to specify how to do the calculation in a way that minimizes roundoff errors. The inverse-transition-function approach breaks that, and it does so at a level where the user can't work around it, short of building his own aggregates. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] new json funcs
Is it just me, or is the json_array_element(json, int) function not documented? (Not a bug in this patch, I think ...) -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add CREATE support to event triggers
On 10 January 2014 17:07, Robert Haas robertmh...@gmail.com wrote: On Fri, Jan 10, 2014 at 10:55 AM, Simon Riggs si...@2ndquadrant.com wrote: On 10 January 2014 15:48, Robert Haas robertmh...@gmail.com wrote: On Fri, Jan 10, 2014 at 10:36 AM, Simon Riggs si...@2ndquadrant.com wrote: On 8 January 2014 20:42, Alvaro Herrera alvhe...@2ndquadrant.com wrote: CREATE SCHEMA IF NOT EXISTS some schema AUTHORIZATION some guy; Hmm, given in 9.3 it was OK to have only DROP event triggers, I think it should be equally acceptable to have just CREATE, but without every option on CREATE. CREATE SCHEMA is easily the most complex thing here and would be the command/event to deprioritise if we had any issues getting this done/agreeing something for 9.4. I don't know that I agree with that, but I guess we can cross that bridge when we come to it. We've come to it... You would prefer either everything or nothing?? On what grounds? I hardly think I need to justify that position. Yeh, you do. Everybody does. That's project policy and always has been. When somebody implements 50% of a feature, or worse yet 95% of a feature, it violates the POLA for users and doesn't always subsequently get completed, leaving us with long-term warts that are hard to eliminate. So why was project policy violated when we released 9.3 with only DROP event support? Surely that was a worse violation of POLA than my suggestion? It's not reasonable to do something yourself and then object when others suggest doing the same thing. After 3 years we need something useful. I think the perfect being the enemy of the good argument applies here after this length of time. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] new json funcs
On 01/10/2014 12:42 PM, Alvaro Herrera wrote: Is it just me, or is the json_array_element(json, int) function not documented? (Not a bug in this patch, I think ...) As discussed at the time, we didn't document the functions underlying the json operators, just the operators themselves. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Jan10, 2014, at 18:14 , Kevin Grittner kgri...@ymail.com wrote: Given that this is already the case with aggregates on floating point approximate numbers, why should we rule out an optimization which only makes rounding errors more likely to be visible? The real issue here is that if you are using an approximate data type and expecting exact answers, you will have problems. Because without the optimization, only the values which you *actually* process for a given result determine whether you lose precision or not. With the optimization, OTOH, values which have *nothing* to do with the result in question can nevertheless make it completely bogus. SUM() is a good example. As long as all your values are positive, the amount of precision you lose is bound by the number of input values. If I sum over 10 values, the worst that can happen is that the first values is large enough to prevent the other 9 values from influencing the result. That limits the relative error to something like 9*epsilon, where epsilon is the relative precision of the floating point type, i.e. 1e-15 or so for double. In other words, as long as your frames are less than 10e13 rows long, the relative error will stay below 1%. But with the optimization, that is no longer true. If you sum from, say, CURRENT ROW to UNBOUNDED FOLLOWING, the relative error of the result in one row now depends on the magnitude of values *preceding* that row, even though that value isn't in the frame. And since we now internally subtract, not only add, the relative error is no longer bound by the number of rows in the frame. Here's the corresponding SELECT (which is basically the same as Tom's example upthread): select n, x::float, sum(x::float) over ( order by n rows between current row and unbounded following ) from (values (1, 1e20), (2, 1), (3, 2) ) as t(n, x) order by n; Currently that returns n | x | sum ---+---+--- 1 | 1e+20 | 1e+20 2 | 1 | 3 3 | 2 | 2 but with an inverse transfer function, it may very well return n | x | sum ---+---+--- 1 | 1e+20 | 1e+20 2 | 1 | 0 3 | 2 |-1 That's not to say that approximations are useless. If you represent the circumference of the earth with a double precision number you're dealing with an expected rounding error of about a foot. That's close enough for many purposes. The mistake is assuming that it will be exact or that rounding errors cannot accumulate. In situations where SQL does not promise particular ordering of operations, it should not be assumed; so any expectations of a specific or repeatable result from a sum or average of approximate numbers is misplaced. But this isn't about ordering, it's replacing one computation with a completely different one that just happens to be equivalent *algebraically*. To me, the proposed optimization for float is akin to C compiler which decided to evaluate a + b + c + … z as -a + (2a - b) + (2b - c) + … + (2y - z) + 2z Algebraically, these are the same, but it'd still be insane to do that. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
I wrote: Kevin Grittner kgri...@ymail.com writes: The real issue here is that if you are using an approximate data type and expecting exact answers, you will have problems. That's a canard. People who know what they're doing (admittedly a minority) do not expect exact answers, but they do expect to be able to specify how to do the calculation in a way that minimizes roundoff errors. The inverse-transition-function approach breaks that, and it does so at a level where the user can't work around it, short of building his own aggregates. Although, having said that ... maybe build your own aggregate would be a reasonable suggestion for people who need this? I grant that it's going to be a minority requirement, maybe even a small minority requirement. People who have the chops to get this sort of thing right can probably manage a custom aggregate definition. The constraint this would pose on the float4 and float8 implementations is that it be possible to use their transition and final functions in a custom aggregate declaration while leaving off the inverse function; or, if that combination doesn't work for some reason, we have to continue to provide the previous transition/final functions for use in user aggregates. Suitable documentation would be needed too, of course. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add CREATE support to event triggers
On Fri, Jan 10, 2014 at 12:59 PM, Simon Riggs si...@2ndquadrant.com wrote: That's project policy and always has been. When somebody implements 50% of a feature, or worse yet 95% of a feature, it violates the POLA for users and doesn't always subsequently get completed, leaving us with long-term warts that are hard to eliminate. So why was project policy violated when we released 9.3 with only DROP event support? Surely that was a worse violation of POLA than my suggestion? Well, obviously I didn't think so at the time, or I would have objected. I felt, and still feel, that implementing one kind of event trigger (drop) does not necessarily require implementing another kind (create). I think that's clearly different from implementing either one for only some object types. This event trigger will be called whenever an object is dropped is a reasonable contract with the user. This other event trigger will be called whenever an object is created, unless it happens to be a schema is much less reasonable. At least in my opinion. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL
On Thu, Jan 9, 2014 at 12:46 PM, Claudio Freire klaussfre...@gmail.com wrote: On Thu, Jan 9, 2014 at 2:22 PM, Robert Haas robertmh...@gmail.com wrote: It would be nice to have better operating system support for this. For example, IIUC, 64-bit Linux has 128TB of address space available for user processes. When you clone(), it can either share the entire address space (i.e. it's a thread) or none of it (i.e. it's a process). There's no option to, say, share 64TB and not the other 64TB, which would be ideal for us. We could then map dynamic shared memory segments into the shared portion of the address space and do backend-private allocations in the unshared part. Of course, even if we had that, it wouldn't be portable, so who knows how much good it would do. But it would be awfully nice to have the option. You can map a segment at fork time, and unmap it after forking. That doesn't really use RAM, since it's supposed to be lazily allocated (it can be forced to be so, I believe, with PROT_NONE and MAP_NORESERVE, but I don't think that's portable). That guarantees it's free. It guarantees that it is free as of the moment you unmap it, but it doesn't guarantee that future memory allocations or shared library loads couldn't stomp on the space. Also, that not-portable thing is a bit of a problem. I've got no problem with the idea that third-party code may be platform-specific, but I think the stuff we ship in core has got to work on more or less all reasonably modern systems. Next, you can map shared memory at explicit addresses (linux's mmap has support for that, and I seem to recall Windows did too). All you have to do, is some book-keeping in shared memory (so all processes can coordinate new mappings). I did something like this back in 1998 or 1999 at the operating system level, and it turned out not to work very well. I was working on an experimental research operating system kernel, and we wanted to add support for mmap(), so we set aside a portion of the virtual address space for file mappings. That region was shared across all processes in the system. One problem is that there's no guarantee the space is big enough for whatever you want to map; and the other problem is that it can easily get fragmented. Now, 64-bit address spaces go some way to ameliorating these concerns so maybe it can be made to work, but I would be a teeny bit cautious about using the word just to describe the complexity involved. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL
On Thu, Jan 9, 2014 at 2:09 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Thu, Jan 9, 2014 at 12:21 AM, Robert Haas robertmh...@gmail.com wrote: On Tue, Jan 7, 2014 at 10:20 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Tue, Jan 7, 2014 at 2:46 AM, Robert Haas robertmh...@gmail.com wrote: Well, right now we just reopen the same object from all of the processes, which seems to work fine and doesn't require any of this complexity. The only problem I don't know how to solve is how to make a segment stick around for the whole postmaster lifetime. If duplicating the handle into the postmaster without its knowledge gets us there, it may be worth considering, but that doesn't seem like a good reason to rework the rest of the existing mechanism. I think one has to try this to see if it works as per the need. If it's not urgent, I can try this early next week? Anything we want to get into 9.4 has to be submitted by next Tuesday, but I don't know that we're going to get this into 9.4. Using DuplicateHandle(), we can make segment stick for Postmaster lifetime. I have used below test (used dsm_demo module) to verify: Session - 1 select dsm_demo_create('this message is from session-1'); dsm_demo_create - 82712 Session - 2 - select dsm_demo_read(82712); dsm_demo_read this message is from session-1 (1 row) Session-1 \q -- till here it will work without DuplicateHandle as well Session -2 select dsm_demo_read(82712); dsm_demo_read this message is from session-1 (1 row) Session -2 \q Session -3 select dsm_demo_read(82712); dsm_demo_read this message is from session-1 (1 row) -- above shows that handle stays around. Note - Currently I have to bypass below code in dam_attach(), as it assumes segment will not stay if it's removed from control file. /* * If we didn't find the handle we're looking for in the control * segment, it probably means that everyone else who had it mapped, * including the original creator, died before we got to this point. * It's up to the caller to decide what to do about that. */ if (seg-control_slot == INVALID_CONTROL_SLOT) { dsm_detach(seg); return NULL; } Could you let me know what exactly you are expecting in patch, just a call to DuplicateHandle() after CreateFileMapping() or something else as well? Well, I guess what I was thinking is that we could have a call dsm_keep_segment() which would be invoked on an already-created dsm_segment *. On Linux, that would just bump the reference count in the control segment up by one so that it doesn't get destroyed until postmaster shutdown. On Windows it may as well still do that for consistency, but will also need to do this DuplicateHandle() trick. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] new json funcs
Andrew Dunstan and...@dunslane.net writes: On 01/10/2014 12:42 PM, Alvaro Herrera wrote: Is it just me, or is the json_array_element(json, int) function not documented? As discussed at the time, we didn't document the functions underlying the json operators, just the operators themselves. I see though that json_array_element has a DESCR comment. I believe project policy is that if a function is not meant to be invoked by name but only through an operator, its pg_description entry should just be implementation of xyz operator, with the real comment attached only to the operator. Otherwise \df users are likely to be misled into using the function when they're not really supposed to; and at the very least they will bitch about its lack of documentation. See commits 94133a935414407920a47d06a6e22734c974c3b8 and 908ab80286401bb20a519fa7dc7a837631f20369. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] nested hstore patch
On Thu, Jan 9, 2014 at 5:08 PM, Andrew Dunstan and...@dunslane.net wrote: * I have replicated all the json processing functions for jsonb (although not the json generating functions, such as to_json). Most of these currently work by turning the jsonb back into json and then processing as before. I am sorting out some technical issues and hope to have all of these rewritten to use the native jsonb API in a few days time. * We still need to document jsonb. That too I hope will be done quite shortly. * The jsonb regression test currently contains U+ABCD - I guess we'd better use some hex encoding or whatever for that - unlike json, the jsonb de-serializer dissolves unicode escapes. How does that work if the server encoding isn't UTF-8? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL
On Fri, Jan 10, 2014 at 3:23 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Jan 9, 2014 at 12:46 PM, Claudio Freire klaussfre...@gmail.com wrote: On Thu, Jan 9, 2014 at 2:22 PM, Robert Haas robertmh...@gmail.com wrote: It would be nice to have better operating system support for this. For example, IIUC, 64-bit Linux has 128TB of address space available for user processes. When you clone(), it can either share the entire address space (i.e. it's a thread) or none of it (i.e. it's a process). There's no option to, say, share 64TB and not the other 64TB, which would be ideal for us. We could then map dynamic shared memory segments into the shared portion of the address space and do backend-private allocations in the unshared part. Of course, even if we had that, it wouldn't be portable, so who knows how much good it would do. But it would be awfully nice to have the option. You can map a segment at fork time, and unmap it after forking. That doesn't really use RAM, since it's supposed to be lazily allocated (it can be forced to be so, I believe, with PROT_NONE and MAP_NORESERVE, but I don't think that's portable). That guarantees it's free. It guarantees that it is free as of the moment you unmap it, but it doesn't guarantee that future memory allocations or shared library loads couldn't stomp on the space. You would only unmap prior to remapping, only the to-be-mapped portion, so I don't see a problem. Also, that not-portable thing is a bit of a problem. I've got no problem with the idea that third-party code may be platform-specific, but I think the stuff we ship in core has got to work on more or less all reasonably modern systems. Next, you can map shared memory at explicit addresses (linux's mmap has support for that, and I seem to recall Windows did too). All you have to do, is some book-keeping in shared memory (so all processes can coordinate new mappings). I did something like this back in 1998 or 1999 at the operating system level, and it turned out not to work very well. I was working on an experimental research operating system kernel, and we wanted to add support for mmap(), so we set aside a portion of the virtual address space for file mappings. That region was shared across all processes in the system. One problem is that there's no guarantee the space is big enough for whatever you want to map; and the other problem is that it can easily get fragmented. Now, 64-bit address spaces go some way to ameliorating these concerns so maybe it can be made to work, but I would be a teeny bit cautious about using the word just to describe the complexity involved. Ok, yes, fragmentation could be an issue if the address range is not humongus enough. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] nested hstore patch
On 01/10/2014 01:29 PM, Robert Haas wrote: On Thu, Jan 9, 2014 at 5:08 PM, Andrew Dunstan and...@dunslane.net wrote: * The jsonb regression test currently contains U+ABCD - I guess we'd better use some hex encoding or whatever for that - unlike json, the jsonb de-serializer dissolves unicode escapes. How does that work if the server encoding isn't UTF-8? There is a jsonb_1.out file for the non-utf8 case, just as there is a json_1.out for the same case. Unicode escapes for non-ascii characters are forbidden in jsonb as they are in json, if the encoding isn't utf8. FYI, we are actually using the json lexing and parsing mechanism, so that these types will accept exactly the same inputs. However, since we're not storing json text in jsonb, but instead the decomposed elements, the unicode escapes are resolved in the stored values. I already have a fix for the point above (see https://github.com/feodor/postgres/commit/7d5b8f12747b4a75e8b32914340d07617f1af302) and it will be included in the next version of the patch. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On Fri, Jan 10, 2014 at 7:12 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: I'm getting deadlocks with this patch, using the test script you posted earlier in http://www.postgresql.org/message-id/CAM3SWZQh=8xnvgbbzyhjexujbhwznjutjez9t-dbo9t_mx_...@mail.gmail.com. Am doing something wrong, or is that a regression? Yes. The point of that test case was that it made your V1 livelock (which you fixed), not deadlock in a way detected by the deadlock detector, which is the correct behavior. This testcase was the one that showed up *unprincipled* deadlocking: http://www.postgresql.org/message-id/cam3swzshbe29kpod44cvc3vpzjgmder6k_6fghiszeozgmt...@mail.gmail.com I'd focus on that test case. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] new json funcs
On 01/10/2014 01:27 PM, Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: On 01/10/2014 12:42 PM, Alvaro Herrera wrote: Is it just me, or is the json_array_element(json, int) function not documented? As discussed at the time, we didn't document the functions underlying the json operators, just the operators themselves. I see though that json_array_element has a DESCR comment. I believe project policy is that if a function is not meant to be invoked by name but only through an operator, its pg_description entry should just be implementation of xyz operator, with the real comment attached only to the operator. Otherwise \df users are likely to be misled into using the function when they're not really supposed to; and at the very least they will bitch about its lack of documentation. See commits 94133a935414407920a47d06a6e22734c974c3b8 and 908ab80286401bb20a519fa7dc7a837631f20369. OK, I can fix that I guess. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL
On Fri, Jan 10, 2014 at 1:35 PM, Claudio Freire klaussfre...@gmail.com wrote: You can map a segment at fork time, and unmap it after forking. That doesn't really use RAM, since it's supposed to be lazily allocated (it can be forced to be so, I believe, with PROT_NONE and MAP_NORESERVE, but I don't think that's portable). That guarantees it's free. It guarantees that it is free as of the moment you unmap it, but it doesn't guarantee that future memory allocations or shared library loads couldn't stomp on the space. You would only unmap prior to remapping, only the to-be-mapped portion, so I don't see a problem. OK, yeah, that way works. That's more or less what Noah proposed before. But I was skeptical it would work well everywhere. I suppose we won't know until somebody tries it. (I didn't.) Ok, yes, fragmentation could be an issue if the address range is not humongus enough. I've often thought that 64-bit machines are so capable that there's no reason to go any higher. But lately I've started to wonder. There are already machines out there with 2^40 bytes of physical memory, and the number just keeps creeping up. When you reserve a couple of bits to indicate user or kernel space, and then consider that virtual address space can be many times larger than physical memory, it starts not to seem like that much. But I'm not that excited about the amount of additional memory we'll eat when somebody decides to make a pointer 16 bytes. Ugh. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Jan10, 2014, at 19:08 , Tom Lane t...@sss.pgh.pa.us wrote: I wrote: Kevin Grittner kgri...@ymail.com writes: The real issue here is that if you are using an approximate data type and expecting exact answers, you will have problems. That's a canard. People who know what they're doing (admittedly a minority) do not expect exact answers, but they do expect to be able to specify how to do the calculation in a way that minimizes roundoff errors. The inverse-transition-function approach breaks that, and it does so at a level where the user can't work around it, short of building his own aggregates. Although, having said that ... maybe build your own aggregate would be a reasonable suggestion for people who need this? I grant that it's going to be a minority requirement, maybe even a small minority requirement. People who have the chops to get this sort of thing right can probably manage a custom aggregate definition. So we'd put a footgun into the hands of people who don't know what they're doing, to be fired for performance's sake, and leave it to the people who know what they are doing to put the safety on? best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] new json funcs
Andrew Dunstan and...@dunslane.net writes: On 01/10/2014 01:27 PM, Tom Lane wrote: See commits 94133a935414407920a47d06a6e22734c974c3b8 and 908ab80286401bb20a519fa7dc7a837631f20369. OK, I can fix that I guess. Sure, just remove the DESCR comments for the functions that aren't meant to be used directly. I don't think this is back-patchable, but it's a minor point, so at least for me a fix in HEAD is sufficient. I wonder whether we should add an opr_sanity test verifying that operator implementation functions don't have their own comments? The trouble is that there are a few that are supposed to, but maybe that list is stable enough that it'd be okay to memorialize in the expected output. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
On 01/10/2014 07:47 AM, Bruce Momjian wrote: I know there was a desire to remove this TODO item, but I think we have brought up enough new issues that we can keep it to see if we can come up with a solution. I have added a link to this discussion on the TODO item. I think we will need at least four new GUC variables: * timeout control for degraded mode * command to run during switch to degraded mode * command to run during switch from degraded mode * read-only variable to report degraded mode I know I am the one that instigated all of this so I want to be very clear on what I and what I am confident that my customers would expect. If a synchronous slave goes down, the master continues to operate. That is all. I don't care if it is configurable (I would be fine with that). I don't care if it is not automatic (e.g; slave goes down and we have to tell the master to continue). I have read through this thread more than once, and I have also went back to the docs. I understand why we do it the way we do it. I also understand that from a business requirement for 99% of CMD's customers, it's wrong. At least in the sense of providing continuity of service. Sincerely, JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc In a time of universal deceit - telling the truth is a revolutionary act., George Orwell -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] new json funcs
Tom Lane wrote: I wonder whether we should add an opr_sanity test verifying that operator implementation functions don't have their own comments? The trouble is that there are a few that are supposed to, but maybe that list is stable enough that it'd be okay to memorialize in the expected output. +1. It's an easy rule to overlook. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] new json funcs
Andrew Dunstan wrote: On 01/10/2014 12:42 PM, Alvaro Herrera wrote: Is it just me, or is the json_array_element(json, int) function not documented? As discussed at the time, we didn't document the functions underlying the json operators, just the operators themselves. Oh, I see. That's fine with me. From the source code it's hard to see when a SQL-callable function is only there to implement an operator, though (and it seems a bit far-fetched to suppose that the developer will think, upon seeing an undocumented function, oh this must implement some operator, I will look it up at pg_proc.h). I think the operator(s) should be mentioned in the comment on top of the function. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL
Robert Haas robertmh...@gmail.com writes: I've often thought that 64-bit machines are so capable that there's no reason to go any higher. But lately I've started to wonder. There are already machines out there with 2^40 bytes of physical memory, and the number just keeps creeping up. When you reserve a couple of bits to indicate user or kernel space, and then consider that virtual address space can be many times larger than physical memory, it starts not to seem like that much. But I'm not that excited about the amount of additional memory we'll eat when somebody decides to make a pointer 16 bytes. Ugh. Once you really need that, you're not going to care about doubling the size of pointers. At worst, you're giving up 1 bit of address space to gain 64 more. (Still, I rather doubt it'll happen in my lifetime.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
Florian Pflug f...@phlo.org writes: On Jan10, 2014, at 19:08 , Tom Lane t...@sss.pgh.pa.us wrote: Although, having said that ... maybe build your own aggregate would be a reasonable suggestion for people who need this? I grant that it's going to be a minority requirement, maybe even a small minority requirement. People who have the chops to get this sort of thing right can probably manage a custom aggregate definition. So we'd put a footgun into the hands of people who don't know what they're doing, to be fired for performance's sake, and leave it to the people who know what they are doing to put the safety on? If I may put words in Kevin's mouth, I think his point is that having float8 sum() at all is a foot-gun, and that's hard to deny. You need to know how to use it safely. A compromise compromise might be to provide these alternative safer aggregates built-in. Or, depending on what color you like your bikeshed, leave the standard aggregates alone and define fast_sum etc for the less safe versions. In any case it'd be incumbent on us to document the tradeoffs. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] new json funcs
On 01/10/2014 01:58 PM, Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: On 01/10/2014 01:27 PM, Tom Lane wrote: See commits 94133a935414407920a47d06a6e22734c974c3b8 and 908ab80286401bb20a519fa7dc7a837631f20369. OK, I can fix that I guess. Sure, just remove the DESCR comments for the functions that aren't meant to be used directly. I don't think this is back-patchable, but it's a minor point, so at least for me a fix in HEAD is sufficient. I wonder whether we should add an opr_sanity test verifying that operator implementation functions don't have their own comments? The trouble is that there are a few that are supposed to, but maybe that list is stable enough that it'd be okay to memorialize in the expected output. Well, that would be ok as long as there was a comment in the file so that developers don't just think it's OK to extend the list (it's a bit like part of the reason we don't allow shift/reduce conflicts - if we allowed them people would just keep adding more, and they wouldn't stick out like a sore thumb.) The comment in the current test says: -- Check that operators' underlying functions have suitable comments, -- namely 'implementation of XXX operator'. In some cases involving legacy -- names for operators, there are multiple operators referencing the same -- pg_proc entry, so ignore operators whose comments say they are deprecated. -- We also have a few functions that are both operator support and meant to -- be called directly; those should have comments matching their operator. The history here is that originally I was intending to have these functions documented, and so the descriptions were made to match the operator descriptions, so that we didn't get a failure on this test. Later we decided not to document them as part of last release's bike-shedding, but the function descriptions didn't get changed / removed. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On 01/10/2014 08:37 PM, Peter Geoghegan wrote: On Fri, Jan 10, 2014 at 7:12 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: I'm getting deadlocks with this patch, using the test script you posted earlier in http://www.postgresql.org/message-id/CAM3SWZQh=8xnvgbbzyhjexujbhwznjutjez9t-dbo9t_mx_...@mail.gmail.com. Am doing something wrong, or is that a regression? Yes. The point of that test case was that it made your V1 livelock (which you fixed), not deadlock in a way detected by the deadlock detector, which is the correct behavior. Oh, ok. Interesting. With the patch version I posted today, I'm not getting deadlocks. I'm not getting duplicates in the table either, so it looks like the promise tuple approach somehow avoids the deadlocks, while the btreelock patch does not. Why does it deadlock with the btreelock patch? I don't see why it should. If you have two backends inserting a single tuple, and they conflict, one of them should succeed to insert, and the other one should update. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Time to do our Triage for 9.4
All, To make this easier for everyone to participate in, I've created a wiki page: https://wiki.postgresql.org/wiki/9.4CF4Triage Please add the patches you know well to the appropriate list, thanks! -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] new json funcs
Alvaro Herrera alvhe...@2ndquadrant.com writes: Oh, I see. That's fine with me. From the source code it's hard to see when a SQL-callable function is only there to implement an operator, though (and it seems a bit far-fetched to suppose that the developer will think, upon seeing an undocumented function, oh this must implement some operator, I will look it up at pg_proc.h). I think the operator(s) should be mentioned in the comment on top of the function. Oh, you're complaining about the lack of any header comment for the function in the source code. That's a different matter from the user-visible docs, but I agree that it's poor practice to not have anything. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] new json funcs
Andrew Dunstan and...@dunslane.net writes: The history here is that originally I was intending to have these functions documented, and so the descriptions were made to match the operator descriptions, so that we didn't get a failure on this test. Later we decided not to document them as part of last release's bike-shedding, but the function descriptions didn't get changed / removed. Ah. I suppose there's no way to cross-check the state of the function's pg_description comment against whether it has SGML documentation :-( regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] new json funcs
On Fri, Jan 10, 2014 at 02:39:12PM -0500, Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: The history here is that originally I was intending to have these functions documented, and so the descriptions were made to match the operator descriptions, so that we didn't get a failure on this test. Later we decided not to document them as part of last release's bike-shedding, but the function descriptions didn't get changed / removed. Ah. I suppose there's no way to cross-check the state of the function's pg_description comment against whether it has SGML documentation :-( FDWs to the rescue! Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On Fri, Jan 10, 2014 at 11:28 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Why does it deadlock with the btreelock patch? I don't see why it should. If you have two backends inserting a single tuple, and they conflict, one of them should succeed to insert, and the other one should update. Are you sure that it doesn't make your patch deadlock too, with enough pressure? I've made that mistake myself. That test-case made my patch deadlock (in a detected fashion) when it used buffer locks as a value locking prototype - I say as much right there in the November mail you linked to. I think that's acceptable, because it's non-sensible use of the feature (my point was only that it shouldn't livelock). The test case is naively locking a row without knowing ahead of time (or pro-actively checking) if the conflict is on the first or second unique index. So before too long, you're updating the wrong row (no existing lock is really held), based on the 'a' column's projected value, when in actuality the conflict was on the 'b' column's projected value. Conditions are right for deadlock, because two rows are locked, not one. Although I have not yet properly considered your most recent revision, I can't imagine why the same would not apply there, since the row locking component is (probably) still identical. Granted, that distinction between row locking and value locking is a bit fuzzy in your approach, but if you happened to not insert any rows in any previous iterations (i.e. there were no unfilled promise tuples), and you happened to perform conflict handling first, it could still happen, albeit with lower probability, no? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
FYI, I'm using the verb rewind to talk about using the negative transition aggregation function to get a prior value. I don't know if this is the right verb. Conceptually, when aggregating over floating point numbers, there is some infinitely precise theoretical value, and the computation is approximating it. Call the infinitely precise value 'r'. Call the computed value 'c', which is the result of the aggregation function. (For float4_agg and float8_agg, typeof(c) == float8). The problem you have with rewinding an aggregate is that you don't know if you are getting the same value of c that you would have gotten from a rescan. But if you have a type that tracks a margin [min,max] where typeof(min) == typeof(max) is higher precision than typeof(c), then you can track: min = r = max By setting the rounding mode down, then up, when computing the next value of min and max, respectively. (Extra flag bits or booleans could track whether you've encountered +inf, -inf, Nan, and any other oddball cases, with corresponding special logic that has been discussed already upthread.) In many but not all cases: min != max but (typeof(c))min == (typeof(c))max Because the margin of error is small enough not to render different values when cast to the lower precision typeof(c). You could rewind the aggregation whenever this second case holds, and only force a rescan when it does not. This would render the same results for queries whether they were performed with rewinds or with rescans. The results might differ from older versions of postgres, but only in that they might be more accurate, with less accumulated rounding errors, owing to the higher precision state transition variable. For many modern platforms, typeof(min) could be __float128 using libquadmath, or something similar to that. If not available at compile time, it could be float64 instead. Even then, you'd still know that rewinding was possible when min == max and not otherwise, which is useful for cases of aggregation over exact values. I admit I've done a bit of handwaving on the computation of the margin and the handling of floating-point rounding issues, but I believe the implementation details are tractible. mark On Friday, January 10, 2014 10:10 AM, Florian Pflug f...@phlo.org wrote: On Jan10, 2014, at 18:14 , Kevin Grittner kgri...@ymail.com wrote: Given that this is already the case with aggregates on floating point approximate numbers, why should we rule out an optimization which only makes rounding errors more likely to be visible? The real issue here is that if you are using an approximate data type and expecting exact answers, you will have problems. Because without the optimization, only the values which you *actually* process for a given result determine whether you lose precision or not. With the optimization, OTOH, values which have *nothing* to do with the result in question can nevertheless make it completely bogus. SUM() is a good example. As long as all your values are positive, the amount of precision you lose is bound by the number of input values. If I sum over 10 values, the worst that can happen is that the first values is large enough to prevent the other 9 values from influencing the result. That limits the relative error to something like 9*epsilon, where epsilon is the relative precision of the floating point type, i.e. 1e-15 or so for double. In other words, as long as your frames are less than 10e13 rows long, the relative error will stay below 1%. But with the optimization, that is no longer true. If you sum from, say, CURRENT ROW to UNBOUNDED FOLLOWING, the relative error of the result in one row now depends on the magnitude of values *preceding* that row, even though that value isn't in the frame. And since we now internally subtract, not only add, the relative error is no longer bound by the number of rows in the frame. Here's the corresponding SELECT (which is basically the same as Tom's example upthread): select n, x::float, sum(x::float) over ( order by n rows between current row and unbounded following ) from (values (1, 1e20), (2, 1), (3, 2) ) as t(n, x) order by n; Currently that returns n | x | sum ---+---+--- 1 | 1e+20 | 1e+20 2 | 1 | 3 3 | 2 | 2 but with an inverse transfer function, it may very well return n | x | sum ---+---+--- 1 | 1e+20 | 1e+20 2 | 1 | 0 3 | 2 | -1 That's not to say that approximations are useless. If you represent the circumference of the earth with a double precision number you're dealing with an expected rounding error of about a foot. That's close enough for many purposes. The mistake is assuming that it will be exact or that rounding errors cannot accumulate. In situations where SQL does not promise particular ordering of operations, it should not be assumed; so any expectations of a specific
Re: [HACKERS] [COMMITTERS] pgsql: Upgrade to Autoconf 2.69
On Tue, Jan 7, 2014 at 08:19:49AM -0500, Peter Eisentraut wrote: That was probably me. I'll look into it. On Jan 6, 2014, at 11:40 PM, Tom Lane t...@sss.pgh.pa.us wrote: Bruce Momjian br...@momjian.us writes: On Sun, Dec 29, 2013 at 02:48:21AM -0500, Tom Lane wrote: 3. pg_upgrade ignores the fact that pg_resetxlog failed, and keeps going. Does pg_resetxlog return a non-zero exit status? If so, pg_upgrade should have caught that and exited. It certainly does: if (errno) { fprintf(stderr, _(%s: could not read from directory \%s\: %s\n), progname, XLOGDIR, strerror(errno)); exit(1); } The bug is that pg_upgrade appears to assume (in many places not just this one) that exec_prog() will abort if the called program fails, but *it doesn't*, contrary to the claim in its own header comment. This is because pg_log(FATAL, ...) doesn't call exit(). pg_fatal() does, but that's not what's being called in the throw_error case. I imagine that this used to work correctly and got broken by some ill-advised refactoring, but whatever the origin, it's 100% broken today. I know Peter is looking at this, but I looked at and I can't see the problem. Every call of exec_prog() that uses pg_resetxlog has throw_error = true, and the test there is: result = system(cmd); if (result != 0) ... pg_log(FATAL, ...) and in pg_log_v() I see: switch (type) ... case PG_FATAL: printf(\n%s, _(message)); printf(Failure, exiting\n); -- exit(1); break; so I am not clear how you are seeing the return status of pg_resetxlog ignored. I tried the attached patch which causes pg_resetxlog -f to return -1, and got the proper error from pg_upgrade in git head: Performing Upgrade -- Analyzing all rows in the new cluster ok Freezing all rows on the new clusterok Deleting files from new pg_clog ok Copying old pg_clog to new server ok Setting next transaction ID for new cluster *failure* Consult the last few lines of pg_upgrade_utility.log for the probable cause of the failure. Failure, exiting and the last line in pg_upgrade_utility.log is: command: /u/pgsql/bin/pg_resetxlog -f -x 683 /u/pgsql/data pg_upgrade_utility.log 21 -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c new file mode 100644 index 03f2fad..3e67630 *** a/src/bin/pg_resetxlog/pg_resetxlog.c --- b/src/bin/pg_resetxlog/pg_resetxlog.c *** main(int argc, char *argv[]) *** 121,126 --- 121,127 { case 'f': force = true; + exit(1); break; case 'n': -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Upgrade to Autoconf 2.69
Bruce Momjian br...@momjian.us writes: On Tue, Jan 7, 2014 at 08:19:49AM -0500, Peter Eisentraut wrote: That was probably me. I'll look into it. and in pg_log_v() I see: switch (type) ... case PG_FATAL: printf(\n%s, _(message)); printf(Failure, exiting\n); -- exit(1); break; Peter just fixed that; see commit ca607b155e86ce529fc9ac322a232f264cda9ab6 regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Upgrade to Autoconf 2.69
Bruce Momjian wrote: I know Peter is looking at this, but I looked at and I can't see the problem. Every call of exec_prog() that uses pg_resetxlog has throw_error = true, and the test there is: result = system(cmd); if (result != 0) ... pg_log(FATAL, ...) and in pg_log_v() I see: switch (type) ... case PG_FATAL: printf(\n%s, _(message)); printf(Failure, exiting\n); -- exit(1); break; This was fixed by Peter two days ago in commit http://git.postgresql.org/pg/commitdiff/ca607b155e86ce529fc9ac322a232f264cda9ab6 -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Upgrade to Autoconf 2.69
On Fri, Jan 10, 2014 at 04:06:11PM -0500, Tom Lane wrote: Bruce Momjian br...@momjian.us writes: On Tue, Jan 7, 2014 at 08:19:49AM -0500, Peter Eisentraut wrote: That was probably me. I'll look into it. and in pg_log_v() I see: switch (type) ... case PG_FATAL: printf(\n%s, _(message)); printf(Failure, exiting\n); -- exit(1); break; Peter just fixed that; see commit ca607b155e86ce529fc9ac322a232f264cda9ab6 Oh, I guess I checked git log on the wrong file then. Thanks. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On 01/10/2014 10:00 PM, Peter Geoghegan wrote: On Fri, Jan 10, 2014 at 11:28 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Why does it deadlock with the btreelock patch? I don't see why it should. If you have two backends inserting a single tuple, and they conflict, one of them should succeed to insert, and the other one should update. Are you sure that it doesn't make your patch deadlock too, with enough pressure? I've made that mistake myself. That test-case made my patch deadlock (in a detected fashion) when it used buffer locks as a value locking prototype - I say as much right there in the November mail you linked to. I think that's acceptable, because it's non-sensible use of the feature (my point was only that it shouldn't livelock). The test case is naively locking a row without knowing ahead of time (or pro-actively checking) if the conflict is on the first or second unique index. So before too long, you're updating the wrong row (no existing lock is really held), based on the 'a' column's projected value, when in actuality the conflict was on the 'b' column's projected value. Conditions are right for deadlock, because two rows are locked, not one. I see. Yeah, I also get deadlocks when I change update statement to use foo.b = rej.b instead of foo.a = rej.a. I think it's down to the indexes are processed, ie. which conflict you see first. This is pretty much the same issue we discussed wrt. exclusion contraints. If the tuple being inserted conflicts with several existing tuples, what to do? I think the best answer would be to return and lock them all. It could still deadlock, but it's nevertheless less surprising behavior than returning one of the tuples in random. Actually, we could even avoid the deadlock by always locking the tuples in a certain order, although I'm not sure if it's worth the trouble. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Disallow arrays with non-standard lower bounds
On 1/9/14, 10:58 PM, Tom Lane wrote: Jim Nasby j...@nasby.net writes: ISTM that allowing users to pick arbitrary lower array bounds was a huge mistake. I've never seen anyone make use of it, can't think of any legitimate use cases for it, and hate the stupendous amount of extra code needed to deal with it. You lack imagination, sir. Considering what you'd normally want to do in SQL, the only example I can think of is to not have the argument over 0 vs 1 based. Actually, I was thinking there might be some computational problems where changing lower bound would be nice, but then again, what other languages actually support this? -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Sat, Jan 11, 2014 at 7:08 AM, Tom Lane t...@sss.pgh.pa.us wrote: Although, having said that ... maybe build your own aggregate would be a reasonable suggestion for people who need this? I grant that it's going to be a minority requirement, maybe even a small minority requirement. People who have the chops to get this sort of thing right can probably manage a custom aggregate definition. I more or less wrote off the idea of inverse transition functions after your example upthread. I had thought that perhaps if we could get inverse transitions in there for SUM(numeric) then people who need more speed could just cast their value to numeric then back to float or double precision after aggregation takes place. I had to delay writing any documentation around that as I'm still not sure if we can have sum(numeric) use an inverse transition function due to the fact that it can introduce extra zeros after the decimal point. As the patch stands at the moment, I currently have a regression test which currently fails due to these extra zeros after the decimal point: -- This test currently fails due extra trailing 0 digits. SELECT SUM(n::numeric) OVER (ORDER BY i ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) FROM (VALUES(1,1.01),(2,2),(3,3)) v(i,n); Patched produces: 6.01 5.00 3.00 Unpatched produces: 6.01 5 3 With inverse transitions this query still produces correct results, it just does not produces the numeric in the same format as it does without performing inverse transitions. Personally I'd rather focus on trying to get SUM(numeric) in there for 9.4 and maybe focus on floating point stuff at a later date as casting to numeric can be the work around for users who complain about the speed. Or if they really want they can create their own aggregate, using an existing built in function as the inverse transition, like float8_mi. There's certain things that currently seem a big magical to me when it comes to numeric, for example I've no idea why the following query produces 20 0's after the decimal point for 1 and only 16 for 2. select n::numeric / 1 from generate_series(1,2) g(n); To me it does not look very consistent at all and I'm really wondering if there is some special reason why we bother including the useless zeros at the end at all. I've written a patch which gets rid of them in numeric_out, but I had not planned on posting it here in case it gets laughed off stage due to some special reason we have for keeping those zeros that I don't know about. Can anyone explain to me why we have these unneeded zeros in numeric when the precision is not supplied? Regards David Rowley The constraint this would pose on the float4 and float8 implementations is that it be possible to use their transition and final functions in a custom aggregate declaration while leaving off the inverse function; or, if that combination doesn't work for some reason, we have to continue to provide the previous transition/final functions for use in user aggregates. Suitable documentation would be needed too, of course. regards, tom lane
Re: [HACKERS] Time to do our Triage for 9.4
On Sat, Jan 11, 2014 at 8:28 AM, Josh Berkus j...@agliodbs.com wrote: All, To make this easier for everyone to participate in, I've created a wiki page: https://wiki.postgresql.org/wiki/9.4CF4Triage Please add the patches you know well to the appropriate list, thanks! I know my own patch pretty well and from my own point of view it's very close to being about ready to go, but a good review may change that. Should we be waiting for 2nd opinions or can patch authors decide for themselves? Or were you talking only to commiters? Regards David Rowley -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
On 1/10/14, 12:59 PM, Joshua D. Drake wrote: I know I am the one that instigated all of this so I want to be very clear on what I and what I am confident that my customers would expect. If a synchronous slave goes down, the master continues to operate. That is all. I don't care if it is configurable (I would be fine with that). I don't care if it is not automatic (e.g; slave goes down and we have to tell the master to continue). I have read through this thread more than once, and I have also went back to the docs. I understand why we do it the way we do it. I also understand that from a business requirement for 99% of CMD's customers, it's wrong. At least in the sense of providing continuity of service. +1 I understand that this is a degredation of full-on sync rep. But there is definite value added with sync-rep that can automatically (or at least easily) degrade over async; it protects you from single failures. I fully understand that it will not protect you from a double failure. That's OK in many cases. Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add CREATE support to event triggers
On 10 January 2014 18:17, Robert Haas robertmh...@gmail.com wrote: On Fri, Jan 10, 2014 at 12:59 PM, Simon Riggs si...@2ndquadrant.com wrote: That's project policy and always has been. When somebody implements 50% of a feature, or worse yet 95% of a feature, it violates the POLA for users and doesn't always subsequently get completed, leaving us with long-term warts that are hard to eliminate. So why was project policy violated when we released 9.3 with only DROP event support? Surely that was a worse violation of POLA than my suggestion? Well, obviously I didn't think so at the time, or I would have objected. I felt, and still feel, that implementing one kind of event trigger (drop) does not necessarily require implementing another kind (create). I think that's clearly different from implementing either one for only some object types. This event trigger will be called whenever an object is dropped is a reasonable contract with the user. This other event trigger will be called whenever an object is created, unless it happens to be a schema is much less reasonable. At least in my opinion. In the fullness of time, I agree that is not a restriction we should maintain. Given that CREATE SCHEMA with multiple objects is less well used, its a reasonable restriction to accept for one release, if the alternative is to implement nothing at all of value. Especially since we are now in the third year of development of this set of features, it is time to reduce the scope to ensure delivery. There may be other ways to ensure something of value is added, this was just one suggestion. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Disallow arrays with non-standard lower bounds
On 01/10/2014 04:26 PM, Jim Nasby wrote: On 1/9/14, 10:58 PM, Tom Lane wrote: Jim Nasby j...@nasby.net writes: ISTM that allowing users to pick arbitrary lower array bounds was a huge mistake. I've never seen anyone make use of it, can't think of any legitimate use cases for it, and hate the stupendous amount of extra code needed to deal with it. You lack imagination, sir. Considering what you'd normally want to do in SQL, the only example I can think of is to not have the argument over 0 vs 1 based. Actually, I was thinking there might be some computational problems where changing lower bound would be nice, but then again, what other languages actually support this? Ada, for one. In fact, in Ada the index doesn't need to be an integer, just an enumerable type (e.g. an enum). You can iterate over one-dimensional arrays by saying: FOR i IN [REVERSE] my_array'range LOOP ... cheers andrew (who sadly hasn't used Ada in anger for about 20 years). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
On 2014-01-10 10:59:23 -0800, Joshua D. Drake wrote: On 01/10/2014 07:47 AM, Bruce Momjian wrote: I know there was a desire to remove this TODO item, but I think we have brought up enough new issues that we can keep it to see if we can come up with a solution. I have added a link to this discussion on the TODO item. I think we will need at least four new GUC variables: * timeout control for degraded mode * command to run during switch to degraded mode * command to run during switch from degraded mode * read-only variable to report degraded mode I know I am the one that instigated all of this so I want to be very clear on what I and what I am confident that my customers would expect. If a synchronous slave goes down, the master continues to operate. That is all. I don't care if it is configurable (I would be fine with that). I don't care if it is not automatic (e.g; slave goes down and we have to tell the master to continue). Would you please explain, as precise as possible, what the advantages of using a synchronous standby would be in such a scenario? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
* Andres Freund (and...@2ndquadrant.com) wrote: On 2014-01-10 10:59:23 -0800, Joshua D. Drake wrote: If a synchronous slave goes down, the master continues to operate. That is all. I don't care if it is configurable (I would be fine with that). I don't care if it is not automatic (e.g; slave goes down and we have to tell the master to continue). Would you please explain, as precise as possible, what the advantages of using a synchronous standby would be in such a scenario? In a degraded/failure state, things continue to *work*. In a non-degraded/failure state, you're able to handle a system failure and know that you didn't lose any transactions. Tom's point is correct, that you will fail on the have two copies of everything in this mode, but that could certainly be acceptable in the case where there is a system failure. As pointed out by someone previously, that's how RAID-1 works (which I imagine quite a few of us use). I've been thinking about this a fair bit and I've come to like the RAID1 analogy. Stinks that we can't keep things going (automatically) if either side fails, but perhaps we will one day... Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Disallow arrays with non-standard lower bounds
On Fri, Jan 10, 2014 at 03:26:04PM -0600, Jim Nasby wrote: On 1/9/14, 10:58 PM, Tom Lane wrote: Jim Nasby j...@nasby.net writes: ISTM that allowing users to pick arbitrary lower array bounds was a huge mistake. I've never seen anyone make use of it, can't think of any legitimate use cases for it, and hate the stupendous amount of extra code needed to deal with it. You lack imagination, sir. Considering what you'd normally want to do in SQL, the only example I can think of is to not have the argument over 0 vs 1 based. Actually, I was thinking there might be some computational problems where changing lower bound would be nice, but then again, what other languages actually support this? Well, there's Perl, but that's not an argument *for* doing this, and I say that as a long-time Perl user (who has never seen this feature used in code worth not scrapping, by the way). Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Disallow arrays with non-standard lower bounds
Gavin Flower gavinflo...@archidevsys.co.nz wrote: Starting arrays at zero makes the most sense, as then you can calculate the displacement simply as (index) * (size of entry), and not have subtract one from the index first. This would be my preference. The SQL standard explicitly specifies that arrays positions range from 1 to the cardinality of the array, with individual elements referenced by position. When implementing a language for which there is an international standard, my preference is to conform to the standard. I don't have a problem with extensions to the language, and a variable low bound is workable as an extension as long as the standard ways to create an array default to a low bound of 1. A bigger problem with our array implementation is that is is really a multidimensional matrix, rather than an array which can contain nested arrays. That is both non-standard and limiting. That said, I think it would be nice to have better support for arrays defined with a single dimension and a low bound of 1, as the standard requires. Functions which throw an error when passed a non-conforming parameter and can provide a simplified API as a result would probably get used by me. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Disallow arrays with non-standard lower bounds
On Fri, Jan 10, 2014 at 1:26 PM, Jim Nasby j...@nasby.net wrote: On 1/9/14, 10:58 PM, Tom Lane wrote: Jim Nasby j...@nasby.net writes: ISTM that allowing users to pick arbitrary lower array bounds was a huge mistake. I've never seen anyone make use of it, can't think of any legitimate use cases for it, and hate the stupendous amount of extra code needed to deal with it. You lack imagination, sir. Considering what you'd normally want to do in SQL, the only example I can think of is to not have the argument over 0 vs 1 based. Actually, I was thinking there might be some computational problems where changing lower bound would be nice, but then again, what other languages actually support this? Perl does, though they regret it bitterly. Cheers, Jeff
Re: [HACKERS] Disallow arrays with non-standard lower bounds
On Fri, Jan 10, 2014 at 4:10 PM, Jeff Janes jeff.ja...@gmail.com wrote: On Fri, Jan 10, 2014 at 1:26 PM, Jim Nasby j...@nasby.net wrote: On 1/9/14, 10:58 PM, Tom Lane wrote: Jim Nasby j...@nasby.net writes: ISTM that allowing users to pick arbitrary lower array bounds was a huge mistake. I've never seen anyone make use of it, can't think of any legitimate use cases for it, and hate the stupendous amount of extra code needed to deal with it. You lack imagination, sir. Considering what you'd normally want to do in SQL, the only example I can think of is to not have the argument over 0 vs 1 based. Actually, I was thinking there might be some computational problems where changing lower bound would be nice, but then again, what other languages actually support this? Perl does, though they regret it bitterly. What does it matter? Our arrays have had the capability for years and years and because it's cleaner is simply not justification to break people's applications. Why are we even considering this? merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
On 2014-01-10 17:02:08 -0500, Stephen Frost wrote: * Andres Freund (and...@2ndquadrant.com) wrote: On 2014-01-10 10:59:23 -0800, Joshua D. Drake wrote: If a synchronous slave goes down, the master continues to operate. That is all. I don't care if it is configurable (I would be fine with that). I don't care if it is not automatic (e.g; slave goes down and we have to tell the master to continue). Would you please explain, as precise as possible, what the advantages of using a synchronous standby would be in such a scenario? In a degraded/failure state, things continue to *work*. In a non-degraded/failure state, you're able to handle a system failure and know that you didn't lose any transactions. Why do you know that you didn't loose any transactions? Trivial network hiccups, a restart of a standby, IO overload on the standby all can cause a very short interruptions in the walsender connection - leading to degradation. As pointed out by someone previously, that's how RAID-1 works (which I imagine quite a few of us use). I don't think that argument makes much sense. Raid-1 isn't safe as-is. It's only safe if you use some sort of journaling or similar ontop. If you issued a write during a crash you normally will just get either the version from before or the version after the last write back, depending on the state on the individual disks and which disk is treated as authoritative by the raid software. And even if you disregard that, there's not much outside influence that can lead to loosing connection to a disk drive inside a raid outside an actually broken drive. Any network connection is normally kept *outside* the leven at which you build raids. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
Andres, On Friday, January 10, 2014, Andres Freund wrote: On 2014-01-10 17:02:08 -0500, Stephen Frost wrote: * Andres Freund (and...@2ndquadrant.com javascript:;) wrote: On 2014-01-10 10:59:23 -0800, Joshua D. Drake wrote: If a synchronous slave goes down, the master continues to operate. That is all. I don't care if it is configurable (I would be fine with that). I don't care if it is not automatic (e.g; slave goes down and we have to tell the master to continue). Would you please explain, as precise as possible, what the advantages of using a synchronous standby would be in such a scenario? In a degraded/failure state, things continue to *work*. In a non-degraded/failure state, you're able to handle a system failure and know that you didn't lose any transactions. Why do you know that you didn't loose any transactions? Trivial network hiccups, a restart of a standby, IO overload on the standby all can cause a very short interruptions in the walsender connection - leading to degradation. You know that you haven't *lost* any by virtue of the master still being up. The case you describe is a double-failure scenario- the link between the master and slave has to go away AND the master must accept a transaction and then fail independently. As pointed out by someone previously, that's how RAID-1 works (which I imagine quite a few of us use). I don't think that argument makes much sense. Raid-1 isn't safe as-is. It's only safe if you use some sort of journaling or similar ontop. If you issued a write during a crash you normally will just get either the version from before or the version after the last write back, depending on the state on the individual disks and which disk is treated as authoritative by the raid software. Uh, you need a decent raid controller then and we're talking about after a transaction commit/sync. And even if you disregard that, there's not much outside influence that can lead to loosing connection to a disk drive inside a raid outside an actually broken drive. Any network connection is normally kept *outside* the leven at which you build raids. This is a fair point and perhaps we should have the timeout or jitter GUC which was proposed elsewhere, but the notion that this configuration is completely unreasonable is not accurate and therefore having it would be a benefit overall. Thanks, Stephen
Re: [HACKERS] Standalone synchronous master
On 01/10/2014 01:49 PM, Andres Freund wrote: I know I am the one that instigated all of this so I want to be very clear on what I and what I am confident that my customers would expect. If a synchronous slave goes down, the master continues to operate. That is all. I don't care if it is configurable (I would be fine with that). I don't care if it is not automatic (e.g; slave goes down and we have to tell the master to continue). Would you please explain, as precise as possible, what the advantages of using a synchronous standby would be in such a scenario? Current behavior: db01-sync-db02 Transactions are happening. Everything is happy. Website is up. Orders are being made. db02 goes down. It doesn't matter why. It is down. Because it is down, db01 for all intents and purposes is also down because we are using sync replication. We have just lost continuity of service, we can no longer accept orders, we can no longer allow people to log into the website, we can no longer service accounts. In short, we are out of business. Proposed behavior: db01-sync-db02 Transactions are happening. Everything is happy. Website is up. Orders are being made. db02 goes down. It doesn't matter why. It is down. db01 continues to accept orders, allow people to log into the website and we can still service accounts. The continuity of service continues. Yes, there are all kinds of things that need to be considered when that happens, that isn't the point. The point is, PostgreSQL continues its uptime guarantee and allows the business to continue to function as (if) nothing has happened. For many and I dare say the majority of businesses, this is enough. They know that if the slave goes down they can continue to operate. They know if the master goes down they can fail over. They know that while both are up they are using sync rep (with various caveats). They are happy. They like that it is simple and just works. They continue to use PostgreSQL. Sincerely, JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc In a time of universal deceit - telling the truth is a revolutionary act., George Orwell -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
On 2014-01-10 14:29:58 -0800, Joshua D. Drake wrote: db02 goes down. It doesn't matter why. It is down. db01 continues to accept orders, allow people to log into the website and we can still service accounts. The continuity of service continues. Why is that configuration advantageous over a async configuration is the question. Why, with those requirements, are you using a synchronous standby at all? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On Fri, Jan 10, 2014 at 1:25 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: This is pretty much the same issue we discussed wrt. exclusion contraints. If the tuple being inserted conflicts with several existing tuples, what to do? I think the best answer would be to return and lock them all. It could still deadlock, but it's nevertheless less surprising behavior than returning one of the tuples in random. Actually, we could even avoid the deadlock by always locking the tuples in a certain order, although I'm not sure if it's worth the trouble. I understand and accept that as long as we're intent on locking more than one row per transaction, that action could deadlock with another session doing something similar. Actually, I've even encountered people giving advice in relation to proprietary systems along the lines of: if your big SQL MERGE statement is deadlocking excessively, you might try hinting to make sure a nested loop join is used. I think that this kind of ugly compromise is unavoidable in those scenarios (in reality the most popular strategy is probably cross your fingers). But as everyone agrees, the common case where an xact only upserts one row should never deadlock with another, similar xact. So *that* isn't a problem I have with making row locking work for exclusion constraints. My problem is that in general I'm not sold on the actual utility of making this kind of row locking work with exclusion constraints. I'm sincerely having a hard time thinking of a practical use-case (although, as I've said, I want to make it work with IGNORE). Even if you work all this row locking stuff out, and the spill-to-disk aspect out, the interface is still wrong, because you need to figure out a way to project more than one reject per slot. Maybe I lack imagination around how to make that work, but there are a lot of ifs and buts either way. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
On 01/10/2014 02:33 PM, Andres Freund wrote: On 2014-01-10 14:29:58 -0800, Joshua D. Drake wrote: db02 goes down. It doesn't matter why. It is down. db01 continues to accept orders, allow people to log into the website and we can still service accounts. The continuity of service continues. Why is that configuration advantageous over a async configuration is the question. Why, with those requirements, are you using a synchronous standby at all? +1 Greetings, Andres Freund -- Adrian Klaver adrian.kla...@gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
On 01/10/2014 02:33 PM, Andres Freund wrote: On 2014-01-10 14:29:58 -0800, Joshua D. Drake wrote: db02 goes down. It doesn't matter why. It is down. db01 continues to accept orders, allow people to log into the website and we can still service accounts. The continuity of service continues. Why is that configuration advantageous over a async configuration is the question. Why, with those requirements, are you using a synchronous standby at all? If the master goes down, I can fail over knowing that as many of my transactions as possible have been replicated. JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc In a time of universal deceit - telling the truth is a revolutionary act., George Orwell -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
Hi, On 2014-01-10 17:28:55 -0500, Stephen Frost wrote: Why do you know that you didn't loose any transactions? Trivial network hiccups, a restart of a standby, IO overload on the standby all can cause a very short interruptions in the walsender connection - leading to degradation. You know that you haven't *lost* any by virtue of the master still being up. The case you describe is a double-failure scenario- the link between the master and slave has to go away AND the master must accept a transaction and then fail independently. Unfortunately network outages do correlate with other system faults. What you're wishing for really is the I like the world to be friendly to me mode. Even if you have only disk problems, quite often if your disks die, you can continue to write (especially with a BBU), but uncached reads fail. So the walsender connection errors out because a read failed, and youre degrading into async mode. *Because* your primary is about to die. As pointed out by someone previously, that's how RAID-1 works (which I imagine quite a few of us use). I don't think that argument makes much sense. Raid-1 isn't safe as-is. It's only safe if you use some sort of journaling or similar ontop. If you issued a write during a crash you normally will just get either the version from before or the version after the last write back, depending on the state on the individual disks and which disk is treated as authoritative by the raid software. Uh, you need a decent raid controller then and we're talking about after a transaction commit/sync. Yes, if you have a BBU that memory is authoritative in most cases. But in that case the argument of having two disks is pretty much pointless, the SPOF suddenly became the battery + ram. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
On Fri, Jan 10, 2014 at 2:33 PM, Andres Freund and...@2ndquadrant.comwrote: On 2014-01-10 14:29:58 -0800, Joshua D. Drake wrote: db02 goes down. It doesn't matter why. It is down. db01 continues to accept orders, allow people to log into the website and we can still service accounts. The continuity of service continues. Why is that configuration advantageous over a async configuration is the question. Because it is orders of magnitude less likely to lose transactions that were reported to have been committed. A permanent failure of the master is almost guaranteed to lose transactions with async. With auto-degrade, a permanent failure of the master only loses reported-committed transactions if it co-occurs with a temporary failure of the replica or the network, lasting longer than the time out period. Why, with those requirements, are you using a synchronous standby at all? They aren't using synchronous standby, they are using asynchronous standby because we fail to provide the choice they prefer, which is a compromise between the two. Cheers, Jeff
Re: [HACKERS] Standalone synchronous master
On 2014-01-10 14:44:28 -0800, Joshua D. Drake wrote: On 01/10/2014 02:33 PM, Andres Freund wrote: On 2014-01-10 14:29:58 -0800, Joshua D. Drake wrote: db02 goes down. It doesn't matter why. It is down. db01 continues to accept orders, allow people to log into the website and we can still service accounts. The continuity of service continues. Why is that configuration advantageous over a async configuration is the question. Why, with those requirements, are you using a synchronous standby at all? If the master goes down, I can fail over knowing that as many of my transactions as possible have been replicated. It's not like async replication mode delays sending data to the standby in any way. Really, the commits themselves are sent to the server at exactly the same speed independent of sync/async. The only thing that's delayed is the *notificiation* of the client that sent the commit. Not the commit itself. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
Greetings, On Friday, January 10, 2014, Andres Freund wrote: Hi, On 2014-01-10 17:28:55 -0500, Stephen Frost wrote: Why do you know that you didn't loose any transactions? Trivial network hiccups, a restart of a standby, IO overload on the standby all can cause a very short interruptions in the walsender connection - leading to degradation. You know that you haven't *lost* any by virtue of the master still being up. The case you describe is a double-failure scenario- the link between the master and slave has to go away AND the master must accept a transaction and then fail independently. Unfortunately network outages do correlate with other system faults. What you're wishing for really is the I like the world to be friendly to me mode. Even if you have only disk problems, quite often if your disks die, you can continue to write (especially with a BBU), but uncached reads fail. So the walsender connection errors out because a read failed, and youre degrading into async mode. *Because* your primary is about to die. That can happen, sure, but I don't agree that people using a single drive with a BBU or having two drives in a raid1 die at the same time cases are reasonable arguments against this option. Not to mention that, today, if the master has an issue then we're SOL anyway. Also, if the network fails then likely there aren't any new transactions happening. As pointed out by someone previously, that's how RAID-1 works (which I imagine quite a few of us use). I don't think that argument makes much sense. Raid-1 isn't safe as-is. It's only safe if you use some sort of journaling or similar ontop. If you issued a write during a crash you normally will just get either the version from before or the version after the last write back, depending on the state on the individual disks and which disk is treated as authoritative by the raid software. Uh, you need a decent raid controller then and we're talking about after a transaction commit/sync. Yes, if you have a BBU that memory is authoritative in most cases. But in that case the argument of having two disks is pretty much pointless, the SPOF suddenly became the battery + ram. If that is a concern then use multiple controllers. Certainly not unheard of- look at SANs... Thanks, Stephen
Re: [HACKERS] Standalone synchronous master
On 01/10/2014 02:47 PM, Andres Freund wrote: Really, the commits themselves are sent to the server at exactly the same speed independent of sync/async. The only thing that's delayed is the *notificiation* of the client that sent the commit. Not the commit itself. Which is irrelevant to the point that if the standby goes down, we are now out of business. Any continuous replication should not be a SPOF. The current behavior guarantees that a two node sync cluster is a SPOF. The proposed behavior removes that. Sincerely, Joshua D. Drake -- Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc In a time of universal deceit - telling the truth is a revolutionary act., George Orwell -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
On 01/10/2014 02:57 PM, Stephen Frost wrote: Yes, if you have a BBU that memory is authoritative in most cases. But in that case the argument of having two disks is pretty much pointless, the SPOF suddenly became the battery + ram. If that is a concern then use multiple controllers. Certainly not unheard of- look at SANs... And in PostgreSQL we obviously have the option of having a third or fourth standby but that isn't the problem we are trying to solve. JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc In a time of universal deceit - telling the truth is a revolutionary act., George Orwell -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
On 01/10/2014 11:59 PM, Joshua D. Drake wrote: On 01/10/2014 02:57 PM, Stephen Frost wrote: Yes, if you have a BBU that memory is authoritative in most cases. But in that case the argument of having two disks is pretty much pointless, the SPOF suddenly became the battery + ram. If that is a concern then use multiple controllers. Certainly not unheard of- look at SANs... And in PostgreSQL we obviously have the option of having a third or fourth standby but that isn't the problem we are trying to solve. The problem you are trying to solve is a controller with enough Battery Backed Cache RAM to cache the entire database but with write-though mode. And you want it to degrade to write-back in case of disk failure so that you can continue while the disk is broken. People here are telling you that it would not be safe, use at least RAID-1 if you want availability Cheers -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
On 01/10/2014 02:59 PM, Joshua D. Drake wrote: On 01/10/2014 02:47 PM, Andres Freund wrote: Really, the commits themselves are sent to the server at exactly the same speed independent of sync/async. The only thing that's delayed is the *notificiation* of the client that sent the commit. Not the commit itself. Which is irrelevant to the point that if the standby goes down, we are now out of business. Any continuous replication should not be a SPOF. The current behavior guarantees that a two node sync cluster is a SPOF. The proposed behavior removes that. Again, if that's your goal, then use async replication. I really don't understand the use-case here. The purpose of sync rep is to know determinatively whether or not you have lost data when disaster strikes. If knowing for certain isn't important to you, then use async. BTW, people are using RAID1 as an analogy to 2-node sync replication. That's a very bad analogy, because in RAID1 you have a *single* controller which is capable of determining if the disks are in a failed state or not, and this is all happening on a single node where things like network outages aren't a consideration. It's really not the same situation at all. Also, frankly, I absolutely can't count the number of times I've had to rescue a customer or family member who had RAID1 but wan't monitoring syslog, and so one of their disks had been down for months without them knowning it. Heck, I've done this myself. So ... the Filesystem geeks have already been through this. Filesystem clustering started out with systems like DRBD, which includes an auto-degrade option. However, DBRD with auto-degrade is widely considered untrustworthy and is a significant portion of why DBRD isn't trusted today. From here, clustered filesystems went in two directions: RHCS added layers of monitoring and management to make auto-degrade a safer option than it is with DRBD (and still not the default option). Scalable clustered filesystems added N(M) quorum commit in order to support more than 2 nodes. Either of these courses are reasonable for us to pursue. What's a bad idea is adding an auto-degrade option without any tools to manage and monitor it, which is what this patch does by my reading. If I'm wrong, then someone can point it out to me. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add CREATE support to event triggers
Can we please stop arguing over a problem I don't have? I started with CREATE SCHEMA because it is one of the easy cases, not because it was the most difficult case: we only need to deparse the bits of it that don't involve the objects within, because those are reported by the event trigger as separate commands. Each object gets its own creation command, which works pretty nicely. Of course, the deparsed version of the command will not look very much like what was submitted by the user, but that doesn't matter -- what does matter is that the objects created by running the commands reported in the event trigger will be (or should be) the same as those created by the original command. I was showing CREATE SCHEMA as a way to discuss the fine details: how to report identifiers that might need quoting, what to do with optional clauses (AUTHORIZATION), etc. I am past that now. On the subject of testing the triggers, Robert Haas wrote: Here's one idea: create a contrib module that (somehow, via APIs to be invented) runs every DDL command that gets executed through the deparsing code, and then parses the result and executes *that* instead of the original command. Then, add a build target that runs the regression test suite in that mode, and get the buildfarm configured to run that build target regularly on at least some machines. That way, adding syntax to the regular regression test suite also serves to test that the deparsing logic for that syntax is working. If we do this, there's still some maintenance burden associated with having DDL deparsing code, but at least our chances of noticing when we've failed to maintain it should be pretty good. I gave this some more thought and hit a snag. The problem here is that by the time the event trigger runs, the original object has already been created. At that point, we can't simply replace the created objects with objects that would hypothetically be created by a command trigger. A couple of very hand-wavy ideas: 1. in the event trigger, DROP the original object and CREATE it as reported by the creation_commands SRF. 2. Have ddl_command_start open a savepoint, and then roll it back in ddl_command_end, then create the object again. Not sure this is doable because of the whole SPI nesting issue .. maybe with C-language event trigger functions? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
On 01/10/2014 01:49 PM, Andres Freund wrote: On 2014-01-10 10:59:23 -0800, Joshua D. Drake wrote: On 01/10/2014 07:47 AM, Bruce Momjian wrote: I know there was a desire to remove this TODO item, but I think we have brought up enough new issues that we can keep it to see if we can come up with a solution. I have added a link to this discussion on the TODO item. I think we will need at least four new GUC variables: * timeout control for degraded mode * command to run during switch to degraded mode * command to run during switch from degraded mode * read-only variable to report degraded mode I would argue that we don't need the first. We just want a command to switch synchronous/degraded, and a variable (or function) to report on degraded mode. If we have those things, then it becomes completely possible to have an external monitoring framework, which is capable of answering questions like is the replica down or just slow?, control degrade. Oh, wait! We DO have such a command. It's called ALTER SYSTEM SET! Recently committed. So this is really a solvable issue if one is willing to use an external utility. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Time to do our Triage for 9.4
On 01/10/2014 01:34 PM, David Rowley wrote: On Sat, Jan 11, 2014 at 8:28 AM, Josh Berkus j...@agliodbs.com wrote: All, To make this easier for everyone to participate in, I've created a wiki page: https://wiki.postgresql.org/wiki/9.4CF4Triage Please add the patches you know well to the appropriate list, thanks! I know my own patch pretty well and from my own point of view it's very close to being about ready to go, but a good review may change that. Should we be waiting for 2nd opinions or can patch authors decide for themselves? Or were you talking only to commiters? Well, I'd prefer that someone other than the patch author assess the patch state; the author is going to be characteristically optimistic. However, it's a wiki. If you put it under good to go someone else who disagrees can move it. IMHO, if the patch hasn't had at least one review yet (in a prior CF), though, I'd put it under Nearly Ready. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
On 01/10/2014 03:17 PM, Josh Berkus wrote: Any continuous replication should not be a SPOF. The current behavior guarantees that a two node sync cluster is a SPOF. The proposed behavior removes that. Again, if that's your goal, then use async replication. I think I have gone about this the wrong way. Async does not meet the technical or business requirements that I have. Sync does except that it increases the possibility of an outage. That is the requirement I am trying to address. The purpose of sync rep is to know determinatively whether or not you have lost data when disaster strikes. If knowing for certain isn't important to you, then use async. PostgreSQL Sync replication increases the possibility of an outage. That is incorrect behavior. I want sync because on the chance that the master goes down, I have as much data as possible to fail over to. However, I can't use sync because it increases the possibility that my business will not be able to function on the chance that the standby goes down. What's a bad idea is adding an auto-degrade option without any tools to manage and monitor it, which is what this patch does by my reading. If This we absolutely agree on. JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc In a time of universal deceit - telling the truth is a revolutionary act., George Orwell -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers