Re: [HACKERS] tracking commit timestamps
On Tue, 2014-11-04 at 23:43 -0600, Jim Nasby wrote: I'm worried about 2 commits in the same microsecond on the same system, not on 2 different systems. Or, put another way, if we're going to expose this I think it should also provide a guaranteed unique commit ordering for a single cluster. Presumably, this shouldn't be that hard since we do know the exact order in which things committed. Addition of LSN when the timestamps for two transactions are exactly same isn't enough. There isn't any guarantee that a later commit gets a later timestamp than an earlier commit. In addition, I wonder if this feature would be misused. Record transaction ids to a table to find out commit order (use case could be storing historical row versions for example). Do a dump and restore on another cluster, and all the transaction ids are completely meaningless to the system. Having the ability to record commit order into an audit table would be extremely welcome, but as is, this feature doesn't provide it. - Anssi -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On 2014-11-04 18:33:34 +0200, Heikki Linnakangas wrote: On 10/30/2014 06:02 PM, Andres Freund wrote: On 2014-10-29 10:24:20 +0200, Heikki Linnakangas wrote: On 10/06/2014 02:29 PM, Andres Freund wrote: I've not yet really looked, but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't make me happy... Ok.. Can you elaborate? To me the split between xloginsert.c doing some of the record assembly, and xlog.c doing the lower level part of the assembly is just wrong. I moved the checks for bootstrap mode and xl_len == 0, from the current XLogInsert to the new XLogInsert() function. I don't imagine that to be enough address your feelings about the split between xlog.c and xloginsert.c, but makes it a tiny bit clearer IMHO. I don't know what else to do about it, as it feels very sensible to me as it is now. So unless you object more loudly, I'm going to just go ahead with this and commit, probably tomorrow. I'm still not particularly happy with the split. But I think this patch is enough of a win anyhow. So go ahead. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
On Wed, Nov 5, 2014 at 5:24 PM, Anssi Kääriäinen anssi.kaariai...@thl.fi wrote: On Tue, 2014-11-04 at 23:43 -0600, Jim Nasby wrote: I'm worried about 2 commits in the same microsecond on the same system, not on 2 different systems. Or, put another way, if we're going to expose this I think it should also provide a guaranteed unique commit ordering for a single cluster. Presumably, this shouldn't be that hard since we do know the exact order in which things committed. Addition of LSN when the timestamps for two transactions are exactly same isn't enough. There isn't any guarantee that a later commit gets a later timestamp than an earlier commit. True if WAL record ID is not globally consistent. Two-level commit ordering can be done with (timestamp or LSN, nodeID). At equal timestamp, we could say as well that the node with the lowest systemID wins for example. That's not something In addition, I wonder if this feature would be misused. Record transaction ids to a table to find out commit order (use case could be storing historical row versions for example). Do a dump and restore on another cluster, and all the transaction ids are completely meaningless to the system. I think you are forgetting the fact to be able to take a consistent dump using an exported snapshot. In this case the commit order may not be that meaningless.. Having the ability to record commit order into an audit table would be extremely welcome, but as is, this feature doesn't provide it. That's something that can actually be achieved with this feature if the SQL interface is able to query all the timestamps in a xid range with for example a background worker that tracks this data periodically. Now the thing is as well: how much timestamp history do we want to keep? The patch truncating SLRU files with frozenID may cover a sufficient range... -- Michael
Re: [HACKERS] Sequence Access Method WIP
On 11/04/2014 11:01 PM, Petr Jelinek wrote: On 04/11/14 13:11, Heikki Linnakangas wrote: On 10/13/2014 01:01 PM, Petr Jelinek wrote: Only the alloc and reloptions methods are required (and implemented by the local AM). The caching, xlog writing, updating the page, etc is handled by backend, the AM does not see the tuple at all. I decided to not pass even the struct around and just pass the relevant options because I think if we want to abstract the storage properly then the AM should not care about how the pg_sequence looks like at all, even if it means that the sequence_alloc parameter list is bit long. Hmm. The division of labour between the seqam and commands/sequence.c still feels a bit funny. sequence.c keeps track of how many values have been WAL-logged, and thus usable immediately, but we still call sequence_alloc even when using up those already WAL-logged values. If you think of using this for something like a centralized sequence server in a replication cluster, you certainly don't want to make a call to the remote server for every value - you'll want to cache them. With the local seqam, there are two levels of caching. Each backend caches some values (per the CACHE value option in CREATE SEQUENCE). In addition to that, the server WAL-logs 32 values at a time. If you have a remote seqam, it would most likely add a third cache, but it would interact in strange ways with the second cache. Considering a non-local seqam, the locking is also a bit strange. The server keeps the sequence page locked throughout nextval(). But if the actual state of the sequence is maintained elsewhere, there's no need to serialize the calls to the remote allocator, i.e. the sequence_alloc() calls. I'm not exactly sure what to do about that. One option is to completely move the maintenance of the current value, i.e. sequence.last_value, to the seqam. That makes sense from an abstraction point of view. For example with a remote server managing the sequence, storing the current value in the local catalog table makes no sense as it's always going to be out-of-date. The local seqam would store it as part of the am-private data. However, you would need to move the responsibility of locking and WAL-logging to the seqam. Maybe that's OK, but we'll need to provide an API that the seqam can call to do that. Perhaps just let the seqam call heap_inplace_update on the sequence relation. My idea of how this works is - sequence_next handles the allocation and the level2 caching (WAL logged cache) via amdata if it supports it, or returns single value if it doesn't - then the WAL will always just write the 1 value and there will basically be no level2 cache, since it is the sequence_next who controls how much will be WAL-logged, what backend asks for is just suggestion. Hmm, so the AM might return an nallocated value less than the fetch value that sequence.c requested? As the patch stands, wouldn't that make sequence.c write a WAL record more often? In fact, if the seqam manages the current value outside the database (e.g. a remote seqam that gets the value from another server), nextval() never needs to write a WAL record. The third level caching as you say should be solved by the sequence_request_update and sequence_update callback - that will enable the sequence AM to handle this kind of caching asynchronously without blocking the sequence_next unnecessarily. That way it's possible to implement many different strategies, from no cache at all, to three levels of caching, with and without blocking of calls to sequence_next. It's nice that asynchronous operation is possible, but that seems like a more advanced feature. Surely an approach where you fetch a bunch of values when needed is going to be more common. Let's focus on the simple synchronous case first, and make sure the API works well for that. For the amdata handling (which is the AM's private data variable) the API assumes that (Datum) 0 is NULL, this seems to work well for reloptions so should work here also and it simplifies things a little compared to passing pointers to pointers around and making sure everything is allocated, etc. Sadly the fact that amdata is not fixed size and can be NULL made the page updates of the sequence relation quite more complex that it used to be. It would be nice if the seqam could define exactly the columns it needs, with any datatypes. There would be a set of common attributes: sequence_name, start_value, cache_value, increment_by, max_value, min_value, is_cycled. The local seqam would add last_value, log_cnt and is_called to that. A remote seqam that calls out to some other server might store the remote server's hostname etc. There could be a seqam function that returns a TupleDesc with the required columns, for example. Wouldn't that somewhat bloat catalog if we had new catalog table for each sequence AM? No, that's not what I meant. The number of catalog tables would be the same as today. Sequences
Re: [HACKERS] WAL replay bugs
Michael Paquier wrote: Now, do we really want this feature in-core? That's somewhat a duplicate of what is mentioned here: http://www.postgresql.org/message-id/CAB7nPqQMq=4ejak317mxz4has0i+1rslbqu29zx18jwlb2j...@mail.gmail.com Of course both things do not have the same coverage as the former is for buildfarm and dev, while the latter is dedidated to production systems, but could be used for development as well. Oh, I had forgotten that other patch. The patch sent there is a bit outdated, but a potential implementation gets simpler with XLogReadBufferForRedo able to return flags about each block state during redo. I am still planning to come back to it for this cycle, though I stopped for now waiting for the WAL format patches finish to shape the APIs this feature would rely on. I agree it makes sense to wait until the WAL reworks are done -- glad to hear you're putting some time in this area. Thanks, -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Representing a SRF return column in catalogs
Hi, I am working on something that requires representing a SRF return column in pg_proc and being able to retrieve it, retrieve the name of the column and make a ColumnRef node from it. The catch here are aliases: SELECT generate_series(1,100) AS a ORDER BY a; I need to know that the return column of generate_series is being referred as a in this query (hence tlist, sortClause will have 'c'). I discussed and think that a way can be to have position of column rather than the actual name but that raises the question of the functionality needed for back resolving position to name (to make a ColumnRef node) and then infer that the column is being referred as an alias in this query. Any pointers will be really helpful. Regards, Atri
Re: [HACKERS] get_cast_func syscache utility function
On 11/04/2014 01:45 PM, Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: here's a patch for a utility function to look up the cast function for a from/to pair of types, as recently suggested by Alvaro. Although it only contains one use (in json.c), the upcoming jsonb generators would also use it twice. I'd like to get this committed fairly quickly so I can prepare an updated patch for the jsonb generators. I'm not exactly convinced that this is an appropriate utility function, because casts for which a COERCION_METHOD_FUNCTION entry exists in pg_cast are by no means the whole universe of casts. I'm concerned that creating this function will encourage patch authors to blow off other possibilities when they should not. In the particular context at hand, it seems like you might be better advised to think about how to refactor json_categorize_type to be more helpful to your new use-case. A concrete example of what I'm worried about is that json_categorize_type already ignores the possibility of binary-compatible (WITHOUT FUNCTION) casts to json. Since it's eliminated the domain case earlier, that's perhaps not too horridly broken; but it'd be very easy for new uses of this get_cast_func() function to overlook such considerations. In short, I'd rather see this addressed through functions with slightly higher-level APIs that are capable of covering more cases. In most cases it'd be best if callers were using find_coercion_pathway() rather than taking shortcuts. Well, then, do we really need a wrapper at all? Should we just be doing something like this? if (typoid = FirstNormalObjectId) { Oid castfunc; CoercionPathType ctype; ctype = find_coercion_pathway(JSONOID, typoid, COERCION_EXPLICIT, castfunc); if (ctype == COERCION_PATH_FUNC OidIsValid(castfunc)) { *tcategory = JSONTYPE_CAST; *outfuncoid = castfunc; } } cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Fix xpath() to return namespace definitions
2014-11-04 22:16 GMT+07:00 Peter Eisentraut pete...@gmx.net: On 10/6/14 10:24 PM, Ali Akbar wrote: While reviewing the patch myself, i spotted some formatting problems in the code. Fixed in this v5 patch. Also, this patch uses context patch format (in first versions, because of the small modification, context patch format obfucates the changes. After reimplementation this isn't the case anymore) I think the problem this patch is addressing is real, and your approach is sound, but I'd ask you to go back to the xmlCopyNode() version, and try to add a test case for why the second argument = 1 is necessary. I don't see any other problems. OK. Because upstream code is fixed in current version, i'll revert to the previous version. Test case added to regression test. With =1 argument, the result is correct: local:piece xmlns:local=\http://127.0.0.1\; xmlns=\http://127.0.0.2\; id=\1\ internalnumber one/internal internal2/ /local:piece without the argument, the result is not correct, all children will be lost. Because of that, the other regression test will fail too because the children is not copied: *** 584,593 -- Text XPath expressions evaluation SELECT xpath('/value', data) FROM xmltest; ! xpath ! -- ! {valueone/value} ! {valuetwo/value} (2 rows) SELECT xpath(NULL, NULL) IS NULL FROM xmltest; --- 584,593 -- Text XPath expressions evaluation SELECT xpath('/value', data) FROM xmltest; !xpath ! ! {value/} ! {value/} (2 rows) SELECT xpath(NULL, NULL) IS NULL FROM xmltest; *** ... cut updated patch attached. I noticed somewhat big performance regression if the xml is large (i use PRODML Product Volume sample from energistics.org): * Without patch (tested 3 times): select unnest(xpath('//a:flow', x, ARRAY[['a',' http://www.prodml.org/schemas/1series']])) from u; unnest --- flow + kindgas lift/kind+ ... Time: 84,012 ms Time: 85,683 ms Time: 88,766 ms * With latest v6 patch (notice the correct result with namespace definition): select unnest(xpath('//a:flow', x, ARRAY[['a',' http://www.prodml.org/schemas/1series']])) from u; unnest --- flow xmlns=http://www.prodml.org/schemas/1series; + ... Time: 108,912 ms Time: 108,267 ms Time: 114,848 ms It's 23% performance regression. * Just curious, i'm also testing v5 patch performance (notice the namespace in the result): select unnest(xpath('//a:flow', x, ARRAY[['a',' http://www.prodml.org/schemas/1series']])) from u; unnest --- flow xmlns=http://www.prodml.org/schemas/1series; + kindgas lift/kind+ Time: 92,552 ms Time: 97,440 ms Time: 99,309 ms The regression is only 13%. I know the xmlCopyNode() version (v6 patch) is much more cleaner than v5patch, should we consider the performance benefit? Anyway, thanks for the review! :) Regards, -- Ali Akbar
[HACKERS] Time to remove dummy autocommit GUC?
There's a thread over in pgsql-admin http://www.postgresql.org/message-id/flat/1317404836.812502.1415174912912.javamail.ya...@jws11120.mail.ir2.yahoo.com suggesting that the server-side autocommit GUC causes more confusion than it's worth, because newbies think it should reflect the state of client-side autocommit behavior. We left that variable in there back in 7.4, on backward-compatibility grounds that were a bit debatable even then. ISTM that by now we could just flat-out remove it. Surely no client-side code is still looking at it. Thoughts, objections? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sequence Access Method WIP
On 05/11/14 13:45, Heikki Linnakangas wrote: On 11/04/2014 11:01 PM, Petr Jelinek wrote: On 04/11/14 13:11, Heikki Linnakangas wrote: On 10/13/2014 01:01 PM, Petr Jelinek wrote: Only the alloc and reloptions methods are required (and implemented by the local AM). The caching, xlog writing, updating the page, etc is handled by backend, the AM does not see the tuple at all. I decided to not pass even the struct around and just pass the relevant options because I think if we want to abstract the storage properly then the AM should not care about how the pg_sequence looks like at all, even if it means that the sequence_alloc parameter list is bit long. Hmm. The division of labour between the seqam and commands/sequence.c still feels a bit funny. sequence.c keeps track of how many values have been WAL-logged, and thus usable immediately, but we still call sequence_alloc even when using up those already WAL-logged values. If you think of using this for something like a centralized sequence server in a replication cluster, you certainly don't want to make a call to the remote server for every value - you'll want to cache them. With the local seqam, there are two levels of caching. Each backend caches some values (per the CACHE value option in CREATE SEQUENCE). In addition to that, the server WAL-logs 32 values at a time. If you have a remote seqam, it would most likely add a third cache, but it would interact in strange ways with the second cache. Considering a non-local seqam, the locking is also a bit strange. The server keeps the sequence page locked throughout nextval(). But if the actual state of the sequence is maintained elsewhere, there's no need to serialize the calls to the remote allocator, i.e. the sequence_alloc() calls. I'm not exactly sure what to do about that. One option is to completely move the maintenance of the current value, i.e. sequence.last_value, to the seqam. That makes sense from an abstraction point of view. For example with a remote server managing the sequence, storing the current value in the local catalog table makes no sense as it's always going to be out-of-date. The local seqam would store it as part of the am-private data. However, you would need to move the responsibility of locking and WAL-logging to the seqam. Maybe that's OK, but we'll need to provide an API that the seqam can call to do that. Perhaps just let the seqam call heap_inplace_update on the sequence relation. My idea of how this works is - sequence_next handles the allocation and the level2 caching (WAL logged cache) via amdata if it supports it, or returns single value if it doesn't - then the WAL will always just write the 1 value and there will basically be no level2 cache, since it is the sequence_next who controls how much will be WAL-logged, what backend asks for is just suggestion. Hmm, so the AM might return an nallocated value less than the fetch value that sequence.c requested? As the patch stands, wouldn't that make sequence.c write a WAL record more often? That's correct, that's also why you usually want to have some form of local caching if possible. In fact, if the seqam manages the current value outside the database (e.g. a remote seqam that gets the value from another server), nextval() never needs to write a WAL record. Sure it does, you need to keep the current state in Postgres also, at least the current value so that you can pass correct input to sequence_alloc(). And you need to do this in crash-safe way so WAL is necessary. I think sequences will cache in amdata if possible for that type of sequence and in case where it's not and it's caching on some external server then you'll most likely get bigger overhead from network than WAL anyway... For the amdata handling (which is the AM's private data variable) the API assumes that (Datum) 0 is NULL, this seems to work well for reloptions so should work here also and it simplifies things a little compared to passing pointers to pointers around and making sure everything is allocated, etc. Sadly the fact that amdata is not fixed size and can be NULL made the page updates of the sequence relation quite more complex that it used to be. It would be nice if the seqam could define exactly the columns it needs, with any datatypes. There would be a set of common attributes: sequence_name, start_value, cache_value, increment_by, max_value, min_value, is_cycled. The local seqam would add last_value, log_cnt and is_called to that. A remote seqam that calls out to some other server might store the remote server's hostname etc. There could be a seqam function that returns a TupleDesc with the required columns, for example. Wouldn't that somewhat bloat catalog if we had new catalog table for each sequence AM? No, that's not what I meant. The number of catalog tables would be the same as today. Sequences look much like any other relation, with entries in pg_attribute catalog table for all the attributes for each
Re: [HACKERS] get_cast_func syscache utility function
Andrew Dunstan and...@dunslane.net writes: On 11/04/2014 01:45 PM, Tom Lane wrote: In short, I'd rather see this addressed through functions with slightly higher-level APIs that are capable of covering more cases. In most cases it'd be best if callers were using find_coercion_pathway() rather than taking shortcuts. Well, then, do we really need a wrapper at all? Should we just be doing something like this? if (typoid = FirstNormalObjectId) { Oid castfunc; CoercionPathType ctype; ctype = find_coercion_pathway(JSONOID, typoid, COERCION_EXPLICIT, castfunc); if (ctype == COERCION_PATH_FUNC OidIsValid(castfunc)) { *tcategory = JSONTYPE_CAST; *outfuncoid = castfunc; } } Well, of course, the question that immediately raises is why isn't this code handling the other possible CoercionPathTypes ;-). But at least it's pretty obvious from the code that you are ignoring such cases, so yes I think this is better than what's there now. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Order of views in stats docs
http://www.postgresql.org/docs/9.3/static/monitoring-stats.html, table 27-1. Can somebody find or explain the order of the views in there? It's not actually alphabetical, but it's also not logical. In particular, what is pg_stat_replication doing second to last? I would suggest we move pg_stat_replication up to directly under pg_stat_activity, and move pg_stat_database_conflicts up to directly under pg_stat_database. I think the rest makes reasonable sense. Any objections to this? Can anybody spot a reason for why they are where they are other than that it was just appended to the end of the table without realizing the order that I'm missing now and am about to break? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Order of views in stats docs
Magnus Hagander mag...@hagander.net writes: http://www.postgresql.org/docs/9.3/static/monitoring-stats.html, table 27-1. Can somebody find or explain the order of the views in there? It's not actually alphabetical, but it's also not logical. In particular, what is pg_stat_replication doing second to last? I would suggest we move pg_stat_replication up to directly under pg_stat_activity, and move pg_stat_database_conflicts up to directly under pg_stat_database. I think the rest makes reasonable sense. Any objections to this? Can anybody spot a reason for why they are where they are other than that it was just appended to the end of the table without realizing the order that I'm missing now and am about to break? I agree that the last two items seem to be suffering from blindly-add- it-to-the-end syndrome, which is a disease that runs rampant around here. However, should we consider the possibility of changing the table to straight alphabetical ordering? I'm not as much in love with that approach as some folks, but it does have the merit that it's always clear where you ought to put a new item. This would result in grouping the all, sys, and user views separately, rather than grouping those variants of a view together ... but on reflection I'm not sure that that'd be totally horrible. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
On 11/5/14, 6:10 AM, Michael Paquier wrote: In addition, I wonder if this feature would be misused. Record transaction ids to a table to find out commit order (use case could be storing historical row versions for example). Do a dump and restore on another cluster, and all the transaction ids are completely meaningless to the system. I think you are forgetting the fact to be able to take a consistent dump using an exported snapshot. In this case the commit order may not be that meaningless.. Anssi's point is that you can't use xmin because it can change, but I think anyone working with this feature would understand that. Having the ability to record commit order into an audit table would be extremely welcome, but as is, this feature doesn't provide it. That's something that can actually be achieved with this feature if the SQL interface is able to query all the timestamps in a xid range with for example a background worker that tracks this data periodically. Now the thing is as well: how much timestamp history do we want to keep? The patch truncating SLRU files with frozenID may cover a sufficient range... Except that commit time is not guaranteed unique *even on a single system*. That's my whole point. If we're going to bother with all the commit time machinery it seems really silly to provide a way to uniquely order every commit. Clearly trying to uniquely order commits across multiple systems is a far larger problem, and I'm not suggesting we attempt that. But for a single system AIUI all we need to do is expose the LSN of each commit record and that will give you the exact and unique order in which transactions committed. This isn't a hypothetical feature either; if we had this, logical replication systems wouldn't have to try and fake this via batches. You could actually recreate exactly what data was visible at what time to all transactions, not just repeatable read ones (as long as you kept snapshot data as well, which isn't hard). As for how much data to keep, if you have a process that's doing something to record this information permanently all it needs to do is keep an old enough snapshot around. That's not that hard to do, even from user space. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Time to remove dummy autocommit GUC?
Tom Lane t...@sss.pgh.pa.us wrote: ISTM that by now we could just flat-out remove it. +1 -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
On 11/5/14, 10:30 AM, Andres Freund wrote: Except that commit time is not guaranteed unique *even on a single system*. That's my whole point. If we're going to bother with all the commit time machinery it seems really silly to provide a way to uniquely order every commit. Well. I think that's the misunderstanding here. That's absolutely not what committs is supposed to be used for. For the replication stream you'd hopefully use logical decoding. That gives you the transaction data exactly in commit order. So presumably you'd want to use logical decoding to insert into a table with a sequence on it, or similar? I agree, that sounds like a better way to handle this. I think it's worth mentioning in the docs for commit_ts, because people WILL mistakenly try and use it to determine commit ordering. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
On 2014-11-05 10:23:15 -0600, Jim Nasby wrote: On 11/5/14, 6:10 AM, Michael Paquier wrote: In addition, I wonder if this feature would be misused. Record transaction ids to a table to find out commit order (use case could be storing historical row versions for example). Do a dump and restore on another cluster, and all the transaction ids are completely meaningless to the system. I think you are forgetting the fact to be able to take a consistent dump using an exported snapshot. In this case the commit order may not be that meaningless.. Anssi's point is that you can't use xmin because it can change, but I think anyone working with this feature would understand that. Having the ability to record commit order into an audit table would be extremely welcome, but as is, this feature doesn't provide it. That's something that can actually be achieved with this feature if the SQL interface is able to query all the timestamps in a xid range with for example a background worker that tracks this data periodically. Now the thing is as well: how much timestamp history do we want to keep? The patch truncating SLRU files with frozenID may cover a sufficient range... Except that commit time is not guaranteed unique *even on a single system*. That's my whole point. If we're going to bother with all the commit time machinery it seems really silly to provide a way to uniquely order every commit. Well. I think that's the misunderstanding here. That's absolutely not what committs is supposed to be used for. For the replication stream you'd hopefully use logical decoding. That gives you the transaction data exactly in commit order. Clearly trying to uniquely order commits across multiple systems is a far larger problem, and I'm not suggesting we attempt that. But for a single system AIUI all we need to do is expose the LSN of each commit record and that will give you the exact and unique order in which transactions committed. I don't think that's something you should attempt. That's what logical decoding is for. Hence I see little point in exposing the LSN that way. Where I think committs is useful is a method for analyzing and resolving conflicts between multiple systems. In that case you likely can't use the LSN for anything as it won't be very meaningful. If you get conflicts below the accuracy of the timestamps you better use another deterministic method of resolving them - BDR e.g. compares the system identifier, timeline id, database oid, and a user defined name. While enforcing that those aren't the same between systems. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree index builds, CLUSTER, and sortsupport
On 10/10/14, 7:26 PM, Peter Geoghegan wrote: Both Robert and Heikki expressed dissatisfaction with the fact that B-Tree index builds don't use sortsupport. Because B-Tree index builds cannot really use the onlyKey optimization, the historic oversight of not supporting B-Tree builds (and CLUSTER) might have been at least a little understandable [1]. But with the recent work on sortsupport for text, it's likely that B-Tree index builds will be missing out on rather a lot by not taking advantage of sortsupport. Attached patch modifies tuplesort to correct this oversight. What's really nice about it is that there is actually a net negative code footprint: Did anything ever happen with this? -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
On 2014-11-05 10:34:40 -0600, Jim Nasby wrote: On 11/5/14, 10:30 AM, Andres Freund wrote: Except that commit time is not guaranteed unique *even on a single system*. That's my whole point. If we're going to bother with all the commit time machinery it seems really silly to provide a way to uniquely order every commit. Well. I think that's the misunderstanding here. That's absolutely not what committs is supposed to be used for. For the replication stream you'd hopefully use logical decoding. That gives you the transaction data exactly in commit order. So presumably you'd want to use logical decoding to insert into a table with a sequence on it, or similar? I'm not following. I'd use logical decoding to replicate the data to another system, thereby guaranteeing its done in commit order. Then, when applying the data on the other side, I can detect/resolve some forms of conflicts by looking at the timestamps of rows via committs. I agree, that sounds like a better way to handle this. I think it's worth mentioning in the docs for commit_ts, because people WILL mistakenly try and use it to determine commit ordering. Ok, sounds sensible. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Repeatable read and serializable transactions see data committed after tx start
On 11/4/14, 6:11 PM, Álvaro Hernández Tortosa wrote: Should we improve then the docs stating this more clearly? Any objection to do this? If we go that route we should also mention that now() will no longer be doing what you probably hope it would (AFAIK it's driven by BEGIN and not the first snapshot). Perhaps we should change how now() works, but I'm worried about what that might do to existing applications... -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] get_cast_func syscache utility function
On 11/05/2014 10:10 AM, Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: On 11/04/2014 01:45 PM, Tom Lane wrote: In short, I'd rather see this addressed through functions with slightly higher-level APIs that are capable of covering more cases. In most cases it'd be best if callers were using find_coercion_pathway() rather than taking shortcuts. Well, then, do we really need a wrapper at all? Should we just be doing something like this? if (typoid = FirstNormalObjectId) { Oid castfunc; CoercionPathType ctype; ctype = find_coercion_pathway(JSONOID, typoid, COERCION_EXPLICIT, castfunc); if (ctype == COERCION_PATH_FUNC OidIsValid(castfunc)) { *tcategory = JSONTYPE_CAST; *outfuncoid = castfunc; } } Well, of course, the question that immediately raises is why isn't this code handling the other possible CoercionPathTypes ;-). But at least it's pretty obvious from the code that you are ignoring such cases, so yes I think this is better than what's there now. Also, it's equivalent to what's there now, I think. I wasn't intending to change the behaviour - if someone wants to do that they can submit a separate patch. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Time to remove dummy autocommit GUC?
On Nov 5, 2014 5:27 PM, Kevin Grittner kgri...@ymail.com wrote: Tom Lane t...@sss.pgh.pa.us wrote: ISTM that by now we could just flat-out remove it. +1 +1. I thought we had already removed it years ago :-) /Magnus
Re: [HACKERS] BRIN indexes - TRAP: BadArgument
On Tue, Nov 4, 2014 at 2:28 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: I was clearly too careless about testing the xlog code --- it had numerous bugs. This version should be a lot better, but there might be problems lurking still as I don't think I covered it all. Let me know if you see anything wrong. At line 252 of brin_xlog.c, should the UnlockReleaseBuffer(metabuf) be protected by a BufferIsValid? XLogReadBufferForRedo says that it might return an invalid buffer under some situations. Perhaps it is known that those situations can't apply here? Now I am getting segfaults during normal (i.e. no intentional crashes) operations. I think I was seeing them sometimes before as well, I just wasn't looking for them. The attached script invokes the segfault within a few minutes. A lot of the stuff in the script is probably not necessary, I just didn't spend the time to pair it down to the essentials. It does not need to be in parallel, I get the crash when invoked with only one job (perl ~/ brin_crash.pl 1). I think this is related to having block ranges which have no tuples in them when they are first summarized. If I take out the with t as (delete from foo returning *) insert into foo select * from t, then I don't see the crashes #0 0x0089ed3e in pg_detoast_datum_packed (datum=0x0) at fmgr.c:2270 #1 0x00869be9 in text_le (fcinfo=0x7fff1bf6b9f0) at varlena.c:1661 #2 0x0089cfc7 in FunctionCall2Coll (flinfo=0x297e640, collation=100, arg1=0, arg2=43488216) at fmgr.c:1324 #3 0x004678f8 in minmaxConsistent (fcinfo=0x7fff1bf6be40) at brin_minmax.c:213 #4 0x0089d0c9 in FunctionCall3Coll (flinfo=0x297b830, collation=100, arg1=43509512, arg2=43510296, arg3=43495856) at fmgr.c:1349 #5 0x00462484 in bringetbitmap (fcinfo=0x7fff1bf6c310) at brin.c:469 #6 0x0089cfc7 in FunctionCall2Coll (flinfo=0x28f2440, collation=0, arg1=43495712, arg2=43497376) at fmgr.c:1324 #7 0x004b3fc9 in index_getbitmap (scan=0x297b120, bitmap=0x297b7a0) at indexam.c:651 #8 0x0062ece0 in MultiExecBitmapIndexScan (node=0x297af30) at nodeBitmapIndexscan.c:89 #9 0x00619783 in MultiExecProcNode (node=0x297af30) at execProcnode.c:550 #10 0x0062dea2 in BitmapHeapNext (node=0x2974750) at nodeBitmapHeapscan.c:104 Cheers, Jeff brin_crash.pl Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sequence Access Method WIP
On 11/05/2014 05:07 PM, Petr Jelinek wrote: On 05/11/14 13:45, Heikki Linnakangas wrote: In fact, if the seqam manages the current value outside the database (e.g. a remote seqam that gets the value from another server), nextval() never needs to write a WAL record. Sure it does, you need to keep the current state in Postgres also, at least the current value so that you can pass correct input to sequence_alloc(). And you need to do this in crash-safe way so WAL is necessary. Why does sequence_alloc need the current value? If it's a remote seqam, the current value is kept in the remote server, and the last value that was given to this PostgreSQL server is irrelevant. That irks me with this API. The method for acquiring a new value isn't fully abstracted behind the AM interface, as sequence.c still needs to track it itself. That's useful for the local AM, of course, and maybe some others, but for others it's totally useless. For the amdata handling (which is the AM's private data variable) the API assumes that (Datum) 0 is NULL, this seems to work well for reloptions so should work here also and it simplifies things a little compared to passing pointers to pointers around and making sure everything is allocated, etc. Sadly the fact that amdata is not fixed size and can be NULL made the page updates of the sequence relation quite more complex that it used to be. It would be nice if the seqam could define exactly the columns it needs, with any datatypes. There would be a set of common attributes: sequence_name, start_value, cache_value, increment_by, max_value, min_value, is_cycled. The local seqam would add last_value, log_cnt and is_called to that. A remote seqam that calls out to some other server might store the remote server's hostname etc. There could be a seqam function that returns a TupleDesc with the required columns, for example. Wouldn't that somewhat bloat catalog if we had new catalog table for each sequence AM? No, that's not what I meant. The number of catalog tables would be the same as today. Sequences look much like any other relation, with entries in pg_attribute catalog table for all the attributes for each sequence. Currently, all sequences have the same set of attributes, sequence_name, last_value and so forth. What I'm proposing is that there would a set of attributes that are common to all sequences, but in addition to that there could be any number of AM-specific attributes. Oh, that's interesting idea, so the AM interfaces would basically return updated tuple and there would be some description function that returns tupledesc. Yeah, something like that. I am bit worried that this would kill any possibility of ALTER SEQUENCE USING access_method. Plus I don't think it actually solves any real problem - serializing the internal C structs into bytea is not any harder than serializing them into tuple IMHO. I agree that serialization to bytea isn't that difficult, but it's still nicer to work directly with the correct data types. And it makes the internal state easily accessible for monitoring and debugging purposes. It also does not really solve the amdata being dynamic size issue. Yes it would. There would not be a single amdata attribute, but the AM could specify any number of custom attributes, which could be fixed size or varlen. It would be solely the AM's responsibility to set the values of those attributes. That's not the issue I was referring to, I was talking about the page replacement code which is not as simple now that we have potentially dynamic size tuple and if tuples were different for different AMs the code would still have to be able to handle that case. Setting the values in tuple itself is not too complicated. I don't see the problem with that. We deal with variable-sized tuples in heap pages all the time. The max size of amdata (or the extra AM-specific columns) is going to be determined by the block size, though. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree index builds, CLUSTER, and sortsupport
On Wed, Nov 5, 2014 at 8:36 AM, Jim Nasby jim.na...@bluetreble.com wrote: Did anything ever happen with this? I consider this to be related to the abbreviated keys stuff, although it is clearly independently valuable (so it could be reviewed independently). Since Robert ran out of time to work on abbreviated keys (hopefully temporarily), it's not all that surprising that no one has paid attention to this smaller piece of work. I see that Andreas Karlsson is signed up to review this, so that's something. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] to_char_at_timezone()?
On 11/04/2014 04:04 PM, Marko Tiikkaja wrote: On 11/5/14, 12:59 AM, Tom Lane wrote: Marko Tiikkaja ma...@joh.to writes: So I got into thinking whether it would make sense to provide a new function, say, to_char_at_timezone() to solve this problem. For example: ... Any thoughts? The patch is quite trivial. I'm not convinced that it's all that trivial. Is the input timestamp or timestamptz, and what's the semantics of that exactly (ie what timezone rotation happens)? One's first instinct is often wrong in this area. In my example, the input is a timestamptz, and the output is converted to the target time zone the same way timestamptz_out() does, except based on the input timezone instead of TimeZone. Not sure whether it would make sense to do this for timestamp, or whether there's even a clear intuitive behaviour there. Why wouldn't we just add the timezone as an additional parameter? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] to_char_at_timezone()?
On 11/5/14, 7:36 PM, Josh Berkus wrote: On 11/04/2014 04:04 PM, Marko Tiikkaja wrote: In my example, the input is a timestamptz, and the output is converted to the target time zone the same way timestamptz_out() does, except based on the input timezone instead of TimeZone. Not sure whether it would make sense to do this for timestamp, or whether there's even a clear intuitive behaviour there. Why wouldn't we just add the timezone as an additional parameter? Are you suggesting that we add a new overload of to_char() instead of a new function to_char_at_timezone()? That sounds a bit confusing, but that might just be me. .marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_multixact not getting truncated
On 11/3/14, 7:40 PM, Josh Berkus wrote: On 11/03/2014 05:24 PM, Josh Berkus wrote: BTW, the reason I started poking into this was a report from a user that they have a pg_multixact directory which is 21GB in size, and is 2X the size of the database. Here's XID data: Latest checkpoint's NextXID: 0/1126461940 Latest checkpoint's NextOID: 371135838 Latest checkpoint's NextMultiXactId: 162092874 Latest checkpoint's NextMultiOffset: 778360290 Latest checkpoint's oldestXID:945761490 Latest checkpoint's oldestXID's DB: 370038709 Latest checkpoint's oldestActiveXID: 1126461940 Latest checkpoint's oldestMultiXid: 123452201 Latest checkpoint's oldestMulti's DB: 370038709 Oldest mxid file is 29B2, newest is 3A13 No tables had a relminmxid of 1 (some of the system tables were 0, though), and the data from pg_class and pg_database is consistent. More tidbits: I just did a quick check on customer systems (5 of them). This only seems to be happening on customer systems where I specifically know there is a high number of FK lock waits (the system above gets around 1000 per minute that we know of). Other systems with higher transaction rates don't exhibit this issue; I checked a 9.3.5 database which normally needs to do XID wraparound once every 10 days, and it's pg_multixact is only 48K (it has one file, ). Note that pg_clog on the bad machine is only 64K in size. How many IDs are there per mxid file? #define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset)) So for 8k blocks, there are 2k offsets (really MultiXactIds) per page, 32 pages per SLRU segment. Your file names aren't making sense to me. :( If I'm doing the math correctly, 29B2 is MXID 699 531 264 and 3A13 is 974 323 712. You're only looking in pg_multixact/members/, yes? Relevant code starts in vacuum.c/vac_update_datfrozenxid() If there's any rows in pg_class for tables/matviews/toast with either relfrozenxid next XID or relminmxid next MXID then the code *silently* pulls the plug right there. IMO we should at least issue a warning. That you see relminxid advancing tells me this isn't the case here. ForceTransactionIdLimitUpdate() is a bit suspect in that it only looks at xidVacLimit, but if it were breaking then you wouldn't see pg_database minmxid advancing. Looking through TruncateMultiXact, I don't see anything that could prevent truncation, unless the way we're handing MultiXactID wraparound is broken (which I don't see any indication of). Can you post the contents of pg_multixact/members/? -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_multixact not getting truncated
On 11/05/2014 10:40 AM, Jim Nasby wrote: On 11/3/14, 7:40 PM, Josh Berkus wrote: On 11/03/2014 05:24 PM, Josh Berkus wrote: BTW, the reason I started poking into this was a report from a user that they have a pg_multixact directory which is 21GB in size, and is 2X the size of the database. Here's XID data: Latest checkpoint's NextXID: 0/1126461940 Latest checkpoint's NextOID: 371135838 Latest checkpoint's NextMultiXactId: 162092874 Latest checkpoint's NextMultiOffset: 778360290 Latest checkpoint's oldestXID:945761490 Latest checkpoint's oldestXID's DB: 370038709 Latest checkpoint's oldestActiveXID: 1126461940 Latest checkpoint's oldestMultiXid: 123452201 Latest checkpoint's oldestMulti's DB: 370038709 Oldest mxid file is 29B2, newest is 3A13 No tables had a relminmxid of 1 (some of the system tables were 0, though), and the data from pg_class and pg_database is consistent. More tidbits: I just did a quick check on customer systems (5 of them). This only seems to be happening on customer systems where I specifically know there is a high number of FK lock waits (the system above gets around 1000 per minute that we know of). Other systems with higher transaction rates don't exhibit this issue; I checked a 9.3.5 database which normally needs to do XID wraparound once every 10 days, and it's pg_multixact is only 48K (it has one file, ). Note that pg_clog on the bad machine is only 64K in size. How many IDs are there per mxid file? #define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset)) So for 8k blocks, there are 2k offsets (really MultiXactIds) per page, 32 pages per SLRU segment. Your file names aren't making sense to me. :( If I'm doing the math correctly, 29B2 is MXID 699 531 264 and 3A13 is 974 323 712. You're only looking in pg_multixact/members/, yes? These are members, not offsets. What's stored in members? Relevant code starts in vacuum.c/vac_update_datfrozenxid() If there's any rows in pg_class for tables/matviews/toast with either relfrozenxid next XID or relminmxid next MXID then the code *silently* pulls the plug right there. IMO we should at least issue a warning. Wait, how would that situation arise in the first place? Wraparound is supposed to prevent it. Mind you, I checked pg_class, and it matches the minmxid shown by pg_database, so that's not the smoking gun. Can you post the contents of pg_multixact/members/? Well, not as of last week, obviously. https://gist.github.com/jberkus/d05db3629e8c898664c4 I haven't pasted all the filenames, because, well, look at the count. I also added the contents of the /offsets directory, for full information. Note that we've added 400 multixact files since my first email. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
Jim Nasby jim.na...@bluetreble.com wrote: for a single system AIUI all we need to do is expose the LSN of each commit record and that will give you the exact and unique order in which transactions committed. This isn't a hypothetical feature either; if we had this, logical replication systems wouldn't have to try and fake this via batches. You could actually recreate exactly what data was visible at what time to all transactions, not just repeatable read ones (as long as you kept snapshot data as well, which isn't hard). Well, that not entirely true for serializable transactions; there are points in time where reading the committed state could cause a transaction to roll back[1] -- either a writing transaction which would make that visible state inconsistent with the later committed state or the reading transaction if it views something which is not (yet) consistent. That's not to say that this feature is a bad idea; part of the serializable implementation itself depends on being able to accurately determine commit order, and this feature could allow that to work more efficiently. I'm saying that, like hot standby, a replicated database could not provide truly serializable transactions (even read only ones) without something else in addition to this. We've discussed various ways of doing that. Perhaps the most promising is to include in the stream some indication of which points in the transaction stream are safe for a serializable transaction to read. If there's a way to implement commit order recording such that a two-state flag could be associated with each commit, I think it could be made to work for serializable transactions. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company [1] https://wiki.postgresql.org/wiki/SSI#Read_Only_Transactions -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL replay bugs
On 11/4/14 10:50 PM, Michael Paquier wrote: Except pg_upgrade, are there other tests using bash? There are a few obscure things under src/test/. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Order of views in stats docs
On 11/5/14 10:57 AM, Tom Lane wrote: However, should we consider the possibility of changing the table to straight alphabetical ordering? I'm not as much in love with that approach as some folks, but it does have the merit that it's always clear where you ought to put a new item. Yes, I think that property is important when developing in a loose community. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Repeatable read and serializable transactions see data committed after tx start
Jim Nasby jim.na...@bluetreble.com wrote: If we go that route we should also mention that now() will no longer be doing what you probably hope it would (AFAIK it's driven by BEGIN and not the first snapshot). There is also the fact that pg_stat_activity shows a connection as being idle in transaction as soon as BEGIN is executed; it does not wait for the snapshot to be acquired. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On 10/30/2014 09:19 PM, Andres Freund wrote: Some things I noticed while reading the patch: A lot of good comments, but let me pick up just two that are related: * There's a couple record types (e.g. XLOG_SMGR_TRUNCATE) that only refer to the relation, but not to the block number. These still log their rnode manually. Shouldn't we somehow deal with those in a similar way explicit block references are dealt with? * Hm. At least WriteMZeroPageXlogRec (and probably the same for all the other slru stuff) doesn't include a reference to the page. Isn't that bad? Shouldn't we make XLogRegisterBlock() usable for that case? Otherwise I fail to see how pg_rewind like tools can sanely deal with this? Yeah, there are still operations that modify relation pages, but don't store the information about the modified pages in the standard format. That includes XLOG_SMGR_TRUNCATE that you spotted, and XLOG_SMGR_CREATE, and also XLOG_DBASE_CREATE/DROP. And then there are updates to non-relation files, like all the slru stuff, relcache init files, etc. And updates to the FSM and VM bypass the full-page write mechanism too. To play it safe, pg_rewind copies all non-relation files as is. That includes all SLRUs, FSM and VM files, and everything else whose filename doesn't match the (main fork of) a relation file. Of course, that's a fair amount of copying to do, so we might want to optimize that in the future, but I want to nail the relation files first. They are usually an order of magnitude larger than the other files, after all. Unfortunately pg_rewind still needs to recognize and parse the special WAL records like XLOG_SMGR_CREATE/TRUNCATE, that modify relation files outside the normal block registration system. I've been thinking that we should add another flag to the WAL record format to mark such records. pg_rewind will still need to understand the record format of such records, but the flag will help to catch bugs of omission. If pg_rewind or another such tool sees a record that's flagged as special, but doesn't recognize the record type, it can throw an error. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On Mon, Oct 27, 2014 at 5:15 PM, Peter Geoghegan p...@heroku.com wrote: Let's see if we can link these two thoughts. 1. You think the biggest problem is the lack of attention to the design. 2. I keep asking you to put the docs in a readable form. If you can't understand the link between those two things, I am at a loss. You've read the docs. Please be clearer. In what sense are they not readable? The main description of the feature appears on the INSERT reference page: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/on-conflict-docs/sql-insert.html I've updated that reference page. I did a fair amount of copy-editing, but also updated the docs to describe the latest (unpublished) refinements to the syntax. Which is, as you and Robert requested, that the target and rejected-for-insertion tuples may be referenced with magical aliases in the style of OLD.* and NEW.*. I've spelt these aliases as TARGET.* and EXCLUDED.*, since OLD.* and NEW.* didn't seem to make much sense here. This requires some special processing during rewriting (which, as you probably know, is true of DML statements in general), and is certainly more invasive than what I had before, but all told isn't too bad. Basically, there is still an ExcludedExpr, but it only appears in the post-rewrite query tree, and is never created by the raw grammar or processed during parse analysis. I attach the doc patch with the relevant changes, in case you'd like a quick reference to where things are changed. I have already implemented the two things that you and Robert asked for most recently: A costing model for unique index inference, and the above syntax. I've also added IGNORE support to postgres_fdw (so you can IGNORE if and only if a unique index inference specification is omitted, just as with updatable views since V1.3). Currently, I'm working on fixing an issue with RLS that I describe in detail here: https://wiki.postgresql.org/wiki/UPSERT#RLS Once I fix that (provided it doesn't take too long), I'll publish a V1.4. AFAICT, that'll close out all of the current open issues. I hope this goes some way towards addressing your concerns. -- Peter Geoghegan From b9556286b3710234a32ce48e8d163d5e844154b8 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan p...@heroku.com Date: Fri, 26 Sep 2014 20:59:04 -0700 Subject: [PATCH 6/6] User-visible documentation for INSERT ... ON CONFLICT {UPDATE | IGNORE} INSERT ... ON CONFLICT {UPDATE | IGNORE} is documented as a new clause of the INSERT command. Some potentially surprising interactions with triggers are noted -- BEFORE INSERT per-row triggers must fire without the INSERT path necessarily being taken, for example. All the existing features that INSERT ... ON CONFLICT {UPDATE | IGNORE} fails to completely play nice with have those limitations noted. (Notes are added to the existing documentation for those other features, although some of these cases will need to be revisited). This includes postgres_fdw, updatable views and table inheritance (although these have most interesting cases covered, particularly inheritance). Finally, a user-level description of the new MVCC violation that the ON CONFLICT UPDATE variant sometimes requires has been added to Chapter 13 - Concurrency Control, beside existing commentary on READ COMMITTED mode's special handling of concurrent updates. The new MVCC violation introduced seems somewhat distinct from the existing one (i.e. what is internally referred to as the EvalPlanQual() mechanism), because in READ COMMITTED mode it is no longer necessary for any row version to be conventionally visible to the command's MVCC snapshot for an UPDATE of the row to occur (or for the row to be locked, should the WHERE clause predicate not be satisfied). --- doc/src/sgml/ddl.sgml | 23 +++ doc/src/sgml/fdwhandler.sgml | 8 ++ doc/src/sgml/indices.sgml | 11 +- doc/src/sgml/keywords.sgml| 7 + doc/src/sgml/mvcc.sgml| 24 doc/src/sgml/plpgsql.sgml | 14 +- doc/src/sgml/postgres-fdw.sgml| 8 ++ doc/src/sgml/ref/create_index.sgml| 7 +- doc/src/sgml/ref/create_rule.sgml | 6 +- doc/src/sgml/ref/create_table.sgml| 5 +- doc/src/sgml/ref/create_trigger.sgml | 5 +- doc/src/sgml/ref/create_view.sgml | 33 - doc/src/sgml/ref/insert.sgml | 262 +++--- doc/src/sgml/ref/set_constraints.sgml | 6 +- doc/src/sgml/trigger.sgml | 29 +++- 15 files changed, 419 insertions(+), 29 deletions(-) diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml index f9dc151..1bf3537 100644 --- a/doc/src/sgml/ddl.sgml +++ b/doc/src/sgml/ddl.sgml @@ -2428,9 +2428,27 @@ VALUES ('Albany', NULL, NULL, 'NY'); /para para + There is limited inheritance support for commandINSERT/command + commands with literalON CONFLICT/ clauses. Tables with + children are not generally accepted as targets.
Re: [HACKERS] Tweaking Foreign Keys for larger tables
On 10/31/14 6:19 AM, Simon Riggs wrote: Various ways of tweaking Foreign Keys are suggested that are helpful for larger databases. *INITIALLY NOT ENFORCED FK created, but is not enforced during DML. Will be/Must be marked NOT VALID when first created. We can run a VALIDATE on the constraint at any time; if it passes the check it is marked VALID and presumed to stay that way until the next VALIDATE run. Does that mean the FK would become invalid after every DML operation, until you expicitly revalidate it? Is that practical? ON DELETE IGNORE ON UPDATE IGNORE If we allow this specification then the FK is one way - we check the existence of a row in the referenced table, but there is no need for a trigger on the referenced table to enforce an action on delete or update, so no need to lock the referenced table when adding FKs. Are you worried about locking the table at all, or about having to lock many rows? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Amazon Redshift
Hackers, do you think we should implement some of the functions offered by Amazon Redshift? http://docs.aws.amazon.com/redshift/latest/dg/c_SQL_functions.html Amazon Redshift is completely based on PostgreSQL, but they have implemented some features not currently available in PostgreSQL. Given the fact that its a successful customer-driven commercial project, one can easily guess that the functions they implemented are useful and requested by many users. Phil
Re: [HACKERS] Amazon Redshift
philip taylor philta...@mail.com wrote: do you think we should implement some of the functions offered by Amazon Redshift? http://docs.aws.amazon.com/redshift/latest/dg/c_SQL_functions.html Well, obviously we already have many of the functions linked from that page. I think that if you want to go through them and pick out a few that you think are particularly promising it could start a useful discussion. Pointing at a page that includes such things as CURRENT_DATE, the avg() aggregate, and the lead() window function is just going to waste the time of everyone who follows the link. It's not like all of them would be committed in the same patch anyway. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Order of views in stats docs
On 11/5/14, 2:43 PM, Peter Eisentraut wrote: On 11/5/14 10:57 AM, Tom Lane wrote: However, should we consider the possibility of changing the table to straight alphabetical ordering? I'm not as much in love with that approach as some folks, but it does have the merit that it's always clear where you ought to put a new item. Yes, I think that property is important when developing in a loose community. Couldn't we just stick a warning SGML comment at the end of the list? ISTM that's no more likely to be missed/ignored than noticing that the list happens to be alphabetical. Perhaps the best solution is to split the list into different areas; one for database stats, another for table stats, etc. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Amazon Redshift
On 11/5/14, 3:42 PM, Kevin Grittner wrote: philip taylor philta...@mail.com wrote: do you think we should implement some of the functions offered by Amazon Redshift? http://docs.aws.amazon.com/redshift/latest/dg/c_SQL_functions.html Well, obviously we already have many of the functions linked from that page. I think that if you want to go through them and pick out a few that you think are particularly promising it could start a useful discussion. Pointing at a page that includes such things as CURRENT_DATE, the avg() aggregate, and the lead() window function is just going to waste the time of everyone who follows the link. It's not like all of them would be committed in the same patch anyway. At the risk stating the obvious, before we look at implementing them we should see if Amazon would donate that effort. But as Kevin said, first step is discussion on what we don't have and whether we want it. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] superuser() shortcuts
Attached is two separate patches to address previous comments/recommendations: * superuser-cleanup-shortcuts_11-5-2014.patch * has_privilege-cleanup_11-5-2014.patch -Adam -- Adam Brightwell - adam.brightw...@crunchydatasolutions.com Database Engineer - www.crunchydatasolutions.com diff --git a/contrib/test_decoding/expected/permissions.out b/contrib/test_decoding/expected/permissions.out new file mode 100644 index 212fd1d..f011955 *** a/contrib/test_decoding/expected/permissions.out --- b/contrib/test_decoding/expected/permissions.out *** RESET ROLE; *** 54,66 -- plain user *can't* can control replication SET ROLE lr_normal; SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding'); ! ERROR: must be superuser or replication role to use replication slots INSERT INTO lr_test VALUES('lr_superuser_init'); ERROR: permission denied for relation lr_test SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); ! ERROR: must be superuser or replication role to use replication slots SELECT pg_drop_replication_slot('regression_slot'); ! ERROR: must be superuser or replication role to use replication slots RESET ROLE; -- replication users can drop superuser created slots SET ROLE lr_superuser; --- 54,69 -- plain user *can't* can control replication SET ROLE lr_normal; SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding'); ! ERROR: permission denied to use replication slots ! HINT: You must be superuser or replication role to use replication slots. INSERT INTO lr_test VALUES('lr_superuser_init'); ERROR: permission denied for relation lr_test SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); ! ERROR: permission denied to use replication slots ! HINT: You must be superuser or replication role to use replication slots. SELECT pg_drop_replication_slot('regression_slot'); ! ERROR: permission denied to use replication slots ! HINT: You must be superuser or replication role to use replication slots. RESET ROLE; -- replication users can drop superuser created slots SET ROLE lr_superuser; *** SELECT 'init' FROM pg_create_logical_rep *** 90,96 RESET ROLE; SET ROLE lr_normal; SELECT pg_drop_replication_slot('regression_slot'); ! ERROR: must be superuser or replication role to use replication slots RESET ROLE; -- all users can see existing slots SET ROLE lr_superuser; --- 93,100 RESET ROLE; SET ROLE lr_normal; SELECT pg_drop_replication_slot('regression_slot'); ! ERROR: permission denied to use replication slots ! HINT: You must be superuser or replication role to use replication slots. RESET ROLE; -- all users can see existing slots SET ROLE lr_superuser; diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c new file mode 100644 index 133143d..52d8208 *** a/src/backend/access/transam/xlogfuncs.c --- b/src/backend/access/transam/xlogfuncs.c *** *** 27,32 --- 27,33 #include miscadmin.h #include replication/walreceiver.h #include storage/smgr.h + #include utils/acl.h #include utils/builtins.h #include utils/numeric.h #include utils/guc.h *** pg_start_backup(PG_FUNCTION_ARGS) *** 54,63 backupidstr = text_to_cstring(backupid); ! if (!superuser() !has_rolreplication(GetUserId())) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), ! errmsg(must be superuser or replication role to run a backup))); startpoint = do_pg_start_backup(backupidstr, fast, NULL, NULL); --- 55,65 backupidstr = text_to_cstring(backupid); ! if (!has_replication_privilege(GetUserId())) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), ! errmsg(permission denied to run a backup), ! errhint(You must be superuser or replication role to run a backup.))); startpoint = do_pg_start_backup(backupidstr, fast, NULL, NULL); *** pg_stop_backup(PG_FUNCTION_ARGS) *** 82,91 { XLogRecPtr stoppoint; ! if (!superuser() !has_rolreplication(GetUserId())) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), ! (errmsg(must be superuser or replication role to run a backup; stoppoint = do_pg_stop_backup(NULL, true, NULL); --- 84,94 { XLogRecPtr stoppoint; ! if (!has_replication_privilege(GetUserId())) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), ! errmsg(permission denied to run a backup), ! errhint(You must be superuser or replication role to run a backup.))); stoppoint = do_pg_stop_backup(NULL, true, NULL); diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c new file mode 100644 index d30612c..4d26b02 *** a/src/backend/catalog/aclchk.c ---
Re: [HACKERS] tracking commit timestamps
On 11/05/2014 11:23 AM, Jim Nasby wrote: Except that commit time is not guaranteed unique *even on a single system*. That's my whole point. If we're going to bother with all the commit time machinery it seems really silly to provide a way to uniquely order every commit. Clearly trying to uniquely order commits across multiple systems is a far larger problem, and I'm not suggesting we attempt that. But for a single system AIUI all we need to do is expose the LSN of each commit record and that will give you the exact and unique order in which transactions committed. This isn't a hypothetical feature either; if we had this, logical replication systems wouldn't have to try and fake this via batches. You could actually recreate exactly what data was visible at what time to all transactions, not just repeatable read ones (as long as you kept snapshot data as well, which isn't hard). As for how much data to keep, if you have a process that's doing something to record this information permanently all it needs to do is keep an old enough snapshot around. That's not that hard to do, even from user space. +1 for this. It isn't just 'replication' systems that have a need for getting the commit order of transactions on a single system. I have a application (not slony) where we want to query a table but order the output based on the transaction commit order of when the insert into the table was done (think of a queue). I'm not replicating the output but passing the data to other applications for further processing. If I just had the commit timestamp I would need to put in some other condition to break ties in a consistent way. I think being able to get an ordering by commit LSN is what I really want in this case not the timestamp. Logical decoding is one solution to this (that I was considering) but being able to do something like select * FROM event_log order by commit_id would be a lot simpler. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Amazon Redshift
do you think we should implement some of the functions offered by Amazon Redshift? http://docs.aws.amazon.com/redshift/latest/dg/c_SQL_functions.html Well, obviously we already have many of the functions linked from that page. I think that if you want to go through them and pick out a few that you think are particularly promising it could start a useful discussion. Pointing at a page that includes such things as CURRENT_DATE, the avg() aggregate, and the lead() window function is just going to waste the time of everyone who follows the link. It's not like all of them would be committed in the same patch anyway. At the risk stating the obvious, before we look at implementing them we should see if Amazon would donate that effort. But as Kevin said, first step is discussion on what we don't have and whether we want it. Ok, this is a summary of what they have that we don't (of course, I could have missed something): Date Functions http://docs.aws.amazon.com/redshift/latest/dg/r_ADD_MONTHS.html http://docs.aws.amazon.com/redshift/latest/dg/r_DATEADD_function.html http://docs.aws.amazon.com/redshift/latest/dg/r_DATEDIFF_function.html http://docs.aws.amazon.com/redshift/latest/dg/r_LAST_DAY.html http://docs.aws.amazon.com/redshift/latest/dg/r_MONTHS_BETWEEN_function.html http://docs.aws.amazon.com/redshift/latest/dg/r_NEXT_DAY.html String Functions http://docs.aws.amazon.com/redshift/latest/dg/REGEXP_COUNT.html http://docs.aws.amazon.com/redshift/latest/dg/REGEXP_INSTR.html http://docs.aws.amazon.com/redshift/latest/dg/r_STRTOL.html http://docs.aws.amazon.com/redshift/latest/dg/crc32-function.html http://docs.aws.amazon.com/redshift/latest/dg/FUNC_SHA1.html JSON Functions http://docs.aws.amazon.com/redshift/latest/dg/JSON_ARRAY_LENGTH.html http://docs.aws.amazon.com/redshift/latest/dg/JSON_EXTRACT_ARRAY_ELEMENT_TEXT.html http://docs.aws.amazon.com/redshift/latest/dg/JSON_EXTRACT_PATH_TEXT.html Window functions http://docs.aws.amazon.com/redshift/latest/dg/r_WF_RATIO_TO_REPORT.html http://docs.aws.amazon.com/redshift/latest/dg/r_WF_MEDIAN.html -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On 2014-11-05 23:08:31 +0200, Heikki Linnakangas wrote: On 10/30/2014 09:19 PM, Andres Freund wrote: Some things I noticed while reading the patch: A lot of good comments, but let me pick up just two that are related: * There's a couple record types (e.g. XLOG_SMGR_TRUNCATE) that only refer to the relation, but not to the block number. These still log their rnode manually. Shouldn't we somehow deal with those in a similar way explicit block references are dealt with? * Hm. At least WriteMZeroPageXlogRec (and probably the same for all the other slru stuff) doesn't include a reference to the page. Isn't that bad? Shouldn't we make XLogRegisterBlock() usable for that case? Otherwise I fail to see how pg_rewind like tools can sanely deal with this? Yeah, there are still operations that modify relation pages, but don't store the information about the modified pages in the standard format. That includes XLOG_SMGR_TRUNCATE that you spotted, and XLOG_SMGR_CREATE, and also XLOG_DBASE_CREATE/DROP. And then there are updates to non-relation files, like all the slru stuff, relcache init files, etc. And updates to the FSM and VM bypass the full-page write mechanism too. That's a awful number of special cases. I see little reason not to invent something that can reference at least the relation in those cases. Then we can at least only resync the relations if there were changes. E.g. for vm's that not necessarily all that likely in an insert mostly case. I'm not that worried about things like create/drop database, but fsm, vm, and the various slru's are really somewhat essential. To play it safe, pg_rewind copies all non-relation files as is. That includes all SLRUs, FSM and VM files, and everything else whose filename doesn't match the (main fork of) a relation file. Of course, that's a fair amount of copying to do, so we might want to optimize that in the future, but I want to nail the relation files first. They are usually an order of magnitude larger than the other files, after all. That's fair enough. Unfortunately pg_rewind still needs to recognize and parse the special WAL records like XLOG_SMGR_CREATE/TRUNCATE, that modify relation files outside the normal block registration system. I've been thinking that we should add another flag to the WAL record format to mark such records. So everytime pg_rewind comes across a record with that flag set which it doesn't have special case code it'd balk? Sounds sensible. Personally I think we should at least have a generic format to refer to entire relations without a specific block number. And one to refer to SLRUs. I don't think we necessarily need to implement them now, but we should make sure there's bit space left to denote them. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sequence Access Method WIP
On 05/11/14 18:32, Heikki Linnakangas wrote: On 11/05/2014 05:07 PM, Petr Jelinek wrote: On 05/11/14 13:45, Heikki Linnakangas wrote: In fact, if the seqam manages the current value outside the database (e.g. a remote seqam that gets the value from another server), nextval() never needs to write a WAL record. Sure it does, you need to keep the current state in Postgres also, at least the current value so that you can pass correct input to sequence_alloc(). And you need to do this in crash-safe way so WAL is necessary. Why does sequence_alloc need the current value? If it's a remote seqam, the current value is kept in the remote server, and the last value that was given to this PostgreSQL server is irrelevant. Hmm, I am not sure if I see this usecase as practical TBH, but I also don't see fundamental problem with it. That irks me with this API. The method for acquiring a new value isn't fully abstracted behind the AM interface, as sequence.c still needs to track it itself. That's useful for the local AM, of course, and maybe some others, but for others it's totally useless. Hmm, I think that kind of abstraction can only be done by passing current tuple and returning updated tuple (yes I realize that it's what you have been saying basically). In general it sounds like the level of abstraction you'd want would be one where AM cares about everything except the the code that does the actual writes to page and WAL (but when to do those would still be controlled completely by AM?) and the SQL interface. I don't see how to make that work with ALTER SEQUENCE USING to be honest and I do care quite a lot about that use-case (I think the ability to convert the local sequences to 3rd party ones and back is very important). That's not the issue I was referring to, I was talking about the page replacement code which is not as simple now that we have potentially dynamic size tuple and if tuples were different for different AMs the code would still have to be able to handle that case. Setting the values in tuple itself is not too complicated. I don't see the problem with that. We deal with variable-sized tuples in heap pages all the time. The max size of amdata (or the extra AM-specific columns) is going to be determined by the block size, though. Glad to hear that. Yes the limit is block size, I think we can live with that at least for the moment... -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
On 2014-11-05 17:17:05 -0500, Steve Singer wrote: It isn't just 'replication' systems that have a need for getting the commit order of transactions on a single system. I have a application (not slony) where we want to query a table but order the output based on the transaction commit order of when the insert into the table was done (think of a queue). I'm not replicating the output but passing the data to other applications for further processing. If I just had the commit timestamp I would need to put in some other condition to break ties in a consistent way. I think being able to get an ordering by commit LSN is what I really want in this case not the timestamp. Logical decoding is one solution to this (that I was considering) but being able to do something like select * FROM event_log order by commit_id would be a lot simpler. Imo that's essentially a different feature. What you essentially would need here is a 'commit sequence number' - but no timestamps. And probably to be useful that number has to be 8 bytes in itself. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] BRIN indexes - TRAP: BadArgument
On Wed, Nov 5, 2014 at 12:54 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Thanks for the updated patch. Now when I run the test program (version with better error reporting attached), it runs fine until I open a psql session and issue: reindex table foo; Then it immediately falls over with some rows no longer being findable through the index. -- use index select count(*) from foo where text_array = md5(4611::text); 0 -- use seq scan select count(*) from foo where text_array||'' = md5(4611::text); 1 Where the number '4611' was taken from the error message of the test program. brin_crash.pl Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Amazon Redshift
On 11/05/2014 02:36 PM, philip taylor wrote: Ok, this is a summary of what they have that we don't (of course, I could have missed something): I can't see any functions on that list I'd want. For example, DATEADD is there just to be compatible with MSSQL. It's useless to us. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] numeric_normalize() is a few bricks shy of a load
9.4 introduced a function numeric_normalize() whose charter is to produce a string representing a numeric, such that two numerics' string representations will compare equal if and only if the numeric values compare equal. (This is used, so far, only by GIN indexes on JSONB.) Thomas Fanghaenel pointed out to me that this function is utterly, completely broken: it will reduce distinct values such as 1 and 1000 to the same string, and what's much worse for the GIN use-case, it can reduce values that should compare equal to distinct strings. Here's a test case demonstrating the latter assertion: regression=# create table t1 (f1 jsonb); CREATE TABLE regression=# insert into t1 values('{x: 12000}'); INSERT 0 1 regression=# insert into t1 values('{x: 12000.00}'); INSERT 0 1 regression=# select * from t1 where f1 @ '{x: 12000}'; f1 - {x: 12000} {x: 12000.00} (2 rows) regression=# create index on t1 using gin (f1); CREATE INDEX regression=# set enable_seqscan TO 0; SET regression=# select * from t1 where f1 @ '{x: 12000}'; f1 -- {x: 12000} (1 row) regression=# select * from t1 where f1 @ '{x: 12000.00}'; f1 - {x: 12000.00} (1 row) Since JSONB GIN indexes are always considered lossy, the reduction-to-same-string case is masked by index rechecks, resulting only in an inefficient index that fails to distinguish keys it could distinguish. However, the reduction-to-different-strings case results in wrong answers, as illustrated above. I think there's no choice but to fix this in 9.4. The only reason it even needs discussion is that this would invalidate index entries in beta testers' JSONB GIN indexes, such that queries would not find entries that they did find before. I'm not sure we can do anything about this except recommend that beta testers REINDEX such indexes after updating to 9.4next. Thoughts? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: dynahash replacement for buffer table
Hi, git branch also available at: http://git.postgresql.org/gitweb/?p=users/rhaas/postgres.git;a=shortlog;h=refs/heads/chash2014 A minor review of this: * Should be rebased ontop of the atomics API * In benchmarks it becomes apparent that the dynamic element width makes some macros like CHashTableGetNode() and CHashTableGetGarbageByBucket() quite expensive. At least gcc on x86 can't figure how to build an efficient expression for the target. There's two ways to address this: a) Actually morph chash into something that will be included into the user sites. Since I think there'll only be a limited number of those that's probably acceptable. b) Simplify the access rules. I quite seriously doubt that the interleaving of garbage and freelists is actually necessary/helpful. * There's several cases where it's noticeable that chash creates more cacheline misses - that's imo a good part of why the single threaded performance regresses. There's good reasons for the current design without a inline bucket, namely that it makes the concurrency handling easier. But I think that can be countered by still storing a pointer - just to an element inside the bucket. Afaics that'd allow this to be introduced quite easily? I'm afraid that I can't see us going forward unless we address the noticeable single threaded penalties. Do you see that differently? * There's some whitespace damage. (space followed by tab, followed by actual contents) * I still think it's a good idea to move the full memory barriers into the cleanup/writing process by doing write memory barriers when acquiring a hazard pointer and RMW atomic operations (e.g. atomic add) in the process testing for the hazard pointer. * Shouldn't we relax the CPU in a couple places like CHashAllocate(), CHashBucketScan()'s loops? * I don't understand right now (but then I'm in a Hotel bar) why it's safe for CHashAddToGarbage() to clobber the hash value? CHashBucketScan() relies the chain being ordered by hashcode. No? Another backend might just have been past the IsMarked() bit and look at the hashcode? * We really should find a way to sensibly print the stats. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Repeatable read and serializable transactions see data committed after tx start
On Mon, Nov 3, 2014 at 2:14 PM, Álvaro Hernández Tortosa a...@8kdata.com wrote: Given a transaction started with BEGIN (REPEATABLE READ | SERIALIZABLE), if a concurrent session commits some data before *any* query within the first transaction, that committed data is seen by the transaction. This is not what I'd expect. I think the problem is with your expectation, not the behavior. Serializable means that the transactions execute in such a fashion that their parallel execution is equivalent to some serial order of execution. It doesn't say that it must be equivalent to any particular order that you might imagine, such as the order in which the transactions commit, or, as you propose, the order in which they begin. Generally, I think that's a good thing, because transaction isolation is expensive: even at repeatable read, but for the need to provide transaction isolation, there would be no such thing as bloat. The repeatable read and serializable levels add further overhead of various kinds. We could provide a super-duper serializable level that provides even tighter guarantees, but (1) I can't imagine many people are keen to make the cost of serialization even higher than it already is and (2) if you really want that behavior, just do some trivial operation sufficient to cause a snapshot to be taken immediately after the BEGIN. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL replay bugs
On Thu, Nov 6, 2014 at 5:41 AM, Peter Eisentraut pete...@gmx.net wrote: On 11/4/14 10:50 PM, Michael Paquier wrote: Except pg_upgrade, are there other tests using bash? There are a few obscure things under src/test/. Oh, I see. There is quite a number here, and each script is doing quite different things.. $ git grep /sh src/test/ src/test/locale/de_DE.ISO8859-1/runall:#! /bin/sh src/test/locale/gr_GR.ISO8859-7/runall:#! /bin/sh src/test/locale/koi8-r/runall:#! /bin/sh src/test/locale/koi8-to-win1251/runall:#! /bin/sh src/test/mb/mbregress.sh:#! /bin/sh src/test/performance/start-pgsql.sh:#!/bin/sh src/test/regress/regressplans.sh:#! /bin/sh -- Michael
Re: [HACKERS] Repeatable read and serializable transactions see data committed after tx start
On 05/11/14 17:46, Jim Nasby wrote: On 11/4/14, 6:11 PM, Álvaro Hernández Tortosa wrote: Should we improve then the docs stating this more clearly? Any objection to do this? If we go that route we should also mention that now() will no longer be doing what you probably hope it would (AFAIK it's driven by BEGIN and not the first snapshot). If I understand you correctly, you mean that if we add a note to the documentation stating that the transaction really freezes when you do the first query, people would expect now() to be also frozen when the first query is done, which is not what happens (it's frozen at tx start). Then, yes, you're right, probably *both* the isolation levels and the now() function documentation should be patched to become more precise. Perhaps we should change how now() works, but I'm worried about what that might do to existing applications... Perhaps, I also believe it might not be good for existing applications, but it definitely has a different freeze behavior, which seems inconsistent too. Thanks, Álvaro -- Álvaro Hernández Tortosa --- 8Kdata -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Repeatable read and serializable transactions see data committed after tx start
On 06/11/14 00:42, Robert Haas wrote: On Mon, Nov 3, 2014 at 2:14 PM, Álvaro Hernández Tortosa a...@8kdata.com wrote: Given a transaction started with BEGIN (REPEATABLE READ | SERIALIZABLE), if a concurrent session commits some data before *any* query within the first transaction, that committed data is seen by the transaction. This is not what I'd expect. I think the problem is with your expectation, not the behavior. But my expectation is derived from the documentation: The Repeatable Read isolation level only sees data committed before the transaction began; In PostgreSQL you will see data committed after a BEGIN ... (REPEATABLE READ | SERIALIZABLE) statement (only before the first query). And it's reasonable to think that transaction begins when you issue a BEGIN statement. It's also reasonable to think this way as: - now() is frozen at BEGIN time, as Nasby pointed out - pg_stat_activity says that the transaction is started, as Kevin mentioned So if the behavior is different from what the documentation says and what other external indicators may point out, I think at least documentation should be clear about this precise behavior, to avoid confusing users. Serializable means that the transactions execute in such a fashion that their parallel execution is equivalent to some serial order of execution. It doesn't say that it must be equivalent to any particular order that you might imagine, such as the order in which the transactions commit, or, as you propose, the order in which they begin. Generally, I think that's a good thing, because transaction isolation is expensive: even at repeatable read, but for the need to provide transaction isolation, there would be no such thing as bloat. The repeatable read and serializable levels add further overhead of various kinds. We could provide a super-duper serializable level that provides even tighter guarantees, but (1) I can't imagine many people are keen to make the cost of serialization even higher than it already is and (2) if you really want that behavior, just do some trivial operation sufficient to cause a snapshot to be taken immediately after the BEGIN. I'm not really asking for a new isolation level, just that either BEGIN really freezes (for repeatable read and serializable) or if that's expensive and not going to happen, that the documentation clearly states the fact that freeze starts at first-query-time, and that if you need to freeze before your first real query, you should do a dummy one instead (like SELECT 1). Also, if this early freeze is a performance hit -and for that reason BEGIN is not going to be changed to freeze- then that also should be pointed out in the documentation, so that users that freeze early with SELECT 1-type queries understand that. Regards, Álvaro -- Álvaro Hernández Tortosa --- 8Kdata -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
On 11/05/2014 05:43 PM, Andres Freund wrote: On 2014-11-05 17:17:05 -0500, Steve Singer wrote: Imo that's essentially a different feature. What you essentially would need here is a 'commit sequence number' - but no timestamps. And probably to be useful that number has to be 8 bytes in itself. I think this gets to the heart of some of the differing views people have expressed on this patch Is this patch supposed to: A) Add commit timestamp tracking but nothing more B) Add infrastructure to store commit timestamps and provide a facility for storing additional bits of data extensions might want to be associated with the commit C). Add commit timestamps and node identifiers to commits If the answer is (A) then I can see why some people are objecting to adding extradata.If the answer is (B) then it's fair to ask how well does this patch handle various types of things people might want to attach to the commit record (such as the LSN). I think the problem is that once you start providing a facility extensions can use to store data along with the commit record then being restricted to 4 or 8 bytes is very limiting. It also doesn't allow you to load two extensions at once on a system. You wouldn't be able to have both the 'steve_commit_order' extension and BDR installed at the same time. I don't think this patch does a very good job at (B) but It wasn't intended to. If what we are really doing is C, and just calling the space 'extradata' until we get the logical identifier stuff in and then we are going to rename extradata to nodeid then we should say so. If we are really concerned about the pg_upgrade impact of expanding the record later then maybe we should add 4 bytes of padding to the CommitTimeStampEntry now and but leave the manipulating the node id until later. Steve Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] recovery_target_time and standby_mode
Hackers, Someone brought me an issue that recovery_target_time and standby_mode weren't working as they expected. I think that the way they work now makes sense, but we do need to clarify it in the docs. However, I'm posting this to hackers first in case the way these two work together *isn't* as intended. Setup: two servers are restored from the same pgBarman archive. The master is brought to a specific point in time using recovery_target_time. Then they user attempts to do the same with the new replica. recovery.conf: recovery_target_time = 'SOME-PAST-TIMESTAMP' standby_mode = on primary_conninfo = 'host=mymaster user=postgres port=5432' How It Works Now: When the recovery_target_time is reached, standby_mode is ignored and the server comes up as a standalone. How The User Wanted It To Work: When the recovery_target_time is reached, switch to streaming replication and stay a standby. Note that there is a workaround for what the user wants to do. I'm just trying to clarify what our desired behavior is. From there we can either work on patches or on doc fixes. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree index builds, CLUSTER, and sortsupport
On 10/11/2014 02:26 AM, Peter Geoghegan wrote: Both Robert and Heikki expressed dissatisfaction with the fact that B-Tree index builds don't use sortsupport. Because B-Tree index builds cannot really use the onlyKey optimization, the historic oversight of not supporting B-Tree builds (and CLUSTER) might have been at least a little understandable [1]. But with the recent work on sortsupport for text, it's likely that B-Tree index builds will be missing out on rather a lot by not taking advantage of sortsupport. Attached patch modifies tuplesort to correct this oversight. What's really nice about it is that there is actually a net negative code footprint: src/backend/access/nbtree/nbtsort.c | 63 +++--- src/backend/utils/sort/sortsupport.c | 77 ++-- src/backend/utils/sort/tuplesort.c | 274 +++ src/include/utils/sortsupport.h | 3 + 4 files changed, 203 insertions(+), 214 deletions(-) The code compiles and passes the test suite. I looked at the changes to the code. The new code is clean and there is more code re-use and improved readability. On possible further improvement would be to move the preparation of SortSupport to a common function since this is done three time in the code. I did some simple benchmarks by adding indexes to temporary tables and could see improvements of around 10% in index build time. So it gives a nice, but not amazing, performance improvement. Is there any case where we should expect any greater performance improvement? Either way I find this a nice patch which improves code quality and performance. = Minor code style issues I found - There is a double space in strategy = (scanKey-sk_flags [...]. - I think there should be a newline in tuplesort_begin_index_btree() before /* Prepare SortSupport data for each column */. - Remove the extra newline after reversedirection(). - End sentences in comments with period. That seems to be the common practice in the project. -- Andreas Karlsson -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] numeric_normalize() is a few bricks shy of a load
* Tom Lane (t...@sss.pgh.pa.us) wrote: I think there's no choice but to fix this in 9.4. The only reason it even needs discussion is that this would invalidate index entries in beta testers' JSONB GIN indexes, such that queries would not find entries that they did find before. I'm not sure we can do anything about this except recommend that beta testers REINDEX such indexes after updating to 9.4next. Thoughts? I'm not happy about it, but I agree with your assessment. Beta testers will survive (not exaclty the first time they've had to deal with such changes..). We surely can't ship it as-is. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] recovery_target_time and standby_mode
On Thu, Nov 6, 2014 at 12:32 AM, Josh Berkus j...@agliodbs.com wrote: When the recovery_target_time is reached, switch to streaming replication and stay a standby. Then shouldn't he just not specify a recovert_target at all? That's the default behaviour for standby_mode on, the whole point of recovery_target is to specify when to stop recovery and leave standby mode, no? -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Repeatable read and serializable transactions see data committed after tx start
On 11/5/14, 6:04 PM, Álvaro Hernández Tortosa wrote: On 05/11/14 17:46, Jim Nasby wrote: On 11/4/14, 6:11 PM, Álvaro Hernández Tortosa wrote: Should we improve then the docs stating this more clearly? Any objection to do this? If we go that route we should also mention that now() will no longer be doing what you probably hope it would (AFAIK it's driven by BEGIN and not the first snapshot). If I understand you correctly, you mean that if we add a note to the documentation stating that the transaction really freezes when you do the first query, people would expect now() to be also frozen when the first query is done, which is not what happens (it's frozen at tx start). Then, yes, you're right, probably *both* the isolation levels and the now() function documentation should be patched to become more precise. Bingo. Hrm, is there anything else that differs between the two? Perhaps we should change how now() works, but I'm worried about what that might do to existing applications... Perhaps, I also believe it might not be good for existing applications, but it definitely has a different freeze behavior, which seems inconsistent too. Yeah, I'd really rather fix it... Hmm... we do have transaction_timestamp(); perhaps we could leave that as the time BEGIN executed and shift everything else to use the snapshot time. Or we could do the opposite. I suspect that would be more likely to cause data quality errors, but anyone that treats timestamps as magic values is going to have those anyway. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree index builds, CLUSTER, and sortsupport
Thanks for the review. On Wed, Nov 5, 2014 at 4:33 PM, Andreas Karlsson andr...@proxel.se wrote: I looked at the changes to the code. The new code is clean and there is more code re-use and improved readability. On possible further improvement would be to move the preparation of SortSupport to a common function since this is done three time in the code. The idea there is to have more direct control of sortsupport. With the abbreviated keys patch, abbreviation occurs based on a decision made by tuplesort.c. I can see why you'd say that, but I prefer to keep initialization of sortsupport structs largely concentrated in tuplesort.c, and more or less uniform regardless of the tuple-type being sorted. I did some simple benchmarks by adding indexes to temporary tables and could see improvements of around 10% in index build time. So it gives a nice, but not amazing, performance improvement. Cool. Is there any case where we should expect any greater performance improvement? The really compelling case is abbreviated keys - as you probably know, there is a patch that builds on this patch, and the abbreviated keys patch, so that B-Tree builds and CLUSTER can use abbreviated keys too. That doesn't really have much to do with this patch, though. The important point is that heap tuple sorting (with a query that has no client overhead, and involves one big sort) and B-Tree index creation both have their tuplesort as a totally dominant cost. The improvements for each case should be quite comparable, which is good (except, as noted in my opening e-mail, when heap/datum tuplesorting can use the onlyKey optimization, while B-Tree/CLUSTER tuplesorting cannot). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] ltree::text not immutable?
On Tue, Nov 4, 2014 at 6:48 PM, Tom Lane t...@sss.pgh.pa.us wrote: I wrote: An alternative that just occurred to me is to put the no-volatile- I/O-functions check into CREATE TYPE, but make it just WARNING not ERROR. That would be nearly as good as an ERROR in terms of nudging people who'd accidentally omitted a volatility marking from their custom types. But we could leave chkpass as-is and wait to see if anyone complains hey, why am I getting this warning?. If we don't hear anyone complaining, maybe that means we can get away with changing the type's behavior in 9.6 or later. Attached is a complete patch along these lines. As I suggested earlier, this just makes the relevant changes in ltree--1.0.sql and pg_trgm--1.1.sql without bumping their extension version numbers, since it doesn't seem important enough to justify a version bump. I propose that we could back-patch the immutability-additions in ltree and pg_trgm, since they won't hurt anything and might make life a little easier for future adopters of those modules. The WARNING additions should only go into HEAD though. I don't understand why you went to all the trouble of building a versioning system for extensions if you're not going to use it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] recovery_target_time and standby_mode
On 11/05/2014 05:00 PM, Greg Stark wrote: On Thu, Nov 6, 2014 at 12:32 AM, Josh Berkus j...@agliodbs.com wrote: When the recovery_target_time is reached, switch to streaming replication and stay a standby. Then shouldn't he just not specify a recovert_target at all? That's the default behaviour for standby_mode on, the whole point of recovery_target is to specify when to stop recovery and leave standby mode, no? Their goal was to restore a master *and replica* to a point in time before Bad Stuff happened, and then have a working master-replica pair. Like I said, this is probably working as intended, we just need to clarify it in the docs. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] recovery_target_time and standby_mode
On Thu, Nov 6, 2014 at 10:00 AM, Greg Stark st...@mit.edu wrote: On Thu, Nov 6, 2014 at 12:32 AM, Josh Berkus j...@agliodbs.com wrote: When the recovery_target_time is reached, switch to streaming replication and stay a standby. Then shouldn't he just not specify a recovert_target at all? That's the default behaviour for standby_mode on, the whole point of recovery_target is to specify when to stop recovery and leave standby mode, no? Agreed with Greg, once a target recovery is switched the node gets out of recovery. What the user should have done here is not specify recovery_target_time in the standby's recovery.conf such as it follows the master through streaming. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] recovery_target_time and standby_mode
On Thu, Nov 6, 2014 at 10:41 AM, Michael Paquier michael.paqu...@gmail.com wrote: On Thu, Nov 6, 2014 at 10:00 AM, Greg Stark st...@mit.edu wrote: On Thu, Nov 6, 2014 at 12:32 AM, Josh Berkus j...@agliodbs.com wrote: When the recovery_target_time is reached, switch to streaming replication and stay a standby. Then shouldn't he just not specify a recovert_target at all? That's the default behaviour for standby_mode on, the whole point of recovery_target is to specify when to stop recovery and leave standby mode, no? Agreed with Greg, once a target recovery is switched the node gets out of recovery. What the user should have done here is not specify recovery_target_time in the standby's recovery.conf such as it follows the master through streaming. Just adding: ... On the new timeline that master is bumping to. If the standby already replayed of the point where WAL forked on master, then the standby should be rewinded. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] recovery_target_time and standby_mode
On 11/05/2014 05:41 PM, Michael Paquier wrote: On Thu, Nov 6, 2014 at 10:00 AM, Greg Stark st...@mit.edu wrote: On Thu, Nov 6, 2014 at 12:32 AM, Josh Berkus j...@agliodbs.com wrote: When the recovery_target_time is reached, switch to streaming replication and stay a standby. Then shouldn't he just not specify a recovert_target at all? That's the default behaviour for standby_mode on, the whole point of recovery_target is to specify when to stop recovery and leave standby mode, no? Agreed with Greg, once a target recovery is switched the node gets out of recovery. What the user should have done here is not specify recovery_target_time in the standby's recovery.conf such as it follows the master through streaming. What I'm pointing out is that you can't actually do that. You think you can, but you can't. Instead, what you need to do is: 1) Recover to target_time. 2) Pause 3) shut down the replica 4) replace recovery.conf with one which streams 5) restart replica This is consistent behavior and makes sense when you think about it. So I think what we need to do is clarify in the documentation covering recovery_target and standby_mode that they are exclusive. Hmmm. You know, I think this means we do have a bug. If recovery_target_time and standby_mode are exclusive, we should error if the user attempts to set them both. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] group locking: incomplete patch, just for discussion
On Sun, Nov 2, 2014 at 7:31 AM, Simon Riggs si...@2ndquadrant.com wrote: The procgloballist stuff should be the subject of a separate patch which I agree with. Yes, I think that's probably a net improvement in robustness quite apart from what we decide to do about any of the rest of this. I've attached it here as revise-procglobal-tracking.patch and will commit that bit if nobody objects. The remainder is reattached without change as group-locking-v0.1.patch. Per your other comment, I've developed the beginnings of a testing framework which I attached here as test_group_locking-v0.patch. That doesn't look to have much hope of evolving into something we'd want even in contrib, but I think it'll be rather useful for debugging. It works like this: rhaas=# create table foo (a int); CREATE TABLE rhaas=# select test_group_locking('1.0:start,2.0:start,1.0:lock:AccessExclusiveLock:foo,2.0:lock:AccessExclusiveLock:foo'); NOTICE: starting worker 1.0 NOTICE: starting worker 2.0 NOTICE: instructing worker 1.0 to acquire AccessExclusiveLock on relation with OID 16387 NOTICE: instructing worker 2.0 to acquire AccessExclusiveLock on relation with OID 16387 ERROR: could not obtain AccessExclusiveLock on relation with OID 16387 CONTEXT: background worker, group 2, task 0 The syntax is a little arcane, I guess, but it's documented in the comments within. In this case I asked it to start up two background workers and have them both try to take AccessExclusiveLock on table foo. As expected, the second one fails. The idea is that workers are identified by a pair of numbers X.Y; two workers with the same X-value are in the same locking group. So if I call the second worker 1.1 rather than 2.0, it'll join the same locking group as worker 1.0 and ... then it does the wrong thing, and then it crashes the server, because my completely-untested code is unsurprisingly riddled with bugs. Eventually, this needs to be generalized a bit so that we can use it to test deadlock detection. That's tricky, because what you really want to do is tell worker A to wait for some lock and then, once you're sure it's on the wait queue, tell worker B to go take some other lock and check that you see the resulting deadlock. There doesn't seem to be a good API for the user backend to find out whether some background worker is waiting for some particular lock, so I may have to resort to the hacky expedient of having the driver process wait for a few seconds and assume that's long enough that the background worker will be on the wait queue by then. Or maybe I can drum up some solution, but anyway it's not done yet. The value of this test code is that we can easily reproduce locking scenarios which would be hard to reproduce in a real workload - e.g. because they're timing-dependent. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c index 723051e..85997f6 100644 --- a/src/backend/storage/lmgr/lock.c +++ b/src/backend/storage/lmgr/lock.c @@ -210,6 +210,10 @@ static bool FastPathUnGrantRelationLock(Oid relid, LOCKMODE lockmode); static bool FastPathTransferRelationLocks(LockMethod lockMethodTable, const LOCKTAG *locktag, uint32 hashcode); static PROCLOCK *FastPathGetRelationLockEntry(LOCALLOCK *locallock); +static bool GroupLockShouldJumpQueue(LockMethod lockMethodTable, + LOCKMODE lockmode, + LOCK *lock, + PROCLOCK *proclock); /* * To make the fast-path lock mechanism work, we must have some way of @@ -339,7 +343,8 @@ PROCLOCK_PRINT(const char *where, const PROCLOCK *proclockP) static uint32 proclock_hash(const void *key, Size keysize); static void RemoveLocalLock(LOCALLOCK *locallock); static PROCLOCK *SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc, - const LOCKTAG *locktag, uint32 hashcode, LOCKMODE lockmode); + PGPROC *leader, const LOCKTAG *locktag, uint32 hashcode, + LOCKMODE lockmode); static void GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner); static void BeginStrongLockAcquire(LOCALLOCK *locallock, uint32 fasthashcode); static void FinishStrongLockAcquire(void); @@ -894,8 +899,8 @@ LockAcquireExtended(const LOCKTAG *locktag, * away anytime. So we have to use SetupLockInTable() to recompute the * lock and proclock pointers, even if they're already set. */ - proclock = SetupLockInTable(lockMethodTable, MyProc, locktag, -hashcode, lockmode); + proclock = SetupLockInTable(lockMethodTable, MyProc, LockGroupLeader, +locktag, hashcode, lockmode); if (!proclock) { AbortStrongLockAcquire(); @@ -914,18 +919,27 @@ LockAcquireExtended(const LOCKTAG *locktag, /* * If lock requested conflicts with locks requested by waiters, must join - * wait queue. Otherwise, check for conflict with already-held locks. - * (That's last because most complex check.) + * wait queue (except for certain cases involving
Re: [HACKERS] Additional role attributes superuser review
On Mon, Nov 3, 2014 at 11:44 AM, Adam Brightwell adam.brightw...@crunchydatasolutions.com wrote: That said, I don't feel very strongly about that position, so if you and Robert (and others on the thread) agree that's the right approach then I'll see about getting it done. We haven't reached consensus on this one yet and I didn't want it to fall too far off the radar. Here is what I summarize as the current state of the discussion: 1. Syntax: ALTER ROLE role { ADD | DROP } CAPABILITY capability Seems reasonable. * Perhaps keeping the current syntax around as deprecated to be removed in a scheduled future version. (provide a deprecated message to the user?) Yeah, I don't think we should remove the existing syntax because a lot of people are used to it. We still have some very old COPY syntax around for backward-compatibility, and it's not hurting us, either. At the same time, I think recasting the existing capability-like things as capabilities per se is a good idea, because otherwise we've got this old stuff that is randomly different for no especially good reason. GRANT EXECUTE PRIVILEGES ON capability TO role Yuck. The only other approach I can think of is have some magic, hard-coded roles that implicitly have each capability, and then those can be granted. e.g. have a role pg_unrestricted_copy or whatever with a given OID, and then test has_privs_of_role() with that OID. Condense all attributes in pg_authid to single int64 column and create bitmasks accordingly. Obviously there is some concern for upgrading and whether to do both at once or to do them incrementally. IMHO, I think if the changes are going to be made, then we should go ahead and do them at the same time. Though, would it be beneficial to separate in to two distinct patches? If we're going to change the catalog representation for the existing capabilities, I'd recommend that the first patch change the catalog representation and add the new syntax without adding any more capabilities; and then the second and subsequent patches can add additional capabilities. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Convert query plan to sql query
wrote: I don't think SQL can express the information the plan contains. For example, join methods (hash, nest loop, merge). I don't need the way the query will be executed, so there is no need for (hash, nest loop, merge). -- View this message in context: http://postgresql.1045698.n5.nabble.com/Convert-query-plan-to-sql-query-tp5825727p5825877.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Convert query plan to sql query
May be you want to check how it's done in Postgres-XC. Postgres-XC works on plans being created by PostgreSQL and reverse-engineers queries (for parts of the plans which are shippable.) The notions of shippability may not be of interest to you, but the code to reverse-engineer most of the plan nodes is there in Postgres-XC. I'm glad you used reverse engineering that's exactly what I need. I didn't get a chance to look at the internals of Postgres-XC, thank you for the info. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Convert-query-plan-to-sql-query-tp5825727p5825878.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] REINDEX CONCURRENTLY 2.0
On Thu, Oct 30, 2014 at 5:19 PM, Michael Paquier michael.paqu...@gmail.com wrote: Updated patch is attached. Please find attached an updated patch with the following things changed: - Addition of tab completion in psql for all new commands - Addition of a call to WaitForLockers in index_concurrent_swap to ensure that there are no running transactions on the parent table running before exclusive locks are taken on the index and its concurrent entry. Previous patch versions created deadlocks because of that, issue spotted by the isolation tests integrated in the patch. - Isolation tests for reindex concurrently are re-enabled by default. Regards, -- Michael diff --git a/doc/src/sgml/mvcc.sgml b/doc/src/sgml/mvcc.sgml index cd55be8..653b120 100644 --- a/doc/src/sgml/mvcc.sgml +++ b/doc/src/sgml/mvcc.sgml @@ -864,7 +864,8 @@ ERROR: could not serialize access due to read/write dependencies among transact para Acquired by commandVACUUM/command (without optionFULL/option), - commandANALYZE/, commandCREATE INDEX CONCURRENTLY/, and + commandANALYZE/, commandCREATE INDEX CONCURRENTLY/, + commandREINDEX CONCURRENTLY/, commandALTER TABLE VALIDATE/command and other commandALTER TABLE/command variants (for full details see xref linkend=SQL-ALTERTABLE). @@ -1143,7 +1144,7 @@ ERROR: could not serialize access due to read/write dependencies among transact sect2 id=locking-pages titlePage-level Locks/title - + para In addition to table and row locks, page-level share/exclusive locks are used to control read/write access to table pages in the shared buffer diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml index cabae19..285f3ff 100644 --- a/doc/src/sgml/ref/reindex.sgml +++ b/doc/src/sgml/ref/reindex.sgml @@ -21,7 +21,7 @@ PostgreSQL documentation refsynopsisdiv synopsis -REINDEX { INDEX | TABLE | DATABASE | SYSTEM } replaceable class=PARAMETERname/replaceable [ FORCE ] +REINDEX { INDEX | TABLE | DATABASE | SYSTEM } [ CONCURRENTLY ] replaceable class=PARAMETERname/replaceable [ FORCE ] /synopsis /refsynopsisdiv @@ -68,9 +68,12 @@ REINDEX { INDEX | TABLE | DATABASE | SYSTEM } replaceable class=PARAMETERnam An index build with the literalCONCURRENTLY/ option failed, leaving an quoteinvalid/ index. Such indexes are useless but it can be convenient to use commandREINDEX/ to rebuild them. Note that - commandREINDEX/ will not perform a concurrent build. To build the - index without interfering with production you should drop the index and - reissue the commandCREATE INDEX CONCURRENTLY/ command. + commandREINDEX/ will perform a concurrent build if literal + CONCURRENTLY/ is specified. To build the index without interfering + with production you should drop the index and reissue either the + commandCREATE INDEX CONCURRENTLY/ or commandREINDEX CONCURRENTLY/ + command. Indexes of toast relations can be rebuilt with commandREINDEX + CONCURRENTLY/. /para /listitem @@ -139,6 +142,21 @@ REINDEX { INDEX | TABLE | DATABASE | SYSTEM } replaceable class=PARAMETERnam /varlistentry varlistentry +termliteralCONCURRENTLY/literal/term +listitem + para + When this option is used, productnamePostgreSQL/ will rebuild the + index without taking any locks that prevent concurrent inserts, + updates, or deletes on the table; whereas a standard reindex build + locks out writes (but not reads) on the table until it's done. + There are several caveats to be aware of when using this option + mdash; see xref linkend=SQL-REINDEX-CONCURRENTLY + endterm=SQL-REINDEX-CONCURRENTLY-title. + /para +/listitem + /varlistentry + + varlistentry termliteralFORCE/literal/term listitem para @@ -218,6 +236,194 @@ REINDEX { INDEX | TABLE | DATABASE | SYSTEM } replaceable class=PARAMETERnam reindex anything. /para + refsect2 id=SQL-REINDEX-CONCURRENTLY + title id=SQL-REINDEX-CONCURRENTLY-titleRebuilding Indexes Concurrently/title + + indexterm zone=SQL-REINDEX-CONCURRENTLY + primaryindex/primary + secondaryrebuilding concurrently/secondary + /indexterm + + para +Rebuilding an index can interfere with regular operation of a database. +Normally productnamePostgreSQL/ locks the table whose index is rebuilt +against writes and performs the entire index build with a single scan of the +table. Other transactions can still read the table, but if they try to +insert, update, or delete rows in the table they will block until the +index rebuild is finished. This could have a severe effect if the system is +a live production database. Very large tables can take many hours to be +indexed, and even for smaller tables, an index rebuild can lock out writers +for periods that are unacceptably long for a production system. + /para
Re: [HACKERS] Amazon Redshift
On Wed, Nov 5, 2014 at 5:36 PM, philip taylor philta...@mail.com wrote: do you think we should implement some of the functions offered by Amazon Redshift? http://docs.aws.amazon.com/redshift/latest/dg/c_SQL_functions.html JSON Functions http://docs.aws.amazon.com/redshift/latest/dg/JSON_ARRAY_LENGTH.html http://docs.aws.amazon.com/redshift/latest/dg/JSON_EXTRACT_PATH_TEXT.html We actually have the two above: http://www.postgresql.org/docs/current/static/functions-json.html Actually, if you look at the examples you will see they only copy ours and changed the values a little. -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación Phone: +593 4 5107566 Cell: +593 987171157 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GSoC2014] Patch ALTER TABLE ... SET LOGGED
On Sat, Sep 13, 2014 at 11:02 PM, Fabrízio de Royes Mello fabriziome...@gmail.com wrote: Patch rebased and added to commitfest [1]. It looks like a good thing to remove ATChangeIndexesPersistence, this puts the persistence switch directly into reindex process. A couple of minor comments about this patch: 1) Reading it, I am wondering if it would not be finally time to switch to a macro to get a relation's persistence, something like RelationGetPersistence in rel.h... Not related directly to this patch. 2) reindex_index has as new argument a relpersislence value for the new index. reindex_relation has differently a new set of flags to enforce the relpersistence of all the underling indexes. Wouldn't it be better for API consistency to pass directly a relpersistence value through reindex_relation? In any case, the comment block of reindex_relation does not contain a description of the new flags. 3) Here you may as well just set the value and be done: +/* + * Check if need to set the new relpersistence + */ +if (iRel-rd_rel-relpersistence != relpersistence) +iRel-rd_rel-relpersistence = relpersistence; Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On Wed, Nov 5, 2014 at 2:56 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 11/05/2014 09:06 AM, Amit Kapila wrote: 2. XLogRecPtr XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn) So the scenario is that: * XLogRecordAssemble decides that a page doesn't need to be backed up * both RedoRecPtr and doPageWrites change while building the record. doPageWrites goes from true to false. Without the patch, we would retry, because first check RedoRecPtr has changed. With the patch, we notice that even though RedoRecPtr has changed, doPageWrites is now off, so no FPWs are required regardless of RedoRecPtr, and not retry. Yeah, you're right, the behavior changes in that case. However, the new behavior is correct; the retry is unnecessary in that scenario. How does it interact with backup, basically in stop backup we first change forcePageWrite to false and then get the stop wal location by inserting XLOG_BACKUP_END, so it seems to me that it is quite possible that the record which backend is inserting using XLogInsert() will be considered in backup. Now shouldn't this record contain FPW if forcePageWrite was true when XLogInsert() started to avoid any torn page taken in backup? 3. There are couple of places where *XLogInsert* is used in wal.sgml and it seems to me some of those needs change w.r.t this patch. Hmm. It's a bit strange to mention XLogInsert and XLogFlush functions by name in that text. It's otherwise admin-level stuff, on how to set WAL-related settings. Up to 9.2 that manual page actually called them LogInsert and LogFlush. But I guess you're right; if we are to mention the function by name, it should say XLogInsertRecord. Agreed. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Convert query plan to sql query
mariem mariem.benfad...@gmail.com writes: I don't think SQL can express the information the plan contains. For example, join methods (hash, nest loop, merge). I don't need the way the query will be executed, so there is no need for (hash, nest loop, merge). If you don't need that, why are you insistent on extracting the information from a plan tree? It seems far simpler to me to make use of ruleutils.c to reverse-list the original parsetree. That functionality already exists and is well tested and well maintained. If you insist on working from a plan tree, you will be writing a fair amount of code that you will have to maintain yourself. And I absolutely, positively guarantee that we will break it in every major release, and occasionally in minor releases. You should read the git history of explain.c and ruleutils.c and ask yourself whether you want to keep up with that level of churn. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Repeatable read and serializable transactions see data committed after tx start
Jim Nasby jim.na...@bluetreble.com writes: Hmm... we do have transaction_timestamp(); perhaps we could leave that as the time BEGIN executed and shift everything else to use the snapshot time. It's not possible to take a timestamp that *exactly* matches the snapshot time. We could rearrange the code so that we ask the kernel for timeofday just before or after capturing some relevant snapshot, but there's still going to be some skew there. In any case, I believe we document those timestamps as being the time of arrival of a command from the client, not in terms of any snapshots. To the extent that people use now() to represent the time of insertion of data, I think the command arrival time is arguably the best definition. Delaying it until the server code gets around to capturing a snapshot would not be an improvement. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
On 2014-11-05 19:31:52 -0500, Steve Singer wrote: On 11/05/2014 05:43 PM, Andres Freund wrote: On 2014-11-05 17:17:05 -0500, Steve Singer wrote: Imo that's essentially a different feature. What you essentially would need here is a 'commit sequence number' - but no timestamps. And probably to be useful that number has to be 8 bytes in itself. I think this gets to the heart of some of the differing views people have expressed on this patch I think it's actually besides the heart... Is this patch supposed to: A) Add commit timestamp tracking but nothing more B) Add infrastructure to store commit timestamps and provide a facility for storing additional bits of data extensions might want to be associated with the commit C). Add commit timestamps and node identifiers to commits If the answer is (A) then I can see why some people are objecting to adding extradata.If the answer is (B) then it's fair to ask how well does this patch handle various types of things people might want to attach to the commit record (such as the LSN). I think there's a mistake exactly here. The LSN of the commit isn't just some extra information about the commit. You can't say 'here, also attach this piece of information'. Instead you need special case code in xact.c to add it. Thus prohibiting that space to be used for something else. I think the problem is that once you start providing a facility extensions can use to store data along with the commit record then being restricted to 4 or 8 bytes is very limiting. Well, you can easily use those 4/8 bytes to start adding more data to the transaction. By referring to some table with transaction metadata for example. It also doesn't allow you to load two extensions at once on a system. You wouldn't be able to have both the 'steve_commit_order' extension and BDR installed at the same time. I don't think this patch does a very good job at (B) but It wasn't intended to. Well, I don't agree that steve_commit_order makes much sense in this context. But there actually is a real problem here, namely that there's no namespacing in those bytes. I'd be ok with saying that we split the extradata in for bytes for the namespace and four for the actual data. That's roughly how it works for advisory locks already. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers