Re: [HACKERS] discarding duplicate indexes
On 20/12/12 14:57, Josh Kupershmidt wrote: CREATE TABLE test (id int); CREATE INDEX test_idx1 ON test (id); CREATE INDEX test_idx2 ON test (id); I initially misread your example code, but after I realised my mistake, I thought of an alternative scenario that might be worth considering. CREATE TABLE test (id int, int sub, text payload); CREATE INDEX test_idx1 ON test (id, sub); CREATE INDEX test_idx2 ON test (id); Nowtest_idx2 is logically included intest_idx1, but if the majority of transactions only query onid, thentest_idx2 would be more better as it ties up less RAM Cheers, Gavin
Re: [HACKERS] [GENERAL] trouble with pg_upgrade 9.0 - 9.1
19.12.2012, 21:47, Tom Lane t...@sss.pgh.pa.us: Kevin Grittner kgri...@mail.com writes: Groshev Andrey wrote: Mismatch of relation names: database database, old rel public.lob.ВерсияВнешнегоДокумента$Документ_pkey, new rel public.plob.ВерсияВнешнегоДокумента$Документ There is a limit on identifiers of 63 *bytes* (not characters) after which the name is truncated. In UTF8 encoding, the underscore would be in the 64th position. Hmm ... that is a really good point, except that you are not counting the lob. or plob. part, which we previously saw is part of the relation name not the schema name. Counting that part, it's already overlimit, which seems to be proof that Andrey isn't using UTF8 but some single-byte encoding. Anyway, that would only explain the issue if pg_upgrade were somehow changing the database encoding, which surely we'd have heard complaints about already? Or maybe this has something to do with pg_upgrade's client-side encoding rather than the server encoding... regards, tom lane I'm initialize data dir with use ru_RU.UTF8, but this databse use CP1251, ie one byte per character. Agreed. This is a complicated report because the identifiers: * contain periods * are long * are in cyrillic * don't use utf8 * are very similar However, I just can't see how these could be causing the problem. Looking at the 9.1 pg_upgrade code, we already know that there are the same number of relations in old and new clusters, so everything must be being restored. And there is a lob.* and a plob.* that exist. The C code is also saying that the pg_class.oid of the lob.* in the old database is the same as the plob.* in the new database. That question is how is that happening. Can you email me privately the output of: pg_dump --schema-only --binary-upgrade database Thanks. If you want to debug this yourself, check these lines in the pg_dump output: -- For binary upgrade, must preserve pg_class oids SELECT binary_upgrade.set_next_index_pg_class_oid('786665369'::pg_catalog.oid); ALTER TABLE ONLY lob.ВерсияВнешнегоДокумента$Документ ADD CONSTRAINT plob.ВерсияВнешнегоДокумента$Документ PRIMARY KEY (@Файл, Страница); See that 786665369? That is the pg_class.oid of the plob in the old cluster, and hopefully the new one. Find where the lob*_pkey index is created and get that oid. Those should match the same names of the pg_class.oid in the old and new clusters, but it seems the new plob* oid is matching the lob oid in the old cluster. Also, pg_upgrade sorts everything by oid, so it can't be that somehow pg_upgrade isn't ordering things right, and because we already passed the oid check, we already know they have the same oid, but different names. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] discarding duplicate indexes
On 12/20/2012 12:26 AM, Gavin Flower wrote: CREATE TABLE test (id int, int sub, text payload); CREATE INDEX test_idx1 ON test (id, sub); CREATE INDEX test_idx2 ON test (id); Nowtest_idx2 is logically included intest_idx1, but if the majority of transactions only query onid, thentest_idx2 would be more better as it ties up less RAM if sub is an integer, that index isn't that much larger. both indexes need to index all the rows, and with the header and block overhead, the extra word isn't that big of a deal. as long as there are some transactions using the other index, most of both of them will likely want to be in memory, so you'll end up using MORE memory.
Re: [HACKERS] Review of Row Level Security
On 20 December 2012 00:24, Robert Haas robertmh...@gmail.com wrote: On Wed, Dec 19, 2012 at 12:54 PM, Simon Riggs si...@2ndquadrant.com wrote: I can see a use case for not having security apply for users who have *only* INSERT privilege. This would allow people to run bulk loads of data into a table with row security. We should add that. That is not the common case, so with proper documentation that should be a useful feature without relaxing default security. Never applying security for INSERT and then forcing them to add BEFORE triggers if they want full security is neither secure nor performant. I think INSERT vs. not-INSERT is not the relevant distinction, because the question also arises for UPDATE. Not sure I understand you. You suggested it was a valid use case for a user to have only INSERT privilege and wish to bypass security checks. I agreed and suggested it could be special-cased. In the UPDATE case, the question is whether the RLS qual should be checked only against the OLD tuple (to make sure that we can see the tuple to modify it) or also against the NEW tuple (to make sure that we're not modifying it to a form that we can no longer see). In other words, the question is not do we support all of the commands? but rather do we check not only the tuple read but also the tuple written?. For INSERT, we only write a tuple, without reading. For SELECT and DELETE, we only read a tuple, without writing a new one. UPDATE does both a read and a write. I'm not sure what this comment adds to the discussion. What you say is understood. Previously, I suggested that we handle this by enforcing row-level security only on data read from the table - the OLD row, so to speak - and not on data written to the table - the NEW row, so to speak - because the latter case can be handled well enough by triggers. (The OLD case cannot, because not seeing the row is different from erroring out when you do see it.) There are other alternatives, like allowing the user to specify which behavior they want. But I think that simply decreeing that the policy will apply not only to rows read but also rows written in all cases will be less flexible than we will ultimately want to be. As discussed, we should add a security feature that is secure by default. Adding options to make it less secure can follow initial commit. We might even make it in this release if the review of the main feature goes well. Saying that something is or is not secure is not meaningful without defining what you want to be secure against. There's nothing insecure about checking only the tuples read; it's just a different (and useful) threat model. There are three main points * Applies to all commands should not be implemented via triggers. Complex, slow, unacceptable thing to force upon users. Doing that begs the question of why we would have the feature at all, since we already have triggers and barrier views. * the default for row security should be applies to all commands. Anything else may be useful in some cases, but is surprising to users and requires careful thought to determine if it is appropriate. * How to handle asymmetric row security policies? KaiGai has already begun discussing problems caused by a security policy that differs between reads/writes, on his latest patch post. That needs further analysis to check that it actually makes sense to allow it, since it is more complex. It would be better to fully analyse that situation and post solutions, rather than simply argue its OK. Kevin has made good arguments to show there could be value in such a setup; nobody has talked about banning it, but we do need analysis, suggested syntax/mechanisms and extensive documentation to explain it etc. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Set visibility map bit after HOT prune
On 20 December 2012 00:43, Robert Haas robertmh...@gmail.com wrote: On Wed, Dec 19, 2012 at 12:39 PM, Simon Riggs si...@2ndquadrant.com wrote: The benefit of saying that only UPDATEs clean the block is that this penalises only the workload making the mess, rather than everybody cleaning up repeatedly over one messy guy. Right, but there are plenty of situations where having everybody clean up after the messy guy is better than waiting around and hoping that Mom (aka vacuum) will do it. The problems I see are that cleaning on SELECT is too frequent, interferes with foreground performance and re-dirties data blocks too often. Waiting for Mom is configurable, since we can set parameters for autovacuum. But we can't turn off the cleaning by SELECTs, which makes the configurability of autovacuum somewhat moot. We could also contact the Cleaner instead. ISTM that if someone spots a block that could use cleanup, they mark the block as BM_PIN_COUNT_WAITER, but don't set pid. Then when they unpin the block they send a signal/queue work for a special cleaning process to come in and do the work now that nobody is waiting. Logic would allow VACUUMs to go past that by setting the pid. If we prioritised cleanup onto blocks that are already dirty we would minimise I/O. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] trouble with pg_upgrade 9.0 - 9.1
On Thu, Dec 20, 2012 at 08:55:16AM +0400, Groshev Andrey wrote: No, old database not use table plob.. only primary key -- -- Name: plob.ВерсияВнешнегоДокумента$Документ; Type: CONSTRAINT; Schema: public; Owner: postgres; Tablespace: -- -- For binary upgrade, must preserve pg_class oids SELECT binary_upgrade.set_next_index_pg_class_oid('786665369'::pg_catalog.oid); ALTER TABLE ONLY lob.ВерсияВнешнегоДокумента$Документ ADD CONSTRAINT plob.ВерсияВнешнегоДокумента$Документ PRIMARY KEY (@Файл, Страница); OK, now I know what is happening, though I can't figure out yet how you got there. Basically, when you create a primary key, the name you supply goes into two places, pg_class, for the index, and pg_constraint for the constraint name. What is happening is that you have a pg_class entry called lob.*_pkey and a pg_constraint entry with plob.*. You can verify it yourself by running queries on the system tables. Let me know if you want me to show you the queries. pg_dump dumps the pg_constraint name when recreating the index, while pg_upgrade uses the pg_class name. When you restore the database into the new cluster, the pg_class index name is lost and the new primary key gets identical pg_class and pg_constraint names. I tried to recreate the problem with these commands: test= create table test (x int primary key); NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index test_pkey for table test CREATE TABLE test= alter index test_pkey rename to ptest; ALTER INDEX test= select * from pg_constraint where conname = 'ptest'; conname | connamespace | -+--+- ptest | 2200 | (1 row) test= select * from pg_class where relname = 'ptest'; relname | relnamespace | -+--+- ptest | 2200 | (1 row) As you can see, ALTER INDEX renamed both the pg_constraint and pg_class names. Is it possible someone manually updated the system table to rename this primary key? That would cause this error message. The fix is to just to make sure they match. Does pg_upgrade need to be modified to handle this case? Are there legitimate cases where they will not match and the index name will not be preserved though a dump/restore? This seems safe: test= alter table test add constraint zz primary key using index ii; NOTICE: ALTER TABLE / ADD CONSTRAINT USING INDEX will rename index ii to zz ALTER TABLE -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Set visibility map bit after HOT prune
On 20 December 2012 00:10, Pavan Deolasee pavan.deola...@gmail.com wrote: I just thought that we can fairly easily limit the damage if we are really worried about SELECTs being penalised. What if we set a configurable limit on *extra* things that a query may do which is otherwise not very useful for the query itself, but is useful to keep the system healthy and steady. HOT prune definitely counts as one of them and may be even setting of hint bits. (This is a topic for a separate thread though) I like this idea transaction_cleanup_limit = -1 (default), 0, 1+ -1 means no limit on number of cleanups in this transaction, which is current behaviour. Other numbers are the number of cleanups that will be tolerated in this transaction; once we hit the limit we don't attempt cleanup anymore we just get on with it. The limit is ignored for UPDATEs since they need to clear space for their work. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] EDB hosted buildfarm animals - extended downtime
Due to circumstances beyond my control (blame the power company), the following buildfarm animals will be down from 27th December until 2nd January: baiji mastodon protosciurus castoroides -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ThisTimeLineID in checkpointer and bgwriter processes
On Wednesday, December 19, 2012 9:30 PM Heikki Linnakangas wrote: In both checkpointer.c and bgwriter.c, we do this before entering the main loop: /* * Use the recovery target timeline ID during recovery */ if (RecoveryInProgress()) ThisTimeLineID = GetRecoveryTargetTLI(); That seems reasonable. However, since it's only done once, when the process starts up, ThisTimeLineID is never updated in those processes, even though the startup process changes recovery target timeline. That actually seems harmless to me, and I also haven't heard of any complaints of misbehavior in 9.1 or 9.2 caused by that. I'm not sure why we bother to set ThisTimeLineID in those processes in the first place. This is used in RemoveOldXlogFiles(), so if during recovery when it's not set and this function gets called, it might have some problem. I think it could get called from CreateRestartPoint() during recovery. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] trouble with pg_upgrade 9.0 - 9.1
20.12.2012, 13:00, Bruce Momjian br...@momjian.us: On Thu, Dec 20, 2012 at 08:55:16AM +0400, Groshev Andrey wrote: No, old database not use table plob.. only primary key -- -- Name: plob.ВерсияВнешнегоДокумента$Документ; Type: CONSTRAINT; Schema: public; Owner: postgres; Tablespace: -- -- For binary upgrade, must preserve pg_class oids SELECT binary_upgrade.set_next_index_pg_class_oid('786665369'::pg_catalog.oid); ALTER TABLE ONLY lob.ВерсияВнешнегоДокумента$Документ ADD CONSTRAINT plob.ВерсияВнешнегоДокумента$Документ PRIMARY KEY (@Файл, Страница); OK, now I know what is happening, though I can't figure out yet how you got there. Basically, when you create a primary key, the name you supply goes into two places, pg_class, for the index, and pg_constraint for the constraint name. What is happening is that you have a pg_class entry called lob.*_pkey and a pg_constraint entry with plob.*. You can verify it yourself by running queries on the system tables. Let me know if you want me to show you the queries. pg_dump dumps the pg_constraint name when recreating the index, while pg_upgrade uses the pg_class name. When you restore the database into the new cluster, the pg_class index name is lost and the new primary key gets identical pg_class and pg_constraint names. I have already begun to approach this to the idea, when noticed that pgAdmin describes this index through _pkey, and through the pg_dump plob.. But your letter immediately pointed me to the end of my research :) I tried to recreate the problem with these commands: test= create table test (x int primary key); NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index test_pkey for table test CREATE TABLE test= alter index test_pkey rename to ptest; ALTER INDEX test= select * from pg_constraint where conname = 'ptest'; conname | connamespace | -+--+- ptest | 2200 | (1 row) test= select * from pg_class where relname = 'ptest'; relname | relnamespace | -+--+- ptest | 2200 | (1 row) As you can see, ALTER INDEX renamed both the pg_constraint and pg_class names. Is it possible someone manually updated the system table to rename this primary key? That would cause this error message. The fix is to just to make sure they match. Does pg_upgrade need to be modified to handle this case? Unfortunately, my knowledge is not enough to talk about it. I do not know what comes first in this case: pg_class, pg_constraint or pg_catalog.index or pg_catalog.pg_indexes. Incidentally, in the last of: # select schemaname,tablename,indexname,tablespace from pg_catalog.pg_indexes where indexname like '%ВерсияВнешнегоДокумента$Документ%'; schemaname | tablename | indexname | tablespace +--+--+ public | lob.ВерсияВнешнегоДокумента$Документ | lob.ВерсияВнешнегоДокумента$Документ_pkey| public | ВерсияВнешнегоДокумента$Документ | ВерсияВнешнегоДокумента$Документ_pkey| public | ВерсияВнешнегоДокумента$Документ | iВерсияВнешнегоДокумента$Документ-blb_header | (3 rows) If pg_upgrade said that the old database is not in a very good condition, I would look for a problem in the database, and not something else. Are there legitimate cases where they will not match and the index name will not be preserved though a dump/restore? This seems safe: test= alter table test add constraint zz primary key using index ii; NOTICE: ALTER TABLE / ADD CONSTRAINT USING INDEX will rename index ii to zz ALTER TABLE -- Bruce Momjian br...@momjian.us http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: optimized DROP of multiple tables within a transaction
On 2012-12-20 02:54:48 +0100, Tomas Vondra wrote: On 19.12.2012 02:18, Andres Freund wrote: On 2012-12-17 00:31:00 +0100, Tomas Vondra wrote: I think except of the temp buffer issue mentioned below its ready. -DropRelFileNodeAllBuffers(RelFileNodeBackend rnode) +DropRelFileNodeAllBuffers(RelFileNodeBackend * rnodes, int nnodes) { - int i; + int i, j; + + /* sort the list of rnodes */ + pg_qsort(rnodes, nnodes, sizeof(RelFileNodeBackend), rnode_comparator); /* If it's a local relation, it's localbuf.c's problem. */ - if (RelFileNodeBackendIsTemp(rnode)) + for (i = 0; i nnodes; i++) { - if (rnode.backend == MyBackendId) - DropRelFileNodeAllLocalBuffers(rnode.node); - return; + if (RelFileNodeBackendIsTemp(rnodes[i])) + { + if (rnodes[i].backend == MyBackendId) + DropRelFileNodeAllLocalBuffers(rnodes[i].node); + } } While you deal with local buffers here you don't anymore in the big loop over shared buffers. That wasn't needed earlier since we just returned after noticing we have local relation, but thats not the case anymore. Hmm, but that would require us to handle the temp relations explicitly wherever we call DropRelFileNodeAllBuffers. Currently there are two such places - smgrdounlink() and smgrdounlinkall(). By placing it into DropRelFileNodeAllBuffers() this code is shared and I think it's a good thing. But that does not mean the code is perfect - it was based on the assumption that if there's a mix of temp and regular relations, the temp relations will be handled in the first part and the rest in the second one. Maybe it'd be better to improve it so that the temp relations are removed from the array after the first part (and skip the second one if there are no remaining relations). The problem is that afaik without the backend-local part a temporary relation could match the same relfilenode as a full relation which would have pretty bad consequences. Still surprised this is supposed to be faster than a memcmp, but as you seem to have measured it earlier.. It surprised me too. These are the numbers with the current patch: 1) one by one = 0246810 12 14 16 18 20 -- current 15 22 28 34 4175 77 82 92 99 106 memcmp 16 23 29 36 44 122 125 128 153 154 158 Until the number of indexes reaches ~10, the numbers are almost exactly the same. Then the bsearch branch kicks in and it's clear how much slower the memcmp comparator is. 2) batches of 100 = 0246810 12 14 16 18 20 -- current 358 10 1215 17 21 23 27 31 memcmp47 10 13 1619 22 28 30 32 36 Here the difference is much smaller, but even here the memcmp is consistently a bit slower. My theory is that while the current comparator starts with the most variable part (relation OID), and thus usually starts after the comparing the first 4B, the memcmp starts from the other end (tablespace and database) and therefore needs to compare more data. Thats a good guess. I think its ok this way, but if you feel like playing you could just change the order in the struct... +void smgrdounlinkall(SMgrRelation * rels, int nrels, bool isRedo) +{ + int i = 0; + RelFileNodeBackend *rnodes; + ForkNumber forknum; + + /* initialize an array which contains all relations to be dropped */ + rnodes = palloc(sizeof(RelFileNodeBackend) * nrels); + for (i = 0; i nrels; i++) + { + RelFileNodeBackend rnode = rels[i]-smgr_rnode; + int which = rels[i]-smgr_which; + + rnodes[i] = rnode; + + /* Close the forks at smgr level */ + for (forknum = 0; forknum = MAX_FORKNUM; forknum++) + (*(smgrsw[which].smgr_close)) (rels[i], forknum); + } + + /* + * Get rid of any remaining buffers for the relation. bufmgr will just + * drop them without bothering to write the contents. + */ + DropRelFileNodeAllBuffers(rnodes, nrels); I think it might be a good idea to handle temp relations on/buffers on this level instead of trying to put it into DropRelFileNodeAllBuffers. Would also allow you to only build an array of RelFileNode's instead of RelFileNodeBackend's which might make it slightl faster. Hmmm, sadly that'd require duplication of code because it needs to be done in smgrdounlink too. It should be fairly small though. int nrels_nonlocal = 0; for (i = 0; i nrels; i++) { RelFileNodeBackend rnode = rels[i]-smgr_rnode; int which =
Re: [HACKERS] [GENERAL] trouble with pg_upgrade 9.0 - 9.1
On Thu, Dec 20, 2012 at 03:19:17PM +0400, Groshev Andrey wrote: 20.12.2012, 13:00, Bruce Momjian br...@momjian.us: On Thu, Dec 20, 2012 at 08:55:16AM +0400, Groshev Andrey wrote: No, old database not use table plob.. only primary key -- -- Name: plob.ВерсияВнешнегоДокумента$Документ; Type: CONSTRAINT; Schema: public; Owner: postgres; Tablespace: -- -- For binary upgrade, must preserve pg_class oids SELECT binary_upgrade.set_next_index_pg_class_oid('786665369'::pg_catalog.oid); ALTER TABLE ONLY lob.ВерсияВнешнегоДокумента$Документ ADD CONSTRAINT plob.ВерсияВнешнегоДокумента$Документ PRIMARY KEY (@Файл, Страница); OK, now I know what is happening, though I can't figure out yet how you got there. Basically, when you create a primary key, the name you supply goes into two places, pg_class, for the index, and pg_constraint for the constraint name. What is happening is that you have a pg_class entry called lob.*_pkey and a pg_constraint entry with plob.*. You can verify it yourself by running queries on the system tables. Let me know if you want me to show you the queries. pg_dump dumps the pg_constraint name when recreating the index, while pg_upgrade uses the pg_class name. When you restore the database into the new cluster, the pg_class index name is lost and the new primary key gets identical pg_class and pg_constraint names. I have already begun to approach this to the idea, when noticed that pgAdmin describes this index through _pkey, and through the pg_dump plob.. But your letter immediately pointed me to the end of my research :) Good. I tried to recreate the problem with these commands: test= create table test (x int primary key); NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index test_pkey for table test CREATE TABLE test= alter index test_pkey rename to ptest; ALTER INDEX test= select * from pg_constraint where conname = 'ptest'; conname | connamespace | -+--+- ptest | 2200 | (1 row) test= select * from pg_class where relname = 'ptest'; relname | relnamespace | -+--+- ptest | 2200 | (1 row) As you can see, ALTER INDEX renamed both the pg_constraint and pg_class names. Is it possible someone manually updated the system table to rename this primary key? That would cause this error message. The fix is to just to make sure they match. Does pg_upgrade need to be modified to handle this case? Unfortunately, my knowledge is not enough to talk about it. I do not know what comes first in this case: pg_class, pg_constraint or pg_catalog.index or pg_catalog.pg_indexes. Incidentally, in the last of: # select schemaname,tablename,indexname,tablespace from pg_catalog.pg_indexes where indexname like '%ВерсияВнешнегоДокумента$Документ%'; schemaname | tablename | indexname | tablespace +--+--+ public | lob.ВерсияВнешнегоДокумента$Документ | lob.ВерсияВнешнегоДокумента$Документ_pkey| public | ВерсияВнешнегоДокумента$Документ | ВерсияВнешнегоДокумента$Документ_pkey| public | ВерсияВнешнегоДокумента$Документ | iВерсияВнешнегоДокумента$Документ-blb_header | (3 rows) If pg_upgrade said that the old database is not in a very good condition, I would look for a problem in the database, and not something else. pg_catalog.pg_indexes is a view. You can to modify pg_class to match the pg_constraint name. You might be able to just rename the index in Pgadmin to match. Perhaps PGAdmin allowed this mismatch to happen? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ThisTimeLineID in checkpointer and bgwriter processes
On 20.12.2012 12:08, Amit Kapila wrote: On Wednesday, December 19, 2012 9:30 PM Heikki Linnakangas wrote: In both checkpointer.c and bgwriter.c, we do this before entering the main loop: /* * Use the recovery target timeline ID during recovery */ if (RecoveryInProgress()) ThisTimeLineID = GetRecoveryTargetTLI(); That seems reasonable. However, since it's only done once, when the process starts up, ThisTimeLineID is never updated in those processes, even though the startup process changes recovery target timeline. That actually seems harmless to me, and I also haven't heard of any complaints of misbehavior in 9.1 or 9.2 caused by that. I'm not sure why we bother to set ThisTimeLineID in those processes in the first place. This is used in RemoveOldXlogFiles(), so if during recovery when it's not set and this function gets called, it might have some problem. I think it could get called from CreateRestartPoint() during recovery. Hmm, right, it's used for this: XLogFileName(lastoff, ThisTimeLineID, segno); The resulting lastoff string, which is a xlog filename like 00020008, is used to compare filenames of files in pg_xlog. However, the tli part, first 8 characters, are ignored for comparison purposes. In addition to that, 'lastoff' is printed in a DEBUG2 line, purely for debugging purposes. So I think we're good on that front. But I'll add a comment there, and use 0 explicitly instead of ThisTimeLineID, for clarity. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] trouble with pg_upgrade 9.0 - 9.1
20.12.2012, 11:43, Bruce Momjian br...@momjian.us: 19.12.2012, 21:47, Tom Lane t...@sss.pgh.pa.us: Kevin Grittner kgri...@mail.com writes: Groshev Andrey wrote: Mismatch of relation names: database database, old rel public.lob.ВерсияВнешнегоДокумента$Документ_pkey, new rel public.plob.ВерсияВнешнегоДокумента$Документ There is a limit on identifiers of 63 *bytes* (not characters) after which the name is truncated. In UTF8 encoding, the underscore would be in the 64th position. Hmm ... that is a really good point, except that you are not counting the lob. or plob. part, which we previously saw is part of the relation name not the schema name. Counting that part, it's already overlimit, which seems to be proof that Andrey isn't using UTF8 but some single-byte encoding. Anyway, that would only explain the issue if pg_upgrade were somehow changing the database encoding, which surely we'd have heard complaints about already? Or maybe this has something to do with pg_upgrade's client-side encoding rather than the server encoding... regards, tom lane I'm initialize data dir with use ru_RU.UTF8, but this databse use CP1251, ie one byte per character. Agreed. This is a complicated report because the identifiers: * contain periods * are long * are in cyrillic * don't use utf8 * are very similar However, I just can't see how these could be causing the problem. Looking at the 9.1 pg_upgrade code, we already know that there are the same number of relations in old and new clusters, so everything must be being restored. And there is a lob.* and a plob.* that exist. The C code is also saying that the pg_class.oid of the lob.* in the old database is the same as the plob.* in the new database. That question is how is that happening. Can you email me privately the output of: pg_dump --schema-only --binary-upgrade database Thanks. If you want to debug this yourself, check these lines in the pg_dump output: -- For binary upgrade, must preserve pg_class oids SELECT binary_upgrade.set_next_index_pg_class_oid('786665369'::pg_catalog.oid); ALTER TABLE ONLY lob.ВерсияВнешнегоДокумента$Документ ADD CONSTRAINT plob.ВерсияВнешнегоДокумента$Документ PRIMARY KEY (@Файл, Страница); See that 786665369? That is the pg_class.oid of the plob in the old cluster, and hopefully the new one. Find where the lob*_pkey index is created and get that oid. Those should match the same names of the pg_class.oid in the old and new clusters, but it seems the new plob* oid is matching the lob oid in the old cluster. Also, pg_upgrade sorts everything by oid, so it can't be that somehow pg_upgrade isn't ordering things right, and because we already passed the oid check, we already know they have the same oid, but different names. -- Bruce Momjian br...@momjian.us http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + Yes, was the last question. How to find out which version should stay? And of course, I forgot to say a great big thank you! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] trouble with pg_upgrade 9.0 - 9.1
On Thu, Dec 20, 2012 at 03:41:37PM +0400, Groshev Andrey wrote: See that 786665369? That is the pg_class.oid of the plob in the old cluster, and hopefully the new one. Find where the lob*_pkey index is created and get that oid. Those should match the same names of the pg_class.oid in the old and new clusters, but it seems the new plob* oid is matching the lob oid in the old cluster. Also, pg_upgrade sorts everything by oid, so it can't be that somehow pg_upgrade isn't ordering things right, and because we already passed the oid check, we already know they have the same oid, but different names. -- Bruce Momjian br...@momjian.us http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + Yes, was the last question. How to find out which version should stay? And of course, I forgot to say a great big thank you! You can pick either name to be the right one; they just have to match. The oids are fine. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: patch submission: truncate trailing nulls from heap rows to reduce the size of the null bitmap [Review]
On 13 October 2012 08:54, Amit kapila amit.kap...@huawei.com wrote: As per the last discussion for this patch, performance data needs to be provided before this patch's Review can proceed further. So as per your suggestion and from the discussions about this patch, I have collected the performance data as below: Results are taken with following configuration. 1. Schema - UNLOGGED TABLE with 2,000,000 records having all columns are INT type. 2. shared_buffers = 10GB 3. All the performance result are taken with single connection. 4. Performance is collected for INSERT operation (insert into temptable select * from inittable) Platform details: Operating System: Suse-Linux 10.2 x86_64 Hardware : 4 core (Intel(R) Xeon(R) CPU L5408 @ 2.13GHz) RAM : 24GB Documents Attached: init.sh: Which will create the schema sql_used.sql : sql's used for taking results Trim_Nulls_Perf_Report.html : Performance data Observations from Performance Results 1. There is no performance change for cloumns that have all valid values(non- NULLs). 2. There is a visible performance increase when number of columns containing NULLS are more than 60~70% in table have large number of columns. 3. There are visible space savings when number of columns containing NULLS are more than 60~70% in table have large number of columns. Let me know if there is more performance data needs to be collected for this patch? I can't make sense of your performance report. Because of that I can't derive the same conclusions from it you do. Can you explain the performance results in more detail, so we can see what they mean? Like which are the patched, which are the unpatched results? Which results are comparable, what the percentages mean etc.. We might then move quickly towards commit, or at least more tests. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_top
Hi List, This might not be the right place to post, but I've got a minor patch for pg_top and would like to submit it for review. Basically, the rpm version in the repositories pg_top92-3.6.2 doesn't work with Postgres 9.2 #define QUERY_PROCESSES \ SELECT procpid\n \ FROM pg_stat_activity; It appears that procpid has been renamed to pid at some point, also the column current_query appears to have been shortened to query. My patch updates a couple of queries to use the new shorter column names. Where or who should I submit the changes to? Thanks, Brett _ Brett Maton t: +44 (0) 2392 658 264 m: +44 (0) 7427 610 472 e: mailto:mat...@ltresources.co.uk mat...@ltresources.co.uk http://www.ltresources.co.uk/ Description: http://www.ltresources.co.uk/logow.png image003.png
Re: [HACKERS] [GENERAL] trouble with pg_upgrade 9.0 - 9.1
On Wed, Dec 19, 2012 at 10:19:30PM -0500, Bruce Momjian wrote: On Wed, Dec 19, 2012 at 12:56:05PM -0500, Kevin Grittner wrote: Groshev Andrey wrote: Mismatch of relation names: database database, old rel public.lob.ВерсияВнешнегоДокумента$Документ_pkey, new rel public.plob.ВерсияВнешнегоДокумента$Документ There is a limit on identifiers of 63 *bytes* (not characters) after which the name is truncated. In UTF8 encoding, the underscore would be in the 64th position. OK, Kevin is certainly pointing out a bug in the pg_upgrade code, though I am unclear how it would exhibit the mismatch error reported. pg_upgrade uses NAMEDATALEN for database, schema, and relation name storage lengths. While NAMEDATALEN works fine in the backend, it is possible that a frontend client, like pg_upgrade, could retrieve a name in the client encoding whose length exceeds NAMEDATALEN if the client encoding did not match the database encoding (or is it the cluster encoding for system tables). This would cause truncation of these values. The truncation would not cause crashes, but might cause failures by not being able to connect to overly-long database names, and it weakens the checking of relation/schema names --- the same check that is reported above. (I believe initdb.c also erroneously uses NAMEDATALEN.) I have developed the attached patch to pg_strdup() the string returned from libpq, rather than use a fixed NAMEDATALEN buffer to store the value. I am only going to apply this to 9.3 because I can't see this causing problems except for weaker comparisons for very long identifiers where the client encoding is longer than the server encoding, and failures for very long database names, no of which we have gotten bug reports about. Turns out initdb.c was fine because it expects only collation names to be only in ASCII; I added a comment to that effect. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + diff --git a/contrib/pg_upgrade/info.c b/contrib/pg_upgrade/info.c new file mode 100644 index 2250442..5cb9b61 *** a/contrib/pg_upgrade/info.c --- b/contrib/pg_upgrade/info.c *** static void get_db_infos(ClusterInfo *cl *** 23,29 static void get_rel_infos(ClusterInfo *cluster, DbInfo *dbinfo); static void free_rel_infos(RelInfoArr *rel_arr); static void print_db_infos(DbInfoArr *dbinfo); ! static void print_rel_infos(RelInfoArr *arr); /* --- 23,29 static void get_rel_infos(ClusterInfo *cluster, DbInfo *dbinfo); static void free_rel_infos(RelInfoArr *rel_arr); static void print_db_infos(DbInfoArr *dbinfo); ! static void print_rel_infos(RelInfoArr *rel_arr); /* *** create_rel_filename_map(const char *old_ *** 127,134 map-new_relfilenode = new_rel-relfilenode; /* used only for logging and error reporing, old/new are identical */ ! snprintf(map-nspname, sizeof(map-nspname), %s, old_rel-nspname); ! snprintf(map-relname, sizeof(map-relname), %s, old_rel-relname); } --- 127,134 map-new_relfilenode = new_rel-relfilenode; /* used only for logging and error reporing, old/new are identical */ ! map-nspname = old_rel-nspname; ! map-relname = old_rel-relname; } *** get_db_infos(ClusterInfo *cluster) *** 220,227 for (tupnum = 0; tupnum ntups; tupnum++) { dbinfos[tupnum].db_oid = atooid(PQgetvalue(res, tupnum, i_oid)); ! snprintf(dbinfos[tupnum].db_name, sizeof(dbinfos[tupnum].db_name), %s, ! PQgetvalue(res, tupnum, i_datname)); snprintf(dbinfos[tupnum].db_tblspace, sizeof(dbinfos[tupnum].db_tblspace), %s, PQgetvalue(res, tupnum, i_spclocation)); } --- 220,226 for (tupnum = 0; tupnum ntups; tupnum++) { dbinfos[tupnum].db_oid = atooid(PQgetvalue(res, tupnum, i_oid)); ! dbinfos[tupnum].db_name = pg_strdup(PQgetvalue(res, tupnum, i_datname)); snprintf(dbinfos[tupnum].db_tblspace, sizeof(dbinfos[tupnum].db_tblspace), %s, PQgetvalue(res, tupnum, i_spclocation)); } *** get_rel_infos(ClusterInfo *cluster, DbIn *** 346,355 curr-reloid = atooid(PQgetvalue(res, relnum, i_oid)); nspname = PQgetvalue(res, relnum, i_nspname); ! strlcpy(curr-nspname, nspname, sizeof(curr-nspname)); relname = PQgetvalue(res, relnum, i_relname); ! strlcpy(curr-relname, relname, sizeof(curr-relname)); curr-relfilenode = atooid(PQgetvalue(res, relnum, i_relfilenode)); --- 345,354 curr-reloid = atooid(PQgetvalue(res, relnum, i_oid)); nspname = PQgetvalue(res, relnum, i_nspname); ! curr-nspname = pg_strdup(nspname); relname = PQgetvalue(res, relnum, i_relname); ! curr-relname = pg_strdup(relname); curr-relfilenode = atooid(PQgetvalue(res, relnum, i_relfilenode)); *** free_db_and_rel_infos(DbInfoArr *db_arr) *** 377,383
Re: [HACKERS] pg_top
On Thursday 20 December 2012, Brett Maton wrote: Hi List, This might not be the right place to post, but I've got a minor patch for pg_top and would like to submit it for review. This project lies on Github: https://github.com/markwkm/pg_top/ I think, this commit had fixed the issue you are talking about: https://github.com/markwkm/pg_top/commit/a211ac3f161d7ec2752ed038009f93b4f470e86c -- Say NO to spam and viruses. Stop using Microsoft Windows! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On 17.12.2012 15:05, Thom Brown wrote: I just set up 120 chained standbys, and for some reason I'm seeing these errors: LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 1 LOG: record with zero length at 0/301EC10 LOG: fetching timeline history file for timeline 2 from primary server LOG: restarted WAL streaming at 0/300 on timeline 1 LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 1 LOG: new target timeline is 2 LOG: restarted WAL streaming at 0/300 on timeline 2 LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 2 FATAL: error reading result of streaming command: ERROR: requested WAL segment 00020003 has already been removed ERROR: requested WAL segment 00020003 has already been removed LOG: started streaming WAL from primary at 0/300 on timeline 2 ERROR: requested WAL segment 00020003 has already been removed I just committed a patch that should make the requested WAL segment 00020003 has already been removed errors go away. The trick was for walsenders to not switch to the new timeline until at least one record has been replayed on it. That closes the window where the walsender already considers the new timeline to be the latest, but the WAL file has not been created yet. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_top
Ah, I cloned the repository from git://git.postgresql.org/ I'll have a look at Marks repo. Thanks, Brett -Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of P. Christeas Sent: 20 December 2012 12:33 To: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] pg_top On Thursday 20 December 2012, Brett Maton wrote: Hi List, This might not be the right place to post, but I've got a minor patch for pg_top and would like to submit it for review. This project lies on Github: https://github.com/markwkm/pg_top/ I think, this commit had fixed the issue you are talking about: https://github.com/markwkm/pg_top/commit/a211ac3f161d7ec2752ed038009f93b4f47 0e86c -- Say NO to spam and viruses. Stop using Microsoft Windows! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On 2012-12-20 14:45:05 +0200, Heikki Linnakangas wrote: On 17.12.2012 15:05, Thom Brown wrote: I just set up 120 chained standbys, and for some reason I'm seeing these errors: LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 1 LOG: record with zero length at 0/301EC10 LOG: fetching timeline history file for timeline 2 from primary server LOG: restarted WAL streaming at 0/300 on timeline 1 LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 1 LOG: new target timeline is 2 LOG: restarted WAL streaming at 0/300 on timeline 2 LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 2 FATAL: error reading result of streaming command: ERROR: requested WAL segment 00020003 has already been removed ERROR: requested WAL segment 00020003 has already been removed LOG: started streaming WAL from primary at 0/300 on timeline 2 ERROR: requested WAL segment 00020003 has already been removed I just committed a patch that should make the requested WAL segment 00020003 has already been removed errors go away. The trick was for walsenders to not switch to the new timeline until at least one record has been replayed on it. That closes the window where the walsender already considers the new timeline to be the latest, but the WAL file has not been created yet. I vote for introducing InvalidTimeLineID soon... 0 as a invalid TimeLineID seems to spread and is annoying to grep for. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parser Cruft in gram.y
On Wed, Dec 19, 2012 at 10:18 PM, Tom Lane t...@sss.pgh.pa.us wrote: valgrind comes with a tool called cachegrind which can emulate the cache algorithm on some variants of various cpus and produce reports. Can it be made to produce a report for a specific block of memory? I believe that oprofile can be persuaded to produce statistics about where in one's code are the most cache misses, not just the most wall-clock ticks; which would shed a lot of light on this question. However, my oprofile-fu doesn't quite extend to actually persuading it. perf can certainly do this. $ perf record -a -e cache-misses pgbench -n -S -T 30 ...output elided... $ perf report -d postgres | grep -v '^#' | head 8.88% postgres base_yyparse 7.05%swapper 0x807c 4.67% postgres SearchCatCache 3.77%pgbench 0x172dd58 3.47% postgres hash_search_with_hash_value 3.23% postgres AllocSetAlloc 2.58% postgres core_yylex 1.87% postgres LWLockAcquire 1.83% postgres fmgr_info_cxt_security 1.75% postgres 0x4d1054 For comparison: $ perf record -a -e cycles -d postgres pgbench -n -S -T 30 $ perf report -d postgres | grep -v '^#' | head 6.54% postgres AllocSetAlloc 4.08% swapper 0x4ce754 3.60% postgres hash_search_with_hash_value 2.74% postgres base_yyparse 2.71% postgres MemoryContextAllocZeroAligned 2.57% postgres MemoryContextAlloc 2.36% postgres SearchCatCache 2.10% postgres _bt_compare 1.70% postgres LWLockAcquire 1.54% postgres FunctionCall2Coll -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cascading replication: should we detect/prevent cycles?
On Wed, Dec 19, 2012 at 5:14 PM, Joshua Berkus j...@agliodbs.com wrote: What would such a test look like? It's not obvious to me that there's any rapid way for a user to detect this situation, without checking each server individually. Change something on the master and observe that none of the supposed standbys notice? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ThisTimeLineID in checkpointer and bgwriter processes
On Thursday, December 20, 2012 5:12 PM Heikki Linnakangas wrote: On 20.12.2012 12:08, Amit Kapila wrote: On Wednesday, December 19, 2012 9:30 PM Heikki Linnakangas wrote: In both checkpointer.c and bgwriter.c, we do this before entering the main loop: /* * Use the recovery target timeline ID during recovery */ if (RecoveryInProgress()) ThisTimeLineID = GetRecoveryTargetTLI(); That seems reasonable. However, since it's only done once, when the process starts up, ThisTimeLineID is never updated in those processes, even though the startup process changes recovery target timeline. That actually seems harmless to me, and I also haven't heard of any complaints of misbehavior in 9.1 or 9.2 caused by that. I'm not sure why we bother to set ThisTimeLineID in those processes in the first place. This is used in RemoveOldXlogFiles(), so if during recovery when it's not set and this function gets called, it might have some problem. I think it could get called from CreateRestartPoint() during recovery. Hmm, right, it's used for this: XLogFileName(lastoff, ThisTimeLineID, segno); So I think we're good on that front. But I'll add a comment there, and use 0 explicitly instead of ThisTimeLineID, for clarity. True, it might not have any functionality effect in RemoveOldXlogFiles(). However it can be used in PreallocXlogFiles()-XLogFileInit() as well. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ThisTimeLineID in checkpointer and bgwriter processes
On 20 December 2012 13:19, Amit Kapila amit.kap...@huawei.com wrote: So I think we're good on that front. But I'll add a comment there, and use 0 explicitly instead of ThisTimeLineID, for clarity. True, it might not have any functionality effect in RemoveOldXlogFiles(). However it can be used in PreallocXlogFiles()-XLogFileInit() as well. PreallocXlogFiles() should have been put to the sword long ago. It's a performance tweak aimed at people without a performance problem in this area. I'll happily accept this excuse to remove it. We can discuss whether any replacement is required. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ThisTimeLineID in checkpointer and bgwriter processes
On 2012-12-20 13:32:36 +, Simon Riggs wrote: On 20 December 2012 13:19, Amit Kapila amit.kap...@huawei.com wrote: So I think we're good on that front. But I'll add a comment there, and use 0 explicitly instead of ThisTimeLineID, for clarity. True, it might not have any functionality effect in RemoveOldXlogFiles(). However it can be used in PreallocXlogFiles()-XLogFileInit() as well. PreallocXlogFiles() should have been put to the sword long ago. It's a performance tweak aimed at people without a performance problem in this area. Creating, zeroing fsync()'ing a 16MB file shouldn't be done in individual transactions because it would increase latency noticeably. So it seems it addresses a real performance problem. What do you dislike about it? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parser Cruft in gram.y
On 12/18/12 5:10 PM, Robert Haas wrote: I can't help but suspect that the way we handle keywords today is monumentally inefficient. The unreserved_keyword products, et al, just seem somehow badly wrong-headed. We take the trouble to distinguish all of those cases so that we an turn around and not distinguish them. I feel like there ought to be some way to use lexer states to handle this - if we're in a context where an unreserved keyword will be treated as an IDENT, then have the lexer return IDENT when it sees an unreserved keyword. I might be wrong, but it seems like that would eliminate a whole lot of parser state transitions. However, even if I'm right, I have no idea how to implement it. It just seems very wasteful that we have so many parser states that have no purpose other than (effectively) to convert an unreserved_keyword into an IDENT when the lexer could do the same thing much more cheaply given a bit more context. Another way of attack along these lines might be to use the %glr-parser and then try to cut back on all those redundant rules that were put in to avoid conflicts. The number of key words categories and such could perhaps also be reduced that way. Of course, this is mostly speculation. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parser Cruft in gram.y
On 18 December 2012 22:10, Robert Haas robertmh...@gmail.com wrote: Well that would be nice, but the problem is that I see no way to implement it. If, with a unified parser, the parser is 14% of our source code, then splitting it in two will probably crank that number up well over 20%, because there will be duplication between the two. That seems double-plus un-good. I don't think the size of the parser binary is that relevant. What is relevant is how much of that is regularly accessed. Increasing parser cache misses for DDL and increasing size of binary overall are acceptable costs if we are able to swap out the unneeded areas and significantly reduce the cache misses on the well travelled portions of the parser. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ThisTimeLineID in checkpointer and bgwriter processes
On Thursday, December 20, 2012 7:03 PM Simon Riggs wrote: On 20 December 2012 13:19, Amit Kapila amit.kap...@huawei.com wrote: So I think we're good on that front. But I'll add a comment there, and use 0 explicitly instead of ThisTimeLineID, for clarity. True, it might not have any functionality effect in RemoveOldXlogFiles(). However it can be used in PreallocXlogFiles()-XLogFileInit() as well. PreallocXlogFiles() should have been put to the sword long ago. It's a performance tweak aimed at people without a performance problem in this area. Yes, it seems to be for a rare scenario. I'll happily accept this excuse to remove it. We can discuss whether any replacement is required. We should think of alternatives if there is no other reasonable fix for the problem. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parser Cruft in gram.y
On Thu, Dec 20, 2012 at 8:55 AM, Simon Riggs si...@2ndquadrant.com wrote: On 18 December 2012 22:10, Robert Haas robertmh...@gmail.com wrote: Well that would be nice, but the problem is that I see no way to implement it. If, with a unified parser, the parser is 14% of our source code, then splitting it in two will probably crank that number up well over 20%, because there will be duplication between the two. That seems double-plus un-good. I don't think the size of the parser binary is that relevant. What is relevant is how much of that is regularly accessed. Increasing parser cache misses for DDL and increasing size of binary overall are acceptable costs if we are able to swap out the unneeded areas and significantly reduce the cache misses on the well travelled portions of the parser. I generally agree. We don't want to bloat the size of the parser with wild abandon, but yeah if we can reduce the cache misses on the well-travelled portions that seems like it ought to help. My previous hacky attempt to quantify the potential benefit in this area was: http://archives.postgresql.org/pgsql-hackers/2011-05/msg01008.php On my machine there seemed to be a small but consistent win; on a very old box Jeff Janes tried, it didn't seem like there was any benefit at all. Somehow, I have a feeling we're missing a trick here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] recent ALTER whatever .. SET SCHEMA refactoring
The recent SET SCHEMA refactoring has changed the error message that you get when trying to move a function into the schema that already contains it. For a table, as ever, you get: rhaas=# create table foo (a int); CREATE TABLE rhaas=# alter table foo set schema public; ERROR: table foo is already in schema public Functions used to produce the same message, but not any more: rhaas=# create function foo(a int) returns int as $$select 1$$ language sql; CREATE FUNCTION rhaas=# alter function foo(int) set schema public; ERROR: function foo already exists in schema public The difference is that the first error message is complaining about an operation that is a no-op, whereas the second one is complaining about a name collision. It seems a bit off in this case because the name collision is between the function and itself, not the function and some other object with a conflicting name. The root of the problem is that AlterObjectNamespace_internal generates both error messages and does the checks in the correct order, but for functions, AlterFunctionNamespace_oid has a second copy of the conflicting-names check that is argument-type aware, which happens before the same-schema check in AlterObjectNamespace_internal. IMHO, we ought to fix this by getting rid of the check in AlterFunctionNamespace_oid and adding an appropriate special case for functions in AlterObjectNamespace_internal that knows how to do the check correctly. It's not a huge deal, but it seems confusing to have AlterObjectNamespace_internal know how to do the check correctly in some cases but not others. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parser Cruft in gram.y
On 2012-12-20 09:11:46 -0500, Robert Haas wrote: On Thu, Dec 20, 2012 at 8:55 AM, Simon Riggs si...@2ndquadrant.com wrote: On 18 December 2012 22:10, Robert Haas robertmh...@gmail.com wrote: Well that would be nice, but the problem is that I see no way to implement it. If, with a unified parser, the parser is 14% of our source code, then splitting it in two will probably crank that number up well over 20%, because there will be duplication between the two. That seems double-plus un-good. I don't think the size of the parser binary is that relevant. What is relevant is how much of that is regularly accessed. Increasing parser cache misses for DDL and increasing size of binary overall are acceptable costs if we are able to swap out the unneeded areas and significantly reduce the cache misses on the well travelled portions of the parser. I generally agree. We don't want to bloat the size of the parser with wild abandon, but yeah if we can reduce the cache misses on the well-travelled portions that seems like it ought to help. My previous hacky attempt to quantify the potential benefit in this area was: http://archives.postgresql.org/pgsql-hackers/2011-05/msg01008.php On my machine there seemed to be a small but consistent win; on a very old box Jeff Janes tried, it didn't seem like there was any benefit at all. Somehow, I have a feeling we're missing a trick here. I don't think you will see too many cache misses on such a low number of extremly simply statements, so its not too surprising not to see a that big difference with that. Are we sure its really cache-misses and not just the actions performed in the grammar that make bison code show up in profiles? I remember the latter being the case... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parser Cruft in gram.y
On 2012-12-20 15:45:47 +0100, Andres Freund wrote: On 2012-12-20 09:11:46 -0500, Robert Haas wrote: On Thu, Dec 20, 2012 at 8:55 AM, Simon Riggs si...@2ndquadrant.com wrote: On 18 December 2012 22:10, Robert Haas robertmh...@gmail.com wrote: Well that would be nice, but the problem is that I see no way to implement it. If, with a unified parser, the parser is 14% of our source code, then splitting it in two will probably crank that number up well over 20%, because there will be duplication between the two. That seems double-plus un-good. I don't think the size of the parser binary is that relevant. What is relevant is how much of that is regularly accessed. Increasing parser cache misses for DDL and increasing size of binary overall are acceptable costs if we are able to swap out the unneeded areas and significantly reduce the cache misses on the well travelled portions of the parser. I generally agree. We don't want to bloat the size of the parser with wild abandon, but yeah if we can reduce the cache misses on the well-travelled portions that seems like it ought to help. My previous hacky attempt to quantify the potential benefit in this area was: http://archives.postgresql.org/pgsql-hackers/2011-05/msg01008.php On my machine there seemed to be a small but consistent win; on a very old box Jeff Janes tried, it didn't seem like there was any benefit at all. Somehow, I have a feeling we're missing a trick here. I don't think you will see too many cache misses on such a low number of extremly simply statements, so its not too surprising not to see a that big difference with that. Are we sure its really cache-misses and not just the actions performed in the grammar that make bison code show up in profiles? I remember the latter being the case... Hm. A very, very quick perf stat -dvvv of pgbench -S -c 20 -j 20 -T 20 later: 218350.885559 task-clock# 10.095 CPUs utilized 1,676,479 context-switches #0.008 M/sec 2,396 cpu-migrations#0.011 K/sec 796,038 page-faults #0.004 M/sec 506,312,525,518 cycles#2.319 GHz [20.00%] 405,944,435,754 stalled-cycles-frontend # 80.18% frontend cycles idle [30.32%] 236,712,872,641 stalled-cycles-backend# 46.75% backend cycles idle [40.51%] 193,459,120,458 instructions #0.38 insns per cycle #2.10 stalled cycles per insn [50.70%] 36,433,144,472 branches # 166.856 M/sec [51.12%] 3,623,778,087 branch-misses #9.95% of all branches [50.87%] 50,344,123,581 L1-dcache-loads # 230.565 M/sec [50.33%] 5,548,192,686 L1-dcache-load-misses # 11.02% of all L1-dcache hits [49.69%] 2,666,461,361 LLC-loads # 12.212 M/sec [35.63%] 112,407,198 LLC-load-misses #4.22% of all LL-cache hits [ 9.67%] 21.629396701 seconds time elapsed So there definitely a noticeable rate of cache misses... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: patch submission: truncate trailing nulls from heap rows to reduce the size of the null bitmap [Review]
On Thursday, December 20, 2012 5:46 PM Simon Riggs wrote: On 13 October 2012 08:54, Amit kapila amit.kap...@huawei.com wrote: As per the last discussion for this patch, performance data needs to be provided before this patch's Review can proceed further. So as per your suggestion and from the discussions about this patch, I have 1. There is no performance change for cloumns that have all valid values(non- NULLs). 2. There is a visible performance increase when number of columns containing NULLS are more than 60~70% in table have large number of columns. 3. There are visible space savings when number of columns containing NULLS are more than 60~70% in table have large number of columns. Let me know if there is more performance data needs to be collected for this patch? I can't make sense of your performance report. Because of that I can't derive the same conclusions from it you do. Can you explain the performance results in more detail, so we can see what they mean? Like which are the patched, which are the unpatched results? On the extreme let it is mentioned Original Code/ Trim Triling Nulls Patch. In any case I have framed the results again as below: 1. Table with 800 columns A. INSERT tuples with 600 trailing nulls B. UPDATE last column value to non-null C. UPDATE last column value to null -+-+- Original Code | Trim Tailing NULLs | Improvement (%) TPS space used| TPSspace used | Results (pages) | (pages) | -+-+-- 1A: 0.2068 25 | 0.2302 23 | 10.1% tps, 11.1% space 1B: 0.0448 50 | 0.0481 472223 | 6.8% tps, 5.6% space 1C: 0.0433 75 | 0.0493 694445 | 12.2% tps, 7.4% space 2. Table with 800 columns A. INSERT tuples with 300 trailing nulls B. UPDATE last column value to non-null C. UPDATE last column value to null -+-+- Original Code | Trim Tailing NULLs | Improvement (%) TPS space used| TPSspace used | Results (pages) | (pages) | -+-+-- 2A: 0.0280 67 | 0.0287 67 | 2.3% tps, 0% space 2B: 0.0143 134 | 0.0152 134 | 5.3% tps, 0% space 2C: 0.0145 200 | 0.0149 200 | 2.9% tps, 0% space 3. Table with 300 columns A. INSERT tuples with 150 trailing nulls B. UPDATE last column value to non-null C. UPDATE last column value to null -+-+ Original Code | Trim Tailing NULLs | Improvement (%) TPS space used| TPSspace used | Results (pages) | (pages) | -+-+ 3A: 0.2815 17 | 0.2899 17 | 2.9% tps, 0% space 3B: 0.0851 34 | 0.0870 34 | 2.2% tps, 0% space 3C: 0.0846 50 | 0.0852 50 | 0.7% tps, 0% space 4. Table with 300 columns A. INSERT tuples with 250 trailing nulls B. UPDATE last column value to non-null C. UPDATE last column value to null -+-+- Original Code | Trim Tailing NULLs | Improvement (%) TPS space used| TPSspace used | Results (pages) | (pages) | -+-+- 4A: 0.54477 | 0.5996 58824 | 09.2% tps, 11.8% space 4B: 0.1251 135633 | 0.1232 127790 | -01.5% tps, 5.8% space 4C: 0.1223 202299 | 0.1361 186613 | 10.1% tps, 7.5% space Please let me know, if still it is not clear. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parser Cruft in gram.y
On 2012-12-20 15:51:37 +0100, Andres Freund wrote: On 2012-12-20 15:45:47 +0100, Andres Freund wrote: On 2012-12-20 09:11:46 -0500, Robert Haas wrote: On Thu, Dec 20, 2012 at 8:55 AM, Simon Riggs si...@2ndquadrant.com wrote: On 18 December 2012 22:10, Robert Haas robertmh...@gmail.com wrote: Well that would be nice, but the problem is that I see no way to implement it. If, with a unified parser, the parser is 14% of our source code, then splitting it in two will probably crank that number up well over 20%, because there will be duplication between the two. That seems double-plus un-good. I don't think the size of the parser binary is that relevant. What is relevant is how much of that is regularly accessed. Increasing parser cache misses for DDL and increasing size of binary overall are acceptable costs if we are able to swap out the unneeded areas and significantly reduce the cache misses on the well travelled portions of the parser. I generally agree. We don't want to bloat the size of the parser with wild abandon, but yeah if we can reduce the cache misses on the well-travelled portions that seems like it ought to help. My previous hacky attempt to quantify the potential benefit in this area was: http://archives.postgresql.org/pgsql-hackers/2011-05/msg01008.php On my machine there seemed to be a small but consistent win; on a very old box Jeff Janes tried, it didn't seem like there was any benefit at all. Somehow, I have a feeling we're missing a trick here. I don't think you will see too many cache misses on such a low number of extremly simply statements, so its not too surprising not to see a that big difference with that. Are we sure its really cache-misses and not just the actions performed in the grammar that make bison code show up in profiles? I remember the latter being the case... Hm. A very, very quick perf stat -dvvv of pgbench -S -c 20 -j 20 -T 20 later: 218350.885559 task-clock# 10.095 CPUs utilized 1,676,479 context-switches #0.008 M/sec 2,396 cpu-migrations#0.011 K/sec 796,038 page-faults #0.004 M/sec 506,312,525,518 cycles#2.319 GHz [20.00%] 405,944,435,754 stalled-cycles-frontend # 80.18% frontend cycles idle [30.32%] 236,712,872,641 stalled-cycles-backend# 46.75% backend cycles idle [40.51%] 193,459,120,458 instructions #0.38 insns per cycle #2.10 stalled cycles per insn [50.70%] 36,433,144,472 branches # 166.856 M/sec [51.12%] 3,623,778,087 branch-misses #9.95% of all branches [50.87%] 50,344,123,581 L1-dcache-loads # 230.565 M/sec [50.33%] 5,548,192,686 L1-dcache-load-misses # 11.02% of all L1-dcache hits [49.69%] 2,666,461,361 LLC-loads # 12.212 M/sec [35.63%] 112,407,198 LLC-load-misses #4.22% of all LL-cache hits [ 9.67%] 21.629396701 seconds time elapsed So there definitely a noticeable rate of cache misses... L1 misses: # Samples: 997K of event 'L1-dcache-load-misses' # Overhead Command Shared Object Symbol # ... 6.49% postgres postgres[.] SearchCatCache 3.65% postgres postgres[.] base_yyparse 3.48% postgres postgres[.] hash_search_with_hash_value 3.41% postgres postgres[.] AllocSetAlloc 1.84% postgres postgres[.] LWLockAcquire 1.40% postgres postgres[.] fmgr_info_cxt_security 1.36% postgres postgres[.] nocachegetattr 1.23% postgres libc-2.13.so[.] _int_malloc 1.20% postgres postgres[.] core_yylex 1.15% postgres postgres[.] MemoryContextAllocZeroAligned 0.94% postgres postgres[.] PostgresMain 0.94% postgres postgres[.] MemoryContextAlloc 0.91% postgres libc-2.13.so[.] __memcpy_ssse3_back 0.89% postgres postgres[.] CatalogCacheComputeHashValue 0.86% postgres postgres[.] PinBuffer 0.86% postgres [kernel.kallsyms] [k] __audit_syscall_entry 0.80% postgres libc-2.13.so[.] __strcmp_sse42 0.80% postgres postgres[.] _bt_compare 0.78% postgres postgres[.] FunctionCall2Coll 0.77% postgres libc-2.13.so[.] malloc 0.73% postgres libc-2.13.so[.] __memset_sse2 0.72% postgres postgres[.] GetSnapshotData 0.69% postgres [kernel.kallsyms]
Re: [HACKERS] discarding duplicate indexes
On Thu, Dec 20, 2012 at 1:26 AM, Gavin Flower gavinflo...@archidevsys.co.nz wrote: On 20/12/12 14:57, Josh Kupershmidt wrote: CREATE TABLE test (id int); CREATE INDEX test_idx1 ON test (id); CREATE INDEX test_idx2 ON test (id); I initially misread your example code, but after I realised my mistake, I thought of an alternative scenario that might be worth considering. CREATE TABLE test (id int, int sub, text payload); CREATE INDEX test_idx1 ON test (id, sub); CREATE INDEX test_idx2 ON test (id); Now test_idx2 is logically included in test_idx1, but if the majority of transactions only query on id, then test_idx2 would be more better as it ties up less RAM Well, this situation works without any LIKE ... INCLUDING INDEXES surprises. If you CREATE TABLE test_copycat (LIKE test INCLUDING INDEXES); you should see test_copycat created with both indexes, since indexParams is considered for this deduplicating. Josh -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parser Cruft in gram.y
On 2012-12-20 16:04:49 +0100, Andres Freund wrote: On 2012-12-20 15:51:37 +0100, Andres Freund wrote: When doing a source/assembly annotation in SearchCatCache about half of the misses are attributed to the memcpy() directly at the beginning. Using a separate array for storing the arguments instead of copying modifying cache-cc_skey yields a 2% speedup in pgbench -S for me... The attached patch is clearly not ready and I don't really have time energy to pursue it right now, but it seems interesting enough to post. The approach seems solid and sensible although the implementation is not (too much cp, no comments). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 9ae1432..bee6f3d 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -73,7 +73,7 @@ static CatCacheHeader *CacheHdr = NULL; static uint32 CatalogCacheComputeHashValue(CatCache *cache, int nkeys, - ScanKey cur_skey); + ScanKey cur_skey, Datum *argument); static uint32 CatalogCacheComputeTupleHashValue(CatCache *cache, HeapTuple tuple); @@ -173,7 +173,7 @@ GetCCHashEqFuncs(Oid keytype, PGFunction *hashfunc, RegProcedure *eqfunc) * Compute the hash value associated with a given set of lookup keys */ static uint32 -CatalogCacheComputeHashValue(CatCache *cache, int nkeys, ScanKey cur_skey) +CatalogCacheComputeHashValue(CatCache *cache, int nkeys, ScanKey cur_skey, Datum *argument) { uint32 hashValue = 0; uint32 oneHash; @@ -188,28 +188,28 @@ CatalogCacheComputeHashValue(CatCache *cache, int nkeys, ScanKey cur_skey) case 4: oneHash = DatumGetUInt32(DirectFunctionCall1(cache-cc_hashfunc[3], - cur_skey[3].sk_argument)); + argument[3])); hashValue ^= oneHash 24; hashValue ^= oneHash 8; /* FALLTHROUGH */ case 3: oneHash = DatumGetUInt32(DirectFunctionCall1(cache-cc_hashfunc[2], - cur_skey[2].sk_argument)); + argument[2])); hashValue ^= oneHash 16; hashValue ^= oneHash 16; /* FALLTHROUGH */ case 2: oneHash = DatumGetUInt32(DirectFunctionCall1(cache-cc_hashfunc[1], - cur_skey[1].sk_argument)); + argument[1])); hashValue ^= oneHash 8; hashValue ^= oneHash 24; /* FALLTHROUGH */ case 1: oneHash = DatumGetUInt32(DirectFunctionCall1(cache-cc_hashfunc[0], - cur_skey[0].sk_argument)); + argument[0])); hashValue ^= oneHash; break; default: @@ -228,17 +228,14 @@ CatalogCacheComputeHashValue(CatCache *cache, int nkeys, ScanKey cur_skey) static uint32 CatalogCacheComputeTupleHashValue(CatCache *cache, HeapTuple tuple) { - ScanKeyData cur_skey[CATCACHE_MAXKEYS]; + Datum arguments[CATCACHE_MAXKEYS]; bool isNull = false; - /* Copy pre-initialized overhead data for scankey */ - memcpy(cur_skey, cache-cc_skey, sizeof(cur_skey)); - /* Now extract key fields from tuple, insert into scankey */ switch (cache-cc_nkeys) { case 4: - cur_skey[3].sk_argument = + arguments[3] = (cache-cc_key[3] == ObjectIdAttributeNumber) ? ObjectIdGetDatum(HeapTupleGetOid(tuple)) : fastgetattr(tuple, @@ -248,7 +245,7 @@ CatalogCacheComputeTupleHashValue(CatCache *cache, HeapTuple tuple) Assert(!isNull); /* FALLTHROUGH */ case 3: - cur_skey[2].sk_argument = + arguments[2] = (cache-cc_key[2] == ObjectIdAttributeNumber) ? ObjectIdGetDatum(HeapTupleGetOid(tuple)) : fastgetattr(tuple, @@ -258,7 +255,7 @@ CatalogCacheComputeTupleHashValue(CatCache *cache, HeapTuple tuple) Assert(!isNull); /* FALLTHROUGH */ case 2: - cur_skey[1].sk_argument = + arguments[1] = (cache-cc_key[1] == ObjectIdAttributeNumber) ? ObjectIdGetDatum(HeapTupleGetOid(tuple)) : fastgetattr(tuple, @@ -268,7 +265,7 @@ CatalogCacheComputeTupleHashValue(CatCache *cache, HeapTuple tuple) Assert(!isNull); /* FALLTHROUGH */ case 1: - cur_skey[0].sk_argument = + arguments[0] = (cache-cc_key[0] == ObjectIdAttributeNumber) ? ObjectIdGetDatum(HeapTupleGetOid(tuple)) : fastgetattr(tuple, @@ -282,7 +279,7 @@ CatalogCacheComputeTupleHashValue(CatCache *cache, HeapTuple tuple) break; } - return CatalogCacheComputeHashValue(cache, cache-cc_nkeys, cur_skey); + return CatalogCacheComputeHashValue(cache, cache-cc_nkeys, cache-cc_skey, arguments); } @@ -1058,6 +1055,7 @@ SearchCatCache(CatCache *cache, Datum v4) { ScanKeyData cur_skey[CATCACHE_MAXKEYS]; + Datum arguments[CATCACHE_MAXKEYS]; uint32 hashValue; Index hashIndex; dlist_iter iter; @@ -1080,16 +1078,15 @@ SearchCatCache(CatCache *cache, /* * initialize the search key information */ - memcpy(cur_skey, cache-cc_skey, sizeof(cur_skey)); -
Re: [HACKERS] operator dependency of commutator and negator, redux
Brendan Jurd dire...@gmail.com writes: On 20 December 2012 11:51, Tom Lane t...@sss.pgh.pa.us wrote: While reconsidering the various not-too-satisfactory fixes we thought of back then, I had a sudden thought. Instead of having a COMMUTATOR or NEGATOR forward reference create a shell operator and link to it, why not simply *ignore* such references? Then when the second operator is defined, go ahead and fill in both links? Ignore with warning sounds pretty good. So it would go something like this? # CREATE OPERATOR (... COMMUTATOR ); WARNING: COMMUTATOR (foo, foo) undefined, ignoring. CREATE OPERATOR # CREATE OPERATOR (... COMMUTATOR ); CREATE OPERATOR I was thinking a NOTICE at most. If it's a WARNING then restoring perfectly valid pg_dump files will result in lots of scary-looking chatter. You could make an argument for printing nothing at all, but that would probably mislead people who'd fat-fingered their COMMUTATOR entries. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Set visibility map bit after HOT prune
On Wed, Dec 19, 2012 at 11:12 PM, Pavan Deolasee pavan.deola...@gmail.com wrote: Yeah, VM buffer contention can become prominent if we break the invariant that page level bit status implies the vm bit status, at least when its clear.OTOH IMHO we need some mechanism to address the issue of aggressive clearing of the VM bits, but a very lame corresponding set operation. I agree. Today we don't have much contention on the VM page, but we must be sacrificing its usability in return. IOS as well as vacuum optimizations using VMs will turn out not so useful for many workloads. I have that fear too, but the evidence isn't really in yet. The testing that people have reported on this list has had mostly positive outcomes. Of course that doesn't mean that it will work as well in the field as it did in the lab, but it doesn't mean that it won't, either. I'm very reluctant to suggest that we can solve this my setting aside another page-level bit to track visibility of tuples for heapscans. Or even have a bit in the tuple header itself to track this information at that level to avoid repeated visibility check for a tuple which is known to be visible to all current and future transactions. This has been suggested before, as an alternative to freezing tuples. It seems to have some potential although I think more thought and work is needed to figure out exactly where to go with it. And we expect vacuums to be very less or none. AFAIR in pgbench, it now takes hours for accounts table to get chosen for vacuum and we should be happy about it. But IOS are almost impossible for pgbench kind of workloads today because of our aggressive strategy to clear the VM bits. IMHO, it's probably fairly hopeless to make a pure pgbench workload show a benefit from index-only scans. A large table under a very heavy, completely random write workload is just about the worst possible case for index-only scans. Index-only scans are a way of avoiding unnecessary visibility checks when the target data hasn't changed recently, not a magic bullet to escape all heap access. If the target data has changed, you're going to have to touch the heap. And while I agree that we aren't aggressive enough in setting the VM bits right now, I also think it wouldn't be too hard to go too far in the opposite direction: we could easily spend more effort trying to make index-only scans effective than we could ever hope to recoup from the scans themselves. Now, if you set up N threads of which 10% are doing regular pgbench and the other 90% are doing pgbench -S, or something like that, then you might start to hope for some benefit from index-only scans. But I think you might also GET some benefit in that case, even at steady state. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Set visibility map bit after HOT prune
On Thu, Dec 20, 2012 at 4:58 AM, Simon Riggs si...@2ndquadrant.com wrote: ISTM that if someone spots a block that could use cleanup, they mark the block as BM_PIN_COUNT_WAITER, but don't set pid. Then when they unpin the block they send a signal/queue work for a special cleaning process to come in and do the work now that nobody is waiting. Logic would allow VACUUMs to go past that by setting the pid. If we prioritised cleanup onto blocks that are already dirty we would minimise I/O. I don't favor that particular signaling mechanism, but I agree that there is quite a bit of potential utility in having foreground processes notice that work (like a HOT prune, or setting the VM bit) needs to be done and pass those requests off to a background process. I'm hoping the new background worker framework in 9.3 will make that sort of thing easier for people to play around with. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] operator dependency of commutator and negator, redux
On Thu, Dec 20, 2012 at 10:52 AM, Tom Lane t...@sss.pgh.pa.us wrote: Brendan Jurd dire...@gmail.com writes: On 20 December 2012 11:51, Tom Lane t...@sss.pgh.pa.us wrote: While reconsidering the various not-too-satisfactory fixes we thought of back then, I had a sudden thought. Instead of having a COMMUTATOR or NEGATOR forward reference create a shell operator and link to it, why not simply *ignore* such references? Then when the second operator is defined, go ahead and fill in both links? Ignore with warning sounds pretty good. So it would go something like this? # CREATE OPERATOR (... COMMUTATOR ); WARNING: COMMUTATOR (foo, foo) undefined, ignoring. CREATE OPERATOR # CREATE OPERATOR (... COMMUTATOR ); CREATE OPERATOR I was thinking a NOTICE at most. If it's a WARNING then restoring perfectly valid pg_dump files will result in lots of scary-looking chatter. You could make an argument for printing nothing at all, but that would probably mislead people who'd fat-fingered their COMMUTATOR entries. What about jiggering the dump so that only the second of the two operators to be dumped includes the COMMUTATOR clause? Or even adding a separate ALTER OPERATOR COMMUTATOR statement (or something of the sort) that pg_dump can emit as a separate item. Even a NOTICE in pg_dump seems like too much chatter (witness recent quieting of some other NOTICE messages we've all grown tired of) but silently ignoring the problem doesn't seem smart either, for the reason you mention. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parser Cruft in gram.y
On Thu, Dec 20, 2012 at 3:18 AM, Tom Lane t...@sss.pgh.pa.us wrote: Greg Stark st...@mit.edu writes: But I'm not entirely convinced any of this is actually useful. Just becuase the transition table is large doesn't mean it's inefficient. That's a fair point. However, I've often noticed base_yyparse() showing up rather high on profiles --- higher than seemed plausible at the time, given that its state-machine implementation is pretty tight. Now I'm wondering whether that isn't coming from cache stalls from trying to touch all the requisite parts of the transition table. For what it's worth the bloat isn't in the parser transition table at all: 516280 yy_transition 147208 yytable 147208 yycheck 146975 base_yyparse 17468 yypact 9571 core_yylex 8734 yystos 8734 yydefact Unless I'm confused yy_transition is in fact the *lexer* transition table. I'm not sure how to reconcile that with the profiling results showing the cache misses in base_yyparse() though. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] trouble with pg_upgrade 9.0 - 9.1
Bruce Momjian br...@momjian.us writes: As you can see, ALTER INDEX renamed both the pg_constraint and pg_class names. Is it possible someone manually updated the system table to rename this primary key? That would cause this error message. The fix is to just to make sure they match. Does pg_upgrade need to be modified to handle this case? Are there legitimate cases where they will not match and the index name will not be preserved though a dump/restore? This seems safe: It's not always been true that ALTER INDEX would try to rename constraints to keep 'em in sync. A quick check says that only 8.3 and later do that. I'm not sure though how a 9.0 database could get into such a state without manual catalog hacking, since as you say a dump and reload should have ended up with the index and constraint having the same name again. I'd be inclined not to worry about this in pg_upgrade, at least not till we see a plausible scenario for the situation to arise without manual catalog changes. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parser Cruft in gram.y
On 2012-12-20 15:58:12 +, Greg Stark wrote: On Thu, Dec 20, 2012 at 3:18 AM, Tom Lane t...@sss.pgh.pa.us wrote: Greg Stark st...@mit.edu writes: But I'm not entirely convinced any of this is actually useful. Just becuase the transition table is large doesn't mean it's inefficient. That's a fair point. However, I've often noticed base_yyparse() showing up rather high on profiles --- higher than seemed plausible at the time, given that its state-machine implementation is pretty tight. Now I'm wondering whether that isn't coming from cache stalls from trying to touch all the requisite parts of the transition table. For what it's worth the bloat isn't in the parser transition table at all: 516280 yy_transition 147208 yytable 147208 yycheck 146975 base_yyparse 17468 yypact 9571 core_yylex 8734 yystos 8734 yydefact Unless I'm confused yy_transition is in fact the *lexer* transition table. I'm not sure how to reconcile that with the profiling results showing the cache misses in base_yyparse() though. The scanner is compiled together with the parser, so its not unlikely that the compiler bunches them together making only base_yyparse visible in the profile. I quickly checked though, and the top three offenders for L1 caches are in bison not in the lexer... Its not implausible that the state transitions in the lexer are better cached and/or predicted... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ThisTimeLineID in checkpointer and bgwriter processes
On Thu, Dec 20, 2012 at 8:41 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 20.12.2012 12:08, Amit Kapila wrote: On Wednesday, December 19, 2012 9:30 PM Heikki Linnakangas wrote: In both checkpointer.c and bgwriter.c, we do this before entering the main loop: /* * Use the recovery target timeline ID during recovery */ if (RecoveryInProgress()) ThisTimeLineID = GetRecoveryTargetTLI(); That seems reasonable. However, since it's only done once, when the process starts up, ThisTimeLineID is never updated in those processes, even though the startup process changes recovery target timeline. That actually seems harmless to me, and I also haven't heard of any complaints of misbehavior in 9.1 or 9.2 caused by that. I'm not sure why we bother to set ThisTimeLineID in those processes in the first place. This is used in RemoveOldXlogFiles(), so if during recovery when it's not set and this function gets called, it might have some problem. I think it could get called from CreateRestartPoint() during recovery. Hmm, right, it's used for this: XLogFileName(lastoff, ThisTimeLineID, segno); The resulting lastoff string, which is a xlog filename like 00020008, is used to compare filenames of files in pg_xlog. However, the tli part, first 8 characters, are ignored for comparison purposes. In addition to that, 'lastoff' is printed in a DEBUG2 line, purely for debugging purposes. InstallXLogFileSegment() also uses ThisTimeLineID. But your recent commit doesn't take care of it and prevents the standby from recycling the WAL files properly. Specifically, the standby recycles the WAL file to wrong name. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ThisTimeLineID in checkpointer and bgwriter processes
Simon Riggs si...@2ndquadrant.com writes: PreallocXlogFiles() should have been put to the sword long ago. It's a performance tweak aimed at people without a performance problem in this area. This claim seems remarkably lacking in any supporting evidence. I'll freely grant that PreallocXlogFiles could stand to be improved (by which I mean made more aggressive, ie willing to create files more often and/or further in advance). I don't see how it follows that it's okay to remove the functionality altogether. To the extent that we can make that activity happen in checkpointer rather than in foreground processes, it's surely a good thing. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] operator dependency of commutator and negator, redux
Robert Haas robertmh...@gmail.com writes: On Thu, Dec 20, 2012 at 10:52 AM, Tom Lane t...@sss.pgh.pa.us wrote: I was thinking a NOTICE at most. If it's a WARNING then restoring perfectly valid pg_dump files will result in lots of scary-looking chatter. You could make an argument for printing nothing at all, but that would probably mislead people who'd fat-fingered their COMMUTATOR entries. What about jiggering the dump so that only the second of the two operators to be dumped includes the COMMUTATOR clause? Seems messy and fragile. In particular this'd represent a lot of work in order to make it more likely that the restore malfunctions if someone makes use of pg_restore's -l switch to reorder the entries. Also it would not retroactively fix the problem for restoring dumps made with existing pg_dump versions. Even a NOTICE in pg_dump seems like too much chatter (witness recent quieting of some other NOTICE messages we've all grown tired of) pg_dump has included set client_min_messages = warning in its output for quite some time now. So as long as we don't insist on making this a WARNING, people won't see it in that usage. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] ALTER .. OWNER TO error mislabels schema as other object type
This looks busted: rhaas=# create role clerks; CREATE ROLE rhaas=# create role bob in role clerks; CREATE ROLE rhaas=# create schema foo; CREATE SCHEMA rhaas=# grant usage on schema foo to bob, clerks; GRANT rhaas=# create aggregate foo.sum(basetype=text,sfunc=textcat,stype=text,initcond=''); CREATE AGGREGATE rhaas=# alter aggregate foo.sum(text) owner to bob; ALTER AGGREGATE rhaas=# set role bob; SET rhaas= alter aggregate foo.sum(text) owner to clerks; ERROR: permission denied for function foo Eh? There's no function called foo. There's a schema called foo, which seems to be the real problem: clerks needs to have CREATE on foo in order for bob to complete the rename. But somehow the error message is confused about what type of object it's dealing with. [ Credit: The above example is adapted from an EDB-internal regression test, the failure of which was what alerted me to this problem. ] -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Tue, Dec 18, 2012 at 04:06:02AM -0500, Greg Smith wrote: On 12/18/12 3:17 AM, Simon Riggs wrote: Clearly part of the response could involve pg_dump on the damaged structure, at some point. This is the main thing I wanted to try out more, once I have a decent corruption generation tool. If you've corrupted a single record but can still pg_dump the remainder, that seems the best we can do to help people recover from that. Providing some documentation on how to figure out what rows are in that block, presumably by using the contrib inspection tools, would be helpful too. FWIW, Postgres is pretty resiliant against corruption. I've maintained a postgres db on a server with bad memory (don't ask) and since most scrambling was in text strings you just got funny output sometimes. The most common failure was a memory allocation failure as postgres tried to copy a datum whose length field was correupted. If things went really wonky you could identify the bad tuples by hand and then delete them by ctid. Regular reindexing helped too. All I'm saying is that a mode where you log a warning but proceed anyway is useful. It won't pin down the exact error, but it will tell you where to look and help find the non-obvious corruption (so you can possibly fix it by hand). Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] ThisTimeLineID in checkpointer and bgwriter processes
On 20 December 2012 13:19, Amit Kapila amit.kap...@huawei.com wrote: True, it might not have any functionality effect in RemoveOldXlogFiles(). However it can be used in PreallocXlogFiles()-XLogFileInit() as well. Which is never called in recovery because we never write WAL. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ALTER .. OWNER TO error mislabels schema as other object type
Robert Haas robertmh...@gmail.com writes: This looks busted: Between this and your previous example, it's becoming clear that the recent refactorings of the ALTER code were not ready for prime time. Perhaps we should just revert those instead of playing bug whack-a-mole. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On Sat, Dec 15, 2012 at 9:36 AM, Fujii Masao masao.fu...@gmail.com wrote: On Sat, Dec 8, 2012 at 12:51 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 06.12.2012 15:39, Amit Kapila wrote: On Thursday, December 06, 2012 12:53 AM Heikki Linnakangas wrote: On 05.12.2012 14:32, Amit Kapila wrote: On Tuesday, December 04, 2012 10:01 PM Heikki Linnakangas wrote: After some diversions to fix bugs and refactor existing code, I've committed a couple of small parts of this patch, which just add some sanity checks to notice incorrect PITR scenarios. Here's a new version of the main patch based on current HEAD. After testing with the new patch, the following problems are observed. Defect - 1: 1. start primary A 2. start standby B following A 3. start cascade standby C following B. 4. start another standby D following C. 5. Promote standby B. 6. After successful time line switch in cascade standby C D, stop D. 7. Restart D, Startup is successful and connecting to standby C. 8. Stop C. 9. Restart C, startup is failing. Ok, the error I get in that scenario is: C 2012-12-05 19:55:43.840 EET 9283 FATAL: requested timeline 2 does not contain minimum recovery point 0/3023F08 on timeline 1 C 2012-12-05 19:55:43.841 EET 9282 LOG: startup process (PID 9283) exited with exit code 1 C 2012-12-05 19:55:43.841 EET 9282 LOG: aborting startup due to startup process failure That mismatch causes the error. I'd like to fix this by always treating the checkpoint record to be part of the new timeline. That feels more correct. The most straightforward way to implement that would be to peek at the xlog record before updating replayEndRecPtr and replayEndTLI. If it's a checkpoint record that changes TLI, set replayEndTLI to the new timeline before calling the redo-function. But it's a bit of a modularity violation to peek into the record like that. Or we could just revert the sanity check at beginning of recovery that throws the requested timeline 2 does not contain minimum recovery point 0/3023F08 on timeline 1 error. The error I added to redo of checkpoint record that says unexpected timeline ID %u in checkpoint record, before reaching minimum recovery point %X/%X on timeline %u checks basically the same thing, but at a later stage. However, the way minRecoveryPointTLI is updated still seems wrong to me, so I'd like to fix that. I'm thinking of something like the attached (with some more comments before committing). Thoughts? This has fixed the problem reported. However, I am not able to think will there be any problem if we remove check requested timeline 2 does not contain minimum recovery point 0/3023F08 on timeline 1 at beginning of recovery and just update replayEndTLI with ThisTimeLineID? Well, it seems wrong for the control file to contain a situation like this: pg_control version number:932 Catalog version number: 201211281 Database system identifier: 5819228770976387006 Database cluster state: shut down in recovery pg_control last modified: pe 7. joulukuuta 2012 17.39.57 Latest checkpoint location: 0/3023EA8 Prior checkpoint location:0/260 Latest checkpoint's REDO location:0/3023EA8 Latest checkpoint's REDO WAL file:00020003 Latest checkpoint's TimeLineID: 2 ... Time of latest checkpoint:pe 7. joulukuuta 2012 17.39.49 Min recovery ending location: 0/3023F08 Min recovery ending loc's timeline: 1 Note the latest checkpoint location and its TimelineID, and compare them with the min recovery ending location. The min recovery ending location is ahead of latest checkpoint's location; the min recovery ending location actually points to the end of the checkpoint record. But how come the min recovery ending location's timeline is 1, while the checkpoint record's timeline is 2. Now maybe that would happen to work if remove the sanity check, but it still seems horribly confusing. I'm afraid that discrepancy will come back to haunt us later if we leave it like that. So I'd like to fix that. Mulling over this for some more, I propose the attached patch. With the patch, we peek into the checkpoint record, and actually perform the timeline switch (by changing ThisTimeLineID) before replaying it. That way the checkpoint record is really considered to be on the new timeline for all purposes. At the moment, the only difference that makes in practice is that we set replayEndTLI, and thus minRecoveryPointTLI, to the new TLI, but it feels logically more correct to do it that way. This patch has already been included in HEAD. Right? I found another requested timeline does not contain minimum recovery point error scenario in HEAD: 1. Set up the master 'M', one standby 'S1', and one cascade standby 'S2'. 2. Shutdown the master 'M'
Re: [HACKERS] Set visibility map bit after HOT prune
On Thu, Dec 20, 2012 at 9:23 PM, Robert Haas robertmh...@gmail.com wrote: On Wed, Dec 19, 2012 at 11:12 PM, Pavan Deolasee I'm very reluctant to suggest that we can solve this my setting aside another page-level bit to track visibility of tuples for heapscans. Or even have a bit in the tuple header itself to track this information at that level to avoid repeated visibility check for a tuple which is known to be visible to all current and future transactions. This has been suggested before, as an alternative to freezing tuples. It seems to have some potential although I think more thought and work is needed to figure out exactly where to go with it. Ok. Will try to read archives to see what was actually suggested and why it was put on back burner. BTW at the risk of being shot down again, I wonder if can we push down the freeze operation to HOT prune also. A single WAL record can then record this action as well. Also, it saves us from repeated checks for transaction status flags in heap_freeze_tuple(). Of course, we do all these only if HOT prune has work on its on and just try to piggyback. I wonder if we should add a flag to heap_page_prune and try to do some additional work if its being called from lazy vacuum such as setting the VM bit and the tuple freeze. IIRC I had put something like that in the early patches, but then ripped of for simplicity. May be its time to play with that again. In fact, I'd also suggested ripping off the line pointer scan in lazy vacuum since its preceded by a HOT prune which does bulk of the work anyways. I remember Tom taking objection to that, but can't remember why. Will try to read up the old thread again. And we expect vacuums to be very less or none. AFAIR in pgbench, it now takes hours for accounts table to get chosen for vacuum and we should be happy about it. But IOS are almost impossible for pgbench kind of workloads today because of our aggressive strategy to clear the VM bits. IMHO, it's probably fairly hopeless to make a pure pgbench workload show a benefit from index-only scans. A large table under a very heavy, completely random write workload is just about the worst possible case for index-only scans. Index-only scans are a way of avoiding unnecessary visibility checks when the target data hasn't changed recently, not a magic bullet to escape all heap access. If the target data has changed, you're going to have to touch the heap. Not always. Not clearing the VM bit at HOT update is one such idea we discussed. Of course, there are open issues with that, but they are not unsolvable. The advantage of not touching heap is just too big to ignore. And while I agree that we aren't aggressive enough in setting the VM bits right now, I also think it wouldn't be too hard to go too far in the opposite direction: we could easily spend more effort trying to make index-only scans effective than we could ever hope to recoup from the scans themselves. I agree. I also started having that worry. We are at one extreme right now and it might not help to get to the other extreme. Looks like I'm coming along the idea of somehow detecting if the scan is happening on the result relation of a ModifyTable and avoid setting VM bit in that case. Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ThisTimeLineID in checkpointer and bgwriter processes
On 20 December 2012 16:21, Tom Lane t...@sss.pgh.pa.us wrote: Simon Riggs si...@2ndquadrant.com writes: PreallocXlogFiles() should have been put to the sword long ago. It's a performance tweak aimed at people without a performance problem in this area. This claim seems remarkably lacking in any supporting evidence. I'll freely grant that PreallocXlogFiles could stand to be improved (by which I mean made more aggressive, ie willing to create files more often and/or further in advance). I don't see how it follows that it's okay to remove the functionality altogether. To the extent that we can make that activity happen in checkpointer rather than in foreground processes, it's surely a good thing. More aggressive implies it is currently in some way aggressive. Removing it will make as little difference as keeping it, so let it stay. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ThisTimeLineID in checkpointer and bgwriter processes
On Fri, Dec 21, 2012 at 1:46 AM, Simon Riggs si...@2ndquadrant.com wrote: On 20 December 2012 13:19, Amit Kapila amit.kap...@huawei.com wrote: True, it might not have any functionality effect in RemoveOldXlogFiles(). However it can be used in PreallocXlogFiles()-XLogFileInit() as well. Which is never called in recovery because we never write WAL. No. CreateRestartPoint() calls PreallocXlogFiles(). Walreceiver may write WAL, so PreallocXlogFiles() would be useful even during recovery to some extent. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ThisTimeLineID in checkpointer and bgwriter processes
On 2012-12-20 16:46:21 +, Simon Riggs wrote: On 20 December 2012 13:19, Amit Kapila amit.kap...@huawei.com wrote: True, it might not have any functionality effect in RemoveOldXlogFiles(). However it can be used in PreallocXlogFiles()-XLogFileInit() as well. Which is never called in recovery because we never write WAL. It's called from CreateRestartPoint. And the pre-inited files get used by the walreceiver and I would guess thats beneficial at elast for sync rep... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FDW: ForeignPlan and parameterized paths
Ronan Dunklau rdunk...@gmail.com writes: I intentionally did the nestloop_params substitution after calling GetForeignPlan not before. It's not apparent to me why it would be useful to do it before, because the FDW is going to have no idea what those params represent. (Note that they represent values coming from some other, probably local, relation; not from the foreign table.) Even if the FDW have no idea what they represent, it can identify a clause of the form Var Operator Param, which allows to store the param reference (paramid) for retrieving the param value at execution time. I don't see any plausible reason for an FDW to special-case nestloop params like that. What an FDW should be looking for is clauses of the form Var-of-foreign-table Operator Expression-not-involving-foreign-table, and a Param is just one case of Expression-not-involving-foreign-table. (Compare the handling of indexscan clauses: indxpath.c doesn't much care what's on the righthand side of an indexable clause, so long as there is no Var of the indexed table there.) Moreover, in order to do effective parameterized-path creation in the first place, the FDW's GetForeignPaths function will already have had to recognize these same clauses in their original form. If we do the param substitution before calling GetForeignPlan, that will just mean that the two functions can't share code anymore. Or in short: the fact that the righthand-side expression gets replaced (perhaps only partially) by a Param is an implementation detail of the executor's expression evaluation methods. The FDW shouldn't care about that, only about the result of the expression. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ALTER .. OWNER TO error mislabels schema as other object type
On Thu, Dec 20, 2012 at 11:46 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: This looks busted: Between this and your previous example, it's becoming clear that the recent refactorings of the ALTER code were not ready for prime time. Perhaps we should just revert those instead of playing bug whack-a-mole. Well, as yet, I have no clear evidence that there is any problem with anything other than the error messages. It seems like overkill to revert the whole thing just for that. Not to say that there might not be further issues, of course. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Set visibility map bit after HOT prune
Pavan Deolasee pavan.deola...@gmail.com writes: Ok. Will try to read archives to see what was actually suggested and why it was put on back burner. BTW at the risk of being shot down again, I wonder if can we push down the freeze operation to HOT prune also. Seems unlikely to be a win. We only care about freezing tuples in the context of being able to advance a relation-wide relfrozenxid horizon. Freezing pages retail accomplishes nothing whatsoever towards that goal, unless you have some way to know that no new freeze work will be needed on the page before the next vacuum freeze happens. Otherwise, you're just moving portions of the work from background vacuuming into foreground processes, with no benefit gained thereby. In fact, you might well be *creating* work that would otherwise not have had to be done at all --- the tuple might get deleted before the next freeze happens. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Set visibility map bit after HOT prune
On Thu, Dec 20, 2012 at 11:49 AM, Pavan Deolasee pavan.deola...@gmail.com wrote: I wonder if we should add a flag to heap_page_prune and try to do some additional work if its being called from lazy vacuum such as setting the VM bit and the tuple freeze. IIRC I had put something like that in the early patches, but then ripped of for simplicity. May be its time to play with that again. That seems unlikely to be a good trade-off. If VACUUM is going to do extra stuff, it's better to have that in the vacuum-specific code, rather than in code that is also traversed from other places. Otherwise the conditional logic might impose a penalty on people who aren't taking those branches. IMHO, it's probably fairly hopeless to make a pure pgbench workload show a benefit from index-only scans. A large table under a very heavy, completely random write workload is just about the worst possible case for index-only scans. Index-only scans are a way of avoiding unnecessary visibility checks when the target data hasn't changed recently, not a magic bullet to escape all heap access. If the target data has changed, you're going to have to touch the heap. Not always. Not clearing the VM bit at HOT update is one such idea we discussed. Of course, there are open issues with that, but they are not unsolvable. The advantage of not touching heap is just too big to ignore. I don't really agree. Sure, not touching the heap is nice, but mostly because you avoid pulling pages into shared_buffers that aren't otherwise needed. IIRC, an index-only scan isn't faster than an index scan if all the necessary table and index pages are already cached. Touching already-resident pages just isn't that expensive. And of course, if a page has recently suffered an insert, update, or delete, it is more likely to be resident. You can construct access patterns where this isn't so - e.g. update the page, wait for it to get paged out, and then SELECT from it with an index-only scan, wait for it to get paged out again, etc. - but I'm not sure how much of a problem that is in the real world. And while I agree that we aren't aggressive enough in setting the VM bits right now, I also think it wouldn't be too hard to go too far in the opposite direction: we could easily spend more effort trying to make index-only scans effective than we could ever hope to recoup from the scans themselves. I agree. I also started having that worry. We are at one extreme right now and it might not help to get to the other extreme. Looks like I'm coming along the idea of somehow detecting if the scan is happening on the result relation of a ModifyTable and avoid setting VM bit in that case. It's unclear to me that that's the right way to slice it. There are several different sets of concerns here: (1) avoiding setting the all-visible bit when it'll be cleared again just after, (2) avoiding slowing down SELECT with hot-pruning, and (3) avoiding slowing down repeated SELECTs by NOT having the first one do HOT-pruning. And maybe others. The right thing to do depends on which problems you think are relatively more important. That question might not even have one right answer, but even if it does we don't have consensus on what it is. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ThisTimeLineID in checkpointer and bgwriter processes
On 20.12.2012 18:19, Fujii Masao wrote: InstallXLogFileSegment() also uses ThisTimeLineID. But your recent commit doesn't take care of it and prevents the standby from recycling the WAL files properly. Specifically, the standby recycles the WAL file to wrong name. A-ha, good catch. So that's actually a live bug in 9.1 and 9.2 as well: after the recovery target timeline has changed, restartpoints will continue to preallocate/recycle WAL files for the old timeline. That's otherwise harmless, but the useless WAL files waste space, and walreceiver will have to always create new files. So instead of always running with ThisTimeLineID = 0 in the checkpointer process, I guess we'll have to update it to the timeline being replayed, when creating a restartpoint. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Set visibility map bit after HOT prune
On Thu, Dec 20, 2012 at 10:55 PM, Tom Lane t...@sss.pgh.pa.us wrote: Pavan Deolasee pavan.deola...@gmail.com writes: Ok. Will try to read archives to see what was actually suggested and why it was put on back burner. BTW at the risk of being shot down again, I wonder if can we push down the freeze operation to HOT prune also. Seems unlikely to be a win. We only care about freezing tuples in the context of being able to advance a relation-wide relfrozenxid horizon. Freezing pages retail accomplishes nothing whatsoever towards that goal, unless you have some way to know that no new freeze work will be needed on the page before the next vacuum freeze happens. Otherwise, you're just moving portions of the work from background vacuuming into foreground processes, with no benefit gained thereby. If we can establish an invariant that a all-visible page is always fully freezed, then vacuum freeze does not need to look at those pages again. Another advantage is that we are holding the right lock and piggyback freeze with cleanup WAL-logging, thus avoiding re-dirtying of the page and additional WAL logging. In fact, you might well be *creating* work that would otherwise not have had to be done at all --- the tuple might get deleted before the next freeze happens. Yeah, there will be cases where it might not add any value or even add little overhead. Don't know what will serve better on an average or majority of the workloads though. Vacuum freeze has known to add sudden and unexpected load on the system, so I thought this might mitigate that to a certain extent. Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Set visibility map bit after HOT prune
On Thu, Dec 20, 2012 at 10:59 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Dec 20, 2012 at 11:49 AM, Pavan Deolasee pavan.deola...@gmail.com wrote: I wonder if we should add a flag to heap_page_prune and try to do some additional work if its being called from lazy vacuum such as setting the VM bit and the tuple freeze. IIRC I had put something like that in the early patches, but then ripped of for simplicity. May be its time to play with that again. That seems unlikely to be a good trade-off. If VACUUM is going to do extra stuff, it's better to have that in the vacuum-specific code, rather than in code that is also traversed from other places. Otherwise the conditional logic might impose a penalty on people who aren't taking those branches. Thats a call we need to take between code duplication vs customising execution. We do that all over the code. Not sure if it will be any different here. It's unclear to me that that's the right way to slice it. There are several different sets of concerns here: (1) avoiding setting the all-visible bit when it'll be cleared again just after, (2) avoiding slowing down SELECT with hot-pruning, and (3) avoiding slowing down repeated SELECTs by NOT having the first one do HOT-pruning. And maybe others. The right thing to do depends on which problems you think are relatively more important. That question might not even have one right answer, but even if it does we don't have consensus on what it is. Hmm. We tossed and discussed many interesting ideas in this thread. It will be sad if none of them go anywhere. When I look at archives, I see we might have discussed some of these even in the past but never got an agreement because there always be a workload which may not be served well by any specific idea. And many a times, they are so interrelated that we either have to do all or none. Unfortunately, trying to do all is too-much and too-invasive most often. May be what we need an official experimental branch where such ideas can be checked-in and encourage people to try out those branches in their real world tests or set up dedicated benchmark machines to run regular tests. Tested and proven ideas can then be merged into the main trunk. That will be the only way to know efficacy of such ideas. Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_top
On 12/20/2012 4:17 AM, Brett Maton wrote: It appears that procpid has been renamed to pid at some point, also the column current_query appears to have been shortened to query. My patch updates a couple of queries to use the new shorter column names. IMHO, any such fix should check the version, and use the old name for 9.2, and new for = 9.2 /me tossed another mumbled curse at whomever changed that field name.
Re: [HACKERS] [GENERAL] trouble with pg_upgrade 9.0 - 9.1
On Thu, Dec 20, 2012 at 11:08:58AM -0500, Tom Lane wrote: Bruce Momjian br...@momjian.us writes: As you can see, ALTER INDEX renamed both the pg_constraint and pg_class names. Is it possible someone manually updated the system table to rename this primary key? That would cause this error message. The fix is to just to make sure they match. Does pg_upgrade need to be modified to handle this case? Are there legitimate cases where they will not match and the index name will not be preserved though a dump/restore? This seems safe: It's not always been true that ALTER INDEX would try to rename constraints to keep 'em in sync. A quick check says that only 8.3 and later do that. I'm not sure though how a 9.0 database could get into such a state without manual catalog hacking, since as you say a dump and reload should have ended up with the index and constraint having the same name again. I'd be inclined not to worry about this in pg_upgrade, at least not till we see a plausible scenario for the situation to arise without manual catalog changes. Agreed. I added a C comment so we don't forget about the matching requirement. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Set visibility map bit after HOT prune
Pavan Deolasee pavan.deola...@gmail.com writes: On Thu, Dec 20, 2012 at 10:55 PM, Tom Lane t...@sss.pgh.pa.us wrote: Seems unlikely to be a win. We only care about freezing tuples in the context of being able to advance a relation-wide relfrozenxid horizon. Freezing pages retail accomplishes nothing whatsoever towards that goal, unless you have some way to know that no new freeze work will be needed on the page before the next vacuum freeze happens. Otherwise, you're just moving portions of the work from background vacuuming into foreground processes, with no benefit gained thereby. If we can establish an invariant that a all-visible page is always fully freezed, then vacuum freeze does not need to look at those pages again. We're not going to do that, because it would require freezing tuples immediately after they fall below the RecentGlobalXmin horizon. This would be a significant loss of capability from a forensic standpoint, not to mention breaking existing applications that look at xmin to determine whether a tuple has recently been updated. Besides which, I think it would result in a large increase in the WAL volume emitted by prune operations (hint bit setting doesn't require WAL, unlike freezing). I don't believe for a minute your argument that it would result in a net reduction in WAL. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Set visibility map bit after HOT prune
On Thu, Dec 20, 2012 at 1:30 PM, Tom Lane t...@sss.pgh.pa.us wrote: Pavan Deolasee pavan.deola...@gmail.com writes: On Thu, Dec 20, 2012 at 10:55 PM, Tom Lane t...@sss.pgh.pa.us wrote: Seems unlikely to be a win. We only care about freezing tuples in the context of being able to advance a relation-wide relfrozenxid horizon. Freezing pages retail accomplishes nothing whatsoever towards that goal, unless you have some way to know that no new freeze work will be needed on the page before the next vacuum freeze happens. Otherwise, you're just moving portions of the work from background vacuuming into foreground processes, with no benefit gained thereby. If we can establish an invariant that a all-visible page is always fully freezed, then vacuum freeze does not need to look at those pages again. We're not going to do that, because it would require freezing tuples immediately after they fall below the RecentGlobalXmin horizon. This would be a significant loss of capability from a forensic standpoint, not to mention breaking existing applications that look at xmin to determine whether a tuple has recently been updated. Besides which, I think it would result in a large increase in the WAL volume emitted by prune operations (hint bit setting doesn't require WAL, unlike freezing). I don't believe for a minute your argument that it would result in a net reduction in WAL. I don't think the above makes sense, because making a page all-visible already requires emitting a WAL record. Pavan didn't say freeze the page every time we set a hint bit; he said freeze the page every time it gets marked all-visible. And that's already WAL-logged. Now, there is a downside: right now, we play a tricky little game where we emit a WAL record for setting the visibility map bit, but we don't actually set the LSN of the heap page. It's OK because it's harmless if the PD_ALL_VISIBLE bit makes it to disk and the visibility-map doesn't, and also because the PD_ALL_VISIBLE bit can be set without relying on the previous page contents. But doing anything more complicated with the same WAL record, like freezing, is likely to require setting the LSN on the heap page. And that will result in a huge increase in WAL traffic when vacuuming an insert-only table. Whee, crash recovery is fun. With respect to the forensic problem, we've previously discussed setting a HEAP_XMIN_FROZEN bit in the tuple header rather than overwriting the xmin with FrozenXID. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] trouble with pg_upgrade 9.0 - 9.1
Applied. We can mark this report closed. Groshev, let us know if you have any further problems. --- On Thu, Dec 20, 2012 at 07:19:48AM -0500, Bruce Momjian wrote: On Wed, Dec 19, 2012 at 10:19:30PM -0500, Bruce Momjian wrote: On Wed, Dec 19, 2012 at 12:56:05PM -0500, Kevin Grittner wrote: Groshev Andrey wrote: Mismatch of relation names: database database, old rel public.lob.ВерсияВнешнегоДокумента$Документ_pkey, new rel public.plob.ВерсияВнешнегоДокумента$Документ There is a limit on identifiers of 63 *bytes* (not characters) after which the name is truncated. In UTF8 encoding, the underscore would be in the 64th position. OK, Kevin is certainly pointing out a bug in the pg_upgrade code, though I am unclear how it would exhibit the mismatch error reported. pg_upgrade uses NAMEDATALEN for database, schema, and relation name storage lengths. While NAMEDATALEN works fine in the backend, it is possible that a frontend client, like pg_upgrade, could retrieve a name in the client encoding whose length exceeds NAMEDATALEN if the client encoding did not match the database encoding (or is it the cluster encoding for system tables). This would cause truncation of these values. The truncation would not cause crashes, but might cause failures by not being able to connect to overly-long database names, and it weakens the checking of relation/schema names --- the same check that is reported above. (I believe initdb.c also erroneously uses NAMEDATALEN.) I have developed the attached patch to pg_strdup() the string returned from libpq, rather than use a fixed NAMEDATALEN buffer to store the value. I am only going to apply this to 9.3 because I can't see this causing problems except for weaker comparisons for very long identifiers where the client encoding is longer than the server encoding, and failures for very long database names, no of which we have gotten bug reports about. Turns out initdb.c was fine because it expects only collation names to be only in ASCII; I added a comment to that effect. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + diff --git a/contrib/pg_upgrade/info.c b/contrib/pg_upgrade/info.c new file mode 100644 index 2250442..5cb9b61 *** a/contrib/pg_upgrade/info.c --- b/contrib/pg_upgrade/info.c *** static void get_db_infos(ClusterInfo *cl *** 23,29 static void get_rel_infos(ClusterInfo *cluster, DbInfo *dbinfo); static void free_rel_infos(RelInfoArr *rel_arr); static void print_db_infos(DbInfoArr *dbinfo); ! static void print_rel_infos(RelInfoArr *arr); /* --- 23,29 static void get_rel_infos(ClusterInfo *cluster, DbInfo *dbinfo); static void free_rel_infos(RelInfoArr *rel_arr); static void print_db_infos(DbInfoArr *dbinfo); ! static void print_rel_infos(RelInfoArr *rel_arr); /* *** create_rel_filename_map(const char *old_ *** 127,134 map-new_relfilenode = new_rel-relfilenode; /* used only for logging and error reporing, old/new are identical */ ! snprintf(map-nspname, sizeof(map-nspname), %s, old_rel-nspname); ! snprintf(map-relname, sizeof(map-relname), %s, old_rel-relname); } --- 127,134 map-new_relfilenode = new_rel-relfilenode; /* used only for logging and error reporing, old/new are identical */ ! map-nspname = old_rel-nspname; ! map-relname = old_rel-relname; } *** get_db_infos(ClusterInfo *cluster) *** 220,227 for (tupnum = 0; tupnum ntups; tupnum++) { dbinfos[tupnum].db_oid = atooid(PQgetvalue(res, tupnum, i_oid)); ! snprintf(dbinfos[tupnum].db_name, sizeof(dbinfos[tupnum].db_name), %s, ! PQgetvalue(res, tupnum, i_datname)); snprintf(dbinfos[tupnum].db_tblspace, sizeof(dbinfos[tupnum].db_tblspace), %s, PQgetvalue(res, tupnum, i_spclocation)); } --- 220,226 for (tupnum = 0; tupnum ntups; tupnum++) { dbinfos[tupnum].db_oid = atooid(PQgetvalue(res, tupnum, i_oid)); ! dbinfos[tupnum].db_name = pg_strdup(PQgetvalue(res, tupnum, i_datname)); snprintf(dbinfos[tupnum].db_tblspace, sizeof(dbinfos[tupnum].db_tblspace), %s, PQgetvalue(res, tupnum, i_spclocation)); } *** get_rel_infos(ClusterInfo *cluster, DbIn *** 346,355 curr-reloid = atooid(PQgetvalue(res, relnum, i_oid)); nspname = PQgetvalue(res, relnum, i_nspname); ! strlcpy(curr-nspname, nspname, sizeof(curr-nspname)); relname =
Re: [HACKERS] Feature Request: pg_replication_master()
On Wed, Dec 19, 2012 at 07:32:33PM -0500, Robert Haas wrote: On Wed, Dec 19, 2012 at 5:34 PM, Simon Riggs si...@2ndquadrant.com wrote: As ever, we spent much energy on debating backwards compatibility rather than just solving the problem it posed, which is fairly easy to solve. I'm still of the opinion (as were a lot of people on the previous thread, IIRC) that just making them GUCs and throwing backward compatibility under the bus is acceptable in this case. Changes that break application code are anathema to me, because people can have a LOT of application code and updating it can be REALLY hard. The same cannot be said about recovery.conf - you have at most one of those per standby, and if it needs to be changed in some way, you can do it with a very small Perl script. Yes, third-party tools will need to be updated; that is surely a downside, but I think it might be a tolerable one in this case. Agreed. We have always has a more lax requirement of changing the Postgres administration interface. I am not saying to ignore backward compatibility, but future admin interface clarity overrules backward compatibility in most cases. If people are really worried about this, they can write a Perl script to convert from the old to new format. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Feature Request: pg_replication_master()
On Wed, Dec 19, 2012 at 10:34:14PM +, Simon Riggs wrote: On 19 December 2012 22:19, Joshua Berkus j...@agliodbs.com wrote: It stalled because the patch author decided not to implement the request to detect recovery.conf in data directory, which allows backwards compatibility. Well, I don't think we had agreement on how important backwards compatibility for recovery.conf was, particularly not on the whole recovery.conf/recovery.done functionality and the wierd formatting of recovery.conf. As ever, we spent much energy on debating backwards compatibility rather than just solving the problem it posed, which is fairly easy to solve. Let me also add that I am tired of having recovery.conf improvement stalled by backward compatibility concerns. At this point, let's just trash recovery.conf backward compatibility and move on. And I don't want to hear complaints about tool breakage either. These are external tools, not shipped with community Postgres, and they will just have to adjust. I will be glad to beat all complainants into the ground for the good of the community. ;-) We just can't operate like this, and if we allowed these things to block us in the past, Postgres would be a royal mess today! At this point backward compatibility has paralized us from fixing a recovery.conf API that everyone agrees is non-optimal, and this has gone on for multiple major releases. I don't care what we have to do, just clean this up for 9.3! -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review of Row Level Security
Kevin, all, * Kevin Grittner (kgri...@mail.com) wrote: The more secure behavior is to allow entry of data which will not be visible by the person doing the entry. wrt this- I'm inclined to agree with Kevin. It's certainly common in certain environments that you can write to a higher level than you can read from. Granting those writers access to read the data later would be... difficult. What we're really arguing about here, afaict, is what the default should be. In line with Kevin's comments and Tom's reading of the spec (along with my own experience in these environments), I'd argue for the default to allow writing rows you're not allowed to read. It would certainly be ideal if we could support both options, on a per-relation basis, when we release the overall feature. It doesn't feel like it'd be a lot of work to do that, but I've not been able to follow this discussion up til now. Thankfully, I'm hopeful that I'm going to have more time now to keep up with PG. :) Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Review of Row Level Security
On Thu, Dec 20, 2012 at 4:35 AM, Simon Riggs si...@2ndquadrant.com wrote: Not sure I understand you. You suggested it was a valid use case for a user to have only INSERT privilege and wish to bypass security checks. I agreed and suggested it could be special-cased. That's not really what I intended to suggest. I view checking an inserted tuple and checking the new version of an updated tuple as of a piece. I would think we would check against the RLS quals in either both of those situations or neither, not one without the other. * Applies to all commands should not be implemented via triggers. Complex, slow, unacceptable thing to force upon users. Doing that begs the question of why we would have the feature at all, since we already have triggers and barrier views. I agree that it is questionable whether we need this feature given that we already have security barrier views. I don't agree that performing security checks via triggers is unacceptably slow or complex. Rather, I would say it is flexible and can meet a variety of needs, unlike this feature, which imposes much tighter constraints on what you can and cannot check and in which situations. * the default for row security should be applies to all commands. Anything else may be useful in some cases, but is surprising to users and requires careful thought to determine if it is appropriate. I (and several other people, it seems) do not agree. * How to handle asymmetric row security policies? KaiGai has already begun discussing problems caused by a security policy that differs between reads/writes, on his latest patch post. That needs further analysis to check that it actually makes sense to allow it, since it is more complex. It would be better to fully analyse that situation and post solutions, rather than simply argue its OK. Kevin has made good arguments to show there could be value in such a setup; nobody has talked about banning it, but we do need analysis, suggested syntax/mechanisms and extensive documentation to explain it etc. Frankly, in view of your comments above, I am starting to rethink whether we want this at all. I mean, if you've got security barrier views, you can check the data being read. If you've got triggers, you can check the data being written. So what's left? There's something notationally appealing about being able to apply a security policy to a table rather than creating a separate view and telling people to use the view in lieu of the table, but how much is that notational convenience worth? It has some value from the standpoint of compatibility with other database products ... but probably not a whole lot, since all the syntax we're inventing here is PostgreSQL-specific anyway. Your proposal to check both tuples read and tuples written might add some value ... but unless there's an as-yet-undemonstrated performance benefit, it is again mostly a notational benefit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Feature Request: pg_replication_master()
On Thu, Dec 20, 2012 at 02:29:49PM -0500, Bruce Momjian wrote: Let me also add that I am tired of having recovery.conf improvement stalled by backward compatibility concerns. At this point, let's just trash recovery.conf backward compatibility and move on. And I don't want to hear complaints about tool breakage either. These are external tools, not shipped with community Postgres, and they will just have to adjust. I will be glad to beat all complainants into the ground for the good of the community. ;-) We just can't operate like this, and if we allowed these things to block us in the past, Postgres would be a royal mess today! At this point backward compatibility has paralized us from fixing a recovery.conf API that everyone agrees is non-optimal, and this has gone on for multiple major releases. I don't care what we have to do, just clean this up for 9.3! Let me add two more things. First, no matter how many people you interact with who use Postgres, there are many more who you do not interact with. Please don't give excessive weight to those people you know, or who use your tools, or who are your customers, and forget the many more Postgres users who you will never know. They are not using your tools or scripts --- they just want Postgres to be simple to use. Second, no matter how successful Postgres is now, there are many more users in our future, and we have a responsibility to give them a database that is as easy to configure as possible, without hampering them with decisions to avoid disruption for current Postgres users. I am not saying we should ignore current users, or our customers or our tool users, but it is very clear to me that we have lost the proper balance in this area. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review of Row Level Security
2012/12/20 Stephen Frost sfr...@snowman.net: Kevin, all, * Kevin Grittner (kgri...@mail.com) wrote: The more secure behavior is to allow entry of data which will not be visible by the person doing the entry. wrt this- I'm inclined to agree with Kevin. It's certainly common in certain environments that you can write to a higher level than you can read from. Granting those writers access to read the data later would be... difficult. What we're really arguing about here, afaict, is what the default should be. In line with Kevin's comments and Tom's reading of the spec (along with my own experience in these environments), I'd argue for the default to allow writing rows you're not allowed to read. If system ensures writer's permission is always equivalent or more restrictive than reader's permission, it also eliminates the problem around asymmetric row-security policy between commands. The problematic scenario was that updatable but invisible rows are exposed; using update with returning clause for example. In case when updatable rows are always visible, here is no matter even if user shows it. Probably, we can implement it as ((row-security of select) AND (row-security of update)) that ensures always restrictive row-security policy. Thanks, It would certainly be ideal if we could support both options, on a per-relation basis, when we release the overall feature. It doesn't feel like it'd be a lot of work to do that, but I've not been able to follow this discussion up til now. Thankfully, I'm hopeful that I'm going to have more time now to keep up with PG. :) Thanks, Stephen -- KaiGai Kohei kai...@kaigai.gr.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review of Row Level Security
* Robert Haas (robertmh...@gmail.com) wrote: * Applies to all commands should not be implemented via triggers. Complex, slow, unacceptable thing to force upon users. Doing that begs the question of why we would have the feature at all, since we already have triggers and barrier views. I would rather neither requires writing custom triggers but rather both are supported through this feature. I agree that it is questionable whether we need this feature given that we already have security barrier views. This I don't agree with- the plan has long been to have PG-specific RLS first and then to support SELinux capabilities on top of it. We didn't want to have SELinux-specific functionality that couldn't be achieved without SELinux being involved, and I continue to agree with that. There are many situations, environments, and individuals that would view having to implement RLS through views and triggers as being far-and-away too painful and error-prone to rely on. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Review of Row Level Security
2012/12/20 Robert Haas robertmh...@gmail.com: On Thu, Dec 20, 2012 at 4:35 AM, Simon Riggs si...@2ndquadrant.com wrote: Not sure I understand you. You suggested it was a valid use case for a user to have only INSERT privilege and wish to bypass security checks. I agreed and suggested it could be special-cased. That's not really what I intended to suggest. I view checking an inserted tuple and checking the new version of an updated tuple as of a piece. I would think we would check against the RLS quals in either both of those situations or neither, not one without the other. * Applies to all commands should not be implemented via triggers. Complex, slow, unacceptable thing to force upon users. Doing that begs the question of why we would have the feature at all, since we already have triggers and barrier views. I agree that it is questionable whether we need this feature given that we already have security barrier views. I don't agree that performing security checks via triggers is unacceptably slow or complex. Rather, I would say it is flexible and can meet a variety of needs, unlike this feature, which imposes much tighter constraints on what you can and cannot check and in which situations. I'd like to ask Simon which point is more significant; performance penalty or complex operations by users. If later, FK constraint is a good example that automatically defines triggers that applies its checks on inserted tuple and newer version of updated tuple. Even though we need to consider how to handle dynamically added row-security policy by extension (e.g sepgsql), I don't think we need to enforce users to define triggers for each tables with row-security as long as system support it. * the default for row security should be applies to all commands. Anything else may be useful in some cases, but is surprising to users and requires careful thought to determine if it is appropriate. I (and several other people, it seems) do not agree. * How to handle asymmetric row security policies? KaiGai has already begun discussing problems caused by a security policy that differs between reads/writes, on his latest patch post. That needs further analysis to check that it actually makes sense to allow it, since it is more complex. It would be better to fully analyse that situation and post solutions, rather than simply argue its OK. Kevin has made good arguments to show there could be value in such a setup; nobody has talked about banning it, but we do need analysis, suggested syntax/mechanisms and extensive documentation to explain it etc. Frankly, in view of your comments above, I am starting to rethink whether we want this at all. I mean, if you've got security barrier views, you can check the data being read. If you've got triggers, you can check the data being written. So what's left? In some cases, it is not a reasonable choice to re-define kind of database objects or its name from what existing application assumes. It is a reason why we need adaptive security features on regular tables without or minimum application changes Thanks, -- KaiGai Kohei kai...@kaigai.gr.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Feature Request: pg_replication_master()
Let me also add that I am tired of having recovery.conf improvement stalled by backward compatibility concerns. At this point, let's just trash recovery.conf backward compatibility and move on. And I don't want to hear complaints about tool breakage either. These are external tools, not shipped with community Postgres, and they will just have to adjust. I will be glad to beat all complainants into the ground for the good of the community. ;-) We just can't operate like this, and if we allowed these things to block us in the past, Postgres would be a royal mess today! At this point backward compatibility has paralized us from fixing a recovery.conf API that everyone agrees is non-optimal, and this has gone on for multiple major releases. I don't care what we have to do, just clean this up for 9.3! +1 And the sooner we do it before release, the more time will those external tools have to adjust. Regards Petr Jelinek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] operator dependency of commutator and negator, redux
Robert Haas robertmh...@gmail.com writes: a separate ALTER OPERATOR COMMUTATOR statement (or something of the sort) that pg_dump can emit as a separate item. Even a NOTICE in I like that capability, but it's not helping us in the backward compatibility section where we will still read commutator declarations as operator properties. And maintaining an extension with different syntax for CREATE OPERATOR depending on the major version would be a pain (it's the case already for create type by the way, painfully so). So I think Tom's idea is better to fix the problem at hand. About dropping the Operator Shell, we've been talking in the past about adding more properties to our operators to allow for some more optimizer tricks (reducing expressions to constants or straigth variable references at parse time, reducing joins, adding parametrized paths etc). I can think about assiociativity and neutral element, but that's a property of one operator only. Now there's the distributive property that happens in between two different operators and that maybe would better be added as an ALTER OPERATOR statement rather than a possibly forward reference when we come to that. I'm not too sure about other concepts that we might want to tackle down the road here, another angle here would be about support for parallelism where maybe operators property could tell the planner how to spread a complex where clause or output column computation… All in all, it looks to me like the current proposals on the table would allow us to dispose of the Operator Shell idea entirely. If we ever need it back, the ALTER OPERATOR trick looks like a better tool. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strange OOM errors with EXECUTE in PL/pgSQL
Tom Lane t...@sss.pgh.pa.us writes: The reason this fails is that you've got a half-megabyte source string, and each of the 11000 plans that are due to be created from it saves its own copy of the source string. Hence, 5500 megabytes needed just for source strings. We could possibly fix this by inventing some sort of reference-sharing arrangement (which'd be complicated and fragile) or by not storing the source strings with the plans (which'd deal a serious blow to our ability to provide helpful error messages). Neither answer seems appealing. I don't readily see how complicated and fragile it would be, it looks like a hash table of symbols pointing to source strings and a reference counting, and each plan would need to reference that symbol. Now maybe that's what you call complicated and fragile, and even if not, I'm not really sure it would pull its weight. The use case of sending over and over again *in a given session* the exact same query string without using PREPARE/EXECUTE looks like quite tiny. -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review of Row Level Security
Kohei KaiGai wrote: If system ensures writer's permission is always equivalent or more restrictive than reader's permission, it also eliminates the problem around asymmetric row-security policy between commands. I'm not sure we're understanding each other; so far people who favor asymmetric read and write predicates for row level security have been arguing that people should be able to write tuples that they can't read back. Of course you need to be able to read a tuple to update it, but the argument is that you should be able to configure security so that a role can (for example) update a row to set a sealed flag, and then no longer have rights to read that row (including for purposes of update). You can give away data which is yours, but you can't then take it back if it's not. The problematic scenario was that updatable but invisible rows are exposed; I have not seen anyone argue that such behavior should be allowed. Probably, we can implement it as ((row-security of select) AND (row-security of update)) that ensures always restrictive row-security policy. I hadn't been paying a lot of attention to this patch until I saw the question about whether a user with a particular role could write a row which would then not be visible to that role. I just took a look at the patch. I don't think I like ALTER TABLE as a syntax for row level security. How about using existing GRANT syntax but allowing a WHERE clause? That seems more natural to me, and it would make it easy to apply the same conditions to multiple types of operations when desired, but use different expressions when desired. Without having spent a lot of time pondering it, I think that if row level SELECT permissions exist, they would need to be met on the OLD tuple to allow DELETE or UPDATE, and UPDATE row level permissions would be applied to the NEW tuple. So, Simon's use-case could be met with: GRANT SELECT, INSERT, UPDATE, DELETE ON tabx TO org12user WHERE org = 12; ... and similar GRANTs. A user who should be able to access rows for a particular value of org would be granted the appropriate permission. These could be combined by granting a role to another role. To go back to a Wisconsin Courts example, staff in a county might be granted rights to access data for that county, but district roles could be set up and granted to court officials, who need to be able to access data for all counties in their judicial district, because judges fill in for each other across county lines, but only within their own district. My use-case could be met with: GRANT SELECT, INSERT, UPDATE, DELETE ON addr TO general_staff WHERE NOT sealed; GRANT SELECT, INSERT, UPDATE, DELETE ON addr TO sealed_addr_authority WHERE SEALED; GRANT general_staff TO sealed_addr_authority; Note that I think that if one has multiple roles with row level permissions on a table, access should be allowed if any of those roles allows it. I think that the above should be logically equivalent to (although perhaps slightly less efficient at run-time): GRANT SELECT, INSERT, UPDATE, DELETE ON addr TO general_staff WHERE NOT sealed; GRANT SELECT, INSERT, UPDATE, DELETE ON addr TO sealed_addr_authority; And just to round it out, that these could be applied to users with: GRANT general_staff TO bob; GRANT sealed_addr_authority TO supervisor; GRANT supervisor TO jean; I realize this is a major syntactic departure from the current patch, but it just seems a lot more natural and flexible. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Feature Request: pg_replication_master()
As ever, we spent much energy on debating backwards compatibility rather than just solving the problem it posed, which is fairly easy to solve. Well, IIRC, the debate was primarily of *your* making. Almost everyone else on the thread was fine with the original patch, and it was nearly done for 9.2 before you stepped in. I can't find anyone else on that thread who thought that backwards compatibility was more important than fixing the API. --Josh -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Feature Request: pg_replication_master()
On 2012-12-18 19:43:09 -0800, Josh Berkus wrote: We should probably have a function, like pg_replication_master(), which gives the host address of the current master. This would help DBAs for large replication clusters a lot. Obviously, this would only work in streaming. Do you want the node one step up or the top-level in the chain? Because I don't think we can do the latter without complicating the replication protocol noticably. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
I just committed a patch that should make the requested WAL segment 00020003 has already been removed errors go away. The trick was for walsenders to not switch to the new timeline until at least one record has been replayed on it. That closes the window where the walsender already considers the new timeline to be the latest, but the WAL file has not been created yet. OK, I'll download the snapshot in a couple days and make sure this didn't breaks something else. --Josh -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_top
John R Pierce pie...@hogranch.com writes: /me tossed another mumbled curse at whomever changed that field name. The reason for the field name change was that the semantics of the field changed. You typically ought to look at what the application is actually doing with the field, not just do s/current_query/query/g and expect that all will be well. (In particular, if the app is looking for idle or idle in transaction state markers, it's going to need more adjustment than that.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_top
On Thu, Dec 20, 2012 at 4:17 PM, Tom Lane t...@sss.pgh.pa.us wrote: John R Pierce pie...@hogranch.com writes: /me tossed another mumbled curse at whomever changed that field name. The reason for the field name change was that the semantics of the field changed. You typically ought to look at what the application is actually doing with the field, not just do s/current_query/query/g and expect that all will be well. (In particular, if the app is looking for idle or idle in transaction state markers, it's going to need more adjustment than that.) IMSNHO, neither of these should have been changed, I would much rather have seen a new view or some other way of opting into the new functionality. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cascading replication: should we detect/prevent cycles?
Robert, What would such a test look like? It's not obvious to me that there's any rapid way for a user to detect this situation, without checking each server individually. Change something on the master and observe that none of the supposed standbys notice? That doesn't sound like an infallible test, or a 60-second one. My point is that in a complex situation (imagine a shop with 9 replicated servers in 3 different cascaded groups, immediately after a failover of the original master), it would be easy for a sysadmin, responding to middle of the night page, to accidentally fat-finger an IP address and create a cycle instead of a new master. And once he's done that, a longish troubleshooting process to figure out what's wrong and why writes aren't working, especially if he goes to bed and some other sysadmin picks up the Writes failing to PostgreSQL ticket. *if* it's relatively easy for us to detect cycles (that's a big if, I'm not sure how we'd do it), then it would help a lot for us to at least emit a WARNING. That would short-cut a lot of troubleshooting. --Josh Berkus -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Feature Request: pg_replication_master()
Andreas, Do you want the node one step up or the top-level in the chain? Because I don't think we can do the latter without complicating the replication protocol noticably. Well, clearly a whole chain would be nice for the user. But even just one step up would be very useful. --Josh -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parser Cruft in gram.y
Another way of attack along these lines might be to use the %glr-parser and then try to cut back on all those redundant rules that were put in to avoid conflicts. The number of key words categories and such could perhaps also be reduced that way. Of course, this is mostly speculation. The GLR output from Bison is licensed under the GPL (unlike the LALR output). So using Bison's GLR mode isn't an option. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parser Cruft in gram.y
On 2012-12-20 23:12:55 +, McDevitt, Charles wrote: Another way of attack along these lines might be to use the %glr-parser and then try to cut back on all those redundant rules that were put in to avoid conflicts. The number of key words categories and such could perhaps also be reduced that way. Of course, this is mostly speculation. The GLR output from Bison is licensed under the GPL (unlike the LALR output). So using Bison's GLR mode isn't an option. Thats not the case anymore: http://www.gnu.org/software/bison/manual/html_node/Conditions.html Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parser Cruft in gram.y
The GLR output from Bison is licensed under the GPL (unlike the LALR output). So using Bison's GLR mode isn't an option. Thats not the case anymore: http://www.gnu.org/software/bison/manual/html_node/Conditions.html Sorry! My mistake... I didn't realize they changed the rules. I'll be more careful to check these things in the future. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ThisTimeLineID in checkpointer and bgwriter processes
On Fri, Dec 21, 2012 at 2:45 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 20.12.2012 18:19, Fujii Masao wrote: InstallXLogFileSegment() also uses ThisTimeLineID. But your recent commit doesn't take care of it and prevents the standby from recycling the WAL files properly. Specifically, the standby recycles the WAL file to wrong name. A-ha, good catch. So that's actually a live bug in 9.1 and 9.2 as well: after the recovery target timeline has changed, restartpoints will continue to preallocate/recycle WAL files for the old timeline. That's otherwise harmless, but the useless WAL files waste space, and walreceiver will have to always create new files. So instead of always running with ThisTimeLineID = 0 in the checkpointer process, I guess we'll have to update it to the timeline being replayed, when creating a restartpoint. Yes. Thanks for fixing that. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parser Cruft in gram.y
Andres Freund and...@2ndquadrant.com writes: On 2012-12-20 23:12:55 +, McDevitt, Charles wrote: Another way of attack along these lines might be to use the %glr-parser and then try to cut back on all those redundant rules that were put in to avoid conflicts. The number of key words categories and such could perhaps also be reduced that way. The GLR output from Bison is licensed under the GPL (unlike the LALR output). So using Bison's GLR mode isn't an option. Thats not the case anymore: http://www.gnu.org/software/bison/manual/html_node/Conditions.html This does mean that we'd have to specify a minimum bison version of 2.2 in order to be squeaky-clean license wise, if we went over to using the GLR mode. However, that would likely be a good idea anyway from a technical standpoint --- the GLR mode may exist in ancient bison versions, but who knows how bug-free it is. Anyway, this is all merest speculation until somebody actually tries it and sees if a performance gain is possible. Having just re-read the description of GLR mode, I wouldn't be surprised if any savings in table size is squandered by its handling of ambiguous cases (ie, the need to track and merge multiple parser states). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On 20 December 2012 12:45, Heikki Linnakangas hlinnakan...@vmware.comwrote: On 17.12.2012 15:05, Thom Brown wrote: I just set up 120 chained standbys, and for some reason I'm seeing these errors: LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 1 LOG: record with zero length at 0/301EC10 LOG: fetching timeline history file for timeline 2 from primary server LOG: restarted WAL streaming at 0/300 on timeline 1 LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 1 LOG: new target timeline is 2 LOG: restarted WAL streaming at 0/300 on timeline 2 LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 2 FATAL: error reading result of streaming command: ERROR: requested WAL segment 00020003 has already been removed ERROR: requested WAL segment 00020003 has already been removed LOG: started streaming WAL from primary at 0/300 on timeline 2 ERROR: requested WAL segment 00020003 has already been removed I just committed a patch that should make the requested WAL segment 00020003 has already been removed errors go away. The trick was for walsenders to not switch to the new timeline until at least one record has been replayed on it. That closes the window where the walsender already considers the new timeline to be the latest, but the WAL file has not been created yet. Now I'm getting this on all standbys after promoting the first standby in a chain. LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 1 LOG: record with zero length at 0/301EC10 LOG: fetching timeline history file for timeline 2 from primary server LOG: restarted WAL streaming at 0/300 on timeline 1 FATAL: could not receive data from WAL stream: LOG: new target timeline is 2 FATAL: could not connect to the primary server: FATAL: the database system is in recovery mode LOG: started streaming WAL from primary at 0/300 on timeline 2 TRAP: FailedAssertion(!(((sentPtr) = (SendRqstPtr))), File: walsender.c, Line: 1425) LOG: server process (PID 19917) was terminated by signal 6: Aborted LOG: terminating any other active server processes LOG: all server processes terminated; reinitializing LOG: database system was interrupted while in recovery at log time 2012-12-20 23:41:23 GMT HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. LOG: entering standby mode FATAL: the database system is in recovery mode LOG: redo starts at 0/228 LOG: consistent recovery state reached at 0/2E8 LOG: database system is ready to accept read only connections LOG: record with zero length at 0/301EC70 LOG: started streaming WAL from primary at 0/300 on timeline 2 LOG: unexpected EOF on standby connection And if I restart the new primary, the first new standby connected to it shows: LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 2 FATAL: error reading result of streaming command: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. LOG: record with zero length at 0/301F1E0 However, all other standbys don't show any additional log output. -- Thom
Re: [HACKERS] Review of Row Level Security
On 20 December 2012 21:50, Kevin Grittner kgri...@mail.com wrote: How about using existing GRANT syntax but allowing a WHERE clause? It's a nice feature, but a completely different thing to what is being discussed here. Row security adds the ability to enforce a single coherent policy at table level. It might be nice to have multiple, potentially overlapping policies, but that would require significantly different design and coding to what we have here. For me, enforcing a single policy at table level helps to make it secure by being coherent and understandable. So perhaps in later releases we might do the feature you suggest. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Set visibility map bit after HOT prune
On Wednesday, December 19, 2012, Robert Haas wrote: On Wed, Dec 19, 2012 at 12:26 PM, Pavan Deolasee pavan.deola...@gmail.com javascript:; wrote: I would like to run some pgbench tests where we get the system in a steady state such as all/most updates are HOT updates (not entirely unlikely scenario for many real life cases). And then try running some concurrent queries which can be executed via IOS. My gut feel is that, today we will see slow and continuous drop in performance for these queries because IOS will slowly stop working. If there are no vacuums, I agree. If the table is randomly updated over its entire size, then pretty much every block will be not-all-visible (and so disqualified from IOS) before you hit the default 20% vacuum threshold. I wonder if there ought not be another vac threshold, based on vm density rather than estimated obsolete tuple density. Cheers, Jeff
[HACKERS] Set visibility map bit after HOT prune
On Thursday, December 20, 2012, Robert Haas wrote: IMHO, it's probably fairly hopeless to make a pure pgbench workload show a benefit from index-only scans. A large table under a very heavy, completely random write workload is just about the worst possible case for index-only scans. Index-only scans are a way of avoiding unnecessary visibility checks when the target data hasn't changed recently, not a magic bullet to escape all heap access. If the target data has changed, you're going to have to touch the heap. And while I agree that we aren't aggressive enough in setting the VM bits right now, I also think it wouldn't be too hard to go too far in the opposite direction: we could easily spend more effort trying to make index-only scans effective than we could ever hope to recoup from the scans themselves. Now, if you set up N threads of which 10% are doing regular pgbench and the other 90% are doing pgbench -S, or something like that, then you might start to hope for some benefit from index-only scans. I set this up before, by dropping the primary key and instead building an index on (aid,abalance) and then just running pgbench with a mixture of -f flags that corresponded to some -S-like and some default-like transactions. On a freshly vacuumed table, I saw a hint of a performance boost at even a 50:50 ratio, and clear boost at 3 -S to 1 default I ran this at a size where not even all the index fit in RAM, to maximize the benefit of not having to visit the table. However, the boost started going away due to vm clearance long before autovacuum kicked in at default settings. Cheers, Jeff
Re: [HACKERS] [PERFORM] pgbench to the MAXINT
On Wed, Feb 16, 2011 at 8:15 AM, Greg Smith g...@2ndquadrant.com wrote: Tom Lane wrote: I think that might be a good idea --- it'd reduce the cross-platform variability of the results quite a bit, I suspect. random() is not to be trusted everywhere, but I think erand48 is pretty much the same wherever it exists at all (and src/port/ provides it elsewhere). Given that pgbench will run with threads in some multi-worker configurations, after some more portability research I think odds are good we'd get nailed by http://sourceware.org/**bugzilla/show_bug.cgi?id=10320http://sourceware.org/bugzilla/show_bug.cgi?id=10320: erand48 implementation not thread safe but POSIX says it should be. The AIX docs have a similar warning on them, so who knows how many versions of that library have the same issue. Maybe we could make sure the one in src/port/ is thread safe and make sure pgbench only uses it. This whole area continues to be messy enough that I think the patch needs to brew for another CF before it will all be sorted out properly. I'll mark it accordingly and can pick this back up later. Hi Greg, I spent some time rebasing this patch to current master. Attached is the patch, based on master couple of commits old. Your concern of using erand48() has been resolved since pgbench now uses thread-safe and concurrent pg_erand48() from src/port/. The patch is very much what you had posted, except for a couple of differences due to bit-rot. (i) I didn't have to #define MAX_RANDOM_VALUE64 since its cousin MAX_RANDOM_VALUE is not used by code anymore, and (ii) I used ternary operator in DDLs[] array to decide when to use bigint vs int columns. Please review. As for tests, I am currently running 'pgbench -i -s 21474' using unpatched pgbench, and am recording the time taken;Scale factor 21475 had actually failed to do anything meaningful using unpatched pgbench. Next I'll run with '-s 21475' on patched version to see if it does the right thing, and in acceptable time compared to '-s 21474'. What tests would you and others like to see, to get some confidence in the patch? The machine that I have access to has 62 GB RAM, 16-core 64-hw-threads, and about 900 GB of disk space. Linux host 3.2.6-3.fc16.ppc64 #1 SMP Fri Feb 17 21:41:20 UTC 2012 ppc64 ppc64 ppc64 GNU/Linux Best regards, PS: The primary source of patch is this branch: https://github.com/gurjeet/postgres/tree/64bit_pgbench -- Gurjeet Singh http://gurjeet.singh.im/ pgbencg-64-v6.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ThisTimeLineID in checkpointer and bgwriter processes
On Thursday, December 20, 2012 11:15 PM Heikki Linnakangas wrote: On 20.12.2012 18:19, Fujii Masao wrote: InstallXLogFileSegment() also uses ThisTimeLineID. But your recent commit doesn't take care of it and prevents the standby from recycling the WAL files properly. Specifically, the standby recycles the WAL file to wrong name. A-ha, good catch. So that's actually a live bug in 9.1 and 9.2 as well: after the recovery target timeline has changed, restartpoints will continue to preallocate/recycle WAL files for the old timeline. That's otherwise harmless, but the useless WAL files waste space, and walreceiver will have to always create new files. So instead of always running with ThisTimeLineID = 0 in the checkpointer process, I guess we'll have to update it to the timeline being replayed, when creating a restartpoint. Shouldn't there be a check if(RecoveryInProgress), before assigning RecoveryTargetTLI to ThisTimeLineID in CreateRestartPoint()? With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review of Row Level Security
2012/12/20 Kevin Grittner kgri...@mail.com: Kohei KaiGai wrote: If system ensures writer's permission is always equivalent or more restrictive than reader's permission, it also eliminates the problem around asymmetric row-security policy between commands. I'm not sure we're understanding each other; so far people who favor asymmetric read and write predicates for row level security have been arguing that people should be able to write tuples that they can't read back. Of course you need to be able to read a tuple to update it, but the argument is that you should be able to configure security so that a role can (for example) update a row to set a sealed flag, and then no longer have rights to read that row (including for purposes of update). You can give away data which is yours, but you can't then take it back if it's not. Hmm, for this example, the right behavior is to check rows with ((row-security of select) AND (row-security of update)) on scan- stage, but only (row-security of update) is checked on just-before row updates. In case of (row-security of select) is sealed = false, it is not visible thus not target row of the update, but user can turn on the flag to make it gone. Anyway, the row-security to be applied on scanning-stage of source of update/delete needs to be equivalent or more restrictive than rules on select. However, it might not be needed to ensure rules being restrictive on writer-stage. Probably, we can implement it as ((row-security of select) AND (row-security of update)) that ensures always restrictive row-security policy. I hadn't been paying a lot of attention to this patch until I saw the question about whether a user with a particular role could write a row which would then not be visible to that role. I just took a look at the patch. I don't think I like ALTER TABLE as a syntax for row level security. How about using existing GRANT syntax but allowing a WHERE clause? That seems more natural to me, and it would make it easy to apply the same conditions to multiple types of operations when desired, but use different expressions when desired. It seems to me this syntax gives an impression that row-security feature is tightly combined with role-mechanism, even though it does not need. For example, we can set row-security policy of primary key is even/odd number, independent from working role. Without having spent a lot of time pondering it, I think that if row level SELECT permissions exist, they would need to be met on the OLD tuple to allow DELETE or UPDATE, and UPDATE row level permissions would be applied to the NEW tuple. Yes, I agree. Thanks, -- KaiGai Kohei kai...@kaigai.gr.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers