Re: [HACKERS] Synchronous Standalone Master Redoux
On 16.07.2012 22:01, Robert Haas wrote: On Sat, Jul 14, 2012 at 7:54 PM, Josh Berkusj...@agliodbs.com wrote: So, here's the core issue with degraded mode. I'm not mentioning this to block any patch anyone has, but rather out of a desire to see someone address this core problem with some clever idea I've not thought of. The problem in a nutshell is: indeterminancy. Assume someone implements degraded mode. Then: 1. Master has one synchronous standby, Standby1, and two asynchronous, Standby2 and Standby3. 2. Standby1 develops a NIC problem and is in and out of contact with Master. As a result, it's flipping in and out of synchronous / degraded mode. 3. Master fails catastrophically due to a RAID card meltdown. All data lost. At this point, the DBA is in kind of a pickle, because he doesn't know: (a) Was Standby1 in synchronous or degraded mode when Master died? The only log for that was on Master, which is now gone. (b) Is Standby1 actually the most caught up standby, and thus the appropriate new master for Standby2 and Standby3, or is it behind? With the current functionality of Synchronous Replication, you don't have either piece of indeterminancy, because some external management process (hopefully located on another server) needs to disable synchronous replication when Standby1 develops its problem. That is, if the master is accepting synchronous transactions at all, you know that Standby1 is up-to-date, and no data is lost. While you can answer (b) by checking all servers, (a) is particularly pernicious, because unless you have the application log all operating in degraded mode messages, there is no way to ever determine the truth. Good explanation. In brief, the problem here is that you can only rely on the no-transaction-loss guarantee provided by synchronous replication if you can be certain that you'll always be aware of it when synchronous replication gets shut off. Right now that is trivially true, because it has to be shut off manually. If we provide a facility that logs a message and then shuts it off, we lose that certainty, because the log message could get eaten en route by the same calamity that takes down the master. There is no way for the master to WAIT for the log message to be delivered and only then degrade. However, we could craft a mechanism that has this effect. Suppose we create a new GUC with a name like synchronous_replication_status_change_command. If we're thinking about switching between synchronous replication and degraded mode automatically, we first run this command. If it returns 0, then we're allowed to switch, but if it returns anything else, then we're not allowed to switch (but can retry the command after a suitable interval). The user is responsible for supplying a command that records the status change somewhere off-box in a fashion that's sufficiently durable that the user has confidence that the notification won't subsequently be lost. For example, the user-supplied command could SSH into three machines located in geographically disparate data centers and create a file with a certain name on each one, returning 0 only if it's able to reach at least two of them and create the file on all the ones it can reach. If the master dies, but at least two out of the those three machines are still alive, we can be certain of determining with confidence whether the master might have been in degraded mode at the time of the crash. More or less paranoid versions of this scheme are possible depending on user preferences, but the key point is that for the no-transaction-loss guarantee to be of any use, there has to be a way to reliably know whether that guarantee was in effect at the time the master died in a fire. Logging isn't enough, but I think some more sophisticated mechanism can get us there. Yeah, I think that's the right general approach. Not necessarily that exact GUC, but something like that. I don't want PostgreSQL to get more involved in determining the state of the standby, when to do failover, or when to fall back to degraded mode. That's a whole new territory with all kinds of problems, and there is plenty of software out there to handle that. Usually you have some external software to do monitoring and to initiate failovers anyway. What we need is a better API for co-operating with such software, to perform failover, and to switch replication between synchronous and asynchronous modes. BTW, one little detail that I don't think has been mentioned in this thread before: Even though the master currently knows whether a standby is connected or not, and you could write a patch to act based on that, there are other failure scenarios where you would still not be happy. For example, imagine that the standby has a disk failure. It stays connected to the master, but fails to fsync anything to disk. Would you want to fall back to degraded mode and just do asynchronous replication in that case? How do you decide when to do
Re: [HACKERS] Synchronous Standalone Master Redoux
On Mon, Jul 16, 2012 at 10:58 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: BTW, one little detail that I don't think has been mentioned in this thread before: Even though the master currently knows whether a standby is connected or not, and you could write a patch to act based on that, there are other failure scenarios where you would still not be happy. For example, imagine that the standby has a disk failure. It stays connected to the master, but fails to fsync anything to disk. Would you want to fall back to degraded mode and just do asynchronous replication in that case? How do you decide when to do that in the master? Or what if the standby keeps making progress, but becomes incredibly slow for some reason, like disk failure in a RAID array? I'd rather outsource all that logic to external monitoring software - software that you should be running anyway. I would like to express some support for the non-edge nature of this case. Outside of simple loss of availability of a server, losing access to a block device is probably the second-most-common cause of loss of availability for me. It's especially insidious because simple select 1 checks may continue to return for quite some time, so instead we rely on linux diskstats parsing to see if write progress hits zero for a while. In cases like these, the overhead of a shell-command to rapidly consort with a decision-making process can be prohibitive -- it's already a pretty big waster of time for me in wal archiving/dearchiving, where process startup and SSL negotiation and lack of parallelization can be pretty slow. This may also exhibit this problem. I would like to plead that whatever is done would be most useful being controllable via non-GUCs in its entirely -- arguably that is already the case, since one can write a replication protocol client to do the job, by faking the standby status update messages, but perhaps there is a more lucid way if one makes accommodation. In particular, the awkwardness of using pg_receivexlog[0] or a similar tool for replacing archive_command is something that I feel should be addressed eventually, as to not be a second-class citizen. Although that is already being worked on[1]...the archive command has no backpressure either, other than out of disk. The case of restore_command is even more sore: remastering or archive-recovery via streaming protocol actions is kind of a pain at the moment. I haven't thoroughly explored this yet and I don't think it is documented, but it can be hard for something that is dearchiving from wal segments stored somewhere to find exactly the right record to start replaying at: the wal record format is not stable, and it need not be, if the server helps by ignoring records that predate what it requires or can inform the process feeding WAL that it got things wrong. Maybe that is the case, but it is not documented. I also don't think any guarantees around the maximum size or alignment of WAL shipped by the streaming protocol in XLogData messages, and that's too bad. Also, the endianness of WAL position fields in the XLogData is host-byte-order-dependent, which sucks if you are forwarding WAL around but need to know what range is contained in a message. In practice many people can say all I have is little-endian, but it is somewhat unpleasant and not necessarily the case. Correct me if I'm wrong, I'd be glad for it. [0]: see the notes section, http://www.postgresql.org/docs/devel/static/app-pgreceivexlog.html [1]: http://archives.postgresql.org/pgsql-hackers/2012-06/msg00348.php -- fdr -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using pg_upgrade on log-shipping standby servers
On Mon, Jul 16, 2012 at 5:29 PM, Jeff Davis pg...@j-davis.com wrote: On Tue, 2012-07-10 at 11:50 -0400, Bruce Momjian wrote: I don't think we can assume that because pg_upgrade was run on the master and standby that they are binary identical, can we? Technically the user file are identical, but the system catalogs and WAL might be different, hence my suggestion to run rsync before allowing the standby to rejoin the primary. Do you have plans to change that in the future? If we know that the user data files are identical between primary and replica, it would be nice if we could provide a robust way to avoid copying them. How about this alternative that may sound crazy, but would lend itself to some unification in archiving: Could pg_upgrade emit WAL segment(s) to provide continuity of a timeline? So something like: * Take down the writable primary for pg_upgrade * Some WAL is emitted and possibly archived * The old version, when reaching the special pg_upgrade WAL, could exit or report its situation having paused replay (as clearly, it cannot proceed). Unsure. * Start up a new version of postgres on the same cluster at that point, which plays the upgrade-WAL. I see this being pretty mechanically intensive, but right now my hands are completely tied as to achieving total continuity of my archives, costing a base-backup's worth of risk window upon upgrade. -- fdr -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pl/perl and utf-8 in sql_ascii databases
Hello, I suppose that testing for the two cases and additional one case which runs pg_do_encoding_conversion(), say latin1, would be enough to confirm that encoding/decoding is properly done, since the concrete conversion scheme is not significant this case. So I recommend that we should add the test for latin1 and omit the test from other than sql_ascii, utf8 and latin1. This might be archieved by create empty plperl_lc.sql and plperl_lc.out files for those encodings. What do you think about that? I think that's probably too much engineering for something that doesn't really warrant it. A real solution to this problem could be to create yet another new test file containing just this function definition and the query that calls it, and have one expected file for each encoding; but that's too much work and too many files, I'm afraid. I agree completely. The balance between the additional complexity of regress and the what we would get from the complexity... I can see us supporting tests that require a small number of expected files. No Make tricks with file copying, though. If we can't get some easy way to test this without that, I submit we should just remove the test. Ok I agree to do so. regards, -- Kyotaro Horiguchi NTT Open Source Software Center == My e-mail address has been changed since Apr. 1, 2012. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] b-tree index search algorithms
Am 17.07.12 05:21, schrieb Tom Lane: Samuel Vogel s...@muel-vogel.de writes: I'm currently on a university research project if performance could be increased by substituting different inter-node search algorithms instead of the currently used binary search. Hm, what have you got in mind exactly? At first I will try a simple interpolation search, but problems start there since I need to have a numerical representation of the index keys (or map them to one) to do the interpolation. But I'm having troubles understanding how the general b-tree implementation (nbtree.h) is used to represent for example a simple primary key on an integer column. I've debug printed the 'scankey-sk_argument' and all attributes of the index tuples on the pages being traversed (simply ran 'DatumGetInt32' on both) but I never see one of the integers actually appearing in my table being logged when I do a select. Not clear what you did wrong from this amount of detail, but integer keys ought to be pretty obvious at the debugger level. Okay, to be more specific: Printing 'DatumGetInt32(scankey-sk_argument)' in '_bt_compare' never shows me 50 when I execute this query: SELECT * FROM simpletest WHERE id = 50; This is why I assume that all column values are hashed before they are pushed into the b-tree, PG's b-trees do not hash anything. If you're not seeing interpretable key values then you're doing something wrong in your inspection methodology. Okay, how are indexes on char/text columns handled then? Are they hashed before being put into the b-tree or is my assumption correct, that in that case the Datum is only a link to where the actual data is stored and only 'scankey-sk_func' knows how to make use of it (compare it). In that case it would be extremly hard to get to a numeric representation which can be used for the interpolation. Regards, Samuel Vogel -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Closing out the June commitfest
On 16 July 2012 01:16, Tom Lane t...@sss.pgh.pa.us wrote: We are now at the end of the originally scheduled one-month window for the June commitfest. While the numbers look fairly bad: Needs Review: 17, Waiting on Author: 10, Ready for Committer: 3, Committed: 29, Returned with Feedback: 12, Rejected: 5. Total: 76. it's not quite a complete disaster, because almost all of the needs review patches did actually get some review and/or had new versions posted during the fest. We did not get them to the point of being committable, but we did make progress. I only see about three patches that seem to have received no attention whatsoever. At this point we could move the open items to the September fest and call this one good, or we could keep trying to close things out. Personally I'd like to do the former, because we really need to spend some effort on closing out the various open issues for 9.2, and the commitfest seems to have sucked up all the available time of those who might've been fixing those issues over the past month. Sounds fine to me. I've been unavailable for much of this CF, so my intention is to continue with my parts of it. Meaning the reviews I was scheduled to do won't be put off until Sept. But first, I'll review the 9.2 open items list again. That's a personal point, not trying to suggest everybody else should do that. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Covering Indexes
On 28 June 2012 13:16, David E. Wheeler da...@justatheory.com wrote: Very interesting design document for SQLite 4: http://www.sqlite.org/src4/doc/trunk/www/design.wiki I'm particularly intrigued by covering indexes. For example: CREATE INDEX cover1 ON table1(a,b) COVERING(c,d); This allows the following query to do an index-only scan: SELECT c, d FROM table1 WHERE a=? AND b=?; Now that we have index-only scans in 9.2, I'm wondering if it would make sense to add covering index support, too, where additional, unindexed columns are stored alongside indexed columns. Just to be clear, the ability to have covered indexes is already in 9.2, I updated the release notes to explain that a few months back. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Covering Indexes
On Jul 17, 2012, at 5:18 PM, Simon Riggs wrote: Now that we have index-only scans in 9.2, I'm wondering if it would make sense to add covering index support, too, where additional, unindexed columns are stored alongside indexed columns. Just to be clear, the ability to have covered indexes is already in 9.2, I updated the release notes to explain that a few months back. You mean this? Allow queries to retrieve data only from indexes, avoiding heap access (Robert Haas, Ibrar Ahmed, Heikki Linnakangas, Tom Lane) This is often called index-only scans or covering indexes. This is possible for heap pages with exclusively all-visible tuples, as reported by the visibility map. The visibility map was made crash-safe as a necessary part of implementing this feature. That’s not how SQLite is using the term “covering index.” What they mean is the ability to have additional, unindexed columns in an index, so that they can *also* be returned in the event of an index-only scan. Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Covering Indexes
On 17 July 2012 16:21, David E. Wheeler da...@justatheory.com wrote: On Jul 17, 2012, at 5:18 PM, Simon Riggs wrote: Now that we have index-only scans in 9.2, I'm wondering if it would make sense to add covering index support, too, where additional, unindexed columns are stored alongside indexed columns. Just to be clear, the ability to have covered indexes is already in 9.2, I updated the release notes to explain that a few months back. You mean this? Allow queries to retrieve data only from indexes, avoiding heap access (Robert Haas, Ibrar Ahmed, Heikki Linnakangas, Tom Lane) This is often called index-only scans or covering indexes. This is possible for heap pages with exclusively all-visible tuples, as reported by the visibility map. The visibility map was made crash-safe as a necessary part of implementing this feature. That’s not how SQLite is using the term “covering index.” What they mean is the ability to have additional, unindexed columns in an index, so that they can *also* be returned in the event of an index-only scan. CREATE INDEX ON foo (a, b, c, d); allows SELECT c, d FROM foo WHERE a = ? AND b = ? to use an index only scan. The phrase unindexed seems misleading since the data is clearly in the index from the description on the URL you gave. And since the index is non-unique, I don't see any gap between Postgres and SQLliite4. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Covering Indexes
On Jul 17, 2012, at 5:32 PM, Simon Riggs wrote: CREATE INDEX ON foo (a, b, c, d); allows SELECT c, d FROM foo WHERE a = ? AND b = ? to use an index only scan. The phrase unindexed seems misleading since the data is clearly in the index from the description on the URL you gave. And since the index is non-unique, I don't see any gap between Postgres and SQLliite4. Yeah, but that index is unnecessarily big if one will never use c or d in the search. The nice thing about covering indexes as described for SQLite 4 and implemented in MSSQL is that you can specify additional columns that just come along for the ride, but are not part of the indexed data: CREATE INDEX cover1 ON table1(a,b) COVERING(c,d); Yes, you can do that by also indexing c and d as of 9.2, but it might be nice to be able to include them in the index as additional row data without actually indexing them. Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Covering Indexes
On 17 July 2012 16:54, David E. Wheeler da...@justatheory.com wrote: On Jul 17, 2012, at 5:32 PM, Simon Riggs wrote: CREATE INDEX ON foo (a, b, c, d); allows SELECT c, d FROM foo WHERE a = ? AND b = ? to use an index only scan. The phrase unindexed seems misleading since the data is clearly in the index from the description on the URL you gave. And since the index is non-unique, I don't see any gap between Postgres and SQLliite4. Yeah, but that index is unnecessarily big if one will never use c or d in the search. The nice thing about covering indexes as described for SQLite 4 and implemented in MSSQL is that you can specify additional columns that just come along for the ride, but are not part of the indexed data: CREATE INDEX cover1 ON table1(a,b) COVERING(c,d); Yes, you can do that by also indexing c and d as of 9.2, but it might be nice to be able to include them in the index as additional row data without actually indexing them. Can you explain what you mean by without actually indexing them? ISTM that it is a non-feature if the index is already non-unique, and the difference is simply down to the amount of snake oil applied to the descriptive text on the release notes. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Covering Indexes
On Tue, Jul 17, 2012 at 6:08 PM, Simon Riggs si...@2ndquadrant.com wrote: On 17 July 2012 16:54, David E. Wheeler da...@justatheory.com wrote: Yeah, but that index is unnecessarily big if one will never use c or d in the search. The nice thing about covering indexes as described for SQLite 4 and implemented in MSSQL is that you can specify additional columns that just come along for the ride, but are not part of the indexed data: CREATE INDEX cover1 ON table1(a,b) COVERING(c,d); Yes, you can do that by also indexing c and d as of 9.2, but it might be nice to be able to include them in the index as additional row data without actually indexing them. Can you explain what you mean by without actually indexing them? ISTM that it is a non-feature if the index is already non-unique, and the difference is simply down to the amount of snake oil applied to the descriptive text on the release notes. It would be useful in non-unique indexes to store data without ordering operators (like xml).
Re: [HACKERS] Covering Indexes
-Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers- ow...@postgresql.org] On Behalf Of David E. Wheeler Sent: Tuesday, July 17, 2012 11:55 AM To: Simon Riggs Cc: Pg Hackers Subject: Re: [HACKERS] Covering Indexes On Jul 17, 2012, at 5:32 PM, Simon Riggs wrote: CREATE INDEX ON foo (a, b, c, d); allows SELECT c, d FROM foo WHERE a = ? AND b = ? to use an index only scan. The phrase unindexed seems misleading since the data is clearly in the index from the description on the URL you gave. And since the index is non-unique, I don't see any gap between Postgres and SQLliite4. Yeah, but that index is unnecessarily big if one will never use c or d in the search. The nice thing about covering indexes as described for SQLite 4 and implemented in MSSQL is that you can specify additional columns that just come along for the ride, but are not part of the indexed data: CREATE INDEX cover1 ON table1(a,b) COVERING(c,d); Yes, you can do that by also indexing c and d as of 9.2, but it might be nice to be able to include them in the index as additional row data without actually indexing them. Best, David Concretely, I would presume that the contents of a covering index could then look like the following (a,b,c,d): (2,1,2,A) (2,1,5,A) -- the 5 is out of natural order but exists in the covering part (2,1,3,A) Whereas PostgreSQL would be forced to have the index ordered as such: (2,1,2,A) (2,1,3,A) (2,1,5,A) Either way the data in c and d are IN THE INDEX otherwise in neither case could the data values be returned while strictly querying the index. So the question that needs to be asked is what kind of performance increase can be had during DML (insert/update) statements and whether those gains are worth pursuing. Since these other engines appear to allow both cases you should be able to get at least a partial idea of the performance gains between index (a,b,c,d) and index (a,b) covering (c,d). Vik's concurrent point regarding non-indexable values makes some sense but the use case there seems specialized as I suspect that in the general case values that are non-indexable (if there truly are any) are generally those that would be too large to warrant sticking into an index in the first place. But, XML values do ring true in my mind (particularly frequently used fragments that are generally quite small). But again whether that is a reasonable use case for a covering index I do not know. It feels like trying to solve the remaining 10% when it took a long while to even muster up enough support and resources to solve the 90%. David J. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Covering Indexes
On 17 July 2012 17:41, David Johnston pol...@yahoo.com wrote: Concretely, I would presume that the contents of a covering index could then look like the following (a,b,c,d): (2,1,2,A) (2,1,5,A) -- the 5 is out of natural order but exists in the covering part (2,1,3,A) Whereas PostgreSQL would be forced to have the index ordered as such: (2,1,2,A) (2,1,3,A) (2,1,5,A) Either way the data in c and d are IN THE INDEX otherwise in neither case could the data values be returned while strictly querying the index. So the question that needs to be asked is what kind of performance increase can be had during DML (insert/update) statements and whether those gains are worth pursuing. Since these other engines appear to allow both cases you should be able to get at least a partial idea of the performance gains between index (a,b,c,d) and index (a,b) covering (c,d). There is a use case, already discussed, whereby that is useful for create unique index on foo (a,b) covering (c,d) but there really isn't any functional difference between create index on foo (a,b) covering (c,d) and create index on foo (a,b,c,d) There is a potential performance impact. But as Tom says, that might even be negative if it is actually measurable. Vik's concurrent point regarding non-indexable values makes some sense but the use case there seems specialized as I suspect that in the general case values that are non-indexable (if there truly are any) are generally those that would be too large to warrant sticking into an index in the first place. I think it would be easy enough to add noop operators for sorts if you wanted to do that. But, XML values do ring true in my mind (particularly frequently used fragments that are generally quite small). But again whether that is a reasonable use case for a covering index I do not know. It feels like trying to solve the remaining 10% when it took a long while to even muster up enough support and resources to solve the 90%. The main thing is that we definitely already do have covering indexes and we will be announcing we have that soon. The fact we have chosen to implement that without adding new syntax strikes me as a selling point as well, so all client tools still work. So the feature we are talking about here needs to be called something else, otherwise we will be confusing people. Unsorted trailing index columns... -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Covering Indexes
On 07/17/2012 12:41 PM, David Johnston wrote: So the question that needs to be asked is what kind of performance increase can be had during DML (insert/update) statements and whether those gains are worth pursuing. Since these other engines appear to allow both cases you should be able to get at least a partial idea of the performance gains between index (a,b,c,d) and index (a,b) covering (c,d). Tom's recent answer to me on this point (as I understood it) was that he would expect performance to degrade, not improve, since the btree code is known not to perform well when there are many non-unique values. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Covering Indexes
David E. Wheeler da...@justatheory.com ca+u5nmjz33zsvqpzk-auoindxkq6elip1hgq53byodlpwfd...@mail.gmail.com writes: On Jul 17, 2012, at 5:32 PM, Simon Riggs wrote: The phrase unindexed seems misleading since the data is clearly in the index from the description on the URL you gave. And since the index is non-unique, I don't see any gap between Postgres and SQLliite4. Yeah, but that index is unnecessarily big if one will never use c or d in the search. The data would still have to be stored in the leaf entries, at least. Yeah, you could possibly omit the unindexed columns from upper tree levels, but with typical btree fanout ratios in the hundreds, the overall space savings would be negligible. The idea of different index tuple descriptors on different tree levels doesn't appeal to me, either. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CompactCheckpointerRequestQueue versus pad bytes
Robert Haas robertmh...@gmail.com writes: On Mon, Jul 16, 2012 at 9:58 PM, Tom Lane t...@sss.pgh.pa.us wrote: BTW, I wonder whether the code that checks for relfilenode conflict when selecting a pg_class or relfilenode OID tries both file naming conventions? If not, should we make it do so? I don't believe it does, nor do I see what we would gain by making it to do so. What we would gain is ensuring that we aren't using the same relfilenode for both a regular table and a temp table. Do you really want to assume that such a conflict is 100% safe? It sounds pretty scary to me, and even if we were sure the backend would never get confused, what about client-side code that's looking at relfilenode? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] b-tree index search algorithms
Samuel Vogel s...@muel-vogel.de writes: Am 17.07.12 05:21, schrieb Tom Lane: Samuel Vogel s...@muel-vogel.de writes: I'm currently on a university research project if performance could be increased by substituting different inter-node search algorithms instead of the currently used binary search. Hm, what have you got in mind exactly? At first I will try a simple interpolation search, but problems start there since I need to have a numerical representation of the index keys (or map them to one) to do the interpolation. Dunno about that. btree knows nothing about the datatypes it's working on except that they have comparison functions. Converting the values to some sort of numeric scale that you can interpolate on seems logically dubious and fraught with practical difficulties. Now, we do have some code in selfuncs.c that tries to do that, for some data types, but it's only for planner estimation purposes, and we don't rely very heavily on its results even in that context. Depending on it to be right for search purposes sounds pretty scary. Not clear what you did wrong from this amount of detail, but integer keys ought to be pretty obvious at the debugger level. Okay, to be more specific: Printing 'DatumGetInt32(scankey-sk_argument)' in '_bt_compare' never shows me 50 when I execute this query: SELECT * FROM simpletest WHERE id = 50; Um, what does it show you? DatumGetInt32 is a macro, and at least in gdb that won't work at all: (gdb) p DatumGetInt32(scankey-sk_argument) No symbol DatumGetInt32 in current context. However, just looking at the value produces sane answers for me: (gdb) p *scankey $1 = {sk_flags = 196608, sk_attno = 1, sk_strategy = 0, sk_subtype = 23, sk_collation = 0, sk_func = {fn_addr = 0x486ec0 btint4cmp, fn_oid = 351, fn_nargs = 2, fn_strict = 1 '\001', fn_retset = 0 '\000', fn_stats = 2 '\002', fn_extra = 0x0, fn_mcxt = 0x2b82fe8, fn_expr = 0x0}, sk_argument = 50} (gdb) p scankey-sk_argument $2 = 50 PG's b-trees do not hash anything. If you're not seeing interpretable key values then you're doing something wrong in your inspection methodology. Okay, how are indexes on char/text columns handled then? The datum values will be pointers to strings. ... is my assumption correct, that in that case the Datum is only a link to where the actual data is stored and only 'scankey-sk_func' knows how to make use of it (compare it). In that case it would be extremly hard to get to a numeric representation which can be used for the interpolation. The btree code is (or reasonably can be) aware that such values are pass-by-reference, and how to get to the bits. But the comparison semantics of two different values are not something it knows about except by asking the comparison function. This can be quite a nontrivial matter even for text, since we follow strcoll() comparison rules. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] isolation check takes a long time
Excerpts from Andrew Dunstan's message of dom jul 15 16:42:22 -0400 2012: I'm looking into that. But given that the default is to set max_prepared_transactions to 0, shouldn't we just remove that test from the normal installcheck schedule? We could provide an alternative schedule that does include it. That's a thought -- AFAIR we do provide a numeric_big test that's not exercised by the regular regress schedule, for a precedent. However, there's more work to do in isolation testing. It'd be good to have it being routinely run in serializable isolation level, for example, not just in read committed. I wouldn't want to overload the slowest machines in the buildfarm (some of which are already barely capable of running the tests on all branches in a 24h schedule, of which Stefan Kaltenbrunner is so proud), but if we could have a few of the fastest members running isolation and isolation-serializable, with max_prepared_transactions set to a nonzero value, that'd be great. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] several problems in pg_receivexlog
On Fri, Jul 13, 2012 at 1:15 AM, Magnus Hagander mag...@hagander.net wrote: On Thu, Jul 12, 2012 at 6:07 PM, Fujii Masao masao.fu...@gmail.com wrote: On Thu, Jul 12, 2012 at 8:39 PM, Magnus Hagander mag...@hagander.net wrote: On Tue, Jul 10, 2012 at 7:03 PM, Fujii Masao masao.fu...@gmail.com wrote: On Tue, Jul 10, 2012 at 3:23 AM, Fujii Masao masao.fu...@gmail.com wrote: Hi, I found several problems in pg_receivexlog, e.g., memory leaks, file-descripter leaks, ..etc. The attached patch fixes these problems. ISTM there are still some other problems in pg_receivexlog, so I'll read it deeply later. While pg_basebackup background process is streaming WAL records, if its replication connection is terminated (e.g., walsender in the server is accidentally terminated by SIGTERM signal), pg_basebackup ends up failing to include all required WAL files in the backup. The problem is that, in this case, pg_basebackup doesn't emit any error message at all. So an user might misunderstand that a base backup has been successfully taken even though it doesn't include all required WAL files. Ouch. That is definitely a bug if it behaves that way. To fix this problem, I think that, when the replication connection is terminated, ReceiveXlogStream() should check whether we've already reached the stop point by calling stream_stop() before returning TRUE. If we've not yet (this means that we've not received all required WAL files yet), ReceiveXlogStream() should return FALSE and pg_basebackup should emit an error message. Comments? Doesn't it already return false because it detects the error of the connection? What's the codepath where we end up returning true even though we had a connection failure? Shouldn't that end up under the could not read copy data branch, which already returns false? You're right. If the error is detected, that function always returns false and the error message is emitted (but I think that current error message pg_basebackup: child process exited with error 1 is confusing), so it's OK. But if walsender in the server is terminated by SIGTERM, no error is detected and pg_basebackup background process gets out of the loop in ReceiveXlogStream() and returns true. Oh. Because the server does a graceful shutdown. D'uh, of course. Then yes, your suggested fix seems like a good one. Attached patch adds the fix. Also I found I had forgotten to set the file descriptor to -1 at the end of ReceiveXlogStream(), in previously-committed my patch. Attached patch fixes this problem. Regards, -- Fujii Masao pgreceivexlog_check_stoppoint_v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CompactCheckpointerRequestQueue versus pad bytes
I wrote: I had thought that we might get a performance boost here by saving fsync queue traffic, but I see that md.c was already not calling register_dirty_segment for temp rels, so there's no joy there. Actually, wait a second. We were smart enough to not send fsync requests in the first place for temp rels. But we were not smart enough to not call ForgetRelationFsyncRequests when deleting a temp rel, which made for an entirely useless scan through the pending-fsyncs table. So there could be win there, on top of not forwarding the actual unlink operation. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pl/perl and utf-8 in sql_ascii databases
Excerpts from Kyotaro HORIGUCHI's message of mar jul 17 05:01:10 -0400 2012: I think that's probably too much engineering for something that doesn't really warrant it. A real solution to this problem could be to create yet another new test file containing just this function definition and the query that calls it, and have one expected file for each encoding; but that's too much work and too many files, I'm afraid. I agree completely. The balance between the additional complexity of regress and the what we would get from the complexity... I had to remove both that test and the one about the 0x80, because it wasn't working for me in either SQL_ASCII or Latin1, I forget which. I'm not sure I understand the reason for the failure -- I was getting a false result instead of true, which was unexpected. Maybe there's a trivial explanation for this .. or maybe it really is broken. In any case, maybe it'd be a good idea to have more tests related to encodings, if we can write them in some reasonable manner. But only in HEAD, I guess, because having to backpatch stuff and test every branch in at least three encodings is just too annoying. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [COMMITTERS] pgsql: Split contrib documentation into extensions and programs
On ons, 2012-05-09 at 14:44 -0400, Alvaro Herrera wrote: Excerpts from Peter Eisentraut's message of mié may 09 13:54:53 -0400 2012: Split contrib documentation into extensions and programs Create separate appendixes for contrib extensions and other server plugins on the one hand, and utility programs on the other. Recast the documentation of the latter as refentries, so that man pages are generated. I noticed that the pgupgrade manpage doesn't have the titles of each step in Usage. Not sure this is a serious problem, but it's different from the HTML at any rate. Fixed. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using pg_upgrade on log-shipping standby servers
On Tue, 2012-07-17 at 01:02 -0700, Daniel Farina wrote: Could pg_upgrade emit WAL segment(s) to provide continuity of a timeline? So something like: By segments did you mean records? * Take down the writable primary for pg_upgrade * Some WAL is emitted and possibly archived * The old version, when reaching the special pg_upgrade WAL, could exit or report its situation having paused replay (as clearly, it cannot proceed). Unsure. I don't really understand this step. * Start up a new version of postgres on the same cluster at that point, which plays the upgrade-WAL. I see this being pretty mechanically intensive, but right now my hands are completely tied as to achieving total continuity of my archives, costing a base-backup's worth of risk window upon upgrade. Does continuity of archives mean avoid downtime or maintain a single WAL sequence. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] isolation check takes a long time
On Tue, Jul 17, 2012 at 01:56:19PM -0400, Alvaro Herrera wrote: Excerpts from Andrew Dunstan's message of dom jul 15 16:42:22 -0400 2012: I'm looking into that. But given that the default is to set max_prepared_transactions to 0, shouldn't we just remove that test from the normal installcheck schedule? That's a thought -- AFAIR we do provide a numeric_big test that's not exercised by the regular regress schedule, for a precedent. It would be nice to have a pattern for adding tests run less often than every commit but more often than whenever a human explicitly remembers the test and invokes it manually. Perhaps a schedule that the recommended buildfarm configuration would somehow run every two weeks and before each release (including betas and branch releases). However, there's more work to do in isolation testing. It'd be good to have it being routinely run in serializable isolation level, for example, not just in read committed. Except for the foreign key specs, isolation test specs request a specific isolation level when starting their transactions. Running such specs under different default_transaction_isolation settings primarily confirms that BEGIN TRANSACTION ISOLATION LEVEL x is indistinguishable from BEGIN under default_transaction_isolation = x. It might also discover transaction isolation sensitivity in the setup/cleanup steps, which often omit explicit transaction control. I don't think such verification justifies regularly running thousands of tests. The foreign key tests, however, would benefit from running under all three isolation levels. Let's control it per-spec instead of repeating the entire suite. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] New buildfarm client release
There is a new release of the PostgreSQL buildfarm client available at https://github.com/downloads/PGBuildFarm/client-code/build-farm-4_7.tgz Most of the changes in the release are minor bug fixes. Enhancements include: * extra_config can now have a DEFAULT key, and these entries are prepended to any branch-specific entry. This means you can now set common extra config in one place instead of having to specify them for each branch. The sample entry now uses this as its key instead of HEAD, and the sample entry has a line for fsync = off, which can improve the speed of buildfarm runs. * provide for include as well as exclude filters on file names for deciding if a run is needed. The old config setting of trigger_filter is still honored as if it were trigger_exclude, the new name. * allow skipping install steps as well as other steps. Enjoy! cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using pg_upgrade on log-shipping standby servers
On Mon, Jul 16, 2012 at 05:29:26PM -0700, Jeff Davis wrote: On Tue, 2012-07-10 at 11:50 -0400, Bruce Momjian wrote: I don't think we can assume that because pg_upgrade was run on the master and standby that they are binary identical, can we? Technically the user file are identical, but the system catalogs and WAL might be different, hence my suggestion to run rsync before allowing the standby to rejoin the primary. Do you have plans to change that in the future? If we know that the user data files are identical between primary and replica, it would be nice if we could provide a robust way to avoid copying them. Well, rsync --checksum would work, but both systems have to be down for that. You could snapshot the down primary and rsync --checksum that against the standby, but I am not sure how much that helps us. I can't figure out how to make this work better without adding a whole lot more code to pg_upgrade that might need adjustment for every minor release, i.e. pg_upgrade knows nothing about the WAL file format, and I want to keep it that way. However, I have two ideas. First, I don't know _why_ the primary/standby would be any different after pg_upgrade, so I added the documentation mention because I couldn't _guarantee_ they were the same. Actually, if people can test this, we might be able to say this is safe. Second, the user files (large) are certainly identical, it is only the system tables (small) that _might_ be different, so rsync'ing just those would add the guarantee, but I know of no easy way to rsync just the system tables. Does that help? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
Robert Haas robertmh...@gmail.com writes: On Mon, Jul 16, 2012 at 3:18 PM, Tom Lane t...@sss.pgh.pa.us wrote: BTW, while we are on the subject: hasn't this split completely broken the statistics about backend-initiated writes? Yes, it seems to have done just that. So I went to fix this in the obvious way (attached), but while testing it I found that the number of buffers_backend events reported during a regression test run barely changed; which surprised the heck out of me, so I dug deeper. The cause turns out to be extremely scary: ForwardFsyncRequest isn't getting called at all in the bgwriter process, because the bgwriter process has a pendingOpsTable. So it just queues its fsync requests locally, and then never acts on them, since it never runs any checkpoints anymore. This implies that nobody has done pull-the-plug testing on either HEAD or 9.2 since the checkpointer split went in (2011-11-01), because even a modicum of such testing would surely have shown that we're failing to fsync a significant fraction of our write traffic. Furthermore, I would say that any performance testing done since then, if it wasn't looking at purely read-only scenarios, isn't worth the electrons it's written on. In particular, any performance gain that anybody might have attributed to the checkpointer splitup is very probably hogwash. This is not giving me a warm feeling about our testing practices. As far as fixing the bug is concerned, the reason for the foulup is that mdinit() looks to IsBootstrapProcessingMode() to decide whether to create a pendingOpsTable. That probably was all right when it was coded, but what it means today is that *any* process started via AuxiliaryProcessMain will have one; thus not only do bgwriters have one, but so do walwriter and walreceiver processes; which might not represent a bug today but it's pretty scary anyway. I think we need to fix that so it's more directly dependent on the auxiliary process type. We can't use flags set by the respective FooMain() functions, such as am_bg_writer, because mdinit is called from BaseInit() which happens before reaching those functions. My suggestion is that bootstrap.c ought to make the process's AuxProcType value available and then mdinit should consult that to decide what to do. (Having done that, we might consider getting rid of the retail process-type flags am_bg_writer etc.) regards, tom lane diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c index 5f93fccbfab1bbb8306f5de4ad228f3cb48b0862..41a7b2be4f680db08556d948eaaa002ed50119c5 100644 *** a/src/backend/postmaster/bgwriter.c --- b/src/backend/postmaster/bgwriter.c *** BackgroundWriterMain(void) *** 341,346 --- 341,357 } + /* + * IsBackgroundWriterProcess + * Return true if running in background writer process. + */ + bool + IsBackgroundWriterProcess(void) + { + return am_bg_writer; + } + + /* * signal handler routines * diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c index 92fd4276cd1b3be81d1ac741f9f6ea09d241ea52..bd1db4811d661d75e616bc77157c4a9e9b7e92fc 100644 *** a/src/backend/postmaster/checkpointer.c --- b/src/backend/postmaster/checkpointer.c *** bool *** 1124,1129 --- 1124,1130 ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno) { CheckpointerRequest *request; + bool am_bg_writer; bool too_full; if (!IsUnderPostmaster) *** ForwardFsyncRequest(RelFileNode rnode, F *** 1131,1141 if (am_checkpointer) elog(ERROR, ForwardFsyncRequest must not be called in checkpointer); LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE); /* Count all backend writes regardless of if they fit in the queue */ ! CheckpointerShmem-num_backend_writes++; /* * If the checkpointer isn't running or the request queue is full, the --- 1132,1144 if (am_checkpointer) elog(ERROR, ForwardFsyncRequest must not be called in checkpointer); + am_bg_writer = IsBackgroundWriterProcess(); LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE); /* Count all backend writes regardless of if they fit in the queue */ ! if (!am_bg_writer) ! CheckpointerShmem-num_backend_writes++; /* * If the checkpointer isn't running or the request queue is full, the *** ForwardFsyncRequest(RelFileNode rnode, F *** 1150,1156 * Count the subset of writes where backends have to do their own * fsync */ ! CheckpointerShmem-num_backend_fsync++; LWLockRelease(CheckpointerCommLock); return false; } --- 1153,1160 * Count the subset of writes where backends have to do their own * fsync */ ! if (!am_bg_writer) ! CheckpointerShmem-num_backend_fsync++; LWLockRelease(CheckpointerCommLock); return false; } diff
Re: [HACKERS] Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
On 17 July 2012 23:56, Tom Lane t...@sss.pgh.pa.us wrote: This implies that nobody has done pull-the-plug testing on either HEAD or 9.2 since the checkpointer split went in (2011-11-01), because even a modicum of such testing would surely have shown that we're failing to fsync a significant fraction of our write traffic. Furthermore, I would say that any performance testing done since then, if it wasn't looking at purely read-only scenarios, isn't worth the electrons it's written on. In particular, any performance gain that anybody might have attributed to the checkpointer splitup is very probably hogwash. This is not giving me a warm feeling about our testing practices. The checkpointer slit-up was not justified as a performance optimisation so much as a re-factoring effort that might have some concomitant performance benefits. While I agree that it is regrettable that this was allowed to go undetected for so long, I do not find it especially surprising that some performance testing results post-split didn't strike somebody as fool's gold. Much of the theory surrounding checkpoint tuning, if followed, results in relatively little work being done during the sync phase of a checkpoint, especially if an I/O scheduler like deadline is used. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using pg_upgrade on log-shipping standby servers
On Tue, Jul 17, 2012 at 11:55 AM, Jeff Davis pg...@j-davis.com wrote: On Tue, 2012-07-17 at 01:02 -0700, Daniel Farina wrote: Could pg_upgrade emit WAL segment(s) to provide continuity of a timeline? So something like: By segments did you mean records? Yes. It would be nicer not to have to tie it to the WAL segment file size. * Take down the writable primary for pg_upgrade * Some WAL is emitted and possibly archived * The old version, when reaching the special pg_upgrade WAL, could exit or report its situation having paused replay (as clearly, it cannot proceed). Unsure. I don't really understand this step. Some WAL is emitted and possibly archived needs a subject in that fragment: pg_upgrade somehow (directly, or indirectly) emits and/or archives WAL used to complete binary-upgrade. That means that it should appear in the WAL stream and be subject to archive_command, like any other WAL. The sticky part is what the standby should do when it encounters the special wal-upgrade records. It should probably pause replay to allow some other program to stop the old postgres version and start the new version with the same cluster. * Start up a new version of postgres on the same cluster at that point, which plays the upgrade-WAL. I see this being pretty mechanically intensive, but right now my hands are completely tied as to achieving total continuity of my archives, costing a base-backup's worth of risk window upon upgrade. Does continuity of archives mean avoid downtime or maintain a single WAL sequence. The latter. -- fdr -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
On 07/18/2012 06:56 AM, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: On Mon, Jul 16, 2012 at 3:18 PM, Tom Lane t...@sss.pgh.pa.us wrote: BTW, while we are on the subject: hasn't this split completely broken the statistics about backend-initiated writes? Yes, it seems to have done just that. So I went to fix this in the obvious way (attached), but while testing it I found that the number of buffers_backend events reported during a regression test run barely changed; which surprised the heck out of me, so I dug deeper. The cause turns out to be extremely scary: ForwardFsyncRequest isn't getting called at all in the bgwriter process, because the bgwriter process has a pendingOpsTable. So it just queues its fsync requests locally, and then never acts on them, since it never runs any checkpoints anymore. This implies that nobody has done pull-the-plug testing on either HEAD or 9.2 since the checkpointer split went in (2011-11-01) That makes me wonder if on top of the buildfarm, extending some buildfarm machines into a crashfarm is needed: - Keep kvm instances with copy-on-write snapshot disks and the build env on them - Fire up the VM, do a build, and start the server - From outside the vm have the test controller connect to the server and start a test run - Hard-kill the OS instance at a random point in time. - Start the OS instance back up - Start Pg back up and connect to it again - From the test controller, test the Pg install for possible corruption by reading the indexes and tables, doing some test UPDATEs, etc. The main challenge would be coming up with suitable tests to run, ones that could then be checked to make sure nothing was broken. The test controller would know how far a test got before the OS got killed and would know which test it was running, so it'd be able to check for expected data if provided with appropriate test metadata. Use of enable_ flags should permit scans of indexes and table heaps to be forced. What else should be checked? The main thing that comes to mind for me is something I've worried about for a while: that Pg might not always handle out-of-disk-space anywhere near as gracefully as it's often claimed to. There's no automated testing for that, so it's hard to really know. A harnessed VM could be used to test that. Instead of virtual plug pull tests it could generate a virtual disk of constrained random size, run its tests until out-of-disk caused failure, stop Pg, expand the disk, restart Pg, and run its checks. Variants where WAL was on a separate disk and only WAL or only the main non-WAL disk run out of space would also make sense and be easy to produce with such a harness. I've written some automated kvm test harnesses, so I could have a play with this idea. I would probably need some help with the test design, though, and the guest OS would be Linux, Linux, or Linux at least to start with. Opinions? -- Craig Ringer -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
Craig Ringer ring...@ringerc.id.au writes: On 07/18/2012 06:56 AM, Tom Lane wrote: This implies that nobody has done pull-the-plug testing on either HEAD or 9.2 since the checkpointer split went in (2011-11-01) That makes me wonder if on top of the buildfarm, extending some buildfarm machines into a crashfarm is needed: Not sure if we need a whole farm, but certainly having at least one machine testing this sort of stuff on a regular basis would make me feel a lot better. The main challenge would be coming up with suitable tests to run, ones that could then be checked to make sure nothing was broken. One fairly simple test scenario could go like this: * run the regression tests * pg_dump the regression database * run the regression tests again * hard-kill immediately upon completion * restart database, allow it to perform recovery * pg_dump the regression database * diff previous and new dumps; should be the same The main thing this wouldn't cover is discrepancies in user indexes, since pg_dump doesn't do anything that's likely to result in indexscans on user tables. It ought to be enough to detect the sort of system-wide problem we're talking about here, though. In general I think the hard part is automated reproduction of an OS-crash scenario, but your ideas about how to do that sound promising. Once we have that going, it shouldn't be hard to come up with tests of the form do X, hard-crash, recover, check X still looks sane. What else should be checked? The main thing that comes to mind for me is something I've worried about for a while: that Pg might not always handle out-of-disk-space anywhere near as gracefully as it's often claimed to. +1 regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] During Xlog replaying, is there maybe emitted xlog?
Hi all, I reviewed the source code, and saw the following calling path: StartupXLOG() StartupDatabase() RmgrTable[rmid].rm_cleanup() btree_xlog_cleanup() _bt_insert_parent _bt_insertonpg() XLogInsert() As we can see, during xlog replaying, XLog may be emitted. So whether there are some *ISSUE* in above calling stack? thanks. 发信日期:2012-07-18 xu2002261
Re: [HACKERS] During Xlog replaying, is there maybe emitted xlog?
xu2002261 xu2002...@163.com writes: Hi all, I reviewed the source code, and saw the following calling path: StartupXLOG() StartupDatabase() RmgrTable[rmid].rm_cleanup() btree_xlog_cleanup() _bt_insert_parent _bt_insertonpg() XLogInsert() As we can see, during xlog replaying, XLog may be emitted. So whether there are some *ISSUE* in above calling stack? No, it's entirely correct. That path isn't during replay, it's upon completion of replay, where we're cleaning up anything that failed to be completed before the crash. Emitting more WAL is allowed then. Note the comment a few lines above the rm_cleanup calls: /* * Resource managers might need to write WAL records, eg, to record * index cleanup actions. So temporarily enable XLogInsertAllowed in * this process only. */ regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] During Xlog replaying, is there maybe emitted xlog?
Thanks a lot. oops, indeed, the clean up stage is not in the XLog replay, So there is no problem. 2012-07-18 xu2002261 发件人: Tom Lane 发送时间: 2012-07-18 10:05:26 收件人: xu2002261 抄送: pgsql-hackers 主题: Re: [HACKERS] During Xlog replaying, is there maybe emitted xlog? xu2002261 xu2002...@163.com writes: Hi all, I reviewed the source code, and saw the following calling path: StartupXLOG() StartupDatabase() RmgrTable[rmid].rm_cleanup() btree_xlog_cleanup() _bt_insert_parent _bt_insertonpg() XLogInsert() As we can see, during xlog replaying, XLog may be emitted. So whether there are some *ISSUE* in above calling stack? No, it's entirely correct. That path isn't during replay, it's upon completion of replay, where we're cleaning up anything that failed to be completed before the crash. Emitting more WAL is allowed then. Note the comment a few lines above the rm_cleanup calls: /* * Resource managers might need to write WAL records, eg, to record * index cleanup actions. So temporarily enable XLogInsertAllowed in * this process only. */ regards, tom lane
Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation
On 07/16/2012 02:39 PM, Robert Haas wrote: Unfortunately, there are lots of important operations (like bulk loading, SELECT * FROM bigtable, and VACUUM notverybigtable) that inevitably end up writing out their own dirty buffers. And even when the background writer does write something, it's not always clear that this is a positive thing. Here's Greg Smith commenting on the more-is-worse phenonmenon: http://archives.postgresql.org/pgsql-hackers/2012-02/msg00564.php You can add crash recovery to the list of things where the interaction with the OS write cache matters a lot too, something I just took a beating and learned from recently. Since the recovery process is essentially one giant unified backend, how effectively the background writer and/or checkpointer move writes from recovery to themselves is really important. It's a bit easier to characterize than a complicated mixed set of clients, which has given me a couple of ideas to chase down. What I've been doing for much of the last month (instead of my original plan of reviewing patches) is moving toward the bottom of characterizing that under high pressure. It provides an even easier way to compare multiple write strategies at the OS level than regular pgbench-like benchmarks. Recovery playback with a different tuning becomes as simple as rolling back to a simple base backup and replaying all the WAL, possibly including some number of bulk operations that showed up. You can measure that speed instead of transaction-level throughput. I'm seeing the same ~100% difference in performance between various Linux tunings on recovery as I was getting on VACUUM tests, and it's a whole lot easier to setup and (ahem) replicate the results. I'm putting together a playback time benchmark based on this observation. The fact that I have servers all over the place now with 64GB worth of RAM has turned the topic of how much dirty memory should be used for write caching into a hot item for me again in general too. If I live through 9.3 development, I expect to have a lot more ideas about how to deal with this whole area play out in the upcoming months. I could really use a cool day to sit outside thinking about it right now. Jeff Janes and I came up with what I believe to be a plausible explanation for the problem: http://archives.postgresql.org/pgsql-hackers/2012-03/msg00356.php I kinda think we ought to be looking at fixing that for 9.2, and perhaps even back-patching further, but nobody else seemed terribly excited about it. FYI, I never rejected any of that thinking, I just haven't chewed on what you two were proposing. If that's still something you think should be revisited for 9.2, I'll take a longer look at it. My feeling on this so far has really been that the write blocking issues are much larger than the exact logic used by the background writer during the code you were highlighting, which I always saw as more active/important during idle periods. This whole area needs to get a complete overhaul during 9.3 though, especially since there are plenty of people who want to fit checksum writes into that path too. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
On 07/17/2012 06:56 PM, Tom Lane wrote: So I went to fix this in the obvious way (attached), but while testing it I found that the number of buffers_backend events reported during a regression test run barely changed; which surprised the heck out of me, so I dug deeper. The cause turns out to be extremely scary: ForwardFsyncRequest isn't getting called at all in the bgwriter process, because the bgwriter process has a pendingOpsTable. When I did my testing early this year to look at checkpointer performance (among other 9.2 write changes like group commit), I did see some cases where buffers_backend was dramatically different on 9.2 vs. 9.1 There were plenty of cases where the totals across a 10 minute pgbench were almost identical though, so this issue didn't stick out then. That's a very different workload than the regression tests though. This implies that nobody has done pull-the-plug testing on either HEAD or 9.2 since the checkpointer split went in (2011-11-01), because even a modicum of such testing would surely have shown that we're failing to fsync a significant fraction of our write traffic. Ugh. Most of my pull the plug testing the last six months has been focused on SSD tests with older versions. I want to duplicate this (and any potential fix) now that you've highlighted it. Furthermore, I would say that any performance testing done since then, if it wasn't looking at purely read-only scenarios, isn't worth the electrons it's written on. In particular, any performance gain that anybody might have attributed to the checkpointer splitup is very probably hogwash. There hasn't been any performance testing that suggested the checkpointer splitup was justified. The stuff I did showed it being flat out negative for a subset of pgbench oriented cases, which didn't seem real-world enough to disprove it as the right thing to do though. I thought there were two valid justifications for the checkpointer split (which is not a feature I have any corporate attachment to--I'm as isolated from how it was developed as you are). The first is that it seems like the right architecture to allow reworking checkpoints and background writes for future write path optimization. A good chunk of the time when I've tried to improve one of those (like my spread sync stuff from last year), the code was complicated by the background writer needing to follow the drum of checkpoint timing, and vice-versa. Being able to hack on those independently got a sign of relief from me. And while this adds some code duplication in things like the process setup, I thought the result would be cleaner for people reading the code to follow too. This problem is terrible, but I think part of how it crept in is that the single checkpoint+background writer process was doing way too many things to even follow all of them some days. The second justification for the split was that it seems easier to get a low power result from, which I believe was the angle Peter Geoghegan was working when this popped up originally. The checkpointer has to run sometimes, but only at a 50% duty cycle as it's tuned out of the box. It seems nice to be able to approach that in a way that's power efficient without coupling it to whatever heartbeat the BGW is running at. I could even see people changing the frequencies for each independently depending on expected system load. Tune for lower power when you don't expect many users, that sort of thing. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using pg_upgrade on log-shipping standby servers
On Tue, Jul 17, 2012 at 04:49:39PM -0700, Daniel Farina wrote: On Tue, Jul 17, 2012 at 11:55 AM, Jeff Davis pg...@j-davis.com wrote: On Tue, 2012-07-17 at 01:02 -0700, Daniel Farina wrote: Could pg_upgrade emit WAL segment(s) to provide continuity of a timeline? So something like: By segments did you mean records? Yes. It would be nicer not to have to tie it to the WAL segment file size. * Take down the writable primary for pg_upgrade * Some WAL is emitted and possibly archived * The old version, when reaching the special pg_upgrade WAL, could exit or report its situation having paused replay (as clearly, it cannot proceed). Unsure. I don't really understand this step. Some WAL is emitted and possibly archived needs a subject in that fragment: pg_upgrade somehow (directly, or indirectly) emits and/or archives WAL used to complete binary-upgrade. That means that it should appear in the WAL stream and be subject to archive_command, like any other WAL. The sticky part is what the standby should do when it encounters the special wal-upgrade records. It should probably pause replay to allow some other program to stop the old postgres version and start the new version with the same cluster. WAL is not guaranteed to be the same between PG major versions, so doing anything with WAL is pretty much a no-go. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
On 07/18/2012 12:00 PM, Greg Smith wrote: The second justification for the split was that it seems easier to get a low power result from, which I believe was the angle Peter Geoghegan was working when this popped up originally. The checkpointer has to run sometimes, but only at a 50% duty cycle as it's tuned out of the box. It seems nice to be able to approach that in a way that's power efficient without coupling it to whatever heartbeat the BGW is running at. I could even see people changing the frequencies for each independently depending on expected system load. Tune for lower power when you don't expect many users, that sort of thing. Yeah - I'm already seeing benefits from that on my laptop, with much less need to stop Pg when I'm not using it. -- Craig Ringer -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
On 07/18/2012 08:31 AM, Tom Lane wrote: Not sure if we need a whole farm, but certainly having at least one machine testing this sort of stuff on a regular basis would make me feel a lot better. OK. That's something I can actually be useful for. My current qemu/kvm test harness control code is in Python since that's what all the other tooling for the project I was using it for is in. Is it likely to be useful for me to adapt that code for use for a Pg crash-test harness, or will you need a particular tool/language to be used? If so, which/what? I'll do pretty much anything except Perl. I'll have a result for you more quickly working in Python, though I'm happy enough to write it in C (or Java, but I'm guessing that won't get any enthusiasm around here). One fairly simple test scenario could go like this: * run the regression tests * pg_dump the regression database * run the regression tests again * hard-kill immediately upon completion * restart database, allow it to perform recovery * pg_dump the regression database * diff previous and new dumps; should be the same The main thing this wouldn't cover is discrepancies in user indexes, since pg_dump doesn't do anything that's likely to result in indexscans on user tables. It ought to be enough to detect the sort of system-wide problem we're talking about here, though. It also won't detect issues that only occur during certain points in execution, under concurrent load, etc. Still, a start, and I could look at extending it into some kind of crash fuzzing once the basics were working. In general I think the hard part is automated reproduction of an OS-crash scenario, but your ideas about how to do that sound promising. It's worked well for other testing I've done. Any writes that're still in the guest OS's memory, write queues, etc are lost when kvm is killed, just like a hard crash. Anything the kvm guest has flushed to disk is on the host and preserved - either on the host's disks (cache=writethrough) or at least in dirty writeback buffers in ram (cache=writeback). kvm can even do a decent job of simulating a BBU-equipped write-through volume by allowing the host OS to do write-back caching of KVM's backing device/files. You don't get to set a max write-back cache size directly, but Linux I/O writeback settings provide some control. My favourite thing about kvm is that it's just another command. It can be run headless and controlled via virtual serial console and/or its monitor socket. It doesn't require special privileges and can operate on ordinary files. It's very well suited for hooking into test harnesses. The only challenge with using kvm/qemu is that there have been some breaking changes and a couple of annoying bugs that mean I won't be able to support anything except pretty much the latest versions initially. kvm is easy to compile and has limited dependencies, so I don't expect that to be an issue, but thought it was worth raising. -- Craig Ringer -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
Greg Smith g...@2ndquadrant.com writes: On 07/17/2012 06:56 PM, Tom Lane wrote: Furthermore, I would say that any performance testing done since then, if it wasn't looking at purely read-only scenarios, isn't worth the electrons it's written on. In particular, any performance gain that anybody might have attributed to the checkpointer splitup is very probably hogwash. There hasn't been any performance testing that suggested the checkpointer splitup was justified. The stuff I did showed it being flat out negative for a subset of pgbench oriented cases, which didn't seem real-world enough to disprove it as the right thing to do though. Just to clarify, I'm not saying that this means we should revert the checkpointer split. What I *am* worried about is that we may have been hacking other things on the basis of faulty performance tests. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers