Re: [HACKERS] Window Functions: v07 APIs and buffering strateties
2008/10/28 Hitoshi Harada [EMAIL PROTECTED]: Thanks for your testing all! 2008/10/28 ITAGAKI Takahiro [EMAIL PROTECTED]: Hitoshi Harada [EMAIL PROTECTED] wrote: select relid,AVG(seq_Scan) OVER (ORDER BY relid) FROM pg_stat_user_tables GROUP BY relid,seq_scan; crashes with segfault. I reproduced it on linux of my environment, building without debug/cassert. It could be a problem around there. And I fixed this problem, confirming with/without debug/cassert/gcc -O and push it to git. If you want delta patch, please see http://git.postgresql.org/?p=~davidfetter/window_functions/.git;a=commitdiff;h=fbf19bfd0c8d2ac083b775f4cc724ec66e74fa8f Thanks for all of your feedbacks. Regards, -- Hitoshi Harada -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Proposal of PITR performance improvement for 8.4.
Hi, This is my first proposal of PITR performance improvement for PostgreSQL 8.4 development. This proposal includes readahead mechanism of data pages which will be read by redo() routines in the recovery. This is especially effective in the recovery without full page write. Readahead is done by posix_fadvise() as proposed in index scan improvement. Details of the implementation will be found in README file in the material. -- -- Koichi Suzuki pg_readahead_20081028.tar.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Window Functions: v07 APIs and buffering strateties
Hitoshi Harada [EMAIL PROTECTED] wrote: And I fixed this problem, confirming with/without debug/cassert/gcc -O and push it to git. If you want delta patch, please see http://git.postgresql.org/?p=~davidfetter/window_functions/.git;a=commitdiff;h=fbf19bfd0c8d2ac083b775f4cc724ec66e74fa8f Now it passed all regression tests! There is still one trivial warning: parse_func.c: In function `ParseFuncOrColumn': parse_func.c:77: warning: 'retval' might be used uninitialized in this function It looks completely safe, but I suggest to initialize the variable only to keep compiler quiet. [src/backend/parser/parse_func.c] ParseFuncOrColumn() else + { elog(ERROR, unknown function status); + retval = NULL; /* keep compiler quiet */ + } Regards, --- ITAGAKI Takahiro NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] VACUUMs and WAL
Looking at a VACUUM's WAL records makes me think twice about the way we issue a VACUUM. 1. First we scan the heap, issuing a HEAP2 clean record for every block that needs cleaning. 2. Then we scan the index, issuing WAL records as appropriate. 3. Then we rescan the heap, issuing a HEAP2 clean record for every block. I don't see a reason why we would issue 2 WAL records per block for a VACUUM, nor why we would prune and remove in two steps, dirtying the block each time. Seems like we could write approximately half the amount of data that we do. Surely we can come up with a better plan than that one? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Window Functions: v07 APIs and buffering strateties
2008/10/28 ITAGAKI Takahiro [EMAIL PROTECTED]: Hitoshi Harada [EMAIL PROTECTED] wrote: And I fixed this problem, confirming with/without debug/cassert/gcc -O and push it to git. If you want delta patch, please see http://git.postgresql.org/?p=~davidfetter/window_functions/.git;a=commitdiff;h=fbf19bfd0c8d2ac083b775f4cc724ec66e74fa8f Now it passed all regression tests! There is still one trivial warning: parse_func.c: In function `ParseFuncOrColumn': parse_func.c:77: warning: 'retval' might be used uninitialized in this function It looks completely safe, but I suggest to initialize the variable only to keep compiler quiet. [src/backend/parser/parse_func.c] ParseFuncOrColumn() else + { elog(ERROR, unknown function status); + retval = NULL; /* keep compiler quiet */ + } Agreed. I've just noticed it as well and I think it would be more much like original code that the last else if clause gets else. Anyway, thanks. Regards, -- Hitoshi Harada -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] VACUUMs and WAL
On Tue, 2008-10-28 at 08:49 +, Simon Riggs wrote: Looking at a VACUUM's WAL records makes me think twice about the way we issue a VACUUM. 1. First we scan the heap, issuing a HEAP2 clean record for every block that needs cleaning. IIRC the first heap pass just collects info and does nothing else. Is this just an empty/do-nothing WAL record ? 2. Then we scan the index, issuing WAL records as appropriate. 3. Then we rescan the heap, issuing a HEAP2 clean record for every block. I don't see a reason why we would issue 2 WAL records per block for a VACUUM, nor why we would prune and remove in two steps, dirtying the block each time. The first pass should just be collecting info and not dirtying anything. Could it be side effect of setting some transaction visibility bits on first visit ? In that case It would be good, if we could disable doing that that for vacuum. Seems like we could write approximately half the amount of data that we do. Surely we can come up with a better plan than that one? --- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] VACUUMs and WAL
On Tue, 2008-10-28 at 11:45 +0200, Hannu Krosing wrote: On Tue, 2008-10-28 at 08:49 +, Simon Riggs wrote: Looking at a VACUUM's WAL records makes me think twice about the way we issue a VACUUM. 1. First we scan the heap, issuing a HEAP2 clean record for every block that needs cleaning. IIRC the first heap pass just collects info and does nothing else. Is this just an empty/do-nothing WAL record ? 8.3 changed that; it used to work that way. I guess I never looked at the amount of WAL being generated. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: convert SQL-language functions to return tuplestores
Tom Lane [EMAIL PROTECTED] writes: Robert Haas [EMAIL PROTECTED] writes: I'm pretty excited by that example. LIMIT/OFFSET is really useful as a way of paginating query results for display on a web page (show results 1-100, 101-200, etc), and I use it on potentially expensive SRFs just as I do on tables and views. I suspect it doesn't help you as much as you think. It's always been the case that SRFs in FROM-items are fed through a tuplestore, and so are plpgsql SRF results. I always thought we considered that a bug though. It sure would be nice if we could generate results as needed instead of having to generate them in advance and store all of them. In particular I fear there are a lot of places that use functions where we might expect them to use views. They're never going to get really good plans but it would be nice if we could at least avoid the extra materialize steps. Now your patch isn't affecting that one way or the other but does it rule it out forever? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: convert SQL-language functions to return tuplestores
On Sun, 2008-10-26 at 21:49 -0400, Tom Lane wrote: So I'm concluding that we can easily afford to switch to tuplestore-always operation, especially if we are willing to put any effort into tuplestore optimization. (I note that the current tuplestore code writes 24 bytes per row for this example, which is a shade on the high side for only 4 bytes payload. It looks like it would be pretty easy to knock 10 bytes off that for a 40% savings in I/O volume.) That seems like an important, possibly more important, change. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal of PITR performance improvement for 8.4.
Koichi Suzuki [EMAIL PROTECTED] writes: This is my first proposal of PITR performance improvement for PostgreSQL 8.4 development. This proposal includes readahead mechanism of data pages which will be read by redo() routines in the recovery. This is especially effective in the recovery without full page write. Readahead is done by posix_fadvise() as proposed in index scan improvement. Incidentally: a bit of background for anyone who wasn't around when last this came up: prefetching is especially for our recovery code because it's single-threaded. If you have a raid array you're effectively limited to using a single drive at a time. This is a major problem because the logs could have been written by many processes hammering the raid array concurrently. In other words your warm standby database might not be able to keep up with the logs from the master database even on identical (or even better) hardware. Simon (I think?) proposed allowing our recovery code to be multi-threaded. Heikki suggested using prefetching. Details of the implementation will be found in README file in the material. I've read through this and I think I disagree with the idea of using a separate program. It's a lot of extra code -- and duplicated code from the normal recovery too. I recognise that it's awkward to handle during recovery since you would have to rework the wal reading logic quite a bit. But it might be worth doing anyways for non-raid situations. Even if you don't have a raid array it would be worthwhile to do the equivalent of a bitmap heap scan by fetching blocks in order. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] VACUUMs and WAL
Simon Riggs [EMAIL PROTECTED] writes: I don't see a reason why we would issue 2 WAL records per block for a VACUUM, nor why we would prune and remove in two steps, dirtying the block each time. Seems like we could write approximately half the amount of data that we do. Surely we can come up with a better plan than that one? This sounds like the same issue Pavan identified and proposed solving by rotating the three passes so that we do step 3 at the beginning of the next vacuum run, so it can be done in the same pass as step 1. To do that he proposed we do: 1. scan heap doing two things: a) remove any marked tuples if they were marked by a previous vacuum which committed and b) prune and mark any tuples we find are deletable for a future vacuum to remove. 2. scan indexes and remove the tuples we marked in 1b. The main blocking issue IIRC was: How to mark the tuples in a way which is safe if vacuum aborts. He suggested putting a vacuum xid in pg_class. Other suggestions were to mark the pages in the page header (which I thought made the most sense) or to put the xid in the line pointer (since nothing else needs to go there, the tuples are already cleaned). There might also have been a question of how to deal with the statistics. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] VACUUMs and WAL
On Tue, 2008-10-28 at 10:59 +, Gregory Stark wrote: Simon Riggs [EMAIL PROTECTED] writes: I don't see a reason why we would issue 2 WAL records per block for a VACUUM, nor why we would prune and remove in two steps, dirtying the block each time. Seems like we could write approximately half the amount of data that we do. Surely we can come up with a better plan than that one? This sounds like the same issue Pavan identified and proposed solving by rotating the three passes so that we do step 3 at the beginning of the next vacuum run, so it can be done in the same pass as step 1. To do that he proposed we do: 1. scan heap doing two things: a) remove any marked tuples if they were marked by a previous vacuum which committed and b) prune and mark any tuples we find are deletable for a future vacuum to remove. 2. scan indexes and remove the tuples we marked in 1b. It's fairly hard to remove the second heap pass completely. I think what I am suggesting is two heap passes, but writing WAL and dirtying blocks on only one of the passes. The biggest I/O cost comes from the writes, not the reads, ISTM. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote: One option would be to just ignore that problem for now, and not WAL-log. Probably worth skipping for now, since it will cause patch conflicts if you do. Are there any other interactions with Hot Standby? But it seems like we can sneak in an extra flag on a HEAP2_CLEAN record to say page is now all visible, without too much work. Does the PD_ALL_VISIBLE flag need to be set at the same time as updating the VM? Surely heapgetpage() could do a ConditionalLockBuffer exclusive to set the block flag (unlogged), but just not update VM. Separating the two concepts should allow the visibility check speed gain to more generally available. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal of PITR performance improvement for 8.4.
On Tue, 2008-10-28 at 16:33 +0900, Koichi Suzuki wrote: This is my first proposal of PITR performance improvement for PostgreSQL 8.4 development. This proposal includes readahead mechanism of data pages which will be read by redo() routines in the recovery. This is especially effective in the recovery without full page write. Readahead is done by posix_fadvise() as proposed in index scan improvement. Details of the implementation will be found in README file in the material. I'm happy with the idea of a readahead process. I thought we were implementing a BackgroundReader process for other uses. Is that dead now? I see you're doing this as a standalone command. Shame we can't include this into core for now, but I can see the reasons why. How does the whole toolchain look with this? Can you give a simple example of using pg_lesslog, pg_readahead and pg_standby together, if that is appropriate? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] VACUUMs and WAL
Simon Riggs [EMAIL PROTECTED] writes: On Tue, 2008-10-28 at 10:59 +, Gregory Stark wrote: To do that he proposed we do: 1. scan heap doing two things: a) remove any marked tuples if they were marked by a previous vacuum which committed and b) prune and mark any tuples we find are deletable for a future vacuum to remove. 2. scan indexes and remove the tuples we marked in 1b. It's fairly hard to remove the second heap pass completely. I think what I am suggesting is two heap passes, but writing WAL and dirtying blocks on only one of the passes. How small a patch would it be? I guess you just need to disable all pruning in the first pass and do it in the second patch? I would still rather see Pavan's optimizationo if we can do it cleanly, but if it's not going to happen and this is trivial then sure, we may as well. The biggest I/O cost comes from the writes, not the reads, ISTM. It's counter-intuitive but actually it's usually the other way around. Writes can be buffered, re-ordered, and scheduled during otherwise idle time. Reads however are always blocking. However in this situation I think you may be right. Vacuum is doing a sequential scan through the table so if the OS has to interrupt that scan to go do some writes it'll end up having to go back and forth. That would be a *lot* slower than just doing a sequential scan. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] VACUUMs and WAL
On Tue, 2008-10-28 at 10:10 +, Simon Riggs wrote: On Tue, 2008-10-28 at 11:45 +0200, Hannu Krosing wrote: On Tue, 2008-10-28 at 08:49 +, Simon Riggs wrote: Looking at a VACUUM's WAL records makes me think twice about the way we issue a VACUUM. 1. First we scan the heap, issuing a HEAP2 clean record for every block that needs cleaning. IIRC the first heap pass just collects info and does nothing else. Is this just an empty/do-nothing WAL record ? 8.3 changed that; it used to work that way. I guess I never looked at the amount of WAL being generated. I can't see how it is safe to do anything more than just lookups on first pass. There will be dangling index pointers if the system crashes/is rebooted or the vacuum is just interrupted after cleaning some heap pages but before cleaning corresponding index pages. --- Hannu -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal of PITR performance improvement for 8.4.
Gregory Stark wrote: Koichi Suzuki [EMAIL PROTECTED] writes: This is my first proposal of PITR performance improvement for PostgreSQL 8.4 development. This proposal includes readahead mechanism of data pages which will be read by redo() routines in the recovery. This is especially effective in the recovery without full page write. Readahead is done by posix_fadvise() as proposed in index scan improvement. Incidentally: a bit of background for anyone who wasn't around when last this came up: prefetching is especially for our recovery code because it's single-threaded. If you have a raid array you're effectively limited to using a single drive at a time. This is a major problem because the logs could have been written by many processes hammering the raid array concurrently. In other words your warm standby database might not be able to keep up with the logs from the master database even on identical (or even better) hardware. Simon (I think?) proposed allowing our recovery code to be multi-threaded. Heikki suggested using prefetching. I actually played around with the prefetching, and even wrote a quick prototype of it, about a year ago. It read ahead a fixed number of the WAL records in xlog.c, calling posix_fadvise() for all pages that were referenced in them. I never got around to finish it, as I wanted to see Greg's posix_fadvise() patch get done first and rely on the same infrastructure, but here's some lessons I learned: 1. You should avoid useless posix_fadvise() calls. In the naive implementation, where you simply call posix_fadvise() for every page referenced in every WAL record, you'll do 1-2 posix_fadvise() syscalls per WAL record, and that's a lot of overhead. We face the same design question as with Greg's patch to use posix_fadvise() to prefetch index and bitmap scans: what should the interface to the buffer manager look like? The simplest approach would be a new function call like AdviseBuffer(Relation, BlockNumber), that calls posix_fadvise() for the page if it's not in the buffer cache, but is a no-op otherwise. But that means more overhead, since for every page access, we need to find the page twice in the buffer cache; once for the AdviseBuffer() call, and 2nd time for the actual ReadBuffer(). It would be more efficient to pin the buffer in the AdviseBuffer() call already, but that requires much more changes to the callers. 2. The format of each WAL record is different, so you need a readahead handler for every resource manager, for every record type. It would be a lot simpler if there was a standardized way to store that information in the WAL records. 3. IIRC I tried to handle just a few most important WAL records at first, but it turned out that you really need to handle all WAL records (that are used at all) before you see any benefit. Otherwise, every time you hit a WAL record that you haven't done posix_fadvise() on, the recovery stalls, and you don't need much of those to diminish the gains. Not sure how these apply to your approach, it's very different. You seem to handle 1. by collecting all the page references for the WAL file, and sorting and removing the duplicates. I wonder how much CPU time is spent on that? Details of the implementation will be found in README file in the material. I've read through this and I think I disagree with the idea of using a separate program. It's a lot of extra code -- and duplicated code from the normal recovery too. Agreed, this belongs into core. The nice thing about a separate process is that you could hook it into recovery_command, with no changes to the server, but as you note in the README, we'd want to use this in crash recovery as well, and the interaction between the external command and the startup process seems overly complex for that. Besides, we want to use the posix_fadvise() stuff in the backend anyway, so we should use the same infrastructure during WAL replay as well. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Window Functions: v07 APIs and buffering strateties
Hitoshi Harada [EMAIL PROTECTED] writes: 2008/10/28 ITAGAKI Takahiro [EMAIL PROTECTED]: I tested the patch on mingw (Windows) and got the following warning and error: A. gram.y: conflicts: 3 shift/reduce B. include/nodes/plannodes.h:650: error: syntax error before uint I have no idea about A. I have noticed it but didn't think it is a problem, but it doesn't occur in production, does it? We have a zero-tolerance policy for bison warnings. Patches that introduce shift/reduce conflicts *will* be rejected. (And no, %expect isn't an acceptable fix. The problem with it is you can't be sure which warnings it ignored. In a grammar that gets hacked on as often as PG's does, we couldn't rely on the conflicts to not move around, possibly resulting in unforeseen misbehavior.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] VACUUMs and WAL
Hannu Krosing wrote: On Tue, 2008-10-28 at 10:10 +, Simon Riggs wrote: On Tue, 2008-10-28 at 11:45 +0200, Hannu Krosing wrote: On Tue, 2008-10-28 at 08:49 +, Simon Riggs wrote: Looking at a VACUUM's WAL records makes me think twice about the way we issue a VACUUM. 1. First we scan the heap, issuing a HEAP2 clean record for every block that needs cleaning. IIRC the first heap pass just collects info and does nothing else. Is this just an empty/do-nothing WAL record ? 8.3 changed that; it used to work that way. I guess I never looked at the amount of WAL being generated. I can't see how it is safe to do anything more than just lookups on first pass. What's done in the first pass is the same HOT pruning that is done opportunistically on other page accesses as well. IIRC it's required for correctness, though I can't remember what exactly the issue was. I don't think the extra WAL volume is a problem; VACUUM doesn't generate much WAL, anyway. As for the extra data page writes it causes; yeah, that might cause some I/O that could be avoided, but remember that the first pass often dirties buffers anyway to set hint bits. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Window Functions: v07 APIs and buffering strateties
2008/10/28 Tom Lane [EMAIL PROTECTED]: Hitoshi Harada [EMAIL PROTECTED] writes: 2008/10/28 ITAGAKI Takahiro [EMAIL PROTECTED]: I tested the patch on mingw (Windows) and got the following warning and error: A. gram.y: conflicts: 3 shift/reduce B. include/nodes/plannodes.h:650: error: syntax error before uint I have no idea about A. I have noticed it but didn't think it is a problem, but it doesn't occur in production, does it? We have a zero-tolerance policy for bison warnings. Patches that introduce shift/reduce conflicts *will* be rejected. (And no, %expect isn't an acceptable fix. The problem with it is you can't be sure which warnings it ignored. In a grammar that gets hacked on as often as PG's does, we couldn't rely on the conflicts to not move around, possibly resulting in unforeseen misbehavior.) regards, tom lane OK, I'll try to remove it. I'm not used to bison so my first task is to find where the conflict is... Regards, -- Hitoshi Harada -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal of PITR performance improvement for 8.4.
Simon Riggs [EMAIL PROTECTED] writes: I'm happy with the idea of a readahead process. I thought we were implementing a BackgroundReader process for other uses. Is that dead now? You and Bruce seem to keep resurrecting that idea. I've never liked it -- I always hated that in Oracle and thought it was a terrible kludge. I think the inter-process communication would be way too heavy-weight, by the time the other process is schedule the process which needed the blocks would have probably have done many of them already anyways. Worse, you would need a large number of reading processes and would start to run into locking contention on the work-queue as well. In any case it would be a lot of code to do what posix_fadvise does for us with a simple syscall anyways. Am I misjudging this? Is this a popular idea? We could always implement it as a fall-back implementation for mdprefetch() where posix_fadvise (and libaio assuming we implement that as well) don't work. It has the advantage of working with any system at all even if it predates 1003.1. But it seems like an awful lot of code to implement a solution that I fear won't work very well. And are people running raid arrays not likely to be running modern OSes anyways? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] VACUUMs and WAL
Simon Riggs [EMAIL PROTECTED] writes: I think what I am suggesting is two heap passes, but writing WAL and dirtying blocks on only one of the passes. I think you've all forgotten about hint-bit setting. The assumption is that the first VACUUM pass is going to update a lot of hint bits and we might as well get some other work done with the same write. Now of course that doesn't necessarily entail a WAL write too, but it makes this less than a slam-dunk win. Also, I think that the reason the code ended up this way is that there were pretty severe difficulties in making the VACUUM code cope correctly with un-pruned tuples. Pavan might remember more about that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] VACUUMs and WAL
Simon Riggs [EMAIL PROTECTED] writes: Looking at a VACUUM's WAL records makes me think twice about the way we issue a VACUUM. 1. First we scan the heap, issuing a HEAP2 clean record for every block that needs cleaning. 2. Then we scan the index, issuing WAL records as appropriate. 3. Then we rescan the heap, issuing a HEAP2 clean record for every block. The first pass removes dead HOT tuples. The second pass removes dead normal tuples (it does NOT write every block, only those with dead tuples). In principle the set of pages written in pass 1 might be completely disjoint from the set of pages written in pass 2 (though I admit that's probably not real likely). Surely we can come up with a better plan than that one? Maybe, but it's not as bad as you're painting it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal of PITR performance improvement for 8.4.
Gregory Stark wrote: Simon Riggs [EMAIL PROTECTED] writes: I'm happy with the idea of a readahead process. I thought we were implementing a BackgroundReader process for other uses. Is that dead now? You and Bruce seem to keep resurrecting that idea. I've never liked it -- I always hated that in Oracle and thought it was a terrible kludge. I think the inter-process communication would be way too heavy-weight, by the time the other process is schedule the process which needed the blocks would have probably have done many of them already anyways. Worse, you would need a large number of reading processes and would start to run into locking contention on the work-queue as well. In any case it would be a lot of code to do what posix_fadvise does for us with a simple syscall anyways. I agree. Am I misjudging this? Is this a popular idea? We could always implement it as a fall-back implementation for mdprefetch() where posix_fadvise (and libaio assuming we implement that as well) don't work. It has the advantage of working with any system at all even if it predates 1003.1. Yeah, if we want to support prefetching on systems that don't have posix_fadvise(), that would be the way to do it. posix_fadvise() is a good API. We can support multiple platform-specific implementations of it if there's interest, using posix_fadvise(), aio to a dummy buffer, whatever functions there's on Windows, or background reader processes or threads. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] VACUUMs and WAL
On Tue, 2008-10-28 at 14:28 +0200, Heikki Linnakangas wrote: Hannu Krosing wrote: On Tue, 2008-10-28 at 10:10 +, Simon Riggs wrote: On Tue, 2008-10-28 at 11:45 +0200, Hannu Krosing wrote: On Tue, 2008-10-28 at 08:49 +, Simon Riggs wrote: Looking at a VACUUM's WAL records makes me think twice about the way we issue a VACUUM. 1. First we scan the heap, issuing a HEAP2 clean record for every block that needs cleaning. IIRC the first heap pass just collects info and does nothing else. Is this just an empty/do-nothing WAL record ? 8.3 changed that; it used to work that way. I guess I never looked at the amount of WAL being generated. I can't see how it is safe to do anything more than just lookups on first pass. What's done in the first pass is the same HOT pruning that is done opportunistically on other page accesses as well. IIRC it's required for correctness, though I can't remember what exactly the issue was. Are you sure it is a correctness thing ? Maybe HOT pruning just happened to be in a path used by vacuum to read pages. I don't think the extra WAL volume is a problem; Probably not ( unless you need to ship your WAL records via a very expensive network connection ). If it is a simple performance problem, then it can probably be fixed by just running VACUUM slower. VACUUM doesn't generate much WAL, anyway. As for the extra data page writes it causes; yeah, that might cause some I/O that could be avoided, but remember that the first pass often dirties buffers anyway to set hint bits. Still, can't we special-case HOT pruning and hint-bit change WAL-logging for first the pass of vacuum ? They both seem redundant in case of VACUUM. --- Hannu -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Window Functions: v07 APIs and buffering strateties
Hitoshi Harada [EMAIL PROTECTED] writes: OK, I'll try to remove it. I'm not used to bison so my first task is to find where the conflict is... Use bison -v to get details of where the conflict is. I find that the most common way to fix things is to postpone where the parser has to make a decision, which usually ends up meaning a slightly more verbose set of productions --- for instance instead of foo: bar opt_baz opt_baz: baz | /*EMPTY*/ you might have to do foo: bar baz | bar (which is actually shorter in this pseudo-example, but wouldn't be if bar were complicated or foo had a lot of alternatives already). The reason this can fix it is that the parser doesn't have to commit to which case applies until after it's read the tokens for baz. In the first form, it has to decide whether baz is there or not without looking any further ahead than the first token of baz. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Simon Riggs wrote: On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote: One option would be to just ignore that problem for now, and not WAL-log. Probably worth skipping for now, since it will cause patch conflicts if you do. Are there any other interactions with Hot Standby? But it seems like we can sneak in an extra flag on a HEAP2_CLEAN record to say page is now all visible, without too much work. Hmm. Even if a tuple is visible to everyone on the master, it's not necessarily yet visible to all the read-only transactions in the slave. Does the PD_ALL_VISIBLE flag need to be set at the same time as updating the VM? Surely heapgetpage() could do a ConditionalLockBuffer exclusive to set the block flag (unlogged), but just not update VM. Separating the two concepts should allow the visibility check speed gain to more generally available. Yes, that should be possible in theory. There's no version of ConditionalLockBuffer() for conditionally upgrading a shared lock to exclusive, but it should be possible in theory. I'm not sure if it would be safe to set the PD_ALL_VISIBLE_FLAG while holding just a shared lock, though. If it is, then we could do just that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Heikki Linnakangas [EMAIL PROTECTED] writes: ... I'm not sure if it would be safe to set the PD_ALL_VISIBLE_FLAG while holding just a shared lock, though. If it is, then we could do just that. Seems like it must be safe. If you have shared lock on a page then no one else could be modifying the page in a way that would falsify PD_ALL_VISIBLE. You might have several processes concurrently try to set the bit but that is safe (same situation as for hint bits). The harder part is propagating the bit to the visibility map, but I gather you intend to only allow VACUUM to do that? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: convert SQL-language functions to return tuplestores
Gregory Stark [EMAIL PROTECTED] writes: Tom Lane [EMAIL PROTECTED] writes: I suspect it doesn't help you as much as you think. It's always been the case that SRFs in FROM-items are fed through a tuplestore, and so are plpgsql SRF results. I always thought we considered that a bug though. It sure would be nice if we could generate results as needed instead of having to generate them in advance and store all of them. I suppose, but short of a fundamental rethink of how PL functions work that's not going to happen. There's also the whole issue of when do side-effects happen (such as before/after statement triggers). In particular I fear there are a lot of places that use functions where we might expect them to use views. They're never going to get really good plans but it would be nice if we could at least avoid the extra materialize steps. Agreed, but I think the fundamental solution there, for simple-select functions, is inlining. Now your patch isn't affecting that one way or the other but does it rule it out forever? I think the PL side of the problem is the hard part --- if we knew how to solve these issues for plpgsql then SQL functions would surely be easy. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: convert SQL-language functions to return tuplestores
Simon Riggs [EMAIL PROTECTED] writes: On Sun, 2008-10-26 at 21:49 -0400, Tom Lane wrote: So I'm concluding that we can easily afford to switch to tuplestore-always operation, especially if we are willing to put any effort into tuplestore optimization. (I note that the current tuplestore code writes 24 bytes per row for this example, which is a shade on the high side for only 4 bytes payload. It looks like it would be pretty easy to knock 10 bytes off that for a 40% savings in I/O volume.) That seems like an important, possibly more important, change. Yeah, seeing that both WITH and window functions will be stressing tuplestore performance, anything we can save there is probably worth the trouble. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: convert SQL-language functions to return tuplestores
On Tue, Oct 28, 2008 at 09:28:38AM -0400, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: On Sun, 2008-10-26 at 21:49 -0400, Tom Lane wrote: So I'm concluding that we can easily afford to switch to tuplestore-always operation, especially if we are willing to put any effort into tuplestore optimization. (I note that the current tuplestore code writes 24 bytes per row for this example, which is a shade on the high side for only 4 bytes payload. It looks like it would be pretty easy to knock 10 bytes off that for a 40% savings in I/O volume.) That seems like an important, possibly more important, change. Yeah, seeing that both WITH and window functions will be stressing tuplestore performance, anything we can save there is probably worth the trouble. regards, tom lane The pre-sort for index builds would also benefit from this change. Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Tom Lane wrote: The harder part is propagating the bit to the visibility map, but I gather you intend to only allow VACUUM to do that? Yep. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: To modify a page: If PD_ALL_VISIBLE flag is set, the bit in the visibility map is cleared first. The heap page is kept pinned, but not locked, while the visibility map is updated. We want to avoid holding a lock across I/O, even though the visibility map is likely to stay in cache. After the visibility map has been updated, the page is exclusively locked and modified as usual, and PD_ALL_VISIBLE flag is cleared before releasing the lock. So after having determined that you will modify a page, you release the ex lock on the buffer and then try to regain it later? Seems like a really bad idea from here. What if it's no longer possible to do the modification you intended? In case of insert/update, you have to find a new target page. I put the logic in RelationGetBufferForTuple(). In case of delete and update (old page), the flag is checked and bit cleared just after pinning the buffer, before doing anything else. (I note that that's not actually what the patch is doing for heap_update, will fix..) If we give up on the strict requirement that the bit in the visibility map has to be cleared if the PD_ALL_VISIBLE flag on the page is not set, then we could just update the visibility map after releasing the locks on the heap pages. I think I'll do that for now, for simplicity. To set the PD_ALL_VISIBLE flag, you must hold an exclusive lock on the page, while you observe that all tuples on the page are visible to everyone. That doesn't sound too good from a concurrency standpoint... Well, no, but it's only done in VACUUM. And pruning. I implemented it as a new loop that call HeapTupleSatisfiesVacuum on each tuple, and checking that xmin is old enough for live tuples, but come to think of it, we're already calling HeapTupleSatisfiesVacuum for every tuple on the page during VACUUM, so it should be possible to piggyback on that by restructuring the code. That's how the patch works right now. However, there's a small performance problem with the current approach: setting the PD_ALL_VISIBLE flag must be WAL-logged. Otherwise, this could happen: I'm more concerned about *clearing* the bit being WAL-logged. That's necessary for correctness. Yes, clearing the PD_ALL_VISIBLE flag always needs to be WAL-logged. There's a new boolean field in xl_heap_insert/update/delete records indicating if the operation cleared the flag. On replay, if the flag was cleared, the bit in the visibility map is also cleared. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Optimizing tuplestore usage for SRFs
I wrote: Simon Riggs [EMAIL PROTECTED] writes: On Sun, 2008-10-26 at 21:49 -0400, Tom Lane wrote: ... effort into tuplestore optimization. (I note that the current tuplestore code writes 24 bytes per row for this example, which is a shade on the high side for only 4 bytes payload. It looks like it would be pretty easy to knock 10 bytes off that for a 40% savings in I/O volume.) That seems like an important, possibly more important, change. Yeah, seeing that both WITH and window functions will be stressing tuplestore performance, anything we can save there is probably worth the trouble. Six of the ten bytes I was speaking of are alignment padding, which can be removed with some relatively simple hacking in tuplestore.c (I think the same might apply in tuplesort.c BTW). The other place I was looking at is that currently, a SRF's result tuplestore is always built with randomAccess = true, which causes the tuplestore to write a trailing length word on each tuple, to support the possibility of being asked to read backwards. Of course, 99% of the time it never will be asked to do so. So we ought to try to suppress that overhead. I can see two ways we might do this: 1. Stop treating nodeFunctionscan.c as supporting backwards scan. This would be an extremely localized change (about two lines ;-)) but the downside is that calling a function scan in a SCROLL CURSOR would result in an extra Materialize node getting stuck in front of the Functionscan node to support backwards scanning. 2. Propagate the Functionscan node's EXEC_FLAG_BACKWARD flag bit to the called function. I'd be inclined to do this by adding it as an additional bit in the rsinfo-allowedModes field. In either case we would need cooperation from called SRFs to get the optimization to happen --- the tuplestore_begin_heap calls are not consolidated into one place (which maybe was a mistake) and they're all passing constant TRUE for randomAccess. That coding is safe but would need to change to reflect whichever policy we adopt. (Note: one advantage of approach #1 is that if anyone is mistakenly passing randomAccess = FALSE, it would become safe, which it isn't now.) I'm of mixed mind about which way to go. I'm not sure that optimizing scroll cursors on functions is something worth worrying about. However, if we do #1 now then we are probably locked into that approach and could not change to #2 in the future --- we'd have to worry about breaking third-party SRFs that are passing randomAccess = FALSE. Comments? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Updating FSM on recovery
The one remaining issue I'd like to address in the new FSM implementation is the fact that the FSM is currently not updated at all in WAL recovery. The old FSM wasn't updated on WAL recovery either, and was in fact completely thrown away if the system wasn't shut down cleanly. The difference is that after recovery, we used to start with no FSM information at all, and all inserts would have to extend the relations until the next vacuum, while now the inserts use the old data in the FSM. In case of a PITR recovery or warm stand-by, the FSM would information would come from the last base backup, which could be *very* old. The first inserter after the recovery might have to visit a lot of pages that the FSM claimed had free space, but didn't in reality, before finding a suitable target. In the absolutely worst case, where the table was almost empty when the base backup was taken, but is now full, it might have to visit every single heap page. That's not good. So we should try to update the FSM during recovery as well. It doesn't need to be very accurate, as the FSM information isn't accurate anyway, but we should try to avoid the worst case scenarios. The attached patch is my first attempt at that. Arbitrarily, if after a heap insert/update there's less than 20% of free space on the page, the FSM is updated. Compared to updating it every time, that saves a lot of overhead, while doing a pretty good job at marking full pages as full in the FSM. My first thought was to update the FSM if there isn't enough room on the page for a new tuple of the same size as the one just inserted; that would be pretty close to the logic we have during normal operation, where the FSM is updated when the tuple that we're about to insert doesn't fit on the page. But because we don't know the fillfactor during recovery, I don't think we can do reliably. One issue with this patch is that it doesn't update the FSM at all when pages are restored from full page images. It would require fetching the page and checking the free space on it, or peeking into the size of the backup block data, and I'm not sure if it's worth the extra code to do that. Thoughts? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 1d43b0b..a9bc17a 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -54,6 +54,7 @@ #include miscadmin.h #include pgstat.h #include storage/bufmgr.h +#include storage/freespace.h #include storage/lmgr.h #include storage/procarray.h #include storage/smgr.h @@ -4029,6 +4030,7 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record, bool clean_move) int nredirected; int ndead; int nunused; + Size freespace; if (record-xl_info XLR_BKP_BLOCK_1) return; @@ -4068,6 +4070,15 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record, bool clean_move) PageSetLSN(page, lsn); PageSetTLI(page, ThisTimeLineID); MarkBufferDirty(buffer); + + /* + * update the FSM as well + * + * XXX: We don't get here if the page was restored from full page image + */ + freespace = PageGetHeapFreeSpace(page); + XLogRecordPageWithFreeSpace(xlrec-node, xlrec-block, freespace); + UnlockReleaseBuffer(buffer); } @@ -4212,6 +4223,7 @@ heap_xlog_insert(XLogRecPtr lsn, XLogRecord *record) HeapTupleHeader htup; xl_heap_header xlhdr; uint32 newlen; + Size freespace; if (record-xl_info XLR_BKP_BLOCK_1) return; @@ -4271,6 +4283,19 @@ heap_xlog_insert(XLogRecPtr lsn, XLogRecord *record) PageSetLSN(page, lsn); PageSetTLI(page, ThisTimeLineID); MarkBufferDirty(buffer); + + /* + * If the page is running low on free space, update the FSM as well. + * Pretty arbitrarily, our definition of low is less than 20%. We can't + * do much better than that without knowing the fill-factor for the table. + * + * XXX: We don't get here if the page was restored from full page image + */ + freespace = PageGetHeapFreeSpace(page); + if (freespace BLCKSZ / 5) + XLogRecordPageWithFreeSpace(xlrec-target.node, + BufferGetBlockNumber(buffer), freespace); + UnlockReleaseBuffer(buffer); } @@ -4296,6 +4321,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool move, bool hot_update) xl_heap_header xlhdr; int hsize; uint32 newlen; + Size freespace; if (record-xl_info XLR_BKP_BLOCK_1) { @@ -4456,6 +4482,16 @@ newsame:; PageSetLSN(page, lsn); PageSetTLI(page, ThisTimeLineID); MarkBufferDirty(buffer); + + /* + * If the page is running low on free space, update the FSM as well. + * XXX: We don't get here if the page was restored from full page image + */ + freespace = PageGetHeapFreeSpace(page); + if (freespace BLCKSZ / 5) + XLogRecordPageWithFreeSpace(xlrec-target.node, + BufferGetBlockNumber(buffer), freespace); + UnlockReleaseBuffer(buffer); } diff --git a/src/backend/storage/freespace/freespace.c
Re: [HACKERS] Visibility map, partial vacuums
On Tue, 2008-10-28 at 14:57 +0200, Heikki Linnakangas wrote: Simon Riggs wrote: On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote: One option would be to just ignore that problem for now, and not WAL-log. Probably worth skipping for now, since it will cause patch conflicts if you do. Are there any other interactions with Hot Standby? But it seems like we can sneak in an extra flag on a HEAP2_CLEAN record to say page is now all visible, without too much work. Hmm. Even if a tuple is visible to everyone on the master, it's not necessarily yet visible to all the read-only transactions in the slave. Never a problem. No query can ever see the rows removed by a cleanup record, enforced by the recovery system. Does the PD_ALL_VISIBLE flag need to be set at the same time as updating the VM? Surely heapgetpage() could do a ConditionalLockBuffer exclusive to set the block flag (unlogged), but just not update VM. Separating the two concepts should allow the visibility check speed gain to more generally available. Yes, that should be possible in theory. There's no version of ConditionalLockBuffer() for conditionally upgrading a shared lock to exclusive, but it should be possible in theory. I'm not sure if it would be safe to set the PD_ALL_VISIBLE_FLAG while holding just a shared lock, though. If it is, then we could do just that. To be honest, I'm more excited about your perf results for that than I am about speeding up some VACUUMs. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: convert SQL-language functions to return tuplestores
I always thought we considered that a bug though. It sure would be nice if we could generate results as needed instead of having to generate them in advance and store all of them. I suppose, but short of a fundamental rethink of how PL functions work that's not going to happen. There's also the whole issue of when do side-effects happen (such as before/after statement triggers). For PL/pgsql, I think it might be possible to execute a function to precisely the point where you have generated a sufficient number of records. In other words, when someone asks for a tuple, you start executing the function until a tuple pops out, and then save the execution context until someone asks for another. Conceivably you can push LIMIT and WHERE clauses down into any RETURN QUERY statements executed, as well. Maybe that qualifies as a fundamental rethink, though, and we can worry about how to suppress the tuplestore in that case when and if someone is prepared to implement it. For other procedural languages, you would need support from the executor for that PL, which in most cases will probably be lacking. thinks a little more In fact, I suspect that you would gain a lot by optimizing specifically for the case of a PL/pgsql function of the form: (1) execute 0 or more statements that may or may not have side effects but do not return any tuples, (2) execute exactly 1 RETURN QUERY statement, and (3) implicit or explicit RETURN. I suspect that's a very common usage pattern, and it wouldn't require being able to save the entire execution context at an arbitrary point. (I agree that BEFORE/AFTER statement triggers are a problem here but I'm not sure that they are an insoluble one, and I'd hate for that to be the thing that kills this type of optimization. Even if you implemented a full-blown partial-execution model, it would be reasonable to always run any particular INSERT/UPDATE/DELETE to completion. It's really SELECT that is the problem.) In particular I fear there are a lot of places that use functions where we might expect them to use views. They're never going to get really good plans but it would be nice if we could at least avoid the extra materialize steps. Agreed, but I think the fundamental solution there, for simple-select functions, is inlining. +1. Upthread passing LIMIT and OFFSET clauses into the SRF as parameters was suggested, but that's really intractable for real-world use where you are also applying WHERE clauses to the SRF results. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal of PITR performance improvement for 8.4.
Gregory Stark wrote: Simon Riggs [EMAIL PROTECTED] writes: And are people running raid arrays not likely to be running modern OSes anyways? No and those that are can upgrade. Joshua D. Drake -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Updating FSM on recovery
Heikki Linnakangas [EMAIL PROTECTED] writes: So we should try to update the FSM during recovery as well. It doesn't need to be very accurate, as the FSM information isn't accurate anyway, but we should try to avoid the worst case scenarios. Agreed. One issue with this patch is that it doesn't update the FSM at all when pages are restored from full page images. It would require fetching the page and checking the free space on it, or peeking into the size of the backup block data, and I'm not sure if it's worth the extra code to do that. I'd vote not to bother, at least not in the first cut. As you say, 100% accuracy isn't required, and I think that in typical scenarios an insert/update that causes a page to become full would be relatively less likely to have a full-page image. As far as the ugliness in XLogRecordPageWithFreeSpace goes: couldn't you just call XLogReadBufferWithFork with init = true, and then initialize the page if PageIsNew? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: convert SQL-language functions to return tuplestores
Hi, In the python language, functions that lazily return collections are called generators and use the yield keyword instead of return. http://www.python.org/doc/2.5.2/tut/node11.html#SECTION0011100 Maybe having such a concept in PostgreSQL would allow the user to choose between current behavior (materializing) and lazy computing, with a new internal API to get done in the executor maybe. CREATE FUNCTION mygenerator() returns setof integer language PLPGSQL AS $f$ BEGIN FOR v_foo IN SELECT foo FROM table LOOP YIELD my_expensive_function(v_foo); END LOOP; RETURN; END; $f$; At the plain SQL level, we could expose this with a new function parameter, GENERATOR maybe? CREATE FUNCTION my_generator_example(integer, integer) returns setof integer generator language SQL $f$ SELECT generate_series($1, $2); $f$; Maybe we should prefer to add the GENERATOR (or LAZY or whatever sounds good for a native English speaker) parameter to PL functions to instead of providing YIELD, having RETURN doing YIELD in this case. Le mardi 28 octobre 2008, Tom Lane a écrit : I suppose, but short of a fundamental rethink of how PL functions work that's not going to happen. There's also the whole issue of when do side-effects happen (such as before/after statement triggers). Would it be possible to forbid generators when using in those cases? Agreed, but I think the fundamental solution there, for simple-select functions, is inlining. Would it be possible to maintain current behavior with ROWS estimator for functions, even when inlining, as a way to trick the planner when you can't feed it good enough stats? I think the PL side of the problem is the hard part --- if we knew how to solve these issues for plpgsql then SQL functions would surely be easy. What about this python idea of GENERATORS and the YIELD control for lazy evaluation of functions? -- dim signature.asc Description: This is a digitally signed message part.
Re: [HACKERS] WIP patch: convert SQL-language functions to return tuplestores
2008/10/28 Dimitri Fontaine [EMAIL PROTECTED]: Hi, In the python language, functions that lazily return collections are called generators and use the yield keyword instead of return. http://www.python.org/doc/2.5.2/tut/node11.html#SECTION0011100 Maybe having such a concept in PostgreSQL would allow the user to choose between current behavior (materializing) and lazy computing, with a new internal API to get done in the executor maybe. lazy computing is good idea, but I am afraid so it should be really wery hard implemented. You should to store somewhere current state, stop execution, return back from node, and you should be able restore PL state and continue in process. I can't to see it without thread support. Regards Pavel Stehule CREATE FUNCTION mygenerator() returns setof integer language PLPGSQL AS $f$ BEGIN FOR v_foo IN SELECT foo FROM table LOOP YIELD my_expensive_function(v_foo); END LOOP; RETURN; END; $f$; At the plain SQL level, we could expose this with a new function parameter, GENERATOR maybe? CREATE FUNCTION my_generator_example(integer, integer) returns setof integer generator language SQL $f$ SELECT generate_series($1, $2); $f$; Maybe we should prefer to add the GENERATOR (or LAZY or whatever sounds good for a native English speaker) parameter to PL functions to instead of providing YIELD, having RETURN doing YIELD in this case. Le mardi 28 octobre 2008, Tom Lane a écrit : I suppose, but short of a fundamental rethink of how PL functions work that's not going to happen. There's also the whole issue of when do side-effects happen (such as before/after statement triggers). Would it be possible to forbid generators when using in those cases? Agreed, but I think the fundamental solution there, for simple-select functions, is inlining. Would it be possible to maintain current behavior with ROWS estimator for functions, even when inlining, as a way to trick the planner when you can't feed it good enough stats? I think the PL side of the problem is the hard part --- if we knew how to solve these issues for plpgsql then SQL functions would surely be easy. What about this python idea of GENERATORS and the YIELD control for lazy evaluation of functions? -- dim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Updating FSM on recovery
On Tue, 2008-10-28 at 15:35 +, Simon Riggs wrote: I wonder if there is merit in having an XLogInsertMulti() which inserts multiple records in a batch as a way of reducing WALInsertLock traffic. It might be possible to piggyback FSM records onto the main heap changes. Or possibly an XLogInsertDeferred() which just queues up some work so the next time we call XLogInsert() it will insert the deferred work as well as the main work all in one lock cycle. It would only be usable for low priority info like FSM stuff that isn't needed for recovery. Maybe we could do that with hints also. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] FAQ_Solaris 1.28 to spanish
This file is for add to pgsql-docs, is the translation of the FAQ_solaris. I have new things to add, specially for Open Solaris plattaform specifications. I can write a new FAQ or just concatenate to this the other features of OSolaris? Reggards, -- Emanuel Calvo Franco Syscope Postgresql DBA BaPUG Member FAQ_Solaris_spanish Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal of PITR performance improvement for 8.4.
On Tue, 2008-10-28 at 12:34 +, Gregory Stark wrote: Simon Riggs [EMAIL PROTECTED] writes: I'm happy with the idea of a readahead process. I thought we were implementing a BackgroundReader process for other uses. Is that dead now? You and Bruce seem to keep resurrecting that idea. Actually just asking for status, not trying to change design. I've never liked it -- I always hated that in Oracle and thought it was a terrible kludge. But now... If you have a better way, great, but that doesn't make perfectly workable and fairly simple userspace solutions into kludges. That's just an emotive description of your preference. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Updating FSM on recovery
On Tue, 2008-10-28 at 16:22 +0200, Heikki Linnakangas wrote: Arbitrarily, if after a heap insert/update there's less than 20% of free space on the page, the FSM is updated. Compared to updating it every time, that saves a lot of overhead, while doing a pretty good job at marking full pages as full in the FSM. My first thought was to update the FSM if there isn't enough room on the page for a new tuple of the same size as the one just inserted; that would be pretty close to the logic we have during normal operation, where the FSM is updated when the tuple that we're about to insert doesn't fit on the page. But because we don't know the fillfactor during recovery, I don't think we can do reliably. With HOT, we tend to hover around the nearly-full state, so this seems like it will trigger repeatedly. Is it possible that we could put an extra field onto a heap_clean record to show remaining space. We would use it only for VACUUMs, not HOT, just as we do now. Probably good idea to make a list of user cases and say what we do in eahc case. e.g. COPY, other bulk ops, HOT etc.. I wonder if there is merit in having an XLogInsertMulti() which inserts multiple records in a batch as a way of reducing WALInsertLock traffic. It might be possible to piggyback FSM records onto the main heap changes. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Updating FSM on recovery
Simon Riggs wrote: On Tue, 2008-10-28 at 16:22 +0200, Heikki Linnakangas wrote: Arbitrarily, if after a heap insert/update there's less than 20% of free space on the page, the FSM is updated. Compared to updating it every time, that saves a lot of overhead, while doing a pretty good job at marking full pages as full in the FSM. My first thought was to update the FSM if there isn't enough room on the page for a new tuple of the same size as the one just inserted; that would be pretty close to the logic we have during normal operation, where the FSM is updated when the tuple that we're about to insert doesn't fit on the page. But because we don't know the fillfactor during recovery, I don't think we can do reliably. With HOT, we tend to hover around the nearly-full state, so this seems like it will trigger repeatedly. Hmm, true. Perhaps we should skip updating the FSM on HOT updates. After recovery, the new HOT-updated tuples are prunable anyway, so for inserting a new tuple, the page is almost as good as it was before the HOT update. Is it possible that we could put an extra field onto a heap_clean record to show remaining space. We would use it only for VACUUMs, not HOT, just as we do now. Sure, we could do that. I'm more worried about killing the pages from the FSM that are full, though, than keeping track of pages with plenty of free space accurately. I wonder if there is merit in having an XLogInsertMulti() which inserts multiple records in a batch as a way of reducing WALInsertLock traffic. It might be possible to piggyback FSM records onto the main heap changes. Umm, in the version that was finally committed, FSM doesn't generate any extra WAL records (except for truncation). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal of PITR performance improvement for 8.4.
Simon Riggs [EMAIL PROTECTED] writes: On Tue, 2008-10-28 at 12:34 +, Gregory Stark wrote: Simon Riggs [EMAIL PROTECTED] writes: I've never liked it -- I always hated that in Oracle and thought it was a terrible kludge. But now... If you have a better way, great, but that doesn't make perfectly workable and fairly simple userspace solutions into kludges. That's just an emotive description of your preference. I won't dispute that calling it a kludge is a personal emotional reaction. I do think it's *far* from simple though. We would have to manage a variable number of worker processes, which is a fair amount of code right off the bat. But the real complexity I fear is in managing the work queue. We would have yet another fixed size shared memory data structure to configure and a lot of locking and memory contention on it. Keep in mind that potentially every backend process will by trying to enqueue hundreds of blocks and you'll have dozens (or even hundreds) of worker processes trying to dequeue blocks. Worst of all, you'll (hopefully) have someone trying to sort the blocks too which would make it basically impossible to do anything to minimize the contention. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Updating FSM on recovery
Simon Riggs [EMAIL PROTECTED] writes: Or possibly an XLogInsertDeferred() which just queues up some work so the next time we call XLogInsert() it will insert the deferred work as well as the main work all in one lock cycle. It would only be usable for low priority info like FSM stuff that isn't needed for recovery. Maybe we could do that with hints also. If it isn't needed for recovery, why would we be logging it at all? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Window Functions: v07 APIs and buffering strateties
2008/10/28 Tom Lane [EMAIL PROTECTED]: Hitoshi Harada [EMAIL PROTECTED] writes: OK, I'll try to remove it. I'm not used to bison so my first task is to find where the conflict is... Use bison -v to get details of where the conflict is. I find that the most common way to fix things is to postpone where the parser has to make a decision, which usually ends up meaning a slightly more verbose set of productions --- for instance instead of foo: bar opt_baz opt_baz: baz | /*EMPTY*/ you might have to do foo: bar baz | bar Thanks for your advice. And I found the problem is around FRAME clause. Now my question is: Can ROWS be reserved_keyword? In window specifications, we have OVER (ORDER BY expr_list [(ROWS|RANGE) ... ]) and currently ROWS is not reserved so bison is confused with cases of ROWS included in expr_list and in FRAME clause. Because there is no delimiter between ORDER BY clause and FRAME (that is (ROWS | RANGE)) clause, ROWS can be in expr_list as a_expr. I see ROWS has been an unreserved keyword and that many places in gram.y such cases have been avoided by other haking methods, but in this case, isn't it possible such? If I missed something let me know it. Regards, -- Hitoshi Harada -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Window Functions: v07 APIs and buffering strateties
Hitoshi Harada [EMAIL PROTECTED] writes: Can ROWS be reserved_keyword? In window specifications, we have OVER (ORDER BY expr_list [(ROWS|RANGE) ... ]) and currently ROWS is not reserved so bison is confused with cases of ROWS included in expr_list and in FRAME clause. Because there is no delimiter between ORDER BY clause and FRAME (that is (ROWS | RANGE)) clause, ROWS can be in expr_list as a_expr. Right offhand, I don't see any alternative but to make both ROWS and RANGE reserved. It's pretty annoying since that might break existing applications that have been using these as identifiers, but the SQL committee seems to care little about that :-( BTW, finding this sort of problem is exactly why ignoring shift/reduce conflicts is a bad idea. You would've ended up with unexpected behaviors given the wrong input. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Updating FSM on recovery
On Tue, 2008-10-28 at 12:16 -0400, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: Or possibly an XLogInsertDeferred() which just queues up some work so the next time we call XLogInsert() it will insert the deferred work as well as the main work all in one lock cycle. It would only be usable for low priority info like FSM stuff that isn't needed for recovery. Maybe we could do that with hints also. If it isn't needed for recovery, why would we be logging it at all? You just agreed that the info didn't need to be very accurate. There's a few things on the server that aren't needed for recovery, but it might be useful if they were logged occasionally to give roughly correct values. Contention on WALInsertLock seems to be a problem, yet writing WAL to disk is not a bottleneck. Deferring writing it slightly to allow things to be batched might be one way of smoothing the impact of that type of operation. That might be better than a heuristic method of reducing the number of inserts. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Simon Riggs wrote: On Tue, 2008-10-28 at 14:57 +0200, Heikki Linnakangas wrote: Simon Riggs wrote: On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote: One option would be to just ignore that problem for now, and not WAL-log. Probably worth skipping for now, since it will cause patch conflicts if you do. Are there any other interactions with Hot Standby? But it seems like we can sneak in an extra flag on a HEAP2_CLEAN record to say page is now all visible, without too much work. Hmm. Even if a tuple is visible to everyone on the master, it's not necessarily yet visible to all the read-only transactions in the slave. Never a problem. No query can ever see the rows removed by a cleanup record, enforced by the recovery system. Yes, but there's a problem with recently inserted tuples: 1. A query begins in the slave, taking a snapshot with xmax = 100. So the effects of anything more recent should not be seen. 2. Transaction 100 inserts a tuple in the master, and commits 3. A vacuum comes along. There's no other transactions running in the master. Vacuum sees that all tuples on the page, including the one just inserted, are visible to everyone, and sets PD_ALL_VISIBLE flag. 4. The change is replicated to the slave. 5. The query in the slave that began at step 1 looks at the page, sees that the PD_ALL_VISIBLE flag is set. Therefore it skips the visibility checks, and erroneously returns the inserted tuple. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote: Lazy VACUUM only needs to visit pages that are '0' in the visibility map. This allows partial vacuums, where we only need to scan those parts of the table that need vacuuming, plus all indexes. Just realised that this means we still have to visit each block of a btree index with a cleanup lock. That means the earlier idea of saying I don't need a cleanup lock if the page is not in memory makes a lot more sense with a partial vacuum. 1. Scan all blocks in memory for the index (and so, don't do this unless the index is larger than a certain % of shared buffers), 2. Start reading in new blocks until you've removed the correct number of tuples 3. Work through the rest of the blocks checking that they are either in shared buffers and we can get a cleanup lock, or they aren't in shared buffers and so nobody has them pinned. If you step (2) intelligently with regard to index correlation you might not need to do much I/O at all, if any. (1) has a good hit ratio because mostly only active tables will be vacuumed so are fairly likely to be in memory. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Updating FSM on recovery
Tom Lane wrote: As far as the ugliness in XLogRecordPageWithFreeSpace goes: couldn't you just call XLogReadBufferWithFork with init = true, and then initialize the page if PageIsNew? With init=true, we don't even try to read the page from the disk (since 8.3), so all FSM pages accessed during recovery would be zeroed out. I don't think that's what you intended. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Window Functions: v07 APIs and buffering strateties
2008/10/29 Tom Lane [EMAIL PROTECTED]: Hitoshi Harada [EMAIL PROTECTED] writes: Can ROWS be reserved_keyword? In window specifications, we have OVER (ORDER BY expr_list [(ROWS|RANGE) ... ]) and currently ROWS is not reserved so bison is confused with cases of ROWS included in expr_list and in FRAME clause. Because there is no delimiter between ORDER BY clause and FRAME (that is (ROWS | RANGE)) clause, ROWS can be in expr_list as a_expr. Right offhand, I don't see any alternative but to make both ROWS and RANGE reserved. It's pretty annoying since that might break existing applications that have been using these as identifiers, but the SQL committee seems to care little about that :-( BTW, finding this sort of problem is exactly why ignoring shift/reduce conflicts is a bad idea. You would've ended up with unexpected behaviors given the wrong input. I see it now. This is so good study to me, though it spent me much time. Thanks anyway. Regards, -- Hitoshi Harada -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
On Tue, 2008-10-28 at 19:02 +0200, Heikki Linnakangas wrote: Yes, but there's a problem with recently inserted tuples: 1. A query begins in the slave, taking a snapshot with xmax = 100. So the effects of anything more recent should not be seen. 2. Transaction 100 inserts a tuple in the master, and commits 3. A vacuum comes along. There's no other transactions running in the master. Vacuum sees that all tuples on the page, including the one just inserted, are visible to everyone, and sets PD_ALL_VISIBLE flag. 4. The change is replicated to the slave. 5. The query in the slave that began at step 1 looks at the page, sees that the PD_ALL_VISIBLE flag is set. Therefore it skips the visibility checks, and erroneously returns the inserted tuple. Yep. I was thinking about FSM and row removal. So PD_ALL_VISIBLE must be separately settable on the standby. Another reason why it should be able to be set without a VACUUM - since there will never be one on standby. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Window Functions: v07 APIs and buffering strateties
On Tue, Oct 28, 2008 at 12:38:09PM -0400, Tom Lane wrote: Hitoshi Harada [EMAIL PROTECTED] writes: In window specifications, we have OVER (ORDER BY expr_list [(ROWS|RANGE) ... ]) and currently ROWS is not reserved so bison is confused with cases of ROWS included in expr_list and in FRAME clause. Because there is no delimiter between ORDER BY clause and FRAME (that is (ROWS | RANGE)) clause, ROWS can be in expr_list as a_expr. Right offhand, I don't see any alternative but to make both ROWS and RANGE reserved. It's pretty annoying since that might break existing applications that have been using these as identifiers, but the SQL committee seems to care little about that :-( Given that the only problematic case is if expr_list ends with a postfix operator, wouldn't it be sufficient to simply decree that in that case you need parentheses? Seems a lot less painful than adding two reserved words. Have a nice day, -- Martijn van Oosterhout [EMAIL PROTECTED] http://svana.org/kleptog/ Please line up in a tree and maintain the heap invariant while boarding. Thank you for flying nlogn airlines. signature.asc Description: Digital signature
Re: [HACKERS] Updating FSM on recovery
Heikki Linnakangas [EMAIL PROTECTED] writes: Tom Lane wrote: As far as the ugliness in XLogRecordPageWithFreeSpace goes: couldn't you just call XLogReadBufferWithFork with init = true, and then initialize the page if PageIsNew? With init=true, we don't even try to read the page from the disk (since 8.3), so all FSM pages accessed during recovery would be zeroed out. I don't think that's what you intended. Ah, right. Maybe the API change you suggested in the comment is the way to go. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PostgreSQL + Replicator developer meeting 10/28
On Mon, 2008-10-27 at 13:42 -0700, Joshua Drake wrote: With the recent open sourcing of Replicator, the team has been trying to come up with ways to ensure an open development process. In that light we have decided to have our first release 1.9 meeting on Freenode. All people interested in participating in a discussion about the upcoming Replicator 1.9 are welcome to attend. I missed the meeting at #replicator, but still have some questions * Is there a mailing list for replicator ? The current topics are: * New MCP architecture What's new ? I have some doubts about the current architecture based on my reading of replicator wiki, but would like to learn about the new architecture before spending too much time in studying the old one. * DDL Replication Is it there alread, or is it just a planned feature ? * Release timeline * Questions * How hard would it be to extract DDL replication part and use it as basis for DDL after trigger support for use in trigger based replication/auditing like Slony ann pgQ/Londiste ? * special DDL triggers capturing DDL statement source could also be an alternative to full DDL triggers. Replicator is set to be a short cycle release, hopefully landing before PostgreSQL 8.4. It will support PostgreSQL 8.3 and PostgreSQL 8.4 (when 8.4 is available). We will be meeting in the #replicator channel at 10:00 AM PDT... -- Hannu -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Window Functions: v07 APIs and buffering strateties
Martijn van Oosterhout [EMAIL PROTECTED] writes: On Tue, Oct 28, 2008 at 12:38:09PM -0400, Tom Lane wrote: Right offhand, I don't see any alternative but to make both ROWS and RANGE reserved. It's pretty annoying since that might break existing applications that have been using these as identifiers, but the SQL committee seems to care little about that :-( Given that the only problematic case is if expr_list ends with a postfix operator, wouldn't it be sufficient to simply decree that in that case you need parentheses? Seems a lot less painful than adding two reserved words. Hmm ... actually, it might be possible to fix it with a suitable precedence declaration? The trick is to make sure that in ... ORDER BY foo ! ROWS ... the operator is taken as postfix not infix, which is the exact opposite of what we did for AS-less column aliases (and, in fact, is the opposite of what bison will do by default, IIRC). So it might be possible to fix by attaching some new precedence level to the ROWS token. On the other hand, this would still mean that ROWS acts oddly differently from any other column name, and in a way that would be hard to explain or document. So I'm really not sure that this is a better solution than reserving it. Still another trick in our toolbox is to use merged tokens to fix the lack of lookahead. If I read the spec correctly, the problematic cases of ROWS and RANGE must be followed by UNBOUNDED or BETWEEN, so folding these token pairs into four merged-tokens would get rid of the need to reserve anything. Merged tokens create their own little oddities too though, as I was just complaining to Peter. I wonder whether it would be possible to improve on that problem by arranging some way for the grammar to signal the lexer about whether lookahead is needed, and thus not do the merging in contexts where it couldn't be appropriate. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Simon Riggs [EMAIL PROTECTED] writes: On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote: Lazy VACUUM only needs to visit pages that are '0' in the visibility map. This allows partial vacuums, where we only need to scan those parts of the table that need vacuuming, plus all indexes. Just realised that this means we still have to visit each block of a btree index with a cleanup lock. Yes, and your proposal cannot fix that. Read The Deletion Algorithm in nbtree/README, particularly the second paragraph. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Heikki Linnakangas [EMAIL PROTECTED] writes: Yes, but there's a problem with recently inserted tuples: 1. A query begins in the slave, taking a snapshot with xmax = 100. So the effects of anything more recent should not be seen. 2. Transaction 100 inserts a tuple in the master, and commits 3. A vacuum comes along. There's no other transactions running in the master. Vacuum sees that all tuples on the page, including the one just inserted, are visible to everyone, and sets PD_ALL_VISIBLE flag. 4. The change is replicated to the slave. 5. The query in the slave that began at step 1 looks at the page, sees that the PD_ALL_VISIBLE flag is set. Therefore it skips the visibility checks, and erroneously returns the inserted tuple. But this is exactly equivalent to the problem with recently deleted tuples: vacuum on the master might take actions that are premature with respect to the status on the slave. Whatever solution we adopt for that will work for this too. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PostgreSQL + Replicator developer meeting 10/28
On Tue, 28 Oct 2008 19:46:42 +0200 Hannu Krosing [EMAIL PROTECTED] wrote: On Mon, 2008-10-27 at 13:42 -0700, Joshua Drake wrote: With the recent open sourcing of Replicator, the team has been trying to come up with ways to ensure an open development process. In that light we have decided to have our first release 1.9 meeting on Freenode. All people interested in participating in a discussion about the upcoming Replicator 1.9 are welcome to attend. I missed the meeting at #replicator, but still have some questions Meeting is still going on. I will answer your questions and post results this afternoon. Joshua D. Drake -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ United States PostgreSQL Association: http://www.postgresql.us/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
On Tue, 2008-10-28 at 13:58 -0400, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote: Lazy VACUUM only needs to visit pages that are '0' in the visibility map. This allows partial vacuums, where we only need to scan those parts of the table that need vacuuming, plus all indexes. Just realised that this means we still have to visit each block of a btree index with a cleanup lock. Yes, and your proposal cannot fix that. Read The Deletion Algorithm in nbtree/README, particularly the second paragraph. Yes, understood. Please read the algorithm again. It does guarantee that each block in the index has been checked to see if nobody is pinning it, it just avoids performing I/O to prove that. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Window Functions: v07 APIs and buffering strateties
On Tue, Oct 28, 2008 at 01:50:26PM -0400, Tom Lane wrote: Given that the only problematic case is if expr_list ends with a postfix operator, wouldn't it be sufficient to simply decree that in that case you need parentheses? Seems a lot less painful than adding two reserved words. Hmm ... actually, it might be possible to fix it with a suitable precedence declaration? The trick is to make sure that in ... ORDER BY foo ! ROWS ... the operator is taken as postfix not infix, which is the exact opposite of what we did for AS-less column aliases (and, in fact, is the opposite of what bison will do by default, IIRC). So it might be possible to fix by attaching some new precedence level to the ROWS token. Yes. Bison's default is to shift, which means that if you do nothing it will treat ROWS as part of the expression if it makes any sense at all. Given the requirement for a following UNBOUNDED or BETWEEN, the only problem is that you'll get a syntax error if the expr_list ends in a postfix operator, I don't see how you get hidden ambiguity. Operator precedence is exactly the way to do this, since operator precedence rules exist solely to resolve the shift/reduce conflicts. You're right about the documentation though. I suppose you could put in the documentation for the ROWS stuff something along the lines of: If the last expression of your ORDER by ends in a postfix operator, you must use parentheses. How many postfix operators are in common use in ORDER BY expressions anyway? Have a nice day, -- Martijn van Oosterhout [EMAIL PROTECTED] http://svana.org/kleptog/ Please line up in a tree and maintain the heap invariant while boarding. Thank you for flying nlogn airlines. signature.asc Description: Digital signature
Re: [HACKERS] Re: [COMMITTERS] pgsql: Rework subtransaction commit protocol for hot standby.
On Tue, 2008-10-28 at 00:35 +, Simon Riggs wrote: Comments please. I'm a little unclear as to what is happening with this. We have a bug in CVS HEAD, so clearly doing nothing shouldn't really be an option. It was my patch with the bug, so clearly my responsibility. After some prototypes, I'm happy with the submitted v3 patch. Please could somebody commit this? Or at least say what they think needs to happen, or what we're waiting for? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Window Functions: v07 APIs and buffering strateties
Martijn van Oosterhout [EMAIL PROTECTED] writes: On Tue, Oct 28, 2008 at 01:50:26PM -0400, Tom Lane wrote: ... So it might be possible to fix by attaching some new precedence level to the ROWS token. Yes. Bison's default is to shift, which means that if you do nothing it will treat ROWS as part of the expression if it makes any sense at all. Given the requirement for a following UNBOUNDED or BETWEEN, the only problem is that you'll get a syntax error if the expr_list ends in a postfix operator, I don't see how you get hidden ambiguity. Hmm, now I see what you meant; that's a little different than what I was envisioning. I was thinking of trying to force a parse decision that would support the windowing syntax, whereas you propose forcing a parse decision that does the opposite, and making the user parenthesize if he's got a conflict. What the choice seems to come down to is making ROWS and RANGE reserved (in some form or other) versus creating a corner case for users of postfix operators. Phrased that way it does seem like the second alternative is better. Hitoshi: you can probably make this happen by including ROWS and RANGE in the %nonassoc IDENT precedence declaration, but you'll want to test to make sure the right things happen. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Feature Request - Table Definition query
Howdy! It was suggested to me that this proposed feature is interesting enough to be floated around for discussion. Don't get your hopes up, because the request is extremely simple: It could be useful to have a command that returns the table definition (like pg_dump -st) from within the query interface. This could be particularly useful if one doesn't have access to or prefers not to manipulate results coming from a shell. If I have an API from which to query the database, it seems silly to have to spawn a shell and run a command to query the very same database. (I'm not exactly familiar with how pg_dump manipulates or extracts that data, so this could be a very, very stupid question. If so, please forgive me.) Regards, Wilfried Schobeiri -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Rework subtransaction commit protocol for hot standby.
On Tue, 2008-10-28 at 14:43 -0400, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: I'm a little unclear as to what is happening with this. We have a bug in CVS HEAD, so clearly doing nothing shouldn't really be an option. Doing nothing until release is certainly not an option ;-). But AFAICS this is not a showstopper for development so I was intending to wait until commitfest starts before doing anything with it. Is it blocking some other progress for you? No, since I know how to fix it. Just uncomfortable leaving known bugs in the tree, that's all. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] current head fails regression tests on mac osx
I won't send the full logs here, if someone's interested in it - here's tar.bz2 version: http://zlew.org/pg_regress.tar.bz2 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] current head fails regression tests on mac osx
Grzegorz Jaskiewicz wrote: I won't send the full logs here, if someone's interested in it - here's tar.bz2 version: http://zlew.org/pg_regress.tar.bz2 This seems to be the root cause: + ERROR: could not open file /Users/gj/Projects/postgres-head/pgsql/src/test/regress/results/onek.data for writing: Permission denied The buildfarm members running Max OS X seem to be working fine, so it seems there's something special in your environment that's causing that to fail. Is the results directory writable? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Rework subtransaction commit protocol for hot standby.
Simon Riggs [EMAIL PROTECTED] writes: On Tue, 2008-10-28 at 14:43 -0400, Tom Lane wrote: Doing nothing until release is certainly not an option ;-). But AFAICS this is not a showstopper for development so I was intending to wait until commitfest starts before doing anything with it. Just uncomfortable leaving known bugs in the tree, that's all. Feel free to stick something about it on the open items list http://wiki.postgresql.org/wiki/PostgreSQL_8.4_Open_Items if you want to make doubly sure it's not forgotten about. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] usermap regexp support
The attached patch tries to implement regexp support in the usermaps (pg_ident.conf). The use-case will be to do things like realm-based matches in kerberos/GSSAPI authentication (will require some further hacks on that one to expose the realm name). For example you could have: krb /^(.*)@myrealm.com$ \1 krb /^(.*)@otherrealm.com guest and things like that. It will also likely be quite useful once we get SSL certificate based authentication. Reviews particularly appreciated - I really can't claim to *know* our regexp code :-O Obviously docs need to be updated as well. //Magnus *** a/src/backend/libpq/hba.c --- b/src/backend/libpq/hba.c *** *** 27,32 --- 27,33 #include libpq/ip.h #include libpq/libpq.h + #include regex/regex.h #include storage/fd.h #include utils/flatfiles.h #include utils/guc.h *** *** 1325,1344 parse_ident_usermap(List *line, int line_number, const char *usermap_name, token = lfirst(line_item); file_pgrole = token; /* Match? */ ! if (case_insensitive) { ! if (strcmp(file_map, usermap_name) == 0 ! pg_strcasecmp(file_pgrole, pg_role) == 0 ! pg_strcasecmp(file_ident_user, ident_user) == 0) ! *found_p = true; } else { ! if (strcmp(file_map, usermap_name) == 0 ! strcmp(file_pgrole, pg_role) == 0 ! strcmp(file_ident_user, ident_user) == 0) ! *found_p = true; } return; --- 1326,1453 token = lfirst(line_item); file_pgrole = token; + if (strcmp(file_map, usermap_name) != 0) + /* Line does not match the map name we're looking for, so just abort */ + return; + /* Match? */ ! if (file_ident_user[0] == '/') { ! /* ! * When system username starts with a slash, treat it as a regular expression. ! * In this case, we process the system username as a regular expression that ! * returns exactly one match. This is replaced for $1 in the database username ! * string, if present. ! */ ! int r; ! regex_t re; ! regmatch_t matches[2]; ! pg_wchar *wstr; ! int wlen; ! char *ofs; ! char *regexp_pgrole; ! ! wstr = palloc((strlen(file_ident_user+1) + 1) * sizeof(pg_wchar)); ! wlen = pg_mb2wchar_with_len(file_ident_user+1, wstr, strlen(file_ident_user+1)); ! ! /* ! * XXX: Major room for optimization: regexps could be compiled when the file is loaded ! * and then re-used in every connection. ! */ ! r = pg_regcomp(re, wstr, wlen, REG_ADVANCED); ! if (r) ! { ! char errstr[100]; ! ! pg_regerror(r, re, errstr, sizeof(errstr)); ! ereport(ERROR, ! (errcode(ERRCODE_INVALID_REGULAR_EXPRESSION), ! errmsg(invalid regular expression '%s': %s, file_ident_user+1, errstr))); ! ! pfree(wstr); ! *error_p = true; ! return; ! } ! pfree(wstr); ! ! wstr = palloc((strlen(ident_user) + 1) * sizeof(pg_wchar)); ! wlen = pg_mb2wchar_with_len(ident_user, wstr, strlen(ident_user)); ! ! r = pg_regexec(re, wstr, wlen, 0, NULL, 2, matches,0); ! if (r) ! { ! char errstr[100]; ! ! if (r != REG_NOMATCH) ! { ! /* REG_NOMATCH is not an error, everything else is */ ! pg_regerror(r, re, errstr, sizeof(errstr)); ! ereport(ERROR, ! (errcode(ERRCODE_INVALID_REGULAR_EXPRESSION), ! errmsg(regular expression match for '%s' failed: %s, file_ident_user+1, errstr))); ! *error_p = true; ! } ! ! pfree(wstr); ! pg_regfree(re); ! return; ! } ! pfree(wstr); ! ! if ((ofs = strstr(file_pgrole, \\1)) != NULL) ! { ! /* substitution of the first argument requested */ ! if (matches[1].rm_so 0) ! ereport(ERROR, ! (errcode(ERRCODE_INVALID_REGULAR_EXPRESSION), ! errmsg(regular expression '%s' has no subexpressions as requested by backreference in '%s', ! file_ident_user+1, file_pgrole))); ! /* length: original length minus length of \1 plus length of match */ ! regexp_pgrole = palloc0(strlen(file_pgrole)-2+(matches[1].rm_so-matches[1].rm_so)); ! strncpy(regexp_pgrole, file_pgrole, (ofs-file_pgrole)); ! memcpy(regexp_pgrole+strlen(regexp_pgrole), ! ident_user+matches[1].rm_so, ! matches[1].rm_eo-matches[1].rm_so); ! strcat(regexp_pgrole, ofs+2); ! } ! else ! { ! /* no substitution, so copy the match */ ! regexp_pgrole = pstrdup(file_pgrole); ! } ! ! pg_regfree(re); ! ! /* now check if the username actually matched what the user is trying to connect as */ ! if (case_insensitive) ! { ! if (pg_strcasecmp(regexp_pgrole, pg_role) == 0) ! *found_p = true; ! } ! else ! { ! if (strcmp(regexp_pgrole, pg_role) == 0) ! *found_p = true; ! } ! pfree(regexp_pgrole); ! ! return; } else { ! /* Not regular expression, so make complete match */ ! if (case_insensitive) ! { ! if (pg_strcasecmp(file_pgrole, pg_role) == 0 ! pg_strcasecmp(file_ident_user, ident_user) == 0) ! *found_p = true; ! } ! else ! { ! if
Re: [HACKERS] current head fails regression tests on mac osx
hmm, how could I miss that. but than: thickbook:src gj$ ls -lahd /Users/gj/Projects/postgres-head/pgsql/src/ test/regress/results/ drwxr-xr-x 119 gj staff 4,0K 28 paź 19:14 /Users/gj/Projects/ postgres-head/pgsql/src/test/regress/results/ thickbook:src gj$ touch /Users/gj/Projects/postgres-head/pgsql/src/ test/regress/results/onek.data thickbook:src gj$ ls -lah /Users/gj/Projects/postgres-head/pgsql/src/ test/regress/results/ total 6728 drwxr-xr-x 120 gj staff 4,0K 28 paź 20:03 . drwxr-xr-x 29 gj staff 986B 28 paź 19:14 .. -rw-r--r--1 gj staff 4,9K 28 paź 19:14 abstime.out -rw-r--r--1 gj staff 9,4K 28 paź 19:14 aggregates.out -rw-r--r--1 gj staff57K 28 paź 19:14 alter_table.out -rw-r--r--1 gj staff40K 28 paź 19:14 arrays.out -rw-r--r--1 gj staff16K 28 paź 19:14 bit.out -rw-r--r--1 gj staff 1,4K 28 paź 19:14 bitmapops.out -rw-r--r--1 gj staff 7,3K 28 paź 19:14 boolean.out -rw-r--r--1 gj staff 4,8K 28 paź 19:14 box.out -rw-r--r--1 gj staff 1,6K 28 paź 19:14 btree_index.out -rw-r--r--1 gj staff 6,0K 28 paź 19:14 case.out -rw-r--r--1 gj staff 2,1K 28 paź 19:14 char.out -rw-r--r--1 gj staff 2,6K 28 paź 19:14 circle.out -rw-r--r--1 gj staff20K 28 paź 19:14 cluster.out -rw-r--r--1 gj staff 3,6K 28 paź 19:14 combocid.out -rw-r--r--1 gj staff 1,3K 28 paź 19:14 comments.out -rw-r--r--1 gj staff12K 28 paź 19:14 constraints.out -rw-r--r--1 gj staff14K 28 paź 19:14 conversion.out -rw-r--r--1 gj staff 5,2K 28 paź 19:14 copy.out -rw-r--r--1 gj staff 6,6K 28 paź 19:14 copy2.out -rw-r--r--1 gj staff 2,6K 28 paź 19:14 copyselect.out -rw-r--r--1 gj staff 1,4K 28 paź 19:14 create_aggregate.out -rw-r--r--1 gj staff 3,2K 28 paź 19:14 create_function_1.out -rw-r--r--1 gj staff 1,9K 28 paź 19:14 create_function_2.out -rw-r--r--1 gj staff46K 28 paź 19:14 create_index.out -rw-r--r--1 gj staff 5,0K 28 paź 19:14 create_misc.out -rw-r--r--1 gj staff 663B 28 paź 19:14 create_operator.out -rw-r--r--1 gj staff 3,5K 28 paź 19:14 create_table.out -rw-r--r--1 gj staff 3,4K 28 paź 19:14 create_type.out -rw-r--r--1 gj staff 9,4K 28 paź 19:14 create_view.out -rw-r--r--1 gj staff22K 28 paź 19:14 date.out -rw-r--r--1 gj staff 1,0K 28 paź 19:14 delete.out -rw-r--r--1 gj staff 5,1K 28 paź 19:14 dependency.out -rw-r--r--1 gj staff17K 28 paź 19:14 domain.out -rw-r--r--1 gj staff 3,1K 28 paź 19:14 drop_if_exists.out -rw-r--r--1 gj staff 7,2K 28 paź 19:14 enum.out -rw-r--r--1 gj staff13K 28 paź 19:14 errors.out -rw-r--r--1 gj staff 6,8K 28 paź 19:14 float4.out -rw-r--r--1 gj staff13K 28 paź 19:14 float8.out -rw-r--r--1 gj staff53K 28 paź 19:14 foreign_key.out -rw-r--r--1 gj staff20K 28 paź 19:14 geometry.out -rw-r--r--1 gj staff12K 28 paź 19:14 guc.out -rw-r--r--1 gj staff 4,0K 28 paź 19:14 hash_index.out -rw-r--r--1 gj staff 141K 28 paź 19:14 horology.out -rw-r--r--1 gj staff20K 28 paź 19:14 inet.out -rw-r--r--1 gj staff28K 28 paź 19:14 inherit.out -rw-r--r--1 gj staff 2,2K 28 paź 19:14 insert.out -rw-r--r--1 gj staff 6,1K 28 paź 19:14 int2.out -rw-r--r--1 gj staff 8,0K 28 paź 19:14 int4.out -rw-r--r--1 gj staff28K 28 paź 19:14 int8.out -rw-r--r--1 gj staff15K 28 paź 19:14 interval.out -rw-r--r--1 gj staff69K 28 paź 19:14 join.out -rw-r--r--1 gj staff 1,1M 28 paź 19:14 largeobject.out -rw-r--r--1 gj staff 2,1K 28 paź 19:14 limit.out -rw-r--r--1 gj staff 1,6K 28 paź 19:14 lseg.out -rw-r--r--1 gj staff 3,1K 28 paź 19:14 macaddr.out -rw-r--r--1 gj staff15K 28 paź 19:14 misc.out -rw-r--r--1 gj staff 2,8K 28 paź 19:14 money.out -rw-r--r--1 gj staff 4,8K 28 paź 19:14 name.out -rw-r--r--1 gj staff 1,3K 28 paź 19:14 namespace.out -rw-r--r--1 gj staff57K 28 paź 19:14 numeric.out -rw-r--r--1 gj staff 3,4K 28 paź 19:14 numerology.out -rw-r--r--1 gj staff 3,6K 28 paź 19:14 oid.out -rw-r--r--1 gj staff24K 28 paź 19:14 oidjoins.out -rw-r--r--1 gj staff 0B 28 paź 20:03 onek.data -rw-r--r--1 gj staff41K 28 paź 19:14 opr_sanity.out -rw-r--r--1 gj staff 2,2K 28 paź 19:14 path.out -rw-r--r--1 gj staff 6,1K 28 paź 19:14 plancache.out -rw-r--r--1 gj staff 111K 28 paź 19:14 plpgsql.out -rw-r--r--1 gj staff 7,9K 28 paź 19:14 point.out -rw-r--r--1 gj staff 4,5K 28 paź 19:14 polygon.out -rw-r--r--1 gj staff24K 28 paź 19:14 polymorphism.out -rw-r--r--1 gj staff91K 28 paź 19:14 portals.out -rw-r--r--1 gj staff 5,3K 28 paź 19:14 portals_p2.out -rw-r--r--1 gj staff12K 28 paź 19:14 prepare.out -rw-r--r--1 gj staff 4,1K 28 paź 19:14 prepared_xacts.out -rw-r--r--1 gj staff17K 28
[HACKERS] don't use MAKE_PTR/OFFSET for shmem pointers
Since we require every process to map the shared memory region to the same address, we don't need the MAKE_PTR/OFFSET code that was needed when that was not the case. This patch makes shared memory pointers just like regular pointers. http://archives.postgresql.org/pgsql-general/2007-08/msg01510.php Kris Jurka*** a/src/backend/access/transam/twophase.c --- b/src/backend/access/transam/twophase.c *** *** 122,128 typedef struct GlobalTransactionData typedef struct TwoPhaseStateData { /* Head of linked list of free GlobalTransactionData structs */ ! SHMEM_OFFSET freeGXacts; /* Number of valid prepXacts entries. */ int numPrepXacts; --- 122,128 typedef struct TwoPhaseStateData { /* Head of linked list of free GlobalTransactionData structs */ ! void * freeGXacts; /* Number of valid prepXacts entries. */ int numPrepXacts; *** *** 184,190 TwoPhaseShmemInit(void) int i; Assert(!found); ! TwoPhaseState-freeGXacts = INVALID_OFFSET; TwoPhaseState-numPrepXacts = 0; /* --- 184,190 int i; Assert(!found); ! TwoPhaseState-freeGXacts = NULL; TwoPhaseState-numPrepXacts = 0; /* *** *** 197,203 TwoPhaseShmemInit(void) for (i = 0; i max_prepared_xacts; i++) { gxacts[i].proc.links.next = TwoPhaseState-freeGXacts; ! TwoPhaseState-freeGXacts = MAKE_OFFSET(gxacts[i]); } } else --- 197,203 for (i = 0; i max_prepared_xacts; i++) { gxacts[i].proc.links.next = TwoPhaseState-freeGXacts; ! TwoPhaseState-freeGXacts = gxacts[i]; } } else *** *** 243,249 MarkAsPreparing(TransactionId xid, const char *gid, TwoPhaseState-prepXacts[i] = TwoPhaseState-prepXacts[TwoPhaseState-numPrepXacts]; /* and put it back in the freelist */ gxact-proc.links.next = TwoPhaseState-freeGXacts; ! TwoPhaseState-freeGXacts = MAKE_OFFSET(gxact); /* Back up index count too, so we don't miss scanning one */ i--; } --- 243,249 TwoPhaseState-prepXacts[i] = TwoPhaseState-prepXacts[TwoPhaseState-numPrepXacts]; /* and put it back in the freelist */ gxact-proc.links.next = TwoPhaseState-freeGXacts; ! TwoPhaseState-freeGXacts = gxact; /* Back up index count too, so we don't miss scanning one */ i--; } *** *** 263,275 MarkAsPreparing(TransactionId xid, const char *gid, } /* Get a free gxact from the freelist */ ! if (TwoPhaseState-freeGXacts == INVALID_OFFSET) ereport(ERROR, (errcode(ERRCODE_OUT_OF_MEMORY), errmsg(maximum number of prepared transactions reached), errhint(Increase max_prepared_transactions (currently %d)., max_prepared_xacts))); ! gxact = (GlobalTransaction) MAKE_PTR(TwoPhaseState-freeGXacts); TwoPhaseState-freeGXacts = gxact-proc.links.next; /* Initialize it */ --- 263,275 } /* Get a free gxact from the freelist */ ! if (TwoPhaseState-freeGXacts == NULL) ereport(ERROR, (errcode(ERRCODE_OUT_OF_MEMORY), errmsg(maximum number of prepared transactions reached), errhint(Increase max_prepared_transactions (currently %d)., max_prepared_xacts))); ! gxact = (GlobalTransaction)TwoPhaseState-freeGXacts; TwoPhaseState-freeGXacts = gxact-proc.links.next; /* Initialize it */ *** *** 452,458 RemoveGXact(GlobalTransaction gxact) /* and put it back in the freelist */ gxact-proc.links.next = TwoPhaseState-freeGXacts; ! TwoPhaseState-freeGXacts = MAKE_OFFSET(gxact); LWLockRelease(TwoPhaseStateLock); --- 452,458 /* and put it back in the freelist */ gxact-proc.links.next = TwoPhaseState-freeGXacts; ! TwoPhaseState-freeGXacts = gxact; LWLockRelease(TwoPhaseStateLock);
Re: [HACKERS] Re: [COMMITTERS] pgsql: Rework subtransaction commit protocol for hot standby.
On Tue, 2008-10-28 at 15:52 -0400, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: On Tue, 2008-10-28 at 14:43 -0400, Tom Lane wrote: Doing nothing until release is certainly not an option ;-). But AFAICS this is not a showstopper for development so I was intending to wait until commitfest starts before doing anything with it. Just uncomfortable leaving known bugs in the tree, that's all. Feel free to stick something about it on the open items list http://wiki.postgresql.org/wiki/PostgreSQL_8.4_Open_Items if you want to make doubly sure it's not forgotten about. You're pulling my chain. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Rework subtransaction commit protocol for hot standby.
Simon Riggs wrote: On Tue, 2008-10-28 at 15:52 -0400, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: On Tue, 2008-10-28 at 14:43 -0400, Tom Lane wrote: Doing nothing until release is certainly not an option ;-). But AFAICS this is not a showstopper for development so I was intending to wait until commitfest starts before doing anything with it. Just uncomfortable leaving known bugs in the tree, that's all. Feel free to stick something about it on the open items list http://wiki.postgresql.org/wiki/PostgreSQL_8.4_Open_Items if you want to make doubly sure it's not forgotten about. You're pulling my chain. :-) The new entry seems too vague to be useful ... -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Any reason to have heap_(de)formtuple?
Kris Jurka napsal(a): On Thu, 23 Oct 2008, Kris Jurka wrote: The problem with trying to deprecate it is that the vast majority of the backend is still using the old interfaces, so people looking for inspiration for their external modules will likely end up using the old interface. Like Alvaro I started this conversion a while ago, got bored, and forgot about it. If people do want this conversion done while keeping the old interface around, I can track down that patch, update it and finish it up for the next CommitFest. Here's a patch that changes everything over to the the new API and implements the old API by calling the new API. It seems to me OK. I have only one comment. I prefer to pfree allocated memory for temporary nulls array. I think that caller could call old API many times without memory context cleanup. Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PostgreSQL + Replicator developer meeting 10/28
On Tue, 2008-10-28 at 19:46 +0200, Hannu Krosing wrote: * Is there a mailing list for replicator ? https://lists.commandprompt.com/mailman/listinfo/replicator-general Regards, -- Devrim GÃœNDÃœZ, RHCE devrim~gunduz.org, devrim~PostgreSQL.org, devrim.gunduz~linux.org.tr http://www.gunduz.org signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Re: [COMMITTERS] pgsql: Rework subtransaction commit protocol for hot standby.
On Tue, 2008-10-28 at 17:36 -0300, Alvaro Herrera wrote: Simon Riggs wrote: On Tue, 2008-10-28 at 15:52 -0400, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: On Tue, 2008-10-28 at 14:43 -0400, Tom Lane wrote: Doing nothing until release is certainly not an option ;-). But AFAICS this is not a showstopper for development so I was intending to wait until commitfest starts before doing anything with it. Just uncomfortable leaving known bugs in the tree, that's all. Feel free to stick something about it on the open items list http://wiki.postgresql.org/wiki/PostgreSQL_8.4_Open_Items if you want to make doubly sure it's not forgotten about. You're pulling my chain. :-) The new entry seems too vague to be useful ... I'll correct it before release. ;-) -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] contrib/pg_stat_statements
ITAGAKI Takahiro wrote: It might be possbile to add a queryText field into QueryDesc and all of the codes using QueryDesc initialize the field with their own query texts, but it requires modifications in many different modules and copying query text might be needed. I don't want to do it only for this contrib module, so I'll support only top statements at the moment. Sorry. Indeed, this turned out to be more difficult than I initally thought. Nevertheless I still think that tracking nested statements is worth the effort. Attached is a patch that adds sourceText to QueryDesc (should apply cleanly after auto_explain.patch). This is very WIP, for one thing it assumes that the source text pointers can be passed around freely. But is the idea of extending QueryDesc generally acceptable? Is it OK to make a copy of the query string? I tested with modified pg_stat_statements (removed toplevel checks), and much to my delight it didn't crash immediately :) I've only done some limited testing but a lot of interesting stuff seems to turns out -- for instance we get to see the lookups generated by referential integrity checks (could help to identify missing indexes on FK fields). Regards, Martin *** a/src/backend/commands/copy.c --- b/src/backend/commands/copy.c *** *** 1056,1062 DoCopy(const CopyStmt *stmt, const char *queryString) /* Create a QueryDesc requesting no output */ cstate-queryDesc = CreateQueryDesc(plan, GetActiveSnapshot(), InvalidSnapshot, ! dest, NULL, false); /* * Call ExecutorStart to prepare the plan for execution. --- 1056,1062 /* Create a QueryDesc requesting no output */ cstate-queryDesc = CreateQueryDesc(plan, GetActiveSnapshot(), InvalidSnapshot, ! dest, NULL, false, queryString); /* * Call ExecutorStart to prepare the plan for execution. *** a/src/backend/commands/explain.c --- b/src/backend/commands/explain.c *** *** 172,178 ExplainOneQuery(Query *query, ExplainStmt *stmt, const char *queryString, plan = pg_plan_query(query, 0, params); /* run it (if needed) and produce output */ ! ExplainOnePlan(plan, params, stmt, tstate); } } --- 172,178 plan = pg_plan_query(query, 0, params); /* run it (if needed) and produce output */ ! ExplainOnePlan(plan, params, stmt, tstate, queryString); } } *** *** 219,225 ExplainOneUtility(Node *utilityStmt, ExplainStmt *stmt, */ void ExplainOnePlan(PlannedStmt *plannedstmt, ParamListInfo params, ! ExplainStmt *stmt, TupOutputState *tstate) { QueryDesc *queryDesc; instr_time starttime; --- 219,226 */ void ExplainOnePlan(PlannedStmt *plannedstmt, ParamListInfo params, ! ExplainStmt *stmt, TupOutputState *tstate, ! const char *queryString) { QueryDesc *queryDesc; instr_time starttime; *** *** 237,243 ExplainOnePlan(PlannedStmt *plannedstmt, ParamListInfo params, queryDesc = CreateQueryDesc(plannedstmt, GetActiveSnapshot(), InvalidSnapshot, None_Receiver, params, ! stmt-analyze); INSTR_TIME_SET_CURRENT(starttime); --- 238,244 queryDesc = CreateQueryDesc(plannedstmt, GetActiveSnapshot(), InvalidSnapshot, None_Receiver, params, ! stmt-analyze, queryString); INSTR_TIME_SET_CURRENT(starttime); *** a/src/backend/commands/prepare.c --- b/src/backend/commands/prepare.c *** *** 697,703 ExplainExecuteQuery(ExecuteStmt *execstmt, ExplainStmt *stmt, pstmt-intoClause = execstmt-into; } ! ExplainOnePlan(pstmt, paramLI, stmt, tstate); } else { --- 697,703 pstmt-intoClause = execstmt-into; } ! ExplainOnePlan(pstmt, paramLI, stmt, tstate, queryString); } else { *** a/src/backend/executor/functions.c --- b/src/backend/executor/functions.c *** *** 309,320 postquel_start(execution_state *es, SQLFunctionCachePtr fcache) es-qd = CreateQueryDesc((PlannedStmt *) es-stmt, snapshot, InvalidSnapshot, None_Receiver, ! fcache-paramLI, false); else es-qd = CreateUtilityQueryDesc(es-stmt, snapshot, None_Receiver, ! fcache-paramLI); /* We assume we don't need to set up ActiveSnapshot for ExecutorStart */ --- 309,320 es-qd = CreateQueryDesc((PlannedStmt *) es-stmt, snapshot, InvalidSnapshot, None_Receiver, ! fcache-paramLI, false, fcache-src); else es-qd = CreateUtilityQueryDesc(es-stmt, snapshot, None_Receiver, ! fcache-paramLI, fcache-src); /* We assume we don't need to set up ActiveSnapshot for ExecutorStart */ *** a/src/backend/executor/spi.c --- b/src/backend/executor/spi.c *** *** 1795,1801 _SPI_execute_plan(SPIPlanPtr plan, ParamListInfo paramLI,
Re: [HACKERS] Any reason to have heap_(de)formtuple?
On Tue, 28 Oct 2008, Zdenek Kotala wrote: Kris Jurka napsal(a): Here's a patch that changes everything over to the the new API and implements the old API by calling the new API. It seems to me OK. I have only one comment. I prefer to pfree allocated memory for temporary nulls array. I think that caller could call old API many times without memory context cleanup. Here's an incremental patch to add the suggested pfreeing. Kris Jurka*** a/src/backend/access/common/heaptuple.c --- b/src/backend/access/common/heaptuple.c *** *** 801,806 heap_formtuple(TupleDesc tupleDescriptor, --- 801,808 tuple = heap_form_tuple(tupleDescriptor, values, boolNulls); + pfree(boolNulls); + return tuple; } *** *** 894,899 heap_modifytuple(HeapTuple tuple, --- 896,902 { int numberOfAttributes = tupleDesc-natts; int attnum; + HeapTuple result; bool *replNulls = (bool *) palloc(numberOfAttributes * sizeof(bool)); bool *replActions = (bool *) palloc(numberOfAttributes * sizeof(bool)); *** *** 903,909 heap_modifytuple(HeapTuple tuple, replActions[attnum] = replCharActions[attnum] == 'r'; } ! return heap_modify_tuple(tuple, tupleDesc, replValues, replNulls, replActions); } /* --- 906,917 replActions[attnum] = replCharActions[attnum] == 'r'; } ! result = heap_modify_tuple(tuple, tupleDesc, replValues, replNulls, replActions); ! ! pfree(replNulls); ! pfree(replActions); ! ! return result; } /* *** *** 1051,1056 heap_deformtuple(HeapTuple tuple, --- 1059,1066 charNulls[attnum] = boolNulls[attnum] ? 'n' : ' '; } + pfree(boolNulls); + } /* -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PostgreSQL + Replicator developer meeting 10/28
On Tue, 2008-10-28 at 11:01 -0700, Joshua Drake wrote: On Tue, 28 Oct 2008 19:46:42 +0200 Hannu Krosing [EMAIL PROTECTED] wrote: On Mon, 2008-10-27 at 13:42 -0700, Joshua Drake wrote: With the recent open sourcing of Replicator, the team has been trying to come up with ways to ensure an open development process. In that light we have decided to have our first release 1.9 meeting on Freenode. All people interested in participating in a discussion about the upcoming Replicator 1.9 are welcome to attend. I missed the meeting at #replicator, but still have some questions Meeting is still going on. Heh. Seems I underestimated the time difference :) I will answer your questions and post results this afternoon. Thanks! Hannu -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal of PITR performance improvement for 8.4.
Gregory Stark wrote: Simon Riggs [EMAIL PROTECTED] writes: I'm happy with the idea of a readahead process. I thought we were implementing a BackgroundReader process for other uses. Is that dead now? You and Bruce seem to keep resurrecting that idea. I've never liked it -- I always hated that in Oracle and thought it was a terrible kludge. I didn't think I was promoting the separate reader process after you had the posix_fadvise() idea. -- Bruce Momjian [EMAIL PROTECTED]http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Rework subtransaction commit protocol for hot standby.
Simon Riggs [EMAIL PROTECTED] writes: I'm a little unclear as to what is happening with this. We have a bug in CVS HEAD, so clearly doing nothing shouldn't really be an option. Doing nothing until release is certainly not an option ;-). But AFAICS this is not a showstopper for development so I was intending to wait until commitfest starts before doing anything with it. Is it blocking some other progress for you? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal of PITR performance improvement for 8.4.
On Tue, 2008-10-28 at 17:40 -0400, Bruce Momjian wrote: Gregory Stark wrote: Simon Riggs [EMAIL PROTECTED] writes: I'm happy with the idea of a readahead process. I thought we were implementing a BackgroundReader process for other uses. Is that dead now? You and Bruce seem to keep resurrecting that idea. I've never liked it -- I always hated that in Oracle and thought it was a terrible kludge. I didn't think I was promoting the separate reader process after you had the posix_fadvise() idea. I think Greg is misinterpreting our occasional lack of exactness as disagreement. The end solution is the goal, not any of the discussed mechanisms. It's always good to have a name for it that sums up the goals rather than the methods e.g. frequent update optimisation rather than update-in-place. It would be good if the solutions for normal running and recovery were similar. Greg, please could you look into that? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SQL/MED compatible connection manager
Chris Browne wrote: Slony-I does some vaguely similar stuff in its handling of connection paths; here's the schema: create table @[EMAIL PROTECTED] ( pa_server int4, pa_client int4, pa_conninfo text NOT NULL, pa_connretryint4, [snip ...] I wouldn't be surprised to find there being some value in using something like SQL/MED. Here the pa_conninfo could be replaced with the connection name (actually SERVER). For the complete connection definition a USER MAPPING (eg. remote username and password) is also needed. But that can be fetched by the connection connection lookup function One detail I'll point out, that I'm noticing from an application I'm working on right now. We might want to have something like a db connection data type; here's a prototype I put together: slonyregress1=# create type dbconn as (port integer, dbname text, username text, password text, ssl boolean); CREATE TYPE [snip] slonyregress1=# select * from dbconns; id | db +-- 1 | (5432,slonyregress1,slony,secret!,t) (1 row) I'm not certain that this is forcibly the right representation, but I think it is possible that we'd want a finer-grained representation than merely a connection string. Yes -- the server, user mapping and FDW all take generic options. Some of them might be part of the connect string, others could specify some hints of how the connection should be handled (driver options etc). DBD-Excel has a particularly nasty example of those. A fixed type would not be able to cover all of them. This is where the SQL/MED stuff can help - all of this complexity can be reduced to a single name. Though it adds the additional step of doing the lookup. The dbconns example could be written like this: test=# create table dbconns (id serial primary key, db regserver); ... test=# insert into dbconns (db) values ('test'); INSERT 0 1 test=# select * from dbconns; id | db +- 1 | public.test (1 row) And then for the connection details: test=# select * from pg_get_remote_connection_info('test'); option | value --+ host | /tmp port | 6543 dbname | foo username | bob password | secret (5 rows) This assumes that there is a server public.test and a user mapping for the session user. The option/value pairs are outputted by the dummy FDW that just dumps the generic options for the server and user mapping. A smart FDW could turn this into just a connection string. Still, there probably should be a set of functions for accessing the raw options/value pairs as well regards, Martin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PostgreSQL + Replicator developer meeting 10/28
On Tue, 28 Oct 2008 19:46:42 +0200 Hannu Krosing [EMAIL PROTECTED] wrote: The current topics are: * New MCP architecture What's new ? I have some doubts about the current architecture based on my reading of replicator wiki, but would like to learn about the new architecture before spending too much time in studying the old one. The two obvious problems with the existing MCP architecture is: 1. Single point of failure 2. Portability The new architecture is set to remove both of those. The short version is the MCP will be moved into the backend. Thus: Master-MCP|Slave -Slave1 -Slave2 -Slave3 The process being, Master sends data to MCP|Slave, MCP|Slave writes it to disk (optionally restores it) and then forwards it to 1,2,3 who then receive the data, write it to disk and then restore it. If master dies, you can promote to any of the slaves and the left over slaves will connect to the promoted slave and begin receiving updates. If the MCP|Slave dies a new Slave can begin the MCP|Slave process. Alvaro or Alexey can speak more technically about implementation than I can. * DDL Replication Is it there alread, or is it just a planned feature ? Planned feature. * How hard would it be to extract DDL replication part and use it as basis for DDL after trigger support for use in trigger based replication/auditing like Slony ann pgQ/Londiste ? Hmm I am not sure. We are pretty deep into the core and only use triggers for GRANT/REVOKE/CREATE ROLE . Sincerely, Joshua D. Drake -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ United States PostgreSQL Association: http://www.postgresql.us/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PostgreSQL + Replicator developer meeting 10/28
On Tue, 2008-10-28 at 15:18 -0700, Joshua Drake wrote: On Tue, 28 Oct 2008 19:46:42 +0200 Hannu Krosing [EMAIL PROTECTED] wrote: The current topics are: * New MCP architecture What's new ? I have some doubts about the current architecture based on my reading of replicator wiki, but would like to learn about the new architecture before spending too much time in studying the old one. The two obvious problems with the existing MCP architecture is: 1. Single point of failure For async replication there is always SPoF, at least the master until first slave has aquired log is a SPoF, or do you plan that both Master and MCP|Slave to keep the log and be able to step in for each other if the other fails? 2. Portability Portability to where ? Other DBMS's ? Other PG versions ? for me there was also two more problems: 3. separate replication log, which at least seems to be able to get out of sync with main DB. Why don't you just use a DB table, WAL-logged and all 4. Also, again from reading Replicator FAQ, it seems that there is a window of corruption/data loss when rotating the Replicators transaction log. I think that doing it with copy/truncate either needs locking the logfile (== bad performance, during copy/truncate) or is just a data-eating failure waiting to happen. pgQ has a solution to that by rotating queue tables, and postgreSQL core also does effectively rotate WAL log segments. To be robust _and_ effective, Replicator should also rotate the logfile itself. The new architecture is set to remove both of those. The short version is the MCP will be moved into the backend. Thus: Master-MCP|Slave -Slave1 -Slave2 -Slave3 The process being, Master sends data to MCP|Slave, MCP|Slave writes it to disk (optionally restores it) Will this first send be sync or async ? Or have you planned to have it be configurable among several robustness vs. performance levels, similar to the planned integrated WAL-shipping. if async, will it also use MVCC for keeping log on Master (l.ike Slony and pgQ do), just to be at least as reliable as postgreSQL core itself and not require a full resync at server crash. and then forwards it to 1,2,3 who then receive the data, write it to disk and then restore it. If master dies, you can promote to any of the slaves and the left over slaves will connect to the promoted slave and begin receiving updates. If the MCP|Slave dies a new Slave can begin the MCP|Slave process. Alvaro or Alexey can speak more technically about implementation than I can. Alvaro - I guess you already have discussed most of it, but basically you need to solve all the same problems that WAL-shipping based Hot Standby is solving and Slony/pgQ/Londiste have solved. Hopefully you get it more robust than Slony when making changes under high load :) Will there be an helper application for setting up and configuring changes in replication. or will it all be done using added SQL commands ? How will DDL be handled ( i understood that you don't yet have DDL replication) ? Will Slave tables be kind-of-read-only like Slony slaves ? Or even _really_ read only like Simon's Hot Standby ? * DDL Replication Is it there alread, or is it just a planned feature ? Planned feature. Did the plans got any clearer during this meeting ? * How hard would it be to extract DDL replication part and use it as basis for DDL after trigger support for use in trigger based replication/auditing like Slony ann pgQ/Londiste ? Hmm I am not sure. We are pretty deep into the core and only use triggers for GRANT/REVOKE/CREATE ROLE . By the way, why did you choose pretty deep into the core approach instead of triggers ? I mean, you probably end up duplicating (or missing) lots of postgreSQL-s internal goodness instead of just using what is already available ? Sincerely, Joshua D. Drake Thanks for the update. I hope something useful will come out of this too, though I hoped that it already had some advantages over trigger-based replication, like ability to replicate DDL . --- Hannu Krosing -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal of PITR performance improvement for 8.4.
Simon Riggs [EMAIL PROTECTED] writes: On Tue, 2008-10-28 at 17:40 -0400, Bruce Momjian wrote: Gregory Stark wrote: Simon Riggs [EMAIL PROTECTED] writes: I'm happy with the idea of a readahead process. I thought we were implementing a BackgroundReader process for other uses. Is that dead now? You and Bruce seem to keep resurrecting that idea. I've never liked it -- I always hated that in Oracle and thought it was a terrible kludge. I didn't think I was promoting the separate reader process after you had the posix_fadvise() idea. I'm sorry, I thought I remembered you mentioning it again. But perhaps I was thinking of someone else (perhaps it was Simon again?) or perhaps it was before you saw the actual patch. It would be good if the solutions for normal running and recovery were similar. Greg, please could you look into that? I could do the readahead side of things but what I'm not sure how to arrange is how to restructure the wal reading logic to read records ahead of the actual replay. I think we would have to maintain two pointers one for the prefetch and one for the actual running. But the logic in for recovery is complex enough that I'm concerned about changing it enough to do that and whether it can be done without uglifying the code quite a bit. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SQL/MED compatible connection manager
On Mon, 2008-10-27 at 16:50 -0400, Chris Browne wrote: [EMAIL PROTECTED] (Martin Pihlak) writes: Tons of details have been omitted, but should be enough to start discussion. What do you think, does this sound usable? Suggestions, objections? Slony-I does some vaguely similar stuff in its handling of connection paths; here's the schema: I think the whole issue was initially raised by insecurity, as dblink conrib module exposed connection strings to all users, and SQL/MED was seen as a standard way to hide it. The simple credentials hiding could of course be achieved by having something similar to pg_user/pg_shadow and some SECURITY DEFINER functions for actually opening the connections, but probably it seemed easier to at least base it on standards, so we can actually start with pg_remote_server table public pg_user_mapping_shadow table (restricted)/ pg_user_mapping view(public) and some functions with proper grants to match the subset that Martin outlined in http://wiki.postgresql.org/wiki/SqlMedConnectionManager if we've got that working, then we could move to massaging it into the parser to provide standard SQL/MED syntax. so I think that first we should agree on functionality and get the few system (?) tables and functions done, and worry about parser changes once the actual functionality is field tested. -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Window Functions: v07 APIs and buffering strateties
2008/10/29 Tom Lane [EMAIL PROTECTED]: Martijn van Oosterhout [EMAIL PROTECTED] writes: On Tue, Oct 28, 2008 at 01:50:26PM -0400, Tom Lane wrote: ... So it might be possible to fix by attaching some new precedence level to the ROWS token. Yes. Bison's default is to shift, which means that if you do nothing it will treat ROWS as part of the expression if it makes any sense at all. Given the requirement for a following UNBOUNDED or BETWEEN, the only problem is that you'll get a syntax error if the expr_list ends in a postfix operator, I don't see how you get hidden ambiguity. Hmm, now I see what you meant; that's a little different than what I was envisioning. I was thinking of trying to force a parse decision that would support the windowing syntax, whereas you propose forcing a parse decision that does the opposite, and making the user parenthesize if he's got a conflict. What the choice seems to come down to is making ROWS and RANGE reserved (in some form or other) versus creating a corner case for users of postfix operators. Phrased that way it does seem like the second alternative is better. Hitoshi: you can probably make this happen by including ROWS and RANGE in the %nonassoc IDENT precedence declaration, but you'll want to test to make sure the right things happen. Bison and parsing are quite new to me so it'll take a little time but I will try it. One thing, the words following after ROWS/RANGE are not only UNBOUNDED and BETWEEN but also CURRENT and unsigned constant though. Still the phrasing approach doesn't seem less hope? Regards, -- Hitoshi Harada -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal of PITR performance improvement for 8.4.
Thanks for a lot of inspiring discussions. Please note that my proposal includes only a few lines of change to the recovery code itself. It does not affect buffer management, order of WAL record applying etc. Only change needed is to invoke prefetch feature if redo is going to read WAL which has not been handled by the prefetch (prefetch function returns last-handled LSN). Before writing the readahead code, I ran several experiment how posix_fadvise() speeds up random read and I found that POSIX_FADV_WILLNEED can improve total read performance for around five times, if we schedule the order of posix_fadvise() call to the order of block position. Without random position, the improvement ratio was around three times. This result was achieved with single process, but for RAID configuration. I'd like to do the similar measurement against single disk. I'd like to run some benchmark to clarify the improvement.I agree I should show how my proposal is useful. In terms of the influence to the recovery code, pg_readahead just calls posix_fadvise() to tell the operating system to prefetch the data page to kernel's cash, not PG's shared memory, so we don't have to implement this in PG core code. Because of this and I think it is more practical to have platform-specific code to outside as possible, I wrote most of the prefetch in the external process, which can be available at contrib or PgFoundry, perhaps the latter. Heikki suggested to have separate reader process. I think it's very good idea but with this idea, but this will change PG's performance dramatically.Better in some case, but even worse in other cases possibly.I don't have clear on this.So I think background reader issue should be a challange to 8.5 or further and we must call for research works.So far, I think it is reasonable to keep improving specific code. I'd like to hear some more about these. I'm more than happy to write all the code inside PG core to avoid overhead to create another process. --- Koichi Suzuki 2008/10/29 Gregory Stark [EMAIL PROTECTED]: Simon Riggs [EMAIL PROTECTED] writes: On Tue, 2008-10-28 at 17:40 -0400, Bruce Momjian wrote: Gregory Stark wrote: Simon Riggs [EMAIL PROTECTED] writes: I'm happy with the idea of a readahead process. I thought we were implementing a BackgroundReader process for other uses. Is that dead now? You and Bruce seem to keep resurrecting that idea. I've never liked it -- I always hated that in Oracle and thought it was a terrible kludge. I didn't think I was promoting the separate reader process after you had the posix_fadvise() idea. I'm sorry, I thought I remembered you mentioning it again. But perhaps I was thinking of someone else (perhaps it was Simon again?) or perhaps it was before you saw the actual patch. It would be good if the solutions for normal running and recovery were similar. Greg, please could you look into that? I could do the readahead side of things but what I'm not sure how to arrange is how to restructure the wal reading logic to read records ahead of the actual replay. I think we would have to maintain two pointers one for the prefetch and one for the actual running. But the logic in for recovery is complex enough that I'm concerned about changing it enough to do that and whether it can be done without uglifying the code quite a bit. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal of PITR performance improvement for 8.4.
Hi, I think we would have to maintain two pointers one for the prefetch and one for the actual running. But the logic in for recovery is complex enough that I'm concerned about changing it enough to do that and whether it can be done without uglifying the code quite a bit. Yes, this is what the code is doing. Prefetch function returns LSN where prefetch was performed. Redo routine has to call prefetch if it is going to read WAL record beyond this. -- -- Koichi Suzuki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal of PITR performance improvement for 8.4.
Heikki, 1. In the code, all the referenced page is extracted from the WAL records in scope and sorted to schedule and avoid double posix_fadvise() calls.If full page write is included as the first WAL record, such page is not prefetched.Although there's still some risk to call posix_fadvise() for the pages already in the kernel cache, but I don't think the overhead is not so big. I'd like to measure how much improvement we can achieve with some benchmark. 2. Yes, we have to analyze WAL record using specific code and I think it is not good. At present, major part of this is from xlogdump. Are you proposing to improve WAL record format to be less rm-specific? It's interesting too. 3. Yes, the code includes all the WAL record handling except for those which does not need any read operation in the recovery, such as creating new pages with new record.Improving WAL record format as 2. will help to simplify this as well. 2008/10/28 Heikki Linnakangas [EMAIL PROTECTED]: Gregory Stark wrote: Koichi Suzuki [EMAIL PROTECTED] writes: This is my first proposal of PITR performance improvement for PostgreSQL 8.4 development. This proposal includes readahead mechanism of data pages which will be read by redo() routines in the recovery. This is especially effective in the recovery without full page write. Readahead is done by posix_fadvise() as proposed in index scan improvement. Incidentally: a bit of background for anyone who wasn't around when last this came up: prefetching is especially for our recovery code because it's single-threaded. If you have a raid array you're effectively limited to using a single drive at a time. This is a major problem because the logs could have been written by many processes hammering the raid array concurrently. In other words your warm standby database might not be able to keep up with the logs from the master database even on identical (or even better) hardware. Simon (I think?) proposed allowing our recovery code to be multi-threaded. Heikki suggested using prefetching. I actually played around with the prefetching, and even wrote a quick prototype of it, about a year ago. It read ahead a fixed number of the WAL records in xlog.c, calling posix_fadvise() for all pages that were referenced in them. I never got around to finish it, as I wanted to see Greg's posix_fadvise() patch get done first and rely on the same infrastructure, but here's some lessons I learned: 1. You should avoid useless posix_fadvise() calls. In the naive implementation, where you simply call posix_fadvise() for every page referenced in every WAL record, you'll do 1-2 posix_fadvise() syscalls per WAL record, and that's a lot of overhead. We face the same design question as with Greg's patch to use posix_fadvise() to prefetch index and bitmap scans: what should the interface to the buffer manager look like? The simplest approach would be a new function call like AdviseBuffer(Relation, BlockNumber), that calls posix_fadvise() for the page if it's not in the buffer cache, but is a no-op otherwise. But that means more overhead, since for every page access, we need to find the page twice in the buffer cache; once for the AdviseBuffer() call, and 2nd time for the actual ReadBuffer(). It would be more efficient to pin the buffer in the AdviseBuffer() call already, but that requires much more changes to the callers. 2. The format of each WAL record is different, so you need a readahead handler for every resource manager, for every record type. It would be a lot simpler if there was a standardized way to store that information in the WAL records. 3. IIRC I tried to handle just a few most important WAL records at first, but it turned out that you really need to handle all WAL records (that are used at all) before you see any benefit. Otherwise, every time you hit a WAL record that you haven't done posix_fadvise() on, the recovery stalls, and you don't need much of those to diminish the gains. Not sure how these apply to your approach, it's very different. You seem to handle 1. by collecting all the page references for the WAL file, and sorting and removing the duplicates. I wonder how much CPU time is spent on that? Details of the implementation will be found in README file in the material. I've read through this and I think I disagree with the idea of using a separate program. It's a lot of extra code -- and duplicated code from the normal recovery too. Agreed, this belongs into core. The nice thing about a separate process is that you could hook it into recovery_command, with no changes to the server, but as you note in the README, we'd want to use this in crash recovery as well, and the interaction between the external command and the startup process seems overly complex for that. Besides, we want to use the posix_fadvise() stuff in the backend anyway, so we should use the same infrastructure during WAL replay as well.
Re: [HACKERS] PostgreSQL + Replicator developer meeting 10/28
Hannu Krosing wrote: On Tue, 2008-10-28 at 15:18 -0700, Joshua Drake wrote: The two obvious problems with the existing MCP architecture is: 1. Single point of failure For async replication there is always SPoF, at least the master until first slave has aquired log is a SPoF, or do you plan that both Master and MCP|Slave to keep the log and be able to step in for each other if the other fails? Yeah, with the new architecture there is still going to be a bit of a SPoF in the master-MCP but it's a lot smaller than the current setup, in which if you lose the MCP you basically lose everything. 2. Portability Portability to where ? Other DBMS's ? Other PG versions ? Other operating systems mainly. The trouble is we never got around to porting the MCP to any OS beyond Linux; I think it should work on Solaris and BSDs, but surely not Windows. We want to just get rid of what I consider a (crappy) reimplementation of postmaster; instead we should just let postmaster do the job. Additionally we would get rid of the ugly way we import backend code into the MCP server. for me there was also two more problems: 3. separate replication log, which at least seems to be able to get out of sync with main DB. Why don't you just use a DB table, WAL-logged and all The whole replication log thing is a topic of dissent in the team ;-) 4. Also, again from reading Replicator FAQ, it seems that there is a window of corruption/data loss when rotating the Replicators transaction log. I think that doing it with copy/truncate either needs locking the logfile (== bad performance, during copy/truncate) or is just a data-eating failure waiting to happen. Hmm, what Replicator FAQ? We used to have this copy/truncate problem, and we rearchitected the log to avoid this (we use a rotating setup now) Master-MCP|Slave -Slave1 -Slave2 -Slave3 The process being, Master sends data to MCP|Slave, MCP|Slave writes it to disk (optionally restores it) Will this first send be sync or async ? Or have you planned to have it be configurable among several robustness vs. performance levels, similar to the planned integrated WAL-shipping. It is async, and we haven't talked about sync. if async, will it also use MVCC for keeping log on Master (l.ike Slony and pgQ do), just to be at least as reliable as postgreSQL core itself and not require a full resync at server crash. You mean WAL? We don't currently. Alvaro or Alexey can speak more technically about implementation than I can. Alvaro - I guess you already have discussed most of it, but basically you need to solve all the same problems that WAL-shipping based Hot Standby is solving and Slony/pgQ/Londiste have solved. If you mean that we're duplicating the effort that's already going elsewhere, my opinion is yes, we are. Hopefully you get it more robust than Slony when making changes under high load :) Hmm, I don't know about lack of robustness in Slony, so I don't know. Will there be an helper application for setting up and configuring changes in replication. or will it all be done using added SQL commands ? Well, the interface I work on is all SQL commands :-) How will DDL be handled ( i understood that you don't yet have DDL replication) ? We don't have it yet. However, since we can just add any code in any place we like, and that we have a protocol to transmit changes, it is relatively easy to add calls to collect the needed information and replay it on the slave. Will Slave tables be kind-of-read-only like Slony slaves ? Or even _really_ read only like Simon's Hot Standby ? Heh -- they are read only, and they turn into read-write when the slave promotes. I'm not sure what kind does that make it :-) * DDL Replication Is it there alread, or is it just a planned feature ? Planned feature. Did the plans got any clearer during this meeting ? Not really; we didn't talk about that. Hmm I am not sure. We are pretty deep into the core and only use triggers for GRANT/REVOKE/CREATE ROLE . By the way, why did you choose pretty deep into the core approach instead of triggers ? Speed maybe? I don't know. I mean, you probably end up duplicating (or missing) lots of postgreSQL-s internal goodness instead of just using what is already available ? Yeah. I hope something useful will come out of this too, though I hoped that it already had some advantages over trigger-based replication, like ability to replicate DDL . I fear that our approach to replication is so ad-hoc that there's not much to be gained from elsewhere. Replicator is pretty much a fork that's not likely to yield anything useful to upstream. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your
[HACKERS] Checking stack depth
Hello, Why is check_stack_depth function enforced in context of evaluating expressions in PostgreSQL ? What sort of recursion we are trying to safeguard ? thanks, Suresh
Re: [HACKERS] Checking stack depth
Suresh [EMAIL PROTECTED] writes: Why is check_stack_depth function enforced in context of evaluating expressions in PostgreSQL ? What sort of recursion we are trying to safeguard ? create function foo(int) returns int as $$ select foo($1) $$ language sql; select foo(1); regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] BufferAccessStrategy for bulk insert
And here's the patch, which based on comments thus far does the following: - Replaces the use_wal, use_fsm arguments in various places with a single options argument. - Creates a BAS_BULKWRITE buffer access strategy. - Creates a BulkInsertState object so that COPY and CTAS can use BAS_BULKWRITE and also keep the most recent page pinned. Note that the original purpose of this exercise was to implement the optimization that COPY and CTAS would keep the most recent page pinned to avoid repeated pin/unpin cycles. This change shows a small but measurable performance improvement on short rows. The remaining items were added based on reviewer comments. One concern that I have about this approach is that the situation in which people are probably most concerned about COPY performance is restoring a dump. In that case, the COPY will be the only thing running, and using a BufferAccessStrategy is an anti-optimization. I don't think it's a very big effect (any testing anyone can do on real hardware rather than what I have would be appreciated) but I'm sort of unsold of optimizing for what I believe to be the less-common use case. If the consensus is to reverse course on this point I'm happy to rip those changes back out and resubmit; they are a relatively small proportion of the patch. ...Robert On Sun, Oct 26, 2008 at 8:37 PM, Robert Haas [EMAIL PROTECTED] wrote: Seems sane to me. I don't see the point of the HEAP_INSERT_BULK flag bit --- providing or not providing bistate would cover that, and if you have a bit as well then you have to define what the inconsistent combinations mean. I concur with making all-zeroes be the typical state of the flag bits, too. Thanks for the design review. I had thought to make the inconsistent combinations fail an assertion, but I'm just as happy to leave it out altogether. FWIW, we generally declare bitmask flag variables as int, unless there's some really good reason to do otherwise. OK, thanks for the tip. ...Robert Index: src/backend/access/heap/heapam.c === RCS file: /projects/cvsroot/pgsql/src/backend/access/heap/heapam.c,v retrieving revision 1.266 diff -c -r1.266 heapam.c *** src/backend/access/heap/heapam.c 27 Oct 2008 21:50:12 - 1.266 --- src/backend/access/heap/heapam.c 29 Oct 2008 03:25:41 - *** *** 1798,1803 --- 1798,1827 } } + /* + * GetBulkInsertState - set up for a bulk insert + */ + BulkInsertState + GetBulkInsertState(void) + { + BulkInsertState bistate; + + bistate = palloc(sizeof(struct BulkInsertStateData)); + bistate-strategy = GetAccessStrategy(BAS_BULKWRITE); + bistate-last_pin = InvalidBuffer; + return bistate; + } + + /* + * FreeBulkInsertState - clean up after finishing a bulk insert + */ + void + FreeBulkInsertState(BulkInsertState bistate) + { + if (bistate-last_pin != InvalidBuffer) + ReleaseBuffer(bistate-last_pin); + FreeAccessStrategy(bistate-strategy); + } /* * heap_insert - insert tuple into a heap *** *** 1805,1821 * The new tuple is stamped with current transaction ID and the specified * command ID. * ! * If use_wal is false, the new tuple is not logged in WAL, even for a ! * non-temp relation. Safe usage of this behavior requires that we arrange ! * that all new tuples go into new pages not containing any tuples from other ! * transactions, and that the relation gets fsync'd before commit. * (See also heap_sync() comments) * ! * use_fsm is passed directly to RelationGetBufferForTuple, which see for ! * more info. * ! * Note that use_wal and use_fsm will be applied when inserting into the ! * heap's TOAST table, too, if the tuple requires any out-of-line data. * * The return value is the OID assigned to the tuple (either here or by the * caller), or InvalidOid if no OID. The header fields of *tup are updated --- 1829,1846 * The new tuple is stamped with current transaction ID and the specified * command ID. * ! * If the HEAP_INSERT_SKIP_WAL option is supplied, the new tuple is not logged ! * in WAL, even for a non-temp relation. Safe usage of this behavior requires ! * that we arrange that all new tuples go into new pages not containing any ! * tuples from other transactions, and that the relation gets fsync'd before ! * commit. * (See also heap_sync() comments) * ! * The HEAP_INSERT_SKIP_FSM option is passed directly to ! * RelationGetBufferForTuple, which see for more info. * ! * Note that options will be applied when inserting into the heap's TOAST ! * table, too, if the tuple requires any out-of-line data. * * The return value is the OID assigned to the tuple (either here or by the * caller), or InvalidOid if no OID. The header fields of *tup are updated *** *** 1825,1831 */ Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid, ! bool use_wal, bool
Re: [HACKERS] array_agg and array_accum (patch)
It's worth noting that this is the third version of this idea that has been submitted. Ian Caulfield submitted a patch to add this, and so did I. Someone should probably look at all three of them and compare. ...Robert On Mon, Oct 27, 2008 at 1:41 PM, Jeff Davis [EMAIL PROTECTED] wrote: On Mon, 2008-10-27 at 18:47 +0200, Peter Eisentraut wrote: How else will you tell an aggregate function whose result depends on the input order which order you want? The only aggregates defined in the standard where this matters are array_agg, array_accum, and xmlagg, but I don't see array_accum() in the standard, I wrote it just as an alternative to array_agg() because I thought array_agg() ignored NULLs. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers