Re: [HACKERS] Review of: explain / allow collecting row counts without timing info
Tom Lane wrote: > Yeah, I think we need to preserve that property. Unexpectedly > executing query (which may have side-effects) is a very dangerous > thing. People are used to the idea that ANALYZE == execute, and > adding random other flags that also cause execution is going to > burn somebody. +1 FWIW, another reason not to use Robert's suggested syntax is that you get "rows=n" entries with or without the actual run. You just don't get the "actual" block to compare to the estimate. So ROWS as an option would be very ambiguous. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Sat, Dec 24, 2011 at 03:56:58PM +, Simon Riggs wrote: > > Also, as far as I can see this patch usurps the page version field, > > which I find unacceptably short-sighted. Do you really think this is > > the last page layout change we'll ever make? > > No, I don't. I hope and expect the next page layout change to > reintroduce such a field. > > But since we're agreed now that upgrading is important, changing page > format isn't likely to be happening until we get an online upgrade > process. So future changes are much less likely. If they do happen, we > have some flag bits spare that can be used to indicate later versions. > It's not the prettiest thing in the world, but it's a small ugliness > in return for an important feature. If there was a way without that, I > would have chosen it. Have you considered the CRC might match a valuid page version number? Is that safe? -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Sun, Dec 25, 2011 at 04:25:19PM +0100, Martijn van Oosterhout wrote: > On Sat, Dec 24, 2011 at 04:01:02PM +, Simon Riggs wrote: > > On Sat, Dec 24, 2011 at 3:54 PM, Andres Freund wrote: > > > Why don't you use the same tricks as the former patch and copy the buffer, > > > compute the checksum on that, and then write out that copy (you can even > > > do > > > both at the same time). I have a hard time believing that the additional > > > copy > > > is more expensive than the locking. > > > > ISTM we can't write and copy at the same time because the cheksum is > > not a trailer field. > > Ofcourse you can. If the checksum is in the trailer field you get the > nice property that the whole block has a constant checksum. However, if > you store the checksum elsewhere you just need to change the checking > algorithm to copy the checksum out, zero those bytes and run the > checksum and compare with the extracted checksum. > > Not pretty, but I don't think it makes a difference in performence. Sorry to be late replying to this, but an interesting idea would be to zero the _hint_ bits before doing the CRC checksum. That would avoid the problem of WAL-logging the hint bits. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review of: explain / allow collecting row counts without timing info
Jeff Janes writes: > I suspect we will be unwilling to make such a break with the past. In > that case, I think I prefer the originally proposed semantics, even > though I agree they are somewhat less natural. ANALYZE is a big flag > that means "This query will be executed, not just planned". If we are > not going to make a major break, but only nibble around the edges, > then I don't think we should remove the property that the query will > be executed if and only if ANALYZE is specified. Yeah, I think we need to preserve that property. Unexpectedly executing a query (which may have side-effects) is a very dangerous thing. People are used to the idea that ANALYZE == execute, and adding random other flags that also cause execution is going to burn somebody. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb and fsync
On Sat, Feb 04, 2012 at 03:41:27PM -0800, Jeff Davis wrote: > On Sat, 2012-01-28 at 13:18 -0500, Tom Lane wrote: > > Yeah. Personally I would be sad if initdb got noticeably slower, and > > I've never seen or heard of a failure that this would fix. > > I worked up a patch, and it looks like it does about 6 file fsync's and > a 7th for the PGDATA directory. That degrades the time from about 1.1s > to 1.4s on my workstation. > So, is it worth it? Should we make it an option that can be specified? If we add fsync calls to the initdb process, they should cover the entire data directory tree. This patch syncs files that initdb.c writes, but we ought to also sync files that bootstrap-mode backends had written. An optimization like the pg_flush_data() call in copy_file() may reduce the speed penalty. initdb should do these syncs by default and offer an option to disable them. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: Allow SQL-language functions to reference parameters by parameter name
Matthew Draper writes: > [ sql-named-param-refs-v2.patch ] Applied with some editorialization: I switched the behavior for two-part names as discussed, and did some other mostly-cosmetic code cleanup, and did some work on the documentation. > I'm still not sure whether to just revise (almost) all the SQL function > examples to use parameter names, and declare them the "right" choice; as > it's currently written, named parameters still seem rather second-class. They're less second-class in the docs as committed, but I left a lot of examples still using $n for parameters. I'm not sure how far to go in that direction. We should not be too eager to scrub the docs of $n, because if nothing else people will need to understand the notation when they see it for a long time to come. But feel free to submit a follow-up docs patch if you feel more is warranted. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb and fsync
On Sat, 2012-01-28 at 13:18 -0500, Tom Lane wrote: > Yeah. Personally I would be sad if initdb got noticeably slower, and > I've never seen or heard of a failure that this would fix. I worked up a patch, and it looks like it does about 6 file fsync's and a 7th for the PGDATA directory. That degrades the time from about 1.1s to 1.4s on my workstation. pg_test_fsync says this about my workstation (one 8kB write): open_datasync 117.495 ops/sec fdatasync 117.949 ops/sec fsync 25.530 ops/sec fsync_writethroughn/a open_sync 24.666 ops/sec 25 ops/sec means about 40ms per fsync, times 7 is about 280ms, so that seems like about the right degradation for fsync. I tried with fdatasync as well to see if it improved things, and I wasn't able to realize any difference (not sure exactly why). So, is it worth it? Should we make it an option that can be specified? > I wonder whether it wouldn't be sufficient to call sync(2) at the end, > anyway, rather than cluttering the entire initdb codebase with fsync > calls. It looks like there are only a few places, so I don't think clutter is really the problem with the simple patch at this point (unless there is a portability problem with just calling fsync). Regards, Jeff Davis initdb-fsync.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb and fsync
* Tom Lane: > I wonder whether it wouldn't be sufficient to call sync(2) at the end, > anyway, rather than cluttering the entire initdb codebase with fsync > calls. We tried to do this in the Debian package mananger. It works as expected on Linux systems, but it can cause a lot of data to hit the disk, and there are kernel versions where sync(2) never completes if the system is rather busy. initdb is much faster with 9.1 than with 8.4. It's so fast that you can use it in test suites, instead of reusing an existing cluster. I think this is a rather desirable property. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.2] Add GUC sepgsql.client_label
Tom Lane wrote: > What I was concerned about was the case where GUC is trying to > re-establish an old value during transaction rollback. For example, > assume we are superuser to start with, and we do > > begin; > set role unprivileged_user; > ... > rollback; > > The rollback needs to transition the role setting back to > superuser. Sorry for missing the point. Yeah, I totally agree with that. > And yeah, I agree it's a bug that you can do what Kevin's example > shows. I'll look at it and see if I can pull together a patch. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches
On Wed, Feb 1, 2012 at 9:47 AM, Robert Haas wrote: > On Wed, Jan 25, 2012 at 8:49 AM, Robert Haas wrote: >> On Tue, Jan 24, 2012 at 4:28 PM, Simon Riggs wrote: >>> I think we should be working to commit XLogInsert and then Group >>> Commit, then come back to the discussion. >> >> I definitely agree that those two have way more promise than anything >> else on the table. However, now that I understand how badly we're >> getting screwed by checkpoints, I'm curious to do some more >> investigation of what's going on there. I can't help thinking that in >> these particular cases the full page writes may be a bigger issue than >> the checkpoint itself. If that turns out to be the case it's not >> likely to be something we're able to address for 9.2, but I'd like to >> at least characterize it. > > I did another run with full_page_writes=off: > > http://wiki.postgresql.org/wiki/File:Tps-master-nofpw.png > > This is run with the master branch as of commit > 4f42b546fd87a80be30c53a0f2c897acb826ad52, same as the previous tests, > so that the results are comparable. > > The graph is pretty dramatically different from the earlier one: > > http://wiki.postgresql.org/wiki/File:Tps-master.png > > So are the TPS numbers: > > full page writes on: tps = 2200.848350 (including connections establishing) > full page writes off: tps = 9494.757916 (including connections establishing) > > It seems pretty clear that, even with full_page_writes=off, the > checkpoint is affecting things negatively: the first 700 seconds or so > show much better and more consistent performance than the remaining > portion of the test. I'm not sure why that is, but I'm guessing there > was a checkpoint around that time. We really need to nail that down. Could you post the scripts (on the wiki) you use for running the benchmark and making the graph? I'd like to see how much work it would be for me to change it to detect checkpoints and do something like color the markers blue during checkpoints and red elsewhen. Also, I'm not sure how bad that graph really is. The overall throughput is more variable, and there are a few latency spikes but they are few. The dominant feature is simply that the long-term average is less than the initial burst. Of course the goal is to have a high level of throughput with a smooth latency under sustained conditions. But to expect that that long-sustained smooth level of throughput be identical to the "initial burst throughput" sounds like more of a fantasy than a goal. If we want to accept the lowered throughput and work on the what variability/spikes are there, I think a good approach would be to take the long term TPS average, and dial the number of clients back until the initial burst TPS matches that long term average. Then see if the spikes still exist over the long term using that dialed back number of clients. > But the effect is much worse with > full_page_writes=on: the distinctive parabolic shape of those graphs > is apparently caused by the gradually decreasing frequency of full > page writes as the number of transactions processed since the last > checkpoint grows. The dilution out of not-yet-full-page-written buffers as the checkpoint ages is probably the ultimate cause of that pattern, but I suspect there are intermediate causes that we could tackle. I don't think the full-page-writes are leading to WALInsert contention, for example, because that would probably lead to smooth throughput decline, but not those latency spikes in which those entire seconds go by without transactions. I doubt it is leading to general IO compaction, as the IO at that point should be pretty much sequential (the checkpoint has not yet reached the sync stage, the WAL is sequential). So I bet that that is caused by fsyncs occurring at xlog segment switches, and the locking that that entails. If I recall, we can have a segment which is completely written to OS and in the process of being fsynced, and we can have another segment which is in some state of partially in wal_buffers and partly written out to OS cache, but that we can't start reusing the wal_buffers that were already written to OS for that segment (and therefore are theoretically available for reuse by the upcoming 3rd segment) until the previous segments fsync has completed. So all WALInsert's freeze. Or something like that. This code has changed a bit since last time I studied it. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Finer Extension dependencies
Hitoshi Harada writes: > Ping. In case you don't have updates soon, I'll mark Returned with Feedback. Pong. Sorry about my recent silence. I've not been in a position to work on this recently, and am done with those other duties now. I intend to be posting an updated patch soon now. Please wait some more before punting this patch out of the current commit fest. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #6425: Bus error in slot_deform_tuple
Simon Riggs writes: > The cause here is data changing underneath the user. Your patch solves > the most obvious error, but it still allows other problems if applying > the backup block changes data. If the backup block doesn't do anything > at all then we don't need to apply it either. This is nonsense. What applying the backup block does is to apply the change that the WAL record would otherwise have applied, except we decided to make it store a full-page image instead. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #6425: Bus error in slot_deform_tuple
On Sat, Feb 4, 2012 at 6:37 PM, Simon Riggs wrote: > Patch to do that attached -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 99a431a..4758931 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -4590,6 +4590,7 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record) int ndead; int nunused; Size freespace; + bool hit; /* * We're about to remove tuples. In Hot Standby mode, ensure that there's @@ -4608,7 +4609,7 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record) if (record->xl_info & XLR_BKP_BLOCK_1) return; - buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block, RBM_NORMAL); + buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block, RBM_NORMAL, &hit); if (!BufferIsValid(buffer)) return; LockBufferForCleanup(buffer); @@ -4664,6 +4665,7 @@ heap_xlog_freeze(XLogRecPtr lsn, XLogRecord *record) TransactionId cutoff_xid = xlrec->cutoff_xid; Buffer buffer; Page page; + bool hit; /* * In Hot Standby mode, ensure that there's no queries running which still @@ -4677,7 +4679,7 @@ heap_xlog_freeze(XLogRecPtr lsn, XLogRecord *record) if (record->xl_info & XLR_BKP_BLOCK_1) return; - buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block, RBM_NORMAL); + buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block, RBM_NORMAL, &hit); if (!BufferIsValid(buffer)) return; LockBufferForCleanup(buffer); @@ -4728,6 +4730,7 @@ heap_xlog_visible(XLogRecPtr lsn, XLogRecord *record) xl_heap_visible *xlrec = (xl_heap_visible *) XLogRecGetData(record); Buffer buffer; Page page; + bool hit; /* * Read the heap page, if it still exists. If the heap file has been @@ -4736,7 +4739,7 @@ heap_xlog_visible(XLogRecPtr lsn, XLogRecord *record) * will have to be cleared out at the same time. */ buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block, - RBM_NORMAL); + RBM_NORMAL, &hit); if (!BufferIsValid(buffer)) return; page = (Page) BufferGetPage(buffer); @@ -4806,13 +4809,14 @@ heap_xlog_newpage(XLogRecPtr lsn, XLogRecord *record) xl_heap_newpage *xlrec = (xl_heap_newpage *) XLogRecGetData(record); Buffer buffer; Page page; + bool hit; /* * Note: the NEWPAGE log record is used for both heaps and indexes, so do * not do anything that assumes we are touching a heap. */ buffer = XLogReadBufferExtended(xlrec->node, xlrec->forknum, xlrec->blkno, - RBM_ZERO); + RBM_ZERO, &hit); Assert(BufferIsValid(buffer)); LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); page = (Page) BufferGetPage(buffer); diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c index 0f5c113..d10b0b8 100644 --- a/src/backend/access/nbtree/nbtxlog.c +++ b/src/backend/access/nbtree/nbtxlog.c @@ -466,6 +466,7 @@ btree_xlog_vacuum(XLogRecPtr lsn, XLogRecord *record) Buffer buffer; Page page; BTPageOpaque opaque; + bool hit; xlrec = (xl_btree_vacuum *) XLogRecGetData(record); @@ -491,7 +492,7 @@ btree_xlog_vacuum(XLogRecPtr lsn, XLogRecord *record) * Another simple optimization would be to check if there's any * backends running; if not, we could just skip this. */ - buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, blkno, RBM_NORMAL); + buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, blkno, RBM_NORMAL, &hit); if (BufferIsValid(buffer)) { LockBufferForCleanup(buffer); @@ -513,7 +514,7 @@ btree_xlog_vacuum(XLogRecPtr lsn, XLogRecord *record) * Like in btvacuumpage(), we need to take a cleanup lock on every leaf * page. See nbtree/README for details. */ - buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block, RBM_NORMAL); + buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block, RBM_NORMAL, &hit); if (!BufferIsValid(buffer)) return; LockBufferForCleanup(buffer); diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index cce87a3..3f4842d 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -3687,6 +3687,7 @@ RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup) BkpBlock bkpb; char *blk; int i; + bool hit; if (!(record->xl_info & XLR_BKP_BLOCK_MASK)) return; @@ -3700,8 +3701,21 @@ RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup) memcpy(&bkpb, blk, sizeof(BkpBlock)); blk += sizeof(BkpBlock); + hit = false; buffer = XLogReadBufferExtended(bkpb.node, bkpb.fork, bkpb.block, - RBM_ZERO); + RBM_ZERO, &hit); + + /* + * If we found the block in shared buffers and we are already + * consistent then skip applying the backup block. The block + * was already removable anyway,
Re: [HACKERS] [BUGS] BUG #6425: Bus error in slot_deform_tuple
On Fri, Feb 3, 2012 at 6:45 AM, Tom Lane wrote: > The reason it is relevant > to our current problem is that even though RestoreBkpBlocks faithfully > takes exclusive lock on the buffer, *that is not enough to guarantee > that no one else is touching that buffer*. Another backend that has > already located a visible tuple on a page is entitled to keep accessing > that tuple with only a buffer pin. So the existing code transiently > wipes the data from underneath the other backend's pin. While deciding whether to apply the patch, I'm thinking about whether we should be doing this at all. We already agreed that backup blocks were removable from the WAL stream. The cause here is data changing underneath the user. Your patch solves the most obvious error, but it still allows other problems if applying the backup block changes data. If the backup block doesn't do anything at all then we don't need to apply it either. So ISTM that we should just skip applying backup blocks over the top of existing buffers once we have reached consistency. Patch to do that attached, but the basic part of it is just this... @@ -3700,8 +3701,21 @@ RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup) memcpy(&bkpb, blk, sizeof(BkpBlock)); blk += sizeof(BkpBlock); + hit = false; buffer = XLogReadBufferExtended(bkpb.node, bkpb.fork, bkpb.block, - RBM_ZERO); + RBM_ZERO, &hit); + + /* +* If we found the block in shared buffers and we are already +* consistent then skip applying the backup block. The block +* was already removable anyway, so we can skip without problems. +* This avoids us needing to take a cleanup lock in all cases when +* we apply backup blocks because of potential effects on user queries, +* which expect data on blocks to remain constant while being read. +*/ + if (reachedConsistency && hit) + continue; + Assert(BufferIsValid(buffer)); if (cleanup) LockBufferForCleanup(buffer); -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Memory usage during sorting
On Sat, Feb 4, 2012 at 10:06 AM, Jeff Janes wrote: > Attached is a completely uncommitable proof of concept/work in > progress patch to get around the limitation. It shows a 2 fold > improvement when indexing an integer column on a 50,000,000 row > randomly ordered table. Oops, forgot to say that is with set maintenance_work_mem=4194304; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Memory usage during sorting
On Sat, Jan 21, 2012 at 4:51 PM, Peter Geoghegan wrote: > On 16 January 2012 00:59, Jeff Janes wrote: >> I think it would be better to pre-deduct the tape overhead amount we >> will need if we decide to switch to tape sort from the availMem before >> we even start reading (and then add it back if we do indeed make that >> switch). That way we wouldn't over-run the memory in the first place. >> However, that would cause apparent regressions in which sorts that >> previously fit into maintenance_work_mem no longer do. Boosting >> maintenance_work_mem to a level that was actually being used >> previously would fix those regressions, but pointing out that the >> previous behavior was not optimal doesn't change the fact that people >> are used to it and perhaps tuned to it. So the attached patch seems >> more backwards-friendly. > > Hmm. Are people really setting maintenance_work_mem such that it is > exactly large enough to quicksort when building an index in one case > but not another? It could also apply to work_mem in other situations. I'm just using maintenance_work_mem in my example because index creation is the thing that lead me into this little diversion and so that is what my test-bed is set up to use. Some people do have very stable "ritualized" work-loads and so could have tuned exactly to them. I don't know how common that would be. More common might be synthetic benchmarks which had been tuned, either intentionally or accidentally, to just barely fit in the (overshot) memory, so it would be a perception problem that there was a regression when in fact the tuning merely became out of date. > Is the difference large enough to warrant avoiding > pre-deduction? Switching to an external sort when you could have done it all by quick sort (by cheating just a little bit on memory usage) makes a pretty big difference, over 2 fold in my tests. If the fast-path optimization for qsort goes in, the difference might get even bigger. Of course, there will always be some point at which that switch over must occur, so the real question is do we need to keep that switch point historically consistent to avoid nasty surprises for people with excessively fine-tuned *work_mem settings based on old behavior? And do such people even exist outside of my imagination? But a bigger question I now have is if any of this matters. Since there is currently no way to know how many connections might be running how many concurrent sorts, there is really no way to tune work_mem with much precision. So, does it matter if we slightly lie about how much we will actually use? And is it worth worrying about whether every last drop of memory is used efficiently? I was trying to compare the performance I was getting with a theoretical model of what I should get, and it was confusing to me that we were using a originally defined size of memory for the priority heap, and then later reducing that amount of memory. That is annoying because it complicates the theoretical model, and even more annoying when I realized that the "freed" memory that results from doing this is too fragmented to even be used for other purposes. But theoretical annoyances aside, it is hardly the worst thing about memory usage in sorting. It is just one that is standing in the way of my further study. So I don't think this patch I proposed is particularly important in its own right. I want to see if anyone will point out that I am all wet because my analysis failed to take into account. I probably should have explained this when I submitted the patch. The worst thing about the current memory usage is probably that big machines can't qsort more than 16,777,216 tuples no matter how much memory they have, because memtuples has to live entirely within a single allocation, which is currently limited to 1GB. It is probably too late to fix this problem for 9.2. I wish I had started there rather than here. This 16,777,216 tuple limitation will get even more unfortunate if the specializations that speed up qsort but not external sort get accepted. Attached is a completely uncommitable proof of concept/work in progress patch to get around the limitation. It shows a 2 fold improvement when indexing an integer column on a 50,000,000 row randomly ordered table. As for what to do to make it commitable, it seems like maybe there should be two different MaxAllocSize. One determined by the "aset.c assumes they can compute twice an allocation's size without overflow". I would think that simply testing size_t at compile time would be enough. Allowing more than 1 GB allocations would probably be pointless on 32 bit machines, and 64 bit should be able to compute twice of a much larger value than 1GB without overflow. For the varlena stuff, I don't know what to do. Is the "I swear to God I won't pass this pointer to one of those functions" sufficient? Or does each function need to test it? And I'm pretty sure that palloc, not just repalloc, would need
Re: [HACKERS] [v9.2] Add GUC sepgsql.client_label
"Kevin Grittner" writes: > Tom Lane wrote: >> More to the point, a GUC rollback transition *has to always >> succeed*. Period. > [ counterexample showing we should sometimes disallow "RESET" ] This actually isn't what I was talking about: a RESET statement is a commanded adoption of a new value that happens to be gotten off the stack, and there's not a lot of logical difficulty in letting it fail. What I was concerned about was the case where GUC is trying to re-establish an old value during transaction rollback. For example, assume we are superuser to start with, and we do begin; set role unprivileged_user; ... rollback; The rollback needs to transition the role setting back to superuser. A check based purely on allowed transitions would disallow that, since it's not visibly different from the unprivileged_user trying to make himself superuser. You have to take the context of the state change into account. And yeah, I agree it's a bug that you can do what Kevin's example shows. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.2] Add GUC sepgsql.client_label
Tom Lane wrote: > More to the point, a GUC rollback transition *has to always > succeed*. Period. I was about to point out the exception of the transaction_read_only GUC, which according to the standard must not be changed except at the beginning of a transaction or a subtransaction, and must not be changed from "on" to "off" in a subtransaction. Then I noticed that, while we protect against an explicit change in a prohibited way, we allow a RESET: test=# begin transaction read only; BEGIN test=# select * from x; x --- 1 (1 row) test=# set transaction_read_only = off; ERROR: transaction read-write mode must be set before any query test=# rollback; ROLLBACK test=# begin transaction read only; BEGIN test=# select * from x; x --- 1 (1 row) test=# reset transaction_read_only ; RESET test=# insert into x VALUES (2); INSERT 0 1 test=# commit; COMMIT I think that's a problem. It could allow back-door violations of invariants enforced by triggers, and seems to violate the SQL standard. I think this should be considered a bug, although I'm not sure whether it's safe to back-patch, given the change to existing behavior. Whether such a (required) exception to what you assert above should open the door to any others is another question. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] basic pgbench runs with various performance-related patches
On 01/24/2012 08:58 AM, Robert Haas wrote: One somewhat odd thing about these numbers is that, on permanent tables, all of the patches seemed to show regressions vs. master in single-client throughput. That's a slightly difficult result to believe, though, so it's probably a testing artifact of some kind. It looks like you may have run the ones against master first, then the ones applying various patches. The one test artifact I have to be very careful to avoid in that situation is that later files on the physical disk are slower than earlier ones. There's a >30% differences between the fastest part of a regular hard drive, the logical beginning, and its end. Multiple test runs tend to creep forward onto later sections of disk, and be biased toward the earlier run in that case. To eliminate that bias when it gets bad, I normally either a) run each test 3 times, interleaved, or b) rebuild the filesystem in between each initdb. I'm not sure that's the problem you're running into, but it's the only one I've been hit by that matches the suspicious part of your results. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby fails if any backend crashes
On Fri, Feb 3, 2012 at 4:48 AM, Tom Lane wrote: > I think saner behavior might only require this change: > > /* > * Any unexpected exit (including FATAL exit) of the startup > * process is treated as a crash, except that we don't want to > * reinitialize. > */ > if (!EXIT_STATUS_0(exitstatus)) > { > - RecoveryError = true; > + if (!FatalError) > + RecoveryError = true; > HandleChildCrash(pid, exitstatus, > _("startup process")); > continue; > } > > plus suitable comment adjustments of course. Haven't tested this yet > though. Looks good, will test. > It's a bit disturbing that nobody has reported this from the field yet. > Seems to imply that hot standby isn't being used much. There are many people I know using it in production for more than a year now. Either they haven't seen it or they haven't reported it to us. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Finer Extension dependencies
On Mon, Jan 23, 2012 at 3:06 AM, Hitoshi Harada wrote: > On Mon, Jan 23, 2012 at 2:00 AM, Dimitri Fontaine > wrote: >> Hitoshi Harada writes: > - What happens if DROP EXTENSION ... CASCADE? Does it work? It should, what happens when you try? :) >>> >>> I just tried DROP EXTENSION now, and found it broken :( >>> >>> db1=# create extension kmeans; >>> CREATE EXTENSION >>> db1=# drop extension kmeans; >>> ERROR: cannot drop extension kmeans because extension feature kmeans >>> requires it >>> HINT: You can drop extension feature kmeans instead. >> >> Can you provide me the test case you've been using? That looks like a >> bug I need to fix, indeed (unless the problem lies in the test case, >> which would mean I need to tighten things some more). > > The test case is just above; createdb db1 and create and drop an > extension. The kmean extension is on pgxn. I tried my small test > extension named ext1 which contains only one plpgsql function, and > created it then dropped it, reproduced. > Ping. In case you don't have updates soon, I'll mark Returned with Feedback. Thanks, -- Hitoshi Harada -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: Allow SQL-language functions to reference parameters by parameter name
On Thu, Feb 2, 2012 at 3:19 PM, Tom Lane wrote: > [ working on this patch now ... ] > > Matthew Draper writes: >> On 25/01/12 18:37, Hitoshi Harada wrote: >>> Should we throw an error in such ambiguity? Or did you make it happen >>> intentionally? If latter, we should also mention the rule in the >>> manual. > >> I did consider it, and felt it was the most consistent: > > I believe the issue here is that a two-part name A.B has two possible > interpretations (once we have eliminated table references supplied by > the current SQL command inside the function): per the comment, > > * A.B A = record-typed parameter name, B = field name > * OR: A = function name, B = parameter name > > If both interpretations are feasible, what should we do? The patch > tries them in the above order, but I think the other order would be > better. My argument is this: the current behavior doesn't provide any > "out" other than changing the function or parameter name. Now > presumably, if someone is silly enough to use a parameter name the same > as the function's name, he's got a good reason to do so and would not > like to be forced to change it. If we prefer the function.parameter > interpretation, then he still has a way to get to a field name: he just > has to use a three-part name function.parameter.field. If we prefer the > parameter.field interpretation, but he needs to refer to a scalar > parameter, the only way to do it is to use an unqualified reference, > which might be infeasible (eg, if it also matches a column name exposed > in the SQL command). > > Another problem with the current implementation is that if A matches a > parameter name, but ParseFuncOrColumn fails (ie, the parameter is not of > composite type or doesn't contain a field named B), then the hook just > fails to resolve anything; it doesn't fall back to consider the > function-name-first alternative. So that's another usability black > mark. We could probably complicate the code enough so it did consider > the function.parameter case at that point, but I don't think that there > is a strong enough argument for this precedence order to justify such > complication. > > In short, I propose swapping the order in which these cases are tried. +1 from me. Thanks, -- Hitoshi Harada -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SKIP LOCKED DATA
On 16 January 2012 21:30, Josh Berkus wrote: > Useful, yes. Harder than it looks, probably. I tried to mock up a > version of this years ago for a project where I needed it, and ran into > all kinds of race conditions. Can you remember any details about those race conditions? > Anyway, if it could be made to work, this is extremely useful for any > application which needs queueing behavior (with is a large plurality, if > not a majority, of applications). Ok, based on this feedback I decided to push further and try implementating this. See POC/WIP patch attached. It seems to work for simple examples but I haven't yet tried to break it or see how it interacts with more complicated queries or high concurrency levels. It probably contains at least a few rookie mistakes! Any feedback gratefully received. The approach is described in my original email. Short version: heap_lock_tuple now takes an enum called wait_policy instead of a boolean called nowait, with the following values: LockWaitBlock: wait for lock (like nowait = false before), LockWaitError: error if not immediately lockable (like nowait = true before) LockWaitSkip: give up and return HeapTupleWouldBlock if not immediately lockable (this is a new policy) The rest of the patch is about getting the appropriate value down to that function call, following the example of the existing nowait support, and skipping rows if you said SKIP LOCKED DATA and you got HeapTupleWouldBlock. Compared to one very popular commercial database's implementation, I think this is a little bit friendlier for the user who wants to distribute work. Let's say you want to lock one row without lock contention, which this patch allows with FETCH FIRST 1 ROW ONLY FOR UPDATE SKIP LOCKED DATA in an SQL query. In that other system, the mechanism for limiting the number of rows fetched is done in the WHERE clause, and therefore the N rows are counted *before* checking if the lock can be obtained, so users sometimes have to resort to stored procedures so they can control the FETCH from a cursor imperatively. In another popular commercial database from Redmond, you can ask for the top (first) N rows while using the equivalent of SKIP LOCKED DATA and it has the same effect as this patch as far as I can tell, and another large blue system is the same. As discussed in another branch of this thread, you can probably get the same effect with transactional advisory locks. But I personally like row skipping better as an explicit feature because: (1) I think there might be an order-of-evaluation problem with a WHERE clause containing both lock testing and row filtering expressions (ie it is undefined right?) which you might need subselects to work around (ie to be sure to avoid false positive locks, not sure about this). (2) The advisory technique requires you to introduce an integer identifier if you didn't already have one and then lock that despite having already said which rows you want to try to lock in standard row filtering expressions. (3) The advisory technique requires all users of the table to participate in the advisory lock protocol even if they don't want to use the option to skip lock. (4) It complements NOWAIT (which could also have been done with transactional advisory locks, with a function like try_lock_or_fail). (5) I like the idea of directly porting applications from other databases with this feature (that is what led me here). I can also imagine some other arguments against SKIP LOCKED DATA: "the type of applications that would use this technique generate too much dead tuple churn for PostgreSQL anyway" (for example, compared to the update-tuples-in-place systems like DB2 z/OS edition where SKIP LOCKED DATA is used to distribute work efficiently), and "databases shouldn't be used as queues anyway, for that we have $X", and "skipping rows provides an inconsistent view of the data" (ie a result set that never strictly existed). Here are some examples of previous requests or discussion of this feature: http://archives.postgresql.org/pgsql-general/2008-07/msg00442.php http://archives.postgresql.org/pgsql-bugs/2003-12/msg00154.php http://archives.postgresql.org/pgsql-general/2002-07/msg00744.php http://blog.hydrobiont.com/2011/06/select-for-update-skip-locked-in.html Thanks for reading! Thomas diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml index 01c0104..58adcae 100644 --- a/doc/src/sgml/ref/select.sgml +++ b/doc/src/sgml/ref/select.sgml @@ -45,7 +45,7 @@ SELECT [ ALL | DISTINCT [ ON ( expressioncount | ALL } ] [ OFFSET start [ ROW | ROWS ] ] [ FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } ONLY ] -[ FOR { UPDATE | SHARE } [ OF table_name [, ...] ] [ NOWAIT ] [...] ] +[ FOR { UPDATE | SHARE } [ OF table_name [, ...] ] [ { NOWAIT | SKIP LOCKED DATA } ] [...] ] where from_item can be one of: @@ -1146,14 +1146,14 @@ FETCH { FIRST | NEXT } [ count ] { T
Re: [HACKERS] Caching for stable expressions with constant arguments v6
On Sat, Feb 4, 2012 at 09:49, Jaime Casanova wrote: > i little review... Thanks! By the way, you should update to the v7 patch. > first, i notice a change of behaviour... i'm not sure if i can say > this is good or not. > if you execute: select *, cached_random() from (select > generate_series(1, 10) ) i; Yeah, this is pretty much expected. > seems you are moving code in simplify_function(), do you think is > useful to do that independently? at least if it provides some clarity > to the code I think so, yeah, but separating it from the patch would be quite a bit of work. > shared_buffers | 4096 Note that the "ts" table is 36MB so it doesn't fit in your shared_buffers now. If you increase shared_buffers, you will see better gains for tests #1 and #2 > from what i see there is no too much gain for the amount of complexity > added... i can see there should be cases which a lot more gain Yeah, the goal of this test script wasn't to demonstrate the best cases of the patch, but to display how it behaves in different scenarios. Here's what they do: Test case #1 is a typical "real world" query that benefits from this patch. With a higher shared_buffers you should see a 6-7x improvement there, which I think is a compelling speedup for a whole class of queries. Test #2 doesn't benefit much from the patch since now() is already a very fast function, but the point is that even saving the function call overhead helps noticeably. Tests #3 and #4 show the possible *worst* case of the patch -- it's processing a complex expression, but there's just one row in the table, so no opportunity for caching. Besides stable functions, this patch also improves the performance of expressions involving placeholders parameters, such as variable references from PL/pgSQL, since these are not constant-folded. E.g: DECLARE foo timestamptz = '2012-02-04'; BEGIN SELECT * FROM ts WHERE ts>(foo - interval '1 day'); Before this patch, the interval subtraction was executed once per row; now it's once per execution. > a benchmark with pgbench scale 20 (average of 3 runs, -T 300 clients > =1, except on second run) > -scale 20 I think the differences here are mostly noise because the queries pgbench generates aren't affected by caching. Regards, Marti -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers