Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Thu, Mar 24, 2016 at 7:12 PM, Peter Geoghegan wrote: > On Thu, Mar 24, 2016 at 7:17 AM, Robert Haas wrote: >> I really like this idea, and the performance results seem impressive, >> but I think we should push this out to 9.7. A btree patch that didn't >> have WAL support until two and a half weeks into the final CommitFest >> just doesn't seem to me like a good candidate. First, as a general >> matter, if a patch isn't code-complete at the start of a CommitFest, >> it's reasonable to say that it should be reviewed but not necessarily >> committed in that CommitFest. This patch has had some review, but I'm >> not sure how deep that review is, and I think it's had no code review >> at all of the WAL logging changes, which were submitted only a week >> ago, well after the CF deadline. Second, the btree AM is a >> particularly poor place to introduce possibly destabilizing changes. >> Everybody depends on it, all the time, for everything. And despite >> new tools like amcheck, it's not a particularly easy thing to debug. > > Regrettably, I must agree. I don't see a plausible path to commit for > this patch in the ongoing CF. > > I think that Anastasia did an excellent job here, and I wish I could > have been of greater help sooner. Nevertheless, it would be unwise to > commit this given the maturity of the code. There have been very few > instances of performance improvements to the B-Tree code for as long > as I've been interested, because it's so hard, and the standard is so > high. The only example I can think of from the last few years is > Kevin's commit 2ed5b87f96 and Tom's commit 1a77f8b63d both of which > were far less invasive, and Simon's commit c7111d11b1, which we just > outright reverted from 9.5 due to subtle bugs (and even that was > significantly less invasive than this patch). Improving nbtree is > something that requires several rounds of expert review, and that's > something that's in short supply for the B-Tree code in particular. I > think that a new testing strategy is needed to make this easier, and I > hope to get that going with amcheck. I need help with formalizing a > "testing first" approach for improving the B-Tree code, because I > think it's the only way that we can move forward with projects like > this. It's *incredibly* hard to push forward patches like this given > our current, limited testing strategy. I've been toying (having gotten nowhere concrete really) with prefix compression myself, I agree that messing with btree code is quite harder than it ought to be. Perhaps trying experimental format changes in a separate experimental am wouldn't be all that bad (say, nxbtree?). People could opt-in to those, by creating the indexes with nxbtree instead of plain btree (say in development environments) and get some testing going without risking much. Normally the same effect should be achievable with mere flags, but since format changes to btree tend to be rather invasive, ensuring the patch doesn't change behavior with the flag off is hard as well, hence the wholly separate am idea. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Use pread and pwrite instead of lseek + write and read
On Wed, Aug 17, 2016 at 10:40 AM, Tom Lane wrote: > Oskari Saarenmaa writes: >> On my laptop a simple pgbench run (scale 100, 15 minutes) shows a 1.5% >> performance improvement. > > I would have hoped for a lot better result before anyone would propose > that we should deal with all the portability issues this'll create. > >> A 1.5% performance improvement is small but >> measurable - and IMV more importantly it allows us to drop more than 100 >> lines of backwards (compatible?) code; maybe we could start targeting >> more recent platforms in v10? > > That's basically nonsense: we'll end up adding way more than that to > deal with platforms that haven't got these APIs. Perhaps a better target would then be to try and make use of readv and writev, which should both be useful for sequential access of various kinds and network I/O. For instance, when reading 10 sequential buffers, a readv could fill 10 buffers at a time. I remember a project where we got a linear improvement in thoughput by using them for network I/O, because we were limited by packet thoughput rather than byte thoughput, and using the iovec utilities reduced the overhead considerably. But all this is anecdotal evidence in any case, YMMV. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location
On Thu, Aug 18, 2016 at 5:05 PM, Robert Haas wrote: > On Wed, Aug 17, 2016 at 10:54 PM, Claudio Freire > wrote: >> The attached patch tries to maintain the initial status of B-Tree >> indexes, which are created with equal-key runs in physical order, >> during the whole life of the B-Tree, and make key-tid pairs >> efficiently searchable in the process. > > I have two pretty important questions that doesn't seem to be > addressed in your email. I believe I addressed that in the README, but if I wasn't clear, let me know and I'll expand there too. Summarizing here: > 1. How does it do that? It adds the block number and the offset number of the index tuple's t_tid (that is, the pointer into the heap) as a final (implicit) sort key, as is done already in tuplesort. It just keeps doing that while comparing keys in the B-Tree when searching the leaf page for insertion, so the insertion point now points to the exact point where the insertion has to happen to maintain that condition, and not just to the first valid position within the same-key run. In fact, that's why non-leaf index tuples need a different format, because while leaf index tuples contain the heap pointer already, non-leaf ones contain only the downlink, not the pointer into the heap. To be able to do comparisons and pick the right downlink, the original heap pointer in the leaf index tuple is copied into the downlink index tuple when splitting pages into an additional IndexTupleData header that is prepended only to non-leaf index tuples. This representation is chosen to minimize code changes, such that (itup+1) is usable as before, and also because it's reasonably compact. But it *is* necessary to check whether it's a leaf or non-leaf tuple to choose whether to use (itup+1) or just itup, which is why the patch requires so many changes in so many places. > 2. What's the benefit? The idea came up in the discussion about WARM updates. Basically, in various situations it is convenient to be able to find the index tuples that point to a specific heap tuple. Without this, doing so isn't efficient - the only way to find such index tuples is to find the leftmost index tuple that has the same key, and walk the index right links until the right tid is found. For non-unique indexes, but also for unique indexes with heavy bloat, this could involve a large amount of I/O, so it isn't efficient in several contexts. Think vacuum and WARM updates mostly (like HOT updates, but where only some indexes don't need updating, and some do). Having the ability to find a heap tuple by key-ctid pair is thus beneficial because it allows efficient implementations for those operations. It may benefit other parts of the code, but those are the ones that come to mind. It also makes index scans of the form SELECT * FROM t WHERE some_col = some_const; Scan the heap in sequential order, even if some_col has low cardinality (ie: the query returns lots of tuples), which is a nice performance side effect. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location
On Thu, Aug 18, 2016 at 6:04 PM, Kevin Grittner wrote: > On Thu, Aug 18, 2016 at 3:41 PM, Claudio Freire > wrote: > >> It also makes index scans of the form >> >> SELECT * FROM t WHERE some_col = some_const; >> >> Scan the heap in sequential order, even if some_col has low >> cardinality (ie: the query returns lots of tuples), which is a nice >> performance side effect. > > Speaking of performance side effects, does this avoid O(N^2) > performance on index tuple insertion with duplicate values, for all > insertion orderings? For example, does it descend directly to the > right leaf page for the insert rather than starting at the front of > the block of duplicate values and scanning to the right for a > block with space, with a random chance to split a full block on > each page it moves through? Yes, but only on non-unique indexes. Unique indexes still need to scan all duplicates to check visibility and will become O(N^2) there. The code with the random chance to split is still there, but it should never have a chance to run because the comparison against the heap tuple pointer would stop it very quickly (before walking a single offset I believe). I thought about cleaning that up, but to make review easier I thought I'd leave it there for now. Removing it would add a lot of diff noise. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location
On Thu, Aug 18, 2016 at 6:15 PM, Peter Geoghegan wrote: > On Thu, Aug 18, 2016 at 1:41 PM, Claudio Freire > wrote: >> In fact, that's why non-leaf index tuples need a different format, >> because while leaf index tuples contain the heap pointer already, >> non-leaf ones contain only the downlink, not the pointer into the >> heap. To be able to do comparisons and pick the right downlink, the >> original heap pointer in the leaf index tuple is copied into the >> downlink index tuple when splitting pages into an additional >> IndexTupleData header that is prepended only to non-leaf index tuples. > > I think that this is a bad idea. We need to implement suffix > truncation of internal page index tuples at some point, to make them > contain less information from the original leaf page index tuple. > That's an important optimization, because it increases fan-in. This > seems like a move in the opposite direction. I see that. I could try to measure average depth to measure the impact this had on fan-in. While it should cut it in half for narrow indexes, half of very high is still high. Wide indexes, which are are the ones that would suffer from poor fan-in, would feel this far less, since the overhead is constant. Even if it does have an impact, I don't see an alternative, without also implementing suffix truncation. Perhaps I could try to avoid adding the leaf tid header if it isn't necessary, though I would have to use up the last available flag bit in t_info for that. > ISTM that the way to address this problem is with a duplicate list > and/or prefix compression in leaf pages. Prefix compression is another one I will be looking into eventually, but last time I tried it was far more invasive so I abandoned until I could find a less invasive way to do it. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location
On Thu, Aug 18, 2016 at 6:27 PM, Tom Lane wrote: > Claudio Freire writes: >> On Thu, Aug 18, 2016 at 6:04 PM, Kevin Grittner wrote: >>> Speaking of performance side effects, does this avoid O(N^2) >>> performance on index tuple insertion with duplicate values, for all >>> insertion orderings? For example, does it descend directly to the >>> right leaf page for the insert rather than starting at the front of >>> the block of duplicate values and scanning to the right for a >>> block with space, with a random chance to split a full block on >>> each page it moves through? > >> Yes, but only on non-unique indexes. > > How's that work if the existing entries aren't in TID order (which they > will not be, in a pre-existing index)? Or are you assuming you can blow > off on-disk compatibility of indexes? > > regards, tom lane This patch does blow off on-disk compatibility, but the plan is to re-add it later on. A flag in the meta page would indicate whether it's a sorted index or not, and the only way to get a sorted index would be with a reindex. The code would have to support both for a while until whatever point we'd figure we can drop support for old format indexes. Since this is a sort order change I see no alternative, either the whole index is sorted by TID or not. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location
On Thu, Aug 18, 2016 at 6:38 PM, Peter Geoghegan wrote: > On Thu, Aug 18, 2016 at 2:26 PM, Claudio Freire > wrote: >> I see that. I could try to measure average depth to measure the impact >> this had on fan-in. >> >> While it should cut it in half for narrow indexes, half of very high >> is still high. Wide indexes, which are are the ones that would suffer >> from poor fan-in, would feel this far less, since the overhead is >> constant. > > You'll also have to consider the impact on the choice of split point, > FWIW. There is currently leeway about what page an index tuple can > land on, if and when it happens to be equal to the high-key. I'm not > sure how important that is, but it's a consideration. When I read the code, it seemed generic enough, using ItemIdGetLength (which works on non-leaf index tuples correctly, reporting the right size), so it should still work. But the choice of split point may make a difference for future insertions, so I'll look into that. >> Even if it does have an impact, I don't see an alternative, without >> also implementing suffix truncation. Perhaps I could try to avoid >> adding the leaf tid header if it isn't necessary, though I would have >> to use up the last available flag bit in t_info for that. > > To be clear, I'm particularly worried about painting ourselves into a > corner, suffix-truncation-wise. It's a very common optimization, that > we should really have. Well, page version numbers could be used to switch between semantically similar page formats later on. So if another format is necessary (say, prefix compression, suffix truncation, etc), it can be changed on-the-fly on existing indexes by checking page version numbers and doing the proper switch. So we won't be painting ourselves into a corner, but picking the wrong format may indeed be a headache. I can go the way of the flag in t_info if that's preferrable. Both would work. It's just that it's the last flag available, and that would be painting ourselves into a corner regarding flags. To avoid that, the flag itself could be "has extra data" (instead of has leaf tid), and add a whole set of flags in the extra data. That could also help for preffix compression and suffix truncation. >>> ISTM that the way to address this problem is with a duplicate list >>> and/or prefix compression in leaf pages. >> >> Prefix compression is another one I will be looking into eventually, >> but last time I tried it was far more invasive so I abandoned until I >> could find a less invasive way to do it. > > Unfortunately, sometimes it is necessary to be very ambitious in order > to solve a problem. The understandable and usually well-reasoned > approach of making progress as incremental as possible occasionally > works against contributors. It's worth considering if this is such a > case. I'd agree if this regressed performance considerably without the other improvements, so I guess this will depend on the fan-in measurements. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location
On Thu, Aug 18, 2016 at 6:26 PM, Claudio Freire wrote: > On Thu, Aug 18, 2016 at 6:15 PM, Peter Geoghegan wrote: >> On Thu, Aug 18, 2016 at 1:41 PM, Claudio Freire >> wrote: >>> In fact, that's why non-leaf index tuples need a different format, >>> because while leaf index tuples contain the heap pointer already, >>> non-leaf ones contain only the downlink, not the pointer into the >>> heap. To be able to do comparisons and pick the right downlink, the >>> original heap pointer in the leaf index tuple is copied into the >>> downlink index tuple when splitting pages into an additional >>> IndexTupleData header that is prepended only to non-leaf index tuples. >> >> I think that this is a bad idea. We need to implement suffix >> truncation of internal page index tuples at some point, to make them >> contain less information from the original leaf page index tuple. >> That's an important optimization, because it increases fan-in. This >> seems like a move in the opposite direction. > > I see that. I could try to measure average depth to measure the impact > this had on fan-in. > > While it should cut it in half for narrow indexes, half of very high > is still high. Wide indexes, which are are the ones that would suffer > from poor fan-in, would feel this far less, since the overhead is > constant. > > Even if it does have an impact, I don't see an alternative, without > also implementing suffix truncation. Perhaps I could try to avoid > adding the leaf tid header if it isn't necessary, though I would have > to use up the last available flag bit in t_info for that. So, I did some tests. TL;DR version is, as expected, there is a big impact on narrow indexes, but I don't believe it's something to be worried about. **Begin LONG version** (skip to end long version if not interested) Regardless of the fanout (or fanin, if you look at it that way), what matters is tree depth, so I used pageinspect to check how depth of indexes changed after patching. This is all on pristine recently built indexes. I will try to measure the same on bloated indexes later on, but the tests are painfully slow. I used the following query to inspect all user indexes, since pageinspect failed to work for toast indexes for some reason (it probably needed qualified relnames or something like that). Anyway user indexes should be enough. select level, count(*), pg_size_pretty(max(relpages) * 8192.0) as biggest from pg_class, bt_metap(relname) where relkind = 'i' and relnamespace = 2200 and relam = 403 group by level order by level desc; I tested it on both patched and unpached versions of both my test database (roughly 23GB), pgbench scale 1000 (15GB) and pgbench scale 1 (146GB). I believe pgbench is a worst case example, because it only has very narrow PK indexes, which are the ones that will suffer the most from this patch. The numbers: mat, patched level;count;biggest 3;18;"1369 MB" 2;60;"322 MB" 1;26;"808 kB" 0;50;"16 kB" mat, unpatched level;count;biggest 3;15;"1367 MB" 2;63;"334 MB" 1;26;"800 kB" 0;49;"16 kB" pgbench, scale 1000, patched level;count;biggest 3;1;"2145 MB" 1;2;"456 kB" pgbench, scale 1000, unpatched level;count;biggest 3;1;"2142 MB" 1;2;"456 kB" pgbench, scale 1, patched level;count;biggest 3;1;"21 GB" 2;1;"2224 kB" 1;1;"240 kB" pgbench, scale 1, unpatched level;count;biggest 3;1;"21 GB" 1;2;"2208 kB" So, clearly some indexes are pushed to the next level, but it isn't a widespread effect - it looks more like the threshold index size for needing a root split is slightly pushed downwards. It totally agrees with the math. The index tuple header is 8 bytes, and the datums need to be maxaligned so they will also be 8 bytes at least. That means the B in B-tree gets cut by 50% (2/3) at the worst but only for inner tuples, and so you get that if a * log_(B)(N/B) = log_(B/(3/2))(N/B) then for big enogh N a < (3/2)*log_(B)(N) - 2 Which means the depth of huge B-trees is increased by 50%, but the effects only begins to be seen at level 2, which is what we have here, and level 2 for such a narrow index seems to be about 2GB. So we're talking about very big indexes already. If level 2 can hold 2GB of index, level 3 can hold around 2G x 8k / 16 = 1TB of index, and I have yet to see (even in very big databases) a 1TB index on a PK. If anyone has such an index, yes, this patch will hurt performance by 25% (4 page reads instead of 3). Bad, but since it only affects very extreme cases, doesn't seem so bad. **End LONG version** So, that said, I started mocking up a version of the patch that used a "has aux data&quo
Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location
On Sat, Aug 20, 2016 at 1:05 AM, Amit Kapila wrote: > On Thu, Aug 18, 2016 at 8:24 AM, Claudio Freire > wrote: >> >> A couple of points make me uneasy about this patch, yet I can think of >> no better alternative, so I seek feedback: >> >> - introducing a different format for inner index tuples makes for an >> invasive patch and quite difficult-to-reason code (it's easy to forget >> whether a page is leaf or inner and that now causes assertion failures >> or segfaults) >> - the ascent-descent to check for uniqueness when there are large >> dead tuple runs could have locking subtleties that escape me. Perhaps >> there's better ways to solve this. > > I have tried to study this part of your patch and it seems somewhat > non-trivial and risky part of proposal. > > + } else { > + /* > + * We found the actual first item on another block, so we have to perform > + * a two-step search - first half, until our write-locked buffer, then > another > + * starting from our write-locked buffer. > + */ > + LockBuffer(buf, BUFFER_LOCK_UNLOCK); > + LockBuffer(buf, BT_WRITE); > + > + buf = _bt_moveright(rel, buf, natts, itup_scankey, &(itup->t_tid), false, > + true, stack, BT_WRITE, NULL); > + > + first_offset = _bt_binsrch(rel, buf, natts, itup_scankey, NULL, false, > NULL); > + > + xwait = _bt_check_unique(rel, itup, heapRel, nbuf, buf, > first_offset, itup_scankey, > + checkUnique, &is_unique, &speculativeToken); > + > + _bt_relbuf(rel, nbuf); > + } > > The idea for uniqueness check is that we hold the write lock on > buffer/page on which we are trying to operate (we do take read locks > on the consecutive pages during this check). Here, in your changed > proposal, you have two buffers (one to which the search led you and > one buffer previous in the chain) and before checking uniqueness, on > one of the buffers you seem to have write lock and on other you have > read lock. This seems to change the way uniqueness check works till > now, can you explain how this works (can we safely assume that all > searches for uniqueness check will lead to the same buffer first). All uniqueness checks will find the same "nbuf" (read-locked), which is the first one that matches the key. They could have different "buf" buffers (the write-locked) one. But since read locks cannot be held while a write-lock is held, I believe it doesn't matter. All uniqueness checks will try to read-lock nbuf unless the uniqueness check for that insertion is done. When they do, they will block until the other insertion is finished, and when that happens, either the insert happened, or it returned an xid to wait for and retry. If it succeeded, the check attempting the read-lock will see the inserted tuple and return the xid of the earlier transaction and wait on it. If it failed, no insert happened because there's a visible tuple there, and the read-lock will succeed, find the visible tuple, and the insert will fail too. In essence, it is still the case, I believe, that only one insert can succeed at a time, all the others will retry. It only happens that with the patch, the contention will be between a read lock and a write lock, and not two write locks, and there's a little less contention (some more work can be done in parallel). > With this new mechanism, do we have two type of search interfaces > where one would work for keys (as now) and other for key-ctid or it > will be just a single interface which works both ways? I think either > way there are pros and cons. The patch doesn't add another interface, but the idea is to add it. The implementation will probably fall on a common function that can do both, but some index ams can't implement the key-ctid operation (hash indexes), so they will have to be separate interfaces to be able to properly represent the separate capabilities (besides, other implementations may well use completely different paths for the two operations). > Also, in the thread below you are talking about using the last bit in > t_info, I want to bring it to your notice that there is a patch of > mine [1] in which I have used it to avoid changing on-disk > compatibility of hash indexes. I am not saying that we should not > plan it for some other use, but just to keep in mind that there are > other proposals to use it. We can decide what is best way to proceed, > if we are aware of all the proposals that want to use it. > > > [1] - > https://www.postgresql.org/message-id/caa4ek1lkq_udism-z2dq6cuvjh3db5fnfnnezzbpsrjw0ha...@mail.gmail.com I wasn't aware of that particular thread, but there's another (the WARM one) where there's another proposed use of that bit. IMHO, any use of the bit that doesn't leave
Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location
On Sat, Aug 20, 2016 at 1:53 AM, Claudio Freire wrote: > All uniqueness checks will try to read-lock nbuf > unless the uniqueness check for that insertion is done. That should read "all uniqueness checks will try to read-lock the buf unless the uniqueness check for that insertion is done." (nbuf -> buf) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location
On Sat, Aug 20, 2016 at 4:27 AM, Amit Kapila wrote: >> All uniqueness checks will find the same "nbuf" (read-locked), which >> is the first one that matches the key. >> >> They could have different "buf" buffers (the write-locked) one. But >> since read locks cannot be held while a write-lock is held, I believe >> it doesn't matter. All uniqueness checks will try to read-lock nbuf >> unless the uniqueness check for that insertion is done. When they do, >> they will block until the other insertion is finished, and when that >> happens, either the insert happened, or it returned an xid to wait for >> and retry. If it succeeded, the check attempting the read-lock will >> see the inserted tuple and return the xid of the earlier transaction >> and wait on it. If it failed, no insert happened because there's a >> visible tuple there, and the read-lock will succeed, find the visible >> tuple, and the insert will fail too. >> >> In essence, it is still the case, I believe, that only one insert can >> succeed at a time, all the others will retry. It only happens that >> with the patch, the contention will be between a read lock and a write >> lock, and not two write locks, and there's a little less contention >> (some more work can be done in parallel). >> >>> With this new mechanism, do we have two type of search interfaces >>> where one would work for keys (as now) and other for key-ctid or it >>> will be just a single interface which works both ways? I think either >>> way there are pros and cons. >> >> The patch doesn't add another interface, but the idea is to add it. >> The implementation will probably fall on a common function that can do >> both >> > > That makes sense, but this means there is a chance that the searches > could lead to different buffers in case of uniqueness checks (the > search with key-ctid could lead to a different buffer than the search > with just key). I am not clear do we have requirement for doing this > uniqueness check for key-ctid search API, because as I understand you > want to do it mainly for vacuum and WARM tuples. Vacuum won't add new > tuples, so is this required for WARM tuples? Well, I'm not realy sure what exactly would need to be done when doing the WARM conditional insert in the case of unique indexes, and that's a strong motivation to not add the interface for inserts just yet. Vacuum will only need to delete, and in the case of deletes the operation would be quite straight forward. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block level parallel vacuum WIP
On Tue, Aug 23, 2016 at 8:02 AM, Masahiko Sawada wrote: > > 2. Vacuum table and index (after 1 transaction executed) > 1 worker : 12 sec > 2 workers : 49 sec > 3 workers : 54 sec > 4 workers : 53 sec > > As a result of my test, since multiple process could frequently try to > acquire the cleanup lock on same index buffer, execution time of > parallel vacuum got worse. > And it seems to be effective for only table vacuum so far, but is not > improved as expected (maybe disk bottleneck). Not only that, but from your description (I haven't read the patch, sorry), you'd be scanning the whole index multiple times (one per worker). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Patch] Temporary tables that do not bloat pg_catalog (a.k.a fast temp tables)
On Tue, Aug 23, 2016 at 7:11 PM, Tomas Vondra wrote: > On 08/22/2016 10:32 PM, Robert Haas wrote: >> >> >> ... >> >> 1. The number of tables for which we would need to add a duplicate, >> unlogged table is formidable. You need pg_attribute, pg_attrdef, >> pg_constraint, pg_description, pg_type, pg_trigger, pg_rewrite, etc. >> And the backend changes needed so that we used the unlogged copy for >> temp tables and the permanent copy for regular tables is probably >> really large. >> >> 2. You can't write to unlogged tables on standby servers, so this >> doesn't help solve the problem of wanting to use temporary tables on >> standbys. >> >> 3. While it makes creating temporary tables a lighter-weight >> operation, because you no longer need to write WAL for the catalog >> entries, there's probably still substantially more overhead than just >> stuffing them in backend-local RAM. So the performance benefits are >> probably fairly modest. >> >> Overall I feel like the development effort that it would take to make >> this work would almost certainly be better-expended elsewhere. But of >> course I'm not in charge of how people who work for other companies >> spend their time... >> > > Could someone please explain how the unlogged tables are supposed to fix the > catalog bloat problem, as stated in the initial patch submission? We'd still > need to insert/delete the catalog rows when creating/dropping the temporary > tables, causing the bloat. Or is there something I'm missing? Wouldn't more aggressive vacuuming of catalog tables fix the bloat? Perhaps reserving a worker or N to run only on catalog schemas? That'd be far simpler. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Patch] Temporary tables that do not bloat pg_catalog (a.k.a fast temp tables)
On Tue, Aug 23, 2016 at 7:20 PM, Andres Freund wrote: >> Wouldn't more aggressive vacuuming of catalog tables fix the bloat? > > Not really in my experience, at least not without more drastic vacuum > changes. The issue is that if you have a single "long running" > transaction (in some workloads that can even just be a 3 min taking > query/xact), nothing will be cleaned up during that time. If you have a > few hundred temp tables created per sec, you'll be in trouble even > then. Not to speak of the case where you have queries taking hours (say > a backup). Well, my experience isn't as extreme as that (just a few dozen temp tables per minute), but when I see bloat in catalog tables it's because all autovacuum workers are stuck vacuuming huge tables for huge periods of time (hours or days). So that's certainly another bloat case to consider. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Patch] Temporary tables that do not bloat pg_catalog (a.k.a fast temp tables)
On Tue, Aug 23, 2016 at 7:25 PM, Tomas Vondra wrote: >>> Could someone please explain how the unlogged tables are supposed to fix >>> the >>> catalog bloat problem, as stated in the initial patch submission? We'd >>> still >>> need to insert/delete the catalog rows when creating/dropping the >>> temporary >>> tables, causing the bloat. Or is there something I'm missing? >> >> >> Wouldn't more aggressive vacuuming of catalog tables fix the bloat? >> >> Perhaps reserving a worker or N to run only on catalog schemas? >> >> That'd be far simpler. > > > Maybe, although IIRC the issues with catalog bloat were due to a combination > of long queries and many temporary tables being created/dropped. In that > case simply ramping up autovacuum (or even having a dedicated workers for > catalogs) would not realy help due to the xmin horizon being blocked by the > long-running queries. > > Maybe it's entirely crazy idea due to the wine I drank at the dinner, but > couldn't we vacuum the temporary table records differently? For example, > couldn't we just consider them removable as soon as the backend that owns > them disappears? Or perhaps go all the way and generalize that to rows that never become visible outside their parent transaction. As in, delete of rows created by the deleting transaction could clean up, carefully to avoid voiding indexes and all that, but more aggressively than regular deletes. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [Patch] Temporary tables that do not bloat pg_catalog (a.k.a fast temp tables)
On Tue, Aug 23, 2016 at 8:50 PM, Greg Stark wrote: > On Tue, Aug 23, 2016 at 4:15 PM, Aleksander Alekseev > wrote: >> Frankly I have much more faith in Tom's idea of using virtual part of the >> catalog for all temporary tables, i.e turning all temporary tables into >> "fast" temporary tables. Instead of introducing a new type of temporary >> tables >> that solve catalog bloating problem and forcing users to rewrite applications >> why not just not to create a problem in a first place? > > Would applications really need to be rewritten? Are they really > constructing temporary tables where the definition of the table is > dynamic, not just the content? Mine is. But it wouldn't be a big deal to adapt. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Patch] Temporary tables that do not bloat pg_catalog (a.k.a fast temp tables)
On Tue, Aug 23, 2016 at 9:12 PM, Tomas Vondra wrote: > On 08/24/2016 12:38 AM, Claudio Freire wrote: >> >> On Tue, Aug 23, 2016 at 7:25 PM, Tomas Vondra >> wrote: >>>>> >>>>> Could someone please explain how the unlogged tables are supposed to >>>>> fix >>>>> the >>>>> catalog bloat problem, as stated in the initial patch submission? We'd >>>>> still >>>>> need to insert/delete the catalog rows when creating/dropping the >>>>> temporary >>>>> tables, causing the bloat. Or is there something I'm missing? >>>> >>>> >>>> >>>> Wouldn't more aggressive vacuuming of catalog tables fix the bloat? >>>> >>>> Perhaps reserving a worker or N to run only on catalog schemas? >>>> >>>> That'd be far simpler. >>> >>> >>> >>> Maybe, although IIRC the issues with catalog bloat were due to a >>> combination >>> of long queries and many temporary tables being created/dropped. In that >>> case simply ramping up autovacuum (or even having a dedicated workers for >>> catalogs) would not realy help due to the xmin horizon being blocked by >>> the >>> long-running queries. >>> >>> Maybe it's entirely crazy idea due to the wine I drank at the dinner, but >>> couldn't we vacuum the temporary table records differently? For example, >>> couldn't we just consider them removable as soon as the backend that owns >>> them disappears? >> >> >> Or perhaps go all the way and generalize that to rows that never >> become visible outside their parent transaction. >> >> As in, delete of rows created by the deleting transaction could clean >> up, carefully to avoid voiding indexes and all that, but more >> aggressively than regular deletes. >> > > Maybe, but I wouldn't be surprised if such generalization would be an order > of magnitude more complicated - and even the vacuuming changes I mentioned > are undoubtedly a fair amount of work. After looking at it from a birdseye view, I agree it's conceptually complex (reading HeapTupleSatisfiesSelf already makes one dizzy). But other than that, the implementation seems rather simple. It seems to me, if one figures out that it is safe to do so (a-priori, xmin not committed, xmax is current transaction), it would simply be a matter of chasing the HOT chain root, setting all LP except the first to LP_UNUSED and the first one to LP_DEAD. Of course I may be missing a ton of stuff. > Sadly, I don't see how this might fix the other issues mentioned in this > thread (e.g. impossibility to create temp tables on standbys), No it doesn't :( -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Patch] Temporary tables that do not bloat pg_catalog (a.k.a fast temp tables)
On Wed, Aug 24, 2016 at 2:04 AM, Alvaro Herrera wrote: > Claudio Freire wrote: > >> After looking at it from a birdseye view, I agree it's conceptually >> complex (reading HeapTupleSatisfiesSelf already makes one dizzy). >> >> But other than that, the implementation seems rather simple. It seems >> to me, if one figures out that it is safe to do so (a-priori, xmin not >> committed, xmax is current transaction), it would simply be a matter >> of chasing the HOT chain root, setting all LP except the first to >> LP_UNUSED and the first one to LP_DEAD. >> >> Of course I may be missing a ton of stuff. > > What you seem to be missing is that rows corresponding to temp tables > are not "visible to its own transaction only". The rows are valid > after the transaction is gone; what makes the tables temporary is the > fact that they are in a temporary schema. And what makes them invisible > to one backend is the fact that they are in the temporary schema for > another backend. Not that they are uncommitted. Yeah, I was thinking of "on commit drop" behavior, but granted there's all the others. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] increasing the default WAL segment size
On Wed, Aug 24, 2016 at 10:31 PM, Robert Haas wrote: > 1. Transaction rates are vastly higher these days. In 1999, I think > we were still limited to ~2^32 transactions during the entire lifetime > of the server; transaction ID wraparound hadn't been invented yet.[1] > Today, some installations do that many write transactions in under a > week. The practical consequence of this is that WAL files fill up in > extremely short periods of time. Some users generate multiple > terabytes of WAL per day, which means they are generating - and very > likely archiving - WAL files a rate of greater than 1 per second! > That poses multiple problems. For example, if your archive command > happens to involve ssh, you might run into trouble because of this > sort of thing: > > [rhaas pgsql]$ /usr/bin/time ssh hydra true > 1.57 real 0.00 user 0.00 sys ... > Considering those three factors, I think we should consider pushing > the default value up somewhat higher for v10. Reverting to the 64MB > size that we had prior to 47937403676d913c0e740eec6b85113865c6c8ab > sounds pretty reasonable. Users with really high transaction rates > might even prefer a higher value (e.g. 256MB, 1GB) but that's hardly > practical for small installs given our default of max_wal_size = 1GB. > Possibly it would make sense for this to be configurable at initdb > time instead of requiring a recompile; we probably don't save any > significant number of cycles by compiling this into the server. FWIW, +1 We're already hurt by the small segments due to a similar phenomenon as the ssh case: TCP slow start. Designing the archive/recovery command to work around TCP slow start is quite complex, and bigger segments would just be a better thing. Not to mention that bigger segments compress better. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] increasing the default WAL segment size
On Thu, Aug 25, 2016 at 12:21 PM, Simon Riggs wrote: > On 25 August 2016 at 02:31, Robert Haas wrote: > >> Furthermore, there is an enforced, synchronous fsync at the end of >> every segment, which actually does hurt performance on write-heavy >> workloads.[2] Of course, if that were the only reason to consider >> increasing the segment size, it would probably make more sense to just >> try to push that extra fsync into the background, but that's not >> really the case. From what I hear, the gigantic number of files is a >> bigger pain point. > > I think we should fully describe the problem before finding a solution. > > This is too big a change to just tweak a value without discussing the > actual issue. > > And if the problem is as described, how can a change of x4 be enough > to make it worth the pain of change? I think you're already admitting > it can't be worth it by discussing initdb configuration. > > If we do have the pain of change, should we also consider making WAL > files variable length? What do we gain by having the files all the > same size? ISTM better to have WAL files that vary in length up to 1GB > in size. > > (This is all about XLOG_SEG_SIZE; I presume XLOG_BLCKSZ can stay as it > is, right?) Avoiding variable sizes does avoid some failure modes on the filesystem side in the face of crashes/power loss. So making them variable size, while possible, wouldn't be simple at all (it would involve figuring out the way filesystems behave when facing crash when a file is changing in size). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Wed, Aug 31, 2016 at 1:45 PM, Pavan Deolasee wrote: > We discussed a few ideas to address the "Duplicate Scan" problem. For > example, we can teach Index AMs to discard any duplicate (key, CTID) insert > requests. Or we could guarantee uniqueness by either only allowing updates > in one lexical order. While the former is a more complete solution to avoid > duplicate entries, searching through large number of keys for non-unique > indexes could be a drag on performance. The latter approach may not be > sufficient for many workloads. Also tracking increment/decrement for many > indexes will be non-trivial. > > There is another problem with allowing many index entries pointing to the > same WARM chain. It will be non-trivial to know how many index entries are > currently pointing to the WARM chain and index/heap vacuum will throw up > more challenges. > > Instead, what I would like to propose and the patch currently implements > is to restrict WARM update to once per chain. So the first non-HOT update > to a tuple or a HOT chain can be a WARM update. The chain can further be > HOT updated any number of times. But it can no further be WARM updated. > This might look too restrictive, but it can still bring down the number of > regular updates by almost 50%. Further, if we devise a strategy to convert > a WARM chain back to HOT chain, it can again be WARM updated. (This part is > currently not implemented). A good side effect of this simple strategy is > that we know there can maximum two index entries pointing to any given WARM > chain. > We should probably think about coordinating with my btree patch. >From the description above, the strategy is quite readily "upgradable" to one in which the indexam discards duplicate (key,ctid) pairs and that would remove the limitation of only one WARM update... right?
[HACKERS] Vacuum: allow usage of more than 1GB of work mem
The attached patch allows setting maintainance_work_mem or autovacuum_work_mem higher than 1GB (and be effective), by turning the allocation of the dead_tuples into a huge allocation. This results in fewer index scans for heavily bloated tables, and could be a lifesaver in many situations (in particular, the situation I'm living right now in production, where we don't have enough room for a vacuum full, and have just deleted 75% of a table to make room but have to rely on regular lazy vacuum to free the space). The patch also makes vacuum free the dead_tuples before starting truncation. It didn't seem necessary to hold onto it beyond that point, and it might help give the OS more cache, especially if work mem is configured very high to avoid multiple index scans. Tested with pgbench scale 4000 after deleting the whole pgbench_accounts table, seemed to work fine. From feddd4f58c30ca6a446ce0eed343f2562030f0e5 Mon Sep 17 00:00:00 2001 From: Claudio Freire Date: Fri, 2 Sep 2016 23:21:01 -0300 Subject: [PATCH 1/2] Vacuum: allow using more than 1GB work mem Turns the palloc for dead_tuples into a huge allocation to allow using more than 1GB worth of dead_tuples, saving index scans on heavily bloated tables --- src/backend/commands/vacuumlazy.c | 19 +-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c index 231e92d..dbe2040 100644 --- a/src/backend/commands/vacuumlazy.c +++ b/src/backend/commands/vacuumlazy.c @@ -155,6 +155,7 @@ static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats); static BlockNumber count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats); static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks); +static void lazy_space_dealloc(LVRelStats *vacrelstats); static void lazy_record_dead_tuple(LVRelStats *vacrelstats, ItemPointer itemptr); static bool lazy_tid_reaped(ItemPointer itemptr, void *state); @@ -1310,6 +1311,8 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats, for (i = 0; i < nindexes; i++) lazy_cleanup_index(Irel[i], indstats[i], vacrelstats); + lazy_space_dealloc(vacrelstats); + /* If no indexes, make log report that lazy_vacuum_heap would've made */ if (vacuumed_pages) ereport(elevel, @@ -1952,7 +1955,7 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks) { maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData); maxtuples = Min(maxtuples, INT_MAX); - maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData)); + maxtuples = Min(maxtuples, MaxAllocHugeSize / sizeof(ItemPointerData)); /* curious coding here to ensure the multiplication can't overflow */ if ((BlockNumber) (maxtuples / LAZY_ALLOC_TUPLES) > relblocks) @@ -1969,7 +1972,19 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks) vacrelstats->num_dead_tuples = 0; vacrelstats->max_dead_tuples = (int) maxtuples; vacrelstats->dead_tuples = (ItemPointer) - palloc(maxtuples * sizeof(ItemPointerData)); + MemoryContextAllocHuge(CurrentMemoryContext, maxtuples * sizeof(ItemPointerData)); +} + +/* + * lazy_space_dealloc - free lazy vacuum work mem + */ +static void +lazy_space_dealloc(LVRelStats *vacrelstats) +{ + if (vacrelstats->dead_tuples != NULL) { + pfree(vacrelstats->dead_tuples); + vacrelstats->dead_tuples = NULL; + } } /* -- 2.6.6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
On Sun, Sep 4, 2016 at 3:46 AM, Simon Riggs wrote: > On 3 September 2016 at 04:25, Claudio Freire wrote: >> The patch also makes vacuum free the dead_tuples before starting >> truncation. It didn't seem necessary to hold onto it beyond that >> point, and it might help give the OS more cache, especially if work >> mem is configured very high to avoid multiple index scans. > > How long does that part ever take? Is there any substantial gain from this? > > Lets discuss that as a potential second patch. In the test case I mentioned, it takes longer than the vacuum part itself. Other than freeing RAM there's no gain. I didn't measure any speed difference while testing, but that's probably because the backward scan doesn't benefit from the cache, but other activity on the system might. So, depending on the workload on the server, extra available RAM may be a significant gain on its own or not. It just didn't seem there was a reason to keep that RAM reserved, especially after making it a huge allocation. I'm fine either way. I can remove that from the patch or leave it as-is. It just seemed like a good idea at the time. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
On Mon, Sep 5, 2016 at 11:50 AM, Claudio Freire wrote: > On Sun, Sep 4, 2016 at 3:46 AM, Simon Riggs wrote: >> On 3 September 2016 at 04:25, Claudio Freire wrote: >>> The patch also makes vacuum free the dead_tuples before starting >>> truncation. It didn't seem necessary to hold onto it beyond that >>> point, and it might help give the OS more cache, especially if work >>> mem is configured very high to avoid multiple index scans. >> >> How long does that part ever take? Is there any substantial gain from this? >> >> Lets discuss that as a potential second patch. > > In the test case I mentioned, it takes longer than the vacuum part itself. > > Other than freeing RAM there's no gain. I didn't measure any speed > difference while testing, but that's probably because the backward > scan doesn't benefit from the cache, but other activity on the system > might. So, depending on the workload on the server, extra available > RAM may be a significant gain on its own or not. It just didn't seem > there was a reason to keep that RAM reserved, especially after making > it a huge allocation. > > I'm fine either way. I can remove that from the patch or leave it > as-is. It just seemed like a good idea at the time. Rebased and split versions attached From 7b47b59d89bf3edaeb11c4ffe3660f3d5b7b86de Mon Sep 17 00:00:00 2001 From: Claudio Freire Date: Fri, 2 Sep 2016 23:21:01 -0300 Subject: [PATCH 1/2] Vacuum: allow using more than 1GB work mem Turns the palloc for dead_tuples into a huge allocation to allow using more than 1GB worth of dead_tuples, saving index scans on heavily bloated tables --- src/backend/commands/vacuumlazy.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c index 231e92d..2c98da4 100644 --- a/src/backend/commands/vacuumlazy.c +++ b/src/backend/commands/vacuumlazy.c @@ -1952,7 +1952,7 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks) { maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData); maxtuples = Min(maxtuples, INT_MAX); - maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData)); + maxtuples = Min(maxtuples, MaxAllocHugeSize / sizeof(ItemPointerData)); /* curious coding here to ensure the multiplication can't overflow */ if ((BlockNumber) (maxtuples / LAZY_ALLOC_TUPLES) > relblocks) @@ -1969,7 +1969,7 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks) vacrelstats->num_dead_tuples = 0; vacrelstats->max_dead_tuples = (int) maxtuples; vacrelstats->dead_tuples = (ItemPointer) - palloc(maxtuples * sizeof(ItemPointerData)); + MemoryContextAllocHuge(CurrentMemoryContext, maxtuples * sizeof(ItemPointerData)); } /* -- 2.6.6 From 24fa0815d915cc31ae455ef60585f54a33dbd09c Mon Sep 17 00:00:00 2001 From: Claudio Freire Date: Fri, 2 Sep 2016 23:21:01 -0300 Subject: [PATCH 2/2] Vacuum: free dead_tuples sooner Frees the dead_tuples array sooner, decreasing impact of huge settings for autovacuum_work_mem or maintainance_work_mem during lazy vacuum. --- src/backend/commands/vacuumlazy.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c index 2c98da4..dbe2040 100644 --- a/src/backend/commands/vacuumlazy.c +++ b/src/backend/commands/vacuumlazy.c @@ -155,6 +155,7 @@ static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats); static BlockNumber count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats); static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks); +static void lazy_space_dealloc(LVRelStats *vacrelstats); static void lazy_record_dead_tuple(LVRelStats *vacrelstats, ItemPointer itemptr); static bool lazy_tid_reaped(ItemPointer itemptr, void *state); @@ -1310,6 +1311,8 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats, for (i = 0; i < nindexes; i++) lazy_cleanup_index(Irel[i], indstats[i], vacrelstats); + lazy_space_dealloc(vacrelstats); + /* If no indexes, make log report that lazy_vacuum_heap would've made */ if (vacuumed_pages) ereport(elevel, @@ -1973,6 +1976,18 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks) } /* + * lazy_space_dealloc - free lazy vacuum work mem + */ +static void +lazy_space_dealloc(LVRelStats *vacrelstats) +{ + if (vacrelstats->dead_tuples != NULL) { + pfree(vacrelstats->dead_tuples); + vacrelstats->dead_tuples = NULL; + } +} + +/* * lazy_record_dead_tuple - remember one deletable tuple */ static void -- 2.6.6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
On Mon, Sep 5, 2016 at 5:36 PM, Simon Riggs wrote: > On 5 September 2016 at 15:50, Claudio Freire wrote: >> On Sun, Sep 4, 2016 at 3:46 AM, Simon Riggs wrote: >>> On 3 September 2016 at 04:25, Claudio Freire wrote: >>>> The patch also makes vacuum free the dead_tuples before starting >>>> truncation. It didn't seem necessary to hold onto it beyond that >>>> point, and it might help give the OS more cache, especially if work >>>> mem is configured very high to avoid multiple index scans. >>> >>> How long does that part ever take? Is there any substantial gain from this? >>> >>> Lets discuss that as a potential second patch. >> >> In the test case I mentioned, it takes longer than the vacuum part itself. > > Please provide a test case and timings so we can see what's happening. The referenced test case is the one I mentioned on the OP: - createdb pgbench - pgbench -i -s 4000 pgbench - psql pgbench -c 'delete from pgbench_accounts;' - vacuumdb -v -t pgbench_accounts pgbench fsync=off, autovacuum=off, maintainance_work_mem=4GB >From what I remember, it used ~2.7GB of RAM up until the truncate phase, where it freed it. It performed a single index scan over the PK. I don't remember timings, and I didn't take them, so I'll have to repeat the test to get them. It takes all day and makes my laptop unusably slow, so I'll post them later, but they're not very interesting. The only interesting bit is that it does a single index scan instead of several, which on TB-or-more tables it's kinda nice. Btw, without a further patch to prefetch pages on the backward scan for truncate, however, my patience ran out before it finished truncating. I haven't submitted that patch because there was an identical patch in an older thread that was discussed and more or less rejected since it slightly penalized SSDs. While I'm confident my version of the patch is a little easier on SSDs, I haven't got an SSD at hand to confirm, so that patch is still waiting on the queue until I get access to an SSD. The tests I'll make include that patch, so the timing regarding truncate won't be representative of HEAD (I really can't afford to run the tests on a scale=4000 pgbench without that patch, it crawls, and smaller scales don't fill the dead_tuples array). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
On Sun, Sep 4, 2016 at 8:10 PM, Jim Nasby wrote: > On 9/4/16 1:46 AM, Simon Riggs wrote: >>> >>> > The patch also makes vacuum free the dead_tuples before starting >>> > truncation. It didn't seem necessary to hold onto it beyond that >>> > point, and it might help give the OS more cache, especially if work >>> > mem is configured very high to avoid multiple index scans. >> >> How long does that part ever take? Is there any substantial gain from >> this? > > > If you're asking about how long the dealloc takes, we're going to have to > pay that cost anyway when the context gets destroyed/reset, no? Doing that > sooner rather than later certainly seems like a good idea since we've seen > that truncation can take quite some time. Might as well give the memory back > to the OS ASAP. AFAIK, except on debug builds where it has to memset the whole thing, the cost is constant (unrelated to the allocated block size), so it should be rather small in this context. On Tue, Sep 6, 2016 at 1:42 PM, Robert Haas wrote: > On Sat, Sep 3, 2016 at 8:55 AM, Claudio Freire wrote: >> The attached patch allows setting maintainance_work_mem or >> autovacuum_work_mem higher than 1GB (and be effective), by turning the >> allocation of the dead_tuples into a huge allocation. >> >> This results in fewer index scans for heavily bloated tables, and >> could be a lifesaver in many situations (in particular, the situation >> I'm living right now in production, where we don't have enough room >> for a vacuum full, and have just deleted 75% of a table to make room >> but have to rely on regular lazy vacuum to free the space). >> >> The patch also makes vacuum free the dead_tuples before starting >> truncation. It didn't seem necessary to hold onto it beyond that >> point, and it might help give the OS more cache, especially if work >> mem is configured very high to avoid multiple index scans. >> >> Tested with pgbench scale 4000 after deleting the whole >> pgbench_accounts table, seemed to work fine. > > The problem with this is that we allocate the entire amount of > maintenance_work_mem even when the number of actual dead tuples turns > out to be very small. That's not so bad if the amount of memory we're > potentially wasting is limited to ~1 GB, but it seems pretty dangerous > to remove the 1 GB limit, because somebody might have > maintenance_work_mem set to tens or hundreds of gigabytes to speed > index creation, and allocating that much space for a VACUUM that > encounters 1 dead tuple does not seem like a good plan. > > What I think we need to do is make some provision to initially > allocate only a small amount of memory and then grow the allocation > later if needed. For example, instead of having > vacrelstats->dead_tuples be declared as ItemPointer, declare it as > ItemPointer * and allocate the array progressively in segments. I'd > actually argue that the segment size should be substantially smaller > than 1 GB, like say 64MB; there are still some people running systems > which are small enough that allocating 1 GB when we may need only 6 > bytes can drive the system into OOM. This would however incur the cost of having to copy the whole GB-sized chunk every time it's expanded. It woudln't be cheap. I've monitored the vacuum as it runs and the OS doesn't map the whole block unless it's touched, which it isn't until dead tuples are found. Surely, if overcommit is disabled (as it should), it could exhaust the virtual address space if set very high, but it wouldn't really use the memory unless it's needed, it would merely reserve it. To fix that, rather than repalloc the whole thing, dead_tuples would have to be an ItemPointer** of sorted chunks. That'd be a significantly more complex patch, but at least it wouldn't incur the memcpy. I could attempt that, but I don't see the difference between vacuum and create index in this case. Both could allocate a huge chunk of the virtual address space if maintainance work mem says so, both proportional to the size of the table. I can't see how that could take any DBA by surprise. A sensible compromise could be dividing the maintainance_work_mem by autovacuum_max_workers when used in autovacuum, as is done for cost limits, to protect those that set both rather high. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Make initdb's suggested "pg_ctl start" command line more reliabl
On Tue, Sep 6, 2016 at 2:08 PM, Tom Lane wrote: >> The not-quoting-if-not-needed doesn't appear to do anything useful for me: >> 'pg-install/bin/pg_ctl' -D 'pg-install/var/data' -l logfile start > > Dash is considered a character that needs quoting. It might be possible > to avoid that if we could be certain that appendShellString's output would > never be placed in a spot where it could be taken for a switch, but that > seems like a large assumption to me. Or maybe we could treat it as > forcing quotes only if it starts the string; but I don't personally care > enough to write the code for that. Feel free if you want to. Wouldn't it be taken for a switch even with quoting? Quoting "-D" seems to work fine, which would suggest the above is true. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
On Tue, Sep 6, 2016 at 2:39 PM, Robert Haas wrote: >> I could attempt that, but I don't see the difference between >> vacuum and create index in this case. Both could allocate a huge chunk >> of the virtual address space if maintainance work mem says so, both >> proportional to the size of the table. I can't see how that could take >> any DBA by surprise. > > Really? CREATE INDEX isn't going to allocate more storage space than > the size of the data actually being sorted, because tuplesort.c is > smart about that kind of thing. But VACUUM will very happily allocate > vastly more memory than the number of dead tuples. It is thankfully > smart enough not to allocate more storage than the number of line > pointers that could theoretically exist in a relation of the given > size, but that only helps for very small relations. In a large > relation that divergence between the amount of storage space that > could theoretically be needed and the amount that is actually needed > is likely to be extremely high. 1 TB relation = 2^27 blocks, each of > which can contain MaxHeapTuplesPerPage dead line pointers. On my > system, MaxHeapTuplesPerPage is 291, so that's 291 * 2^27 possible > dead line pointers, which at 6 bytes each is 291 * 6 * 2^27 = ~218GB, > but the expected number of dead line pointers is much less than that. > Even if this is a vacuum triggered by autovacuum_vacuum_scale_factor > and you're using the default of 0.2 (probably too high for such a > large table), assuming there are about 60 tuples for page (which is > what I get with pgbench -i) the table would have about 2^27 * 60 = 7.7 > billion tuples of which 1.5 billion would be dead, meaning we need > about 9-10GB of space to store all of those dead tuples. Allocating > as much as 218GB when we need 9-10GB is going to sting, and I don't > see how you will get a comparable distortion with CREATE INDEX. I > might be missing something, though. CREATE INDEX could also allocate 218GB, you just need to index enough columns and you'll get that. Aside from the fact that CREATE INDEX will only allocate what is going to be used and VACUUM will overallocate, the potential to fully allocate the amount given is still there for both cases. > There's no real issue when there's only one process running on the > system at a time. If the user set maintenance_work_mem to an amount > of memory that he can't afford to pay even once, then that's simple > misconfiguration and it's not really our problem. The issue is that > when there are 3 or potentially more VACUUM processes running plus a > CREATE INDEX or two at the same time. If you set maintenance_work_mem > to a value that is large enough to make the CREATE INDEX run fast, now > with your patch that is also going to cause each VACUUM process to > gobble up lots of extra memory that it probably doesn't need, and now > you may well start to get failures. I've seen this happen even with > the current 1GB limit, though you need a pretty small system - e.g. > 8GB RAM - for it to be a problem. I think it is really really likely > to cause big problems for us if we dramatically increase that limit > without making the allocation algorithm smarter. Ok, a pity it will invalidate all the testing already done though (I was almost done with the testing). I guess I'll send the results anyway. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
On Tue, Sep 6, 2016 at 3:11 PM, Tom Lane wrote: > We could get around (1) by something like Robert's idea of segmented > allocation, but TBH I've seen nothing on this thread to make me think > it's necessary or would even result in any performance improvement > at all. The bigger we make that array, the worse index-cleaning > is going to perform, and complicating the data structure will add > another hit on top of that. I wouldn't be so sure, I've seen cases where two binary searches were faster than a single binary search, especially when working with humongus arrays like this tid array, because touching less (memory) pages for a search does pay off considerably. I'd try before giving up on the idea. The test results (which I'll post in a second) do give credit to your expectation that making the array bigger/more complex does impact index scan performance. It's still faster than scanning several times though. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
On Tue, Sep 6, 2016 at 3:45 AM, Simon Riggs wrote: > On 5 September 2016 at 21:58, Claudio Freire wrote: > >>>>> How long does that part ever take? Is there any substantial gain from >>>>> this? > >> Btw, without a further patch to prefetch pages on the backward scan >> for truncate, however, my patience ran out before it finished >> truncating. I haven't submitted that patch because there was an >> identical patch in an older thread that was discussed and more or less >> rejected since it slightly penalized SSDs. > > OK, thats enough context. Sorry for being forgetful on that point. > > Please post that new patch also. Attached. On Mon, Sep 5, 2016 at 5:58 PM, Claudio Freire wrote: > On Mon, Sep 5, 2016 at 5:36 PM, Simon Riggs wrote: >> On 5 September 2016 at 15:50, Claudio Freire wrote: >>> On Sun, Sep 4, 2016 at 3:46 AM, Simon Riggs wrote: >>>> On 3 September 2016 at 04:25, Claudio Freire >>>> wrote: >>>>> The patch also makes vacuum free the dead_tuples before starting >>>>> truncation. It didn't seem necessary to hold onto it beyond that >>>>> point, and it might help give the OS more cache, especially if work >>>>> mem is configured very high to avoid multiple index scans. >>>> >>>> How long does that part ever take? Is there any substantial gain from this? >>>> >>>> Lets discuss that as a potential second patch. >>> >>> In the test case I mentioned, it takes longer than the vacuum part itself. >> >> Please provide a test case and timings so we can see what's happening. Robert made a strong point for a change in the approach, so the information below is applicable only to the old patch (to be rewritten). I'm sending this merely to document the testing done, it will be a while before I can get the proposed design running and tested. > The referenced test case is the one I mentioned on the OP: > > - createdb pgbench > - pgbench -i -s 4000 pgbench > - psql pgbench -c 'delete from pgbench_accounts;' > - vacuumdb -v -t pgbench_accounts pgbench > > fsync=off, autovacuum=off, maintainance_work_mem=4GB > > From what I remember, it used ~2.7GB of RAM up until the truncate > phase, where it freed it. It performed a single index scan over the > PK. > > I don't remember timings, and I didn't take them, so I'll have to > repeat the test to get them. It takes all day and makes my laptop > unusably slow, so I'll post them later, but they're not very > interesting. The only interesting bit is that it does a single index > scan instead of several, which on TB-or-more tables it's kinda nice. So, the test results below: During setup, maybe for context, the delete took 52m 50s real time (measured with time psql pgbench -c 'delete from pgbench_accounts;') During the delete, my I/O was on average like the following, which should give an indication of what my I/O subsystem is capable of (not much, granted): Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.47 5.27 35.53 77.4217.5842.95 1097.51 145.22 1295.23 33.47 1874.36 8.85 100.00 Since it's a 5k RPM laptop drive, it's rather slow on IOPS, and since I'm using the defaults for shared buffers and checkpoints, write thoughput isn't stellar either. But that's not the point of the test anyway, it's just for context. The hardware is an HP envy laptop with a 1TB 5.4k RPM hard drive, 12GB RAM, core i7-4722HQ, no weird performance tweaking of any kind (ie: cpu scaling left intact). The system was not dedicated of course, being a laptop, but it had little else going on while the test was running. Given the size of the test, I don't believe there's any chance concurrent activity could invalidate the results. The timing for setup was comparable with both versions (patched and unpatched), so I'm reporting the patched times only. The vacuum phase: patched: $ vacuumdb -v -t pgbench_accounts pgbench INFO: vacuuming "public.pgbench_accounts" INFO: scanned index "pgbench_accounts_pkey" to remove 4 row versions DETAIL: CPU 12.46s/48.76u sec elapsed 566.47 sec. INFO: "pgbench_accounts": removed 4 row versions in 6557378 pages DETAIL: CPU 56.68s/28.90u sec elapsed 1872.76 sec. INFO: index "pgbench_accounts_pkey" now contains 0 row versions in 1096762 pages DETAIL: 4 index row versions were removed. 1092896 index pages have been deleted, 0 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.47 sec. INFO: "pgbench_accounts": found 4 remova
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Mon, Aug 15, 2016 at 9:33 PM, Peter Geoghegan wrote: > The patch is intended to be applied on top of parallel B-Tree patches > 0001-* and 0002-* [1]. I happened to test it with parallelism, but > these are all independently useful, and will be entered as a separate > CF entry (perhaps better to commit the earlier two patches first, to > avoid merge conflicts). I'm optimistic that we can get those 3 patches > in the series out of the way early, without blocking on discussing > parallel sort. Applied patches 1 and 2, builds fine, regression tests run fine. It was a prerequisite to reviewing patch 3 (which I'm going to do below), so I thought I might as well report on that tidbit of info, fwiw. > The patch makes tuplesort merging shift down and displace the root > tuple with the tape's next preread tuple, rather than compacting and > then inserting into the heap anew. This approach to maintaining the > heap as tuples are returned to caller will always produce fewer > comparisons overall. The new approach is also simpler. We were already > shifting down to compact the heap within the misleadingly named [2] > function tuplesort_heap_siftup() -- why not instead just use the > caller tuple (the tuple that we currently go on to insert) when > initially shifting down (not the heap's preexisting last tuple, which > is guaranteed to go straight to the leaf level again)? That way, we > don't need to enlarge the heap at all through insertion, shifting up, > etc. We're done, and are *guaranteed* to have performed less work > (fewer comparisons and swaps) than with the existing approach (this is > the reason for my optimism about getting this stuff out of the way > early). Patch 3 applies fine to git master as of 25794e841e5b86a0f90fac7f7f851e5d950e51e2 (on top of patches 1 and 2). Builds fine and without warnings on gcc 4.8.5 AFAICT, regression test suite runs without issues as well. Patch lacks any new tests, but the changed code paths seem covered sufficiently by existing tests. A little bit of fuzzing on the patch itself, like reverting some key changes, or flipping some key comparisons, induces test failures as it should, mostly in cluster. The logic in tuplesort_heap_root_displace seems sound, except: +*/ + memtuples[i] = memtuples[imin]; + i = imin; + } + + Assert(state->memtupcount > 1 || imin == 0); + memtuples[imin] = *newtup; +} Why that assert? Wouldn't it make more sense to Assert(imin < n) ? In the meanwhile, I'll go and do some perf testing. Assuming the speedup is realized during testing, LGTM. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Tue, Sep 6, 2016 at 8:28 PM, Peter Geoghegan wrote: > On Tue, Sep 6, 2016 at 12:57 PM, Claudio Freire > wrote: >> Patch lacks any new tests, but the changed code paths seem covered >> sufficiently by existing tests. A little bit of fuzzing on the patch >> itself, like reverting some key changes, or flipping some key >> comparisons, induces test failures as it should, mostly in cluster. >> >> The logic in tuplesort_heap_root_displace seems sound, except: >> >> +*/ >> + memtuples[i] = memtuples[imin]; >> + i = imin; >> + } >> + >> + Assert(state->memtupcount > 1 || imin == 0); >> + memtuples[imin] = *newtup; >> +} >> >> Why that assert? Wouldn't it make more sense to Assert(imin < n) ? > > There might only be one or two elements in the heap. Note that the > heap size is indicated by state->memtupcount at this point in the > sort, which is a little confusing (that differs from how memtupcount > is used elsewhere, where we don't partition memtuples into a heap > portion and a preread tuples portion, as we do here). I noticed, but here n = state->memtupcount + Assert(memtuples[0].tupindex == newtup->tupindex); + + CHECK_FOR_INTERRUPTS(); + + n = state->memtupcount; /* n is heap's size, including old root */ + imin = 0; /* start with caller's "hole" in root */ + i = imin; In fact, the assert on the patch would allow writing memtuples outside the heap, as in calling tuplesort_heap_root_displace if memtupcount==0, but I don't think that should be legal (memtuples[0] == memtuples[imin] would be outside the heap). Sure, that's a weird enough case (that assert up there already reads memtuples[0] which would be equally illegal if memtupcount==0), but it goes on to show that the assert expression just seems odd for its intent. BTW, I know it's not the scope of the patch, but shouldn't root_displace be usable on the TSS_BOUNDED phase? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Tue, Sep 6, 2016 at 9:19 PM, Peter Geoghegan wrote: > On Tue, Sep 6, 2016 at 4:55 PM, Claudio Freire wrote: >> I noticed, but here n = state->memtupcount >> >> + Assert(memtuples[0].tupindex == newtup->tupindex); >> + >> + CHECK_FOR_INTERRUPTS(); >> + >> + n = state->memtupcount; /* n is heap's size, >> including old root */ >> + imin = 0; /* >> start with caller's "hole" in root */ >> + i = imin; > > I'm fine with using "n" in the later assertion you mentioned, if > that's clearer to you. memtupcount is broken out as "n" simply because > that's less verbose, in a place where that makes things far clearer. > >> In fact, the assert on the patch would allow writing memtuples outside >> the heap, as in calling tuplesort_heap_root_displace if >> memtupcount==0, but I don't think that should be legal (memtuples[0] >> == memtuples[imin] would be outside the heap). > > You have to have a valid heap (i.e. there must be at least one > element) to call tuplesort_heap_root_displace(), and it doesn't > directly compact the heap, so it must remain valid on return. The > assertion exists to make sure that everything is okay with a > one-element heap, a case which is quite possible. More than using "n" or "memtupcount" what I'm saying is to assert that memtuples[imin] is inside the heap, which would catch the same errors the original assert would, and more. Assert(imin < state->memtupcount) If you prefer. The original asserts allows any value of imin for memtupcount>1, and that's my main concern. It shouldn't. On Tue, Sep 6, 2016 at 9:19 PM, Peter Geoghegan wrote: >> Sure, that's a weird enough case (that assert up there already reads >> memtuples[0] which would be equally illegal if memtupcount==0), but it >> goes on to show that the assert expression just seems odd for its >> intent. >> >> BTW, I know it's not the scope of the patch, but shouldn't >> root_displace be usable on the TSS_BOUNDED phase? > > I don't think it should be, no. With a top-n heap sort, the > expectation is that after a little while, we can immediately determine > that most tuples do not belong in the heap (this will require more > than one comparison per tuple when the tuple that may be entered into > the heap will in fact go in the heap, which should be fairly rare > after a time). That's why that general strategy can be so much faster, > of course. I wasn't proposing getting rid of that optimization, but just replacing the siftup+insert step with root_displace... > Note that that heap is "reversed" -- the sort order is inverted, so > that we can use a minheap. The top of the heap is the most marginal > tuple in the top-n heap so far, and so is the next to be removed from > consideration entirely (not the next to be returned to caller, when > merging). ...but I didn't pause to consider that point. It still looks like a valid optimization, instead rearranging the heap twice (siftup + insert), do it once (replace + relocate). However, I agree that it's not worth the risk conflating the two optimizations. That one can be done later as a separate patch. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
On Wed, Sep 7, 2016 at 12:12 PM, Greg Stark wrote: > On Wed, Sep 7, 2016 at 1:45 PM, Simon Riggs wrote: >> On 6 September 2016 at 19:59, Tom Lane wrote: >> >>> The idea of looking to the stats to *guess* about how many tuples are >>> removable doesn't seem bad at all. But imagining that that's going to be >>> exact is folly of the first magnitude. >> >> Yes. Bear in mind I had already referred to allowing +10% to be safe, >> so I think we agree that a reasonably accurate, yet imprecise >> calculation is possible in most cases. > > That would all be well and good if it weren't trivial to do what > Robert suggested. This is just a large unsorted list that we need to > iterate throught. Just allocate chunks of a few megabytes and when > it's full allocate a new chunk and keep going. There's no need to get > tricky with estimates and resizing and whatever. I agree. While the idea of estimating the right size sounds promising a priori, considering the estimate can go wrong and over or underallocate quite severely, the risks outweigh the benefits when you consider the alternative of a dynamic allocation strategy. Unless the dynamic strategy has a bigger CPU impact than expected, I believe it's a superior approach. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
On Thu, Sep 8, 2016 at 11:54 AM, Pavan Deolasee wrote: > For example, for a table with 60 bytes wide tuple (including 24 byte > header), each page can approximately have 8192/60 = 136 tuples. Say we > provision for 136*2 = 272 bits per page i.e. 34 bytes per page for the > bitmap. First 272 offsets in every page are represented in the bitmap and > anything greater than are in overflow region. On the other hand, the current > representation will need about 16 bytes per page assuming 2% dead tuples, 40 > bytes per page assuming 5% dead tuples and 80 bytes assuming 10% dead > tuples. So bitmap will take more space for small tuples or when vacuum is > run very aggressively, both seems unlikely for very large tables. Of course > the calculation does not take into account the space needed by the overflow > area, but I expect that too be small. I thought about something like this, but it could be extremely inefficient for mostly frozen tables, since the bitmap cannot account for frozen pages without losing the O(1) lookup characteristic -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Tue, Sep 6, 2016 at 8:28 PM, Peter Geoghegan wrote: >> In the meanwhile, I'll go and do some perf testing. >> >> Assuming the speedup is realized during testing, LGTM. > > Thanks. I suggest spending at least as much time on unsympathetic > cases (e.g., only 2 or 3 tapes must be merged). At the same time, I > suggest focusing on a type that has relatively expensive comparisons, > such as collated text, to make differences clearer. The tests are still running (the benchmark script I came up with runs for a lot longer than I anticipated, about 2 days), but preliminar results are very promising, I can see a clear and consistent speedup. We'll have to wait for the complete results to see if there's any significant regression, though. I'll post the full results when I have them, but until now it all looks like this: setup: create table lotsofitext(i text, j text, w text, z integer, z2 bigint); insert into lotsofitext select cast(random() * 10.0 as text) || 'blablablawblabla', cast(random() * 10.0 as text) || 'blablablawjjjblabla', cast(random() * 10.0 as text) || 'blablabl awwwabla', random() * 10.0, random() * 1.0 from generate_series(1, 1000); timed: select count(*) FROM (select * from lotsofitext order by i, j, w, z, z2) t; Unpatched Time: 100351.251 ms Patched Time: 75180.787 ms That's like a 25% speedup on random input. As we say over here, rather badly translated, not a turkey's boogers (meaning "nice!") On Tue, Sep 6, 2016 at 9:50 PM, Claudio Freire wrote: > On Tue, Sep 6, 2016 at 9:19 PM, Peter Geoghegan wrote: >> On Tue, Sep 6, 2016 at 4:55 PM, Claudio Freire >> wrote: >>> I noticed, but here n = state->memtupcount >>> >>> + Assert(memtuples[0].tupindex == newtup->tupindex); >>> + >>> + CHECK_FOR_INTERRUPTS(); >>> + >>> + n = state->memtupcount; /* n is heap's size, >>> including old root */ >>> + imin = 0; /* >>> start with caller's "hole" in root */ >>> + i = imin; >> >> I'm fine with using "n" in the later assertion you mentioned, if >> that's clearer to you. memtupcount is broken out as "n" simply because >> that's less verbose, in a place where that makes things far clearer. >> >>> In fact, the assert on the patch would allow writing memtuples outside >>> the heap, as in calling tuplesort_heap_root_displace if >>> memtupcount==0, but I don't think that should be legal (memtuples[0] >>> == memtuples[imin] would be outside the heap). >> >> You have to have a valid heap (i.e. there must be at least one >> element) to call tuplesort_heap_root_displace(), and it doesn't >> directly compact the heap, so it must remain valid on return. The >> assertion exists to make sure that everything is okay with a >> one-element heap, a case which is quite possible. > > More than using "n" or "memtupcount" what I'm saying is to assert that > memtuples[imin] is inside the heap, which would catch the same errors > the original assert would, and more. > > Assert(imin < state->memtupcount) > > If you prefer. > > The original asserts allows any value of imin for memtupcount>1, and > that's my main concern. It shouldn't. So, for the assertions to properly avoid clobbering/reading out of bounds memory, you need both the above assert: +*/ + memtuples[i] = memtuples[imin]; + i = imin; + } + >+ Assert(imin < state->memtupcount); + memtuples[imin] = *newtup; +} And another one at the beginning, asserting: + SortTuple *memtuples = state->memtuples; + int n, + imin, + i; + >+ Assert(state->memtupcount > 0 && memtuples[0].tupindex == >newtup->tupindex); + + CHECK_FOR_INTERRUPTS(); It's worth making that change, IMHO, unless I'm missing something. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Thu, Sep 8, 2016 at 2:13 PM, Peter Geoghegan wrote: > On Thu, Sep 8, 2016 at 8:53 AM, Claudio Freire wrote: >> setup: >> >> create table lotsofitext(i text, j text, w text, z integer, z2 bigint); >> insert into lotsofitext select cast(random() * 10.0 as text) >> || 'blablablawblabla', cast(random() * 10.0 as text) || >> 'blablablawjjjblabla', cast(random() * 10.0 as text) || >> 'blablabl >> awwwabla', random() * 10.0, random() * 1.0 from >> generate_series(1, 1000); >> >> timed: >> >> select count(*) FROM (select * from lotsofitext order by i, j, w, z, z2) t; >> >> Unpatched Time: 100351.251 ms >> Patched Time: 75180.787 ms >> >> That's like a 25% speedup on random input. As we say over here, rather >> badly translated, not a turkey's boogers (meaning "nice!") > > Cool! What work_mem setting were you using here? The script iterates over a few variations of string patterns (easy comparisons vs hard comparisons), work mem (4MB, 64MB, 256MB, 1GB, 4GB), and table sizes (~350M, ~650M, ~1.5G). That particular case I believe is using work_mem=4MB, easy strings, 1.5GB table. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Is tuplesort_heap_siftup() a misnomer?
On Thu, Sep 8, 2016 at 4:20 PM, Peter Geoghegan wrote: > On Thu, Sep 8, 2016 at 10:40 AM, Tom Lane wrote: >>> On Thu, Sep 8, 2016 at 12:01 AM, Heikki Linnakangas wrote: I still think tuplesort_heap_siftup is a confusing name, although I'm not sure that Peter's "compact" is much better. I suggest that we rename it to tuplesort_heap_delete_top(). In comments within the function, explain that the *loop* corresponds to the "siftup" in Knuth's book. >> >>> I'm also fine with that name. >> >> I can live with it too. > > Attached patch does it that way, then. I stuck with the reference to > "shift down", though, since I think we all agree that that is > unambiguous. I believe the term is "sift" not "shift" -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Is tuplesort_heap_siftup() a misnomer?
On Thu, Sep 8, 2016 at 2:19 PM, Peter Geoghegan wrote: >> Peter, looking at your "displace" patch in this light, I think >> tuplesort_heap_root_displace() and tuplesort_heap_delete_top() (as I'm >> calling it now), should share a common subroutine. Displace replaces the top >> node with the new node passed as argument, and then calls the subroutine to >> push it down to the right place. Delete_top replaces the top node with the >> last value in the heap, and then calls the subroutine. Or perhaps delete_top >> should remove the last value from the heap, and then call displace() with >> it, to re-insert it. > > I can see the value in that, but I'd still rather not. The code reuse > win isn't all that large, and having a separate > tuplesort_heap_root_displace() routine allows us to explain what's > going on for merging, to assert what we're able to assert with tape > numbers vs. heap position, and to lose the HEAPCOMPARE()/checkIndex > cruft that the existing routines need (if only barely) to support > replacement selection. I would be surprised if with a very tight inner > loop like this, HEAPCOMPARE() has an appreciable overhead (maybe it > increases pipeline stalls). BTW, regarding this, since I read in some other thread that it was ok to use static inline functions, I believe the compiler is smart enough to optimize out the constant true/false in checkIndex for inlined calls, so that may be viable (on modern compilers). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
... On Fri, Sep 9, 2016 at 9:22 PM, Claudio Freire wrote: > Since it is true that doing so would make it impossible to keep the > asserts about tupindex in tuplesort_heap_root_displace, I guess it > depends on how useful those asserts are (ie: how likely it is that > those conditions could be violated, and how damaging it could be if > they were). If it is decided the refactor is desirable, I'd suggest > making the common siftup producedure static inline, to allow > tuplesort_heap_root_displace to inline and specialize it, since it > will be called with checkIndex=False and that simplifies the resulting > code considerably. > > Peter also mentioned that there were some other changes going on in > the surrounding code that could impact this patch, so I'm marking the > patch Waiting on Author. > > Overall, however, I believe the patch is in good shape. Only minor > form issues need to be changed, the functionality seems both desirable > and ready. Sorry, forgot to specify, that was all about patch 3, the one about tuplesort_heap_root_displace. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tuplesort merge pre-reading
On Fri, Sep 9, 2016 at 8:13 AM, Heikki Linnakangas wrote: > > Claudio, if you could also repeat the tests you ran on Peter's patch set on > the other thread, with these patches, that'd be nice. These patches are > effectively a replacement for > 0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch. And review > would be much appreciated too, of course. > > Attached are new versions. Compared to last set, they contain a few comment > fixes, and a change to the 2nd patch to not allocate tape buffers for tapes > that were completely unused. Will do so -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tuplesort merge pre-reading
On Fri, Sep 9, 2016 at 9:51 PM, Claudio Freire wrote: > On Fri, Sep 9, 2016 at 8:13 AM, Heikki Linnakangas wrote: >> >> Claudio, if you could also repeat the tests you ran on Peter's patch set on >> the other thread, with these patches, that'd be nice. These patches are >> effectively a replacement for >> 0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch. And review >> would be much appreciated too, of course. >> >> Attached are new versions. Compared to last set, they contain a few comment >> fixes, and a change to the 2nd patch to not allocate tape buffers for tapes >> that were completely unused. > > > Will do so It seems both 1 and 1+2 break make check. Did I misunderstand something? I'm applying the first patch of Peter's series (cap number of tapes), then your first one (remove prefetch) and second one (use larger read buffers). Peter's patch needs some rebasing on top of those but nothing major. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tuplesort merge pre-reading
On Sun, Sep 11, 2016 at 12:47 PM, Heikki Linnakangas wrote: > Here's a new version of these patches, rebased over current master. I > squashed the two patches into one, there's not much point to keep them > separate. I don't know what was up with the other ones, but this one works fine. Benchmarking now. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
On Tue, Sep 13, 2016 at 3:51 PM, Robert Haas wrote: > On Fri, Sep 9, 2016 at 3:04 AM, Masahiko Sawada wrote: >> Attached PoC patch changes the representation of dead tuple locations >> to the hashmap having tuple bitmap. >> The one hashmap entry consists of the block number and the TID bitmap >> of corresponding block, and the block number is the hash key of >> hashmap. >> Current implementation of this patch is not smart yet because each >> hashmap entry allocates the tuple bitmap with fixed >> size(LAZY_ALLOC_TUPLES), so each hashentry can store up to >> LAZY_ALLOC_TUPLES(291 if block size is 8kB) tuples. >> In case where one block can store only the several tens tuples, the >> most bits are would be waste. >> >> After improved this patch as you suggested, I will measure performance >> benefit. > > We also need to consider the amount of memory gets used. What I > proposed - replacing the array with an array of arrays - would not > increase memory utilization significantly. I don't think it would > have much impact on CPU utilization either. I've finished writing that patch, I'm in the process of testing its CPU impact. First test seemed to hint at a 40% increase in CPU usage, which seems rather steep compared to what I expected, so I'm trying to rule out some methodology error here. > It would require > replacing the call to bsearch() in lazy_heap_reaptid() with an > open-coded implementation of bsearch, or with one bsearch to find the > chunk and another to find the TID within the chunk, but that shouldn't > be very expensive. I did a linear search to find the chunk, with exponentially growing chunks, and then a bsearch to find the item inside the chunk. With the typical number of segments and given the 12GB limit, the segment array size is well within the range that favors linear search. > For example, we could keep a bitmap with one bit per K > pages. If the bit is set, there is at least 1 dead tuple on that > page; if clear, there are none. When we see an index tuple, we > consult the bitmap to determine whether we need to search the TID > list. We select K to be the smallest power of 2 such that the bitmap > uses less memory than some threshold, perhaps 64kB. I've been pondering something like that, but that's an optimization that's quite orthogonal to the multiarray stuff. > Assuming that > updates and deletes to the table have some locality, we should be able > to skip a large percentage of the TID searches with a probe into this > very compact bitmap. I don't think you can assume locality -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
On Tue, Sep 13, 2016 at 4:06 PM, Robert Haas wrote: > On Tue, Sep 13, 2016 at 2:59 PM, Claudio Freire > wrote: >> I've finished writing that patch, I'm in the process of testing its CPU >> impact. >> >> First test seemed to hint at a 40% increase in CPU usage, which seems >> rather steep compared to what I expected, so I'm trying to rule out >> some methodology error here. > > Hmm, wow. That's pretty steep. Maybe lazy_heap_reaptid() is hotter > than I think it is, but even if it accounts for 10% of total CPU usage > within a vacuum, which seems like an awful lot, you'd have to make it > 4x as expensive, which also seems like an awful lot. IIRC perf top reported a combined 45% between layz_heap_reaptid + vac_cmp_itemptr (after patching). vac_cmp_itemptr was around 15% on its own Debug build of couse (I need the assertions and the debug symbols), I'll retest with optimizations once debug tests make sense. >>> For example, we could keep a bitmap with one bit per K >>> pages. If the bit is set, there is at least 1 dead tuple on that >>> page; if clear, there are none. When we see an index tuple, we >>> consult the bitmap to determine whether we need to search the TID >>> list. We select K to be the smallest power of 2 such that the bitmap >>> uses less memory than some threshold, perhaps 64kB. >> >> I've been pondering something like that, but that's an optimization >> that's quite orthogonal to the multiarray stuff. > > Sure, but if this really does increase CPU time, it'd be reasonable to > do something to decrease it again in order to get the other benefits > of this patch - i.e. increasing the maintenance_work_mem limit while > reducing the chances that overallocation will cause OOM. I was hoping it wouldn't regress performance so much. I'd rather micro-optimize the multiarray implementation until it doesn't and then think of orthogonal optimizations. >>> Assuming that >>> updates and deletes to the table have some locality, we should be able >>> to skip a large percentage of the TID searches with a probe into this >>> very compact bitmap. >> >> I don't think you can assume locality > > Really? If you have a 1TB table, how many 2MB ranges of that table do > you think will contain dead tuples for a typical vacuum? I think most > tables of that size are going to be mostly static, and the all-visible > and all-frozen bits are going to be mostly set. You *could* have > something like a pgbench-type workload that does scattered updates > across the entire table, but that's going to perform pretty poorly > because you'll constantly be updating blocks that have to be pulled in > from disk. I have a few dozen of those in my biggest database. They do updates and deletes all over the place and, even if they were few, they're scattered almost uniformly. Thing is, I think we really need to not worsen that case, which seems rather common (almost any OLTP with a big enough user base, or a K-V type of table, or TOAST tables). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
On Wed, Sep 14, 2016 at 12:17 PM, Robert Haas wrote: > > I am kind of doubtful about this whole line of investigation because > we're basically trying pretty hard to fix something that I'm not sure > is broken.I do agree that, all other things being equal, the TID > lookups will probably be faster with a bitmap than with a binary > search, but maybe not if the table is large and the number of dead > TIDs is small, because cache efficiency is pretty important. But even > if it's always faster, does TID lookup speed even really matter to > overall VACUUM performance? Claudio's early results suggest that it > might, but maybe that's just a question of some optimization that > hasn't been done yet. FYI, the reported impact was on CPU time, not runtime. There was no significant difference in runtime (real time), because my test is heavily I/O bound. I tested with a few small tables and there was no significant difference either, but small tables don't stress the array lookup anyway so that's expected. But on the assumption that some systems may be CPU bound during vacuum (particularly those able to do more than 300-400MB/s sequential I/O), in those cases the increased or decreased cost of lazy_tid_reaped will directly correlate to runtime. It's just none of my systems, which all run on amazon and is heavily bandwidth constrained (fastest I/O subsystem I can get my hands on does 200MB/s). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
On Wed, Sep 14, 2016 at 12:17 PM, Robert Haas wrote: > For instance, one idea to grow memory usage incrementally would be to > store dead tuple information separately for each 1GB segment of the > relation. So we have an array of dead-tuple-representation objects, > one for every 1GB of the relation. If there are no dead tuples in a > given 1GB segment, then this pointer can just be NULL. Otherwise, it > can point to either the bitmap representation (which will take ~4.5MB) > or it can point to an array of TIDs (which will take 6 bytes/TID). > That could handle an awfully wide variety of usage patterns > efficiently; it's basically never worse than what we're doing today, > and when the dead tuple density is high for any portion of the > relation it's a lot better. If you compress the list into a bitmap a posteriori, you know the number of tuples per page, so you could encode the bitmap even more efficiently. It's not a bad idea, one that can be slapped on top of the multiarray patch - when closing a segment, it can be decided whether to turn it into a bitmap or not. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
On Thu, Sep 15, 2016 at 12:50 PM, Tomas Vondra wrote: > On 09/14/2016 07:57 PM, Tom Lane wrote: >> >> Pavan Deolasee writes: >>> >>> On Wed, Sep 14, 2016 at 10:53 PM, Alvaro Herrera >>> >>> wrote: >>>> >>>> One thing not quite clear to me is how do we create the bitmap >>>> representation starting from the array representation in midflight >>>> without using twice as much memory transiently. Are we going to write >>>> the array to a temp file, free the array memory, then fill the bitmap by >>>> reading the array from disk? >> >> >>> We could do that. >> >> >> People who are vacuuming because they are out of disk space will be very >> very unhappy with that solution. > > > The people are usually running out of space for data, while these files > would be temporary files placed wherever temp_tablespaces points to. I'd > argue if this is a source of problems, the people are already in deep > trouble due to sorts, CREATE INDEX, ... as those commands may also generate > a lot of temporary files. One would not expect "CREATE INDEX" to succeed when space is tight, but VACUUM is quite the opposite. Still, temporary storage could be used if available, and gracefully fall back to some other technique (like not using bitmaps) when not. Not sure it's worth the trouble, though. On Wed, Sep 14, 2016 at 12:24 PM, Claudio Freire wrote: > On Wed, Sep 14, 2016 at 12:17 PM, Robert Haas wrote: >> >> I am kind of doubtful about this whole line of investigation because >> we're basically trying pretty hard to fix something that I'm not sure >> is broken.I do agree that, all other things being equal, the TID >> lookups will probably be faster with a bitmap than with a binary >> search, but maybe not if the table is large and the number of dead >> TIDs is small, because cache efficiency is pretty important. But even >> if it's always faster, does TID lookup speed even really matter to >> overall VACUUM performance? Claudio's early results suggest that it >> might, but maybe that's just a question of some optimization that >> hasn't been done yet. > > FYI, the reported impact was on CPU time, not runtime. There was no > significant difference in runtime (real time), because my test is > heavily I/O bound. > > I tested with a few small tables and there was no significant > difference either, but small tables don't stress the array lookup > anyway so that's expected. > > But on the assumption that some systems may be CPU bound during vacuum > (particularly those able to do more than 300-400MB/s sequential I/O), > in those cases the increased or decreased cost of lazy_tid_reaped will > directly correlate to runtime. It's just none of my systems, which all > run on amazon and is heavily bandwidth constrained (fastest I/O > subsystem I can get my hands on does 200MB/s). Attached is the patch with the multiarray version. The tests are weird. Best case comparison over several runs, to remove the impact of concurrent activity on this host (I couldn't remove all background activity even when running the tests overnight, the distro adds tons of crons and background cleanup tasks it would seem), there's only very mild CPU impact. I'd say insignificant, as it's well below the mean variance. Worst case: DETAIL: CPU 9.90s/80.94u sec elapsed 1232.42 sec. Best case: DETAIL: CPU 12.10s/63.82u sec elapsed 832.79 sec. There seems to be more variance with the multiarray approach than the single array one, but I could not figure out why. Even I/O seems less stable: Worst case: INFO: "pgbench_accounts": removed 4 row versions in 6557382 pages DETAIL: CPU 64.31s/37.60u sec elapsed 2573.88 sec. Best case: INFO: "pgbench_accounts": removed 4 row versions in 6557378 pages DETAIL: CPU 54.48s/31.78u sec elapsed 1552.18 sec. Since this test takes several hours to complete, I could only run a few runs of each version, so the statistical significance of the test isn't very bright. I'll try to compare with smaller pgbench scale numbers and more runs over the weekend (gotta script that). It's certainly puzzling, I cannot explain the increased variance, especially in I/O, since the I/O should be exactly the same. I'm betting it's my system that's unpredictable somehow. So I'm posting the patch in case someone gets inspired and can spot the reason, and because there's been a lot of talk about this very same approach, so I thought I'd better post the code ;) I'll also try to get a more predictable system to run the tests on. From f85fd4213eb6c0b4da8dc9196424f6
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
On Thu, Sep 15, 2016 at 3:48 PM, Tomas Vondra wrote: > For example, we always allocate the TID array as large as we can fit into > m_w_m, but maybe we don't need to wait with switching to the bitmap until > filling the whole array - we could wait as long as the bitmap fits into the > remaining part of the array, build it there and then copy it to the > beginning (and use the bitmap from that point). The bitmap can be created like that, but grow from the end of the segment backwards. So the scan can proceed until the bitmap fills the whole segment (filling backwards), no copy required. I'll try that later, but first I'd like to get multiarray approach right since that's the basis of it. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tuplesort merge pre-reading
On Fri, Sep 9, 2016 at 9:51 PM, Claudio Freire wrote: > On Fri, Sep 9, 2016 at 8:13 AM, Heikki Linnakangas wrote: >> >> Claudio, if you could also repeat the tests you ran on Peter's patch set on >> the other thread, with these patches, that'd be nice. These patches are >> effectively a replacement for >> 0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch. And review >> would be much appreciated too, of course. >> >> Attached are new versions. Compared to last set, they contain a few comment >> fixes, and a change to the 2nd patch to not allocate tape buffers for tapes >> that were completely unused. > > > Will do so Well, here they are, the results. ODS format only (unless you've got issues opening the ODS). The results seem all over the map. Some regressions seem significant (both in the amount of performance lost and their significance, since all 4 runs show a similar regression). The worst being "CREATE INDEX ix_lotsofitext_zz2ijw ON lotsofitext (z, z2, i, j, w);" with 4GB work_mem, which should be an in-memory sort, which makes it odd. I will re-run it overnight just in case to confirm the outcome. logtape_preload_timings.ods Description: application/vnd.oasis.opendocument.spreadsheet -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tuplesort merge pre-reading
On Tue, Sep 20, 2016 at 3:34 PM, Claudio Freire wrote: > On Fri, Sep 9, 2016 at 9:51 PM, Claudio Freire wrote: >> On Fri, Sep 9, 2016 at 8:13 AM, Heikki Linnakangas wrote: >>> >>> Claudio, if you could also repeat the tests you ran on Peter's patch set on >>> the other thread, with these patches, that'd be nice. These patches are >>> effectively a replacement for >>> 0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch. And review >>> would be much appreciated too, of course. >>> >>> Attached are new versions. Compared to last set, they contain a few comment >>> fixes, and a change to the 2nd patch to not allocate tape buffers for tapes >>> that were completely unused. >> >> >> Will do so > > Well, here they are, the results. > > ODS format only (unless you've got issues opening the ODS). > > The results seem all over the map. Some regressions seem significant > (both in the amount of performance lost and their significance, since > all 4 runs show a similar regression). The worst being "CREATE INDEX > ix_lotsofitext_zz2ijw ON lotsofitext (z, z2, i, j, w);" with 4GB > work_mem, which should be an in-memory sort, which makes it odd. > > I will re-run it overnight just in case to confirm the outcome. A new run for "patched" gives better results, it seems it was some kind of glitch in the run (maybe some cron decided to do something while running those queries). Attached In essence, it doesn't look like it's harmfully affecting CPU efficiency. Results seem neutral on the CPU front. logtape_preload_timings.ods Description: application/vnd.oasis.opendocument.spreadsheet -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tuplesort merge pre-reading
On Thu, Sep 22, 2016 at 4:17 AM, Heikki Linnakangas wrote: > On 09/22/2016 03:40 AM, Claudio Freire wrote: >> >> On Tue, Sep 20, 2016 at 3:34 PM, Claudio Freire >> wrote: >>> >>> The results seem all over the map. Some regressions seem significant >>> (both in the amount of performance lost and their significance, since >>> all 4 runs show a similar regression). The worst being "CREATE INDEX >>> ix_lotsofitext_zz2ijw ON lotsofitext (z, z2, i, j, w);" with 4GB >>> work_mem, which should be an in-memory sort, which makes it odd. >>> >>> I will re-run it overnight just in case to confirm the outcome. >> >> >> A new run for "patched" gives better results, it seems it was some >> kind of glitch in the run (maybe some cron decided to do something >> while running those queries). >> >> Attached >> >> In essence, it doesn't look like it's harmfully affecting CPU >> efficiency. Results seem neutral on the CPU front. > > > Looking at the spreadsheet, there is a 40% slowdown in the "slow" "CREATE > INDEX ix_lotsofitext_zz2ijw ON lotsofitext (z, z2, i, j, w);" test with 4GB > of work_mem. I can't reproduce that on my laptop, though. Got any clue > what's going on there? It's not present in other runs, so I think it's a fluke the spreadsheet isn't filtering out. Especially considering that one should be a fully in-memory fast sort and thus unaffected by the current patch (z and z2 being integers, IIRC, most comparisons should be about comparing the first columns and thus rarely involve the big strings). I'll try to confirm that's the case though. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Small improvement to compactify_tuples
On Thu, Nov 2, 2017 at 11:46 PM, Tom Lane wrote: > Sokolov Yura writes: >> [ 0001-Improve-compactify_tuples.patch, v5 or thereabouts ] > > I started to review this patch. I spent a fair amount of time on > beautifying the code, because I found it rather ugly and drastically > undercommented. Once I had it to the point where it seemed readable, > I went to check the shellsort algorithm against Wikipedia's entry, > and found that this appears to be an incorrect implementation of > shellsort: where pg_shell_sort_pass has > > for (_i = off; _i < _n; _i += off) \ > > it seems to me that we need to have > > for (_i = off; _i < _n; _i += 1) \ > > or maybe just _i++. As-is, this isn't h-sorting the whole file, > but just the subset of entries that have multiple-of-h indexes > (ie, the first of the h distinct subfiles that should get sorted). > The bug is masked by the final pass of plain insertion sort, but > we are not getting the benefit we should get from the earlier passes. > > However, I'm a bit dubious that it's worth fixing that; instead > my inclination would be to rip out the shellsort implementation > entirely. The code is only using it for the nitems <= 48 case > (which makes the first three offset steps certainly no-ops) and > I am really unconvinced that it's worth expending the code space > for a shellsort rather than plain insertion sort in that case, > especially when we have good reason to think that the input data > is nearly sorted. I actually noticed that and benchmarked some variants. Neither made any noticeable difference in performance, so I decided not to complain about them. I guess the same case can be made for removing the shell sort. So I'm inclined to agree. > BTW, the originally given test case shows no measurable improvement > on my box. I did manage to reproduce the original test and got a consistent improvement. > I was eventually able to convince myself by profiling > that the patch makes us spend less time in compactify_tuples, but > this test case isn't a very convincing one. > > So, quite aside from the bug, I'm not excited about committing the > attached as-is. I think we should remove pg_shell_sort and just > use pg_insertion_sort. If somebody can show a test case that > provides a measurable speed improvement from the extra code, > I could be persuaded to reconsider. My tests modifying the shell sort didn't produce any measurable difference, but I didn't test removing it altogether. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Small improvement to compactify_tuples
On Fri, Nov 3, 2017 at 4:30 PM, Tom Lane wrote: > Claudio Freire writes: >> On Thu, Nov 2, 2017 at 11:46 PM, Tom Lane wrote: >>> BTW, the originally given test case shows no measurable improvement >>> on my box. > >> I did manage to reproduce the original test and got a consistent improvement. > > This gave me the idea to memcpy the page into some workspace and then use > memcpy, not memmove, to put the tuples back into the caller's copy of the > page. That gave me about a 50% improvement in observed TPS, and a perf > profile like this: > > + 38.50%38.40%299520 postmaster postgres > [.] compactify_tuples > + 31.11%31.02%241975 postmaster libc-2.12.so > [.] memcpy > +8.74% 8.72% 68022 postmaster postgres > [.] itemoffcompare > +6.51% 6.49% 50625 postmaster postgres > [.] compactify_tuples_loop > +4.21% 4.19% 32719 postmaster postgres > [.] pg_qsort > +1.70% 1.69% 13213 postmaster postgres > [.] memcpy@plt > > There still doesn't seem to be any point in replacing the qsort, > but it does seem like something like the second attached patch > might be worth doing. > > So I'm now wondering why my results seem to be so much different > from those of other people who have tried this, both as to whether > compactify_tuples is worth working on at all and as to what needs > to be done to it if so. Thoughts? > > regards, tom lane > I'm going to venture a guess that the version of gcc and libc, and build options used both in the libc (ie: the distro) and postgres may play a part here. I'm running with glibc 2.22, for instance, and building with gcc 4.8.5. I will try and benchmark memcpy vs memmove and see what the performance difference is there with my versions, too. This may heavily depend on compiler optimizations that may vary between versions, since memcpy/memmove tend to be inlined a lot. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Small improvement to compactify_tuples
On Sat, Nov 4, 2017 at 8:07 PM, Юрий Соколов wrote: > 2017-11-03 5:46 GMT+03:00 Tom Lane : >> >> Sokolov Yura writes: >> > [ 0001-Improve-compactify_tuples.patch, v5 or thereabouts ] >> >> I went to check the shellsort algorithm against Wikipedia's entry, >> and found that this appears to be an incorrect implementation of >> shellsort: where pg_shell_sort_pass has >> >> for (_i = off; _i < _n; _i += off) \ >> >> it seems to me that we need to have >> >> for (_i = off; _i < _n; _i += 1) \ >> >> or maybe just _i++. > > > Shame on me :-( > I've wrote shell sort several times, so I forgot to recheck myself once > again. > And looks like best gap sequence from wikipedia is really best > ( {301, 132, 57, 23, 10 , 4} in my notation), > > > 2017-11-03 17:37 GMT+03:00 Claudio Freire : >> On Thu, Nov 2, 2017 at 11:46 PM, Tom Lane wrote: >>> BTW, the originally given test case shows no measurable improvement >>> on my box. >> >> I did manage to reproduce the original test and got a consistent >> improvement. > > I've rechecked my self using my benchmark. > Without memmove, compactify_tuples comsumes: > - with qsort 11.66% cpu (pg_qsort + med3 + swapfunc + itemoffcompare + > compactify_tuples = 5.97 + 0.51 + 2.87 + 1.88 + 0.44) > - with just insertion sort 6.65% cpu (sort is inlined, itemoffcompare also > inlined, so whole is compactify_tuples) > - with just shell sort 5,98% cpu (sort is inlined again) > - with bucket sort 1,76% cpu (sort_itemIds + compactify_tuples = 1.30 + > 0.46) Is that just insertion sort without bucket sort? Because I think shell sort has little impact in your original patch because it's rarely exercised. With bucket sort, most buckets are very small, too small for shell sort to do any useful work. That's why I'm inclined to agree with Tom in that we could safely simplify it out, remove it, without much impact. Maybe leave a fallback to qsort if some corner case produces big buckets? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Small improvement to compactify_tuples
On Mon, Nov 6, 2017 at 11:50 AM, Юрий Соколов wrote: >> Maybe leave a fallback to qsort if some corner case produces big buckets? > > For 8kb pages, each bucket is per 32 bytes. So, for heap pages it is at > most 1 heap-tuple per bucket, and for index pages it is at most 2 index > tuples per bucket. For 32kb pages it is 4 heap-tuples and 8 index-tuples > per bucket. > It will be unnecessary overhead to call non-inlineable qsort in this cases > > So, I think, shell sort could be removed, but insertion sort have to remain. > > I'd prefer shell sort to remain also. It could be useful in other places > also, > because it is easily inlinable, and provides comparable to qsort performance > up to several hundreds of elements. I'd rather have an inlineable qsort. And I'd recommend doing that when there is a need, and I don't think this patch really needs it, since bucket sort handles most cases anyway. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Small improvement to compactify_tuples
On Mon, Nov 6, 2017 at 6:58 PM, Юрий Соколов wrote: > > 2017-11-06 17:55 GMT+03:00 Claudio Freire : >> >> On Mon, Nov 6, 2017 at 11:50 AM, Юрий Соколов >> wrote: >> >> Maybe leave a fallback to qsort if some corner case produces big >> >> buckets? >> > >> > For 8kb pages, each bucket is per 32 bytes. So, for heap pages it is at >> > most 1 heap-tuple per bucket, and for index pages it is at most 2 index >> > tuples per bucket. For 32kb pages it is 4 heap-tuples and 8 index-tuples >> > per bucket. >> > It will be unnecessary overhead to call non-inlineable qsort in this >> > cases >> > >> > So, I think, shell sort could be removed, but insertion sort have to >> > remain. >> > >> > I'd prefer shell sort to remain also. It could be useful in other places >> > also, >> > because it is easily inlinable, and provides comparable to qsort >> > performance >> > up to several hundreds of elements. >> >> I'd rather have an inlineable qsort. > > But qsort is recursive. It is quite hard to make it inlineable. And still it > will be > much heavier than insertion sort (btw, all qsort implementations uses > insertion > sort for small arrays). And it will be heavier than shell sort for small > arrays. I haven't seen this trick used in postgres, nor do I know whether it would be well received, so this is more like throwing an idea to see if it sticks... But a way to do this without macros is to have an includable "template" algorithm that simply doesn't define the comparison function/type, it rather assumes it: qsort_template.h #define QSORT_NAME qsort_ ## QSORT_SUFFIX static void QSORT_NAME(ELEM_TYPE arr, size_t num_elems) { ... if (ELEM_LESS(arr[a], arr[b])) ... } #undef QSORT_NAME Then, in "offset_qsort.h": #define QSORT_SUFFIX offset #define ELEM_TYPE offset #define ELEM_LESS(a,b) ((a) < (b)) #include "qsort_template.h" #undef QSORT_SUFFIX #undef ELEM_TYPE #undef ELEM_LESS Now, I realize this may have its cons, but it does simplify maintainance of type-specific or parameterized variants of performance-critical functions. > I can do specialized qsort for this case. But it will be larger bunch of > code, than > shell sort. > >> And I'd recommend doing that when there is a need, and I don't think >> this patch really needs it, since bucket sort handles most cases >> anyway. > > And it still needs insertion sort for buckets. > I can agree to get rid of shell sort. But insertion sort is necessary. I didn't suggest getting rid of insertion sort. But the trick above is equally applicable to insertion sort. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Small improvement to compactify_tuples
On Mon, Nov 6, 2017 at 9:08 PM, Юрий Соколов wrote: > 2017-11-07 1:14 GMT+03:00 Claudio Freire : >> >> I haven't seen this trick used in postgres, nor do I know whether it >> would be well received, so this is more like throwing an idea to see >> if it sticks... >> >> But a way to do this without macros is to have an includable >> "template" algorithm that simply doesn't define the comparison >> function/type, it rather assumes it: >> >> qsort_template.h >> >> #define QSORT_NAME qsort_ ## QSORT_SUFFIX >> >> static void QSORT_NAME(ELEM_TYPE arr, size_t num_elems) >> { >> ... if (ELEM_LESS(arr[a], arr[b])) >> ... >> } >> >> #undef QSORT_NAME >> >> Then, in "offset_qsort.h": >> >> #define QSORT_SUFFIX offset >> #define ELEM_TYPE offset >> #define ELEM_LESS(a,b) ((a) < (b)) >> >> #include "qsort_template.h" >> >> #undef QSORT_SUFFIX >> #undef ELEM_TYPE >> #undef ELEM_LESS >> >> Now, I realize this may have its cons, but it does simplify >> maintainance of type-specific or parameterized variants of >> performance-critical functions. >> >> > I can do specialized qsort for this case. But it will be larger bunch of >> > code, than >> > shell sort. >> > >> >> And I'd recommend doing that when there is a need, and I don't think >> >> this patch really needs it, since bucket sort handles most cases >> >> anyway. >> > >> > And it still needs insertion sort for buckets. >> > I can agree to get rid of shell sort. But insertion sort is necessary. >> >> I didn't suggest getting rid of insertion sort. But the trick above is >> equally applicable to insertion sort. > > This trick is used in simplehash.h . I agree, it could be useful for qsort. > This will not make qsort inlineable, but will reduce overhead much. > > This trick is too heavy-weight for insertion sort alone, though. Without > shellsort, insertion sort could be expressed in 14 line macros ( 8 lines > without curly braces). But if insertion sort will be defined together with > qsort (because qsort still needs it), then it is justifiable. What do you mean by heavy-weight? Aside from requiring all that include magic, if you place specialized sort functions in a reusable header, using it is as simple as including the type-specific header (or declaring the type macros and including the template), and using them as regular functions. There's no runtime overhead involved, especially if you declare the comparison function as a macro or a static inline function. The sort itself can be declared static inline as well, and the compiler will decide whether it's worth inlining. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Small improvement to compactify_tuples
On Tue, Nov 7, 2017 at 11:42 AM, Юрий Соколов wrote: > > > 2017-11-07 17:15 GMT+03:00 Claudio Freire : >> Aside from requiring all that include magic, if you place specialized >> sort functions in a reusable header, using it is as simple as >> including the type-specific header (or declaring the type macros and >> including the template), and using them as regular functions. There's >> no runtime overhead involved, especially if you declare the comparison >> function as a macro or a static inline function. The sort itself can >> be declared static inline as well, and the compiler will decide >> whether it's worth inlining. > > Ok, if no one will complain against another one qsort implementation, > I will add template header for qsort. Since qsort needs insertion sort, > it will be in a same file. > Do you approve of this? > > With regards, > Sokolov Yura If you need it. I'm not particularly fond of writing code before it's needed. If you can measure the impact for that particular case where qsort might be needed, and it's a real-world case, then by all means. Otherwise, if it's a rarely-encountered corner case, I'd recommend simply calling the stdlib's qsort. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Small improvement to compactify_tuples
On Wed, Nov 8, 2017 at 12:33 PM, Tom Lane wrote: > Robert Haas writes: >> On Tue, Nov 7, 2017 at 4:39 PM, Tom Lane wrote: >>> What I'm getting from the standard pgbench measurements, on both machines, >>> is that this patch might be a couple percent slower than HEAD, but that is >>> barely above the noise floor so I'm not too sure about it. > >> Hmm. It seems like slowing down single client performance by a couple >> of percent is something that we really don't want to do. > > I do not think there is any change here that can be proven to always be a > win. Certainly the original patch, which proposes to replace an O(n log n) > sort algorithm with an O(n^2) one, should not be thought to be that. > The question to focus on is what's the average case, and I'm not sure how > to decide what the average case is. But more than two test scenarios > would be a good start. > > regards, tom lane Doing no change to the overall algorithm and replacing qsort with an inlineable type-specific one should be a net win in all cases. Doing bucket sort with a qsort of large buckets (or small tuple arrays) should also be a net win in all cases. Using shell sort might not seem clear, but lets not forget the original patch only uses it in very small arrays and very infrequently at that. What's perhaps not clear is whether there are better ideas. Like rebuilding the page as Tom proposes, which doesn't seem like a bad idea. Bucket sort already is O(bytes), just as memcopy, only it has a lower constant factor (it's bytes/256 in the original patch), which might make copying the whole page an extra time lose against bucket sort in a few cases. Deciding that last point does need more benchmarking. That doesn't mean the other improvements can't be pursued in the meanwhile, right? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical tape pause/resume
On Tue, Oct 4, 2016 at 7:47 AM, Heikki Linnakangas wrote: > 1. The first patch changes the way we store the logical tapes on disk. > Instead of using indirect blocks, HFS style, the blocks form a linked list. > Each block has a trailer, with the block numbers of the previous and next > block of the tape. That eliminates the indirect blocks, which simplifies the > code quite a bit, and makes it simpler to implement the pause/resume > functionality in the second patch. It also immediately reduces the memory > needed for the buffers, from 3 to 1 block per tape, as we don't need to hold > the indirect blocks in memory. Doesn't that make prefetching more than a buffer ahead harder? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Merge Join with an Index Optimization
On Sun, Sep 11, 2016 at 1:20 PM, Tom Lane wrote: > Michael Malis writes: >> As I understand it, a merge join will currently read all tuples from both >> subqueries (besides early termination). I believe it should be possible to >> take advantages of the indexes on one or both of the tables being read from >> to skip a large number of tuples that would currently be read. > > IIUC, what you're proposing is to replace "read past some tuples in the > index" with "re-descend the index btree". Yes, that would win if it > allows skipping over a large number of index entries, but it could lose > big-time if not. The difficulty is that I don't see how the merge join > level could predict whether it would win for any particular advance step. > You'd really need most of the smarts to be in the index AM. > > You might want to troll the archives for past discussions of index skip > scan, which is more or less the same idea (could possibly be exactly the > same idea, depending on how it's implemented). I think we had arrived > at the conclusion that re-descent from the root might be appropriate > when transitioning to another index page, but not intra-page. Anyway > no one's produced a concrete patch yet. Interesting it should be brought up. I was considering the index skip optimization for vacuum at some point, might as well consider it for regular scans too if I get to that. Basically, instead of a simple get next, I was considering adding a "get next skip until". The caller would be the one responsible for sending the hint, and the index am would be free to implement the skip in a smart way if possible. The interface for vacuum is a bit different in practice, but conceptually the same. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Indirect indexes
On Tue, Oct 18, 2016 at 3:28 PM, Alvaro Herrera wrote: > I propose we introduce the concept of "indirect indexes". I have a toy > implementation and before I go further with it, I'd like this assembly's > input on the general direction. > > Indirect indexes are similar to regular indexes, except that instead of > carrying a heap TID as payload, they carry the value of the table's > primary key. Because this is laid out on top of existing index support > code, values indexed by the PK can only be six bytes long (the length of > ItemPointerData); in other words, 281,474,976,710,656 rows are > supported, which should be sufficient for most use cases.[1] You don't need that limitation (and vacuum will be simpler) if you add the PK as another key, akin to: CREATE INDIRECT INDEX idx ON tab (a, b, c); turns into CREATE INDEX idx ON tab (a, b, c, pk); And is queried appropriately (using an index-only scan, extracting the PK from the index tuple, and then querying the PK index to get the tids). In fact, I believe that can work with all index ams supporting index-only scans. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Indirect indexes
On Tue, Oct 18, 2016 at 5:48 PM, Simon Riggs wrote: > On 18 October 2016 at 22:04, Claudio Freire wrote: >> On Tue, Oct 18, 2016 at 3:28 PM, Alvaro Herrera >> wrote: >>> I propose we introduce the concept of "indirect indexes". I have a toy >>> implementation and before I go further with it, I'd like this assembly's >>> input on the general direction. >>> >>> Indirect indexes are similar to regular indexes, except that instead of >>> carrying a heap TID as payload, they carry the value of the table's >>> primary key. Because this is laid out on top of existing index support >>> code, values indexed by the PK can only be six bytes long (the length of >>> ItemPointerData); in other words, 281,474,976,710,656 rows are >>> supported, which should be sufficient for most use cases.[1] >> >> >> You don't need that limitation (and vacuum will be simpler) if you add >> the PK as another key, akin to: >> >> CREATE INDIRECT INDEX idx ON tab (a, b, c); >> >> turns into >> >> CREATE INDEX idx ON tab (a, b, c, pk); >> >> And is queried appropriately (using an index-only scan, extracting the >> PK from the index tuple, and then querying the PK index to get the >> tids). >> >> In fact, I believe that can work with all index ams supporting index-only >> scans. > > But if you did that, an UPDATE of a b or c would cause a non-HOT > update, so would defeat the purpose of indirect indexes. I meant besides all the other work, omitting the tid from the index (as only the PK matters), marking them indirect, and all that. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Indirect indexes
On Wed, Oct 19, 2016 at 10:21 AM, Simon Riggs wrote: >> Simon objected that putting the PK >> into the index tuple would disable HOT, but I don't think that's a >> valid objection. > > Just to be clear, that's not what I objected to. Claudio appeared to > be suggesting that an indirect index is the same thing as an index > with PK tacked onto the end, which I re-confirm is not the case since > doing that would not provide the primary objective of indirect > indexes. No, I was suggesting using the storage format of those indexes. Perhaps I wasn't clear. CREATE INDEX could be implemented entirely as the rewrite I mention, I believe. But everything else can't, as you say. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Indirect indexes
On Wed, Oct 19, 2016 at 2:04 PM, Bruce Momjian wrote: >> What we should ask is what is the difference between indirect indexes >> and WARM and to what extent they overlap. >> >> My current understanding is that WARM won't help you if you update >> parts of a JSON document and/or use GIN indexes, but is effective >> without needing to add a new index type and will be faster for >> retrieval than indirect indexes. >> >> So everybody please chirp in with benefits or comparisons. > > I am not sure we have even explored all the limits of WARM with btree > indexes --- I haven't heard anyone talk about non-btree indexes yet. AFAIK there's no fundamental reason why it wouldn't work for other index ams, but it does require quite a bit of legwork to get everything working there. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Indirect indexes
On Thu, Oct 20, 2016 at 12:14 PM, Petr Jelinek wrote: > WARM can do WARM update 50% of time, indirect index can do HOT update > 100% of time (provided the column is not changed), I don't see why we > could not have both solutions. > > That all being said, it would be interesting to hear Álvaro's thoughts > about which use-cases he expects indirect indexes to work better than WARM. I'm not Alvaro, but it's quite evident that indirect indexes don't need space on the same page to get the benefits of HOT update (even though it wouldn't be HOT). That's a big difference IMO. That said, WARM isn't inherently limited to 50%, but it *is* limited to HOT-like updates (new tuple is in the same page as the old), and since in many cases that is a limiting factor for HOT updates, one can expect WARM will be equally limited. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Indirect indexes
On Thu, Oct 20, 2016 at 12:30 PM, Pavan Deolasee wrote: > > > On Thu, Oct 20, 2016 at 8:44 PM, Petr Jelinek wrote: >> >> >> >> WARM can do WARM update 50% of time, indirect index can do HOT update >> 100% of time (provided the column is not changed), I don't see why we >> could not have both solutions. >> > > I think the reason why I restricted WARM to one update per chain, also > applies to indirect index. For example, if a indirect column value is > changed from 'a' to 'b' and back to 'a', there will be two pointers from 'a' > to the PK and AFAICS that would lead to the same duplicate scan issue. > > We have a design to convert WARM chains back to HOT and that will increase > the percentage of WARM updates much beyond 50%. I was waiting for feedback > on the basic patch before putting in more efforts, but it went unnoticed > last CF. With indirect indexes, since you don't need to insert a tid, you can just "insert on conflict do nothing" on the index. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Indirect indexes
On Thu, Oct 20, 2016 at 1:08 PM, Pavan Deolasee wrote: > On Thu, Oct 20, 2016 at 9:20 PM, Claudio Freire > wrote: >> >> >> >> With indirect indexes, since you don't need to insert a tid, you can >> just "insert on conflict do nothing" on the index. > > > Would that work with non-unique indexes? Anyways, the point I was trying to > make is that there are a similar technical challenges and we could solve it > for WARM as well with your work for finding conflicting pairs and > then not doing inserts. My thinking currently is that it will lead to other > challenges, especially around vacuum, but I could be wrong. Consider this: Have a vacuum_cycle_id field in the metapage, with one bit reserved to whether there's a vacuum in progress. While there is a vacuum in progress on the index, all kinds of modifications will look up the entry, and store the current vacuum_cycle_id on the unused space for the tid pointer on the index entry. When not, only new entries will be added (with the current vacuum cycle id). So, during vacuum, indirect indexes incur a similar cost to that of regular indexes, but only during vacuum. When vacuuming, allocate 1/2 maintenance_work_mem for a bloom filter, and increase all vacuum cycle ids (on the metapage) and mark a vacuum in progress. Scan the heap, add pairs of *non-dead* tuples to the bloom filter. That's one BF per index, sadly, but bear with me. Then scan the indexes. pairs *not* in the BF that have the *old* vacuum cycle id get removed. Clear the vacuum in progress flag on all indexes' metapage. The only drawback here is that mwm dictates the amount of uncleanable waste left on the indexes (BF false positives). Surely, the BF could be replaced with an accurate set rather than an approximate one, but that could require a lot of memory if keys are big, and a lot of scans of the indexes. The BF trick bounds the amount of waste left while minimizing work. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLUSTER, reform_and_rewrite_tuple(), and parallelism
On Wed, Oct 26, 2016 at 9:33 PM, Peter Geoghegan wrote: > Besides, parallel CREATE INDEX alone will probably > be quite effective at speeding up CLUSTER in practice, simply because > that's often where most wall clock time is spent during a CLUSTER > operation. Creating all indexes in parallel could also help considerably, for the case when there are multiple indexes, and seems far simpler. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
On Thu, Nov 17, 2016 at 2:34 PM, Masahiko Sawada wrote: > I glanced at the patches but the both patches don't obey the coding > style of PostgreSQL. > Please refer to [1]. > > [1] > http://wiki.postgresql.org/wiki/Developer_FAQ#What.27s_the_formatting_style_used_in_PostgreSQL_source_code.3F. I thought I had. I'll go through that list to check what I missed. >> In the results archive, the .vacuum prefix is the patched version with >> both patch 1 and 2, .git.ref is just patch 1 (without which the >> truncate takes unholily long). > > Did you measure the performance benefit of 0001 patch by comparing > HEAD with HEAD+0001 patch? Not the whole test, but yes. Without the 0001 patch the backward scan step during truncate goes between 3 and 5 times slower. I don't have timings because the test never finished without the patch. It would have finished, but it would have taken about a day. >> Grepping the results a bit, picking an average run out of all runs on >> each scale: >> >> Timings: >> >> Patched: >> >> s100: CPU: user: 3.21 s, system: 1.54 s, elapsed: 18.95 s. >> s400: CPU: user: 14.03 s, system: 6.35 s, elapsed: 107.71 s. >> s4000: CPU: user: 228.17 s, system: 108.33 s, elapsed: 3017.30 s. >> >> Unpatched: >> >> s100: CPU: user: 3.39 s, system: 1.64 s, elapsed: 18.67 s. >> s400: CPU: user: 15.39 s, system: 7.03 s, elapsed: 114.91 s. >> s4000: CPU: user: 282.21 s, system: 105.95 s, elapsed: 3017.28 s. >> >> Total I/O (in MB) >> >> Patched: >> >> s100: R:2.4 - W:5862 >> s400: R:1337.4 - W:29385.6 >> s4000: R:318631 - W:370154 >> >> Unpatched: >> >> s100: R:1412.4 - W:7644.6 >> s400: R:3180.6 - W:36281.4 >> s4000: R:330683 - W:370391 >> >> >> So, in essence, CPU time didn't get adversely affected. If anything, >> it got improved by about 20% on the biggest case (scale 4000). > > And this test case deletes all tuples in relation and then measure > duration of vacuum. > It would not be effect much in practical use case. Well, this patch set started because I had to do exactly that, and realized just how inefficient vacuum was in that case. But it doesn't mean it won't benefit more realistic use cases. Almost any big database ends up hitting this 1GB limit because big tables take very long to vacuum and accumulate a lot of bloat in-between vacuums. If you have a specific test case in mind I can try to run it. >> While total runtime didn't change much, I believe this is only due to >> the fact that the index is perfectly correlated (clustered?) since >> it's a pristine index, so index scans either remove or skip full >> pages, never leaving things half-way. A bloated index would probably >> show a substantially different behavior, I'll try to get a script that >> does it by running pgbench a while before the vacuum or something like >> that. >> >> However, the total I/O metric already shows remarkable improvement. >> This metric is measuring all the I/O including pgbench init, the >> initial vacuum pgbench init does, the delete and the final vacuum. So >> it's not just the I/O for the vacuum itself, but the whole run. We can >> see the patched version reading a lot less (less scans over the >> indexes will do that), and in some cases writing less too (again, less >> index scans may be performing less redundant writes when cleaning >> up/reclaiming index pages). > > What value of maintenance_work_mem did you use for this test? 4GB on both, patched and HEAD. > Since > DeadTuplesSegment struct still stores array of ItemPointerData(6byte) > representing dead tuple I supposed that the frequency of index vacuum > does not change. But according to the test result, a index vacuum is > invoked once and removes 4 rows at the time. It means that the > vacuum stored about 2289 MB memory during heap vacuum. On the other > side, the result of test without 0002 patch show that a index vacuum > remove 178956737 rows at the time, which means 1GB memory was used. 1GB is a hardcoded limit on HEAD for vacuum work mem. This patch removes that limit and lets vacuum use all the memory the user gave to vacuum. In the test, in both cases, 4GB was used as maintenance_work_mem value, but HEAD cannot use all the 4GB, and it will limit itself to just 1GB. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
On Thu, Nov 17, 2016 at 2:51 PM, Claudio Freire wrote: > On Thu, Nov 17, 2016 at 2:34 PM, Masahiko Sawada > wrote: >> I glanced at the patches but the both patches don't obey the coding >> style of PostgreSQL. >> Please refer to [1]. >> >> [1] >> http://wiki.postgresql.org/wiki/Developer_FAQ#What.27s_the_formatting_style_used_in_PostgreSQL_source_code.3F. > > I thought I had. I'll go through that list to check what I missed. Attached is patch 0002 with pgindent applied over it I don't think there's any other formatting issue, but feel free to point a finger to it if I missed any From b9782cef0d971de341115bd3474d192419674219 Mon Sep 17 00:00:00 2001 From: Claudio Freire Date: Mon, 12 Sep 2016 23:36:42 -0300 Subject: [PATCH 2/2] Vacuum: allow using more than 1GB work mem Turn the dead_tuples array into a structure composed of several exponentially bigger arrays, to enable usage of more than 1GB of work mem during vacuum and thus reduce the number of full index scans necessary to remove all dead tids when the memory is available. --- src/backend/commands/vacuumlazy.c | 295 +- 1 file changed, 230 insertions(+), 65 deletions(-) diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c index 854bce3..dfb0612 100644 --- a/src/backend/commands/vacuumlazy.c +++ b/src/backend/commands/vacuumlazy.c @@ -91,6 +91,7 @@ * tables. */ #define LAZY_ALLOC_TUPLES MaxHeapTuplesPerPage +#define LAZY_MIN_TUPLES Max(MaxHeapTuplesPerPage, (128<<20) / sizeof(ItemPointerData)) /* * Before we consider skipping a page that's marked as clean in @@ -98,6 +99,27 @@ */ #define SKIP_PAGES_THRESHOLD ((BlockNumber) 32) +typedef struct DeadTuplesSegment +{ + int num_dead_tuples; /* # of entries in the segment */ + int max_dead_tuples; /* # of entries allocated in the segment */ + ItemPointerData last_dead_tuple; /* Copy of the last dead tuple (unset + * until the segment is fully + * populated) */ + unsigned short padding; + ItemPointer dead_tuples; /* Array of dead tuples */ +} DeadTuplesSegment; + +typedef struct DeadTuplesMultiArray +{ + int num_entries; /* current # of entries */ + int max_entries; /* total # of slots that can be allocated in + * array */ + int num_segs; /* number of dead tuple segments allocated */ + int last_seg; /* last dead tuple segment with data (or 0) */ + DeadTuplesSegment *dead_tuples; /* array of num_segs segments */ +} DeadTuplesMultiArray; + typedef struct LVRelStats { /* hasindex = true means two-pass strategy; false means one-pass */ @@ -117,14 +139,14 @@ typedef struct LVRelStats BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */ /* List of TIDs of tuples we intend to delete */ /* NB: this list is ordered by TID address */ - int num_dead_tuples; /* current # of entries */ - int max_dead_tuples; /* # slots allocated in array */ - ItemPointer dead_tuples; /* array of ItemPointerData */ + DeadTuplesMultiArray dead_tuples; int num_index_scans; TransactionId latestRemovedXid; bool lock_waiter_detected; } LVRelStats; +#define DeadTuplesCurrentSegment(lvrelstats) (&((lvrelstats)->dead_tuples.dead_tuples[ \ + (lvrelstats)->dead_tuples.last_seg ])) /* A few variables that don't seem worth passing around as parameters */ static int elevel = -1; @@ -149,7 +171,7 @@ static void lazy_cleanup_index(Relation indrel, IndexBulkDeleteResult *stats, LVRelStats *vacrelstats); static int lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer, - int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer); + int tupindex, LVRelStats *vacrelstats, DeadTuplesSegment * seg, Buffer *vmbuffer); static bool should_attempt_truncation(LVRelStats *vacrelstats); static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats); static BlockNumber count_nondeletable_pages(Relation onerel, @@ -157,8 +179,9 @@ static BlockNumber count_nondeletable_pages(Relation onerel, static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks); static void lazy_record_dead_tuple(LVRelStats *vacrelstats, ItemPointer itemptr); +static void lazy_clear_dead_tuples(LVRelStats *vacrelstats); static bool lazy_tid_reaped(ItemPointer itemptr, void *state); -static int vac_cmp_itemptr(const void *left, const void *right); +static inline int vac_cmp_itemptr(const void *left, const void *right); static bool heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid, bool *all_frozen); @@ -498,7 +521,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats, /* Report that we're scanning the heap, advertising total # of blocks */ initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP; initprog_val[1] = nblocks; - initprog_val[2] = vacrelstats->max_dead_tu
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
On Thu, Nov 17, 2016 at 6:34 PM, Robert Haas wrote: > On Thu, Nov 17, 2016 at 1:42 PM, Claudio Freire > wrote: >> Attached is patch 0002 with pgindent applied over it >> >> I don't think there's any other formatting issue, but feel free to >> point a finger to it if I missed any > > Hmm, I had imagined making all of the segments the same size rather > than having the size grow exponentially. The whole point of this is > to save memory, and even in the worst case you don't end up with that > many segments as long as you pick a reasonable base size (e.g. 64MB). Wastage is bound by a fraction of the total required RAM, that is, it's proportional to the amount of required RAM, not the amount allocated. So it should still be fine, and the exponential strategy should improve lookup performance considerably. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location
On Fri, Nov 18, 2016 at 10:05 PM, Peter Geoghegan wrote: > On Thu, Aug 18, 2016 at 2:15 PM, Peter Geoghegan wrote: >> I think that this is a bad idea. We need to implement suffix >> truncation of internal page index tuples at some point, to make them >> contain less information from the original leaf page index tuple. >> That's an important optimization, because it increases fan-in. This >> seems like a move in the opposite direction. > > Maybe I was too hasty, here. Can you rebase this patch, Claudio? I've been changing the on-disk format considerably, to one that makes more sense. I haven't posted it because it still doesn't have suffix (and prefix) truncation, but it should be compatible with it (ie: it could be implemented as an optimization that doesn't change the on-disk format). However, I haven't tested the current state of the patch thoroughly. If a WIP is fine, I can post the what I have, rebased. > It's possible that this idea could have further non-obvious benefits. > I see some potential for this to help with nbtree page deletion, item > recycling, etc. For example, maybe VACUUM can manage to do something > much closer to a regular index scan when that seems to make sense -- That was on my mind too > I also think that this might help with bitmap index scans. How so? AFAIK it helps regular scans on low-cardinality indexes more than bitmap index scans. With low cardinality indexes, the resulting heap access pattern will be an unordered series of sequential range scans, which is better than the fully random scan the current layout causes. Bitmap index scans, however, deny that benefit. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location
On Fri, Nov 18, 2016 at 11:09 PM, Peter Geoghegan wrote: > On Fri, Nov 18, 2016 at 5:27 PM, Claudio Freire > wrote: >> I've been changing the on-disk format considerably, to one that makes >> more sense. > > I can see how that would be necessary. I'm going to suggest some more > things that need a new btree version number now, too. > > I realized today, quite suddenly, why I disliked your original patch: > it didn't go far enough with embracing the idea of > heap-item-pointer-as-key. In other words, I didn't like that you > didn't change anything in the internal pages. But it did. In fact it only changed internal pages. > Maybe you should put > heap TIDs someplace in new internal pages, so that even there we treat > them as part of the key. That is indeed what's happening (both in the old and new version). The new version also opens up to the possibility of omitting the heap TID in inner pages, if they're redundant (ie: if two consecutive keys are different already, the heap TID part of the key is redundant, since it's not necessary to make tree descent decisions). > This may significantly benefit UPDATE-heavy > workloads where HOT doesn't work out so well. Particularly for > non-unique indexes, we currently have to wade through a lot of bloat > from the leaf level, rather than jumping straight to the correct leaf > page when updating or inserting. That is improved in the patch as well. The lookup for an insertion point for a heavily bloated (unique or not) index is done efficiently, instead of the O(N^2) way it used before. > You might not want to do the same with unique indexes, where we must > initially buffer lock "the first leaf page that a duplicate could be > on" while we potentially look in one or more additional sibling pages > (but bloat won't be so bad there, perhaps). It's done even for unique indexes. Locking in that case becomes complex, true, but I believe it's not an insurmountable problem. > Or, you might want to make > sure that B-Tree tuple insertions still find that "first page" to > check, despite still generally treating heap item pointer as part of > the key proper (in unique indexes, it can still behave like NULL, and > not participate in uniqueness checking, while still essentially being > part of the key/sort order). It behaves like that (even though it's not coded like that) > It also occurs to me that our performance for updates in the event of > many NULL values in indexes is probably really bad, and could be > helped by this. You'll want to test all of this with a workload that > isn't very sympathetic to HOT, of course -- most of these benefits are > seen when HOT doesn't help. I haven't really tested with an overabundance of NULLs, I'll add that to the tests. I have tested low-cardinality indexes though, but only for correctness. >> However, I haven't tested the current state of the patch thoroughly. >> >> If a WIP is fine, I can post the what I have, rebased. > > I'm mostly curious about the effects on bloat. I now feel like you > haven't sufficiently examined the potential benefits there, since you > never made heap item pointer a first class part of the key space. > Maybe you'd like to look into that yourself first? What do you mean with "first class part"? It's not included in the scankey for a variety of reasons, not the least of which is not breaking the interface for current uses, and because the tid type is quite limited in its capabilities ATM. Also, the heap TID is usually there on most AM calls that care about it (ie: insertions have it, of course), so adding it to the scankey also seemed redundant. If not adding to the scankey, what do you mean by "first class" then? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
On Mon, Nov 21, 2016 at 2:15 PM, Masahiko Sawada wrote: > On Fri, Nov 18, 2016 at 6:54 AM, Claudio Freire > wrote: >> On Thu, Nov 17, 2016 at 6:34 PM, Robert Haas wrote: >>> On Thu, Nov 17, 2016 at 1:42 PM, Claudio Freire >>> wrote: >>>> Attached is patch 0002 with pgindent applied over it >>>> >>>> I don't think there's any other formatting issue, but feel free to >>>> point a finger to it if I missed any >>> >>> Hmm, I had imagined making all of the segments the same size rather >>> than having the size grow exponentially. The whole point of this is >>> to save memory, and even in the worst case you don't end up with that >>> many segments as long as you pick a reasonable base size (e.g. 64MB). >> >> Wastage is bound by a fraction of the total required RAM, that is, >> it's proportional to the amount of required RAM, not the amount >> allocated. So it should still be fine, and the exponential strategy >> should improve lookup performance considerably. > > I'm concerned that it could use repalloc for large memory area when > vacrelstats->dead_tuples.dead_tuple is bloated. It would be overhead > and slow. How large? That array cannot be very large. It contains pointers to exponentially-growing arrays, but the repalloc'd array itself is small: one struct per segment, each segment starts at 128MB and grows exponentially. In fact, IIRC, it can be proven that such a repalloc strategy has an amortized cost of O(log log n) per tuple. If it repallocd the whole tid array it would be O(log n), but since it handles only pointers to segments of exponentially growing tuples it becomes O(log log n). Furthermore, n there is still limited to MAX_INT, which means the cost per tuple is bound by O(log log 2^32) = 5. And that's an absolute worst case that's ignoring the 128MB constant factor which is indeed relevant. > What about using semi-fixed memroy space without repalloc; > Allocate the array of ItemPointerData array, and each ItemPointerData > array stores the dead tuple locations. The size of ItemPointerData > array starts with small size (e.g. 16MB or 32MB). After we used an > array up, we then allocate next segment with twice size as previous > segment. That's what the patch does. > But to prevent over allocating memory, it would be better to > set a limit of allocating size of ItemPointerData array to 512MB or > 1GB. There already is a limit to 1GB (the maximum amount palloc can allocate) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location
On Mon, Nov 21, 2016 at 5:42 PM, Peter Geoghegan wrote: >>> Or, you might want to make >>> sure that B-Tree tuple insertions still find that "first page" to >>> check, despite still generally treating heap item pointer as part of >>> the key proper (in unique indexes, it can still behave like NULL, and >>> not participate in uniqueness checking, while still essentially being >>> part of the key/sort order). >> >> It behaves like that (even though it's not coded like that) > > Why not? What do you mean? > > What I'm interested in evaluating is an implementation where an index > on (foo) has a sort order/key space that is precisely the same as an > index on (foo, ctid), with zero exceptions. Well, the patch does the same sorting, but without explicitly adding the ctid to the scankey. When inserting into a unique index, the lookup for the insertion point proceeds as it would if the key was (foo, ctid), finding a page in the middle of the range that contain item pointers for foo. When that happens on a regular index, the insertion is done at that point and nothing else needs to be done. But when it happens on a unique index, the code first checks to see if the first item pointer for foo is on the same page (allegedly a frequent case). If so, the uniqueness check is done in a very straightforward and efficient manner. If not, however, the code retraces its steps up the tree to the first time it followed a key other than foo (that's the only way it could land at a middle page), and then follows the downlinks looking at a truncated key (just foo, ignoring ctid). Kind of what you say, but without the explicit null. > >>> It also occurs to me that our performance for updates in the event of >>> many NULL values in indexes is probably really bad, and could be >>> helped by this. You'll want to test all of this with a workload that >>> isn't very sympathetic to HOT, of course -- most of these benefits are >>> seen when HOT doesn't help. >> >> I haven't really tested with an overabundance of NULLs, I'll add that >> to the tests. I have tested low-cardinality indexes though, but only >> for correctness. > > I think we'll have to model data distribution to evaluate the idea > well. After all, if there is a uniform distribution of values anyway, > you're going to end up with many dirty leaf pages. Whereas, in the > more realistic case where updates are localized, we can hope to better > amortize the cost of inserting index tuples, and recycling index > tuples. Quite possibly >> What do you mean with "first class part"? >> >> It's not included in the scankey for a variety of reasons, not the >> least of which is not breaking the interface for current uses, and >> because the tid type is quite limited in its capabilities ATM. Also, >> the heap TID is usually there on most AM calls that care about it (ie: >> insertions have it, of course), so adding it to the scankey also >> seemed redundant. > > I mean that insertions into indexes that are low cardinality should be > largely guided by TID comparisons. ... >> If not adding to the scankey, what do you mean by "first class" then? > > Just what I said about the sort order of an index on "(foo)" now > precisely matching what we'd get for "(foo, ctid)". It is the case in both versions of the patch > There are a couple > of tricky issues with that that you'd have to look out for, like > making sure that the high key continues to hold a real TID, which at a > glance looks like something that just happens already. I'm not sure > that we preserve the item pointer TID in all cases, since the current > assumption is that a high key's TID is just filler -- maybe we can > lose that at some point. I thought long about that, and inner pages don't need a real TID in their keys, as those keys only specify key space cutoff points and not real tids, and high tids in leaf pages are always real. I haven't had any issue with that, aside from the headaches resulting from thinking about that for protracted periods of time. > You should use amcheck to specifically verify that that happens > reliably in all cases. Presumably, its use of an insertion scankey > would automatically see the use of TID as a tie-breaker with patched > Postgres amcheck verification, and so amcheck will work for this > purpose unmodified. It's totally on my roadmap. I'll have to adapt it for the new index format though, it doesn't work without modification. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)
On Fri, Feb 24, 2017 at 1:24 PM, Robert Haas wrote: > On Fri, Feb 24, 2017 at 7:29 PM, Kohei KaiGai wrote: >> This attached patch re-designed the previous v2 according to Robert's >> comment. >> All I had to do is putting entrypoint for ForeignScan/CustomScan at >> ExecShutdownNode, >> because it is already modified to call child node first, earlier than >> ExecShutdownGather >> which releases DSM segment. > > Aside from the documentation, which needs some work, this looks fine > to me on a quick read. LGTM too. You may want to clarify in the docs when the hook is called in relation to other hooks too. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Thu, Jun 12, 2014 at 1:25 PM, Robert Haas wrote: > > It appears that any string starting with the letter "a" will create > output that begins with 001S00 and the seventh character always > appears to be 0 or 1: > > [rhaas ~]$ ./strxfrm en_US ab ac ad ae af a% a0 "a " > "ab" -> "001S001T001S001T" > "ac" -> "001S001U001S001U" > "ad" -> "001S001V001S001V" > "ae" -> "001S001W001S001W" > "af" -> "001S001X001S001X" > "a%" -> "001S000W001S000W" > "a0" -> "001S000b001S000b" > "a " -> "001S000R001S000R" ... > [rhaas@hydra ~]$ ./strxfrm-binary en_US aaaA > "" -> 0c0c0c0c0c0c0c0c010202020202020202010202020202020202 (26 bytes) > "aaaA" -> 0c0c0c0c0c0c0c0c010202020202020202010202020202020204 (26 bytes) > > Still, it's fair to say that on this Linux system, the first 8 bytes > capture a significant portion of the entropy of the first 8 bytes of > the string, whereas on MacOS X you only get entropy from the first 2 > bytes of the string. It would be interesting to see results from > other platforms people might care about also. If you knew mac's output character set with some certainty, you could compress it rather efficiently by applying a tabulated decode back into non-printable bytes. Say, like base64 decoding (the set appears to be a subset of base64, but it's hard to be sure). Ie, x = strxfrm(s) xz = hex(tab[x[0]] + 64*tab[x[1]] + 64^2*tab[x[2]] ... etc) This can be made rather efficient. But the first step is defining with some certainty the output character set. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Tue, Jun 17, 2014 at 1:04 PM, Andres Freund wrote: > For me minmax indexes are helpful because they allow to generate *small* > 'coarse' indexes over large volumes of data. From my pov that's possible > possible because they don't contain item pointers for every contained > row. But minmax is just a specific form of bloom filter. This could certainly be generalized to a bloom filter index with some set of bloom&hashing operators (minmax being just one). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Tue, Jun 17, 2014 at 8:47 AM, Abhijit Menon-Sen wrote: > if (compress_backup_block = BACKUP_BLOCK_COMPRESSION_SNAPPY) You mean == right? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Thu, Jun 19, 2014 at 10:06 AM, Heikki Linnakangas wrote: > On 06/18/2014 06:09 PM, Claudio Freire wrote: >> >> On Tue, Jun 17, 2014 at 8:48 PM, Greg Stark wrote: >>> >>> An aggregate to generate a min/max "bounding box" from several values >>> A function which takes an "bounding box" and a new value and returns >>> the new "bounding box" >>> A function which tests if a value is in a "bounding box" >>> A function which tests if a "bounding box" overlaps a "bounding box" >> >> >> Which I'd generalize a bit further by renaming "bounding box" with >> "compressed set", and allow it to be parameterized. > > > What do you mean by parameterized? Bloom filters can be paired with number of hashes, number of bit positions, and hash function, so it's not a simple bloom filter index, but a bloom filter index with N SHA-1-based hashes spread on a K-length bitmap. >> So, you have: >> >> An aggregate to generate a "compressed set" from several values >> A function which adds a new value to the "compressed set" and returns >> the new "compressed set" >> A function which tests if a value is in a "compressed set" >> A function which tests if a "compressed set" overlaps another >> "compressed set" of equal type > > > Yeah, something like that. I'm not sure I like the "compressed set" term any > more than bounding box, though. GiST seems to have avoided naming the thing, > and just talks about "index entries". But if we can come up with a good > name, that would be more clear. I don't want to use the term bloom filter since it's very specific of a specific technique, but it's basically that - an approximate set without false negatives. Ie: compressed set. It's not a bounding box either when using bloom filters. So... >> One problem with such a generalized implementation would be, that I'm >> not sure in-place modification of the "compressed set" on-disk can be >> assumed to be safe on all cases. Surely, for strictly-enlarging sets >> it would, but while min/max and bloom filters both fit the bill, it's >> not clear that one can assume this for all structures. > > > I don't understand what you mean. It's a fundamental property of minmax > indexes that you can always replace the "min" or "max" or "compressing set" > or "bounding box" or whatever with another datum that represents all the > keys that the old one did, plus some. Yes, and bloom filters happen to fall on that category too. Never mind what I said. I was thinking of other potential and imaginary implementation that supports removal or updates, that might need care with transaction lifetimes, but that's easily fixed by letting vacuum or some lazy process do the deleting just as it happens with other indexes anyway. So, I guess the interface must include also the invariant that compressed sets only grow, never shrink unless from a rebuild or a vacuum operation. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
On Thu, Jun 19, 2014 at 2:49 PM, Greg Stark wrote: > I don't think the threaded implementation on Linux is the one to use > though. I find this *super* confusing but the kernel definitely > supports aio syscalls, glibc also has a threaded implementation it > uses if run on a kernel that doesn't implement the syscalls, and I > think there are existing libaio and librt libraries from outside glibc > that do one or the other. Which you build against seems to make a big > difference. My instinct is that anything but the kernel native > implementation will be worthless. The overhead of thread communication > will completely outweigh any advantage over posix_fadvise's partial > win. What overhead? The only communication is through a "done" bit and associated synchronization structure when *and only when* you want to wait on it. Furthermore, posix_fadvise is braindead on this use case, been there, done that. What you win with threads is a better postgres-kernel interaction, even if you loose some CPU performance it's gonna beat posix_fadvise by a large margin. > The main advantage of posix aio is that we can actually receive the > data out of order. With posix_fadvise we can get the i/o and cpu > overlap but we will never process the later blocks until the earlier > requests are satisfied and processed in order. With aio you could do a > sequential scan, initiating i/o on 1,000 blocks and then processing > them as they arrive, initiating new requests as those blocks are > handled. And each and every I/O you issue with it counts as such on the kernel side. It's not the case with posix_fadvise, mind you, and that's terribly damaging for performance. > When I investigated this I found the buffer manager's I/O bits seemed > to already be able to represent the state we needed (i/o initiated on > this buffer but not completed). The problem was in ensuring that a > backend would process the i/o completion promptly when it might be in > the midst of handling other tasks and might even get an elog() stack > unwinding. The interface that actually fits Postgres best might be the > threaded interface (orthogonal to the threaded implementation > question) which is you give aio a callback which gets called on a > separate thread when the i/o completes. The alternative is you give > aio a list of operation control blocks and it tells you the state of > all the i/o operations. But it's not clear to me how you arrange to do > that regularly, promptly, and reliably. Indeed we've been musing about using librt's support of completion callbacks for that. > The other gotcha here is that the kernel implementation only does > anything useful on DIRECT_IO files. That means you have to do *all* > the prefetching and i/o scheduling yourself. If that's the case, we should discard kernel-based implementations and stick to thread-based ones. Postgres cannot do scheduling as efficiently as the kernel, and it shouldn't try. > You would be doing that > anyways for sequential scans and bitmap scans -- and we already do it > with things like synchronised scans and posix_fadvise That only works because the patterns are semi-sequential. If you try to schedule random access, it becomes effectively impossible without hardware info. The kernel is the one with hardware info. > Finally, when I did the posix_fadvise work I wrote a synthetic > benchmark for testing the equivalent i/o pattern of a bitmap scan. It > let me simulate bitmap scans of varying densities with varying > parameters, notably how many i/o to keep in flight at once. It > supported posix_fadvise or aio. You should look it up in the archives, > it made for some nice looking graphs. IIRC I could not find any build > environment where aio offered any performance boost at all. I think > this means I just didn't know how to build it against the right > libraries or wasn't using the right kernel or there was some skew > between them at the time. If it's old, it probable you didn't hit the kernel's braindeadness since it was introduced somewhat recently (somewhate, ie, before 3.x I believe). Even if you did hit it, bitmap heap scans are blessed with sequential ordering. The technique doesn't work nearly as well with random I/O that might be sorted from time to time. When traversing an index, you do a mostly sequential pattern due to physical correlation, but not completely sequential. Not only that, with the assumption of random I/O, and the uncertainty of when will the scan be aborted too, you don't read ahead as much as you could if you knew it was sequential or a full scan. That kills performance. You don't fetch enough ahead of time to avoid stalls, and the kernel doesn't do read-ahead either because posix_fadvise effectively disables it, resulting in the equivalent of direct I/O with bad scheduling. Solving this for index scans isn't just a little more complex. It's insanely more complex, because you need hardware information to do it right. How many spindles, how many
Re: [HACKERS] Minmax indexes
On Wed, Jun 18, 2014 at 8:51 AM, Heikki Linnakangas wrote: > > I liked Greg's sketch of what the opclass support functions would be. It > doesn't seem significantly more complicated than what's in the patch now. Which was On Tue, Jun 17, 2014 at 8:48 PM, Greg Stark wrote: > An aggregate to generate a min/max "bounding box" from several values > A function which takes an "bounding box" and a new value and returns > the new "bounding box" > A function which tests if a value is in a "bounding box" > A function which tests if a "bounding box" overlaps a "bounding box" Which I'd generalize a bit further by renaming "bounding box" with "compressed set", and allow it to be parameterized. So, you have: An aggregate to generate a "compressed set" from several values A function which adds a new value to the "compressed set" and returns the new "compressed set" A function which tests if a value is in a "compressed set" A function which tests if a "compressed set" overlaps another "compressed set" of equal type If you can define different compressed sets, you can use this to generate both min/max indexes as well as bloom filter indexes. Whether we'd want to have both is perhaps questionable, but having the ability to is probably desirable. One problem with such a generalized implementation would be, that I'm not sure in-place modification of the "compressed set" on-disk can be assumed to be safe on all cases. Surely, for strictly-enlarging sets it would, but while min/max and bloom filters both fit the bill, it's not clear that one can assume this for all structures. Adding also a "in-place updateable" bit to the "type" would perhaps inflate the complexity of the patch due to the need to provide both code paths? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
On Tue, Jun 24, 2014 at 12:08 PM, John Lumby wrote: >> The question is, if you receive the notification of the I/O completion >> using a signal or a thread, is it safe to release the lwlock from the >> signal handler or a separate thread? > > In the forthcoming new version of the patch that uses sigevent, > the originator locks a LWlock associated with that BAaiocb eXclusive, > and , when signalled, in the signal handler it places that LWlock > on a process-local queue of LWlocks awaiting release. > (No, It cannot be safely released inside the signal handler or in a > separate thread). Whenever the mainline passes a CHECK_INTERRUPTS macro > and at a few additional points in bufmgr, the backend walks this > process-local > queue and releases those LWlocks.This is also done if the originator > itself issues a ReadBuffer, which is the most frequent case in which it > is released. I suggest using a semaphore instead. Semaphores are supposed to be incremented/decremented from multiple threads or processes already. So, in theory, the callback itself should be able to do it. The problem with the process-local queue is that it may take time to be processed (the time it takes to get to a CHECK_INTERRUPTS macro, which as it happened with regexes, it can be quite high). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] NUMA packaging and patch
On Thu, Jun 26, 2014 at 11:18 AM, Kohei KaiGai wrote: > One thing I concern is, it may conflict with numa-balancing > features that is supported in the recent Linux kernel; that > migrates physical pages according to the location of tasks > which references the page beyond the numa zone. > # I'm not sure whether it is applied on shared memory region. > # Please correct me if I misunderstood. But it looks to me > # physical page in shared memory is also moved. > http://events.linuxfoundation.org/sites/events/files/slides/summit2014_riel_chegu_w_0340_automatic_numa_balancing_0.pdf Sadly, it excludes the OS cache explicitly (when it mentions libc.so), which is one of the hottest sources of memory bandwidth consumption in a database. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Wed, Jul 9, 2014 at 6:16 PM, Alvaro Herrera wrote: > Another thing I noticed is that version 8 of the patch blindly believed > the "pages_per_range" declared in catalogs. This meant that if somebody > did "alter index foo set pages_per_range=123" the index would > immediately break (i.e. return corrupted results when queried). I have > fixed this by storing the pages_per_range value used to construct the > index in the metapage. Now if you do the ALTER INDEX thing, the new > value is only used when the index is recreated by REINDEX. This seems a lot like parameterizing. So I guess the only thing left is to issue a NOTICE when said alter takes place (I don't see that on the patch, but maybe it's there?) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Thu, Jul 10, 2014 at 4:20 PM, Alvaro Herrera wrote: > Claudio Freire wrote: >> On Wed, Jul 9, 2014 at 6:16 PM, Alvaro Herrera >> wrote: >> > Another thing I noticed is that version 8 of the patch blindly believed >> > the "pages_per_range" declared in catalogs. This meant that if somebody >> > did "alter index foo set pages_per_range=123" the index would >> > immediately break (i.e. return corrupted results when queried). I have >> > fixed this by storing the pages_per_range value used to construct the >> > index in the metapage. Now if you do the ALTER INDEX thing, the new >> > value is only used when the index is recreated by REINDEX. >> >> This seems a lot like parameterizing. > > I don't understand what that means -- care to elaborate? We've been talking about bloom filters, and how their shape differs according to the parameters of the bloom filter (number of hashes, hash type, etc). But after seeing this case of pages_per_range, I noticed it's an effective-enough mechanism. Like: CREATE INDEX ix_blah ON some_table USING bloom (somecol) WITH (BLOOM_HASHES=15, BLOOM_BUCKETS=1024, PAGES_PER_RANGE=64); Marking as read-only is ok, or emitting a NOTICE so that if anyone changes those parameters that change the shape of the index, they know it needs a rebuild would be OK too. Both mechanisms work for me. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Fri, Jul 11, 2014 at 3:47 PM, Greg Stark wrote: > On Fri, Jul 11, 2014 at 6:00 PM, Claudio Freire > wrote: >> Marking as read-only is ok, or emitting a NOTICE so that if anyone >> changes those parameters that change the shape of the index, they know >> it needs a rebuild would be OK too. Both mechanisms work for me. > > We don't actually have any of these mechanisms. They wouldn't be bad > things to have but I don't think we should gate adding new types of > indexes on adding them. In particular, the index could just hard code > a value for these parameters and having them be parameterized is > clearly better even if that doesn't produce all the warnings or > rebuild things automatically or whatever. No, I agree, it's just a nice to have. But at least the docs should mention it. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Mon, Jul 14, 2014 at 2:53 PM, Peter Geoghegan wrote: > My concern is that it won't be worth it to do the extra work, > particularly given that I already have 8 bytes to work with. Supposing > I only had 4 bytes to work with (as researchers writing [2] may have > only had in 1994), that would leave me with a relatively small number > of distinct normalized keys in many representative cases. For example, > I'd have a mere 40,665 distinct normalized keys in the case of my > "cities" database, rather than 243,782 (out of a set of 317,102 rows) > for 8 bytes of storage. But if I double that to 16 bytes (which might > be taken as a proxy for what a good compression scheme could get me), > I only get a modest improvement - 273,795 distinct keys. To be fair, > that's in no small part because there are only 275,330 distinct city > names overall (and so most dups get away with a cheap memcmp() on > their tie-breaker), but this is a reasonably organic, representative > dataset. Are those numbers measured on MAC's strxfrm? That was the one with suboptimal entropy on the first 8 bytes. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Incremental Backup
On Fri, Jul 25, 2014 at 10:14 AM, Marco Nenciarini wrote: > 1. Proposal > = > Our proposal is to introduce the concept of a backup profile. The backup > profile consists of a file with one line per file detailing tablespace, > path, modification time, size and checksum. > Using that file the BASE_BACKUP command can decide which file needs to > be sent again and which is not changed. The algorithm should be very > similar to rsync, but since our files are never bigger than 1 GB per > file that is probably granular enough not to worry about copying parts > of files, just whole files. That wouldn't nearly as useful as the LSN-based approach mentioned before. I've had my share of rsyncing live databases (when resizing filesystems, not for backup, but the anecdotal evidence applies anyhow) and with moderately write-heavy databases, even if you only modify a tiny portion of the records, you end up modifying a huge portion of the segments, because the free space choice is random. There have been patches going around to change the random nature of that choice, but none are very likely to make a huge difference for this application. In essence, file-level comparisons get you only a mild speed-up, and are not worth the effort. I'd go for the hybrid file+lsn method, or nothing. The hybrid avoids the I/O of inspecting the LSN of entire segments (necessary optimization for huge multi-TB databases) and backups only the portions modified when segments do contain changes, so it's the best of both worlds. Any partial implementation would either require lots of I/O (LSN only) or save very little (file only) unless it's an almost read-only database. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Incremental Backup
On Fri, Jul 25, 2014 at 3:44 PM, Robert Haas wrote: > On Fri, Jul 25, 2014 at 2:21 PM, Claudio Freire > wrote: >> On Fri, Jul 25, 2014 at 10:14 AM, Marco Nenciarini >> wrote: >>> 1. Proposal >>> = >>> Our proposal is to introduce the concept of a backup profile. The backup >>> profile consists of a file with one line per file detailing tablespace, >>> path, modification time, size and checksum. >>> Using that file the BASE_BACKUP command can decide which file needs to >>> be sent again and which is not changed. The algorithm should be very >>> similar to rsync, but since our files are never bigger than 1 GB per >>> file that is probably granular enough not to worry about copying parts >>> of files, just whole files. >> >> That wouldn't nearly as useful as the LSN-based approach mentioned before. >> >> I've had my share of rsyncing live databases (when resizing >> filesystems, not for backup, but the anecdotal evidence applies >> anyhow) and with moderately write-heavy databases, even if you only >> modify a tiny portion of the records, you end up modifying a huge >> portion of the segments, because the free space choice is random. >> >> There have been patches going around to change the random nature of >> that choice, but none are very likely to make a huge difference for >> this application. In essence, file-level comparisons get you only a >> mild speed-up, and are not worth the effort. >> >> I'd go for the hybrid file+lsn method, or nothing. The hybrid avoids >> the I/O of inspecting the LSN of entire segments (necessary >> optimization for huge multi-TB databases) and backups only the >> portions modified when segments do contain changes, so it's the best >> of both worlds. Any partial implementation would either require lots >> of I/O (LSN only) or save very little (file only) unless it's an >> almost read-only database. > > I agree with much of that. However, I'd question whether we can > really seriously expect to rely on file modification times for > critical data-integrity operations. I wouldn't like it if somebody > ran ntpdate to fix the time while the base backup was running, and it > set the time backward, and the next differential backup consequently > omitted some blocks that had been modified during the base backup. I was thinking the same. But that timestamp could be saved on the file itself, or some other catalog, like a "trusted metadata" implemented by pg itself, and it could be an LSN range instead of a timestamp really. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Incremental Backup
On Fri, Jul 25, 2014 at 7:38 PM, Josh Berkus wrote: > On 07/25/2014 11:49 AM, Claudio Freire wrote: >>> I agree with much of that. However, I'd question whether we can >>> > really seriously expect to rely on file modification times for >>> > critical data-integrity operations. I wouldn't like it if somebody >>> > ran ntpdate to fix the time while the base backup was running, and it >>> > set the time backward, and the next differential backup consequently >>> > omitted some blocks that had been modified during the base backup. >> I was thinking the same. But that timestamp could be saved on the file >> itself, or some other catalog, like a "trusted metadata" implemented >> by pg itself, and it could be an LSN range instead of a timestamp >> really. > > What about requiring checksums to be on instead, and checking the > file-level checksums? Hmmm, wait, do we have file-level checksums? Or > just page-level? It would be very computationally expensive to have up-to-date file-level checksums, so I highly doubt it. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Incremental Backup
On Tue, Jul 29, 2014 at 1:24 PM, Marco Nenciarini wrote: >> On Fri, Jul 25, 2014 at 10:14 AM, Marco Nenciarini >> wrote: >>> 1. Proposal >>> = >>> Our proposal is to introduce the concept of a backup profile. The backup >>> profile consists of a file with one line per file detailing tablespace, >>> path, modification time, size and checksum. >>> Using that file the BASE_BACKUP command can decide which file needs to >>> be sent again and which is not changed. The algorithm should be very >>> similar to rsync, but since our files are never bigger than 1 GB per >>> file that is probably granular enough not to worry about copying parts >>> of files, just whole files. >> >> That wouldn't nearly as useful as the LSN-based approach mentioned before. >> >> I've had my share of rsyncing live databases (when resizing >> filesystems, not for backup, but the anecdotal evidence applies >> anyhow) and with moderately write-heavy databases, even if you only >> modify a tiny portion of the records, you end up modifying a huge >> portion of the segments, because the free space choice is random. >> >> There have been patches going around to change the random nature of >> that choice, but none are very likely to make a huge difference for >> this application. In essence, file-level comparisons get you only a >> mild speed-up, and are not worth the effort. >> >> I'd go for the hybrid file+lsn method, or nothing. The hybrid avoids >> the I/O of inspecting the LSN of entire segments (necessary >> optimization for huge multi-TB databases) and backups only the >> portions modified when segments do contain changes, so it's the best >> of both worlds. Any partial implementation would either require lots >> of I/O (LSN only) or save very little (file only) unless it's an >> almost read-only database. >> > > From my experience, if a database is big enough and there is any kind of > historical data in the database, the "file only" approach works well. > Moreover it has the advantage of being simple and easily verifiable. I don't see how that would be true if it's not full of read-only or append-only tables. Furthermore, even in that case, you need to have the database locked while performing the file-level backup, and computing all the checksums means processing the whole thing. That's a huge amount of time to be locked for multi-TB databases, so how is that good enough? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance issue in pg_dump's dependency loop searching
On Tue, Jul 29, 2014 at 3:06 PM, Tom Lane wrote: > Simon Riggs writes: >> On 25 July 2014 20:47, Tom Lane wrote: >>> Another idea would be to > >> ...persist the optimal dump order in the database. > >> That way we can maintain the correct dump order each time we do DDL, >> which is only a small incremental cost, no matter how many objects we >> have. > > I don't see any obvious way to make it incremental; so I doubt that > it would be a small extra cost. In any case I disagree that making DDL > slower to make pg_dump faster is a good tradeoff. Many people seldom > or never use pg_dump. > > regards, tom lane Not to mention slowing down temp tables -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Incremental Backup
On Thu, Jul 31, 2014 at 5:26 AM, desmodemone wrote: > b) yes the backends need to update the map, but it's in memory, and as I > show, could be very small if we you chunk of blocks.If we not compress the > map, I not think could be a bottleneck. If it's in memory, it's not crash-safe. For something aimed at backups, I think crash safety is a requirement. So it's at least one extra I/O per commit, maybe less if many can be coalesced at checkpoints, but I wouldn't count on it too much, because worst cases are easy to come by (sparse enough updates). I think this could be pegged on WAL replay / checkpoint stuff alone, so it would be very asynchronous, but not free. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Incremental Backup
On Fri, Aug 1, 2014 at 12:35 AM, Amit Kapila wrote: >> c) the map is not crash safe by design, because it needs only for >> incremental backup to track what blocks needs to be backuped, not for >> consistency or recovery of the whole cluster, so it's not an heavy cost for >> the whole cluster to maintain it. we could think an option (but it's heavy) >> to write it at every flush on file to have crash-safe map, but I not think >> it's so usefull . I think it's acceptable, and probably it's better to force >> that, to say: "if your db will crash, you need a fullbackup ", > > I am not sure if your this assumption is right/acceptable, how can > we say that in such a case users will be okay to have a fullbackup? > In general, taking fullbackup is very heavy operation and we should > try to avoid such a situation. Besides, the one taking the backup (ie: script) may not be aware of the need to take a full one. It's a bad design to allow broken backups at all, IMNSHO. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Incremental Backup
On Fri, Aug 1, 2014 at 1:43 PM, desmodemone wrote: > > > > 2014-08-01 18:20 GMT+02:00 Claudio Freire : > >> On Fri, Aug 1, 2014 at 12:35 AM, Amit Kapila >> wrote: >> >> c) the map is not crash safe by design, because it needs only for >> >> incremental backup to track what blocks needs to be backuped, not for >> >> consistency or recovery of the whole cluster, so it's not an heavy cost >> >> for >> >> the whole cluster to maintain it. we could think an option (but it's >> >> heavy) >> >> to write it at every flush on file to have crash-safe map, but I not >> >> think >> >> it's so usefull . I think it's acceptable, and probably it's better to >> >> force >> >> that, to say: "if your db will crash, you need a fullbackup ", >> > >> > I am not sure if your this assumption is right/acceptable, how can >> > we say that in such a case users will be okay to have a fullbackup? >> > In general, taking fullbackup is very heavy operation and we should >> > try to avoid such a situation. >> >> >> Besides, the one taking the backup (ie: script) may not be aware of >> the need to take a full one. >> >> It's a bad design to allow broken backups at all, IMNSHO. > > > Hi Claudio, > thanks for your observation > First: the case it's after a crash of a database, and it's not something > happens every day or every week. It's something that happens in rare > conditions, or almost my experience is so. If it happens very often probably > there are other problems. Not so much. In this case, the software design isn't software-crash safe, it's not that it's not hardware-crash safe. What I mean, is that an in-memory bitmap will also be out of sync if you kill -9 (or if one of the backends is killed by the OOM), or if it runs out of disk space too. Normally, a simple restart fixes it because pg will do crash recovery just fine, but now the bitmap is out of sync, and further backups are broken. It's not a situation I want to face unless there's a huge reason to go for such design. If you make it so that the commit includes flipping the bitmap, it can be done cleverly enough to avoid too much overhead (though it will have some), and you now have it so that any to-be-touched block is now part of the backup. You just apply all the bitmap changes in batch after a checkpoint, before syncing to disk, and before erasing the WAL segments. Simple, relatively efficient, and far more robust than an in-memory thing. Still, it *can* double checkpoint I/O on the worst case, and it's not an unfathomable case either. > Second: to avoid the problem to know if the db needed to have a full backup > to rebuild the map we could think to write in the map header the backup > reference (with an id and LSN reference for example ) so if the > someone/something try to do an incremental backup after a crash, the map > header will not have noone full backup listed [because it will be empty] , > and automaticcaly switch to a full one. I think after a crash it's a good > practice to do a full backup, to see if there are some problems on files or > on filesystems, but if I am wrong I am happy to know :) . After a crash I do not do a backup, I do a verification of the data (VACUUM and some data consistency checks usually), lest you have a useless backup. The backup goes after that. But, I'm not DBA guru. > Remember that I propose a map in ram to reduce the impact on performances, > but we could create an option to leave the choose to the user, if you want a > crash safe map, at every flush will be updated also a map file , if not, the > map will be in ram. I think the performance impact of a WAL-linked map isn't so big as to prefer the possibility of broken backups. I wouldn't even allow it. It's not free, making it crash safe, but it's not that expensive either. If you want to support incremental backups, you really really need to make sure those backups are correct and usable, and IMV anything short of full crash safety will be too fragile for that purpose. I don't want to be in a position of needing the backup and finding out it's inconsistent after the fact, and I don't want to encourage people to set themselves up for that by adding that "faster but unsafe backups" flag. I'd do it either safe, or not at all. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Incremental Backup
On Mon, Aug 4, 2014 at 5:15 AM, Gabriele Bartolini wrote: >I really like the proposal of working on a block level incremental > backup feature and the idea of considering LSN. However, I'd suggest > to see block level as a second step and a goal to keep in mind while > working on the first step. I believe that file-level incremental > backup will bring a lot of benefits to our community and users anyway. Thing is, I don't see how the LSN method is that much harder than an on-disk bitmap. In-memory bitmap IMO is just a recipe for disaster. Keeping a last-updated-LSN for each segment (or group of blocks) is just as easy as keeping a bitmap, and far more flexible and robust. The complexity and cost of safely keeping the map up-to-date is what's in question here, but as was pointed before, there's no really safe alternative. Nor modification times nor checksums (nor in-memory bitmaps IMV) are really safe enough for backups, so you really want to use something like the LSN. It's extra work, but opens up a world of possibilities. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Incremental Backup
On Tue, Aug 5, 2014 at 3:23 PM, Simon Riggs wrote: > On 4 August 2014 19:30, Claudio Freire wrote: >> On Mon, Aug 4, 2014 at 5:15 AM, Gabriele Bartolini >> wrote: >>>I really like the proposal of working on a block level incremental >>> backup feature and the idea of considering LSN. However, I'd suggest >>> to see block level as a second step and a goal to keep in mind while >>> working on the first step. I believe that file-level incremental >>> backup will bring a lot of benefits to our community and users anyway. >> >> Thing is, I don't see how the LSN method is that much harder than an >> on-disk bitmap. In-memory bitmap IMO is just a recipe for disaster. >> >> Keeping a last-updated-LSN for each segment (or group of blocks) is >> just as easy as keeping a bitmap, and far more flexible and robust. >> >> The complexity and cost of safely keeping the map up-to-date is what's >> in question here, but as was pointed before, there's no really safe >> alternative. Nor modification times nor checksums (nor in-memory >> bitmaps IMV) are really safe enough for backups, so you really want to >> use something like the LSN. It's extra work, but opens up a world of >> possibilities. > > OK, some comments on all of this. > > * Wikipedia thinks the style of backup envisaged should be called > "Incremental" > https://en.wikipedia.org/wiki/Differential_backup > > * Base backups are worthless without WAL right up to the *last* LSN > seen during the backup, which is why pg_stop_backup() returns an LSN. > This is the LSN that is the effective viewpoint of the whole base > backup. So if we wish to get all changes since the last backup, we > must re-quote this LSN. (Or put another way - file level LSNs don't > make sense - we just need one LSN for the whole backup). File-level LSNs are an optimization. When you want to backup all files modified since the last base or incremental backup (yes, you need the previous backup label at least), you check the file-level LSN range. That tells you which "changesets" touched that file, so you know whether to process it or not. Block-level LSNs (or, rather, block-segment-level) are just a refinement of that. > * When we take an incremental backup we need the WAL from the backup > start LSN through to the backup stop LSN. We do not need the WAL > between the last backup stop LSN and the new incremental start LSN. > That is a huge amount of WAL in many cases and we'd like to avoid > that, I would imagine. (So the space savings aren't just the delta > from the main data files, we should also look at WAL savings). Yes, probably something along the lines of removing redundant FPW and stuff like that. > * For me, file based incremental is a useful and robust feature. > Block-level incremental is possible, but requires either significant > persistent metadata (1 MB per GB file) or access to the original > backup. One important objective here is to make sure we do NOT have to > re-read the last backup when taking the next backup; this helps us to > optimize the storage costs for backups. Plus, block-level recovery > requires us to have a program that correctly re-writes data into the > correct locations in a file, which seems likely to be a slow and bug > ridden process to me. Nice, safe, solid file-level incremental backup > first please. Fancy, bug prone, block-level stuff much later. Ok. You could do incremental first without any kind of optimization, then file-level optimization by keeping a file-level LSN range, and then extend that to block-segment-level LSN ranges. That sounds like a plan to me. But, I don't see how you'd do the one without optimization without reading the previous backup for comparing deltas. Remember checksums are deemed not trustworthy, not just by me, so that (which was the original proposition) doesn't work. > * If we don't want/have file checksums, then we don't need a profile > file and using just the LSN seems fine. I don't think we should > specify that manually - the correct LSN is written to the backup_label > file in a base backup and we should read it back from there. Agreed -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers