Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple
On Thu, Nov 9, 2017 at 4:53 PM, Andres Freund wrote: > Primarily because it's not an anti-corruption tool. I'd be surprised if > there weren't ways to corrupt the page using these corruptions that > aren't detected by it. It's very hard to assess the risk of missing something that's actually detectable with total confidence, but I think that the check is actually very thorough. > But even if it were, I don't think there's > enough information to do so in the general case. You very well can end > up with pages where subsequent hot pruning has removed a good bit of the > direct evidence of this bug. Sure, but maybe those are cases that can't get any worse anyway. So the question of avoiding making it worse doesn't arise. > But I'm not really sure why the error detection capabilities of matter > much for the principal point I raised, which is how much work we need to > do to not further worsen the corruption. You're right. Just trying to put the risk in context, and to understand the extent of the concern that you have. -- Peter Geoghegan
Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple
On Thu, Nov 9, 2017 at 4:17 PM, Andres Freund wrote: >> Actually, on second thought, I take that back -- I don't think that >> REINDEXing will even finish once a HOT chain is broken by the bug. >> IndexBuildHeapScan() actually does quite a good job of making sure >> that HOT chains are sane, which is how the enhanced amcheck notices >> the bug here in practice. > > I think that's too optimistic. Why? Because the "find the TID of the root" logic in IndexBuildHeapScan()/heap_get_root_tuples() won't reliably find the actual root (it might be some other HOT chain root following TID recycling by VACUUM)? Assuming that's what you meant: I would have thought that the xmin/xmax matching within heap_get_root_tuples() makes the sanity checking fairly reliable in practice. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple
On Thu, Nov 9, 2017 at 4:17 PM, Andres Freund wrote: >> I don't follow you here. Why would REINDEXing make the rows that >> should be dead disappear again, even for a short period of time? > > It's not the REINDEX that makes them reappear. Of course. I was just trying to make sense of what you said. > It's the second > vacuum. The reindex part was about $user trying to fix the problem... > As you need two vacuums with appropriate cutoffs to hit the "rows > revive" problem, that'll often in practice not happen immediately. This explanation clears things up, though. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple
On Thu, Nov 9, 2017 at 2:24 PM, Andres Freund wrote: > Attached is a version of the already existing regression test that both > reproduces the broken hot chain (and thus failing index lookups) and > then also the tuple reviving. I don't see any need for letting this run > with arbitrary permutations. I thought that the use of every possible permutation was excessive, myself. It left us with an isolation test that didn't precisely describe the behavior that is tested. What you came up with seems far, far better, especially because of the comments you included. The mail message-id references seem to add a lot, too. > What I'm currently wondering about is how much we need to harden > postgres against such existing corruption. If e.g. the hot chains are > broken somebody might have reindexed thinking the problem is fixed - but > if they then later vacuum everything goes to shit again, with dead rows > reappearing. I don't follow you here. Why would REINDEXing make the rows that should be dead disappear again, even for a short period of time? It might do so for index scans, I suppose, but not for sequential scans. Are you concerned about a risk of somebody not noticing that sequential scans are still broken? Actually, on second thought, I take that back -- I don't think that REINDEXing will even finish once a HOT chain is broken by the bug. IndexBuildHeapScan() actually does quite a good job of making sure that HOT chains are sane, which is how the enhanced amcheck notices the bug here in practice. (Before this bug was discovered, I would have expected amcheck to catch problems like it slightly later, during the Bloom filter probe for that HOT chain...but, in fact, it never gets there with corruption from this bug in practice, AFAIK.) -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pageinspect option to forgo buffer locking?
On Thu, Nov 9, 2017 at 9:49 AM, Andres Freund wrote: > Currently the locking in get_raw_page_internal() prevents that. If it's > an option defaulting to off, I don't see why we couldn't allow that to > skip locking the page's contents. Obviously you can get corrupted > contents that way, but we already allow to pass arbitrary stuff to > heap_page_items(). Since pinning wouldn't be changed, there's no danger > of the page being moved out from under us. +1. I've done things like this before myself. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Small improvement to compactify_tuples
On Wed, Nov 8, 2017 at 12:59 PM, Andres Freund wrote: > I complained about multiple related things, I'm not exactly sure what > exactly you're referring to here: > - The fact that HeapTupleHeaderData's are commonly iterated over in > reverse order is bad for performance. For shared buffers resident > workloads involving seqscans that yields 15-25% slowdowns for me. It's > trivial to fix that by just changing iteration order, but that > obviously changes results. But we could reorder the page during heap > pruning. FWIW, the classic page layout (the one that appears in Gray's Transaction Processing Systems, at any rate) has the ItemId array at the end of the page and the tuples at the start (immediately after a generic page header) -- it's the other way around. I think that that has its pros and cons. > - The layout of items in index pages is suboptimal. We regularly do > binary searches over the the linearly ordered items, which is cache > inefficient. So instead we should sort items as [1/2, 1/4, 3/4, ...] > elements, which will access items in a close-ish to linear manner. I still think that we can repurpose each ItemId's lp_len as an abbreviated key in internal index pages [1], and always get IndexTuple size through the index tuple header. I actual got as far as writing a very rough prototype of that. That's obviously a significant project, but it seems doable. [1] https://www.postgresql.org/message-id/CAH2-Wz=mv4dmoapficrsyntv2kinxeotbwuy5r7fxxoc-oe...@mail.gmail.com -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Small improvement to compactify_tuples
On Wed, Nov 8, 2017 at 8:19 AM, Robert Haas wrote: > I don't remember any more just how much faster qsort_tuple() and > qsort_ssup() are than plain qsort(), but it was significant enough to > convince me to commit 337b6f5ecf05b21b5e997986884d097d60e4e3d0... IIRC, qsort_ssup() was about 20% faster at the time, while qsort_tuple() was 5% - 10% faster. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MERGE SQL Statement for PG11
On Tue, Nov 7, 2017 at 3:29 PM, Nico Williams wrote: > On Thu, Nov 02, 2017 at 03:25:48PM -0700, Peter Geoghegan wrote: >> Nico Williams wrote: >> >A MERGE mapped to a DML like this: > > I needed to spend more time reading MERGE docs from other RDBMSes. Please don't hijack this thread. It's about the basic question of semantics, and is already hard enough for others to follow as-is. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Small improvement to compactify_tuples
On Tue, Nov 7, 2017 at 2:40 PM, Юрий Соколов wrote: >> The same is true of unique indexes vs. non-unique. > > offtopic: recently I'd a look at setting LP_DEAD in indexes. > I didn't found huge difference between unique and non-unique indices. > There is codepath that works only for unique, but it is called less > frequently than common codepath that also sets LP_DEAD. I meant to say that this is only important with UPDATEs + contention. The extra LP_DEAD setting within _bt_check_unique() makes quite a noticeable difference, at least in terms of index bloat (though less so in terms of raw TPS). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Small improvement to compactify_tuples
On Tue, Nov 7, 2017 at 2:36 PM, Tom Lane wrote: > Peter Geoghegan writes: >> My point is only that it's worth considering that this factor affects >> how representative your sympathetic case is. It's not clear how many >> PageIndexMultiDelete() calls are from opportunistic calls to >> _bt_vacuum_one_page(), how important that subset of calls is, and so >> on. Maybe it doesn't matter at all. > > According to the perf measurements I took earlier, essentially all the > compactify_tuple calls in this test case are from PageRepairFragmentation > (from heap_page_prune), not PageIndexMultiDelete. For a workload with high contention (e.g., lots of updates that follow a Zipfian distribution) lots of important cleanup has to occur within _bt_vacuum_one_page(), and with an exclusive buffer lock held. It may be that making PageIndexMultiDelete() faster pays off disproportionately well there, but I'd only expect to see that at higher client count workloads with lots of contention -- workloads that we still do quite badly on (note that we always have not done well here, even prior to commit 2ed5b87f9 -- Yura showed this at one point). It's possible that this work influenced Yura in some way. When Postgres Pro did some benchmarking of this at my request, we saw that the bloat got really bad past a certain client count. IIRC there was a clear point at around 32 or 64 clients where TPS nosedived, presumably because cleanup could not keep up. This was a 128 core box, or something like that, so you'll probably have difficulty recreating it with what's at hand. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Small improvement to compactify_tuples
) On Tue, Nov 7, 2017 at 1:39 PM, Tom Lane wrote: > So I think we should seriously consider the attached, but it'd be a > good idea to benchmark it on a wider variety of platforms and test > cases. > create unlogged table test3 ( > id integer PRIMARY KEY with (fillfactor=85), > val text > ) WITH (fillfactor=85); Passing observation: Unlogged table B-Tree indexes have a much greater tendency for LP_DEAD setting/kill_prior_tuple() working out following commit 2ed5b87f9 [1], because unlogged tables were unaffected by that commit. (I've been meaning to follow up with my analysis of that regression, actually.) The same is true of unique indexes vs. non-unique. There are workloads where the opportunistic LP_DEAD setting performed by _bt_check_unique() is really important (it calls ItemIdMarkDead()). Think high contention workloads, like when Postgres is used to implement a queue table. My point is only that it's worth considering that this factor affects how representative your sympathetic case is. It's not clear how many PageIndexMultiDelete() calls are from opportunistic calls to _bt_vacuum_one_page(), how important that subset of calls is, and so on. Maybe it doesn't matter at all. [1] https://postgr.es/m/cah2-wzmyry7mnjf0gw5wtk3cszh3gqfhhoxvsyuno5pk8cu...@mail.gmail.com -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Hash take II
On Tue, Nov 7, 2017 at 1:01 PM, Andres Freund wrote: > +/* > + * Build the name for a given segment of a given BufFile. > + */ > +static void > +MakeSharedSegmentName(char *name, const char *buffile_name, int segment) > +{ > + snprintf(name, MAXPGPATH, "%s.%d", buffile_name, segment); > +} > > Not a fan of this name - you're not "making" a filename here (as in > allocating or such). I think I'd just remove the Make prefix. +1 Can we document the theory behind file naming here, if that isn't in the latest version? This is a routine private to parallel hash join (or shared tuplestore), not Buffile. Maybe Buffile should have some opinion on this, though. Just as a matter of style. > +/* > + * Delete a BufFile that was created by BufFileCreateShared in the given > + * SharedFileSet using the given name. > + * > + * It is not necessary to delete files explicitly with this function. It is > + * provided only as a way to delete files proactively, rather than waiting > for > + * the SharedFileSet to be cleaned up. > + * > + * Only one backend should attempt to delete a given name, and should know > + * that it exists and has been exported or closed. > + */ This part is new to me. We now want one backend to delete a given filename. What changed? Please provide a Message-Id reference if that will help me to understand. For now, I'm going to guess that this development had something to do with the need to deal with virtual FDs that do a close() on an FD to keep under backend limits. Do I have that right? > + /* > +* We don't set FD_DELETE_AT_CLOSE for files opened this way, but we > still > +* want to make sure they get closed at end of xact. > +*/ > + ResourceOwnerEnlargeFiles(CurrentResourceOwner); > + ResourceOwnerRememberFile(CurrentResourceOwner, file); > + VfdCache[file].resowner = CurrentResourceOwner; > > So maybe I'm being pedantic here, but wouldn't the right order be to do > ResourceOwnerEnlargeFiles() *before* creating the file? It's a memory > allocating operation, so it can fail, which'd leak the file. I remember going to pains to get this right with my own unifiable BufFile concept. I'm going to wait for an my question about file deletion + resowners before commenting further, though. > + if (vfdP->fdstate & FD_TEMP_FILE_LIMIT) > + { > + /* Subtract its size from current usage (do first in case of > error) */ > + temporary_files_size -= vfdP->fileSize; > + vfdP->fileSize = 0; > + } > > So, is it right to do so unconditionally and without regard for errors? > If the file isn't deleted, it shouldn't be subtracted from fileSize. I > guess you're managing that through the flag, but that's not entirely > obvious. I think that the problem here is that the accounting is expected to always work. It's not like there is a resowner style error path in which temporary_files_size gets reset to 0. > Is that entirely unproblematic? Are there any DSM callbacks that rely on > locks still being held? Please split this part into a separate commit > with such analysis. I was always confused about the proper use of DSM callbacks myself. They seemed generally underdocumented. > + /* Create the output buffer. */ > + if (accessor->write_chunk != NULL) > + pfree(accessor->write_chunk); > + accessor->write_chunk = (SharedTuplestoreChunk *) > + palloc0(accessor->write_pages * BLCKSZ); > > Are we guaranteed to be in a long-lived memory context here? I imagine that Thomas looked to tuplestore_begin_heap() + interXact as a kind of precedent here. See comments above that function. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MERGE SQL Statement for PG11
Simon Riggs wrote: In step 3 we discover that an entry exists in the index for a committed row. Since we have a unique index we use it to locate the row we know exists and UPDATE that. We don't use a new MVCC snapshot, we do what EPQ does. EPQ is already violating MVCC for UPDATEs, so why does it matter if we do it for INSERTs also? Before I go on to say why I think that this approach is problematic, I want to point out a few things that I think we actually agree on: * EPQ is fairly arbitrary as a behavior for READ COMMITTED UPDATE conflict handling. It has more to do with how VACUUM works than about some platonic ideal that everyone agrees on. * We can imagine other alternatives, such as the behavior in Oracle (statement level rollback + optimistic retry). * Those alternatives are probably better in some ways but worse in other ways. * EPQ violates snapshot consistency, even though that's not inherently necessary to avoid "READ COMMITTED serialization errors". * ON CONFLICT also violates snapshot consistency, in rather a different way. (Whether or not this is necessary is more debatable.) I actually think that other MVCC systems don't actually copy Oracle here, either, and for similar pragmatic reasons. It's a mixed bag. Where hides the problem? The problem is violating MVCC is something that can be done in different ways, and by meaningful degrees: * EPQ semantics are believed to be fine because we don't get complaints about it. I think that that's because it's specialized to UPDATEs and UPDATE-like operations, where we walk an UPDATE chain specifically, and only use a dirty snapshot for the chain's newer tuples. * ON CONFLICT doesn't care about UPDATE chains. Unlike EPQ, it makes no distinction between a concurrent UPDATE, and a concurrent DELETE + fresh INSERT. It's specialized to CONFLICTs. This might seem abstract, but it has real, practical implications. Certain contradictions exist when you start with MVCC semantics, then fall back to EPQ semantics, then finally fall back to ON CONFLICT semantics. Questions about mixing these two things: * What do we do if someone concurrently UPDATEs in a way that makes the qual not pass during EPQ traversal? Should we INSERT when that happens? * If so, what about the case when the MERGE join qual/unique index values didn't change (just some other attributes that do not pass the additional WHEN MATCHED qual)? * What about when there was a concurrent DELETE -- should we INSERT then? ON CONFLICT goes from a CONFLICT, and then applies its own qual. That's hugely different to doing it the other way around: starting from your own MVCC snapshot qual, and going to a CONFLICT. This is because evaluating the DO UPDATE's WHERE clause is just one little extra step after the one and only latest row for that value has been locked. You could theoretically go this way with 2PL, I think, because that's a bit like locking every row that the predicate touches, but of course that isn't at all practical. I should stop trying to make a watertight case against this, even though I still think that's possible. For now, instead, I'll just say that this is *extremely* complicated, and still has unresolved questions about semantics. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MERGE SQL Statement for PG11
Simon Riggs wrote: APPROACH1 1. Join to produce results based upon snapshot at start of query 2. Apply results for INSERT, UPDATE or DELETE Such failures are of great concern in practice because the time between 1 and 2 could be very long for large statements, or for smaller statements we might have sufficiently high concurrency to allow us to see regular failures. I'm not sure that they're a *great* concern in a world with something that targets UPSERT use cases, which is a situation that does not exist in DBMSs with MERGE (with the notable exception of Teradata). But it's clearly a concern that users may expect to avoid duplicate violations in READ COMMITTED, since this caused confusion among users of other database systems with MERGE. APPROACH2 (modified from my original proposal slightly) This write-up actually begins to confront the issues that I've raised. I'm glad to see this. 1. Join... 2. Apply results for UPDATE, if present not visible via the snapshot taken at 1, do EPQ to ensure we locate current live tuple 3. If still not visible, do speculative insertion if we have a unique index available, otherwise ERROR. If spec insertion fails, go to 2 The loop created above can live-lock, meaning that an infinite loop could be created. The loop is *guaranteed* to live-lock once you "goto 2". So you might as well just throw an error at that point, which is the behavior that I've been arguing for all along! If this isn't guaranteed to live-lock at "goto 2", then it's not clear why. The outcome of step 2 is clearly going to be identical if you don't acquire a new MVCC snapshot, but you don't address that. You might have meant "apply an equivalent ON CONFLICT DO UPDATE", or something like that, despite the fact that the use of ON CONFLICT DO NOTHING was clearly implied by the "goto 2". I also see problems with that, but I'll wait for you to clarify what you meant before going into what they are. In practice, such live-locks are rare and we could detect them by falling out of the loop after a few tries. Approach2's purpose is to alleviate errors in Approach1, so falling out of the loop merely takes us back to the error we would have got if we didn't try, so Approach2 has considerable benefit over Approach1. I don't hate the idea of retrying a fixed number of times for things like this, but I don't like it either. I'm going to assume that it's fine for now. I read that step 3 in Approach2 is some kind of problem in MVCC semantics. My understanding is that SQL Standard allows us to define what the semantics of the statement are in relation to concurrency, so any semantic issue can be handled by defining it to work the way we want. My only concern is that our choices here should be good ones, based on practical considerations. We both more or less agree on how this should be assessed, I think; we just reach different conclusions. As you point out, whichever we choose, we will be bound by those semantics. So if we take Approach1, as has been indicated currently, what is the written explanation for that, so we can show that to the people who ask in the future about our decisions? Well, Approach1 is what other systems implement. I think that it would be important to point out that MERGE with Approach1 isn't special, but ON CONFLICT DO UPDATE is special. We'd also say that higher isolation levels will not have duplicate violations. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Display number of heap accesses for index scans
Andres Freund wrote: The number of index lookups that failed to return anything can be a critical performance factor in OLTP workloads. Therefore it seems like it'd be a good idea to extend the explain analyze output to include that information. I certainly agree. I've sometimes wondered if we could do better here by exploiting unique indexes' uniqueness property. The general idea is that we could avoid visiting duplicates in the heap when we've determined that they cannot be visible to our MVCC snapshot - we already encountered the visible version for that value, so this must be true. This is somewhat like what we do with HOT chains, except it occurs across index tuples with the same value rather than across heap tuples within a HOT chain. The big difficulty is LP_DEAD bit setting within B-Tree pages; if no one is visiting the heap, no one can opportunistically mark the bit to indicate that nobody needs to visit the heap tuple pointed to by the index tuple (because the heap tuple is dead to everyone). Now, there isn't an immediate problem with that, because the main point of both optimizations is to avoid heap visits. But there is a problem when there needs to be a page split and there is a lack of LP_DEAD bits set from which to reclaim space to defer the page split (in nbtree, we can reclaim the space needed for our an index tuple that needs to be inserted without actually spliting, in the end). We cannot undermine the secondary goal of the LP_DEAD/kill_prior_tuple optimization, which is reclaiming space early within B-Tree pages. I can imagine a way of addressing this problem, though it is very invasive - versioning distinct index tuple versions in the index with ordinal version numbers. I wonder if anyone else has thought about doing something like this, and has a better idea, though. I think that we could be cleverer about unique indexes in a number of different ways [1]. Reducing heap accesses is a big one. [1] https://wiki.postgresql.org/wiki/Key_normalization#Making_all_items_in_the_index_unique_by_treating_heap_TID_as_an_implicit_last_attribute -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Small improvement to compactify_tuples
Юрий Соколов wrote: tps is also reflects changes: ~17ktps with qsort ~19ktps with bucket sort Also vacuum of benchmark's table is also improved: ~3s with qsort, ~2.4s with bucket sort One thing that you have to be careful with when it comes to our qsort with partially presored inputs is what I like to call "banana skin effects": https://postgr.es/m/cah2-wzku2xk2dpz7n8-a1mvuuttuvhqkfna+eutwnwctgyc...@mail.gmail.com This may have nothing at all to do with your results; I'm just pointing it out as a possibility. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple
Alvaro Herrera wrote: He means that the tuple that heap_update moves to page 1 (which will no longer be processed by vacuum) will contain a multixact that's older than relminmxid -- because it is copied unchanged by heap_update instead of properly checking against age limit. I see. The problem is more or less with this heap_update() code: /* * And also prepare an Xmax value for the new copy of the tuple. If there * was no xmax previously, or there was one but all lockers are now gone, * then use InvalidXid; otherwise, get the xmax from the old tuple. (In * rare cases that might also be InvalidXid and yet not have the * HEAP_XMAX_INVALID bit set; that's fine.) */ if ((oldtup.t_data->t_infomask & HEAP_XMAX_INVALID) || HEAP_LOCKED_UPGRADED(oldtup.t_data->t_infomask) || (checked_lockers && !locker_remains)) xmax_new_tuple = InvalidTransactionId; else xmax_new_tuple = HeapTupleHeaderGetRawXmax(oldtup.t_data); My naive guess is that we have to create a new MultiXactId here in at least some cases, just like FreezeMultiXactId() sometimes does. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple
Andres Freund wrote: Here's that patch. I've stared at this some, and Robert did too. Robert mentioned that the commit message might need some polish and I'm not 100% sure about the error message texts yet. The commit message should probably say that the bug involves the resurrection of previously dead tuples, which is different to there being duplicates because a constraint is not enforced because HOT chains are broken (that's a separate, arguably less serious problem). Staring at the vacuumlazy hunk I think I might have found a related bug: heap_update_tuple() just copies the old xmax to the new tuple's xmax if a multixact and still running. It does so without verifying liveliness of members. Isn't that buggy? Consider what happens if we have three blocks: 1 has free space, two is being vacuumed and is locked, three is full and has a tuple that's key share locked by a live tuple and is updated by a dead xmax from before the xmin horizon. In that case afaict the multi will be copied from the third page to the first one. Which is quite bad, because vacuum already processed it, and we'll set relfrozenxid accordingly. I hope I'm missing something here? Can you be more specific about what you mean here? I think that I understand where you're going with this, but I'm not sure. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MERGE SQL Statement for PG11
Simon Riggs wrote: The *only* behavioural difference I have proposed would be the *lack* of an ERROR in (some) concurrent cases. I think that's a big difference. Error vs. non-error is a big deal by itself; Are you saying avoiding an ERROR is a bad or good thing? Are you really asking Robert to repeat what has already been said about a dozen different ways? That's *not* the only difference. You need to see a couple of steps ahead to see further differences, as the real dilemma comes when you have to reconcile having provided the UPSERT-guarantees with cases that that doesn't map on to (which can happen in a number of different ways). I don't understand why you'll talk about just about anything but that. This is a high-level concern about the overarching design. Do you really not understand the concern at this point? also, the non-error case involves departing from MVCC semantics just as INSERT .. ON CONFLICT UPDATE does. Meaning what exactly? What situation occurs that a user would be concerned with? Please describe exactly what you mean so we get it clear. The concurrent behaviour for MERGE is allowed to be implementation-specific, so we can define it any way we want. Agreed -- we can. It isn't controversial at all to say that the SQL standard has nothing to say on this question. The problem is that the semantics you argue for are ill-defined, and seem to create more problems than they solve. Why keep bringing up the SQL standard? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Thomas Munro wrote: I'm going to make an item on my personal TODO list for that. No useful insights on that right now, though. I decided to try that, but it didn't really work: fd.h gets included by front-end code, so I can't very well define a struct and declare functions that deal in dsm_segment and slock_t. On the other hand it does seem a bit better to for these shared file sets to work in terms of File, not BufFile. Realistically, fd.h has a number of functions that are really owned by buffile.c already. This sounds fine. That way you don't have to opt in to BufFile's double buffering and segmentation schemes just to get shared file clean-up, if for some reason you want direct file handles. Is that something that you really think is possible? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MERGE SQL Statement for PG11
Nico Williams wrote: A MERGE mapped to a DML like this: WITH updated AS ( UPDATE SET ... WHERE RETURNING ) , inserted AS ( INSERT INTO SELECT ... WHERE NOT IN (SELECT FROM updated) AND .. ON CONFLICT DO NOTHING -- see below! RETURNING ) DELETE FROM WHERE NOT IN (SELECT FROM updated) AND NOT IN (SELECT FROM inserted) AND ...; This is a bad idea. An implementation like this is not at all maintainable. can handle concurrency via ON CONFLICT DO NOTHING in the INSERT CTE. That's not handling concurrency -- it's silently ignoring an error. Who is to say that the conflict that IGNORE ignored is associated with a row visible to the MVCC snapshot of the statement? IOW, why should the DELETE affect any row? There are probably a great many reasons why you need a ModifyTable executor node that keeps around state, and explicitly indicates that a MERGE is a MERGE. For example, we'll probably want statement level triggers to execute in a fixed order, regardless of the MERGE, RLS will probably require explicitly knowledge of MERGE semantics, and so on. FWIW, your example doesn't actually have a source (just a target), so it isn't actually like MERGE. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MERGE SQL Statement for PG11
Simon Riggs wrote: I think people imagined you had worked out how to make MERGE run concurrently, I certainly did, but in fact you're just saying you don't believe it ever should. I'm certain that they didn't think that at all. But I'll let them speak for themselves. That is strange since the SQL Standard specifically allows the implementation to decide upon concurrent behaviour. And yet nobody else decided to do what you propose with this apparent leeway. (See the UPSERT wiki page for many references that confirm this.) So in your view we should make no attempt to avoid concurrent errors, even when we have the capability to do so (in some cases) and doing so would be perfectly compliant with the SQLStandard. Yes. That's what I believe. I believe this because I can't see a way to do this that isn't a mess, and because ON CONFLICT DO UPDATE exists and works well for the cases where we can do better in READ COMMITTED mode. Did you know that Oracle doesn't have an EPQ style mechanism at all? Instead, it rolls back the entire statement and retries it from scratch. That is not really any further from or closer to the standard than the EPQ stuff, because the standard doesn't say anything about what should happen as far as READ COMMITTED conflict handling goes. My point here is that all of the stuff I'm talking about is only relevant in READ COMMITTED mode, in areas where the standard never provides us with guidance. If you can rely on SSI, then there is no difference between what you propose and what I propose anyway, except that what I propose is more general and will have better performance, especially for batch MERGEs. If READ COMMITTED didn't exist, implementing ON CONFLICT would have been more or less free of controversy. Yes, that certainly will make an easier patch for MERGE. Indeed, it will. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MERGE SQL Statement for PG11
Simon Riggs wrote: So if I understand you correctly, in your view MERGE should just fail with an ERROR if it runs concurrently with other DML? That's certainly my opinion on the matter. It seems like that might be the consensus, too. Obviously there are things that you as a user can do about this on your own, like opt to use a higher isolation level, or manually LOCK TABLE. For some use cases, including bulk loading for OLAP, users might just know that there isn't going to be concurrent activity because it's not an OLTP system. If this still seems odd to you, then consider that exactly the same situation exists with UPDATE. A user could want their UPDATE to affect a row where no row version is actually visible to their MVCC snapshot, because they have an idea about reliably updating the latest row. UPDATE doesn't work like that, of course. Is this unacceptable because the user expects that it should work that way? Bear in mind that ON CONFLICT DO UPDATE *can* actually update a row when there is no version of it visible to the snapshot. It can also update a row where there is a concurrent DELETE + INSERT, and the tuples with the relevant unique index values end up not even being part of the same update chain in each case (MVCC-snapshot-visible vs. latest). IOW, you may end up updating a completely different logical row to the row with the conflicting value that is visible to your MVCC snapshot! i.e. if a race condition between the query and an INSERT runs concurrently with another INSERT We have no interest in making that work? Without meaning to sound glib: we already did make it work for a special, restricted case that is important enough to justify introducing a couple of kludges -- ON CONFLICT DO UPDATE/upsert. I do agree that what I propose for MERGE will probably cause confusion; just look into Oracle's MERGE implementation for examples of this. We ought to go out of our way to make it clear that MERGE doesn't provide these guarantees. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MERGE SQL Statement for PG11
Nico Williams wrote: If you want to ignore conflicts arising from concurrency you could always add an ON CONFLICT DO NOTHING to the INSERT DML in the mapping I proposed earlier. Thus a MERGE CONCURRENTLY could just do that. Is there any reason not to map MERGE as I proposed? Performance, for one. MERGE generally has a join that can be optimized like an UPDATE FROM join. I haven't studied this question in any detail, but FWIW I think that using CTEs for merging is morally equivalent to a traditional MERGE implementation. It may actually be possible to map from CTEs to a MERGE statement, but I don't think that that's a good approach to implementing MERGE. Most of the implementation time will probably be spent doing things like making sure MERGE behaves appropriately with triggers, RLS, updatable views, and so on. That will take quite a while, but isn't particularly technically challenging IMV. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MERGE SQL Statement for PG11
Robert Haas wrote: And if, in the meantime, MERGE can only handle the cases where there is a unique index, then it can only handle the cases INSERT .. ON CONFLICT UPDATE can cover, which makes it, as far as I can see, syntactic sugar over what we already have. Maybe it's not entirely - you might be planning to make some minor functional enhancements - but it's not clear what those are, and I feel like whatever it is could be done with less work and more elegance by just extending the INSERT .. ON CONFLICT UPDATE syntax. +1 Marko Tiikkaja's INSERT ... ON CONFLICT SELECT patch, which is in the current CF [1], moves things in this direction. I myself have occasionally wondered if it was worth adding an alternative DO DELETE conflict_action. This could appear alongside DO UPDATE, and be applied using MERGE-style conditions. All of these things seem like small adjuncts to ON CONFLICT because they're all just an alternative way of modifying or projecting the tuple that is locked by ON CONFLICT. Everything new would have to happen after the novel ON CONFLICT handling has already completed. The only reason that I haven't pursued this is because it doesn't seem that compelling. I mention it now because It's worth acknowledging that ON CONFLICT could be pushed a bit further in this direction. Of course, this still falls far short of making ON CONFLICT entirely like MERGE. [1] https://commitfest.postgresql.org/15/1241/ -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple
On Thu, Nov 2, 2017 at 9:44 AM, Robert Haas wrote: > The second commit (22576734b805fb0952f9be841ca8f643694ee868) is where > I think things get a lot more dangerous. The problem (as Andres > pointed out to me this afternoon) is that it seems possible to end up > with a situation where there should be two HOT chains on a page, and > because of the weakened xmin/xmax checking rules, we end up thinking > that the second one is a continuation of the first one, which will be > all kinds of bad news. That would be particularly likely to happen in > releases from before we invented HEAP_XMIN_FROZEN, when there's no > xmin/xmax matching at all, but could happen on later releases if we > are extraordinarily unlucky (i.e. xmin of the first tuple in the > second chain happens to be the same as the pre-freeze xmax in the old > chain, probably because the same XID was used to update the page in > two consecutive epochs). Fortunately, that commit is (I think) not > released anywhere. FWIW, if you look at the second commit (22576734b805fb0952f9be841ca8f643694ee868) carefully, you'll realize that it doesn't even treat those two cases differently. It was buggy even on its own terms. The FrozenTransactionId test used an xmin from HeapTupleHeaderGetXmin(), not HeapTupleHeaderGetRawXmin(). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple
On Thu, Nov 2, 2017 at 4:20 AM, Andres Freund wrote: > I think the problem is on the pruning, rather than the freezing side. We > can't freeze a tuple if it has an alive predecessor - rather than > weakining this, we should be fixing the pruning to not have the alive > predecessor. Excellent catch. > If the update xmin is actually below the cutoff we can remove the tuple > even if there's live lockers - the lockers will also be present in the > newer version of the tuple. I verified that for me that fixes the > problem. Obviously that'd require some comment work and more careful > diagnosis. I didn't even know that that was safe. > I think a5736bf754c82d8b86674e199e232096c679201d might be dangerous in > the face of previously corrupted tuple chains and pg_upgraded clusters - > it can lead to tuples being considered related, even though they they're > from entirely independent hot chains. Especially when upgrading 9.3 post > your fix, to current releases. Frankly, I'm relieved that you got to this. I was highly suspicious of a5736bf754c82d8b86674e199e232096c679201d, even beyond my specific, actionable concern about how it failed to handle the 9.3/FrozenTransactionId xmin case as special. As I went into in the "heap/SLRU verification, relfrozenxid cut-off, and freeze-the-dead bug" thread, these commits left us with a situation where there didn't seem to be a reliable way of knowing whether or not it is safe to interrogate clog for a given heap tuple using a tool like amcheck. And, it wasn't obvious that you couldn't have a codepath that failed to account for pre-cutoff non-frozen tuples -- codepaths that call TransactionIdDidCommit() despite it actually being unsafe. If I'm not mistaken, your proposed fix restores sanity there. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MERGE SQL Statement for PG11
On Wed, Nov 1, 2017 at 10:19 AM, Simon Riggs wrote: >> The problem here is: Iff the first statement uses ON CONFLICT >> infrastructure, doesn't the absence of WHEN NOT MATCHED imply >> different semantics for the remaining updates and deletes in the >> second version of the query? > > Not according to the SQL Standard, no. I have no plans for such > differences to exist. > > Spec says: If we hit HeapTupleSelfUpdated then we throw an ERROR. Your documentation said that the MERGE was driven by a speculative insertion (BTW, I don't think that this internal implementation detail should be referenced in user-facing docs). I inferred that that could not always be true, since there won't always be an INSERT/WHEN NOT MATCHED case, assuming that you allow that at all (I now gather that you will). >> You've removed what seems like a neat >> adjunct to the MERGE, but it actually changes everything else too when >> using READ COMMITTED. Isn't that pretty surprising? > > I think you're presuming things I haven't said and don't mean, so > we're both surprised. You're right -- I'm surmising what I think might be true, because I don't have the information available to know one way or the other. As far as this issue with using speculative insertions in one context but not in another goes, I still don't really know where you stand. I can still only surmise that you must want both implementations, and will use one or the other as circumstances dictate (to avoid dup violations in the style of ON CONFLICT where that's possible). This seems true because you now say that it will be possible to omit WHEN NOT MATCHED, and yet there is no such thing as a speculative insertion without the insertion. You haven't said that that conclusion is true yourself, but it's the only conclusion that I can draw based on what you have said. > I think we need some way of expressing the problems clearly. It's certainly hard to talk about these problems. I know this from experience. > "a general purpose solution is one that more or > less works like an UPDATE FROM, with an outer join, whose ModifyTable > node is capable of insert, update, or delete (and accepts quals for > MATCHED and NOT matched cases, etc). You could still get duplicate > violations due to concurrent activity in READ COMMITTED mode". > > Surely the whole point of this is to avoid duplicate violations due to > concurrent activity? Now we're getting somewhere. I *don't* think that that's the whole point of MERGE. No other MERGE implementation does that, or claims to do that. The SQL standard says nothing about this. Heikki found this to be acceptable when working on the GSoC MERGE implementation that went nowhere. My position is that we ought to let MERGE be MERGE, and let ON CONFLICT be ON CONFLICT. In Postgres, you can avoid duplicate violations with MERGE by using a higher isolation level (these days, those are turned into a serialization error at higher isolation levels when no duplicate is visible to the xact's snapshot). MERGE isn't and shouldn't be special when it comes to concurrency. > I'm not seeing how either design sketch rigorously avoids live locks, > but those are fairly unlikely and easy to detect and abort. My MERGE semantics (which really are not mine at all) avoid live lock/lock starvation by simply never retrying anything without making forward progress. MERGE doesn't take any special interest in concurrency, just like any other DML statement that isn't INSERT with ON CONFLICT. ON CONFLICT would have live locks if it didn't always have the choice of inserting [1]. In many ways, the syntax of INSERT ON CONFLICT DO UPDATE is restricted in exactly the way it needs to be in order to function correctly. It wasn't an accident that it didn't end up being UPDATE ... ON NOUPDATE DO INSERT, or something like that, which Robert proposed at one point. ON CONFLICT plays by its own rules to a certain extent, because that's what you need in order to get the desired guarantees in READ COMMITTED mode [2]. This is the main reason why it was as painful a project as it was. Further generalizing that seems fraught with difficulties. It seems logically impossible to generalize it in a way where you don't end up with two behaviors masquerading as one. [1] https://wiki.postgresql.org/wiki/UPSERT#Theoretical_lock_starvation_hazards [2] https://wiki.postgresql.org/wiki/UPSERT#Goals_for_implementation -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Tue, Oct 31, 2017 at 5:07 PM, Thomas Munro wrote: > So that's this bit: > > + pg_itoa(worker, filename); > + lts->pfile = BufFileCreateShared(fileset, filename); > > ... and: > > + pg_itoa(i, filename); > + file = BufFileOpenShared(fileset, filename); Right. > What's wrong with using a worker number like this? I guess nothing, though there is the question of discoverability for DBAs, etc. You do address this separately, by having (potentially) descriptive filenames, as you go into. > It's not random choice: buffile.c creates a uniquely named directory > (or directories, if you have more than one location configured in the > temp_tablespaces GUC) to hold all the backing files involved in each > BufFileSet. Naming of BufFiles within the BufFileSet is the caller's > problem, and a worker number seems like a reasonable choice to me. It > won't collide with a concurrent parallel CREATE INDEX because that'll > be using its own BufFileSet. Oh, I see. I may have jumped the gun on that one. > One complaint about the current coding that someone might object to: > MakeSharedSegmentPath() just dumps the caller's BufFile name into a > path without sanitisation: I should fix that so that we only accept > fairly limited strings here. Another complaint is that perhaps fd.c > knows too much about buffile.c's business. For example, > RemovePgTempFilesInDir() knows about the ".set" directories created by > buffile.c, which might be called a layering violation. Perhaps the > set/directory logic should move entirely into fd.c, so you'd call > FileSetInit(FileSet *), not BufFileSetInit(BufFileSet *), and then > BufFileOpenShared() would take a FileSet *, not a BufFileSet *. > Thoughts? I'm going to make an item on my personal TODO list for that. No useful insights on that right now, though. > 3. sharedtuplestore.c takes a caller-supplied BufFileSet and creates > its shared BufFiles in there. Earlier versions created and owned a > BufFileSet, but in the current Parallel Hash patch I create loads of > separate SharedTuplestore objects but I didn't want to create load of > directories to back them, so you can give them all the same > BufFileSet. That works because SharedTuplestores are also given a > name, and it's the caller's job (in my case nodeHash.c) to make sure > the SharedTuplestores are given unique names within the same > BufFileSet. For Parallel Hash you'll see names like 'i3of8' (inner > batch 3 of 8). There is no need for there to be in any sort of > central registry for that though, because it rides on top of the > guarantees from 2 above: buffile.c will put those files into a > uniquely named directory, and that works as long as no one else is > allowed to create files or directories in the temp directory that > collide with its reserved pattern /^pgsql_tmp.+\.set$/. For the same > reason, parallel CREATE INDEX is free to use worker numbers as BufFile > names, since it has its own BufFileSet to work within. If the new standard is that you have temp file names that suggest the purpose of each temp file, then that may be something that parallel CREATE INDEX should buy into. > In an earlier version, BufFileSet was one of those annoying data > structures with a FLEXIBLE_ARRAY_MEMBER that you'd use as an > incomplete type (declared but not defined in the includable header), > and here it was being used "inside" (or rather after) SharedSort, > which *itself* had a FLEXIBLE_ARRAY_MEMBER. The reason for the > variable sized object was that I needed all backends to agree on the > set of temporary tablespace OIDs, of which there could be any number, > but I also needed a 'flat' (pointer-free) object I could stick in > relocatable shared memory. In the newest version I changed that > flexible array to tablespaces[8], because 8 should be enough > tablespaces for anyone (TM). I guess that that's something that you'll need to take up with Andres, if you haven't already. I have a hard time imagining a single query needed to use more than that many tablespaces at once, so maybe this is fine. > I don't really believe anyone uses > temp_tablespaces for IO load balancing anymore and I hate code like > the above. So I think Rushabh should now remove the above-quoted code > and just use a BufFileSet directly as a member of SharedSort. FWIW, I agree with you that nobody uses temp_tablespaces this way these days. This seems like a discussion for your hash join patch, though. I'm happy to buy into that. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Thu, Oct 26, 2017 at 4:22 AM, Rushabh Lathia wrote: > Attaching the re based patch according to the v22 parallel-hash patch sets I took a quick look at this today, and noticed a few issues: * make_name() is used to name files in sharedtuplestore.c, which is what is passed to BufFileOpenShared() for parallel hash join. Your using your own logic for that within the equivalent logtape.c call to BufFileOpenShared(), presumably because make_name() wants to identify participants by PID rather than by an ordinal identifier number. I think that we need some kind of central registry for things that use shared buffiles. It could be that sharedtuplestore.c is further generalized to support this, or it could be that they both call something else that takes care of naming. It's not okay to have this left to random chance. You're going to have to ask Thomas about this. You should also use MAXPGPATH for the char buffer on the stack. * This logtape.c comment needs to be updated, as it's no longer true: * successfully. In general, workers can take it that the leader will * reclaim space in files under their ownership, and so should not * reread from tape. * Robert hated the comment changes in the header of nbtsort.c. You might want to change it back, because he is likely to be the one that commits this. * You should look for similar comments in tuplesort.c (IIRC a couple of places will need to be revised). * tuplesort_begin_common() should actively reject a randomAccess parallel case using elog(ERROR). * tuplesort.h should note that randomAccess isn't supported, too. * What's this all about?: + /* Accessor for the SharedBufFileSet that is at the end of Sharedsort. */ + #define GetSharedBufFileSet(shared)\ + ((BufFileSet *) (&(shared)->tapes[(shared)->nTapes])) You can't just cast from one type to the other without regard for the underling size of the shared memory buffer, which is what this looks like to me. This only fails to crash because you're only abusing the last member in the tapes array for this purpose, and there happens to be enough shared memory slop that you get away with it. I'm pretty sure that ltsUnify() ends up clobbering the last/leader tape, which is a place where BufFileSet is also used, so this is just wrong. You should rethink the shmem structure a little bit. * There is still that FIXME comment within leader_takeover_tapes(). I believe that you should still have a leader tape (at least in local memory in the leader), even though you'll never be able to do anything with it, since randomAccess is no longer supported. You can remove the FIXME, and just note that you have a leader tape to be consistent with the serial case, though recognize that it's not useful. Note that even with randomAccess, we always had the leader tape, so it's not that different, really. I suppose it might make sense to make shared->tapes not have a leader tape. It hardly matters -- perhaps you should leave it there in order to keep the code simple, as you'll be keeping the leader tape in local memory, too. (But it still won't fly to continue to clobber it, of course -- you still need to find a dedicated place for BufFileSet in shared memory.) That's all I have right now. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MERGE SQL Statement for PG11
On Tue, Oct 31, 2017 at 2:25 AM, Simon Riggs wrote: > If there are challenges ahead, its reasonable to ask for test cases > for that now especially if you think you know what they already are. > Imagine we go forwards 2 months - if you dislike my patch when it > exists you will submit a test case showing the fault. Why not save us > all the trouble and describe that now? Test Driven Development. I already have, on several occasions now. But if you're absolutely insistent on my constructing the test case in terms of a real SQL statement, then that's what I'll do. Consider this MERGE statement, from your mock documentation: MERGE INTO wines w USING wine_stock_changes s ON s.winename = w.winename WHEN NOT MATCHED AND s.stock_delta > 0 THEN INSERT VALUES(s.winename, s.stock_delta) WHEN MATCHED AND w.stock + s.stock_delta > 0 THEN UPDATE SET stock = w.stock + s.stock_delta ELSE DELETE; Suppose we remove the WHEN NOT MATCHED case, leaving us with: MERGE INTO wines w USING wine_stock_changes s ON s.winename = w.winename WHEN MATCHED AND w.stock + s.stock_delta > 0 THEN UPDATE SET stock = w.stock + s.stock_delta ELSE DELETE; We now have a MERGE that will not INSERT, but will continue to UPDATE and DELETE. (It's implied that your syntax cannot do this at all, because you propose use the ON CONFLICT infrastructure, but I think we have to imagine a world in which that restriction either never existed or has subsequently been lifted.) The problem here is: Iff the first statement uses ON CONFLICT infrastructure, doesn't the absence of WHEN NOT MATCHED imply different semantics for the remaining updates and deletes in the second version of the query? You've removed what seems like a neat adjunct to the MERGE, but it actually changes everything else too when using READ COMMITTED. Isn't that pretty surprising? If you're not clear on what I mean, see my previous remarks on EPQ, live lock, and what a CONFLICT could be in READ COMMITTED mode. Concurrent activity at READ COMMITTED mode can be expected to significantly alter the outcome here. Why not just always use the behavior that the second query requires, which is very much like an UPDATE FROM with an outer join that can sometimes do deletes (and inserts)? We should always use an MVCC snapshot, and never play ON CONFLICT style games with visibility/dirty snapshots. > It's difficult to discuss anything with someone that refuses to > believe that there are acceptable ways around things. I believe there > are. Isn't that blind faith? Again, it seems like you haven't really tried to convince me. > If you can calm down the rhetoric we can work together, but if you > continue to grandstand it makes it more difficult. I'm trying to break the deadlock, by getting you to actually consider what I'm saying. I don't enjoy confrontation. Currently, it seems like you're just ignoring my objections, which you actually could fairly easily work through. That is rather frustrating. > You've said its possible another way. Show that assertion is actually > true. We're all listening, me especially, for the technical details. My proposal, if you want to call it that, has the merit of actually being how MERGE works in every other system. Both Robert and Stephen seem to be almost sold on what I suggest, so I think that I've probably already explained my position quite well. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MERGE SQL Statement for PG11
On Mon, Oct 30, 2017 at 11:07 AM, Simon Riggs wrote: > Please explain in detail the MERGE SQL statements that you think will > be problematic and why. Your proposal is totally incomplete, so I can only surmise its behavior in certain cases, to make a guess at what the problems might be (see my remarks on EPQ, live lock, etc). This makes it impossible to do what you ask right now. Besides, you haven't answered the question from my last e-mail ("What's wrong with that [set of MERGE semantics]?"), so why should I go to further trouble? You're just not constructively engaging with me at this point. We're going around in circles. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MERGE SQL Statement for PG11
On Mon, Oct 30, 2017 at 6:21 AM, Simon Riggs wrote: >> That's not really true. Nobody's going to be happy if MERGE has one >> behavior in one set of cases and an astonishingly different behavior >> in another set of cases. If you adopt a behavior for certain cases >> that can't be extended to other cases, then you're blocking a >> general-purpose MERGE. > > If a general purpose solution exists, please explain what it is. For the umpteenth time, a general purpose solution is one that more or less works like an UPDATE FROM, with an outer join, whose ModifyTable node is capable of insert, update, or delete (and accepts quals for MATCHED and NOT matched cases, etc). You could still get duplicate violations due to concurrent activity in READ COMMITTED mode, but not at higher isolation levels thanks to Thomas Munro's work there. In this world, ON CONFLICT and MERGE are fairly distinct things. What's wrong with that? You haven't actually told us why you don't like that. > The problem is that nobody has ever done so, so its not like we are > discussing the difference between bad solution X and good solution Y, > we are comparing reasonable solution X with non-existent solution Y. Nobody knows what your proposal would be like when time came to remove the restrictions that you suggest could be removed later. You're the one with the information here. We need to know what those semantics would be up-front, since you're effectively committing us down that path. You keep making vague appeals to pragmatism, but, in all sincerity, I don't understand where you're coming from at all. I strongly believe that generalizing from ON CONFLICT doesn't even make the implementation easier in the short term. ISTM that you're making this difficult for yourself for reasons that are known only to you. >> And, indeed, it seems that you're proposing an implementation that >> adds no new functionality, just syntax compatibility. Do we really >> want or need two syntaxes for the same thing in core? I kinda think >> Peter might have the right idea here. Under his proposal, we'd be >> getting something that is, in a way, new. > > Partitioning looked like "just new syntax", but it has been a useful > new feature. False equivalency. Nobody, including you, ever argued that that work risked painting us into a corner. (IIRC you said something like the progress was too small to justify putting into a single release.) > MERGE provides new capabilities that we do not have and is much more > powerful than INSERT/UPDATE, in a syntax that follow what other > databases use today. Just like partitioning. But you haven't told us *how* it is more powerful. Again, the semantics of a MERGE that is a generalization of ON CONFLICT are not at all obvious, and seem like they might be very surprising and risky. There is no question that it's your job to (at a minimum) define those semantics ahead of time, since you're going to commit us to them in the long term if you continue down this path. It is most emphatically *not* just a "small matter of programming". -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MERGE SQL Statement for PG11
On Sun, Oct 29, 2017 at 4:48 AM, Simon Riggs wrote: > I have no objection to you writing a better version than me and if my > work inspires you to complete that in a reasonable timescale then we > all win. My whole point is that the way that you seem determined to go on this is a dead end. I don't think that *anyone* can go improve on what you come up with if that's based heavily on ON CONFLICT, for the simple reason that the final user visible design is totally unclear. There is an easy way to make me shut up - come up with a design for MERGE that more or less builds on how UPDATE FROM works, rather than building MERGE on ON CONFLICT. (You might base things like RLS handling on ON CONFLICT, but in the main MERGE should be like an UPDATE FROM with an outer join, that can do INSERTs and DELETEs, too.) The original effort to add MERGE didn't do anything upsert-like, which Heikki (the GSOC mentor of the project) was perfectly comfortable with. I'm too lazy to go search the archives right now, but it's there. Heikki cites the SQL standard. This is what MERGE *actually is*, which you can clearly see from the Oracle/SQL Server/DB2 docs. It says this in the first paragraph of their MERGE documentation. It's crystal clear from their docs -- "This statement is a convenient way to combine multiple operations. It lets you avoid multiple INSERT, UPDATE, and DELETE DML statements." > I'm also very happy to work together on writing the version > described - you have already done much work in this area. You seem to want to preserve the ON CONFLICT guarantees at great cost. But you haven't even defended that based on a high level goal, or a use case, or something that makes sense to users (describing how it is possible is another matter). You haven't even tried to convince me. > I don't see any major problems in points you've raised so far, though > obviously there is much detail. Did you even read them? They are not mere details. They're fundamental to the semantics of the feature (if you base it on ON CONFLICT). It's not actually important that you understand them all; the important message is that generalizing ON CONFLICT has all kinds of terrible problems. > All of this needs to be written and then committed, so I'll get on and > write my proposal. We can then see whether that is an 80% solution or > something less. There are more obvious barriers to completion at this > point, like time and getting on with it. Getting on with *what*, exactly? In general, I have nothing against an 80% solution, or even a 50% solution, provided there is a plausible path to a 100% solution. I don't think that you have such a path, but only because you're tacitly inserting requirements that no other MERGE implementation has to live with, that I doubt any implementation *could* live with. Again, I'm not the one making this complicated, or adding requirements that will be difficult for you to get in to your v1 -- you're the one doing that. The semantics that I suggest (the SQL standard's semantics) will require less code, and will be far simpler. Right now, I simply don't understand why you're insisting on using ON CONFLICT without even saying why. I can only surmise that you think that doing so will simplify the implementation, but I can guarantee you that it won't. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MERGE SQL Statement for PG11
On Sat, Oct 28, 2017 at 12:49 PM, Simon Riggs wrote: > Nothing I am proposing blocks later work. Actually, many things will block future work if you go down that road. You didn't respond to the specific points I raised, but that doesn't mean that they're not real. > Everything you say makes it clear that a fully generalized solution is > going to be many years in the making, assuming we agree. I think that it's formally impossible as long as you preserve the ON CONFLICT guarantees, unless you somehow define the problems out of existence. Those are guarantees which no other MERGE implementation has ever made, and which the SQL standard says nothing about. And for good reasons. > "The extent to which an SQL-implementation may disallow independent > changes that are not significant is implementation-defined”. > > So we get to choose. I recommend that we choose something practical. > We're approaching the 10 year anniversary of my first serious attempt > to do MERGE. I say that its time to move forwards with useful > solutions, rather than wait another 10 years for the perfect one, even > assuming it exists. As far as I'm concerned, you're the one arguing for an unobtainable solution over a good one, not me. I *don't* think you should solve the problems that I raise -- you should instead implement MERGE without any of the ON CONFLICT guarantees, just like everyone else has. Building MERGE on top of the ON CONFLICT guarantees, and ultimately arriving at something that is comparable to other implementations over many releases might be okay if anyone had the slightest idea of what that would look like. You haven't even _described the semantics_, which you could do by addressing the specific points that I raised. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MERGE SQL Statement for PG11
On Fri, Oct 27, 2017 at 3:00 PM, Serge Rielau wrote: >> What other systems *do* have this restriction? I've never seen one that did. > > Not clear what you are leading up to here. > When I did MERGE in DB2 there was also no limitation: > "Each row in the target can only be operated on once. A row in the target can > only be identified as MATCHED with one row in the result table of the > table-reference” > What there was however was a significant amount of code I had to write and > test to enforce the above second sentence. Then it seems that we were talking about two different things all along. > Maybe in PG there is a trivial way to detect an expanding join and block it > at runtime. There is for ON CONFLICT. See the cardinality violation logic within ExecOnConflictUpdate(). (There are esoteric cases where this error can be raised due to a wCTE that does an insert "from afar", which is theoretically undesirable but not actually a problem.) The MERGE implementation that I have in mind would probably do almost the same thing, and make the "HeapTupleSelfUpdated" case within ExecUpdate() raise an error when the caller happened to be a MERGE, rather than following the historic UPDATE behavior. (The behavior is to silently suppress a second or subsequent UPDATE attempt from the same command, a behavior that Simon's mock MERGE documentation references.) > So the whole point I’m trying to make is that I haven’t seen the need for the > extra work I had to do once the feature appeared in the wild. That seems pretty reasonable to me. My whole point is that I think it's a mistake to do things like lock rows ahead of evaluating any UPDATE predicate, in the style of ON CONFLICT, in order to replicate the ON CONFLICT guarantees [1]. I'm arguing for implementation simplicity, too. Trying to implement MERGE in a way that extends ON CONFLICT seems like a big mistake to me, because ON CONFLICT updates rows on the basis of a would-be duplicate violation, along with all the baggage that that carries. This is actually enormously different to an equi-join that is fed by a scan using an MVCC snapshot. The main difference is that there actually is no MVCC snapshot in play in most cases [2]. If *no* row with the PK value of 5 is visible to our MVCC snapshot, but an xact committed having inserted such a row, that still counts as a CONFLICT with READ COMMITTED. [1] https://wiki.postgresql.org/wiki/UPSERT#Goals_for_implementation [2] https://www.postgresql.org/docs/devel/static/transaction-iso.html#xact-read-committed -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MERGE SQL Statement for PG11
On Sat, Oct 28, 2017 at 3:10 AM, Simon Riggs wrote: > SQL:2011 specifically states "The extent to which an > SQL-implementation may disallow independent changes that are not > significant is implementation-defined”, so in my reading the above > behaviour would make us fully spec compliant. Thank you to Peter for > providing the infrastructure on which this is now possible for PG11. > > Serge puts this very nicely by identifying two different use cases for MERGE. MERGE benefits from having a join that is more or less implemented in the same way as any other join. It can be a merge join, hash join, or nestloop join. ON CONFLICT doesn't work using a join. Should I to take it that you won't be supporting any of these alternative join algorithms? If not, then you'll have something that really isn't comparable to MERGE as implemented in Oracle, SQL Server, or DB2. They *all* do this. Would the user be able to omit WHEN NOT MATCHED/INSERT, as is the case with every existing MERGE implementation? If so, what actually happens under the hood when WHEN NOT MATCHED is omitted? For example, would you actually use a regular "UPDATE FROM" style join, as opposed to the ON CONFLICT infrastructure? And, if that is so, can you justify the semantic difference for rows that are updated in each scenario (omitted vs. not omitted) in READ COMMITTED mode? Note that this could be the difference between updating a row when *no* version is visible to our MVCC snapshot, as opposed to doing the EPQ stuff and updating the latest row version if possible. That's a huge, surprising difference. On top of all this, you risk live-lock if INSERT isn't a possible outcome (this is also why ON CONFLICT can never accept a predicate on its INSERT portion -- again, quite unlike MERGE). Why not just follow what other systems do? It's actually easier to go that way, and you get a better outcome. ON CONFLICT involves what you could call a sleight of hand, and I fear that you don't appreciate just how specialized the internal infrastructure is. > Now, I accept that you might also want a MERGE statement that > continues to work even if there is no unique constraint, but it would > need to have different properties to the above. I do not in any way > argue against adding that. Maybe you *should* be arguing against it, though, and arguing against ever supporting anything but equijoins, because these things will *become* impossible if you go down that road. By starting with the ON CONFLICT infrastructure, while framing no-unique-index-support as work for some unspecified future release, you're leaving it up to someone else to resolve the problems. Someone else must square the circle of mixing ON CONFLICT semantics with fully generalized MERGE semantics. But who? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MERGE SQL Statement for PG11
On Fri, Oct 27, 2017 at 2:13 PM, srielau wrote: > While the standard may not require a unique index for the ON clause I have > never seen a MERGE statement that did not have this property. So IMHO this > is a reasonable restrictions. The Oracle docs on MERGE say nothing about unique indexes or constraints. They don't even mention them in passing. They do say "This statement is a convenient way to combine multiple operations. It lets you avoid multiple INSERT, UPDATE, and DELETE DML statements." SQL Server's MERGE docs do mention unique indexes, but only in passing, saying something about unique violations, and that unique violations *cannot* be suppressed in MERGE, even though that's possible with other DML statements (with something called IGNORE_DUP_KEY). What other systems *do* have this restriction? I've never seen one that did. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MERGE SQL Statement for PG11
On Fri, Oct 27, 2017 at 6:24 AM, Robert Haas wrote: > I think one of the reasons why Peter Geoghegan decided to pursue > INSERT .. ON CONFLICT UPDATE was that, because it is non-standard SQL > syntax, he felt free to mandate a non-standard SQL requirement, namely > the presence of a unique index on the arbiter columns. That's true, but what I was really insistent on, more than anything else, was that the user would get a practical guarantee about an insert-or-update outcome under concurrency. There could be no "unprincipled deadlocks", nor could there be spurious unique violations. This is the kind of thing that the SQL standard doesn't really concern itself with, and yet it's of significant practical importance to users. Both Oracle and SQL Server allow these things that I specifically set out to avoid. I think that that's mostly a good thing, though; they do a really bad job of explaining what's what, and don't provide for a very real need ("upsert") in some other way, but their MERGE semantics do make sense to me. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MERGE SQL Statement for PG11
On Fri, Oct 27, 2017 at 7:41 AM, Simon Riggs wrote: > Good points. > > I didn't say it but my intention was to just throw an ERROR if no > single unique index can be identified. You'd also throw an error when there was no "upsert compatible" join quals, I take it? I don't see the point in that. That's just mapping one syntax on to another. > It could be possible to still run MERGE in that situaton but we would > need to take a full table lock at ShareRowExclusive. It's quite likely > that such statements would throw duplicate update errors, so I > wouldn't be aiming to do anything with that for PG11. I would avoid mixing up ON CONFLICT DO UPDATE and MERGE. The "anomalies" you describe in MERGE are not really anomalies IMV. They're simply how the feature is supposed to operate, and how it's possible to make MERGE use alternative join algorithms based only on the underlying cost. You might use a merge join for a bulk load use-case, for example. I think an SQL MERGE feature would be compelling, but I don't think that it should take much from ON CONFLICT. As I've said many times [1], they're really two different features (Teradata had features similar to each, for example). I suggest that you come up with something that has the semantics that the standard requires, and therefore makes none of the ON CONFLICT guarantees about an outcome under concurrency (INSERT or UPDATE). Those guarantees are basically incompatible with how MERGE needs to work. In case it matters, I think that the idea of varying relation heavyweight lock strength based on subtle semantics within a DML statement is a bad one. Frankly, I think that that's going to be a nonstarter. [1] https://www.postgresql.org/message-id/CAM3SWZRP0c3g6+aJ=yydgyactzg0xa8-1_fcvo5xm7hrel3...@mail.gmail.com -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] unique index violation after pg_upgrade to PG10
On Tue, Oct 24, 2017 at 10:20 PM, Justin Pryzby wrote: > I think you must have compared these: Yes, I did. My mistake. > On Tue, Oct 24, 2017 at 03:11:44PM -0500, Justin Pryzby wrote: >> ts=# SELECT * FROM bt_page_items(get_raw_page('sites_idx', 1)); >> >> itemoffset | 48 >> ctid | (1,37) >> itemlen| 32 >> nulls | f >> vars | t >> data | 1b 43 52 43 4c 4d 54 2d 43 45 4d 53 30 0b 31 31 31 31 00 00 00 >> 00 00 00 > ... >> itemoffset | 37 >> ctid | (0,97) >> itemlen| 24 >> nulls | f >> vars | t >> data | 1b 43 52 43 4c 4d 54 2d 43 45 4d 53 30 03 00 00 > > ..but note those are both items in sites_idx (48 and 37, which I seem to have > pasted out of order).. I included both because I'm not confident I know what > the "index tid=(1,37)" referred to, but I gather it means item at offset=37 > (and not item with ctid=(1,37).) > > | [pryzbyj@database amcheck]$ psql --port 5678 ts -c "SELECT > bt_index_check('sites_idx'::regclass::oid, heapallindexed=>True)" > | ERROR: high key invariant violated for index "sites_idx" > | DETAIL: Index tid=(1,37) points to heap tid=(0,97) page lsn=0/0. This means that the item at (1,37) in the index is not <= the high key, which is located at (1,1). (The high key comes first, despite being an upper bound rather than a lower bound, per pageinspect docs.) I find it suspicious that the page lsn is 0 here, btw. > ts=# SELECT * FROM page_header(get_raw_page('sites_idx', 1)); > lsn | checksum | flags | lower | upper | special | pagesize | version | > prune_xid > -+--+---+---+---+-+--+-+--- > 0/0 |0 | 0 | 872 | 1696 |8176 | 8192 | 4 | >0 > > Here is its heap page: > > ts=# SELECT * FROM heap_page_items(get_raw_page('sites', 0)) WHERE lp=97; > lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | > t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | >t_data > ++--+++--+--++-+++--+---+ > 97 |968 |1 | 52 | 21269 | 33567444 |0 | (3,27) | > 8204 | 2307 | 32 | 11101001 | | > \x70001b4352434c4d542d43454d5330030303 > > Which I see ends with 0303 vs .. Looks like I was accidentally right, then -- the heap and index do differ. You might try this tool I published recently, to get a better sense of details like this: https://github.com/petergeoghegan/pg_hexedit (Don't do so with a data directory that you cannot afford to corrupt again, though!) > Maybe this is relevant ? > ts=# SELECT * FROM heap_page_items(get_raw_page('sites', 3)) WHERE lp=27; > lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | > t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_data > +----+--++++--++-++++---+ > 27 | 0 |0 | 0 ||| || >|||| | This looks like an LP_DEAD item. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] unique index violation after pg_upgrade to PG10
On Tue, Oct 24, 2017 at 1:11 PM, Justin Pryzby wrote: > ..which I gather just verifies that the index is corrupt, not sure if there's > anything else to do with it? Note, we've already removed the duplicate rows. Yes, the index itself is definitely corrupt -- this failed before the new "heapallindexed" check even started. Though it looks like that would have failed too, if you got that far, since the index points to a row that does not contain the same data. (I only know this because you dumped the heap tuple and the index tuple.) Maybe you could try verifying a different index on the same table with "heapallindexed", too. Perhaps that would fail in a more interesting way. I don't know how pg_repack works in any detail, but I have a hard time imagining it causing corruption like this, where a single B-Tree page is corrupt (high key invariant fails), possibly because of a torn page (possibly due to recovery not replaying all the WAL needed, for whatever reason). You're using LVM snapshots -- I hope that you're aware that they're not guaranteed to be consistent across logical volumes. There are a few different ways that they could cause corruption like this if you weren't careful. (In general, I wouldn't recommend using LVM snapshots as any kind of backup solution.) -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] unique index violation after pg_upgrade to PG10
On Tue, Oct 24, 2017 at 11:48 AM, Kenneth Marshall wrote: >> Really ? pg_repack "found" and was victim to the duplicate keys, and rolled >> back its work. The CSV logs clearly show that our application INSERTed rows >> which are duplicates. >> >> [pryzbyj@database ~]$ rpm -qa pg_repack10 >> pg_repack10-1.4.2-1.rhel6.x86_64 >> >> Justin > > Hi Justin, > > I just dealt with a similar problem with pg_repack and a PostgreSQL 9.5 DB, > the exact same error. It seemed to caused by a tuple visibility issue that > allowed the "working" unique index to be built, even though a duplicate row > existed. Then the next pg_repack would fail with the error you got. In our > case I needed the locality of reference to keep the DB performance acceptable > and it was not a critical issue if there was a duplicate. We would remove the > duplicates if we had a failure. We never had a problem with the NO pg_repack > scenarios. A new, enhanced version of the corruption detection tool amcheck is now available, and has both apt + yum packages available: https://github.com/petergeoghegan/amcheck Unlike the version in Postgres 10, this enhanced version (V1.2) has "heapallindexed" verification, which is really what you want here. If you install it, and run it against the unique index in question (with "heapallindexed" verification), that could help. It might provide a more useful diagnostic message. This is very new, so do let me know how you get on if you try it out. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] How to determine that a TransactionId is really aborted?
On Sun, Oct 22, 2017 at 2:19 PM, Eric Ridge wrote: > I'm looking for the status as any concurrent open transaction might see it. > For example, if any concurrent transaction might see it as "in progress", > that's what I'd want returned. Does that make sense? Maybe, but note that that's fundamentally something that can become stale immediately. And, just because an MVCC snapshot cannot see a row does not mean that it cannot affect its transaction/statement in some other way (e.g. unique index enforcement). Again, you'll probably need to put this low level requirement into context if you want sound advice from this list. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] How to determine that a TransactionId is really aborted?
On Sun, Oct 22, 2017 at 12:23 PM, Eric Ridge wrote: > Can anyone confirm or deny that this is correct? I feel like it is correct, > but I'm no expert. What are you going to use the code for? I think that that context is likely to matter here. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
On Thu, Oct 5, 2017 at 7:00 PM, Peter Geoghegan wrote: > v3 of the patch series, attached, does it that way -- it adds a > bloom_create(). The new bloom_create() function still allocates its > own memory, but does so while using a FLEXIBLE_ARRAY_MEMBER. A > separate bloom_init() function (that works with dynamic shared memory) > could easily be added later, for the benefit of parallel hash join. Since Peter E's work on making the documentation sgml files more XML-like has broken the v3 patch doc build, attached is v4, which fixes this bit rot. It also has a few small tweaks here and there to the docs. Nothing worth noting specifically, really -- I just don't like to leave my patches with bit rot for long. (Hat-tip to Thomas Munro for making this easy to detect with his new CF continuous integration tooling.) I should point out that I shipped virtually the same code yesterday, as v1.1 of the Github version of amcheck (also known as amcheck_next). Early adopters will be able to use this new "heapallindexed" functionality in the next few days, once packages become available for the apt and yum community repos. Just as before, the Github version will work on versions of Postgres >= 9.4. This seems like good timing on my part, because we know that this new "heapallindexed" verification will detect the "freeze the dead" bugs that the next point release is set to have fixes for -- that is actually kind of how one of the bugs was found [1]. We may even want to advertise the available of this check within amcheck_next, in the release notes for the next Postgres point release. [1] https://www.postgresql.org/message-id/cah2-wznm4rcrhfaiwkpwtpew2bxdtgrozk7jwwgucxeh3d1...@mail.gmail.com -- Peter Geoghegan From 7906c7391a9f52d334c2cbc7d3e245ff014629f2 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Tue, 2 May 2017 00:19:24 -0700 Subject: [PATCH 2/2] Add amcheck verification of indexes against heap. Add a new, optional capability to bt_index_check() and bt_index_parent_check(): callers can check that each heap tuple that ought to have an index entry does in fact have one. This happens at the end of the existing verification checks. This is implemented by using a Bloom filter data structure. The implementation performs set membership tests within a callback (the same type of callback that each index AM registers for CREATE INDEX). The Bloom filter is populated during the initial index verification scan. --- contrib/amcheck/Makefile | 2 +- contrib/amcheck/amcheck--1.0--1.1.sql| 28 +++ contrib/amcheck/amcheck.control | 2 +- contrib/amcheck/expected/check_btree.out | 14 +- contrib/amcheck/sql/check_btree.sql | 9 +- contrib/amcheck/verify_nbtree.c | 298 --- doc/src/sgml/amcheck.sgml| 173 ++ src/include/utils/snapmgr.h | 2 +- 8 files changed, 454 insertions(+), 74 deletions(-) create mode 100644 contrib/amcheck/amcheck--1.0--1.1.sql diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile index 43bed91..c5764b5 100644 --- a/contrib/amcheck/Makefile +++ b/contrib/amcheck/Makefile @@ -4,7 +4,7 @@ MODULE_big = amcheck OBJS = verify_nbtree.o $(WIN32RES) EXTENSION = amcheck -DATA = amcheck--1.0.sql +DATA = amcheck--1.0--1.1.sql amcheck--1.0.sql PGFILEDESC = "amcheck - function for verifying relation integrity" REGRESS = check check_btree diff --git a/contrib/amcheck/amcheck--1.0--1.1.sql b/contrib/amcheck/amcheck--1.0--1.1.sql new file mode 100644 index 000..e6cca0a --- /dev/null +++ b/contrib/amcheck/amcheck--1.0--1.1.sql @@ -0,0 +1,28 @@ +/* contrib/amcheck/amcheck--1.0--1.1.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "ALTER EXTENSION amcheck UPDATE TO '1.1'" to load this file. \quit + +-- +-- bt_index_check() +-- +DROP FUNCTION bt_index_check(regclass); +CREATE FUNCTION bt_index_check(index regclass, +heapallindexed boolean DEFAULT false) +RETURNS VOID +AS 'MODULE_PATHNAME', 'bt_index_check' +LANGUAGE C STRICT PARALLEL RESTRICTED; + +-- +-- bt_index_parent_check() +-- +DROP FUNCTION bt_index_parent_check(regclass); +CREATE FUNCTION bt_index_parent_check(index regclass, +heapallindexed boolean DEFAULT false) +RETURNS VOID +AS 'MODULE_PATHNAME', 'bt_index_parent_check' +LANGUAGE C STRICT PARALLEL RESTRICTED; + +-- Don't want these to be available to public +REVOKE ALL ON FUNCTION bt_index_check(regclass, boolean) FROM PUBLIC; +REVOKE ALL ON FUNCTION bt_index_parent_check(regclass, boolean) FROM PUBLIC; diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control index 05e2861..4690484 100644 --- a/contrib/amcheck/amcheck.control +++ b/contrib/amcheck/amcheck.control @@ -1,5 +1,5 @@ # amcheck extension comment = 'functions for verifying r
Re: [HACKERS] [COMMITTERS] pgsql: Fix traversal of half-frozen update chains
On Thu, Oct 19, 2017 at 9:03 AM, Peter Geoghegan wrote: >> /me studies the problem for a while. >> >> What's bothering me about this is: how is cutoff_xid managing to be a >> new enough transaction ID for this to happen in the first place? The >> cutoff XID should certainly be older than anything any current >> snapshot can see, because it's computed using GetOldestXmin() -- which >> means that XIDs older than the cutoff can't still be running (except >> maybe if old_snapshot_threshold is set to a non-default value). > Now that the problem is fixed by a5736bf7, this test case will prune > and get an LP_REDIRECT ItemId (not an LP_DEAD one), as we're no longer > confused about the continuity of HOT chains within heap_prune_chain(). In case I was unclear: the key point here is that it wasn't wrong to prune the tuples because they might still be visible to somebody's snapshot. It was wrong to do so in a specific way that could still happen prior to a5736bf7, leaving only a stub LP_DEAD line pointer, because index scans needed to do HOT chain traversal in order to get to the visible tuple that wasn't pruned (the one that was still frozen). You might say that the HOT chain "appeared to split in two" from heap_prune_chain()'s perspective, a problem that has nothing to do with an XID cutoff. That's why after 20b65522 there was no problem with sequential scans. There were remaining problems with index scans, though, purely because of this HOT pruning business -- that was fixed in a5736bf7. (And, as I've said, problems with "traversal of half-frozen update chains" are presumed to have existed for routines like EvalPlanQual() -- even more subtle problems.) One thing I'm not clear on is if we could or should have limited this new HeapTupleUpdateXmaxMatchesXmin() stuff to HOT chains, and so not altered behavior for cross-page cases (for routines like EvalPlanQualFetch() -- not routines like heap_prune_chain(), that only ever deal with a single page at a time). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix traversal of half-frozen update chains
On Thu, Oct 19, 2017 at 7:21 AM, Robert Haas wrote: > The commit message for a5736bf7 doesn't say anything about a race; it > just claims that it is fixing traversal of half-frozen update chains, > without making any reference to how such a thing as a half-frozen > update chain came to exist in the first place. I think the commit > that talks about that race condition is 20b65522. Well, the race was that the wrong tuple is pruned, because of naivety in the heap_prune_chain() update chain handling, which is what started work that became commit a5736bf7. That's by far the worst consequence of the traversal of half-frozen update chain business that a5736bf7's commit message describes, since we know it leads to wrong answers to index scans (and REINDEX fails with "failed to find parent tuple"). It's the only consequence that's been directly observed, but probably not the only one that's possible. Most other places that suffer the problem are in obscure-to-users paths like EvalPlanQual(). > /me studies the problem for a while. > > What's bothering me about this is: how is cutoff_xid managing to be a > new enough transaction ID for this to happen in the first place? The > cutoff XID should certainly be older than anything any current > snapshot can see, because it's computed using GetOldestXmin() -- which > means that XIDs older than the cutoff can't still be running (except > maybe if old_snapshot_threshold is set to a non-default value). The problem that we directly observed during pruning, which caused "failed to find parent tuple" even after the first freeze-the-dead commit (commit 20b65522), was that the mixture of frozen and unfrozen tuples in a hot chain caused heap_prune_chain() to make an incorrect conclusion about the continuity of a HOT chain. We'd prune the wrong tuples -- it failed to find the visible tuple at the end of the HOT chain, and respect that that meant it could not prune down to a stub ItemId (blow away what it *incorrectly* thought was an entire HOT chain). Now that the problem is fixed by a5736bf7, this test case will prune and get an LP_REDIRECT ItemId (not an LP_DEAD one), as we're no longer confused about the continuity of HOT chains within heap_prune_chain(). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix traversal of half-frozen update chains
On Wed, Oct 18, 2017 at 1:54 PM, Robert Haas wrote: > Well, I guess what I don't understand is, suppose we have a HOT chain > that looks like this, where [x,y] denotes a tuple with an xmin of x > and an xmax of y. > > [a,b]->[b,c]->[c,d] > > If b is eligible to be frozen, then b is committed and all-visible, > which means that the [a,b] tuple should be pruned altogether. IOW, I > don't think that a non-root tuple should ever have a frozen xmin; if > that happens, I think we've already screwed up. So I don't quite > understand how this problem arises in the first place. Technically, we don't freeze the heap-only tuples here, because we cannot. because freezing tuples renders them visible (we don't set XMIN_FROZEN). Instead, we set hint bits with WAL-logging, on the assumption that nobody will ever *need* to interrogate the clog -- see commit 20b65522 from several weeks back. To be clear, this *does* mean that hint bits stop being mere hints, which I find troubling. I described these heap-only tuples as "logically frozen" over on the "heap/SLRU verification, relfrozenxid cut-off, and freeze-the-dead bug" thread, which is where I brought up my general concern. > I think that the way we are supposed to be guaranteeing this is to > first prune the page and then freeze it. There is a race where we cannot prune the page, though. That's why we had to revisit what I suppose was a tacit assumption, and address its problems in the commit that started this thread (commit a5736bf7). > But that also seems like something > that shouldn't happen - our notion of the freeze threshold should be > frozen at the beginning of the scan and should not advance, and it > should be prior to our freeze horizon, which could be allowed to > advance but not retreat in the course of the scan. It is not allowed retreat -- kind of. (Alvaro mentioned something in passing about an *alternative* fix where it really was allowed to retreat in the simplest sense [1], but I don't think that that's going to happen.) > Apologies if this is stupid; please clue me on what I'm missing. I don't think that your questions are stupid at all. It intuitively seems bad that xmin freezing is only something we can do to HOT root and non-HOT tuples, while xmax freezing needs to happen to heap-only tuples, as well as HOT root tuples and non-HOT tuples. But, as I said, I'm still playing catch-up on MultiXacts, and I too feel like I might still be missing important details. [1] https://postgr.es/m/20171017100200.ruszi2c6qqwetce5@alvherre.pgsql -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: heap/SLRU verification, relfrozenxid cut-off, and freeze-the-dead bug (Was: amcheck (B-Tree integrity checking tool))
On Mon, Oct 16, 2017 at 8:09 PM, Noah Misch wrote: > That presupposes construction of two pieces of software, the server and the > checker, such that every disagreement is a bug in the server. But checkers > get bugs just like servers get bugs. You make a good point, which is that *some* code must be wrong when an error is raised and hardware is not to blame, but ISTM that the nuance of that really matters. The checker seems much less likely to be where bugs are, for three reasons: * There is far less code for us to maintain as compared to the volume of backend code that is effectively tested (again, not including the hidden universe of complex, unauditable firmware code that could be involved these days). * Much of the actual checking (as much as possible) is outsourced to core code that is already critically important. If that has bugs in it, then it is unlikely to be defined as an amcheck bug. * Knowing all this, we can go out of our way to do a good job of getting the design right the first time. (A sound design is far more important than actually having zero bugs.) Obviously there could be unambiguous bugs; I'm not arguing otherwise. I just hope that we can push this model as far as we need to, and perhaps accommodate verifiability as a goal for *future* development projects. We're almost doing that today; debuggability of on-disk structures is something that the community already values. This is the logical next step, IMV. > Checkers do provide a sort of > double-entry bookkeeping. When a reproducible test case prompts a checker > complaint, we'll know *some* code is wrong. I really like your double entry bookkeeping analogy. A tiny discrepancy will bubble up, even in a huge organization, and yet the underlying principles are broad and not all that complicated. > That's an admirable contribution. Thank you. I just hope that it becomes something that other contributors have some sense of ownership over. > I'm essentially saying that the server is innocent until proven guilty. It > would be cool to have a self-contained specification of PostgreSQL data files, > but where the server disagrees with the spec without causing problem > behaviors, we'd ultimately update the spec to fit the server. I might not have done a good job of explaining my position. I agree with everything you say here. I would like to see amcheck become a kind of vehicle for discussing things that we already discuss. You get a nice tool at the end, that clarifies and increases confidence in the original understanding over time (or acts as a canary-in-the-coalmine forcing function when the original understanding turns out to be faulty). The tool itself is ultimately just a bonus. Bringing it back to the concrete freeze-the-dead issue, and the question of an XID-cutoff for safely interrogating CLOG: I guess it will be possible to assess a HOT chain as a whole. We can probably do this safely while holding a super-exclusive lock on the buffer. I can probably find a way to ensure this only needs to happen in a rare slow path, when it looks like the invariant might be violated but we need to make sure (I'm already following this pattern in a couple of places). Realistically, there will be some amount of "try it and see" here. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix traversal of half-frozen update chains
On Tue, Oct 17, 2017 at 3:02 AM, Alvaro Herrera wrote: > Yeah, me too. If you see another way to fix the problem, let's discuss > it. I doubt that there is a better way. > I think a possible way is to avoid considering that the relfrozenxid > value computed by the caller is final. While that alternative seems possible, it also seems riskier. > One thing I didn't quite investigate is why this bug only shows up with > multixacts so far. Is it just because multixacts provide an easy way to > reproduce it, and that there are others, more difficult ways to cause > the same problem without involving multixacts? If so, then the problem > is likely present in 9.2 as well. The obvious explanation (although not necessarily the correct one) is that freezing didn't have a MultiXactIdGetUpdateXid() call in 9.2. The way we pass down both cutoff_xid and cutoff_multi to FreezeMultiXactId() seems like it might be involved in the data corruption that we saw (the incorrect pruning/failed to find parent tuple thing). I might spend some time figuring this out later in the week. It's hard to pin down, and I've only really started to learn about MultiXacts in the past few months. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pageinspect function to decode infomasks
On Sat, Oct 14, 2017 at 2:47 PM, Peter Geoghegan wrote: > I am working on an experimental version of pg_filedump, customized to > output XML that can be interpreted by an open source hex editor. The > XML makes the hex editor produce color coded, commented > tags/annotations for any given heap or B-Tree relation. This includes > tooltips with literal values for all status bits (including > t_infomask/t_infomask2 bits, IndexTuple bits, B-Tree meta page status > bits, PD_* page-level bits, ItemId bits, and others). This is now available from: https://github.com/petergeoghegan/pg_hexedit -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: heap/SLRU verification, relfrozenxid cut-off, and freeze-the-dead bug (Was: amcheck (B-Tree integrity checking tool))
On Fri, Oct 13, 2017 at 7:09 PM, Noah Misch wrote: > All good questions; I don't know offhand. Discovering those answers is > perhaps the chief labor required of such a project. ISTM that by far the hardest part of the project is arriving at a consensus around what a good set of invariants for CLOG and MultiXact looks like. I think that it's fair to say that this business with relfrozenxid now appears to be more complicated than many of us would have thought. Or, at least, more complicated than I thought when I first started thinking about it. Once we're measuring this complexity (by having checks), we should be in a better position to keep it under control, and to avoid ambiguity. > The checker should > consider circumstances potentially carried from past versions via pg_upgrade. Right. False positives are simply unacceptable. > Fortunately, if you get some details wrong, it's cheap to recover from checker > bugs. Ideally, amcheck will become a formal statement of the contracts provided by major subsystems, such as the heapam, the various SLRUs, and so on. While I agree that having bugs there is much less severe than having bugs in backend code, I would like the tool to reach a point where it actually *defines* correctness (by community consensus). If a bug in amcheck reflects a bug in our high level thinking about correctness, then that actually is a serious problem. Arguably, it's the most costly variety of bug that Postgres can have. I may never be able to get general buy-in here; building broad consensus like that is a lot harder than writing some code for a contrib module. Making the checking code the *authoritative* record of how invariants are *expected* to work is a major goal of the project, though. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pageinspect function to decode infomasks
On Sat, Oct 14, 2017 at 10:58 AM, Robert Haas wrote: > I think it's perfectly sensible to view those 2 bits as making up a > 2-bit field with 4 states rather than displaying each bit > individually, but you obviously disagree. Fair enough. I guess it is that simple. > I can think of two possible explanations for that. Number one, the > tool was written before HEAP_XMIN_FROZEN was invented and hasn't been > updated for those changes. Have we invented our last t_infomask/t_infomask2 (logical) status already? > Number two, the author of the tool agrees > with your position rather than mine. I am working on an experimental version of pg_filedump, customized to output XML that can be interpreted by an open source hex editor. The XML makes the hex editor produce color coded, commented tags/annotations for any given heap or B-Tree relation. This includes tooltips with literal values for all status bits (including t_infomask/t_infomask2 bits, IndexTuple bits, B-Tree meta page status bits, PD_* page-level bits, ItemId bits, and others). I tweeted about this several months ago, when it was just a tool I wrote for myself, and received a surprisingly positive response. It seems like I'm on to something, and should release the tool to the community. I mention this project because it very much informs my perspective here. Having spent quite a while deliberately corrupting test data in novel ways, just to see what happens, the "work backwards from the storage format" perspective feels very natural to me. I do think that I understand where you're coming from too, though. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pageinspect function to decode infomasks
On Fri, Oct 13, 2017 at 1:02 PM, Robert Haas wrote: >> I don't think it's our place to "interpret" the bits. Are we *also* >> going to show HEAP_XMIN_FROZEN when xmin is physically set to >> FrozenTransactionId? > > No, of course not. We're talking about how to display the 256 and 512 > bits of t_infomask. Those have four states: nothing, committed, > invalid, frozen. You're arguing that frozen isn't a real state, that > it's somehow just a combination of committed and invalid, but I think > that's the wrong way of thinking about it. No, I'm arguing that they're just bits. Show the bits, rather than interpreting what is displayed. Document that there are other logical states that are represented as composites of contradictory/mutually exclusive states. Anyone who hopes to interpret these values has to be an expert anyway, or willing to become something of an expert. There is a good chance that they've taken an interest because something is already wrong. > Before HEAP_XMIN_FROZEN existed, it would have been right to display > the state where both bits are set as committed|invalid, because that > would clearly show you that two things had been set that should never > both be set at the same time. But now that's a valid state with a > well-defined meaning and I think we should display the actual meaning > of that state. > >> Where does it end? > > I guess it ends wherever we decide to stop. You can take what you're saying much further. What about HEAP_XMAX_SHR_LOCK, and HEAP_MOVED? Code like HEAP_LOCKED_UPGRADED() pretty strongly undermines the idea that these composite values are abstractions. >> I think that we should prominently document that HEAP_XMIN_COMMITTED >> |HEAP_XMIN_ABORTED == HEAP_XMIN_FROZEN, rather than trying to hide >> complexity that we have no business hiding in a tool like pageinspect. > > I respect that opinion, but I don't think I'm trying to hide anything. > I think I'm proposing that we display the information in what I > believed to be the clearest and most accurate way. pg_filedump doesn't display HEAP_XMIN_FROZEN, either. (Nor does it ever display any of the other composite t_infomask/t_infomask2 values.) -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Thu, Oct 12, 2017 at 2:23 PM, Robert Haas wrote: > On Thu, Oct 12, 2017 at 4:38 PM, Alexander Korotkov >> However I imply that alternative storage would share our "MVCC model". So, >> it >> should share our transactional model including transactions, >> subtransactions, snapshots etc. >> Therefore, if alternative storage is transactional, then in particular it >> should be able to fetch tuple with >> given TID according to given snapshot. However, how it's implemented >> internally is >> a black box for us. Thus, we don't insist that tuple should have different >> TID after update; >> we don't insist there is any analogue of HOT; we don't insist alternative >> storage needs vacuum >> (or if even it needs vacuum, it might be performed in completely different >> way) and so on. > > Fully agreed. If we implement that interface, where does that leave EvalPlanQual()? Do those semantics have to be preserved? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix traversal of half-frozen update chains
On Fri, Oct 13, 2017 at 4:54 AM, Robert Haas wrote: > I haven't really followed this thread in depth, but I'm very nervous > about the idea that we should ever be examining the raw-xmin of a > tuple that has been marked HEAP_XMIN_FROZEN for anything other than > forensic purposes. I'm nervous about it as well. These problems are very difficult to reason about. > The design principle I followed in writing the > original patch was that marking a tuple HEAP_XMIN_FROZEN was identical > to setting the xmin to 2 except that the original xmin was still > available for manual inspection. I did point this out myself. > If we're finding that we need the > raw xmin after freezing, doesn't that mean that our freezing algorithm > was flat busted prior to that commit? I wouldn't put it that way at all. That commit provided us with the opportunity to put in a better fix for a problem with update chain traversal, a problem that was particularly critical when certain HOT pruning + freezing races occur (the "failed to find parent tuple" thing). It's clear that freezing was just as busted before and after that commit, though. > And maybe still busted when > things wrap around multiple times and raw-xmins are reused? I'm more concerned about the situation that will remain (or has now been created) on 9.3, where we don't have raw xmin, and so must use very forgiving FrozenTransactionId matching within HeapTupleUpdateXmaxMatchesXmin(). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pageinspect function to decode infomasks
On Tue, Aug 15, 2017 at 10:54 AM, Robert Haas wrote: >> Or at least make the filtering optional. > > I don't think "filtering" is the right way to think about it. It's > just labeling each combination of bits with the meaning appropriate to > that combination of bits. I do. -1 to not just showing what's on the page -- if the HEAP_XMIN_COMMITTED and HEAP_XMIN_ABORTED bits are set, then I think we should show them. Yeah, I accept that there is a real danger of confusing people with that. Unfortunately, I think that displaying HEAP_XMIN_FROZEN will cause even more confusion. I don't think that HEAP_XMIN_FROZEN is an abstraction at all. It's a notational convenience. I don't think it's our place to "interpret" the bits. Are we *also* going to show HEAP_XMIN_FROZEN when xmin is physically set to FrozenTransactionId? Where does it end? I think that we should prominently document that HEAP_XMIN_COMMITTED |HEAP_XMIN_ABORTED == HEAP_XMIN_FROZEN, rather than trying to hide complexity that we have no business hiding in a tool like pageinspect. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Wed, Oct 11, 2017 at 1:08 PM, Robert Haas wrote: > On Mon, Oct 9, 2017 at 10:22 AM, Alexander Korotkov > wrote: >> For me, it's crucial point that pluggable storages should be able to have >> different MVCC implementation, and correspondingly have full control over >> its interactions with indexes. >> Thus, it would be good if we would get consensus on that point. I'd like >> other discussion participants to comment whether they agree/disagree and >> why. >> Any comments? > > I think it's good for new storage managers to have full control over > interactions with indexes. I'm not sure about the MVCC part. I think > it would be legitimate to want a storage manager to ignore MVCC > altogether - e.g. to build a non-transactional table. I agree with Alexander -- if you're going to have a new MVCC implementation, you have to do significant work within index access methods. Adding "retail index tuple deletion" is probably just the beginning. ISTM that you need something like InnoDB's purge thread when index values change, since two versions of the same index tuple (each with distinct attribute values) have to physically co-exist for a time. > I don't know > that it would be a very good idea to have two different full-fledged > MVCC implementations, though. Like Tom says, that would be > replicating a lot of the awfulness of the MySQL model. It's not just the MySQL model, FWIW. SQL-on-Hadoop systems like Impala, certain NoSQL systems, and AFAIK any database system that claims to have pluggable storage all do it this way. That is, core transaction management functions (e.g. MVCC snapshot acquisition) is outsourced to the storage engine. It *is* very cumbersome, but that's what they do. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On markers of changed data
On Sat, Oct 7, 2017 at 6:34 AM, Stephen Frost wrote: >> > That’s actually what pg_rman is doing for what it calls incremental >> > backups (perhaps that would be differential backup in PG >> > terminology?), and the performance is bad as you can imagine. We could >> > have a dedicated LSN map to do such things with 4 bytes per page. I am >> > still not convinced that this much facility and the potential bug >> > risks are worth it though, Postgres already knows about differential >> > backups if you shape it as a delta of WAL segments. I think that, in >> > order to find a LSN map more convincing, we should find first other >> > use cases where it could become useful. Some use cases may pop up with >> > VACUUM, but I have not studied the question hard enough... >> >> The case I've discussed with barman developers is a large database >> (couple dozen of TBs should be enough) where a large fraction (say 95%) >> is read-only but there are many changes to the active part of the data, >> so that WAL is more massive than size of active data. > > Yes, we've seen environments like that also. I'm pretty sure that those cases are cases where there are many more FPIs than might be expected, due to a lack of locality. (UUID PKs can make the size of WAL balloon, for example.) -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] heap/SLRU verification, relfrozenxid cut-off, and freeze-the-dead bug (Was: amcheck (B-Tree integrity checking tool))
On Sun, Oct 16, 2016 at 6:46 PM, Noah Misch wrote: >> Noah and I discussed possible future directions for amcheck in person >> recently. I would like to get Noah's thoughts again here on how a tool >> like amcheck might reasonably target other access methods for >> verification. In particular, the heapam. MultiXacts were mentioned as >> a structure that could receive verification in a future iteration of >> this tool, but I lack expertise there. > > Yes, a heap checker could examine plenty of things. Incomplete list: The heap checking enhancement that is under review in the next CF [1], which checks an index against the table it indexes, doesn't directly test any of the things you outlined in this mail almost exactly a year ago. (That said, some of the same things are checked indirectly, simply by using IndexBuildHeapScan().) I do still think that there is a very real need for a "direct heap/SLRU checker" function, though. I only put off implementing one because there was still some low hanging fruit. I would like to go over your suggestions again now. My understanding of what we *can* do has evolved in the past several months. > - Detect impossible conditions in the hint bits. A tuple should not have both > HEAP_XMAX_COMMITTED and HEAP_XMAX_INVALID. Every tuple bearing > HEAP_ONLY_TUPLE should bear HEAP_UPDATED. HEAP_HASVARWIDTH should be true > if and only if the tuple has a non-NULL value in a negative-typlen column, > possibly a dropped column. A tuple should not have both HEAP_KEYS_UPDATED > and HEAP_XMAX_LOCK_ONLY. It certainly makes sense to check for clearly contradictory hint bits like this. > - Verify agreement between CLOG, MULTIXACT, and hint bits. This is where it gets complicated, I think. This is what I really want to talk about. The recent commit 20b65522 pretty clearly established that a tuple could technically escape freezing (having HEAP_XMIN_FROZEN set) despite actually being before a relfrozenxid cut-off. The idea was that as long as we reliably set hint bits on heap-only tuples through WAL-logging, it doesn't matter that their CLOG could be truncated, because nobody will ever need to interrogate the CLOG anyway (to coin a phrase, the heap-only tuple nevertheless still had its xmax "logically frozen"). If nothing else, this leaves me with a very squishy set of invariant conditions to work with when I go to verify agreement with CLOG, MULTIXACT, and hint bits. So, the question is: What is the exact set of invariant conditions that I can check when I go to verify agreement between a heap relation and the various SLRUs? In particular, at what precise XID-wise point do CLOG and MULTIXACT stop reliably storing transaction status? And, is there a different point for the xmax of tuples that happen to be heap-only tuples? Another important concern here following 20b65522 is: Why is it safe to assume that nobody will ever call TransactionIdDidCommit() for "logically frozen" heap-only tuples that are not at the end of their HOT chain, and in so doing get a wrong answer? I can't find a rule that implies that there is no dangerous ambiguity that's written down anywhere. I *can* find a comment that suggests that it's wrong, though -- the "N.B." comment at the top of heap_prepare_freeze_tuple(). (Granted, that comment doesn't seem to acknowledge the fact that the caller does WAL-log as part of freezing; there has been some churn in this area.) I'm pretty sure that the classic rule for TransactionIdDidCommit() when called from tqual.c was that the tuple cannot have HEAP_XMIN_FROZEN set, which was straightforward enough back when a frozen tuple was assumed to have necessarily committed (and to have a "raw" xmin point that should logically be visible to everyone). This business of "logically frozen" xmax values for dead heap-only tuples makes things *way* more complicated, though. I'm concerned that we've failed to account for that, and that TransactionIdDidCommit() calls concerning pre-relfrozenxid XIDs can still happen. I should point out that for whatever the reason, there is evidence that we do in fact risk TransactionIdDidCommit() calls that give wrong answers (independent of the ~2014 MULTIXACT bugs where that was clearly the case, and even before 20b65522): Jeff Janes showed [2] that there is probably some unaccounted-for case where "premature" truncation takes place. This may be related to the recent HOT chain/freezing bugs, and our (only partial [3]) fixes may have fixed that, too -- I just don't know. [1] https://commitfest.postgresql.org/15/1263/ [2] https://postgr.es/m/CAMkU=1y-tnp_7jdfg_ubxqdb8esmo280acdgdwsa429hrte...@mail.gmail.com [3] https://postgr.es/m/20171007232524.sdpi3jk2636fvjge@alvherre.pgsql -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple
On Sat, Oct 7, 2017 at 4:25 PM, Alvaro Herrera wrote: > Hmm, I think I added a random sleep (max. 100ms) right after the > HeapTupleSatisfiesVacuum call in vacuumlazy.c (lazy_scan_heap), and that > makes the race easier to hit. I still cannot reproduce. Perhaps you can be more specific? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Prepared statements assume text type in PG10
On Sat, Oct 7, 2017 at 4:27 PM, Peter Geoghegan wrote: > I suspect commit d8d32d9 is involved here, though I haven't verified that. Weirdly, there is a git hash collision here, so you'll have to put in d8d32d9a (8 characters -- the default of 7 for a short git hash isn't cutting it). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Prepared statements assume text type in PG10
On Sat, Oct 7, 2017 at 2:56 PM, Jack Christensen wrote: > The test suite for the Go PostgreSQL driver pgx > (https://github.com/jackc/pgx) found an unexpected behavior change in PG10. > Previously, it was impossible to prepare a statement with a unknown or > ambiguous parameter type. > > Pre-version 10: > > jack=# prepare ps as select $1; > ERROR: could not determine data type of parameter $1 > > But on PG10 the type defaults to text: > > jack=# prepare ps as select $1; > PREPARE > Time: 0.183 ms > jack=# execute ps('Hello, there'); >?column? > -- > Hello, there > (1 row) > > Time: 0.437 ms > > I looked through the git log and couldn't find any commits referencing this. > Is this an intended behavior change? I suspect commit d8d32d9 is involved here, though I haven't verified that. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple
On Sat, Oct 7, 2017 at 1:31 AM, Alvaro Herrera wrote: >> As you must have seen, Alvaro said he has a variant of Dan's original >> script that demonstrates that a problem remains, at least on 9.6+, >> even with today's fix. I think it's the stress-test that plays with >> fillfactor, many clients, etc [1]. > > I just execute setup.sql once and then run this shell command, > > while :; do > psql -e -P pager=off -f ./repro.sql > for i in `seq 1 5`; do > psql -P pager=off -e --no-psqlrc -f ./lock.sql & > done > wait && psql -P pager=off -e --no-psqlrc -f ./reindex.sql > psql -P pager=off -e --no-psqlrc -f ./report.sql > echo "done" > done I cannot reproduce the problem on my personal machine using this script/stress-test. I tried to do so on the master branch git tip. This reinforces the theory that there is some timing sensitivity, because the remaining race condition is very narrow. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple
On Fri, Oct 6, 2017 at 2:09 PM, Wong, Yi Wen wrote: > Yesterday, I've been spending time with pg_visibility on the pages when I > reproduce the issue in 9.6. > None of the all-frozen or all-visible bits are necessarily set in problematic > pages. Since this happened yesterday, I assume it was with an unfixed version? As you must have seen, Alvaro said he has a variant of Dan's original script that demonstrates that a problem remains, at least on 9.6+, even with today's fix. I think it's the stress-test that plays with fillfactor, many clients, etc [1]. I've tried to independently reproduce the problem on the master branch's current tip, with today's new fix, but cannot break things despite trying many variations. I cannot reproduce the problem that Alvaro still sees. I'll have to wait until Alvaro posts his repro to the list before commenting further, which I assume he'll post as soon as he can. There doesn't seem to be much point in not waiting for that. [1] https://postgr.es/m/20171005162402.jahqflf3mekileqm@alvherre.pgsql -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple
On Fri, Oct 6, 2017 at 11:34 AM, Peter Geoghegan wrote: >> I don't know if it's really the freeze map at fault or something else. > > Ideally, it would be possible to effectively disable the new freeze > map stuff in a minimal way, for testing purposes. Perhaps the authors > of that patch, CC'd, can suggest a way to do that. Actually, the simplest thing might be to just use pg_visibility's pg_check_frozen() to check that the visibility/freeze map accurately summarizes the all-frozen status of tuples in the heap. If that doesn't indicate that there is corruption, we can be fairly confident that the problem is elsewhere. The metadata in the visibility/freeze map should be accurate when a bit is set to indicate that an entire heap page is all-frozen (or, separately, all-visible). We can hardly expect it to have better information that the authoritative source of truth, the heap itself. The more I think about it, the more I tend to doubt that the remaining problems are with the freeze map. If the freeze map was wrong, and incorrectly said that a page was all-frozen, then surely the outward symptoms would take a long time to show up, as they always do when we accidentally fail to freeze a tuple before a relfrozenxid cutoff. ISTM that that's the only meaningful way that the freeze map can be wrong -- it only promises to be accurate when it says that no further freezing is needed for a page/bit. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple
On Fri, Oct 6, 2017 at 10:49 AM, Alvaro Herrera wrote: > I can tell that, in 9.6, REINDEX still reports the error we saw in > earlier releases, after some of the runs of my reproducer scripts. I'm > unable to reproduce it anymore in 9.3 to 9.5. I can't see the one Dan > originally reported anywhere, either. You mean the enhanced stress-test that varied fillfactor, added filler columns, and so on [1]? Can you post that to the list, please? I think that several of us would like to have a reproducible test case. > I don't know if it's really the freeze map at fault or something else. Ideally, it would be possible to effectively disable the new freeze map stuff in a minimal way, for testing purposes. Perhaps the authors of that patch, CC'd, can suggest a way to do that. If I had to guess, I'd say that it's just as likely that the issue is only reproducible on 9.6 because of the enhancements added in that release that improved buffer pinning (the use of atomic ops to pin buffers, moving buffer content locks into buffer descriptors, etc). It was already a bit tricky to get the problem that remained after 20b6552 but before today's a5736bf to reproduce with Dan's script. It often took me 4 or 5 attempts. (I wonder what it looks like with your enhanced version of that script -- the one that I just asked about.) It seems possible that we've merely reduced the window for the race to the point that it's practically (though not theoretically) impossible to reproduce the problem on versions < 9.6, though not on 9.6+. Applying Occam's razor, the problem doesn't seem particularly likely to be in the freeze map stuff, which isn't actually all that closely related. [1] https://postgr.es/m/20171005162402.jahqflf3mekileqm@alvherre.pgsql -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple
On Fri, Oct 6, 2017 at 7:59 AM, Alvaro Herrera wrote: > By the way, I still wonder if there's any way for a new tuple to get > inserted in the place where a HOT redirect would be pointing to, and > have it be marked as Frozen, where the old redirect contains a > non-invalid Xmax. I tried to think of a way for that to happen, but > couldn't think of anything. > > What I imagine is a sequence like this: > > 1. insert a tuple > 2. HOT-update a tuple > 3. prune the page, making lp 1 be a redirect (lp 2 is the new tuple) > 4. start transaction > 5. HOT-update the tuple again, creating HOT in lp 3 > 6. abort transaction (leaving aborted update in lp 3) > 7. somehow remove tuple from lp 3, make slot available > 8. new transaction comes along, inserts tuple in lp 3 > 9. somebody freezes tuple in lp3 (???) > > Then we follow the HOT chain, see that Xmin=2 in lp3 and conclude that > the tuple is part of the chain because of an xid "match". > > Basically from step 7 onwards I don't think this is possible, but maybe > I'm just blind. For the record, I also think that this is impossible, in part because pruning requires a cleanup buffer lock (and because HOT chains cannot span pages). I wouldn't say that I am 100% confident about this, though. BTW, is this comment block that appears above heap_prepare_freeze_tuple() now obsolete, following 20b65522 (and maybe much earlier commits)? * NB: It is not enough to set hint bits to indicate something is * committed/invalid -- they might not be set on a standby, or after crash * recovery. We really need to remove old xids. */ We WAL-log setting hint bits during freezing now, iff tuple xmin is before the Xid cutoff and tuple is a heap-only tuple. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple
On Fri, Oct 6, 2017 at 6:18 AM, Alvaro Herrera wrote: > Wood, Dan wrote: >> Yes, I’ve been testing 9.6. I’ll try Alvaro’s patch today. >> >> I would prefer to focus on either latest 9X or 11dev. > > I tested my patch on 9.4 and 9.5 today and it seems to close the problem > (with the patch, I waited 10x as many iterations as it took for the > problem to occur ~10 times without the patch), but I can reproduce a > problem in 9.6 with my patch installed. There must be something new in > 9.6 that is causing the problem to reappear. What problem persists? The original one (or, at least, the original symptom of pruning HOT chains incorrectly)? If that's what you mean, I wouldn't be so quick to assume that it's the freeze map. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
On Fri, Sep 29, 2017 at 10:54 AM, Peter Geoghegan wrote: >> Something that allocates new memory as the patch's bloom_init() >> function does I'd tend to call 'make' or 'create' or 'new' or >> something, rather than 'init'. > > I tend to agree. I'll adopt that style in the next version. I just > didn't want the caller to have to manage the memory themselves. v3 of the patch series, attached, does it that way -- it adds a bloom_create(). The new bloom_create() function still allocates its own memory, but does so while using a FLEXIBLE_ARRAY_MEMBER. A separate bloom_init() function (that works with dynamic shared memory) could easily be added later, for the benefit of parallel hash join. Other notable changes: * We now support bloom filters that have bitsets of up to 512MB in size. The previous limit was 128MB. * We now use TransactionXmin for our AccessShareLock xmin cutoff, rather than calling GetOldestXmin(). This is the same cut-off used by xacts that must avoid broken hot chains for their earliest snapshot. This avoids a scan of the proc array, and allows more thorough verification, since GetOldestXmin() was overly restrictive here. * Expanded code comments describing the kinds of problems the new verification capability is expected to be good at catching. For example, there is now a passing reference to the CREATE INDEX CONCURRENTLY bug that was fixed back in February (it's given as an example of a more general problem -- faulty HOT safety assessment). With the new heapallindexed enhancement added by this patch, amcheck can be expected to catch that issue much of the time. We also go into heap-only tuple handling within IndexBuildHeapScan(). The way that CREATE INDEX tries to index the most recent tuple in a HOT chain (while locating the root tuple in the chain, to get the right heap TID for the index) has proven to be very useful as a smoke test while investigating HOT/VACUUM FREEZE bugs in the past couple of weeks [1]. I believe it would have caught several historic MultiXact/recovery bugs, too. This all seems worth noting explicitly. [1] https://postgr.es/m/cah2-wznm4rcrhfaiwkpwtpew2bxdtgrozk7jwwgucxeh3d1...@mail.gmail.com -- Peter Geoghegan From 3bed03a9e0506c0b81097b634c5f1b5534a2dcb3 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Tue, 2 May 2017 00:19:24 -0700 Subject: [PATCH 2/2] Add amcheck verification of indexes against heap. Add a new, optional capability to bt_index_check() and bt_index_parent_check(): callers can check that each heap tuple that ought to have an index entry does in fact have one. This happens at the end of the existing verification checks. This is implemented by using a Bloom filter data structure. The implementation performs set membership tests within a callback (the same type of callback that each index AM registers for CREATE INDEX). The Bloom filter is populated during the initial index verification scan. --- contrib/amcheck/Makefile | 2 +- contrib/amcheck/amcheck--1.0--1.1.sql| 28 contrib/amcheck/amcheck.control | 2 +- contrib/amcheck/expected/check_btree.out | 14 +- contrib/amcheck/sql/check_btree.sql | 9 +- contrib/amcheck/verify_nbtree.c | 275 --- doc/src/sgml/amcheck.sgml| 157 ++ src/include/utils/snapmgr.h | 2 +- 8 files changed, 423 insertions(+), 66 deletions(-) create mode 100644 contrib/amcheck/amcheck--1.0--1.1.sql diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile index 43bed91..c5764b5 100644 --- a/contrib/amcheck/Makefile +++ b/contrib/amcheck/Makefile @@ -4,7 +4,7 @@ MODULE_big = amcheck OBJS = verify_nbtree.o $(WIN32RES) EXTENSION = amcheck -DATA = amcheck--1.0.sql +DATA = amcheck--1.0--1.1.sql amcheck--1.0.sql PGFILEDESC = "amcheck - function for verifying relation integrity" REGRESS = check check_btree diff --git a/contrib/amcheck/amcheck--1.0--1.1.sql b/contrib/amcheck/amcheck--1.0--1.1.sql new file mode 100644 index 000..e6cca0a --- /dev/null +++ b/contrib/amcheck/amcheck--1.0--1.1.sql @@ -0,0 +1,28 @@ +/* contrib/amcheck/amcheck--1.0--1.1.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "ALTER EXTENSION amcheck UPDATE TO '1.1'" to load this file. \quit + +-- +-- bt_index_check() +-- +DROP FUNCTION bt_index_check(regclass); +CREATE FUNCTION bt_index_check(index regclass, +heapallindexed boolean DEFAULT false) +RETURNS VOID +AS 'MODULE_PATHNAME', 'bt_index_check' +LANGUAGE C STRICT PARALLEL RESTRICTED; + +-- +-- bt_index_parent_check() +-- +DROP FUNCTION bt_index_parent_check(regclass); +CREATE FUNCTION bt_index_parent_check(index regclass, +heapallindexed boolean DEFAULT false) +RETURNS VOID +AS 'MODULE_PATHNAME', 'bt_index_parent_check' +LANGUAGE
Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple
On Thu, Oct 5, 2017 at 4:35 AM, Alvaro Herrera wrote: > At any rate, I was thinking in a new routine to encapsulate the logic, > > /* > * Check the tuple XMIN against prior XMAX, if any > */ > if (!HeapTupleUpdateXmaxMatchesXmin(priorXmax, > HeapTupleHeaderGetXmin(htup))) > break; > > where ..XmaxMatchesXmin would know about checking for possibly frozen > tuples. Makes sense. >> I can see why it >> would be safe (or at least no more dangerous) to rely on >> HeapTupleHeaderGetRawXmin() in the way mentioned here, at least on >> installations that initdb'd on a version after commit 37484ad2a >> (version 9.4+). However, I'm not sure why what you propose here would >> be safe when even raw xmin happens to be FrozenTransactionId. Are you >> sure that that's truly race-free? If it's really true that we only >> need to check for FrozenTransactionId on 9.3, why not just do that on >> all versions, and never bother with HeapTupleHeaderGetRawXmin()? >> ("Sheer paranoia" is a valid answer; I just want us to be clear on the >> reasoning.) > > I think the RawXmin is the better mechanism. I'm not absolutely certain > that the windows is completely closed in 9.3; as I understand things, it > is possible for transaction A to prune an aborted heap-only tuple, then > transaction B to insert a frozen tuple in the same location, then > transaction C follows a link to the HOT that was pruned. I think we'd > end up considering that the new frozen tuple is part of the chain, which > is wrong ... In 9.4 we can compare the real xmin. Good point: the race doesn't exist on 9.4+ with pg_upgrade from 9.3, when raw xmin happens to be FrozenTransactionId, because there can be no new tuples that have that property. This possible race is strictly a 9.3 issue. (You have to deal with a raw xmin of FrozenTransactionId on 9.4+, but there are no race conditions, because affected tuples are all pre-pg_upgrade tuples -- they were frozen months ago, not microseconds ago.) > I hope I am proved wrong about this, because if not, I think we're > looking at an essentially unsolvable problem in 9.3. Well, as noted within README.HOT, the xmin/xmax matching did not appear in earlier versions of the original HOT patch, because it was thought that holding a pin of the buffer was a sufficient interlock against concurrent line pointer recycling (pruning must have a buffer cleanup lock). Clearly it is more robust to match xmin to xmax, but is there actually any evidence that it was really necessary at all (for single-page HOT chain traversals) when HOT went in in 2007, or that it has since become necessary? heap_page_prune()'s caller has to have a buffer cleanup lock. I'm certainly not advocating removing the xmin/xmax matching within heap_prune_chain() on 9.3. However, it may be acceptable to rely on holding a buffer cleanup lock within heap_prune_chain() on 9.3, just for the rare/theoretical cases where the FrozenTransactionId raw xmin ambiguity means that xmin/xmax matching alone might not be enough. As you say, it seems unsolvable through looking at state on the page directly on 9.3, so there may be no other way. And, that's still strictly better than what we have today. > As far as I understand, this problem only emerges if one part of a HOT > chain reaches the min freeze age while another part of the same chain is > still visible by some running transaction. It is particularly > noticeably in our current test case because we use a min freeze age of > 0, with many concurrrent modifying the same page. What this says to me > is that VACUUM FREEZE is mildly dangerous when there's lots of high > concurrent HOT UPDATE activity. I'm not sure what you mean here. It is dangerous right now, because there is a bug, but we can squash the bug. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple
On Wed, Oct 4, 2017 at 6:46 AM, Alvaro Herrera wrote: > Wong, Yi Wen wrote: >> My interpretation of README.HOT is the check is just to ensure the chain is >> continuous; in which case the condition should be: >> >> > if (TransactionIdIsValid(priorXmax) && >> > !TransactionIdEquals(priorXmax, >> > HeapTupleHeaderGetRawXmin(htup))) >> > break; >> >> So the difference is GetRawXmin vs GetXmin, because otherwise we get the >> FreezeId instead of the Xmin when the transaction happened As you know, on version 9.4+, as of commit 37484ad2a, we decided that we are "largely ignoring the value to which it [xmin] is set". The expectation became that raw xmin is available after freezing, but mostly for forensic purposes. I think Alvaro should now memorialize the idea that its value is actually critical in some place (htup_details.h?). > I independently arrived at the same conclusion. Since I was trying with > 9.3, the patch differs -- in the old version we must explicitely test > for the FrozenTransactionId value, instead of using GetRawXmin. Obviously you're going to have to be prepared for a raw xmin of FrozenTransactionId, even on 9.4+, due to pg_upgrade. I can see why it would be safe (or at least no more dangerous) to rely on HeapTupleHeaderGetRawXmin() in the way mentioned here, at least on installations that initdb'd on a version after commit 37484ad2a (version 9.4+). However, I'm not sure why what you propose here would be safe when even raw xmin happens to be FrozenTransactionId. Are you sure that that's truly race-free? If it's really true that we only need to check for FrozenTransactionId on 9.3, why not just do that on all versions, and never bother with HeapTupleHeaderGetRawXmin()? ("Sheer paranoia" is a valid answer; I just want us to be clear on the reasoning.) Obviously any race would have a ridiculously tiny window, but it's not obvious why this protocol would be completely race-free (in the event of a FrozenTransactionId raw xmin). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple
On Wed, Oct 4, 2017 at 11:00 AM, Wood, Dan wrote: > The early “break;” here is likely the xmin frozen reason as I found in the > other loop. It looks that way. Since we're already very defensive when it comes to this xmin/xmax matching business, and we're defensive while following an update chain more generally, I wonder if we should be checking HeapTupleHeaderIsSpeculative() on versions >= 9.5 (versions with ON CONFLICT DO UPDATE, where t_ctid block number might actually be a speculative insertion token). Or, at least acknowledging that case in comments. I remember expressing concern that something like this was possible at the time that that went in. We certainly don't want to have heap_abort_speculative() "super delete" the wrong tuple in the event of item pointer recycling. There are at least defensive sanity checks that would turn that into an error within heap_abort_speculative(), so it wouldn't be a disaster if it was attempted. I don't think that it's possible in practice, and maybe it's sufficient that routines like heap_get_latest_tid() check for a sane item offset, which will discriminate against SpecTokenOffsetNumber, which must be well out of range for ItemIds on the page. Could be worth a comment. (I guess that heap_prune_chain() wouldn't need to be changed if we decide to add such comments, because the speculative tuple ItemId is going to be skipped over due to being ItemIdIsUsed() before we even get there.) -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple
On Tue, Oct 3, 2017 at 8:43 PM, Wong, Yi Wen wrote: > My interpretation of README.HOT is the check is just to ensure the chain is > continuous; in which case the condition should be: > >> if (TransactionIdIsValid(priorXmax) && >> !TransactionIdEquals(priorXmax, >> HeapTupleHeaderGetRawXmin(htup))) >> break; > > So the difference is GetRawXmin vs GetXmin, because otherwise we get the > FreezeId instead of the Xmin when the transaction happened I was thinking along similar lines. > The interesting consequence of changing that is the prune seems to get the > entire chain altogether with Dan's repro... I've run it a couple of times and > have consistently gotten the following page > > lp | t_ctid | lp_off | lp_flags | t_infomask | t_infomask2 > +++--++- > 1 | (0,1) | 8152 |1 | 2818 | 3 > 2 || 7 |2 || > 3 || 0 |0 || > 4 || 0 |0 || > 5 || 0 |0 || > 6 || 0 |0 || > 7 | (0,7) | 8112 |1 | 11010 | 32771 > (7 rows) That's also what I see. This is a good thing, I think; that's how we ought to prune. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple
On Tue, Oct 3, 2017 at 6:09 PM, Wood, Dan wrote: > I’ve just started looking at this again after a few weeks break. > if (TransactionIdIsValid(priorXmax) && > !TransactionIdEquals(priorXmax, > HeapTupleHeaderGetXmin(htup))) > break; > We need to understand why these TXID equal checks exist. Can we > differentiate the cases they are protecting against with the two exceptions > I’ve found? I haven't read your remarks here in full, since I'm about to stop working for the day, but I will point out that src/backend/access/heap/README.HOT says a fair amount about this, under "Abort Cases". -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple
On Tue, Oct 3, 2017 at 5:15 PM, Michael Paquier wrote: >> FYI, the repro case page contents looks like this with the patch applied: >> postgres=# select lp, lp_flags, t_xmin, t_xmax, t_ctid, >> to_hex(t_infomask) as infomask, >> to_hex(t_infomask2) as infomask2 >> from heap_page_items(get_raw_page('t', 0)); >> lp | lp_flags | t_xmin | t_xmax | t_ctid | infomask | infomask2 >> +--+-+++--+--- >> 1 |1 | 1845995 | 0 | (0,1) | b02 | 3 >> 2 |2 | ||| | >> 3 |0 | ||| | >> 4 |0 | ||| | >> 5 |0 | ||| | >> 6 |0 | ||| | >> 7 |1 | 1846001 | 0 | (0,7) | 2b02 | 8003 >> (7 rows) > > Is lp_off for tid (0,2) pointing to (0,7)? A hot chain preserved is > what would look correct to me. Yes, it is: postgres=# select * from bt_page_items('foo', 1); itemoffset | ctid | itemlen | nulls | vars | data +---+-+---+--+- 1 | (0,1) | 16 | f | f| 01 00 00 00 00 00 00 00 2 | (0,2) | 16 | f | f| 03 00 00 00 00 00 00 00 (2 rows) I can tell from looking at my hex editor that the 4 bytes of ItemId that we see for position '(0,2)' in the ItemId array are "07 00 01 00", meaning that '(0,2)' this is a LP_REDIRECT item, repointing us to '(0,7)'. Everything here looks sane to me, at least at first blush. > - * Check the tuple XMIN against prior XMAX, if any > - */ > -if (TransactionIdIsValid(priorXmax) && > -!TransactionIdEquals(HeapTupleHeaderGetXmin(htup), priorXmax)) > -break; > If you remove this check, you could also remove completely priorXmax. > > Actually, I may be missing something, but why is priorXmax updated > even for dead tuples? For example just doing that is also taking care > of the problem: I'll study what you suggest here some more tomorrow. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple
On Tue, Oct 3, 2017 at 9:48 AM, Alvaro Herrera wrote: > But that still doesn't fix the problem; > as far as I can see, vacuum removes the root of the chain, not yet sure > why, and then things are just as corrupted as before. Are you sure it's not opportunistic pruning? Another thing that I've noticed with this problem is that the relevant IndexTuple will pretty quickly vanish, presumably due to LP_DEAD setting (but maybe not actually due to LP_DEAD setting). (Studies the problem some more...) I now think that it actually is a VACUUM problem, specifically a problem with VACUUM pruning. You see the HOT xmin-to-xmax check pattern that you mentioned within heap_prune_chain(), which looks like where the incorrect tuple prune (or possibly, at times, redirect?) takes place. (I refer to the prune/kill that you mentioned today, that frustrated your first attempt at a fix -- "I modified the multixact freeze code...".) The attached patch "fixes" the problem -- I cannot get amcheck to complain about corruption with this applied. And, "make check-world" passes. Hopefully it goes without saying that this isn't actually my proposed fix. It tells us something that this at least *masks* the problem, though; it's a start. FYI, the repro case page contents looks like this with the patch applied: postgres=# select lp, lp_flags, t_xmin, t_xmax, t_ctid, to_hex(t_infomask) as infomask, to_hex(t_infomask2) as infomask2 from heap_page_items(get_raw_page('t', 0)); lp | lp_flags | t_xmin | t_xmax | t_ctid | infomask | infomask2 +--+-+++--+--- 1 |1 | 1845995 | 0 | (0,1) | b02 | 3 2 |2 | ||| | 3 |0 | ||| | 4 |0 | ||| | 5 |0 | ||| | 6 |0 | ||| | 7 | 1 | 1846001 | 0 | (0,7) | 2b02 | 8003 (7 rows) -- Peter Geoghegan diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c index 52231ac..90eb39e 100644 --- a/src/backend/access/heap/pruneheap.c +++ b/src/backend/access/heap/pruneheap.c @@ -470,13 +470,6 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum, ItemPointerSet(&(tup.t_self), BufferGetBlockNumber(buffer), offnum); /* - * Check the tuple XMIN against prior XMAX, if any - */ - if (TransactionIdIsValid(priorXmax) && - !TransactionIdEquals(HeapTupleHeaderGetXmin(htup), priorXmax)) - break; - - /* * OK, this tuple is indeed a member of the chain. */ chainitems[nchain++] = offnum; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple
On Tue, Oct 3, 2017 at 9:48 AM, Alvaro Herrera wrote: > which shows a HOT-update chain, where the t_xmax are multixacts. Then a > vacuum freeze comes, and because the multixacts are below the freeze > horizon for multixacts, we get this: > > select lp, lp_flags, t_xmin, t_xmax, t_ctid, to_hex(t_infomask) as infomask, > to_hex(t_infomask2) as infomask2 > from heap_page_items(get_raw_page('t', 0)); > lp | lp_flags | t_xmin | t_xmax | t_ctid | infomask | infomask2 > +--++++--+--- > 1 |1 | 2 | 0 | (0,1) | 902 | 3 > 2 |0 |||| | > 3 |1 | 2 | 14662 | (0,4) | 2502 | c003 > 4 |1 | 2 | 14663 | (0,5) | 2502 | c003 > 5 |1 | 2 | 14664 | (0,6) | 2502 | c003 > 6 |1 | 2 | 14665 | (0,7) | 2502 | c003 > 7 |1 | 2 | 0 | (0,7) | 2902 | 8003 > (7 filas) > > where the xmin values have all been frozen, and the xmax values are now > regular Xids. I thought that we no longer store FrozenTransactionId (xid 2) as our "raw" xmin while freezing, and yet that's what we see here. It looks like pageinspect is looking at raw xmin (it's calling HeapTupleHeaderGetRawXmin(), not HeapTupleHeaderGetXmin()). What's the deal with that? Shouldn't the original xmin be preserved for forensic analysis, following commit 37484ad2? I guess what you mean is that this is what you see having modified the code to actually store FrozenTransactionId as xmin once more, in an effort to fix this? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 64-bit queryId?
On Mon, Oct 2, 2017 at 9:10 AM, Robert Haas wrote: > On Mon, Oct 2, 2017 at 11:02 AM, Joshua D. Drake > wrote: >> +1 to both of these as well. > > OK, so here's a patch. Review appreciated. You need to change the SQL interface as well, although I'm not sure exactly how. The problem is that you are now passing a uint64 queryId to Int64GetDatumFast() within pg_stat_statements_internal(). That worked when queryId was a uint32, because you can easily represent values <= UINT_MAX as an int64/int8. However, you cannot represent the second half of the range of uint64 within a int64/int8. I think that this will behave different depending on USE_FLOAT8_BYVAL, if nothing else. The background here is that we used int8 as the output type for the function when queryId was first exposed via the SQL interface because there was no 32-bit unsigned int type that we could have used (except possibly Oid, but that's weird). You see the same pattern in a couple of other places. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?
On Mon, Oct 2, 2017 at 9:42 AM, Peter Eisentraut wrote: > I understand where you are coming from. My experience with developing > this feature has been that it is quite fragile in some ways. We have > had this out there for testing for many months, and we have fixed many > bugs, and follow-up bugs, up to very recently. We don't have good > automatic test coverage, so we need to rely on user testing. Making > significant code and design changes a week or two before the final > release is just too late, even if the changes were undisputed. And this > wasn't just one specific isolated change, it was a set of overlapping > changes that seemingly kept changing and expanding. For the record, I don't accept that summary. We could have not bothered with the sanitization thing at all, and I wouldn't have cared very much. I even revised my proposal such that you would never have needed to do the language tag mapping at all [1] (albeit while desupporting ICU versions prior to 4.6). And, as recently as 7 days ago, you yourself said: "Reading over this code again, it is admittedly not quite clear why this "canonicalization" is in there right now. I think it had something to do with how we built the keyword variants at one point. It might not make sense. I'd be glad to take that out and use the result straight from uloc_getAvailable() for collcollate." [2] This would have been far more radical than what I proposed. > I'm satisfied that what we are shipping is "good enough". It has been > in development for over a year, it has been reviewed and tested for > months, and a lot of credit for that goes to you. It is "good enough", I suppose. I am disappointed by this particular outcome, but that's mostly because it could have been avoided. I'm still very happy that you took on this project, and I don't want that to be overshadowed by this particular disagreement. > I'm available to discuss the changes you are envisioning, preferably one > at a time, in the near future as part of the PG11 development process. I don't really think that there is anything more to be done about the issues raised on this thread, with the possible exception of the DEBUG1 display name string thing. The sanitization just isn't valuable enough to add after the fact. Apart from the risks of adding it after the fact, another reason not to bother with it is that it's *incredibly* forgiving. So forgiving that you could reasonably argue that we might as well not have it at all. I may well do more on ICU with you in the future, but not in the area of how things are canonicalized or sanitized. It's too late for that now. [1] https://www.postgresql.org/message-id/CAH2-WzmVtRyNg2gT=u=ktEC-jM3aKq4bYzJ0u2=osxe+o3k...@mail.gmail.com [2] https://www.postgresql.org/message-id/f6c0fca7-e277-3f46-c0c1-adc001bff...@2ndquadrant.com -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?
On Sun, Oct 1, 2017 at 6:27 AM, Peter Eisentraut wrote: > I'm still pondering committing my documentation patch I had posted, and > I've been reviewing your patches to see if there is anything else > documentation-wise that could be added. -1 from me -- I don't think we should tell users about the deprecated locale format at all. I hope that ICU continue to support the deprecated locale string format within ucol_open() into the future, and that users will only ever use BCP 47 within CREATE COLLATION (as the 'locale' string). Perhaps it won't matter that much in the end, and will turn out to be more of a wart than something that causes pain for users. It's hard to predict. > As I had mentioned before, I disagree with the substance of the > functionality changes you are proposing, and I think it would be way too > late to make such significant changes. The general community opinion > also seems to be in favor of letting things be. It does seem too late. It's disappointing that we did not do better here. This problem was entirely avoidable. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple
On Thu, Sep 28, 2017 at 1:20 PM, Alvaro Herrera wrote: > Maybe what this means is that we need to do both Dan's initially > proposed patch (or something related to it) apart from the fixes already > pushed. IOW we need to put back some of the "tupkeep" business ... I took the time to specifically check if that would fix the problem. Unfortunately, it did not. We see exactly the same problem, or at least amcheck/REINDEX produces exactly the same error. I checked both Dan's original update_freeze.patch, and your revision that retained some of the "tupkeep" stuff, 0002-Don-t-freeze-recently-dead-HOT-tuples, which you posted on September 6th. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?
On Sat, Sep 30, 2017 at 12:28 PM, Tom Lane wrote: > This suggests to me that arguing about canonicalization is moot so > far as avoiding reindexing goes: if you change ICU library versions, > you're screwed and will have to jump through all the reindexing hoops, > no matter what we do or don't do. (Maybe we are looking at the wrong > information to populate collversion?) Just to be clear, I was never arguing that canonicalization would avoid reindexing, and I didn't think anyone else was. I believe that the version string will change when the behavior of its collator changes for any reason and in any way. This includes changes to how binary sort keys are generated. (We don't currently store binary keys on disk, so a change to that representation doesn't really necessitate a REINDEX for us. The UCA spec explicitly decouples compatibility for indexing with binary keys from changes needed due to human requirements. ICU binary sort keys are compressed, and this is sometimes improved, independently of the evolution of how natural language experts say text should be sorted.) > Now, it may still be worthwhile to argue about whether canonicalization > will help the other component of the problem, which is whether you can > dump and reload CREATE COLLATION commands into a new ICU version and > expect to get more-or-less-the-same behavior as before. Again, to be clear, that's what I was arguing all along. I think that users will get very good results with this approach. When the sorting behavior of a locale is refined by natural language experts, it's almost certainly a very small change that most users that are affected won't actually notice. For example, when en-us-x-icu is changed, that's actually due to a general change in the inherited root collator that doesn't really affect English speakers. There is no practical question about how you're supposed to sort English text. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?
On Sat, Sep 30, 2017 at 8:25 AM, Tom Lane wrote: > I'd also argue that the point of adopting ICU was exactly so we *could* > distinguish those cases, and limit the scope of a normal upgrade to > "reindex these identifiable indexes and you're done". In the libc world, > when you upgrade libc's locale definitions, you have no idea what the > consequences are. Right. With libc, we think of collations as something that there is a small, fixed number of on a system, that we cannot safely assume anything about. But with ICU, all of the semantics of how natural languages should be sorted are exposed via various APIs, and there are literally more possible sets of collation behaviors than there are grains of sand in the Sahara (there are hundreds of distinct scripts, which we can change the overall ordering of arbitrarily, on top of all the other customizations). Clearly the libc way of looking at things doesn't really carry over. BCP 47 is supposed to be universal -- it's an IETF standard. That's where all the stability guarantees are. The officially recognized 'u' extension that ICU uses is a CLDR/Unicode thing, not an ICU thing. The same format could, in the future, be used by other collation providers, since there actually are other CLDR consumers/UCA implementations. And, ICU have said that they have deprecated the old locale format, and have standardized on BCP 47. As of ICU 54, it is recommended that ucol_open() be passed a string in BCP 47 format. I'm surprised that this issue was not resolved earlier in the week. I presumed that all of this was obvious to Peter E., but I seem to have been wrong about that. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 64-bit queryId?
On Sat, Sep 30, 2017 at 8:39 AM, Peter Geoghegan wrote: > Isn't that already true in the case of queryId? I've never heard any > complaints about collisions. Most people don't change > pg_stat_statements.max, so the probability of a collision is more like > 1%. And, that's the probability of *any* collision, not the > probability of a collision that the user actually cares about. The > majority of entries in pg_stat_statements among those ten thousand > will not be interesting. Correction: ten thousand is an example value of pg_stat_statements.max in the docs, not the default value. The default is five thousand. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 64-bit queryId?
On Sat, Sep 30, 2017 at 7:34 AM, Robert Haas wrote: > Assuming, however, that you don't manage to prove all known > mathematics inconsistent, what one might reasonably hope to do is > render collisions remote enough that one need not worry about them too > much in practice. Isn't that already true in the case of queryId? I've never heard any complaints about collisions. Most people don't change pg_stat_statements.max, so the probability of a collision is more like 1%. And, that's the probability of *any* collision, not the probability of a collision that the user actually cares about. The majority of entries in pg_stat_statements among those ten thousand will not be interesting. Often, 90%+ will be one-hit wonders. If that isn't true, then you're probably not going to find pg_stat_statements very useful, because you have nothing to focus on. I have heard complaints about a number of different things in pg_stat_statements, like the fact that we don't always manage to replace constants with '?'/'$n' characters in all cases. I heard about that quite a few times during my time at Heroku. But never this. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?
On Mon, Sep 25, 2017 at 11:40 AM, Peter Eisentraut wrote: > Reading over this code again, it is admittedly not quite clear why this > "canonicalization" is in there right now. I think it had something to > do with how we built the keyword variants at one point. It might not > make sense. I'd be glad to take that out and use the result straight > from uloc_getAvailable() for collcollate. That is, after all, the > "canonical" version that ICU chooses to report to us. Is anything going to happen here ahead of shipping v10? This remains an open item. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Wed, Sep 20, 2017 at 2:32 AM, Rushabh Lathia wrote: > Yes, I haven't touched the randomAccess part yet. My initial goal was > to incorporate the BufFileSet api's here. This is going to need a rebase, due to the commit today to remove replacement selection sort. That much should be easy. > Sorry, I didn't get this part. Are you talking about the your patch changes > into OpenTemporaryFileInTablespace(), BufFileUnify() and other changes > related to ltsUnify() ? If that's the case, I don't think it require with > the > BufFileSet. Correct me if I am wrong here. I thought that you'd have multiple BufFiles, which would be multiplexed (much like a single BufFile itself mutiplexes 1GB segments), so that logtape.c could still recycle space in the randomAccess case. I guess that that's not a goal now. > To be frank its too early for me to comment anything in this area. I need > to study this more closely. As an initial goal I was just focused on > understanding the current implementation of the patch and incorporate > the BufFileSet APIs. Fair enough. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench stuck with 100% cpu usage
On Thu, Sep 28, 2017 at 10:36 PM, Pavan Deolasee wrote: > Well, I think it's a very legitimate test, not for testing performance, but > testing crash recovery and I use it very often. This particular test was run > to catch another bug which will be reported separately. Yeah, I use pgbench for stuff like that all the time. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
On Fri, Sep 29, 2017 at 10:29 AM, Robert Haas wrote: > I am also wondering whether this patch should consider > 81c5e46c490e2426db243eada186995da5bb0ba7 as a way of obtaining > multiple hash values. I suppose that's probably inferior to what is > already being done on performance grounds, but I'll just throw out a > mention of it here all the same in case it was overlooked or the > relevance not spotted... Well, we sometimes only want one hash value. This happens when we're very short on memory (especially relative to the estimated final size of the set), so it's a fairly common requirement. And, we have a convenient way to get a second independent uint32 hash function now (murmurhash32()). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
On Thu, Sep 28, 2017 at 8:34 PM, Thomas Munro wrote: > FWIW I think if I were attacking that problem the first thing I'd > probably try would be getting rid of that internal pointer > filter->bitset in favour of a FLEXIBLE_ARRAY_MEMBER and then making > the interface look something like this: > > extern size_t bloom_estimate(int64 total elems, int work_mem); > extern void bloom_init(bloom_filter *filter, int64 total_elems, int work_mem); > > Something that allocates new memory as the patch's bloom_init() > function does I'd tend to call 'make' or 'create' or 'new' or > something, rather than 'init'. I tend to agree. I'll adopt that style in the next version. I just didn't want the caller to have to manage the memory themselves. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The case for removing replacement selection sort
On Fri, Sep 29, 2017 at 7:19 AM, Robert Haas wrote: > That supports your theory that there's some confounding factor in the > CREATE INDEX case, such as I/O scheduling. Since this machine has an > SSD, I guess I don't have a mental model for how that works. We're > not waiting for the platter to rotate... Random I/O is still significantly more expensive with SSDs, especially random writes, where all the wear leveling stuff comes into play. There is a tiny universe of very complicated firmware within every SSD [1]. (I am generally concerned about the trend towards increasingly complicated, unauditable firmware like this, but that's another story.) > ...but I guess that's all irrelevant as far as this patch goes. The > point of this patch is to simplify things from removing a technique > that is no longer effective, and the evidence we have supports the > contention that it is no longer effective. I'll go commit this. Thanks. [1] https://lwn.net/Articles/353411/ -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The case for removing replacement selection sort
On Thu, Sep 28, 2017 at 3:18 PM, Robert Haas wrote: > On Fri, Jul 14, 2017 at 6:20 PM, Peter Geoghegan wrote: >> With the additional enhancements made to Postgres 10, I doubt that >> there are any remaining cases where it wins. > > I tried my favorite sorting test case -- pgbench -i -s 300 and then > reindex index pgbench_accounts_pkey. I set maintenance_work_mem to > 4MB so that replacement selection would be chosen. > > On master, this takes ~21.5 seconds; with replacement_sort_tuples = 0, > it takes ~19.1 seconds. So, disabling replacement selection is a win. > > On 9.6, this takes ~19.1 seconds; with replacement_sort_tuples = 0, it > takes ~23 seconds. So, disabling replacement selection is a loss. > > This supports your theory that we should go ahead and remove > replacement selection. I'm glad to hear that. But, I should reiterate that if sorting actually gets faster when my patch is applied, then that's something that I consider to be a bonus. (This is primarily a refactoring patch, to remove a bunch of obsolete code.) > It does however both me a bit that this test > case is apparently slower in master than in 9.6, and not just with > very small values of work_mem. With default maintenance_work_mem > (64MB), this takes about 13 seconds on 9.6 but about 15 seconds on > master -- and with that setting replacement selection is not chosen at > all. As I mentioned, the picture changed for replacement selection during the Postgres 10 development cycle, which caused me to finally push for it to be killed in this thread. So, anyone that just started following the thread should note that it's totally expected that replacement selection still just about pulls its weight in 9.6, but not in 10. > Any idea what causes this regression? I don't know. My best guess is that the overall I/O scheduling is now suboptimal due to commit e94568e (Heikki's preloading thing), because this is CREATE INDEX, and there is new competitive pressure. You might find it hard to replicate the problem with a "SELECT COUNT(DISTINCT aid) FROM pgbench_accounts", which would confirm this explanation. Or, you could also see what happens with a separate temp tablespace. It's really, really hard to have a 100%, unambiguous win within tuplesort.c. We've seen this time and again over the years. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple
On Thu, Sep 28, 2017 at 3:20 PM, Robert Haas wrote: > On Thu, Sep 28, 2017 at 5:47 PM, Peter Geoghegan wrote: >> In the end, commit 6bfa88a fixed that old recovery bug by making sure >> the recovery routine heap_xlog_lock() did the right thing. In both >> cases (Feb 2014 and today), the index wasn't really corrupt -- it just >> pointed to the root of a HOT chain when it should point to some child >> tuple (or maybe a successor HOT chain). > > Unless I'm very confused, it's really not OK to point at a child tuple > rather than the root of the HOT chain. Please see my later clarification. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple
On Thu, Sep 28, 2017 at 2:47 PM, Peter Geoghegan wrote: > In the end, commit 6bfa88a fixed that old recovery bug by making sure > the recovery routine heap_xlog_lock() did the right thing. In both > cases (Feb 2014 and today), the index wasn't really corrupt -- it just > pointed to the root of a HOT chain when it should point to some child > tuple (or maybe a successor HOT chain). Just to be clear, obviously I don't mean that the index should have been magically updated to compensate for that 2014 heap_lock_tuple() recovery bug (say, via an in-place IndexTuple update during recovery). Certainly, the index AM was entitled to assume that the heap TID it pointed to would continue to be the root of the same HOT chain, at least until the next VACUUM. That was what I meant by "the index wasn't actually corrupt". All I meant here is that the index scan and sequential scan would have been in agreement if something like that *had* happened artificially (say, when the index tuple was edited with a hex editor). And, that the actual high level problem was that we failed to treat a HOT-updated tuple as a HOT-updated tuple, leading to consequences that were mostly noticeable during index scans. Probably on both occasions (back in 2014, and now). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple
On Thu, Sep 28, 2017 at 2:39 PM, Alvaro Herrera wrote: >> FWIW, I am reminded a little bit of the MultiXact/recovery bug I >> reported way back in February of 2014 [1], which also had a HOT >> interaction that caused index scans to give wrong answers, despite >> more or less structurally sound indexes. >> >> [1] >> https://www.postgresql.org/message-id/CAM3SWZTMQiCi5PV5OWHb+bYkUcnCk=o67w0csswpvv7xfuc...@mail.gmail.com > > Thanks for the reference. I didn't remember this problem and it's not > (wasn't) in my list of things to look into. Perhaps these are both the > same bug. I was reminded of that old bug because initially, at the time, it looked very much like a corrupt index: sequential scans were fine, but index scans gave wrong answers. This is what I saw today. In the end, commit 6bfa88a fixed that old recovery bug by making sure the recovery routine heap_xlog_lock() did the right thing. In both cases (Feb 2014 and today), the index wasn't really corrupt -- it just pointed to the root of a HOT chain when it should point to some child tuple (or maybe a successor HOT chain). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple
On Thu, Sep 28, 2017 at 1:20 PM, Alvaro Herrera wrote: > Alvaro Herrera wrote: > >> Odd that it's not fixed. I guess there's still some more work to do >> here ... > > Maybe what this means is that we need to do both Dan's initially > proposed patch (or something related to it) apart from the fixes already > pushed. IOW we need to put back some of the "tupkeep" business ... We certainly do still see wrong answers to queries here: postgres=# select ctid, xmin, xmax, * from t; ctid | xmin | xmax | id | name | x ---+---+--++--+--- (0,1) | 21171 |0 | 1 | 111 | 0 (0,7) | 21177 |0 | 3 | 333 | 5 (2 rows) postgres=# select * from t where id = 3; id | name | x +--+--- 3 | 333 | 5 (1 row) postgres=# set enable_seqscan = off; SET postgres=# select * from t where id = 3; id | name | x +--+--- (0 rows) FWIW, I am reminded a little bit of the MultiXact/recovery bug I reported way back in February of 2014 [1], which also had a HOT interaction that caused index scans to give wrong answers, despite more or less structurally sound indexes. [1] https://www.postgresql.org/message-id/CAM3SWZTMQiCi5PV5OWHb+bYkUcnCk=o67w0csswpvv7xfuc...@mail.gmail.com -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
add_element(), when they have one close at hand anyway -- much like addHyperLogLog(). Again, that seems like work for the ultimate consumer of that functionality. It's a trivial tweak that can happen later. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The case for removing replacement selection sort
On Wed, Sep 27, 2017 at 12:06 AM, Simon Riggs wrote: >> I think we should remove the replacement_sort_tuples GUC, and kill >> replacement selection entirely. There is no need to do this for >> Postgres 10. I don't feel very strongly about it. It just doesn't make >> sense to continue to support replacement selection. > > Forgive me if I missed the explanation, but how will we handle bounded sorts? No change there at all. If anything they will probably be slightly faster, because the heap management routines don't have to work with a special heap comparator that compares run number, but only when replacement selection is in use. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers