For many years now, MySQL has a feature called INSERT IGNORE [1]; SQLite has INSERT ON CONFLICT IGNORE [2]; SQL Server has an option called IGNORE_DUP_KEY and Oracle has a hint called IGNORE_ROW_ON_DUPKEY_INDEX (they acknowledge that it's a bit odd that a hint changes the semantics of a DML statement - of course, MERGE has always been able to do something approximately equivalent).
Each of these features has their respective systems simply not insert certain tuples that will result in a duplicate key violation, while potentially proceeding with the insertion of other, non-violating tuples. The attached WIP patch implements this for Postgres, with a few notable differences: 1) The patch is only interested in unique index violations (specifically, violations of amcanunique access method unique indexes); it will not do anything with NULL constraint violations, as the MySQL feature does, for example. It also doesn't care about exclusion constraints, even though it's currently possible to defer exclusion constraint enforcement in a way that is analogous to how it's done for btree indexes. 2) The clause that I've proposed is, as you'll have already gathered, spelt slightly differently to any of these systems. I'm not particularly attached to how I've spelt it. I've spelt it this way so as to avoid suggesting that it's fully compatible with the MySQL feature. I don't think we want weird NULLness behavior, but I may be mistaken. If we want that additional behavior, we probably want it as an optional adjunct, or perhaps an entirely independent clause to what I've specified here. 3) RETURNING is expanded - "RETURNING REJECTS *" is now possible where that makes sense. It is possible for clients to interrogate the wire protocol to see the number of rows affected (this can be done with a PQcmdTuples() call by libpq clients, for example), and see how that compares with the number of tuples they tried to insert. In addition, a client can use RETURNING to see what rows were successfully inserted. Here is a simple example session that shows this usage: postgres=# create table tab(id int4 primary key, val text); CREATE TABLE postgres=# insert into tab(id, val) values(1, 'val'), (3, 'val'), (4, 'val'); INSERT 0 3 postgres=# insert into tab(id, val) values(1, 'nosert'), (2, 'nosert'), (3, 'nosert') on duplicate key ignore returning id; id ---- 2 (1 row) INSERT 0 1 postgres=# select xmin, * from tab; xmin | id | val --------+----+----- 580843 | 1 | val 580843 | 3 | val 580843 | 4 | val 580844 | 2 | nosert (4 rows) If this all happened in two separate, concurrent sessions, the transaction with xid 580844 might well have blocked on the outcome of 580843 in the usual way (ultimately inserting or doing nothing as appropriate for each tuple), just as you'd expect. Note, however, that the xid of the "noserter" is only seen here for the tuple that it actually successfully inserted - in effect, the noserter session does not lock tuples that it did not originally insert. That's consistent with MySQL's INSERT IGNORE, for example. However, some people might prefer it if this did share lock rows that prevented tuples from being inserted, or if that was at least an option. That would be something quite a bit closer to a fully-fledged upsert - I imagine we'd have to lock the old tuple using EvalPlanQual, and if a submission did that correctly then I think we'd be well on our way to solving all the problems. RETURNING REJECTS is a way of getting the inverse of affected tuples of an ordinary INSERT...RETURNING statement - rather than returning rows successfully inserted only (i.e. tuples where either a BEFORE trigger returned NULL or we didn't proceed with insertion), we returning rows that were not successfully inserted only. This is only meaningful in the context of INSERT...ON DUPLICATE KEY IGNORE. Sure, clients could figure this out for themselves using vanilla "RETURNING *", but I think that this addition is justified by the fact that it's generally expected that the number of rejects will be relatively small. People are going to want to use this feature in an ad-hoc manner within a psql session, and supporting this will particularly help there. It might be nice to find a way to indicate the exact constraint violated per row returned, but I preferred to wait and see if people thought it was a good idea generally. Use-cases ========= The use-cases for this feature are fairly obvious. It can be used as a way of merging tuples easily, if clients can live with the present limitations. In addition to that, the feature is may be of value to clients of the logical change-set generation infrastructure (such as the proposed BDR contrib module[3]), if and when that infrastructure is finally committed. Currently, the BDR prototype simply balks at INSERT conflicts. This is because the amount of subtransactions/xids generated to do anything else is considered prohibitive. If Postgres is ever going to solve the kinds of problems that BDR proposes to solve well (except for the relatively narrow use-case where balking at INSERT conflicts is acceptable), it needs something like what I've proposed here. The only alternative is probably to replay a transaction without using subtransactions, and if that fails, remember that fact and wrap every single change in a subxact. This is really going to be painful for large transactions. Andres privately said some weeks back (prior to my sharing of this information, and even prior to writing most of the code posted here) that he'd like to see an INSERT...ON DUPLICATE KEY LOCK. I suppose that he must have meant that he'd like to see shared locks obtained on rows that blocked a noserter from actually inserting some of their proposed tuples, for the purposes of conflict resolution (which, as I've said, this doesn't do). Perhaps he'd like to comment on these issues here. Mechanism ========= This patch is a spin-off from a broader effort to implement INSERT...ON DUPLICATE KEY UPDATE (upsert). During the 2012 developer meeting, it was agreed that we needed a simple, atomic upsert, and that we should probably use the MySQL syntax as long as we don't have merge in its full generality [4]. In doing things this way, I'm attempting to manage the risk of that broader patch not being accepted by checkpointing progress towards it in a major release. In particular, I'd like to see us reach consensus on the questions surrounding our basic approach to implementing upsert. I'm keen to see this happen before attempting to solve all the other problems, in a way which is perhaps predicated on my own personal opinion about how to solve the basic problem. Having said all that, even if I never work on upsert, this patch certainly has independent value. Having reviewed the archives for previous discussions on upsert or MERGE, it seems that there is some diversity of opinion among more senior developers about just how to solve the basic concurrency issue. For example, a few years ago Heikki opined that continuing to insert the heap tuple first, and then potentially cleaning that up in the event of a duplicate rather than raising an error was a workable approach [5]. Simon addressed Heikki's point here specifically, and suggested that we should try and find the tuple first [6], which is approximately the approach I have taken here. In saying this, Simon appears to have changed his mind, because on a previous occasion he remarked that doing an UPDATE potentially followed by an INSERT might be preferable [7]. Tom at least once briefly sketched how he thought this ought to work [8], and it appeared to me that he was roughly in agreement with Simon in [6]. (I hope I haven't misrepresented somebody's view here - if so, please correct me). >From a high level, the patch works by adding something that is tentatively internally referred to as "speculative insertion" in lower-level code comments. The basic approach is: * Within ExecInsert, when inserting into a regular heap relation from a tuple table slot (after BEFORE triggers fire but before inserting the heap tuple), there is an extra function call to a new function, ExecLockIndexTuples(). * For each applicable amcanunique unique index, ExecLockIndexTuples() calls a new am method, amlock. For each unique index, we end up acquiring a write buffer lock in a similar fashion to _bt_doinsert. When we return from the new amlock function, we report whether or not there was a duplicate without having inserted any index tuples or having otherwise done something that requires a heap tuple to be present. I like to call this point "the point of no return", where individual unique indexes commit themselves to insertion (which is to say, the point after which a unique constraint violation becomes impossible but before actual index tuple insertion in _bt_doinsert [9]). Another duplicate key cannot be inserted by anyone else until that write lock is released, which it won't be until insertion or IGNORE, for the same reason why that's currently true once control reaches the point of no return. A pagesplit might need to happen before insertion proper of that index tuple, for example, and that's okay. * If we have total consensus among all unique indexes, we proceed with a heap insertion of the tuple formed from the tuple table slot ExecInsert() was called in respect of. Yes, this means that potentially multiple exclusive buffer locks are held for the duration of the heap tuple insertion. If not, we release the locks early and return NULL from ExecInsert() before heap insertion ever happens (and certainly before index tuple insertion ever happens). * If and when we get to ExecInsertIndexTuples(), and ultimately to _bt_doinsert, the new _bt_doinsert knows that for any unique indexes it sees, that the process has already been partially completed. It takes state made available from the first phase of speculative insertion, and finishes the job, pretty much picking up from the point of no return for amcanunique unique indexes (or starting from scratch in the usual way for non-unique amcanunique-method indexes that happen to be involved in speculative insertion). Deadlock hazards ============== In order to avoid deadlocks, the locking stage acquires locks in reverse order to the usual index insertion order. So, regardless of how those locks are subsequently freed (either by actual insertion or by finishing early in the IGNORE case), we avoid buffer lock related deadlocks. Still, deadlock hazards are something that reviewers will want to pay particular attention to. Unlike some other systems like DB2, we have always allowed BEFORE ROW triggers to execute arbitrary SQL. I've frequently thought this was a bit of a wart (e.g. [10]), and certainly not supportive of sensible, idiomatic trigger use, but there isn't much we can do about it at this stage. Right now, BEFORE ROW triggers will still fire when the new code decides to not do the insert. It certainly wouldn't be acceptable to allow before triggers to run *after* the first phase of speculative insertion, because they might try and access an index with an exclusive locked buffer, resulting in backend self-deadlock. Besides, we cannot really know *what* to lock until after the before triggers fire. That leaves us with two options, as far as I can tell: 1. Just have users live with those semantics. This is what the patch does presently, and anyone who is using triggers in what I consider to be a sensible way doesn't have to care. For everyone else, it's a gotcha that they have to deal with, to be noted prominently. 2. Do something overly-cute to prevent the trigger from doing anything we can't back out of. I'm not sure how feasible this is, because I haven't looked at it at all (since option 1 looks perfectly reasonable to me), but everything is on the table I suppose. As noted in code comments, this would be pretty fragile. It would also probably be at least as ugly as requiring users to live with option 1. BEFORE ROW triggers aside, there are clearly other hazards implied by taking exclusive locks on unique index buffers - a naive implementation could be dealing with a system index. While these locks aren't held for much longer than in the existing code, a requirement to carefully avoid catalog access for the duration of the heap insertion seems do-able but still quite fragile, and perhaps a source of bugs in the future. So as a precaution against some catalog access using an index that is already locked by speculative insertion in a way that leads to deadlock, we simply forbid speculative insertion into system catalogs outright. This doesn't see like too onerous a restriction to me. Transaction isolation ================= The interaction with SSI is something that I considered, but may not have got right. With the transaction isolation level set to serializable, each exclusive-locked buffer gets a CheckForSerializableConflictIn() call in a way that I believe is analogous to the extant _btdoinsert code. Crucially, we always make those calls when the isolation level is serializable, which means extra work (we've perhaps already concluded that the speculative insertion won't go ahead) over read committed and repeatable read. So because the number of new calls to CheckForSerializableConflictIn() aren't influenced by interaction with other sessions, if I'm not mistaken that preserves SSI's guarantees. However, this may still be unnecessary - it may also be fine to just rely on the buffer locks that are taken. I need better analysis here. Tests ===== Isolation tests are included that test the feature in an obvious way. My original test case involved a pgbench session with a custom script where primary key value is a random value between 1 and :scale inclusive, and the expectation is that we'll end up with exactly :scale number of tuples. After many sessions did INSERT...ON DUPLICATE KEY IGNORE, this is consistently the case, without spurious duplicates or any other incorrect behavior (that I'm aware of). Presently, this works fine with tens of thousands of tuples. I've also done quite a lot of other smoke testing of the feature, but this isn't included here. I have done no performance testing to date. Reviewers will want to pay attention to the performance implications, particularly in the regular insert case; it's conceivable that I've regressed things, though I don't specifically suspect that I have. Documentation ============ I haven't added any documentation, because I didn't think that there was much point before hearing comments from others. Obviously at least the INSERT documentation, and "Chapter 52. Index Access Method Interface Definition" and the pg_am docs will need to be updated if we're to proceed. Thoughts? *** References *** [1] https://dev.mysql.com/doc/refman/5.5/en/insert.html [2] https://www.sqlite.org/lang_conflict.html [3] http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/bdr [4] https://wiki.postgresql.org/wiki/PgCon_2012_Developer_Meeting#MERGE [5] http://www.postgresql.org/message-id/45e845c4.6030...@enterprisedb.com [6] http://www.postgresql.org/message-id/1172858409.3760.1618.ca...@silverbirch.site [7] http://www.postgresql.org/message-id/1132000496.3388.31.camel@holly [8] http://www.postgresql.org/message-id/20083.1131730...@sss.pgh.pa.us [9] https://github.com/postgres/postgres/blob/REL9_3_STABLE/src/backend/access/nbtree/nbtinsert.c#L174 [10] http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=6868ed7491b7ea7f0af6133bb66566a2f5fe5a75 -- Peter Geoghegan
insert_on_dup.v1.2013_08_30.patch.gz
Description: GNU Zip compressed data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers