Re: New addition to the merge sql standard

Nick DeCoursin Mon, 20 Nov 2023 13:38:39 -0800

> "ignore" may not be what you want, though.  Perhaps the fact that insert
> (coming from the NOT MATCHED clause) fails (== conflicts with a tuple
> concurrently inserted in an unique or exclusion constraint) should
> transform the row operation into a MATCHED case, so it'd fire the other
> clauses in the overall MERGE operation.  Then you could add a WHEN
> MATCHED DO NOTHING case which does the ignoring that you want; or just
> let them be handled by WHEN MATCHED UPDATE or whatever.  But you may
> need some way to distinguish rows that appeared concurrently from rows
> that were there all along.

Your suggestion (or idea) would also be good with me, but that would seem
to me to be considerably more complex to implement and to use, I think.

In a similar sense, I think perhaps it could be good to add the complete
`insert ... on conflict` syntax and functionality onto the insert part of
merge (including handling unique violations with an update), but I haven't
completely thought it through. This is similar to what you seemed to be
brainstorming, but rather than going back into the match statements, it
would look like this perhaps (with the new additions underlined):

merge into some_table st
using different_table as dt
on (st.id = dt.st_id)
when matched do nothing
when not matched then
  insert (cola, colb, colc)
  values ("uniquekey", "gal", "asfa")
  *on conflict (cola) do update set*
  *colb = "gal",*
  *colc = "asfa"*
;

As a result, `merge` would essentially have all the functionality that
`insert ... on conflict` has, so that perhaps (perhaps!) `merge` >=
'insert' in every case except for simply inserting raw data without a
select.

I personally think adding the `on conflict ...` syntax onto merge insert
like ^ this would be better than trying to go back into the merge's match
statements.

However, even if it were desired to add the complete `on conflict do update
...` functionality, I would think it would be better to take one step at a
time by just first adding the ignore functionality (of ignoring rows that
cause unique violation exceptions) which could be called `on conflict do
nothing`, leaving open the possibility for future expansion with the `on
conflict do update set ...`.

> In regards to the SQL standard, I hope what you're saying is merely not
> documented by them.  If it indeed isn't, it may be possible to get them
> to accept some new behavior, and then I'm sure we'd consider
> implementing it.  If your suggestion goes against what they already
> have, I'm afraid you'd be doomed.  So the next question is, how do other
> implementations handle this case you're talking about?  SQL Server, DB2
> and Oracle being the relevant ones.

I would suppose it's merely not documented but yet that there's nothing
contradictory to it either. Unfortunately, I don't know how other databases
do it, but I would think probably none of them do it yet.

> Assuming the idea is good and there are no conflicts, then maybe it's
> just lack of round tuits.

I think it's just an advance feature that's hard to grasp, understand, and
develop. But as our database systems become more and more advanced, advance
features like this become more and more feasible.

You know, Postgres is sort of reaching that point - I believe - where it's
looking around, like, ok what next? We've done everything, while other
databases are still struggling under structural problems, postgres is
starting to "push the envelope".

> I vaguely recall thinking about this, and noticing that implementing
> something of this sort would require messing around with the ExecInsert
> interface.  It'd probably require splitting it in pieces, similar to how
> ExecUpdate was split.
>
> There are some comments in the code about possible "live-locks" where
> merge would be eternally confused between inserting a new row which it
> then wants to delete; or something like that.  For sure we would need to
> understand the concurrent behavior of this new feature very clearly.
>
> An interesting point is that our inserts *wait* to see whether the
> concurrent insertion commits or aborts, when a unique constraint is
> involved.  I'm not sure you want to have MERGE blocking on concurrent
> inserts.  This is all assuming READ COMMITTED semantics; on REPEATABLE
> READ or higher, I think you're just screwed, because of course MERGE is
> not going to get a snapshot that sees the rows inserted by transactions
> that started after.
>
> You'd need to explore all this very carefully.

Yeah. I would have liked to have been able to volunteer to do it myself,
but it seems based on your description to be way over my head 😄. C is
definitely not my forte, and I don't have any postgres development
experience (actually we did go into the postgres source for a database
project in college haha).

However, it might be pertinent for me to reference this under the READ
COMMITTED semantics:

MERGE allows the user to specify various combinations of INSERT, UPDATE and
> DELETE subcommands. A MERGE command with both INSERT and UPDATE subcommands
> looks similar to INSERT with an ON CONFLICT DO UPDATE clause but does not
> guarantee that either INSERT or UPDATE will occur. If MERGE attempts an
> UPDATE or DELETE and the row is concurrently updated but the join
> condition still passes for the current target and the current source tuple,
> then MERGE will behave the same as the UPDATE or DELETE commands and
> perform its action on the updated version of the row. *However, because *
> *MERGE** can specify several actions and they can be conditional, the
> conditions for each action are re-evaluated on the updated version of the
> row, starting from the first action, even if the action that had originally
> matched appears later in the list of actions.* On the other hand, if the
> row is concurrently updated or deleted so that the join condition fails,
> then MERGE will evaluate the condition's NOT MATCHED actions next, and
> execute the first one that succeeds. If MERGE attempts an INSERT and a
> unique index is present and a duplicate row is concurrently inserted, then
> a uniqueness violation error is raised; MERGE does not attempt to avoid
> such errors by restarting evaluation of MATCHED conditions.
>

With this in mind, the `merge` statement doesn't block on concurrent
inserts, nor is that necessary. The merge semantics imply that there is no
blocking/waiting. Deriving from this or in tandem with this, the insert
within merge doesn't need to do any blocking or waiting either, only when
it actually performs the job of committing the insert, if this fails, then
perform the `on conflict do nothing`. Therefore, due to the original merge
semantics, merge insert doesn't need to follow the same requirements as
normal `insert`, and it doesn't need to *wait* - it's a best effort thing.

In my opinion, `merge` is meant for batch operations of large data, and
that's the best way to think about it. It's not meant for perfectly
serializable data. It's meant for moving huge datasets efficiently in a
best effort means.

Cheers,
Nick

On Thu, Nov 16, 2023 at 6:13 PM Alvaro Herrera <[email protected]>
wrote:

> On 2023-Nov-16, Nick DeCoursin wrote:
>
> > In my opinion, it would be better for merge to offer the functionality to
> > simply ignore the rows that cause unique violation exceptions instead of
> > tanking the whole query.
>
> "ignore" may not be what you want, though.  Perhaps the fact that insert
> (coming from the NOT MATCHED clause) fails (== conflicts with a tuple
> concurrently inserted in an unique or exclusion constraint) should
> transform the row operation into a MATCHED case, so it'd fire the other
> clauses in the overall MERGE operation.  Then you could add a WHEN
> MATCHED DO NOTHING case which does the ignoring that you want; or just
> let them be handled by WHEN MATCHED UPDATE or whatever.  But you may
> need some way to distinguish rows that appeared concurrently from rows
> that were there all along.
>
> In regards to the SQL standard, I hope what you're saying is merely not
> documented by them.  If it indeed isn't, it may be possible to get them
> to accept some new behavior, and then I'm sure we'd consider
> implementing it.  If your suggestion goes against what they already
> have, I'm afraid you'd be doomed.  So the next question is, how do other
> implementations handle this case you're talking about?  SQL Server, DB2
> and Oracle being the relevant ones.
>
> Assuming the idea is good and there are no conflicts, then maybe it's
> just lack of round tuits.
>
> Happen to have some?
>
> I vaguely recall thinking about this, and noticing that implementing
> something of this sort would require messing around with the ExecInsert
> interface.  It'd probably require splitting it in pieces, similar to how
> ExecUpdate was split.
>
> There are some comments in the code about possible "live-locks" where
> merge would be eternally confused between inserting a new row which it
> then wants to delete; or something like that.  For sure we would need to
> understand the concurrent behavior of this new feature very clearly.
>
>
> An interesting point is that our inserts *wait* to see whether the
> concurrent insertion commits or aborts, when a unique constraint is
> involved.  I'm not sure you want to have MERGE blocking on concurrent
> inserts.  This is all assuming READ COMMITTED semantics; on REPEATABLE
> READ or higher, I think you're just screwed, because of course MERGE is
> not going to get a snapshot that sees the rows inserted by transactions
> that started after.
>
> You'd need to explore all this very carefully.
>
> --
> Álvaro Herrera        Breisgau, Deutschland  —
> https://www.EnterpriseDB.com/
>

Re: New addition to the merge sql standard

Reply via email to