Re: [EXTERNAL]: Re: [EXTERNAL]: Re: UPSERT in Postgres

Louis Tian Sun, 09 Apr 2023 09:50:11 -0700


________________________________
From: Israel Brewster <ijbrews...@alaska.edu>
Sent: Sunday, April 9, 2023 3:09 AM
To: Louis Tian <louis.t...@aquamonix.com.au>
Cc: Peter Geoghegan <p...@bowt.ie>; pgsql-general@lists.postgresql.org 
<pgsql-general@lists.postgresql.org>
Subject: [EXTERNAL]: Re: [EXTERNAL]: Re: UPSERT in Postgres

Thanks Israel. Your example really helped me to understand where we differ.

> Not necessarily. Consider the following UPSERT statement:
> INSERT INTO customer_order_counts (customer_id,order_count) VALUES (123456, 
> 1) ON CONFLICT (customer_id) DO UPDATE SET 
> order_count=customer_order_counts.order_count+1;

Yes, you are 100% right that is not idempotent statement. But that's not 
because UPSERT is idempotent *operation*. I am making a distinction between an 
operation and statement here.
This statement is not idempotent is not because UPSERT operation is not 
idempotent, it is because the value being upsert is different in the statement 
each time you run that statement.

> Not really. Generally when I am doing an UPSERT, I am NOT using the primary 
> key, but rather some other UNIQUE-ly indexed column(s). My primary key is 
> typically an ID column that is defined as a serial, > and automatically 
> generated by the database. The unique column I use for the upset, however, is 
> generally something that would identify the row to a human - such as Station 
> ID and timestamp
> columns in a database I have of seismic readings. Each reading gets a unique 
> ID (the primary key) that identifies it to the database and is used in joins. 
> However, occasionally a reading gets updated, so I > update the row, based 
> not on the primary key (which I don’t know for incoming data), but on the 
> station and timestamp. UPSERT, but not based on the primary key.

I am using "primary key" to refer to the "row identifier" in a conceptual way.
Definition from Wiki, "In the relational 
model<https://en.wikipedia.org/wiki/Relational_model> of 
databases<https://en.wikipedia.org/wiki/Database>, a primary key is a specific 
choice of a minimal set of attributes 
(columns<https://en.wikipedia.org/wiki/Column_(database)>) that uniquely 
specify a tuple (row<https://en.wikipedia.org/wiki/Row_(database)>) in a 
relation<https://en.wikipedia.org/wiki/Relation_(database)> 
(table<https://en.wikipedia.org/wiki/Table_(database)>)."
So, it's not necessarily the primary key you declared on the table (for 
example, the id column in your example).
To put it in another way, the Station ID and timestamp columns work in the 
capability of a primary key.
I was merely setting a scene in my orginal email. What I was trying to say is 
"UPSERT only makes sense when each row is uniquely identifiable".
So please don't get me wrong there, I am not arguing that being able to do on 
conflict on any unique index is wrong or bad.

Cheers,
Louis Tian

> On Apr 6, 2023, at 4:49 PM, Louis Tian <louis.t...@aquamonix.com.au> wrote:
>
> Hi Peter,
>
> Thanks for your reply. Appreciate the help and discussion.
>
>>  In general UPSERT (or any definition of it that I can think of) does
>> not imply idempotency.
>
> "Idempotence is the property of certain operations in mathematics and 
> computer science whereby they can be applied multiple times without changing 
> the result beyond the initial application." from Wikipedia.
> the concept of Idempotence when applies to HTTP is consistent with the above. 
> https://developer.mozilla.org/en-US/docs/Glossary/Idempotent. Or are you 
> going by a different defintion that I am not aware of?
> If you execute the same upsert multiple times, the state of the database 
> remains the same as if only execute once.

Not necessarily. Consider the following UPSERT statement:

INSERT INTO customer_order_counts (customer_id,order_count) VALUES (123456, 1) 
ON CONFLICT (customer_id) DO UPDATE SET 
order_count=customer_order_counts.order_count+1;

That is completely valid (I tested it), and actually makes sense as something 
you might want to do - keep track of how many orders a customer has placed, for 
example if you only keep the order records for 6 months, but still want to know 
the total number of orders the customer has placed. If it is a new customer, 
you insert a record for the customer with an order count of 1. Otherwise, you 
update the record to increment the order count. Clearly this is NOT an 
idempotent operation - every time you run it, it changes the order count, so 
the state of the database does NOT remain the same as if you only execute it 
once.

> If a row already exists, the first statement will update the row so does any 
> subsequent statements. executing the same update multiple time is the same as 
> executing it only once.
> If the row doesn't exist, the first statement will insert that row and any 
> subsequent will try to update, but the update has no real effect since it the 
> value is exactly the same as the insert.
> So by defintion, upsert is idempotent.
>
>> It could just be a unique index or a unique constraint. So you can
>> upsert on any individual unique constraint/index, or the primary key.
>> Of course there might be several on a given table, but you can only
>> use one as the "conflict arbiter" per statement.
>
> Understand that I can use any unique constraint with on conflict.
> But semantically the only correct one is the primary key, since that's what 
> identifies a row logically.

Not really. Generally when I am doing an UPSERT, I am NOT using the primary 
key, but rather some other UNIQUE-ly indexed column(s). My primary key is 
typically an ID column that is defined as a serial, and automatically generated 
by the database. The unique column I use for the upset, however, is generally 
something that would identify the row to a human - such as Station ID and 
timestamp  columns in a database I have of seismic readings. Each reading gets 
a unique ID (the primary key) that identifies it to the database and is used in 
joins. However, occasionally a reading gets updated, so I update the row, based 
not on the primary key (which I don’t know for incoming data), but on the 
station and timestamp. UPSERT, but not based on the primary key.

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

> In that sense, any unique column(s) is a potential candidate for primary key.
> It's more of a pedantic point rather than pragmatic one.
> It's less of a problem for PostgreSQL where the semantic importance of 
> primary key is not manifested at implementation level, since all index points 
> to the tuple directly
> Whereas it is more import for Databaes like MySQL where the secondary index 
> points to the primary key index.
--
Message  protected by MailGuard: e-mail anti-virus, anti-spam and content 
filtering.https://www.mailguard.com.au/mg
Click here to report this message as spam:
https://console.mailguard.com.au/ras/25ah5yO8hL/7ACQJux9EWGExcgeAdX4X1/5.6

Re: [EXTERNAL]: Re: [EXTERNAL]: Re: UPSERT in Postgres

Reply via email to