On Mon, May 5, 2025 at 1:56 AM Mikhail Kharitonov
<[email protected]> wrote:
>
> Hi hackers,
>
> An inconsistency was observed when using logical replication on partitioned
> tables with the option `publish_via_partition_root = true`: if REPLICA
> IDENTITY
> FULL is set only on the parent table, but not on all partitions, logical
> decoding emits UPDATE and DELETE messages with tag 'O' (old tuple) even for
> partitions that do not have full replica identity. In those cases, only the
> primary key columns are included in the message, which contradicts the
> expected
> meaning of 'O' and violates the logical replication message protocol:
>
> https://www.postgresql.org/docs/current/protocol-logicalrep-message-formats.html
>
> This can cause issues in downstream consumers, which interpret
> the 'O' tag as implying that a full tuple is present.
>
> The attached patch resolves the inconsistency by selecting the correct tuple
> type ('O' vs 'K') based on the replica identity of the actual leaf relation
> being published, rather than using the setting of the root relation alone.
> As a result, the format of logical replication messages aligns with
> the semantics
> defined by the protocol.
>
> Steps to reproduce:
>
> 1. Create a partitioned table with REPLICA IDENTITY FULL on the parent
> and only one of the partitions.
>
> 2. Create a publication with `publish_via_partition_root = true`.
>
> 3. Perform INSERT, UPDATE, DELETE operations through the root table.
>
> 4. Observe via `pg_recvlogical` that for a partition without full replica
> identity, the logical replication stream contains 'O' records with
> only key fields.
I tested this scenario but what I've seen in my env is somewhat
different from the above analysis; pgoutput plugin writes 'O' records
as you mentioned, but it doesn't omit non-key fields, but writes NULL
as non-key fields. Here are my reproducible steps:
create table p (a int not null, b int) partition by list (a);
create table c1 partition of p for values in (1);
create table c2 partition of p for values in (2);
create unique index on c2 (a);
alter table p replica identity full;
alter table c1 replica identity full;
alter table c2 replica identity using INDEX c2_a_idx ;
insert into p values (1, 10), (2, 20);
create publication pub for all tables with (publish_via_partition_root
= 'true');
select pg_create_logical_replication_slot('sub', 'pgoutput');
delete from p where a = 1;
delete from p where a = 2;
select encode(data, 'escape') from
pg_logical_slot_peek_binary_changes('sub', null, null,
'proto_version', '1', 'publication_names', 'pub');
The last pg_logical_slot_peek_binary_changes() writes the two 'D'
(delete) messages:
1. D\000\000@\000O\000\x02t\000\000\000\x011t\000\000\000\x0210
2. D\000\000@\000O\000\x02t\000\000\000\x012n
What we can know from these messages are:
- Both messages have 'O'.
- Both messages have two columns ('\000\x02').
- The first message has: the first column '1' (length is 1
('\000\000\000\x01')), and the second column '10' (length is 2
('\000\000\000\x02')).
- The second message has: the first column '2', and the second column
NULL ('n').
>From these facts, I guess there could be problematic cases even in the
native logical replication. Here are reproducible steps:
-- Publisher
create table p (a int not null, b int) partition by list (a);
create table c1 partition of p for values in (1);
create table c2 partition of p for values in (2);
create unique index on c2 (a);
alter table p replica identity full;
alter table c1 replica identity full;
alter table c2 replica identity using INDEX c2_a_idx ;
insert into p values (1, 10), (2, 20);
create publication pub for all tables with (publish_via_partition_root
= 'true');
-- Subscriber
create table p (a int, b int, c int);
create subscription sub connection 'dbname=postgres port=5551' publication pub;
-- Publisher
delete from p where a = 1; -- generate a message 'DELETE (1, 10)'
delete from p where a = 2; -- generate a message 'DELETE (2, NULL)'
The second delete message cannot find the tuple on the subscriber, so
the table contents are now inconsistent between the publisher and the
subscriber. I need more investigation to verify that it's a problem,
but this behavior doesn't change even with the proposed change.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com