On Tue, Jan 27, 2026 at 8:25 PM vignesh C <[email protected]> wrote:
>
> On Fri, 23 Jan 2026 at 18:41, vignesh C <[email protected]> wrote:
> >
> > On Wed, 21 Jan 2026 at 11:35, Dilip Kumar <[email protected]> wrote:
> > >
> > > On Mon, Jan 19, 2026 at 3:08 PM shveta malik <[email protected]>
> > > wrote:
> > > >
> > > > Approaches for Supporting EXCEPT in Partitioned Tables
> > > > ------------------------------------------------------------------------
> > > >
> > > > In an offline discussion with Peter Smith, Amit, and Shlok, we
> > > > identified several approaches for supporting EXCEPT with partitioned
> > > > tables and their partitions. I’d like to hear others’ opinions on
> > > > these approaches.
> > > >
> > > > Consider the following partition hierarchy:
> > > > tab_root
> > > > ├─ tab_part_1
> > > > │ ├─ tab_part_1_p1
> > > > │ └─ tab_part_1_p2
> > > > └─ tab_part_2
> > > > ├─ tab_part_2_p1
> > > > └─ tab_part_2_p2
> > > >
> > > >
> > > > Approach 1:
> > > > ---------------------------------
> > > > If we exclude a table, then the data in that table and all of its
> > > > partitions (i.e., the entire subtree under that table) should not be
> > > > replicated.
> > > >
> > > > For example EXCEPT (tab_part_1) skips replication of tab_part_1 and
> > > > all of its partitions.
> > > >
> > > > This behaviour remains the same with or without
> > > > publish_via_partition_root. The publish_via_partition_root flag only
> > > > affects publish_via_relid, i.e., the relation through which data is
> > > > published.
> > > >
> > > > This approach involves certain implementation challenges. For brevity,
> > > > these are documented in the attached 'Approach1_challenges' document.
> > > >
> > > > Approach 2:
> > > > ---------------------------------------------------
> > > > Assign meaning to ONLY and '*' for partition tables in the EXCEPT
> > > > list. In HEAD, ONLY and '*' do not have any meaning for partitioned
> > > > tables or partitions, and these keywords are currently ignored.
> > > >
> > > > Examples:
> > > > 1. EXCEPT (ONLY tab_part_1) skips replication of only the table
> > > > tab_part_1. Changes for tab_root, tab_part_1_p1, and tab_part_1_p2 are
> > > > still replicated.
> > > >
> > > > ii. EXCEPT (tab_part_1*) skips replication of tables tab_part_1,
> > > > tab_part_1_p1, and tab_part_1_p2
> > > >
> > > > The challenges described in Approach 1, particularly around tablesync
> > > > handling and COPY behaviour, would still need to be addressed under
> > > > this approach as well. ONLY or '*' with partitioned tables is not
> > > > supported in HEAD, supporting it specifically for ALL TABLES EXCEPT
> > > > may introduce additional confusion for users.
> > > >
> > > > Approach 3:
> > > > ----------------
> > > > Do not allow partitions to be specified in the EXCEPT clause.
> > > >
> > > > Only EXCEPT (tab_root) is supported, which excludes tab_root and all
> > > > of its partitions. Specifying EXCEPT (tab_part_1) or EXCEPT
> > > > (tab_part_1_p1) will result in an error.
> > > >
> > > > ~~
> > > >
> > > > While Approach 1 and Approach 2 offer more flexibility to the user
> > > > compared to Approach 3, they also introduce additional design
> > > > complexity which does not seem simpler to address.
> > >
> > > Thanks for explaining this, overall I like the Approach 1, and I also
> > > see the problem when publish via root is given in that case COPY FROM
> > > is executed on the root and it would be hard to exclude specific
> > > partitions.
> >
> > Regarding the above issue which is also mentioned in
> > Approach1_challenges at [1]:
> > When a publication is created with publish_via_partition_root = true
> > and a specific partition(tab_part_1_1) is excluded, the expected
> > behavior is that changes from non-excluded partitions (for example,
> > tab_part_2 and tab_part_1_2 and their descendants) are replicated,
> > while changes from the excluded partition (tab_part_1_1 and its
> > subtree) are not.
> > tab_root
> > ├── tab_part_1
> > │ ├── tab_part_1_1 (except)
> > │ │ ├── tab_part_1_1_1
> > │ │ │ └── tab_part_1_1_1_1
> > │ │ └── tab_part_1_1_2
> > │ └── tab_part_1_2
> > │ ├── tab_part_1_2_1
> > │ └── tab_part_1_2_2
> > └── tab_part_2
> >
> > In this situation, replication cannot be performed purely via the
> > partition root (tab_root), because doing so would implicitly include
> > data from the excluded child partitions.
> >
> > To address this, the publication creation should explicitly record the
> > excluded partition(tab_part_1_1) in pg_publication_rel with an
> > excluded = true flag. The publish_via_partition_root setting remains
> > stored at the publication level, as it is today. With
> > publish_via_partition_root = true, the publisher–subscriber mapping is
> > not partition-to-partition. Instead, all eligible data is mapped to
> > the subscriber’s partition root. Therefore,
> > pg_get_publication_tables() should return only the top-level root
> > table (tab_root) to the subscriber for table synchronization. During
> > initial table sync, when the tablesync worker prepares the COPY
> > command, it can query the publisher to determine the effective set of
> > tables that belong to the publication after applying the exclusion
> > rules. Based on this resolved table list, the tablesync worker can
> > construct a COPY query that unions data only from the non-excluded
> > partitions, for example:
> > COPY (
> > SELECT * FROM tab_part_1_2_1
> > UNION ALL
> > SELECT * FROM tab_part_1_2_2
> > UNION ALL
> > SELECT * FROM tab_part_2
> > )
> >
> > This ensures that only non-excluded data is copied and applied to
> > tab_root on the subscriber, while preserving the semantics of
> > publish_via_partition_root = true.
I agree with the suggested changes in tablesync. It will be good if we
can add these details in the commit-msg section of the patch. Also
please mention how increment replication is impacted (or supposed to
work) with Approach1.
> Here is a patch which has the changes to handle the same.
>
Thank You for the patch.
1)
There are certain parts of Approach 3 still present in Approach 1, as
an example:
1a)
+ For partitioned tables, only the root partitioned table may be specified
+ in <literal>EXCEPT TABLE</literal>.
1b)
+ /*
+ * Only the topmost ancestor of a partitioned table can be specified
+ * in EXCEPT TABLES clause of a FOR ALL TABLES publication. So fetch
+ * the publications excluding the topmost ancestor only.
+ */
+ GetRelationPublications(llast_oid(ancestors), NULL, &exceptpuboids);
+
1c)
+ /* Check if the partiton is part of EXCEPT list of any publication */
+ GetRelationPublications(RelationGetRelid(attachrel), NULL, &except_pubids);
+ if (except_pubids != NIL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot attach relation \"%s\" as partition because it is
part of EXCEPT list in publication",
+ RelationGetRelationName(attachrel))));
+
Overall, please take a diff of v35 and v37 to find such parts and
please correct these and others (if any).
2)
Also I don't think if below is correct statement for Approach 1:
+ * 2. For a partition, if the topmost ancestor is part of
+ * the EXCEPT TABLE list, we don't publish it.
Even if any ancestor is part of EXECPT list (not only top most) we
should not publish that partition, isn't it?
3)
I tried a scenario and found that incremental replication is not
working correctly. Attached the failing test as Approach1_v37_fail.txt
Once these basic things are corrected, I can review further.
thanks
Shveta
tab_root (RANGE range_col)
├── tab_part_1 (1–1000) PARTITION BY RANGE (i)
│ ├── tab_part_1_p1 (i 0–500)
│ └── tab_part_1_p2 (i 500–1000)
└── tab_part_2 (1000–2000) PARTITION BY RANGE (i)
├── tab_part_2_p1 (i 0–500)
└── tab_part_2_p2 (i 500–1000)
Sceanrio 1:
CREATE TABLE tab_root (range_col int, i int, j int) PARTITION BY RANGE
(range_col);
CREATE TABLE tab_part_1 PARTITION OF tab_root FOR VALUES FROM (1) TO (1000)
PARTITION BY RANGE (i);
CREATE TABLE tab_part_2 PARTITION OF tab_root FOR VALUES FROM (1000) TO (2000)
PARTITION BY RANGE (i);
-- Leaf partitions under tab_part_1
CREATE TABLE tab_part_1_p1 PARTITION OF tab_part_1 FOR VALUES FROM (0) TO (500);
CREATE TABLE tab_part_1_p2 PARTITION OF tab_part_1 FOR VALUES FROM (500) TO
(1000);
-- Leaf partitions under tab_part_2
CREATE TABLE tab_part_2_p1 PARTITION OF tab_part_2 FOR VALUES FROM (0) TO (500);
CREATE TABLE tab_part_2_p2 PARTITION OF tab_part_2 FOR VALUES FROM (500) TO
(1000);
-- Leaf partition: tab_part_1_p1 (range_col 1–1000, i 0–500)
INSERT INTO tab_root (range_col, i, j) VALUES (200, 250, 2);
-- Leaf partition: tab_part_1_p2 (range_col 1–1000, i 500–1000)
INSERT INTO tab_root (range_col, i, j) VALUES (250, 750, 11);
-- Leaf partition: tab_part_2_p1 (range_col 1000–2000, i 0–500)
INSERT INTO tab_root (range_col, i, j) VALUES (1300, 250, 21);
-- Leaf partition: tab_part_2_p2 (range_col 1000–2000, i 500–1000)
INSERT INTO tab_root (range_col, i, j) VALUES (1700, 750, 31);
CREATE PUBLICATION pub1 for all tables EXCEPT(tab_part_1,tab_part_2_p2) WITH
(PUBLISH_VIA_PARTITION_ROOT=true);
Sub:
CREATE TABLE tab_root (range_col int, i int, j int);
CREATE TABLE tab_part_1 (range_col int, i int, j int);
CREATE TABLE tab_part_2 (range_col int, i int, j int);
CREATE TABLE tab_part_1_p1 (range_col int, i int, j int);
CREATE TABLE tab_part_1_p2 (range_col int, i int, j int);
CREATE TABLE tab_part_2_p1 (range_col int, i int, j int);
CREATE TABLE tab_part_2_p2 (range_col int, i int, j int);
CREATE SUBSCRIPTION sub1 connection 'dbname=postgres host=localhost user=shveta
port=5433' publication pub1;
--Initial data belonging to tab_part_2_p1 alone should show up in table_root
select * from tab_root;
Pub:
--Now test incremental replication:
-- Leaf partition: tab_part_1_p1 (range_col 1–1000, i 0–500)
INSERT INTO tab_root (range_col, i, j) VALUES (200, 250, 2);
-- Leaf partition: tab_part_1_p2 (range_col 1–1000, i 500–1000)
INSERT INTO tab_root (range_col, i, j) VALUES (250, 750, 11);
-- Leaf partition: tab_part_2_p1 (range_col 1000–2000, i 0–500)
INSERT INTO tab_root (range_col, i, j) VALUES (1300, 250, 21);
-- Leaf partition: tab_part_2_p2 (range_col 1000–2000, i 500–1000)
INSERT INTO tab_root (range_col, i, j) VALUES (1700, 750, 31);
Sub:
-- Expectation: data belonging to tab_part_2_p1 alone should show up in
tab_root.
-- Actual: tab_part_2_p2's data also show up in tab_root
select * from tab_root;
postgres=# select * from tab_root;
range_col | i | j
-----------+-----+----
1300 | 250 | 21
1300 | 250 | 21
1700 | 750 | 31 --> tab_part_2_p2 data
(3 rows)
~~
Sceanrio 2:
CREATE PUBLICATION pub2 for all tables EXCEPT(tab_root) WITH
(PUBLISH_VIA_PARTITION_ROOT=true);
--This gives error on Pub, expectation is no error:
CREATE TABLE tab_top_root (range_col int, i int, j int) PARTITION BY RANGE
(range_col);
ALTER TABLE tab_top_root ATTACH PARTITION tab_root FOR VALUES FROM (0) TO
(2000);
ERROR: cannot attach relation "tab_root" as partition because it is part of
EXCEPT list in publication