Re: Adding REPACK [concurrently]

Robert Treat Sat, 26 Jul 2025 18:59:40 -0700

On Sat, Jul 26, 2025 at 5:56 PM Alvaro Herrera <[email protected]> wrote:
>
> Hello,
>
> Here's a patch to add REPACK and eventually the CONCURRENTLY flag to it.
> This is coming from [1].  The ultimate goal is to have an in-core tool
> to allow concurrent table rewrite to get rid of bloat; right now, VACUUM
> FULL does that, but it's not concurrent.  Users have resorted to using
> the pg_repack third-party tool, which is ancient and uses a weird
> internal implementation, as well as pg_squeeze, which uses logical
> decoding to capture changes that occur during the table rewrite.  The
> patch submitted here, largely by Antonin Houska with some changes by me,
> is based on the the pg_squeeze code which he authored, and first
> introduces a new command called REPACK to absorb both VACUUM FULL and
> CLUSTER, followed by addition of a CONCURRENTLY flag to allow some forms
> of REPACK to operate online using logical decoding.
>
> Essentially, this first patch just reshuffles the CLUSTER code to create
> the REPACK command.
>


Thanks for keeping this ball rolling.

>
> My other change to Antonin's patch is that I made REPACK USING INDEX set
> the 'indisclustered' flag to the index being used, so REPACK behaves
> identically to CLUSTER.  We can discuss whether we really want this.
> For instance we could add an option so that by default REPACK omits
> persisting the clustered index, and instead it only does that when you
> give it some special option, say something like
>   "REPACK (persist_clustered_index=true) tab USING INDEX idx"
> Overall I'm not sure this is terribly interesting, since clustered
> indexes are not very useful for most users anyway.
>

I think I would lean towards having it work like CLUSTER (preserve the
index), since that helps people making the transition, and it doesn't
feel terribly useful to invent new syntax for a feature that I would
agree isn't very useful for most people.

> I made a few other minor changes not worthy of individual mention, and
> there are a few others pending, such as updates to the
> pg_stat_progress_repack view infrastructure, as well as phasing out
> pg_stat_progress_cluster (maybe the latter would offer a subset of the
> former; not yet sure about this.)  Also, I'd like to work on adding a
> `repackdb` command for completeness.
>
> On repackdb: I think is going to be very similar to vacuumdb, mostly in
> that it is going to need to be able to run tasks in parallel; but there
> are things it doesn't have to deal with, such as analyze-in-stages,
> which I think is a large burden.  I estimate about 1k LOC there,
> extremely similar to vacuumdb.  Maybe it makes sense to share the source
> code and make the new executable a symlink instead, with some additional
> code to support the two different modes.  Again, I'm not sure about
> this -- I like the idea, but I'd have to see the implementation.
>
> I'll be rebasing the rest of Antonin's patch series afterwards,
> including the logical decoding changes necessary for CONCURRENTLY.  In
> the meantime, if people want to review those, which would be very
> valuable, they can go back to branch master from around the time he
> submitted it and apply the old patches there.
>

For clarity, are you intending to commit this patch before having the
other parts ready? (If that sounds like an objection, it isn't) After
a first pass, I think there's some confusing bits in the new docs that
could use straightening out, but there likely going to overlap changes
once concurrently is brought in, so it might make sense to hold off on
those. Either way I definitely want to dive into this a bit deeper
with some fresh eyes, there's a lot to digest... speaking of, for this
bit in src/backend/commands/cluster.c

+    switch (cmd)
+    {
+        case REPACK_COMMAND_REPACK:
+            return "REPACK";
+        case REPACK_COMMAND_VACUUMFULL:
+            return "VACUUM";
+        case REPACK_COMMAND_CLUSTER:
+            return "VACUUM";
+    }
+    return "???";

The last one should return "CLUSTER" no?


Robert Treat
https://xzilla.net

Re: Adding REPACK [concurrently]

Reply via email to