Hi Robert,
Thanks for the feedback and suggestions.
> I'm somewhat less of a noob here, so I'll confirm that this proposal
> has basically zero chance of getting in, at least for the v19 cycle.
> This isn't so much about the proposal itself, but more in that if you
> were trying to pick the worst time of year to submit a large,
> complicated feature into the postgresql workflow, this would be really
> close to that.
> However, I have also wondered about this specific trade-off (FPW vs
> DWB) for years, but until now, the level of effort required to produce
> a meaningful POC that would confirm if the idea was worth pursuing was
> so large that I think it stopped anyone from even trying. So,
> hopefully everyone will realize that we don't live in that world
> anymore, and as a side benefit, apparently the idea is worth pursuing.
I completely understand, and I actually have no intention of pushing
this patch for the v19 cycle. My primary goal right now is simply to
share the POC results and discuss the idea with the community to see
if this direction is worth pursuing.
For context, I have been a MySQL InnoDB developer for over 10 years,
but I admit I am a newcomer to the PostgreSQL community, so I am still
familiarizing myself with the standard workflow and processes here.
>> I think it would be valuable to have this as I've been hit by PostgreSQL's
>> unsteady (chain-saw-like) WAL traffic, especially related to touching 1st the
>> pages after checkpoint, up to the point of saturating network links. The
>> common
>> counter-argument to double buffering is probably that FPI may(?) increase WAL
>> standby replication rate and this would have to be measured into account
>> (but we also should take into account how much maintenance_io_concurrency/
>> posix_fadvise() prefetching that we do today helps avoid any I/O stalls on
>> fetching pages - so it should be basically free), I see even that you
>> got benefits
>> by not using FPI. Interesting.
>>
>> Some notes/questions about the patches itself:
>>
> So, I haven't looked at the code itself; tbh honest I am a bit too
> paranoid to dive into generated code that would seem to carry some
> likely level of legal risk around potential reuse of GPL/proprietary
> code it might be based on (either in its original training, inference,
> or context used for generation. Yeah, I know innodb isn't written in
> C, but still). That said, I did have some feedback and questions on
> the proposal itself, and some suggestions for how to move things
> forward.
>
>> 0. The convention here is send the patches using:
>> git format-patch -v<VERSION> HEAD~<numberOfpatches>
>> for easier review. The 0003 probably should be out of scope. Anyway I've
>> attached all of those so maybe somebody else is going to take a
>> look at them too,
>> they look very mature. Is this code used in production already anywhere?
>> (and
>> BTW the numbers are quite impressive)
>>
> While Jakub is right that the convention is to send patches, that
> convention is based on a manual development model, not an agentic
> development model. While there is no official project policy on this,
> IMHO the thing we really need from you is not the code output, but the
> prompts that were used to generate the code. There are plenty of folks
> who have access to claude that could then use those prompts to
> "recreate with enough proximity" the work you had claude do, and that
> process would also allow for additional verification and reduction of
> any legal concerns or concerns about investing further human
> time/energy. (No offense, but as you are not a regular contributor,
> you could analogize this to when third parties do large code dumps and
> say "here's a contribution, it's up to you to figure out how to use
> it". Ideally we want other folks to be able to pick up the project and
> continue with it, even if it means recreating it, and that works best
> if we have the underlying prompts).
> The claude code configuration file is a good start, but certainly not
> enough. Probably the ideal here would be full session logs, although a
> developer-diary would probably also suffice. I'm kind of guessing here
> because I don't know the scope of the prompts involved or how you were
> interacting with Claude in order to get where you are now, but those
> seem like the more obvious tools for work of this size whose intention
> is to be open.
Regarding the AI-generated code, the raw output from Claude was far
from perfect. I have manually reviewed and modified the code
extensively to get it to this state.
Our plan is to first deploy and test this thoroughly on our own
product, Alibaba Cloud RDS for PostgreSQL. Once we are confident that
it is stable and issue-free, we intend to submit a formalized patch
to the community. I am very much looking forward to discussing and
reviewing the actual code with you all when the time comes.
As for sharing the prompts or session logs, I personally feel they
might not be as valuable as the final code itself. The generation
process involved a lot of iterative, back-and-forth communication;
the AI only knew how to make the right modifications after continuous
human guidance, correction, and architectural decisions.
> I would be helpful if you could provide a little more information on
> the system you are running these benchmarks on, specifically for me
> the underlying OS/Filesystem/hardware, and I'd even be interested in
> the build flags. I'd also be interested to know if you did any kind of
> crash safety testing... while it is great to have improved
> performance, presumably that isn't actually the primary point of these
> subsystems. It'd also be worth knowing if you tested this on any
> systems with replication (physical or logical) since we'd need to
> understand those potential downstream effects. I'm tempted to say you
> should have an AI generate some pgbench scripts. Granted its early and
> fine if you have done any of this, but I imagine we'll need to look at
> it eventually.
I have addressed the feedback and conducted comprehensive benchmarks
comparing the three io_torn_pages_protection modes. Here are the
detailed performance results and the system setup information you
requested.
Benchmark Setup:
- Hardware: x86_64, Linux 5.10, NVMe SSD
- PostgreSQL: 19devel (with DWB patch applied)
- Tool: pgbench (TPC-B), 64 clients, 8 threads, 60 seconds per run
- Common config: shared_buffers = 1GB, wal_level = replica
- Three modes tested:
* io_torn_pages_protection = full_pages (traditional FPW)
* io_torn_pages_protection = double_writes (DWB size = 128MB)
* io_torn_pages_protection = off (no protection, baseline)
Each test was run sequentially on the same machine to avoid I/O
contention.
Test 1: pgbench scale=100 (~1.5GB dataset), max_wal_size = 10GB
With infrequent checkpoints, FPW overhead is minimal and all three
modes perform similarly:
Mode TPS Latency(avg) WAL Size FPI Count FPI Size
----------- ---------- ------------ --------- ---------- ---------
full_pages 103,290 0.610 ms 3,903 MB 191,341 1,456 MB
double_writes 104,088 0.606 ms 2,475 MB 0 0
off 104,622 0.602 ms 2,510 MB 0 0
DWB vs FPW: +0.8% TPS, WAL reduced by 36.6%.
Test 2: pgbench scale=100 (~1.5GB dataset), max_wal_size = 64MB
With frequent checkpoints (triggered every ~64MB of WAL), the FPW write
amplification becomes severe:
Mode TPS Latency(avg) WAL Size FPI Count FPI Size
----------- ---------- ------------ --------- ----------- ---------
full_pages 54,324 1.171 ms 29 GB 3,806,504 28 GB
double_writes 93,942 0.672 ms 2,303 MB 2 16 kB
off 108,901 0.578 ms 2,746 MB 0 0
DWB vs FPW: +72.9% TPS, latency reduced by 42.6%, WAL reduced by 92.2%.
Test 3: pgbench scale=10 (~150MB dataset), max_wal_size = 64MB
Even with a smaller dataset that fits in shared_buffers, the advantage
is clear:
Mode TPS Latency(avg) WAL Size FPI Count FPI Size
----------- ---------- ------------ --------- ----------- ---------
full_pages 33,707 1.895 ms 15 GB 2,010,439 14 GB
double_writes 43,743 1.459 ms 1,140 MB 0 0
off 43,982 1.452 ms 1,150 MB 0 0
DWB vs FPW: +29.8% TPS, WAL reduced by 92.6%.
Test 4: sysbench oltp_write_only (10 tables x 100K rows), 64 threads, 30s
max_wal_size=10GB max_wal_size=64MB
-------------------- --------------------
Mode TPS FPI Count TPS FPI Count
----------- -------- ---------- -------- ----------
full_pages 138,019 32,253 40,187 3,455,253
double_writes 136,642 0 133,034 0
With large max_wal_size: DWB and FPW perform equally (-1.0%).
With small max_wal_size: DWB is +231% faster (3.3x).
Analysis:
The key factor is checkpoint frequency. FPW must write a full 8KB page
image to WAL for every page's first modification after each checkpoint.
When checkpoints are frequent:
- FPI count explodes (32K -> 3.8M with scale=100)
- WAL becomes dominated by FPI data (28GB out of 29GB = 96.6%)
- This creates massive write amplification on the WAL path
DWB avoids this entirely. Its WAL size stays constant regardless of
checkpoint frequency (~2.3GB vs FPW's 29GB). The double-write buffer
itself wrote 42GB in the heavy test, but since it uses sequential writes
with batched fsync, the overhead is modest — DWB achieves 86.3% of the
"no protection" baseline (93,942 vs 108,901 TPS).
When does DWB matter most?
- Large active datasets that exceed shared_buffers
- Frequent checkpoints (small max_wal_size or short checkpoint_timeout)
- Write-heavy workloads
- Replication scenarios where WAL volume directly impacts network
In production environments where max_wal_size is often set
conservatively (e.g., 1GB) and datasets are much larger than
shared_buffers, DWB should provide significant and consistent benefits
over FPW. As for the crash safety testing you mentioned, it is on our
roadmap as we continue to refine the patch for our internal RDS
deployment.
Regards,
Baotiao