Hi Jakub Thanks for your detailed suggestions. Sorry for the delayed response due to the Chinese Spring Festival.
Overall, my current work on this is mostly a Proof of Concept (POC). I believe that introducing a Double Write Buffer (DWB) is a more reasonable design at this point, so I wanted to bring this up for discussion with the community. At present, I am still modifying the code, and it has not been deployed to production yet. We are currently reviewing this code, and we plan to deploy and release it on our Alibaba Cloud RDS for PostgreSQL in the future. > I'm a newbie here, but took Your idea with some interest, probably everyone > else is busy with work on other patches before commit freeze. > > I think it would be valuable to have this as I've been hit by PostgreSQL's > unsteady (chain-saw-like) WAL traffic, especially related to touching 1st the > pages after checkpoint, up to the point of saturating network links. The > common > counter-argument to double buffering is probably that FPI may(?) increase WAL > standby replication rate and this would have to be measured into account > (but we also should take into account how much maintenance_io_concurrency/ > posix_fadvise() prefetching that we do today helps avoid any I/O stalls on > fetching pages - so it should be basically free), I see even that you > got benefits > by not using FPI. Interesting. Yes, I believe that using a DWB can reduce the WAL size, which is also helpful for PostgreSQL's replication synchronization latency. > Some notes/questions about the patches itself: > > 0. The convention here is send the patches using: > git format-patch -v<VERSION> HEAD~<numberOfpatches> > for easier review. The 0003 probably should be out of scope. Anyway I've > attached all of those so maybe somebody else is going to take a > look at them too, > they look very mature. Is this code used in production already anywhere? > (and > BTW the numbers are quite impressive) Since my initial goal was just to start a discussion and provide some experimental data, I didn't organize a proper patch series for this mail thread. As mentioned, this code is not yet used in production. However, we are reviewing it and will push it to be used in Alibaba Cloud RDS for PostgreSQL later (I am the manager of the Alibaba Cloud RDS Team). > 1. We have full_page_writes = on/off, but Your's patch adds > double_write_buffer > IMHO if we have competing solution it would be better to have something > like > io_torn_pages_protection = off | full_pages | double_writes > and maybe we'll be able to add 'atomic_writes' one day. > BTW: once you stabilize the GUC, it is worth adding to > postgresql.conf.sample That is a great idea, I will modify it accordingly later. I omitted this initially because I was mainly focusing on providing experimental data for the POC. > 2. How would one know how to size double_write_buffer_size ? If we decide to merge this into upstream, I will provide a detailed design document covering the internal mechanisms, sizing guidelines, etc. > 2b. IMHO the patch could have enriched pg_stat_io with some > information. Please take > a look on pg_stat_io view and functions like pgstat_count_io_op_time() and > their parameters and enums there, that way we could have IOOBJECT_DWBUF > maybe > and be able to say how much I/O was attributed to double-buffering, fsync() > times related to it and so on. Agreed, these metrics need to be provided as well. I haven't modified that part yet for this POC. > 3. In DWBufPostCheckpoint() there's pg_usleep(1ms) just before atomic > pwrites(), > but exactly why is it necessary to have this literally sched_yield(2) > there? Good catch, this will be revised in the next version. > 4. In BufferSync() I have doubts if such copying is safe in loop: > page = BufHdrGetBlock(bufHdr); > memcpy(dwb_buf, page, BLCKSZ); > shouldn't there be some form of locking (BUFFER_LOCK_SHARE?)/pinning > buffers? > Also it wouldn't be better if that memcpy would be guarded by the > critical section? > (START_CRIT_SECTION) Yes, this needs to be fixed. > 4b. There seems to be double coping: there's palloc for dwb_buf in > BufferSync() > that is filled by memcpy(), and then DWBufWritePage() is called and > then again > that "page" is copied a second time using memcpy(). This seems to be done > for > every checkpoint page, so may reduce benefits of this double-buffering > code. Yes, this also needs to be fixed to avoid the double copying. > 4c. Shouldn't this active waiting in DWBufWritePage() shouldn't be achieved > using spinlocks rather than pg_usleep(100us)? Yes, that is doable. This part hasn't been carefully refined yet. > 5. Have you maybe verified using injection points (or gdb) if crashing > in several > places really hits that DWBufRecoverPage()? Is there a simple way > of reproducing this > to play with it? (possibly that could be good test on it's own) OK, I will work on providing a way to test this using injection points in subsequent updates. > 6. Quick testing overview (for completeness) > - basic test without even enabling this feature complains about > postgresql.conf.sample > (test_misc/003_check_guc) > - with `PG_TEST_INITDB_EXTRA_OPTS="-c double_write_buffer=on" meson > test` I've > got 3 failures there: > * test_misc/003_check_guc (expected) > * pg_waldump/002_save_fullpage (I would say it's expected) > * pg_walinspect / pg_walinspect/regress (I would say it's expected) OK, I will address these test failures in the next version. > I haven't really got it up and running for real, but at least that's > some start and I hope that helps. Thanks again for the review. Regards, Baotiao
