Hi hackers, While using the pg_enable_data_checksums() feature, I found a likely bug, a race condition in datachecksum_state.c's launcher_exit().
When pg_enable_data_checksums() is called twice before the first launcher starts, two bg workers are registered (the code expects this). The redundant launcher exits early, but it's launcher_exit() callback unconditionally clears the shared launcher_running flag and may call SetDataChecksumsOff() -- even though it never owned the flag. This allows a third pg_enable_data_checksums() call to launch another launcher concurrently with the first (duplicate work, doubled I/O, spurious warnings). Worse, if the redundant launcher initialized after the winner transitioned to inprogress-on, its exit handler calls SetDataChecksumsOff(), silently aborting the enable operation. (I have not triggered the SetDataChecksumsOff part though calling out ad it can be a likely scenario based on timing of workers) Reproduced by firing three calls in quick succession: psql -c "SELECT pg_enable_data_checksums();" & psql -c "SELECT pg_enable_data_checksums();" & sleep 0.5 psql -c "SELECT pg_enable_data_checksums();" & Log shows two launchers processing databases concurrently: [2093292] LOG: enabling data checksums requested [2093293] LOG: already running, exiting [2093299] LOG: enabling data checksums requested -- third launcher admitted [2093292] LOG: processing database "postgres" [2093299] LOG: processing database "postgres" -- same DB, concurrently [2093299] WARNING: cannot set data checksums to "on", current state is not "inprogress-on" I think the process-local launcher_running flag exists for this purpose and is already used for the worker-kill block, but the flag-clear and state-revert blocks do not use it. The attached patch returns early from launcher_exit() when the local flag is false. Thoughts? Regards, Ayush
0001-Fix-race-in-online-checksums-launcher_exit.patch
Description: Binary data
