Hi hackers,

While using the pg_enable_data_checksums() feature, I found a likely bug, a
race condition in  datachecksum_state.c's launcher_exit().

When pg_enable_data_checksums() is called twice before the first launcher
starts, two bg workers are registered (the code expects this).  The
redundant launcher exits early, but it's launcher_exit() callback
unconditionally clears the shared launcher_running flag and may call
SetDataChecksumsOff() -- even though it never owned the flag.

This allows a third pg_enable_data_checksums() call to launch another
launcher concurrently with the first (duplicate work, doubled I/O, spurious
warnings).  Worse, if the redundant launcher initialized after the winner
transitioned to inprogress-on, its exit handler calls
SetDataChecksumsOff(), silently aborting the enable operation.  (I have
not triggered the SetDataChecksumsOff part though calling out ad it can be
a likely scenario based on timing of workers)

Reproduced by firing three calls in quick succession:

  psql -c "SELECT pg_enable_data_checksums();" &
  psql -c "SELECT pg_enable_data_checksums();" &
  sleep 0.5
  psql -c "SELECT pg_enable_data_checksums();" &

Log shows two launchers processing databases concurrently:

  [2093292] LOG:  enabling data checksums requested
  [2093293] LOG:  already running, exiting
  [2093299] LOG:  enabling data checksums requested     -- third launcher
admitted
  [2093292] LOG:  processing database "postgres"
  [2093299] LOG:  processing database "postgres"        -- same DB,
concurrently
  [2093299] WARNING:  cannot set data checksums to "on", current state is
not "inprogress-on"

I think the process-local launcher_running flag exists for this purpose and
is already used for the worker-kill block, but the flag-clear and
state-revert blocks do not use it.

The attached patch returns early from launcher_exit() when the local flag
is false. Thoughts?

Regards,
Ayush

Attachment: 0001-Fix-race-in-online-checksums-launcher_exit.patch
Description: Binary data

Reply via email to