Ming, I recently discovered a bug in the FUA code - a recent bcachefs change exposed it - and my best guess is it's related to your recent changes to blk-flush.c.
What I'm seeing is if all writes are issued as FUA writes, in a short period of time the request queue get stuck - writes are on the queue but they aren't being issued or completed. This is with an AHCI device - so no blk-mq, and it's emulating FUA with flushes. You ought to be able to reproduce this yourself by changing generic_make_request() to make all writes FUA, and then just doing O_DIRECT writes with dd or something. I suspect that if there's non FUA flushes being issued they'll end up kicking the queue and keeping things from getting stuck, in my testing I'm only seeing things get completely stuck when testing bcachefs in multi device mode, with no metadata or journal IO to the device in question, just FUA data writes. After things get stuck, with kgdb I'm seeing a request on the request queue that has flush_data_end_io for its endio function. I've still been trying to figure out how the flush machinery is supposed to work, I don't know what else you'd want to know. Much appreciated if you could take a look.