Hi hackers, I found this when reading the related code. Here is the scenario:
bool RegisterSyncRequest(const FileTag *ftag, SyncRequestType type, bool retryOnError) For the case retryOnError is true, the function would in loop call ForwardSyncRequest() until it succeeds, but in ForwardSyncRequest(), we can see if we run into the below branch, RegisterSyncRequest() will need to loop until the checkpointer absorbs the existing requests so ForwardSyncRequest() might hang for some time until a checkpoint request is triggered. This scenario seems to be possible in theory though the chance is not high. ForwardSyncRequest(): if (CheckpointerShmem->checkpointer_pid == 0 || (CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests && !CompactCheckpointerRequestQueue())) { /* * Count the subset of writes where backends have to do their own * fsync */ if (!AmBackgroundWriterProcess()) CheckpointerShmem->num_backend_fsync++; LWLockRelease(CheckpointerCommLock); return false; } One fix is to add below similar code in RegisterSyncRequest(), trigger a checkpoint for the scenario. // checkpointer_triggered: variable for one trigger only. if (!ret && retryOnError && ProcGlobal->checkpointerLatch && !checkpointer_triggered) SetLatch(ProcGlobal->checkpointerLatch); Any comments? Regards, Paul Guo (Vmware)