On Wed, Apr 4, 2018 at 07:32:04PM +1200, Thomas Munro wrote: > On Wed, Apr 4, 2018 at 6:00 PM, Craig Ringer <cr...@2ndquadrant.com> wrote: > > On 4 April 2018 at 13:29, Thomas Munro <thomas.mu...@enterprisedb.com> > > wrote: > >> /* Ensure that we skip any errors that predate opening of the file */ > >> f->f_wb_err = filemap_sample_wb_err(f->f_mapping); > >> > >> [...] > > > > Holy hell. So even PANICing on fsync() isn't sufficient, because the kernel > > will deliberately hide writeback errors that predate our fsync() call from > > us? > > Predates the opening of the file by the process that calls fsync(). > Yeah, it sure looks that way based on the above code fragment. Does > anyone know better?
Uh, just to clarify, what is new here is that it is ignoring any _errors_ that happened before the open(). It is not ignoring write()'s that happened but have not been written to storage before the open(). FYI, pg_test_fsync has always tested the ability to fsync() writes() from from other processes: Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) write, fsync, close 5360.341 ops/sec 187 usecs/op write, close, fsync 4785.240 ops/sec 209 usecs/op Those two numbers should be similar. I added this as a check to make sure the behavior we were relying on was working. I never tested sync errors though. I think the fundamental issue is that we always assumed that writes to the kernel that could not be written to storage would remain in the kernel until they succeeded, and that fsync() would report their existence. I can understand why kernel developers don't want to keep failed sync buffers in memory, and once they are gone we lose reporting of their failure. Also, if the kernel is going to not retry the syncs, how long should it keep reporting the sync failure? To the first fsync that happens after the failure? How long should it continue to record the failure? What if no fsync() every happens, which is likely for non-Postgres workloads? I think once they decided to discard failed syncs and not retry them, the fsync behavior we are complaining about was almost required. Our only option might be to tell administrators to closely watch for kernel write failure messages, and then restore or failover. :-( The last time I remember being this surprised about storage was in the early Postgres years when we learned that just because the BSD file system uses 8k pages doesn't mean those are atomically written to storage. We knew the operating system wrote the data in 8k chunks to storage but: o the 8k pages are written as separate 512-byte sectors o the 8k might be contiguous logically on the drive but not physically o even 512-byte sectors are not written atomically This is why we added pre-page images are written to WAL, which is what full_page_writes controls. -- Bruce Momjian <br...@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +