On Fri, May 07, 2021 at 10:18:14PM -0400, Tom Lane wrote: > Andres Freund <and...@anarazel.de> writes: > > On 2021-05-07 17:14:18 -0700, Noah Misch wrote: > >> Having a flaky buildfarm member is bad news. I'll LD_PRELOAD the attached > >> to > >> prevent fsync from reaching the kernel. Hopefully, that will make the > >> hardware-or-kernel trouble unreachable. (Changing 008_fsm_truncation.pl > >> wouldn't avoid this, because fsync=off doesn't affect syncs outside the > >> backend.) > > > Not sure how reliable that is - there's other paths that could return an > > error, I think.
Yep, one can imagine a failure at close() or something. All the non-HEAD buildfarm failures are at some *sync call, so I'm optimistic about getting mileage from this. (I didn't check the more-numerous HEAD failures.) If it's not enough, I may move the farm directory to tmpfs. > > If the root cause is the disk responding weirdly to > > write cache flushes, you could tell the kernel that that the disk has no > > write cache (e.g. echo write through > /sys/block/sda/queue/write_cache). > > I seriously doubt Noah has root on that machine. If I can make the case for that setting being a good thing for the VM's users generally, I probably can file a ticket and get it done. > More to the point, the admin told me it's a VM (or LDOM, whatever that is) > under a Solaris host, so there's no direct hardware access going on > anyway. He didn't say in so many words, but I suspect the reason he's > suspecting kernel bugs is that there's nothing going wrong so far as the > host OS is concerned.