On Thu, Mar 29, 2018 at 11:30:59AM +0900, Michael Paquier wrote: > On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote: > > Craig Ringer <cr...@2ndquadrant.com> writes: > >> TL;DR: Pg should PANIC on fsync() EIO return. > > > > Surely you jest. > > Any callers of pg_fsync in the backend code are careful enough to check > the returned status, sometimes doing retries like in mdsync, so what is > proposed here would be a regression.
The retries are the source of the problem ; the first fsync() can return EIO, and also *clears the error* causing a 2nd fsync (of the same data) to return success. (Note, I can see that it might be useful to PANIC on EIO but retry for ENOSPC). On Thu, Mar 29, 2018 at 03:48:27PM +1300, Thomas Munro wrote: > Craig, is the phenomenon you described the same as the second issue > "Reporting writeback errors" discussed in this article? > https://lwn.net/Articles/724307/ Worse, the article acknowledges the behavior without apparently suggesting to change it: "Storing that value in the file structure has an important benefit: it makes it possible to report a writeback error EXACTLY ONCE TO EVERY PROCESS THAT CALLS FSYNC() .... In current kernels, ONLY THE FIRST CALLER AFTER AN ERROR OCCURS HAS A CHANCE OF SEEING THAT ERROR INFORMATION." I believe I reproduced the problem behavior using dmsetup "error" target, see attached. strace looks like this: kernel is Linux 4.10.0-28-generic #32~16.04.2-Ubuntu SMP Thu Jul 20 10:19:48 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux 1 open("/dev/mapper/eio", O_RDWR|O_CREAT, 0600) = 3 2 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192 3 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192 4 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192 5 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192 6 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192 7 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192 8 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 2560 9 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = -1 ENOSPC (No space left on device) 10 dup(2) = 4 11 fcntl(4, F_GETFL) = 0x8402 (flags O_RDWR|O_APPEND|O_LARGEFILE) 12 brk(NULL) = 0x1299000 13 brk(0x12ba000) = 0x12ba000 14 fstat(4, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0 15 write(4, "write(1): No space left on devic"..., 34write(1): No space left on device 16 ) = 34 17 close(4) = 0 18 fsync(3) = -1 EIO (Input/output error) 19 dup(2) = 4 20 fcntl(4, F_GETFL) = 0x8402 (flags O_RDWR|O_APPEND|O_LARGEFILE) 21 fstat(4, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0 22 write(4, "fsync(1): Input/output error\n", 29fsync(1): Input/output error 23 ) = 29 24 close(4) = 0 25 close(3) = 0 26 open("/dev/mapper/eio", O_RDWR|O_CREAT, 0600) = 3 27 fsync(3) = 0 28 write(3, "\0", 1) = 1 29 fsync(3) = 0 30 exit_group(0) = ? 2: EIO isn't seen initially due to writeback page cache; 9: ENOSPC due to small device 18: original IO error reported by fsync, good 25: the original FD is closed 26: ..and file reopened 27: fsync on file with still-dirty data+EIO returns success BAD 10, 19: I'm not sure why there's dup(2), I guess glibc thinks that perror should write to a separate FD (?) Also note, close() ALSO returned success..which you might think exonerates the 2nd fsync(), but I think may itself be problematic, no? In any case, the 2nd byte certainly never got written to DM error, and the failure status was lost following fsync(). I get the exact same behavior if I break after one write() loop, such as to avoid ENOSPC. Justin
/* % make CFLAGS='-Wall -Wextra -O3' /tmp/eio % sudo lvcreate -L 9M -n tst data echo ' 0 1 linear /dev/data/tst 0 1 1 error 1 2 99 linear /dev/data/tst 2' |sudo dmsetup create eio */ #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <stdlib.h> #include <stdio.h> int main() { char buf[8<<10]; int fd; if (-1==(fd=open("/dev/mapper/eio", O_CREAT|O_RDWR, 00600))) { //if (-1==(fd=open("/mnt/t", O_CREAT|O_RDWR, 00600))) { perror("open"); exit(1); } while (1) { // if (sizeof(buf)!=(write(fd, buf, sizeof(buf)))) { if (-1==write(fd, buf, sizeof(buf))) { perror("write(1)"); break; } // break; } if (0!=fsync(fd)) perror("fsync(1)"); if (0!=close(fd)) perror("close(1)"); if (-1==(fd=open("/dev/mapper/eio", O_CREAT|O_RDWR, 00600))) { //if (-1==(fd=open("/mnt/t", O_CREAT|O_RDWR, 00600))) { perror("open"); exit(1); } if (0!=fsync(fd)) perror("fsync(2)"); if (1!=write(fd, buf, 1)) perror("write(2)"); if (0!=fsync(fd)) perror("fsync(3)"); return 0; }