On Thu, Mar 29, 2018 at 11:30:59AM +0900, Michael Paquier wrote:
> On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:
> > Craig Ringer <cr...@2ndquadrant.com> writes:
> >> TL;DR: Pg should PANIC on fsync() EIO return.
> > 
> > Surely you jest.
> 
> Any callers of pg_fsync in the backend code are careful enough to check
> the returned status, sometimes doing retries like in mdsync, so what is
> proposed here would be a regression.

The retries are the source of the problem ; the first fsync() can return EIO,
and also *clears the error* causing a 2nd fsync (of the same data) to return
success.

(Note, I can see that it might be useful to PANIC on EIO but retry for ENOSPC).

On Thu, Mar 29, 2018 at 03:48:27PM +1300, Thomas Munro wrote:
> Craig, is the phenomenon you described the same as the second issue
> "Reporting writeback errors" discussed in this article?
> https://lwn.net/Articles/724307/

Worse, the article acknowledges the behavior without apparently suggesting to
change it:

 "Storing that value in the file structure has an important benefit: it makes
it possible to report a writeback error EXACTLY ONCE TO EVERY PROCESS THAT
CALLS FSYNC() .... In current kernels, ONLY THE FIRST CALLER AFTER AN ERROR
OCCURS HAS A CHANCE OF SEEING THAT ERROR INFORMATION."

I believe I reproduced the problem behavior using dmsetup "error" target, see
attached.

strace looks like this:

kernel is Linux 4.10.0-28-generic #32~16.04.2-Ubuntu SMP Thu Jul 20 10:19:48 
UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

     1  open("/dev/mapper/eio", O_RDWR|O_CREAT, 0600) = 3
     2  write(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 
8192
     3  write(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 
8192
     4  write(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 
8192
     5  write(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 
8192
     6  write(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 
8192
     7  write(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 
8192
     8  write(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 
2560
     9  write(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 
-1 ENOSPC (No space left on device)
    10  dup(2)                                  = 4
    11  fcntl(4, F_GETFL)                       = 0x8402 (flags 
O_RDWR|O_APPEND|O_LARGEFILE)
    12  brk(NULL)                               = 0x1299000
    13  brk(0x12ba000)                          = 0x12ba000
    14  fstat(4, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
    15  write(4, "write(1): No space left on devic"..., 34write(1): No space 
left on device
    16  ) = 34
    17  close(4)                                = 0
    18  fsync(3)                                = -1 EIO (Input/output error)
    19  dup(2)                                  = 4
    20  fcntl(4, F_GETFL)                       = 0x8402 (flags 
O_RDWR|O_APPEND|O_LARGEFILE)
    21  fstat(4, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
    22  write(4, "fsync(1): Input/output error\n", 29fsync(1): Input/output 
error
    23  ) = 29
    24  close(4)                                = 0
    25  close(3)                                = 0
    26  open("/dev/mapper/eio", O_RDWR|O_CREAT, 0600) = 3
    27  fsync(3)                                = 0
    28  write(3, "\0", 1)                       = 1
    29  fsync(3)                                = 0
    30  exit_group(0)                           = ?

2: EIO isn't seen initially due to writeback page cache;
9: ENOSPC due to small device
18: original IO error reported by fsync, good
25: the original FD is closed
26: ..and file reopened
27: fsync on file with still-dirty data+EIO returns success BAD

10, 19: I'm not sure why there's dup(2), I guess glibc thinks that perror
should write to a separate FD (?)

Also note, close() ALSO returned success..which you might think exonerates the
2nd fsync(), but I think may itself be problematic, no?  In any case, the 2nd
byte certainly never got written to DM error, and the failure status was lost
following fsync().

I get the exact same behavior if I break after one write() loop, such as to
avoid ENOSPC.

Justin
/*
% make CFLAGS='-Wall -Wextra -O3' /tmp/eio
% sudo lvcreate -L 9M -n tst data

echo '
0 1 linear /dev/data/tst 0
1 1 error 1
2 99 linear /dev/data/tst 2' |sudo dmsetup create eio

*/

#include <unistd.h>

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

#include <stdlib.h>
#include <stdio.h>

int main()
{
	char buf[8<<10];
	int fd;

	if (-1==(fd=open("/dev/mapper/eio", O_CREAT|O_RDWR, 00600))) {
	//if (-1==(fd=open("/mnt/t", O_CREAT|O_RDWR, 00600))) {
		perror("open");
		exit(1);
	}

	while (1) {
		// if (sizeof(buf)!=(write(fd, buf, sizeof(buf)))) {
		if (-1==write(fd, buf, sizeof(buf))) {
			perror("write(1)");
			break;
		}
		// break;
	}

	if (0!=fsync(fd))
		perror("fsync(1)");

	if (0!=close(fd))
		perror("close(1)");

	if (-1==(fd=open("/dev/mapper/eio", O_CREAT|O_RDWR, 00600))) {
	//if (-1==(fd=open("/mnt/t", O_CREAT|O_RDWR, 00600))) {
		perror("open");
		exit(1);
	}

	if (0!=fsync(fd))
		perror("fsync(2)");

	if (1!=write(fd, buf, 1))
		perror("write(2)");

	if (0!=fsync(fd))
		perror("fsync(3)");

	return 0;
}

Reply via email to