Re: [Qemu-devel] [PATCH for-2.6 v2 0/3] Bug fixes for gluster

Kevin Wolf Thu, 21 Apr 2016 01:45:07 -0700

Am 20.04.2016 um 20:38 hat Rik van Riel geschrieben:
> On Wed, 2016-04-20 at 13:46 +0200, Kevin Wolf wrote:
> > Am 20.04.2016 um 12:40 hat Ric Wheeler geschrieben:
> > > 
> > > On 04/20/2016 05:24 AM, Kevin Wolf wrote:
> > > > 
> > > > Am 20.04.2016 um 03:56 hat Ric Wheeler geschrieben:
> > > > > 
> > > > > On 04/19/2016 10:09 AM, Jeff Cody wrote:
> > > > > > 
> > > > > > On Tue, Apr 19, 2016 at 08:18:39AM -0400, Ric Wheeler wrote:
> > > > > I still worry that in many non-gluster situations we will have
> > > > > permanent data loss here. Specifically, the way the page cache
> > > > > works, if we fail to write back cached data *at any time*, a
> > > > > future
> > > > > fsync() will get a failure.
> > > > And this is actually what saves the semantic correctness. If you
> > > > threw
> > > > away data, any following fsync() must fail. This is of course
> > > > inconvenient because you won't be able to resume a VM that is
> > > > configured
> > > > to stop on errors, and it means some data loss, but it's safe
> > > > because we
> > > > never tell the guest that the data is on disk when it really
> > > > isn't.
> > > > 
> > > > gluster's behaviour (without resync-failed-syncs-after-fsync set)
> > > > is
> > > > different, if I understand correctly. It will throw away the data
> > > > and
> > > > then happily report success on the next fsync() call. And this is
> > > > what
> > > > causes not only data loss, but corruption.
> > > Yes, that makes sense to me - the kernel will remember that it
> > > could
> > > not write data back from the page cache and the future fsync() will
> > > see an error.
> > > 
> > > > 
> > > > 
> > > > [ Hm, or having read what's below... Did I misunderstand and
> > > > Linux
> > > >   returns failure only for a single fsync() and on the next one
> > > > it
> > > >   returns success again? That would be bad. ]
> > > I would need to think through that scenario with the memory
> > > management people to see if that could happen.
> > Okay, please do. This is the fundamental assumption we make: If an
> > fsync() succeeds, *all* successfully completed writes are on disk, no
> > matter whether another fsync() failed in between. If they can't be
> > written to the disk (e.g. because the data was thrown away), no
> > consequent fsync() can succeed any more.
> > 
> 
> Is that actually desirable behaviour?
> 
> What would it take to make fsync succeed again on that file at any
> point in the future?
> 
> Umount of the filesystem?
> 
> Reboot of the whole system?
> 
> Restore from backup?


I would say at least a close()/open() pair.

In other words, I would mark all file descriptors that exist for the
file when dirty pages are dropped or marked clean without having them
written back successfully, so that fsync() will consistently return
failure for them if data written to the file descriptor is lost.

So for more precise wording, I should have said "If an fsync() succeeds,
all successfully completed writes *to that file descriptor* are on
disk". fsync() gets as an fd, so I think that's the obvious thing.

And as I write this and think more about the reasons and implications, I
think we need to get rid of the behaviour that after a write error, the
page stays in the cache and is marked clean while it's inconsistent with
the actual disk contents.

If we were to ensure that the cache is consistent, we'd have the two
options I mentioned earlier in the thread:

1. On write error, don't mark the page clean, but remove it from the
   cache entirely. After a new open(), programs will see the state of
   the file as it is on the disk, with the data that is going to be lost
   already gone. (The other option is, obviously, that the read will
   return an I/O error because the disk is really broken.)

   fsync() can legimitately return success on new file descriptors
   because the program already reads what is on the disk, and there are
   no additional pages in the cache that need to be written back for
   this to be stable. On the other hand, even this works only if the
   problem was temporary because supposedly the program wrote something
   to its new file descriptor before calling fsync().

   On old file descriptors, it will always return failure because the
   data is already lost.

2. On write error, keep the pages in the cache as long as the memory
   isn't desperately needed elsewhere. When it is desperately needed,
   goto 1.

   This means that newly opened files would still see the cache contents
   which hasn't been successfully written back yet. fsync() would retry
   writing it out, both from old and from new file descriptors. For the
   programs having new file descriptors, this is probably equivalent
   with 1 - fsync() succeeds if, and only if, the temporary problem has
   been resolved meanwhile.

   On old file descriptors, this preserves the data and if the failure
   was temporary, the program can recover the file descriptor and get
   back to working.

Implementing 1. would already improve the situation considerably, and I
guess 2. would be the best thing that we can possibly achieve. Both are
by far better than a situation where fsync() lies and returns success
in cases where the cache contents differs from the disk contents (which
is what I understand we have now).

Kevin

Re: [Qemu-devel] [PATCH for-2.6 v2 0/3] Bug fixes for gluster

Reply via email to