Am 20.04.2016 um 20:38 hat Rik van Riel geschrieben: > On Wed, 2016-04-20 at 13:46 +0200, Kevin Wolf wrote: > > Am 20.04.2016 um 12:40 hat Ric Wheeler geschrieben: > > > > > > On 04/20/2016 05:24 AM, Kevin Wolf wrote: > > > > > > > > Am 20.04.2016 um 03:56 hat Ric Wheeler geschrieben: > > > > > > > > > > On 04/19/2016 10:09 AM, Jeff Cody wrote: > > > > > > > > > > > > On Tue, Apr 19, 2016 at 08:18:39AM -0400, Ric Wheeler wrote: > > > > > I still worry that in many non-gluster situations we will have > > > > > permanent data loss here. Specifically, the way the page cache > > > > > works, if we fail to write back cached data *at any time*, a > > > > > future > > > > > fsync() will get a failure. > > > > And this is actually what saves the semantic correctness. If you > > > > threw > > > > away data, any following fsync() must fail. This is of course > > > > inconvenient because you won't be able to resume a VM that is > > > > configured > > > > to stop on errors, and it means some data loss, but it's safe > > > > because we > > > > never tell the guest that the data is on disk when it really > > > > isn't. > > > > > > > > gluster's behaviour (without resync-failed-syncs-after-fsync set) > > > > is > > > > different, if I understand correctly. It will throw away the data > > > > and > > > > then happily report success on the next fsync() call. And this is > > > > what > > > > causes not only data loss, but corruption. > > > Yes, that makes sense to me - the kernel will remember that it > > > could > > > not write data back from the page cache and the future fsync() will > > > see an error. > > > > > > > > > > > > > > > [ Hm, or having read what's below... Did I misunderstand and > > > > Linux > > > > returns failure only for a single fsync() and on the next one > > > > it > > > > returns success again? That would be bad. ] > > > I would need to think through that scenario with the memory > > > management people to see if that could happen. > > Okay, please do. This is the fundamental assumption we make: If an > > fsync() succeeds, *all* successfully completed writes are on disk, no > > matter whether another fsync() failed in between. If they can't be > > written to the disk (e.g. because the data was thrown away), no > > consequent fsync() can succeed any more. > > > > Is that actually desirable behaviour? > > What would it take to make fsync succeed again on that file at any > point in the future? > > Umount of the filesystem? > > Reboot of the whole system? > > Restore from backup?
I would say at least a close()/open() pair. In other words, I would mark all file descriptors that exist for the file when dirty pages are dropped or marked clean without having them written back successfully, so that fsync() will consistently return failure for them if data written to the file descriptor is lost. So for more precise wording, I should have said "If an fsync() succeeds, all successfully completed writes *to that file descriptor* are on disk". fsync() gets as an fd, so I think that's the obvious thing. And as I write this and think more about the reasons and implications, I think we need to get rid of the behaviour that after a write error, the page stays in the cache and is marked clean while it's inconsistent with the actual disk contents. If we were to ensure that the cache is consistent, we'd have the two options I mentioned earlier in the thread: 1. On write error, don't mark the page clean, but remove it from the cache entirely. After a new open(), programs will see the state of the file as it is on the disk, with the data that is going to be lost already gone. (The other option is, obviously, that the read will return an I/O error because the disk is really broken.) fsync() can legimitately return success on new file descriptors because the program already reads what is on the disk, and there are no additional pages in the cache that need to be written back for this to be stable. On the other hand, even this works only if the problem was temporary because supposedly the program wrote something to its new file descriptor before calling fsync(). On old file descriptors, it will always return failure because the data is already lost. 2. On write error, keep the pages in the cache as long as the memory isn't desperately needed elsewhere. When it is desperately needed, goto 1. This means that newly opened files would still see the cache contents which hasn't been successfully written back yet. fsync() would retry writing it out, both from old and from new file descriptors. For the programs having new file descriptors, this is probably equivalent with 1 - fsync() succeeds if, and only if, the temporary problem has been resolved meanwhile. On old file descriptors, this preserves the data and if the failure was temporary, the program can recover the file descriptor and get back to working. Implementing 1. would already improve the situation considerably, and I guess 2. would be the best thing that we can possibly achieve. Both are by far better than a situation where fsync() lies and returns success in cases where the cache contents differs from the disk contents (which is what I understand we have now). Kevin