On Wed, Apr 06, 2016 at 01:51:59PM +0200, Kevin Wolf wrote: > Am 06.04.2016 um 13:41 hat Kevin Wolf geschrieben: > > Am 06.04.2016 um 13:19 hat Ric Wheeler geschrieben: > > > > > > We had a thread discussing this not on the upstream list. > > > > > > My summary of the thread is that I don't understand why gluster > > > should drop cached data after a failed fsync() for any open file. > > > > It certainly shouldn't, but it does by default. :-) > > > > Have a look at commit 3fcead2d in glusterfs.git, which at least > > introduces an option to get usable behaviour: > > > > { .key = {"resync-failed-syncs-after-fsync"}, > > .type = GF_OPTION_TYPE_BOOL, > > .default_value = "off", > > .description = "If sync of \"cached-writes issued before fsync\" " > > "(to backend) fails, this option configures whether " > > "to retry syncing them after fsync or forget them. " > > "If set to on, cached-writes are retried " > > "till a \"flush\" fop (or a successful sync) on sync " > > "failures. " > > "fsync itself is failed irrespective of the value of " > > "this option. ", > > }, > > > > As you can see, the default is still to drop cached data, and this is > > with the file still opened. qemu needs to make sure that this option is > > set, and if Jeff's comment in the code below is right, there is no way > > currently to make sure that the option isn't silently ignored. > > > > Can we get some function that sets an option and fails if the option is > > unknown? Or one that queries the state after setting an option, so we > > can check whether we succeeded in switching to the mode we need? > > > > > For closed files, I think it might still happen but this is the same > > > as any file system (and unlikely to be the case for qemu?). > > > > Our problem is only with open images. Dropping caches for files that > > qemu doesn't use any more is fine as far as I'm concerned. > > > > Note that our usage can involve cases where we reopen a file with > > different flags, i.e. first open a second file descriptor, then close > > the first one. The image was never completely closed here and we would > > still want the cache to preserve our data in such cases. > > Hm, actually, maybe we should just call bdrv_flush() before reopening an > image, and if an error is returned, we abort the reopen. It's far from > being a hot path, so the overhead of a flush shouldn't matter, and it > seems we're taking an unnecessary risk without doing this. >
[I seemed to have been dropped from the cc] Are you talking about doing a bdrv_flush() on the new descriptor (i.e. reop_s->glfs)? Because otherwise, we already do this in bdrv_reopen_prepare() on the original fd. It happens right before the call to drv->bdrv_reopen_prepare(): 2020 ret = bdrv_flush(reopen_state->bs); 2021 if (ret) { 2022 error_setg_errno(errp, -ret, "Error flushing drive"); 2023 goto error; 2024 } 2025 2026 if (drv->bdrv_reopen_prepare) { 2027 ret = drv->bdrv_reopen_prepare(reopen_state, queue, &local_err); > > > > I will note that Linux in general had (still has I think?) the > > > behavior that once the process closes a file (or exits), we lose > > > context to return an error to. From that point on, any failed IO > > > from the page cache to the target disk will be dropped from cache. > > > To hold things in the cache would lead it to fill with old data that > > > is not really recoverable and we have no good way to know that the > > > situation is repairable and how long that might take. Upstream > > > kernel people have debated this, the behavior might be tweaked for > > > certain types of errors. > > > > That's fine, we just don't want the next fsync() to signal success when > > in reality the cache has thrown away our data. As soon as we close the > > image, there is no next fsync(), so you can do whatever you like. > > > > Kevin > > > > > On 04/06/2016 07:02 AM, Kevin Wolf wrote: > > > >[ Adding some CCs ] > > > > > > > >Am 06.04.2016 um 05:29 hat Jeff Cody geschrieben: > > > >>Upon receiving an I/O error after an fsync, by default gluster will > > > >>dump its cache. However, QEMU will retry the fsync, which is especially > > > >>useful when encountering errors such as ENOSPC when using the > > > >>werror=stop > > > >>option. When using caching with gluster, however, the last written data > > > >>will be lost upon encountering ENOSPC. Using the cache xlator option of > > > >>'resync-failed-syncs-after-fsync' should cause gluster to retain the > > > >>cached data after a failed fsync, so that ENOSPC and other transient > > > >>errors are recoverable. > > > >> > > > >>Signed-off-by: Jeff Cody <jc...@redhat.com> > > > >>--- > > > >> block/gluster.c | 27 +++++++++++++++++++++++++++ > > > >> configure | 8 ++++++++ > > > >> 2 files changed, 35 insertions(+) > > > >> > > > >>diff --git a/block/gluster.c b/block/gluster.c > > > >>index 30a827e..b1cf71b 100644 > > > >>--- a/block/gluster.c > > > >>+++ b/block/gluster.c > > > >>@@ -330,6 +330,23 @@ static int qemu_gluster_open(BlockDriverState *bs, > > > >> QDict *options, > > > >> goto out; > > > >> } > > > >>+#ifdef CONFIG_GLUSTERFS_XLATOR_OPT > > > >>+ /* Without this, if fsync fails for a recoverable reason (for > > > >>instance, > > > >>+ * ENOSPC), gluster will dump its cache, preventing retries. This > > > >>means > > > >>+ * almost certain data loss. Not all gluster versions support the > > > >>+ * 'resync-failed-syncs-after-fsync' key value, but there is no > > > >>way to > > > >>+ * discover during runtime if it is supported (this api returns > > > >>success for > > > >>+ * unknown key/value pairs) */ > > > >Honestly, this sucks. There is apparently no way to operate gluster so > > > >we can safely recover after a failed fsync. "We hope everything is fine, > > > >but depending on your gluster version, we may now corrupt your image" > > > >isn't very good. > > > > > > > >We need to consider very carefully if this is good enough to go on after > > > >an error. I'm currently leaning towards "no". That is, we should only > > > >enable this after Gluster provides us a way to make sure that the option > > > >is really set. > > > > > > > >>+ ret = glfs_set_xlator_option (s->glfs, "*-write-behind", > > > >>+ > > > >>"resync-failed-syncs-after-fsync", > > > >>+ "on"); > > > >>+ if (ret < 0) { > > > >>+ error_setg_errno(errp, errno, "Unable to set xlator key/value > > > >>pair"); > > > >>+ ret = -errno; > > > >>+ goto out; > > > >>+ } > > > >>+#endif > > > >We also need to consider the case without CONFIG_GLUSTERFS_XLATOR_OPT. > > > >In this case (as well as theoretically in the case that the option > > > >didn't take effect - if only we could know about it), a failed > > > >glfs_fsync_async() is fatal and we need to stop operating on the image, > > > >i.e. set bs->drv = NULL like when we detect corruption in qcow2 images. > > > >The guest will see a broken disk that fails all I/O requests, but that's > > > >better than corrupting data. > > > > > > > >Kevin > > > > > >