Re: [PATCHv2] bcache: option for allow stale data on read failure
On 2017/9/20 下午5:40, Michael Lyle wrote: > On Wed, Sep 20, 2017 at 3:28 AM, Coly Li wrote: >> Even the read request failed on file system meta data, because finally a >> stale data will be provided to kernel file system code, it is probably >> file system won't complain as well. > > The scary case is when filesystem data that points to other filesystem > data is cached. E.g. the data structures representing what space is > free on disk, or a directory, or a database btree. Some examples: > > Free space handling-- if a big file /foo is created, and the active > free-space datastructures are in cache (and this is likely, because > actively written places can have their writeback-writes > cancelled/deferred indefinitely)-- and then later the caching disk > fails, an old version of this will be read from disk. Later, an > effort to write a file /bar allocates the space used by /foo, and > writes over it. > > Directory entity handling-- if /var/spool/foo is an active directory > (associated data structures in cache), and has the directory > /var/spool/foo/bar under it, and then /bar is removed... the backing > disk will still have a reference to bar. If the space for bar is then > used for something else, the kernel may end up reading something very > different from what it expects for a directory later after a cache > device failure. > > Btrees, etc-- the same thing. If a tree shrinks, old tree entitys can > end up pointing to other kinds of data. > > I think this change is harmful-- it is not a good idea to > automatically, at runtime, decide to start returning data that > violates the guarantees a block device is supposed to obey about > ordering and persistence. Hi Mike, I totally agree with you. It is my fault for the misleading commit log, if you read it again you may find we stand on same side, this is what I feel from your response :-) Current bcache code does provide stale data from read failure recovery. In v1 patch discussion people wanted to keep this behavior then in v2 version I add an option to permit this "harmful" behavior, and disable this behavior by default. And good to know Kent does not like an option, then we can disable this "harmful" behavior by default. Thanks. Coly
Re: [PATCHv2] bcache: option for allow stale data on read failure
On 2017/9/20 下午6:07, Kent Overstreet wrote: > On Wed, Sep 20, 2017 at 06:24:33AM +0800, Coly Li wrote: >> When bcache does read I/Os, for example in writeback or writethrough mode, >> if a read request on cache device is failed, bcache will try to recovery >> the request by reading from cached device. If the data on cached device is >> not synced with cache device, then requester will get a stale data. >> >> For critical storage system like database, providing stale data from >> recovery may result an application level data corruption, which is >> unacceptible. But for some other situation like multi-media stream cache, >> continuous service may be more important and it is acceptible to fetch >> a chunk of stale data. >> >> This patch tries to solve the above conflict by adding a sysfs option >> /sys/block/bcache/bcache/allow_stale_data_on_failure >> which is defaultly cleared (to 0) as disabled. Now people can make choices >> for different situations. > > IMO this is just a bug, I'd rather not have an option to keep the buggy > behaviour. How about this patch: > Hi Kent, OK, last time when I discuss with other bcache developers, people wanted to keep this behavior, then I modify it as an option in this version patch. I support fix it without an option, because there are too many options already. Good to know you have similar decision :-) > commit 2746f9c1f962288d8c5d7dabe698bf7b3fddd405 > Author: Kent Overstreet > Date: Wed Sep 20 18:06:37 2017 +0200 > > bcache: Don't recover from IO errors when reading dirty data > > Signed-off-by: Kent Overstreet > > diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c > index 382397772a..c2d57ef953 100644 > --- a/drivers/md/bcache/request.c > +++ b/drivers/md/bcache/request.c > @@ -532,8 +532,10 @@ static int cache_lookup_fn(struct btree_op *op, struct > btree *b, struct bkey *k) > > PTR_BUCKET(b->c, k, ptr)->prio = INITIAL_PRIO; > > - if (KEY_DIRTY(k)) > + if (KEY_DIRTY(k)) { > s->read_dirty_data = true; > + s->recoverable = false; > + } > I though of fixing here, the reason I gave up to modify here was, cache_lookup_fn() is called for keys in leaf nodes (b->level == 0), bch_btree_map_keys_recurse() needs to do I/O to fetch upper level nodes before accessing leaf node. When a SSD failed bch_btree_node_get() will fail before cache_lookup_fn() is executed. So the your patch, there is no chance to set s->recoverable to false, recovery still happens. If you don't like an option, the following modification should be much simpler, diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index 681b4f12b05a..f397785d9c38 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -697,8 +697,10 @@ static void cached_dev_read_error(struct closure *cl) { struct search *s = container_of(cl, struct search, cl); struct bio *bio = &s->bio.bio; + struct cached_dev *dc = container_of(s->d, struct cached_dev, disk); - if (s->recoverable) { + if (s->recoverable && + (dc && !atomic_read(&dc->has_dirty)) { /* Retry from the backing device: */ trace_bcache_read_retry(s->orig_bio); This might be the simplest way I know for now. Thanks. Coly Li
Re: [PATCHv2] bcache: option for allow stale data on read failure
On Wed, Sep 20, 2017 at 06:24:33AM +0800, Coly Li wrote: > When bcache does read I/Os, for example in writeback or writethrough mode, > if a read request on cache device is failed, bcache will try to recovery > the request by reading from cached device. If the data on cached device is > not synced with cache device, then requester will get a stale data. > > For critical storage system like database, providing stale data from > recovery may result an application level data corruption, which is > unacceptible. But for some other situation like multi-media stream cache, > continuous service may be more important and it is acceptible to fetch > a chunk of stale data. > > This patch tries to solve the above conflict by adding a sysfs option > /sys/block/bcache/bcache/allow_stale_data_on_failure > which is defaultly cleared (to 0) as disabled. Now people can make choices > for different situations. IMO this is just a bug, I'd rather not have an option to keep the buggy behaviour. How about this patch: commit 2746f9c1f962288d8c5d7dabe698bf7b3fddd405 Author: Kent Overstreet Date: Wed Sep 20 18:06:37 2017 +0200 bcache: Don't recover from IO errors when reading dirty data Signed-off-by: Kent Overstreet diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index 382397772a..c2d57ef953 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -532,8 +532,10 @@ static int cache_lookup_fn(struct btree_op *op, struct btree *b, struct bkey *k) PTR_BUCKET(b->c, k, ptr)->prio = INITIAL_PRIO; - if (KEY_DIRTY(k)) + if (KEY_DIRTY(k)) { s->read_dirty_data = true; + s->recoverable = false; + } n = bio_next_split(bio, min_t(uint64_t, INT_MAX, KEY_OFFSET(k) - bio->bi_iter.bi_sector),
Re: [PATCHv2] bcache: option for allow stale data on read failure
On Wed, Sep 20, 2017 at 3:28 AM, Coly Li wrote: > Even the read request failed on file system meta data, because finally a > stale data will be provided to kernel file system code, it is probably > file system won't complain as well. The scary case is when filesystem data that points to other filesystem data is cached. E.g. the data structures representing what space is free on disk, or a directory, or a database btree. Some examples: Free space handling-- if a big file /foo is created, and the active free-space datastructures are in cache (and this is likely, because actively written places can have their writeback-writes cancelled/deferred indefinitely)-- and then later the caching disk fails, an old version of this will be read from disk. Later, an effort to write a file /bar allocates the space used by /foo, and writes over it. Directory entity handling-- if /var/spool/foo is an active directory (associated data structures in cache), and has the directory /var/spool/foo/bar under it, and then /bar is removed... the backing disk will still have a reference to bar. If the space for bar is then used for something else, the kernel may end up reading something very different from what it expects for a directory later after a cache device failure. Btrees, etc-- the same thing. If a tree shrinks, old tree entitys can end up pointing to other kinds of data. I think this change is harmful-- it is not a good idea to automatically, at runtime, decide to start returning data that violates the guarantees a block device is supposed to obey about ordering and persistence. Mike
Re: [PATCHv2] bcache: option for allow stale data on read failure
On 2017/9/20 上午8:59, Michael Lyle wrote: > Coly-- > > It's an interesting changeset. Hi Mike, Yes it's interesting :-) It fixes a silent database data corruption in our product kernel. The most dangerous point is, it happens silent even in-data checksum is used, this issue is detected by out-of-data checksum. > I am not positive if it will work in practice-- the most likely > objects to be cached are filesystem metadata. Won't most filesystems > fall apart if some of their data structures revert back to an earlier > point of time? For database workload, most of data cached on SSD is data blocks of database file which are replied from binlog (for example mysql). File system won't complain for such situation, and an early version means all transactions information since last update are all lost, in *silence*. Even the read request failed on file system meta data, because finally a stale data will be provided to kernel file system code, it is probably file system won't complain as well. Because, - file system reports error when I/O failed, if a stale data from recovery provided to file system, file system just uses the stale data until a worse failure detected by file system code. - if file system use a metadata checksum, and the checksum is inside metadata block (it is quite common), because the stale data is also checksum consistent, file system won't report error as well. So the data corruption happens in application level, even file system kernel code still thinks everything is consistent on disk Thanks. Coly Li > On Tue, Sep 19, 2017 at 3:24 PM, Coly Li wrote: >> When bcache does read I/Os, for example in writeback or writethrough mode, >> if a read request on cache device is failed, bcache will try to recovery >> the request by reading from cached device. If the data on cached device is >> not synced with cache device, then requester will get a stale data. >> >> For critical storage system like database, providing stale data from >> recovery may result an application level data corruption, which is >> unacceptible. But for some other situation like multi-media stream cache, >> continuous service may be more important and it is acceptible to fetch >> a chunk of stale data. >> >> This patch tries to solve the above conflict by adding a sysfs option >> /sys/block/bcache/bcache/allow_stale_data_on_failure >> which is defaultly cleared (to 0) as disabled. Now people can make choices >> for different situations. >> >> With this patch, for a failed read request in writeback or writethrough >> mode, recovery a recoverable read request only happens in one of the >> following conditions, >> - dc->has_dirty is zero. It means all data on cache device is synced to >>cached device, the recoveried data is up-to-date. >> - dc->has_dirty is non-zero, and dc->allow_stale_data_on_failure is set >>to 1. It means there is dirty data not synced to cached device yet, but >>option allow_stale_data_on_failure is set, receiving stale data is >>explicitly acceptible for requester. >> >> For other cache modes in bcache, read request will never hit >> cached_dev_read_error(), they don't need this patch. >> >> Please note, because cache mode can be switched arbitrarily in run time, a >> writethrough mode might be switched from a writeback mode. Therefore >> checking dc->has_data in writethrough mode still makes sense. >> >> Changelog: >> v2: rename sysfs entry from allow_stale_data_on_failure to >> allow_stale_data_on_failure, and fix the confusing commit log. >> v1: initial patch posted. >> >> Signed-off-by: Coly Li >> Reported-by: Arne Wolf >> Cc: Nix >> Cc: Kai Krakow >> Cc: Eric Wheeler >> Cc: Junhui Tang >> Cc: sta...@vger.kernel.org [snip]
Re: [PATCHv2] bcache: option for allow stale data on read failure
Coly-- It's an interesting changeset. I am not positive if it will work in practice-- the most likely objects to be cached are filesystem metadata. Won't most filesystems fall apart if some of their data structures revert back to an earlier point of time? Mike On Tue, Sep 19, 2017 at 3:24 PM, Coly Li wrote: > When bcache does read I/Os, for example in writeback or writethrough mode, > if a read request on cache device is failed, bcache will try to recovery > the request by reading from cached device. If the data on cached device is > not synced with cache device, then requester will get a stale data. > > For critical storage system like database, providing stale data from > recovery may result an application level data corruption, which is > unacceptible. But for some other situation like multi-media stream cache, > continuous service may be more important and it is acceptible to fetch > a chunk of stale data. > > This patch tries to solve the above conflict by adding a sysfs option > /sys/block/bcache/bcache/allow_stale_data_on_failure > which is defaultly cleared (to 0) as disabled. Now people can make choices > for different situations. > > With this patch, for a failed read request in writeback or writethrough > mode, recovery a recoverable read request only happens in one of the > following conditions, > - dc->has_dirty is zero. It means all data on cache device is synced to >cached device, the recoveried data is up-to-date. > - dc->has_dirty is non-zero, and dc->allow_stale_data_on_failure is set >to 1. It means there is dirty data not synced to cached device yet, but >option allow_stale_data_on_failure is set, receiving stale data is >explicitly acceptible for requester. > > For other cache modes in bcache, read request will never hit > cached_dev_read_error(), they don't need this patch. > > Please note, because cache mode can be switched arbitrarily in run time, a > writethrough mode might be switched from a writeback mode. Therefore > checking dc->has_data in writethrough mode still makes sense. > > Changelog: > v2: rename sysfs entry from allow_stale_data_on_failure to > allow_stale_data_on_failure, and fix the confusing commit log. > v1: initial patch posted. > > Signed-off-by: Coly Li > Reported-by: Arne Wolf > Cc: Nix > Cc: Kai Krakow > Cc: Eric Wheeler > Cc: Junhui Tang > Cc: sta...@vger.kernel.org > --- > drivers/md/bcache/bcache.h | 1 + > drivers/md/bcache/request.c | 14 +- > drivers/md/bcache/sysfs.c | 4 > 3 files changed, 18 insertions(+), 1 deletion(-) > > diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h > index dee542fff68e..f26b174f409a 100644 > --- a/drivers/md/bcache/bcache.h > +++ b/drivers/md/bcache/bcache.h > @@ -356,6 +356,7 @@ struct cached_dev { > unsignedpartial_stripes_expensive:1; > unsignedwriteback_metadata:1; > unsignedwriteback_running:1; > + unsignedallow_stale_data_on_failure:1; > unsigned char writeback_percent; > unsignedwriteback_delay; > > diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c > index 019b3df9f1c6..becbc0959ca2 100644 > --- a/drivers/md/bcache/request.c > +++ b/drivers/md/bcache/request.c > @@ -702,8 +702,20 @@ static void cached_dev_read_error(struct closure *cl) > { > struct search *s = container_of(cl, struct search, cl); > struct bio *bio = &s->bio.bio; > + struct cached_dev *dc = container_of(s->d, struct cached_dev, disk); > + int recovery_stale_data = dc ? dc->allow_stale_data_on_failure : 0; > > - if (s->recoverable) { > + /* > +* If dc->has_dirty is non-zero and the recovering data is on cache > +* device, then recover from cached device will return a stale data > +* to requester. But in some cases people accept stale data to avoid > +* a -EIO. So I/O error recovery only happens when, > +* - No dirty data on cache device. > +* - Cached device is dirty but sysfs allow_stale_data_on_failure is > +* explicitly set (to 1) to accept stale data from recovery. > +*/ > + if (s->recoverable && > + (!atomic_read(&dc->has_dirty) || recovery_stale_data)) { > /* Retry from the backing device: */ > trace_bcache_read_retry(s->orig_bio); > > diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c > index f90f13616980..8603756005a8 100644 > --- a/drivers/md/bcache/sysfs.c > +++ b/drivers/md/bcache/sysfs.c > @@ -106,6 +106,7 @@ rw_attribute(cache_replacement_policy); > rw_attribute(btree_shrinker_disabled); > rw_attribute(copy_gc_enabled); > rw_attribute(size); > +rw_attribute(allow_stale_data_on_failure); > > SHOW(__bch_cached_dev) > { > @@ -125,6 +126,7 @@ SHOW(__bch_cached_dev) > var_printf(bypass_torture_test, "%i"); > var_printf(wr