Re: [PATCHv2] bcache: option for allow stale data on read failure

2017-09-20 Thread Coly Li
On 2017/9/20 下午5:40, Michael Lyle wrote:
> On Wed, Sep 20, 2017 at 3:28 AM, Coly Li  wrote:
>> Even the read request failed on file system meta data, because finally a
>> stale data will be provided to kernel file system code, it is probably
>> file system won't complain as well.
> 
> The scary case is when filesystem data that points to other filesystem
> data is cached.  E.g. the data structures representing what space is
> free on disk, or a directory, or a database btree.  Some examples:
> 
> Free space handling-- if a big file /foo is created, and the active
> free-space datastructures are in cache (and this is likely, because
> actively written places can have their writeback-writes
> cancelled/deferred indefinitely)-- and then later the caching disk
> fails, an old version of this will be read from disk.  Later, an
> effort to write a file /bar allocates the space used by /foo, and
> writes over it.
> 
> Directory entity handling-- if /var/spool/foo is an active directory
> (associated data structures in cache), and has the directory
> /var/spool/foo/bar under it, and then /bar is removed... the backing
> disk will still have a reference to bar.  If the space for bar is then
> used for something else, the kernel may end up reading something very
> different from what it expects for a directory later after a cache
> device failure.
> 
> Btrees, etc-- the same thing.  If a tree shrinks, old tree entitys can
> end up pointing to other kinds of data.
> 
> I think this change is harmful-- it is not a good idea to
> automatically, at runtime, decide to start returning data that
> violates the guarantees a block device is supposed to obey about
> ordering and persistence.

Hi Mike,

I totally agree with you. It is my fault for the misleading commit log,
if you read it again you may find we stand on same side, this is what I
feel from your response :-)

Current bcache code does provide stale data from read failure recovery.
In v1 patch discussion people wanted to keep this behavior then in v2
version I add an option to permit this "harmful" behavior, and disable
this behavior by default.

And good to know Kent does not like an option, then we can disable this
"harmful" behavior by default.

Thanks.

Coly


Re: [PATCHv2] bcache: option for allow stale data on read failure

2017-09-20 Thread Coly Li
On 2017/9/20 下午6:07, Kent Overstreet wrote:
> On Wed, Sep 20, 2017 at 06:24:33AM +0800, Coly Li wrote:
>> When bcache does read I/Os, for example in writeback or writethrough mode,
>> if a read request on cache device is failed, bcache will try to recovery
>> the request by reading from cached device. If the data on cached device is
>> not synced with cache device, then requester will get a stale data.
>>
>> For critical storage system like database, providing stale data from
>> recovery may result an application level data corruption, which is
>> unacceptible. But for some other situation like multi-media stream cache,
>> continuous service may be more important and it is acceptible to fetch
>> a chunk of stale data.
>>
>> This patch tries to solve the above conflict by adding a sysfs option
>>  /sys/block/bcache/bcache/allow_stale_data_on_failure
>> which is defaultly cleared (to 0) as disabled. Now people can make choices
>> for different situations.
> 
> IMO this is just a bug, I'd rather not have an option to keep the buggy
> behaviour. How about this patch:
> 

Hi Kent,

OK, last time when I discuss with other bcache developers, people wanted
to keep this behavior, then I modify it as an option in this version
patch. I support fix it without an option, because there are too many
options already. Good to know you have similar decision :-)


> commit 2746f9c1f962288d8c5d7dabe698bf7b3fddd405
> Author: Kent Overstreet 
> Date:   Wed Sep 20 18:06:37 2017 +0200
> 
> bcache: Don't recover from IO errors when reading dirty data
> 
> Signed-off-by: Kent Overstreet 
> 
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index 382397772a..c2d57ef953 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -532,8 +532,10 @@ static int cache_lookup_fn(struct btree_op *op, struct 
> btree *b, struct bkey *k)
>  
>   PTR_BUCKET(b->c, k, ptr)->prio = INITIAL_PRIO;
>  
> - if (KEY_DIRTY(k))
> + if (KEY_DIRTY(k)) {
>   s->read_dirty_data = true;
> + s->recoverable = false;
> + }
>  

I though of fixing here, the reason I gave up to modify here was,
cache_lookup_fn() is called for keys in leaf nodes (b->level == 0),
bch_btree_map_keys_recurse() needs to do I/O to fetch upper level nodes
before accessing leaf node. When a SSD failed bch_btree_node_get() will
fail before cache_lookup_fn() is executed. So the your patch, there is
no chance to set s->recoverable to false, recovery still happens.

If you don't like an option, the following modification should be much
simpler,

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 681b4f12b05a..f397785d9c38 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -697,8 +697,10 @@ static void cached_dev_read_error(struct closure *cl)
 {
struct search *s = container_of(cl, struct search, cl);
struct bio *bio = >bio.bio;
+   struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);

-   if (s->recoverable) {
+   if (s->recoverable &&
+   (dc && !atomic_read(>has_dirty)) {
/* Retry from the backing device: */
trace_bcache_read_retry(s->orig_bio);

This might be the simplest way I know for now.

Thanks.

Coly Li


Re: [PATCHv2] bcache: option for allow stale data on read failure

2017-09-20 Thread Kent Overstreet
On Wed, Sep 20, 2017 at 06:24:33AM +0800, Coly Li wrote:
> When bcache does read I/Os, for example in writeback or writethrough mode,
> if a read request on cache device is failed, bcache will try to recovery
> the request by reading from cached device. If the data on cached device is
> not synced with cache device, then requester will get a stale data.
> 
> For critical storage system like database, providing stale data from
> recovery may result an application level data corruption, which is
> unacceptible. But for some other situation like multi-media stream cache,
> continuous service may be more important and it is acceptible to fetch
> a chunk of stale data.
> 
> This patch tries to solve the above conflict by adding a sysfs option
>   /sys/block/bcache/bcache/allow_stale_data_on_failure
> which is defaultly cleared (to 0) as disabled. Now people can make choices
> for different situations.

IMO this is just a bug, I'd rather not have an option to keep the buggy
behaviour. How about this patch:

commit 2746f9c1f962288d8c5d7dabe698bf7b3fddd405
Author: Kent Overstreet 
Date:   Wed Sep 20 18:06:37 2017 +0200

bcache: Don't recover from IO errors when reading dirty data

Signed-off-by: Kent Overstreet 

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 382397772a..c2d57ef953 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -532,8 +532,10 @@ static int cache_lookup_fn(struct btree_op *op, struct 
btree *b, struct bkey *k)
 
PTR_BUCKET(b->c, k, ptr)->prio = INITIAL_PRIO;
 
-   if (KEY_DIRTY(k))
+   if (KEY_DIRTY(k)) {
s->read_dirty_data = true;
+   s->recoverable = false;
+   }
 
n = bio_next_split(bio, min_t(uint64_t, INT_MAX,
  KEY_OFFSET(k) - bio->bi_iter.bi_sector),


Re: [PATCHv2] bcache: option for allow stale data on read failure

2017-09-20 Thread Coly Li
On 2017/9/20 上午8:59, Michael Lyle wrote:
> Coly--
> 
> It's an interesting changeset.

Hi Mike,

Yes it's interesting :-) It fixes a silent database data corruption in
our product kernel. The most dangerous point is, it happens silent even
in-data checksum is used, this issue is detected by out-of-data checksum.

> I am not positive if it will work in practice-- the most likely
> objects to be cached are filesystem metadata.  Won't most filesystems
> fall apart if some of their data structures revert back to an earlier
> point of time?

For database workload, most of data cached on SSD is data blocks of
database file which are replied from binlog (for example mysql). File
system won't complain for such situation, and an early version means all
transactions information since last update are all lost, in *silence*.

Even the read request failed on file system meta data, because finally a
stale data will be provided to kernel file system code, it is probably
file system won't complain as well. Because,
- file system reports error when I/O failed, if a stale data from
recovery provided to file system, file system just uses the stale data
until a worse failure detected by file system code.
- if file system use a metadata checksum, and the checksum is inside
metadata block (it is quite common), because the stale data is also
checksum consistent, file system won't report error as well.

So the data corruption happens in application level, even file system
kernel code still thinks everything is consistent on disk 

Thanks.

Coly Li


> On Tue, Sep 19, 2017 at 3:24 PM, Coly Li  wrote:
>> When bcache does read I/Os, for example in writeback or writethrough mode,
>> if a read request on cache device is failed, bcache will try to recovery
>> the request by reading from cached device. If the data on cached device is
>> not synced with cache device, then requester will get a stale data.
>>
>> For critical storage system like database, providing stale data from
>> recovery may result an application level data corruption, which is
>> unacceptible. But for some other situation like multi-media stream cache,
>> continuous service may be more important and it is acceptible to fetch
>> a chunk of stale data.
>>
>> This patch tries to solve the above conflict by adding a sysfs option
>> /sys/block/bcache/bcache/allow_stale_data_on_failure
>> which is defaultly cleared (to 0) as disabled. Now people can make choices
>> for different situations.
>>
>> With this patch, for a failed read request in writeback or writethrough
>> mode, recovery a recoverable read request only happens in one of the
>> following conditions,
>>  - dc->has_dirty is zero. It means all data on cache device is synced to
>>cached device, the recoveried data is up-to-date.
>>  - dc->has_dirty is non-zero, and dc->allow_stale_data_on_failure is set
>>to 1. It means there is dirty data not synced to cached device yet, but
>>option allow_stale_data_on_failure is set, receiving stale data is
>>explicitly acceptible for requester.
>>
>> For other cache modes in bcache, read request will never hit
>> cached_dev_read_error(), they don't need this patch.
>>
>> Please note, because cache mode can be switched arbitrarily in run time, a
>> writethrough mode might be switched from a writeback mode. Therefore
>> checking dc->has_data in writethrough mode still makes sense.
>>
>> Changelog:
>> v2: rename sysfs entry from allow_stale_data_on_failure  to
>> allow_stale_data_on_failure, and fix the confusing commit log.
>> v1: initial patch posted.
>>
>> Signed-off-by: Coly Li 
>> Reported-by: Arne Wolf 
>> Cc: Nix 
>> Cc: Kai Krakow 
>> Cc: Eric Wheeler 
>> Cc: Junhui Tang 
>> Cc: sta...@vger.kernel.org

[snip]



Re: [PATCHv2] bcache: option for allow stale data on read failure

2017-09-20 Thread Michael Lyle
Coly--

It's an interesting changeset.

I am not positive if it will work in practice-- the most likely
objects to be cached are filesystem metadata.  Won't most filesystems
fall apart if some of their data structures revert back to an earlier
point of time?

Mike

On Tue, Sep 19, 2017 at 3:24 PM, Coly Li  wrote:
> When bcache does read I/Os, for example in writeback or writethrough mode,
> if a read request on cache device is failed, bcache will try to recovery
> the request by reading from cached device. If the data on cached device is
> not synced with cache device, then requester will get a stale data.
>
> For critical storage system like database, providing stale data from
> recovery may result an application level data corruption, which is
> unacceptible. But for some other situation like multi-media stream cache,
> continuous service may be more important and it is acceptible to fetch
> a chunk of stale data.
>
> This patch tries to solve the above conflict by adding a sysfs option
> /sys/block/bcache/bcache/allow_stale_data_on_failure
> which is defaultly cleared (to 0) as disabled. Now people can make choices
> for different situations.
>
> With this patch, for a failed read request in writeback or writethrough
> mode, recovery a recoverable read request only happens in one of the
> following conditions,
>  - dc->has_dirty is zero. It means all data on cache device is synced to
>cached device, the recoveried data is up-to-date.
>  - dc->has_dirty is non-zero, and dc->allow_stale_data_on_failure is set
>to 1. It means there is dirty data not synced to cached device yet, but
>option allow_stale_data_on_failure is set, receiving stale data is
>explicitly acceptible for requester.
>
> For other cache modes in bcache, read request will never hit
> cached_dev_read_error(), they don't need this patch.
>
> Please note, because cache mode can be switched arbitrarily in run time, a
> writethrough mode might be switched from a writeback mode. Therefore
> checking dc->has_data in writethrough mode still makes sense.
>
> Changelog:
> v2: rename sysfs entry from allow_stale_data_on_failure  to
> allow_stale_data_on_failure, and fix the confusing commit log.
> v1: initial patch posted.
>
> Signed-off-by: Coly Li 
> Reported-by: Arne Wolf 
> Cc: Nix 
> Cc: Kai Krakow 
> Cc: Eric Wheeler 
> Cc: Junhui Tang 
> Cc: sta...@vger.kernel.org
> ---
>  drivers/md/bcache/bcache.h  |  1 +
>  drivers/md/bcache/request.c | 14 +-
>  drivers/md/bcache/sysfs.c   |  4 
>  3 files changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index dee542fff68e..f26b174f409a 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -356,6 +356,7 @@ struct cached_dev {
> unsignedpartial_stripes_expensive:1;
> unsignedwriteback_metadata:1;
> unsignedwriteback_running:1;
> +   unsignedallow_stale_data_on_failure:1;
> unsigned char   writeback_percent;
> unsignedwriteback_delay;
>
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index 019b3df9f1c6..becbc0959ca2 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -702,8 +702,20 @@ static void cached_dev_read_error(struct closure *cl)
>  {
> struct search *s = container_of(cl, struct search, cl);
> struct bio *bio = >bio.bio;
> +   struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
> +   int recovery_stale_data = dc ? dc->allow_stale_data_on_failure : 0;
>
> -   if (s->recoverable) {
> +   /*
> +* If dc->has_dirty is non-zero and the recovering data is on cache
> +* device, then recover from cached device will return a stale data
> +* to requester. But in some cases people accept stale data to avoid
> +* a -EIO. So I/O error recovery only happens when,
> +* - No dirty data on cache device.
> +* - Cached device is dirty but sysfs allow_stale_data_on_failure is
> +*   explicitly set (to 1) to accept stale data from recovery.
> +*/
> +   if (s->recoverable &&
> +   (!atomic_read(>has_dirty) || recovery_stale_data)) {
> /* Retry from the backing device: */
> trace_bcache_read_retry(s->orig_bio);
>
> diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
> index f90f13616980..8603756005a8 100644
> --- a/drivers/md/bcache/sysfs.c
> +++ b/drivers/md/bcache/sysfs.c
> @@ -106,6 +106,7 @@ rw_attribute(cache_replacement_policy);
>  rw_attribute(btree_shrinker_disabled);
>  rw_attribute(copy_gc_enabled);
>  rw_attribute(size);
> +rw_attribute(allow_stale_data_on_failure);
>
>  

[PATCHv2] bcache: option for allow stale data on read failure

2017-09-19 Thread Coly Li
When bcache does read I/Os, for example in writeback or writethrough mode,
if a read request on cache device is failed, bcache will try to recovery
the request by reading from cached device. If the data on cached device is
not synced with cache device, then requester will get a stale data.

For critical storage system like database, providing stale data from
recovery may result an application level data corruption, which is
unacceptible. But for some other situation like multi-media stream cache,
continuous service may be more important and it is acceptible to fetch
a chunk of stale data.

This patch tries to solve the above conflict by adding a sysfs option
/sys/block/bcache/bcache/allow_stale_data_on_failure
which is defaultly cleared (to 0) as disabled. Now people can make choices
for different situations.

With this patch, for a failed read request in writeback or writethrough
mode, recovery a recoverable read request only happens in one of the
following conditions,
 - dc->has_dirty is zero. It means all data on cache device is synced to
   cached device, the recoveried data is up-to-date. 
 - dc->has_dirty is non-zero, and dc->allow_stale_data_on_failure is set
   to 1. It means there is dirty data not synced to cached device yet, but
   option allow_stale_data_on_failure is set, receiving stale data is
   explicitly acceptible for requester.

For other cache modes in bcache, read request will never hit
cached_dev_read_error(), they don't need this patch.

Please note, because cache mode can be switched arbitrarily in run time, a
writethrough mode might be switched from a writeback mode. Therefore
checking dc->has_data in writethrough mode still makes sense.

Changelog:
v2: rename sysfs entry from allow_stale_data_on_failure  to
allow_stale_data_on_failure, and fix the confusing commit log.
v1: initial patch posted.

Signed-off-by: Coly Li 
Reported-by: Arne Wolf 
Cc: Nix 
Cc: Kai Krakow 
Cc: Eric Wheeler 
Cc: Junhui Tang 
Cc: sta...@vger.kernel.org
---
 drivers/md/bcache/bcache.h  |  1 +
 drivers/md/bcache/request.c | 14 +-
 drivers/md/bcache/sysfs.c   |  4 
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index dee542fff68e..f26b174f409a 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -356,6 +356,7 @@ struct cached_dev {
unsignedpartial_stripes_expensive:1;
unsignedwriteback_metadata:1;
unsignedwriteback_running:1;
+   unsignedallow_stale_data_on_failure:1;
unsigned char   writeback_percent;
unsignedwriteback_delay;
 
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 019b3df9f1c6..becbc0959ca2 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -702,8 +702,20 @@ static void cached_dev_read_error(struct closure *cl)
 {
struct search *s = container_of(cl, struct search, cl);
struct bio *bio = >bio.bio;
+   struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
+   int recovery_stale_data = dc ? dc->allow_stale_data_on_failure : 0;
 
-   if (s->recoverable) {
+   /*
+* If dc->has_dirty is non-zero and the recovering data is on cache
+* device, then recover from cached device will return a stale data
+* to requester. But in some cases people accept stale data to avoid
+* a -EIO. So I/O error recovery only happens when,
+* - No dirty data on cache device.
+* - Cached device is dirty but sysfs allow_stale_data_on_failure is
+*   explicitly set (to 1) to accept stale data from recovery.
+*/
+   if (s->recoverable &&
+   (!atomic_read(>has_dirty) || recovery_stale_data)) {
/* Retry from the backing device: */
trace_bcache_read_retry(s->orig_bio);
 
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index f90f13616980..8603756005a8 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -106,6 +106,7 @@ rw_attribute(cache_replacement_policy);
 rw_attribute(btree_shrinker_disabled);
 rw_attribute(copy_gc_enabled);
 rw_attribute(size);
+rw_attribute(allow_stale_data_on_failure);
 
 SHOW(__bch_cached_dev)
 {
@@ -125,6 +126,7 @@ SHOW(__bch_cached_dev)
var_printf(bypass_torture_test, "%i");
var_printf(writeback_metadata,  "%i");
var_printf(writeback_running,   "%i");
+   var_printf(allow_stale_data_on_failure,"%i");
var_print(writeback_delay);
var_print(writeback_percent);
sysfs_hprint(writeback_rate,dc->writeback_rate.rate << 9);
@@ -201,6 +203,7 @@ STORE(__cached_dev)
 #define d_strtoi_h(var)sysfs_hatoi(var, dc->var)