Re: [PULL] Re: bcache stability patches

2015-12-30 Thread Denis Bychkov
On Tue, Dec 29, 2015 at 10:00 PM, Eric Wheeler
 wrote:
> Hi Jens and Kent,
>
> This affects many users, so please take a look when you have a moment:
>
> There is a growing bcache user community with a well-tested patchset that
> is necessary for production bcache use.  The diffstat is small and we all
> want someone to pull it in and get it into mainline.  This would serve
> many people if this can get pulled in upstream.
>
> More below:
>
> On Tue, 22 Dec 2015, Denis Bychkov wrote:
>> There is a set of bcache stability patches elevating bcache stability to
>> production level. As far as I know, there is no single reported and peer
>> confirmed bug that is not solved by this set. Unfortunately, for some
>> reason, Kent does not have enough time and/or energy to review them and
>> send them upstream. Let's come up with a solution that would allow to
>> review all these patches (some of them written by Ken himself, some of
>> them produced by the community), review them and hand them to the
>> maintainer who is willing to apply them upstream. Without that, bcache
>> is just another half-assed unstable and buggy cache layer. These patches
>> will allow people to start use bcache in production systems. Please find
>> the patch set attached. (The patches apply cleanly to 4.3 and 4.4 kernel
>> series).
>
> Hi Dennis,
>
> I'm maintaining a branch here that is ready to merge.  We have been
> testing this for about a year in production and works great.  All Cc's and
> authors are correct and it (should) have every stability patch below,
> possibly others too.  Please tell me if there are any patches missing:

Just went through all the patches I have, looks like you did not miss any. The
only small correction - the author of commit
53803810fce7826feff7d5632a7ab3cc991e6243
is me, not Gabriel de Perthuis. Gabriel wrote the original patch which
I mention in
the comment, but it was broken (did not release the mutex), so I
created a new one,
that actually fixed the problem with device_busy error.

>
> git pull https://github.com/ewheelerinc/linux.git bcache-patches-for-3.17
> (Yes, github for hosting only, I don't edit with their web interfaces.)
>
> Note that this branch still merges cleanly through v4.4-rc7 and as far
> back as 3.17-rc1 (maybe earlier).  Each patch provides Cc: 
> sta...@vger.kernel.org.
>
> It is ready to merge!  We just need Jens or Kent or someone to pull it in.
> Here is the diffstat and shortlog against v4.4-rc7:
>
>  drivers/md/bcache/btree.c |  5 -
>  drivers/md/bcache/super.c | 16 
>  drivers/md/bcache/writeback.c | 37 ++---
>  drivers/md/bcache/writeback.h |  3 ++-
>  4 files changed, 48 insertions(+), 13 deletions(-)
>
> Al Viro (1):
>   bcache: fix a leak in bch_cached_dev_run()
>
> Gabriel de Perthuis (1):
>   bcache: allows use of register in udev to avoid "device_busy" error.
>
> Kent Overstreet (2):
>   bcache: Add a cond_resched() call to gc
>   bcache: Change refill_dirty() to always scan entire disk if
> necessary
>
> Stefan Bader (1):
>   bcache: prevent crash on changing writeback_running
>
> Zheng Liu (3):
>   bcache: fix a livelock when we cause a huge number of cache misses
>   bcache: clear BCACHE_DEV_UNLINK_DONE flag when attaching a backing 
> device
>   bcache: unregister reboot notifier if bcache fails to unregister device
>
> See also these threads:
>   https://lkml.org/lkml/2015/12/5/38
>   https://www.mail-archive.com/linux-bcache@vger.kernel.org/msg03159.html
>
> Quickly view commits here, too:
>   https://github.com/ewheelerinc/linux/commits/bcache-patches-for-3.17
>
>
> Cheers,
>
> -Eric
>
>
> --
> Eric Wheeler, President   eWheeler, Inc. dba Global Linux Security
> 888-LINUX26 (888-546-8926)Fax: 503-716-3878   PO Box 25107
> www.GlobalLinuxSecurity.pro   Linux since 1996! Portland, OR 97298
>



-- 

Denis
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


BCACHE stability patches

2015-12-22 Thread Denis Bychkov
Hi,

There is a set of bcache stability patches elevating bcache stability
to production level. As far as I know, there is no single reported and
peer confirmed bug that is not solved by this set. Unfortunately, for
some reason, Kent does not have enough time and/or energy to review
them and send them upstream. Let's come up with a solution that would
allow to review all these patches (some of them written by Ken
himself, some of them produced by the community), review them and hand
them to the maintainer who is willing to apply them upstream. Without
that, bcache is just another half-assed unstable and buggy cache
layer.
These patches will allow people to start use bcache in production systems.
Please find the patch set attached. (The patches apply cleanly to 4.3
and 4.4 kernel series).

---
From: Zheng Liu 

In bcache_init() function it forgot to unregister reboot notifier if
bcache fails to unregister a block device.  This commit fixes this.

Signed-off-by: Zheng Liu 
Tested-by: Joshua Schmid 
---
 drivers/md/bcache/super.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2066,8 +2066,10 @@ static int __init bcache_init(void)
  closure_debug_init();

  bcache_major = register_blkdev(0, "bcache");
- if (bcache_major < 0)
+ if (bcache_major < 0) {
+ unregister_reboot_notifier(&reboot);
  return bcache_major;
+ }

  if (!(bcache_wq = create_workqueue("bcache")) ||
 !(bcache_kobj = kobject_create_and_add("bcache", fs_kobj)) ||
From: Zheng Liu 
To: linux-bca...@vger.kernel.org
Cc: Zheng Liu , Joshua Schmid
, Zhu Yanhai , Kent Overstreet

Subject: [PATCH v2] bcache: fix a livelock in btree lock
Date: Wed, 25 Feb 2015 20:32:09 +0800 (02/25/2015 04:32:09 AM)
From: Zheng Liu 

This commit tries to fix a livelock in bcache.  This livelock might
happen when we causes a huge number of cache misses simultaneously.

When we get a cache miss, bcache will execute the following path.

->cached_dev_make_request()
  ->cached_dev_read()
->cached_lookup()
  ->bch->btree_map_keys()
->btree_root()  <
  ->bch_btree_map_keys_recurse()|
->cache_lookup_fn() |
  ->cached_dev_cache_miss() |
->bch_btree_insert_check_key() -|
  [If btree->seq is not equal to seq + 1, we should return
   EINTR and traverse btree again.]

In bch_btree_insert_check_key() function we first need to check upgrade
flag (op->lock == -1), and when this flag is true we need to release
read btree->lock and try to take write btree->lock.  During taking and
releasing this write lock, btree->seq will be monotone increased in
order to prevent other threads modify this in cache miss (see btree.h:74).
But if there are some cache misses caused by some requested, we could
meet a livelock because btree->seq is always changed by others.  Thus no
one can make progress.

This commit will try to take write btree->lock if it encounters a race
when we traverse btree.  Although it sacrifice the scalability but we
can ensure that only one can modify the btree.

Signed-off-by: Zheng Liu 
Tested-by: Joshua Schmid 
Cc: Joshua Schmid 
Cc: Zhu Yanhai 
Cc: Kent Overstreet 
---
changelog:
v2: fix a bug that stops all concurrency writes unconditionally.

 drivers/md/bcache/btree.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -2162,8 +2162,10 @@ int bch_btree_insert_check_key(struct bt
  rw_lock(true, b, b->level);

  if (b->key.ptr[0] != btree_ptr ||
-b->seq != seq + 1)
+   b->seq != seq + 1) {
+   op->lock = b->level;
  goto out;
+   }
  }

  SET_KEY_PTRS(check_key, 1);

From: Joshua Schmid 
Subject: [PATCH] fix a leak in bch_cached_dev_run()
Newsgroups: gmane.linux.kernel.bcache.devel
Date: 2015-02-03 11:24:06 GMT (3 weeks, 2 days, 11 hours and 43 minutes ago)

From: Al Viro 

Signed-off-by: Al Viro 
Tested-by: Joshua Schmid 
---
 drivers/md/bcache/super.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -847,8 +847,11 @@ void bch_cached_dev_run(struct cached_de
  buf[SB_LABEL_SIZE] = '\0';
  env[2] = kasprintf(GFP_KERNEL, "CACHED_LABEL=%s", buf);

- if (atomic_xchg(&dc->running, 1))
+ if (atomic_xchg(&dc->running, 1)) {
+ kfree(env[1]);
+ kfree(env[2]);
  return;
+ }

  if (!d->c &&
 BDEV_STATE(&dc->sb) != BDEV_STATE_NONE) {
From: Denis Bychkov 

Allows to use register, not register_quiet in udev to avoid "device_busy" error.
The initial patch proposed at https://lkml.org/lkml/2013/8/26/549 by
Gabriel de Perthuis
 does not unlock the mutex and hangs the kernel.

4.3 kernel panics when MMC/SDHC card is inserted on thinkpad

2015-11-08 Thread Denis Bychkov
The only started in 4.3 kernel (at least RC-5), 4.2.x does not have
this problem. The kernel panic happens immediately after the SDHC card
is inserted, reproducibility is 100%. If the system boots up with the
card already inserted, it will crash as soon as sdhci_pci module is
loaded. If the module is unloaded/blacklisted, obviously, nothing
happens as the system does not see the MMC card reader.
The machine is Lenovo thinkpad T-510 laptop with Intel Westmere
CPU/3400 series chipset running 64-bit kernel 4.3.0.

(somewhat) relevant kernel configuration bits:
# CONFIG_CALGARY_IOMMU is not set
CONFIG_IOMMU_HELPER=y
CONFIG_VFIO_IOMMU_TYPE1=m
CONFIG_IOMMU_API=y
CONFIG_IOMMU_SUPPORT=y
# Generic IOMMU Pagetable Support
CONFIG_IOMMU_IOVA=y
# CONFIG_AMD_IOMMU is not set
CONFIG_INTEL_IOMMU=y
CONFIG_INTEL_IOMMU_DEFAULT_ON=y
CONFIG_INTEL_IOMMU_FLOPPY_WA=y
# CONFIG_IOMMU_STRESS is not set
CONFIG_KVM_INTEL=m
CONFIG_PCI_MMCONFIG=y
# Supported MMC/SDIO adapters
CONFIG_MMC=m
# CONFIG_MMC_DEBUG is not set
# CONFIG_MMC_CLKGATE is not set
# MMC/SD/SDIO Card Drivers
CONFIG_MMC_BLOCK=m
CONFIG_MMC_BLOCK_MINORS=8
CONFIG_MMC_BLOCK_BOUNCE=y
CONFIG_MMC_TEST=m
# MMC/SD/SDIO Host Controller Drivers
CONFIG_MMC_SDHCI=m
CONFIG_MMC_SDHCI_PCI=m
CONFIG_MMC_RICOH_MMC=y
CONFIG_MMC_SDHCI_ACPI=m

Card reader device:
0d:00.0 SD Host controller: Ricoh Co Ltd MMC/SD Host Controller (rev 01)
Subsystem: Lenovo MMC/SD Host Controller
Flags: bus master, fast devsel, latency 0, IRQ 16
Memory at f210 (32-bit, non-prefetchable) [size=256]
Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Power Management version 3
Capabilities: [80] Express Endpoint, MSI 00
Capabilities: [100] Virtual Channel
Capabilities: [800] Advanced Error Reporting
Kernel driver in use: sdhci-pci
Kernel modules: sdhci_pci

The panic report caught via netconsole:

[22946.904308] [ cut here ]
[22946.906564] kernel BUG at drivers/iommu/intel-iommu.c:3485!
[22946.908801] invalid opcode:  [#1] PREEMPT SMP
[22946.93] Modules linked in: netconsole dm_mod bnep
cpufreq_powersave cpufreq_stats cpufreq_conservative cpufreq_userspace
coretemp intel_powerclamp kvm_intel kvm crct10dif_pclmul crc32_pclmul
jitterentropy_rng hmac sha256_ssse3 sha256_generic drbg
snd_hda_codec_hdmi ansi_cprng gpio_ich iTCO_wdt iTCO_vendor_support
aesni_intel arc4 aes_x86_64 nouveau mxm_wmi lrw gf128mul glue_helper
ablk_helper iwldvm cryptd psmouse mac80211 uvcvideo serio_raw pcspkr
nd_e820 videobuf2_vmalloc ttm evdev videobuf2_memops i2c_algo_bit
mousedev btusb videobuf2_core btrtl drm_kms_helper v4l2_common mac_hid
btbcm videodev btintel drm snd_hda_codec_conexant bluetooth
snd_hda_codec_generic iwlwifi syscopyarea sysfillrect sysimgblt
fb_sys_fops snd_hda_intel snd_hda_codec cfg80211 snd_hda_core
snd_hwdep i2c_i801 thinkpad_acpi lpc_ich snd_pcm sg mfd_core nvram
i2c_core snd_timer intel_ips rfkill hwmon snd mei_me soundcore
intel_agp mei tpm_tis intel_gtt shpchp tpm agpgart battery rtc_cmos ac
video thermal wmi acpi_cpufreq button processor tp_smapi(O)
thinkpad_ec(O) autofs4 ext4 crc16 mbcache jbd2 btrfs xor hid_generic
usbhid hid raid6_pq sr_mod cdrom sd_mod uas usb_storage firewire_ohci
ahci libahci crc32c_intel libata atkbd sdhci_pci scsi_mod ehci_pci
sdhci ehci_hcd e1000e firewire_core mmc_core crc_itu_t ptp usbcore
usb_common pps_core
[22946.929431] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G   O
4.3.0-westmere #1
[22946.932551] Hardware name: LENOVO 4313CTO/4313CTO, BIOS 6MET92WW
(1.52 ) 09/26/2012
[22946.935701] task: 88023231a580 ti: 88023232c000 task.ti:
88023232c000
[22946.938878] RIP: 0010:[]  []
intel_unmap+0x1d0/0x210
[22946.942117] RSP: 0018:88023bd83da8  EFLAGS: 00010046
[22946.945341] RAX:  RBX: 880231ea5580 RCX: 0002
[22946.948592] RDX:  RSI: fffebda0 RDI: 880231e7d098
[22946.951855] RBP: 88023bd83de0 R08:  R09: 
[22946.955131] R10: 563f08fc R11: 1849050d R12: 880231e7d098
[22946.958423] R13: 8800bacbbc20 R14: fffebda0 R15: 
[22946.961723] FS:  () GS:88023bd8()
knlGS:
[22946.965051] CS:  0010 DS:  ES:  CR0: 8005003b
[22946.968387] CR2: e4d9c0e0 CR3: 01a0c000 CR4: 06e0
[22946.971760] Stack:
[22946.975131]  8800bacbbc60  880231ea5580
880231ea5580
[22946.978598]  8800bacbbc20 0010 
88023bd83df0
[22946.982064]  813cad22 88023bd83e48 c01090c2
0282
[22946.985546] Call Trace:
[22946.988984]  
[22946.989016]  [] intel_unmap_sg+0x12/0x20
[22946.995844]  [] sdhci_finish_data+0x142/0x340 [sdhci]
[22946.999296]  [] sdhci_irq+0x484/0x9b5 [sdhci]
[22947.002759]  [] ? notifier_call_chain+0x4a/0x70
[22947.006222]  [] handle_ir

Re: [PATCH] bcache: Fix writeback_thread never writing back incomplete stripes.

2015-09-18 Thread Denis Bychkov
Hi Kent,

After running a day with this new version of your patch, I did not
notice any problems, so I assume, it works. I'll keep an eye on it and
report back if find anything bad. I believe, you finally fixed that
long lasting bug with writeback threads spinning CPU.
To Vojtech Pavlik: thank you for catching it! For a long time I had to
run bcache with the partial_stripes_expensive branch disabled, which
seriously affected its performance on my RAID-6.

On Thu, Sep 17, 2015 at 2:31 PM, Kent Overstreet
 wrote:
> On Thu, Sep 17, 2015 at 08:40:54AM -0800, Kent Overstreet wrote:
>> On Thu, Sep 17, 2015 at 11:30:17AM -0400, Denis Bychkov wrote:
>> > Well, it turns out my celebration was a bit premature.
>> >
>> > PLEASE, DO NOT APPLY THE PATCH POSTED BY KENT (not the one Vojtech
>> > posted) ON A PRODUCTION SYSTEM, IT CAUSES DATA CORRUPTION.
>> >
>> > The interesting thing is that it somehow damaged the partition that
>> > was not supposed to receive any writes (the file system was mounted
>> > read-only), so my guess is that the patch causes the blocks residing
>> > in the write-back cache to flush to the wrong blocks on the backing
>> > device.
>> > Everything was going great until I rebooted and saw this in the log:
>> >
>> > [   19.639082] attempt to access beyond end of device
>> > [   19.643984] md1p2: rw=1, want=75497520, limit=62914560
>> > [   19.659033] attempt to access beyond end of device
>> > [   19.663929] md1p2: rw=1, want=75497624, limit=62914560
>> > [   19.669447] attempt to access beyond end of device
>> > [   19.674338] md1p2: rw=1, want=75497752, limit=62914560
>> > [   19.679195] attempt to access beyond end of device
>> > [   19.679199] md1p2: rw=1, want=75498080, limit=62914560
>> > [   19.689007] attempt to access beyond end of device
>> > [   19.689011] md1p2: rw=1, want=75563376, limit=62914560
>> > [   19.699055] attempt to access beyond end of device
>> > [   19.699059] md1p2: rw=1, want=79691816, limit=62914560
>> > [   19.719246] attempt to access beyond end of device
>> > [   19.724144] md1p2: rw=1, want=79691928, limit=62914560
>> > ..
>> > (it's a small example, the list was much longer)
>> > And the next thing I found out the super block on my 10-Tb XFS RAID was 
>> > gone. :)
>> > Oh well, it's a good thing I have backups.
>> > I knew what I was doing when trying the untested patches. I should
>> > have made the RAID md partition read-only, not the file system. I kind
>> > of expected that something could have gone wrong with the file system
>> > I was testing, just did not expect it would fire nukes at the innocent
>> > bystanders.
>>
>> Aw, shit. That's just _bizzare_.
>>
>> I have a theory - it appears that last_scanned isn't getting initialized 
>> before
>> it's used, so it's going to be all 0s the very first time... which it appears
>> could cause it to slurp up keys from the wrong device (and if that device was
>> bigger than the correct device, that could explain the accesses beyond the 
>> end
>> of the device).
>>
>> Currently just a theory though, and I have no clue why it would only be 
>> exposed
>> with my patch.
>
> Here's an updated patch that has a fix for _that_ theory, and also a new
> BUG_ON(). Any chance you could test it?
>
> Oh - I didn't ask - _do_ you have multiple backing devices attached to the 
> same
> cache set? Because if you don't, this isn't it at all...
>
> -- >8 --
> Subject: [PATCH] bcache: Change refill_dirty() to always scan entire disk if 
> necessary
>
> Previously, it would only scan the entire disk if it was starting from the 
> very
> start of the disk - i.e. if the previous scan got to the end.
>
> This was broken by refill_full_stripes(), which updates last_scanned so that
> refill_dirty was never triggering the searched_from_start path.
>
> But if we change refill_dirty() to always scan the entire disk if necessary,
> regardless of what last_scanned was, the code gets cleaner and we fix that bug
> too.
>
> Signed-off-by: Kent Overstreet 
> ---
>  drivers/md/bcache/writeback.c | 37 ++---
>  1 file changed, 30 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index cdde0f32f0..d383024247 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -310,6 +310,10 @@ void bcache_dev_sectors_dirty_add(struct cache_set *c, 
> unsigned

Re: [PATCH] bcache: Fix writeback_thread never writing back incomplete stripes.

2015-09-17 Thread Denis Bychkov
Yes, sure, I'll try it.
And yes, this was exactly my setup - several backing devices sharing
the same cache set, all in write back mode.
When I decided to try your patch yesterday, I created a few more and
attached them to the same cache set. Did not want to experiment with
the live ones - so just remounted them RO to make sure they won't send
any writes. Instead of switching the md array into RO mode - I am an
idiot.
And today, after I detached the botched XFS partition from the cache
(to make sure, there are no more dirty data left) and looked at the
area where there used to be an XFS super block, I found a super block
belonging to one of the new devices I created. So there is no mistake
here, this is exactly what happened, I think you pinned it.
The only thing I don't understand is how those writes ended up on a
different device. I'd understand it if it was the same device,
different partition, we would write to the sectors that belong to a
different partition on the same disk. But this is not the case, it is
a different md device that received the unwanted writes. Unless ...
unless, of course, the addresses of the write requests that are stored
in bcache cache set have already been translated into a physical
interface (ATA/SCSI) address. Have they? If that is the case, they
share the same address space, because they are the same SATA drives.
But, that can't be true, right? Bcache can not and does not store
physical drive addresses in the cache, does it?

On Thu, Sep 17, 2015 at 2:31 PM, Kent Overstreet
 wrote:
> On Thu, Sep 17, 2015 at 08:40:54AM -0800, Kent Overstreet wrote:
>> On Thu, Sep 17, 2015 at 11:30:17AM -0400, Denis Bychkov wrote:
>> > Well, it turns out my celebration was a bit premature.
>> >
>> > PLEASE, DO NOT APPLY THE PATCH POSTED BY KENT (not the one Vojtech
>> > posted) ON A PRODUCTION SYSTEM, IT CAUSES DATA CORRUPTION.
>> >
>> > The interesting thing is that it somehow damaged the partition that
>> > was not supposed to receive any writes (the file system was mounted
>> > read-only), so my guess is that the patch causes the blocks residing
>> > in the write-back cache to flush to the wrong blocks on the backing
>> > device.
>> > Everything was going great until I rebooted and saw this in the log:
>> >
>> > [   19.639082] attempt to access beyond end of device
>> > [   19.643984] md1p2: rw=1, want=75497520, limit=62914560
>> > [   19.659033] attempt to access beyond end of device
>> > [   19.663929] md1p2: rw=1, want=75497624, limit=62914560
>> > [   19.669447] attempt to access beyond end of device
>> > [   19.674338] md1p2: rw=1, want=75497752, limit=62914560
>> > [   19.679195] attempt to access beyond end of device
>> > [   19.679199] md1p2: rw=1, want=75498080, limit=62914560
>> > [   19.689007] attempt to access beyond end of device
>> > [   19.689011] md1p2: rw=1, want=75563376, limit=62914560
>> > [   19.699055] attempt to access beyond end of device
>> > [   19.699059] md1p2: rw=1, want=79691816, limit=62914560
>> > [   19.719246] attempt to access beyond end of device
>> > [   19.724144] md1p2: rw=1, want=79691928, limit=62914560
>> > ..
>> > (it's a small example, the list was much longer)
>> > And the next thing I found out the super block on my 10-Tb XFS RAID was 
>> > gone. :)
>> > Oh well, it's a good thing I have backups.
>> > I knew what I was doing when trying the untested patches. I should
>> > have made the RAID md partition read-only, not the file system. I kind
>> > of expected that something could have gone wrong with the file system
>> > I was testing, just did not expect it would fire nukes at the innocent
>> > bystanders.
>>
>> Aw, shit. That's just _bizzare_.
>>
>> I have a theory - it appears that last_scanned isn't getting initialized 
>> before
>> it's used, so it's going to be all 0s the very first time... which it appears
>> could cause it to slurp up keys from the wrong device (and if that device was
>> bigger than the correct device, that could explain the accesses beyond the 
>> end
>> of the device).
>>
>> Currently just a theory though, and I have no clue why it would only be 
>> exposed
>> with my patch.
>
> Here's an updated patch that has a fix for _that_ theory, and also a new
> BUG_ON(). Any chance you could test it?
>
> Oh - I didn't ask - _do_ you have multiple backing devices attached to the 
> same
> cache set? Because if you don't, this isn't it at all...
>
> -- >8 --
> Subject: [PATCH] bcache: Change refill_dirty() to

Re: [PATCH] bcache: Fix writeback_thread never writing back incomplete stripes.

2015-09-17 Thread Denis Bychkov
Well, it turns out my celebration was a bit premature.

PLEASE, DO NOT APPLY THE PATCH POSTED BY KENT (not the one Vojtech
posted) ON A PRODUCTION SYSTEM, IT CAUSES DATA CORRUPTION.

The interesting thing is that it somehow damaged the partition that
was not supposed to receive any writes (the file system was mounted
read-only), so my guess is that the patch causes the blocks residing
in the write-back cache to flush to the wrong blocks on the backing
device.
Everything was going great until I rebooted and saw this in the log:

[   19.639082] attempt to access beyond end of device
[   19.643984] md1p2: rw=1, want=75497520, limit=62914560
[   19.659033] attempt to access beyond end of device
[   19.663929] md1p2: rw=1, want=75497624, limit=62914560
[   19.669447] attempt to access beyond end of device
[   19.674338] md1p2: rw=1, want=75497752, limit=62914560
[   19.679195] attempt to access beyond end of device
[   19.679199] md1p2: rw=1, want=75498080, limit=62914560
[   19.689007] attempt to access beyond end of device
[   19.689011] md1p2: rw=1, want=75563376, limit=62914560
[   19.699055] attempt to access beyond end of device
[   19.699059] md1p2: rw=1, want=79691816, limit=62914560
[   19.719246] attempt to access beyond end of device
[   19.724144] md1p2: rw=1, want=79691928, limit=62914560
..
(it's a small example, the list was much longer)
And the next thing I found out the super block on my 10-Tb XFS RAID was gone. :)
Oh well, it's a good thing I have backups.
I knew what I was doing when trying the untested patches. I should
have made the RAID md partition read-only, not the file system. I kind
of expected that something could have gone wrong with the file system
I was testing, just did not expect it would fire nukes at the innocent
bystanders.


On Wed, Sep 16, 2015 at 5:08 PM, Denis Bychkov  wrote:
> Hi Kent, Vojtech, list
>
> Your fix works perfectly, I can finally remove my patch that disables
> the partial_stripes_expensive branch and unleash the full bcache
> performance on my RAID-6 array. I was aware of this problem for some
> time now, but never got to learning the bcache codebase enough to find
> out why this happens with partial stripes write-back enabled, so I
> just disabled it to avoid CPU spinning. Thanks a lot to Vojtech for
> finding the reason and to you for the patch. I have quite a collection
> of stability patches for the mainline bcache. Would you please be kind
> to review them when convenient and merge upstream the ones you think
> are worth merging. It will be 4.4 merge window, I believe.
> Please find all the patches attached. All of them are rebased on
> mainline 4.2 kernel. If somebody needs the 4.1.x versions, please let
> me know.
>
>
> On Wed, Sep 16, 2015 at 7:32 AM, Kent Overstreet
>  wrote:
>> On Sat, Sep 05, 2015 at 01:10:12PM +0200, Vojtech Pavlik wrote:
>>> Fix writeback_thread never finishing writing back all dirty data in bcache 
>>> when
>>> partial_stripes_expensive is set, and spinning, consuming 100% of CPU 
>>> instead.
>>>
>>> Signed-off-by: Vojtech Pavlik 
>>> ---
>>>
>>> This is a fix for the current upstream bcache, not the devel branch.
>>>
>>> If partial_stripes_expensive is set for a cache set, then writeback_thread
>>> always attempts to write full stripes out back to the backing device first.
>>> However, since it does that based on a bitmap and not a simple linear
>>> search, like the rest of the code of refill_dirty(), it changes the
>>> last_scanned pointer so that never points to 'end'. refill_dirty() then
>>> never tries to scan from 'start', resulting in the writeback_thread
>>> looping, consuming 100% of CPU, but never making progress in writing out
>>> the incomplete dirty stripes in the cache.
>>>
>>> Scanning the tree after not finding enough full stripes fixes the issue.
>>>
>>> Incomplete dirty stripes are written to the backing device, the device
>>> eventually reaches a clean state if there is nothing dirtying data and
>>> writeback_thread sleeps. This also fixes the problem of the cache device
>>> not being possible to detach in the partial_stripes_expensive scenario.
>>
>> Good catch!
>>
>>> It may be more efficient to separate the last_scanned field for normal and
>>> stripe scans instead.
>>
>> Actually, I think I like your approach - I think it gives us the behaviour we
>> want, although it took some thinking to figure out why.
>>
>> One of the reasons for last_scanned is just to do writeback in LBA order and
>> avoid making the disks seek around - so, if refill_full_stripes() did just 
>> queue
>> up some IO at last_scanne

Re: [PATCH] bcache: Fix writeback_thread never writing back incomplete stripes.

2015-09-16 Thread Denis Bychkov
Hi Kent, Vojtech, list

Your fix works perfectly, I can finally remove my patch that disables
the partial_stripes_expensive branch and unleash the full bcache
performance on my RAID-6 array. I was aware of this problem for some
time now, but never got to learning the bcache codebase enough to find
out why this happens with partial stripes write-back enabled, so I
just disabled it to avoid CPU spinning. Thanks a lot to Vojtech for
finding the reason and to you for the patch. I have quite a collection
of stability patches for the mainline bcache. Would you please be kind
to review them when convenient and merge upstream the ones you think
are worth merging. It will be 4.4 merge window, I believe.
Please find all the patches attached. All of them are rebased on
mainline 4.2 kernel. If somebody needs the 4.1.x versions, please let
me know.


On Wed, Sep 16, 2015 at 7:32 AM, Kent Overstreet
 wrote:
> On Sat, Sep 05, 2015 at 01:10:12PM +0200, Vojtech Pavlik wrote:
>> Fix writeback_thread never finishing writing back all dirty data in bcache 
>> when
>> partial_stripes_expensive is set, and spinning, consuming 100% of CPU 
>> instead.
>>
>> Signed-off-by: Vojtech Pavlik 
>> ---
>>
>> This is a fix for the current upstream bcache, not the devel branch.
>>
>> If partial_stripes_expensive is set for a cache set, then writeback_thread
>> always attempts to write full stripes out back to the backing device first.
>> However, since it does that based on a bitmap and not a simple linear
>> search, like the rest of the code of refill_dirty(), it changes the
>> last_scanned pointer so that never points to 'end'. refill_dirty() then
>> never tries to scan from 'start', resulting in the writeback_thread
>> looping, consuming 100% of CPU, but never making progress in writing out
>> the incomplete dirty stripes in the cache.
>>
>> Scanning the tree after not finding enough full stripes fixes the issue.
>>
>> Incomplete dirty stripes are written to the backing device, the device
>> eventually reaches a clean state if there is nothing dirtying data and
>> writeback_thread sleeps. This also fixes the problem of the cache device
>> not being possible to detach in the partial_stripes_expensive scenario.
>
> Good catch!
>
>> It may be more efficient to separate the last_scanned field for normal and
>> stripe scans instead.
>
> Actually, I think I like your approach - I think it gives us the behaviour we
> want, although it took some thinking to figure out why.
>
> One of the reasons for last_scanned is just to do writeback in LBA order and
> avoid making the disks seek around - so, if refill_full_stripes() did just 
> queue
> up some IO at last_scanned, we do want to keep scanning from that position for
> non full stripes.
>
> But since (as you noted) last_scanned is never going to be >= end after 
> calling
> refill_full_strips() (if there were any full stripes) - really the correct 
> thing
> to do is loop around to the start if necessary so that we can successfully 
> scan
> everything.
>
> The only inefficiency I see with your approach is that on the second scan, 
> after
> we've looped, we don't need to scan to the end of the disk - we only need to
> scan up to where we initially started.
>
> But, another observation - if we change refill_dirty() so that it always scans
> the entire keyspace if necessary, regardless of where last_scanned was - well,
> that really isn't specific to the refill_full_stripes() case - we can just
> always do that.
>
> Can you give this patch a try?
>
> -- >8 --
> Subject: [PATCH] bcache: Change refill_dirty() to always scan entire disk if 
> necessary
>
> Previously, it would only scan the entire disk if it was starting from the 
> very
> start of the disk - i.e. if the previous scan got to the end.
>
> This was broken by refill_full_stripes(), which updates last_scanned so that
> refill_dirty was never triggering the searched_from_start path.
>
> But if we change refill_dirty() to always scan the entire disk if necessary,
> regardless of what last_scanned was, the code gets cleaner and we fix that bug
> too.
>
> Signed-off-by: Kent Overstreet 
> ---
>  drivers/md/bcache/writeback.c | 24 
>  1 file changed, 16 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index cdde0f32f0..08a52db38b 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -359,11 +359,13 @@ next:
> }
>  }
>
> +/*
> + * Returns true if we scanned the entire disk
> + */
>  static bool refill_dirty(struct cached_dev *dc)
>  {
> struct keybuf *buf = &dc->writeback_keys;
> -   struct bkey end = KEY(dc->disk.id, MAX_KEY_OFFSET, 0);
> -   bool searched_from_start = false;
> +   struct bkey start_pos, end = KEY(dc->disk.id, MAX_KEY_OFFSET, 0);
>
> if (dc->partial_stripes_expensive) {
> refill_full_stripes(dc);
> @@ -371,14 +373,20 @@ static bool refill_dirty(struct cach