[dm-devel] bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size

2016-05-08 Thread James Johnston
Hi,

[1.] One line summary of the problem:

bcache gets stuck flushing writeback cache when used in combination with
LUKS/dm-crypt and non-default bucket size

[2.] Full description of the problem/report:

I've run into a problem where the bcache writeback cache can't be flushed to
disk when the backing device is a LUKS / dm-crypt device and the cache set has
a non-default bucket size.  Basically, only a few megabytes will be flushed to
disk, and then it gets stuck.  Stuck means that the bcache writeback task
thrashes the disk by constantly reading hundreds of MB/second from the cache set
in an infinite loop, while not actually progressing (dirty_data never decreases
beyond a certain point).

I am wondering if anybody else can reproduce this apparent bug?  Apologies for
mailing both device mapper and bcache mailing lists, but I'm not sure where the
bug lies as I've only reproduced it when both are used in combination.

The situation is basically unrecoverable as far as I can tell: if you attempt
to detach the cache set then the cache set disk gets thrashed extra-hard
forever, and it's impossible to actually get the cache set detached.  The only
solution seems to be to back up the data and destroy the volume...

[3.] Keywords (i.e., modules, networking, kernel):

bcache, dm-crypt, LUKS, device mapper, LVM

[4.] Kernel information
[4.1.] Kernel version (from /proc/version):
Linux version 4.6.0-040600rc6-generic (kernel@gloin) (gcc version 5.2.1 
20151010 (Ubuntu 5.2.1-22ubuntu2) ) #201605012031 SMP Mon May 2 00:33:26 UTC 
2016

[7.] A small shell script or example program which triggers the
 problem (if possible)

Here are the steps I used to reproduce:

1.  Set up an Ubuntu 16.04 virtual machine in VMware with three SATA hard
drives.  Ubuntu was installed with default settings, except that: (1) guided
partitioning used with NO LVM or dm-crypt, (2) OpenSSH server installed.
First SATA drive has operating system installation.  Second SATA drive is
used for bcache cache set.  Third SATA drive has dm-crypt/LUKS + bcache
backing device.  Note that all drives have 512 byte physical sectors.  Also,
all virtual drives are backed by a single physical SSD with 512 byte
sectors. (i.e. not advanced format)

2.  Ubuntu was updated to latest packages as of 5/8/2016.  The problem
reproduces with both distribution kernel 4.4.0-22-generic and also mainline
kernel 4.6.0-040600rc6-generic distributed by Ubuntu kernel team.  Installed
bcache-tools package was 1.0.8-2.  Installed cryptsetup-bin package was
2:1.6.6-5ubuntu2.

3.  Set up the cache set, dm-crypt, and backing device:

sudo -s
# Make cache set on second drive
# IMPORTANT:  Problem does not occur if I omit --bucket parameter.
make-bcache --bucket 2M -C /dev/sdb
# Set up LUKS/dm-crypt on second drive.
# IMPORTANT:  Problem does not occur if I omit the dm-crypt layer.
cryptsetup luksFormat /dev/sdc
cryptsetup open --type luks /dev/sdc backCrypt
# Make bcache backing device & enable writeback
make-bcache -B /dev/mapper/backCrypt
bcache-super-show /dev/sdb | grep cset.uuid | \
cut -f 3 > /sys/block/bcache0/bcache/attach
echo writeback > /sys/block/bcache0/bcache/cache_mode

4.  Finally, this is the kill sequence to bring the system to its knees:

sudo -s
cd /sys/block/bcache0/bcache
echo 0 > sequential_cutoff
# Verify that the cache is attached (i.e. does not say "no cache").  It should
# say that it's clean since we haven't written anything yet.
cat state
# Copy some random data.
dd if=/dev/urandom of=/dev/bcache0 bs=1M count=250
# Show current state.  On my system approximately 20 to 25 MB remain in
# writeback cache.
cat dirty_data
cat state
# Detach the cache set.  This will start the cache set disk thrashing.
echo 1 > detach
# After a few moments, confirm that the cache set is not going anywhere.  On
# my system, only a few MB have been flushed as evidenced by a small decrease
# in dirty_data.  State remains dirty.
cat dirty_data
cat state
# At this point, the hypervisor system reports hundreds of MB/second of reads
# to the underlying physical SSD coming from the virtual machine; the hard drive
# light is stuck on...  hypervisor status bar shows the activity is on cache
# set.  No writes seem to be occurring on any disk.

[8.] Environment
[8.1.] Software (add the output of the ver_linux script here)
Linux bcachetest2 4.6.0-040600rc6-generic #201605012031 SMP Mon May 2 00:33:26 
UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Util-linux  2.27.1
Mount   2.27.1
Module-init-tools   22
E2fsprogs   1.42.13
Xfsprogs4.3.0
Linux C Library 2.23
Dynamic linker (ldd)2.23
Linux C++ Library   6.0.21
Procps  3.3.10
Net-tools   1.60
Kbd 1.15.5
Console-tools   1.15.5
Sh-utils8.25
Udev229
Modules Loaded  8250_fintek ablk_helper aesni_intel aes_x86_64 ahci 
async_memcpy 

Re: [dm-devel] bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size

2016-05-11 Thread Eric Wheeler

On Sun, 8 May 2016, James Johnston wrote:

> Hi,
> 
> [1.] One line summary of the problem:
> 
> bcache gets stuck flushing writeback cache when used in combination with
> LUKS/dm-crypt and non-default bucket size
> 
> [2.] Full description of the problem/report:
> 
> I've run into a problem where the bcache writeback cache can't be flushed to
> disk when the backing device is a LUKS / dm-crypt device and the cache set has
> a non-default bucket size.  

You might try LUKS atop of bcache instead of under it.  This might be 
better for privacy too, otherwise your cached data is unencrypted.

> # Make cache set on second drive
> # IMPORTANT:  Problem does not occur if I omit --bucket parameter.
> make-bcache --bucket 2M -C /dev/sdb

2MB is quite large, maybe it exceeds the 256-bvec limit.  I'm not sure if 
Ming Lei's patch got in to 4.6 yet, but try this:
  https://lkml.org/lkml/2016/4/5/1046

and maybe Shaohua Li's patch too:
  http://www.spinics.net/lists/raid/msg51830.html


--
Eric Wheeler

> # Set up LUKS/dm-crypt on second drive.
> # IMPORTANT:  Problem does not occur if I omit the dm-crypt layer.
> cryptsetup luksFormat /dev/sdc
> cryptsetup open --type luks /dev/sdc backCrypt
> # Make bcache backing device & enable writeback
> make-bcache -B /dev/mapper/backCrypt
> bcache-super-show /dev/sdb | grep cset.uuid | \
> cut -f 3 > /sys/block/bcache0/bcache/attach
> echo writeback > /sys/block/bcache0/bcache/cache_mode
> 
> 4.  Finally, this is the kill sequence to bring the system to its knees:
> 
> sudo -s
> cd /sys/block/bcache0/bcache
> echo 0 > sequential_cutoff
> # Verify that the cache is attached (i.e. does not say "no cache").  It should
> # say that it's clean since we haven't written anything yet.
> cat state
> # Copy some random data.
> dd if=/dev/urandom of=/dev/bcache0 bs=1M count=250
> # Show current state.  On my system approximately 20 to 25 MB remain in
> # writeback cache.
> cat dirty_data
> cat state
> # Detach the cache set.  This will start the cache set disk thrashing.
> echo 1 > detach
> # After a few moments, confirm that the cache set is not going anywhere.  On
> # my system, only a few MB have been flushed as evidenced by a small decrease
> # in dirty_data.  State remains dirty.
> cat dirty_data
> cat state
> # At this point, the hypervisor system reports hundreds of MB/second of reads
> # to the underlying physical SSD coming from the virtual machine; the hard 
> drive
> # light is stuck on...  hypervisor status bar shows the activity is on cache
> # set.  No writes seem to be occurring on any disk.
> 
> [8.] Environment
> [8.1.] Software (add the output of the ver_linux script here)
> Linux bcachetest2 4.6.0-040600rc6-generic #201605012031 SMP Mon May 2 
> 00:33:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
> 
> Util-linux  2.27.1
> Mount   2.27.1
> Module-init-tools   22
> E2fsprogs   1.42.13
> Xfsprogs4.3.0
> Linux C Library 2.23
> Dynamic linker (ldd)2.23
> Linux C++ Library   6.0.21
> Procps  3.3.10
> Net-tools   1.60
> Kbd 1.15.5
> Console-tools   1.15.5
> Sh-utils8.25
> Udev229
> Modules Loaded  8250_fintek ablk_helper aesni_intel aes_x86_64 ahci 
> async_memcpy async_pq async_raid6_recov async_tx async_xor autofs4 btrfs 
> configfs coretemp crc32_pclmul crct10dif_pclmul cryptd drm drm_kms_helper 
> e1000 fb_sys_fops fjes gf128mul ghash_clmulni_intel glue_helper hid 
> hid_generic i2c_piix4 ib_addr ib_cm ib_core ib_iser ib_mad ib_sa input_leds 
> iscsi_tcp iw_cm joydev libahci libcrc32c libiscsi libiscsi_tcp linear lrw 
> mac_hid mptbase mptscsih mptspi multipath nfit parport parport_pc pata_acpi 
> ppdev psmouse raid0 raid10 raid1 raid456 raid6_pq rdma_cm 
> scsi_transport_iscsi scsi_transport_spi serio_raw shpchp syscopyarea 
> sysfillrect sysimgblt ttm usbhid vmw_balloon vmwgfx vmw_vmci 
> vmw_vsock_vmci_transport vsock xor
> 
> [8.2.] Processor information (from /proc/cpuinfo):
> processor   : 0
> vendor_id   : GenuineIntel
> cpu family  : 6
> model   : 42
> model name  : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
> stepping: 7
> microcode   : 0x29
> cpu MHz : 2491.980
> cache size  : 3072 KB
> physical id : 0
> siblings: 1
> core id : 0
> cpu cores   : 1
> apicid  : 0
> initial apicid  : 0
> fpu : yes
> fpu_exception   : yes
> cpuid level : 13
> wp  : yes
> flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx rdtscp lm 
> constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc 
> aperfmperf eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt 
> tsc_deadline_timer aes xsave avx hypervisor lahf_lm epb tsc_adjust dtherm ida 
> arat pln pts
> bugs:
> bogomips: 4

Re: [dm-devel] bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size

2016-05-17 Thread Tim Small
Hi Eric,

On 15/05/16 10:08, Tim Small wrote:
> On 11/05/16 02:38, Eric Wheeler wrote:
>> Ming Lei's patch got in to 4.6 yet, but try this:
>> >   https://lkml.org/lkml/2016/4/5/1046
>> > 
>> > and maybe Shaohua Li's patch too:
>> >   http://www.spinics.net/lists/raid/msg51830.html

> I'll give them both a go...

I tried both of these on 4.6.0-rc7 without change to the symptoms (cache
device continuously read).  Then I tried also disabling
partial_stripes_expensive prior to registering the bcache device as per
your instructions here:

https://lkml.org/lkml/2016/2/1/636

and that seems to have improved things, but not fixed them.

The cache device is 120G, and dirty_data had got up to 55.3G, but has
now dropped down to 44.5G, but isn't going any further...

The cache device is being read at a steady ~270 MB/s, and the backing
device (dm-crypt) being written at the same rate, but the writes aren't
flowing down to the underlying devices (md RAID5, and SATA disks).  I'm
guessing that these writes are being refused/retried, and are maybe
failing due to their size (avgrq-sz showing > 4000 sectors on the
backing device)?  Disabling the partial stripes expensive maybe just
resulted in a few GB of small writes succeeding?

# iostat -y -d 2 -x -p /dev/sdf /dev/dm-0 /dev/md2 /dev/bcache0
Linux 4.6.0-rc7+  16/05/16_x86_64_(2 CPU)

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdf   0.00 0.00  413.000.00 281422.00 0.00
1362.82   143.18  338.31  338.310.00   2.42 100.00
sdf1  0.00 0.000.000.00 0.00 0.00
0.00 0.000.000.000.00   0.00   0.00
sdf2  0.00 0.000.000.00 0.00 0.00
0.00 0.000.000.000.00   0.00   0.00
sdf3  0.00 0.00  413.000.00 281422.00 0.00
1362.82   143.18  338.31  338.310.00   2.42 100.00
dm-0  0.00 0.000.00  138.50 0.00 280912.00
4056.49 0.000.010.000.01   0.01   0.20
md2   0.00 0.000.000.00 0.00 0.00
0.00 0.000.000.000.00   0.00   0.00
bcache0   0.00 0.000.000.00 0.00 0.00
0.00 0.000.000.000.00   0.00   0.00

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdf   0.00 6.00  412.001.50 281806.0032.00
1363.18   135.19  314.09  314.78  124.00   2.42 100.00
sdf1  0.00 6.000.001.50 0.0032.00
42.67 4.10  124.000.00  124.00 388.00  58.20
sdf2  0.00 0.000.000.00 0.00 0.00
0.00 0.000.000.000.00   0.00   0.00
sdf3  0.00 0.00  412.000.00 281806.00 0.00
1367.99   131.10  314.78  314.780.00   2.43 100.00
dm-0  0.00 0.000.00  138.50 0.00 282388.00
4077.81 0.000.010.000.01   0.01   0.20
md2   0.00 0.000.000.00 0.00 0.00
0.00 0.000.000.000.00   0.00   0.00
bcache0   0.00 0.000.000.00 0.00 0.00
0.00 0.000.000.000.00   0.00   0.00

Cheers,

Tim.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size

2016-05-17 Thread Tim Small
On 16/05/16 14:02, Tim Small wrote:
> # iostat -y -d 2 -x -p /dev/sdf /dev/dm-0 /dev/md2 /dev/bcache0

... and then mangled the word-wrapping.  Try again:

Here's a typical hand-edited excerpt from:

iostat -d 2 -x -y -m -p /dev/sdf /dev/dm-0 /dev/md2 /dev/bcache0

...

Device:r/s w/srMB/swMB/s avgrq-sz avgqu-sz   await
sdf 396.50   19.50   272.02 0.25  1340.38   138.44  346.09
sdf3397.000.00   272.52 0.00  1405.83   130.05  338.40
dm-0  0.00  149.00 0.00   271.29  3728.81 0.010.04
md2   0.000.00 0.00 0.00 0.00 0.000.00
bcache0   0.000.00 0.00 0.00 0.00 0.000.00

where:

sdf is the SSD (bcache cache device is sdf3)
dm-0 is dm-crypt backing device (bcache backing store)
md2 is the underlying device for dm-crypt
bcache0 is the bcache device.

According to the iostat manual page:

"avgrq-sz The average size (in sectors) of the requests that were issued
to the device."

dm-0 is described like this in the output of 'dmsetup table':

encryptedstore01: 0 46879675392 crypt aes-xts-plain64
 0 9:2
3072 1 allow_discards

Tim.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size

2016-05-17 Thread Tim Small
On 08/05/16 19:39, James Johnston wrote:
> I've run into a problem where the bcache writeback cache can't be flushed to
> disk when the backing device is a LUKS / dm-crypt device and the cache set has
> a non-default bucket size.  Basically, only a few megabytes will be flushed to
> disk, and then it gets stuck.  Stuck means that the bcache writeback task
> thrashes the disk by constantly reading hundreds of MB/second from the cache 
> set
> in an infinite loop, while not actually progressing (dirty_data never 
> decreases
> beyond a certain point).

> [...]

> The situation is basically unrecoverable as far as I can tell: if you attempt
> to detach the cache set then the cache set disk gets thrashed extra-hard
> forever, and it's impossible to actually get the cache set detached.  The only
> solution seems to be to back up the data and destroy the volume...

You can boot an older kernel to flush the device without destroying it
(I'm guessing that's because older kernels split down the big requests
which are failing on the 4.4 kernel).  Once flushed you could put the
cache into writethrough mode, or use a smaller bucket size.

Tim.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size

2016-05-18 Thread James Johnston
> On Sun, 8 May 2016, James Johnston wrote:
> 
> > Hi,
> >
> > [1.] One line summary of the problem:
> >
> > bcache gets stuck flushing writeback cache when used in combination with
> > LUKS/dm-crypt and non-default bucket size
> >
> > [2.] Full description of the problem/report:
> >
> > I've run into a problem where the bcache writeback cache can't be flushed to
> > disk when the backing device is a LUKS / dm-crypt device and the cache set 
> > has
> > a non-default bucket size.
> 
> You might try LUKS atop of bcache instead of under it.  This might be
> better for privacy too, otherwise your cached data is unencrypted.

Only in this test case; on my real setup, the cache device is also layered on 
top
Of LUKS.  (On both backing & cache, it's LUKS --> LVM2 --> bcache.  This gives 
me
flexibility to adjust volumes without messing with the encryption, or having 
more
encryption devices than really needed.  At any rate, I expect this setup to at
least work...)

> 
> > # Make cache set on second drive
> > # IMPORTANT:  Problem does not occur if I omit --bucket parameter.
> > make-bcache --bucket 2M -C /dev/sdb
> 
> 2MB is quite large, maybe it exceeds the 256-bvec limit.  I'm not sure if
> Ming Lei's patch got in to 4.6 yet, but try this:
>   https://lkml.org/lkml/2016/4/5/1046
> 
> and maybe Shaohua Li's patch too:
>   http://www.spinics.net/lists/raid/msg51830.html

Trying these is still on my TODO list (thus the belated reply here) but based
on the responses from Tim Small I'm doubtful this will fix anything, as it
sounds like he has the same problem (symptoms sound exactly the same) and he
says the patches didn't help.

Like Tim, I also chose a large bucket size because the manual page told me to.
Based on the high-level description of bcache and my knowledge of how flash
works, it certainly sounds necessary.

Perhaps the union of people who read manpages and people who use LUKS like
this is very small. :)

James


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size

2016-05-20 Thread James Johnston
> On Mon, 16 May 2016, Tim Small wrote:
> 
> > On 08/05/16 19:39, James Johnston wrote:
> > > I've run into a problem where the bcache writeback cache can't be flushed 
> > > to
> > > disk when the backing device is a LUKS / dm-crypt device and the cache 
> > > set has
> > > a non-default bucket size.  Basically, only a few megabytes will be 
> > > flushed to
> > > disk, and then it gets stuck.  Stuck means that the bcache writeback task
> > > thrashes the disk by constantly reading hundreds of MB/second from the 
> > > cache set
> > > in an infinite loop, while not actually progressing (dirty_data never 
> > > decreases
> > > beyond a certain point).
> >
> > > [...]
> >
> > > The situation is basically unrecoverable as far as I can tell: if you 
> > > attempt
> > > to detach the cache set then the cache set disk gets thrashed extra-hard
> > > forever, and it's impossible to actually get the cache set detached.  The 
> > > only
> > > solution seems to be to back up the data and destroy the volume...
> >
> > You can boot an older kernel to flush the device without destroying it
> > (I'm guessing that's because older kernels split down the big requests
> > which are failing on the 4.4 kernel).  Once flushed you could put the
> > cache into writethrough mode, or use a smaller bucket size.
> 
> Indeed, can someone test 4.1.y and see if the problem persists with a 2M
> bucket size?  (If someone has already tested 4.1, then appologies as I've
> not yet seen that report.)
> 
> If 4.1 works, then I think a bisect is in order.  Such a bisect would at
> least highlight the problem and might indicate a (hopefully trivial) fix.

To help narrow this down, I tested the following generic pre-compiled mainline 
kernels
on Ubuntu 15.10:

 * WORKS:  http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.3.6-wily/
 * DOES NOT WORK:  
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4-rc1+cod1-wily/

I also tried the default & latest distribution-provided 4.2 kernel.  It worked.
This one also worked:

 * WORKS:  http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.2.8-wily/

So it seems to me that it is a regression from 4.3.6 kernel to any 4.4 kernel.  
That
should help save time with bisection...

James


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size

2016-05-20 Thread Eric Wheeler
On Mon, 16 May 2016, Tim Small wrote:
> Hi Eric,
> 
> On 15/05/16 10:08, Tim Small wrote:
> > On 11/05/16 02:38, Eric Wheeler wrote:
> >> Ming Lei's patch got in to 4.6 yet, but try this:
> >> >   https://lkml.org/lkml/2016/4/5/1046
> >> > 
> >> > and maybe Shaohua Li's patch too:
> >> >   http://www.spinics.net/lists/raid/msg51830.html
> 
> > I'll give them both a go...
> 
> I tried both of these on 4.6.0-rc7 without change to the symptoms (cache
> device continuously read).  Then I tried also disabling
> partial_stripes_expensive prior to registering the bcache device as per
> your instructions here:
> 
> https://lkml.org/lkml/2016/2/1/636
> 
> and that seems to have improved things, but not fixed them.

What is your /sys/class/X/queue/limits/io_opt value? (requires the sysfs 
patch)

Caution: make these changes at your own risk, I have no idea what other 
side effects that might when modifying io_opt and dc->disk.stride_width, 
so be sure this is a test machine.

You could update my sysfs limits patch to set QL_SYSFS_RW for io_opt and 
shrink it or set it to zero before registering.  

or,

bcache sets the disk.stripe_size at initialization, so you could just 
force this to 0 in cached_dev_init() and see if it fixes that:

-bcache/super.c:1138dc->disk.stripe_size = q->limits.io_opt >> 9;
+bcache/super.c:1138dc->disk.stripe_size = 0;

It then uses stripe_size in the writeback code:

writeback.c:299:stripe_offset = offset & (d->stripe_size - 1);
writeback.c:303:  d->stripe_size - stripe_offset);
writeback.c:313:if (sectors_dirty == d->stripe_size)
writeback.c:357:stripe * 
dc->disk.stripe_size, 0);
writeback.c:361:   next_stripe * 
dc->disk.stripe_size, 0),
writeback.h:20: do_div(offset, d->stripe_size);
writeback.h:34: if (nr_sectors <= dc->disk.stripe_size)
writeback.h:37: nr_sectors -= dc->disk.stripe_size;

Speculation only, but I've always wondered if there are issues when opt_io!=0.

Are you able to test one or the other or both methods?

--
Eric Wheeler


> 
> The cache device is 120G, and dirty_data had got up to 55.3G, but has
> now dropped down to 44.5G, but isn't going any further...
> 
> The cache device is being read at a steady ~270 MB/s, and the backing
> device (dm-crypt) being written at the same rate, but the writes aren't
> flowing down to the underlying devices (md RAID5, and SATA disks).  I'm
> guessing that these writes are being refused/retried, and are maybe
> failing due to their size (avgrq-sz showing > 4000 sectors on the
> backing device)?  Disabling the partial stripes expensive maybe just
> resulted in a few GB of small writes succeeding?
> 
> # iostat -y -d 2 -x -p /dev/sdf /dev/dm-0 /dev/md2 /dev/bcache0
> Linux 4.6.0-rc7+  16/05/16_x86_64_(2 CPU)
> 
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdf   0.00 0.00  413.000.00 281422.00 0.00
> 1362.82   143.18  338.31  338.310.00   2.42 100.00
> sdf1  0.00 0.000.000.00 0.00 0.00
> 0.00 0.000.000.000.00   0.00   0.00
> sdf2  0.00 0.000.000.00 0.00 0.00
> 0.00 0.000.000.000.00   0.00   0.00
> sdf3  0.00 0.00  413.000.00 281422.00 0.00
> 1362.82   143.18  338.31  338.310.00   2.42 100.00
> dm-0  0.00 0.000.00  138.50 0.00 280912.00
> 4056.49 0.000.010.000.01   0.01   0.20
> md2   0.00 0.000.000.00 0.00 0.00
> 0.00 0.000.000.000.00   0.00   0.00
> bcache0   0.00 0.000.000.00 0.00 0.00
> 0.00 0.000.000.000.00   0.00   0.00
> 
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdf   0.00 6.00  412.001.50 281806.0032.00
> 1363.18   135.19  314.09  314.78  124.00   2.42 100.00
> sdf1  0.00 6.000.001.50 0.0032.00
> 42.67 4.10  124.000.00  124.00 388.00  58.20
> sdf2  0.00 0.000.000.00 0.00 0.00
> 0.00 0.000.000.000.00   0.00   0.00
> sdf3  0.00 0.00  412.000.00 281806.00 0.00
> 1367.99   131.10  314.78  314.780.00   2.43 100.00
> dm-0  0.00 0.000.00  138.50 0.00 282388.00
> 4077.81 0.000.010.000.01   0.01   0.20
> md2   0.00 0.000.000.00 0.00 0.00
> 0.00 0.000.000.000.00   0.00   0.00
> bcache0   0.00 0.000.000.00 0.00 0.00
> 0.00 0.000.000.000.00   0.00   0.00
> 
> Cheers,
> 
> Tim.
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo

Re: [dm-devel] bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size

2016-05-20 Thread Eric Wheeler

On Mon, 16 May 2016, Tim Small wrote:

> On 08/05/16 19:39, James Johnston wrote:
> > I've run into a problem where the bcache writeback cache can't be flushed to
> > disk when the backing device is a LUKS / dm-crypt device and the cache set 
> > has
> > a non-default bucket size.  Basically, only a few megabytes will be flushed 
> > to
> > disk, and then it gets stuck.  Stuck means that the bcache writeback task
> > thrashes the disk by constantly reading hundreds of MB/second from the 
> > cache set
> > in an infinite loop, while not actually progressing (dirty_data never 
> > decreases
> > beyond a certain point).
> 
> > [...]
> 
> > The situation is basically unrecoverable as far as I can tell: if you 
> > attempt
> > to detach the cache set then the cache set disk gets thrashed extra-hard
> > forever, and it's impossible to actually get the cache set detached.  The 
> > only
> > solution seems to be to back up the data and destroy the volume...
> 
> You can boot an older kernel to flush the device without destroying it
> (I'm guessing that's because older kernels split down the big requests
> which are failing on the 4.4 kernel).  Once flushed you could put the
> cache into writethrough mode, or use a smaller bucket size.

Indeed, can someone test 4.1.y and see if the problem persists with a 2M 
bucket size?  (If someone has already tested 4.1, then appologies as I've 
not yet seen that report.)

If 4.1 works, then I think a bisect is in order.  Such a bisect would at 
least highlight the problem and might indicate a (hopefully trivial) fix.

--
Eric Wheeler



> 
> Tim.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size

2016-05-21 Thread James Johnston
> On Fri, 20 May 2016, James Johnston wrote:
> 
> > > On Mon, 16 May 2016, Tim Small wrote:
> > >
> > > > On 08/05/16 19:39, James Johnston wrote:
> > > > > I've run into a problem where the bcache writeback cache can't be 
> > > > > flushed to
> > > > > disk when the backing device is a LUKS / dm-crypt device and the 
> > > > > cache set has
> > > > > a non-default bucket size.  Basically, only a few megabytes will be 
> > > > > flushed to
> > > > > disk, and then it gets stuck.  Stuck means that the bcache writeback 
> > > > > task
> > > > > thrashes the disk by constantly reading hundreds of MB/second from 
> > > > > the cache set
> > > > > in an infinite loop, while not actually progressing (dirty_data never 
> > > > > decreases
> > > > > beyond a certain point).
> > > >
> > > > > [...]
> > > >
> > > > > The situation is basically unrecoverable as far as I can tell: if you 
> > > > > attempt
> > > > > to detach the cache set then the cache set disk gets thrashed 
> > > > > extra-hard
> > > > > forever, and it's impossible to actually get the cache set detached.  
> > > > > The only
> > > > > solution seems to be to back up the data and destroy the volume...
> > > >
> > > > You can boot an older kernel to flush the device without destroying it
> > > > (I'm guessing that's because older kernels split down the big requests
> > > > which are failing on the 4.4 kernel).  Once flushed you could put the
> > > > cache into writethrough mode, or use a smaller bucket size.
> > >
> > > Indeed, can someone test 4.1.y and see if the problem persists with a 2M
> > > bucket size?  (If someone has already tested 4.1, then appologies as I've
> > > not yet seen that report.)
> > >
> > > If 4.1 works, then I think a bisect is in order.  Such a bisect would at
> > > least highlight the problem and might indicate a (hopefully trivial) fix.
> >
> > To help narrow this down, I tested the following generic pre-compiled 
> > mainline kernels
> > on Ubuntu 15.10:
> >
> >  * WORKS:  http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.3.6-wily/
> >  * DOES NOT WORK:  
> > http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4-rc1+cod1-wily/
> >
> > I also tried the default & latest distribution-provided 4.2 kernel.  It 
> > worked.
> > This one also worked:
> >
> >  * WORKS:  http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.2.8-wily/
> >
> > So it seems to me that it is a regression from 4.3.6 kernel to any 4.4 
> > kernel.  That
> > should help save time with bisection...
> 
> Below is the patchlist for md and block that might help with a place to
> start.  Are there any other places in the Linux tree where we should watch
> for changes?
> 
> I'm wondering if it might be in dm-4.4-changes since this is dm-crypt
> related, but it could be ac322de which was quite large.
> 
> James or Tim,
> 
> Can you try building ac322de?  If that produces the problem, then there
> are only 3 more to try (unless this was actually a problem in 4.3 which
> was fixed in 4.3.y, but hopefully that isn't so).
> 
> ccf21b6 is probably the next to test to rule out neil's big md patch,
> which Linus abreviated in the commit log so it must be quite long.  OTOH,
> if dm-4.4-changes works, then I'm not sure what commit might produce the
> problem because the rest are not obviously relevant to the issue that are
> more recent. 

So I decided to go ahead and bisect it today.  Looks like the bad commit is
this one.  The commit prior flushed the bcache writeback cache without
incident; this one does not and I guess caused this bcache regression.
(FWIW ac322de came up during bisection, and tested good.)

johnstonj@kernel-build:~/linux$ git bisect bad
dbba42d8a9ebddcc1c1412e8457f79f3cb6ef6e7 is the first bad commit
commit dbba42d8a9ebddcc1c1412e8457f79f3cb6ef6e7
Author: Mikulas Patocka 
Date:   Wed Oct 21 16:34:20 2015 -0400

dm: eliminate unused "bioset" process for each bio-based DM device

Commit 54efd50bfd873e2dbf784e0b21a8027ba4299a3e ("block: make
generic_make_request handle arbitrarily sized bios") makes it possible
for block devices to process large bios.  In doing so that commit
allocates a new queue->bio_split bioset for each block device, this
bioset is used for allocating bios when the driver needs to split large
bios.

Each bioset allocates a workqueue process, thus the above commit
increases the number of processes allocated per block device.

DM doesn't need the queue->bio_split bioset, thus we can deallocate it.
This reduces the number of allocated processes per bio-based DM device
from 3 to 2.  Also remove the call to blk_queue_split(), it is not
needed because DM does its own splitting.

Signed-off-by: Mikulas Patocka 
Signed-off-by: Mike Snitzer 

The patch for this commit is very brief; reproduced here:

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 9555843..64b50b7 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1763,8 +1763,6 @@ static void dm_make_request(struct request_queue

Re: [dm-devel] bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size

2016-05-23 Thread 'Eric Wheeler'
On Fri, 20 May 2016, James Johnston wrote:

> > On Mon, 16 May 2016, Tim Small wrote:
> > 
> > > On 08/05/16 19:39, James Johnston wrote:
> > > > I've run into a problem where the bcache writeback cache can't be 
> > > > flushed to
> > > > disk when the backing device is a LUKS / dm-crypt device and the cache 
> > > > set has
> > > > a non-default bucket size.  Basically, only a few megabytes will be 
> > > > flushed to
> > > > disk, and then it gets stuck.  Stuck means that the bcache writeback 
> > > > task
> > > > thrashes the disk by constantly reading hundreds of MB/second from the 
> > > > cache set
> > > > in an infinite loop, while not actually progressing (dirty_data never 
> > > > decreases
> > > > beyond a certain point).
> > >
> > > > [...]
> > >
> > > > The situation is basically unrecoverable as far as I can tell: if you 
> > > > attempt
> > > > to detach the cache set then the cache set disk gets thrashed extra-hard
> > > > forever, and it's impossible to actually get the cache set detached.  
> > > > The only
> > > > solution seems to be to back up the data and destroy the volume...
> > >
> > > You can boot an older kernel to flush the device without destroying it
> > > (I'm guessing that's because older kernels split down the big requests
> > > which are failing on the 4.4 kernel).  Once flushed you could put the
> > > cache into writethrough mode, or use a smaller bucket size.
> > 
> > Indeed, can someone test 4.1.y and see if the problem persists with a 2M
> > bucket size?  (If someone has already tested 4.1, then appologies as I've
> > not yet seen that report.)
> > 
> > If 4.1 works, then I think a bisect is in order.  Such a bisect would at
> > least highlight the problem and might indicate a (hopefully trivial) fix.
> 
> To help narrow this down, I tested the following generic pre-compiled 
> mainline kernels
> on Ubuntu 15.10:
> 
>  * WORKS:  http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.3.6-wily/
>  * DOES NOT WORK:  
> http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4-rc1+cod1-wily/
> 
> I also tried the default & latest distribution-provided 4.2 kernel.  It 
> worked.
> This one also worked:
> 
>  * WORKS:  http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.2.8-wily/
> 
> So it seems to me that it is a regression from 4.3.6 kernel to any 4.4 
> kernel.  That
> should help save time with bisection...

Below is the patchlist for md and block that might help with a place to 
start.  Are there any other places in the Linux tree where we should watch 
for changes?

I'm wondering if it might be in dm-4.4-changes since this is dm-crypt
related, but it could be ac322de which was quite large.

James or Tim,

Can you try building ac322de?  If that produces the problem, then there 
are only 3 more to try (unless this was actually a problem in 4.3 which 
was fixed in 4.3.y, but hopefully that isn't so). 

ccf21b6 is probably the next to test to rule out neil's big md patch, 
which Linus abreviated in the commit log so it must be quite long.  OTOH, 
if dm-4.4-changes works, then I'm not sure what commit might produce the 
problem because the rest are not obviously relevant to the issue that are 
more recent.  

-Eric

]# git log --oneline v4.3~1..v4.4-rc1 drivers/md/ block/ Makefile | egrep -v 
'md-cluster|raid5|blk-mq'

 8005c49 Linux 4.4-rc1
 ccc2600 block: fix blk-core.c kernel-doc warning
 c34e6e0 Merge branch 'kbuild' of 
git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild
 3419b45 Merge branch 'for-4.4/io-poll' of git://git.kernel.dk/linux-block
 3934bbc Merge tag 'md/4.4-rc0-fix' of git://neil.brown.name/md
 ad804a0 Merge branch 'akpm' (patches from Andrew)
 75021d2 Merge branch 'for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
 05229be block: add block polling support
 dece163 block: change ->make_request_fn() and users to return a queue cookie
 8639b46 pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode
 71baba4 mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM
 d0164ad mm, page_alloc: distinguish between being unable to sleep, unwilling 
to sleep and avoiding waking kswapd
 8d090f4 bcache: Really show state of work pending bit
 933425fb Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
 5ebe0ee Merge tag 'docs-for-linus' of git://git.lwn.net/linux
 69234ac Merge branch 'for-4.4' of 
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
 e0700ce Merge tag 'dm-4.4-changes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
 ac322de Merge tag 'md/4.4' of git://neil.brown.name/md
 ccf21b6 Merge branch 'for-4.4/reservations' of git://git.kernel.dk/linux-block
 527d152 Merge branch 'for-4.4/integrity' of git://git.kernel.dk/linux-block
 d9734e0 Merge branch 'for-4.4/core' of git://git.kernel.dk/linux-block
 6a13feb Linux 4.3


--
Eric Wheeler



> 
> James
> 
> 
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel