Re: A review of dm-writeboost

2013-10-20 Thread Dave Chinner
On Sat, Oct 19, 2013 at 07:59:55PM +0900, Akira Hayakawa wrote:
> Dave,
> 
> # -EIO retuned corrupts XFS
> I turned up
> lockdep, frame pointer, xfs debug
> and also changed to 3.12.0-rc5 from rc1.
> 
> What's changed is that
> the problem we discussed in previous mails
> *never* reproduce.
> 
> However, if I turn off the lockdep only
> it hangs up by setting blockup to 1 and then to 0 after that.

Which indicates that the corruption is likely to be caused by a race
condition, and that adding lockdep slows things down enough to
change the timing and hence not hit the race condition...

e.g. the use after free that causes the memory corruption now occurs
after the critical point where XFS can get stuck on it.

> The underlying device once became dead revives
> may confuse the filesystem but
> this looks not related to locking things.
> This problem may be producible using dm-flakey.

I use dm-flakey quite a bit, and I haven't seen this

> This behavior is not reproducible with ext4.
> 
> 
> root@Lyle:/home/akira# virsh console Hercules
> Connected to domain Hercules
> Escape character is ^]
> [  143.042980] device-mapper: writeboost: info@modulator_proc() reactivated 
> after blockup
> [  143.042999] device-mapper: writeboost: info@dm_safe_io_internal() 
> reactivated after blockup
> [  143.043006] device-mapper: writeboost: info@migrate_proc() reactivated 
> after blockup
> [  143.045328] XFS: metadata I/O error: block 0x300d0f ("xlog_iodone") error 
> 5 numblks 64
> [  143.045333] XFS: xfs_do_force_shutdown(0x2) called from line 1161 of file 
> fs/xfs/xfs_log.c.  Return address = 0xa0430f9c
> [  143.045335] XFS: Log I/O Error Detected.  Shutting down filesystem
> [  143.045335] XFS: Please umount the filesystem and rectify the problem(s)
> [  143.045338] XFS: Assertion failed: atomic_read(&iclog->ic_refcnt) == 0, 
> file: fs/xfs/xfs_log.c, line: 2751
.
> [  143.045465] Workqueue: xfslogd xfs_buf_iodone_work [xfs]
.
> [  143.045558]  [] xlog_state_done_syncing+0x6d/0xe4 [xfs]
> [  143.045590]  [] xlog_iodone+0xd0/0xd9 [xfs]
> [  143.045609]  [] xfs_buf_iodone_work+0x57/0xa2 [xfs]
> [  143.045627]  [] process_one_work+0x18b/0x297
> [  143.045631]  [] worker_thread+0x12e/0x1fb
> [  143.045634]  [] ? rescuer_thread+0x268/0x268
> [  143.045638]  [] kthread+0x88/0x90
> [  143.045641]  [] ? __kthread_parkme+0x60/0x60

So that has got through xlog_iodone() and has found that the
reference count of the iclog was not zero when it went to run the
log IO completion callbacks.

We asserted that the reference count was zero before issuing the IO,
and we only ever increment the reference count when the iclog is in
an active state. We cannot increment the reference count while the
log is under IO because the state of the iclog is "syncing", not
"active".

Once the log is shut down, the iclog state goes to
XLOG_STATE_IOERROR and never goes out of that state. The assert just
prior to the one that tripped above indicates that we are either in
an ACTIVE state or IOERROR:

ASSERT(iclog->ic_state == XLOG_STATE_SYNCING ||
   iclog->ic_state == XLOG_STATE_IOERROR);
>>  ASSERT(atomic_read(&iclog->ic_refcnt) == 0);

It was the second assert that failed, and hence it's clear that
in either state the reference count cannot be incremented until the
IO is fully completed and it transitioned back the active state.

> [  143.045333] XFS: xfs_do_force_shutdown(0x2) called from line 1161 of file 
> fs/xfs/xfs_log.c.  Return address = 0xa0430f9c

This indicates that the shutdown was called from xlog_iodone() as a
result of an IO error on the iclog buffer, and the call to
xlog_state_done_syncing() happens 5 lines of code later, which
immediately assert fails with a corrupt iclog state.

Because we shutdown with SHUTDOWN_LOG_IO_ERROR set, we call into
xfs_log_force_umount() with logerror = true. This first call ends up
setting the log flag XLOG_IO_ERROR, then calling
xlog_state_ioerror() which sets ic_state on all iclogs to
XLOG_STATE_IOERROR.

The shutdown then runs xlog_state_do_callback() which aborts the
completions on all the iclogs that have callbacks attached, and
wakes anyone waiting on log space or log forces so they can get
errors returned to them.

We now return to xlog_iodone() in a shutdown state, with all the
iclog buffers in with ic_state = XLOG_STATE_IOERROR. So, there is
absolutely no opportunity for anyone to take a new reference to the
iclog in the 10 *microseconds* between the IO error being detected,
the shutdown being processed and the call to
xlog_state_done_syncing() where the assert fails. At this point, I
cannot see how this can occur in the XFS code.

Indeed, I can trigger this error path easily under heavy load just
using the godown utility in xfstests:

[1044535.986040] XFS (vdc): Mounting Filesystem
[1044536.040630] XFS (vdc): Ending clean mount
[1044536.059299] XFS: Clearing xfsstats
# src/xfstests-dev/src/godown -v 

Re: A review of dm-writeboost

2013-10-19 Thread Akira Hayakawa
Dave,

# -EIO retuned corrupts XFS
I turned up
lockdep, frame pointer, xfs debug
and also changed to 3.12.0-rc5 from rc1.

What's changed is that
the problem we discussed in previous mails
*never* reproduce.

However, if I turn off the lockdep only
it hangs up by setting blockup to 1 and then to 0 after that.
The underlying device once became dead revives
may confuse the filesystem but
this looks not related to locking things.
This problem may be producible using dm-flakey.

This behavior is not reproducible with ext4.


root@Lyle:/home/akira# virsh console Hercules
Connected to domain Hercules
Escape character is ^]
[  143.042980] device-mapper: writeboost: info@modulator_proc() reactivated 
after blockup
[  143.042999] device-mapper: writeboost: info@dm_safe_io_internal() 
reactivated after blockup
[  143.043006] device-mapper: writeboost: info@migrate_proc() reactivated after 
blockup
[  143.045328] XFS: metadata I/O error: block 0x300d0f ("xlog_iodone") error 5 
numblks 64
[  143.045333] XFS: xfs_do_force_shutdown(0x2) called from line 1161 of file 
fs/xfs/xfs_log.c.  Return address = 0xa0430f9c
[  143.045335] XFS: Log I/O Error Detected.  Shutting down filesystem
[  143.045335] XFS: Please umount the filesystem and rectify the problem(s)
[  143.045338] XFS: Assertion failed: atomic_read(&iclog->ic_refcnt) == 0, 
file: fs/xfs/xfs_log.c, line: 2751
[  143.045392] [ cut here ]
[  143.045393] kernel BUG at fs/xfs/xfs_message.c:108!
[  143.045396] invalid opcode:  [#1] SMP 
[  143.045429] Modules linked in: xfs crc32c libcrc32c dm_writeboost(O) fuse 
nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop 
snd_hda_intel snd_hda_codec snd_hwdep snd_pcm joydev hid_generic usbhid 
processor snd_page_alloc i2c_piix4 i2c_core snd_timer snd soundcore psmouse 
serio_raw evdev virtio_balloon pcspkr hid thermal_sys microcode button ext4 
crc16 jbd2 mbcache dm_mod sg sr_mod cdrom ata_generic virtio_blk virtio_net 
floppy uhci_hcd ehci_hcd ata_piix usbcore libata usb_common scsi_mod virtio_pci 
virtio_ring virtio
[  143.045433] CPU: 2 PID: 216 Comm: kworker/2:1H Tainted: G   O 
3.12.0-rc5 #11
[  143.045434] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[  143.045465] Workqueue: xfslogd xfs_buf_iodone_work [xfs]
[  143.045466] task: 88002fe7e0c0 ti: 88002ff66000 task.ti: 
88002ff66000
[  143.045493] RIP: 0010:[]  [] 
assfail+0x1d/0x1f [xfs]
[  143.045494] RSP: :88002ff67d98  EFLAGS: 00010292
[  143.045495] RAX: 005e RBX: 88020de3c280 RCX: 0c8a
[  143.045496] RDX:  RSI: 819c3f2c RDI: 818356e0
[  143.045497] RBP: 88002ff67d98 R08:  R09: 819c3ecc
[  143.045498] R10:  R11:  R12: 8801fa583d28
[  143.045499] R13: 8801fa583c00 R14: 0002 R15: 
[  143.045501] FS:  () GS:88021fc8() 
knlGS:
[  143.045502] CS:  0010 DS:  ES:  CR0: 8005003b
[  143.045505] CR2: 7f6e7680db67 CR3: 0180b000 CR4: 06e0
[  143.045511] Stack:
[  143.045514]  88002ff67dc8 a0430ed1 8801fb124e40 
0002
[  143.045517]  88020de3c280 e8c80500 88002ff67df8 
a0431018
[  143.045519]  8801fb124ee0 8801fb124ee0 8801fb124e40 
88021fc92a00
[  143.045520] Call Trace:
[  143.045558]  [] xlog_state_done_syncing+0x6d/0xe4 [xfs]
[  143.045590]  [] xlog_iodone+0xd0/0xd9 [xfs]
[  143.045609]  [] xfs_buf_iodone_work+0x57/0xa2 [xfs]
[  143.045627]  [] process_one_work+0x18b/0x297
[  143.045631]  [] worker_thread+0x12e/0x1fb
[  143.045634]  [] ? rescuer_thread+0x268/0x268
[  143.045638]  [] kthread+0x88/0x90
[  143.045641]  [] ? __kthread_parkme+0x60/0x60
[  143.045646]  [] ret_from_fork+0x7c/0xb0
[  143.045649]  [] ? __kthread_parkme+0x60/0x60
[  143.045670] Code: 48 c7 c7 6b 26 45 a0 e8 a4 2a c5 e0 5d c3 55 48 89 f1 41 
89 d0 48 c7 c6 80 26 45 a0 48 89 fa 31 c0 48 89 e5 31 ff e8 aa fc ff ff <0f> 0b 
55 48 63 f6 49 89 f9 41 b8 01 00 00 00 b9 10 00 00 00 ba 
[  143.045694] RIP  [] assfail+0x1d/0x1f [xfs]
[  143.045695]  RSP 
[  143.045697] ---[ end trace e4f1a3c306353178 ]---
[  143.045741] BUG: unable to handle kernel paging request at ffd8
[  143.045745] IP: [] kthread_data+0xb/0x11
[  143.045747] PGD 180c067 PUD 180e067 PMD 0 
[  143.045749] Oops:  [#2] SMP 
[  143.045774] Modules linked in: xfs crc32c libcrc32c dm_writeboost(O) fuse 
nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop 
snd_hda_intel snd_hda_codec snd_hwdep snd_pcm joydev hid_generic usbhid 
processor snd_page_alloc i2c_piix4 i2c_core snd_timer snd soundcore psmouse 
serio_raw evdev virtio_balloon pcspkr hid thermal_sys microcode button ext4 
crc16 jbd2 mbcache dm_mod sg sr_mod cdrom ata_generic virtio_blk virtio_net 
floppy uhci_hcd ehci_hcd ata_pii

Re: A review of dm-writeboost

2013-10-16 Thread Dave Chinner
On Wed, Oct 16, 2013 at 09:17:40PM +0900, Akira Hayakawa wrote:
> Dave
> 
> > XFS shuts down because you've returned EIO to a log IO. That's a
> > fatal error. If you do the same to an ext4 journal write, it will do
> > the equivalent of shut down (e.g. complain and turn read-only).
> You mean block device should not return -EIO anyway if
> it doesn't want XFS to suddenly shut down?

Yes. EIO means an IO error has occurred. That causes failure paths
to be triggered in the upper layers.

I really don't understand what you are trying to achieve with this
"blockup" thing. If something goes wrong with the device, then you
*cannot recover* by sending EIO to any new IOs and then continuing
on at a later time as though nothing has happened. The moment a
filesystem gets EIO from a metadata write, it is likely to be
corrupted and if you continue onwards after that you simply
propagate the corruption.

> As Mikulas said, connection failure often be the cause of
> I/O error from the underlying devices.

Connection failure is *rarely* the cause of IO errors, except in
environments where SANs are in use. Even then multipathing makes
fatal connection failure a rare occurrence. Broken hardware is a
much more common cause of problems at the storage layers.

> That ext4 and XFS are both determine shutting down in
> erroneous journal writes is also due to the limitation of journal write?
> or just a compromise in implementation? This is just for my curiosity.

A failed, unrecoverable journal write violates the filesystem
consistency model of any journalling filesystem. Operations must be
stopped and the hardware and filesystem must be repaired, otherwise
loss of data will occur.

i.e. You're tellingthe filesystem that it's had a fatal IO error by
returning EIO, and the filesystems are treating it as though they've
seen a fatal IO error.


Simple rule: Don't complete IOs with EIO if you haven't had a fatal
IO error.

> struct wb_cache *cache = data;
> struct wb_device *wb = cache->wb;
> unsigned long intvl;
> 
> while (!kthread_should_stop()) {
> 
> wait_on_blockup();

Ugh. You should be using workqueue with timed work for this.

BTW, you're missing the handling needed by these kernel threads
for suspend-to-disk/ram

> [   68.825016] XFS (dm-3): Mounting Filesystem
> [   68.847027] XFS (dm-3): Ending clean mount
> [   72.100112] device-mapper: writeboost: err@dm_safe_io_internal() system is 
> blocked up on I/O error. set blockup to 0 after checkup.
> [   72.109702] device-mapper: writeboost: err@migrate_proc() system is 
> blocked up on I/O error. set blockup to 0 after checkup.
> [   72.812097] device-mapper: writeboost: err@modulator_proc() system is 
> blocked up on I/O error. set blockup to 0 after checkup.
> [   73.894429] Buffer I/O error on device dm-3, logical block 98354
> [   73.895824] lost page write due to I/O error on dm-3

Data IO has been lost due to EIOs. You've corrupted user files when
this error is emitted.

...

> [   73.930022] XFS (dm-3): metadata I/O error: block 0x40 
> ("xfs_buf_iodone_callbacks") error 5 numblks 16
> [   74.036759] XFS (dm-3): metadata I/O error: block 0x300c7f ("xlog_iodone") 
> error 5 numblks 64
> [   74.043456] XFS (dm-3): xfs_do_force_shutdown(0x2) called from line 1161 
> of file fs/xfs/xfs_log.c.  Return address = 0xa03a6417
> [   74.047556] XFS (dm-3): Log I/O Error Detected.  Shutting down filesystem
> [   74.049893] XFS (dm-3): Please umount the filesystem and rectify the 
> problem(s)
> [   74.051467] XFS (dm-3): metadata I/O error: block 0x300cbf ("xlog_iodone") 
> error 5 numblks 64
> [   74.053190] XFS (dm-3): xfs_do_force_shutdown(0x2) called from line 1161 
> of file fs/xfs/xfs_log.c.  Return address = 0xa03a6417

And these are all the log buffers containing uncommitted changes
being aborted due to EIO. The filesystem state in memory now doesn't
match the state on disk, and so it's effectively corrupt and shuts
down.



> [  100.052005] BUG: soft lockup - CPU#0 stuck for 23s! [kworker/0:1H:215]
...
> [  100.052005] RIP: 0010:[]  [] 
> do_raw_spin_lock+0x16/0x23
> [  100.052005] Workqueue: xfslogd xfs_buf_iodone_work [xfs]
> [  100.052005] Call Trace:
> [  100.052005]  [] ? xfs_buf_iodone+0x1b/0x49 [xfs]
> [  100.052005]  [] ? xfs_buf_do_callbacks+0x22/0x30 [xfs]
> [  100.052005]  [] ? xfs_buf_iodone_callbacks+0x16b/0x1c4 
> [xfs]
> [  100.052005]  [] ? process_one_work+0x191/0x294
> [  100.052005]  [] ? worker_thread+0x121/0x1e7
> [  100.052005]  [] ? rescuer_thread+0x269/0x269
> [  100.052005]  [] ? kthread+0x81/0x89
> [  100.052005]  [] ? __kthread_parkme+0x5d/0x5d
> [  100.052005]  [] ? ret_from_fork+0x7c/0xb0
> [  100.052005]  [] ? __kthread_parkme+0x5d/0x5d

You need to compile your kernel with framepointers enabled so we get
reliable stack traces. I think it's stuck on a spinlock in
xfs_buf_iodone, which would imply the AIL lock.

.

> [  100.244006] CPU: 2 PID: 3167 Comm: xfsaild/dm-

Re: A review of dm-writeboost

2013-10-16 Thread Akira Hayakawa
Dave

> XFS shuts down because you've returned EIO to a log IO. That's a
> fatal error. If you do the same to an ext4 journal write, it will do
> the equivalent of shut down (e.g. complain and turn read-only).
You mean block device should not return -EIO anyway if
it doesn't want XFS to suddenly shut down?
As Mikulas said, connection failure often be the cause of
I/O error from the underlying devices.

That ext4 and XFS are both determine shutting down in
erroneous journal writes is also due to the limitation of journal write?
or just a compromise in implementation? This is just for my curiosity.

>> [  180.560040] device-mapper: writeboost: err@recorder_proc() system is 
>> blocked up on I/O error. set blockup to 0 after checkup.
>> [  180.561179] device-mapper: writeboost: err@sync_proc() system is blocked 
>> up on I/O error. set blockup to 0 after checkup.
> 
> What's with the 35s delay between these writeboost messages? Have
> you only done a partial shutdown of the block device and it takes
> This length of time for it to completely block IO?
Strange.
These daemons should stop in few seconds on the current configuration.
Yes, partial in a sense. Not all the daemons stop immediately but
for the client the logical device is seen as being blocked up
returning -EIO on every I/O. I don't think this behavior harms the
upper layer.

Currently, sync_proc is like this.
It sleeps for few seconds, wakes up and
meets wait_on_blockup() to dump that message and halts itself.
recorder_proc is implemented in the same way.

int sync_proc(void *data)
{
int r;
struct wb_cache *cache = data;
struct wb_device *wb = cache->wb;
unsigned long intvl;

while (!kthread_should_stop()) {

wait_on_blockup();

/* sec -> ms */
intvl = ACCESS_ONCE(cache->sync_interval) * 1000;

if (!intvl) {
schedule_timeout_interruptible(msecs_to_jiffies(1000));
continue;
}

flush_current_buffer(cache);

RETRY(blkdev_issue_flush(cache->device->bdev, GFP_NOIO, NULL));

schedule_timeout_interruptible(msecs_to_jiffies(intvl));
}
return 0;
}

XFS shuts down, go crazy and it disturbs the kthread to wake up?

> These should be emitting a stack trace. Can you turn up the logging
> level you are using so that they emit a full stack trace? The
> messages are useless without the stack dump
I turned the level up to 7. Here is the one.

Connected to domain Hercules
Escape character is ^]
[   54.683482] device-mapper: writeboost: err@audit_cache_device() superblock 
header: magic number invalid
[   54.809262] bio: create slab  at 2
[   68.812800] SGI XFS with ACLs, security attributes, realtime, large 
block/inode numbers, no debug enabled
[   68.825016] XFS (dm-3): Mounting Filesystem
[   68.847027] XFS (dm-3): Ending clean mount
[   72.100112] device-mapper: writeboost: err@dm_safe_io_internal() system is 
blocked up on I/O error. set blockup to 0 after checkup.
[   72.109702] device-mapper: writeboost: err@migrate_proc() system is blocked 
up on I/O error. set blockup to 0 after checkup.
[   72.812097] device-mapper: writeboost: err@modulator_proc() system is 
blocked up on I/O error. set blockup to 0 after checkup.
[   73.894429] Buffer I/O error on device dm-3, logical block 98354
[   73.895824] lost page write due to I/O error on dm-3
[   73.897042] Buffer I/O error on device dm-3, logical block 98355
[   73.897209] Buffer I/O error on device dm-3, logical block 196641
[   73.897210] lost page write due to I/O error on dm-3
[   73.897263] Buffer I/O error on device dm-3, logical block 196688
[   73.897264] lost page write due to I/O error on dm-3
[   73.897266] Buffer I/O error on device dm-3, logical block 196689
[   73.897267] lost page write due to I/O error on dm-3
[   73.897268] Buffer I/O error on device dm-3, logical block 196690
[   73.897269] lost page write due to I/O error on dm-3
[   73.897270] Buffer I/O error on device dm-3, logical block 196691
[   73.897271] lost page write due to I/O error on dm-3
[   73.897272] Buffer I/O error on device dm-3, logical block 196692
[   73.897273] lost page write due to I/O error on dm-3
[   73.897307] Buffer I/O error on device dm-3, logical block 294955
[   73.897308] lost page write due to I/O error on dm-3
[   73.897335] Buffer I/O error on device dm-3, logical block 294956
[   73.897335] lost page write due to I/O error on dm-3
[   73.914261] lost page write due to I/O error on dm-3
[   73.930022] XFS (dm-3): metadata I/O error: block 0x40 
("xfs_buf_iodone_callbacks") error 5 numblks 16
[   74.036759] XFS (dm-3): metadata I/O error: block 0x300c7f ("xlog_iodone") 
error 5 numblks 64
[   74.043456] XFS (dm-3): xfs_do_force_shutdown(0x2) called from line 1161 of 
file fs/xfs/xfs_log.c.  Return address = 0xa03a6417
[   74.047556] XFS (dm-3): Log I/O Error Detected

Re: A review of dm-writeboost

2013-10-16 Thread Dave Chinner
On Wed, Oct 16, 2013 at 07:34:38PM +0900, Akira Hayakawa wrote:
> Dave
> 
> > Akira, can you please post the entire set of messages you are
> > getting when XFS showing problems? That way I can try to confirm
> > whether it's a regression in XFS or something else.
> 
> Environment:
> - The kernel version is 3.12-rc1
> - The debuggee is a KVM virtual machine equipped with 8 vcpus.
> - writeboost version is commit 236732eb84684e8473353812acb3302232e1eab0
>   You can clone it from https://github.com/akiradeveloper/dm-writeboost
> 
> Test:
> 1. Make a writeboost device with 3MB cache device and 3GB backing store
>with default option (segment size order is 7 and RAM buffer is 2MB 
> allocated).
> 2. start testing/1 script (compiling Ruby and make test after it)
> 3. set blockup variable to 1 via message interface few seconds later.
>The writeboost device starts to return -EIO on all incoming requests.
>I guess this behavior causes the problem.
> 
> In some case, XFS doesn't collapse after setting blockup to 1.
> When I set the variable to 1 about 10 or 20 seconds later,
> it didn't collapse but neatly stops the compile and
> after again I set it to 0, it restarts the compile.
> XFS does collapse (badly shutting down the filesystem as seen below) in some 
> case
> but doesn't collapse in another case sounds to me that
> the former case runs into a very corner case bug.

XFS shuts down because you've returned EIO to a log IO. That's a
fatal error. If you do the same to an ext4 journal write, it will do
the equivalent of shut down (e.g. complain and turn read-only).

> The entire set of messages via virsh console is shown below.
> Some lines related to writeboost are all benign.
> The daemons are just stopping because blockup variable is 1.
> 
> [  146.284626] XFS (dm-3): metadata I/O error: block 0x300d91 ("xlog_iodone") 
> error 5 numblks 64
> [  146.285825] XFS (dm-3): Log I/O Error Detected.  Shutting down filesystem
> [  146.286699] XFS (dm-3): Please umount the filesystem and rectify the 
> problem(s)

What happened before this? Please attach the *full* log.

> [  146.560036] device-mapper: writeboost: err@modulator_proc() system is 
> blocked up on I/O error. set blockup to 0 after checkup.
> [  147.244036] device-mapper: writeboost: err@migrate_proc() system is 
> blocked up on I/O error. set blockup to 0 after checkup.
> [  172.052006] BUG: soft lockup - CPU#0 stuck for 23s! [script:3170]
> [  172.436003] BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:1:57]

These should be emitting a stack trace. Can you turn up the logging
level you are using so that they emit a full stack trace? The
messages are useless without the stack dump

Also, 23 seconds before this timestamp is 149s, about 3s after the
XFS filesystem shut down, so it's not clear that the XFS shutdown is
related to the soft lockup yet. That's what we need the stack traces
for...

> [  180.560040] device-mapper: writeboost: err@recorder_proc() system is 
> blocked up on I/O error. set blockup to 0 after checkup.
> [  180.561179] device-mapper: writeboost: err@sync_proc() system is blocked 
> up on I/O error. set blockup to 0 after checkup.

What's with the 35s delay between these writeboost messages? Have
you only done a partial shutdown of the block device and it takes
This length of time for it to completely block IO?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: A review of dm-writeboost

2013-10-16 Thread Akira Hayakawa
Dave

> Akira, can you please post the entire set of messages you are
> getting when XFS showing problems? That way I can try to confirm
> whether it's a regression in XFS or something else.

Environment:
- The kernel version is 3.12-rc1
- The debuggee is a KVM virtual machine equipped with 8 vcpus.
- writeboost version is commit 236732eb84684e8473353812acb3302232e1eab0
  You can clone it from https://github.com/akiradeveloper/dm-writeboost

Test:
1. Make a writeboost device with 3MB cache device and 3GB backing store
   with default option (segment size order is 7 and RAM buffer is 2MB 
allocated).
2. start testing/1 script (compiling Ruby and make test after it)
3. set blockup variable to 1 via message interface few seconds later.
   The writeboost device starts to return -EIO on all incoming requests.
   I guess this behavior causes the problem.

In some case, XFS doesn't collapse after setting blockup to 1.
When I set the variable to 1 about 10 or 20 seconds later,
it didn't collapse but neatly stops the compile and
after again I set it to 0, it restarts the compile.
XFS does collapse (badly shutting down the filesystem as seen below) in some 
case
but doesn't collapse in another case sounds to me that
the former case runs into a very corner case bug.

The entire set of messages via virsh console is shown below.
Some lines related to writeboost are all benign.
The daemons are just stopping because blockup variable is 1.

[  146.284626] XFS (dm-3): metadata I/O error: block 0x300d91 ("xlog_iodone") 
error 5 numblks 64
[  146.285825] XFS (dm-3): Log I/O Error Detected.  Shutting down filesystem
[  146.286699] XFS (dm-3): Please umount the filesystem and rectify the 
problem(s)
[  146.560036] device-mapper: writeboost: err@modulator_proc() system is 
blocked up on I/O error. set blockup to 0 after checkup.
[  147.244036] device-mapper: writeboost: err@migrate_proc() system is blocked 
up on I/O error. set blockup to 0 after checkup.
[  172.052006] BUG: soft lockup - CPU#0 stuck for 23s! [script:3170]
[  172.436003] BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:1:57]
[  180.560040] device-mapper: writeboost: err@recorder_proc() system is blocked 
up on I/O error. set blockup to 0 after checkup.
[  180.561179] device-mapper: writeboost: err@sync_proc() system is blocked up 
on I/O error. set blockup to 0 after checkup.
[  200.052005] BUG: soft lockup - CPU#0 stuck for 23s! [script:3170]
[  200.436005] BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:1:57]
[  206.484005] INFO: rcu_sched self-detected stall on CPU { 0}  (t=15000 
jiffies g=1797 c=1796 q=3022)
[  232.052007] BUG: soft lockup - CPU#0 stuck for 23s! [script:3170]
[  232.436003] BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:1:57]
[  260.052006] BUG: soft lockup - CPU#0 stuck for 23s! [script:3170]
[  260.436004] BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:1:57]
[  288.052006] BUG: soft lockup - CPU#0 stuck for 23s! [script:3170]
[  288.436004] BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:1:57]

Akira
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: A review of dm-writeboost

2013-10-15 Thread Dave Chinner
[cc x...@oss.sgi.com]

On Tue, Oct 15, 2013 at 08:01:45PM -0400, Mikulas Patocka wrote:
> On Mon, 14 Oct 2013, Akira Hayakawa wrote:
> > But, XFS stalls ...
> > ---
> > For testing,
> > I manually turns `blockup` to 1
> > when compiling Ruby is in progress
> > on XFS on a writeboost device.
> > As soon as I do it,
> > XFS starts to dump error message 
> > like "metadata I/O error: ... ("xlog_iodone") error ..."
> > and after few seconds it then starts to dump
> > like "BUG: soft lockup -CPU#3 stuck for 22s!".
> > The system stalls and doesn't accept the keyboard.
> > 
> > I think this behavior is caused by
> > the device always returning -EIO after turning 
> > the variable to 1.
> > But why XFS goes stalling on I/O error?
> 
> Because it is bloated and buggy.

How did I know you'd take that cheap shot, Mikulas? You are so
predictable...

> We have bug 924301 for XFS crash on I/O 
> error...

Which is a problem with memory corruption after filling a dm
snapshot volume to 100% and shortly after XFS has shut down the
kernel panics from memory corruption. Can't be reproduced without
filling the dm-snapshot volume to 100%, can't be reproduced with any
other filesystem. Crashes are also occurring randomly in printk and
the worker thread infrastructure. Memory and list poisoning clearly
indicates worker thread lists have freed objects on them. There are
lockdep messages from the DM snapshot code, etc.

There's actually very little to point at XFS problems other than the
first hang that was reported where XFS was stuck in a tight loop due
to memory corruption.  It reminds me of a very similar bug report
and triage we went through last week:

http://oss.sgi.com/pipermail/xfs/2013-October/030681.html

Further analysis and bisects pointed to the zram driver being buggy,
not XFS:

http://oss.sgi.com/pipermail/xfs/2013-October/030707.html

XFS has historically exposing bugs in block device drivers that no
other filesystem exposes, and so when a new block device driver gets
tested with XFS and we start seeing memory corruption symptoms, it's
a fair bet that it's not XFS that is causing it

Just sayin'.

---

Akira, can you please post the entire set of messages you are
getting when XFS showing problems? That way I can try to confirm
whether it's a regression in XFS or something else.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: A review of dm-writeboost

2013-10-15 Thread Akira Hayakawa
Mikulas,

> I/Os shouldn't be returned with -ENOMEM. If they are, you can treat it as 
> a hard error.
It seems to be blkdev_issue_discard returns -ENOMEM
when bio_alloc fails, for example.
Waiting for a second and we can alloc the memory is my idea
for handling -ENOMEM returned.

> Blocking I/O until the admin turns a specific variable isn't too 
> reliable.
> 
> Think of this case - your driver detects I/O error and blocks all I/Os. 
> The admin tries to log in. The login process needs memory. To fulfill this 
> memory need, the login process writes out some dirty pages. Those writes 
> are blocked by your driver - in the result, the admin is not able to log 
> in and flip the switch to unblock I/Os.
> 
> Blocking I/O indefinitely isn't good because any system activity 
> (including typing commands into shell) may wait on this I/O.
I understand the problem. But, what should I do then?
Since writeboost is a cache software,
it loses consistency if we ignore the cache at all
in its returning I/O error.
Go panic in that case is also inappropriate (But, inaccessibility to
the storage will eventually halt the whole system. If so, go panic might
be an acceptable solution).

I am afraid my idea is based on your past comment
> If you can't handle a specific I/O request failure gracefully, you should 
> mark the driver as dead, don't do any more I/Os to the disk or cache 
> device and return -EIO on all incoming requests.
> 
> Always think that I/O failures can happen because of connection problems, 
> not data corruption problems - for example, a disk cable can go loose, a 
> network may lose connectivity, etc. In these cases, it is best to stop 
> doing any I/O at all and let the user resolve the situation.
1) In failure, mark the driver dead - set `blockup` to 1 in my case -
   and returning -EIO on all incoming requests. Yes.
2) And wait for the user resolve the situation - returning -EIO until
   admin turns `blockup` to 0 after checkup in my case - . Yes.

Did you mean we should not provide any way to recover the system
because admin may not be able to reach the switch?
writeboost module autonomously checking the device in problem
recovered should be implemented?
Retry submitting I/O to the device and find the device is recovered
on I/O success is a solution and I have implemented it.
I/O retry doesn't destroy any consistency in writeboost;
sooner or later it can not be able to accept writes any more because of
lack of RAM buffer which can be reused after I/O success to cache device.

Akira

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: A review of dm-writeboost

2013-10-15 Thread Mikulas Patocka


On Mon, 14 Oct 2013, Akira Hayakawa wrote:

> Hi, DM Guys
> 
> I suppose I have finished the tasks to
> answer Mikulas's pointing outs.
> So, let me update the progress report.
> 
> The code is updated now on my Github repo.
> Checkout the develop branch to avail
> the latest source code.
> 
> Compilation Status
> --
> First, compilation status.
> Mikulas's advised me to compile the module in
> 32 bit environment. and Yes, I did.
> With all these kernels listed below
> writeboost can compile without any error nor warning.
> 
> For 64 bit
> 3.2.36, 3.4.25, 3.5.7, 3.6.9, 3.7.5, 3.8.5, 3.9.8, 3.10.5, 3.11.4 and 3.12-rc1
> 
> For 32 bit
> 3.2.0-4-686-pae (Debian 7.1.0-i386)
> 
> 
> Block up on error
> -
> The most annoying thing in this update
> is how to handle the I/O error.
> For memory allocation error,
> writeboost now makes use of mempool to avoid
> the problem Mikulas's said in his last comments
> but handling I/O error gracefully
> when the system is running is very difficult.
> 
> My answer is 
> all the daemon stop when I/O error (-EIO returned) happens
> in any part of this module.
> They waits on wait_queue (blockup_wait_queue)
> and reactivates when sysadmin turns `blockup` variable to 0
> through message interface.
> When `blockup` is 1, all the incoming I/O
> are returned as -EIO to the upper layer.
> RETRY macro is introduced
> which wraps doing I/O and
> retries I/O submission if the error is -ENOMEN but

I/Os shouldn't be returned with -ENOMEM. If they are, you can treat it as 
a hard error.

> turns blockup to 1 and sleeps if the error is -EIO.
> -EIO is more serious than -ENOMEM because
> it may destroy the storage for some accidental problem
> that we have no control in device-mapper layer
> (e.g. the storage controller went crazy).
> Blocking up the whole I/O is to minimize the
> probable damage.
> 
> But, XFS stalls ...
> ---
> For testing,
> I manually turns `blockup` to 1
> when compiling Ruby is in progress
> on XFS on a writeboost device.
> As soon as I do it,
> XFS starts to dump error message 
> like "metadata I/O error: ... ("xlog_iodone") error ..."
> and after few seconds it then starts to dump
> like "BUG: soft lockup -CPU#3 stuck for 22s!".
> The system stalls and doesn't accept the keyboard.
> 
> I think this behavior is caused by
> the device always returning -EIO after turning 
> the variable to 1.
> But why XFS goes stalling on I/O error?

Because it is bloated and buggy. We have bug 924301 for XFS crash on I/O 
error...

> It should just suspend and starts returning
> error to the upper layer as writeboost now does.
> As Mikulas said the I/O error is often
> due to connection failure that is usually recoverable.
> Stalling the kernel will need reboot 
> after recovering nevertheless writeboost
> can recover just by again turning `blockup` to 0.
> Any reason for this design or
> existing of a option to not stall XFS on I/O error?
> 
> Thanks,
> Akira

Blocking I/O until the admin turns a specific variable isn't too 
reliable.

Think of this case - your driver detects I/O error and blocks all I/Os. 
The admin tries to log in. The login process needs memory. To fulfill this 
memory need, the login process writes out some dirty pages. Those writes 
are blocked by your driver - in the result, the admin is not able to log 
in and flip the switch to unblock I/Os.

Blocking I/O indefinitely isn't good because any system activity 
(including typing commands into shell) may wait on this I/O.

Mikulas
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: A review of dm-writeboost

2013-10-14 Thread Akira Hayakawa
Hi, DM Guys

I suppose I have finished the tasks to
answer Mikulas's pointing outs.
So, let me update the progress report.

The code is updated now on my Github repo.
Checkout the develop branch to avail
the latest source code.

Compilation Status
--
First, compilation status.
Mikulas's advised me to compile the module in
32 bit environment. and Yes, I did.
With all these kernels listed below
writeboost can compile without any error nor warning.

For 64 bit
3.2.36, 3.4.25, 3.5.7, 3.6.9, 3.7.5, 3.8.5, 3.9.8, 3.10.5, 3.11.4 and 3.12-rc1

For 32 bit
3.2.0-4-686-pae (Debian 7.1.0-i386)


Block up on error
-
The most annoying thing in this update
is how to handle the I/O error.
For memory allocation error,
writeboost now makes use of mempool to avoid
the problem Mikulas's said in his last comments
but handling I/O error gracefully
when the system is running is very difficult.

My answer is 
all the daemon stop when I/O error (-EIO returned) happens
in any part of this module.
They waits on wait_queue (blockup_wait_queue)
and reactivates when sysadmin turns `blockup` variable to 0
through message interface.
When `blockup` is 1, all the incoming I/O
are returned as -EIO to the upper layer.
RETRY macro is introduced
which wraps doing I/O and
retries I/O submission if the error is -ENOMEN but
turns blockup to 1 and sleeps if the error is -EIO.
-EIO is more serious than -ENOMEM because
it may destroy the storage for some accidental problem
that we have no control in device-mapper layer
(e.g. the storage controller went crazy).
Blocking up the whole I/O is to minimize the
probable damage.

But, XFS stalls ...
---
For testing,
I manually turns `blockup` to 1
when compiling Ruby is in progress
on XFS on a writeboost device.
As soon as I do it,
XFS starts to dump error message 
like "metadata I/O error: ... ("xlog_iodone") error ..."
and after few seconds it then starts to dump
like "BUG: soft lockup -CPU#3 stuck for 22s!".
The system stalls and doesn't accept the keyboard.

I think this behavior is caused by
the device always returning -EIO after turning 
the variable to 1.
But why XFS goes stalling on I/O error?
It should just suspend and starts returning
error to the upper layer as writeboost now does.
As Mikulas said the I/O error is often
due to connection failure that is usually recoverable.
Stalling the kernel will need reboot 
after recovering nevertheless writeboost
can recover just by again turning `blockup` to 0.
Any reason for this design or
existing of a option to not stall XFS on I/O error?

Thanks,
Akira

--

Followed by
changes to Driver/* and Documentation

diff --git a/Driver/dm-writeboost-daemon.c b/Driver/dm-writeboost-daemon.c
index cb8febe..7a7f353 100644
--- a/Driver/dm-writeboost-daemon.c
+++ b/Driver/dm-writeboost-daemon.c
@@ -12,9 +12,11 @@
 
 int flush_proc(void *data)
 {
+   int r;
unsigned long flags;
 
struct wb_cache *cache = data;
+   struct wb_device *wb = cache->wb;
 
while (true) {
struct flush_job *job;
@@ -22,13 +24,13 @@ int flush_proc(void *data)
struct dm_io_request io_req;
struct dm_io_region region;
 
+   wait_on_blockup();
+
spin_lock_irqsave(&cache->flush_queue_lock, flags);
while (list_empty(&cache->flush_queue)) {
spin_unlock_irqrestore(&cache->flush_queue_lock, flags);
-   wait_event_interruptible_timeout(
-   cache->flush_wait_queue,
-   (!list_empty(&cache->flush_queue)),
-   msecs_to_jiffies(100));
+
+   schedule_timeout_interruptible(msecs_to_jiffies(1000));
 
/*
 * flush daemon can exit
@@ -49,6 +51,8 @@ int flush_proc(void *data)
list_del(&job->flush_queue);
spin_unlock_irqrestore(&cache->flush_queue_lock, flags);
 
+   smp_rmb();
+
seg = job->seg;
 
io_req = (struct dm_io_request) {
@@ -65,9 +69,9 @@ int flush_proc(void *data)
.count = (seg->length + 1) << 3,
};
 
-   dm_safe_io_retry(&io_req, 1, ®ion, false);
+   RETRY(dm_safe_io(&io_req, 1, ®ion, NULL, false));
 
-   cache->last_flushed_segment_id = seg->global_id;
+   atomic64_set(&cache->last_flushed_segment_id, seg->global_id);
 
complete_all(&seg->flush_done);
 
@@ -78,15 +82,15 @@ int flush_proc(void *data)
 */
if (!bio_list_empty(&job->barrier_ios)) {
struct bio *bio;
-   blkdev_issue_flush(cache->device->bdev, GFP_NOIO, NULL);
+   RETRY(blkdev_issue_flush(cache->device->bdev, GFP_NOIO, 
NULL));
while ((bio = b

Re: A review of dm-writeboost

2013-10-09 Thread Akira Hayakawa
Mikulas,

> Next, you need to design some locking - which variables are protected by 
> which locks. If you use shared variables without locks, you need to use 
> memory barriers (it is harder to design code using memory barriers than 
> locks).

First I will explain the locking and the shared variables.
writeboost doesn't have many variables shared among asynchronous threads.

1. We have HashTable for cache lookup and RAM buffer for write data transiently.
   These data structures are locked by a mutex io_lock.
   barrier_ios is a queue to push bio flagged with REQ_FUA or REQ_FLUSH.
   This queue is also locked by io_lock.
   These bios are chained to flush job inside io_lock.
   queue_flushing is a routine that makes a flush job and queues it in
   flush_queue which is protected by a spinlock flush_queue_lock.
   This foreground (.map) is a producer of flush jobs to flush daemon.
2. flush daemon loops flush_proc in single thread.
   It pops and consumes one flush job from flush_queue (writing to the cache 
device).
   Therefore, foreground should keep that the flush job is
   completely initialized before queues it.
   After completing the writing, it sets(actually increments) the 
last_flush_segment_id.
   This daemon is the only to update this shared variable.
3. migrate daemon loops migrate_proc also running in the single thread.
   It first calculates how many (nr_mig) segments from 
   last_migrated_segment_id + 1 to + nr_mig by referring
   last_flushed_segment_id and last_migrate_segment_id.
   The former is the head of the dirty log and the latter
   is the tail. Note that the head doesn't have to be synchronized
   since the last_flush_segment_id linearly increases.
   last_flush_segment_id not synced only results in less number of migration 
than actual.
   It doesn't collapse anything at all.
   After migrate all the nr_mig dirty segments, it increments
   last_migrated_segment_id by nr_mig.
   This daemon is the only to update this last_migrate_segment_id.
   
Other daemons don't share such variables.

> Think of this piece of code:
> nr_mig_candidates = cache->last_flushed_segment_id -
> cache->last_migrated_segment_id;
> ...
> nr_mig = min(nr_mig_candidates,
>  cache->nr_cur_batched_migration);
> ...
> for (i = 1; i <= nr_mig; i++) {
>   seg = get_segment_header_by_id(
> cache,
> cache->last_migrated_segment_id + i);
> list_add_tail(&seg->migrate_list, &cache->migrate_list);
> }
> 
> The processor may reorder all memory accesses - for example it may read 
> the data accessed in the "for" cycle before reading the variables 
> "cache->last_flushed_segment_id" and "cache->last_migrated_segment_id". If 
> this happens, the code may work with invalid data.

True?
My understanding is that reordering does not happen
for these three blocks of code because there are sequential data dependencies.
That the last block is while loop is the point?

But, I sorted our other two points we need to insert memory barriers.

First is in queue_flushing.
We can't deny that incomplete flush job is queued and
flush_proc goes bad.
I am thinking of inserting smp_wmb before queuing and
smp_rmb after popping a job in flush_proc.

Second is in migrate_proc.
We can't deny that migrate_proc goes into migrate_linked_segments
before cache->migrate_list is properly initialized.
I am thinking of inserting smp_wmb before migrate_linked_segments.

I don't understand memory barrier well.
So, please fix if I am going the wrong way.

> Nonatomic 64-bit variables
> --
> 
> Note that 64-bit variables are not atomic on 32-bit architectures.
> 
> Linux assumes that 32-bit variables are atomic on 32-bit architectures and 
> 64-bit or 32-bit variables are atomic on 64-bit architectures. That is, 
> variables having the "long" or "int" type are atomic. Atomicity means that 
> if you read and modify the variable at the same time, you read either the 
> old value or the new values.
> 
> 64-bit variables on 32-bit architectures are usually read and written in 
> two steps, and thus the atomicity requirement isn't true.
> 
> For example, suppose that you change cache->last_flushed_segment_id from 
> 0x to 0x0001. Now, the function migrate_proc 
> that asynchronously reads cache->last_flushed_segment_id can read 
> 0x or 0x0001 (if it reads one of these values, 
> it behaves correctly), but it can also read 0x or 
> 0x0001 - if migrate_proc reads one of these two values, it 
> goes wild, migrating segments that weren't ever supposed to be migrated, 
> and likely causes a crash or data corruption.
> 
> I found this bug in migrate_proc and update_superblock_record (but it may 
> be also in other parts of the code).
> 
> You can use atomic64_t type for atomic 64-bit variables (plus memory 
> barriers as explained in the previous section). Or

Re: [dm-devel] A review of dm-writeboost

2013-10-08 Thread Akira Hayakawa
Mikulas,

> Waking up every 100ms in flush_proc is not good because it wastes CPU time 
> and energy if the driver is idle.
Yes, 100ms is too short. I will change it to 1sec then.
We can wait for 1 sec in termination.

> The problem is that if you fill up the whole cache device in less time 
> than 1 second. Then, you are waiting for 1 second until the migration 
> process notices that it has some work to do. That waiting is pointless and 
> degrades performance - the new write requests received by your driver sit 
> in queue_current_buffer, waiting for data to be migrated. And the 
> migration process sits in 
> schedule_timeout_interruptible(msecs_to_jiffies(1000)), waiting for the 
> end of the timeout. Your driver does absolutely nothing in this moment.
> 
> For this reason, you should wake the migration process as soon as 
> it is needed.
I see the reason. I agree.

Migration is not restlessly executed in writeboost.
The cache device is full and needs migration to make room for new writes
is "urgent" situation.
Setting reserving_segment_id to non-zero can tell
migration daemon that the migration is urgent.

There is also a non-urgent migration
when the backing store is loaded lower than threshold.
Restlessly migrating to the backing store may affect the
whole system. For example, it may defer the read request
to the backing store.
So, migrating only in idle time can be a good strategy.

> Pointless modulator_proc
> 
> 
> This thread does no work, it just turns cache->allow_migrate on and off. 
> Thus, the thread is useless, you can just remove it. In the place where 
> you read cache->allow_migrate (in migrate_proc) you could just do the work 
> that used to be performed in modulator_proc.
Sure, it turns the flag on and off, and this daemon is needful.
This daemon calculates the load of the backing store then
moving this code to migrate_proc and do the same thing
every loop is too CPU consuming.
Make a decision every second seems to be reasonable.

However, some system doesn't want to delay migration at all
because the backing store has a large write back cache
and wants it filled for its optimization
(e.g. reordering) to be effective.
In this case, setting both enable_migration_modulator 
and allow_migrate to 0 will do.

Also, note that related to migration
nr_max_batched_migration can determine how many segments
can be migrated at a time.

Back to the urgent migration,
the problem can be solved easily.
How about inserting waking up the migration daemon
just after reserving_segment_id to non-zero.
It is similar to waking up flush daemon when it queues a flush job.

void wait_for_migration(struct wb_cache *cache, u64 id)
{
struct segment_header *seg = get_segment_header_by_id(cache, id);

/*
 * Set reserving_segment_id to non zero
 * to force the migartion daemon
 * to complete migarate of this segment
 * immediately.
 */
cache->reserving_segment_id = id;
// HERE
wait_for_completion(&seg->migrate_done);
cache->reserving_segment_id = 0;
}

> flush_proc is woken up correctly. But the other threads aren't. 
> migrate_proc, modulator_proc, recorder_proc, sync_proc all do polling.
For other daemons,
modulator: turns on and off according to the load of the backing store every 
second (default ON)
recorder: update the super block record every T seconds (default T=60)
sync: make all the transient data persistent every T seconds (default T=60)

They are just looping themselves.

Maybe, recorder and sync should be turned off for default.
- Recorder daemon is just for fast rebooting. The record section contains
  the last_migrated_segment_id which is used in recover_cache()
  to decrease the segments to recover.
- Sync daemon is for SLA in enterprise. Some user want to
  make the storage system persistent every given period.
  This is needless intrinsically. So, turning it off by default is appropriate.

Akira
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [dm-devel] A review of dm-writeboost

2013-10-08 Thread Akira Hayakawa
Mikulas,

Let me ask you about this comment
of choosing the best API.
For the rest, I will reply later.

> BTW. You should use wait_event_interruptible_lock_irq instead of 
> wait_event_interruptible and wait_event_interruptible_lock_irq_timeout 
> instead of wait_event_interruptible_timeout. The functions with "lock_irq" 
> automatically drop and re-acquire the spinlock, so that the condition is 
> tested while the lock is held - so that there are no asynchronous accesses 
> to !list_empty(&cache->flush_queue).

wait_event_interruptible_lock_irq_timeout is
added to the kernel since 3.12 by this patch.
https://lkml.org/lkml/2013/8/21/362
However, it is only used by zfcp itself.

I am afraid I want to refrain from this new API
and keep the code as it is now.
Later if this API is widely accepted
it is time to use it in writeboost
is my opinion.
The fact this API is not added for a long time
makes me feel it should not be used at least at this moment of time.
I want to use only those truly stable.
writeboost as a storage kernel module should be stable.
I believe depending only on the stable APIs is a
good way of making a software stable.

All said, I tried to fix this.
Is this change what you meant?

@@ -24,10 +24,10 @@ int flush_proc(void *data)

spin_lock_irqsave(&cache->flush_queue_lock, flags);
while (list_empty(&cache->flush_queue)) {
-   spin_unlock_irqrestore(&cache->flush_queue_lock, flags);
-   wait_event_interruptible_timeout(
+   wait_event_interruptible_lock_irq_timeout(
cache->flush_wait_queue,
-   (!list_empty(&cache->flush_queue)),
+   !list_empty(&cache->flush_queue),
+   cache->flush_queue_lock,
msecs_to_jiffies(100));

/*
@@ -36,8 +36,6 @@ int flush_proc(void *data)
 */
if (kthread_should_stop())
return 0;
-   else
-   spin_lock_irqsave(&cache->flush_queue_lock, 
flags);
}

Akira
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: A review of dm-writeboost

2013-10-07 Thread Mikulas Patocka


On Sun, 6 Oct 2013, Akira Hayakawa wrote:

> Mikulas,
> 
> Thank you for your reviewing.
> 
> I will reply to polling issue first.
> For the rest, I will reply later.
> 
> > Polling for state
> > -
> > 
> > Some of the kernel threads that you spawn poll for data in one-second 
> > interval - see migrate_proc, modulator_proc, recorder_proc, sync_proc.
> > 
> > flush_proc correctly contains wait_event, but there is still some 100ms 
> > polling timeout in flush_proc that shouldn't be there.
> > 
> > If you set the polling interval too low, it wastes CPU cycle and it wastes 
> > energy due to CPU wake-ups. If you set the polling interval too high, it 
> > causes artifical delays. For example, current middle-class SSDs can do 
> > writes at a rate 500MB/s. Now, think that the user uses 400MB partition 
> > for the cache - the partition is filled up in 0.8s and the kernel waits 
> > for additional 0.2s doing absolutely nothing, until your polling code 
> > wakes up and notices that it should start flusing the cache.
> > 
> > So - the polling code introduces artifical delays that can cause 
> > performance degradation. You may think about lowering the polling interval 
> > to lessen the performance impact, but if you lower the polling interval, 
> > it increases CPU and power consumption when the driver is idle. Either 
> > way, it is wrong. Just don't use polling.
> 
> You dislike the fact that looping thread sleeping
> can delay to the newly coming jobs.
> 
> I am afraid your understanding
> on the sleeping of flush daemon is wrong.
> Please see queue_flushing() it wakes up the daemon
> if the queue is empty.
> 
> spin_lock_irqsave(&cache->flush_queue_lock, flags);
> empty = list_empty(&cache->flush_queue);
> list_add_tail(&job->flush_queue, &cache->flush_queue);
> spin_unlock_irqrestore(&cache->flush_queue_lock, flags);
> if (empty)
> wake_up_interruptible(&cache->flush_wait_queue);
> 
> Even if this waking up lines are removed
> the flush daemon wakes up by its self
> 100ms at worst after the first 1MB written.

flush_proc is woken up correctly. But the other threads aren't. 
migrate_proc, modulator_proc, recorder_proc, sync_proc all do polling.

Waking up every 100ms in flush_proc is not good because it wastes CPU time 
and energy if the driver is idle.


BTW. You should use wait_event_interruptible_lock_irq instead of 
wait_event_interruptible and wait_event_interruptible_lock_irq_timeout 
instead of wait_event_interruptible_timeout. The functions with "lock_irq" 
automatically drop and re-acquire the spinlock, so that the condition is 
tested while the lock is held - so that there are no asynchronous accesses 
to !list_empty(&cache->flush_queue).


> > Don't do this. You can do polling in a piece of code that is executed 
> > infrequently, such as driver unload, but general functionality must not 
> > depend on polling.
> 100ms is the time the user must wait in termination as your claim.
>
> However, I think this is a good time to
> explain why I didn't choose to use workqueue.
> 
> Keeping it a polling thread that consumes what in queue
> instead of replacing it with workqueue
> has these reasons:
> 
> 1. CPU overhead is lower
>queue_work overhead everytime flush job is to queue
>is crucial for writeboost. queue_work is too CPU-consuming
>as far as I looked at the code.
>I have measured the theoretical maximum 4KB randwrite throughput  
>of writeboost using fio benchmark. It was 1.5GB/sec
>with fast enough cache device.
>I don't think this extraordinary score can not be achieved with
>other caching softwares.
>In this benchmark, the flush daemon didn't sleep at all
>because the writes are ceaseless.
>Using workqueue will not only lose throughput
>but also wastes CPU cycles both regularly is the fact I dislike.
> 2. Ease of Error Handling
>Keeping it a single thread looping and self-managing the queue
>makes it easy to handle error.
>If we choose to use workqueue, we need a mechanism to
>* notify the occurrence of a error to other work items
>* play-back the work items in the same order since waiting until the
>  system recovers for long time is prohibited with workqueue
>  as we discussed in the previous posts.
>  Segments must be flushed and then migrated in sequential
>  order along its id.
>Leaving the control of the queue which should be in desired order
>to the workqueue subsystem is dangerous in this case.

You don't have to use workqueues if you don't like them. You can use 
kernel threads and wait_event/wake_up instead. Workqueues are simpler to 
use in some circumstances (for example, you can submit the same work item 
multiple times) but it's up to you if you want that simplicity or not.

But you shouldn't do looping and waiting for events in migrate_proc.

> For migrate_proc,
> I see no reason this daemon polling is bad
> although it actually sleeps in leisure time.
> I

Re: A review of dm-writeboost

2013-10-06 Thread Akira Hayakawa
Mikulas,

Thank you for your reviewing.

I will reply to polling issue first.
For the rest, I will reply later.

> Polling for state
> -
> 
> Some of the kernel threads that you spawn poll for data in one-second 
> interval - see migrate_proc, modulator_proc, recorder_proc, sync_proc.
> 
> flush_proc correctly contains wait_event, but there is still some 100ms 
> polling timeout in flush_proc that shouldn't be there.
> 
> If you set the polling interval too low, it wastes CPU cycle and it wastes 
> energy due to CPU wake-ups. If you set the polling interval too high, it 
> causes artifical delays. For example, current middle-class SSDs can do 
> writes at a rate 500MB/s. Now, think that the user uses 400MB partition 
> for the cache - the partition is filled up in 0.8s and the kernel waits 
> for additional 0.2s doing absolutely nothing, until your polling code 
> wakes up and notices that it should start flusing the cache.
> 
> So - the polling code introduces artifical delays that can cause 
> performance degradation. You may think about lowering the polling interval 
> to lessen the performance impact, but if you lower the polling interval, 
> it increases CPU and power consumption when the driver is idle. Either 
> way, it is wrong. Just don't use polling.

You dislike the fact that looping thread sleeping
can delay to the newly coming jobs.

I am afraid your understanding
on the sleeping of flush daemon is wrong.
Please see queue_flushing() it wakes up the daemon
if the queue is empty.

spin_lock_irqsave(&cache->flush_queue_lock, flags);
empty = list_empty(&cache->flush_queue);
list_add_tail(&job->flush_queue, &cache->flush_queue);
spin_unlock_irqrestore(&cache->flush_queue_lock, flags);
if (empty)
wake_up_interruptible(&cache->flush_wait_queue);

Even if this waking up lines are removed
the flush daemon wakes up by its self
100ms at worst after the first 1MB written.

> Don't do this. You can do polling in a piece of code that is executed 
> infrequently, such as driver unload, but general functionality must not 
> depend on polling.
100ms is the time the user must wait in termination as your claim.

However, I think this is a good time to
explain why I didn't choose to use workqueue.

Keeping it a polling thread that consumes what in queue
instead of replacing it with workqueue
has these reasons:

1. CPU overhead is lower
   queue_work overhead everytime flush job is to queue
   is crucial for writeboost. queue_work is too CPU-consuming
   as far as I looked at the code.
   I have measured the theoretical maximum 4KB randwrite throughput  
   of writeboost using fio benchmark. It was 1.5GB/sec
   with fast enough cache device.
   I don't think this extraordinary score can not be achieved with
   other caching softwares.
   In this benchmark, the flush daemon didn't sleep at all
   because the writes are ceaseless.
   Using workqueue will not only lose throughput
   but also wastes CPU cycles both regularly is the fact I dislike.
2. Ease of Error Handling
   Keeping it a single thread looping and self-managing the queue
   makes it easy to handle error.
   If we choose to use workqueue, we need a mechanism to
   * notify the occurrence of a error to other work items
   * play-back the work items in the same order since waiting until the
 system recovers for long time is prohibited with workqueue
 as we discussed in the previous posts.
 Segments must be flushed and then migrated in sequential
 order along its id.
   Leaving the control of the queue which should be in desired order
   to the workqueue subsystem is dangerous in this case.

For migrate_proc,
I see no reason this daemon polling is bad
although it actually sleeps in leisure time.
In using writeboost, we can obviously assume that
the sequential write throughput of cache device is much higher
than the random write throughput of the backing store
so migrate daemon is unlikely to sleep
since there are many dirty segments to migrate
is the ordinary situation.
Furthermore, since migration daemon is more one block
off from the user (upper layer) than flush daemon
the one second sleeping problem gets weaker or diluted.
If you still dislike the delay of this daemon
I think of forcefully waking up the daemon if it is sleeping
like flush daemon.

> First, you must design some event model - (a bunch of threads polling in 1 
> second interval doesn't seem viable). For example, you can use workqueues 
> correctly this way:
> 
> * You create a workqueue for migration, but you don't submit any work to 
>   it.
> * If you fill the cache device above the watermark, you submit a work item 
>   to the migration workqueue (you can submit the same work item to the 
>   same workqueue multiple times - if the work item is still queued and 
>   hasn't started executing, the second submit is ignored; if the work item 
>   has started executing, the second submit causes that it is executed once
>   more).
> * The work 

A review of dm-writeboost

2013-10-05 Thread Mikulas Patocka
Hi

I looked at dm-writeboost source code and here I'm sending the list of 
problems I found:


Polling for state
-

Some of the kernel threads that you spawn poll for data in one-second 
interval - see migrate_proc, modulator_proc, recorder_proc, sync_proc.

flush_proc correctly contains wait_event, but there is still some 100ms 
polling timeout in flush_proc that shouldn't be there.


Don't do this. You can do polling in a piece of code that is executed 
infrequently, such as driver unload, but general functionality must not 
depend on polling.


If you set the polling interval too low, it wastes CPU cycle and it wastes 
energy due to CPU wake-ups. If you set the polling interval too high, it 
causes artifical delays. For example, current middle-class SSDs can do 
writes at a rate 500MB/s. Now, think that the user uses 400MB partition 
for the cache - the partition is filled up in 0.8s and the kernel waits 
for additional 0.2s doing absolutely nothing, until your polling code 
wakes up and notices that it should start flusing the cache.

So - the polling code introduces artifical delays that can cause 
performance degradation. You may think about lowering the polling interval 
to lessen the performance impact, but if you lower the polling interval, 
it increases CPU and power consumption when the driver is idle. Either 
way, it is wrong. Just don't use polling.


Lack of locking/barriers


migrate_proc lacks any lock or barrier. Note that processors may execute 
instructions out of order, and thus concurrent access without locks or 
barriers is always wrong.

Think of this piece of code:
nr_mig_candidates = cache->last_flushed_segment_id -
cache->last_migrated_segment_id;
...
nr_mig = min(nr_mig_candidates,
 cache->nr_cur_batched_migration);
...
for (i = 1; i <= nr_mig; i++) {
seg = get_segment_header_by_id(
cache,
cache->last_migrated_segment_id + i);
list_add_tail(&seg->migrate_list, &cache->migrate_list);
}

The processor may reorder all memory accesses - for example it may read 
the data accessed in the "for" cycle before reading the variables 
"cache->last_flushed_segment_id" and "cache->last_migrated_segment_id". If 
this happens, the code may work with invalid data.

Similarly, the processor that updates cache->last_flushed_segment_id can 
update it before updating the segment variables itself.

You need to use smp_wmb() before incrementing 
cache->last_flushed_segment_id in the producer process and smp_rmb() after 
reading cache->last_flushed_segment_id in the consumer process. Read 
Documentation/memory-barriers.txt for full explanation.

You can use locks instead of memory barriers, locks are simpler to use and 
simpler to verify, but they are usually slower than memory barriers.


Nonatomic 64-bit variables
--

Note that 64-bit variables are not atomic on 32-bit architectures.

Linux assumes that 32-bit variables are atomic on 32-bit architectures and 
64-bit or 32-bit variables are atomic on 64-bit architectures. That is, 
variables having the "long" or "int" type are atomic. Atomicity means that 
if you read and modify the variable at the same time, you read either the 
old value or the new values.

64-bit variables on 32-bit architectures are usually read and written in 
two steps, and thus the atomicity requirement isn't true.

For example, suppose that you change cache->last_flushed_segment_id from 
0x to 0x0001. Now, the function migrate_proc 
that asynchronously reads cache->last_flushed_segment_id can read 
0x or 0x0001 (if it reads one of these values, 
it behaves correctly), but it can also read 0x or 
0x0001 - if migrate_proc reads one of these two values, it 
goes wild, migrating segments that weren't ever supposed to be migrated, 
and likely causes a crash or data corruption.

I found this bug in migrate_proc and update_superblock_record (but it may 
be also in other parts of the code).

You can use atomic64_t type for atomic 64-bit variables (plus memory 
barriers as explained in the previous section). Or you can use locks.

reserving_segment_id is also affected. However, you never actually need 
the value of reserving_segment_id, you only compare it to zero. Thus, you 
can change this variable to the "int" type and set it to "0" or "1". (you 
must use int and not bool, see the next section).


Variables smaller than 32 bits must not be asynchronously modified
--

Early Alpha processors can't access memory objects smaller than 32 bits - 
so, for example, if your code writes 8-bit char or bool variable, the 
processor reads 32 bits to a register, modifies 8-bit part of the register 
and writes 32 bits from the register back to memory. Now, if you define
something like
un