Re: A review of dm-writeboost
On Sat, Oct 19, 2013 at 07:59:55PM +0900, Akira Hayakawa wrote: > Dave, > > # -EIO retuned corrupts XFS > I turned up > lockdep, frame pointer, xfs debug > and also changed to 3.12.0-rc5 from rc1. > > What's changed is that > the problem we discussed in previous mails > *never* reproduce. > > However, if I turn off the lockdep only > it hangs up by setting blockup to 1 and then to 0 after that. Which indicates that the corruption is likely to be caused by a race condition, and that adding lockdep slows things down enough to change the timing and hence not hit the race condition... e.g. the use after free that causes the memory corruption now occurs after the critical point where XFS can get stuck on it. > The underlying device once became dead revives > may confuse the filesystem but > this looks not related to locking things. > This problem may be producible using dm-flakey. I use dm-flakey quite a bit, and I haven't seen this > This behavior is not reproducible with ext4. > > > root@Lyle:/home/akira# virsh console Hercules > Connected to domain Hercules > Escape character is ^] > [ 143.042980] device-mapper: writeboost: info@modulator_proc() reactivated > after blockup > [ 143.042999] device-mapper: writeboost: info@dm_safe_io_internal() > reactivated after blockup > [ 143.043006] device-mapper: writeboost: info@migrate_proc() reactivated > after blockup > [ 143.045328] XFS: metadata I/O error: block 0x300d0f ("xlog_iodone") error > 5 numblks 64 > [ 143.045333] XFS: xfs_do_force_shutdown(0x2) called from line 1161 of file > fs/xfs/xfs_log.c. Return address = 0xa0430f9c > [ 143.045335] XFS: Log I/O Error Detected. Shutting down filesystem > [ 143.045335] XFS: Please umount the filesystem and rectify the problem(s) > [ 143.045338] XFS: Assertion failed: atomic_read(&iclog->ic_refcnt) == 0, > file: fs/xfs/xfs_log.c, line: 2751 . > [ 143.045465] Workqueue: xfslogd xfs_buf_iodone_work [xfs] . > [ 143.045558] [] xlog_state_done_syncing+0x6d/0xe4 [xfs] > [ 143.045590] [] xlog_iodone+0xd0/0xd9 [xfs] > [ 143.045609] [] xfs_buf_iodone_work+0x57/0xa2 [xfs] > [ 143.045627] [] process_one_work+0x18b/0x297 > [ 143.045631] [] worker_thread+0x12e/0x1fb > [ 143.045634] [] ? rescuer_thread+0x268/0x268 > [ 143.045638] [] kthread+0x88/0x90 > [ 143.045641] [] ? __kthread_parkme+0x60/0x60 So that has got through xlog_iodone() and has found that the reference count of the iclog was not zero when it went to run the log IO completion callbacks. We asserted that the reference count was zero before issuing the IO, and we only ever increment the reference count when the iclog is in an active state. We cannot increment the reference count while the log is under IO because the state of the iclog is "syncing", not "active". Once the log is shut down, the iclog state goes to XLOG_STATE_IOERROR and never goes out of that state. The assert just prior to the one that tripped above indicates that we are either in an ACTIVE state or IOERROR: ASSERT(iclog->ic_state == XLOG_STATE_SYNCING || iclog->ic_state == XLOG_STATE_IOERROR); >> ASSERT(atomic_read(&iclog->ic_refcnt) == 0); It was the second assert that failed, and hence it's clear that in either state the reference count cannot be incremented until the IO is fully completed and it transitioned back the active state. > [ 143.045333] XFS: xfs_do_force_shutdown(0x2) called from line 1161 of file > fs/xfs/xfs_log.c. Return address = 0xa0430f9c This indicates that the shutdown was called from xlog_iodone() as a result of an IO error on the iclog buffer, and the call to xlog_state_done_syncing() happens 5 lines of code later, which immediately assert fails with a corrupt iclog state. Because we shutdown with SHUTDOWN_LOG_IO_ERROR set, we call into xfs_log_force_umount() with logerror = true. This first call ends up setting the log flag XLOG_IO_ERROR, then calling xlog_state_ioerror() which sets ic_state on all iclogs to XLOG_STATE_IOERROR. The shutdown then runs xlog_state_do_callback() which aborts the completions on all the iclogs that have callbacks attached, and wakes anyone waiting on log space or log forces so they can get errors returned to them. We now return to xlog_iodone() in a shutdown state, with all the iclog buffers in with ic_state = XLOG_STATE_IOERROR. So, there is absolutely no opportunity for anyone to take a new reference to the iclog in the 10 *microseconds* between the IO error being detected, the shutdown being processed and the call to xlog_state_done_syncing() where the assert fails. At this point, I cannot see how this can occur in the XFS code. Indeed, I can trigger this error path easily under heavy load just using the godown utility in xfstests: [1044535.986040] XFS (vdc): Mounting Filesystem [1044536.040630] XFS (vdc): Ending clean mount [1044536.059299] XFS: Clearing xfsstats # src/xfstests-dev/src/godown -v
Re: A review of dm-writeboost
Dave, # -EIO retuned corrupts XFS I turned up lockdep, frame pointer, xfs debug and also changed to 3.12.0-rc5 from rc1. What's changed is that the problem we discussed in previous mails *never* reproduce. However, if I turn off the lockdep only it hangs up by setting blockup to 1 and then to 0 after that. The underlying device once became dead revives may confuse the filesystem but this looks not related to locking things. This problem may be producible using dm-flakey. This behavior is not reproducible with ext4. root@Lyle:/home/akira# virsh console Hercules Connected to domain Hercules Escape character is ^] [ 143.042980] device-mapper: writeboost: info@modulator_proc() reactivated after blockup [ 143.042999] device-mapper: writeboost: info@dm_safe_io_internal() reactivated after blockup [ 143.043006] device-mapper: writeboost: info@migrate_proc() reactivated after blockup [ 143.045328] XFS: metadata I/O error: block 0x300d0f ("xlog_iodone") error 5 numblks 64 [ 143.045333] XFS: xfs_do_force_shutdown(0x2) called from line 1161 of file fs/xfs/xfs_log.c. Return address = 0xa0430f9c [ 143.045335] XFS: Log I/O Error Detected. Shutting down filesystem [ 143.045335] XFS: Please umount the filesystem and rectify the problem(s) [ 143.045338] XFS: Assertion failed: atomic_read(&iclog->ic_refcnt) == 0, file: fs/xfs/xfs_log.c, line: 2751 [ 143.045392] [ cut here ] [ 143.045393] kernel BUG at fs/xfs/xfs_message.c:108! [ 143.045396] invalid opcode: [#1] SMP [ 143.045429] Modules linked in: xfs crc32c libcrc32c dm_writeboost(O) fuse nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop snd_hda_intel snd_hda_codec snd_hwdep snd_pcm joydev hid_generic usbhid processor snd_page_alloc i2c_piix4 i2c_core snd_timer snd soundcore psmouse serio_raw evdev virtio_balloon pcspkr hid thermal_sys microcode button ext4 crc16 jbd2 mbcache dm_mod sg sr_mod cdrom ata_generic virtio_blk virtio_net floppy uhci_hcd ehci_hcd ata_piix usbcore libata usb_common scsi_mod virtio_pci virtio_ring virtio [ 143.045433] CPU: 2 PID: 216 Comm: kworker/2:1H Tainted: G O 3.12.0-rc5 #11 [ 143.045434] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 [ 143.045465] Workqueue: xfslogd xfs_buf_iodone_work [xfs] [ 143.045466] task: 88002fe7e0c0 ti: 88002ff66000 task.ti: 88002ff66000 [ 143.045493] RIP: 0010:[] [] assfail+0x1d/0x1f [xfs] [ 143.045494] RSP: :88002ff67d98 EFLAGS: 00010292 [ 143.045495] RAX: 005e RBX: 88020de3c280 RCX: 0c8a [ 143.045496] RDX: RSI: 819c3f2c RDI: 818356e0 [ 143.045497] RBP: 88002ff67d98 R08: R09: 819c3ecc [ 143.045498] R10: R11: R12: 8801fa583d28 [ 143.045499] R13: 8801fa583c00 R14: 0002 R15: [ 143.045501] FS: () GS:88021fc8() knlGS: [ 143.045502] CS: 0010 DS: ES: CR0: 8005003b [ 143.045505] CR2: 7f6e7680db67 CR3: 0180b000 CR4: 06e0 [ 143.045511] Stack: [ 143.045514] 88002ff67dc8 a0430ed1 8801fb124e40 0002 [ 143.045517] 88020de3c280 e8c80500 88002ff67df8 a0431018 [ 143.045519] 8801fb124ee0 8801fb124ee0 8801fb124e40 88021fc92a00 [ 143.045520] Call Trace: [ 143.045558] [] xlog_state_done_syncing+0x6d/0xe4 [xfs] [ 143.045590] [] xlog_iodone+0xd0/0xd9 [xfs] [ 143.045609] [] xfs_buf_iodone_work+0x57/0xa2 [xfs] [ 143.045627] [] process_one_work+0x18b/0x297 [ 143.045631] [] worker_thread+0x12e/0x1fb [ 143.045634] [] ? rescuer_thread+0x268/0x268 [ 143.045638] [] kthread+0x88/0x90 [ 143.045641] [] ? __kthread_parkme+0x60/0x60 [ 143.045646] [] ret_from_fork+0x7c/0xb0 [ 143.045649] [] ? __kthread_parkme+0x60/0x60 [ 143.045670] Code: 48 c7 c7 6b 26 45 a0 e8 a4 2a c5 e0 5d c3 55 48 89 f1 41 89 d0 48 c7 c6 80 26 45 a0 48 89 fa 31 c0 48 89 e5 31 ff e8 aa fc ff ff <0f> 0b 55 48 63 f6 49 89 f9 41 b8 01 00 00 00 b9 10 00 00 00 ba [ 143.045694] RIP [] assfail+0x1d/0x1f [xfs] [ 143.045695] RSP [ 143.045697] ---[ end trace e4f1a3c306353178 ]--- [ 143.045741] BUG: unable to handle kernel paging request at ffd8 [ 143.045745] IP: [] kthread_data+0xb/0x11 [ 143.045747] PGD 180c067 PUD 180e067 PMD 0 [ 143.045749] Oops: [#2] SMP [ 143.045774] Modules linked in: xfs crc32c libcrc32c dm_writeboost(O) fuse nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop snd_hda_intel snd_hda_codec snd_hwdep snd_pcm joydev hid_generic usbhid processor snd_page_alloc i2c_piix4 i2c_core snd_timer snd soundcore psmouse serio_raw evdev virtio_balloon pcspkr hid thermal_sys microcode button ext4 crc16 jbd2 mbcache dm_mod sg sr_mod cdrom ata_generic virtio_blk virtio_net floppy uhci_hcd ehci_hcd ata_pii
Re: A review of dm-writeboost
On Wed, Oct 16, 2013 at 09:17:40PM +0900, Akira Hayakawa wrote: > Dave > > > XFS shuts down because you've returned EIO to a log IO. That's a > > fatal error. If you do the same to an ext4 journal write, it will do > > the equivalent of shut down (e.g. complain and turn read-only). > You mean block device should not return -EIO anyway if > it doesn't want XFS to suddenly shut down? Yes. EIO means an IO error has occurred. That causes failure paths to be triggered in the upper layers. I really don't understand what you are trying to achieve with this "blockup" thing. If something goes wrong with the device, then you *cannot recover* by sending EIO to any new IOs and then continuing on at a later time as though nothing has happened. The moment a filesystem gets EIO from a metadata write, it is likely to be corrupted and if you continue onwards after that you simply propagate the corruption. > As Mikulas said, connection failure often be the cause of > I/O error from the underlying devices. Connection failure is *rarely* the cause of IO errors, except in environments where SANs are in use. Even then multipathing makes fatal connection failure a rare occurrence. Broken hardware is a much more common cause of problems at the storage layers. > That ext4 and XFS are both determine shutting down in > erroneous journal writes is also due to the limitation of journal write? > or just a compromise in implementation? This is just for my curiosity. A failed, unrecoverable journal write violates the filesystem consistency model of any journalling filesystem. Operations must be stopped and the hardware and filesystem must be repaired, otherwise loss of data will occur. i.e. You're tellingthe filesystem that it's had a fatal IO error by returning EIO, and the filesystems are treating it as though they've seen a fatal IO error. Simple rule: Don't complete IOs with EIO if you haven't had a fatal IO error. > struct wb_cache *cache = data; > struct wb_device *wb = cache->wb; > unsigned long intvl; > > while (!kthread_should_stop()) { > > wait_on_blockup(); Ugh. You should be using workqueue with timed work for this. BTW, you're missing the handling needed by these kernel threads for suspend-to-disk/ram > [ 68.825016] XFS (dm-3): Mounting Filesystem > [ 68.847027] XFS (dm-3): Ending clean mount > [ 72.100112] device-mapper: writeboost: err@dm_safe_io_internal() system is > blocked up on I/O error. set blockup to 0 after checkup. > [ 72.109702] device-mapper: writeboost: err@migrate_proc() system is > blocked up on I/O error. set blockup to 0 after checkup. > [ 72.812097] device-mapper: writeboost: err@modulator_proc() system is > blocked up on I/O error. set blockup to 0 after checkup. > [ 73.894429] Buffer I/O error on device dm-3, logical block 98354 > [ 73.895824] lost page write due to I/O error on dm-3 Data IO has been lost due to EIOs. You've corrupted user files when this error is emitted. ... > [ 73.930022] XFS (dm-3): metadata I/O error: block 0x40 > ("xfs_buf_iodone_callbacks") error 5 numblks 16 > [ 74.036759] XFS (dm-3): metadata I/O error: block 0x300c7f ("xlog_iodone") > error 5 numblks 64 > [ 74.043456] XFS (dm-3): xfs_do_force_shutdown(0x2) called from line 1161 > of file fs/xfs/xfs_log.c. Return address = 0xa03a6417 > [ 74.047556] XFS (dm-3): Log I/O Error Detected. Shutting down filesystem > [ 74.049893] XFS (dm-3): Please umount the filesystem and rectify the > problem(s) > [ 74.051467] XFS (dm-3): metadata I/O error: block 0x300cbf ("xlog_iodone") > error 5 numblks 64 > [ 74.053190] XFS (dm-3): xfs_do_force_shutdown(0x2) called from line 1161 > of file fs/xfs/xfs_log.c. Return address = 0xa03a6417 And these are all the log buffers containing uncommitted changes being aborted due to EIO. The filesystem state in memory now doesn't match the state on disk, and so it's effectively corrupt and shuts down. > [ 100.052005] BUG: soft lockup - CPU#0 stuck for 23s! [kworker/0:1H:215] ... > [ 100.052005] RIP: 0010:[] [] > do_raw_spin_lock+0x16/0x23 > [ 100.052005] Workqueue: xfslogd xfs_buf_iodone_work [xfs] > [ 100.052005] Call Trace: > [ 100.052005] [] ? xfs_buf_iodone+0x1b/0x49 [xfs] > [ 100.052005] [] ? xfs_buf_do_callbacks+0x22/0x30 [xfs] > [ 100.052005] [] ? xfs_buf_iodone_callbacks+0x16b/0x1c4 > [xfs] > [ 100.052005] [] ? process_one_work+0x191/0x294 > [ 100.052005] [] ? worker_thread+0x121/0x1e7 > [ 100.052005] [] ? rescuer_thread+0x269/0x269 > [ 100.052005] [] ? kthread+0x81/0x89 > [ 100.052005] [] ? __kthread_parkme+0x5d/0x5d > [ 100.052005] [] ? ret_from_fork+0x7c/0xb0 > [ 100.052005] [] ? __kthread_parkme+0x5d/0x5d You need to compile your kernel with framepointers enabled so we get reliable stack traces. I think it's stuck on a spinlock in xfs_buf_iodone, which would imply the AIL lock. . > [ 100.244006] CPU: 2 PID: 3167 Comm: xfsaild/dm-
Re: A review of dm-writeboost
Dave > XFS shuts down because you've returned EIO to a log IO. That's a > fatal error. If you do the same to an ext4 journal write, it will do > the equivalent of shut down (e.g. complain and turn read-only). You mean block device should not return -EIO anyway if it doesn't want XFS to suddenly shut down? As Mikulas said, connection failure often be the cause of I/O error from the underlying devices. That ext4 and XFS are both determine shutting down in erroneous journal writes is also due to the limitation of journal write? or just a compromise in implementation? This is just for my curiosity. >> [ 180.560040] device-mapper: writeboost: err@recorder_proc() system is >> blocked up on I/O error. set blockup to 0 after checkup. >> [ 180.561179] device-mapper: writeboost: err@sync_proc() system is blocked >> up on I/O error. set blockup to 0 after checkup. > > What's with the 35s delay between these writeboost messages? Have > you only done a partial shutdown of the block device and it takes > This length of time for it to completely block IO? Strange. These daemons should stop in few seconds on the current configuration. Yes, partial in a sense. Not all the daemons stop immediately but for the client the logical device is seen as being blocked up returning -EIO on every I/O. I don't think this behavior harms the upper layer. Currently, sync_proc is like this. It sleeps for few seconds, wakes up and meets wait_on_blockup() to dump that message and halts itself. recorder_proc is implemented in the same way. int sync_proc(void *data) { int r; struct wb_cache *cache = data; struct wb_device *wb = cache->wb; unsigned long intvl; while (!kthread_should_stop()) { wait_on_blockup(); /* sec -> ms */ intvl = ACCESS_ONCE(cache->sync_interval) * 1000; if (!intvl) { schedule_timeout_interruptible(msecs_to_jiffies(1000)); continue; } flush_current_buffer(cache); RETRY(blkdev_issue_flush(cache->device->bdev, GFP_NOIO, NULL)); schedule_timeout_interruptible(msecs_to_jiffies(intvl)); } return 0; } XFS shuts down, go crazy and it disturbs the kthread to wake up? > These should be emitting a stack trace. Can you turn up the logging > level you are using so that they emit a full stack trace? The > messages are useless without the stack dump I turned the level up to 7. Here is the one. Connected to domain Hercules Escape character is ^] [ 54.683482] device-mapper: writeboost: err@audit_cache_device() superblock header: magic number invalid [ 54.809262] bio: create slab at 2 [ 68.812800] SGI XFS with ACLs, security attributes, realtime, large block/inode numbers, no debug enabled [ 68.825016] XFS (dm-3): Mounting Filesystem [ 68.847027] XFS (dm-3): Ending clean mount [ 72.100112] device-mapper: writeboost: err@dm_safe_io_internal() system is blocked up on I/O error. set blockup to 0 after checkup. [ 72.109702] device-mapper: writeboost: err@migrate_proc() system is blocked up on I/O error. set blockup to 0 after checkup. [ 72.812097] device-mapper: writeboost: err@modulator_proc() system is blocked up on I/O error. set blockup to 0 after checkup. [ 73.894429] Buffer I/O error on device dm-3, logical block 98354 [ 73.895824] lost page write due to I/O error on dm-3 [ 73.897042] Buffer I/O error on device dm-3, logical block 98355 [ 73.897209] Buffer I/O error on device dm-3, logical block 196641 [ 73.897210] lost page write due to I/O error on dm-3 [ 73.897263] Buffer I/O error on device dm-3, logical block 196688 [ 73.897264] lost page write due to I/O error on dm-3 [ 73.897266] Buffer I/O error on device dm-3, logical block 196689 [ 73.897267] lost page write due to I/O error on dm-3 [ 73.897268] Buffer I/O error on device dm-3, logical block 196690 [ 73.897269] lost page write due to I/O error on dm-3 [ 73.897270] Buffer I/O error on device dm-3, logical block 196691 [ 73.897271] lost page write due to I/O error on dm-3 [ 73.897272] Buffer I/O error on device dm-3, logical block 196692 [ 73.897273] lost page write due to I/O error on dm-3 [ 73.897307] Buffer I/O error on device dm-3, logical block 294955 [ 73.897308] lost page write due to I/O error on dm-3 [ 73.897335] Buffer I/O error on device dm-3, logical block 294956 [ 73.897335] lost page write due to I/O error on dm-3 [ 73.914261] lost page write due to I/O error on dm-3 [ 73.930022] XFS (dm-3): metadata I/O error: block 0x40 ("xfs_buf_iodone_callbacks") error 5 numblks 16 [ 74.036759] XFS (dm-3): metadata I/O error: block 0x300c7f ("xlog_iodone") error 5 numblks 64 [ 74.043456] XFS (dm-3): xfs_do_force_shutdown(0x2) called from line 1161 of file fs/xfs/xfs_log.c. Return address = 0xa03a6417 [ 74.047556] XFS (dm-3): Log I/O Error Detected
Re: A review of dm-writeboost
On Wed, Oct 16, 2013 at 07:34:38PM +0900, Akira Hayakawa wrote: > Dave > > > Akira, can you please post the entire set of messages you are > > getting when XFS showing problems? That way I can try to confirm > > whether it's a regression in XFS or something else. > > Environment: > - The kernel version is 3.12-rc1 > - The debuggee is a KVM virtual machine equipped with 8 vcpus. > - writeboost version is commit 236732eb84684e8473353812acb3302232e1eab0 > You can clone it from https://github.com/akiradeveloper/dm-writeboost > > Test: > 1. Make a writeboost device with 3MB cache device and 3GB backing store >with default option (segment size order is 7 and RAM buffer is 2MB > allocated). > 2. start testing/1 script (compiling Ruby and make test after it) > 3. set blockup variable to 1 via message interface few seconds later. >The writeboost device starts to return -EIO on all incoming requests. >I guess this behavior causes the problem. > > In some case, XFS doesn't collapse after setting blockup to 1. > When I set the variable to 1 about 10 or 20 seconds later, > it didn't collapse but neatly stops the compile and > after again I set it to 0, it restarts the compile. > XFS does collapse (badly shutting down the filesystem as seen below) in some > case > but doesn't collapse in another case sounds to me that > the former case runs into a very corner case bug. XFS shuts down because you've returned EIO to a log IO. That's a fatal error. If you do the same to an ext4 journal write, it will do the equivalent of shut down (e.g. complain and turn read-only). > The entire set of messages via virsh console is shown below. > Some lines related to writeboost are all benign. > The daemons are just stopping because blockup variable is 1. > > [ 146.284626] XFS (dm-3): metadata I/O error: block 0x300d91 ("xlog_iodone") > error 5 numblks 64 > [ 146.285825] XFS (dm-3): Log I/O Error Detected. Shutting down filesystem > [ 146.286699] XFS (dm-3): Please umount the filesystem and rectify the > problem(s) What happened before this? Please attach the *full* log. > [ 146.560036] device-mapper: writeboost: err@modulator_proc() system is > blocked up on I/O error. set blockup to 0 after checkup. > [ 147.244036] device-mapper: writeboost: err@migrate_proc() system is > blocked up on I/O error. set blockup to 0 after checkup. > [ 172.052006] BUG: soft lockup - CPU#0 stuck for 23s! [script:3170] > [ 172.436003] BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:1:57] These should be emitting a stack trace. Can you turn up the logging level you are using so that they emit a full stack trace? The messages are useless without the stack dump Also, 23 seconds before this timestamp is 149s, about 3s after the XFS filesystem shut down, so it's not clear that the XFS shutdown is related to the soft lockup yet. That's what we need the stack traces for... > [ 180.560040] device-mapper: writeboost: err@recorder_proc() system is > blocked up on I/O error. set blockup to 0 after checkup. > [ 180.561179] device-mapper: writeboost: err@sync_proc() system is blocked > up on I/O error. set blockup to 0 after checkup. What's with the 35s delay between these writeboost messages? Have you only done a partial shutdown of the block device and it takes This length of time for it to completely block IO? Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: A review of dm-writeboost
Dave > Akira, can you please post the entire set of messages you are > getting when XFS showing problems? That way I can try to confirm > whether it's a regression in XFS or something else. Environment: - The kernel version is 3.12-rc1 - The debuggee is a KVM virtual machine equipped with 8 vcpus. - writeboost version is commit 236732eb84684e8473353812acb3302232e1eab0 You can clone it from https://github.com/akiradeveloper/dm-writeboost Test: 1. Make a writeboost device with 3MB cache device and 3GB backing store with default option (segment size order is 7 and RAM buffer is 2MB allocated). 2. start testing/1 script (compiling Ruby and make test after it) 3. set blockup variable to 1 via message interface few seconds later. The writeboost device starts to return -EIO on all incoming requests. I guess this behavior causes the problem. In some case, XFS doesn't collapse after setting blockup to 1. When I set the variable to 1 about 10 or 20 seconds later, it didn't collapse but neatly stops the compile and after again I set it to 0, it restarts the compile. XFS does collapse (badly shutting down the filesystem as seen below) in some case but doesn't collapse in another case sounds to me that the former case runs into a very corner case bug. The entire set of messages via virsh console is shown below. Some lines related to writeboost are all benign. The daemons are just stopping because blockup variable is 1. [ 146.284626] XFS (dm-3): metadata I/O error: block 0x300d91 ("xlog_iodone") error 5 numblks 64 [ 146.285825] XFS (dm-3): Log I/O Error Detected. Shutting down filesystem [ 146.286699] XFS (dm-3): Please umount the filesystem and rectify the problem(s) [ 146.560036] device-mapper: writeboost: err@modulator_proc() system is blocked up on I/O error. set blockup to 0 after checkup. [ 147.244036] device-mapper: writeboost: err@migrate_proc() system is blocked up on I/O error. set blockup to 0 after checkup. [ 172.052006] BUG: soft lockup - CPU#0 stuck for 23s! [script:3170] [ 172.436003] BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:1:57] [ 180.560040] device-mapper: writeboost: err@recorder_proc() system is blocked up on I/O error. set blockup to 0 after checkup. [ 180.561179] device-mapper: writeboost: err@sync_proc() system is blocked up on I/O error. set blockup to 0 after checkup. [ 200.052005] BUG: soft lockup - CPU#0 stuck for 23s! [script:3170] [ 200.436005] BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:1:57] [ 206.484005] INFO: rcu_sched self-detected stall on CPU { 0} (t=15000 jiffies g=1797 c=1796 q=3022) [ 232.052007] BUG: soft lockup - CPU#0 stuck for 23s! [script:3170] [ 232.436003] BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:1:57] [ 260.052006] BUG: soft lockup - CPU#0 stuck for 23s! [script:3170] [ 260.436004] BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:1:57] [ 288.052006] BUG: soft lockup - CPU#0 stuck for 23s! [script:3170] [ 288.436004] BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:1:57] Akira ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: A review of dm-writeboost
[cc x...@oss.sgi.com] On Tue, Oct 15, 2013 at 08:01:45PM -0400, Mikulas Patocka wrote: > On Mon, 14 Oct 2013, Akira Hayakawa wrote: > > But, XFS stalls ... > > --- > > For testing, > > I manually turns `blockup` to 1 > > when compiling Ruby is in progress > > on XFS on a writeboost device. > > As soon as I do it, > > XFS starts to dump error message > > like "metadata I/O error: ... ("xlog_iodone") error ..." > > and after few seconds it then starts to dump > > like "BUG: soft lockup -CPU#3 stuck for 22s!". > > The system stalls and doesn't accept the keyboard. > > > > I think this behavior is caused by > > the device always returning -EIO after turning > > the variable to 1. > > But why XFS goes stalling on I/O error? > > Because it is bloated and buggy. How did I know you'd take that cheap shot, Mikulas? You are so predictable... > We have bug 924301 for XFS crash on I/O > error... Which is a problem with memory corruption after filling a dm snapshot volume to 100% and shortly after XFS has shut down the kernel panics from memory corruption. Can't be reproduced without filling the dm-snapshot volume to 100%, can't be reproduced with any other filesystem. Crashes are also occurring randomly in printk and the worker thread infrastructure. Memory and list poisoning clearly indicates worker thread lists have freed objects on them. There are lockdep messages from the DM snapshot code, etc. There's actually very little to point at XFS problems other than the first hang that was reported where XFS was stuck in a tight loop due to memory corruption. It reminds me of a very similar bug report and triage we went through last week: http://oss.sgi.com/pipermail/xfs/2013-October/030681.html Further analysis and bisects pointed to the zram driver being buggy, not XFS: http://oss.sgi.com/pipermail/xfs/2013-October/030707.html XFS has historically exposing bugs in block device drivers that no other filesystem exposes, and so when a new block device driver gets tested with XFS and we start seeing memory corruption symptoms, it's a fair bet that it's not XFS that is causing it Just sayin'. --- Akira, can you please post the entire set of messages you are getting when XFS showing problems? That way I can try to confirm whether it's a regression in XFS or something else. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: A review of dm-writeboost
Mikulas, > I/Os shouldn't be returned with -ENOMEM. If they are, you can treat it as > a hard error. It seems to be blkdev_issue_discard returns -ENOMEM when bio_alloc fails, for example. Waiting for a second and we can alloc the memory is my idea for handling -ENOMEM returned. > Blocking I/O until the admin turns a specific variable isn't too > reliable. > > Think of this case - your driver detects I/O error and blocks all I/Os. > The admin tries to log in. The login process needs memory. To fulfill this > memory need, the login process writes out some dirty pages. Those writes > are blocked by your driver - in the result, the admin is not able to log > in and flip the switch to unblock I/Os. > > Blocking I/O indefinitely isn't good because any system activity > (including typing commands into shell) may wait on this I/O. I understand the problem. But, what should I do then? Since writeboost is a cache software, it loses consistency if we ignore the cache at all in its returning I/O error. Go panic in that case is also inappropriate (But, inaccessibility to the storage will eventually halt the whole system. If so, go panic might be an acceptable solution). I am afraid my idea is based on your past comment > If you can't handle a specific I/O request failure gracefully, you should > mark the driver as dead, don't do any more I/Os to the disk or cache > device and return -EIO on all incoming requests. > > Always think that I/O failures can happen because of connection problems, > not data corruption problems - for example, a disk cable can go loose, a > network may lose connectivity, etc. In these cases, it is best to stop > doing any I/O at all and let the user resolve the situation. 1) In failure, mark the driver dead - set `blockup` to 1 in my case - and returning -EIO on all incoming requests. Yes. 2) And wait for the user resolve the situation - returning -EIO until admin turns `blockup` to 0 after checkup in my case - . Yes. Did you mean we should not provide any way to recover the system because admin may not be able to reach the switch? writeboost module autonomously checking the device in problem recovered should be implemented? Retry submitting I/O to the device and find the device is recovered on I/O success is a solution and I have implemented it. I/O retry doesn't destroy any consistency in writeboost; sooner or later it can not be able to accept writes any more because of lack of RAM buffer which can be reused after I/O success to cache device. Akira ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: A review of dm-writeboost
On Mon, 14 Oct 2013, Akira Hayakawa wrote: > Hi, DM Guys > > I suppose I have finished the tasks to > answer Mikulas's pointing outs. > So, let me update the progress report. > > The code is updated now on my Github repo. > Checkout the develop branch to avail > the latest source code. > > Compilation Status > -- > First, compilation status. > Mikulas's advised me to compile the module in > 32 bit environment. and Yes, I did. > With all these kernels listed below > writeboost can compile without any error nor warning. > > For 64 bit > 3.2.36, 3.4.25, 3.5.7, 3.6.9, 3.7.5, 3.8.5, 3.9.8, 3.10.5, 3.11.4 and 3.12-rc1 > > For 32 bit > 3.2.0-4-686-pae (Debian 7.1.0-i386) > > > Block up on error > - > The most annoying thing in this update > is how to handle the I/O error. > For memory allocation error, > writeboost now makes use of mempool to avoid > the problem Mikulas's said in his last comments > but handling I/O error gracefully > when the system is running is very difficult. > > My answer is > all the daemon stop when I/O error (-EIO returned) happens > in any part of this module. > They waits on wait_queue (blockup_wait_queue) > and reactivates when sysadmin turns `blockup` variable to 0 > through message interface. > When `blockup` is 1, all the incoming I/O > are returned as -EIO to the upper layer. > RETRY macro is introduced > which wraps doing I/O and > retries I/O submission if the error is -ENOMEN but I/Os shouldn't be returned with -ENOMEM. If they are, you can treat it as a hard error. > turns blockup to 1 and sleeps if the error is -EIO. > -EIO is more serious than -ENOMEM because > it may destroy the storage for some accidental problem > that we have no control in device-mapper layer > (e.g. the storage controller went crazy). > Blocking up the whole I/O is to minimize the > probable damage. > > But, XFS stalls ... > --- > For testing, > I manually turns `blockup` to 1 > when compiling Ruby is in progress > on XFS on a writeboost device. > As soon as I do it, > XFS starts to dump error message > like "metadata I/O error: ... ("xlog_iodone") error ..." > and after few seconds it then starts to dump > like "BUG: soft lockup -CPU#3 stuck for 22s!". > The system stalls and doesn't accept the keyboard. > > I think this behavior is caused by > the device always returning -EIO after turning > the variable to 1. > But why XFS goes stalling on I/O error? Because it is bloated and buggy. We have bug 924301 for XFS crash on I/O error... > It should just suspend and starts returning > error to the upper layer as writeboost now does. > As Mikulas said the I/O error is often > due to connection failure that is usually recoverable. > Stalling the kernel will need reboot > after recovering nevertheless writeboost > can recover just by again turning `blockup` to 0. > Any reason for this design or > existing of a option to not stall XFS on I/O error? > > Thanks, > Akira Blocking I/O until the admin turns a specific variable isn't too reliable. Think of this case - your driver detects I/O error and blocks all I/Os. The admin tries to log in. The login process needs memory. To fulfill this memory need, the login process writes out some dirty pages. Those writes are blocked by your driver - in the result, the admin is not able to log in and flip the switch to unblock I/Os. Blocking I/O indefinitely isn't good because any system activity (including typing commands into shell) may wait on this I/O. Mikulas ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: A review of dm-writeboost
Hi, DM Guys I suppose I have finished the tasks to answer Mikulas's pointing outs. So, let me update the progress report. The code is updated now on my Github repo. Checkout the develop branch to avail the latest source code. Compilation Status -- First, compilation status. Mikulas's advised me to compile the module in 32 bit environment. and Yes, I did. With all these kernels listed below writeboost can compile without any error nor warning. For 64 bit 3.2.36, 3.4.25, 3.5.7, 3.6.9, 3.7.5, 3.8.5, 3.9.8, 3.10.5, 3.11.4 and 3.12-rc1 For 32 bit 3.2.0-4-686-pae (Debian 7.1.0-i386) Block up on error - The most annoying thing in this update is how to handle the I/O error. For memory allocation error, writeboost now makes use of mempool to avoid the problem Mikulas's said in his last comments but handling I/O error gracefully when the system is running is very difficult. My answer is all the daemon stop when I/O error (-EIO returned) happens in any part of this module. They waits on wait_queue (blockup_wait_queue) and reactivates when sysadmin turns `blockup` variable to 0 through message interface. When `blockup` is 1, all the incoming I/O are returned as -EIO to the upper layer. RETRY macro is introduced which wraps doing I/O and retries I/O submission if the error is -ENOMEN but turns blockup to 1 and sleeps if the error is -EIO. -EIO is more serious than -ENOMEM because it may destroy the storage for some accidental problem that we have no control in device-mapper layer (e.g. the storage controller went crazy). Blocking up the whole I/O is to minimize the probable damage. But, XFS stalls ... --- For testing, I manually turns `blockup` to 1 when compiling Ruby is in progress on XFS on a writeboost device. As soon as I do it, XFS starts to dump error message like "metadata I/O error: ... ("xlog_iodone") error ..." and after few seconds it then starts to dump like "BUG: soft lockup -CPU#3 stuck for 22s!". The system stalls and doesn't accept the keyboard. I think this behavior is caused by the device always returning -EIO after turning the variable to 1. But why XFS goes stalling on I/O error? It should just suspend and starts returning error to the upper layer as writeboost now does. As Mikulas said the I/O error is often due to connection failure that is usually recoverable. Stalling the kernel will need reboot after recovering nevertheless writeboost can recover just by again turning `blockup` to 0. Any reason for this design or existing of a option to not stall XFS on I/O error? Thanks, Akira -- Followed by changes to Driver/* and Documentation diff --git a/Driver/dm-writeboost-daemon.c b/Driver/dm-writeboost-daemon.c index cb8febe..7a7f353 100644 --- a/Driver/dm-writeboost-daemon.c +++ b/Driver/dm-writeboost-daemon.c @@ -12,9 +12,11 @@ int flush_proc(void *data) { + int r; unsigned long flags; struct wb_cache *cache = data; + struct wb_device *wb = cache->wb; while (true) { struct flush_job *job; @@ -22,13 +24,13 @@ int flush_proc(void *data) struct dm_io_request io_req; struct dm_io_region region; + wait_on_blockup(); + spin_lock_irqsave(&cache->flush_queue_lock, flags); while (list_empty(&cache->flush_queue)) { spin_unlock_irqrestore(&cache->flush_queue_lock, flags); - wait_event_interruptible_timeout( - cache->flush_wait_queue, - (!list_empty(&cache->flush_queue)), - msecs_to_jiffies(100)); + + schedule_timeout_interruptible(msecs_to_jiffies(1000)); /* * flush daemon can exit @@ -49,6 +51,8 @@ int flush_proc(void *data) list_del(&job->flush_queue); spin_unlock_irqrestore(&cache->flush_queue_lock, flags); + smp_rmb(); + seg = job->seg; io_req = (struct dm_io_request) { @@ -65,9 +69,9 @@ int flush_proc(void *data) .count = (seg->length + 1) << 3, }; - dm_safe_io_retry(&io_req, 1, ®ion, false); + RETRY(dm_safe_io(&io_req, 1, ®ion, NULL, false)); - cache->last_flushed_segment_id = seg->global_id; + atomic64_set(&cache->last_flushed_segment_id, seg->global_id); complete_all(&seg->flush_done); @@ -78,15 +82,15 @@ int flush_proc(void *data) */ if (!bio_list_empty(&job->barrier_ios)) { struct bio *bio; - blkdev_issue_flush(cache->device->bdev, GFP_NOIO, NULL); + RETRY(blkdev_issue_flush(cache->device->bdev, GFP_NOIO, NULL)); while ((bio = b
Re: A review of dm-writeboost
Mikulas, > Next, you need to design some locking - which variables are protected by > which locks. If you use shared variables without locks, you need to use > memory barriers (it is harder to design code using memory barriers than > locks). First I will explain the locking and the shared variables. writeboost doesn't have many variables shared among asynchronous threads. 1. We have HashTable for cache lookup and RAM buffer for write data transiently. These data structures are locked by a mutex io_lock. barrier_ios is a queue to push bio flagged with REQ_FUA or REQ_FLUSH. This queue is also locked by io_lock. These bios are chained to flush job inside io_lock. queue_flushing is a routine that makes a flush job and queues it in flush_queue which is protected by a spinlock flush_queue_lock. This foreground (.map) is a producer of flush jobs to flush daemon. 2. flush daemon loops flush_proc in single thread. It pops and consumes one flush job from flush_queue (writing to the cache device). Therefore, foreground should keep that the flush job is completely initialized before queues it. After completing the writing, it sets(actually increments) the last_flush_segment_id. This daemon is the only to update this shared variable. 3. migrate daemon loops migrate_proc also running in the single thread. It first calculates how many (nr_mig) segments from last_migrated_segment_id + 1 to + nr_mig by referring last_flushed_segment_id and last_migrate_segment_id. The former is the head of the dirty log and the latter is the tail. Note that the head doesn't have to be synchronized since the last_flush_segment_id linearly increases. last_flush_segment_id not synced only results in less number of migration than actual. It doesn't collapse anything at all. After migrate all the nr_mig dirty segments, it increments last_migrated_segment_id by nr_mig. This daemon is the only to update this last_migrate_segment_id. Other daemons don't share such variables. > Think of this piece of code: > nr_mig_candidates = cache->last_flushed_segment_id - > cache->last_migrated_segment_id; > ... > nr_mig = min(nr_mig_candidates, > cache->nr_cur_batched_migration); > ... > for (i = 1; i <= nr_mig; i++) { > seg = get_segment_header_by_id( > cache, > cache->last_migrated_segment_id + i); > list_add_tail(&seg->migrate_list, &cache->migrate_list); > } > > The processor may reorder all memory accesses - for example it may read > the data accessed in the "for" cycle before reading the variables > "cache->last_flushed_segment_id" and "cache->last_migrated_segment_id". If > this happens, the code may work with invalid data. True? My understanding is that reordering does not happen for these three blocks of code because there are sequential data dependencies. That the last block is while loop is the point? But, I sorted our other two points we need to insert memory barriers. First is in queue_flushing. We can't deny that incomplete flush job is queued and flush_proc goes bad. I am thinking of inserting smp_wmb before queuing and smp_rmb after popping a job in flush_proc. Second is in migrate_proc. We can't deny that migrate_proc goes into migrate_linked_segments before cache->migrate_list is properly initialized. I am thinking of inserting smp_wmb before migrate_linked_segments. I don't understand memory barrier well. So, please fix if I am going the wrong way. > Nonatomic 64-bit variables > -- > > Note that 64-bit variables are not atomic on 32-bit architectures. > > Linux assumes that 32-bit variables are atomic on 32-bit architectures and > 64-bit or 32-bit variables are atomic on 64-bit architectures. That is, > variables having the "long" or "int" type are atomic. Atomicity means that > if you read and modify the variable at the same time, you read either the > old value or the new values. > > 64-bit variables on 32-bit architectures are usually read and written in > two steps, and thus the atomicity requirement isn't true. > > For example, suppose that you change cache->last_flushed_segment_id from > 0x to 0x0001. Now, the function migrate_proc > that asynchronously reads cache->last_flushed_segment_id can read > 0x or 0x0001 (if it reads one of these values, > it behaves correctly), but it can also read 0x or > 0x0001 - if migrate_proc reads one of these two values, it > goes wild, migrating segments that weren't ever supposed to be migrated, > and likely causes a crash or data corruption. > > I found this bug in migrate_proc and update_superblock_record (but it may > be also in other parts of the code). > > You can use atomic64_t type for atomic 64-bit variables (plus memory > barriers as explained in the previous section). Or
Re: A review of dm-writeboost
On Sun, 6 Oct 2013, Akira Hayakawa wrote: > Mikulas, > > Thank you for your reviewing. > > I will reply to polling issue first. > For the rest, I will reply later. > > > Polling for state > > - > > > > Some of the kernel threads that you spawn poll for data in one-second > > interval - see migrate_proc, modulator_proc, recorder_proc, sync_proc. > > > > flush_proc correctly contains wait_event, but there is still some 100ms > > polling timeout in flush_proc that shouldn't be there. > > > > If you set the polling interval too low, it wastes CPU cycle and it wastes > > energy due to CPU wake-ups. If you set the polling interval too high, it > > causes artifical delays. For example, current middle-class SSDs can do > > writes at a rate 500MB/s. Now, think that the user uses 400MB partition > > for the cache - the partition is filled up in 0.8s and the kernel waits > > for additional 0.2s doing absolutely nothing, until your polling code > > wakes up and notices that it should start flusing the cache. > > > > So - the polling code introduces artifical delays that can cause > > performance degradation. You may think about lowering the polling interval > > to lessen the performance impact, but if you lower the polling interval, > > it increases CPU and power consumption when the driver is idle. Either > > way, it is wrong. Just don't use polling. > > You dislike the fact that looping thread sleeping > can delay to the newly coming jobs. > > I am afraid your understanding > on the sleeping of flush daemon is wrong. > Please see queue_flushing() it wakes up the daemon > if the queue is empty. > > spin_lock_irqsave(&cache->flush_queue_lock, flags); > empty = list_empty(&cache->flush_queue); > list_add_tail(&job->flush_queue, &cache->flush_queue); > spin_unlock_irqrestore(&cache->flush_queue_lock, flags); > if (empty) > wake_up_interruptible(&cache->flush_wait_queue); > > Even if this waking up lines are removed > the flush daemon wakes up by its self > 100ms at worst after the first 1MB written. flush_proc is woken up correctly. But the other threads aren't. migrate_proc, modulator_proc, recorder_proc, sync_proc all do polling. Waking up every 100ms in flush_proc is not good because it wastes CPU time and energy if the driver is idle. BTW. You should use wait_event_interruptible_lock_irq instead of wait_event_interruptible and wait_event_interruptible_lock_irq_timeout instead of wait_event_interruptible_timeout. The functions with "lock_irq" automatically drop and re-acquire the spinlock, so that the condition is tested while the lock is held - so that there are no asynchronous accesses to !list_empty(&cache->flush_queue). > > Don't do this. You can do polling in a piece of code that is executed > > infrequently, such as driver unload, but general functionality must not > > depend on polling. > 100ms is the time the user must wait in termination as your claim. > > However, I think this is a good time to > explain why I didn't choose to use workqueue. > > Keeping it a polling thread that consumes what in queue > instead of replacing it with workqueue > has these reasons: > > 1. CPU overhead is lower >queue_work overhead everytime flush job is to queue >is crucial for writeboost. queue_work is too CPU-consuming >as far as I looked at the code. >I have measured the theoretical maximum 4KB randwrite throughput >of writeboost using fio benchmark. It was 1.5GB/sec >with fast enough cache device. >I don't think this extraordinary score can not be achieved with >other caching softwares. >In this benchmark, the flush daemon didn't sleep at all >because the writes are ceaseless. >Using workqueue will not only lose throughput >but also wastes CPU cycles both regularly is the fact I dislike. > 2. Ease of Error Handling >Keeping it a single thread looping and self-managing the queue >makes it easy to handle error. >If we choose to use workqueue, we need a mechanism to >* notify the occurrence of a error to other work items >* play-back the work items in the same order since waiting until the > system recovers for long time is prohibited with workqueue > as we discussed in the previous posts. > Segments must be flushed and then migrated in sequential > order along its id. >Leaving the control of the queue which should be in desired order >to the workqueue subsystem is dangerous in this case. You don't have to use workqueues if you don't like them. You can use kernel threads and wait_event/wake_up instead. Workqueues are simpler to use in some circumstances (for example, you can submit the same work item multiple times) but it's up to you if you want that simplicity or not. But you shouldn't do looping and waiting for events in migrate_proc. > For migrate_proc, > I see no reason this daemon polling is bad > although it actually sleeps in leisure time. > I
Re: A review of dm-writeboost
Mikulas, Thank you for your reviewing. I will reply to polling issue first. For the rest, I will reply later. > Polling for state > - > > Some of the kernel threads that you spawn poll for data in one-second > interval - see migrate_proc, modulator_proc, recorder_proc, sync_proc. > > flush_proc correctly contains wait_event, but there is still some 100ms > polling timeout in flush_proc that shouldn't be there. > > If you set the polling interval too low, it wastes CPU cycle and it wastes > energy due to CPU wake-ups. If you set the polling interval too high, it > causes artifical delays. For example, current middle-class SSDs can do > writes at a rate 500MB/s. Now, think that the user uses 400MB partition > for the cache - the partition is filled up in 0.8s and the kernel waits > for additional 0.2s doing absolutely nothing, until your polling code > wakes up and notices that it should start flusing the cache. > > So - the polling code introduces artifical delays that can cause > performance degradation. You may think about lowering the polling interval > to lessen the performance impact, but if you lower the polling interval, > it increases CPU and power consumption when the driver is idle. Either > way, it is wrong. Just don't use polling. You dislike the fact that looping thread sleeping can delay to the newly coming jobs. I am afraid your understanding on the sleeping of flush daemon is wrong. Please see queue_flushing() it wakes up the daemon if the queue is empty. spin_lock_irqsave(&cache->flush_queue_lock, flags); empty = list_empty(&cache->flush_queue); list_add_tail(&job->flush_queue, &cache->flush_queue); spin_unlock_irqrestore(&cache->flush_queue_lock, flags); if (empty) wake_up_interruptible(&cache->flush_wait_queue); Even if this waking up lines are removed the flush daemon wakes up by its self 100ms at worst after the first 1MB written. > Don't do this. You can do polling in a piece of code that is executed > infrequently, such as driver unload, but general functionality must not > depend on polling. 100ms is the time the user must wait in termination as your claim. However, I think this is a good time to explain why I didn't choose to use workqueue. Keeping it a polling thread that consumes what in queue instead of replacing it with workqueue has these reasons: 1. CPU overhead is lower queue_work overhead everytime flush job is to queue is crucial for writeboost. queue_work is too CPU-consuming as far as I looked at the code. I have measured the theoretical maximum 4KB randwrite throughput of writeboost using fio benchmark. It was 1.5GB/sec with fast enough cache device. I don't think this extraordinary score can not be achieved with other caching softwares. In this benchmark, the flush daemon didn't sleep at all because the writes are ceaseless. Using workqueue will not only lose throughput but also wastes CPU cycles both regularly is the fact I dislike. 2. Ease of Error Handling Keeping it a single thread looping and self-managing the queue makes it easy to handle error. If we choose to use workqueue, we need a mechanism to * notify the occurrence of a error to other work items * play-back the work items in the same order since waiting until the system recovers for long time is prohibited with workqueue as we discussed in the previous posts. Segments must be flushed and then migrated in sequential order along its id. Leaving the control of the queue which should be in desired order to the workqueue subsystem is dangerous in this case. For migrate_proc, I see no reason this daemon polling is bad although it actually sleeps in leisure time. In using writeboost, we can obviously assume that the sequential write throughput of cache device is much higher than the random write throughput of the backing store so migrate daemon is unlikely to sleep since there are many dirty segments to migrate is the ordinary situation. Furthermore, since migration daemon is more one block off from the user (upper layer) than flush daemon the one second sleeping problem gets weaker or diluted. If you still dislike the delay of this daemon I think of forcefully waking up the daemon if it is sleeping like flush daemon. > First, you must design some event model - (a bunch of threads polling in 1 > second interval doesn't seem viable). For example, you can use workqueues > correctly this way: > > * You create a workqueue for migration, but you don't submit any work to > it. > * If you fill the cache device above the watermark, you submit a work item > to the migration workqueue (you can submit the same work item to the > same workqueue multiple times - if the work item is still queued and > hasn't started executing, the second submit is ignored; if the work item > has started executing, the second submit causes that it is executed once > more). > * The work