Re: [PATCHSET v5] blk-mq: reimplement timeout handling
On Sun, Jan 14, 2018 at 11:12 PM, jianchao.wang wrote: > > > On 01/13/2018 05:19 AM, Bart Van Assche wrote: >> Sorry but I only retrieved the blk-mq debugfs several minutes after the hang >> started so I'm not sure the state information is relevant. Anyway, I have >> attached >> it to this e-mail. The most remarkable part is the following: >> >> ./9ddfa913/requeue_list:9646711c {.op=READ, .state=idle, >> gen=0x1 >> 18, abort_gen=0x0, .cmd_flags=, .rq_flags=SORTED|1|SOFTBARRIER|IO_STAT, >> complete >> =0, .tag=-1, .internal_tag=217} >> >> The hexadecimal number at the start is the request_queue pointer (I modified >> the >> blk-mq-debugfs code such that queues are registered with there address just >> after >> creation and until a name is assigned). This is a dm-mpath queue. > > There seems to be something wrong in hctx->nr_active. Then looks it is same with the issue I saw during starting multipathd, and the following patch should fix that, if there isn't other issue. https://marc.info/?l=linux-block&m=151586577400558&w=2 -- Ming Lei -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHSET v5] blk-mq: reimplement timeout handling
On 01/13/2018 05:19 AM, Bart Van Assche wrote: > Sorry but I only retrieved the blk-mq debugfs several minutes after the hang > started so I'm not sure the state information is relevant. Anyway, I have > attached > it to this e-mail. The most remarkable part is the following: > > ./9ddfa913/requeue_list:9646711c {.op=READ, .state=idle, > gen=0x1 > 18, abort_gen=0x0, .cmd_flags=, .rq_flags=SORTED|1|SOFTBARRIER|IO_STAT, > complete > =0, .tag=-1, .internal_tag=217} > > The hexadecimal number at the start is the request_queue pointer (I modified > the > blk-mq-debugfs code such that queues are registered with there address just > after > creation and until a name is assigned). This is a dm-mpath queue. There seems to be something wrong in hctx->nr_active. ./sde/hctx2/cpu2/completed:2 3 ./sde/hctx2/cpu2/merged:0 ./sde/hctx2/cpu2/dispatched:2 3 ./sde/hctx2/active:5 ./sde/hctx1/cpu1/completed:2 38 ./sde/hctx1/cpu1/merged:0 ./sde/hctx1/cpu1/dispatched:2 38 ./sde/hctx1/active:40 ./sde/hctx0/cpu0/completed:20 11 ./sde/hctx0/cpu0/merged:0 ./sde/hctx0/cpu0/dispatched:20 11 ./sde/hctx0/active:31 ... ./sdc/hctx1/cpu1/completed:14 13 ./sdc/hctx1/cpu1/merged:0 ./sdc/hctx1/cpu1/dispatched:14 13 ./sdc/hctx1/active:21 ./sdc/hctx0/cpu0/completed:1 41 ./sdc/hctx0/cpu0/merged:0 ./sdc/hctx0/cpu0/dispatched:1 41 ./sdc/hctx0/active:36 Then hctx_may_queue return false. Thanks Jianchao -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHSET v5] blk-mq: reimplement timeout handling
On Sat, Jan 13, 2018 at 10:45:14PM +0800, Ming Lei wrote: > On Fri, Jan 12, 2018 at 04:55:34PM -0500, Laurence Oberman wrote: > > On Fri, 2018-01-12 at 20:57 +, Bart Van Assche wrote: > > > On Tue, 2018-01-09 at 08:29 -0800, Tejun Heo wrote: > > > > Currently, blk-mq timeout path synchronizes against the usual > > > > issue/completion path using a complex scheme involving atomic > > > > bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence > > > > rules. Unfortunatley, it contains quite a few holes. > > > > > > Hello Tejun, > > > > > > With this patch series applied I see weird hangs in blk_mq_get_tag() > > > when I > > > run the srp-test software. If I pull Jens' latest for-next branch and > > > revert > > > this patch series then the srp-test software runs successfully. Note: > > > if you > > > don't have InfiniBand hardware available then you will need the > > > RDMA/CM > > > patches for the SRP initiator and target drivers that have been > > > posted > > > recently on the linux-rdma mailing list to run the srp-test software. > > > > > > This is how I run the srp-test software in a VM: > > > > > > ./run_tests -c -d -r 10 > > > > > > Here is an example of what SysRq-w reported when the hang occurred: > > > > > > sysrq: SysRq : Show Blocked State > > > taskPC stack pid father > > > kworker/u8:0D12864 5 2 0x8000 > > > Workqueue: events_unbound sd_probe_async [sd_mod] > > > Call Trace: > > > ? __schedule+0x2b4/0xbb0 > > > schedule+0x2d/0x90 > > > io_schedule+0xd/0x30 > > > blk_mq_get_tag+0x169/0x290 > > > ? finish_wait+0x80/0x80 > > > blk_mq_get_request+0x16a/0x4f0 > > > blk_mq_alloc_request+0x59/0xc0 > > > blk_get_request_flags+0x3f/0x260 > > > scsi_execute+0x33/0x1e0 [scsi_mod] > > > read_capacity_16.part.35+0x9c/0x460 [sd_mod] > > > sd_revalidate_disk+0x14bb/0x1cb0 [sd_mod] > > > sd_probe_async+0xf2/0x1a0 [sd_mod] > > > process_one_work+0x21c/0x6d0 > > > worker_thread+0x35/0x380 > > > ? process_one_work+0x6d0/0x6d0 > > > kthread+0x117/0x130 > > > ? kthread_create_worker_on_cpu+0x40/0x40 > > > ret_from_fork+0x24/0x30 > > > systemd-udevd D13672 1048285 0x0100 > > > Call Trace: > > > ? __schedule+0x2b4/0xbb0 > > > schedule+0x2d/0x90 > > > io_schedule+0xd/0x30 > > > generic_file_read_iter+0x32f/0x970 > > > ? page_cache_tree_insert+0x100/0x100 > > > __vfs_read+0xcc/0x120 > > > vfs_read+0x96/0x140 > > > SyS_read+0x40/0xa0 > > > do_syscall_64+0x5f/0x1b0 > > > entry_SYSCALL64_slow_path+0x25/0x25 > > > RIP: 0033:0x7f8ce6d08d11 > > > RSP: 002b:7fff96dec288 EFLAGS: 0246 ORIG_RAX: > > > > > > RAX: ffda RBX: 5651de7f6e10 RCX: 7f8ce6d08d11 > > > RDX: 0040 RSI: 5651de7f6e38 RDI: 0007 > > > RBP: 5651de7ea500 R08: 7f8ce6cf1c20 R09: 5651de7f6e10 > > > R10: 006f R11: 0246 R12: 01ff > > > R13: 01ff0040 R14: 5651de7ea550 R15: 0040 > > > systemd-udevd D13496 1049285 0x0100 > > > Call Trace: > > > ? __schedule+0x2b4/0xbb0 > > > schedule+0x2d/0x90 > > > io_schedule+0xd/0x30 > > > blk_mq_get_tag+0x169/0x290 > > > ? finish_wait+0x80/0x80 > > > blk_mq_get_request+0x16a/0x4f0 > > > blk_mq_make_request+0x105/0x8e0 > > > ? generic_make_request+0xd6/0x3d0 > > > generic_make_request+0x103/0x3d0 > > > ? submit_bio+0x57/0x110 > > > submit_bio+0x57/0x110 > > > mpage_readpages+0x13b/0x160 > > > ? I_BDEV+0x10/0x10 > > > ? rcu_read_lock_sched_held+0x66/0x70 > > > ? __alloc_pages_nodemask+0x2e8/0x360 > > > __do_page_cache_readahead+0x2a4/0x370 > > > ? force_page_cache_readahead+0xaf/0x110 > > > force_page_cache_readahead+0xaf/0x110 > > > generic_file_read_iter+0x743/0x970 > > > ? find_held_lock+0x2d/0x90 > > > ? _raw_spin_unlock+0x29/0x40 > > > __vfs_read+0xcc/0x120 > > > vfs_read+0x96/0x140 > > > SyS_read+0x40/0xa0 > > > do_syscall_64+0x5f/0x1b0 > > > entry_SYSCALL64_slow_path+0x25/0x25 > > > RIP: 0033:0x7f8ce6d08d11 > > > RSP: 002b:7fff96dec8b8 EFLAGS: 0246 ORIG_RAX: > > > > > > RAX: ffda RBX: 7f8ce7085010 RCX: 7f8ce6d08d11 > > > RDX: 0004 RSI: 7f8ce7085038 RDI: 000f > > > RBP: 5651de7ec840 R08: R09: 7f8ce7085010 > > > R10: 7f8ce7085028 R11: 0246 R12: > > > R13: 0004 R14: 5651de7ec890 R15: 0004 > > > systemd-udevd D13672 1055285 0x0100 > > > Call Trace: > > > ? __schedule+0x2b4/0xbb0 > > > schedule+0x2d/0x90 > > > io_schedule+0xd/0x30 > > > blk_mq_get_tag+0x169/0x290 > > > ? finish_wait+0x80/0x80 > > > blk_mq_get_request+0x16a/0x4f0 > > > blk_mq_make_request+0x105/0x8e0 > > > ? generic_make_request+0xd6/0x3d0 > > > generic_make_request+0x103/0x3d0 > > > ? submit_bio+0x57/0x110 > > > submit_bio+0x57/0x110 > > > mpage_readpages+0x13b/0x160 > > > ? I_BDEV+0x10/0x10 > > > ? rcu_read_lock_sched_held+0x66/0x70 > > > ? __alloc_pages_nodemask+0x2e8/0x3
Re: [PATCHSET v5] blk-mq: reimplement timeout handling
On Fri, Jan 12, 2018 at 04:55:34PM -0500, Laurence Oberman wrote: > On Fri, 2018-01-12 at 20:57 +, Bart Van Assche wrote: > > On Tue, 2018-01-09 at 08:29 -0800, Tejun Heo wrote: > > > Currently, blk-mq timeout path synchronizes against the usual > > > issue/completion path using a complex scheme involving atomic > > > bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence > > > rules. Unfortunatley, it contains quite a few holes. > > > > Hello Tejun, > > > > With this patch series applied I see weird hangs in blk_mq_get_tag() > > when I > > run the srp-test software. If I pull Jens' latest for-next branch and > > revert > > this patch series then the srp-test software runs successfully. Note: > > if you > > don't have InfiniBand hardware available then you will need the > > RDMA/CM > > patches for the SRP initiator and target drivers that have been > > posted > > recently on the linux-rdma mailing list to run the srp-test software. > > > > This is how I run the srp-test software in a VM: > > > > ./run_tests -c -d -r 10 > > > > Here is an example of what SysRq-w reported when the hang occurred: > > > > sysrq: SysRq : Show Blocked State > > taskPC stack pid father > > kworker/u8:0D12864 5 2 0x8000 > > Workqueue: events_unbound sd_probe_async [sd_mod] > > Call Trace: > > ? __schedule+0x2b4/0xbb0 > > schedule+0x2d/0x90 > > io_schedule+0xd/0x30 > > blk_mq_get_tag+0x169/0x290 > > ? finish_wait+0x80/0x80 > > blk_mq_get_request+0x16a/0x4f0 > > blk_mq_alloc_request+0x59/0xc0 > > blk_get_request_flags+0x3f/0x260 > > scsi_execute+0x33/0x1e0 [scsi_mod] > > read_capacity_16.part.35+0x9c/0x460 [sd_mod] > > sd_revalidate_disk+0x14bb/0x1cb0 [sd_mod] > > sd_probe_async+0xf2/0x1a0 [sd_mod] > > process_one_work+0x21c/0x6d0 > > worker_thread+0x35/0x380 > > ? process_one_work+0x6d0/0x6d0 > > kthread+0x117/0x130 > > ? kthread_create_worker_on_cpu+0x40/0x40 > > ret_from_fork+0x24/0x30 > > systemd-udevd D13672 1048285 0x0100 > > Call Trace: > > ? __schedule+0x2b4/0xbb0 > > schedule+0x2d/0x90 > > io_schedule+0xd/0x30 > > generic_file_read_iter+0x32f/0x970 > > ? page_cache_tree_insert+0x100/0x100 > > __vfs_read+0xcc/0x120 > > vfs_read+0x96/0x140 > > SyS_read+0x40/0xa0 > > do_syscall_64+0x5f/0x1b0 > > entry_SYSCALL64_slow_path+0x25/0x25 > > RIP: 0033:0x7f8ce6d08d11 > > RSP: 002b:7fff96dec288 EFLAGS: 0246 ORIG_RAX: > > > > RAX: ffda RBX: 5651de7f6e10 RCX: 7f8ce6d08d11 > > RDX: 0040 RSI: 5651de7f6e38 RDI: 0007 > > RBP: 5651de7ea500 R08: 7f8ce6cf1c20 R09: 5651de7f6e10 > > R10: 006f R11: 0246 R12: 01ff > > R13: 01ff0040 R14: 5651de7ea550 R15: 0040 > > systemd-udevd D13496 1049285 0x0100 > > Call Trace: > > ? __schedule+0x2b4/0xbb0 > > schedule+0x2d/0x90 > > io_schedule+0xd/0x30 > > blk_mq_get_tag+0x169/0x290 > > ? finish_wait+0x80/0x80 > > blk_mq_get_request+0x16a/0x4f0 > > blk_mq_make_request+0x105/0x8e0 > > ? generic_make_request+0xd6/0x3d0 > > generic_make_request+0x103/0x3d0 > > ? submit_bio+0x57/0x110 > > submit_bio+0x57/0x110 > > mpage_readpages+0x13b/0x160 > > ? I_BDEV+0x10/0x10 > > ? rcu_read_lock_sched_held+0x66/0x70 > > ? __alloc_pages_nodemask+0x2e8/0x360 > > __do_page_cache_readahead+0x2a4/0x370 > > ? force_page_cache_readahead+0xaf/0x110 > > force_page_cache_readahead+0xaf/0x110 > > generic_file_read_iter+0x743/0x970 > > ? find_held_lock+0x2d/0x90 > > ? _raw_spin_unlock+0x29/0x40 > > __vfs_read+0xcc/0x120 > > vfs_read+0x96/0x140 > > SyS_read+0x40/0xa0 > > do_syscall_64+0x5f/0x1b0 > > entry_SYSCALL64_slow_path+0x25/0x25 > > RIP: 0033:0x7f8ce6d08d11 > > RSP: 002b:7fff96dec8b8 EFLAGS: 0246 ORIG_RAX: > > > > RAX: ffda RBX: 7f8ce7085010 RCX: 7f8ce6d08d11 > > RDX: 0004 RSI: 7f8ce7085038 RDI: 000f > > RBP: 5651de7ec840 R08: R09: 7f8ce7085010 > > R10: 7f8ce7085028 R11: 0246 R12: > > R13: 0004 R14: 5651de7ec890 R15: 0004 > > systemd-udevd D13672 1055285 0x0100 > > Call Trace: > > ? __schedule+0x2b4/0xbb0 > > schedule+0x2d/0x90 > > io_schedule+0xd/0x30 > > blk_mq_get_tag+0x169/0x290 > > ? finish_wait+0x80/0x80 > > blk_mq_get_request+0x16a/0x4f0 > > blk_mq_make_request+0x105/0x8e0 > > ? generic_make_request+0xd6/0x3d0 > > generic_make_request+0x103/0x3d0 > > ? submit_bio+0x57/0x110 > > submit_bio+0x57/0x110 > > mpage_readpages+0x13b/0x160 > > ? I_BDEV+0x10/0x10 > > ? rcu_read_lock_sched_held+0x66/0x70 > > ? __alloc_pages_nodemask+0x2e8/0x360 > > __do_page_cache_readahead+0x2a4/0x370 > > ? force_page_cache_readahead+0xaf/0x110 > > force_page_cache_readahead+0xaf/0x110 > > generic_file_read_iter+0x743/0x970 > > __vfs_read+0xcc/0x120 > > vfs_read+0x96/0x140 > > SyS_read+0x40/0xa0 > > do_syscall_64+0x5f/0x1b0 > > entry_SYSCALL64_slow_path+0x25
Re: [PATCHSET v5] blk-mq: reimplement timeout handling
On Fri, Jan 12, 2018 at 04:55:34PM -0500, Laurence Oberman wrote: > On Fri, 2018-01-12 at 20:57 +, Bart Van Assche wrote: > > On Tue, 2018-01-09 at 08:29 -0800, Tejun Heo wrote: > > > Currently, blk-mq timeout path synchronizes against the usual > > > issue/completion path using a complex scheme involving atomic > > > bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence > > > rules. Unfortunatley, it contains quite a few holes. > > > > Hello Tejun, > > > > With this patch series applied I see weird hangs in blk_mq_get_tag() > > when I > > run the srp-test software. If I pull Jens' latest for-next branch and > > revert > > this patch series then the srp-test software runs successfully. Note: > > if you > > don't have InfiniBand hardware available then you will need the > > RDMA/CM > > patches for the SRP initiator and target drivers that have been > > posted > > recently on the linux-rdma mailing list to run the srp-test software. > > > > This is how I run the srp-test software in a VM: > > > > ./run_tests -c -d -r 10 > > > > Here is an example of what SysRq-w reported when the hang occurred: > > > > sysrq: SysRq : Show Blocked State > > taskPC stack pid father > > kworker/u8:0D12864 5 2 0x8000 > > Workqueue: events_unbound sd_probe_async [sd_mod] > > Call Trace: > > ? __schedule+0x2b4/0xbb0 > > schedule+0x2d/0x90 > > io_schedule+0xd/0x30 > > blk_mq_get_tag+0x169/0x290 > > ? finish_wait+0x80/0x80 > > blk_mq_get_request+0x16a/0x4f0 > > blk_mq_alloc_request+0x59/0xc0 > > blk_get_request_flags+0x3f/0x260 > > scsi_execute+0x33/0x1e0 [scsi_mod] > > read_capacity_16.part.35+0x9c/0x460 [sd_mod] > > sd_revalidate_disk+0x14bb/0x1cb0 [sd_mod] > > sd_probe_async+0xf2/0x1a0 [sd_mod] > > process_one_work+0x21c/0x6d0 > > worker_thread+0x35/0x380 > > ? process_one_work+0x6d0/0x6d0 > > kthread+0x117/0x130 > > ? kthread_create_worker_on_cpu+0x40/0x40 > > ret_from_fork+0x24/0x30 > > systemd-udevd D13672 1048285 0x0100 > > Call Trace: > > ? __schedule+0x2b4/0xbb0 > > schedule+0x2d/0x90 > > io_schedule+0xd/0x30 > > generic_file_read_iter+0x32f/0x970 > > ? page_cache_tree_insert+0x100/0x100 > > __vfs_read+0xcc/0x120 > > vfs_read+0x96/0x140 > > SyS_read+0x40/0xa0 > > do_syscall_64+0x5f/0x1b0 > > entry_SYSCALL64_slow_path+0x25/0x25 > > RIP: 0033:0x7f8ce6d08d11 > > RSP: 002b:7fff96dec288 EFLAGS: 0246 ORIG_RAX: > > > > RAX: ffda RBX: 5651de7f6e10 RCX: 7f8ce6d08d11 > > RDX: 0040 RSI: 5651de7f6e38 RDI: 0007 > > RBP: 5651de7ea500 R08: 7f8ce6cf1c20 R09: 5651de7f6e10 > > R10: 006f R11: 0246 R12: 01ff > > R13: 01ff0040 R14: 5651de7ea550 R15: 0040 > > systemd-udevd D13496 1049285 0x0100 > > Call Trace: > > ? __schedule+0x2b4/0xbb0 > > schedule+0x2d/0x90 > > io_schedule+0xd/0x30 > > blk_mq_get_tag+0x169/0x290 > > ? finish_wait+0x80/0x80 > > blk_mq_get_request+0x16a/0x4f0 > > blk_mq_make_request+0x105/0x8e0 > > ? generic_make_request+0xd6/0x3d0 > > generic_make_request+0x103/0x3d0 > > ? submit_bio+0x57/0x110 > > submit_bio+0x57/0x110 > > mpage_readpages+0x13b/0x160 > > ? I_BDEV+0x10/0x10 > > ? rcu_read_lock_sched_held+0x66/0x70 > > ? __alloc_pages_nodemask+0x2e8/0x360 > > __do_page_cache_readahead+0x2a4/0x370 > > ? force_page_cache_readahead+0xaf/0x110 > > force_page_cache_readahead+0xaf/0x110 > > generic_file_read_iter+0x743/0x970 > > ? find_held_lock+0x2d/0x90 > > ? _raw_spin_unlock+0x29/0x40 > > __vfs_read+0xcc/0x120 > > vfs_read+0x96/0x140 > > SyS_read+0x40/0xa0 > > do_syscall_64+0x5f/0x1b0 > > entry_SYSCALL64_slow_path+0x25/0x25 > > RIP: 0033:0x7f8ce6d08d11 > > RSP: 002b:7fff96dec8b8 EFLAGS: 0246 ORIG_RAX: > > > > RAX: ffda RBX: 7f8ce7085010 RCX: 7f8ce6d08d11 > > RDX: 0004 RSI: 7f8ce7085038 RDI: 000f > > RBP: 5651de7ec840 R08: R09: 7f8ce7085010 > > R10: 7f8ce7085028 R11: 0246 R12: > > R13: 0004 R14: 5651de7ec890 R15: 0004 > > systemd-udevd D13672 1055285 0x0100 > > Call Trace: > > ? __schedule+0x2b4/0xbb0 > > schedule+0x2d/0x90 > > io_schedule+0xd/0x30 > > blk_mq_get_tag+0x169/0x290 > > ? finish_wait+0x80/0x80 > > blk_mq_get_request+0x16a/0x4f0 > > blk_mq_make_request+0x105/0x8e0 > > ? generic_make_request+0xd6/0x3d0 > > generic_make_request+0x103/0x3d0 > > ? submit_bio+0x57/0x110 > > submit_bio+0x57/0x110 > > mpage_readpages+0x13b/0x160 > > ? I_BDEV+0x10/0x10 > > ? rcu_read_lock_sched_held+0x66/0x70 > > ? __alloc_pages_nodemask+0x2e8/0x360 > > __do_page_cache_readahead+0x2a4/0x370 > > ? force_page_cache_readahead+0xaf/0x110 > > force_page_cache_readahead+0xaf/0x110 > > generic_file_read_iter+0x743/0x970 > > __vfs_read+0xcc/0x120 > > vfs_read+0x96/0x140 > > SyS_read+0x40/0xa0 > > do_syscall_64+0x5f/0x1b0 > > entry_SYSCALL64_slow_path+0x25
Re: [PATCHSET v5] blk-mq: reimplement timeout handling
On 1/12/18 2:19 PM, Bart Van Assche wrote: > On Fri, 2018-01-12 at 14:07 -0700, Jens Axboe wrote: >> You're really not making it easy for folks to run this :-) > > My hope is that the ib_srp and ib_srpt patches will be accepted upstream soon. > As long as these are not upstream, anyone who wants to retrieve these patches > is welcome to clone > https://github.com/bvanassche/linux/tree/block-scsi-for-next, > a kernel tree with all my pending patches. > >> Do you have the matching blk-mq debugfs output for the device? > > Sorry but I only retrieved the blk-mq debugfs several minutes after the hang > started so I'm not sure the state information is relevant. Anyway, I have > attached > it to this e-mail. The most remarkable part is the following: > > ./9ddfa913/requeue_list:9646711c {.op=READ, .state=idle, > gen=0x1 > 18, abort_gen=0x0, .cmd_flags=, .rq_flags=SORTED|1|SOFTBARRIER|IO_STAT, > complete > =0, .tag=-1, .internal_tag=217} Two things come to mind here: 1) We forgot to add RQF_STARTED to the debugfs bits, I just rectified that. 2) You are using a scheduler (which one?). The request was inserted, and retrieved by the driver, then requeued. After this requeue, apparently nothing happened. The queue should have been re-run, but that didn't happen. What are the queue/hctx states? -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHSET v5] blk-mq: reimplement timeout handling
On Fri, 2018-01-12 at 20:57 +, Bart Van Assche wrote: > On Tue, 2018-01-09 at 08:29 -0800, Tejun Heo wrote: > > Currently, blk-mq timeout path synchronizes against the usual > > issue/completion path using a complex scheme involving atomic > > bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence > > rules. Unfortunatley, it contains quite a few holes. > > Hello Tejun, > > With this patch series applied I see weird hangs in blk_mq_get_tag() > when I > run the srp-test software. If I pull Jens' latest for-next branch and > revert > this patch series then the srp-test software runs successfully. Note: > if you > don't have InfiniBand hardware available then you will need the > RDMA/CM > patches for the SRP initiator and target drivers that have been > posted > recently on the linux-rdma mailing list to run the srp-test software. > > This is how I run the srp-test software in a VM: > > ./run_tests -c -d -r 10 > > Here is an example of what SysRq-w reported when the hang occurred: > > sysrq: SysRq : Show Blocked State > taskPC stack pid father > kworker/u8:0D12864 5 2 0x8000 > Workqueue: events_unbound sd_probe_async [sd_mod] > Call Trace: > ? __schedule+0x2b4/0xbb0 > schedule+0x2d/0x90 > io_schedule+0xd/0x30 > blk_mq_get_tag+0x169/0x290 > ? finish_wait+0x80/0x80 > blk_mq_get_request+0x16a/0x4f0 > blk_mq_alloc_request+0x59/0xc0 > blk_get_request_flags+0x3f/0x260 > scsi_execute+0x33/0x1e0 [scsi_mod] > read_capacity_16.part.35+0x9c/0x460 [sd_mod] > sd_revalidate_disk+0x14bb/0x1cb0 [sd_mod] > sd_probe_async+0xf2/0x1a0 [sd_mod] > process_one_work+0x21c/0x6d0 > worker_thread+0x35/0x380 > ? process_one_work+0x6d0/0x6d0 > kthread+0x117/0x130 > ? kthread_create_worker_on_cpu+0x40/0x40 > ret_from_fork+0x24/0x30 > systemd-udevd D13672 1048285 0x0100 > Call Trace: > ? __schedule+0x2b4/0xbb0 > schedule+0x2d/0x90 > io_schedule+0xd/0x30 > generic_file_read_iter+0x32f/0x970 > ? page_cache_tree_insert+0x100/0x100 > __vfs_read+0xcc/0x120 > vfs_read+0x96/0x140 > SyS_read+0x40/0xa0 > do_syscall_64+0x5f/0x1b0 > entry_SYSCALL64_slow_path+0x25/0x25 > RIP: 0033:0x7f8ce6d08d11 > RSP: 002b:7fff96dec288 EFLAGS: 0246 ORIG_RAX: > > RAX: ffda RBX: 5651de7f6e10 RCX: 7f8ce6d08d11 > RDX: 0040 RSI: 5651de7f6e38 RDI: 0007 > RBP: 5651de7ea500 R08: 7f8ce6cf1c20 R09: 5651de7f6e10 > R10: 006f R11: 0246 R12: 01ff > R13: 01ff0040 R14: 5651de7ea550 R15: 0040 > systemd-udevd D13496 1049285 0x0100 > Call Trace: > ? __schedule+0x2b4/0xbb0 > schedule+0x2d/0x90 > io_schedule+0xd/0x30 > blk_mq_get_tag+0x169/0x290 > ? finish_wait+0x80/0x80 > blk_mq_get_request+0x16a/0x4f0 > blk_mq_make_request+0x105/0x8e0 > ? generic_make_request+0xd6/0x3d0 > generic_make_request+0x103/0x3d0 > ? submit_bio+0x57/0x110 > submit_bio+0x57/0x110 > mpage_readpages+0x13b/0x160 > ? I_BDEV+0x10/0x10 > ? rcu_read_lock_sched_held+0x66/0x70 > ? __alloc_pages_nodemask+0x2e8/0x360 > __do_page_cache_readahead+0x2a4/0x370 > ? force_page_cache_readahead+0xaf/0x110 > force_page_cache_readahead+0xaf/0x110 > generic_file_read_iter+0x743/0x970 > ? find_held_lock+0x2d/0x90 > ? _raw_spin_unlock+0x29/0x40 > __vfs_read+0xcc/0x120 > vfs_read+0x96/0x140 > SyS_read+0x40/0xa0 > do_syscall_64+0x5f/0x1b0 > entry_SYSCALL64_slow_path+0x25/0x25 > RIP: 0033:0x7f8ce6d08d11 > RSP: 002b:7fff96dec8b8 EFLAGS: 0246 ORIG_RAX: > > RAX: ffda RBX: 7f8ce7085010 RCX: 7f8ce6d08d11 > RDX: 0004 RSI: 7f8ce7085038 RDI: 000f > RBP: 5651de7ec840 R08: R09: 7f8ce7085010 > R10: 7f8ce7085028 R11: 0246 R12: > R13: 0004 R14: 5651de7ec890 R15: 0004 > systemd-udevd D13672 1055285 0x0100 > Call Trace: > ? __schedule+0x2b4/0xbb0 > schedule+0x2d/0x90 > io_schedule+0xd/0x30 > blk_mq_get_tag+0x169/0x290 > ? finish_wait+0x80/0x80 > blk_mq_get_request+0x16a/0x4f0 > blk_mq_make_request+0x105/0x8e0 > ? generic_make_request+0xd6/0x3d0 > generic_make_request+0x103/0x3d0 > ? submit_bio+0x57/0x110 > submit_bio+0x57/0x110 > mpage_readpages+0x13b/0x160 > ? I_BDEV+0x10/0x10 > ? rcu_read_lock_sched_held+0x66/0x70 > ? __alloc_pages_nodemask+0x2e8/0x360 > __do_page_cache_readahead+0x2a4/0x370 > ? force_page_cache_readahead+0xaf/0x110 > force_page_cache_readahead+0xaf/0x110 > generic_file_read_iter+0x743/0x970 > __vfs_read+0xcc/0x120 > vfs_read+0x96/0x140 > SyS_read+0x40/0xa0 > do_syscall_64+0x5f/0x1b0 > entry_SYSCALL64_slow_path+0x25/0x25 > RIP: 0033:0x7f8ce6d08d11 > RSP: 002b:7fff96dec848 EFLAGS: 0246 ORIG_RAX: > > RAX: ffda RBX: 5651de7ec300 RCX: 7f8ce6d08d11 > RDX: 0100 RSI: 5651de7ec328 RDI: 000f > RBP: 5651de7ea500 R08: R09: 5651de7ec300 > R10: 5651de7
Re: [PATCHSET v5] blk-mq: reimplement timeout handling
On Fri, 2018-01-12 at 14:07 -0700, Jens Axboe wrote: > You're really not making it easy for folks to run this :-) My hope is that the ib_srp and ib_srpt patches will be accepted upstream soon. As long as these are not upstream, anyone who wants to retrieve these patches is welcome to clone https://github.com/bvanassche/linux/tree/block-scsi-for-next, a kernel tree with all my pending patches. > Do you have the matching blk-mq debugfs output for the device? Sorry but I only retrieved the blk-mq debugfs several minutes after the hang started so I'm not sure the state information is relevant. Anyway, I have attached it to this e-mail. The most remarkable part is the following: ./9ddfa913/requeue_list:9646711c {.op=READ, .state=idle, gen=0x1 18, abort_gen=0x0, .cmd_flags=, .rq_flags=SORTED|1|SOFTBARRIER|IO_STAT, complete =0, .tag=-1, .internal_tag=217} The hexadecimal number at the start is the request_queue pointer (I modified the blk-mq-debugfs code such that queues are registered with there address just after creation and until a name is assigned). This is a dm-mpath queue. Bart../9ddfa913/hctx0/cpu3/completed:29 0 ./9ddfa913/hctx0/cpu3/merged:0 ./9ddfa913/hctx0/cpu3/dispatched:30 0 ./9ddfa913/hctx0/cpu2/completed:0 0 ./9ddfa913/hctx0/cpu2/merged:0 ./9ddfa913/hctx0/cpu2/dispatched:0 0 ./9ddfa913/hctx0/cpu1/completed:0 0 ./9ddfa913/hctx0/cpu1/merged:0 ./9ddfa913/hctx0/cpu1/dispatched:0 0 ./9ddfa913/hctx0/cpu0/completed:0 0 ./9ddfa913/hctx0/cpu0/merged:0 ./9ddfa913/hctx0/cpu0/dispatched:0 0 ./9ddfa913/hctx0/active:0 ./9ddfa913/hctx0/run:92 ./9ddfa913/hctx0/queued:30 ./9ddfa913/hctx0/dispatched: 00 ./9ddfa913/hctx0/dispatched: 197 ./9ddfa913/hctx0/dispatched: 20 ./9ddfa913/hctx0/dispatched: 40 ./9ddfa913/hctx0/dispatched: 80 ./9ddfa913/hctx0/dispatched: 160 ./9ddfa913/hctx0/dispatched: 32+ 0 ./9ddfa913/hctx0/io_poll:considered=0 ./9ddfa913/hctx0/io_poll:invoked=0 ./9ddfa913/hctx0/io_poll:success=0 ./9ddfa913/hctx0/sched_tags_bitmap:: ./9ddfa913/hctx0/sched_tags_bitmap:0010: 0002 ./9ddfa913/hctx0/sched_tags:nr_tags=256 ./9ddfa913/hctx0/sched_tags:nr_reserved_tags=0 ./9ddfa913/hctx0/sched_tags:active_queues=0 ./9ddfa913/hctx0/sched_tags: ./9ddfa913/hctx0/sched_tags:bitmap_tags: ./9ddfa913/hctx0/sched_tags:depth=256 ./9ddfa913/hctx0/sched_tags:busy=1 ./9ddfa913/hctx0/sched_tags:bits_per_word=64 ./9ddfa913/hctx0/sched_tags:map_nr=4 ./9ddfa913/hctx0/sched_tags:alloc_hint={45, 56, 144, 218} ./9ddfa913/hctx0/sched_tags:wake_batch=8 ./9ddfa913/hctx0/sched_tags:wake_index=0 ./9ddfa913/hctx0/sched_tags:ws={ ./9ddfa913/hctx0/sched_tags:{.wait_cnt=8, .wait=inactive}, ./9ddfa913/hctx0/sched_tags:{.wait_cnt=8, .wait=inactive}, ./9ddfa913/hctx0/sched_tags:{.wait_cnt=8, .wait=inactive}, ./9ddfa913/hctx0/sched_tags:{.wait_cnt=8, .wait=inactive}, ./9ddfa913/hctx0/sched_tags:{.wait_cnt=8, .wait=inactive}, ./9ddfa913/hctx0/sched_tags:{.wait_cnt=8, .wait=inactive}, ./9ddfa913/hctx0/sched_tags:{.wait_cnt=8, .wait=inactive}, ./9ddfa913/hctx0/sched_tags:{.wait_cnt=8, .wait=inactive}, ./9ddfa913/hctx0/sched_tags:} ./9ddfa913/hctx0/sched_tags:round_robin=0 ./9ddfa913/hctx0/tags_bitmap:: ./9ddfa913/hctx0/tags_bitmap:0010: ./9ddfa913/hctx0/tags_bitmap:0020: ./9ddfa913/hctx0/tags_bitmap:0030: ./9ddfa913/hctx0/tags_bitmap:0040: ./9ddfa913/hctx0/tags_bitmap:0050: ./9ddfa913/hctx0/tags_bitmap:0060: ./9ddfa913/hctx0/tags_bitmap:0070: ./9ddfa913/hctx0/tags_bitmap:0080: ./9ddfa913/hctx0/tags_bitmap:0090: ./9ddfa913/hctx0/tags_bitmap:00a0: ./9ddfa913/hctx0/tags_bitmap:00b0: ./9ddfa913/hctx0/tags_bitmap:00c0: ./9ddfa913/hctx0/tags_bitmap:00d0: ./9ddfa913/hctx0/tags_bitmap:00e0:
Re: [PATCHSET v5] blk-mq: reimplement timeout handling
On 1/12/18 1:57 PM, Bart Van Assche wrote: > On Tue, 2018-01-09 at 08:29 -0800, Tejun Heo wrote: >> Currently, blk-mq timeout path synchronizes against the usual >> issue/completion path using a complex scheme involving atomic >> bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence >> rules. Unfortunatley, it contains quite a few holes. > > Hello Tejun, > > With this patch series applied I see weird hangs in blk_mq_get_tag() when I > run the srp-test software. If I pull Jens' latest for-next branch and revert > this patch series then the srp-test software runs successfully. Note: if you > don't have InfiniBand hardware available then you will need the RDMA/CM > patches for the SRP initiator and target drivers that have been posted > recently on the linux-rdma mailing list to run the srp-test software. You're really not making it easy for folks to run this :-) > This is how I run the srp-test software in a VM: > > ./run_tests -c -d -r 10 > > Here is an example of what SysRq-w reported when the hang occurred: > > sysrq: SysRq : Show Blocked State > taskPC stack pid father > kworker/u8:0D12864 5 2 0x8000 > Workqueue: events_unbound sd_probe_async [sd_mod] > Call Trace: > ? __schedule+0x2b4/0xbb0 > schedule+0x2d/0x90 > io_schedule+0xd/0x30 > blk_mq_get_tag+0x169/0x290 > ? finish_wait+0x80/0x80 > blk_mq_get_request+0x16a/0x4f0 > blk_mq_alloc_request+0x59/0xc0 > blk_get_request_flags+0x3f/0x260 > scsi_execute+0x33/0x1e0 [scsi_mod] > read_capacity_16.part.35+0x9c/0x460 [sd_mod] > sd_revalidate_disk+0x14bb/0x1cb0 [sd_mod] > sd_probe_async+0xf2/0x1a0 [sd_mod] > process_one_work+0x21c/0x6d0 > worker_thread+0x35/0x380 > ? process_one_work+0x6d0/0x6d0 > kthread+0x117/0x130 > ? kthread_create_worker_on_cpu+0x40/0x40 > ret_from_fork+0x24/0x30 > systemd-udevd D13672 1048285 0x0100 > Call Trace: > ? __schedule+0x2b4/0xbb0 > schedule+0x2d/0x90 > io_schedule+0xd/0x30 > generic_file_read_iter+0x32f/0x970 > ? page_cache_tree_insert+0x100/0x100 > __vfs_read+0xcc/0x120 > vfs_read+0x96/0x140 > SyS_read+0x40/0xa0 > do_syscall_64+0x5f/0x1b0 > entry_SYSCALL64_slow_path+0x25/0x25 Do you have the matching blk-mq debugfs output for the device? -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHSET v5] blk-mq: reimplement timeout handling
On Tue, 2018-01-09 at 08:29 -0800, Tejun Heo wrote: > Currently, blk-mq timeout path synchronizes against the usual > issue/completion path using a complex scheme involving atomic > bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence > rules. Unfortunatley, it contains quite a few holes. Hello Tejun, With this patch series applied I see weird hangs in blk_mq_get_tag() when I run the srp-test software. If I pull Jens' latest for-next branch and revert this patch series then the srp-test software runs successfully. Note: if you don't have InfiniBand hardware available then you will need the RDMA/CM patches for the SRP initiator and target drivers that have been posted recently on the linux-rdma mailing list to run the srp-test software. This is how I run the srp-test software in a VM: ./run_tests -c -d -r 10 Here is an example of what SysRq-w reported when the hang occurred: sysrq: SysRq : Show Blocked State taskPC stack pid father kworker/u8:0D12864 5 2 0x8000 Workqueue: events_unbound sd_probe_async [sd_mod] Call Trace: ? __schedule+0x2b4/0xbb0 schedule+0x2d/0x90 io_schedule+0xd/0x30 blk_mq_get_tag+0x169/0x290 ? finish_wait+0x80/0x80 blk_mq_get_request+0x16a/0x4f0 blk_mq_alloc_request+0x59/0xc0 blk_get_request_flags+0x3f/0x260 scsi_execute+0x33/0x1e0 [scsi_mod] read_capacity_16.part.35+0x9c/0x460 [sd_mod] sd_revalidate_disk+0x14bb/0x1cb0 [sd_mod] sd_probe_async+0xf2/0x1a0 [sd_mod] process_one_work+0x21c/0x6d0 worker_thread+0x35/0x380 ? process_one_work+0x6d0/0x6d0 kthread+0x117/0x130 ? kthread_create_worker_on_cpu+0x40/0x40 ret_from_fork+0x24/0x30 systemd-udevd D13672 1048285 0x0100 Call Trace: ? __schedule+0x2b4/0xbb0 schedule+0x2d/0x90 io_schedule+0xd/0x30 generic_file_read_iter+0x32f/0x970 ? page_cache_tree_insert+0x100/0x100 __vfs_read+0xcc/0x120 vfs_read+0x96/0x140 SyS_read+0x40/0xa0 do_syscall_64+0x5f/0x1b0 entry_SYSCALL64_slow_path+0x25/0x25 RIP: 0033:0x7f8ce6d08d11 RSP: 002b:7fff96dec288 EFLAGS: 0246 ORIG_RAX: RAX: ffda RBX: 5651de7f6e10 RCX: 7f8ce6d08d11 RDX: 0040 RSI: 5651de7f6e38 RDI: 0007 RBP: 5651de7ea500 R08: 7f8ce6cf1c20 R09: 5651de7f6e10 R10: 006f R11: 0246 R12: 01ff R13: 01ff0040 R14: 5651de7ea550 R15: 0040 systemd-udevd D13496 1049285 0x0100 Call Trace: ? __schedule+0x2b4/0xbb0 schedule+0x2d/0x90 io_schedule+0xd/0x30 blk_mq_get_tag+0x169/0x290 ? finish_wait+0x80/0x80 blk_mq_get_request+0x16a/0x4f0 blk_mq_make_request+0x105/0x8e0 ? generic_make_request+0xd6/0x3d0 generic_make_request+0x103/0x3d0 ? submit_bio+0x57/0x110 submit_bio+0x57/0x110 mpage_readpages+0x13b/0x160 ? I_BDEV+0x10/0x10 ? rcu_read_lock_sched_held+0x66/0x70 ? __alloc_pages_nodemask+0x2e8/0x360 __do_page_cache_readahead+0x2a4/0x370 ? force_page_cache_readahead+0xaf/0x110 force_page_cache_readahead+0xaf/0x110 generic_file_read_iter+0x743/0x970 ? find_held_lock+0x2d/0x90 ? _raw_spin_unlock+0x29/0x40 __vfs_read+0xcc/0x120 vfs_read+0x96/0x140 SyS_read+0x40/0xa0 do_syscall_64+0x5f/0x1b0 entry_SYSCALL64_slow_path+0x25/0x25 RIP: 0033:0x7f8ce6d08d11 RSP: 002b:7fff96dec8b8 EFLAGS: 0246 ORIG_RAX: RAX: ffda RBX: 7f8ce7085010 RCX: 7f8ce6d08d11 RDX: 0004 RSI: 7f8ce7085038 RDI: 000f RBP: 5651de7ec840 R08: R09: 7f8ce7085010 R10: 7f8ce7085028 R11: 0246 R12: R13: 0004 R14: 5651de7ec890 R15: 0004 systemd-udevd D13672 1055285 0x0100 Call Trace: ? __schedule+0x2b4/0xbb0 schedule+0x2d/0x90 io_schedule+0xd/0x30 blk_mq_get_tag+0x169/0x290 ? finish_wait+0x80/0x80 blk_mq_get_request+0x16a/0x4f0 blk_mq_make_request+0x105/0x8e0 ? generic_make_request+0xd6/0x3d0 generic_make_request+0x103/0x3d0 ? submit_bio+0x57/0x110 submit_bio+0x57/0x110 mpage_readpages+0x13b/0x160 ? I_BDEV+0x10/0x10 ? rcu_read_lock_sched_held+0x66/0x70 ? __alloc_pages_nodemask+0x2e8/0x360 __do_page_cache_readahead+0x2a4/0x370 ? force_page_cache_readahead+0xaf/0x110 force_page_cache_readahead+0xaf/0x110 generic_file_read_iter+0x743/0x970 __vfs_read+0xcc/0x120 vfs_read+0x96/0x140 SyS_read+0x40/0xa0 do_syscall_64+0x5f/0x1b0 entry_SYSCALL64_slow_path+0x25/0x25 RIP: 0033:0x7f8ce6d08d11 RSP: 002b:7fff96dec848 EFLAGS: 0246 ORIG_RAX: RAX: ffda RBX: 5651de7ec300 RCX: 7f8ce6d08d11 RDX: 0100 RSI: 5651de7ec328 RDI: 000f RBP: 5651de7ea500 R08: R09: 5651de7ec300 R10: 5651de7ec318 R11: 0246 R12: 01ffe000 R13: 01ffe100 R14: 5651de7ea550 R15: 0100 Please let me know if you need more information. Bart.N�r��yb�X��ǧv�^�){.n�+{�n�߲)w*jg����ݢj/���z�ޖ��2�ޙ&�)ߡ�a�����G���h��j:+v���w��٥
Re: [PATCHSET v5] blk-mq: reimplement timeout handling
On 1/9/18 9:29 AM, Tejun Heo wrote: > Hello, > > Changes from [v4] > > - Comments added. Patch description updated. > > Changes from [v3] > > - Rebased on top of for-4.16/block. > > - Integrated Jens's hctx_[un]lock() factoring patch and refreshed the > patches accordingly. > > - Added comment explaining the use of hctx_lock() instead of > rcu_read_lock() in completion path. > > Changes from [v2] > > - Possible extended looping around seqcount and u64_stat_sync fixed. > > - Misplaced MQ_RQ_IDLE state setting fixed. > > - RQF_MQ_TIMEOUT_EXPIRED added to prevent firing the same timeout > multiple times. > > - s/queue_rq_src/srcu/ patch added. > > - Other misc changes. > > Changes from [v1] > > - BLK_EH_RESET_TIMER handling fixed. > > - s/request->gstate_seqc/request->gstate_seq/ > > - READ_ONCE() added to blk_mq_rq_udpate_state(). > > - Removed left over blk_clear_rq_complete() invocation from > blk_mq_rq_timed_out(). > > Currently, blk-mq timeout path synchronizes against the usual > issue/completion path using a complex scheme involving atomic > bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence > rules. Unfortunatley, it contains quite a few holes. > > It's pretty easy to make blk_mq_check_expired() terminate a later > instance of a request. If we induce 5 sec delay before > time_after_eq() test in blk_mq_check_expired(), shorten the timeout to > 2s, and issue back-to-back large IOs, blk-mq starts timing out > requests spuriously pretty quickly. Nothing actually timed out. It > just made the call on a recycle instance of a request and then > terminated a later instance long after the original instance finished. > The scenario isn't theoretical either. > > This patchset replaces the broken synchronization mechanism with a RCU > and generation number based one. Please read the patch description of > the second path for more details. Applied for 4.16, and added a patch to silence that false positive srcu_idx for hctx_unlock() that got reported. This grows the request a bit, which sucks, but probably unavoidable. There's some room for improvement with a hole or two, however. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHSET v5] blk-mq: reimplement timeout handling
Hello, Changes from [v4] - Comments added. Patch description updated. Changes from [v3] - Rebased on top of for-4.16/block. - Integrated Jens's hctx_[un]lock() factoring patch and refreshed the patches accordingly. - Added comment explaining the use of hctx_lock() instead of rcu_read_lock() in completion path. Changes from [v2] - Possible extended looping around seqcount and u64_stat_sync fixed. - Misplaced MQ_RQ_IDLE state setting fixed. - RQF_MQ_TIMEOUT_EXPIRED added to prevent firing the same timeout multiple times. - s/queue_rq_src/srcu/ patch added. - Other misc changes. Changes from [v1] - BLK_EH_RESET_TIMER handling fixed. - s/request->gstate_seqc/request->gstate_seq/ - READ_ONCE() added to blk_mq_rq_udpate_state(). - Removed left over blk_clear_rq_complete() invocation from blk_mq_rq_timed_out(). Currently, blk-mq timeout path synchronizes against the usual issue/completion path using a complex scheme involving atomic bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence rules. Unfortunatley, it contains quite a few holes. It's pretty easy to make blk_mq_check_expired() terminate a later instance of a request. If we induce 5 sec delay before time_after_eq() test in blk_mq_check_expired(), shorten the timeout to 2s, and issue back-to-back large IOs, blk-mq starts timing out requests spuriously pretty quickly. Nothing actually timed out. It just made the call on a recycle instance of a request and then terminated a later instance long after the original instance finished. The scenario isn't theoretical either. This patchset replaces the broken synchronization mechanism with a RCU and generation number based one. Please read the patch description of the second path for more details. This patchset contains the following eight patches. 0001-blk-mq-move-hctx-lock-unlock-into-a-helper.patch 0002-blk-mq-protect-completion-path-with-RCU.patch 0003-blk-mq-replace-timeout-synchronization-with-a-RCU-an.patch 0004-blk-mq-use-blk_mq_rq_state-instead-of-testing-REQ_AT.patch 0005-blk-mq-make-blk_abort_request-trigger-timeout-path.patch 0006-blk-mq-remove-REQ_ATOM_COMPLETE-usages-from-blk-mq.patch 0007-blk-mq-remove-REQ_ATOM_STARTED.patch 0008-blk-mq-rename-blk_mq_hw_ctx-queue_rq_srcu-to-srcu.patch and is available in the following git branch. git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git blk-mq-timeout-v5 diffstat follows. Thanks. block/blk-core.c |2 block/blk-mq-debugfs.c |4 block/blk-mq.c | 346 + block/blk-mq.h | 49 ++ block/blk-timeout.c| 16 +- block/blk.h|7 include/linux/blk-mq.h |3 include/linux/blkdev.h | 25 +++ 8 files changed, 294 insertions(+), 158 deletions(-) -- tejun [v4] http://lkml.kernel.org/r/20180108191542.379478-1...@kernel.org [v3] http://lkml.kernel.org/r/20171216120726.517153-1...@kernel.org [v2] http://lkml.kernel.org/r/20171010155441.753966-1...@kernel.org [v1] http://lkml.kernel.org/r/20171209192525.982030-1...@kernel.org -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html