RE: [PATCH] [v3] blk-mq-tag: make blk_mq_tag_busy() return void
Hi jens Could I get your commits? thanks -Original Message- From: tianxianting (RD) Sent: Thursday, December 10, 2020 2:05 PM To: ax...@kernel.dk Cc: ming@redhat.com; linux-bl...@vger.kernel.org; linux-kernel@vger.kernel.org; tianxianting (RD) Subject: [PATCH] [v3] blk-mq-tag: make blk_mq_tag_busy() return void As no one cares about the return value of blk_mq_tag_busy() and __blk_mq_tag_busy(), so make them return void. Other change is to simplify blk_mq_tag_idle(). Signed-off-by: Xianting Tian Reviewed-by: Ming Lei --- block/blk-mq-tag.c | 4 +--- block/blk-mq-tag.h | 16 ++-- 2 files changed, 7 insertions(+), 13 deletions(-) diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index 9c92053e7..01c0bb1fb 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -21,7 +21,7 @@ * to get tag when first time, the other shared-tag users could reserve * budget for it. */ -bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx) +void __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx) { if (blk_mq_is_sbitmap_shared(hctx->flags)) { struct request_queue *q = hctx->queue; @@ -35,8 +35,6 @@ bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx) !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, >state)) atomic_inc(>tags->active_queues); } - - return true; } /* diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h index 7d3e6b333..4b4ccd794 100644 --- a/block/blk-mq-tag.h +++ b/block/blk-mq-tag.h @@ -60,23 +60,19 @@ enum { BLK_MQ_TAG_MAX = BLK_MQ_NO_TAG - 1, }; -extern bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *); +extern void __blk_mq_tag_busy(struct blk_mq_hw_ctx *); extern void __blk_mq_tag_idle(struct blk_mq_hw_ctx *); -static inline bool blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx) +static inline void blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx) { - if (!(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) - return false; - - return __blk_mq_tag_busy(hctx); + if (hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) + __blk_mq_tag_busy(hctx); } static inline void blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx) { - if (!(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) - return; - - __blk_mq_tag_idle(hctx); + if (hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) + __blk_mq_tag_idle(hctx); } static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags *tags, -- 2.17.1
RE: [PATCH] [v2] tty: Protect disc_data in n_tty_close and n_tty_flush_buffer
Hi Greg KH Could we get your comments for the updates? Thanks -Original Message- From: gaoyan (RD) Sent: Friday, December 11, 2020 2:09 PM To: Greg KH Cc: jirisl...@kernel.org; linux-kernel@vger.kernel.org; tianxianting (RD) Subject: 答复: [PATCH] [v2] tty: Protect disc_data in n_tty_close and n_tty_flush_buffer Hi Greg KH: I try to reproduce this problem in testing, but it is difficult to happen again. It is hard to grasp the timing that n_tty_flush_buffer accesses the disc_data which was just set to NULL by n_tty_close. Thanks Gao Yan -邮件原件- 发件人: Greg KH [mailto:gre...@linuxfoundation.org] 发送时间: 2020年12月10日 14:22 收件人: gaoyan (RD) 抄送: jirisl...@kernel.org; linux-kernel@vger.kernel.org; tianxianting (RD) 主题: Re: [PATCH] [v2] tty: Protect disc_data in n_tty_close and n_tty_flush_buffer On Thu, Dec 10, 2020 at 10:25:07AM +0800, Yan.Gao wrote: > n_tty_flush_buffer can happen in parallel with n_tty_close that the > tty->disc_data will be set to NULL. n_tty_flush_buffer accesses > tty->disc_data, so we must prevent n_tty_close clear tty->disc_data > while n_tty_flush_buffer has a non-NULL view of tty->disc_data. > > So we need to make sure that accesses to disc_data are atomic using > tty->termios_rwsem. > > There is an example I meet: > When n_tty_flush_buffer accesses tty struct, the disc_data is right. > However, then reset_buffer_flags accesses tty->disc_data, disc_data > become NULL, So kernel crash when accesses tty->disc_data->real_tail. > I guess there could be another thread change tty->disc_data to NULL, > and during N_TTY line discipline, n_tty_close will set tty->disc_data > to be NULL. So use tty->termios_rwsem to protect disc_data between > close and flush_buffer. > > IP: reset_buffer_flags+0x9/0xf0 > PGD 0 P4D 0 > Oops: 0002 [#1] SMP > CPU: 23 PID: 2087626 Comm: (agetty) Kdump: loaded Tainted: G Hardware > name: UNISINSIGHT X3036P-G3/ST01M2C7S, BIOS 2.00.13 01/11/2019 > task: 9c4e9da71e80 task.stack: b30cfe898000 > RIP: 0010:reset_buffer_flags+0x9/0xf0 > RSP: 0018:b30cfe89bca8 EFLAGS: 00010246 > RAX: 9c4e9da71e80 RBX: 9c368d1bac00 RCX: > RDX: RSI: 9c4ea17b50f0 RDI: > RBP: b30cfe89bcc8 R08: 0100 R09: 0001 > R10: 0001 R11: R12: 9c368d1bacc0 > R13: 9c20cfd18428 R14: 9c4ea17b50f0 R15: 9c368d1bac00 > FS: 7f9fbbe97940() GS:9c375c74() > knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 2260 CR3: 002f72233003 CR4: 007606e0 > DR0: DR1: DR2: > DR3: DR6: fffe0ff0 DR7: 0400 > PKRU: 5554 > Call Trace: > ? n_tty_flush_buffer+0x2a/0x60 > tty_buffer_flush+0x76/0x90 > tty_ldisc_flush+0x22/0x40 > vt_ioctl+0x5a7/0x10b0 > ? n_tty_ioctl_helper+0x27/0x110 > tty_ioctl+0xef/0x8c0 > do_vfs_ioctl+0xa7/0x5e0 > ? __audit_syscall_entry+0xaf/0x100 > ? syscall_trace_enter+0x1d0/0x2b0 > SyS_ioctl+0x79/0x90 > do_syscall_64+0x6c/0x1b0 > entry_SYSCALL64_slow_path+0x25/0x25 > > n_tty_flush_buffer--->tty->disc_data is OK > ->reset_buffer_flags -->tty->disc_data is NULL > > Signed-off-by: Yan.Gao > Reviewed-by: Xianting Tian > --- > drivers/tty/n_tty.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/tty/n_tty.c b/drivers/tty/n_tty.c index > 7e5e36315..e78124ce1 100644 > --- a/drivers/tty/n_tty.c > +++ b/drivers/tty/n_tty.c > @@ -1892,8 +1892,10 @@ static void n_tty_close(struct tty_struct *tty) > if (tty->link) > n_tty_packet_mode_flush(tty); > > + down_write(>termios_rwsem); > vfree(ldata); > tty->disc_data = NULL; > + up_write(>termios_rwsem); > } > > /** So does this solve your problem in testing? Do you have a reproducer for this problem? thanks, greg k-h
RE: [PATCH] [v2] blk-mq-tag: make blk_mq_tag_busy() return void
Yes, Sorry, In V2. I missed it, so I sent v3 :) -Original Message- From: Chaitanya Kulkarni [mailto:chaitanya.kulka...@wdc.com] Sent: Thursday, December 10, 2020 2:21 PM To: tianxianting (RD) ; ax...@kernel.dk Cc: ming@redhat.com; linux-bl...@vger.kernel.org; linux-kernel@vger.kernel.org Subject: Re: [PATCH] [v2] blk-mq-tag: make blk_mq_tag_busy() return void On 12/9/20 22:06, Xianting Tian wrote: > As no one cares about the return value of blk_mq_tag_busy() and > __blk_mq_tag_busy(), so make them return void. > > Other change is to simplify blk_mq_tag_idle(). > > Signed-off-by: Xianting Tian > Reviewed-by: Ming Lei > --- > block/blk-mq-tag.c | 4 ++-- > block/blk-mq-tag.h | 16 ++-- > 2 files changed, 8 insertions(+), 12 deletions(-) > > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index > 9c92053e7..21ff7d156 100644 > --- a/block/blk-mq-tag.c > +++ b/block/blk-mq-tag.c > @@ -21,7 +21,7 @@ > * to get tag when first time, the other shared-tag users could reserve > * budget for it. > */ > -bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx) > +void __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx) > { > if (blk_mq_is_sbitmap_shared(hctx->flags)) { > struct request_queue *q = hctx->queue; @@ -36,7 +36,7 @@ bool > __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx) > atomic_inc(>tags->active_queues); > } > > - return true; > + return; if above return is the last statement then you need to remove that instead of using return with no value. Also, please add the version history.
RE: [PATCH] blk-mq-tag: make blk_mq_tag_busy() return void
Thanks for the comments, So blk_mq_tag_idle() can be also simplified as below, I will send v2 patch for reviewing. static inline void blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx) { - if (!(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) - return; - - __blk_mq_tag_idle(hctx); + if (hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) + __blk_mq_tag_idle(hctx); } -Original Message- From: Ming Lei [mailto:ming@redhat.com] Sent: Thursday, December 10, 2020 12:01 PM To: tianxianting (RD) Cc: ax...@kernel.dk; linux-bl...@vger.kernel.org; linux-kernel@vger.kernel.org Subject: Re: [PATCH] blk-mq-tag: make blk_mq_tag_busy() return void On Tue, Dec 08, 2020 at 03:40:02PM +0800, Xianting Tian wrote: > As no one cares about the return value of blk_mq_tag_busy() and > __blk_mq_tag_busy(), so make them return void. > > Signed-off-by: Xianting Tian > --- > block/blk-mq-tag.c | 4 ++-- > block/blk-mq-tag.h | 8 > 2 files changed, 6 insertions(+), 6 deletions(-) > > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index > 9c92053e7..21ff7d156 100644 > --- a/block/blk-mq-tag.c > +++ b/block/blk-mq-tag.c > @@ -21,7 +21,7 @@ > * to get tag when first time, the other shared-tag users could reserve > * budget for it. > */ > -bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx) > +void __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx) > { > if (blk_mq_is_sbitmap_shared(hctx->flags)) { > struct request_queue *q = hctx->queue; @@ -36,7 +36,7 @@ bool > __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx) > atomic_inc(>tags->active_queues); > } > > - return true; > + return; > } > > /* > diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h index > 7d3e6b333..dd80e5a85 100644 > --- a/block/blk-mq-tag.h > +++ b/block/blk-mq-tag.h > @@ -60,15 +60,15 @@ enum { > BLK_MQ_TAG_MAX = BLK_MQ_NO_TAG - 1, > }; > > -extern bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *); > +extern void __blk_mq_tag_busy(struct blk_mq_hw_ctx *); > extern void __blk_mq_tag_idle(struct blk_mq_hw_ctx *); > > -static inline bool blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx) > +static inline void blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx) > { > if (!(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) > - return false; > + return; > > - return __blk_mq_tag_busy(hctx); > + __blk_mq_tag_busy(hctx); The above can be simplified as: if (hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) __blk_mq_tag_busy(hctx); Otherwise, looks fine: Reviewed-by: Ming Lei Thanks, Ming
RE: [PATCH] sched/rt: Print curr when RT throttling activated
Thanks, We met an issue that a normal thread can't get cpu, And at this moment, we found 'sched: RT throttling activated' log. So I think this patch is useful for such issue. Could I get more comments? Thanks in advance -Original Message- From: Steven Rostedt [mailto:rost...@goodmis.org] Sent: Thursday, December 03, 2020 10:40 PM To: tianxianting (RD) Cc: mi...@redhat.com; pet...@infradead.org; juri.le...@redhat.com; vincent.guit...@linaro.org; dietmar.eggem...@arm.com; bseg...@google.com; mgor...@suse.de; linux-kernel@vger.kernel.org Subject: Re: [PATCH] sched/rt: Print curr when RT throttling activated On Thu, 3 Dec 2020 15:51:29 +0800 Xianting Tian wrote: > We may meet the issue, that one RT thread occupied the cpu by > 950ms/1s, The RT thread maybe is a business thread or other unknown thread. > > Currently, it only outputs the print "sched: RT throttling activated" > when RT throttling happen. It is hard to know what is the RT thread, > For further analysis, we need add more prints. > > This patch is to print current RT task when RT throttling activated, > It help us to know what is the RT thread in the first time. I think this can be useful information to include. Acked-by: Steven Rostedt (VMware) -- Steve > > Signed-off-by: Xianting Tian > --- > kernel/sched/rt.c | 7 --- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index > f215eea6a..8913f38cb 100644 > --- a/kernel/sched/rt.c > +++ b/kernel/sched/rt.c > @@ -946,7 +946,7 @@ static inline int rt_se_prio(struct sched_rt_entity > *rt_se) > return rt_task_of(rt_se)->prio; > } > > -static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq) > +static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq, struct > +task_struct *curr) > { > u64 runtime = sched_rt_runtime(rt_rq); > > @@ -970,7 +970,8 @@ static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq) >*/ > if (likely(rt_b->rt_runtime)) { > rt_rq->rt_throttled = 1; > - printk_deferred_once("sched: RT throttling > activated\n"); > + printk_deferred_once("sched: RT throttling activated > (curr: pid %d, comm %s)\n", > + curr->pid, curr->comm); > } else { > /* >* In case we did anyway, make it go away, @@ -1026,7 > +1027,7 @@ > static void update_curr_rt(struct rq *rq) > if (sched_rt_runtime(rt_rq) != RUNTIME_INF) { > raw_spin_lock(_rq->rt_runtime_lock); > rt_rq->rt_time += delta_exec; > - if (sched_rt_runtime_exceeded(rt_rq)) > + if (sched_rt_runtime_exceeded(rt_rq, curr)) > resched_curr(rq); > raw_spin_unlock(_rq->rt_runtime_lock); > }
RE: [PATCH] ext4: remove the null check of bio_vec page
Thanks Ted :) -Original Message- From: Theodore Y. Ts'o [mailto:ty...@mit.edu] Sent: Thursday, December 03, 2020 10:11 PM To: Jan Kara Cc: tianxianting (RD) ; adilger.ker...@dilger.ca; linux-e...@vger.kernel.org; linux-kernel@vger.kernel.org Subject: Re: [PATCH] ext4: remove the null check of bio_vec page On Wed, Oct 21, 2020 at 12:25:03PM +0200, Jan Kara wrote: > On Tue 20-10-20 16:22:01, Xianting Tian wrote: > > bv_page can't be NULL in a valid bio_vec, so we can remove the NULL > > check, as we did in other places when calling > > bio_for_each_segment_all() to go through all bio_vec of a bio. > > > > Signed-off-by: Xianting Tian > > Thanks for the patch. It looks good to me. You can add: > > Reviewed-by: Jan Kara Applied, thanks. - Ted
RE: [PATCH] ext4: remove the null check of bio_vec page
Thanks Jan Hi Ted Tso, Could I get your opinion of this patch? thanks -Original Message- From: Jan Kara [mailto:j...@suse.cz] Sent: Saturday, October 24, 2020 12:57 AM To: tianxianting (RD) Cc: Jan Kara ; ty...@mit.edu; adilger.ker...@dilger.ca; linux-e...@vger.kernel.org; linux-kernel@vger.kernel.org Subject: Re: [PATCH] ext4: remove the null check of bio_vec page On Fri 23-10-20 16:38:16, Tianxianting wrote: > Thanks Jan > Can be the patch applied? Ted Tso is the ext4 maintainer so he should eventually pick up and apply the patch. But since there's merge window currently open, I guess he's busy shuffling patches to send to Linus. I'd expect he'll get to your patch in a week or two. Honza > > -Original Message- > From: Jan Kara [mailto:j...@suse.cz] > Sent: Wednesday, October 21, 2020 6:25 PM > To: tianxianting (RD) > Cc: ty...@mit.edu; adilger.ker...@dilger.ca; j...@suse.cz; > linux-e...@vger.kernel.org; linux-kernel@vger.kernel.org > Subject: Re: [PATCH] ext4: remove the null check of bio_vec page > > On Tue 20-10-20 16:22:01, Xianting Tian wrote: > > bv_page can't be NULL in a valid bio_vec, so we can remove the NULL > > check, as we did in other places when calling > > bio_for_each_segment_all() to go through all bio_vec of a bio. > > > > Signed-off-by: Xianting Tian > > Thanks for the patch. It looks good to me. You can add: > > Reviewed-by: Jan Kara > > Honza > > > --- > > fs/ext4/page-io.c | 3 --- > > 1 file changed, 3 deletions(-) > > > > diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c index > > defd2e10d..cb135a944 100644 > > --- a/fs/ext4/page-io.c > > +++ b/fs/ext4/page-io.c > > @@ -111,9 +111,6 @@ static void ext4_finish_bio(struct bio *bio) > > unsigned under_io = 0; > > unsigned long flags; > > > > - if (!page) > > - continue; > > - > > if (fscrypt_is_bounce_page(page)) { > > bounce_page = page; > > page = fscrypt_pagecache_page(bounce_page); > > -- > > 2.17.1 > > > -- > Jan Kara > SUSE Labs, CR -- Jan Kara SUSE Labs, CR
RE: [PATCH] mm: bio_alloc never fails when set GFP_NOIO, GFP_KERNEL
Hi Andrew Could I get your comments for this patch? Thanks in advance. -Original Message- From: tianxianting (RD) Sent: Wednesday, October 21, 2020 11:11 AM To: a...@linux-foundation.org Cc: linux...@kvack.org; linux-kernel@vger.kernel.org; tianxianting (RD) Subject: [PATCH] mm: bio_alloc never fails when set GFP_NOIO, GFP_KERNEL bio_alloc with __GFP_DIRECT_RECLAIM(which is included in GFP_NOIO, GFP_KERNEL) never fails, as stated in the comments of bio_alloc_bioset. So we can remove multiple unneeded null checks of bio_alloc and simplify the code. We have done it in fs/ext4/readpage.c, fs/ext4/page-io.c, fs/direct-io.c, and so forth. Signed-off-by: Xianting Tian --- mm/page_io.c | 31 +++ 1 file changed, 11 insertions(+), 20 deletions(-) diff --git a/mm/page_io.c b/mm/page_io.c index e485a6e8a..9215bb356 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -30,18 +30,20 @@ static struct bio *get_swap_bio(gfp_t gfp_flags, struct page *page, bio_end_io_t end_io) { struct bio *bio; + struct block_device *bdev; + /* +* bio_alloc will _always_ be able to allocate a bio if +* __GFP_DIRECT_RECLAIM is set, see comments for bio_alloc_bioset(). +*/ bio = bio_alloc(gfp_flags, 1); - if (bio) { - struct block_device *bdev; + bio->bi_iter.bi_sector = map_swap_page(page, ); + bio_set_dev(bio, bdev); + bio->bi_iter.bi_sector <<= PAGE_SHIFT - 9; + bio->bi_end_io = end_io; - bio->bi_iter.bi_sector = map_swap_page(page, ); - bio_set_dev(bio, bdev); - bio->bi_iter.bi_sector <<= PAGE_SHIFT - 9; - bio->bi_end_io = end_io; + bio_add_page(bio, page, thp_size(page), 0); - bio_add_page(bio, page, thp_size(page), 0); - } return bio; } @@ -351,19 +353,13 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc, ret = 0; bio = get_swap_bio(GFP_NOIO, page, end_write_func); - if (bio == NULL) { - set_page_dirty(page); - unlock_page(page); - ret = -ENOMEM; - goto out; - } bio->bi_opf = REQ_OP_WRITE | REQ_SWAP | wbc_to_write_flags(wbc); bio_associate_blkg_from_page(bio, page); count_swpout_vm_event(page); set_page_writeback(page); unlock_page(page); submit_bio(bio); -out: + return ret; } @@ -416,11 +412,6 @@ int swap_readpage(struct page *page, bool synchronous) ret = 0; bio = get_swap_bio(GFP_KERNEL, page, end_swap_bio_read); - if (bio == NULL) { - unlock_page(page); - ret = -ENOMEM; - goto out; - } disk = bio->bi_disk; /* * Keep this task valid during swap readpage because the oom killer may -- 2.17.1
RE: [PATCH] ext4: remove the null check of bio_vec page
Thanks Jan Can be the patch applied? -Original Message- From: Jan Kara [mailto:j...@suse.cz] Sent: Wednesday, October 21, 2020 6:25 PM To: tianxianting (RD) Cc: ty...@mit.edu; adilger.ker...@dilger.ca; j...@suse.cz; linux-e...@vger.kernel.org; linux-kernel@vger.kernel.org Subject: Re: [PATCH] ext4: remove the null check of bio_vec page On Tue 20-10-20 16:22:01, Xianting Tian wrote: > bv_page can't be NULL in a valid bio_vec, so we can remove the NULL > check, as we did in other places when calling > bio_for_each_segment_all() to go through all bio_vec of a bio. > > Signed-off-by: Xianting Tian Thanks for the patch. It looks good to me. You can add: Reviewed-by: Jan Kara Honza > --- > fs/ext4/page-io.c | 3 --- > 1 file changed, 3 deletions(-) > > diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c index > defd2e10d..cb135a944 100644 > --- a/fs/ext4/page-io.c > +++ b/fs/ext4/page-io.c > @@ -111,9 +111,6 @@ static void ext4_finish_bio(struct bio *bio) > unsigned under_io = 0; > unsigned long flags; > > - if (!page) > - continue; > - > if (fscrypt_is_bounce_page(page)) { > bounce_page = page; > page = fscrypt_pagecache_page(bounce_page); > -- > 2.17.1 > -- Jan Kara SUSE Labs, CR
RE: [PATCH] scsi: megaraid_sas: use spin_lock() in hard IRQ
Yes, thanks I see, If we add this patch, we need to get all cpu arch that support nested interrupts. -Original Message- From: Finn Thain [mailto:fth...@telegraphics.com.au] Sent: Thursday, October 22, 2020 11:29 AM To: tianxianting (RD) Cc: kashyap.de...@broadcom.com; sumit.sax...@broadcom.com; shivasharan.srikanteshw...@broadcom.com; j...@linux.ibm.com; martin.peter...@oracle.com; megaraidlinux@broadcom.com; linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org Subject: RE: [PATCH] scsi: megaraid_sas: use spin_lock() in hard IRQ On Thu, 22 Oct 2020, Tianxianting wrote: > Do you mean Megasas raid can be used in m68k arch? m68k is one example of an architecture on which the unstated assumptions in your patch would be invalid. Does this help to clarify what I wrote? If Megasas raid did work on m68k, I'm sure it could potentially benefit from the theoretical performance improvement from your patch. So perhaps you would consider adding support for slower CPUs like m68k.
RE: [PATCH] scsi: megaraid_sas: use spin_lock() in hard IRQ
Thanks, Do you mean Megasas raid can be used in m68k arch? -Original Message- From: Finn Thain [mailto:fth...@telegraphics.com.au] Sent: Thursday, October 22, 2020 10:25 AM To: tianxianting (RD) Cc: kashyap.de...@broadcom.com; sumit.sax...@broadcom.com; shivasharan.srikanteshw...@broadcom.com; j...@linux.ibm.com; martin.peter...@oracle.com; megaraidlinux@broadcom.com; linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org Subject: Re: [PATCH] scsi: megaraid_sas: use spin_lock() in hard IRQ On Wed, 21 Oct 2020, Xianting Tian wrote: > Since we already in hard IRQ context when running megasas_isr(), On m68k, hard irq context does not mean interrupts are disabled. Are there no other architectures in that category? > so use spin_lock() is enough, which is faster than spin_lock_irqsave(). > Is that measurable? > Signed-off-by: Xianting Tian > --- > drivers/scsi/megaraid/megaraid_sas_base.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) > > diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c > b/drivers/scsi/megaraid/megaraid_sas_base.c > index 2b7e7b5f3..bd186254d 100644 > --- a/drivers/scsi/megaraid/megaraid_sas_base.c > +++ b/drivers/scsi/megaraid/megaraid_sas_base.c > @@ -3977,15 +3977,14 @@ static irqreturn_t megasas_isr(int irq, void > *devp) { > struct megasas_irq_context *irq_context = devp; > struct megasas_instance *instance = irq_context->instance; > - unsigned long flags; > irqreturn_t rc; > > if (atomic_read(>fw_reset_no_pci_access)) > return IRQ_HANDLED; > > - spin_lock_irqsave(>hba_lock, flags); > + spin_lock(>hba_lock); > rc = megasas_deplete_reply_queue(instance, DID_OK); > - spin_unlock_irqrestore(>hba_lock, flags); > + spin_unlock(>hba_lock); > > return rc; > } >
RE: [PATCH] blk-mq: remove the calling of local_memory_node()
Thanks Michal, Hi, raghavendra, Jens Could you help comment this issue? Thanks in advance. -Original Message- From: Michal Hocko [mailto:mho...@suse.com] Sent: Monday, October 19, 2020 7:40 PM To: tianxianting (RD) Cc: ax...@kernel.dk; raghavendra...@linux.vnet.ibm.com; linux-bl...@vger.kernel.org; linux-kernel@vger.kernel.org Subject: Re: [PATCH] blk-mq: remove the calling of local_memory_node() On Mon 19-10-20 16:20:47, Xianting Tian wrote: > We don't need to check whether the node is memoryless numa node before > calling allocator interface. SLUB(and SLAB,SLOB) relies on the page > allocator to pick a node. Page allocator should deal with memoryless > nodes just fine. It has zonelists constructed for each possible nodes. > And it will automatically fall back into a node which is closest to > the requested node. As long as __GFP_THISNODE is not enforced of course. > > The code comments of kmem_cache_alloc_node() of SLAB also showed this: > * Fallback to other node is possible if __GFP_THISNODE is not set. > > blk-mq code doesn't set __GFP_THISNODE, so we can remove the calling > of local_memory_node(). yes, this is indeed the case. I cannot really judge the blg-mq code but it seems to be unnecessary. Maybe there are some subtle details not explained by bffed457160ab though. > Fixes: bffed457160ab ("blk-mq: Avoid memoryless numa node encoded in > hctx numa_node") But the existing code is not broken. It just overdoes what needs to be done. So effectively bffed457160ab was not needed. I do not think that Fixes is really necessary. > Signed-off-by: Xianting Tian > --- > block/blk-mq-cpumap.c | 2 +- > block/blk-mq.c| 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c index > 0157f2b34..3db84d319 100644 > --- a/block/blk-mq-cpumap.c > +++ b/block/blk-mq-cpumap.c > @@ -89,7 +89,7 @@ int blk_mq_hw_queue_to_node(struct blk_mq_queue_map > *qmap, unsigned int index) > > for_each_possible_cpu(i) { > if (index == qmap->mq_map[i]) > - return local_memory_node(cpu_to_node(i)); > + return cpu_to_node(i); > } > > return NUMA_NO_NODE; > diff --git a/block/blk-mq.c b/block/blk-mq.c index > cdced4aca..48f8366b2 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -2737,7 +2737,7 @@ static void blk_mq_init_cpu_queues(struct request_queue > *q, > for (j = 0; j < set->nr_maps; j++) { > hctx = blk_mq_map_queue_type(q, j, i); > if (nr_hw_queues > 1 && hctx->numa_node == NUMA_NO_NODE) > - hctx->numa_node = > local_memory_node(cpu_to_node(i)); > + hctx->numa_node = cpu_to_node(i); > } > } > } > -- > 2.17.1 -- Michal Hocko SUSE Labs
RE: [PATCH] mm: Make allocator take care of memoryless numa node
Thanks Michal, Yes, it is the commit bffed457160ab. Sorry I forgot to paste it in my previous reply. -Original Message- From: Michal Hocko [mailto:mho...@suse.com] Sent: Monday, October 19, 2020 3:07 PM To: tianxianting (RD) Cc: c...@linux.com; penb...@kernel.org; rient...@google.com; iamjoonsoo@lge.com; a...@linux-foundation.org; linux...@kvack.org; linux-kernel@vger.kernel.org; k...@kernel.org; alexei.starovoi...@gmail.com Subject: Re: [PATCH] mm: Make allocator take care of memoryless numa node On Sun 18-10-20 14:18:37, Tianxianting wrote: > Thanks for the comments > I found in current code, there are two places to call > local_memory_node(node) before calling kzalloc_node(), I think we can > remove them? I am not sure which code you are talking about. git grep shows me 2 places in blk-mq code (e.g. bffed457160ab) and that looks quite bogus to me. Bring that up with the respective maintainer and Raghavendra. The changelog doesn't really describe any problem, if there is any. But from the allocator semantic point of view memory less nodes are to be expected and the allocator should fallback to the proper node. As long as __GFP_THISNODE is not enforced of course. -- Michal Hocko SUSE Labs
RE: [PATCH] mm: Make allocator take care of memoryless numa node
Thanks for the comments I found in current code, there are two places to call local_memory_node(node) before calling kzalloc_node(), I think we can remove them? -Original Message- From: Michal Hocko [mailto:mho...@suse.com] Sent: Monday, October 12, 2020 11:06 PM To: tianxianting (RD) Cc: c...@linux.com; penb...@kernel.org; rient...@google.com; iamjoonsoo@lge.com; a...@linux-foundation.org; linux...@kvack.org; linux-kernel@vger.kernel.org; k...@kernel.org; alexei.starovoi...@gmail.com Subject: Re: [PATCH] mm: Make allocator take care of memoryless numa node On Mon 12-10-20 16:27:39, Xianting Tian wrote: > In architecture like powerpc, we can have cpus without any local > memory attached to it. In such cases the node does not have real memory. Yes, this is normal (unfortunately). > In many places of current kernel code, it doesn't judge whether the > node is memoryless numa node before calling allocator interface. And that is correct. It shouldn't make any assumption on the memory on a given node because that memory might be depleted (similar to no memory) or it can disappear at any moment because of the memory offlining. > This patch is to use local_memory_node(), which is guaranteed to have > memory, in allocator interface. local_memory_node() is a noop in other > architectures that don't support memoryless nodes. > > As the call path: > alloc_pages_node > __alloc_pages_node > __alloc_pages_nodemask > and __alloc_pages_node,__alloc_pages_nodemask may be called directly, > so only add local_memory_node() in __alloc_pages_nodemask. Page allocator should deal with memory less nodes just fine. It has zonelists constructed for each possible nodes. And it will automatically fall back into a node with is closest to the requested node. local_memory_node might be incorrect choice from the topology POV. What kind of problem are you trying to fix? > Signed-off-by: Xianting Tian > --- > include/linux/slab.h | 3 +++ > mm/page_alloc.c | 1 + > mm/slab.c| 6 +- > mm/slob.c| 1 + > mm/slub.c| 10 -- > 5 files changed, 18 insertions(+), 3 deletions(-) > > diff --git a/include/linux/slab.h b/include/linux/slab.h index > 24df2393e..527e811e0 100644 > --- a/include/linux/slab.h > +++ b/include/linux/slab.h > @@ -574,6 +574,7 @@ static __always_inline void *kmalloc_node(size_t size, > gfp_t flags, int node) > flags, node, size); > } > #endif > + node = local_memory_node(node); > return __kmalloc_node(size, flags, node); } > > @@ -626,6 +627,8 @@ static inline void *kmalloc_array_node(size_t n, size_t > size, gfp_t flags, > return NULL; > if (__builtin_constant_p(n) && __builtin_constant_p(size)) > return kmalloc_node(bytes, flags, node); > + > + node = local_memory_node(node); > return __kmalloc_node(bytes, flags, node); } > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c index > 6866533de..be63c62c2 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4878,6 +4878,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int > order, int preferred_nid, > return NULL; > } > > + preferred_nid = local_memory_node(preferred_nid); > gfp_mask &= gfp_allowed_mask; > alloc_mask = gfp_mask; > if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, > , _mask, _flags)) diff --git a/mm/slab.c b/mm/slab.c > index f658e86ec..263c2f2e1 100644 > --- a/mm/slab.c > +++ b/mm/slab.c > @@ -3575,7 +3575,10 @@ EXPORT_SYMBOL(kmem_cache_alloc_trace); > */ > void *kmem_cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, > int nodeid) { > - void *ret = slab_alloc_node(cachep, flags, nodeid, _RET_IP_); > + void *ret; > + > + nodeid = local_memory_node(nodeid); > + ret = slab_alloc_node(cachep, flags, nodeid, _RET_IP_); > > trace_kmem_cache_alloc_node(_RET_IP_, ret, > cachep->object_size, cachep->size, @@ > -3593,6 +3596,7 @@ void > *kmem_cache_alloc_node_trace(struct kmem_cache *cachep, { > void *ret; > > + nodeid = local_memory_node(nodeid); > ret = slab_alloc_node(cachep, flags, nodeid, _RET_IP_); > > ret = kasan_kmalloc(cachep, ret, size, flags); diff --git > a/mm/slob.c b/mm/slob.c index 7cc9805c8..1f1c25e06 100644 > --- a/mm/slob.c > +++ b/mm/slob.c > @@ -636,6 +636,7 @@ EXPORT_SYMBOL(__kmalloc_node); > > void *kmem_cache_alloc_node(struct kmem_cache *cachep, gfp_t gfp, int > node) { > + node = local_memory_node(node); > return slob_alloc_node(cachep, g
RE: [PATCH] mm: vmscan: avoid a unnecessary reschedule in shrink_slab()
Thanks for the explain. I got it. -Original Message- From: Michal Hocko [mailto:mho...@suse.com] Sent: Friday, October 16, 2020 9:45 PM To: tianxianting (RD) Cc: a...@linux-foundation.org; linux...@kvack.org; linux-kernel@vger.kernel.org Subject: Re: [PATCH] mm: vmscan: avoid a unnecessary reschedule in shrink_slab() On Fri 16-10-20 13:20:41, Tianxianting wrote: > Thanks > I understood what you said :) > But whether it is proper to check reschedule in every loop when lock is > taken? I do not see any actual problem TBH. cond_resched is mostly to increase interactivity for non preemptible kernel. It can reduce throughput but this is a memory reclaim path and I do not expect this to contribute to any moderate hot paths. Direct reclaim doesn't really count as a hot path. -- Michal Hocko SUSE Labs
RE: [PATCH] mm: vmscan: avoid a unnecessary reschedule in shrink_slab()
Thanks I understood what you said :) But whether it is proper to check reschedule in every loop when lock is taken? By the way, I did not met a issue for this , I just learn this code and come up with one possible optimization based my understanding. -Original Message- From: Michal Hocko [mailto:mho...@suse.com] Sent: Friday, October 16, 2020 9:02 PM To: tianxianting (RD) Cc: a...@linux-foundation.org; linux...@kvack.org; linux-kernel@vger.kernel.org Subject: Re: [PATCH] mm: vmscan: avoid a unnecessary reschedule in shrink_slab() On Fri 16-10-20 12:48:23, Tianxianting wrote: > Thanks, my understanding is, > In shrink_slab(), do_shrink_slab() will do the real reclaim work, which will > occupy current cpu and consume more cpu time, so we need to trigger a > reschedule after reclaim. > But if it jumps to 'out' label, that means we don't do the reclaim work at > this time, it won't cause other thread getting starvation, so we don't need > to call cond_resched() in this case. > Is it right? You are almost right. But consider situation when the lock is taken for quite some time. do_shrink_slab cannot make any forward progress and effectivelly busy loop. Unless the caller does cond_resched it might cause soft lockups. Anyway let me try to ask again. Why does would this be any problem that deserves a fix? > > -Original Message- > From: Michal Hocko [mailto:mho...@suse.com] > Sent: Friday, October 16, 2020 8:08 PM > To: tianxianting (RD) > Cc: a...@linux-foundation.org; linux...@kvack.org; > linux-kernel@vger.kernel.org > Subject: Re: [PATCH] mm: vmscan: avoid a unnecessary reschedule in > shrink_slab() > > On Fri 16-10-20 11:39:52, Xianting Tian wrote: > > In shrink_slab(), it directly goes to 'out' label only when it can't > > get the lock of shrinker_rwsew. In this case, it doesn't do the real > > work of shrinking slab, so we don't need trigger a reschedule by > > cond_resched(). > > Your changelog doesn't explain why this is not needed or undesirable. Do you > see any actual problem? > > The point of this code is to provide a deterministic scheduling point > regardless of the shrinker_rwsew. > > > > > Signed-off-by: Xianting Tian > > --- > > mm/vmscan.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c index 466fc3144..676e97b28 > > 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -687,8 +687,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int > > nid, > > } > > > > up_read(_rwsem); > > -out: > > + > > cond_resched(); > > +out: > > return freed; > > } > > > > -- > > 2.17.1 > > > > -- > Michal Hocko > SUSE Labs -- Michal Hocko SUSE Labs
RE: [PATCH] mm: vmscan: avoid a unnecessary reschedule in shrink_slab()
Thanks, my understanding is, In shrink_slab(), do_shrink_slab() will do the real reclaim work, which will occupy current cpu and consume more cpu time, so we need to trigger a reschedule after reclaim. But if it jumps to 'out' label, that means we don't do the reclaim work at this time, it won't cause other thread getting starvation, so we don't need to call cond_resched() in this case. Is it right? -Original Message- From: Michal Hocko [mailto:mho...@suse.com] Sent: Friday, October 16, 2020 8:08 PM To: tianxianting (RD) Cc: a...@linux-foundation.org; linux...@kvack.org; linux-kernel@vger.kernel.org Subject: Re: [PATCH] mm: vmscan: avoid a unnecessary reschedule in shrink_slab() On Fri 16-10-20 11:39:52, Xianting Tian wrote: > In shrink_slab(), it directly goes to 'out' label only when it can't > get the lock of shrinker_rwsew. In this case, it doesn't do the real > work of shrinking slab, so we don't need trigger a reschedule by > cond_resched(). Your changelog doesn't explain why this is not needed or undesirable. Do you see any actual problem? The point of this code is to provide a deterministic scheduling point regardless of the shrinker_rwsew. > > Signed-off-by: Xianting Tian > --- > mm/vmscan.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c index 466fc3144..676e97b28 > 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -687,8 +687,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, > } > > up_read(_rwsem); > -out: > + > cond_resched(); > +out: > return freed; > } > > -- > 2.17.1 > -- Michal Hocko SUSE Labs
RE: [PATCH] IB/hfi1: Avoid allocing memory on memoryless numa node
Hi Dennis Thanks for the comments If it depends on x86_64, I think this issue doesn't exist. Sorry to disturb you. -Original Message- From: Dennis Dalessandro [mailto:dennis.dalessan...@cornelisnetworks.com] Sent: Monday, October 12, 2020 8:37 PM To: tianxianting (RD) ; mike.marcinis...@intel.com; dennis.dalessan...@intel.com; dledf...@redhat.com; j...@ziepe.ca Cc: linux-r...@vger.kernel.org; linux-kernel@vger.kernel.org Subject: Re: [PATCH] IB/hfi1: Avoid allocing memory on memoryless numa node On 10/10/2020 4:57 AM, Xianting Tian wrote: > In architecture like powerpc, we can have cpus without any local > memory attached to it. In such cases the node does not have real memory. > > Use local_memory_node(), which is guaranteed to have memory. > local_memory_node is a noop in other architectures that does not > support memoryless nodes. > > Signed-off-by: Xianting Tian > --- > drivers/infiniband/hw/hfi1/file_ops.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/infiniband/hw/hfi1/file_ops.c > b/drivers/infiniband/hw/hfi1/file_ops.c > index 8ca51e43c..79fa22cc7 100644 > --- a/drivers/infiniband/hw/hfi1/file_ops.c > +++ b/drivers/infiniband/hw/hfi1/file_ops.c > @@ -965,7 +965,7 @@ static int allocate_ctxt(struct hfi1_filedata *fd, struct > hfi1_devdata *dd, >*/ > fd->rec_cpu_num = hfi1_get_proc_affinity(dd->node); > if (fd->rec_cpu_num != -1) > - numa = cpu_to_node(fd->rec_cpu_num); > + numa = local_memory_node(cpu_to_node(fd->rec_cpu_num)); > else > numa = numa_node_id(); > ret = hfi1_create_ctxtdata(dd->pport, numa, ); > The hfi1 driver depends on X86_64. I'm not sure what this patch buys, can you expand a bit? -Denny
RE: [PATCH] bpf: Avoid allocing memory on memoryless numa node
Thanks Alexei for your suggestion, I will try to do it. -Original Message- From: Alexei Starovoitov [mailto:alexei.starovoi...@gmail.com] Sent: Monday, October 12, 2020 9:21 AM To: tianxianting (RD) Cc: Alexei Starovoitov ; Daniel Borkmann ; David S. Miller ; Jakub Kicinski ; Jesper Dangaard Brouer ; John Fastabend ; Martin KaFai Lau ; Song Liu ; Yonghong Song ; Andrii Nakryiko ; KP Singh ; Network Development ; bpf ; LKML Subject: Re: [PATCH] bpf: Avoid allocing memory on memoryless numa node On Sat, Oct 10, 2020 at 1:55 AM Xianting Tian wrote: > > In architecture like powerpc, we can have cpus without any local > memory attached to it. In such cases the node does not have real memory. > > Use local_memory_node(), which is guaranteed to have memory. > local_memory_node is a noop in other architectures that does not > support memoryless nodes. ... > /* Have map->numa_node, but choose node of redirect target CPU */ > - numa = cpu_to_node(cpu); > + numa = local_memory_node(cpu_to_node(cpu)); There are so many calls to cpu_to_node() throughout the kernel. Are you going to convert all of them one patch at a time to the above sequence? Why not do this CONFIG_HAVE_MEMORYLESS_NODES in cpu_to_node() instead? and save the churn.
RE: [PATCH] net: Avoid allocing memory on memoryless numa node
Hi Jakub, Thanks for your suggestion, Let me try it :-) -Original Message- From: Jakub Kicinski [mailto:k...@kernel.org] Sent: Monday, October 12, 2020 3:18 AM To: tianxianting (RD) Cc: da...@davemloft.net; net...@vger.kernel.org; linux-kernel@vger.kernel.org Subject: Re: [PATCH] net: Avoid allocing memory on memoryless numa node On Sun, 11 Oct 2020 12:11:40 +0800 Xianting Tian wrote: > In architecture like powerpc, we can have cpus without any local > memory attached to it. In such cases the node does not have real memory. > > Use local_memory_node(), which is guaranteed to have memory. > local_memory_node is a noop in other architectures that does not > support memoryless nodes. > > Signed-off-by: Xianting Tian > --- > net/core/dev.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/core/dev.c b/net/core/dev.c index > 266073e30..dcb4533ef 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -2590,7 +2590,7 @@ static struct xps_map *expand_xps_map(struct xps_map > *map, int attr_index, > new_map = kzalloc(XPS_MAP_SIZE(alloc_len), GFP_KERNEL); > else > new_map = kzalloc_node(XPS_MAP_SIZE(alloc_len), GFP_KERNEL, > -cpu_to_node(attr_index)); > + > local_memory_node(cpu_to_node(attr_index))); > if (!new_map) > return NULL; > Are we going to patch all kmalloc_node() callers now to apply local_memory_node()? Can't the allocator take care of this?
RE: [PATCH] fs: use correct parameter in notes of generic_file_llseek_size()
Sorry Randy, I used my cellphone to send the previous mail, it is html. -Original Message- From: Randy Dunlap [mailto:rdun...@infradead.org] Sent: Thursday, October 08, 2020 12:50 AM To: tianxianting (RD) ; v...@zeniv.linux.org.uk Cc: linux-fsde...@vger.kernel.org; linux-kernel@vger.kernel.org Subject: Re: [PATCH] fs: use correct parameter in notes of generic_file_llseek_size() On 10/7/20 8:20 AM, Tianxianting wrote: > hi, > > thanks Randy > > I checked the latest code, seems this patch not applied currently. Hi-- Please don't send html email. I'm pretty sure that the mailing list has dropped (discarded) your email because it was html. Probably only Al and I received your email. Al- Would you prefer that fs/ documentation patches go to someone else for merging? maybe Andrew? Thanks. PS: I can't tell if I am writing an html email or not... :( > > 发件人: Randy Dunlap > 发送时间: Friday, September 11, 2020 10:57:24 AM > 收件人: tianxianting (RD); v...@zeniv.linux.org.uk > 抄送: linux-fsde...@vger.kernel.org; linux-kernel@vger.kernel.org > 主题: Re: [PATCH] fs: use correct parameter in notes of > generic_file_llseek_size() > > On 9/10/20 7:06 PM, Tianxianting wrote: >> Hi viro, >> Could I get your feedback? >> This patch fixed the build warning, I think it can be applied, thanks >> :) >> >> -Original Message- >> From: tianxianting (RD) >> Sent: Saturday, September 05, 2020 3:15 PM >> To: v...@zeniv.linux.org.uk >> Cc: linux-fsde...@vger.kernel.org; linux-kernel@vger.kernel.org; >> tianxianting (RD) >> Subject: [PATCH] fs: use correct parameter in notes of >> generic_file_llseek_size() >> >> Fix warning when compiling with W=1: >> fs/read_write.c:88: warning: Function parameter or member 'maxsize' not >> described in 'generic_file_llseek_size' >> fs/read_write.c:88: warning: Excess function parameter 'size' description in >> 'generic_file_llseek_size' >> >> Signed-off-by: Xianting Tian > > Acked-by: Randy Dunlap > > Thanks. > >> --- >> fs/read_write.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/fs/read_write.c b/fs/read_write.c index >> 5db58b8c7..058563ee2 100644 >> --- a/fs/read_write.c >> +++ b/fs/read_write.c >> @@ -71,7 +71,7 @@ EXPORT_SYMBOL(vfs_setpos); >> * @file:file structure to seek on >> * @offset: file offset to seek to >> * @whence: type of seek >> - * @size:max size of this file in file system >> + * @maxsize: max size of this file in file system >> * @eof: offset used for SEEK_END position >> * >> * This is a variant of generic_file_llseek that allows passing in a >> custom >> > > > -- > ~Randy > -- ~Randy
RE: [PATCH] [v3] blk-mq: add cond_resched() in __blk_mq_alloc_rq_maps()
Thank you Bart Van Hi Jens, Whether this version patch is ok for you? Looking forward to your further comments :) -Original Message- From: Bart Van Assche [mailto:bvanass...@acm.org] Sent: Saturday, September 26, 2020 12:07 PM To: tianxianting (RD) ; ax...@kernel.dk Cc: linux-bl...@vger.kernel.org; linux-kernel@vger.kernel.org Subject: Re: [PATCH] [v3] blk-mq: add cond_resched() in __blk_mq_alloc_rq_maps() On 2020-09-25 19:39, Xianting Tian wrote: > diff --git a/block/blk-mq.c b/block/blk-mq.c index > b3d2785ee..62d152d03 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -3256,9 +3256,11 @@ static int __blk_mq_alloc_rq_maps(struct > blk_mq_tag_set *set) { > int i; > > - for (i = 0; i < set->nr_hw_queues; i++) > + for (i = 0; i < set->nr_hw_queues; i++) { > if (!__blk_mq_alloc_map_and_request(set, i)) > goto out_unwind; > + cond_resched(); > + } > > return 0; Reviewed-by: Bart Van Assche
RE: [PATCH] [v2] blk-mq: add cond_resched() in __blk_mq_alloc_rq_maps()
Hi Jens Thanks a lot for the comments, I think it is not hot path, it is only called when system startup or device hot-plugging. So I submitted V3 patch for you reviewing :) https://lkml.org/lkml/2020/9/25/1543 -Original Message- From: Jens Axboe [mailto:ax...@kernel.dk] Sent: Saturday, September 26, 2020 3:26 AM To: tianxianting (RD) Cc: linux-bl...@vger.kernel.org; linux-kernel@vger.kernel.org Subject: Re: [PATCH] [v2] blk-mq: add cond_resched() in __blk_mq_alloc_rq_maps() On 9/17/20 2:13 AM, Xianting Tian wrote: > We found it takes more time of blk_mq_alloc_rq_maps() in kernel space > when testing nvme hot-plugging. The test and anlysis as below. > > Debug code, > 1, blk_mq_alloc_rq_maps(): > u64 start, end; > depth = set->queue_depth; > start = ktime_get_ns(); > pr_err("[%d:%s switch:%ld,%ld] queue depth %d, nr_hw_queues %d\n", > current->pid, current->comm, current->nvcsw, > current->nivcsw, > set->queue_depth, set->nr_hw_queues); > do { > err = __blk_mq_alloc_rq_maps(set); > if (!err) > break; > > set->queue_depth >>= 1; > if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN) { > err = -ENOMEM; > break; > } > } while (set->queue_depth); > end = ktime_get_ns(); > pr_err("[%d:%s switch:%ld,%ld] all hw queues init cost time %lld > ns\n", > current->pid, current->comm, > current->nvcsw, current->nivcsw, end - start); > > 2, __blk_mq_alloc_rq_maps(): > u64 start, end; > for (i = 0; i < set->nr_hw_queues; i++) { > start = ktime_get_ns(); > if (!__blk_mq_alloc_rq_map(set, i)) > goto out_unwind; > end = ktime_get_ns(); > pr_err("hw queue %d init cost time %lld\n", i, end - start); > } > > Test nvme hot-plugging with above debug code, we found it totally cost > more than 3ms in kernel space without being scheduled out when alloc > rqs for all > 16 hw queues with depth 1024, each hw queue cost about 140-250us. The > time cost will be increased with hw queue number and queue depth > increasing. And if __blk_mq_alloc_rq_maps() returns -ENOMEM, it will > try "queue_depth >>= 1", more time will be consumed. > [ 428.428771] nvme nvme0: pci function 1:01:00.0 > [ 428.428798] nvme 1:01:00.0: enabling device ( -> 0002) > [ 428.428806] pcieport 1:00:00.0: can't derive routing for PCI INT > A > [ 428.428809] nvme 1:01:00.0: PCI INT A: no GSI > [ 432.593374] [4688:kworker/u33:8 switch:663,2] queue depth 30, > nr_hw_queues 1 > [ 432.593404] hw queue 0 init cost time 22883 ns > [ 432.593408] [4688:kworker/u33:8 switch:663,2] all hw queues init > cost time 35960 ns > [ 432.595953] nvme nvme0: 16/0/0 default/read/poll queues > [ 432.595958] [4688:kworker/u33:8 switch:700,2] queue depth 1023, > nr_hw_queues 16 > [ 432.596203] hw queue 0 init cost time 242630 ns > [ 432.596441] hw queue 1 init cost time 235913 ns > [ 432.596659] hw queue 2 init cost time 216461 ns > [ 432.596877] hw queue 3 init cost time 215851 ns > [ 432.597107] hw queue 4 init cost time 228406 ns > [ 432.597336] hw queue 5 init cost time 227298 ns > [ 432.597564] hw queue 6 init cost time 224633 ns > [ 432.597785] hw queue 7 init cost time 219954 ns > [ 432.597937] hw queue 8 init cost time 150930 ns > [ 432.598082] hw queue 9 init cost time 143496 ns > [ 432.598231] hw queue 10 init cost time 147261 ns > [ 432.598397] hw queue 11 init cost time 164522 ns > [ 432.598542] hw queue 12 init cost time 143401 ns > [ 432.598692] hw queue 13 init cost time 148934 ns > [ 432.598841] hw queue 14 init cost time 147194 ns > [ 432.598991] hw queue 15 init cost time 148942 ns > [ 432.598993] [4688:kworker/u33:8 switch:700,2] all hw queues init > cost time 3035099 ns > [ 432.602611] nvme0n1: p1 > > So use this patch to trigger schedule between each hw queue init, to > avoid other threads getting stuck. We call cond_resched() only when > "queue depth >= 512". We are not in atomic context when executing > __blk_mq_alloc_rq_maps(), so it is safe to call cond_resched(). > > Signed-off-by: Xianting Tian > --- > block/blk-mq.c |
RE: [PATCH] [v2] blk-mq: add cond_resched() in __blk_mq_alloc_rq_maps()
Hi Jens Could I get your comments for the issue? Whether it is a problem? Maybe my understanding is superficial, highly appreciated if you can comment a bit. -Original Message- From: tianxianting (RD) Sent: Thursday, September 17, 2020 4:13 PM To: ax...@kernel.dk Cc: linux-bl...@vger.kernel.org; linux-kernel@vger.kernel.org; tianxianting (RD) Subject: [PATCH] [v2] blk-mq: add cond_resched() in __blk_mq_alloc_rq_maps() We found it takes more time of blk_mq_alloc_rq_maps() in kernel space when testing nvme hot-plugging. The test and anlysis as below. Debug code, 1, blk_mq_alloc_rq_maps(): u64 start, end; depth = set->queue_depth; start = ktime_get_ns(); pr_err("[%d:%s switch:%ld,%ld] queue depth %d, nr_hw_queues %d\n", current->pid, current->comm, current->nvcsw, current->nivcsw, set->queue_depth, set->nr_hw_queues); do { err = __blk_mq_alloc_rq_maps(set); if (!err) break; set->queue_depth >>= 1; if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN) { err = -ENOMEM; break; } } while (set->queue_depth); end = ktime_get_ns(); pr_err("[%d:%s switch:%ld,%ld] all hw queues init cost time %lld ns\n", current->pid, current->comm, current->nvcsw, current->nivcsw, end - start); 2, __blk_mq_alloc_rq_maps(): u64 start, end; for (i = 0; i < set->nr_hw_queues; i++) { start = ktime_get_ns(); if (!__blk_mq_alloc_rq_map(set, i)) goto out_unwind; end = ktime_get_ns(); pr_err("hw queue %d init cost time %lld\n", i, end - start); } Test nvme hot-plugging with above debug code, we found it totally cost more than 3ms in kernel space without being scheduled out when alloc rqs for all 16 hw queues with depth 1024, each hw queue cost about 140-250us. The time cost will be increased with hw queue number and queue depth increasing. And if __blk_mq_alloc_rq_maps() returns -ENOMEM, it will try "queue_depth >>= 1", more time will be consumed. [ 428.428771] nvme nvme0: pci function 1:01:00.0 [ 428.428798] nvme 1:01:00.0: enabling device ( -> 0002) [ 428.428806] pcieport 1:00:00.0: can't derive routing for PCI INT A [ 428.428809] nvme 1:01:00.0: PCI INT A: no GSI [ 432.593374] [4688:kworker/u33:8 switch:663,2] queue depth 30, nr_hw_queues 1 [ 432.593404] hw queue 0 init cost time 22883 ns [ 432.593408] [4688:kworker/u33:8 switch:663,2] all hw queues init cost time 35960 ns [ 432.595953] nvme nvme0: 16/0/0 default/read/poll queues [ 432.595958] [4688:kworker/u33:8 switch:700,2] queue depth 1023, nr_hw_queues 16 [ 432.596203] hw queue 0 init cost time 242630 ns [ 432.596441] hw queue 1 init cost time 235913 ns [ 432.596659] hw queue 2 init cost time 216461 ns [ 432.596877] hw queue 3 init cost time 215851 ns [ 432.597107] hw queue 4 init cost time 228406 ns [ 432.597336] hw queue 5 init cost time 227298 ns [ 432.597564] hw queue 6 init cost time 224633 ns [ 432.597785] hw queue 7 init cost time 219954 ns [ 432.597937] hw queue 8 init cost time 150930 ns [ 432.598082] hw queue 9 init cost time 143496 ns [ 432.598231] hw queue 10 init cost time 147261 ns [ 432.598397] hw queue 11 init cost time 164522 ns [ 432.598542] hw queue 12 init cost time 143401 ns [ 432.598692] hw queue 13 init cost time 148934 ns [ 432.598841] hw queue 14 init cost time 147194 ns [ 432.598991] hw queue 15 init cost time 148942 ns [ 432.598993] [4688:kworker/u33:8 switch:700,2] all hw queues init cost time 3035099 ns [ 432.602611] nvme0n1: p1 So use this patch to trigger schedule between each hw queue init, to avoid other threads getting stuck. We call cond_resched() only when "queue depth >= 512". We are not in atomic context when executing __blk_mq_alloc_rq_maps(), so it is safe to call cond_resched(). Signed-off-by: Xianting Tian --- block/blk-mq.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index b3d2785ee..5a71fe53a 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -3255,11 +3255,16 @@ void blk_mq_exit_queue(struct request_queue *q) static int __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set) { int i; + unsigned int depth = set->queue_depth; - for (i = 0; i < set->nr_hw_queues; i++) + for (i = 0; i < set->nr_hw_queues;
RE: [PATCH] [v2] nvme: replace meaningless judgement by checking whether req is null
Thank you Keith, Christoph, So I don't need to send v3 patch? -Original Message- From: Christoph Hellwig [mailto:h...@lst.de] Sent: Tuesday, September 22, 2020 10:59 PM To: Keith Busch Cc: tianxianting (RD) ; ax...@fb.com; h...@lst.de; s...@grimberg.me; linux-n...@lists.infradead.org; linux-kernel@vger.kernel.org Subject: Re: [PATCH] [v2] nvme: replace meaningless judgement by checking whether req is null On Tue, Sep 22, 2020 at 07:57:05AM -0700, Keith Busch wrote: > The commit subject is a too long. We should really try to keep these > to > 50 characters or less. > > nvme-pci: fix NULL req in completion handler > > Otherwise, looks fine. > > Reviewed-by: Keith Busch Yes. I was about to apply it with a similar edit, but I'll take yours happily.
RE: [PATCH] [v2] nvme: replace meaningless judgement by checking whether req is null
Finally, it applied:) Thanks again for all your kindly guides to me. -Original Message- From: Christoph Hellwig [mailto:h...@lst.de] Sent: Tuesday, September 22, 2020 11:41 PM To: tianxianting (RD) Cc: Christoph Hellwig ; Keith Busch ; ax...@fb.com; s...@grimberg.me; linux-n...@lists.infradead.org; linux-kernel@vger.kernel.org Subject: Re: [PATCH] [v2] nvme: replace meaningless judgement by checking whether req is null On Tue, Sep 22, 2020 at 03:27:27PM +, Tianxianting wrote: > Thank you Keith, Christoph, > So I don't need to send v3 patch? No, it is all fine. I've already applied it locally.
RE: [PATCH] nvme: replace meaningless judgement by checking whether req is null
I get it now, thanks Keith:) -Original Message- From: Keith Busch [mailto:kbu...@kernel.org] Sent: Monday, September 21, 2020 11:59 PM To: tianxianting (RD) Cc: ax...@fb.com; h...@lst.de; s...@grimberg.me; linux-n...@lists.infradead.org; linux-kernel@vger.kernel.org Subject: Re: [PATCH] nvme: replace meaningless judgement by checking whether req is null On Mon, Sep 21, 2020 at 03:49:09PM +, Tianxianting wrote: > HI Keith, > Thanks for your comments, > I will submit a new patch of version 2 for the further reviewing, v2 patch > will contains: > 1, retain existing judgement and dev_warn; No no, go ahead and remove the existing check just as you've done. That check becomes redundant with the safer one you're adding, and we don't want redundant checks in the fast path. My only suggestion is to use the same dev_warn() in your new check. > 2, add the check whether req is null(already did in this patch) 3, > simplify and make the changelog succinct according to you said " This is what > I'm thinking:". > Is it right? > Should I retain the nvme_irq crash log in changelog, mention the difference > between nvmeq->q_depth and tagset queue_depth? The tagset's queue_depth is a valid point to mention as well. The dirver's current indirect check is not necessarily in sync with the actual tagset. > Thanks > > -Original Message- > From: Keith Busch [mailto:kbu...@kernel.org] > Sent: Monday, September 21, 2020 11:08 PM > To: tianxianting (RD) > Cc: ax...@fb.com; h...@lst.de; s...@grimberg.me; > linux-n...@lists.infradead.org; linux-kernel@vger.kernel.org > Subject: Re: [PATCH] nvme: replace meaningless judgement by checking > whether req is null > > On Mon, Sep 21, 2020 at 10:10:52AM +0800, Xianting Tian wrote: > > @@ -940,13 +940,6 @@ static inline void nvme_handle_cqe(struct nvme_queue > > *nvmeq, u16 idx) > > struct nvme_completion *cqe = >cqes[idx]; > > struct request *req; > > > > - if (unlikely(cqe->command_id >= nvmeq->q_depth)) { > > - dev_warn(nvmeq->dev->ctrl.device, > > - "invalid id %d completed on queue %d\n", > > - cqe->command_id, le16_to_cpu(cqe->sq_id)); > > - return; > > - } > > - > > /* > > * AEN requests are special as they don't time out and can > > * survive any kind of queue freeze and often don't respond to @@ > > -960,6 +953,13 @@ static inline void nvme_handle_cqe(struct nvme_queue > > *nvmeq, u16 idx) > > } > > > > req = blk_mq_tag_to_rq(nvme_queue_tagset(nvmeq), cqe->command_id); > > + if (unlikely(!req)) { > > + dev_warn(nvmeq->dev->ctrl.device, > > + "req is null for tag %d completed on queue %d\n", > > + cqe->command_id, le16_to_cpu(cqe->sq_id)); > > + return; > > + } > > This is making sense now, though I think we should retain the existing > dev_warn() since it's still accurate and provides continuity for people who > are used to looking for these sorts of messages. > > Your changelog is a bit much though. I think we can say it a bit more > succinctly. This is what I'm thinking: > > The driver registers interrupts for queues before initializing the > tagset because it uses the number of successful request_irq() calls > to configure the tagset parameters. This allows a race condition with > the current tag validity check if the controller happens to produce > an interrupt with a corrupted CQE before the tagset is initialized. > > Replace the driver's indirect tag check with the one already provided > by the block layer.
RE: [PATCH] nvme: replace meaningless judgement by checking whether req is null
HI Keith, Thanks for your comments, I will submit a new patch of version 2 for the further reviewing, v2 patch will contains: 1, retain existing judgement and dev_warn; 2, add the check whether req is null(already did in this patch) 3, simplify and make the changelog succinct according to you said " This is what I'm thinking:". Is it right? Should I retain the nvme_irq crash log in changelog, mention the difference between nvmeq->q_depth and tagset queue_depth? Thanks -Original Message- From: Keith Busch [mailto:kbu...@kernel.org] Sent: Monday, September 21, 2020 11:08 PM To: tianxianting (RD) Cc: ax...@fb.com; h...@lst.de; s...@grimberg.me; linux-n...@lists.infradead.org; linux-kernel@vger.kernel.org Subject: Re: [PATCH] nvme: replace meaningless judgement by checking whether req is null On Mon, Sep 21, 2020 at 10:10:52AM +0800, Xianting Tian wrote: > @@ -940,13 +940,6 @@ static inline void nvme_handle_cqe(struct nvme_queue > *nvmeq, u16 idx) > struct nvme_completion *cqe = >cqes[idx]; > struct request *req; > > - if (unlikely(cqe->command_id >= nvmeq->q_depth)) { > - dev_warn(nvmeq->dev->ctrl.device, > - "invalid id %d completed on queue %d\n", > - cqe->command_id, le16_to_cpu(cqe->sq_id)); > - return; > - } > - > /* >* AEN requests are special as they don't time out and can >* survive any kind of queue freeze and often don't respond to @@ > -960,6 +953,13 @@ static inline void nvme_handle_cqe(struct nvme_queue > *nvmeq, u16 idx) > } > > req = blk_mq_tag_to_rq(nvme_queue_tagset(nvmeq), cqe->command_id); > + if (unlikely(!req)) { > + dev_warn(nvmeq->dev->ctrl.device, > + "req is null for tag %d completed on queue %d\n", > + cqe->command_id, le16_to_cpu(cqe->sq_id)); > + return; > + } This is making sense now, though I think we should retain the existing dev_warn() since it's still accurate and provides continuity for people who are used to looking for these sorts of messages. Your changelog is a bit much though. I think we can say it a bit more succinctly. This is what I'm thinking: The driver registers interrupts for queues before initializing the tagset because it uses the number of successful request_irq() calls to configure the tagset parameters. This allows a race condition with the current tag validity check if the controller happens to produce an interrupt with a corrupted CQE before the tagset is initialized. Replace the driver's indirect tag check with the one already provided by the block layer.
撤回: [PATCH] [v2] nvme: use correct upper limit for tag in nvme_handle_cqe()
tianxianting (RD) 将撤回邮件“[PATCH] [v2] nvme: use correct upper limit for tag in nvme_handle_cqe()”。
RE: [PATCH] [v2] nvme: use correct upper limit for tag in nvme_handle_cqe()
Hi Keith, I found an extreme case, in function blk_mq_alloc_map_and_requests(), it will adjust tagset depth by 'set->queue_depth >>= 1' if there is no enough memory for rqs. If this happens, the real available number of tags(nr_tags) is much smaller than nvmeq->q_depth. So the judgement "if (unlikely(cqe->command_id >= nvmeq->q_depth))" in nvme_handle_cqe() is really meaningless. I figured out a new patch, which replaces the meaningless judgement by checking whether the returned req is null, if it is null, directly return. Would you please spare several minutes to review below new patch? thanks https://lkml.org/lkml/2020/9/20/400 -----Original Message- From: tianxianting (RD) Sent: Saturday, September 19, 2020 11:15 AM To: 'Keith Busch' Cc: ax...@fb.com; h...@lst.de; s...@grimberg.me; linux-n...@lists.infradead.org; linux-kernel@vger.kernel.org Subject: RE: [PATCH] [v2] nvme: use correct upper limit for tag in nvme_handle_cqe() Hi Keith, Thanks a lot for your comments, I will try to figure out a safe fix for this issue, then for you review:) -Original Message- From: Keith Busch [mailto:kbu...@kernel.org] Sent: Saturday, September 19, 2020 3:21 AM To: tianxianting (RD) Cc: ax...@fb.com; h...@lst.de; s...@grimberg.me; linux-n...@lists.infradead.org; linux-kernel@vger.kernel.org Subject: Re: [PATCH] [v2] nvme: use correct upper limit for tag in nvme_handle_cqe() On Fri, Sep 18, 2020 at 06:44:20PM +0800, Xianting Tian wrote: > @@ -940,7 +940,9 @@ static inline void nvme_handle_cqe(struct nvme_queue > *nvmeq, u16 idx) > struct nvme_completion *cqe = >cqes[idx]; > struct request *req; > > - if (unlikely(cqe->command_id >= nvmeq->q_depth)) { > + if (unlikely(cqe->command_id >= > + nvmeq->qid ? nvmeq->dev->tagset.queue_depth : > + nvmeq->dev->admin_tagset.queue_depth)) { Both of these values are set before blk_mq_alloc_tag_set(), so you still have a race. The interrupt handler probably just shouldn't be registered with the queue before the tagset is initialized since there can't be any work for the handler to do before that happens anyway. The controller is definitely broken, though, and will lead to unavoidable corruption if it's really behaving this way.
RE: [PATCH] [v2] nvme: use correct upper limit for tag in nvme_handle_cqe()
Hi, I test and get the init flow of nvme admin queue and io queue in kernel 5.6, Currently, the code use nvmeq->q_depth as the upper limit for tag in nvme_handle_cqe(), according to below init flow, we already have the race currently. Admin queue init flow: 1, set nvmeq->q_depth 32 for admin queue; 2, register irq handler(nvme_irq) for admin queue 0; 3, set admin_tagset.queue_depth to 30 and alloc rqs; 4, nvme irq happen on admin queue; IO queue init flow: 1, set nvmeq->q_depth 1024 for io queue 1~16; 2, register irq handler(nvme_irq) for io queue 1~16; 3, set tagset.queue_depth to 1023 and alloc rqs; 4, nvme irq happen on io queue; So we have two issues need to fix: 1, register interrupt handler(nvme_irq) after tagset init(step 3 above) done to avoid a race; 2, use correct upper limit(queue_depth in tagset) for tag in nvme_handle_cqe(), which is the issue that will be solved in this patch I submitted. Is it right? Thanks a lot. -Original Message----- From: tianxianting (RD) Sent: Saturday, September 19, 2020 11:15 AM To: 'Keith Busch' Cc: ax...@fb.com; h...@lst.de; s...@grimberg.me; linux-n...@lists.infradead.org; linux-kernel@vger.kernel.org Subject: RE: [PATCH] [v2] nvme: use correct upper limit for tag in nvme_handle_cqe() Hi Keith, Thanks a lot for your comments, I will try to figure out a safe fix for this issue, then for you review:) -Original Message- From: Keith Busch [mailto:kbu...@kernel.org] Sent: Saturday, September 19, 2020 3:21 AM To: tianxianting (RD) Cc: ax...@fb.com; h...@lst.de; s...@grimberg.me; linux-n...@lists.infradead.org; linux-kernel@vger.kernel.org Subject: Re: [PATCH] [v2] nvme: use correct upper limit for tag in nvme_handle_cqe() On Fri, Sep 18, 2020 at 06:44:20PM +0800, Xianting Tian wrote: > @@ -940,7 +940,9 @@ static inline void nvme_handle_cqe(struct nvme_queue > *nvmeq, u16 idx) > struct nvme_completion *cqe = >cqes[idx]; > struct request *req; > > - if (unlikely(cqe->command_id >= nvmeq->q_depth)) { > + if (unlikely(cqe->command_id >= > + nvmeq->qid ? nvmeq->dev->tagset.queue_depth : > + nvmeq->dev->admin_tagset.queue_depth)) { Both of these values are set before blk_mq_alloc_tag_set(), so you still have a race. The interrupt handler probably just shouldn't be registered with the queue before the tagset is initialized since there can't be any work for the handler to do before that happens anyway. The controller is definitely broken, though, and will lead to unavoidable corruption if it's really behaving this way.
RE: [PATCH] [v2] nvme: use correct upper limit for tag in nvme_handle_cqe()
Hi Keith, Thanks a lot for your comments, I will try to figure out a safe fix for this issue, then for you review:) -Original Message- From: Keith Busch [mailto:kbu...@kernel.org] Sent: Saturday, September 19, 2020 3:21 AM To: tianxianting (RD) Cc: ax...@fb.com; h...@lst.de; s...@grimberg.me; linux-n...@lists.infradead.org; linux-kernel@vger.kernel.org Subject: Re: [PATCH] [v2] nvme: use correct upper limit for tag in nvme_handle_cqe() On Fri, Sep 18, 2020 at 06:44:20PM +0800, Xianting Tian wrote: > @@ -940,7 +940,9 @@ static inline void nvme_handle_cqe(struct nvme_queue > *nvmeq, u16 idx) > struct nvme_completion *cqe = >cqes[idx]; > struct request *req; > > - if (unlikely(cqe->command_id >= nvmeq->q_depth)) { > + if (unlikely(cqe->command_id >= > + nvmeq->qid ? nvmeq->dev->tagset.queue_depth : > + nvmeq->dev->admin_tagset.queue_depth)) { Both of these values are set before blk_mq_alloc_tag_set(), so you still have a race. The interrupt handler probably just shouldn't be registered with the queue before the tagset is initialized since there can't be any work for the handler to do before that happens anyway. The controller is definitely broken, though, and will lead to unavoidable corruption if it's really behaving this way.
RE: [PATCH] nvme: use correct upper limit for tag in nvme_handle_cqe()
I am very sorry, Please ignore this patch, the code need to rewrite. -Original Message- From: tianxianting (RD) Sent: Friday, September 18, 2020 3:45 PM To: kbu...@kernel.org; ax...@fb.com; h...@lst.de; s...@grimberg.me Cc: linux-n...@lists.infradead.org; linux-kernel@vger.kernel.org; tianxianting (RD) Subject: [PATCH] nvme: use correct upper limit for tag in nvme_handle_cqe() We met a crash issue when hot-insert a nvme device, blk_mq_tag_to_rq() returned null(req=null), then crash happened in nvme_end_request(): req = blk_mq_tag_to_rq(); struct nvme_request *rq = nvme_req(req); //rq = req + 1 rq->result = result; <==crash here!!! [ 1124.256246] nvme nvme5: pci function :e1:00.0 [ 1124.256323] nvme :e1:00.0: enabling device ( -> 0002) [ 1125.720859] nvme nvme5: 96/0/0 default/read/poll queues [ 1125.732483] nvme5n1: p1 p2 p3 [ 1125.788049] BUG: unable to handle kernel NULL pointer dereference at 0130 [ 1125.788054] PGD 0 P4D 0 [ 1125.788057] Oops: 0002 [#1] SMP NOPTI [ 1125.788059] CPU: 50 PID: 0 Comm: swapper/50 Kdump: loaded Tainted: G --- -t - 4.18.0-147.el8.x86_64 #1 [ 1125.788065] RIP: 0010:nvme_irq+0xe8/0x240 [nvme] [ 1125.788068] RSP: 0018:916b8ec83ed0 EFLAGS: 00010813 [ 1125.788069] RAX: RBX: 918ae9211b00 RCX: [ 1125.788070] RDX: 400b RSI: RDI: [ 1125.788071] RBP: 918ae887 R08: 0004 R09: 918ae887 [ 1125.788072] R10: R11: R12: [ 1125.788073] R13: 0001 R14: R15: 0001 [ 1125.788075] FS: () GS:916b8ec8() knlGS: [ 1125.788075] CS: 0010 DS: ES: CR0: 80050033 [ 1125.788076] CR2: 0130 CR3: 001768f0 CR4: 00340ee0 [ 1125.788077] Call Trace: [ 1125.788080] [ 1125.788085] __handle_irq_event_percpu+0x40/0x180 [ 1125.788087] handle_irq_event_percpu+0x30/0x80 [ 1125.788089] handle_irq_event+0x36/0x53 [ 1125.788090] handle_edge_irq+0x82/0x190 [ 1125.788094] handle_irq+0xbf/0x100 [ 1125.788098] do_IRQ+0x49/0xd0 [ 1125.788100] common_interrupt+0xf/0xf The analysis of the possible cause of above crash as below. According to our test, in nvme_pci_enable(), 'dev->q_depth' is set to 1024, in nvme_create_io_queues()->nvme_alloc_queue(), 'nvmeq->q_depth' is set to 'dev->q_depth'. In nvme_dev_add(), 'dev->tagset.queue_depth' is set to 1023: dev->tagset.queue_depth = min_t(unsigned int, dev->q_depth, BLK_MQ_MAX_DEPTH) - 1; //why -1?? first involved by commit a4aea562 In nvme_alloc_queue(), 'nvmeq->q_depth' is set to dev->q_depth'. So we got below values for mutiple depth: dev->q_depth= 1024 dev->tagset.queue_depth = 1023 nvmeq->q_depth = 1024 In blk_mq_alloc_rqs(), it allocated 1023(dev->tagset.queue_depth) rqs for one hw queue. In nvme_alloc_queue(), it allocated 1024(nvmeq->q_depth) cqes for nvmeq->cqes[]. When process cqe in nvme_handle_cqe(), it get it by: struct nvme_completion *cqe = >cqes[idx]; We also added below prints in nvme_handle_cqe() and blk_mq_tag_to_rq(): static inline void nvme_handle_cqe(struct nvme_queue *nvmeq, u16 idx) { volatile struct nvme_completion *cqe = >cqes[idx]; struct request *req; //debug print dev_warn(nvmeq->dev->ctrl.device, "command_id %d completed on queue %d, nvmeq q_depth %d, nvme tagset q_depth %d\n", cqe->command_id, le16_to_cpu(cqe->sq_id), nvmeq->q_depth, nvmeq->dev->tagset.queue_depth); if (unlikely(cqe->command_id >= nvmeq->q_depth)) { dev_warn(nvmeq->dev->ctrl.device, "invalid id %d completed on queue %d\n", cqe->command_id, le16_to_cpu(cqe->sq_id)); return; } ... ... req = blk_mq_tag_to_rq(nvme_queue_tagset(nvmeq), cqe->command_id); ... ... } struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag) { //debug print printk("tag, nr_tags:%d %d\n", tag, tags->nr_tags); if (tag < tags->nr_tags) { prefetch(tags->rqs[tag]); return tags->rqs[tag];
RE: [PATCH] tracing: use __this_cpu_read() in trace_buffered_event_enable()
Hi Steven, Thanks for your reply:) -Original Message- From: Steven Rostedt [mailto:rost...@goodmis.org] Sent: Friday, September 18, 2020 8:31 AM To: tianxianting (RD) Cc: mi...@redhat.com; linux-kernel@vger.kernel.org Subject: Re: [PATCH] tracing: use __this_cpu_read() in trace_buffered_event_enable() Sorry for the late reply (been busy with Linux Plumbers, took a vacation, and then trying to catch up on all my email for the last two months!) But I just wanted to let you know that I added this to my queue. Thanks! -- Steve On Thu, 13 Aug 2020 19:28:03 +0800 Xianting Tian wrote: > The code is executed with preemption disabled, so it's safe to use > __this_cpu_read(). > > Signed-off-by: Xianting Tian > --- > kernel/trace/trace.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index > bb6226972..7d0d71ce9 100644 > --- a/kernel/trace/trace.c > +++ b/kernel/trace/trace.c > @@ -2514,7 +2514,7 @@ void trace_buffered_event_enable(void) > > preempt_disable(); > if (cpu == smp_processor_id() && > - this_cpu_read(trace_buffered_event) != > + __this_cpu_read(trace_buffered_event) != > per_cpu(trace_buffered_event, cpu)) > WARN_ON_ONCE(1); > preempt_enable();
RE: [PATCH] ipmi: add retry in try_get_dev_id()
Thanks you Corey for your kindly guides to me for these three patches :) -Original Message- From: Corey Minyard [mailto:tcminy...@gmail.com] On Behalf Of Corey Minyard Sent: Wednesday, September 16, 2020 10:01 PM To: tianxianting (RD) Cc: a...@arndb.de; gre...@linuxfoundation.org; openipmi-develo...@lists.sourceforge.net; linux-kernel@vger.kernel.org Subject: Re: [PATCH] ipmi: add retry in try_get_dev_id() On Wed, Sep 16, 2020 at 02:21:29PM +0800, Xianting Tian wrote: > Use retry machanism to give device more opportunitys to correctly > response kernel when we received specific completion codes. > > This is similar to what we done in __get_device_id(). Thanks. I moved GET_DEVICE_ID_MAX_RETRY to include/linux/ipmi.h since uapi is for things exported to userspace. But this is good, it's in my next tree. -corey > > Signed-off-by: Xianting Tian > --- > drivers/char/ipmi/ipmi_msghandler.c | 2 -- > drivers/char/ipmi/ipmi_si_intf.c| 17 + > include/uapi/linux/ipmi.h | 2 ++ > 3 files changed, 19 insertions(+), 2 deletions(-) > > diff --git a/drivers/char/ipmi/ipmi_msghandler.c > b/drivers/char/ipmi/ipmi_msghandler.c > index b9685093e..75cb7e062 100644 > --- a/drivers/char/ipmi/ipmi_msghandler.c > +++ b/drivers/char/ipmi/ipmi_msghandler.c > @@ -62,8 +62,6 @@ enum ipmi_panic_event_op { #define > IPMI_PANIC_DEFAULT IPMI_SEND_PANIC_EVENT_NONE #endif > > -#define GET_DEVICE_ID_MAX_RETRY 5 > - > static enum ipmi_panic_event_op ipmi_send_panic_event = > IPMI_PANIC_DEFAULT; > > static int panic_op_write_handler(const char *val, diff --git > a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c > index 77b8d551a..beeb705f1 100644 > --- a/drivers/char/ipmi/ipmi_si_intf.c > +++ b/drivers/char/ipmi/ipmi_si_intf.c > @@ -1316,6 +1316,7 @@ static int try_get_dev_id(struct smi_info *smi_info) > unsigned char *resp; > unsigned long resp_len; > int rv = 0; > + unsigned int retry_count = 0; > > resp = kmalloc(IPMI_MAX_MSG_LENGTH, GFP_KERNEL); > if (!resp) > @@ -1327,6 +1328,8 @@ static int try_get_dev_id(struct smi_info *smi_info) >*/ > msg[0] = IPMI_NETFN_APP_REQUEST << 2; > msg[1] = IPMI_GET_DEVICE_ID_CMD; > + > +retry: > smi_info->handlers->start_transaction(smi_info->si_sm, msg, 2); > > rv = wait_for_msg_done(smi_info); > @@ -1339,6 +1342,20 @@ static int try_get_dev_id(struct smi_info *smi_info) > /* Check and record info from the get device id, in case we need it. */ > rv = ipmi_demangle_device_id(resp[0] >> 2, resp[1], > resp + 2, resp_len - 2, _info->device_id); > + if (rv) { > + /* record completion code */ > + char cc = *(resp + 2); > + > + if ((cc == IPMI_DEVICE_IN_FW_UPDATE_ERR > + || cc == IPMI_DEVICE_IN_INIT_ERR > + || cc == IPMI_NOT_IN_MY_STATE_ERR) > + && ++retry_count <= GET_DEVICE_ID_MAX_RETRY) { > + dev_warn(smi_info->io.dev, > + "retry to get device id as completion code > 0x%x\n", > + cc); > + goto retry; > + } > + } > > out: > kfree(resp); > diff --git a/include/uapi/linux/ipmi.h b/include/uapi/linux/ipmi.h > index 32d148309..bc57f07e3 100644 > --- a/include/uapi/linux/ipmi.h > +++ b/include/uapi/linux/ipmi.h > @@ -426,4 +426,6 @@ struct ipmi_timing_parms { > #define IPMICTL_GET_MAINTENANCE_MODE_CMD _IOR(IPMI_IOC_MAGIC, 30, int) > #define IPMICTL_SET_MAINTENANCE_MODE_CMD _IOW(IPMI_IOC_MAGIC, 31, int) > > +#define GET_DEVICE_ID_MAX_RETRY 5 > + > #endif /* _UAPI__LINUX_IPMI_H */ > -- > 2.17.1 >
RE: [PATCH] [v2] ipmi: retry to get device id when error
Hi Corey, Would you please review this patch: add the same retry in try_get_dev_id() https://lkml.org/lkml/2020/9/16/244 Thanks a lot. -Original Message- From: tianxianting (RD) Sent: Tuesday, September 15, 2020 11:20 PM To: 'miny...@acm.org' Cc: a...@arndb.de; gre...@linuxfoundation.org; openipmi-develo...@lists.sourceforge.net; linux-kernel@vger.kernel.org Subject: RE: [PATCH] [v2] ipmi: retry to get device id when error I get it now, thank you Corey :) I will send the patch for you review tomorrow. -Original Message- From: Corey Minyard [mailto:tcminy...@gmail.com] On Behalf Of Corey Minyard Sent: Tuesday, September 15, 2020 10:53 PM To: tianxianting (RD) Cc: a...@arndb.de; gre...@linuxfoundation.org; openipmi-develo...@lists.sourceforge.net; linux-kernel@vger.kernel.org Subject: Re: [PATCH] [v2] ipmi: retry to get device id when error On Tue, Sep 15, 2020 at 09:40:02AM +, Tianxianting wrote: > Hi Corey, > Thanks for your comments, > Please review these two patches, which are based on your guide. > 1, [PATCH] ipmi: print current state when error > https://lkml.org/lkml/2020/9/15/183 > 2, [PATCH] [v3] ipmi: retry to get device id when error > https://lkml.org/lkml/2020/9/15/156 Patches are applied and in my next tree. > > As you said "You are having the same issue in the ipmi_si code. It's choosing > defaults, but that's not ideal. You probably need to handle this there, too, > in a separate patch." > I am not sure whether I grasped what you said, The print ' device id > demangle failed: -22' in commit message, is just triggered by > bmc_device_id_handler->ipmi_demangle_device_id, this is the issue we met and > is solving. > I found try_get_dev_id(in drivers/char/ipmi/ipmi_si_intf.c) also called > ipmi_demangle_device_id(), do you mean if this ipmi_demangle_device_id() > returned error, we also need to retry? Yes, I think so, retrying in try_get_dev_id() would be a good idea, I think. You are probably getting sub-optimal performance if you don't do this. Thanks, -corey > > Thanks a lot. > > -Original Message- > From: Corey Minyard [mailto:tcminy...@gmail.com] On Behalf Of Corey > Minyard > Sent: Monday, September 14, 2020 11:40 PM > To: tianxianting (RD) > Cc: a...@arndb.de; gre...@linuxfoundation.org; > openipmi-develo...@lists.sourceforge.net; linux-kernel@vger.kernel.org > Subject: Re: [PATCH] [v2] ipmi: retry to get device id when error > > On Mon, Sep 14, 2020 at 04:13:13PM +0800, Xianting Tian wrote: > > We can't get bmc's device id with low probability when loading ipmi > > driver, it caused bmc device register failed. When this issue > > happened, we got below kernel printks: > > This patch is moving in the right direction. For the final patch(es), I can > clean up the english grammar issues, since that's not your native language. > A few comments: > > > [Wed Sep 9 19:52:03 2020] ipmi_si IPI0001:00: IPMI message handler: > > device id demangle failed: -22 > > You are having the same issue in the ipmi_si code. It's choosing defaults, > but that's not ideal. You probably need to handle this there, too, in a > separate patch. > > Can you create a separate patch to add a dev_warn() to the BT code when it > returns IPMI_NOT_IN_MY_STATE_ERR, like I asked previously? And print the > current state when it happens. That way we know where this issue is coming > from and possibly fix the state machine. I'm thinking that the BMC is just > not responding, but I'd like to be sure. > > Other comments inline... > > > [Wed Sep 9 19:52:03 2020] IPMI BT: using default values > > [Wed Sep 9 19:52:03 2020] IPMI BT: req2rsp=5 secs retries=2 > > [Wed Sep 9 19:52:03 2020] ipmi_si IPI0001:00: Unable to get the device > > id: -5 > > [Wed Sep 9 19:52:04 2020] ipmi_si IPI0001:00: Unable to register > > device: error -5 > > > > When this issue happened, we want to manually unload the driver and > > try to load it again, but it can't be unloaded by 'rmmod' as it is already > > 'in use'. > > > > We add below 'printk' in handle_one_recv_msg(), when this issue > > happened, the msg we received is "Recv: 1c 01 d5", which means the > > data_len is 1, data[0] is 0xd5(completion code), which means "bmc cannot > > execute command. > > Command, or request parameter(s), not supported in present state". > > Debug code: > > static int handle_one_recv_msg(struct ipmi_smi *intf, > >struct ipmi_smi_msg *msg) { > > printk("Recv: %*ph\n", msg->rsp_size, msg->rsp); > > ... ... > > } > > Then
RE: [PATCH] [v2] ipmi: retry to get device id when error
I get it now, thank you Corey :) I will send the patch for you review tomorrow. -Original Message- From: Corey Minyard [mailto:tcminy...@gmail.com] On Behalf Of Corey Minyard Sent: Tuesday, September 15, 2020 10:53 PM To: tianxianting (RD) Cc: a...@arndb.de; gre...@linuxfoundation.org; openipmi-develo...@lists.sourceforge.net; linux-kernel@vger.kernel.org Subject: Re: [PATCH] [v2] ipmi: retry to get device id when error On Tue, Sep 15, 2020 at 09:40:02AM +, Tianxianting wrote: > Hi Corey, > Thanks for your comments, > Please review these two patches, which are based on your guide. > 1, [PATCH] ipmi: print current state when error > https://lkml.org/lkml/2020/9/15/183 > 2, [PATCH] [v3] ipmi: retry to get device id when error > https://lkml.org/lkml/2020/9/15/156 Patches are applied and in my next tree. > > As you said "You are having the same issue in the ipmi_si code. It's choosing > defaults, but that's not ideal. You probably need to handle this there, too, > in a separate patch." > I am not sure whether I grasped what you said, The print ' device id > demangle failed: -22' in commit message, is just triggered by > bmc_device_id_handler->ipmi_demangle_device_id, this is the issue we met and > is solving. > I found try_get_dev_id(in drivers/char/ipmi/ipmi_si_intf.c) also called > ipmi_demangle_device_id(), do you mean if this ipmi_demangle_device_id() > returned error, we also need to retry? Yes, I think so, retrying in try_get_dev_id() would be a good idea, I think. You are probably getting sub-optimal performance if you don't do this. Thanks, -corey > > Thanks a lot. > > -Original Message- > From: Corey Minyard [mailto:tcminy...@gmail.com] On Behalf Of Corey > Minyard > Sent: Monday, September 14, 2020 11:40 PM > To: tianxianting (RD) > Cc: a...@arndb.de; gre...@linuxfoundation.org; > openipmi-develo...@lists.sourceforge.net; linux-kernel@vger.kernel.org > Subject: Re: [PATCH] [v2] ipmi: retry to get device id when error > > On Mon, Sep 14, 2020 at 04:13:13PM +0800, Xianting Tian wrote: > > We can't get bmc's device id with low probability when loading ipmi > > driver, it caused bmc device register failed. When this issue > > happened, we got below kernel printks: > > This patch is moving in the right direction. For the final patch(es), I can > clean up the english grammar issues, since that's not your native language. > A few comments: > > > [Wed Sep 9 19:52:03 2020] ipmi_si IPI0001:00: IPMI message handler: > > device id demangle failed: -22 > > You are having the same issue in the ipmi_si code. It's choosing defaults, > but that's not ideal. You probably need to handle this there, too, in a > separate patch. > > Can you create a separate patch to add a dev_warn() to the BT code when it > returns IPMI_NOT_IN_MY_STATE_ERR, like I asked previously? And print the > current state when it happens. That way we know where this issue is coming > from and possibly fix the state machine. I'm thinking that the BMC is just > not responding, but I'd like to be sure. > > Other comments inline... > > > [Wed Sep 9 19:52:03 2020] IPMI BT: using default values > > [Wed Sep 9 19:52:03 2020] IPMI BT: req2rsp=5 secs retries=2 > > [Wed Sep 9 19:52:03 2020] ipmi_si IPI0001:00: Unable to get the device > > id: -5 > > [Wed Sep 9 19:52:04 2020] ipmi_si IPI0001:00: Unable to register > > device: error -5 > > > > When this issue happened, we want to manually unload the driver and > > try to load it again, but it can't be unloaded by 'rmmod' as it is already > > 'in use'. > > > > We add below 'printk' in handle_one_recv_msg(), when this issue > > happened, the msg we received is "Recv: 1c 01 d5", which means the > > data_len is 1, data[0] is 0xd5(completion code), which means "bmc cannot > > execute command. > > Command, or request parameter(s), not supported in present state". > > Debug code: > > static int handle_one_recv_msg(struct ipmi_smi *intf, > >struct ipmi_smi_msg *msg) { > > printk("Recv: %*ph\n", msg->rsp_size, msg->rsp); > > ... ... > > } > > Then in ipmi_demangle_device_id(), it returned '-EINVAL' as 'data_len < 7' > > and 'data[0] != 0'. > > > > We used this patch to retry to get device id when error happen, we > > reproduced this issue again and the retry succeed on the first > > retry, we finally got the correct msg and then all is ok: > > Recv: 1c 01 00 01 81 05 84 02 af db 07 00 01 00 b9 00 10 00 > > > > So use retry machanism in
RE: [PATCH] [v2] ipmi: retry to get device id when error
Hi Corey, Thanks for your comments, Please review these two patches, which are based on your guide. 1, [PATCH] ipmi: print current state when error https://lkml.org/lkml/2020/9/15/183 2, [PATCH] [v3] ipmi: retry to get device id when error https://lkml.org/lkml/2020/9/15/156 As you said "You are having the same issue in the ipmi_si code. It's choosing defaults, but that's not ideal. You probably need to handle this there, too, in a separate patch." I am not sure whether I grasped what you said, The print ' device id demangle failed: -22' in commit message, is just triggered by bmc_device_id_handler->ipmi_demangle_device_id, this is the issue we met and is solving. I found try_get_dev_id(in drivers/char/ipmi/ipmi_si_intf.c) also called ipmi_demangle_device_id(), do you mean if this ipmi_demangle_device_id() returned error, we also need to retry? Thanks a lot. -Original Message- From: Corey Minyard [mailto:tcminy...@gmail.com] On Behalf Of Corey Minyard Sent: Monday, September 14, 2020 11:40 PM To: tianxianting (RD) Cc: a...@arndb.de; gre...@linuxfoundation.org; openipmi-develo...@lists.sourceforge.net; linux-kernel@vger.kernel.org Subject: Re: [PATCH] [v2] ipmi: retry to get device id when error On Mon, Sep 14, 2020 at 04:13:13PM +0800, Xianting Tian wrote: > We can't get bmc's device id with low probability when loading ipmi > driver, it caused bmc device register failed. When this issue > happened, we got below kernel printks: This patch is moving in the right direction. For the final patch(es), I can clean up the english grammar issues, since that's not your native language. A few comments: > [Wed Sep 9 19:52:03 2020] ipmi_si IPI0001:00: IPMI message handler: > device id demangle failed: -22 You are having the same issue in the ipmi_si code. It's choosing defaults, but that's not ideal. You probably need to handle this there, too, in a separate patch. Can you create a separate patch to add a dev_warn() to the BT code when it returns IPMI_NOT_IN_MY_STATE_ERR, like I asked previously? And print the current state when it happens. That way we know where this issue is coming from and possibly fix the state machine. I'm thinking that the BMC is just not responding, but I'd like to be sure. Other comments inline... > [Wed Sep 9 19:52:03 2020] IPMI BT: using default values > [Wed Sep 9 19:52:03 2020] IPMI BT: req2rsp=5 secs retries=2 > [Wed Sep 9 19:52:03 2020] ipmi_si IPI0001:00: Unable to get the device > id: -5 > [Wed Sep 9 19:52:04 2020] ipmi_si IPI0001:00: Unable to register > device: error -5 > > When this issue happened, we want to manually unload the driver and > try to load it again, but it can't be unloaded by 'rmmod' as it is already > 'in use'. > > We add below 'printk' in handle_one_recv_msg(), when this issue > happened, the msg we received is "Recv: 1c 01 d5", which means the > data_len is 1, data[0] is 0xd5(completion code), which means "bmc cannot > execute command. > Command, or request parameter(s), not supported in present state". > Debug code: > static int handle_one_recv_msg(struct ipmi_smi *intf, >struct ipmi_smi_msg *msg) { > printk("Recv: %*ph\n", msg->rsp_size, msg->rsp); > ... ... > } > Then in ipmi_demangle_device_id(), it returned '-EINVAL' as 'data_len < 7' > and 'data[0] != 0'. > > We used this patch to retry to get device id when error happen, we > reproduced this issue again and the retry succeed on the first retry, > we finally got the correct msg and then all is ok: > Recv: 1c 01 00 01 81 05 84 02 af db 07 00 01 00 b9 00 10 00 > > So use retry machanism in this patch to give bmc more opportunity to > correctly response kernel when we received specific completion code. > > Signed-off-by: Xianting Tian > --- > drivers/char/ipmi/ipmi_msghandler.c | 29 + > include/uapi/linux/ipmi_msgdefs.h | 2 ++ > 2 files changed, 27 insertions(+), 4 deletions(-) > > diff --git a/drivers/char/ipmi/ipmi_msghandler.c > b/drivers/char/ipmi/ipmi_msghandler.c > index 737c0b6b2..07d5be2cd 100644 > --- a/drivers/char/ipmi/ipmi_msghandler.c > +++ b/drivers/char/ipmi/ipmi_msghandler.c > @@ -34,6 +34,7 @@ > #include > #include > #include > +#include > > #define IPMI_DRIVER_VERSION "39.2" > > @@ -60,6 +61,9 @@ enum ipmi_panic_event_op { #else #define > IPMI_PANIC_DEFAULT IPMI_SEND_PANIC_EVENT_NONE #endif > + > +#define GET_DEVICE_ID_MAX_RETRY 5 > + > static enum ipmi_panic_event_op ipmi_send_panic_event = > IPMI_PANIC_DEFAULT; > > static int panic_op_write_handler(const char *val, @@ -317,6 +321,7 > @@ st
RE: [PATCH] ipmi: retry to get device id when error
Hi Corey, Thanks a lot for you kindly guide. I send the version 2 patch to you: [PATCH] [v2] ipmi: retry to get device id when error If there is still something wrong or I missed something, please correct me:) -Original Message- From: Corey Minyard [mailto:tcminy...@gmail.com] On Behalf Of Corey Minyard Sent: Monday, September 14, 2020 1:26 AM To: tianxianting (RD) Cc: a...@arndb.de; gre...@linuxfoundation.org; openipmi-develo...@lists.sourceforge.net; linux-kernel@vger.kernel.org Subject: Re: [PATCH] ipmi: retry to get device id when error On Sun, Sep 13, 2020 at 02:10:01PM +, Tianxianting wrote: > Hi Corey > Thanks for your quickly reply, > We didn't try the method you mentioned, actually, I didn't know it > before you told me:( The issue ever occurred on our 2 ceph storage server > both with low probability. > We finally use this patch to solve the issue, it can automatically solve the > issue when it happened. So no need to judge and reload ipmi driver manually > or develop additional scripts to make it. > The 1 second delay is acceptable to us. > > If there really isn't a BMC out there, ipmi driver will not be loaded, is it > right? No, there are systems that have IPMI controllers that are specified in the ACPI/SMBIOS tables but have no IPMI controller. > > May be we can adjust to retry 3 times with 500ms interval? Maybe there is another way. What I'm guessing is happening is not that the interface is lossy, but that the BMC is in the middle of a reset or an upgrade. The D5 completion code means: Cannot execute command. Command, or request parameter(s), not supported in present state. That error is also returned from bt_start_transaction() in the driver if the driver is not in the right state when starting a transaction, which may point to a bug in the BT state machine. Search for IPMI_NOT_IN_MY_STATE_ERR in drivers/char/ipmi/ipmi_bt_sm.c. If it's coming fron the state machine, it would be handy to know what state it was in when the error happened to help trace the bug down. That's what I would suggest first. Fix the fundamental bug, if you can. a pr_warn() added to the code there would be all that's needed, and thats probably a good permanent addition. I would be ok with a patch that retried some number of times if it got a D5 completion code. That wouldn't have any effect on other systems. Plus you could add a D1 and D2 completion code, which are similar things. Add the new completion codes to include/uapi/linux/ipmi_msgdefs.h and implement the retry. That makes sense from a general point of view. Thanks, -corey > > Thanks in advance if you can feedback again. > > -Original Message- > From: Corey Minyard [mailto:tcminy...@gmail.com] On Behalf Of Corey > Minyard > Sent: Sunday, September 13, 2020 8:40 PM > To: tianxianting (RD) > Cc: a...@arndb.de; gre...@linuxfoundation.org; > openipmi-develo...@lists.sourceforge.net; linux-kernel@vger.kernel.org > Subject: Re: [PATCH] ipmi: retry to get device id when error > > On Sun, Sep 13, 2020 at 08:02:03PM +0800, Xianting Tian wrote: > > We can't get bmc's device id with low probability when loading ipmi > > driver, it caused bmc device register failed. This issue may caused > > by bad lpc signal quality. When this issue happened, we got below kernel > > printks: > > [Wed Sep 9 19:52:03 2020] ipmi_si IPI0001:00: IPMI message handler: > > device id demangle failed: -22 > > [Wed Sep 9 19:52:03 2020] IPMI BT: using default values > > [Wed Sep 9 19:52:03 2020] IPMI BT: req2rsp=5 secs retries=2 > > [Wed Sep 9 19:52:03 2020] ipmi_si IPI0001:00: Unable to get the device > > id: -5 > > [Wed Sep 9 19:52:04 2020] ipmi_si IPI0001:00: Unable to register > > device: error -5 > > > > When this issue happened, we want to manually unload the driver and > > try to load it again, but it can't be unloaded by 'rmmod' as it is already > > 'in use'. > > I'm not sure this patch is a good idea; it would cause a long boot delay in > situations where there really isn't a BMC out there. Yes, it happens. > > You don't have to reload the driver to add a device, though. You can hot-add > devices using /sys/modules/ipmi_si/parameters/hotmod. Look in > Documentation/driver-api/ipmi.rst for details. > > Does that work for you? > > -corey > > > > > We add below 'printk' in handle_one_recv_msg(), when this issue > > happened, the msg we received is "Recv: 1c 01 d5", which means the > > data_len is 1, data[0] is 0xd5. > > Debug code: > > static int handle_one_recv_msg(struct ipmi_smi *intf, > >struct ipmi_smi_msg *msg) { > > printk("Recv: %*ph\n", msg-
RE: [PATCH] ipmi: retry to get device id when error
Hi Corey Thanks for your quickly reply, We didn't try the method you mentioned, actually, I didn't know it before you told me:( The issue ever occurred on our 2 ceph storage server both with low probability. We finally use this patch to solve the issue, it can automatically solve the issue when it happened. So no need to judge and reload ipmi driver manually or develop additional scripts to make it. The 1 second delay is acceptable to us. If there really isn't a BMC out there, ipmi driver will not be loaded, is it right? May be we can adjust to retry 3 times with 500ms interval? Thanks in advance if you can feedback again. -Original Message- From: Corey Minyard [mailto:tcminy...@gmail.com] On Behalf Of Corey Minyard Sent: Sunday, September 13, 2020 8:40 PM To: tianxianting (RD) Cc: a...@arndb.de; gre...@linuxfoundation.org; openipmi-develo...@lists.sourceforge.net; linux-kernel@vger.kernel.org Subject: Re: [PATCH] ipmi: retry to get device id when error On Sun, Sep 13, 2020 at 08:02:03PM +0800, Xianting Tian wrote: > We can't get bmc's device id with low probability when loading ipmi > driver, it caused bmc device register failed. This issue may caused by > bad lpc signal quality. When this issue happened, we got below kernel printks: > [Wed Sep 9 19:52:03 2020] ipmi_si IPI0001:00: IPMI message handler: > device id demangle failed: -22 > [Wed Sep 9 19:52:03 2020] IPMI BT: using default values > [Wed Sep 9 19:52:03 2020] IPMI BT: req2rsp=5 secs retries=2 > [Wed Sep 9 19:52:03 2020] ipmi_si IPI0001:00: Unable to get the device > id: -5 > [Wed Sep 9 19:52:04 2020] ipmi_si IPI0001:00: Unable to register > device: error -5 > > When this issue happened, we want to manually unload the driver and > try to load it again, but it can't be unloaded by 'rmmod' as it is already > 'in use'. I'm not sure this patch is a good idea; it would cause a long boot delay in situations where there really isn't a BMC out there. Yes, it happens. You don't have to reload the driver to add a device, though. You can hot-add devices using /sys/modules/ipmi_si/parameters/hotmod. Look in Documentation/driver-api/ipmi.rst for details. Does that work for you? -corey > > We add below 'printk' in handle_one_recv_msg(), when this issue > happened, the msg we received is "Recv: 1c 01 d5", which means the > data_len is 1, data[0] is 0xd5. > Debug code: > static int handle_one_recv_msg(struct ipmi_smi *intf, >struct ipmi_smi_msg *msg) { > printk("Recv: %*ph\n", msg->rsp_size, msg->rsp); > ... ... > } > Then in ipmi_demangle_device_id(), it returned '-EINVAL' as 'data_len < 7' > and 'data[0] != 0'. > > We used this patch to retry to get device id when error happen, we > reproduced this issue again and the retry succeed on the first retry, > we finally got the correct msg and then all is ok: > Recv: 1c 01 00 01 81 05 84 02 af db 07 00 01 00 b9 00 10 00 > > So use retry machanism in this patch to give bmc more opportunity to > correctly response kernel. > > Signed-off-by: Xianting Tian > --- > drivers/char/ipmi/ipmi_msghandler.c | 17 ++--- > 1 file changed, 14 insertions(+), 3 deletions(-) > > diff --git a/drivers/char/ipmi/ipmi_msghandler.c > b/drivers/char/ipmi/ipmi_msghandler.c > index 737c0b6b2..bfb2de77a 100644 > --- a/drivers/char/ipmi/ipmi_msghandler.c > +++ b/drivers/char/ipmi/ipmi_msghandler.c > @@ -34,6 +34,7 @@ > #include > #include > #include > +#include > > #define IPMI_DRIVER_VERSION "39.2" > > @@ -60,6 +61,9 @@ enum ipmi_panic_event_op { #else #define > IPMI_PANIC_DEFAULT IPMI_SEND_PANIC_EVENT_NONE #endif > + > +#define GET_DEVICE_ID_MAX_RETRY 5 > + > static enum ipmi_panic_event_op ipmi_send_panic_event = > IPMI_PANIC_DEFAULT; > > static int panic_op_write_handler(const char *val, @@ -2426,19 > +2430,26 @@ send_get_device_id_cmd(struct ipmi_smi *intf) static int > __get_device_id(struct ipmi_smi *intf, struct bmc_device *bmc) { > int rv; > - > - bmc->dyn_id_set = 2; > + unsigned int retry_count = 0; > > intf->null_user_handler = bmc_device_id_handler; > > +retry: > + bmc->dyn_id_set = 2; > + > rv = send_get_device_id_cmd(intf); > if (rv) > return rv; > > wait_event(intf->waitq, bmc->dyn_id_set != 2); > > - if (!bmc->dyn_id_set) > + if (!bmc->dyn_id_set) { > + msleep(1000); > + if (++retry_count <= GET_DEVICE_ID_MAX_RETRY) > + goto retry; > + > rv = -EIO; /* Something went wrong in the fetch. */ > + } > > /* dyn_id_set makes the id data available. */ > smp_rmb(); > -- > 2.17.1 >
RE: [PATCH] fs: use correct parameter in notes of generic_file_llseek_size()
Hi viro, Could I get your feedback? This patch fixed the build warning, I think it can be applied, thanks :) -Original Message- From: tianxianting (RD) Sent: Saturday, September 05, 2020 3:15 PM To: v...@zeniv.linux.org.uk Cc: linux-fsde...@vger.kernel.org; linux-kernel@vger.kernel.org; tianxianting (RD) Subject: [PATCH] fs: use correct parameter in notes of generic_file_llseek_size() Fix warning when compiling with W=1: fs/read_write.c:88: warning: Function parameter or member 'maxsize' not described in 'generic_file_llseek_size' fs/read_write.c:88: warning: Excess function parameter 'size' description in 'generic_file_llseek_size' Signed-off-by: Xianting Tian --- fs/read_write.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/read_write.c b/fs/read_write.c index 5db58b8c7..058563ee2 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -71,7 +71,7 @@ EXPORT_SYMBOL(vfs_setpos); * @file: file structure to seek on * @offset:file offset to seek to * @whence:type of seek - * @size: max size of this file in file system + * @maxsize: max size of this file in file system * @eof: offset used for SEEK_END position * * This is a variant of generic_file_llseek that allows passing in a custom -- 2.17.1
RE: [PATCH] block: remove redundant empty check of mq_list
Hi Jens, Thanks for your feedback, Yes, blk_flush_plug_list() is only caller of blk_mq_flush_plug_list(). So I checked the callers of blk_flush_plug_list(), found below code path will call blk_flush_plug_list(): io_schedule_prepare/sched_submit_work->blk_schedule_flush_plug writeback_sb_inodes->blk_flush_plug blk_finish_plug dm_submit_bio/__submit_bio_noacct_mq/__submit_bio->blk_mq_submit_bio blk_poll So I think there are still many chances to do the redundant judge? -Original Message- From: Jens Axboe [mailto:ax...@kernel.dk] Sent: Wednesday, September 09, 2020 10:21 PM To: tianxianting (RD) ; a...@kernel.org; dan...@iogearbox.net; ka...@fb.com; songliubrav...@fb.com; y...@fb.com; andr...@fb.com; john.fastab...@gmail.com; kpsi...@chromium.org Cc: linux-bl...@vger.kernel.org; linux-kernel@vger.kernel.org; net...@vger.kernel.org; b...@vger.kernel.org Subject: Re: [PATCH] block: remove redundant empty check of mq_list On 9/9/20 12:48 AM, Xianting Tian wrote: > blk_mq_flush_plug_list() itself will do the empty check of mq_list, so > remove such check in blk_flush_plug_list(). > Actually normally mq_list is not empty when blk_flush_plug_list is > called. It's cheaper to do in the caller, instead of doing the function call and then aborting if it's empty. So I'd suggest just leaving it alone. Right now this is the only caller, but it's nicer to assume we can be called in any state vs not having the check. -- Jens Axboe
RE: [PATCH] blkcg: add plugging support for punt bio
Thanks TJ and your previous guide to me. I will summarize and resubmit the patch. -Original Message- From: Tejun Heo [mailto:hte...@gmail.com] On Behalf Of Tejun Heo Sent: Wednesday, September 09, 2020 2:03 AM To: tianxianting (RD) Cc: ax...@kernel.dk; cgro...@vger.kernel.org; linux-bl...@vger.kernel.org; linux-kernel@vger.kernel.org Subject: Re: [PATCH] blkcg: add plugging support for punt bio On Sat, Sep 05, 2020 at 11:25:03AM +, Tianxianting wrote: > Hi jens,tj > Could you share a couple of minutes to comment it? > I really appreciate it The result looks fine to me but can you please summarize that in the commit message of the patch? Thanks. -- tejun
RE: [PATCH] clocksource: Return negative error code in h8300_8timer_init()
Hi Markus, Thanks for your comments. No real fix, as normally , we should return a negative code, In function h8300_8timer_init(), it set ' ret = -EINVAL ', ' ret = ENXIO' , it should be align with a negative code static int __init h8300_8timer_init(struct device_node *node) { ret = ENXIO; <<< base = of_iomap(node, 0); if (!base) { pr_err("failed to map registers for clockevent\n"); goto free_clk; } ret = -EINVAL; <<< irq = irq_of_parse_and_map(node, 0); if (!irq) { pr_err("failed to get irq for clockevent\n"); goto unmap_reg; } -Original Message- From: Markus Elfring [mailto:markus.elfr...@web.de] Sent: Tuesday, September 01, 2020 10:46 PM To: tianxianting (RD) ; uclinux-h8-de...@lists.sourceforge.jp Cc: linux-kernel@vger.kernel.org; kernel-janit...@vger.kernel.org; Daniel Lezcano ; Thomas Gleixner ; Yoshinori Sato Subject: Re: [PATCH] clocksource: Return negative error code in h8300_8timer_init() > A negative error code should be returned * Can an other imperative wording become helpful for the change description? * Would you like to add the tag “Fixes” to the commit message? Regards, Markus
RE: [PATCH] [v3] blk-mq: use BLK_MQ_NO_TAG for no tag
Hi Jens Could I get the feedback from you whether this patch can be applied? Thanks a lot. -Original Message- From: tianxianting (RD) Sent: Thursday, August 27, 2020 2:34 PM To: ax...@kernel.dk Cc: linux-bl...@vger.kernel.org; linux-kernel@vger.kernel.org; tianxianting (RD) Subject: [PATCH] [v3] blk-mq: use BLK_MQ_NO_TAG for no tag Replace various magic -1 constants for tags with BLK_MQ_NO_TAG. Signed-off-by: Xianting Tian --- block/blk-core.c | 4 ++-- block/blk-mq-sched.c | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index d9d632639..c7eaf7504 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -116,8 +116,8 @@ void blk_rq_init(struct request_queue *q, struct request *rq) rq->__sector = (sector_t) -1; INIT_HLIST_NODE(>hash); RB_CLEAR_NODE(>rb_node); - rq->tag = -1; - rq->internal_tag = -1; + rq->tag = BLK_MQ_NO_TAG; + rq->internal_tag = BLK_MQ_NO_TAG; rq->start_time_ns = ktime_get_ns(); rq->part = NULL; refcount_set(>ref, 1); diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index a19cdf159..439481f59 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -522,7 +522,7 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head, goto run; } - WARN_ON(e && (rq->tag != -1)); + WARN_ON(e && (rq->tag != BLK_MQ_NO_TAG)); if (blk_mq_sched_bypass_insert(hctx, !!e, rq)) { /* -- 2.17.1
RE: [PATCH] [v2] nvme-pci: check req to prevent crash in nvme_handle_cqe()
Hi, Could I get the feedback for the patch, whether it can be applied or need some improvement? It really can prevent a crash we met. Thanks:) -Original Message- From: tianxianting (RD) Sent: Monday, August 31, 2020 6:56 PM To: kbu...@kernel.org; ax...@fb.com; h...@lst.de; s...@grimberg.me Cc: linux-n...@lists.infradead.org; linux-kernel@vger.kernel.org; tianxianting (RD) Subject: [PATCH] [v2] nvme-pci: check req to prevent crash in nvme_handle_cqe() We met a crash issue when hot-insert a nvme device, blk_mq_tag_to_rq() returned null(req=null), then crash happened in nvme_end_request(): struct nvme_request *rq = nvme_req(req); rq->result = result; <==crash here The test env is, a server is configured with 2 backplanes, each backplane support 8 nvme devices, this crash happened when hot-insert a nvme device to the second backplane. We measured the signal, which is send out of cpu to ack nvme interrupt, the signal is very weak when it reached the second backplane, the device can't distinguish it as a ack signal. So it caused the device can't clear the interrupt flag. After updating related driver, the signal sending out of cpu to the second backplane is good, the crash issue disappeared. As blk_mq_tag_to_rq() may return null, so it should be check whether it is null before using it to prevent a crash. [ 1124.256246] nvme nvme5: pci function :e1:00.0 [ 1124.256323] nvme :e1:00.0: enabling device ( -> 0002) [ 1125.720859] nvme nvme5: 96/0/0 default/read/poll queues [ 1125.732483] nvme5n1: p1 p2 p3 [ 1125.788049] BUG: unable to handle kernel NULL pointer dereference at 0130 [ 1125.788054] PGD 0 P4D 0 [ 1125.788057] Oops: 0002 [#1] SMP NOPTI [ 1125.788059] CPU: 50 PID: 0 Comm: swapper/50 Kdump: loaded Tainted: G --- -t - 4.18.0-147.el8.x86_64 #1 [ 1125.788065] RIP: 0010:nvme_irq+0xe8/0x240 [nvme] [ 1125.788068] RSP: 0018:916b8ec83ed0 EFLAGS: 00010813 [ 1125.788069] RAX: RBX: 918ae9211b00 RCX: [ 1125.788070] RDX: 400b RSI: RDI: [ 1125.788071] RBP: 918ae887 R08: 0004 R09: 918ae887 [ 1125.788072] R10: R11: R12: [ 1125.788073] R13: 0001 R14: R15: 0001 [ 1125.788075] FS: () GS:916b8ec8() knlGS: [ 1125.788075] CS: 0010 DS: ES: CR0: 80050033 [ 1125.788076] CR2: 0130 CR3: 001768f0 CR4: 00340ee0 [ 1125.788077] Call Trace: [ 1125.788080] [ 1125.788085] __handle_irq_event_percpu+0x40/0x180 [ 1125.788087] handle_irq_event_percpu+0x30/0x80 [ 1125.788089] handle_irq_event+0x36/0x53 [ 1125.788090] handle_edge_irq+0x82/0x190 [ 1125.788094] handle_irq+0xbf/0x100 [ 1125.788098] do_IRQ+0x49/0xd0 [ 1125.788100] common_interrupt+0xf/0xf Signed-off-by: Xianting Tian --- drivers/nvme/host/pci.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index ba725ae47..5f1c51a43 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -960,6 +960,13 @@ static inline void nvme_handle_cqe(struct nvme_queue *nvmeq, u16 idx) } req = blk_mq_tag_to_rq(nvme_queue_tagset(nvmeq), cqe->command_id); + if (unlikely(!req)) { + dev_warn(nvmeq->dev->ctrl.device, + "req is null(tag:%d) on queue %d\n", + cqe->command_id, le16_to_cpu(cqe->sq_id)); + return; + } + trace_nvme_sq(req, cqe->sq_head, nvmeq->sq_tail); if (!nvme_end_request(req, cqe->status, cqe->result)) nvme_pci_complete_rq(req); -- 2.17.1
RE: [PATCH] nvme-pci: check req to prevent crash in nvme_handle_cqe()
I am very sorry, I used the wrong patch file :( I send it again. Please review, thanks. -Original Message- From: kernel test robot [mailto:l...@intel.com] Sent: Monday, August 31, 2020 5:41 PM To: tianxianting (RD) ; kbu...@kernel.org; ax...@fb.com; h...@lst.de; s...@grimberg.me Cc: kbuild-...@lists.01.org; linux-n...@lists.infradead.org; linux-kernel@vger.kernel.org; tianxianting (RD) Subject: Re: [PATCH] nvme-pci: check req to prevent crash in nvme_handle_cqe() Hi Xianting, Thank you for the patch! Yet something to improve: [auto build test ERROR on v5.9-rc2] [also build test ERROR on next-20200828] [cannot apply to linus/master v5.9-rc3] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/Xianting-Tian/nvme-pci-check-req-to-prevent-crash-in-nvme_handle_cqe/20200831-155653 base:d012a7190fc1fd72ed48911e77ca97ba4521bccd config: parisc-randconfig-r004-20200831 (attached as .config) compiler: hppa-linux-gcc (GCC) 9.3.0 reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=parisc If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot All errors (new ones prefixed by >>): drivers/nvme/host/pci.c: In function 'nvme_handle_cqe': >> drivers/nvme/host/pci.c:3244: error: unterminated argument list invoking >> macro "dev_warn" 3244 | module_exit(nvme_exit); | >> drivers/nvme/host/pci.c:966:3: error: 'dev_warn' undeclared (first use in >> this function); did you mean '_dev_warn'? 966 | dev_warn(nvmeq->dev->ctrl.device, | ^~~~ | _dev_warn drivers/nvme/host/pci.c:966:3: note: each undeclared identifier is reported only once for each function it appears in >> drivers/nvme/host/pci.c:966:11: error: expected ';' at end of input 966 | dev_warn(nvmeq->dev->ctrl.device, | ^ | ; .. 3244 | module_exit(nvme_exit); | >> drivers/nvme/host/pci.c:966:3: error: expected declaration or >> statement at end of input 966 | dev_warn(nvmeq->dev->ctrl.device, | ^~~~ >> drivers/nvme/host/pci.c:966:3: error: expected declaration or >> statement at end of input drivers/nvme/host/pci.c: At top level: drivers/nvme/host/pci.c:105:13: warning: 'nvme_dev_disable' declared 'static' but never defined [-Wunused-function] 105 | static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown); | ^~~~ drivers/nvme/host/pci.c:106:13: warning: '__nvme_disable_io_queues' declared 'static' but never defined [-Wunused-function] 106 | static bool __nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode); | ^~~~ drivers/nvme/host/pci.c:901:13: warning: 'nvme_pci_complete_rq' defined but not used [-Wunused-function] 901 | static void nvme_pci_complete_rq(struct request *req) | ^~~~ drivers/nvme/host/pci.c:853:21: warning: 'nvme_queue_rq' defined but not used [-Wunused-function] 853 | static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx, | ^ drivers/nvme/host/pci.c:484:13: warning: 'nvme_commit_rqs' defined but not used [-Wunused-function] 484 | static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx) | ^~~ drivers/nvme/host/pci.c:427:12: warning: 'nvme_pci_map_queues' defined but not used [-Wunused-function] 427 | static int nvme_pci_map_queues(struct blk_mq_tag_set *set) |^~~ drivers/nvme/host/pci.c:403:12: warning: 'nvme_init_request' defined but not used [-Wunused-function] 403 | static int nvme_init_request(struct blk_mq_tag_set *set, struct request *req, |^ drivers/nvme/host/pci.c:392:12: warning: 'nvme_init_hctx' defined but not used [-Wunused-function] 392 | static int nvme_init_hctx(struct blk_mq_hw_ctx *hctx, void *data, |^~ drivers/nvme/host/pci.c:379:12: warning: 'nvme_admin_init_hctx' defined but not used [-Wunused-function] 379 | static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data, |^~~~ drivers/nvme/host/pci.c:371:15: warning: 'nvme_pci_iod_alloc_size' defined but not used [-Wunused-function] 371 | static size_t nvme_pci_iod_alloc_size(void) |
RE: [PATCH] aio: make aio wait path to account iowait time
Thanks peterz, jan So, enable aio iowait time accounting is a bad idea:( I gained a lot from you, thanks -Original Message- From: pet...@infradead.org [mailto:pet...@infradead.org] Sent: Friday, August 28, 2020 6:52 PM To: Jan Kara Cc: tianxianting (RD) ; v...@zeniv.linux.org.uk; b...@kvack.org; mi...@redhat.com; juri.le...@redhat.com; vincent.guit...@linaro.org; dietmar.eggem...@arm.com; rost...@goodmis.org; bseg...@google.com; mgor...@suse.de; linux-fsde...@vger.kernel.org; linux-...@kvack.org; linux-kernel@vger.kernel.org; Tejun Heo ; han...@cmpxchg.org Subject: Re: [PATCH] aio: make aio wait path to account iowait time On Fri, Aug 28, 2020 at 11:41:29AM +0200, Jan Kara wrote: > On Fri 28-08-20 11:07:29, pet...@infradead.org wrote: > > On Fri, Aug 28, 2020 at 02:07:12PM +0800, Xianting Tian wrote: > > > As the normal aio wait path(read_events() -> > > > wait_event_interruptible_hrtimeout()) doesn't account iowait time, > > > so use this patch to make it to account iowait time, which can > > > truely reflect the system io situation when using a tool like 'top'. > > > > Do be aware though that io_schedule() is potentially far more > > expensive than regular schedule() and io-wait accounting as a whole > > is a trainwreck. > > Hum, I didn't know that io_schedule() is that much more expensive. > Thanks for info. It's all relative, but it can add up under contention. And since these storage thingies are getting faster every year, I'm assuming these schedule rates are increasing along with it. > > When in_iowait is set schedule() and ttwu() will have to do > > additional atomic ops, and (much) worse, PSI will take additional locks. > > > > And all that for a number that, IMO, is mostly useless, see the > > comment with nr_iowait(). > > Well, I understand the limited usefulness of the system or even per > CPU percentage spent in IO wait. However whether a particular task is > sleeping waiting for IO or not So strict per-task state is not a problem, and we could easily change get_task_state() to distinguish between IO-wait or not, basically duplicate S/D state into an IO-wait variant of the same. Although even this has ABI implications :-( > is IMO a useful diagnostic information and there are several places in > the kernel that take that into account (PSI, hangcheck timer, cpufreq, > ...). So PSI is the one I hate most. We spend an aweful lot of time to not have to take the old rq->lock on wakeup, and PSI reintroduced it for accounting purposes -- I hate accounting overhead. :/ There's a number of high frequency scheduling workloads where it really adds up, which is the reason we got rid of it in the first place. OTOH, PSI gives more sensible numbers, although it goes side-ways when you introduce affinity masks / cpusets. The menu-cpufreq gov is known crazy and we're all hard working on replacing it. And the tick-sched usage is, iirc, the nohz case of iowait. > So I don't see that properly accounting that a task is waiting for IO > is just "expensive random number generator" as you mention below :). > But I'm open to being educated... It's the userspace iowait, and in particular the per-cpu iowait numbers that I hate. Only on UP does any of that make sense. But we can't remove them because ABI :-(
RE: [PATCH] [v2] blk-mq: use BLK_MQ_NO_TAG for no tag
Hi Ming Lei Thanks for your quick comment. As the function request_to_qc_t() in 'include/linux/blk-mq.h ' used the magic '-1', Seems it is hard to replace it with BLK_MQ_NO_TAG :( -Original Message- From: Ming Lei [mailto:ming@redhat.com] Sent: Wednesday, August 26, 2020 12:29 PM To: tianxianting (RD) Cc: ax...@kernel.dk; linux-bl...@vger.kernel.org; linux-kernel@vger.kernel.org Subject: Re: [PATCH] [v2] blk-mq: use BLK_MQ_NO_TAG for no tag On Wed, Aug 26, 2020 at 10:06:51AM +0800, Xianting Tian wrote: > Replace various magic -1 constants for tags with BLK_MQ_NO_TAG. > And move the definition of BLK_MQ_NO_TAG from 'block/blk-mq-tag.h' > to 'include/linux/blk-mq.h' All three symbols are supposed for block core internal code only, so looks you shouldn't move them to public header. Thanks, Ming
RE: [PATCH] blk-mq: use BLK_MQ_NO_TAG for no tag
Hi Jens Sorry to trouble you, I am very sorry for taking it for granted for the patch compile. I checked, compile for the two changed files are OK: block/blk-core.c block/blk-mq-sched.c Compile failed for the function in below header file: include/linux/blk-mq.h: request_to_qc_t() - if (rq->tag != -1) + if (rq->tag != BLK_MQ_NO_TAG) BLK_MQ_NO_TAG is defined in 'block/blk-mq-tag.h', if I include this header file in 'include/linux/blk-mq.h' via '#include "../../block/blk-mq-tag.h"', Many other similar compile failed happen, for example: In file included from ./include/linux/../../block/blk-mq-tag.h:5:0, from ./include/linux/blk-mq.h:8, from block/blk.h:6, from block/bio.c:24: ./include/linux/../../block/blk-mq.h:21:29: error: ‘HCTX_MAX_TYPES’ undeclared here (not in a function); did you mean ‘KOBJ_NS_TYPES’? struct list_head rq_lists[HCTX_MAX_TYPES]; ^~ If I moved below BLK_MQ_NO_TAG definition to 'include/linux/blk-mq.h', all kernel compiles are ok with the patch. enum { BLK_MQ_NO_TAG = -1U, BLK_MQ_TAG_MIN = 1, BLK_MQ_TAG_MAX = BLK_MQ_NO_TAG - 1, }; Will you accept above moving of BLK_MQ_NO_TAG to 'include/linux/blk-mq.h'? Thanks -Original Message- From: Jens Axboe [mailto:ax...@kernel.dk] Sent: Monday, August 24, 2020 4:58 AM To: tianxianting (RD) ; a...@kernel.org; dan...@iogearbox.net; ka...@fb.com; songliubrav...@fb.com; y...@fb.com; andr...@fb.com; john.fastab...@gmail.com; kpsi...@chromium.org Cc: linux-bl...@vger.kernel.org; linux-kernel@vger.kernel.org Subject: Re: [PATCH] blk-mq: use BLK_MQ_NO_TAG for no tag On 8/23/20 9:44 AM, Xianting Tian wrote: > Replace various magic -1 constants for tags with BLK_MQ_NO_TAG. Doesn't look like this patch was even compiled... -- Jens Axboe
答复: [PATCH] timers: use set_current_state macro
Please review this one, thanks. https://lkml.org/lkml/2020/5/14/796 -邮件原件- 发件人: tianxianting (RD) 发送时间: 2020年5月14日 9:46 收件人: 'Thomas Gleixner' ; john.stu...@linaro.org; sb...@kernel.org 抄送: linux-kernel@vger.kernel.org 主题: 答复: [PATCH] timers: use set_current_state macro Hi Thomas Thanks for your reply :) I am sorry for inconveniencing you about this. tian.xiant...@h3c.com is the mail which is provided by my company, My company may added some limit when I send the mail out of it, I will re-send the patch via my personal mail, I used it to send several patches before, and patches were accepted. -邮件原件- 发件人: Thomas Gleixner [mailto:t...@linutronix.de] 发送时间: 2020年5月14日 5:22 收件人: tianxianting (RD) ; john.stu...@linaro.org; sb...@kernel.org 抄送: linux-kernel@vger.kernel.org; tianxianting (RD) 主题: Re: [PATCH] timers: use set_current_state macro Xianting, Xianting Tian writes: thanks for your patch. Can you please fix your mail client to have proper mail headers? It provides: Content-Type: text/plain Content-Transfer-Encoding: quoted-printable but it fails to provide the charset information. That causes the footer to become unreadable garbage not only in my mail reader. See: https://lore.kernel.org/lkml/20200508020222.15791-1-tian.xiant...@h3c.com/ What's worse is that is causes my patch handling scripts to decode the mail body correctly. And I'm not really inclined to figure out how to handle this case. > --- a/kernel/time/timer.c > +++ b/kernel/time/timer.c > @@ -1885,7 +1885,7 @@ signed long __sched schedule_timeout(signed long > timeout) > printk(KERN_ERR "schedule_timeout: wrong timeout " > "value %lx\n", timeout); > dump_stack(); > - current->state = TASK_RUNNING; > + set_current_state(TASK_RUNNING); That's not the same and adds a barrier which is not needed. Not a big problem in that particular error handling code path, but in general you really have to look whether your replacement is resulting in the same code. If not then you need to make an argument in the changelog why you are replacing existing code with something which is not fully equivalent. For this particular case, please check the implementation and read the documentation of set_current_state() in include/linux/sched.h. > -- > --- > ±¾Óʼþ¼°Æ丽¼þº¬ÓÐлªÈý¼¯Íŵı£ÃÜÐÅÏ¢£¬½öÏÞÓÚ·¢Ë͸øÉÏÃæµØÖ·ÖÐÁгö > µÄ¸öÈË»òȺ×é¡£½ûÖ¹ÈκÎÆäËûÈËÒÔÈκÎÐÎʽʹÓ㨰üÀ¨µ«²»ÏÞÓÚÈ«²¿»ò²¿·ÖµØй > ¶¡¢¸´ÖÆ¡¢ > »òÉ¢·¢£©±¾ÓʼþÖеÄÐÅÏ¢¡£Èç¹ûÄú´íÊÕÁ˱¾Óʼþ£¬ÇëÄúÁ¢¼´µç»°»òÓʼþ֪ͨ·¢¼þ > È˲¢É¾³ý±¾ > Óʼþ£¡ This is the resulting garbage. Not that I could decipher the chinese characters which should be here instead, but at least they would look way nicer. But see below: > This e-mail and its attachments contain confidential information from > New H3C, which is intended only for the person or entity whose address > is listed above. Any use of the information contained herein in any > way (including, but not limited to, total or partial disclosure, > reproduction, or dissemination) by persons other than the intended > recipient(s) is prohibited. If you receive this e-mail in error, > please notify the sender by phone or email immediately and delete it! Can you please remove this disclaimer completely (which avoids the garbage issue as well) ? It does not make any sense if you send mail to a public mailing list: 1) If you send mail to a public list which is archived in public then the information can't be confidential and restricted to a particular audience. It can be accessed by everyone on this planet who has access to the internet. 2) If you really send confidental information accidentally then there is no way to delete it. It's out there in the public and in archives and you can't call it back. Thanks, tglx - 本邮件及其附件含有新华三集团的保密信息,仅限于发送给上面地址中列出 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本 邮件! This e-mail and its attachments contain confidential information from New H3C, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!
答复: [PATCH] timers: use set_current_state macro
Hi Thomas Thanks for your reply :) I am sorry for inconveniencing you about this. tian.xiant...@h3c.com is the mail which is provided by my company, My company may added some limit when I send the mail out of it, I will re-send the patch via my personal mail, I used it to send several patches before, and patches were accepted. -邮件原件- 发件人: Thomas Gleixner [mailto:t...@linutronix.de] 发送时间: 2020年5月14日 5:22 收件人: tianxianting (RD) ; john.stu...@linaro.org; sb...@kernel.org 抄送: linux-kernel@vger.kernel.org; tianxianting (RD) 主题: Re: [PATCH] timers: use set_current_state macro Xianting, Xianting Tian writes: thanks for your patch. Can you please fix your mail client to have proper mail headers? It provides: Content-Type: text/plain Content-Transfer-Encoding: quoted-printable but it fails to provide the charset information. That causes the footer to become unreadable garbage not only in my mail reader. See: https://lore.kernel.org/lkml/20200508020222.15791-1-tian.xiant...@h3c.com/ What's worse is that is causes my patch handling scripts to decode the mail body correctly. And I'm not really inclined to figure out how to handle this case. > --- a/kernel/time/timer.c > +++ b/kernel/time/timer.c > @@ -1885,7 +1885,7 @@ signed long __sched schedule_timeout(signed long > timeout) > printk(KERN_ERR "schedule_timeout: wrong timeout " > "value %lx\n", timeout); > dump_stack(); > - current->state = TASK_RUNNING; > + set_current_state(TASK_RUNNING); That's not the same and adds a barrier which is not needed. Not a big problem in that particular error handling code path, but in general you really have to look whether your replacement is resulting in the same code. If not then you need to make an argument in the changelog why you are replacing existing code with something which is not fully equivalent. For this particular case, please check the implementation and read the documentation of set_current_state() in include/linux/sched.h. > -- > --- > ±¾Óʼþ¼°Æ丽¼þº¬ÓÐлªÈý¼¯Íŵı£ÃÜÐÅÏ¢£¬½öÏÞÓÚ·¢Ë͸øÉÏÃæµØÖ·ÖÐÁгö > µÄ¸öÈË»òȺ×é¡£½ûÖ¹ÈκÎÆäËûÈËÒÔÈκÎÐÎʽʹÓ㨰üÀ¨µ«²»ÏÞÓÚÈ«²¿»ò²¿·ÖµØй > ¶¡¢¸´ÖÆ¡¢ > »òÉ¢·¢£©±¾ÓʼþÖеÄÐÅÏ¢¡£Èç¹ûÄú´íÊÕÁ˱¾Óʼþ£¬ÇëÄúÁ¢¼´µç»°»òÓʼþ֪ͨ·¢¼þ > È˲¢É¾³ý±¾ > Óʼþ£¡ This is the resulting garbage. Not that I could decipher the chinese characters which should be here instead, but at least they would look way nicer. But see below: > This e-mail and its attachments contain confidential information from > New H3C, which is intended only for the person or entity whose address > is listed above. Any use of the information contained herein in any > way (including, but not limited to, total or partial disclosure, > reproduction, or dissemination) by persons other than the intended > recipient(s) is prohibited. If you receive this e-mail in error, > please notify the sender by phone or email immediately and delete it! Can you please remove this disclaimer completely (which avoids the garbage issue as well) ? It does not make any sense if you send mail to a public mailing list: 1) If you send mail to a public list which is archived in public then the information can't be confidential and restricted to a particular audience. It can be accessed by everyone on this planet who has access to the internet. 2) If you really send confidental information accidentally then there is no way to delete it. It's out there in the public and in archives and you can't call it back. Thanks, tglx - 本邮件及其附件含有新华三集团的保密信息,仅限于发送给上面地址中列出 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本 邮件! This e-mail and its attachments contain confidential information from New H3C, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!