Re: [PATCH v2] IB/umem: ib_ucontext already have tgid, remove pid from ib_umem structure
On Tue, May 08, 2018 at 04:50:16PM +0800, Lidong Chen wrote: > The userspace may invoke ibv_reg_mr and ibv_dereg_mr by different threads. > If when ibv_dereg_mr invoke and the thread which invoked ibv_reg_mr has > exited, get_pid_task will return NULL, ib_umem_release does not decrease > mm->pinned_vm. This patch fixes it by use tgid in ib_ucontext struct. > > Signed-off-by: Lidong Chen > --- > [v2] > - use ib_ucontext tgid instread of tgid in ib_umem structure I'm looking at this again, and it doesn't seem quite right.. > diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c > index 9a4e899..2b6c9b5 100644 > --- a/drivers/infiniband/core/umem.c > +++ b/drivers/infiniband/core/umem.c > @@ -119,7 +119,6 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, > unsigned long addr, > umem->length = size; > umem->address= addr; > umem->page_shift = PAGE_SHIFT; > - umem->pid= get_task_pid(current, PIDTYPE_PID); > /* >* We ask for writable memory if any of the following >* access flags are set. "Local write" and "remote write" > @@ -132,7 +131,6 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, > unsigned long addr, >IB_ACCESS_REMOTE_ATOMIC | IB_ACCESS_MW_BIND)); > > if (access & IB_ACCESS_ON_DEMAND) { > - put_pid(umem->pid); > ret = ib_umem_odp_get(context, umem, access); > if (ret) { > kfree(umem); > @@ -148,7 +146,6 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, > unsigned long addr, > > page_list = (struct page **) __get_free_page(GFP_KERNEL); > if (!page_list) { > - put_pid(umem->pid); > kfree(umem); > return ERR_PTR(-ENOMEM); > } in ib_umem_get we are doing this: down_write(¤t->mm->mmap_sem); locked = npages + current->mm->pinned_vm; And then in release we now do: task = get_pid_task(umem->context->tgid, PIDTYPE_PID); if (!task) goto out; mm = get_task_mm(task); mm->pinned_vm -= diff; But there is no guarantee that context->tgid and 'current' are the same thing during ib_umem_get.. So in the dysfunctional case where someone forks and keeps the context FD open on both sides of the fork they can cause the pinned_vm counter to become wrong in the processes. Sounds bad.. Thus, I think we need to go back to storing the tgid in the ib_umem and just fix it to store the group leader not the thread PID? And then even more we need the ib_get_mr_mm() helper to make sense of this, because all the drivers are doing the wrong thing by using the context->tgid too. Is that all right? Jason
Re: WARNING: bad unlock balance in ucma_event_handler
On Wed, Jun 13, 2018 at 06:47:02AM -0700, syzbot wrote: > Hello, > > syzbot found the following crash on: > > HEAD commit:73fcb1a370c7 Merge branch 'akpm' (patches from Andrew) > git tree: upstream > console output: https://syzkaller.appspot.com/x/log.txt?x=16d7082780 > kernel config: https://syzkaller.appspot.com/x/.config?x=f3b4e30da84ec1ed > dashboard link: https://syzkaller.appspot.com/bug?extid=e5579222b6a3edd96522 > compiler: gcc (GCC) 8.0.1 20180413 (experimental) > syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=176daf9780 > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=15e7bd5780 > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > Reported-by: syzbot+e5579222b6a3edd96...@syzkaller.appspotmail.com > > > = > WARNING: bad unlock balance detected! > 4.17.0-rc5+ #58 Not tainted > kworker/u4:0/6 is trying to release lock (&file->mut) at: > [] ucma_event_handler+0x780/0xff0 > drivers/infiniband/core/ucma.c:390 > but there are no more locks to release! > > other info that might help us debug this: > 4 locks held by kworker/u4:0/6: > #0: (ptrval) ((wq_completion)"ib_addr"){+.+.}, at: > __write_once_size include/linux/compiler.h:215 [inline] > #0: (ptrval) ((wq_completion)"ib_addr"){+.+.}, at: > arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline] > #0: (ptrval) ((wq_completion)"ib_addr"){+.+.}, at: atomic64_set > include/asm-generic/atomic-instrumented.h:40 [inline] > #0: (ptrval) ((wq_completion)"ib_addr"){+.+.}, at: atomic_long_set > include/asm-generic/atomic-long.h:57 [inline] > #0: (ptrval) ((wq_completion)"ib_addr"){+.+.}, at: set_work_data > kernel/workqueue.c:617 [inline] > #0: (ptrval) ((wq_completion)"ib_addr"){+.+.}, at: > set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline] > #0: (ptrval) ((wq_completion)"ib_addr"){+.+.}, at: > process_one_work+0xaef/0x1b50 kernel/workqueue.c:2116 > #1: (ptrval) ((work_completion)(&(&req->work)->work)){+.+.}, at: > process_one_work+0xb46/0x1b50 kernel/workqueue.c:2120 > #2: (ptrval) (&id_priv->handler_mutex){+.+.}, at: > addr_handler+0xa6/0x3d0 drivers/infiniband/core/cma.c:2796 > #3: (ptrval) (&file->mut){+.+.}, at: ucma_event_handler+0x10e/0xff0 > drivers/infiniband/core/ucma.c:350 I think this is probably a use-after-free race, eg when we do ctx->file->mut we have raced with ucma_free_ctx() .. Which probably means something along the way to free_ctx() did not call rdma_addr_cancel? Jason
Re: [PATCH v3 1/2] IDR: Expose the XArray lock
On Wed, Jun 13, 2018 at 05:34:02AM -0700, Matthew Wilcox wrote: > Allow users of the IDR to use the XArray lock for their own > synchronisation purposes. The IDR continues to rely on the caller to > handle locking, but this lets the caller use the lock embedded in the > IDR data structure instead of allocating their own lock. > > Signed-off-by: Matthew Wilcox > include/linux/idr.h | 11 +++ > 1 file changed, 11 insertions(+) > > diff --git a/include/linux/idr.h b/include/linux/idr.h > index e856f4e0ab35..109d7a3c8d56 100644 > +++ b/include/linux/idr.h > @@ -98,6 +98,17 @@ static inline void idr_set_cursor(struct idr *idr, > unsigned int val) > * period). > */ > > +#define idr_lock(idr)xa_lock(&(idr)->idr_xa) > +#define idr_unlock(idr) xa_unlock(&(idr)->idr_xa) > +#define idr_lock_bh(idr) xa_lock_bh(&(idr)->idr_xa) > +#define idr_unlock_bh(idr) xa_unlock_bh(&(idr)->idr_xa) > +#define idr_lock_irq(idr)xa_lock_irq(&(idr)->idr_xa) > +#define idr_unlock_irq(idr) xa_unlock_irq(&(idr)->idr_xa) > +#define idr_lock_irqsave(idr, flags) \ > + xa_lock_irqsave(&(idr)->idr_xa, flags) > +#define idr_unlock_irqrestore(idr, flags) \ > + xa_unlock_irqrestore(&(idr)->idr_xa, flags) > + > void idr_preload(gfp_t gfp_mask); Hmm, what do I need to apply this? drivers/infiniband/core/mad.c: In function ‘ib_register_mad_agent’: drivers/infiniband/core/mad.c:387:34: error: ‘struct idr’ has no member named ‘idr_xa’; did you mean ‘idr_rt’? idr_lock(&ib_mad_clients); ^ idr_rt Linus's master (be779f03d563981) doesn't seem to have idr_xa? Jason
Re: [PATCH] infiniband: fix a subtle race condition
On Thu, Jun 14, 2018 at 10:01:08AM +0300, Leon Romanovsky wrote: > On Wed, Jun 13, 2018 at 11:21:54PM -0700, Cong Wang wrote: > > On Wed, Jun 13, 2018 at 10:34 PM, Leon Romanovsky wrote: > > > > > > Hi Cong, > > > > > > If the compiler optimizes the first line (mutex_lock) as you wrote, > > > it will reuse "f" for the second line (mutex_unlock) too. > > > > Nope, check the assembly if you don't trust me, at least > > my compiler always fetches ctx->file without this patch. > > > > I can show you the assembly code tomorrow (too late to > > access my dev machine now). > > I trust you, so don't need to check it however wanted to emphasize > that your solution is compiler specific and not universally true. > > > > > > > > > > > You need to ensure that ucma_modify_id() doesn't run in parallel to > > > anything that uses "ctx->file" directly and indirectly. > > > > > > > Talk is easy, show me the code. :) I knew there is probably > > some other race with this code even after my patch, possibly with > > ->close() for example, but for this specific unlock warning, this patch > > is sufficient. I can't solve all the races in one patch. > > We do prefer complete solution once the problem is fully understood. > > It looks like you are one step away from final patch. It will be conversion > of mutex to be rwlock and separating between read (almost in all places) > and write (in ucma_migrate_id) paths. This was my brief reaction too, this code path almost certainly has a use-after-free, and we should fix the concurrency between the two places in some correct way.. Jason
Re: [PATCH] infiniband: fix a subtle race condition
On Thu, Jun 14, 2018 at 10:03:09AM -0700, Cong Wang wrote: > On Thu, Jun 14, 2018 at 7:24 AM, Jason Gunthorpe wrote: > > > > This was my brief reaction too, this code path almost certainly has a > > use-after-free, and we should fix the concurrency between the two > > places in some correct way.. > > First of all, why use-after-free could trigger an imbalance unlock? > IOW, why do we have to solve use-after-free to fix this imbalance > unlock? The issue syzkaller hit is that accessing ctx->file does not seem locked in any way and can race with other manipulations of ctx->file. So.. for this patch to be correct we need to understand how this statement: f = ctx->file Avoids f becoming a dangling pointer - and without locking, my suspicion is that it doesn't - because missing locking around ctx->file is probably the actual bug syzkaller found. If this is not the case, then add a comment explaining how f's lifetime is OK. Otherwise, we need some kind of locking and guessing we need to hold a kref for f? > Third of all, the use-after-free I can see (race with ->close) exists > before my patch, this patch doesn't make it better or worse, nor > I have any intend to fix it. I'm not sure that race exists, there should be something that flushes the WQ on the path to close... (though I have another email that perhaps that is broken, sigh) Jason
Re: [PATCH] infiniband: fix a subtle race condition
On Thu, Jun 14, 2018 at 04:14:13PM -0700, Cong Wang wrote: > On Thu, Jun 14, 2018 at 10:24 AM, Jason Gunthorpe wrote: > > On Thu, Jun 14, 2018 at 10:03:09AM -0700, Cong Wang wrote: > >> On Thu, Jun 14, 2018 at 7:24 AM, Jason Gunthorpe wrote: > >> > > >> > This was my brief reaction too, this code path almost certainly has a > >> > use-after-free, and we should fix the concurrency between the two > >> > places in some correct way.. > >> > >> First of all, why use-after-free could trigger an imbalance unlock? > >> IOW, why do we have to solve use-after-free to fix this imbalance > >> unlock? > > > > The issue syzkaller hit is that accessing ctx->file does not seem > > locked in any way and can race with other manipulations of ctx->file. > > > > So.. for this patch to be correct we need to understand how this > > statement: > > > >f = ctx->file > > > > Avoids f becoming a dangling pointer - and without locking, my > > It doesn't, because this is not the point, this is not the cause > of the unlock imbalance either. syzbot didn't report use-after-free > or a kernel segfault here. No, it *is* the point - you've proposed a solution, one of many, and we need to see an actual sensible design for how the locking around ctx->file should work correctly. We need solutions that solve the underlying problem, not just paper over the symptoms. Stated another way, for a syzkaller report like this there are a few really obvious fixes. 1) Capture the lock pointer on the stack: f = ctx->file mutex_lock(&f->mut); mutex_unlock(&f->mut); 2) Prevent ctx->file from changing, eg add more locking: mutex_lock(&mut); mutex_lock(&ctx->file->mut); mutex_unlock(&ctx->file->mut)); mutex_unlock(&mut); 3) Prevent ctx->file from being changing/freed by flushing the WQ at the right times: rdma_addr_cancel(...); ctx->file = XYZ; This patch proposed #1. An explanation is required why that is a correct locking design for this code. It sure looks like it isn't. Looking at this *just a bit*, I wonder why not do something like this: mutex_lock(&mut); f = ctx->file; mutex_lock(&f->mutex); mutex_unlock(&mut); ? At least that *might* make sense. Though probably it deadlocks as it looks like we call rdma_addr_cancel() while holding mut. Yuk. But maybe that sequence could be done before launching the work.. > > I'm not sure that race exists, there should be something that flushes > > the WQ on the path to close... (though I have another email that > > perhaps that is broken, sigh) > > This is not related to my patch, but to convince you, let me explain: > > struct ucma_file is not refcnt'ed, I know you cancel the work in > rdma_destroy_id(), but after ucma_migrate_id() the ctx has already > been moved to the new file, for the old file, it won't cancel the > ctx flying with workqueue. So, I think the following use-after-free > could happen: > > ucma_event_handler(): > cur_file = ctx->file; // old file > > ucma_migrate_id(): > lock(); > list_move_tail(&ctx->list, &new_file->ctx_list); > ctx->file = new_file; > unlock(); > > ucma_close(): > // retrieve old file via filp->private_data > // the loop won't cover the ctx moved to the new_file > kfree(file); Yep. That sure seems like the right analysis! > This is _not_ the cause of the unlock imbalance, and is _not_ expected > to solve by patch either. What do you mean? Not calling rdma_addr_cancel() prior to freeing the file is *exactly* the cause of the lock imbalance. The design of this code *assumes* that rdma_addr_cancel() will be called before altering/freeing/etc any of the state it is working on, migration makes a state change that violates that invariant. Jason
Re: [PATCH] infiniband: fix a subtle race condition
On Fri, Jun 15, 2018 at 10:57:49AM -0700, Cong Wang wrote: > > No, it *is* the point - you've proposed a solution, one of many, and > > we need to see an actual sensible design for how the locking around > > ctx->file should work correctly. > > I proposed to a solution for imbalance unlock, you ask a design > for use-after-free, which is *irrelevant*. So why it is the point? The point is, I don't care about the imbalance report. I care about the actual bug, which you have identified as ucma_migrate_id() running concurrently with ucma_event_handler(). That seems like a great analysis, BTW. Stop that from happening and the lock imbalance warning will naturally go away. So will the use after free. I gave you some general ideas on how to do that, obviously they are not easy to do eg somehow solving the dealock with mut would be tricky. But maybe there is still some kind of simple solution.. Another option might be to just fail ucma_migrate_id() when ucma_event_handler() is outstanding.. That *might* be OK.. We've talked about doing things like this for other ucma syzkaller bugs. Also a bit complicated. Anyhow I'm NAK'ing this patch, since it just doesn't move things forward, and removes a warning that is pointing at a bunch of different bugs. Jason
Re: [PATCH rdma-next 1/5] RDMA/hns: Implement the disassociate_ucontext API
On Thu, May 17, 2018 at 04:02:49PM +0800, Wei Hu (Xavier) wrote: > This patch Implements the IB core disassociate_ucontext API. > > Signed-off-by: Wei Hu (Xavier) > drivers/infiniband/hw/hns/hns_roce_main.c | 36 > +++ > 1 file changed, 36 insertions(+) > > diff --git a/drivers/infiniband/hw/hns/hns_roce_main.c > b/drivers/infiniband/hw/hns/hns_roce_main.c > index 96fb6a9..7fafe9d 100644 > +++ b/drivers/infiniband/hw/hns/hns_roce_main.c > @@ -33,6 +33,9 @@ > #include > #include > #include > +#include > +#include > +#include > #include > #include > #include > @@ -422,6 +425,38 @@ static int hns_roce_port_immutable(struct ib_device > *ib_dev, u8 port_num, > return 0; > } > > +static void hns_roce_disassociate_ucontext(struct ib_ucontext *ibcontext) > +{ > + struct task_struct *process; > + struct mm_struct *mm; > + > + process = get_pid_task(ibcontext->tgid, PIDTYPE_PID); > + if (!process) > + return; > + > + mm = get_task_mm(process); > + if (!mm) { > + pr_info("no mm, disassociate ucontext is pending task > termination\n"); > + while (1) { > + put_task_struct(process); > + usleep_range(1000, 2000); > + process = get_pid_task(ibcontext->tgid, PIDTYPE_PID); > + if (!process || process->state == TASK_DEAD) { > + pr_info("disassociate ucontext done, task was > terminated\n"); > + /* if task was dead, need to release the task > + * struct. > + */ > + if (process) > + put_task_struct(process); > + return; > + } > + } > + } I don't want to see this boilerplate code copied into every driver. Hoist it into the core code, have the disassociate driver callback accept a mm_struct parameter, and refactor the other drivers using this. > + mmput(mm); > + put_task_struct(process); > +} This can't be right, disassociate requires the driver to replace all the mmaps it make to user space with the 0 page, I see hns does use mmaps, so it must zap them here. Jason
Re: [PATCH rdma-next 4/5] RDMA/hns: Add reset process for RoCE in hip08
On Thu, May 17, 2018 at 04:02:52PM +0800, Wei Hu (Xavier) wrote: > diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c > b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c > index 86ef15f..e1c44a6 100644 > +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c > @@ -774,6 +774,9 @@ static int hns_roce_cmq_send(struct hns_roce_dev *hr_dev, > int ret = 0; > int ntc; > > + if (hr_dev->is_reset) > + return 0; > + > spin_lock_bh(&csq->lock); > > if (num > hns_roce_cmq_space(csq)) { > @@ -4790,6 +4793,7 @@ static int hns_roce_hw_v2_init_instance(struct > hnae3_handle *handle) > return 0; > > error_failed_get_cfg: > + handle->priv = NULL; > kfree(hr_dev->priv); > > error_failed_kzalloc: > @@ -4803,14 +4807,70 @@ static void hns_roce_hw_v2_uninit_instance(struct > hnae3_handle *handle, > { > struct hns_roce_dev *hr_dev = (struct hns_roce_dev *)handle->priv; > > + if (!hr_dev) > + return; > + > hns_roce_exit(hr_dev); > + handle->priv = NULL; > kfree(hr_dev->priv); > ib_dealloc_device(&hr_dev->ib_dev); > } Why are these hunks here? If init fails then uninit should not be called, so why meddle with priv? Jason
Re: [PATCH rdma-next 4/5] RDMA/hns: Add reset process for RoCE in hip08
On Fri, May 18, 2018 at 11:28:11AM +0800, Wei Hu (Xavier) wrote: > > > On 2018/5/17 23:14, Jason Gunthorpe wrote: > > On Thu, May 17, 2018 at 04:02:52PM +0800, Wei Hu (Xavier) wrote: > >> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c > >> b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c > >> index 86ef15f..e1c44a6 100644 > >> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c > >> @@ -774,6 +774,9 @@ static int hns_roce_cmq_send(struct hns_roce_dev > >> *hr_dev, > >>int ret = 0; > >>int ntc; > >> > >> + if (hr_dev->is_reset) > >> + return 0; > >> + > >>spin_lock_bh(&csq->lock); > >> > >>if (num > hns_roce_cmq_space(csq)) { > >> @@ -4790,6 +4793,7 @@ static int hns_roce_hw_v2_init_instance(struct > >> hnae3_handle *handle) > >>return 0; > >> > >> error_failed_get_cfg: > >> + handle->priv = NULL; > >>kfree(hr_dev->priv); > >> > >> error_failed_kzalloc: > >> @@ -4803,14 +4807,70 @@ static void hns_roce_hw_v2_uninit_instance(struct > >> hnae3_handle *handle, > >> { > >>struct hns_roce_dev *hr_dev = (struct hns_roce_dev *)handle->priv; > >> > >> + if (!hr_dev) > >> + return; > >> + > >>hns_roce_exit(hr_dev); > >> + handle->priv = NULL; > >>kfree(hr_dev->priv); > >>ib_dealloc_device(&hr_dev->ib_dev); > >> } > > Why are these hunks here? If init fails then uninit should not be > > called, so why meddle with priv? > In hns_roce_hw_v2_init_instance function, we evaluate handle->priv with > hr_dev, > We want clear the value in hns_roce_hw_v2_uninit_instance function. > So we can ensure no problem in RoCE driver. What problem could happen? I keep removing unnecessary sets to null and checks of null, so please don't add them if they cannot happen. Eg uninit should never be called with a null priv, that is a serious logic mis-design someplace if it happens. Jason
Re: linux-next: manual merge of the block tree with the rdma tree
On Wed, Aug 15, 2018 at 11:45:39AM +1000, Stephen Rothwell wrote: > Hi all, > > On Thu, 26 Jul 2018 13:58:04 +1000 Stephen Rothwell > wrote: > > > > Today's linux-next merge of the block tree got a conflict in: > > > > drivers/nvme/target/rdma.c > > > > between commit: > > > > 23f96d1f15a7 ("nvmet-rdma: Simplify ib_post_(send|recv|srq_recv)() calls") > > 202093848cac ("nvmet-rdma: add an error flow for post_recv failures") > > > > from the rdma tree and commits: > > > > 2fc464e2162c ("nvmet-rdma: add unlikely check in the fast path") > > > > from the block tree. > > > > I fixed it up (see below) and can carry the fix as necessary. This > > is now fixed as far as linux-next is concerned, but any non trivial > > conflicts should be mentioned to your upstream maintainer when your tree > > is submitted for merging. You may also want to consider cooperating > > with the maintainer of the conflicting tree to minimise any particularly > > complex conflicts. > > > > This is now a conflict between Linus' tree and the rdma tree. Yes, I expect this.. good thing we had linux-next as several of these needed non-obvious changes. I keep track of your postings and build a conflict resolution for Linus to refer to. netdev is the last conflicting tree I expect, and it hasn't been sent yet.. Thanks, Jason
Re: [PATCH] RDMA/bnxt_re: QPLIB: Add and use #define dev_fmt(fmt) "QPLIB: " fmt
On Fri, Aug 10, 2018 at 11:42:46AM -0700, Joe Perches wrote: > Consistently use the "QPLIB: " prefix for dev_ logging. > > Miscellanea: > > o Add missing newlines to avoid possible message interleaving > o Coalesce consecutive dev_ uses that emit a message header to > avoid < 80 column lengths and mistakenly output on multiple lines > o Reflow modified lines to use 80 columns where appropriate > o Consistently use "%s: " where __func__ is output > o QPLIB: is now always output immediately after the dev_ header > > Signed-off-by: Joe Perches > --- > drivers/infiniband/hw/bnxt_re/qplib_fp.c | 130 > +++-- > drivers/infiniband/hw/bnxt_re/qplib_rcfw.c | 51 ++- > drivers/infiniband/hw/bnxt_re/qplib_res.c | 29 --- > drivers/infiniband/hw/bnxt_re/qplib_sp.c | 57 ++--- > 4 files changed, 116 insertions(+), 151 deletions(-) Broadcom: This is a big patch, will you ack it please? https://patchwork.kernel.org/patch/10563083/ Thanks, Jason
Re: [PATCH] rdma: move the ib_wr_opcode enum to include/uapi
On Tue, Aug 14, 2018 at 03:33:02PM -0700, Seth Howell wrote: > By making this enum accessible to the userspace API, we can avoid > conflicts between the kernel-space ib_wr_opcode and the user-space > ibv_wr_opcode enums. > > When using the rxe software driver interface between kernel and user > space, the enum value IBV_WR_SEND_WITH_INV was being improperly > translated to IB_WR_READ_WITH_INV causing the userspace application to > fail. > > Signed-off-by: Seth Howell > include/rdma/ib_verbs.h | 31 --- > include/uapi/rdma/ib_user_verbs.h | 31 +++ > 2 files changed, 31 insertions(+), 31 deletions(-) I've looked at this for a bit now, and I think we should just do the below and not change userspace: >From e3c1b1b81373601fa6cbad2ba5fadd2c8cfdfaed Mon Sep 17 00:00:00 2001 From: Seth Howell Date: Tue, 14 Aug 2018 15:33:02 -0700 Subject: [PATCH] IB/rxe: Revise the ib_wr_opcode enum This enum has become part of the uABI, as both RXE and the ib_uverbs_post_send() command expect userspace to supply values from this enum. So it should be properly placed in include/uapi/rdma. In userspace this enum is called 'enum ibv_wr_opcode' as part of libibverbs.h. That enum defines different values for IB_WR_LOCAL_INV, IB_WR_SEND_WITH_INV, and IB_WR_LSO. These were introduced (incorrectly, it turns out) into libiberbs in 2015. The kernel has changed its mind on the numbering for several of the IB_WC values over the years, but has remained stable on IB_WR_LOCAL_INV and below. Based on this we can conclude that there is no real user space user of the values beyond IB_WR_ATOMIC_FETCH_AND_ADD, as they have never worked via rdma-core. This is confirmed by inspection, only rxe uses the kernel enum and implements the latter operations. rxe has clearly never worked with these attributes from userspace. Other drivers that support these opcodes implement the functionality without calling out to the kernel. To make IB_WR_SEND_WITH_INV and related work for RXE in userspace we choose to renumber the IB_WR enum in the kernel to match the uABI that userspace has bee using since before Soft RoCE was merged. This is an overall simpler configuration for the whole software stack, and obviously can't break anything existing. Reported-by: Seth Howell Fixes: 8700e3e7c485 ("Soft RoCE driver") Cc: Signed-off-by: Jason Gunthorpe --- include/rdma/ib_verbs.h | 34 ++- include/uapi/rdma/ib_user_verbs.h | 20 +- 2 files changed, 39 insertions(+), 15 deletions(-) diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 5d404c20b49f27..5e51ea9832a0c8 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -1278,21 +1278,27 @@ struct ib_qp_attr { }; enum ib_wr_opcode { - IB_WR_RDMA_WRITE, - IB_WR_RDMA_WRITE_WITH_IMM, - IB_WR_SEND, - IB_WR_SEND_WITH_IMM, - IB_WR_RDMA_READ, - IB_WR_ATOMIC_CMP_AND_SWP, - IB_WR_ATOMIC_FETCH_AND_ADD, - IB_WR_LSO, - IB_WR_SEND_WITH_INV, - IB_WR_RDMA_READ_WITH_INV, - IB_WR_LOCAL_INV, - IB_WR_REG_MR, - IB_WR_MASKED_ATOMIC_CMP_AND_SWP, - IB_WR_MASKED_ATOMIC_FETCH_AND_ADD, + /* These are shared with userspace */ + IB_WR_RDMA_WRITE = IB_UVERBS_WR_RDMA_WRITE, + IB_WR_RDMA_WRITE_WITH_IMM = IB_UVERBS_WR_RDMA_WRITE_WITH_IMM, + IB_WR_SEND = IB_UVERBS_WR_SEND, + IB_WR_SEND_WITH_IMM = IB_UVERBS_WR_SEND_WITH_IMM, + IB_WR_RDMA_READ = IB_UVERBS_WR_RDMA_READ, + IB_WR_ATOMIC_CMP_AND_SWP = IB_UVERBS_WR_ATOMIC_CMP_AND_SWP, + IB_WR_ATOMIC_FETCH_AND_ADD = IB_UVERBS_WR_ATOMIC_FETCH_AND_ADD, + IB_WR_LSO = IB_UVERBS_WR_TSO, + IB_WR_SEND_WITH_INV = IB_UVERBS_WR_SEND_WITH_INV, + IB_WR_RDMA_READ_WITH_INV = IB_UVERBS_WR_RDMA_READ_WITH_INV, + IB_WR_LOCAL_INV = IB_UVERBS_WR_LOCAL_INV, + IB_WR_MASKED_ATOMIC_CMP_AND_SWP = + IB_UVERBS_WR_MASKED_ATOMIC_CMP_AND_SWP, + IB_WR_MASKED_ATOMIC_FETCH_AND_ADD = + IB_UVERBS_WR_MASKED_ATOMIC_FETCH_AND_ADD, + + /* These are kernel only and can not be issued by userspace */ + IB_WR_REG_MR = 0x20, IB_WR_REG_SIG_MR, + /* reserve values for low level drivers' internal use. * These values will not be used at all in the ib core layer. */ diff --git a/include/uapi/rdma/ib_user_verbs.h b/include/uapi/rdma/ib_user_verbs.h index 25a16760de2ad1..1254b51a551a1c 100644 --- a/include/uapi/rdma/ib_user_verbs.h +++ b/include/uapi/rdma/ib_user_verbs.h @@ -763,10 +763,28 @@ struct ib_uverbs_sge { __u32 lkey; }; +enum ib_uverbs_wr_opcode { + IB_UVERBS_WR_RDMA_WRITE = 0, + IB_UVERBS_WR_RDMA_WRITE_WITH_IMM = 1, + IB_UVERBS_WR_SEND = 2, + IB_UVERBS_WR_SEND_WITH_IMM = 3, + IB_UVERBS_WR_RDMA_READ = 4, + IB_UV
Re: [PATCH] rdma: move the ib_wr_opcode enum to include/uapi
On Wed, Aug 15, 2018 at 11:27:36PM +, Howell, Seth wrote: > Thank you for taking the time to review this issue. The patch you > submitted will definitely fix the problem I was facing. Can you test it to be certain? Thanks, Jason
[GIT PULL] Please pull RDMA subsystem changes
ng RDMA/ocrdma: Remove a set-but-not-used variable IB/iser: Remove set-but-not-used variables RDMA/rxe: Simplify the error handling code in rxe_create_ah() RDMA/cxgb3: Make iwch_poll_cq_one() easier to analyze RDMA/cxgb4: Make c4iw_poll_cq_one() easier to analyze RDMA/rw: Fix rdma_rw_ctx_signature_init() kernel-doc header RDMA/nes: Avoid complaints about unused variables RDMA/ocrdma: Make ocrdma_destroy_qp() easier to analyze IB/hfi1: Suppress a compiler warning hns: Remove a set-but-not-used variable IB/srpt: Fix srpt_cm_req_recv() error path (1/2) IB/srpt: Fix srpt_cm_req_recv() error path (2/2) RDMA/bnxt_re: Modify a fall-through annotation MAINTAINERS: Remove Dave Goodell from the usnic RDMA driver maintainer list IB/core: Allow ULPs to specify NULL as the third ib_post_(send|recv|srq_recv)() argument RDMA/core: Simplify ib_post_(send|recv|srq_recv)() calls IB/IPoIB: Simplify ib_post_(send|recv|srq_recv)() calls IB/iser: Simplify ib_post_(send|recv|srq_recv)() calls IB/isert: Simplify ib_post_(send|recv|srq_recv)() calls IB/srp: Simplify ib_post_(send|recv|srq_recv)() calls IB/srpt: Simplify ib_post_(send|recv|srq_recv)() calls nvme-rdma: Simplify ib_post_(send|recv|srq_recv)() calls nvmet-rdma: Simplify ib_post_(send|recv|srq_recv)() calls fs/cifs: Simplify ib_post_(send|recv|srq_recv)() calls net/9p: Simplify ib_post_(send|recv|srq_recv)() calls net/rds: Remove two WARN_ON() statements net/rds: Simplify ib_post_(send|recv|srq_recv)() calls net/smc: Remove a WARN_ON() statement net/smc: Simplify ib_post_(send|recv|srq_recv)() calls net/xprtrdma: Simplify ib_post_(send|recv|srq_recv)() calls RDMA/ocrdma: Suppress a compiler warning RDMA/usnic: Suppress a compiler warning IB/iser: Inline two work request conversion functions RDMA: Constify the argument of the work request conversion functions IB/mlx5, ib_post_send(), IB_WR_REG_SIG_MR: Do not modify the 'wr' argument RDMA, core and ULPs: Declare ib_post_send() and ib_post_recv() arguments const rdma/cxgb4: Remove a set-but-not-used variable rdma/cxgb4: Fix SRQ endianness annotations rdma/cxgb4: Simplify a structure initialization RDMA/rxe: Set wqe->status correctly if an unexpected response is received Update the e-mail address of Bart Van Assche Bharat Potnuri (1): iw_cxgb4: remove duplicate memcpy() in c4iw_create_listen() Dan Carpenter (4): IB/core: type promotion bug in rdma_rw_init_one_mr() RDMA/bnxt_re: Fix a couple off by one bugs RDMA/bnxt_re: Fix a bunch of off by one bugs in qplib_fp.c rdma/cxgb4: fix some info leaks Denis Drozdov (1): IB/IPoIB: Set ah valid flag in multicast send flow Erez Shitrit (1): IB/ipoib: Use cancel_delayed_work_sync for neigh-clean task Gal Pressman (1): RDMA/hns: Fix usage of bitmap allocation functions return values Håkon Bugge (1): IB/cm: Remove unused and erroneous msg sequence encoding Jack Morgenstein (1): IB/mlx4: Use 4K pages for kernel QP's WQE buffer Jan Dakinevich (2): ib_srpt: use kvmalloc to allocate ring pointers IPoIB: use kvzalloc to allocate an array of bucket pointers Jann Horn (1): IB/mlx5: fix uaccess beyond "count" in debugfs read/write handlers Jason Gunthorpe (90): IB/core: Provide rdma_ versions of the gid cache API IB/core: Make rdma_find_gid_by_filter support all protocols IB/rxe: Do not hide uABI stuff in memcpy IB/core: Add a sgid_attr pointer to struct rdma_ah_attr IB{cm, core}: Introduce and use ah_attr copy, move, replace APIs RDMA: Hold the sgid_attr inside the struct ib_ah/qp Merge branch 'mellanox/mlx5-next' into RDMA for-next Merge branch 'icrc-counter' into rdma.git for-next IB/uverbs: Delete type and id from uverbs_obj_attr RDMA/core: Remove unused ib cache functions IB/usnic: Update with bug fixes from core code RDMA/uverbs: Check existence of create_flow callback IB/mlx4: Create slave AH's directly IB/core: Check for rdma_protocol_ib only after validating port_num IB/cm: Remove now useless rcu_lock in dst_fetch_ha Merge branch 'mlx5-dump-fill-mkey' into rdma.git for-next RDMA/uverbs: Store the specs_root in the struct ib_uverbs_device RDMA/uverbs: Split UVERBS_ATTR_FLOW_ACTION_ESP_HANDLE RDMA/uverbs: Simplify UVERBS_ATTR family of macros RDMA/uverbs: Simplify method definition macros RDMA/uverbs: Simplify UVERBS_OBJECT and _TREE family of macros RDMA/uverbs: Get rid of the & in method specifications RDMA/uverbs: Remove UA_FLAGS RDMA/uverbs: Use UVERBS_ATTR_MIN_SIZE correctly and uniformly RDMA/uverbs: Combine MIN_SZ_OR_ZERO with UVERBS_ATTR
Re: [GIT PULL] Please pull RDMA subsystem changes
On Fri, Aug 17, 2018 at 12:31:00PM -0700, Linus Torvalds wrote: > On Thu, Aug 16, 2018 at 2:57 PM Jason Gunthorpe wrote: > > > > This time there are many merge conflicts, all due to various tree-wide or > > RDMA > > subystem wide changes done by various people. The resolution is tricky, as > > git > > does not highlight two areas that need revision. > > I was confused by this, because there were no merge conflicts. > > Then I looked at the history, and the reason is that you actually > pushed the for-next branch into the for-linus branch, and so I got the > pre-merged version. Er, yes, sorry, I made a typo in my email, it should have read "As before I've resolved the conflicts in the for-linus tag and provided the unmerged tree in the for-linus-unmerged tag" > Oh well. I'll look at what the conflicts were, but may end up just > taking your resolution. We often have merge conflicts with RDMA, how do you prefer to get the PR? I'm actually not very clear on this part of the process. Historically, when we have conflicts, I've given you a tag 'for-linus-unmerged', a text description of the conflicts with merge diff hunks, and then a tag 'for-linus' which contains the full resolution I used and checked (based on linux-next). In the past you've taken from both tags, I think, depending on how the resolution looks? I generate the 'git request-pull' against the 'for-linus' tag only because the diffstat of the resolved branch is closer to what you'll actually see during merge. Otherwise there are always misleading netdev changes due to the shared branches with netdev. If you don't care about the diffstat I'll switch to 'for-linus' and 'for-linus-resolved' tags so you won't accidentally get any resolution commit unless you want it. Thanks, Jason
Re: [GIT PULL] Please pull RDMA subsystem changes
On Fri, Aug 17, 2018 at 01:50:05PM -0700, Linus Torvalds wrote: > On Fri, Aug 17, 2018 at 12:44 PM Linus Torvalds > wrote: > > > > Ok, everything but the max_sge thing was trivial. I just took yours. > > Oh, and doing the full test compile, I notice there are new warnings. > > That is *NOT* ok. I expect to send you a pull request to remove all of this next week. This is fall out from the SMC merge conflict reversion. > The whole "x is deprecated" is not a useful warning. If you can't > remove something, making it a reminder for yourself for later is not > an acceptable excuse for bothering everybody else with the warning, > and possibly having other issues hidden because by the time there are > expected warnings, people will start ignoring the unexpected ones too. > > So "__deprecated" is for stuff that really isn't used any more, to > catch possible new users. People have tried to use it for anything > else, but it's always been a much bigger pain than it is worth. In this case it has become confusing what CONFIG_ENABLE_WARN_DEPRECATED is for. If the kernel is supposed to compile warning free with that config set, then why have the config at all - it does nothing, by definition? Would you like a patch to remove that CONFIG option? Jason
Re: [GIT PULL] Please pull RDMA subsystem changes
On Fri, Aug 17, 2018 at 01:27:02PM -0700, Linus Torvalds wrote: > On Fri, Aug 17, 2018 at 1:15 PM Jason Gunthorpe wrote: > > > > We often have merge conflicts with RDMA, how do you prefer to get the > > PR? I'm actually not very clear on this part of the process. > > I generally prefer the non-merged version, but with comments about > the conflicts. > > In fact, the really preferred model is generally to send me a > non-merged pull request (say "tags/for-linus") but if the conflicts > look even half-way nasty - or simply because you did the merge anyway > just to get the proper diffstat because history is complex - mention > that you have a merged branch for verification (say > "branch/for-linus-merged"). > > When I get that kind of pull request, I'll just do the merge > resolution myself, and after I've done it I'll generally then do > >git checkout -b test-merge >git pull for-linus-merged > > and then just compare the end result with my resolution as an extra > verification step. > > I may end up skipping the verification step if everything looks > entirely trivial (and really, if you have no real reason for the > pre-merged branch, don't bother even mentioning it even if you did it > for the diffstat), but in practice whenever somebody does that, I have > no reason not to just do that simple extra verification. > > Most of the time the merges are exactly the same, possibly with > whitespace or trivial ordering differences (things like Makefiles and > Kconfig files often have add-add conflicts where the ordering really > doesn't matter between two config options). > > And then sometimes it shows that I missed something in my mmerge. > > And sometimes it shows that I do so many merges that I actually ended > up noticing something that the submaintainer didn't. > > So it can be uninteresting, and when it isn't uninteresting it can go > either way, but so far for the people who do this, I've never been in > the situation where I was *sorry* for the double merge and extra > verification step. > > Because when mis-merges happen (they are happily pretty rare) they are > _so_ annoying and can be so hard to figure out later, that I'd rather > spend a bit more time on the merge than have a dodgy one. Thanks for the explanation. Doug and I will do this in future. Jason
[PATCH] Revert "Permit silencing of __deprecated warnings."
This reverts commit de48844398f81cfdf087d56e12c920d620dae8d5. Linus would prefer that __deprecated never produce a warning in an allyesconfig compile. Since we have been at this state for some time, the option no longer has a purpose. Link: https://lkml.kernel.org/r/ca+55afyglkszs8o3m4jae69wum03tgof+507jccvitrqagb...@mail.gmail.com Signed-off-by: Jason Gunthorpe --- Documentation/process/4.Coding.rst | 2 +- include/linux/compiler_types.h | 6 -- lib/Kconfig.debug | 8 3 files changed, 1 insertion(+), 15 deletions(-) Linus: As discussed. Jonathan: I'm not sure if you prefer diffs to documentation to be minimal like this, or if should reflow the paragraph? I can send this in a PR via rdma.git with the __deprecation removal patch probably Monday/Tuesday if there is agreement, no 0-day surprises, etc. Cheers, Jason diff --git a/Documentation/process/4.Coding.rst b/Documentation/process/4.Coding.rst index eb4b185d168c05..f03c62f50d7d0a 100644 --- a/Documentation/process/4.Coding.rst +++ b/Documentation/process/4.Coding.rst @@ -249,7 +249,7 @@ features; most of these are found in the "kernel hacking" submenu. Several of these options should be turned on for any kernel used for development or testing purposes. In particular, you should turn on: - - ENABLE_WARN_DEPRECATED, ENABLE_MUST_CHECK, and FRAME_WARN to get an + - ENABLE_MUST_CHECK, and FRAME_WARN to get an extra set of warnings for problems like the use of deprecated interfaces or ignoring an important return value from a function. The output generated by these warnings can be verbose, but one need not worry about diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h index a8ba6b04152c13..a2b428c43ae134 100644 --- a/include/linux/compiler_types.h +++ b/include/linux/compiler_types.h @@ -135,12 +135,6 @@ struct ftrace_likely_data { #undef __must_check #define __must_check #endif -#ifndef CONFIG_ENABLE_WARN_DEPRECATED -#undef __deprecated -#undef __deprecated_for_modules -#define __deprecated -#define __deprecated_for_modules -#endif #ifndef __malloc #define __malloc diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index c6e73904c5a5da..ab1b599202bca7 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -211,14 +211,6 @@ config GDB_SCRIPTS instance. See Documentation/dev-tools/gdb-kernel-debugging.rst for further details. -config ENABLE_WARN_DEPRECATED - bool "Enable __deprecated logic" - default y - help - Enable the __deprecated logic in the kernel build. - Disable this to suppress the "warning: 'foo' is deprecated - (declared at kernel/power/somefile.c:1234)" messages. - config ENABLE_MUST_CHECK bool "Enable __must_check logic" default y -- 2.18.0
Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
On Wed, Dec 12, 2018 at 04:53:49PM -0500, Jerome Glisse wrote: > > Almost, we need some safety around assuming that DMA is complete the > > page, so the notification would need to go all to way to userspace > > with something like a file lease notification. It would also need to > > be backstopped by an IOMMU in the case where the hardware does not / > > can not stop in-flight DMA. > > You can always reprogram the hardware right away it will redirect > any dma to the crappy page. That causes silent data corruption for RDMA users - we can't do that. The only way out for current hardware is to forcibly terminate the RDMA activity somehow (and I'm not even sure this is possible, at least it would be driver specific) Even the IOMMU idea probably doesn't work, I doubt all current hardware can handle a PCI-E error TLP properly. On some hardware it probably just protects DAX by causing data corruption for RDMA - I fail to see how that is a win for system stability if the user obviously wants to use DAX and RDMA together... I think your approach with ODP only is the only one that meets your requirements, the only other data-integrity-preserving approach is to block/fail ftruncate/etc. > From my point of view driver should listen to ftruncate before the > mmu notifier kicks in and send event to userspace and maybe wait > and block ftruncate (or move it to a worker thread). We can do this, but we can't guarantee forward progress in userspace and the best way we have to cancel that is portable to all RDMA hardware is to kill the process(es).. So if that is acceptable then we could use user notifiers and allow non-ODP users... Jason
Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
On Wed, Dec 12, 2018 at 07:01:09PM -0500, Jerome Glisse wrote: > On Wed, Dec 12, 2018 at 04:37:03PM -0700, Jason Gunthorpe wrote: > > On Wed, Dec 12, 2018 at 04:53:49PM -0500, Jerome Glisse wrote: > > > > Almost, we need some safety around assuming that DMA is complete the > > > > page, so the notification would need to go all to way to userspace > > > > with something like a file lease notification. It would also need to > > > > be backstopped by an IOMMU in the case where the hardware does not / > > > > can not stop in-flight DMA. > > > > > > You can always reprogram the hardware right away it will redirect > > > any dma to the crappy page. > > > > That causes silent data corruption for RDMA users - we can't do that. > > > > The only way out for current hardware is to forcibly terminate the > > RDMA activity somehow (and I'm not even sure this is possible, at > > least it would be driver specific) > > > > Even the IOMMU idea probably doesn't work, I doubt all current > > hardware can handle a PCI-E error TLP properly. > > What i saying is reprogram hardware to crappy page ie valid page > dma map but that just has random content as a last resort to allow > filesystem to reuse block. So their should be no PCIE error unless > hardware freak out to see its page table reprogram randomly. No, that isn't an option. You can't silently provide corrupted data for RDMA to transfer out onto the network, or silently discard data coming in!! Think of the consequences of that - I have a fileserver process and someone does ftruncate and now my clients receive corrupted data?? The only option is to prevent the RDMA transfer from ever happening, and we just don't have hardware support (beyond destroy everything) to do that. > The question is who do you want to punish ? RDMA user that pin stuff > and expect thing to work forever without worrying for other fs > activities ? Or filesystem to pin block forever :) I don't want to punish everyone, I want both sides to have complete data integrity as the USER has deliberately decided to combine DAX and RDMA. So either stop it at the front end (ie get_user_pages_longterm) or make it work in a way that guarantees integrity for both. > S2: notify userspace program through device/sub-system > specific API and delay ftruncate. After a while if there > is no answer just be mean and force hardware to use > crappy page as anyway this is what happens today I don't think this happens today (outside of DAX).. Does it? .. and the remedy here is to kill the process, not provide corrupt data. Kill the process is likely to not go over well with any real users that want this combination. Think Samba serving files over RDMA - you can't have random unpriv users calling ftruncate and causing smbd to be killed or serve corrupt data. Jason
Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
On Wed, Dec 12, 2018 at 07:44:37PM -0500, Jerome Glisse wrote: > On many GPUs you can do that, it is hardware dependant and you have > steps to take but it is something you can do (and GPU can do > continuous DMA traffic have they have threads running that can > do continuous memory access). So i assume that other hardware > can do it too. RDMA has no generic way to modify a MR and then guarntee the HW sees the modifications. Some HW can do this (ie the same HW that can do ODP, because ODP needs this capability), other HW is an unknown as this has never been asked for as a driver API. Jason
Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
On Wed, Dec 19, 2018 at 10:42:54AM +1100, Dave Chinner wrote: > Essentially, what we are talking about is how to handle broken > hardware. I say we should just brun it with napalm and thermite > (i.e. taint the kernel with "unsupportable hardware") and force > wait_for_stable_page() to trigger when there are GUP mappings if > the underlying storage doesn't already require it. If you want to ban O_DIRECT/etc from writing to file backed pages, then just do it. Otherwise I'm not sure demanding some unrealistic HW design is reasonable. ie nvme drives are not likely to add page faulting to their IO path any time soon. A SW architecture that relies on page faulting is just not going to support real world block IO devices. GPUs and one RDMA are about the only things that can do this today, and they are basically irrelevant to O_DIRECT. Jason
Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
On Wed, Dec 19, 2018 at 12:35:40PM +0100, Jan Kara wrote: > On Wed 19-12-18 21:28:25, Dave Chinner wrote: > > On Tue, Dec 18, 2018 at 08:03:29PM -0700, Jason Gunthorpe wrote: > > > On Wed, Dec 19, 2018 at 10:42:54AM +1100, Dave Chinner wrote: > > > > > > > Essentially, what we are talking about is how to handle broken > > > > hardware. I say we should just brun it with napalm and thermite > > > > (i.e. taint the kernel with "unsupportable hardware") and force > > > > wait_for_stable_page() to trigger when there are GUP mappings if > > > > the underlying storage doesn't already require it. > > > > > > If you want to ban O_DIRECT/etc from writing to file backed pages, > > > then just do it. > > > > O_DIRECT IO *isn't the problem*. > > That is not true. O_DIRECT IO is a problem. In some aspects it is > easier than the problem with RDMA but currently O_DIRECT IO can > crash your machine or corrupt data the same way RDMA can. Just the > race window is much smaller. So we have to fix the generic GUP > infrastructure to make O_DIRECT IO work. I agree that fixing RDMA > will likely require even more work like revokable leases or what > not. This is what I've understood, talking to all the experts. Dave? Why do you think O_DIRECT is actually OK? I agree the duration issue with RDMA is different, but don't forget, O_DIRECT goes out to the network too and has potentially very long timeouts as well. If O_DIRECT works fine then lets use the same approach in RDMA?? Jason
Re: [PATCH -next] IB/mlx5: remove duplicated include from mlx5_ib.h
On Mon, Dec 10, 2018 at 03:27:59PM +0800, YueHaibing wrote: > Remove duplicated include. > > Signed-off-by: YueHaibing > Acked-by: Leon Romanovsky > --- > drivers/infiniband/hw/mlx5/mlx5_ib.h | 1 - > 1 file changed, 1 deletion(-) Applied to for-next Thanks, Jason
Re: rfc: bool structure members (was Re: [PATCH V3] net/mlx4: Get rid of page operation after dma_alloc_coherent)
On Thu, Dec 20, 2018 at 09:12:43PM -0800, Joe Perches wrote: > Care to submit a coding_style.rst patch or > improve the one below this? I took yours and revised it a little bit. I spent some time looking at assembly and decided to drop the performance note, the number of cases that run into overhead seems pretty small and probably already requires !! to be correct. There is also an equally unlikely gain, ie 'if (a & b)' optimizes a tiny bit better for bool types. I also added a small intro on bool, as I know some people are unfamiliar with C11 _Bool and might think bool is just '#define bool u8' (for those added to the cc) I'm looking at cases, like the patch that spawned this, where the struct has a single bool and no performance considerations. As CH said, seeing that get converted to int due to checkpatch is worse than having used bool. Using u8 won't make this struct smaller or faster. Cheers, Jason >From c5e2c887f6326e1c4369876f39996417da5e90cc Mon Sep 17 00:00:00 2001 From: Jason Gunthorpe Date: Fri, 21 Dec 2018 16:29:22 -0700 Subject: [PATCH] coding-style: Clarify the expectations around bool There has been some confusion since checkpatch started warning about bool use in structures, and people have been avoiding using it. Many people feel there is still a legitimate place for bool in structures, so provide some guidance on bool usage derived from the entire thread that spawned the checkpatch warning. Link: https://lkml.kernel.org/r/ca+55afwvzk1ofb9t2v014ptakfhtvan_zj2dojncy3x6e4u...@mail.gmail.com Signed-off-by: Joe Perches Signed-off-by: Jason Gunthorpe --- Documentation/process/coding-style.rst | 33 ++ 1 file changed, 29 insertions(+), 4 deletions(-) diff --git a/Documentation/process/coding-style.rst b/Documentation/process/coding-style.rst index 4e7c0a1c427a9a..efdb1d32035fe1 100644 --- a/Documentation/process/coding-style.rst +++ b/Documentation/process/coding-style.rst @@ -918,7 +918,32 @@ result. Typical examples would be functions that return pointers; they use NULL or the ERR_PTR mechanism to report failure. -17) Don't re-invent the kernel macros +17) Using bool +-- + +The Linux kernel uses the C11 standard for the bool type. bool values can only +evaluate to 0 or 1, and implicit or explicit conversion to bool automatically +converts the value to true or false. When using bool types the !! construction +is not needed, which eliminates a class of bugs. + +When working with bool values the true and false labels should be used instead +of 0 and 1. + +bool function return types, arguments and stack variables are always fine to +use whenever appropriate. Use of bool is encouraged to improve readability and +is often a better option than 'int' for storing boolean values. + +Do not use bool if cache line layout or size of the value matters, its size +and alignment varies based on the compiled architecture. Structures that are +optimized for alignment and size should not use bool. + +If a structure has many true/false values, consider consolidating them into a +bitfield with 1 bit members, or using an appropriate fixed width type, such as +u8. + +Otherwise limited use of bool in structures does improve readability. + +18) Don't re-invent the kernel macros - The header file include/linux/kernel.h contains a number of macros that @@ -941,7 +966,7 @@ need them. Feel free to peruse that header file to see what else is already defined that you shouldn't reproduce in your code. -18) Editor modelines and other cruft +19) Editor modelines and other cruft Some editors can interpret configuration information embedded in source files, @@ -975,7 +1000,7 @@ own custom mode, or may have some other magic method for making indentation work correctly. -19) Inline assembly +20) Inline assembly --- In architecture-specific code, you may need to use inline assembly to interface @@ -1007,7 +1032,7 @@ the next instruction in the assembly output: : /* outputs */ : /* inputs */ : /* clobbers */); -20) Conditional Compilation +21) Conditional Compilation --- Wherever possible, don't use preprocessor conditionals (#if, #ifdef) in .c -- 2.19.2
[PATCH] lib/scatterlist: Provide a DMA page iterator
Commit 2db76d7c3c6d ("lib/scatterlist: sg_page_iter: support sg lists w/o backing pages") introduced the sg_page_iter_dma_address() function without providing a way to use it in the general case. If the sg_dma_len is not equal to the dma_length callers cannot safely use the for_each_sg_page/sg_page_iter_dma_address combination. Resolve this API mistake by providing a DMA specific iterator, for_each_sg_dma_page(), that uses the right length so sg_page_iter_dma_address() works as expected with all sglists. A new iterator type is introduced to provide compile-time safety against wrongly mixing accessors and iterators. Signed-off-by: Jason Gunthorpe --- drivers/gpu/drm/vmwgfx/vmwgfx_drv.h| 26 ++-- drivers/gpu/drm/vmwgfx/vmwgfx_mob.c| 26 +++- drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c | 42 +-- drivers/media/pci/intel/ipu3/ipu3-cio2.c | 4 +- include/linux/scatterlist.h| 49 ++ lib/scatterlist.c | 26 6 files changed, 134 insertions(+), 39 deletions(-) I'd like to run this patch through the RDMA tree as we have another series in the works that wants to use the for_each_sg_dma_page() API. The changes to vmwgfx make me nervous, it would be great if someone could test and ack them? Changes since the RFC: - Rework vmwgfx too [CH] - Use a distinct type for the DMA page iterator [CH] - Do not have a #ifdef [CH] Thanks, Jason diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h index 59f614225bcd72..3c6d71e13a9342 100644 --- a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h @@ -297,7 +297,10 @@ struct vmw_sg_table { struct vmw_piter { struct page **pages; const dma_addr_t *addrs; - struct sg_page_iter iter; + union { + struct sg_page_iter iter; + struct sg_dma_page_iter dma_iter; + }; unsigned long i; unsigned long num_pages; bool (*next)(struct vmw_piter *); @@ -869,9 +872,24 @@ extern int vmw_bo_map_dma(struct ttm_buffer_object *bo); extern void vmw_bo_unmap_dma(struct ttm_buffer_object *bo); extern const struct vmw_sg_table * vmw_bo_sg_table(struct ttm_buffer_object *bo); -extern void vmw_piter_start(struct vmw_piter *viter, - const struct vmw_sg_table *vsgt, - unsigned long p_offs); +void _vmw_piter_start(struct vmw_piter *viter, const struct vmw_sg_table *vsgt, + unsigned long p_offs, bool for_dma); + +/* Create a piter that can call vmw_piter_dma_addr() */ +static inline void vmw_piter_start(struct vmw_piter *viter, + const struct vmw_sg_table *vsgt, + unsigned long p_offs) +{ + _vmw_piter_start(viter, vsgt, p_offs, true); +} + +/* Create a piter that can call vmw_piter_page() */ +static inline void vmw_piter_cpu_start(struct vmw_piter *viter, + const struct vmw_sg_table *vsgt, + unsigned long p_offs) +{ + _vmw_piter_start(viter, vsgt, p_offs, false); +} /** * vmw_piter_next - Advance the iterator one page. diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_mob.c b/drivers/gpu/drm/vmwgfx/vmwgfx_mob.c index 7ed179d30ec51f..a13788017ad608 100644 --- a/drivers/gpu/drm/vmwgfx/vmwgfx_mob.c +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_mob.c @@ -503,7 +503,8 @@ static void vmw_mob_assign_ppn(u32 **addr, dma_addr_t val) */ static unsigned long vmw_mob_build_pt(struct vmw_piter *data_iter, unsigned long num_data_pages, - struct vmw_piter *pt_iter) + struct vmw_piter *pt_iter_cpu, + struct vmw_piter *pt_iter_dma) { unsigned long pt_size = num_data_pages * VMW_PPN_SIZE; unsigned long num_pt_pages = DIV_ROUND_UP(pt_size, PAGE_SIZE); @@ -513,7 +514,7 @@ static unsigned long vmw_mob_build_pt(struct vmw_piter *data_iter, struct page *page; for (pt_page = 0; pt_page < num_pt_pages; ++pt_page) { - page = vmw_piter_page(pt_iter); + page = vmw_piter_page(pt_iter_cpu); save_addr = addr = kmap_atomic(page); @@ -525,7 +526,8 @@ static unsigned long vmw_mob_build_pt(struct vmw_piter *data_iter, WARN_ON(!vmw_piter_next(data_iter)); } kunmap_atomic(save_addr); - vmw_piter_next(pt_iter); + vmw_piter_next(pt_iter_cpu); + vmw_piter_next(pt_iter_dma); } return num_pt_pages; @@ -547,29 +549,31 @@ static void vmw_mob_pt_setup(struct vmw_mob *mob, { unsigned long num_pt_pages = 0; struct ttm_buffer_object *bo = mob->pt_bo; - struct vmw_piter save_pt_iter; - str
Re: [PATCH -next] IB/qib: Add missing err handle for qib_user_sdma_rb_insert
On Fri, Jan 04, 2019 at 06:39:50PM +, Marciniszyn, Mike wrote: > > diff --git a/drivers/infiniband/hw/qib/qib_user_sdma.c > > b/drivers/infiniband/hw/qib/qib_user_sdma.c > > index 31c523b..e87c0a7 100644 > > +++ b/drivers/infiniband/hw/qib/qib_user_sdma.c > > @@ -237,6 +237,8 @@ qib_user_sdma_queue_create(struct device *dev, > > int unit, int ctxt, int sctxt) > > > > ret = qib_user_sdma_rb_insert(&qib_user_sdma_rb_root, > > sdma_rb_node); > > + if (ret == 0) > > + goto err_rb; > > } > > pq->sdma_rb_node = sdma_rb_node; > > > > Thanks! > > This patch is ok. > The NULL returned from this routine will result in an error return up to PSM > which will fail. > > The bail branch will do the necessary cleanup, including freeing the > node that couldn't be added. How? sdma_rb_node is a stack variable that is not accessed during cleanup? Jason
Re: [PATCH] lib/scatterlist: Provide a DMA page iterator
On Sat, Jan 05, 2019 at 10:37:17AM +0800, kbuild test robot wrote: > Hi Jason, > > I love your patch! Yet something to improve: > > [auto build test ERROR on linus/master] > [also build test ERROR on v4.20 next-20190103] > [if your patch is applied to the wrong git tree, please drop us a note to > help improve the system] > > url: > https://github.com/0day-ci/linux/commits/Jason-Gunthorpe/lib-scatterlist-Provide-a-DMA-page-iterator/20190105-081739 > config: x86_64-randconfig-x017-201900 (attached as .config) > compiler: gcc-7 (Debian 7.3.0-1) 7.3.0 > reproduce: > # save the attached .config to linux build tree > make ARCH=x86_64 > > All error/warnings (new ones prefixed by >>): > >In file included from lib/scatterlist.c:9:0: > >> include/linux/export.h:81:20: error: redefinition of > >> '__kstrtab___sg_page_iter_next' > static const char __kstrtab_##sym[]\ >^ >include/linux/export.h:120:25: note: in expansion of macro > '___EXPORT_SYMBOL' > #define __EXPORT_SYMBOL ___EXPORT_SYMBOL > ^~~~ >include/linux/export.h:124:2: note: in expansion of macro '__EXPORT_SYMBOL' > __EXPORT_SYMBOL(sym, "") > ^~~ > >> lib/scatterlist.c:652:1: note: in expansion of macro 'EXPORT_SYMBOL' > EXPORT_SYMBOL(__sg_page_iter_next); > ^ Woops, should be __sg_page_dma_iter_next.. Will resend after getting some feedback. Jason
Re: [PATCH] PCI: Add no-D3 quirk for Mellanox ConnectX-[45]
On Fri, Jan 04, 2019 at 02:44:01PM +1100, David Gibson wrote: > On Thu, Dec 06, 2018 at 08:45:09AM +0200, Leon Romanovsky wrote: > > On Thu, Dec 06, 2018 at 03:19:51PM +1100, David Gibson wrote: > > > Mellanox ConnectX-5 IB cards (MT27800) seem to cause a call trace when > > > unbound from their regular driver and attached to vfio-pci in order to > > > pass > > > them through to a guest. > > > > > > This goes away if the disable_idle_d3 option is used, so it looks like a > > > problem with the hardware handling D3 state. To fix that more > > > permanently, > > > use a device quirk to disable D3 state for these devices. > > > > > > We do this by renaming the existing quirk_no_ata_d3() more generally and > > > attaching it to the ConnectX-[45] devices (0x15b3:0x1013). > > > > > > Signed-off-by: David Gibson > > > drivers/pci/quirks.c | 17 +++-- > > > 1 file changed, 11 insertions(+), 6 deletions(-) > > > > > > > Hi David, > > > > Thank for your patch, > > > > I would like to reproduce the calltrace before moving forward, > > but have trouble to reproduce the original issue. > > > > I'm working with vfio-pci and CX-4/5 cards on daily basis, > > tried manually enter into D3 state now, and it worked for me. > > Interesting. I've investigated this further, though I don't have as > many new clues as I'd like. The problem occurs reliably, at least on > one particular type of machine (a POWER8 "Garrison" with ConnectX-4). > I don't yet know if it occurs with other machines, I'm having trouble > getting access to other machines with a suitable card. I didn't > manage to reproduce it on a different POWER8 machine with a > ConnectX-5, but I don't know if it's the difference in machine or > difference in card revision that's important. Make sure the card has the latest firmware is always good advice.. > So possibilities that occur to me: > * It's something specific about how the vfio-pci driver uses D3 > state - have you tried rebinding your device to vfio-pci? > * It's something specific about POWER, either the kernel or the PCI > bridge hardware > * It's something specific about this particular type of machine Does the EEH indicate what happend to actually trigger it? Jason
[PATCH] phy: ti: Fix compilation failures without REGMAP
This driver requires regmap or the compile fails: drivers/phy/ti/phy-gmii-sel.c:43:27: error: array type has incomplete element type ‘struct reg_field’ const struct reg_field (*regfields)[PHY_GMII_SEL_LAST]; Add it to kconfig. Signed-off-by: Jason Gunthorpe --- drivers/phy/ti/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/phy/ti/Kconfig b/drivers/phy/ti/Kconfig index f137e010776463..c4709ed7fb0e72 100644 --- a/drivers/phy/ti/Kconfig +++ b/drivers/phy/ti/Kconfig @@ -82,6 +82,7 @@ config PHY_TI_GMII_SEL default y if TI_CPSW=y depends on TI_CPSW || COMPILE_TEST select GENERIC_PHY + select REGMAP default m help This driver supports configuring of the TI CPSW Port mode depending on -- 2.20.1
[PATCH] phy: ti: Fix compilation failures without REGMAP
This driver requires regmap or the compile fails: drivers/phy/ti/phy-gmii-sel.c:43:27: error: array type has incomplete element type ‘struct reg_field’ const struct reg_field (*regfields)[PHY_GMII_SEL_LAST]; Add it to kconfig. Signed-off-by: Jason Gunthorpe --- drivers/phy/ti/Kconfig | 1 + 1 file changed, 1 insertion(+) .. and this is why I'm even compiling this thing when I'm not doing anything with CPSW .. diff --git a/drivers/phy/ti/Kconfig b/drivers/phy/ti/Kconfig index f137e010776463..c4709ed7fb0e72 100644 --- a/drivers/phy/ti/Kconfig +++ b/drivers/phy/ti/Kconfig @@ -82,6 +82,7 @@ config PHY_TI_GMII_SEL default y if TI_CPSW=y depends on TI_CPSW || COMPILE_TEST select GENERIC_PHY + select REGMAP default m help This driver supports configuring of the TI CPSW Port mode depending on -- 2.20.1
Re: [PATCH] phy: ti: Fix compilation failures without REGMAP
On Mon, Jan 07, 2019 at 07:13:00PM +0100, Andrew Lunn wrote: > On Mon, Jan 07, 2019 at 05:48:00PM +0000, Jason Gunthorpe wrote: > > This driver requires regmap or the compile fails: > > > > drivers/phy/ti/phy-gmii-sel.c:43:27: error: array type has incomplete > > element type ‘struct reg_field’ > > const struct reg_field (*regfields)[PHY_GMII_SEL_LAST]; > > > > Add it to kconfig. > > Hi Jason > > It looks like more of the TI drivers need REGMAP, but don't select it. > I assume you are using COMPILE_TEST, since TI_CPSW selects REGMAP? Yes, this causes COMPILE_TEST kernels to fail. Jason
Re: [PATCH] IB/cm: Use struct_size() in kmalloc()
On Mon, Jan 07, 2019 at 11:26:16AM -0600, Gustavo A. R. Silva wrote: > One of the more common cases of allocation size calculations is finding > the size of a structure that has a zero-sized array at the end, along > with memory for some number of elements for that array. For example: > > struct foo { > int stuff; > void *entry[]; > }; > > instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL); > > Instead of leaving these open-coded and prone to type mistakes, we can > now use the new struct_size() helper: > > instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL); > > This code was detected with the help of Coccinelle. > > Signed-off-by: Gustavo A. R. Silva > --- > drivers/infiniband/core/cm.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) Applied to for-next thanks Jason
Re: [PATCH] IB/usnic: Use struct_size() in kmalloc()
On Mon, Jan 07, 2019 at 11:39:54AM -0600, Gustavo A. R. Silva wrote: > One of the more common cases of allocation size calculations is finding > the size of a structure that has a zero-sized array at the end, along > with memory for some number of elements for that array. For example: > > struct foo { > int stuff; > void *entry[]; > }; > > instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL); > > Instead of leaving these open-coded and prone to type mistakes, we can > now use the new struct_size() helper: > > instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL); > > This code was detected with the help of Coccinelle. > > Signed-off-by: Gustavo A. R. Silva > --- > drivers/infiniband/hw/usnic/usnic_uiom.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) Applied to for-next, thanks Jason
Re: [PATCH] IB/core: Use struct_size() in kzalloc()
On Fri, Jan 04, 2019 at 12:15:58PM -0600, Gustavo A. R. Silva wrote: > One of the more common cases of allocation size calculations is finding the > size of a structure that has a zero-sized array at the end, along with memory > for some number of elements for that array. For example: > > struct foo { > int stuff; > void *entry[]; > }; > > instance = kzalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL); > > Instead of leaving these open-coded and prone to type mistakes, we can now > use the new struct_size() helper: > > instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL); > > This code was detected with the help of Coccinelle. > > Signed-off-by: Gustavo A. R. Silva > --- > drivers/infiniband/core/sa_query.c | 4 +--- > 1 file changed, 1 insertion(+), 3 deletions(-) Applied to for-next, thanks Jason
[PATCH v3] coding-style: Clarify the expectations around bool
There has been some confusion since checkpatch started warning about bool use in structures, and people have been avoiding using it. Many people feel there is still a legitimate place for bool in structures, so provide some guidance on bool usage derived from the entire thread that spawned the checkpatch warning. Link: https://lkml.kernel.org/r/ca+55afwvzk1ofb9t2v014ptakfhtvan_zj2dojncy3x6e4u...@mail.gmail.com Signed-off-by: Joe Perches Signed-off-by: Jason Gunthorpe --- Documentation/process/coding-style.rst | 38 +++--- 1 file changed, 34 insertions(+), 4 deletions(-) There hasn't been any negative feedback on the first thread that spawned this discussion, so here is a cleanuped up patch, ready to be applied. Acks? v2: - Revise Joe's original suggestion with background and more details v3: - s/C11/C99/ [Al Viro] - Add some remarks on the use of bool in function arguments [GalP] I'm not sure what tree this should go through - Jonathan would you pick it up? Thanks, Jason diff --git a/Documentation/process/coding-style.rst b/Documentation/process/coding-style.rst index b78dd680c03809..cbe6b01b05fa66 100644 --- a/Documentation/process/coding-style.rst +++ b/Documentation/process/coding-style.rst @@ -921,7 +921,37 @@ result. Typical examples would be functions that return pointers; they use NULL or the ERR_PTR mechanism to report failure. -17) Don't re-invent the kernel macros +17) Using bool +-- + +The Linux kernel uses the C99 standard for the bool type. bool values can only +evaluate to 0 or 1, and implicit or explicit conversion to bool automatically +converts the value to true or false. When using bool types the !! construction +is not needed, which eliminates a class of bugs. + +When working with bool values the true and false labels should be used instead +of 0 and 1. + +bool function return types and stack variables are always fine to use whenever +appropriate. Use of bool is encouraged to improve readability and is often a +better option than 'int' for storing boolean values. + +Do not use bool if cache line layout or size of the value matters, its size +and alignment varies based on the compiled architecture. Structures that are +optimized for alignment and size should not use bool. + +If a structure has many true/false values, consider consolidating them into a +bitfield with 1 bit members, or using an appropriate fixed width type, such as +u8. + +Similarly for function arguments, many true/false values can be consolidated +into a single bitwise 'flags' argument and 'flags' can often a more readable +alternative if the call-sites have naked true/false constants. + +Otherwise limited use of bool in structures and arguments can improve +readability. + +18) Don't re-invent the kernel macros - The header file include/linux/kernel.h contains a number of macros that @@ -944,7 +974,7 @@ need them. Feel free to peruse that header file to see what else is already defined that you shouldn't reproduce in your code. -18) Editor modelines and other cruft +19) Editor modelines and other cruft Some editors can interpret configuration information embedded in source files, @@ -978,7 +1008,7 @@ own custom mode, or may have some other magic method for making indentation work correctly. -19) Inline assembly +20) Inline assembly --- In architecture-specific code, you may need to use inline assembly to interface @@ -1010,7 +1040,7 @@ the next instruction in the assembly output: : /* outputs */ : /* inputs */ : /* clobbers */); -20) Conditional Compilation +21) Conditional Compilation --- Wherever possible, don't use preprocessor conditionals (#if, #ifdef) in .c -- 2.20.1
Re: [PATCH v3] coding-style: Clarify the expectations around bool
On Mon, Jan 07, 2019 at 02:10:22PM -0800, Joe Perches wrote: > On Mon, 2019-01-07 at 14:11 -0700, Jason Gunthorpe wrote: > > There has been some confusion since checkpatch started warning about bool > > use in structures, and people have been avoiding using it. > > > > Many people feel there is still a legitimate place for bool in structures, > > so provide some guidance on bool usage derived from the entire thread that > > spawned the checkpatch warning. > > Thanks Jason. > > It'd be nice to combine this with some better checkpatch warning or > even a removal of that misleading warning from checkpatch > altogether. Okay, do you have a preference? > With a couple minor nits below and and Ack if > you want one: > > Acked-by: Joe Perches Thanks, I made the revisions with Bart's extra words. I'll send a v4 in a few days. Jason
Re: [PATCH v3] coding-style: Clarify the expectations around bool
On Mon, Jan 07, 2019 at 04:38:50PM -0800, Joe Perches wrote: > On Mon, 2019-01-07 at 16:25 -0700, Jason Gunthorpe wrote: > > On Mon, Jan 07, 2019 at 02:10:22PM -0800, Joe Perches wrote: > > > On Mon, 2019-01-07 at 14:11 -0700, Jason Gunthorpe wrote: > > > > There has been some confusion since checkpatch started warning about > > > > bool > > > > use in structures, and people have been avoiding using it. > > > > > > > > Many people feel there is still a legitimate place for bool in > > > > structures, > > > > so provide some guidance on bool usage derived from the entire thread > > > > that > > > > spawned the checkpatch warning. > > > > > > Thanks Jason. > > > > > > It'd be nice to combine this with some better checkpatch warning or > > > even a removal of that misleading warning from checkpatch > > > altogether. > > > > Okay, do you have a preference? > > Yes. Great, done, thanks Jason
Re: [PATCH] PCI: Add no-D3 quirk for Mellanox ConnectX-[45]
On Sun, Jan 06, 2019 at 09:43:46AM +1100, Benjamin Herrenschmidt wrote: > On Sat, 2019-01-05 at 10:51 -0700, Jason Gunthorpe wrote: > > > > > Interesting. I've investigated this further, though I don't have as > > > many new clues as I'd like. The problem occurs reliably, at least on > > > one particular type of machine (a POWER8 "Garrison" with ConnectX-4). > > > I don't yet know if it occurs with other machines, I'm having trouble > > > getting access to other machines with a suitable card. I didn't > > > manage to reproduce it on a different POWER8 machine with a > > > ConnectX-5, but I don't know if it's the difference in machine or > > > difference in card revision that's important. > > > > Make sure the card has the latest firmware is always good advice.. > > > > > So possibilities that occur to me: > > > * It's something specific about how the vfio-pci driver uses D3 > > > state - have you tried rebinding your device to vfio-pci? > > > * It's something specific about POWER, either the kernel or the PCI > > > bridge hardware > > > * It's something specific about this particular type of machine > > > > Does the EEH indicate what happend to actually trigger it? > > In a very cryptic way that requires manual parsing using non-public > docs sadly but yes. From the look of it, it's a completion timeout. > > Looks to me like we don't get a response to a config space access > during the change of D state. I don't know if it's the write of the D3 > state itself or the read back though (it's probably detected on the > read back or a subsequent read, but that doesn't tell me which specific > one failed). If it is just one card doing it (again, check you have latest firmware) I wonder if it is a sketchy PCI-E electrical link that is causing a long re-training cycle? Can you tell if the PCI-E link is permanently gone or does it eventually return? Does the card work in Gen 3 when it starts? Is there any indication of PCI-E link errors? Everytime or sometimes? POWER 8 firmware is good? If the link does eventually come back, is the POWER8's D3 resumption timeout long enough? If this doesn't lead to an obvious conclusion you'll probably need to connect to IBM's Mellanox support team to get more information from the card side. Jason
Re: drivers/infiniband/core/.tmp_gl_uverbs_main.o:undefined reference to `__user_bad'
On Wed, Jan 02, 2019 at 02:33:41AM +0800, kbuild test robot wrote: > Hi Jason, > > FYI, the error/warning still remains. Like binder, DRM, etc, uverbs requires architectures to support 8 byte get_user/put_user. For some reason microblaze only supports 8 byte for put_user, so it fails to compile. I don't see a reason to change uverbs here, I think microblaze should be fixed, then it would fix all the cases below. Jason > tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > master > head: e1ef035d272ef4dbfdda98e58699698305138856 > commit: 3a6532c9af1a7836da2d597f1aaca73cb16c3b97 RDMA/uverbs: Use > uverbs_attr_bundle to pass udata for write > date: 5 weeks ago > config: microblaze-allyesconfig (attached as .config) > compiler: microblaze-linux-gcc (GCC) 8.1.0 > reproduce: > wget > https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O > ~/bin/make.cross > chmod +x ~/bin/make.cross > git checkout 3a6532c9af1a7836da2d597f1aaca73cb16c3b97 > # save the attached .config to linux build tree > GCC_VERSION=8.1.0 make.cross ARCH=microblaze > > All errors (new ones prefixed by >>): > >`.exit.data' referenced in section `.exit.text' of drivers/tty/n_hdlc.o: > defined in discarded section `.exit.data' of drivers/tty/n_hdlc.o >`.exit.data' referenced in section `.exit.text' of drivers/tty/n_hdlc.o: > defined in discarded section `.exit.data' of drivers/tty/n_hdlc.o >drivers/infiniband/core/uverbs_main.o: In function `ib_uverbs_write': > >> drivers/infiniband/core/.tmp_gl_uverbs_main.o:(.text+0x13a4): undefined > >> reference to `__user_bad' >drivers/android/binder.o: In function `binder_thread_write': >drivers/android/.tmp_gl_binder.o:(.text+0xda6c): undefined reference to > `__user_bad' >drivers/android/.tmp_gl_binder.o:(.text+0xda98): undefined reference to > `__user_bad' >drivers/android/.tmp_gl_binder.o:(.text+0xdf10): undefined reference to > `__user_bad' >drivers/android/.tmp_gl_binder.o:(.text+0xe498): undefined reference to > `__user_bad' >drivers/android/binder.o:drivers/android/.tmp_gl_binder.o:(.text+0xea78): > more undefined references to `__user_bad' follow > > 0-DAY kernel test infrastructureOpen Source Technology Center > https://lists.01.org/pipermail/kbuild-all Intel Corporation
Re: [PATCH -next] IB/qib: Add missing err handle for qib_user_sdma_rb_insert
On Fri, Dec 21, 2018 at 10:19:38AM +0800, YueHaibing wrote: > It should goto err handle if qib_user_sdma_rb_insert fails, > other than success return. > > Fixes: 67810e8c3c01 ("RDMA/qib: Remove all occurrences of BUG_ON()") > Signed-off-by: YueHaibing > --- > drivers/infiniband/hw/qib/qib_user_sdma.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/infiniband/hw/qib/qib_user_sdma.c > b/drivers/infiniband/hw/qib/qib_user_sdma.c > index 31c523b..e87c0a7 100644 > --- a/drivers/infiniband/hw/qib/qib_user_sdma.c > +++ b/drivers/infiniband/hw/qib/qib_user_sdma.c > @@ -237,6 +237,8 @@ qib_user_sdma_queue_create(struct device *dev, int unit, > int ctxt, int sctxt) > > ret = qib_user_sdma_rb_insert(&qib_user_sdma_rb_root, > sdma_rb_node); > + if (ret == 0) > + goto err_rb; > } This doesn't look right, what about undoing the kmalloc directly above? Jason
Re: [PATCH -next] IB/qib: Add missing err handle for qib_user_sdma_rb_insert
On Wed, Jan 02, 2019 at 08:40:50PM +0200, Leon Romanovsky wrote: > On Wed, Jan 02, 2019 at 10:12:24AM -0700, Jason Gunthorpe wrote: > > On Fri, Dec 21, 2018 at 10:19:38AM +0800, YueHaibing wrote: > > > It should goto err handle if qib_user_sdma_rb_insert fails, > > > other than success return. > > > > > > Fixes: 67810e8c3c01 ("RDMA/qib: Remove all occurrences of BUG_ON()") > > > Signed-off-by: YueHaibing > > > drivers/infiniband/hw/qib/qib_user_sdma.c | 2 ++ > > > 1 file changed, 2 insertions(+) > > > > > > diff --git a/drivers/infiniband/hw/qib/qib_user_sdma.c > > > b/drivers/infiniband/hw/qib/qib_user_sdma.c > > > index 31c523b..e87c0a7 100644 > > > +++ b/drivers/infiniband/hw/qib/qib_user_sdma.c > > > @@ -237,6 +237,8 @@ qib_user_sdma_queue_create(struct device *dev, int > > > unit, int ctxt, int sctxt) > > > > > > ret = qib_user_sdma_rb_insert(&qib_user_sdma_rb_root, > > > sdma_rb_node); > > > + if (ret == 0) > > > + goto err_rb; > > > } > > > > This doesn't look right, what about undoing the kmalloc directly > > above? > > Back then, I came to conclusion that qib_user_sdma_rb_insert() never > returns 0. Otherwise, Dennis would see that BUG_ON() for a long time > ago. It fails if the index is in the RB tree, which would indicate a programming bug. The way to replace the BUG_ON is something like: if (WARN_ON(ret == 0)) { kfree() goto err_rb; } Jason
Re: [PATCH 06/20] IB/ipoib: drop useless LIST_HEAD
On Sun, Dec 23, 2018 at 09:57:01AM +0100, Julia Lawall wrote: > Drop LIST_HEAD where the variable it declares is never used. > > Commit 31c02e215700 ("IPoIB: Avoid using stale last_send counter > when reaping AHs") removed the uses, but not the declaration. > > The semantic patch that fixes this problem is as follows: > (http://coccinelle.lip6.fr/) > > // > @@ > identifier x; > @@ > - LIST_HEAD(x); > ... when != x > // > > Fixes: 31c02e215700 ("IPoIB: Avoid using stale last_send counter when reaping > AHs") > Signed-off-by: Julia Lawall > Reviewed-by: Leon Romanovsky > --- > Successfully 0-day tested on 151 configurations. > > drivers/infiniband/ulp/ipoib/ipoib_ib.c |1 - > 1 file changed, 1 deletion(-) Applied to rdma for-next thanks Jason
Re: [PATCH] infiniband: bnxt_re: qplib: Check the return value of send_message
On Wed, Dec 26, 2018 at 12:56:22PM -0600, Aditya Pakki wrote: > In bnxt_qplib_map_tc2cos(), bnxt_qplib_rcfw_send_message() can return > an error value. The fix returns the error from the latter function > upstream. > > Signed-off-by: Aditya Pakki > Acked-By: Devesh Sharma > --- > drivers/infiniband/hw/bnxt_re/qplib_sp.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) Applied to for-next Jason
Re: [PATCH] infiniband/qedr: Potential null ptr dereference of qp
On Mon, Dec 24, 2018 at 12:24:45PM -0600, Aditya Pakki wrote: > idr_find() may fail and return a NULL pointer. The fix checks the > return value of the function and returns an error in case of NULL. > > Signed-off-by: Aditya Pakki > Acked-by: Michal Kalderon > --- > drivers/infiniband/hw/qedr/qedr_iw_cm.c | 2 ++ > 1 file changed, 2 insertions(+) Applied to for-rc thanks Jason
[GIT PULL] Please pull RDMA subsystem changes
Hi Linus, Second merge window pull request This could also be applied after rc1. It is a small collection of fixes posted over the holiday break. The merge error with advise_mr is worth fixing right away so the other, more minor, things are coming along for the ride. The following changes since commit 5d24ae67a961c51beb255a28c9c417d9710247c2: Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma (2018-12-28 14:57:10 -0800) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git tags/for-linus for you to fetch changes up to 9c6260de505b63638dd86fcc33849b17f6146d94: infiniband/qedr: Potential null ptr dereference of qp (2019-01-02 16:08:38 -0700) 4.21 merge window 2nd pull request Over the break a few defects were found, so this is a -rc style pull request of various small things that have been posted. - An attempt to shorten RCU grace period driven delays showed crashes during heavier testing, and has been entirely reverted - A missed merge/rebase error between the advise_mr and ib_device_ops series - Some small static analysis driven fixes from Julia and Aditya - Missed ability to create a XRC_INI in the devx verbs interop series Aditya Pakki (2): infiniband: bnxt_re: qplib: Check the return value of send_message infiniband/qedr: Potential null ptr dereference of qp Julia Lawall (1): IB/ipoib: drop useless LIST_HEAD Leon Romanovsky (1): Revert "IB/mlx5: Fix long EEH recover time with NVMe offloads" Moni Shoua (1): IB/core: Add advise_mr to the list of known ops Yishai Hadas (1): IB/mlx5: Allow XRC INI usage via verbs in DEVX context drivers/infiniband/core/device.c | 1 + drivers/infiniband/hw/bnxt_re/qplib_sp.c | 5 ++--- drivers/infiniband/hw/mlx5/mr.c | 19 +++ drivers/infiniband/hw/mlx5/qp.c | 3 ++- drivers/infiniband/hw/qedr/qedr_iw_cm.c | 2 ++ drivers/infiniband/ulp/ipoib/ipoib_ib.c | 1 - 6 files changed, 10 insertions(+), 21 deletions(-) signature.asc Description: PGP signature
Re: rfc: bool structure members (was Re: [PATCH V3] net/mlx4: Get rid of page operation after dma_alloc_coherent)
On Sun, Dec 23, 2018 at 04:53:12PM +, Al Viro wrote: > On Sun, Dec 23, 2018 at 06:42:20PM +0200, Gal Pressman wrote: > > > -17) Don't re-invent the kernel macros > > > +17) Using bool > > > +-- > > > + > > > +The Linux kernel uses the C11 standard for the bool type. bool values > > > can only > > C99, surely? Oops, yes, that is right Jason
[GIT PULL] Please pull RDMA subsystem changes
IB/uverbs: Signedness bug in UVERBS_HANDLER() RDMA/mlx5: Signedness bug in UVERBS_HANDLER() Daniel Jurgens (1): net/mlx5: When fetching CQEs return CQE instead of void pointer Danit Goldberg (3): net/mlx5: Expose packet based credit mode IB/mlx5: Add packet based credit mode support IB/mlx5: Report packet based credit mode device capability Devesh Sharma (1): RDMA/bnxt_re: Increase depth of control path command queue Doug Ledford (4): Merge branch 'write-handler-consistent-flow' into for-next Merge 'mlx5-next' into mlx5-devx Merge branch 'mlx5-devx' into wip/dl-for-next IB/mlx5: Fix compile issue when ODP disabled Eli Britstein (3): net/mlx5: Introduce extended destination fields net/mlx5: E-Switch, Change vhca id valid bool field to bit flag net/mlx5: Support extended destination format in flow steering command Eyal Davidovich (1): net/mlx5: Add monitor commands layout and event data Gal Pressman (7): net/mlx5: Fix offsets of ifc reserved fields RDMA/vmw_pvrdma: Use atomic memory allocation in create AH RDMA/mlx5: Fix function name typo 'fileds' -> 'fields' RDMA: Mark if create address handle is in a sleepable context RDMA: Mark if destroy address handle is in a sleepable context RDMA/bnxt_re: Make use of create AH sleepable flag RDMA/bnxt_re: Make use of destroy AH sleepable flag Guy Levi (2): IB/mlx5: Use fragmented SRQ's buffer for in-kernel users IB/mlx5: Use fragmented QP's buffer for in-kernel users Huy Nguyen (1): IB/mlx5: Fix long EEH recover time with NVMe offloads Jason Gunthorpe (38): Merge branch 'mlx5-next' into rdma.git RDMA/mlx5: Do not generate the uabi specs unconditionally RDMA/uverbs: Use a linear list to describe the compiled-in uapi RDMA/uverbs: Factor out the add/get pattern into a helper RDMA/uverbs: Add helpers to mark uapi functions as unsupported RDMA/mlx5: Use the uapi disablement APIs instead of code RDMA/uverbs: Require all objects to have a driver destroy function RDMA/verbs: Store the write/write_ex uapi entry points in the uverbs_api RDMA/uverbs: Convert the write interface to use uverbs_api RDMA/uverbs: Make all the method functions in uverbs_cmd static RDMA/uverbs: Check for NULL driver methods for every write call RDMA/uverbs: Add missing driver_data RDMA/uverbs: Replace ib_uverbs_file with uverbs_attr_bundle for write RDMA/uverbs: Make write() handlers return 0 on success RDMA/uverbs: Do not pass ib_uverbs_file to ioctl methods RDMA/uverbs: Add structure size info to write commands RDMA/uverbs: Prohibit write() calls with too small buffers RDMA/uverbs: Use uverbs_attr_bundle to pass udata for write_ex RDMA/uverbs: Use uverbs_attr_bundle to pass udata for write RDMA/uverbs: Use uverbs_attr_bundle to pass udata for ioctl() RDMA/uverbs: Remove out_len checks that are now done by the core RDMA/uverbs: Use uverbs_attr_bundle to pass ucore for write/write_ex RDMA/uverbs: Get rid of the 'callback' scheme in the compat path RDMA/uverbs: Use uverbs_response() for remaining response copying RDMA/uverbs: Use uverbs_request() for request copying RDMA/uverbs: Use uverbs_request() and core for write_ex handlers RDMA/uverbs: Fill in the response for IB_USER_VERBS_EX_CMD_MODIFY_QP RDMA/uverbs: Simplify ib_uverbs_ex_query_device RDMA/uverbs: Add a simple iterator interface for reading the command RDMA/uverbs: Use the iterator for ib_uverbs_unmarshall_recv() RDMA/uverbs: Do not check the input length on create_cq/qp paths RDMA/uverbs: Use only attrs for the write() handler signature Merge branch 'mlx5-packet-credit-fc' into rdma.git Merge tag 'v4.20-rc6' into rdma.git for-next RDMA/uverbs: Implement an ioctl that can call write and write_ex handlers RDMA/uverbs: Add a method to introspect handles in a context Merge branch 'mlx5-next' into rdma.git IB/mlx5: Fix wrong error unwind Kaike Wan (4): IB/hfi1: Dump pio info for non-user send contexts IB/hfi1: Ignore LNI errors before DC8051 transitions to Polling state IB/hfi1: Unreserve a reserved request when it is completed IB/hfi1: Add OPFN and TID RDMA capability bits Kamal Heib (20): RDMA/core: Introduce ib_device_ops RDMA/bnxt_re: Initialize ib_device_ops struct RDMA/cxgb3: Initialize ib_device_ops struct RDMA/cxgb4: Initialize ib_device_ops struct RDMA/hfi1: Initialize ib_device_ops struct RDMA/hns: Initialize ib_device_ops struct RDMA/i40iw: Initialize ib_device_ops struct RDMA/mlx4: Initialize ib_device_ops struct RDMA/mlx5: Initialize ib_device_ops struct RDMA
Re: rfc: bool structure members (was Re: [PATCH V3] net/mlx4: Get rid of page operation after dma_alloc_coherent)
On Sun, Dec 23, 2018 at 06:42:20PM +0200, Gal Pressman wrote: > On 22-Dec-18 01:52, Jason Gunthorpe wrote: > > On Thu, Dec 20, 2018 at 09:12:43PM -0800, Joe Perches wrote: > >> Care to submit a coding_style.rst patch or > >> improve the one below this? > > > > I took yours and revised it a little bit. I spent some time looking at > > assembly and decided to drop the performance note, the number of cases > > that run into overhead seems pretty small and probably already > > requires !! to be correct. There is also an equally unlikely gain, ie > > 'if (a & b)' optimizes a tiny bit better for bool types. > > > > I also added a small intro on bool, as I know some people are > > unfamiliar with C11 _Bool and might think bool is just '#define bool > > u8' > > > > (for those added to the cc) I'm looking at cases, like the patch that > > spawned this, where the struct has a single bool and no performance > > considerations. As CH said, seeing that get converted to int due to > > checkpatch is worse than having used bool. Using u8 won't make this > > struct smaller or faster. > > > > Since a "Using bool" section is added, perhaps it's worth documenting the bool > usage as a function parameter [1]? > > [1] https://www.spinics.net/lists/linux-rdma/msg72336.html I'm not really sure how to express that as something concrete.. That specific case clearly called out for a flags as it was a widely used API - maybe less spread out cases like static functions or something are OK?? Jason
Re: [PATCH v3 3/3] infiniband/mm: convert put_page() to put_user_page*()
On Fri, Oct 05, 2018 at 07:49:49PM -0700, john.hubb...@gmail.com wrote: > From: John Hubbard > > For code that retains pages via get_user_pages*(), > release those pages via the new put_user_page(), or > put_user_pages*(), instead of put_page() > > This prepares for eventually fixing the problem described > in [1], and is following a plan listed in [2], [3], [4]. > > [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()" > > [2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubb...@nvidia.com > Proposed steps for fixing get_user_pages() + DMA problems. > > [3]https://lkml.kernel.org/r/20180710082100.mkdwngdv5kkrc...@quack2.suse.cz > Bounce buffers (otherwise [2] is not really viable). > > [4] https://lkml.kernel.org/r/20181003162115.gg24...@quack2.suse.cz > Follow-up discussions. > > CC: Doug Ledford > CC: Jason Gunthorpe > CC: Mike Marciniszyn > CC: Dennis Dalessandro > CC: Christian Benvenuti > > CC: linux-r...@vger.kernel.org > CC: linux-kernel@vger.kernel.org > CC: linux...@kvack.org > Signed-off-by: John Hubbard > --- > drivers/infiniband/core/umem.c | 7 --- > drivers/infiniband/core/umem_odp.c | 2 +- > drivers/infiniband/hw/hfi1/user_pages.c | 11 --- > drivers/infiniband/hw/mthca/mthca_memfree.c | 6 +++--- > drivers/infiniband/hw/qib/qib_user_pages.c | 11 --- > drivers/infiniband/hw/qib/qib_user_sdma.c | 8 > drivers/infiniband/hw/usnic/usnic_uiom.c| 7 --- > 7 files changed, 24 insertions(+), 28 deletions(-) I have no issues with this, do you want this series to go through the rdma tree? Otherwise: Acked-by: Jason Gunthorpe Thanks, Jason
Re: WARNING in __put_task_struct (2)
On Mon, Oct 08, 2018 at 06:15:22PM +0200, Dmitry Vyukov wrote: > On Mon, Oct 8, 2018 at 6:12 PM, syzbot > wrote: > > Hello, > > > > syzbot found the following crash on: > > > > HEAD commit:ae16eea39a86 Add linux-next specific files for 20181008 > > git tree: linux-next > > console output: https://syzkaller.appspot.com/x/log.txt?x=12e7c6a140 > > kernel config: https://syzkaller.appspot.com/x/.config?x=42525ce985d1a955 > > dashboard link: https://syzkaller.appspot.com/bug?extid=1ed9dfaa984c88cf0ef8 > > compiler: gcc (GCC) 8.0.1 20180413 (experimental) > > > > Unfortunately, I don't have any reproducer for this crash yet. > > > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > > Reported-by: syzbot+1ed9dfaa984c88cf0...@syzkaller.appspotmail.com > > > This was triggered by the following program, all is rdma-related, so > +rdma maintainers. > > 10:27:26 executing program 5: > r0 = openat$rdma_cm(0xff9c, > &(0x7f000380)='/dev/infiniband/rdma_cm\x00', 0x2, 0x0) > write$RDMA_USER_CM_CMD_CREATE_ID(r0, &(0x7f80)={0x0, 0x18, > 0xfa00, {0x0, &(0x7f40)={0x}, 0x13f}}, > 0x12d) > write$RDMA_USER_CM_CMD_RESOLVE_IP(r0, &(0x7f000200)={0x3, 0x40, > 0xfa00, {{0xa, 0x4e21}, {0xa, 0x0, 0x0, @mcast1}, r1}}, 0x48) > write$RDMA_USER_CM_CMD_CREATE_ID(r0, &(0x7f000340)={0x0, 0x18, > 0xfa00, {0x0, &(0x7f000300)={0x}, 0x13f}}, > 0x20) > write$RDMA_USER_CM_CMD_SET_OPTION(r0, &(0x7f0002c0)={0xe, 0x18, > 0xfa00, @id_afonly={&(0x7f00)=0x1, r1, 0x0, 0x2, 0x4}}, 0x20) > write$RDMA_USER_CM_CMD_RESOLVE_IP(r0, &(0x7f000100)={0x3, 0x40, > 0xfa00, {{0x2, 0x4e21}, {0xa, 0x0, 0x0, @local}, r2}}, 0x1ff) > write$RDMA_USER_CM_CMD_DESTROY_ID(r0, &(0x7f000180)={0x1, 0x10, > 0xfa00, {&(0x7fc0), r2}}, 0x18) This might be fixed in Linux-next now by this commit: Author: Leon Romanovsky Date: Tue Oct 2 11:48:03 2018 +0300 RDMA/restrack: Release task struct which was hold by CM_ID object Jason
Re: [PATCH v3 3/3] infiniband/mm: convert put_page() to put_user_page*()
On Mon, Oct 08, 2018 at 01:37:35PM -0700, John Hubbard wrote: > On 10/8/18 12:42 PM, Jason Gunthorpe wrote: > > On Fri, Oct 05, 2018 at 07:49:49PM -0700, john.hubb...@gmail.com wrote: > >> From: John Hubbard > [...] > >> drivers/infiniband/core/umem.c | 7 --- > >> drivers/infiniband/core/umem_odp.c | 2 +- > >> drivers/infiniband/hw/hfi1/user_pages.c | 11 --- > >> drivers/infiniband/hw/mthca/mthca_memfree.c | 6 +++--- > >> drivers/infiniband/hw/qib/qib_user_pages.c | 11 --- > >> drivers/infiniband/hw/qib/qib_user_sdma.c | 8 > >> drivers/infiniband/hw/usnic/usnic_uiom.c| 7 --- > >> 7 files changed, 24 insertions(+), 28 deletions(-) > > > > I have no issues with this, do you want this series to go through the > > rdma tree? Otherwise: > > > > Acked-by: Jason Gunthorpe > > > > The RDMA tree seems like a good path for this, yes, glad you suggested > that. > > I'll post a v4 with the comment fix and the recent reviewed-by's, which > should be ready for that. It's based on today's linux.git tree at the > moment, but let me know if I should re-apply it to the RDMA tree. I'm unclear who needs to ack the MM sections for us to take it to RDMA? Otherwise it is no problem.. Jason
Re: [PATCH] tpm: fix unused-value issues in tpm_try_transmit
On Wed, Oct 10, 2018 at 03:38:17PM +0200, Gustavo A. R. Silva wrote: > Currently, there are some values assigned to variable *rc*, which > are never actually used in any computation, because such variable > is updated at line 550, before they can be used: > > 549out: > 550rc = tpm_go_idle(chip, flags); > 551if (rc) > 552goto out; > > Fix this by removing such assignments. Should this be done by not quashing rc during the error unwind rather than dropping the errors? Jason
Re: [PATCH] tpm: tpm_try_transmit() ignore value of go_to_idle()
On Mon, Oct 15, 2018 at 02:14:34PM +0300, Tomas Winkler wrote: > Ignore the return value of go_to_idle() in tpm_try_transmit(). > Once it may shadow the return value of actual tpm operation, > second the consequent command will fail as well and the error > will be caought anyway. > Last fix wrong goto, that jumped back instead of forward. > > Fixes: 627448e85c76 ("tpm: separate cmd_ready/go_idle from runtime_pm") > Signed-off-by: Tomas Winkler > drivers/char/tpm/tpm-interface.c | 4 +--- > 1 file changed, 1 insertion(+), 3 deletions(-) > > diff --git a/drivers/char/tpm/tpm-interface.c > b/drivers/char/tpm/tpm-interface.c > index 129f640424b7..f69c711bf74a 100644 > +++ b/drivers/char/tpm/tpm-interface.c > @@ -547,9 +547,7 @@ static ssize_t tpm_try_transmit(struct tpm_chip *chip, > dev_err(&chip->dev, "tpm2_commit_space: error %d\n", rc); > > out: > - rc = tpm_go_idle(chip, flags); > - if (rc) > - goto out; > + (void)tpm_go_idle(chip, flags); We don't really use this style in the kernel, AFAIK. Jason
Re: [PATCH v3] IB/rxe: Remove unnecessary enum values
On Wed, Sep 26, 2018 at 10:12:23PM -0700, Nathan Chancellor wrote: > Clang warns when an emumerated type is implicitly converted to another. > > drivers/infiniband/sw/rxe/rxe.c:106:27: warning: implicit conversion > from enumeration type 'enum rxe_device_param' to different enumeration > type 'enum ib_atomic_cap' [-Wenum-conversion] > rxe->attr.atomic_cap= RXE_ATOMIC_CAP; > ~ ^~ > drivers/infiniband/sw/rxe/rxe.c:131:22: warning: implicit conversion > from enumeration type 'enum rxe_port_param' to different enumeration > type 'enum ib_port_state' [-Wenum-conversion] > port->attr.state= RXE_PORT_STATE; > ~ ^~ > drivers/infiniband/sw/rxe/rxe.c:132:24: warning: implicit conversion > from enumeration type 'enum rxe_port_param' to different enumeration > type 'enum ib_mtu' [-Wenum-conversion] > port->attr.max_mtu = RXE_PORT_MAX_MTU; > ~ ^~~~ > drivers/infiniband/sw/rxe/rxe.c:133:27: warning: implicit conversion > from enumeration type 'enum rxe_port_param' to different enumeration > type 'enum ib_mtu' [-Wenum-conversion] > port->attr.active_mtu = RXE_PORT_ACTIVE_MTU; > ~ ^~~ > drivers/infiniband/sw/rxe/rxe.c:151:24: warning: implicit conversion > from enumeration type 'enum rxe_port_param' to different enumeration > type 'enum ib_mtu' [-Wenum-conversion] > ib_mtu_enum_to_int(RXE_PORT_ACTIVE_MTU); > ~~ ^~~ > 5 warnings generated. > > Use the appropriate values from the expected enumerated type so no > conversion needs to happen then remove the unneeded definitions. > > Reported-by: Nick Desaulniers > Signed-off-by: Nathan Chancellor > Reviewed-by: Nick Desaulniers > --- Applied to for-next Thanks, Jason
Re: [RFC PATCH v4 01/13] ktask: add documentation
On Tue, Nov 06, 2018 at 12:34:11PM -0800, Daniel Jordan wrote: > > What isn't clear is if this calling thread is waiting or not. Only do > > this inheritance trick if it is actually waiting on the work. If it is > > not, nobody cares. > > The calling thread waits. Even if it didn't though, the inheritance trick > would still be desirable for timely completion of the job. Can you make lockdep aware that this is synchronous? ie if I do mutex_lock() ktask_run() mutex_lock() Can lockdep know that all the workers are running under that lock? I'm thinking particularly about rtnl_lock as a possible case, but there could also make some sense to hold the read side of the mm_sem or similar like the above. Jason
Re: [PATCH v4 2/3] mm: introduce put_user_page*(), placeholder versions
On Thu, Oct 11, 2018 at 10:49:29AM +0200, Jan Kara wrote: > > This is a real worry. If someone uses a mistaken put_page() then how > > will that bug manifest at runtime? Under what set of circumstances > > will the kernel trigger the bug? > > At runtime such bug will manifest as a page that can never be evicted from > memory. We could warn in put_page() if page reference count drops below > bare minimum for given user pin count which would be able to catch some > issues but it won't be 100% reliable. So at this point I'm more leaning > towards making get_user_pages() return a different type than just > struct page * to make it much harder for refcount to go wrong... At least for the infiniband code being used as an example here we take the struct page from get_user_pages, then stick it in a sgl, and at put_page time we get the page back out of the sgl via sg_page() So type safety will not help this case... I wonder how many other users are similar? I think this is a pretty reasonable flow for DMA with user pages. Jason
Re: [PATCHv1] RDMA/core: Check error status of rdma_find_ndev_for_src_ip_rcu
On Thu, Oct 04, 2018 at 02:28:54AM +, Parav Pandit wrote: > Hi Doug, Jason, > > > From: Parav Pandit > > Sent: Friday, September 21, 2018 10:00 AM > > To: linux-r...@vger.kernel.org; linux-kernel@vger.kernel.org; > > l...@kernel.org; j...@ziepe.ca; syzkaller-b...@googlegroups.com; Daniel > > Jurgens ; dledf...@redhat.com > > Cc: Parav Pandit > > Subject: [PATCHv1] RDMA/core: Check error status of > > rdma_find_ndev_for_src_ip_rcu > > > > rdma_find_ndev_for_src_ip_rcu() returns either valid netdev pointer or > > ERR_PTR(). > > Instead of checking for NULL, check for error. > > > > Fixes: caf1e3ae9fa6 ("RDMA/core Introduce and use > > rdma_find_ndev_for_src_ip_rcu") > > Reported-by: syzbot+20c32fa6ff84a2d28...@syzkaller.appspotmail.com > > Signed-off-by: Parav Pandit > > drivers/infiniband/core/addr.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c > > index c2ca9e4..3c07eeb 100644 > > +++ b/drivers/infiniband/core/addr.c > > @@ -513,8 +513,8 @@ static int rdma_set_src_addr_rcu(struct > > rdma_dev_addr *dev_addr, > > * loopback IP address. > > */ > > ndev = rdma_find_ndev_for_src_ip_rcu(dev_net(ndev), > > dst_in); > > - if (!ndev) > > - return -ENODEV; > > + if (IS_ERR(ndev)) > > + return PTR_ERR(ndev); > > } > > > > return copy_src_l2_addr(dev_addr, dst_in, dst, ndev); > > Can you please review this fix? I got below report from syzbot that > it tested the patch and reproducer didn't trigger. It is very strange, but this patch does not show up in rdma's patch works. This happened to Dennis as well for one patch, I'm afraid as a general rule, people will need to check that patchworks has thier patches, and maybe talk to LF IT about why things have gone missing. I would guess it is some spam filter issue? I have applied this patch from my email. Jason
Re: [PATCH v2 2/3] mm: introduce put_user_page[s](), placeholder versions
On Thu, Oct 04, 2018 at 09:02:24PM -0700, john.hubb...@gmail.com wrote: > From: John Hubbard > > Introduces put_user_page(), which simply calls put_page(). > This provides a way to update all get_user_pages*() callers, > so that they call put_user_page(), instead of put_page(). > > Also introduces put_user_pages(), and a few dirty/locked variations, > as a replacement for release_pages(), for the same reasons. > These may be used for subsequent performance improvements, > via batching of pages to be released. > > This prepares for eventually fixing the problem described > in [1], and is following a plan listed in [2], [3], [4]. > > [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()" > > [2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubb...@nvidia.com > Proposed steps for fixing get_user_pages() + DMA problems. > > [3]https://lkml.kernel.org/r/20180710082100.mkdwngdv5kkrc...@quack2.suse.cz > Bounce buffers (otherwise [2] is not really viable). > > [4] https://lkml.kernel.org/r/20181003162115.gg24...@quack2.suse.cz > Follow-up discussions. > > CC: Matthew Wilcox > CC: Michal Hocko > CC: Christopher Lameter > CC: Jason Gunthorpe > CC: Dan Williams > CC: Jan Kara > CC: Al Viro > CC: Jerome Glisse > CC: Christoph Hellwig > Signed-off-by: John Hubbard > include/linux/mm.h | 42 -- > 1 file changed, 40 insertions(+), 2 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index a61ebe8ad4ca..1a9aae7c659f 100644 > +++ b/include/linux/mm.h > @@ -137,6 +137,8 @@ extern int overcommit_ratio_handler(struct ctl_table *, > int, void __user *, > size_t *, loff_t *); > extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *, > size_t *, loff_t *); > +int set_page_dirty(struct page *page); > +int set_page_dirty_lock(struct page *page); > > #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) > > @@ -943,6 +945,44 @@ static inline void put_page(struct page *page) > __put_page(page); > } > > +/* Placeholder version, until all get_user_pages*() callers are updated. */ > +static inline void put_user_page(struct page *page) > +{ > + put_page(page); > +} > + > +/* For get_user_pages*()-pinned pages, use these variants instead of > + * release_pages(): > + */ > +static inline void put_user_pages_dirty(struct page **pages, > + unsigned long npages) > +{ > + while (npages) { > + set_page_dirty(pages[npages]); > + put_user_page(pages[npages]); > + --npages; > + } > +} Shouldn't these do the !PageDirty(page) thing? Jason
Re: [PATCH v2 3/3] infiniband/mm: convert to the new put_user_page[s]() calls
On Thu, Oct 04, 2018 at 09:02:25PM -0700, john.hubb...@gmail.com wrote: > From: John Hubbard > > For code that retains pages via get_user_pages*(), > release those pages via the new put_user_page(), > instead of put_page(). > > This prepares for eventually fixing the problem described > in [1], and is following a plan listed in [2], [3], [4]. > > [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()" > > [2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubb...@nvidia.com > Proposed steps for fixing get_user_pages() + DMA problems. > > [3]https://lkml.kernel.org/r/20180710082100.mkdwngdv5kkrc...@quack2.suse.cz > Bounce buffers (otherwise [2] is not really viable). > > [4] https://lkml.kernel.org/r/20181003162115.gg24...@quack2.suse.cz > Follow-up discussions. > > CC: Doug Ledford > CC: Jason Gunthorpe > CC: Mike Marciniszyn > CC: Dennis Dalessandro > CC: Christian Benvenuti > > CC: linux-r...@vger.kernel.org > CC: linux-kernel@vger.kernel.org > CC: linux...@kvack.org > Signed-off-by: John Hubbard > drivers/infiniband/core/umem.c | 2 +- > drivers/infiniband/core/umem_odp.c | 2 +- > drivers/infiniband/hw/hfi1/user_pages.c | 11 --- > drivers/infiniband/hw/mthca/mthca_memfree.c | 6 +++--- > drivers/infiniband/hw/qib/qib_user_pages.c | 11 --- > drivers/infiniband/hw/qib/qib_user_sdma.c | 8 > drivers/infiniband/hw/usnic/usnic_uiom.c| 2 +- > 7 files changed, 18 insertions(+), 24 deletions(-) > > diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c > index a41792dbae1f..9430d697cb9f 100644 > +++ b/drivers/infiniband/core/umem.c > @@ -60,7 +60,7 @@ static void __ib_umem_release(struct ib_device *dev, struct > ib_umem *umem, int d > page = sg_page(sg); > if (!PageDirty(page) && umem->writable && dirty) > set_page_dirty_lock(page); > - put_page(page); > + put_user_page(page); > } How about ? if (umem->writable && dirty) put_user_pages_dirty_lock(&page, 1); else put_user_page(page); ? > diff --git a/drivers/infiniband/hw/hfi1/user_pages.c > b/drivers/infiniband/hw/hfi1/user_pages.c > index e341e6dcc388..99ccc0483711 100644 > +++ b/drivers/infiniband/hw/hfi1/user_pages.c > @@ -121,13 +121,10 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, > unsigned long vaddr, size_t np > void hfi1_release_user_pages(struct mm_struct *mm, struct page **p, >size_t npages, bool dirty) > { > - size_t i; > - > - for (i = 0; i < npages; i++) { > - if (dirty) > - set_page_dirty_lock(p[i]); > - put_page(p[i]); > - } > + if (dirty) > + put_user_pages_dirty_lock(p, npages); > + else > + put_user_pages(p, npages); And I know Jan gave the feedback to remove the bool argument, but just pointing out that quite possibly evey caller will wrapper it in an if like this.. Jason
Re: [PATCH v2 2/3] mm: introduce put_user_page[s](), placeholder versions
On Fri, Oct 05, 2018 at 12:49:06PM -0700, John Hubbard wrote: > On 10/5/18 8:17 AM, Jason Gunthorpe wrote: > > On Thu, Oct 04, 2018 at 09:02:24PM -0700, john.hubb...@gmail.com wrote: > >> From: John Hubbard > >> > >> Introduces put_user_page(), which simply calls put_page(). > >> This provides a way to update all get_user_pages*() callers, > >> so that they call put_user_page(), instead of put_page(). > >> > >> Also introduces put_user_pages(), and a few dirty/locked variations, > >> as a replacement for release_pages(), for the same reasons. > >> These may be used for subsequent performance improvements, > >> via batching of pages to be released. > >> > >> This prepares for eventually fixing the problem described > >> in [1], and is following a plan listed in [2], [3], [4]. > >> > >> [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()" > >> > >> [2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubb...@nvidia.com > >> Proposed steps for fixing get_user_pages() + DMA problems. > >> > >> [3]https://lkml.kernel.org/r/20180710082100.mkdwngdv5kkrc...@quack2.suse.cz > >> Bounce buffers (otherwise [2] is not really viable). > >> > >> [4] https://lkml.kernel.org/r/20181003162115.gg24...@quack2.suse.cz > >> Follow-up discussions. > >> > [...] > >> > >> +/* Placeholder version, until all get_user_pages*() callers are updated. > >> */ > >> +static inline void put_user_page(struct page *page) > >> +{ > >> + put_page(page); > >> +} > >> + > >> +/* For get_user_pages*()-pinned pages, use these variants instead of > >> + * release_pages(): > >> + */ > >> +static inline void put_user_pages_dirty(struct page **pages, > >> + unsigned long npages) > >> +{ > >> + while (npages) { > >> + set_page_dirty(pages[npages]); > >> + put_user_page(pages[npages]); > >> + --npages; > >> + } > >> +} > > > > Shouldn't these do the !PageDirty(page) thing? > > > > Well, not yet. This is the "placeholder" patch, in which I planned to keep > the behavior the same, while I go to all the get_user_pages call sites and > change > put_page() and release_pages() over to use these new routines. Hmm.. Well, if it is the right thing to do here, why not include it and take it out of callers when doing the conversion? If it is the wrong thing, then let us still take it out of callers when doing the conversion :) Just seems like things will be in a better place to make future changes if all the call sights are de-duplicated and correct. Jason
[GIT PULL] Please pull RDMA subsystem changes
kshin (1): IB/iser: Fix possible NULL deref at iser_inv_desc() Jack Morgenstein (1): IB/core: Add an unbound WQ type to the new CQ API Jason Gunthorpe (31): Merge branch 'mlx5-flow-mutate' into rdma.git for-next Merge branch 'uverbs_dev_cleanups' into rdma.git for-next IB/rxe: Revise the ib_wr_opcode enum RDMA/uverbs: Remove is_closed from ib_uverbs_file RDMA/uverbs: Fix error unwind in ib_uverbs_add_one RDMA/ucontext: Add a core API for mmaping driver IO memory RDMA/mlx4: Use rdma_user_mmap_io RDMA/mlx5: Use rdma_user_mmap_io RDMA/hns: Use rdma_user_mmap_io RDMA/ucontext: Get rid of the old disassociate flow RDMA/umem: Do not use current->tgid to track the mm_struct RDMA/usnic: Do not use ucontext->tgid RDMA/umem: Use ib_umem_odp in all function signatures connected to ODP RDMA/umem: Make ib_umem_odp into a sub structure of ib_umem RDMA/umem: Get rid of struct ib_umem.odp_data RDMA/umem: Move all the ODP related stuff out of ucontext and into per_mm RDMA/umem: Use umem->owning_mm inside ODP RDMA/umem: Get rid of per_mm->notifier_count RDMA/umem: Handle a half-complete start/end sequence RDMA/umem: Avoid synchronize_srcu in the ODP MR destruction path RDMA/uverbs: Get rid of ucontext->tgid Merge branch 'mellanox/mlx5-next' into rdma.git for-next RDMA: Fully setup the device name in ib_register_device RDMA/core: Use dev_err/dbg/etc instead of pr_* + ibdev->name RDMA/drivers: Use dev_err/dbg/etc instead of pr_* + ibdev->name RDMA/core: Use dev_name instead of ibdev->name RDMA/drivers: Use dev_name instead of ibdev->name RDMA/ulp: Use dev_name instead of ibdev->name RDMA: Fix building with CONFIG_MMU=n RDMA/uverbs: Fix RCU annotation for radix slot deference Merge branch 'for-rc' into rdma.git for-next Joe Perches (1): RDMA/bnxt_re: QPLIB: Add and use #define dev_fmt(fmt) "QPLIB: " fmt Kaike Wan (3): IB/rdmavt: Rename check_send_wqe as setup_wqe IB/hfi1: Add static trace for iowait IB/hfi1: Use VL15 for SM packets Kamal Heib (2): RDMA/rxe: Remove unused addr_same() RDMA: Remove unused parameter from ib_modify_qp_is_ok() Kamenee Arumugam (1): IB/hfi1: PCIe bus width retry Leon Romanovsky (17): RDMA/nes: Delete impossible debug prints RDMA/uverbs: Declare closing variable as boolean RDMA/umem: Restore lockdep check while downgrading lock RDMA/netlink: Simplify netlink listener existence check RDMA/restrack: Un-inline set task implementation RDMA/restrack: Consolidate task name updates in one place RDMA/restrack: Release task struct which was hold by CM_ID object RDMA/restrack: Protect from reentry to resource return path RDMA/cm: Respect returned status of cm_init_av_by_path RDMA/cma: Remove unused timeout_ms parameter from cma_resolve_iw_route() RDMA/core: Align multiple functions to kernel coding style RDMA/core: Annotate timeout as unsigned long RDMA/core: Implement IB device rename function RDMA/nldev: Allow IB device rename through RDMA netlink RDMA/core: Increase total number of RDMA ports across all devices RDMA/uverbs: Use kernel API to allocate uverbs indexes RDMA/umad: Use kernel API to allocate umad indexes Lijun Ou (15): RDMA/hns: Refactor the codes for setting transport opode RDMA/hns: Add atomic support RDMA/hns: Add CM of vlan device support RDMA/hns: Add enable judgement for UD vlan RDMA/hns: Bugfix for reserved qp number RDMA/hns: Submit bad wr when post send wr exception RDMA/hns: Bugfix for CM test RDMA/hns: Limit the size of extend sge of sq RDMA/hns: Configure ecn field of ip header RDMA/hns: Update some attributes of the RoCE device RDMA/hns: Limit extend sq sge num RDMA/hns: Update some fields of qp context RDMA/hns: Support local invalidate for hip08 in kernel space RDMA/hns: Add vlan enable bit for hip08 RDMA/hns: Bugfix for atomic operation Majd Dibbiny (1): IB/mlx5: Change TX affinity assignment in RoCE LAG mode Mark Bloch (23): RDMA/uverbs: Add UVERBS_ATTR_CONST_IN to the specs language RDMA/mlx5: Add a new flow action verb - modify header RDMA/uverbs: Add generic function to fill in flow action object RDMA/mlx5: Add new flow action verb - packet reformat RDMA/mlx5: Extend packet reformat verbs RDMA/mlx5: Add NIC TX steering support RDMA/mlx5: Enable attaching modify header to steering flows RDMA/mlx5: Enable decap and packet reformat on flow tables RDMA/mlx5: Enable attaching DECAP action to steering flows RDMA/mlx5: Enable reformat on NIC RX if supported RDMA/mlx5: Enable attaching packet reform
Re: [PATCH] RDMA/hns: Use 64-bit arithmetic instead of 32-bit
On Thu, Oct 18, 2018 at 10:02:58AM +0200, Gustavo A. R. Silva wrote: > Cast *max_num_sg* to u64 in order to give the compiler complete > information about the proper arithmetic to use. > > Notice that such variable is used in a context that expects an > expression of type u64 (64 bits, unsigned) and the following > expression is currently being evaluated using 32-bit > arithmetic: > > length = max_num_sg * page_size; > > Addresses-Coverity-ID: 1474517 ("Unintentional integer overflow") > Signed-off-by: Gustavo A. R. Silva > drivers/infiniband/hw/hns/hns_roce_mr.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/infiniband/hw/hns/hns_roce_mr.c > b/drivers/infiniband/hw/hns/hns_roce_mr.c > index 521ad2a..d479d5e 100644 > +++ b/drivers/infiniband/hw/hns/hns_roce_mr.c > @@ -1219,7 +1219,7 @@ struct ib_mr *hns_roce_alloc_mr(struct ib_pd *pd, enum > ib_mr_type mr_type, > int ret; > > page_size = 1 << (hr_dev->caps.pbl_buf_pg_sz + PAGE_SHIFT); > - length = max_num_sg * page_size; > + length = (u64)max_num_sg * page_size; This should be done with check_mul_overflow() which will also force the input types to the correct thing. alloc_mr is callable from userspace so the potential overflow here should not be ignored. Jason
Re: [PATCH v4 2/3] mm: introduce put_user_page*(), placeholder versions
On Thu, Oct 11, 2018 at 06:23:24PM -0700, John Hubbard wrote: > On 10/11/18 6:20 AM, Jason Gunthorpe wrote: > > On Thu, Oct 11, 2018 at 10:49:29AM +0200, Jan Kara wrote: > > > >>> This is a real worry. If someone uses a mistaken put_page() then how > >>> will that bug manifest at runtime? Under what set of circumstances > >>> will the kernel trigger the bug? > >> > >> At runtime such bug will manifest as a page that can never be evicted from > >> memory. We could warn in put_page() if page reference count drops below > >> bare minimum for given user pin count which would be able to catch some > >> issues but it won't be 100% reliable. So at this point I'm more leaning > >> towards making get_user_pages() return a different type than just > >> struct page * to make it much harder for refcount to go wrong... > > > > At least for the infiniband code being used as an example here we take > > the struct page from get_user_pages, then stick it in a sgl, and at > > put_page time we get the page back out of the sgl via sg_page() > > > > So type safety will not help this case... I wonder how many other > > users are similar? I think this is a pretty reasonable flow for DMA > > with user pages. > > > > That is true. The infiniband code, fortunately, never mixes the two page > types into the same pool (or sg list), so it's actually an easier example > than some other subsystems. But, yes, type safety doesn't help there. I can > take a moment to look around at the other areas, to quantify how much a type > safety change might help. Are most (all?) of the places working with SGLs? Maybe we could just have a 'get_user_pages_to_sgl' and 'put_pages_sgl' sort of interface that handled all this instead of trying to make something that is struct page based? It seems easier to get an extra bit for user/!user in the SGL datastructure? Jason
Re: [PATCH] rdma: move the ib_wr_opcode enum to include/uapi
On Mon, Sep 17, 2018 at 08:38:16PM +, Walker, Benjamin wrote: > On Mon, 2018-08-20 at 14:32 -0700, Howell, Seth wrote: > > Hi Jason, > > > > I apologize for the few days of radio silence on this one. I was able to > > apply > > your patch on my local configuration and can confirm that it fixes the issue > > of send with invalidate being improperly mapped between a user-space process > > and the kernel rxe driver. > > > > Thanks again for your time and help. > > > > Seth Howell > > > > From: linux-rdma-ow...@vger.kernel.org [mailto: > > linux-rdma-ow...@vger.kernel.org] On Behalf Of Jason Gunthorpe > > Sent: Thursday, August 16, 2018 11:01 AM > > To: Howell, Seth > > Cc: Walker, Benjamin ; Doug Ledford < > > dledf...@redhat.com>; linux-r...@vger.kernel.org; > > linux-kernel@vger.kernel.org > > ; Moni Shoua > > Subject: Re: [PATCH] rdma: move the ib_wr_opcode enum to include/uapi > > > > On Wed, Aug 15, 2018 at 11:27:36PM +, Howell, Seth wrote: > > > > > Thank you for taking the time to review this issue. The patch you > > > submitted will definitely fix the problem I was facing. > > > > Can you test it to be certain? > > > > Thanks, > > Jason > > We've recently run into this same issue on i40iw, which appears to make the > same > mistake of using the kernel version of the enum instead of the userspace > version. Confused by this?? i40iw_upost_send does not handle the kernel numbers at all, as far as I can see? How does it develop a kernel dependency?? > What's the current status here? Can it be merged? I just checked and > do not see it merged to Linux master. Oh! This apparently got lost, thanks for bringing it up again. > Running a user-space NVMe-oF target with RDMA and a recent Linux kernel > initiator is not currently possible on rxe or i40iw because it requires send > with invalidate support. Okay, but i40iw doesn't seem to support send with invalidate at all in userspace? i40iw_upost_send() swithces on opcode, doesn't handle SEND_INV and then blows up in the default clause - how does this patch make any difference??? Jason
Re: [PATCH] rdma: move the ib_wr_opcode enum to include/uapi
On Mon, Sep 17, 2018 at 10:29:18PM +, Walker, Benjamin wrote: > On Mon, 2018-09-17 at 15:08 -0600, Jason Gunthorpe wrote: > > On Mon, Sep 17, 2018 at 08:38:16PM +, Walker, Benjamin wrote: > > > We've recently run into this same issue on i40iw, which appears to make > > > the > > > same > > > mistake of using the kernel version of the enum instead of the userspace > > > version. > > > > Confused by this?? i40iw_upost_send does not handle the kernel numbers > > at all, as far as I can see? How does it develop a kernel dependency?? > > > > > What's the current status here? Can it be merged? I just checked and > > > do not see it merged to Linux master. > > > > Oh! This apparently got lost, thanks for bringing it up again. > > > > > Running a user-space NVMe-oF target with RDMA and a recent Linux kernel > > > initiator is not currently possible on rxe or i40iw because it requires > > > send > > > with invalidate support. > > > > Okay, but i40iw doesn't seem to support send with invalidate at all in > > userspace? > > > > i40iw_upost_send() swithces on opcode, doesn't handle SEND_INV and > > then blows up in the default clause - how does this patch make any > > difference??? > > It appears I've read the error message incorrectly and was looking at the > kernel > version (i40iw_post_send) as opposed to the user version (i40iw_upost_send). > Indeed, the NIC does not support SEND_WTIH_INVAL at all in that function. The > NIC does support SEND_WITH_INVAL in the kernel i40iw_post_send. OK > What is the correct way for a user space application to check whether a NIC > supports SEND_WITH_INVAL? We are currently examining the device_cap_flags in > the > structure returned by ibv_query_device. Specifically, we're looking at > IBV_DEVICE_MEM_MGT_EXTENSIONS. However, for i40iw, that flag is set. I'm > concerned that the feature support flags are common between user space and the > kernel, but the actual support differs in this case. That is the correct thing to do. Sadly the i40iw driver is broken, it should be fixed to either mask those flags or implement the functionality. Jason
Re: [PATCH] rdma: move the ib_wr_opcode enum to include/uapi
> >From e3c1b1b81373601fa6cbad2ba5fadd2c8cfdfaed Mon Sep 17 00:00:00 2001 > From: Seth Howell > Date: Tue, 14 Aug 2018 15:33:02 -0700 > Subject: [PATCH] IB/rxe: Revise the ib_wr_opcode enum > > This enum has become part of the uABI, as both RXE and the > ib_uverbs_post_send() command expect userspace to supply values from > this enum. So it should be properly placed in include/uapi/rdma. > > In userspace this enum is called 'enum ibv_wr_opcode' as part of > libibverbs.h. That enum defines different values for > IB_WR_LOCAL_INV, IB_WR_SEND_WITH_INV, and IB_WR_LSO. These were > introduced (incorrectly, it turns out) into libiberbs in 2015. > > The kernel has changed its mind on the numbering for several of the > IB_WC values over the years, but has remained stable on > IB_WR_LOCAL_INV and below. > > Based on this we can conclude that there is no real user space user > of the values beyond IB_WR_ATOMIC_FETCH_AND_ADD, as they have never > worked via rdma-core. This is confirmed by inspection, only rxe uses > the kernel enum and implements the latter operations. rxe has > clearly never worked with these attributes from userspace. Other > drivers that support these opcodes implement the functionality > without calling out to the kernel. > > To make IB_WR_SEND_WITH_INV and related work for RXE in userspace we > choose to renumber the IB_WR enum in the kernel to match the uABI > that userspace has bee using since before Soft RoCE was merged. This > is an overall simpler configuration for the whole software stack, > and obviously can't break anything existing. > > Reported-by: Seth Howell > Fixes: 8700e3e7c485 ("Soft RoCE driver") > Cc: > Signed-off-by: Jason Gunthorpe > include/rdma/ib_verbs.h | 34 ++- > include/uapi/rdma/ib_user_verbs.h | 20 +- > 2 files changed, 39 insertions(+), 15 deletions(-) Applied to for-next Thanks, Jason
Re: [PATCH 1/4] IB/core: introduce ->release() callback
On Tue, Sep 18, 2018 at 04:03:43PM +0300, Jan Dakinevich wrote: > IB infrastructure shares common device instance constructor with > reference counting, and it uses kzalloc() to allocate memory > for device specific instance with incapsulated ib_device field as one > contigous memory block. > > The issue is that the device specific instances tend to be too large > and require high page order memory allocation. Unfortunately, kzalloc() > in ib_alloc_device() can not be replaced with kvzalloc() since it would > require a lot of review in all IB driver to prove correctness of the > replacement. > > The driver can allocate some heavy partes of their instance for itself > and keep pointers for them in own instance. For this it is important > that the alocated parts have the same life time as ib_device, thus > their deallocation should be based on the same reference counting. > > Let suppose: > > struct foo_ib_device { > struct ib_device device; > > void *part; > > ... > }; > > To properly free memory from .foo_ib_part the driver should provide > function for ->release() callback: > > void foo_ib_release(struct ib_device *device) > { > struct foo_ib_device *foo = container_of(device, struct foo_ib_device, >device); > > kvfree(foo->part); > } > > ...and initialiaze this callback immediately after foo_ib_device > instance allocation. > > struct foo_ib_device *foo; > > foo = ib_alloc_device(sizeof(struct foo_ib_device)); > > foo->device.release = foo_ib_release; > > /* allocate parts */ > foo->part = kvmalloc(65536, GFP_KERNEL); > > Signed-off-by: Jan Dakinevich > drivers/infiniband/core/device.c | 2 ++ > include/rdma/ib_verbs.h | 2 ++ > 2 files changed, 4 insertions(+) > > diff --git a/drivers/infiniband/core/device.c > b/drivers/infiniband/core/device.c > index db3b627..a8c8b0d 100644 > +++ b/drivers/infiniband/core/device.c > @@ -215,6 +215,8 @@ static void ib_device_release(struct device *device) > ib_cache_release_one(dev); > kfree(dev->port_immutable); > } > + if (dev->release) > + dev->release(dev); > kfree(dev); > } Nope, the driver module could be unloaded at this point. The driver should free memory after its call to ib_unregister_device returns. Jason
Re: [PATCH 0/4] IB: decrease large contigous allocation
On Tue, Sep 18, 2018 at 04:03:42PM +0300, Jan Dakinevich wrote: > The size of mlx4_ib_device became too large to be allocated as whole > contigous > block of memory. Currently it takes about 55K. On architecture with 4K page > it > means 3rd order. > > This patch series makes an attempt to split mlx4_ib_device into several parts > and allocate them with less expensive kvzalloc Why split it up? Any reason not to just allocate the whole thing with kvzalloc? Jason
Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper
On Fri, Feb 09, 2024 at 08:55:31AM -0700, Alex Williamson wrote: > I think Kevin's point is also relative to this latter scenario, in the > L1 instance of the nvgrace-gpu driver the mmap of the usemem BAR is > cachable, but in the L2 instance of the driver where we only use the > vfio-pci-core ops nothing maintains that cachable mapping. Is that a > problem? An uncached mapping on top of a cachable mapping is often > prone to problems. On these CPUs the ARM architecture won't permit it, the L0 level blocks uncachable using FWB and page table attributes. The VM, no matter what it does, cannot make the cachable memory uncachable. Jason
Re: [RFC v1 0/8] vhost-vdpa: add support for iommufd
On Thu, Nov 09, 2023 at 06:48:46PM -0500, Michael S. Tsirkin wrote: > On Tue, Nov 07, 2023 at 11:52:17AM -0400, Jason Gunthorpe wrote: > > On Tue, Nov 07, 2023 at 09:30:21AM -0500, Michael S. Tsirkin wrote: > > > On Tue, Nov 07, 2023 at 10:12:37AM -0400, Jason Gunthorpe wrote: > > > > Big company's should take the responsibility to train and provide > > > > skill development for their own staff. > > > > > > That would result in a beautiful cathedral of a patch. I know this is > > > how some companies work. We are doing more of a bazaar thing here, > > > though. In a bunch of subsystems it seems that you don't get the > > > necessary skills until you have been publically shouted at by > > > maintainers - better to start early ;). Not a nice environment for > > > novices, for sure. > > > > In my view the "shouting from maintainers" is harmful to the people > > buidling skills and it is an unkind thing to dump employees into that > > kind of situation. > > > > They should have help to establish the basic level of competence where > > they may do the wrong thing, but all the process and presentation of > > the wrong thing is top notch. You get a much better reception. > > What - like e.g. mechanically fixing checkpatch warnings without > understanding? No, not at all. I mean actually going through and explaining what the idea is to another person and ensuing that the commit messages convey that idea, that the patches reflect the idea, that everything is convayed, and it isn't obviously internally illogical. Like, why did this series have a giant block of #ifdef 0'd code with no explanation at all? That isn't checkpatch nitpicks, that is not meeting the minimum standard to convey an idea in an RFC. Jason
Re: [PATCH 1/1] remoteproc: Use iommu_paging_domain_alloc()
On Thu, Aug 22, 2024 at 10:17:56AM -0600, Mathieu Poirier wrote: > > - domain = iommu_domain_alloc(dev->bus); > > - if (!domain) { > > + domain = iommu_paging_domain_alloc(dev); > > I'm a little hesitant here. Function iommu_domain_alloc() passes NULL two the > second argument of __iommu_domain_alloc() while iommu_paging_domain_alloc() > provides a 'dev'. I don't have any HW to test on and I am not familiar enough > with the IOMMU subsystem to confidently more forward. So long as dev is the device that will be initiating DMA this is a correct change from the iommu subsystem perspective. Jason
Re: [PATCH] iommufd/selftest:fix a resource leak
On Wed, Sep 04, 2024 at 04:08:06AM +0800, Liu Jing wrote: > If the file fails to open, not only return return1 but also close the file > descriptor,otherwise resource > leak will occur Why doesn't FIXURE_TEARDOWN do this? Jason
Re: [PATCH v2 02/19] iommufd/viommu: Add IOMMUFD_OBJ_VIOMMU and IOMMU_VIOMMU_ALLOC ioctl
On Sun, Sep 01, 2024 at 10:27:09PM -0700, Nicolin Chen wrote: > On Sun, Sep 01, 2024 at 10:39:17AM +0800, Baolu Lu wrote: > > On 2024/8/28 0:59, Nicolin Chen wrote: > > > +int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd) > > > +{ > > > + struct iommu_viommu_alloc *cmd = ucmd->cmd; > > > + struct iommufd_hwpt_paging *hwpt_paging; > > > + struct iommufd_viommu *viommu; > > > + struct iommufd_device *idev; > > > + int rc; > > > + > > > + if (cmd->flags) > > > + return -EOPNOTSUPP; > > > + > > > + idev = iommufd_get_device(ucmd, cmd->dev_id); > > > > Why does a device reference count is needed here? When is this reference > > count released after the VIOMMU is allocated? > > Hmm, it was used to get dev->iommu->iommu_dev to pin the VIOMMU to > a physical IOMMU instance (in v1). Jason suggested to remove that, > yet I didn't realize that this idev is now completely useless. > > With that being said, a parent HWPT could be shared across VIOMUs > allocated for the same VM. So, I think we do need a dev pointer to > know which physical instance the VIOMMU allocates for, especially > for a driver-managed VIOMMU. Eventually you need a way to pin the physical iommu, without pinning any idevs. Not sure how best to do that Jason
Re: [PATCH v2 02/19] iommufd/viommu: Add IOMMUFD_OBJ_VIOMMU and IOMMU_VIOMMU_ALLOC ioctl
On Wed, Sep 04, 2024 at 10:29:26AM -0700, Nicolin Chen wrote: > On Wed, Sep 04, 2024 at 01:26:21PM -0300, Jason Gunthorpe wrote: > > On Sun, Sep 01, 2024 at 10:27:09PM -0700, Nicolin Chen wrote: > > > On Sun, Sep 01, 2024 at 10:39:17AM +0800, Baolu Lu wrote: > > > > On 2024/8/28 0:59, Nicolin Chen wrote: > > > > > +int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd) > > > > > +{ > > > > > + struct iommu_viommu_alloc *cmd = ucmd->cmd; > > > > > + struct iommufd_hwpt_paging *hwpt_paging; > > > > > + struct iommufd_viommu *viommu; > > > > > + struct iommufd_device *idev; > > > > > + int rc; > > > > > + > > > > > + if (cmd->flags) > > > > > + return -EOPNOTSUPP; > > > > > + > > > > > + idev = iommufd_get_device(ucmd, cmd->dev_id); > > > > > > > > Why does a device reference count is needed here? When is this reference > > > > count released after the VIOMMU is allocated? > > > > > > Hmm, it was used to get dev->iommu->iommu_dev to pin the VIOMMU to > > > a physical IOMMU instance (in v1). Jason suggested to remove that, > > > yet I didn't realize that this idev is now completely useless. > > > > > > With that being said, a parent HWPT could be shared across VIOMUs > > > allocated for the same VM. So, I think we do need a dev pointer to > > > know which physical instance the VIOMMU allocates for, especially > > > for a driver-managed VIOMMU. > > > > Eventually you need a way to pin the physical iommu, without pinning > > any idevs. Not sure how best to do that > > Just trying to clarify "without pinning any idevs", does it mean > we shouldn't pass in an idev_id to get dev->iommu->iommu_dev? >From userspace we have no choice but to use an idev_id to locate the physical iommu But since we want to support hotplug it is rather problematic if that idev is permanently locked down. > Otherwise, iommu_probe_device_lock and iommu_device_lock in the > iommu.c are good enough to lock dev->iommu and iommu->list. And > I think we just need an iommu helper refcounting the dev_iommu > (or iommu_device) as we previously discussed. If you have a ref on an idev then the iommu_dev has to be stable, so you can just incr some refcount and then drop the idev stuff. Jason
Re: [PATCH v2 01/19] iommufd: Reorder struct forward declarations
On Tue, Aug 27, 2024 at 09:59:38AM -0700, Nicolin Chen wrote: > Reorder struct forward declarations to alphabetic order to simplify > maintenance, as upcoming patches will add more to the list. > > No functional change intended. > > Signed-off-by: Nicolin Chen > --- > include/linux/iommufd.h | 8 > 1 file changed, 4 insertions(+), 4 deletions(-) I picked this one up Thanks, Jason
Re: [PATCH v2 02/19] iommufd/viommu: Add IOMMUFD_OBJ_VIOMMU and IOMMU_VIOMMU_ALLOC ioctl
On Tue, Aug 27, 2024 at 09:59:39AM -0700, Nicolin Chen wrote: > +/** > + * struct iommu_viommu_alloc - ioctl(IOMMU_VIOMMU_ALLOC) > + * @size: sizeof(struct iommu_viommu_alloc) > + * @flags: Must be 0 > + * @type: Type of the virtual IOMMU. Must be defined in enum > iommu_viommu_type > + * @dev_id: The device to allocate this virtual IOMMU for @dev_id: The device's physical IOMMU will be used to back t he vIOMMU > + * @hwpt_id: ID of a nesting parent HWPT to associate to A nesting parent HWPT that will provide translation for an vIOMMU DMA > + * @out_viommu_id: Output virtual IOMMU ID for the allocated object > + * > + * Allocate a virtual IOMMU object that holds a (shared) nesting parent HWPT Allocate a virtual IOMMU object that represents the underlying physical IOMMU's virtualization support. The vIOMMU object is a security isolated slice of the physical IOMMU HW that is unique to a specific VM. Operations global to the IOMMU are connected to the vIOMMU, such as: - Security namespace for guest owned ID, eg guest controlled cache tags - Virtualization of various platforms IDs like RIDs and others - direct assigned invalidation queues - direct assigned interrupts - non-affiliated event reporting - Delivery of paravirtualized invalidation Jason
Re: [PATCH v2 03/19] iommu: Pass in a viommu pointer to domain_alloc_user op
On Tue, Aug 27, 2024 at 09:59:40AM -0700, Nicolin Chen wrote: > With a viommu object wrapping a potentially shareable S2 domain, a nested > domain should be allocated by associating to a viommu instead. Driver can > store this viommu pointer somewhere, so as to later use it calling viommu > helpers for virtual device ID lookup and viommu invalidation. > > For drivers without a viommu support, keep the parent domain input, which > should be just viommu->hwpt->common.domain otherwise. I've been thinking of add an op for nested allocation since every driver immediately jumps to a special function for nested allocation anyhow without sharing any code. Adding a new parameter that is nested only seems like a good point to try to do that.. Jason
Re: [PATCH v2 04/19] iommufd: Allow pt_id to carry viommu_id for IOMMU_HWPT_ALLOC
On Tue, Aug 27, 2024 at 09:59:41AM -0700, Nicolin Chen wrote: > Now a VIOMMU can wrap a shareable nested parent HWPT. So, it can act like > a nested parent HWPT to allocate a nested HWPT. > > Support that in the IOMMU_HWPT_ALLOC ioctl handler, and update its kdoc. > > Also, associate a viommu to an allocating nested HWPT. > > Signed-off-by: Nicolin Chen > --- > drivers/iommu/iommufd/hw_pagetable.c| 24 ++-- > drivers/iommu/iommufd/iommufd_private.h | 1 + > include/uapi/linux/iommufd.h| 12 ++-- > 3 files changed, 29 insertions(+), 8 deletions(-) Reviewed-by: Jason Gunthorpe Jason
Re: [PATCH v2 06/19] iommufd/viommu: Add IOMMU_VIOMMU_SET/UNSET_VDEV_ID ioctl
On Tue, Aug 27, 2024 at 09:59:43AM -0700, Nicolin Chen wrote: > Introduce a pair of new ioctls to set/unset a per-viommu virtual device id > that should be linked to a physical device id via an idev pointer. Given some of the other discussions around CC I suspect we should rename these to 'create/destroy virtual device' with an eye that eventually they would be extended like other ops with per-CC platform data. ie this would be the interface to tell the CC trusted world that a secure device is being added to a VM with some additional flags.. Right now it only conveys the vRID parameter of the virtual device being created. A following question is if these objects should have their own IDs in the iommufd space too, and then unset is not unset but just a normal destroy object. If so then the thing you put in the ids xarray would also just be a normal object struct. This is probably worth doing if this is going to grow more CC stuff later. Jason
Re: [PATCH v2 08/19] iommufd/viommu: Add cache_invalidate for IOMMU_VIOMMU_TYPE_DEFAULT
On Tue, Aug 27, 2024 at 09:59:45AM -0700, Nicolin Chen wrote: > Add a default_viommu_ops with a new op for cache invaldiation, similar to > the cache_invalidate_user op in structure iommu_domain_ops, but wider. An > IOMMU driver that allocated a nested domain with a core-managed viommu is > able to use the same viommu pointer for this cache invalidation API. > > ARM SMMUv3 for example supports IOTLB and ATC device cache invaldiations. > The IOTLB invalidation is per-VMID, held currently by a parent S2 domain. > The ATC invalidation is per device (Stream ID) that should be tranlsated > by a virtual device ID lookup table. Either case fits the viommu context. > > Signed-off-by: Nicolin Chen > --- > drivers/iommu/iommufd/iommufd_private.h | 3 +++ > drivers/iommu/iommufd/viommu.c | 3 +++ > include/linux/iommu.h | 5 + > include/linux/iommufd.h | 19 +++ > 4 files changed, 30 insertions(+) It looks OK Reviewed-by: Jason Gunthorpe Jason
Re: [PATCH v2 09/19] iommufd: Allow hwpt_id to carry viommu_id for IOMMU_HWPT_INVALIDATE
On Tue, Aug 27, 2024 at 09:59:46AM -0700, Nicolin Chen wrote: > With a VIOMMU object, use space can flush any IOMMU related cache that can > be directed using the viommu. It is similar to IOMMU_HWPT_INVALIDATE uAPI, > but can cover a wider range than IOTLB, such as device cache or desciprtor > cache. > > Allow hwpt_id of the iommu_hwpt_invalidate structure to carry a viommu_id, > and reuse the IOMMU_HWPT_INVALIDATE uAPI for VIOMMU invalidations. Driver > can define a different structure for VIOMMU invalidations v.s. HWPT ones. > > Update the uAPI, kdoc, and selftest case accordingly. > > Signed-off-by: Nicolin Chen > --- > drivers/iommu/iommufd/hw_pagetable.c| 32 +++-- > include/uapi/linux/iommufd.h| 9 --- > tools/testing/selftests/iommu/iommufd.c | 4 ++-- > 3 files changed, 33 insertions(+), 12 deletions(-) Reviewed-by: Jason Gunthorpe Jason
Re: [PATCH v2 10/19] iommufd/viommu: Add vdev_id helpers for IOMMU drivers
On Tue, Aug 27, 2024 at 09:59:47AM -0700, Nicolin Chen wrote: > Driver can call the iommufd_viommu_find_device() to find a device pointer > using its per-viommu virtual ID. The returned device must be protected by > the pair of iommufd_viommu_lock/unlock_vdev_id() function. > > Put these three functions into a new viommu_api file, to build it with the > IOMMUFD_DRIVER config. > > Signed-off-by: Nicolin Chen > --- > drivers/iommu/iommufd/Makefile | 2 +- > drivers/iommu/iommufd/viommu_api.c | 39 ++ > include/linux/iommufd.h| 16 > 3 files changed, 56 insertions(+), 1 deletion(-) > create mode 100644 drivers/iommu/iommufd/viommu_api.c I still think this is better to just share the struct content with the driver, eventually we want to do this anyhow as the driver will want to use container_of() techniques to reach its private data. > +/* > + * Find a device attached to an VIOMMU object using a virtual device ID that > was > + * set via an IOMMUFD_CMD_VIOMMU_SET_VDEV_ID. Callers of this function must > call > + * iommufd_viommu_lock_vdev_id() prior and iommufd_viommu_unlock_vdev_id() > after > + * > + * Return device or NULL. > + */ > +struct device *iommufd_viommu_find_device(struct iommufd_viommu *viommu, u64 > id) > +{ > + struct iommufd_vdev_id *vdev_id; > + > + lockdep_assert_held(&viommu->vdev_ids_rwsem); > + > + xa_lock(&viommu->vdev_ids); > + vdev_id = xa_load(&viommu->vdev_ids, (unsigned long)id); > + xa_unlock(&viommu->vdev_ids); No need for this lock, xa_load is rcu safe against concurrent writer Jason
Re: [PATCH v2 17/19] iommu/arm-smmu-v3: Add arm_smmu_viommu_cache_invalidate
On Tue, Aug 27, 2024 at 09:59:54AM -0700, Nicolin Chen wrote: > +static int arm_smmu_viommu_cache_invalidate(struct iommufd_viommu *viommu, > + struct iommu_user_data_array *array) > +{ > + struct iommu_domain *domain = iommufd_viommu_to_parent_domain(viommu); > + > + return __arm_smmu_cache_invalidate_user( > + to_smmu_domain(domain), viommu, array); I'd like to have the viommu struct directly hold the VMID. The nested parent should be sharable between multiple viommus, it doesn't make any sense that it would hold the vmid. This is struggling because it is trying too hard to not have the driver allocate the viommu, and I think we should just go ahead and do that. Store the vmid, today copied from the nesting parent in the vmid private struct. No need for iommufd_viommu_to_parent_domain(), just rework the APIs to pass the vmid down not a domain. Jason
Re: [PATCH v2 16/19] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user
On Tue, Aug 27, 2024 at 09:59:53AM -0700, Nicolin Chen wrote: > static const struct iommu_domain_ops arm_smmu_nested_ops = { > .get_msi_mapping_domain = arm_smmu_get_msi_mapping_domain, > .attach_dev = arm_smmu_attach_dev_nested, > .free = arm_smmu_domain_nested_free, > + .cache_invalidate_user = arm_smmu_cache_invalidate_user, > }; I think we should drop this op. The original intention was to do things in parts to split up the patches, but it turns out this is functionally useless so lets not even expose it to userspace. So the patch can maybe be split differently and combined with the next patch Jason
Re: [PATCH v2 06/19] iommufd/viommu: Add IOMMU_VIOMMU_SET/UNSET_VDEV_ID ioctl
On Thu, Sep 05, 2024 at 10:37:45AM -0700, Nicolin Chen wrote: > we only have virtual device ID in its data structure. Also, the > virtual device sounds a bit confusing, given we already have idev. idev is "iommufd device" which is the physical device The virtual device is the host side handle of a device in a VM. > That being said, if we have a clear picture that in the long term > we would extend it to hold more information, I think it could be > a smart move. > > Perhaps virtual device can have its own "attach" to vIOMMU? Or > would you still prefer attaching via proxy hwpt_nested? I was thinking just creating it against a vIOMMU is an effective "attach" and the virtual device is permanently tied to the vIOMMU at creation time. Is there more to attach? I think some CC stuff had a few more verbs in the lifecycle though Jason
Re: [PATCH v2 02/19] iommufd/viommu: Add IOMMUFD_OBJ_VIOMMU and IOMMU_VIOMMU_ALLOC ioctl
On Thu, Sep 05, 2024 at 10:10:38AM -0700, Nicolin Chen wrote: > On Thu, Sep 05, 2024 at 12:53:02PM -0300, Jason Gunthorpe wrote: > > On Tue, Aug 27, 2024 at 09:59:39AM -0700, Nicolin Chen wrote: > > > +/** > > > + * struct iommu_viommu_alloc - ioctl(IOMMU_VIOMMU_ALLOC) > > > + * @size: sizeof(struct iommu_viommu_alloc) > > > + * @flags: Must be 0 > > > + * @type: Type of the virtual IOMMU. Must be defined in enum > > > iommu_viommu_type > > > + * @dev_id: The device to allocate this virtual IOMMU for > > > > @dev_id: The device's physical IOMMU will be used to back t he vIOMMU > > > > > + * @hwpt_id: ID of a nesting parent HWPT to associate to > > > > A nesting parent HWPT that will provide translation for an vIOMMU DMA > > > > > + * @out_viommu_id: Output virtual IOMMU ID for the allocated object > > > + * > > > + * Allocate a virtual IOMMU object that holds a (shared) nesting parent > > > HWPT > > > > Allocate a virtual IOMMU object that represents the underlying > > physical IOMMU's virtualization support. The vIOMMU object is a > > security isolated slice of the physical IOMMU HW that is unique to a > > specific VM. Operations global to the IOMMU are connected to the > > vIOMMU, such as: > > - Security namespace for guest owned ID, eg guest controlled cache tags > > - Virtualization of various platforms IDs like RIDs and others > > - direct assigned invalidation queues > > - direct assigned interrupts > > - non-affiliated event reporting > > - Delivery of paravirtualized invalidation > > Ack. Also write something about the HWPT.. > Looks like you prefer using "vIOMMU" v.s. "VIOMMU"? I would go > through all the patches (QEMU including) to keep that aligned. Yeah, VIOMMU just for all-caps constants Jason
Re: [PATCH v1 04/10] iommufd/viommu: Allow drivers to control vdev_id lifecycle
On Tue, Aug 27, 2024 at 10:02:06AM -0700, Nicolin Chen wrote: > The iommufd core provides a lookup helper for an IOMMU driver to find a > device pointer by device's per-viommu virtual ID. Yet a driver may need > an inverted lookup to find a device's per-viommu virtual ID by a device > pointer, e.g. when reporting virtual IRQs/events back to the user space. > In this case, it'd be unsafe for iommufd core to do an inverted lookup, > as the driver can't track the lifecycle of a viommu object or a vdev_id > object. > > Meanwhile, some HW can even support virtual device ID lookup by its HW- > accelerated virtualization feature. E.g. Tegra241 CMDQV HW supports to > execute vanilla guest-issued SMMU commands containing virtual Stream ID > but requires software to configure a link between virtual Stream ID and > physical Stream ID via HW registers. So not only the iommufd core needs > a vdev_id lookup table, drivers will want one too. > > Given the two justifications above, it's the best practice to provide a > a pair of set_vdev_id/unset_vdev_id ops in the viommu ops, so a driver > can implement them to control a vdev_id's lifecycle, and configure the > HW properly if required. I think the lifecycle rules should be much simpler. If a nested domain is attached to a STE/RID/device then the vIOMMU affiliated with that nested domain is pinned while the STE is in place So the driver only need to provide locking around attach changing the STE's vIOMMU vs async operations translating from a STE to a vIOMMU. This can be a simple driver lock of some kind, ie a rwlock across the STE table. Generally that is how all the async events should work, go from the STE to the VIOMMU to a iommufd callback to the iommufd event queue. iommufd will translate the struct device from the driver to an idev_id (or maybe even a vdevid) the same basic way the PRI code works Need to check the attach struct lifecycle carefully Jason
Re: [PATCH v2 17/19] iommu/arm-smmu-v3: Add arm_smmu_viommu_cache_invalidate
On Thu, Sep 05, 2024 at 11:00:49AM -0700, Nicolin Chen wrote: > On Thu, Sep 05, 2024 at 01:20:39PM -0300, Jason Gunthorpe wrote: > > On Tue, Aug 27, 2024 at 09:59:54AM -0700, Nicolin Chen wrote: > > > > > +static int arm_smmu_viommu_cache_invalidate(struct iommufd_viommu > > > *viommu, > > > + struct iommu_user_data_array *array) > > > +{ > > > + struct iommu_domain *domain = iommufd_viommu_to_parent_domain(viommu); > > > + > > > + return __arm_smmu_cache_invalidate_user( > > > + to_smmu_domain(domain), viommu, array); > > > > I'd like to have the viommu struct directly hold the VMID. The nested > > parent should be sharable between multiple viommus, it doesn't make > > any sense that it would hold the vmid. > > > > This is struggling because it is trying too hard to not have the > > driver allocate the viommu, and I think we should just go ahead and do > > that. Store the vmid, today copied from the nesting parent in the vmid > > private struct. No need for iommufd_viommu_to_parent_domain(), just > > rework the APIs to pass the vmid down not a domain. > > OK. When I designed all this stuff, we still haven't made mind > about sharing the s2 domain, i.e. moving the VMID, which might > need a couple of more patches to achieve. Yes, many more patches, and don't try to do it now.. But we can copy the vmid from the s2 and place it in the viommu struct during allocation time. Jason
Re: [PATCH v2 17/19] iommu/arm-smmu-v3: Add arm_smmu_viommu_cache_invalidate
On Wed, Sep 11, 2024 at 06:25:16AM +, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Friday, September 6, 2024 2:22 AM > > > > On Thu, Sep 05, 2024 at 11:00:49AM -0700, Nicolin Chen wrote: > > > On Thu, Sep 05, 2024 at 01:20:39PM -0300, Jason Gunthorpe wrote: > > > > On Tue, Aug 27, 2024 at 09:59:54AM -0700, Nicolin Chen wrote: > > > > > > > > > +static int arm_smmu_viommu_cache_invalidate(struct > > iommufd_viommu *viommu, > > > > > + struct iommu_user_data_array > > *array) > > > > > +{ > > > > > + struct iommu_domain *domain = > > iommufd_viommu_to_parent_domain(viommu); > > > > > + > > > > > + return __arm_smmu_cache_invalidate_user( > > > > > + to_smmu_domain(domain), viommu, array); > > > > > > > > I'd like to have the viommu struct directly hold the VMID. The nested > > > > parent should be sharable between multiple viommus, it doesn't make > > > > any sense that it would hold the vmid. > > > > > > > > This is struggling because it is trying too hard to not have the > > > > driver allocate the viommu, and I think we should just go ahead and do > > > > that. Store the vmid, today copied from the nesting parent in the vmid > > > > private struct. No need for iommufd_viommu_to_parent_domain(), just > > > > rework the APIs to pass the vmid down not a domain. > > > > > > OK. When I designed all this stuff, we still haven't made mind > > > about sharing the s2 domain, i.e. moving the VMID, which might > > > need a couple of more patches to achieve. > > > > Yes, many more patches, and don't try to do it now.. But we can copy > > the vmid from the s2 and place it in the viommu struct during > > allocation time. > > does it assume that a viommu object cannot span multiple physical > IOMMUs so there is only one vmid per viommu? Yes, the viommu is not intended to cross physical iommus, it is intended to contain objects, like invalidation queues, that are tied to single piommus only. If someone does want to make vIOMMU that unifies multiple pIOMMUS, and they have the feature set that would make that possible, then they will need multiple IOMMUFD VFIOMMU objects and will have to divide up the nested domains appropriately. Jason
Re: [PATCH v2 17/19] iommu/arm-smmu-v3: Add arm_smmu_viommu_cache_invalidate
On Wed, Sep 11, 2024 at 08:13:01AM +, Tian, Kevin wrote: > Probably there is a good reason e.g. for simplification or better > aligned with hw accel stuff. But it's not explained clearly so far. Probably the most concrete thing is if you have a direct assignment invalidation queue (ie DMA'd directly by HW) then it only applies to a single pIOMMU and invalidation commands placed there are unavoidably limited in scope. This creates a representation problem, if we have a vIOMMU that spans many pIOMMUs but invalidations do some subset how to do we model that. Just saying the vIOMMU is linked to the pIOMMU solves this nicely. Jason
Re: [PATCH v2 10/19] iommufd/viommu: Add vdev_id helpers for IOMMU drivers
On Thu, Sep 05, 2024 at 10:53:31AM -0700, Nicolin Chen wrote: > On Thu, Sep 05, 2024 at 01:14:15PM -0300, Jason Gunthorpe wrote: > > On Tue, Aug 27, 2024 at 09:59:47AM -0700, Nicolin Chen wrote: > > > Driver can call the iommufd_viommu_find_device() to find a device pointer > > > using its per-viommu virtual ID. The returned device must be protected by > > > the pair of iommufd_viommu_lock/unlock_vdev_id() function. > > > > > > Put these three functions into a new viommu_api file, to build it with the > > > IOMMUFD_DRIVER config. > > > > > > Signed-off-by: Nicolin Chen > > > --- > > > drivers/iommu/iommufd/Makefile | 2 +- > > > drivers/iommu/iommufd/viommu_api.c | 39 ++ > > > include/linux/iommufd.h| 16 > > > 3 files changed, 56 insertions(+), 1 deletion(-) > > > create mode 100644 drivers/iommu/iommufd/viommu_api.c > > > > I still think this is better to just share the struct content with the > > driver, eventually we want to do this anyhow as the driver will > > want to use container_of() techniques to reach its private data. > > In my mind, exposing everything to the driver is something that > we have to (for driver-managed structures) v.s. we want to... > Even in that case, a driver actually only need to know the size > of the core structure, without touching what's inside(?). > > I am a bit worried that drivers would abuse the content in the > core-level structure.. Providing a set of API would encourage > them to keep the core structure intact, hopefully.. This is always a tension in the kernel. If the core apis can be nice and tidy then it is a reasonable direction But here I think we've cross some threshold where the APIs are complex, want to be inlined and really we just want to expose data not APIs to drivers. > > No need for this lock, xa_load is rcu safe against concurrent writer > > I see iommufd's device.c and main.c grab xa_lock before xa_load? That is not to protect the xa_load, it is to protect the lifetime of pointer it returns Jason
Re: [PATCH v2 17/19] iommu/arm-smmu-v3: Add arm_smmu_viommu_cache_invalidate
On Fri, Sep 13, 2024 at 02:33:59AM +, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Thursday, September 12, 2024 7:08 AM > > > > On Wed, Sep 11, 2024 at 08:13:01AM +, Tian, Kevin wrote: > > > > > Probably there is a good reason e.g. for simplification or better > > > aligned with hw accel stuff. But it's not explained clearly so far. > > > > Probably the most concrete thing is if you have a direct assignment > > invalidation queue (ie DMA'd directly by HW) then it only applies to a > > single pIOMMU and invalidation commands placed there are unavoidably > > limited in scope. > > > > This creates a representation problem, if we have a vIOMMU that spans > > many pIOMMUs but invalidations do some subset how to do we model > > that. Just saying the vIOMMU is linked to the pIOMMU solves this > > nicely. > > > > yes that is a good reason. > > btw do we expect the VMM to try-and-fail when deciding whether a > new vIOMMU object is required when creating a new vdev? I think there was some suggestion the getinfo could return this, but also I think qemu needs to have a command line that matches physical so maybe it needs some sysfs? Jason
Re: [PATCH 1/1] remoteproc: Use iommu_paging_domain_alloc()
On Thu, Aug 22, 2024 at 01:24:25PM -0300, Jason Gunthorpe wrote: > On Thu, Aug 22, 2024 at 10:17:56AM -0600, Mathieu Poirier wrote: > > > > - domain = iommu_domain_alloc(dev->bus); > > > - if (!domain) { > > > + domain = iommu_paging_domain_alloc(dev); > > > > I'm a little hesitant here. Function iommu_domain_alloc() passes NULL two > > the > > second argument of __iommu_domain_alloc() while iommu_paging_domain_alloc() > > provides a 'dev'. I don't have any HW to test on and I am not familiar > > enough > > with the IOMMU subsystem to confidently more forward. > > So long as dev is the device that will be initiating DMA this is a > correct change from the iommu subsystem perspective. This is the only call site left for iommu_domain_alloc() - we want to remove this API on the next cycle. Please take the patch Thanks, Jason
Re: [PATCH v2 02/19] iommufd/viommu: Add IOMMUFD_OBJ_VIOMMU and IOMMU_VIOMMU_ALLOC ioctl
On Wed, Sep 11, 2024 at 08:39:57PM -0700, Nicolin Chen wrote: > You mentioned that actually only the iommufd selftest might hit such > a corner case, so perhaps we should do something in the selftest code > v.s. the iommu core. What do you think? Maybe, if there were viommu allocation callbacks maybe those can pin the memory in the selftest.. Jason
Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio
On Tue, Apr 20, 2021 at 08:50:12PM +0800, liulongfang wrote: > On 2021/4/19 20:33, Jason Gunthorpe wrote: > > On Mon, Apr 19, 2021 at 08:24:40PM +0800, liulongfang wrote: > > > >>> I'm also confused how this works securely at all, as a general rule a > >>> VFIO PCI driver cannot access the MMIO memory of the function it is > >>> planning to assign to the guest. There is a lot of danger that the > >>> guest could access that MMIO space one way or another. > >> > >> VF's MMIO memory is divided into two parts, one is the guest part, > >> and the other is the live migration part. They do not affect each other, > >> so there is no security problem. > > > > AFAIK there are several scenarios where a guest can access this MMIO > > memory using DMA even if it is not mapped into the guest for CPU > > access. > > > The hardware divides VF's MMIO memory into two parts. The live migration > driver in the host uses the live migration part, and the device driver in > the guest uses the guest part. They obtain the address of VF's MMIO memory > in their respective drivers, although these two parts The memory is > continuous on the hardware device, but due to the needs of the drive function, > they will not perform operations on another part of the memory, and the > device hardware also independently responds to the operation commands of > the two parts. It doesn't matter, the memory is still under the same PCI BDF and VFIO supports scenarios where devices in the same IOMMU group are not isolated from each other. This is why the granual of isolation is a PCI BDF - VFIO directly blocks kernel drivers from attaching to PCI BDFs that are not completely isolated from VFIO BDF. Bypassing this prevention and attaching a kernel driver directly to the same BDF being exposed to the guest breaks that isolation model. > So, I still don't understand what the security risk you are talking about is, > and what do you think the security design should look like? > Can you elaborate on it? Each security domain must have its own PCI BDF. The migration control registers must be on a different VF from the VF being plugged into a guest and the two VFs have to be in different IOMMU groups to ensure they are isolated from each other. Jason
Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio
On Tue, Apr 20, 2021 at 09:28:46PM +0800, liulongfang wrote: > >> So, I still don't understand what the security risk you are talking about > >> is, > >> and what do you think the security design should look like? > >> Can you elaborate on it? > > > > Each security domain must have its own PCI BDF. > > > The basic unit to perform the live migration function is the VF, and the basic > function of the VF is the business function of the device. If the live > migration > function and the business function are completely separated, and they are > bound > to their respective VFs, it will result in the ability to execute the business > in the guest A functional VF cannot perform live migration, and a VF with a > live > migration function cannot perform business functions in the guest. > > In fact, the scenario that requires live migration is to migrate its business > functions to the VFs of other hosts when the VF's business functions are > heavily > loaded. > This usage scenario itself requires that the VF needs to have these two > functions > at the same time. Other vendors are managing, it is not an unsolvable problem. I think these patches can't advance in any form without somehow addressing the isolation issue. Jason
Re: [PATCH] infiniband: ulp: Remove unnecessary struct declaration
On Thu, Apr 15, 2021 at 05:21:16PM +0800, Wan Jiabing wrote: > struct ipoib_cm_tx is defined at 245th line. > And the definition is independent on the MACRO. > The declaration here is unnecessary. Remove it. > > Signed-off-by: Wan Jiabing > Reviewed-by: Christoph Lameter > --- > drivers/infiniband/ulp/ipoib/ipoib.h | 2 -- > 1 file changed, 2 deletions(-) Applied to for-next, thanks Jason
Re: [PATCH] RDMA/mlx4: remove an unused variable
On Tue, Apr 20, 2021 at 05:28:33PM +0200, Christophe JAILLET wrote: > 'in6' is unused. It is just declared and filled-in. > It can be removed. > > This is a left over from commit 5ea8bbfc4929 > ("mlx4: Implement IP based gids support for RoCE/SRIOV") > > Signed-off-by: Christophe JAILLET > --- > drivers/infiniband/hw/mlx4/qp.c | 3 --- > 1 file changed, 3 deletions(-) Applied to for-next, thanks Jason