Re: [PATCH] IB/ocrdma_hw: Remove redundant checks and goto labels

2018-04-03 Thread Jason Gunthorpe
On Fri, Mar 30, 2018 at 10:38:58AM -0500, Gustavo A. R. Silva wrote:
> Check on return values and goto label mbx_err are unnecessary.
> 
> Addresses-Coverity-ID: 1271151 ("Identical code for different branches")
> Addresses-Coverity-ID: 1268788 ("Identical code for different branches")
> Signed-off-by: Gustavo A. R. Silva 
> ---
>  drivers/infiniband/hw/ocrdma/ocrdma_hw.c | 14 +-
>  1 file changed, 5 insertions(+), 9 deletions(-)

Applied to for-next, thanks

Jason


Re: [PATCH v5 3/3] RDMA/qedr: eliminate duplicate barriers on weakly-ordered archs #2

2018-04-03 Thread Jason Gunthorpe
On Tue, Apr 03, 2018 at 07:42:28AM +, Kalderon, Michal wrote:
> > From: Sinan Kaya [mailto:ok...@codeaurora.org]
> > Sent: Tuesday, April 03, 2018 5:30 AM
> > To: linux-r...@vger.kernel.org; ti...@codeaurora.org;
> > sulr...@codeaurora.org
> > Cc: linux-arm-...@vger.kernel.org; linux-arm-ker...@lists.infradead.org;
> > Kalderon, Michal ; Elior, Ariel
> > ; Doug Ledford ; Jason
> > Gunthorpe ; linux-kernel@vger.kernel.org
> > Subject: Re: [PATCH v5 3/3] RDMA/qedr: eliminate duplicate barriers on
> > weakly-ordered archs #2
> > 
> > On 3/22/2018 12:26 PM, Sinan Kaya wrote:
> > > @@ -860,7 +860,7 @@ static void doorbell_cq(struct qedr_cq *cq, u32
> > cons, u8 flags)
> > >   wmb();
> > >   cq->db.data.agg_flags = flags;
> > >   cq->db.data.value = cpu_to_le32(cons);
> > > - writeq(cq->db.raw, cq->db_addr);
> > > + writeq_relaxed(cq->db.raw, cq->db_addr);
> > 
> > Given the direction to get rid of wmb() in front of writeX() functions, I 
> > have
> > been reviewing this code. Under normal circumstances, I can get rid of all
> > wmb() as follows.
> > 
> > However, I started having my doubts now. Are these wmb() used as a SMP
> > barrier too?
> > I can't find any smp_Xmb() in drivers/infiniband/hw/qedr directory.
> 
> Your doubts are in place. You initial patch series modified writel to 
> writel_relaxed
> Simply removing the wmb is dangerous. The wmb before writel are used to make 
> sure the
> HW observes the changes in memory before we trigger the doorbell. Smp 
> barriers here
> wouldn't suffice, as on a single processor. we still need to make sure memory 
> is updated
> and not remained in cache when HW accesses it.
> Reviewing the qedr barriers, I can find places where this may have not been 
> necessary, 
> But definitely you can't simply remove this wmb barriers. 

As Sinan said, the consensus is that wmb();writel(); is redundant if
the only purpose of the wmb is to order DMA and system memory.

So can you review these patches on that basis please? Is the WMB doing
something else, eg SMP related? If yes, please send a patch adding
appropriate comments.

Thanks,
Jason


Re: [PATCH] RDMA/hns: ensure for-loop actually iterates and free's buffers

2018-03-27 Thread Jason Gunthorpe
On Mon, Mar 26, 2018 at 04:10:18PM +0100, Colin Ian King wrote:
> From: Colin Ian King 
> 
> The current for-loop zeros variable i and only loops once, hence
> not all the buffers are free'd.  Fix this by setting i correctly.
> 
> Detected by CoverityScan, CID#1463415 ("Operands don't affect result")
> 
> Fixes: a5073d6054f7 ("RDMA/hns: Add eq support of hip08")
> Signed-off-by: Colin Ian King 
> Reviewed-by: Yixian Liu 
> ---
>  drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Applied to for-rc, thanks

Jason


[GIT PULL] Please pull RDMA subsystem changes

2018-03-29 Thread Jason Gunthorpe
Hi Linus,

Fifth pull request for -rc.

It has been fairly silent lately on our -rc front. Big queue of patches on the
mailing list going to for-next though.

The following changes since commit e8980d67d6017c8eee8f9c35f782c4bd68e004c9:

  RDMA/ucma: Ensure that CM_ID exists prior to access it (2018-03-20 11:07:21 
-0600)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git tags/for-linus

for you to fetch changes up to 84652aefb347297aa08e91e283adf7b18f77c2d5:

  RDMA/ucma: Introduce safer rdma_addr_size() variants (2018-03-28 16:13:36 
-0600)


Fifth pull request for 4.16-rc

Bug fixes:
- qedr driver bugfixes causing application hangs, wrong uapi errnos, and a
  race condition
- 3 syzkaller found bugfixes in the ucma uapi

Regression fixes for things introduced in 4.16:
- Crash on error introduced in mlx5 UMR flow
- Crash on module unload/etc introduced by bad interaction of restrack
  and mlx5 patches this cycle
- Typo in a two line syzkaller bugfix causing a bad regression
- Coverity report of nonsense code in hns driver


Chien Tin Tung (1):
  RDMA/ucma: Correct option size check using optlen

Colin Ian King (1):
  RDMA/hns: ensure for-loop actually iterates and free's buffers

Jason Gunthorpe (1):
  RDMA/rdma_cm: Fix use after free race with process_one_req

Kalderon, Michal (3):
  RDMA/qedr: fix QP's ack timeout configuration
  RDMA/qedr: Fix rc initialization on CNQ allocation failure
  RDMA/qedr: Fix QP state initialization race

Leon Romanovsky (3):
  RDMA/restrack: Move restrack_clean to be symmetrical to restrack_init
  RDMA/ucma: Check that device is connected prior to access it
  RDMA/ucma: Check that device exists prior to accessing it

Mark Bloch (1):
  IB/mlx5: Don't clean uninitialized UMR resources

Roland Dreier (1):
  RDMA/ucma: Introduce safer rdma_addr_size() variants

 drivers/infiniband/core/addr.c | 25 
 drivers/infiniband/core/device.c   |  3 +-
 drivers/infiniband/core/ucma.c | 47 +-
 drivers/infiniband/hw/hns/hns_roce_hw_v2.c |  2 +-
 drivers/infiniband/hw/mlx5/main.c  | 12 ++--
 drivers/infiniband/hw/mlx5/mr.c|  3 ++
 drivers/infiniband/hw/qedr/main.c  |  3 +-
 drivers/infiniband/hw/qedr/verbs.c | 45 +++-
 include/rdma/ib_addr.h |  2 ++
 9 files changed, 102 insertions(+), 40 deletions(-)


Re: [PATCH] qedr: ix spelling mistake: "hanlde" -> "handle"

2018-03-29 Thread Jason Gunthorpe
On Thu, Mar 29, 2018 at 01:11:07PM +0100, Colin King wrote:
> From: Colin Ian King 
> 
> Trivial fix to spelling mistake in DP_ERR message text
> 
> Signed-off-by: Colin Ian King 
> Reviewed-by: Shamir Rabinovitch 
> ---
>  drivers/infiniband/hw/qedr/main.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

applied to for-next, thanks.

I also fixed the spelling error in the subject :)

Jason


Re: [PATCH] IB/core: Make ib_mad_client_id atomic

2018-04-30 Thread Jason Gunthorpe
On Mon, Apr 23, 2018 at 10:16:18PM +0300, jackm wrote:

> > > TIDs need to be globally unique on the entire machine.  

> Jason, that is not exactly correct.

The expecation for /dev/umad users is that they all receive locally
unique TID prefixes. The kernel may be OK to keep things port-specific
but it is slightly breaking the API we are presenting to userspace to
allow them to alias..

Jason


Re: [PATCH] IB/core: Make ib_mad_client_id atomic

2018-04-30 Thread Jason Gunthorpe
On Mon, Apr 30, 2018 at 01:10:49PM -0400, Doug Ledford wrote:
> On Mon, 2018-04-30 at 08:49 -0600, Jason Gunthorpe wrote:
> > On Mon, Apr 23, 2018 at 10:16:18PM +0300, jackm wrote:
> > 
> > > > > TIDs need to be globally unique on the entire machine.  
> > > Jason, that is not exactly correct.
> > 
> > The expecation for /dev/umad users is that they all receive locally
> > unique TID prefixes. The kernel may be OK to keep things port-specific
> > but it is slightly breaking the API we are presenting to userspace to
> > allow them to alias..
> > 
> > Jason
> 
> Would people be happier with this commit message then:
> 
> IB/core: Make ib_mad_client_id atomic
>   
> Currently, the kernel protects access to the agent ID allocator on a per
> port basis using a spinlock, so it is impossible for two apps/threads on
> the same port to get the same TID, but it is entirely possible for two
> threads on different ports to end up with the same TID.  
> 
> As this can be confusing (regardless of it being legal according to the
> IB Spec 1.3, C13-18.1.1, in section 13.4.6.4 - TransactionID usage),
> and as the rdma-core user space API for /dev/umad devices implies unique
> TIDs even across ports, make the TID an atomic type so that no two
> allocations, regardless of port number, will be the same.
> 
> Signed-off-by: Håkon Bugge 
> Reviewed-by: Jack Morgenstein 
> Reviewed-by: Ira Weiny 
> Reviewed-by: Zhu Yanjun 
> Signed-off-by: Doug Ledford 

fine for me

Jason


Re: [PATCH 3/5] ib_srpt: depend on INFINIBAND_ADDR_TRANS

2018-04-30 Thread Jason Gunthorpe
On Wed, Apr 25, 2018 at 03:33:39PM -0700, Greg Thelen wrote:
> INFINIBAND_SRPT code depends on INFINIBAND_ADDR_TRANS provided symbols.
> So declare the kconfig dependency.  This is necessary to allow for
> enabling INFINIBAND without INFINIBAND_ADDR_TRANS.
> 
> Signed-off-by: Greg Thelen 
> Cc: Tarick Bedeir 
>  drivers/infiniband/ulp/srpt/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/infiniband/ulp/srpt/Kconfig 
> b/drivers/infiniband/ulp/srpt/Kconfig
> index 31ee83d528d9..fb8b7182f05e 100644
> +++ b/drivers/infiniband/ulp/srpt/Kconfig
> @@ -1,6 +1,6 @@
>  config INFINIBAND_SRPT
>   tristate "InfiniBand SCSI RDMA Protocol target support"
> - depends on INFINIBAND && TARGET_CORE
> + depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE

Isn't INFINIBAND && INFINIBAND_ADDR_TRANS a bit redundant? Can't have
INFINIBAND_ADDR_TRANS without INFINIBAND.

Jason


Re: [PATCH 3/5] ib_srpt: depend on INFINIBAND_ADDR_TRANS

2018-05-01 Thread Jason Gunthorpe
On Tue, May 01, 2018 at 03:08:57AM +, Greg Thelen wrote:
> On Mon, Apr 30, 2018 at 4:35 PM Jason Gunthorpe  wrote:
> 
> > On Wed, Apr 25, 2018 at 03:33:39PM -0700, Greg Thelen wrote:
> > > INFINIBAND_SRPT code depends on INFINIBAND_ADDR_TRANS provided symbols.
> > > So declare the kconfig dependency.  This is necessary to allow for
> > > enabling INFINIBAND without INFINIBAND_ADDR_TRANS.
> > >
> > > Signed-off-by: Greg Thelen 
> > > Cc: Tarick Bedeir 
> > >  drivers/infiniband/ulp/srpt/Kconfig | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/infiniband/ulp/srpt/Kconfig
> b/drivers/infiniband/ulp/srpt/Kconfig
> > > index 31ee83d528d9..fb8b7182f05e 100644
> > > +++ b/drivers/infiniband/ulp/srpt/Kconfig
> > > @@ -1,6 +1,6 @@
> > >  config INFINIBAND_SRPT
> > >   tristate "InfiniBand SCSI RDMA Protocol target support"
> > > - depends on INFINIBAND && TARGET_CORE
> > > + depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE
> 
> > Isn't INFINIBAND && INFINIBAND_ADDR_TRANS a bit redundant? Can't have
> > INFINIBAND_ADDR_TRANS without INFINIBAND.
> 
> By kconfig INFINIBAND_ADDR_TRANS depends on INFINIBAND.  So yes, it seems
> redundant.  I don't know if anyone has designs to break this dependency and
> allow for ADDR_TRANS without INFINIBAND.  Assuming not, I'd be willing to
> amend my series removing redundant INFINIBAND and a followup series to
> remove it from similar depends.  Though I'm not familiar with rdma dev tree
> lifecycle.  Is rdma/for-rc a throw away branch (akin to linux-next), or
> will it be merged into linus/master?  If throwaway, then we can amend
> its patches, otherwise followups will be needed.
> 
> Let me know what you'd prefer.  Thanks.

I think a single update patch to fix all of these, and the
pre-existing ones would be great

Jason


Re: linux-next: manual merge of the rdma tree with the rdma-fixes tree

2018-05-02 Thread Jason Gunthorpe
On Wed, May 02, 2018 at 10:00:52AM -0400, Doug Ledford wrote:
> On Wed, 2018-05-02 at 13:22 +0300, Leon Romanovsky wrote:
> > On Mon, Apr 30, 2018 at 08:55:35PM -0400, Doug Ledford wrote:
> > > On Tue, 2018-05-01 at 10:10 +1000, Stephen Rothwell wrote:
> > > > Hi all,
> > > > 
> > > > Today's linux-next merge of the rdma tree got a conflict in:
> > > > 
> > > >   drivers/infiniband/sw/rxe/rxe_resp.c
> > > > 
> > > > between commit:
> > > > 
> > > >   9fd4350ba895 ("B/rxe: avoid double kfree_skb")
> > > > 
> > > > from the rdma-fixes tree and commit:
> > > > 
> > > >   2e47350789eb ("IB/rxe: optimize the function duplicate_request")
> > > > 
> > > > from the rdma tree.
> > > > 
> > > > I fixed it up (I think - see below) and can carry the fix as necessary.
> > > > This is now fixed as far as linux-next is concerned, but any non trivial
> > > > conflicts should be mentioned to your upstream maintainer when your tree
> > > > is submitted for merging.  You may also want to consider cooperating
> > > > with the maintainer of the conflicting tree to minimise any particularly
> > > > complex conflicts.
> > > > 
> > > 
> > > We will probably merge the for-rc branch into the for-next branch in the
> > > next few days, at which point we will do the conflict resolution
> > > ourselves and your need to carry anything should drop out.
> > 
> > Isn't "rdma/wip/for-testing" branch intended for this?
> 
> Not really.  It's there to provide a pre-merged branch for people to
> test.  But, I've rarely seen a release cycle where, *sometime*, we
> didn't get a patch set in the for-next that depends on changes in the
> for-rc area, and in that case, you need to merge for-rc into for-next. 
> If we don't have that this cycle, then you're right, I won't merge for-
> rc into for-next and for-testing will be the throwaway merge branch.  On
> occasion, if the merge fixups needed between for-rc and for-next get too
> difficult for a non-RDMA person to sus out, then we will do a merge of
> for-rc into for-next simply so we can provide the right merge fixup, but
> I doubt this merge fixup rises to that level.

What I've been doing is storing the resolutions in for-testing and
then when the PR is made I create two branches
  merge for-testing, for-next, linus/master
  merge for-next, linus/master

Then I directly diff them to ensure the merge resolutions are all
matching properly.

ditto when merging for-rc and linus/master

Basically for-testing becomes a place to store the merge resolutions
that we can create when the conflict comes up and people still
remember what the right resolution is...

Jason


Re: linux-next: Signed-off-by missing for commits in the rdma tree

2018-05-28 Thread Jason Gunthorpe
On Sat, May 26, 2018 at 01:31:12PM +1000, Stephen Rothwell wrote:
> Hi all,
> 
> Commits
> 
>   116aeb887371 ("iw_cxgb4: provide detailed provider-specific CM_ID 
> information")
> to
>   0252f73334f9 ("IB/qib: Fix DMA api warning with debug kernel")
> 
> are missing a Signed-off-by from their committer.  I presume they have
> been rebased?

Yes, there was a compilation problem and I rebased some of Doug's
commits before finalizing them.

Jason


Re: [PATCH] IB: Revert "remove redundant INFINIBAND kconfig dependencies"

2018-05-28 Thread Jason Gunthorpe
On Fri, May 25, 2018 at 05:32:52PM -0700, Greg Thelen wrote:
> On Fri, May 25, 2018 at 2:32 PM Arnd Bergmann  wrote:
> 
> > Several subsystems depend on INFINIBAND_ADDR_TRANS, which in turn depends
> > on INFINIBAND. However, when with CONFIG_INIFIBAND=m, this leads to a
> > link error when another driver using it is built-in. The
> > INFINIBAND_ADDR_TRANS dependency is insufficient here as this is
> > a 'bool' symbol that does not force anything to be a module in turn.
> 
> > fs/cifs/smbdirect.o: In function `smbd_disconnect_rdma_work':
> > smbdirect.c:(.text+0x1e4): undefined reference to `rdma_disconnect'
> > net/9p/trans_rdma.o: In function `rdma_request':
> > trans_rdma.c:(.text+0x7bc): undefined reference to `rdma_disconnect'
> > net/9p/trans_rdma.o: In function `rdma_destroy_trans':
> > trans_rdma.c:(.text+0x830): undefined reference to `ib_destroy_qp'
> > trans_rdma.c:(.text+0x858): undefined reference to `ib_dealloc_pd'
> 
> > Fixes: 9533b292a7ac ("IB: remove redundant INFINIBAND kconfig
> dependencies")
> > Signed-off-by: Arnd Bergmann 
> 
> Acked-by: Greg Thelen 
> 
> Sorry for the 9533b292a7ac problem.
> At this point the in release cycle, I think Arnd's revert is best.
> 
> If there is interest, I've put a little thought into an alternative fix:
> making INFINIBAND_ADDR_TRANS tristate.  But it's nontrivial.
> So I prefer this simple revert for now.

Is that a normal thing to do?

> Doug: do you need anything from me on this?

I can take the revert..

Jason


Re: [PATCH] IB: Revert "remove redundant INFINIBAND kconfig dependencies"

2018-05-28 Thread Jason Gunthorpe
On Fri, May 25, 2018 at 11:29:59PM +0200, Arnd Bergmann wrote:
> Several subsystems depend on INFINIBAND_ADDR_TRANS, which in turn depends
> on INFINIBAND. However, when with CONFIG_INIFIBAND=m, this leads to a
> link error when another driver using it is built-in. The
> INFINIBAND_ADDR_TRANS dependency is insufficient here as this is
> a 'bool' symbol that does not force anything to be a module in turn.
> 
> fs/cifs/smbdirect.o: In function `smbd_disconnect_rdma_work':
> smbdirect.c:(.text+0x1e4): undefined reference to `rdma_disconnect'
> net/9p/trans_rdma.o: In function `rdma_request':
> trans_rdma.c:(.text+0x7bc): undefined reference to `rdma_disconnect'
> net/9p/trans_rdma.o: In function `rdma_destroy_trans':
> trans_rdma.c:(.text+0x830): undefined reference to `ib_destroy_qp'
> trans_rdma.c:(.text+0x858): undefined reference to `ib_dealloc_pd'
> 
> Fixes: 9533b292a7ac ("IB: remove redundant INFINIBAND kconfig dependencies")
> Signed-off-by: Arnd Bergmann 
> Acked-by: Greg Thelen 
> ---
> The patch that introduced the problem has been queued in the
> rdma-fixes/for-rc tree. Please revert the patch before sending
> the branch to Linus.
> ---
>  drivers/infiniband/ulp/srpt/Kconfig | 2 +-
>  drivers/nvme/host/Kconfig   | 2 +-
>  drivers/nvme/target/Kconfig | 2 +-
>  drivers/staging/lustre/lnet/Kconfig | 2 +-
>  fs/cifs/Kconfig | 2 +-
>  net/9p/Kconfig  | 2 +-
>  net/rds/Kconfig | 2 +-
>  net/sunrpc/Kconfig  | 2 +-
>  8 files changed, 8 insertions(+), 8 deletions(-)

Applied to for-rc, thanks

Jason


Re: [PATCH V2 rdma-next 3/4] RDMA/hns: Add reset process for RoCE in hip08

2018-05-28 Thread Jason Gunthorpe
On Sat, May 26, 2018 at 09:47:43AM +0800, Wei Hu (Xavier) wrote:
> 
> 
> On 2018/5/25 22:55, Jason Gunthorpe wrote:
> > On Fri, May 25, 2018 at 01:54:31PM +0800, Wei Hu (Xavier) wrote:
> >>
> >> On 2018/5/25 5:31, Jason Gunthorpe wrote:
> >>>>  static const struct hnae3_client_ops hns_roce_hw_v2_ops = {
> >>>>  .init_instance = hns_roce_hw_v2_init_instance,
> >>>>  .uninit_instance = hns_roce_hw_v2_uninit_instance,
> >>>> +.reset_notify = hns_roce_hw_v2_reset_notify,
> >>>>  };
> >>>>  
> >>>>  static struct hnae3_client hns_roce_hw_v2_client = {
> >>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_main.c 
> >>>> b/drivers/infiniband/hw/hns/hns_roce_main.c
> >>>> index 1b79a38..ac51372 100644
> >>>> +++ b/drivers/infiniband/hw/hns/hns_roce_main.c
> >>>> @@ -332,6 +332,9 @@ static struct ib_ucontext 
> >>>> *hns_roce_alloc_ucontext(struct ib_device *ib_dev,
> >>>>  struct hns_roce_ib_alloc_ucontext_resp resp = {};
> >>>>  struct hns_roce_dev *hr_dev = to_hr_dev(ib_dev);
> >>>>  
> >>>> +if (!hr_dev->active)
> >>>> +return ERR_PTR(-EAGAIN);
> >>> This still doesn't make sense, ib_unregister_device already makes sure
> >>> that hns_roce_alloc_ucontext isn't running and won't be called before
> >>> returning, don't need another flag to do that.
> >>>
> >>> Since this is the only place the active flag is tested it can just be 
> >>> deleted
> >>> entirely.
> >> Hi, Jason
> >>
> >> roce reset process:
> >> 1. hr_dev->active = false;  //make sure no any process call
> >> ibv_open_device.   
> >> 2. call ib_dispatch_event() function to report IB_EVENT_DEVICE_FATAL
> >> event.
> >> 3. msleep(100);   // for some app to free resources
> >> 4. call ib_unregister_device().   
> >> 5. ...
> >> 6. ...
> >>
> >> There are 2 steps as above before calling ib_unregister_device(), we
> >> evaluate
> >> hr_dev->active with false to avoid no any process call
> >> ibv_open_device.
> > If you think this is the right flow then it is core issue to block new
> > opens, not an individual driver issue, send a core patch - eg add a
> > 'ib_driver_fatal_error()' call that does the dispatch and call it from
> > all the drivers using this IB_EVENT_DEVICE_FATAL
> Hi, Jason
> 
> It seem to be no difference between calling ib_driver_fatal_error and
> calling ib_dispatch_event  directly in manufacturer driver.
> 
> void ib_driver_fatal_error(struct ib_device *ib_dev, u8 port_num)
>  {
>struct ib_event event;
> 
>   event.event = IB_EVENT_DEVICE_FATAL;
> event.device = ib_dev;
> event.element.port_num = port_num;
> ib_dispatch_event(&event);
> }

My point was the core code should block calling
hns_roce_alloc_ucontext after DEVICE_FATAL if we agree that is
correct, the driver shouldn't be doing it.

Jason


Re: [PATCH V3 rdma-next 4/4] RDMA/hns: Implement the disassociate_ucontext API

2018-05-28 Thread Jason Gunthorpe
On Sat, May 26, 2018 at 04:41:47PM +0800, Wei Hu (Xavier) wrote:
> This patch implemented the IB core disassociate_ucontext API.
> 
> Signed-off-by: Wei Hu (Xavier) 
> 
> v2->v3: Addressed the comments from Jason. The related link:
>   https://lkml.org/lkml/2018/5/22/967
> v1->v2: no change.
>  drivers/infiniband/hw/hns/hns_roce_device.h |  8 
>  drivers/infiniband/hw/hns/hns_roce_main.c   | 70 
> -
>  2 files changed, 77 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/infiniband/hw/hns/hns_roce_device.h 
> b/drivers/infiniband/hw/hns/hns_roce_device.h
> index da8512b..31221d5 100644
> +++ b/drivers/infiniband/hw/hns/hns_roce_device.h
> @@ -217,11 +217,19 @@ struct hns_roce_uar {
>   unsigned long   logic_idx;
>  };
>  
> +struct hns_roce_vma_data {
> + struct list_head list;
> + struct vm_area_struct *vma;
> + struct mutex *vma_list_mutex;
> +};

This stuff is basically shared in all the drivers too - would it make
sense to maintain this information and do the zap_pte in the core code
as well?

There is no way to implement dis-associate without also doing the
zaps..

Jason


Re: [PATCH] IB/mad: Use ID allocator routines to allocate agent number

2018-05-29 Thread Jason Gunthorpe
On Tue, May 29, 2018 at 09:38:08AM +0200, Hans Westgaard Ry wrote:
> The agent TID is a 64 bit value split in two dwords.  The least
> significant dword is the TID running counter. The most significant
> dword is the agent number. In the CX-3 shared port model, the mlx4
> driver uses the most significant byte of the agent number to store the
> slave number, making agent numbers greater and equal to 2^24 (3 bytes)
> unusable.

There is no reason for this to be an ida, just do something like

 mad_agent_priv->agent.hi_tid = atomic_inc_return(&ib_mad_client_id) & 
mad_agent_priv->ib_dev->tid_mask;

And have the driver set tid_mask to 3 bytes of 0xFF

And no sysctl.

Jason


Re: [PATCH] IB/mad: Use ID allocator routines to allocate agent number

2018-05-29 Thread Jason Gunthorpe
On Tue, May 29, 2018 at 06:16:14PM +0200, Håkon Bugge wrote:
>
> > On 29 May 2018, at 17:49, Jason Gunthorpe  wrote:
> > 
> > On Tue, May 29, 2018 at 09:38:08AM +0200, Hans Westgaard Ry wrote:
> >> The agent TID is a 64 bit value split in two dwords.  The least
> >> significant dword is the TID running counter. The most significant
> >> dword is the agent number. In the CX-3 shared port model, the mlx4
> >> driver uses the most significant byte of the agent number to store the
> >> slave number, making agent numbers greater and equal to 2^24 (3 bytes)
> >> unusable.
> > 
> > There is no reason for this to be an ida, just do something like
> > 
> > mad_agent_priv->agent.hi_tid = atomic_inc_return(&ib_mad_client_id) & 
> > mad_agent_priv->ib_dev->tid_mask;
> > 
> > And have the driver set tid_mask to 3 bytes of 0xFF
> 
> The issue is that some of the mad agents are long-lived, so you will
> wrap and use the same TID twice.

We already have that problem, and using ida is problematic because we
need to maximize the time between TID re-use, which ida isn't doing.

Preventing re-use seems like a seperate issue from limiting the range
to be compatible with mlx4.

Jason


Re: [PATCH] iw_cxgb4: add INFINIBAND_ADDR_TRANS dependency

2018-06-04 Thread Jason Gunthorpe
On Wed, May 30, 2018 at 11:58:18PM +0200, Arnd Bergmann wrote:
> The newly added fill_res_ep_entry function fails to link if
> CONFIG_INFINIBAND_ADDR_TRANS is not set:
> 
> drivers/infiniband/hw/cxgb4/restrack.o: In function `fill_res_ep_entry':
> restrack.c:(.text+0x3cc): undefined reference to `rdma_res_to_id'
> restrack.c:(.text+0x3d0): undefined reference to `rdma_iw_cm_id'
> 
> This adds a Kconfig dependency for the driver.
> 
> Fixes: 116aeb887371 ("iw_cxgb4: provide detailed provider-specific CM_ID 
> information")
> Signed-off-by: Arnd Bergmann 
> Acked-by: Greg Thelen 
> ---
>  drivers/infiniband/hw/cxgb4/Kconfig | 1 +
>  1 file changed, 1 insertion(+)

Applied to for-next

Thanks,
Jason


Re: [PATCH] infiniband: fix a possible use-after-free bug

2018-06-04 Thread Jason Gunthorpe
On Fri, Jun 01, 2018 at 11:31:44AM -0700, Cong Wang wrote:
> ucma_process_join() will free the new allocated "mc" struct,
> if there is any error after that, especially the copy_to_user().
> 
> But in parallel, ucma_leave_multicast() could find this "mc"
> through idr_find() before ucma_process_join() frees it, since it
> is already published.
> 
> So "mc" could be used in ucma_leave_multicast() after it is been
> allocated and freed in ucma_process_join(), since we don't refcnt
> it.
> 
> Fix this by separating "publish" from ID allocation, so that we
> can get an ID first and publish it later after copy_to_user().
> 
> Fixes c8f6a362bf3e ("RDMA/cma: Add multicast communication support")
> Reported-by: Noam Rathaus 
> Cc: Sean Hefty 
> Cc: Doug Ledford 
> Cc: Jason Gunthorpe 
> Cc: linux-r...@vger.kernel.org
> Signed-off-by: Cong Wang 
> ---
>  drivers/infiniband/core/ucma.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)

Looks good to me, and we already fixed the same sort of bug in the
non-multicast IDs.. Applied to for-next

Jason


Re: [PATCH] infiniband: fix a possible use-after-free bug

2018-06-04 Thread Jason Gunthorpe
On Mon, Jun 04, 2018 at 06:23:20PM +0200, Gi-Oh Kim wrote:
> On Fri, Jun 1, 2018 at 8:31 PM, Cong Wang  wrote:
> > ucma_process_join() will free the new allocated "mc" struct,
> > if there is any error after that, especially the copy_to_user().
> >
> > But in parallel, ucma_leave_multicast() could find this "mc"
> > through idr_find() before ucma_process_join() frees it, since it
> > is already published.
> >
> > So "mc" could be used in ucma_leave_multicast() after it is been
> > allocated and freed in ucma_process_join(), since we don't refcnt
> > it.
> >
> > Fix this by separating "publish" from ID allocation, so that we
> > can get an ID first and publish it later after copy_to_user().
> >
> > Fixes c8f6a362bf3e ("RDMA/cma: Add multicast communication support")
> > Reported-by: Noam Rathaus 
> > Cc: Sean Hefty 
> > Cc: Doug Ledford 
> > Cc: Jason Gunthorpe 
> > Cc: linux-r...@vger.kernel.org
> > Signed-off-by: Cong Wang 
> >  drivers/infiniband/core/ucma.c | 6 +-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
> > index eab43b17e9cf..ec8fb289621f 100644
> > +++ b/drivers/infiniband/core/ucma.c
> > @@ -235,7 +235,7 @@ static struct ucma_multicast* 
> > ucma_alloc_multicast(struct ucma_context *ctx)
> > return NULL;
> >
> > mutex_lock(&mut);
> > -   mc->id = idr_alloc(&multicast_idr, mc, 0, 0, GFP_KERNEL);
> > +   mc->id = idr_alloc(&multicast_idr, NULL, 0, 0, GFP_KERNEL);
> > mutex_unlock(&mut);
> > if (mc->id < 0)
> > goto error;
> > @@ -1421,6 +1421,10 @@ static ssize_t ucma_process_join(struct ucma_file 
> > *file,
> > goto err3;
> > }
> >
> > +   mutex_lock(&mut);
> > +   idr_replace(&multicast_idr, mc, mc->id);
> > +   mutex_unlock(&mut);
> > +
> > mutex_unlock(&file->mut);
> > ucma_put_ctx(ctx);
> > return 0;
> >
> 
> 
> Hi,
> 
> Your patch is reasonable to me.
> Can I ask a question for that?
> Could it be solved by asymmetric locking as following?

No, there are many other paths that touch multicast_idr that don't
hold both locks, we should protect all of them from accessing an
incompletely initialized structure.

Jason


Re: [PATCH] IB/hns: Use zeroing memory allocator instead of allocator/memset

2018-06-04 Thread Jason Gunthorpe
On Sun, Jun 03, 2018 at 05:32:22PM +0800, YueHaibing wrote:
> Use dma_zalloc_coherent for allocating zeroed memory and
> remove unnecessary memset function.
> 
> Signed-off-by: YueHaibing 
> ---
>  drivers/infiniband/hw/hns/hns_roce_alloc.c | 8 +++-
>  1 file changed, 3 insertions(+), 5 deletions(-)

Applied to for-next, thanks.

Jason


Re: [PATCH] iw_cxgb4: add INFINIBAND_ADDR_TRANS dependency

2018-06-04 Thread Jason Gunthorpe
On Thu, May 31, 2018 at 02:40:59PM -0400, Doug Ledford wrote:
> On Wed, 2018-05-30 at 21:03 -0700, Greg Thelen wrote:
> > On Wed, May 30, 2018 at 4:01 PM Jason Gunthorpe  wrote:
> > 
> > > On Thu, May 31, 2018 at 12:40:54AM +0200, Arnd Bergmann wrote:
> > > > > On 5/30/2018 5:25 PM, Jason Gunthorpe wrote:
> > > > > > On Wed, May 30, 2018 at 05:10:35PM -0500, Steve Wise wrote:
> > > > > > > 
> > > > > > > On 5/30/2018 5:04 PM, Jason Gunthorpe wrote:
> > > > > > > > On Wed, May 30, 2018 at 11:58:18PM +0200, Arnd Bergmann wrote:
> > > > > > > > > The newly added fill_res_ep_entry function fails to link if
> > > > > > > > > CONFIG_INFINIBAND_ADDR_TRANS is not set:
> > > > > > > > > 
> > > > > > > > > drivers/infiniband/hw/cxgb4/restrack.o: In function
> > 
> > `fill_res_ep_entry':
> > > > > > > > > restrack.c:(.text+0x3cc): undefined reference to 
> > > > > > > > > `rdma_res_to_id'
> > > > > > > > > restrack.c:(.text+0x3d0): undefined reference to 
> > > > > > > > > `rdma_iw_cm_id'
> > > > > > > > > 
> > > > > > > > > This adds a Kconfig dependency for the driver.
> > > > > > > > > 
> > > > > > > > > Fixes: 116aeb887371 ("iw_cxgb4: provide detailed
> > 
> > provider-specific CM_ID information")
> > > > > > > > > Signed-off-by: Arnd Bergmann 
> > > > > > > > >  drivers/infiniband/hw/cxgb4/Kconfig | 1 +
> > > > > > > > >  1 file changed, 1 insertion(+)
> > > > > > > > 
> > > > > > > > Oh, I think we need to solve this with maybe a header fill null
> > 
> > stub
> > > > > > > > instead..
> > > > > > > > 
> > > > > > > > We don't want to disable drivers just because a user interface 
> > > > > > > > is
> > > > > > > > disabled.
> > > > > > > > 
> > > > > > > 
> > > > > > > Why does CONFIG_INFINIBAND_ADDR_TRANS disable building rdma_cm.ko?
> > 
> > That
> > > > > > > is not correct.
> > > > > > 
> > > > > > That seems like a reasonable thing to do..
> > > > > 
> > > > > rdma_ucm.ko is for usermode users, rdma_cm.ko is for kernel users, and
> > > > > is required for iwarp drivers.  It seems rdma_cm.ko is not being
> > > > > compiled if ADDR_TRANS is not set.
> > > I think the intention was to completely disable rdma-cm, including all
> > > support for rx'ing remote packets? Greg?
> > 
> > Yes.  That's my goal when INFINIBAND_ADDR_TRANS is unset.
> > 
> > > If this is required for iwarp then Arnd's patch is probably the right
> > > way to go..
> > > Jason
> > 
> > Agreed.
> > Acked-by: Greg Thelen 
> 
> If that's the case, then there should be a NOTE: in the Kconfig that
> disabling the connection manager completely disables iWARP hardware.
> 
> I don't really think I like this approach though.  At a minimum if you
> are going to make iWARP totally dependent on rdmacm, then there is zero
> reason that iw_cm.o should be part of the obj-$(CONFIG_INFINIBAND)
> Makefile recipe when ADDR_TRANS is disabled.

This makes sense, Greg, can you send a followup patch with these
items? I know this series has been tough..

Thanks,
Jason


[PATCH] clang-format: Set IndentWrappedFunctionNames false

2018-06-06 Thread Jason Gunthorpe
The true option causes this indenting for functions:

static struct something_very_very_long *
function(void *arg)
{

While a quick survey suggests that the usual Linux fallback is the GNU
style:

static struct something_very_very_long *
function(void *arg)
{

Eg as seen in:

 kernel/cpu.c
 kernel/fork.c

and other places.

Signed-off-by: Jason Gunthorpe 
---
 .clang-format | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Thanks for making this Miguel!

I've been using clang-format for years with the kernel and noticed it
was performing just a little different than I would have expected.

AFAIK the above describes the more common choice in the code base?

Not sure who's tree this is supposed to go through.. Andrew I guess?

diff --git a/.clang-format b/.clang-format
index faffc0d5af4eeb..1d5da22e0ba50c 100644
--- a/.clang-format
+++ b/.clang-format
@@ -382,7 +382,7 @@ IncludeIsMainRegex: '(Test)?$'
 IndentCaseLabels: false
 #IndentPPDirectives: None # Unknown to clang-format-5.0
 IndentWidth: 8
-IndentWrappedFunctionNames: true
+IndentWrappedFunctionNames: false
 JavaScriptQuotes: Leave
 JavaScriptWrapImports: true
 KeepEmptyLinesAtTheStartOfBlocks: false
-- 
2.17.0



Re: [PATCH] clang-format: Set IndentWrappedFunctionNames false

2018-06-06 Thread Jason Gunthorpe
On Wed, Jun 06, 2018 at 01:18:59PM -0700, Joe Perches wrote:
> On Wed, 2018-06-06 at 14:15 -0600, Jason Gunthorpe wrote:
> > The true option causes this indenting for functions:
> > 
> > static struct something_very_very_long *
> > function(void *arg)
> > {
> > 
> > While a quick survey suggests that the usual Linux fallback is the GNU
> > style:
> > 
> > static struct something_very_very_long *
> > function(void *arg)
> > 
> []
> > AFAIK the above describes the more common choice in the code base?
> > 
> > Not sure who's tree this is supposed to go through.. Andrew I guess?
> 
> I believe so yes.
> 
> I expect there will be more refinements to .clang-formata
> as perhaps more people experiment with it too.
> 
> Acked-by: Joe Perches 

Thanks Joe,

I saw your note in the original mailing-list thread and just wanted to
say the way I use clang-format is editor based.

I have a hotkey that triggers clang-format to reformat the current
statement, or current hi-lighted region. If I don't like it, then
'undo' puts it all back for manual adjustment..

This way I am selective of where I apply it, and I find it is
something like 95% of the time good, IHMO.

Big time saver for wrapping long lines, moving code around (eg
indent,de-indent) and other mundane daily tasks.

Cheers,
Jason


[GIT PULL] Please pull RDMA subsystem changes

2018-06-06 Thread Jason Gunthorpe
n Burstein (1):
  IB/mlx5: posting klm/mtt list inline in the send queue for reg_wr

Jack Morgenstein (2):
  IB/core: Make testing MR flags for writability a static inline function
  IB/mlx4: Mark user MR as writable if actual virtual memory is writable

Jason Gunthorpe (8):
  RDMA/rdma_cm: Remove process_req and timer sorting
  RDMA/rdma_cm: Make rdma_addr_cancel into a fence
  RDMA/rdma_cm: Delete rdma_addr_client
  Merge tag 'mlx5-updates-2018-05-17' of 
git://git.kernel.org/.../mellanox/linux into for-next
  Merge branch 'mr_fix' into git://git.kernel.org/.../rdma/rdma for-next
  RDMA/core: Remove indirection through ib_cache_setup()
  Merge branch 'mini_cqe' into git://git.kernel.org/.../rdma/rdma for-next
  Merge tag 'verbs_flow_counters' of 
git://git.kernel.org/.../leon/linux-rdma.git into for-next

Jia-Ju Bai (3):
  infiniband: i40iw: Replace GFP_ATOMIC with GFP_KERNEL in i40iw_add_mqh_4
  infiniband: i40iw: Replace GFP_ATOMIC with GFP_KERNEL in 
i40iw_make_listen_node
  infiniband: i40iw: Replace GFP_ATOMIC with GFP_KERNEL in 
i40iw_l2param_change

Kaike Wan (3):
  IB/hfi1: Ensure VL index is within bounds
  IB/hfi1: Remove unused variable
  IB/hfi1: Rename exp_lock to exp_mutex

Kamenee Arumugam (1):
  IB/Hfi1: Read CCE Revision register to verify the device is responsive

Leon Romanovsky (16):
  net/mlx5: Decrease level of prints about non-existent MKEY
  RDMA/hns: Drop local zgid in favor of core defined variable
  RDMA/mlx5: Remove debug prints of VMA pointers
  RDMA/ucm: Mark UCM interface as BROKEN
  RDMA/mlx4: Catch FW<->SW misalignment without machine crash
  RDMA/mlx4: Discard unknown SQP work requests
  RDMA/cxgb3: Don't crash kernel just because IDR is full
  RDMA/cm: Abort loop in case of CM dequeue
  RDMA/cma: Ignore unknown event
  RDMA/mad: Delete inaccessible BUG_ON
  RDMA/mad: Convert BUG_ONs to error flows
  RDMA/mlx5: Don't check return value of zap_vma_ptes()
  RDMA/mlx4: Don't crash machine if zap_vma_ptes() fails
  mm: Remove return value of zap_vma_ptes()
  RDMA/restrack: Change SPDX tag to properly reflect license
  RDMA/mlx5: Update SPDX tags to show proper license

Matan Barak (2):
  IB/uverbs: Add an ib_uobject getter to ioctl() infrastructure
  IB/core: Support passing uhw for create_flow

Max Gurtovoy (4):
  IB/isert: fix T10-pi check mask setting
  RDMA/core: introduce check masks for T10-PI offload
  IB/iser: use T10-PI check mask definitions from core layer
  IB/isert: use T10-PI check mask definitions from core layer

Michael J. Ruhl (4):
  IB/hfi1: Return actual error value from program_rcvarray()
  IB/hfi1: Return correct value for device state
  IB/hfi1: Reorder incorrect send context disable
  IB/hfi1: Set port number for errorinfo MAD response

Mike Marciniszyn (7):
  IB/hfi1: Fix fault injection init/exit issues
  IB/hfi1: Cleanup of exp_rcv
  IB/{rdmavt,hfi1}: Change hrtimer add to use pinned version
  IB/qib: Fix DMA api warning with debug kernel
  IB/hfi1: Fix user context tail allocation for DMA_RTAIL
  IB/hfi1: Add bypass register defines and replace blind constants
  IB/hfi1: Fix comment on default hdr entry size

Mitko Haralanov (1):
  IB/hfi1: Rework fault injection machinery

Moshe Shemesh (2):
  net/mlx5: Refactor num of blocks in mailbox calculation
  net/mlx5: Fix dump_command mailbox length printed

Or Gerlitz (1):
  net/mlx5: Use flow counter pointer as input to the query function

Parav Pandit (9):
  IB/core: Use CONFIG_SECURITY_INFINIBAND to compile out security code
  IB/core: Make gid_table_reserve_default() return void
  IB/core: Reuse gid_table_release_one() in table allocation failure
  IB/cm: Avoid AV ah_attr overwriting during LAP message handling
  IB/cm: Store and restore ah_attr during LAP msg processing
  IB/cm: Store and restore ah_attr during CM message processing
  IB/core: Remove duplicate declaration of gid_cache_wq
  IB/core: Reduce the places that use zgid
  IB/core: Introduce and use rdma_gid_table()

Rabie Loulou (1):
  net/mlx5e: Explicitly set destination e-switch in FDB rules

Raed Salem (11):
  net/mlx5: Export flow counter related API
  IB/core: Introduce counters object and its create/destroy
  IB/uverbs: Add create/destroy counters support
  IB/core: Introduce counters read verb
  IB/uverbs: Add read counters support
  IB/core: Add support for flow counters
  IB/uverbs: Add support for flow counters
  IB/mlx5: Add counters create and destroy support
  IB/mlx5: Add flow counters binding support
  IB/mlx5: Add flow counters read support
  IB/mlx5: Add counters read support

Roi Dayan (1):
  net/mlx5: Add merged e-switch cap

Sebastian Andrzej Siewior (1):

Re: [PATCH IB/core 2/2] IB/cm: Send authentic pkey in REQ msg and check eligibility of the pkeys

2018-05-15 Thread Jason Gunthorpe
On Tue, May 15, 2018 at 08:11:09PM +0200, Håkon Bugge wrote:
> > On 15 May 2018, at 02:38, Hal Rosenstock  wrote:
> > 
> > On 5/14/2018 5:02 PM, Jason Gunthorpe wrote:
> >> On Thu, May 10, 2018 at 05:16:28PM +0200, Håkon Bugge wrote:
> >> 
> >>> We are talking about two things here. The PKey in the BTH and the
> >>> PKey in the CM REQ payload. They differ.
> >>> 
> >>> I am out of office, but if my memory serves me correct, the PKey in
> >>> the BTH in the MAD packet will be the default PKey. Further, we have
> >>> per IBTA:
> >> 
> >> This sounds like a Linux bug.
> >> 
> >> Linux does not do a PR to get a reversible path dedicated to the GMP> so 
> >> it always uses the data flow path, thus the GMP path paramenters
> >> and those in the REQ should always exactly match.
> >> 
> >> Where is Linux setting the BTH.PKey and how did it choose to use the
> >> default pkey? Lets fix that at least for sure.
> 
> Linux isn’t. The BTH.PKey is inserted by the HCA (hw or fw) coming from the 
> P_Key table (10.9.2 in IBTA), selected by a pkey_index associated with the QP.
> 
> As per C10-133: Packets sent from the Send Queue of a GSI QP shall
> attach a P_Key associated with that QP, just as a P_Key is
> associated with nonmanagement QPs

That language doesn't mean the PKey is forced to the default, it says
the pkey is programmable.

Linux implemented programmable PKeys for the GSI by using the work
request, so it deviates from what the spec imagined (which was
probably one GSI QP per PKey).

See ib_create_send_mad for instance:

mad_send_wr->send_wr.wr.wr_cqe = &mad_send_wr->mad_list.cqe;
mad_send_wr->send_wr.wr.sg_list = mad_send_wr->sg_list;
mad_send_wr->send_wr.wr.num_sge = 2;
mad_send_wr->send_wr.wr.opcode = IB_WR_SEND;
mad_send_wr->send_wr.wr.send_flags = IB_SEND_SIGNALED;
mad_send_wr->send_wr.remote_qpn = remote_qpn;
mad_send_wr->send_wr.remote_qkey = IB_QP_SET_QKEY;
mad_send_wr->send_wr.pkey_index = pkey_index;

The pkey_index associated with the QP1 is ignored:

/*
 * PKey index for QP1 is irrelevant but
 * one is needed for the Reset to Init transition
 */
attr->qp_state = IB_QPS_INIT;
attr->pkey_index = pkey_index;
attr->qkey = (qp->qp_num == 0) ? 0 : IB_QP1_QKEY;
ret = ib_modify_qp(qp, attr, IB_QP_STATE |
 IB_QP_PKEY_INDEX | IB_QP_QKEY);

Since is is present in every WR.

> >> Basically this means that any pkey in the REQ could randomly be the
> >> full or limited value, and that in-of-itself has not bearing on the
> >> connection.
> >> 
> >> So it is quite wrong to insist that the pkey be limited or full when
> >> processing the REQ. The end port is expected to match against the
> >> local table.
> > 
> > Note that there is thorny issue with shared (physical) port
> > virtualization. In shared port virtualization, the VF pkey assignment is
> > a local matter. Only thing SM knows is the physical port pkeys where
> > both full and limited membership of same partition is possible. It is
> > conceivable that CM REQ contains limited partition key between 2 limited
> > VFs and for that a new REJ reason code appears to be needed.
> 
> +1

This is not a difficult issue.

If the GMP is properly tagged with the right PKey then it will never
be delivered to the VM if the VM does not have the PKey in the
table. It is up to the hypervisor to block GMPs that have Pkeys not in
the virtual PKey table of the VF.

The only time you could need a new REJ code is if the GMP is using a
PKey different from the REQ - which is a pretty goofy thing to do
considering this VM case.

Remember the SM doesn't know what Pkeys are in the VM, so it is
basically impossible for the REQ side to reliably select two different
pkeys and know that they will bothmake it to the VM.

One pkey could be done reliably if it matched ipoib, for instance, but
two really cannot in the general case.

So again - the bug here is that the GMP is being sent with the wrong
pkey. Fix that, then see what is left to fix..

If I recall there were bugs here in mlx drivers, where the driver sent
with the wrong Pkey. I think this has actually been fixed now, so
please check the upstream kernel to be sure the Pkey is not what it is
supposed to be.

Jason


Re: [PATCH v2] {net, IB}/mlx5: Use 'kvfree()' for memory allocated by 'kvzalloc()'

2018-05-15 Thread Jason Gunthorpe
On Sun, May 13, 2018 at 09:00:41AM +0200, Christophe JAILLET wrote:
> When 'kvzalloc()' is used to allocate memory, 'kvfree()' must be used to
> free it.
> 
> Signed-off-by: Christophe JAILLET 
> ---
> v1 -> v2: More places to update have been added to the patch
> ---
>  drivers/infiniband/hw/mlx5/cq.c| 2 +-
>  drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c | 2 +-
>  drivers/net/ethernet/mellanox/mlx5/core/vport.c| 6 +++---
>  3 files changed, 5 insertions(+), 5 deletions(-)

I agree with Eric on the need for fixes lines in v3..

> diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
> index 77d257ec899b..6d52ea03574e 100644
> --- a/drivers/infiniband/hw/mlx5/cq.c
> +++ b/drivers/infiniband/hw/mlx5/cq.c
> @@ -849,7 +849,7 @@ static int create_cq_user(struct mlx5_ib_dev *dev, struct 
> ib_udata *udata,
>   return 0;
>  
>  err_cqb:
> - kfree(*cqb);
> + kvfree(*cqb);

For the infiniband part:

Acked-by: Jason Gunthorpe 

Since this is mostly ethernet, can it go through netdev? thanks

Jason


Re: [PATCH v2] IB/umem: ib_ucontext already have tgid, remove pid from ib_umem structure

2018-05-15 Thread Jason Gunthorpe
On Tue, May 08, 2018 at 04:50:16PM +0800, Lidong Chen wrote:
> The userspace may invoke ibv_reg_mr and ibv_dereg_mr by different threads.
> If when ibv_dereg_mr invoke and the thread which invoked ibv_reg_mr has
> exited, get_pid_task will return NULL, ib_umem_release does not decrease
> mm->pinned_vm. This patch fixes it by use tgid in ib_ucontext struct.
> 
> Signed-off-by: Lidong Chen 
> ---
>  [v2]
>  - use ib_ucontext tgid instread of tgid in ib_umem structure
>  
>  drivers/infiniband/core/umem.c | 7 +--
>  include/rdma/ib_umem.h | 1 -
>  2 files changed, 1 insertion(+), 7 deletions(-)

Applied to for-rc, thanks.

It would be nice to send a cleanup to have all the users of tgid doing
this pattern

task = get_pid_task(umem->context->tgid, PIDTYPE_PID);
if (!task)
goto out;
mm = get_task_mm(task);

To call some kind of common function like ib_get_mr_mm(), just to make
it really clear what is happening. 

Jason


Re: [PATCH IB/core 2/2] IB/cm: Send authentic pkey in REQ msg and check eligibility of the pkeys

2018-05-16 Thread Jason Gunthorpe
On Wed, May 16, 2018 at 08:47:21AM +0200, Håkon Bugge wrote:

> > This is not a difficult issue.
> > 
> > If the GMP is properly tagged with the right PKey then it will never
> > be delivered to the VM if the VM does not have the PKey in the
> > table.
> 
> Not quite right. For the shared port model, a GMP will (most
> probably) be accepted by the physical port, due to:

Sure, but I am talking about the VM's 'virtual port'.

> So, if the GMP is destined to VM1, which is a limited member, but we
> have another VM2 which is a full member of the same partition, the
> GMP will pass the HCA’s PKey check.

Sure.

> > It is up to the hypervisor to block GMPs that have Pkeys not in
> > the virtual PKey table of the VF.
> 
> The packet is received by the HCA and stripped from IB headers, in
> particular the BTH. How can the "hypervisor" block it when its
> doesn’t have access to the GMP’s BTH.PKey?

This is wrong too, in Linux the WC's for GMP packets include the pkey
index that was used to RX the packet. This is a Linux extension
matching the one on the sending side.

The hypervisor *must* block these GMPs, it is a hard requirement of
the pkey security model for VMs.

AFAIK the Mellanox drivers do this right - do you know differently?

> > The only time you could need a new REJ code is if the GMP is using a
> > PKey different from the REQ - which is a pretty goofy thing to do
> > considering this VM case.
>
> Its goofy. In the CX-3 shared port model, the BTH.PKey is the
> default one and the REQ.PKey is the full one even if the sending
> VM’s port only is a limited member. This patch series fixes the last
> issue.

Again, this is wrong, the BTH.Pkey and REQ.Pkey should be the same -
please fix that first before trying to change anything else.

> > Remember the SM doesn't know what Pkeys are in the VM, so it is
> > basically impossible for the REQ side to reliably select two different
> > pkeys and know that they will bothmake it to the VM.
> 
> The active side should use the "authentic" PKey in the REQ
> message. That is the one that would be used in BTH.PKey when
> communication has been established. This is implemented by this
> patch series.
> 
> Not sure what you mean by "reliably select two different pkeys". The
> CM REQ message contains one PKey.

If BTH.Pkey != REQ.PKey then the requestor side has to obviously
select two PKeys, which is basically impossible.

The VM should not be part of the default partition, for instance.

> See above, not sure how that could be implemented. And if it is
> solved by the HCA trapping the GMP due to the PKey check, it doesn’t
> help me, as the purpose of the series is to avoid (excessive) PKey
> traps sent to the SM.

It might not help you, but is shows this fix for the pkey trap issue
is wrong. You must fix the pkey traps on the sending side, not on the
responder side..

> > If I recall there were bugs here in mlx drivers, where the driver
> > sent with the wrong Pkey. I think this has actually been fixed
> > now, so please check the upstream kernel to be sure the Pkey is
> > not what it is supposed to be.
> 
> Let me get back to you with some ibdumps here.

Upstream kernel (eg 4.16) and newish Mellanox firmware please.

And it would be fantastic if you could ID why this is happening, AFAIK
it should be OK in the kernel.

Jason


Re: [PATCH IB/core 2/2] IB/cm: Send authentic pkey in REQ msg and check eligibility of the pkeys

2018-05-16 Thread Jason Gunthorpe
On Wed, May 16, 2018 at 12:42:37PM -0400, Hal Rosenstock wrote:

> >>> The only time you could need a new REJ code is if the GMP is using a
> >>> PKey different from the REQ - which is a pretty goofy thing to do
> >>> considering this VM case.
> >>
> >> Its goofy. In the CX-3 shared port model, the BTH.PKey is the
> >> default one and the REQ.PKey is the full one even if the sending
> >> VM’s port only is a limited member. This patch series fixes the last
> >> issue.
> > 
> > Again, this is wrong, the BTH.Pkey and REQ.Pkey should be the same -
> 
> I do not believe there is anything in the spec that requires this. I
> agree it's the simplest use model though.

The spec doesn't require it, but the design of the Linux CM certainly
does.

> > If BTH.Pkey != REQ.PKey then the requestor side has to obviously
> > select two PKeys, which is basically impossible.
> > 
> > The VM should not be part of the default partition, for instance.
> 
> I think that the VM is at least a limited member of the default partition.

Well, being a limited member still means the default pkey cannot be
used for CM GMPs.

I actually can't think of why you'd want to do this, better to put the
SM nodes in all the pkeys and reserve the default pkey completely for
the network management control plane.

Jason


Re: [GIT PULL] Please pull RDMA subsystem changes

2018-05-16 Thread Jason Gunthorpe
On Wed, May 16, 2018 at 07:39:08PM +0200, Eugene Syromiatnikov wrote:
> On Fri, Apr 06, 2018 at 10:05:41AM -0600, Jason Gunthorpe wrote:
> >   RDMA: Change all uapi headers to use __aligned_u64 instead of __u64
> Looks like this change changed the size of struct hfi1_ctxt_info and the
> value of HFI1_IOCTL_CTXT_INFO ioctl number as a result.

Only on 32 bit..

HF1 does not build on 32 bit kernels, so the 32 bit version of the
ioctl is never supported.

HFI1 also does not define '.compat_ioctl' so even a 32 bit user space
on 64 bit kernel cannot call this ioctl.

So, yes it changed, but it doesn't matter.

> >   IB/uverbs: Extend uverbs_ioctl header with driver_id
> This patch changed the value of the RDMA_VERBS_IOCTL ioctl number.
> 
> (see https://lists.strace.io/pipermail/strace-devel/2018-May/008196.html
> for the reference)

Yes.

RDMA_VERBS_IOCTL has been considered experimental and was protected by
a CONFIG option.

After the above commit the experimental was dropped and the ABI was
fixed, but prior to doing so, the struct was revised based on
feedback.

There is no userspace user of the old struct, only the new struct.

Basically from strace's perspective it should use the new ioctl #'s
across all kernel revs and it will be correct as there is no user
space that will issue the old ioctls.

Jason


Re: [PATCH IB/core 2/2] IB/cm: Send authentic pkey in REQ msg and check eligibility of the pkeys

2018-05-16 Thread Jason Gunthorpe
On Wed, May 16, 2018 at 07:46:10PM +0200, Håkon Bugge wrote:

>OK. Lets take one example. The pkey table contains 0x, 0x8001,
>0x0001.
> 
>The wce.pkey_index is 1 (i.e., pointing to 0x8001). Now, tell me, was
>BTH.PKey 0x8001 (matches 0x8001) or was it 0x0001 (also matching
>0x8001) ?

As far as the Linux core is concerned, it must have been 0x8001,
because the only way the pkey_index feature works properly is if
exact-match takes precedence over in-exact match.

Jason


Re: [PATCH IB/core 2/2] IB/cm: Send authentic pkey in REQ msg and check eligibility of the pkeys

2018-05-16 Thread Jason Gunthorpe
On Wed, May 16, 2018 at 08:14:50PM +0200, Håkon Bugge wrote:
> 
> 
> > On 16 May 2018, at 20:01, Jason Gunthorpe  wrote:
> > 
> > On Wed, May 16, 2018 at 07:46:10PM +0200, Håkon Bugge wrote:
> > 
> >>   OK. Lets take one example. The pkey table contains 0x, 0x8001,
> >>   0x0001.
> >> 
> >>   The wce.pkey_index is 1 (i.e., pointing to 0x8001). Now, tell me, was
> >>   BTH.PKey 0x8001 (matches 0x8001) or was it 0x0001 (also matching
> >>   0x8001) ?
> > 
> > As far as the Linux core is concerned, it must have been 0x8001,
> > because the only way the pkey_index feature works properly is if
> > exact-match takes precedence over in-exact match.
> 
> And now if the table only contains 0x, 0x8001, how do you tell?

It doesn't matter.

The delgation of Pkeys to VMs are on a pkey-table-index basis, so if
it matches table entry 1 and entry 1 is passed to the VM, then the
packet can be passed to the VM.

Jason


Re: [PATCH] RDMA/cxgb4: fix null pointer dereference on alloc_skb failure

2019-04-16 Thread Jason Gunthorpe
On Sat, Apr 13, 2019 at 05:00:26PM +0100, Colin King wrote:
> From: Colin Ian King 
> 
> Currently if alloc_skb fails to allocate the skb a null skb is passed
> to t4_set_arp_err_handler and this ends up dereferencing the null skb.
> Avoid the null pointer dereference by checking for a null skb and
> returning early.
> 
> Addresses-Coverity: ("Dereference null return")
> Fixes: b38a0ad8ec11 ("RDMA/cxgb4: Set arp error handler for PASS_ACCEPT_RPL 
> messages")
> Signed-off-by: Colin Ian King 
> Acked-by: Potnuri Bharat Teja 
> ---
>  drivers/infiniband/hw/cxgb4/cm.c | 2 ++
>  1 file changed, 2 insertions(+)

Applied to for-next

Thanks,
Jason


Re: [PATCH] RDMA/cxgb4: fix spelling mistake "immedate" -> "immediate"

2019-04-17 Thread Jason Gunthorpe
On Tue, Apr 16, 2019 at 03:38:04PM +0100, Colin King wrote:
> From: Colin Ian King 
> 
> There is a spelling mistake in a module parameter description. Fix it.
> 
> Signed-off-by: Colin Ian King 
> ---
>  drivers/infiniband/hw/cxgb4/qp.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Applied to for-next, thanks

Jason


Re: [PATCH 5/6] powerpc/mmu: drop mmap_sem now that locked_vm is atomic

2019-04-24 Thread Jason Gunthorpe
On Tue, Apr 23, 2019 at 07:15:44PM -0700, Davidlohr Bueso wrote:
> On Wed, 03 Apr 2019, Daniel Jordan wrote:
> 
> > On Wed, Apr 03, 2019 at 06:58:45AM +0200, Christophe Leroy wrote:
> > > Le 02/04/2019 à 22:41, Daniel Jordan a écrit :
> > > > With locked_vm now an atomic, there is no need to take mmap_sem as
> > > > writer.  Delete and refactor accordingly.
> > > 
> > > Could you please detail the change ?
> > 
> > Ok, I'll be more specific in the next version, using some of your language 
> > in
> > fact.  :)
> > 
> > > It looks like this is not the only
> > > change. I'm wondering what the consequences are.
> > > 
> > > Before we did:
> > > - lock
> > > - calculate future value
> > > - check the future value is acceptable
> > > - update value if future value acceptable
> > > - return error if future value non acceptable
> > > - unlock
> > > 
> > > Now we do:
> > > - atomic update with future (possibly too high) value
> > > - check the new value is acceptable
> > > - atomic update back with older value if new value not acceptable and 
> > > return
> > > error
> > > 
> > > So if a concurrent action wants to increase locked_vm with an acceptable
> > > step while another one has temporarily set it too high, it will now fail.
> > > 
> > > I think we should keep the previous approach and do a cmpxchg after
> > > validating the new value.
> 
> Wouldn't the cmpxchg alternative also be exposed the locked_vm changing 
> between
> validating the new value and the cmpxchg() and we'd bogusly fail even when 
> there
> is still just because the value changed (I'm assuming we don't hold any locks,
> otherwise all this is pointless).
> 
>   current_locked = atomic_read(&mm->locked_vm);
>   new_locked = current_locked + npages;
>   if (new_locked < lock_limit)
>  if (cmpxchg(&mm->locked_vm, current_locked, new_locked) == 
> current_locked)
>/* ENOMEM */

Well it needs a loop..

again:
   current_locked = atomic_read(&mm->locked_vm);
   new_locked = current_locked + npages;
   if (new_locked < lock_limit)
  if (cmpxchg(&mm->locked_vm, current_locked, new_locked) != current_locked)
goto again;

So it won't have bogus failures as there is no unwind after
error. Basically this is a load locked/store conditional style of
locking pattern.

> > That's a good idea, and especially worth doing considering that an arbitrary
> > number of threads that charge a low amount of locked_vm can fail just 
> > because
> > one thread charges lots of it.
> 
> Yeah but the window for this is quite small, I doubt it would be a real issue.
> 
> What if before doing the atomic_add_return(), we first did the racy new_locked
> check for ENOMEM, then do the speculative add and cleanup, if necessary. This
> would further reduce the scope of the window where false ENOMEM can occur.
> 
> > pinned_vm appears to be broken the same way, so I can fix it too unless 
> > someone
> > beats me to it.
> 
> This should not be a surprise for the rdma folks. Cc'ing Jason nonetheless.

I think we accepted this tiny race as a side effect of removing the
lock, which was very beneficial. Really the time window between the
atomic failing and unwind is very small, and there are enough other
ways a hostile user could DOS locked_vm that I don't think it really
matters in practice..

However, the cmpxchg seems better, so a helper to implement that would
probably be the best thing to do.

Jason


Re: sg_dma_page_iter offset & length considerations

2019-04-24 Thread Jason Gunthorpe
On Wed, Apr 24, 2019 at 03:22:18PM +0800, Daniel Drake wrote:
> Hi,
> 
> In drivers/mmc/alcor.c we're working with a MMC controller which
> supports DMA transfers split up into page-sized chunks.

Keep in mind that sg_page_iter_page splits into PAGE_SIZE chuncks, so
if you HW needs exactly a 4k chunk or something then it is not the
right API.

I have in mind the idea to move this code:

https://patchwork.kernel.org/patch/10909379/

Into the global scatterlist someday if there are other users in the
tree that need arbitary splitting..

> Specifically I can see userspace generates requests which present a
> sglist such as:
>  - first entry with offset=1536 length=2560
>  - 7 entries with offset=0 length=4096
>  - last entry with offset=0 length=1536
> 
> I gather that dma_map_sg() will take care off the offsets, i.e. any
> physical address I get with sg_page_iter_dma_address() will already
> have the offset applied, so I don't have to worry about tracking that.

Well the DMA iter aligns everything to pages so all the offsets are
lost.

> But what about the length? For every page returned by the iterator, I
> can't assume that I am being asked to work with the full page,
> right?

So far no user has required the length/offset, but it would be easy
enough to make a function to calculate these values for the current
step.

This is because this API is used by drivers building page lists for
HW, and they usually have additional information outside the SGL that
indicates what the start/end offsets are.

A driver that simply wants a SGL with a capped max size (ie 4k?)
should use the dma_set_max_seg_size() API and just never get a SGE
with a larger length. 

But the SGE may still cross a page boundary..

Jason


Re: [PATCH] infiniband: i40iw: fix potential NULL pointer dereferences

2019-03-12 Thread Jason Gunthorpe
On Fri, Mar 08, 2019 at 11:27:50PM -0600, Kangjie Lu wrote:
> alloc_ordered_workqueue may fail and return NULL. Let's check
> its return value to ensure it is not NULL so as to avoid
> potential NULL pointer dereferences.
> 
> Signed-off-by: Kangjie Lu 
>  drivers/infiniband/hw/i40iw/i40iw_cm.c | 12 
>  1 file changed, 12 insertions(+)
> 
> diff --git a/drivers/infiniband/hw/i40iw/i40iw_cm.c 
> b/drivers/infiniband/hw/i40iw/i40iw_cm.c
> index 206cfb0016f8..ad9b4f235e30 100644
> +++ b/drivers/infiniband/hw/i40iw/i40iw_cm.c
> @@ -3256,9 +3256,21 @@ void i40iw_setup_cm_core(struct i40iw_device *iwdev)
>  
>   cm_core->event_wq = alloc_ordered_workqueue("iwewq",
>   WQ_MEM_RECLAIM);
> + if (!cm_core->event_wq) {
> + i40iw_debug(cm_core->dev,
> + I40IW_DEBUG_CM,
> + "%s allocate event work queue failed\n",
> + __func__);
> + }
>  
>   cm_core->disconn_wq = alloc_ordered_workqueue("iwdwq",
> WQ_MEM_RECLAIM);
> + if (!cm_core->disconn_wq) {
> + i40iw_debug(cm_core->dev,
> + I40IW_DEBUG_CM,
> + "%s allocate disconnect work queue failed\n",
> + __func__);
> + }
>  }

Same no as the mlx patches - handle the error or don't bother

Jason


Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions

2019-03-12 Thread Jason Gunthorpe
On Tue, Mar 12, 2019 at 11:35:29AM -0400, Jerome Glisse wrote:
> > > > Yes you now have the filesystem as well as the GUP pinner claiming
> > > > authority over the contents of a single memory segment. Maybe better not
> > > > allow that?
> > >
> > > This goes back to regressing existing driver with existing users.
> > 
> > There is no regression if that behavior never really worked.
> 
> Well RDMA driver maintainer seems to report that this has been a valid
> and working workload for their users.

I think it is more O_DIRECT that is the history here..

In RDMA land long term GUPs of file backed pages tend to crash the
kernel (what John is trying to fix here) so I'm not sure there are
actual real & tested users, only people that wish they could do this..

Jason


Re: [PATCH 00/10] HMM updates for 5.1

2019-03-13 Thread Jason Gunthorpe
On Wed, Mar 13, 2019 at 09:10:04AM -0700, Andrew Morton wrote:
> On Tue, 12 Mar 2019 21:27:06 -0400 Jerome Glisse  wrote:
> 
> > Andrew you will not be pushing this patchset in 5.1 ?
> 
> I'd like to.  It sounds like we're converging on a plan.
> 
> It would be good to hear more from the driver developers who will be
> consuming these new features - links to patchsets, review feedback,
> etc.  Which individuals should we be asking?  Felix, Christian and
> Jason, perhaps?

At least the Mellanox driver patch looks like a good improvement:

https://patchwork.kernel.org/patch/10786625/
 5 files changed, 202 insertions(+), 452 deletions(-)

In fact it hollows out the 'umem_odp' driver abstraction we already
had in the RDMA core.

So, I fully expect to see this API used in mlx5 and RDMA-core after it
is merged.

We've done some testing now, and there are still some outstanding
questions on the driver parts, but I haven't seen anything
fundamentally wrong with HMM mirror come up.

Jason


Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions

2019-03-14 Thread Jason Gunthorpe
On Thu, Mar 14, 2019 at 10:03:45AM +0100, Jan Kara wrote:
> On Wed 13-03-19 19:16:51, Christopher Lameter wrote:
> > On Tue, 12 Mar 2019, Jerome Glisse wrote:
> > 
> > > > > This has been discuss extensively already. GUP usage is now 
> > > > > widespread in
> > > > > multiple drivers, removing that would regress userspace ie break 
> > > > > existing
> > > > > application. We all know what the rules for that is.
> > 
> > You are still misstating the issue. In RDMA land GUP is widely used for
> > anonyous memory and memory based filesystems. *Not* for real filesystems.
> 
> Maybe in your RDMA land. But there are apparently other users which do use
> mmap of a file on normal filesystem (e.g. ext4) as a buffer for DMA
> (Infiniband does not prohibit this if nothing else, video capture devices
> also use very similar pattern of gup-ing pages and using them as video
> buffers). And these users are reporting occasional kernel crashes. That's
> how this whole effort started. Sadly the DMA to file mmap is working good
> enough that people started using it so at this point we cannot just tell:
> Sorry it was a mistake to allow this, just rewrite your applications.

This is where we are in RDMA too.. People are trying it and the ones
that do enough load testing find their kernel OOPs

So it is not clear at all if this has graduated to a real use, or just
an experiment. Perhaps there are some system configurations that don't
trigger crashes..

Jason


Re: [PATCH -next] RDMA/hns: Make some function static

2019-02-04 Thread Jason Gunthorpe
On Fri, Feb 01, 2019 at 11:11:04AM +0800, YueHaibing wrote:
> Fixes the following sparse warnings:
> 
> drivers/infiniband/hw/hns/hns_roce_hw_v2.c:5822:5: warning:
>  symbol 'hns_roce_v2_query_srq' was not declared. Should it be static?
> drivers/infiniband/hw/hns/hns_roce_srq.c:158:6: warning:
>  symbol 'hns_roce_srq_free' was not declared. Should it be static?
> drivers/infiniband/hw/hns/hns_roce_srq.c:81:5: warning:
>  symbol 'hns_roce_srq_alloc' was not declared. Should it be static?
> 
> Signed-off-by: YueHaibing 
> ---
>  drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 2 +-
>  drivers/infiniband/hw/hns/hns_roce_srq.c   | 9 +
>  2 files changed, 6 insertions(+), 5 deletions(-)

Applied to for-next

Thanks,
Jason


Re: [PATCH V4 rdma-next 0/3] RDMA/hns: Some fixes for hns RoCE driver

2019-02-04 Thread Jason Gunthorpe
On Sun, Feb 03, 2019 at 08:43:12PM +0800, Wei Hu (Xavier) wrote:
> Hi, Jason and Doug
> 
> This patch series includes three bugfixes for reset related operations
> and are maked based on wip/jgg-for-next branch.
> 
> Best Regards
> Xavier
> 
> Wei Hu (Xavier) (3):
>   RDMA/hns: Fix the Oops during rmmod or insmod ko when reset occurs
>   RDMA/hns: Fix the chip hanging caused by sending mailbox&CMQ during
> reset
>   RDMA/hns: Fix the chip hanging caused by sending doorbell during reset

Applied to for-next

Thanks,
Jason


Re: [PATCH] mlx4_ib: Increase the timeout for CM cache

2019-02-05 Thread Jason Gunthorpe
On Thu, Jan 31, 2019 at 06:09:51PM +0100, Håkon Bugge wrote:
> Using CX-3 virtual functions, either from a bare-metal machine or
> pass-through from a VM, MAD packets are proxied through the PF driver.
> 
> Since the VMs have separate name spaces for MAD Transaction Ids
> (TIDs), the PF driver has to re-map the TIDs and keep the book keeping
> in a cache.
> 
> Following the RDMA CM protocol, it is clear when an entry has to
> evicted form the cache. But life is not perfect, remote peers may die
> or be rebooted. Hence, it's a timeout to wipe out a cache entry, when
> the PF driver assumes the remote peer has gone.
> 
> We have experienced excessive amount of DREQ retries during fail-over
> testing, when running with eight VMs per database server.
> 
> The problem has been reproduced in a bare-metal system using one VM
> per physical node. In this environment, running 256 processes in each
> VM, each process uses RDMA CM to create an RC QP between himself and
> all (256) remote processes. All in all 16K QPs.
> 
> When tearing down these 16K QPs, excessive DREQ retries (and
> duplicates) are observed. With some cat/paste/awk wizardry on the
> infiniband_cm sysfs, we observe:
> 
>   dreq:   5007
> cm_rx_msgs:
>   drep:   3838
>   dreq:  13018
>rep:   8128
>req:   8256
>rtu:   8256
> cm_tx_msgs:
>   drep:   8011
>   dreq:  68856
>rep:   8256
>req:   8128
>rtu:   8128
> cm_tx_retries:
>   dreq:  60483
> 
> Note that the active/passive side is distributed.
> 
> Enabling pr_debug in cm.c gives tons of:
> 
> [171778.814239]  mlx4_ib_multiplex_cm_handler: id{slave:
> 1,sl_cm_id: 0xd393089f} is NULL!
> 
> By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the
> tear-down phase of the application is reduced from 113 to 67
> seconds. Retries/duplicates are also significantly reduced:
> 
> cm_rx_duplicates:
>   dreq:   7726
> []
> cm_tx_retries:
>   drep:  1
>   dreq:   7779
> 
> Increasing the timeout further didn't help, as these duplicates and
> retries stem from a too short CMA timeout, which was 20 (~4 seconds)
> on the systems. By increasing the CMA timeout to 22 (~17 seconds), the
> numbers fell down to about one hundred for both of them.
> 
> Adjustment of the CMA timeout is _not_ part of this commit.
> 
> Signed-off-by: Håkon Bugge 
> ---
>  drivers/infiniband/hw/mlx4/cm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Jack? What do you think?

> diff --git a/drivers/infiniband/hw/mlx4/cm.c b/drivers/infiniband/hw/mlx4/cm.c
> index fedaf8260105..8c79a480f2b7 100644
> --- a/drivers/infiniband/hw/mlx4/cm.c
> +++ b/drivers/infiniband/hw/mlx4/cm.c
> @@ -39,7 +39,7 @@
>  
>  #include "mlx4_ib.h"
>  
> -#define CM_CLEANUP_CACHE_TIMEOUT  (5 * HZ)
> +#define CM_CLEANUP_CACHE_TIMEOUT  (30 * HZ)
>  
>  struct id_map_entry {
>   struct rb_node node;
> -- 
> 2.20.1
> 


Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA

2019-02-06 Thread Jason Gunthorpe
On Wed, Feb 06, 2019 at 10:50:00AM +0100, Jan Kara wrote:

> MM/FS asks for lease to be revoked. The revoke handler agrees with the
> other side on cancelling RDMA or whatever and drops the page pins. 

This takes a trip through userspace since the communication protocol
is entirely managed in userspace.

Most existing communication protocols don't have a 'cancel operation'.

> Now I understand there can be HW / communication failures etc. in
> which case the driver could either block waiting or make sure future
> IO will fail and drop the pins. 

We can always rip things away from the userspace.. However..

> But under normal conditions there should be a way to revoke the
> access. And if the HW/driver cannot support this, then don't let it
> anywhere near DAX filesystem.

I think the general observation is that people who want to do DAX &
RDMA want it to actually work, without data corruption, random process
kills or random communication failures.

Really, few users would actually want to run in a system where revoke
can be triggered.

So.. how can the FS/MM side provide a guarantee to the user that
revoke won't happen under a certain system design?

Jason


Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA

2019-02-06 Thread Jason Gunthorpe
On Wed, Feb 06, 2019 at 10:35:04AM -0800, Matthew Wilcox wrote:

> > Admittedly, I'm coming in late to this conversation, but did I miss the
> > portion where that alternative was ruled out?
> 
> That's my preferred option too, but the preponderance of opinion leans
> towards "We can't give people a way to make files un-truncatable".

I haven't heard an explanation why blocking ftruncate is worse than
giving people a way to break RDMA using process by calling ftruncate??

Isn't it exactly the same argument the other way?

Jason


Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA

2019-02-06 Thread Jason Gunthorpe
On Wed, Feb 06, 2019 at 12:20:21PM -0800, Matthew Wilcox wrote:
> On Wed, Feb 06, 2019 at 03:16:02PM -0500, Doug Ledford wrote:
> > On Wed, 2019-02-06 at 11:40 -0800, Matthew Wilcox wrote:
> > > On Wed, Feb 06, 2019 at 07:16:21PM +, Christopher Lameter wrote:
> > > > 
> > > > though? If we only allow this use case then we may not have to worry 
> > > > about
> > > > long term GUP because DAX mapped files will stay in the physical 
> > > > location
> > > > regardless.
> > > 
> > > ... except for truncate.  And now that I think about it, there was a
> > > desire to support hot-unplug which also needed revoke.
> > 
> > We already support hot unplug of RDMA devices.  But it is extreme.  How
> > does hot unplug deal with a program running from the device (something
> > that would have returned ETXTBSY)?
> 
> Not hot-unplugging the RDMA device but hot-unplugging an NV-DIMM.
> 
> It's straightforward to migrate text pages from one DIMM to another;
> you remove the PTEs from the CPU's page tables, copy the data over and
> pagefaults put the new PTEs in place.  We don't have a way to do similar
> things to an RDMA device, do we?

I've long said it is reasonable to have an emergency hard revoke for
exceptional error cases - like dis-orderly hot unplug and so forth.

However, IHMO a orderly migration should rely on user space to
co-ordinate the migration and the application usages, including some
user space driven scheme to assure forward progress..

.. and you are kind of touching on my fear here. revoke started out as
only being for ftruncate. Now we need it for data migration - how soon
before someone wants to do revoke just to re-balance
usage/bandwidth/etc between NVDIMMS? 

That is now way outside of what a RDMA using system can reasonably
tolerate. How would a system designer prevent this?

Again, nobody is going to want a design where RDMA applications are
under some undefined threat of SIGKILL - which is where any lease
revoke idea is going. :( 

The priority of systems using RDMA is almost always to keep the RDMA
working right as it is the often key service the box is providing.

Jason


Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA

2019-02-06 Thread Jason Gunthorpe
On Thu, Feb 07, 2019 at 08:03:56AM +1100, Dave Chinner wrote:
> On Wed, Feb 06, 2019 at 07:16:21PM +, Christopher Lameter wrote:
> > On Wed, 6 Feb 2019, Doug Ledford wrote:
> > 
> > > > Most of the cases we want revoke for are things like truncate().
> > > > Shouldn't happen with a sane system, but we're trying to avoid users
> > > > doing awful things like being able to DMA to pages that are now part of
> > > > a different file.
> > >
> > > Why is the solution revoke then?  Is there something besides truncate
> > > that we have to worry about?  I ask because EBUSY is not currently
> > > listed as a return value of truncate, so extending the API to include
> > > EBUSY to mean "this file has pinned pages that can not be freed" is not
> > > (or should not be) totally out of the question.
> > >
> > > Admittedly, I'm coming in late to this conversation, but did I miss the
> > > portion where that alternative was ruled out?
> > 
> > Coming in late here too but isnt the only DAX case that we are concerned
> > about where there was an mmap with the O_DAX option to do direct write
> > though? If we only allow this use case then we may not have to worry about
> > long term GUP because DAX mapped files will stay in the physical location
> > regardless.
> 
> No, that is not guaranteed. Soon as we have reflink support on XFS,
> writes will physically move the data to a new physical location.
> This is non-negotiatiable, and cannot be blocked forever by a gup
> pin.
> 
> IOWs, DAX on RDMA requires a) page fault capable hardware so that
> the filesystem can move data physically on write access, and b)
> revokable file leases so that the filesystem can kick userspace out
> of the way when it needs to.

Why do we need both? You want to have leases for normal CPU mmaps too?

> Truncate is a red herring. It's definitely a case for revokable
> leases, but it's the rare case rather than the one we actually care
> about. We really care about making copy-on-write capable filesystems like
> XFS work with DAX (we've got people asking for it to be supported
> yesterday!), and that means DAX+RDMA needs to work with storage that
> can change physical location at any time.

Then we must continue to ban longterm pin with DAX..

Nobody is going to want to deploy a system where revoke can happen at
any time and if you don't respond fast enough your system either locks
with some kind of FS meltdown or your process gets SIGKILL. 

I don't really see a reason to invest so much design work into
something that isn't production worthy.

It *almost* made sense with ftruncate, because you could architect to
avoid ftruncate.. But just any FS op might reallocate? Naw.

Dave, you said the FS is responsible to arbitrate access to the
physical pages..

Is it possible to have a filesystem for DAX that is more suited to
this environment? Ie designed to not require block reallocation (no
COW, no reflinks, different approach to ftruncate, etc)

> And that's the real problem we need to solve here. RDMA has no trust
> model other than "I'm userspace, I pinned you, trust me!". That's
> not good enough for FS-DAX+RDMA

It is baked into the silicon, and I don't see much motion on this
front right now. My best hope is that IOMMU PASID will get widely
deployed and RDMA silicon will arrive that can use it. Seems to be
years away, if at all.

At least we have one chip design that can work in a page faulting mode
..

Jason


Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA

2019-02-06 Thread Jason Gunthorpe
On Wed, Feb 06, 2019 at 02:44:45PM -0800, Dan Williams wrote:

> > Do they need to stick with xfs?
> 
> Can you clarify the motivation for that question? This problem exists
> for any filesystem that implements an mmap that where the physical
> page backing the mapping is identical to the physical storage location
> for the file data. 

.. and needs to dynamicaly change that mapping. Which is not really
something inherent to the general idea of a filesystem. A file system
that had *strictly static* block assignments would work fine.

Not all filesystem even implement hole punch.

Not all filesystem implement reflink.

ftruncate doesn't *have* to instantly return the free blocks to
allocation pool.

ie this is not a DAX & RDMA issue but a XFS & RDMA issue.

Replacing XFS is probably not be reasonable, but I wonder if a XFS--
operating mode could exist that had enough features removed to be
safe? 

Ie turn off REFLINK. Change the semantic of ftruncate to be more like
ETXTBUSY. Turn off hole punch.

> > Are they really trying to do COW backed mappings for the RDMA
> > targets?  Or do they want a COW backed FS but are perfectly happy
> > if the specific RDMA targets are *not* COW and are statically
> > allocated?
> 
> I would expect the COW to be broken at registration time. Only ODP
> could possibly support reflink + RDMA. So I think this devolves the
> problem back to just the "what to do about truncate/punch-hole"
> problem in the specific case of non-ODP hardware combined with the
> Filesystem-DAX facility.

Usually the problem with COW is that you make a READ RDMA MR and on a
COW'd file, and some other thread breaks the COW..

This probably becomes a problem if the same process that has the MR
triggers a COW break (ie by writing to the CPU mmap). This would cause
the page to be reassigned but the MR would not be updated, which is
not what the app expects.

WRITE is simpler, once the COW is broken during GUP, the pages cannot
be COW'd again until the DMA pin is released. So new reflinks would be
blocked during the DMA pin period.

To fix READ you'd have to treat it like WRITE and break the COW at GPU.

Jason


Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA

2019-02-06 Thread Jason Gunthorpe
On Wed, Feb 06, 2019 at 03:30:27PM -0800, Dan Williams wrote:
> On Wed, Feb 6, 2019 at 3:21 PM Jason Gunthorpe  wrote:
> >
> > On Wed, Feb 06, 2019 at 02:44:45PM -0800, Dan Williams wrote:
> >
> > > > Do they need to stick with xfs?
> > >
> > > Can you clarify the motivation for that question? This problem exists
> > > for any filesystem that implements an mmap that where the physical
> > > page backing the mapping is identical to the physical storage location
> > > for the file data.
> >
> > .. and needs to dynamicaly change that mapping. Which is not really
> > something inherent to the general idea of a filesystem. A file system
> > that had *strictly static* block assignments would work fine.
> >
> > Not all filesystem even implement hole punch.
> >
> > Not all filesystem implement reflink.
> >
> > ftruncate doesn't *have* to instantly return the free blocks to
> > allocation pool.
> >
> > ie this is not a DAX & RDMA issue but a XFS & RDMA issue.
> >
> > Replacing XFS is probably not be reasonable, but I wonder if a XFS--
> > operating mode could exist that had enough features removed to be
> > safe?
> 
> You're describing the current situation, i.e. Linux already implements
> this, it's called Device-DAX and some users of RDMA find it
> insufficient. The choices are to continue to tell them "no", or say
> "yes, but you need to submit to lease coordination".

Device-DAX is not what I'm imagining when I say XFS--.

I mean more like XFS with all features that require rellocation of
blocks disabled.

Forbidding hold punch, reflink, cow, etc, doesn't devolve back to
device-dax.

Jason


Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA

2019-02-06 Thread Jason Gunthorpe
On Thu, Feb 07, 2019 at 02:52:58PM +1100, Dave Chinner wrote:
> On Wed, Feb 06, 2019 at 05:24:50PM -0500, Doug Ledford wrote:
> > On Wed, 2019-02-06 at 15:08 -0700, Jason Gunthorpe wrote:
> > > On Thu, Feb 07, 2019 at 08:03:56AM +1100, Dave Chinner wrote:
> > > > On Wed, Feb 06, 2019 at 07:16:21PM +, Christopher Lameter wrote:
> > > > > On Wed, 6 Feb 2019, Doug Ledford wrote:
> > > > > 
> > > > > > > Most of the cases we want revoke for are things like truncate().
> > > > > > > Shouldn't happen with a sane system, but we're trying to avoid 
> > > > > > > users
> > > > > > > doing awful things like being able to DMA to pages that are now 
> > > > > > > part of
> > > > > > > a different file.
> > > > > > 
> > > > > > Why is the solution revoke then?  Is there something besides 
> > > > > > truncate
> > > > > > that we have to worry about?  I ask because EBUSY is not currently
> > > > > > listed as a return value of truncate, so extending the API to 
> > > > > > include
> > > > > > EBUSY to mean "this file has pinned pages that can not be freed" is 
> > > > > > not
> > > > > > (or should not be) totally out of the question.
> > > > > > 
> > > > > > Admittedly, I'm coming in late to this conversation, but did I miss 
> > > > > > the
> > > > > > portion where that alternative was ruled out?
> > > > > 
> > > > > Coming in late here too but isnt the only DAX case that we are 
> > > > > concerned
> > > > > about where there was an mmap with the O_DAX option to do direct write
> > > > > though? If we only allow this use case then we may not have to worry 
> > > > > about
> > > > > long term GUP because DAX mapped files will stay in the physical 
> > > > > location
> > > > > regardless.
> > > > 
> > > > No, that is not guaranteed. Soon as we have reflink support on XFS,
> > > > writes will physically move the data to a new physical location.
> > > > This is non-negotiatiable, and cannot be blocked forever by a gup
> > > > pin.
> > > > 
> > > > IOWs, DAX on RDMA requires a) page fault capable hardware so that
> > > > the filesystem can move data physically on write access, and b)
> > > > revokable file leases so that the filesystem can kick userspace out
> > > > of the way when it needs to.
> > > 
> > > Why do we need both? You want to have leases for normal CPU mmaps too?
> 
> We don't need them for normal CPU mmaps because that's locally
> addressable page fault capable hardware. i.e. if we need to
> serialise something, we just use kernel locks, etc. When it's a
> remote entity (such as RDMA) we have to get that remote entity to
> release it's reference/access so the kernel has exclusive access
> to the resource it needs to act on.

Why can't DAX follow the path of GPU? Jerome has been working on
patches that let GPU do page migrations and other activities and
maintain full sync with ODP MRs.

I don't know of a reason why DAX migration would be different from GPU
migration.

The ODP RDMA HW does support halting RDMA access and interrupting the
CPU to re-establish access, so you can get your locks/etc as. With
today's implemetnation DAX has to trigger all the needed MM notifier
call backs to make this work. Tomorrow it will have to interact with
the HMM mirror API.

Jerome is already demoing this for the GPU case, so the RDMA ODP HW is
fine.

Is DAX migration different in some way from GPU's migration that it
can't use this flow and needs a lease to??? This would be a big
surprise to me.

> If your argument is that "existing RDMA apps don't have a recall
> mechanism" then that's what they are going to need to implement to
> work with DAX+RDMA. Reliable remote access arbitration is required
> for DAX+RDMA, regardless of what filesysetm the data is hosted on.

My argument is that is a toy configuration that no production user
would use. It either has the ability to wait for the lease to revoke
'forever' without consequence or the application will be critically
de-stablized by the kernel's escalation to time bound the response.
(or production systems never get revoke)

> Anything less is a potential security hole.

How does it get to a security hole? Obviously the pages under DMA
can't be

Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA

2019-02-06 Thread Jason Gunthorpe
On Wed, Feb 06, 2019 at 04:22:16PM -0800, Dan Williams wrote:
> On Wed, Feb 6, 2019 at 3:41 PM Jason Gunthorpe  wrote:
> [..]
> > > You're describing the current situation, i.e. Linux already implements
> > > this, it's called Device-DAX and some users of RDMA find it
> > > insufficient. The choices are to continue to tell them "no", or say
> > > "yes, but you need to submit to lease coordination".
> >
> > Device-DAX is not what I'm imagining when I say XFS--.
> >
> > I mean more like XFS with all features that require rellocation of
> > blocks disabled.
> >
> > Forbidding hold punch, reflink, cow, etc, doesn't devolve back to
> > device-dax.
> 
> True, not all the way, but the distinction loses significance as you
> lose fs features.
> 
> Filesystems mark DAX functionality experimental [1] precisely because
> it forbids otherwise typical operations that work in the nominal page
> cache case. An approach that says "lets cement the list of things a
> filesystem or a core-memory-mangement facility can't do because RDMA
> finds it awkward" is bad precedent. 

I'm not saying these rules should apply globaly.

I'm suggesting you could have a FS that supports gup_longterm by
design, and a FS that doesn't. And that is OK. They can have different
rules.

Obviously the golden case here is to use ODP (which doesn't call
gup_longterm at all) - that works for both.

Supporting non-ODP is a trade off case - users that want to run on
limited HW must accept limited functionality. Limited functionality is
better than no-funtionality.

Linux has many of these user-choose tradeoffs. This is how it supports
such a wide range of HW capabilities. Not all HW can do all
things. Some features really do need HW support. It has always been
that way.

Jason


Re: [PATCH V2 rdma-next 2/3] RDMA/hns: Fix the chip hanging caused by sending mailbox&CMQ during reset

2019-01-25 Thread Jason Gunthorpe
On Fri, Jan 25, 2019 at 10:15:40AM +0800, Wei Hu (Xavier) wrote:
> 
> 
> On 2019/1/25 2:31, Jason Gunthorpe wrote:
> > On Thu, Jan 24, 2019 at 11:13:29AM +0800, Wei Hu (Xavier) wrote:
> >>
> >> On 2019/1/24 6:40, Jason Gunthorpe wrote:
> >>> On Sat, Jan 19, 2019 at 11:36:06AM +0800, Wei Hu (Xavier) wrote:
> >>>
> >>>> +static int hns_roce_v2_cmd_hw_resetting(struct hns_roce_dev *hr_dev,
> >>>> +unsigned long instance_stage,
> >>>> +unsigned long reset_stage)
> >>>> +{
> >>>> +struct hns_roce_v2_priv *priv = (struct hns_roce_v2_priv 
> >>>> *)hr_dev->priv;
> >>>> +struct hnae3_handle *handle = priv->handle;
> >>>> +const struct hnae3_ae_ops *ops = handle->ae_algo->ops;
> >>>> +unsigned long end;
> >>>> +
> >>>> +/* When hardware reset is detected, we should stop sending 
> >>>> mailbox&cmq
> >>>> + * to hardware, and wait until hardware reset finished. If now
> >>>> + * in .init_instance() function, we should exit with error. If 
> >>>> now at
> >>>> + * HNAE3_INIT_CLIENT stage of soft reset process, we should 
> >>>> exit with
> >>>> + * error, and then HNAE3_INIT_CLIENT related process can 
> >>>> rollback the
> >>>> + * operation like notifing hardware to free resources, 
> >>>> HNAE3_INIT_CLIENT
> >>>> + * related process will exit with error to notify NIC driver to
> >>>> + * reschedule soft reset process once again.
> >>>> + */
> >>>> +end = msecs_to_jiffies(HNS_ROCE_V2_HW_RST_TIMEOUT) + jiffies;
> >>>> +while (ops->get_hw_reset_stat(handle) && time_before(jiffies, 
> >>>> end))
> >>>> +udelay(1);
> >>> I thought you were getting rid of these loops?
> >> Hi, Jason
> >>
> >> Upper applications maybe notify driver to issue mailbox or CMD
> >> commands to hardware, some commands used to cancel resources,
> >> destory bt/destory cq/unreg mr/destory qp etc. when such
> >> commands are executed successfully, the hardware engine will
> >> no longer access some memory registered by the driver.
> >>
> >> When reset occurs, it is possible for upper applications notify driver
> >> to issue mailbox or CMD commands, we need to wait until hardware
> >> reset finished to ensure that hardware no longer accesses related
> >> memory.
> > You should not wait for things using loops like the above.
> Hi, Jason
> 
> Are your comments foucsing on udelay? If not, thanks for your detail
> information.
> In hns3 RoCE driver, some CMQ/mailbox operation are called inside
> the lock,
> we can't use msleep in the lock, otherwise it will cause deadlock.
> When reset occurs, RDMA service cannot be provided normally, I think
> in this
> case using udelay will not have a great impact.

You should not use any kind of sleep call in a loop like this.

Jason


Re: [PATCH 0/2] infiniband: trivial header search path fixups

2019-01-25 Thread Jason Gunthorpe
On Fri, Jan 25, 2019 at 01:39:29PM +0900, Masahiro Yamada wrote:
> My main motivation is to get rid of crappy header search path manipulation
> from Kbuild core.
> 
> Before that, I want to do as many treewide cleanups as possible.
> 
> If you are interested in the big picture of this work,
> the full patch set is available at:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild.git 
> build-test
> 
> 
> 
> Masahiro Yamada (2):
>   infiniband: remove unneeded header search paths
>   infiniband: prefix header search paths with $(srctree)/

Still compiles for me too, applied to rdam for-next

Thanks,
Jason


Re: [PATCH][next] RDMA/nes: remove redundant check on udata

2019-03-26 Thread Jason Gunthorpe
On Sat, Mar 02, 2019 at 11:06:36PM +, Colin King wrote:
> From: Colin Ian King 
> 
> The non-null check on udata is redundant as this check was performed
> just a few statements earlier and the check is always true as udata
> must be non-null at this point. Remove redundant the check on udata
> and the redundant else part that can never be executed.
> 
> Detected by CoverityScan, CID#1477317 ("Logically dead code")
> 
> Signed-off-by: Colin Ian King 
> ---
>  drivers/infiniband/hw/nes/nes_verbs.c | 73 +--
>  1 file changed, 34 insertions(+), 39 deletions(-)

Applied to for-next

Thanks,
Jason


Re: [PATCH v2] drivers: infiniband: Kconfig: pedantic formatting

2019-03-26 Thread Jason Gunthorpe
On Wed, Mar 06, 2019 at 11:08:45PM +0100, Enrico Weigelt, metux IT consult 
wrote:
> Formatting of Kconfig files doesn't look so pretty, so just
> take damp cloth and clean it up.
> 
> Signed-off-by: Enrico Weigelt, metux IT consult 
> ---
>  drivers/infiniband/hw/bnxt_re/Kconfig | 10 +-
>  drivers/infiniband/ulp/iser/Kconfig   |  4 ++--
>  2 files changed, 7 insertions(+), 7 deletions(-)

Applied to for-next thanks

Jason


Re: [PATCH v2] infiniband: i40iw: fix potential NULL pointer dereferences

2019-03-27 Thread Jason Gunthorpe
On Fri, Mar 15, 2019 at 01:57:14AM -0500, Kangjie Lu wrote:
> alloc_ordered_workqueue may fail and return NULL.
> The fix captures the failure and handles it properly to avoid
> potential NULL pointer dereferences.
> 
> Signed-off-by: Kangjie Lu 
> ---
> V2: add return value to capture the error code
> ---
>  drivers/infiniband/hw/i40iw/i40iw.h  |  2 +-
>  drivers/infiniband/hw/i40iw/i40iw_cm.c   | 19 ---
>  drivers/infiniband/hw/i40iw/i40iw_main.c |  5 -
>  3 files changed, 21 insertions(+), 5 deletions(-)

applied to for-next thanks
 
> diff --git a/drivers/infiniband/hw/i40iw/i40iw.h 
> b/drivers/infiniband/hw/i40iw/i40iw.h
> index 2f2b4426ded7..8feec35f95a7 100644
> --- a/drivers/infiniband/hw/i40iw/i40iw.h
> +++ b/drivers/infiniband/hw/i40iw/i40iw.h
> @@ -552,7 +552,7 @@ enum i40iw_status_code i40iw_obj_aligned_mem(struct 
> i40iw_device *iwdev,
>  
>  void i40iw_request_reset(struct i40iw_device *iwdev);
>  void i40iw_destroy_rdma_device(struct i40iw_ib_device *iwibdev);
> -void i40iw_setup_cm_core(struct i40iw_device *iwdev);
> +int i40iw_setup_cm_core(struct i40iw_device *iwdev);
>  void i40iw_cleanup_cm_core(struct i40iw_cm_core *cm_core);
>  void i40iw_process_ceq(struct i40iw_device *, struct i40iw_ceq *iwceq);
>  void i40iw_process_aeq(struct i40iw_device *);
> diff --git a/drivers/infiniband/hw/i40iw/i40iw_cm.c 
> b/drivers/infiniband/hw/i40iw/i40iw_cm.c
> index 206cfb0016f8..2e20786b9a57 100644
> --- a/drivers/infiniband/hw/i40iw/i40iw_cm.c
> +++ b/drivers/infiniband/hw/i40iw/i40iw_cm.c
> @@ -3237,7 +3237,7 @@ void i40iw_receive_ilq(struct i40iw_sc_vsi *vsi, struct 
> i40iw_puda_buf *rbuf)
>   * core
>   * @iwdev: iwarp device structure
>   */
> -void i40iw_setup_cm_core(struct i40iw_device *iwdev)
> +int i40iw_setup_cm_core(struct i40iw_device *iwdev)
>  {
>   struct i40iw_cm_core *cm_core = &iwdev->cm_core;
>  
> @@ -3256,9 +3256,20 @@ void i40iw_setup_cm_core(struct i40iw_device *iwdev)
>  
>   cm_core->event_wq = alloc_ordered_workqueue("iwewq",
>   WQ_MEM_RECLAIM);
> + if (!cm_core->event_wq)
> + goto error;
>  
>   cm_core->disconn_wq = alloc_ordered_workqueue("iwdwq",
> WQ_MEM_RECLAIM);
> + if (!cm_core->disconn_wq)
> + goto error;
> +
> + return 0;
> +error:
> + i40iw_cleanup_cm_core(&iwdev->cm_core);
> + i40iw_pr_err("fail to setup CM core");

But I deleted this print, memory allocation failures already print
enough

Jason


Re: [PATCH] IB/iser: remove uninitialized variable len

2019-03-27 Thread Jason Gunthorpe
On Sat, Mar 16, 2019 at 11:05:12PM +, Colin King wrote:
> From: Colin Ian King 
> 
> The variable len is not being inintialized and the uninitialized
> value is being returned. However, this return path is never reached
> because the default case in the switch statement returns -ENOSYS.
> Clean up the code by replacing the return -ENOSYS with a break
> for the default case and returning -ENOSYS at the end of the
> function.  This allows len to be removed.  Also remove redundant
> break that follows a return statement.
> 
> Signed-off-by: Colin Ian King 
> Reviewed-by: Max Gurtovoy 
> Reviewed-by: Sagi Grimberg 
> ---
>  drivers/infiniband/ulp/iser/iscsi_iser.c | 7 ++-
>  1 file changed, 2 insertions(+), 5 deletions(-)

Applied to for-next

Thanks,
Jason


Re: [PATCH] infiniband: cxgb4: fix a potential NULL pointer dereference

2019-03-28 Thread Jason Gunthorpe
On Wed, Mar 27, 2019 at 07:08:54PM +0530, Potnuri Bharat Teja wrote:
> On Saturday, March 03/23/19, 2019 at 08:07:46 +0530, Kangjie Lu wrote:
> > 
> > 
> > > On Mar 8, 2019, at 11:19 PM, Kangjie Lu  wrote:
> > > 
> > > get_skb may fail and return NULL. The fix returns "ENOMEM"
> > > when it fails to avoid NULL dereference.
> > > 
> > > Signed-off-by: Kangjie Lu 
> > > drivers/infiniband/hw/cxgb4/cm.c | 3 +++
> > > 1 file changed, 3 insertions(+)
> > > 
> > > diff --git a/drivers/infiniband/hw/cxgb4/cm.c 
> > > b/drivers/infiniband/hw/cxgb4/cm.c
> > > index 8221813219e5..502a54d57e2c 100644
> > > +++ b/drivers/infiniband/hw/cxgb4/cm.c
> > > @@ -1919,6 +1919,9 @@ static int send_fw_act_open_req(struct c4iw_ep *ep, 
> > > unsigned int atid)
> > >   int win;
> > > 
> > >   skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
> > > + if (!skb)
> > > + return -ENOMEM;
> > > +
> > 
> > Can someone review this patch? Thanks.
> 
> Sorry for the late response.
> I recommend an error print before the return.
> if (!skb) {
>   pr_err("%s - failed to alloc skb\n", __func__);
>   return -ENOMEM;
> }

no error prints on memory allocation failure, the kernel already
prints enough on this

Jason


Re: WARNING in cma_exit_net

2019-04-01 Thread Jason Gunthorpe
On Mon, Apr 01, 2019 at 10:36:05AM -0700, syzbot wrote:
> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:e3ecb83e Add linux-next specific files for 20190401
> git tree:   linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=13bc36cd20
> kernel config:  https://syzkaller.appspot.com/x/.config?x=db6c9f2bfeb91a99
> dashboard link: https://syzkaller.appspot.com/bug?extid=2e3e485d5697ea610460
> compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
> 
> Unfortunately, I don't have any reproducer for this crash yet.
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+2e3e485d5697ea610...@syzkaller.appspotmail.com
> 
> WARNING: CPU: 1 PID: 7 at drivers/infiniband/core/cma.c:4674
> cma_exit_net+0x327/0x390 drivers/infiniband/core/cma.c:4674
> Kernel panic - not syncing: panic_on_warn set ...

Matt: This is why the WARN_ON(!xa_empty()) is so valuable. Magically
syzkaller can find something in this code is buggy.

Mellanox is also showing a different testing failure over the weekend
(use after free or something) from your 'cma: Convert portspace IDRs
to XArray'

If you seem something obvious wrong in that patch please speak
up.. Below is what I applied.

Thanks,
Jason

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 68c997be242930..a8b9c66c85252b 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -39,7 +39,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
@@ -191,10 +191,10 @@ static struct workqueue_struct *cma_wq;
 static unsigned int cma_pernet_id;
 
 struct cma_pernet {
-   struct idr tcp_ps;
-   struct idr udp_ps;
-   struct idr ipoib_ps;
-   struct idr ib_ps;
+   struct xarray tcp_ps;
+   struct xarray udp_ps;
+   struct xarray ipoib_ps;
+   struct xarray ib_ps;
 };
 
 static struct cma_pernet *cma_pernet(struct net *net)
@@ -202,7 +202,8 @@ static struct cma_pernet *cma_pernet(struct net *net)
return net_generic(net, cma_pernet_id);
 }
 
-static struct idr *cma_pernet_idr(struct net *net, enum rdma_ucm_port_space ps)
+static
+struct xarray *cma_pernet_xa(struct net *net, enum rdma_ucm_port_space ps)
 {
struct cma_pernet *pernet = cma_pernet(net);
 
@@ -247,25 +248,25 @@ struct class_port_info_context {
 static int cma_ps_alloc(struct net *net, enum rdma_ucm_port_space ps,
struct rdma_bind_list *bind_list, int snum)
 {
-   struct idr *idr = cma_pernet_idr(net, ps);
+   struct xarray *xa = cma_pernet_xa(net, ps);
 
-   return idr_alloc(idr, bind_list, snum, snum + 1, GFP_KERNEL);
+   return xa_insert(xa, snum, bind_list, GFP_KERNEL);
 }
 
 static struct rdma_bind_list *cma_ps_find(struct net *net,
  enum rdma_ucm_port_space ps, int snum)
 {
-   struct idr *idr = cma_pernet_idr(net, ps);
+   struct xarray *xa = cma_pernet_xa(net, ps);
 
-   return idr_find(idr, snum);
+   return xa_load(xa, snum);
 }
 
 static void cma_ps_remove(struct net *net, enum rdma_ucm_port_space ps,
  int snum)
 {
-   struct idr *idr = cma_pernet_idr(net, ps);
+   struct xarray *xa = cma_pernet_xa(net, ps);
 
-   idr_remove(idr, snum);
+   xa_erase(xa, snum);
 }
 
 enum {
@@ -4655,10 +4656,10 @@ static int cma_init_net(struct net *net)
 {
struct cma_pernet *pernet = cma_pernet(net);
 
-   idr_init(&pernet->tcp_ps);
-   idr_init(&pernet->udp_ps);
-   idr_init(&pernet->ipoib_ps);
-   idr_init(&pernet->ib_ps);
+   xa_init(&pernet->tcp_ps);
+   xa_init(&pernet->udp_ps);
+   xa_init(&pernet->ipoib_ps);
+   xa_init(&pernet->ib_ps);
 
return 0;
 }
@@ -4667,10 +4668,10 @@ static void cma_exit_net(struct net *net)
 {
struct cma_pernet *pernet = cma_pernet(net);
 
-   idr_destroy(&pernet->tcp_ps);
-   idr_destroy(&pernet->udp_ps);
-   idr_destroy(&pernet->ipoib_ps);
-   idr_destroy(&pernet->ib_ps);
+   WARN_ON(!xa_empty(&pernet->tcp_ps));
+   WARN_ON(!xa_empty(&pernet->udp_ps));
+   WARN_ON(!xa_empty(&pernet->ipoib_ps));
+   WARN_ON(!xa_empty(&pernet->ib_ps));
 }
 
 static struct pernet_operations cma_pernet_operations = {



Re: [PATCH] tpm: Actually fail on TPM errors during "get random"

2019-04-01 Thread Jason Gunthorpe
On Mon, Apr 01, 2019 at 11:32:11AM -0700, Kees Cook wrote:
> A "get random" may fail with a TPM error, but those codes were returned
> as-is to the caller, which assumed the result was the number of bytes
> that had been written to the target buffer, which could lead to an kernel
> heap memory exposure and over-read.
> 
> This fixes tpm1_get_random() to mask positive TPM errors into -EIO, as
> before.
> 
> [   18.092103] tpm tpm0: A TPM error (379) occurred attempting get random
> [   18.092106] usercopy: Kernel memory exposure attempt detected from SLUB 
> object 'kmalloc-64' (offset 0, size 379)!
> 
> Link: https://bugzilla.redhat.com/show_bug.cgi?id=1650989
> Reported-by: Phil Baker 
> Reported-by: Craig Robson 
> Fixes: 7aee9c52d7ac ("tpm: tpm1: rewrite tpm1_get_random() using tpm_buf 
> structure")
> Cc: Laura Abbott 
> Cc: Tomas Winkler 
> Cc: Jarkko Sakkinen 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Kees Cook 
>  drivers/char/tpm/tpm1-cmd.c | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/char/tpm/tpm1-cmd.c b/drivers/char/tpm/tpm1-cmd.c
> index 85dcf2654d11..faeb78ecf960 100644
> +++ b/drivers/char/tpm/tpm1-cmd.c
> @@ -510,7 +510,7 @@ struct tpm1_get_random_out {
>   *
>   * Return:
>   * *  number of bytes read
> - * * -errno or a TPM return code otherwise
> + * * -errno (positive TPM return codes are masked to -EIO)
>   */
>  int tpm1_get_random(struct tpm_chip *chip, u8 *dest, size_t max)
>  {
> @@ -524,7 +524,7 @@ int tpm1_get_random(struct tpm_chip *chip, u8 *dest, 
> size_t max)
>  
>   rc = tpm_buf_init(&buf, TPM_TAG_RQU_COMMAND, TPM_ORD_GET_RANDOM);
>   if (rc)
> - return rc;
> + goto fail;

?? Does tpm_buf_init() return positive?? Shouldn't..

I think you actually want this:

diff --git a/drivers/char/tpm/tpm1-cmd.c b/drivers/char/tpm/tpm1-cmd.c
index 85dcf2654d1102..a01e6fba1aacb6 100644
--- a/drivers/char/tpm/tpm1-cmd.c
+++ b/drivers/char/tpm/tpm1-cmd.c
@@ -510,7 +510,7 @@ struct tpm1_get_random_out {
  *
  * Return:
  * *  number of bytes read
- * * -errno or a TPM return code otherwise
+ * * -errno
  */
 int tpm1_get_random(struct tpm_chip *chip, u8 *dest, size_t max)
 {
@@ -531,8 +531,10 @@ int tpm1_get_random(struct tpm_chip *chip, u8 *dest, 
size_t max)
 
rc = tpm_transmit_cmd(chip, &buf, sizeof(out->rng_data_len),
  "attempting get random");
-   if (rc)
+   if (rc) {
+   rc = -EIO
goto out;
+   }
 
out = (struct tpm1_get_random_out *)&buf.data[TPM_HEADER_SIZE];
 
diff --git a/drivers/char/tpm/tpm2-cmd.c b/drivers/char/tpm/tpm2-cmd.c
index e74c5b7b64bfbd..9fa498b4cf8816 100644
--- a/drivers/char/tpm/tpm2-cmd.c
+++ b/drivers/char/tpm/tpm2-cmd.c
@@ -328,8 +328,10 @@ int tpm2_get_random(struct tpm_chip *chip, u8 *dest, 
size_t max)
   offsetof(struct tpm2_get_random_out,
buffer),
   "attempting get random");
-   if (err)
+   if (err) {
+   rc = -EIO;
goto out;
+   }
 
out = (struct tpm2_get_random_out *)
&buf.data[TPM_HEADER_SIZE];


Re: [PATCH v3] tpm: Actually fail on TPM errors during "get random"

2019-04-01 Thread Jason Gunthorpe
On Mon, Apr 01, 2019 at 12:06:07PM -0700, Kees Cook wrote:
> A "get random" may fail with a TPM error, but those codes were returned
> as-is to the caller, which assumed the result was the number of bytes
> that had been written to the target buffer, which could lead to a kernel
> heap memory exposure and over-read.
> 
> This fixes tpm1_get_random() to mask positive TPM errors into -EIO, as
> before.
> 
> [   18.092103] tpm tpm0: A TPM error (379) occurred attempting get random
> [   18.092106] usercopy: Kernel memory exposure attempt detected from SLUB 
> object 'kmalloc-64' (offset 0, size 379)!
> 
> Link: https://bugzilla.redhat.com/show_bug.cgi?id=1650989
> Reported-by: Phil Baker 
> Reported-by: Craig Robson 
> Fixes: 7aee9c52d7ac ("tpm: tpm1: rewrite tpm1_get_random() using tpm_buf 
> structure")
> Cc: Laura Abbott 
> Cc: Tomas Winkler 
> Cc: Jarkko Sakkinen 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Kees Cook 
> v3: fix never-succeed, limit checks to tpm cmd return (James, Jason)
> v2: also fix tpm2 implementation (Jason Gunthorpe)
>  drivers/char/tpm/tpm1-cmd.c | 7 +--
>  drivers/char/tpm/tpm2-cmd.c | 7 +--
>  2 files changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/char/tpm/tpm1-cmd.c b/drivers/char/tpm/tpm1-cmd.c
> index 85dcf2654d11..faacbe1ffa1a 100644
> +++ b/drivers/char/tpm/tpm1-cmd.c
> @@ -510,7 +510,7 @@ struct tpm1_get_random_out {
>   *
>   * Return:
>   * *  number of bytes read
> - * * -errno or a TPM return code otherwise
> + * * -errno (positive TPM return codes are masked to -EIO)
>   */
>  int tpm1_get_random(struct tpm_chip *chip, u8 *dest, size_t max)
>  {
> @@ -531,8 +531,11 @@ int tpm1_get_random(struct tpm_chip *chip, u8 *dest, 
> size_t max)
>  
>   rc = tpm_transmit_cmd(chip, &buf, sizeof(out->rng_data_len),
> "attempting get random");
> - if (rc)
> + if (rc) {
> + if (rc > 0)
> + rc = -EIO;
>   goto out;
> + }
>  
>   out = (struct tpm1_get_random_out *)&buf.data[TPM_HEADER_SIZE];
>  
> diff --git a/drivers/char/tpm/tpm2-cmd.c b/drivers/char/tpm/tpm2-cmd.c
> index e74c5b7b64bf..8ffa6af61580 100644
> +++ b/drivers/char/tpm/tpm2-cmd.c
> @@ -301,7 +301,7 @@ struct tpm2_get_random_out {
>   *
>   * Return:
>   *   size of the buffer on success,
> - *   -errno otherwise
> + *   -errno otherwise ((positive TPM return codes are masked to -EIO)

Extra bracket, but otherwise looks fine to me

Jason


Re: [PATCH 2/2] ftpm: firmware TPM running in TEE

2019-04-03 Thread Jason Gunthorpe
On Tue, Apr 02, 2019 at 03:33:16PM -0400, Sasha Levin wrote:

> +/*
> + * Undo what has been done in ftpm_tee_probe
> + */
> +static void ftpm_tee_deinit(struct ftpm_tee_private *pvt_data)
> +{
> + /* Release the chip */
> + if (pvt_data->state & STATE_REGISTERED_FLAG)
> + tpm_chip_unregister(pvt_data->chip);
> +
> + if (pvt_data->ctx != NULL)  {
> +
> + /* Free the shared memory pool */
> + if (pvt_data->state & STATE_TEE_SHMALLOC_FLAG)
> + tee_shm_free(pvt_data->shm);
> +
> + /* close the existing session with fTPM TA*/
> + if (pvt_data->state & STATE_TEE_OPENED_FLAG)
> + tee_client_close_session(pvt_data->ctx,
> + pvt_data->session);
> +
> + /* close the context with TEE driver */
> + tee_client_close_context(pvt_data->ctx);
> + }

all these flags are ugly, just use a normal goto unwind during probe
please


> +
> + /* memory allocated with devm_kzalloc() is freed automatically */
> +}
> +
> +/*
> + * ftpm_tee_probe initialize the fTPM
> + * @param: pdev, the platform_device description.
> + * @return: 0 in case of success.
> + *or a negative value describing the error.
> + */
> +static int ftpm_tee_probe(struct platform_device *pdev)
> +{
> + int rc;
> + struct tpm_chip *chip;
> + struct device *dev = &pdev->dev;
> + struct ftpm_tee_private *pvt_data = NULL;
> + struct tee_ioctl_open_session_arg sess_arg;
> +
> + dev_dbg(dev, "%s++\n", __func__);

Please don't push tracing like this to the upstream kernel, we have
ftrace and what not to do this generally :(

> + pvt_data = devm_kzalloc(dev, sizeof(struct ftpm_tee_private),
> + GFP_KERNEL);
> + if (!pvt_data)
> + return -ENOMEM;
> +
> + dev_set_drvdata(dev, pvt_data);
> +
> + /* Open context with TEE driver */
> + pvt_data->ctx = tee_client_open_context(NULL, ftpm_tee_match, NULL,
> + NULL);
> + if (IS_ERR(pvt_data->ctx))  {
> + dev_err(dev, "%s:tee_client_open_context failed\n", __func__);
> + return -EPROBE_DEFER;
> + }
> +
> + /* Open a session with fTPM TA*/
> + memset(&sess_arg, 0, sizeof(sess_arg));
> + memcpy(sess_arg.uuid, ftpm_ta_uuid.b, TEE_IOCTL_UUID_LEN);
> + sess_arg.clnt_login = TEE_IOCTL_LOGIN_PUBLIC;
> + sess_arg.num_params = 0;
> +
> + rc = tee_client_open_session(pvt_data->ctx, &sess_arg, NULL);
> + if ((rc < 0) || (sess_arg.ret != 0)) {
> + dev_err(dev, "%s:tee_client_open_session failed, err=%x\n",
> + __func__, sess_arg.ret);
> + rc = -EINVAL;
> + goto out;
> + }
> + pvt_data->session = sess_arg.session;
> + pvt_data->state |= STATE_TEE_OPENED_FLAG;
> +
> + /* Allocate dynamic shared memory with fTPM TA */
> + pvt_data->shm = tee_shm_alloc(pvt_data->ctx,
> + (MAX_COMMAND_SIZE + MAX_RESPONSE_SIZE),
> + TEE_SHM_MAPPED | TEE_SHM_DMA_BUF);
> + if (IS_ERR(pvt_data->shm)) {
> + dev_err(dev, "%s:tee_shm_alloc failed\n", __func__);
> + rc = -ENOMEM;
> + goto out;
> + }
> + pvt_data->state |= STATE_TEE_SHMALLOC_FLAG;
> +
> + /* Allocate new struct tpm_chip instance */
> + chip = tpm_chip_alloc(dev, &ftpm_tee_tpm_ops);

Why not tpmm_chip_alloc ? Using devm in other places

Doesn't this leak memory? I don't see a put_device cleanup for chip_alloc?

> + if (IS_ERR(chip)) {
> + dev_err(dev, "%s:tpm_chip_alloc failed\n", __func__);
> + rc = PTR_ERR(chip);
> + goto out;
> + }
> +
> + pvt_data->chip = chip;
> + pvt_data->chip->flags |= TPM_CHIP_FLAG_TPM2;
> +
> + /* Create a character device for the fTPM */
> + rc = tpm_chip_register(pvt_data->chip);

It is a bad idea to do things after tpm_chip_register, it should be
the last thing done during probe except for error cleanup via a goto
unwind.

Jason


[GIT PULL] Please pull RDMA subsystem changes

2019-04-10 Thread Jason Gunthorpe
Hi Linus,

Second rc pull request

The following changes since commit 8c2ffd9174779014c3fe1f96d9dc3641d9175f00:

  Linux 5.1-rc2 (2019-03-24 14:02:26 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git tags/for-linus

for you to fetch changes up to d737b25b1ae0540ba13cbd45ebb9b58a1d6d7f0d:

  IB/hfi1: Do not flush send queue in the TID RDMA second leg (2019-04-10 
15:09:30 -0300)


5.1 Second RC pull request

Several driver bug fixes posted in the last several weeks

- Several bug fixes for the hfi1 driver 'TID RDMA' functionality merged
  into 5.1. Since TID RDMA is on by default these all seem to be
  regressions.

- Wrong software permission checks on memory in mlx5

- Memory leak in vmw_pvrdma during driver remove

- Several bug fixes for hns driver features merged into 5.1


Kaike Wan (5):
  IB/hfi1: Failed to drain send queue when QP is put into error state
  IB/hfi1: Clear the IOWAIT pending bits when QP is put into error state
  IB/hfi1: Eliminate opcode tests on mr deref
  IB/hfi1: Fix the allocation of RSM table
  IB/hfi1: Do not flush send queue in the TID RDMA second leg

Kamal Heib (1):
  RDMA/vmw_pvrdma: Fix memory leak on pvrdma_pci_remove

Lijun Ou (1):
  RDMA/hns: Fix bug that caused srq creation to fail

Moni Shoua (1):
  IB/mlx5: Reset access mask when looping inside page fault handler

Yangyang Li (1):
  RDMA/hns: Bugfix for SCC hem free

 drivers/infiniband/hw/hfi1/chip.c  | 26 +++--
 drivers/infiniband/hw/hfi1/qp.c|  4 +++-
 drivers/infiniband/hw/hfi1/rc.c|  4 ++--
 drivers/infiniband/hw/hfi1/tid_rdma.c  | 31 +++---
 drivers/infiniband/hw/hns/hns_roce_hem.c   |  6 +++--
 drivers/infiniband/hw/hns/hns_roce_mr.c|  4 ++--
 drivers/infiniband/hw/hns/hns_roce_qp.c|  3 ---
 drivers/infiniband/hw/mlx5/odp.c   |  3 ++-
 drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c |  2 ++
 9 files changed, 42 insertions(+), 41 deletions(-)


signature.asc
Description: PGP signature


Re: [PATCH] vfio/type1: Use follow_pte()

2021-02-12 Thread Jason Gunthorpe
On Fri, Feb 12, 2021 at 12:01:50PM -0700, Alex Williamson wrote:
> follow_pfn() doesn't make sure that we're using the correct page
> protections, get the pte with follow_pte() so that we can test
> protections and get the pfn from the pte.
> 
> Fixes: 5cbf3264bc71 ("vfio/type1: Fix VA->PA translation for PFNMAP VMAs in 
> vaddr_get_pfn()")
> Signed-off-by: Alex Williamson 
> ---
>  drivers/vfio/vfio_iommu_type1.c |   14 --
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index ec9fd95a138b..90715413c3d9 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -463,9 +463,11 @@ static int follow_fault_pfn(struct vm_area_struct *vma, 
> struct mm_struct *mm,
>   unsigned long vaddr, unsigned long *pfn,
>   bool write_fault)
>  {
> + pte_t *ptep;
> + spinlock_t *ptl;
>   int ret;
>  
> - ret = follow_pfn(vma, vaddr, pfn);
> + ret = follow_pte(vma->vm_mm, vaddr, NULL, &ptep, NULL, &ptl);
>   if (ret) {
>   bool unlocked = false;
>  
> @@ -479,9 +481,17 @@ static int follow_fault_pfn(struct vm_area_struct *vma, 
> struct mm_struct *mm,
>   if (ret)
>   return ret;
>  
> - ret = follow_pfn(vma, vaddr, pfn);
> + ret = follow_pte(vma->vm_mm, vaddr, NULL, &ptep, NULL, &ptl);

commit 9fd6dad1261a541b3f5fa7dc5b15306e6702 in linux-next is what
export's follow_pte and it uses a different signature:

+int follow_pte(struct mm_struct *mm, unsigned long address,
+  pte_t **ptepp, spinlock_t **ptlp)

Recommend you send this patch for rc1 once the right stuff lands in
Linus's tree

Otherwise it looks OK

Reviewed-by: Jason Gunthorpe 

Jason


Re: [PATCH v5 04/14] vfio/mdev: idxd: Add auxialary device plumbing for idxd mdev support

2021-02-12 Thread Jason Gunthorpe
On Fri, Feb 12, 2021 at 11:56:24AM -0700, Dave Jiang wrote:

> > > @@ -434,11 +502,19 @@ static int idxd_probe(struct idxd_device *idxd)
> > >   goto err_idr_fail;
> > >   }
> > > + rc = idxd_setup_mdev_auxdev(idxd);
> > > + if (rc < 0)
> > > + goto err_auxdev_fail;
> > > +
> > >   idxd->major = idxd_cdev_get_major(idxd);
> > >   dev_dbg(dev, "IDXD device %d probed successfully\n", idxd->id);
> > >   return 0;
> > > + err_auxdev_fail:
> > > + mutex_lock(&idxd_idr_lock);
> > > + idr_remove(&idxd_idrs[idxd->type], idxd->id);
> > > + mutex_unlock(&idxd_idr_lock);
> > Probably wrong to order things like this..
> 
> How should it be ordered?

The IDR is global data so some other thread could have read the IDR
and got this pointer, but now it is being torn down in some racy
way. It is best to make the store to global access be the very last
thing so you never have to try to unstore from global memory and don't
have to think about concurrency.

> > Also somehow this has a
> > 
> > idxd = devm_kzalloc(dev, sizeof(struct idxd_device), GFP_KERNEL);
> > 
> > but the idxd has a kref'd struct device in it:
> 
> So the conf_dev is a struct device that let the driver do configuration of
> the device and other components through sysfs. It's a child device to the
> pdev. It should have no relation to the auxdev. The confdevs for each
> component should not be released until the physical device is released. For
> the mdev case, the auxdev shouldn't be released until the removal of the
> pdev as well since it is a child of the pdev also.
> 
> pdev --- device conf_dev --- wq conf_dev
> 
>     |   |--- engine conf_dev
> 
>     |   |--- group conf_dev
> 
>     |--- aux_dev
> 
> > 
> > struct idxd_device {
> > enum idxd_type type;
> > struct device conf_dev;
> > 
> > So that's not right either
> > 
> > You'll need to fix the lifetime model for idxd_device before you get
> > to adding auxdevices
> 
> Can you kindly expand on how it's suppose to look like please?

Well, you can't call kfree on memory that contains a struct device,
you have to use put_device() - so the devm_kzalloc is unconditionally
wrong. Maybe you could replace it with a devm put device action, but
it would probably be alot saner to just put the required put_device's
where they need to be in the first place.

I didn't try to work out what this was all for, but once it is sorted
out you can just embed the aux device here and chain its release to
put_device on the conf_dev and all the lifetime will work out
naturally. 

Jason


Re: [PATCH 0/3] vfio: Device memory DMA mapping improvements

2021-02-12 Thread Jason Gunthorpe
On Fri, Feb 12, 2021 at 12:27:19PM -0700, Alex Williamson wrote:
> This series intends to improve some long standing issues with mapping
> device memory through the vfio IOMMU interface (ie. P2P DMA mappings).
> Unlike mapping DMA to RAM, we can't pin device memory, nor is it
> always accessible.  We attempt to tackle this (predominantly the
> first issue in this iteration) by creating a registration and
> notification interface through vfio-core, between the IOMMU backend
> and the bus driver.  This allows us to do things like automatically
> remove a DMA mapping to device if it's closed by the user.  We also
> keep references to the device container such that it remains isolated
> while this mapping exists.
> 
> Unlike my previous attempt[1], this version works across containers.
> For example if a user has device1 with IOMMU context in container1
> and device2 in container2, a mapping of device2 memory into container1
> IOMMU context would be removed when device2 is released.
> 
> What I don't tackle here is when device memory is disabled, such as
> for a PCI device when the command register memory bit is cleared or
> while the device is in reset.  Ideally is seems like it might be
> nice to have IOMMU API interfaces that could remove r/w permissions
> from the IOTLB entry w/o removing it entirely, but I'm also unsure
> of the ultimate value in modifying the IOTLB entries at this point.
> 
> In the PCI example again, I'd expect a DMA to disabled or unavailable
> device memory to get an Unsupported Request response.  If we play
> with the IOTLB mapping, we might change this to an IOMMU fault for
> either page permissions or page not present, depending on how we
> choose to invalidate that entry.  However, it seems that a system that
> escalates an UR error to fatal, through things like firmware first
> handling, is just as likely to also make the IOMMU fault fatal.  Are
> there cases where we expect otherwise, and if not is there value to
> tracking device memory enable state to that extent in the IOMMU?
> 
> Jason, I'm also curious if this scratches your itch relative to your
> suggestion to solve this with dma-bufs, and if that's still your
> preference, I'd love an outline to accomplish this same with that
> method.

I will look at this more closely later, but given this is solving a
significant security problem and the patches now exist, I'm not
inclined to push too hard to do something different if this works OK.

That said, it is not great to see VFIO create its own little dmabuf
like thing inside itself, in particular if this was in core code we
could add a new vm_operations_struct member like:

   struct dmabuf (*getdma_buf)(struct vm_operations_struct *area);

And completely avoid a lot of the searching and fiddling with
ops. Maybe we can make this look closer to that ideal..

Jason


Re: [PATCH 1/3] vfio: Introduce vma ops registration and notifier

2021-02-12 Thread Jason Gunthorpe
On Fri, Feb 12, 2021 at 12:27:39PM -0700, Alex Williamson wrote:
> Create an interface through vfio-core where a vfio bus driver (ex.
> vfio-pci) can register the vm_operations_struct it uses to map device
> memory, along with a set of registration callbacks.  This allows
> vfio-core to expose interfaces for IOMMU backends to match a
> vm_area_struct to a bus driver and register a notifier for relavant
> changes to the device mapping.  For now we define only a notifier
> action for closing the device.
> 
> Signed-off-by: Alex Williamson 
>  drivers/vfio/vfio.c  |  120 
> ++
>  include/linux/vfio.h |   20 
>  2 files changed, 140 insertions(+)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 38779e6fd80c..568f5e37a95f 100644
> +++ b/drivers/vfio/vfio.c
> @@ -47,6 +47,8 @@ static struct vfio {
>   struct cdev group_cdev;
>   dev_t   group_devt;
>   wait_queue_head_t   release_q;
> + struct list_headvm_ops_list;
> + struct mutexvm_ops_lock;
>  } vfio;
>  
>  struct vfio_iommu_driver {
> @@ -2354,6 +2356,121 @@ struct iommu_domain *vfio_group_iommu_domain(struct 
> vfio_group *group)
>  }
>  EXPORT_SYMBOL_GPL(vfio_group_iommu_domain);
>  
> +struct vfio_vma_ops {
> + const struct vm_operations_struct   *vm_ops;
> + vfio_register_vma_nb_t  *reg_fn;
> + vfio_unregister_vma_nb_t*unreg_fn;
> + struct list_headnext;
> +};
> +
> +int vfio_register_vma_ops(const struct vm_operations_struct *vm_ops,
> +   vfio_register_vma_nb_t *reg_fn,
> +   vfio_unregister_vma_nb_t *unreg_fn)

This just feels a little bit too complicated

I've recently learned from Daniel that we can use the address_space
machinery to drive the zap_vma_ptes() via unmap_mapping_range(). This
technique replaces all the open, close and vma_list logic in vfio_pci

If we don't need open anymore, we could do something like this:

 static const struct vm_operations_struct vfio_pci_mmap_ops = {
.open = vfio_pfn_open, // implemented in vfio.c
.close = vfio_pfn_close,
.fault = vfio_pci_mmap_fault,
 };

Then we could code the function needed:

struct vfio_pfn_range_handle 
{
   struct kref kref;
   struct vfio_device *vfio;
   struct notifier_block invalidation_cb;
   unsigned int flags;
}

struct vfio_pfn_range_handle *get_pfn_range(struct vm_area_struct *vma)
{
   struct vfio_pfn_range_handle *handle;

   if (vma->ops->open != vfio_pfn_open)
  return NULL;
   
   handle = vma->vm_private_data;
   if (test_bit(handle->flags, DMA_STOPPED)
  return NULL;
   kref_get(&handle->kref);
   return handle;
}

Where the common open/close only kref inc/dec the kref and all 'vfio'
VMAs always have a pointer to the same vfio_pfn_range_handle in their
private_data.

The vm_pgoff is already pointing at the physical pfn, so every part of
the system can get the information it needs fairly trivially.

Some stop access function is pretty simple looking

void stop_access(struct vfio_pfn_range_handle *handle)
{
  set_bit(handle->flags, DMA_STOPPED);
  unmap_mapping_range(handle->vfio->[..]->inode, 0, max, false);
  srcu_notifier_call_chain(handle->invalidation_cb, VFIO_VMA_NOTIFY_CLOSE, 
NULL);
}

(well, have to sort out the locking some more, but that is the
 general idea)

I think that would remove alot of the code added here and acts a lot
closer to how a someday dmabuf could act.

Also, this will need to update the nvlink vmops as well

Jason


Re: [RFC PATCH 05/10] vfio: Create a vfio_device from vma lookup

2021-02-24 Thread Jason Gunthorpe
On Wed, Feb 24, 2021 at 02:55:06PM -0700, Alex Williamson wrote:

> > The only use of the special ops would be if there are multiple types
> > of mmap's going on, but for this narrow use case those would be safely
> > distinguished by the vm_pgoff instead
> 
> We potentially do have device specific regions which can support mmap,
> for example the migration region.  We'll need to think about how we
> could even know if portions of those regions map to a device.  We could
> use the notifier to announce it and require the code supporting those
> device specific regions manage it.

So, the above basically says any VFIO VMA is allowed for VFIO to map
to the IOMMU.

If there are places creating mmaps for VFIO that should not go to the
IOMMU then they need to return NULL from this function.

> I'm not really clear what you're getting at with vm_pgoff though, could
> you explain further?

Ah, so I have to take a side discussion to explain what I ment.

The vm_pgoff is a bit confused because we change it here in vfio_pci:

vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;

But the address_space invalidation assumes it still has the region
based encoding:

+   vfio_device_unmap_mapping_range(vdev->device,
+   VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX),
+   VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_ROM_REGION_INDEX) -
+   VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX));

Those three indexes are in the vm_pgoff numberspace and so vm_pgoff
must always be set to the same thing - either the
VFIO_PCI_INDEX_TO_OFFSET() coding or the physical pfn. 

Since you say we need a limited invalidation this looks like a bug to
me - and it must always be the VFIO_PCI_INDEX_TO_OFFSET coding.

So, the PCI vma needs to get switched to use the
VFIO_PCI_INDEX_TO_OFFSET coding and then we can always extract the
region number from the vm_pgoff and thus access any additional data,
such as the base pfn or a flag saying this cannot be mapped to the
IOMMU. Do the reverse of VFIO_PCI_INDEX_TO_OFFSET and consult
information attached to that region ID.

All places creating vfio mmaps have to set the vm_pgoff to
VFIO_PCI_INDEX_TO_OFFSET().

But we have these violations that need fixing:

drivers/vfio/fsl-mc/vfio_fsl_mc.c:  vma->vm_pgoff = (region.addr >> 
PAGE_SHIFT) + pgoff;
drivers/vfio/platform/vfio_platform_common.c:   vma->vm_pgoff = (region.addr >> 
PAGE_SHIFT) + pgoff;

Couldn't see any purpose to this code, cargo cult copy? Just delete
it.

drivers/vfio/pci/vfio_pci.c:vma->vm_pgoff = (pci_resource_start(pdev, 
index) >> PAGE_SHIFT) + pgoff;

Used to implement fault() but we could get the region number and
extract the pfn from the vfio_pci_device's data easy enough.

I manually checked that other parts of VFIO not under drivers/vfio are
doing it OK, looks fine.

Jason


Re: [RFC PATCH 10/10] vfio/type1: Register device notifier

2021-02-24 Thread Jason Gunthorpe
On Wed, Feb 24, 2021 at 02:55:08PM -0700, Alex Williamson wrote:

> > > +static bool strict_mmio_maps = true;
> > > +module_param_named(strict_mmio_maps, strict_mmio_maps, bool, 0644);
> > > +MODULE_PARM_DESC(strict_mmio_maps,
> > > +  "Restrict to safe DMA mappings of device memory (true).");  
> > 
> > I think this should be a kconfig, historically we've required kconfig
> > to opt-in to unsafe things that could violate kernel security. Someone
> > building a secure boot trusted kernel system should not have an
> > options for userspace to just turn off protections.
> 
> It could certainly be further protected that this option might not
> exist based on a Kconfig, but I think we're already risking breaking
> some existing users and I'd rather allow it with an opt-in (like we
> already do for lack of interrupt isolation), possibly even with a
> kernel taint if used, if necessary.

Makes me nervous, security should not be optional.

> > I'd prefer this was written a bit differently, I would like it very
> > much if this doesn't mis-use follow_pte() by returning pfn outside
> > the lock.
> > 
> > vaddr_get_bar_pfn(..)
> > {
> > vma = find_vma_intersection(mm, vaddr, vaddr + 1);
> > if (!vma)
> >return -ENOENT;
> > if ((vma->vm_flags & VM_DENYWRITE) && (prot & PROT_WRITE)) // Check 
> > me
> >return -EFAULT;
> > device = vfio_device_get_from_vma(vma);
> > if (!device)
> >return -ENOENT;
> > 
> > /*
> >  * Now do the same as vfio_pci_mmap_fault() - the vm_pgoff must
> >  * be the physical pfn when using this mechanism. Delete follow_pte 
> > entirely()
> >  */
> > pfn = (vaddr - vma->vm_start)/PAGE_SIZE + vma->vm_pgoff
> > 
> > /* de-dup device and record that we are using device's pages in the
> >pfnmap */
> > ...
> > }
> 
> 
> This seems to undo both:
> 
> 5cbf3264bc71 ("vfio/type1: Fix VA->PA translation for PFNMAP VMAs in 
> vaddr_get_pfn()")

No, the bug this commit described is fixed by calling
vfio_device_get_from_vma() which excludes all non-VFIO VMAs already.

We can assert that the vm_pgoff is in a specific format because it is
a VFIO owned VMA and must follow the rules to be part of the address
space. See my last email

Here I was suggesting to use the vm_pgoff == PFN rule, but since
you've clarified that doesn't work we'd have to determine the PFN from
the region number through the vfio_device instead.

> (which also suggests we are going to break users without the module
> option opt-in above)

Not necessarily, this is complaining vfio crashes, it doesn't say they
actually needed the IOMMU to work on those VMAs because they are doing
P2P DMA.

I think, if this does break someone, they are on a real fringe and
must have already modified their kernel, so a kconfig is the right
approach. It is pretty hard to get non-GUP'able DMA'able memory into a
process with the stock kernel.

Generally speaking, I think Linus has felt security bug fixes like
this are more on the OK side of things to break fringe users.

> And:
> 
> 41311242221e ("vfio/type1: Support faulting PFNMAP vmas")
> 
> So we'd have an alternate path in the un-safe mode and we'd lose the
> ability to fault in mappings.

As above we already exclude VMAs that are not from VFIO, and VFIO
sourced VMA's do not meaningfully implement fault for this use
case. So calling fixup_user_fault() is pointless.

Peter just did this so we could ask him what it was for..

I feel pretty strongly that removing the call to follow_pte is
important here. Even if we do cover all the issues with mis-using the
API it just makes a maintenance problem to leave it in.

Jason


Re: [RFC PATCH 04/10] vfio/pci: Use vfio_device_unmap_mapping_range()

2021-02-24 Thread Jason Gunthorpe
On Wed, Feb 24, 2021 at 02:55:05PM -0700, Alex Williamson wrote:

> Ok, but how does this really help us, unless you're also proposing some
> redesign of the memory_lock semaphore?  Even if we're zapping all the
> affected devices for a bus reset that doesn't eliminate that we still
> require device level granularity for other events.

Ok, I missed the device level one, forget this remark about the reflk
then, per vfio_pci_device is the right granularity

> > >  struct vfio_pci_device {
> > >   struct pci_dev  *pdev;
> > > + struct vfio_device  *device;  
> > 
> > Ah, I did this too, but I didn't use a pointer :)
> 
> vfio_device is embedded in vfio.c, so that worries me.

I'm working on what we talked about in the other thread to show how
VFIO looks if it follows the normal Linux container_of idiom and to
then remove the vfio_mdev.c indirection as an illustration.

I want to directly make the case the VFIO is better off following the
standard kernel designs and driver core norms so we can find an
agreement to get Max's work on vfio_pci going forward.

You were concerned about hand waving, and maintainability so I'm
willing to put in some more work to make it concrete.

I hope you'll keep an open mind

> > All the places trying to call vfio_device_put() when they really want
> > a vfio_pci_device * become simpler now. Eg struct vfio_devices wants
> > to have an array of vfio_pci_device, and get_pf_vdev() only needs to
> > return one pointer.
> 
> Sure, that example would be a good simplification.  I'm not sure see
> other cases where we're going out of our way to manage the vfio_device
> versus vfio_pci_device objects though.

What I've found is there are lots of little costs all over the
place. The above was just easy to describe in this context.

In my mind the biggest negative cost is the type erasure. Lots of
places are using 'device *', 'void *', 'kobj *' as generic handles to
things that are actually already concretely typed. In the kernel I
would say concretely typing things is considered a virtue.

Having gone through alot of VFIO now, as it relates to the
vfio_device, I think the type erasure directly hurts readability and
maintainability. It just takes too much information away from the
reader and the compiler. The core code is managable, but when you
chuck mdev into it that follows the same, it really starts hurting.

Jason


Re: [PATCH 1/9] mm/migrate.c: Always allow device private pages to migrate

2021-02-10 Thread Jason Gunthorpe
On Wed, Feb 10, 2021 at 02:40:10PM +1100, Alistair Popple wrote:
> On Wednesday, 10 February 2021 12:39:32 AM AEDT Jason Gunthorpe wrote:
> > On Tue, Feb 09, 2021 at 12:07:14PM +1100, Alistair Popple wrote:
> > > Device private pages are used to represent device memory that is not
> > > directly accessible from the CPU. Extra references to a device private
> > > page are only used to ensure the struct page itself remains valid whilst
> > > waiting for migration entries. Therefore extra references should not
> > > prevent device private page migration as this can lead to failures to
> > > migrate pages back to the CPU which are fatal to the user process.
> > 
> > This should identify the extra references in expected_count, just
> > disabling this protection seems unsafe, ZONE_DEVICE is not so special
> > that the refcount means nothing
> 
> This is similar to what migarte_vma_check_page() does now. The issue is that 
> a 
> migration wait takes a reference on the device private page so you can end up 
> with one thread stuck waiting for migration whilst the other can't migrate 
> due 
> to the extra refcount.
> 
> Given device private pages can't undergo GUP and that it's not possible to 
> differentiate the migration wait refcount from any other refcount we assume 
> any possible extra reference must be from migration wait.

GUP is not the only thing that elevates the refcount, I think this is
an unsafe assumption

Why is migration holding an extra refcount anyhow?

Jason


Re: [PATCH v2 0/9] Introduce vfio-pci-core subsystem

2021-02-10 Thread Jason Gunthorpe
On Wed, Feb 10, 2021 at 07:52:08AM +, Tian, Kevin wrote:
> > This subsystem framework will also ease on adding vendor specific
> > functionality to VFIO devices in the future by allowing another module
> > to provide the pci_driver that can setup number of details before
> > registering to VFIO subsystem (such as inject its own operations).
> 
> I'm a bit confused about the change from v1 to v2, especially about
> how to inject module specific operations. From live migration p.o.v
> it may requires two hook points at least for some devices (e.g. i40e 
> in original Yan's example):

IMHO, it was too soon to give up on putting the vfio_device_ops in the
final driver- we should try to define a reasonable public/private
split of vfio_pci_device as is the norm in the kernel. No reason we
can't achieve that.

>  register a migration region and intercept guest writes to specific
> registers. [PATCH 4/9] demonstrates the former but not the latter
> (which is allowed in v1).

And this is why, the ROI to wrapper every vfio op in a PCI op just to
keep vfio_pci_device completely private is poor :(

> Then another question. Once we have this framework in place, do we 
> mandate this approach for any vendor specific tweak or still allow
> doing it as vfio_pci_core extensions (such as igd and zdev in this
> series)?

I would say no to any further vfio_pci_core extensions that are tied
to specific PCI devices. Things like zdev are platform features, they
are not tied to specific PCI devices

> If the latter, what is the criteria to judge which way is desired? Also what 
> about the scenarios where we just want one-time vendor information, 
> e.g. to tell whether a device can tolerate arbitrary I/O page faults [1] or
> the offset in VF PCI config space to put PASID/ATS/PRI capabilities [2]?
> Do we expect to create a module for each device to provide such info?
> Having those questions answered is helpful for better understanding of
> this proposal IMO. 😊
> 
> [1] 
> https://lore.kernel.org/kvm/d4c51504-24ed-2592-37b4-f390b97fd...@huawei.com/T/

SVA is a platform feature, so no problem. Don't see a vfio-pci change
in here?

> [2] https://lore.kernel.org/kvm/20200407095801.648b1...@w520.home/

This one could have been done as a broadcom_vfio_pci driver. Not sure
exposing the entire config space unprotected is safe, hard to know
what the device has put in there, and if it is secure to share with a
guest..

> MDEV core is already a well defined subsystem to connect mdev
> bus driver (vfio-mdev) and mdev device driver (mlx5-mdev).

mdev is two things

 - a driver core bus layer and sysfs that makes a lifetime model
 - a vfio bus driver that doesn't do anything but forward ops to the
   main ops

> vfio-mdev is just the channel to bring VFIO APIs through mdev core
> to underlying vendor specific mdev device driver, which is already
> granted flexibility to tweak whatever needs through mdev_parent_ops.

This is the second thing, and it could just be deleted. The actual
final mdev driver can just use vfio_device_ops directly. The
redirection shim in vfio_mdev.c doesn't add value.

> Then what exact extension is talked here by creating another subsystem
> module? or are we talking about some general library which can be
> shared by underlying mdev device drivers to reduce duplicated
> emulation code?

IMHO it is more a design philosophy that the end driver should
implement the vfio_device_ops directly vs having a stack of ops
structs.

Jason


Re: linux-next: build warning after merge of the rdma tree

2021-02-10 Thread Jason Gunthorpe
On Wed, Feb 10, 2021 at 09:11:49PM +1100, Stephen Rothwell wrote:
> Hi all,
> 
> After merging the rdma tree, today's linux-next build (htmldocs) produced
> this warning:
> 
> drivers/infiniband/core/device.c:859: warning: Function parameter or member 
> 'dev' not described in 'ib_port_immutable_read'
> drivers/infiniband/core/device.c:859: warning: Function parameter or member 
> 'port' not described in 'ib_port_immutable_read'
> 
> Introduced by commit
> 
>   7416790e2245 ("RDMA/core: Introduce and use API to read port immutable 
> data")

drivers/infinband is W=1 clean right now in linux-next

But how can I build *only* drivers/infiniband using W=1 so I can keep
it that way?

The rest of the kernel is not clean and creates too much warning noise
to be usable, even with my mini config.

Just doing a 'make W=1 drivers/infiniband' is sort of OK, but then I
end up compiling things twice

Does anyone know a good solution?

Jason






Re: [PATCH v2 0/9] Introduce vfio-pci-core subsystem

2021-02-10 Thread Jason Gunthorpe
On Wed, Feb 10, 2021 at 09:37:46AM -0700, Alex Williamson wrote:

> > >  register a migration region and intercept guest writes to specific
> > > registers. [PATCH 4/9] demonstrates the former but not the latter
> > > (which is allowed in v1).  
> > 
> > And this is why, the ROI to wrapper every vfio op in a PCI op just to
> > keep vfio_pci_device completely private is poor :(
> 
> Says someone who doesn't need to maintain the core, fixing bugs and
> adding features, while not breaking vendor driver touching private data
> in unexpected ways ;)

Said as someone that maintains a driver subsystem 25x larger than VFIO
that is really experienced in "crazy things drivers do". :)

Private/public is rarely at the top of my worries, and I'm confident
to say this is the general kernel philosophy. Again, look anywhere, we
rarely have private data split out of major structs like struct
device, struct netdev, struct pci_device, etc. This data has to be
public because we are using C and we expect inline functions,
container_of() and so on to work. It is rarely done with hidden
structs.

If we can get private data in some places it is a nice win, but not
worth making a mess to achieve. eg I would not give up the normal
container_of pattern just to obscure some struct, the overall ROI is
bad.

Drivers always do unexpected and crazy things, I wouldn't get fixated
on touching private data as the worst sin a driver can do :(

So, no, I don't agree that exposing a struct vfio_pci_device is the
end of the world - it is normal in the kernel to do this kind of
thing, and yes drivers can do crazy things with that if crazy slips
past the review process.

Honestly I expect people to test their drivers and fix things if a
core change broke them. It happens, QA finds it, it gets fixed, normal
stuff for Linux, IMHO.

> > > Then what exact extension is talked here by creating another subsystem
> > > module? or are we talking about some general library which can be
> > > shared by underlying mdev device drivers to reduce duplicated
> > > emulation code?  
> > 
> > IMHO it is more a design philosophy that the end driver should
> > implement the vfio_device_ops directly vs having a stack of ops
> > structs.

> Like Kevin though, I don't really understand the hand-wave
> application to mdev.  Sure, vfio-mdev could be collapsed now that
> we've rejected that there could be other drivers binding to mdev
> devices,

Again, I think the point Max was trying to make is only that vfio_mdev
can follow the same design as proposed here and replace the
mdev_parent_ops with the vfio_device_ops.

Jason


Re: [PATCH 0/9] Add support for SVM atomics in Nouveau

2021-02-10 Thread Jason Gunthorpe
On Tue, Feb 09, 2021 at 04:17:38PM -0500, Jerome Glisse wrote:

> > > Yes, I would like to avoid the long term pin constraints as well if 
> > > possible I
> > > just haven't found a solution yet. Are you suggesting it might be 
> > > possible to 
> > > add a callback in the page migration logic to specially deal with moving 
> > > these 
> > > pages?
> > 
> > How would migration even find the page?
> 
> Migration can scan memory from physical address (isolate_migratepages_range())
> So the CPU mapping is not the only path to get to a page.

I mean find out that the page is now owned by the GPU driver to tell
it that it needs to migrate that reference. Normally that would go
through the VMA to the mmu notifier, but with the page removed from
the normal VMA it can't get to a mmu notifier or the GPU driver.

Jason


Re: [PATCH 0/9] Add support for SVM atomics in Nouveau

2021-02-10 Thread Jason Gunthorpe
On Tue, Feb 09, 2021 at 12:53:27PM -0800, John Hubbard wrote:

> This direction sounds at least...possible. Using MMU notifiers instead of pins
> is definitely appealing. I'm not quite clear on the callback idea above, but
> overall it seems like taking advantage of the ZONE_DEVICE tracking of pages
> (without having to put anything additional in each struct page), could work.

It isn't the ZONE_DEVICE page that needs to be tracked.

Really what you want to do here is leave the CPU page in the VMA and
the page tables where it started and deny CPU access to the page. Then
all the proper machinery will continue to work.

IMHO "migration" is the wrong idea if the data isn't actually moving.

Jason


Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-10 Thread Jason Gunthorpe
On Tue, Feb 09, 2021 at 10:22:47PM +, Song Bao Hua (Barry Song) wrote:

> The problem is that SVA declares we can use any memory of a process
> to do I/O. And in real scenarios, we are unable to customize most
> applications to make them use the pool. So we are looking for some
> extension generically for applications such as Nginx, Ceph.

But those applications will suffer jitter even if their are using CPU
to do the same work. I fail to see why adding an accelerator suddenly
means the application owner will care about jitter introduced by
migration/etc.

Again in proper SVA it should be quite unlikely to take a fault caused
by something like migration, on the same likelyhood as the CPU. If
things are faulting so much this is a problem then I think it is a
system level problem with doing too much page motion.

Jason


Re: [PATCH] DMA/qedr: Use true and false for bool variable

2021-02-10 Thread Jason Gunthorpe
On Wed, Feb 10, 2021 at 05:38:21PM +0800, Jiapeng Chong wrote:
> Fix the following coccicheck warning:
> 
> ./drivers/infiniband/hw/qedr/qedr.h:629:9-10: WARNING: return of 0/1 in
> function 'qedr_qp_has_rq' with return type bool.
> 
> ./drivers/infiniband/hw/qedr/qedr.h:620:9-10: WARNING: return of 0/1 in
> function 'qedr_qp_has_sq' with return type bool.
> 
> Reported-by: Abaci Robot
> Signed-off-by: Jiapeng Chong 
> Acked-by: Michal Kalderon 
> ---
>  drivers/infiniband/hw/qedr/qedr.h | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)

Applied to for-next, thanks

Jason


Re: linux-next: manual merge of the rdma tree with Linus' tree

2021-02-10 Thread Jason Gunthorpe
On Wed, Feb 10, 2021 at 06:57:24PM +, Pearson, Robert B wrote:

> It looks like f1b0a8ea9f12 ("Revert "RDMA/rxe: Remove VLAN code
> leftovers from RXE"") has not been applied to rdma for-next which is
> where I do my work. Not sure how it got upstream.

It was sent to the rc tree, don't worry this is stuff I fix up.

Stephen's note is to help ensure it is fixed properly, does the change
look OK? It looks like this when applied (I changed Stephen's slightly)

static int rxe_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
{
struct udphdr *udph;
struct rxe_dev *rxe;
struct net_device *ndev = skb->dev;
struct rxe_pkt_info *pkt = SKB_TO_PKT(skb);

/* takes a reference on rxe->ib_dev
 * drop when skb is freed
 */
rxe = rxe_get_dev_from_net(ndev);
if (!rxe && is_vlan_dev(rdev))
rxe = rxe_get_dev_from_net(vlan_dev_real_dev(ndev));
if (!rxe)
goto drop;

Jason


Re: [PATCH v3 2/4] mm/gup: decrement head page once for group of subpages

2021-02-10 Thread Jason Gunthorpe
On Fri, Feb 05, 2021 at 08:41:25PM +, Joao Martins wrote:
> Rather than decrementing the head page refcount one by one, we
> walk the page array and checking which belong to the same
> compound_head. Later on we decrement the calculated amount
> of references in a single write to the head page. To that
> end switch to for_each_compound_head() does most of the work.
> 
> set_page_dirty() needs no adjustment as it's a nop for
> non-dirty head pages and it doesn't operate on tail pages.
> 
> This considerably improves unpinning of pages with THP and
> hugetlbfs:
> 
> - THP
> gup_test -t -m 16384 -r 10 [-L|-a] -S -n 512 -w
> PIN_LONGTERM_BENCHMARK (put values): ~87.6k us -> ~23.2k us
> 
> - 16G with 1G huge page size
> gup_test -f /mnt/huge/file -m 16384 -r 10 [-L|-a] -S -n 512 -w
> PIN_LONGTERM_BENCHMARK: (put values): ~87.6k us -> ~27.5k us
> 
> Signed-off-by: Joao Martins 
> Reviewed-by: John Hubbard 
> ---
>  mm/gup.c | 29 +++------
>  1 file changed, 11 insertions(+), 18 deletions(-)

Looks fine

Reviewed-by: Jason Gunthorpe 

I was wondering why this only touches the FOLL_PIN path, it would make
sense to also use this same logic for release_pages()

for (i = 0; i < nr; i++) {
struct page *page = pages[i];
page = compound_head(page);
if (is_huge_zero_page(page))
continue; 

Jason


Re: [PATCH v3 3/4] mm/gup: add a range variant of unpin_user_pages_dirty_lock()

2021-02-10 Thread Jason Gunthorpe
On Fri, Feb 05, 2021 at 08:41:26PM +, Joao Martins wrote:
> Add a unpin_user_page_range_dirty_lock() API which takes a starting page
> and how many consecutive pages we want to unpin and optionally dirty.
> 
> To that end, define another iterator for_each_compound_range()
> that operates in page ranges as opposed to page array.
> 
> For users (like RDMA mr_dereg) where each sg represents a
> contiguous set of pages, we're able to more efficiently unpin
> pages without having to supply an array of pages much of what
> happens today with unpin_user_pages().
> 
> Suggested-by: Jason Gunthorpe 
> Signed-off-by: Joao Martins 
> ---
>  include/linux/mm.h |  2 ++
>  mm/gup.c   | 62 ++
>  2 files changed, 64 insertions(+)

Reviewed-by: Jason Gunthorpe 

> +/**
> + * unpin_user_page_range_dirty_lock() - release and optionally dirty
> + * gup-pinned page range
> + *
> + * @page:  the starting page of a range maybe marked dirty, and definitely 
> released.
> + * @npages: number of consecutive pages to release.
> + * @make_dirty: whether to mark the pages dirty
> + *
> + * "gup-pinned page range" refers to a range of pages that has had one of the
> + * get_user_pages() variants called on that page.

Tidy this language though, this only works with the pin_user_pages
variants because it hardwires FOLL_PIN

Jason


Re: [PATCH v3 4/4] RDMA/umem: batch page unpin in __ib_umem_release()

2021-02-10 Thread Jason Gunthorpe
On Fri, Feb 05, 2021 at 08:41:27PM +, Joao Martins wrote:
> Use the newly added unpin_user_page_range_dirty_lock()
> for more quickly unpinning a consecutive range of pages
> represented as compound pages. This will also calculate
> number of pages to unpin (for the tail pages which matching
> head page) and thus batch the refcount update.
> 
> Running a test program which calls mr reg/unreg on a 1G in size
> and measures cost of both operations together (in a guest using rxe)
> with THP and hugetlbfs:
> 
> Before:
> 590 rounds in 5.003 sec: 8480.335 usec / round
> 6898 rounds in 60.001 sec: 8698.367 usec / round
> 
> After:
> 2688 rounds in 5.002 sec: 1860.786 usec / round
> 32517 rounds in 60.001 sec: 1845.225 usec / round
> 
> Signed-off-by: Joao Martins 
> ---
>  drivers/infiniband/core/umem.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)

Would best for this to go through Andrew's tree

Acked-by: Jason Gunthorpe 

4x improvement is pretty good!

Jason


Re: [PATCH v3 1/4] mm/gup: add compound page list iterator

2021-02-10 Thread Jason Gunthorpe
On Fri, Feb 05, 2021 at 08:41:24PM +, Joao Martins wrote:
> Add an helper that iterates over head pages in a list of pages. It
> essentially counts the tails until the next page to process has a
> different head that the current. This is going to be used by
> unpin_user_pages() family of functions, to batch the head page refcount
> updates once for all passed consecutive tail pages.
> 
> Suggested-by: Jason Gunthorpe 
> Signed-off-by: Joao Martins 
> Reviewed-by: John Hubbard 
> ---
>  mm/gup.c | 26 ++
>  1 file changed, 26 insertions(+)

Reviewed-by: Jason Gunthorpe 

This can be used for check_and_migrate_cma_pages() too (there is a
series around to change this logic though, not sure if it is landed
yet)

Jason


Re: [PATCH v5 02/14] dmaengine: idxd: add IMS detection in base driver

2021-02-10 Thread Jason Gunthorpe
On Fri, Feb 05, 2021 at 01:53:05PM -0700, Dave Jiang wrote:

> diff --git a/drivers/dma/idxd/sysfs.c b/drivers/dma/idxd/sysfs.c
> index 21c1e23cdf23..ab5c76e1226b 100644
> +++ b/drivers/dma/idxd/sysfs.c
> @@ -1444,6 +1444,14 @@ static ssize_t numa_node_show(struct device *dev,
>  }
>  static DEVICE_ATTR_RO(numa_node);
>  
> +static ssize_t ims_size_show(struct device *dev, struct device_attribute 
> *attr, char *buf)
> +{
> + struct idxd_device *idxd = container_of(dev, struct idxd_device, 
> conf_dev);
> +
> + return sprintf(buf, "%u\n", idxd->ims_size);
> +}

use sysfs_emit for all new sysfs functions please

Jason


Re: [PATCH v5 04/14] vfio/mdev: idxd: Add auxialary device plumbing for idxd mdev support

2021-02-10 Thread Jason Gunthorpe
On Fri, Feb 05, 2021 at 01:53:18PM -0700, Dave Jiang wrote:
> diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
> index a2438b3166db..f02c96164515 100644
> +++ b/drivers/dma/idxd/idxd.h
> @@ -8,6 +8,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "registers.h"
>  
>  #define IDXD_DRIVER_VERSION  "1.00"
> @@ -221,6 +222,8 @@ struct idxd_device {
>   struct work_struct work;
>  
>   int *int_handles;
> +
> + struct auxiliary_device *mdev_auxdev;
>  };

If there is only one aux device there not much reason to make it a
dedicated allocation.

>  /* IDXD software descriptor */
> @@ -282,6 +285,10 @@ enum idxd_interrupt_type {
>   IDXD_IRQ_IMS,
>  };
>  
> +struct idxd_mdev_aux_drv {
> + struct auxiliary_driver auxiliary_drv;
> +};

Wrong indent. What is this even for?

> +
>  static inline int idxd_get_wq_portal_offset(enum idxd_portal_prot prot,
>   enum idxd_interrupt_type irq_type)
>  {
> diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c
> index ee56b92108d8..fd57f39e4b7d 100644
> +++ b/drivers/dma/idxd/init.c
> @@ -382,6 +382,74 @@ static void idxd_disable_system_pasid(struct idxd_device 
> *idxd)
>   idxd->sva = NULL;
>  }
>  
> +static void idxd_remove_mdev_auxdev(struct idxd_device *idxd)
> +{
> + if (!IS_ENABLED(CONFIG_VFIO_MDEV_IDXD))
> + return;
> +
> + auxiliary_device_delete(idxd->mdev_auxdev);
> + auxiliary_device_uninit(idxd->mdev_auxdev);
> +}
> +
> +static void idxd_auxdev_release(struct device *dev)
> +{
> + struct auxiliary_device *auxdev = to_auxiliary_dev(dev);
> + struct idxd_device *idxd = dev_get_drvdata(dev);

Nope, where did you see drvdata being used like this? You need to use
container_of.

If put the mdev_auxdev as a non pointer member then this is just:

 struct idxd_device *idxd = container_of(dev, struct idxd_device, 
mdev_auxdev)
 
 put_device(&idxd->conf_dev);

And fix the 'setup' to match this design

> + kfree(auxdev->name);

This is weird, the name shouldn't be allocated, it is supposed to be a
fixed string to make it easy to find the driver name in the code base.

> +static int idxd_setup_mdev_auxdev(struct idxd_device *idxd)
> +{
> + struct auxiliary_device *auxdev;
> + struct device *dev = &idxd->pdev->dev;
> + int rc;
> +
> + if (!IS_ENABLED(CONFIG_VFIO_MDEV_IDXD))
> + return 0;
> +
> + auxdev = kzalloc(sizeof(*auxdev), GFP_KERNEL);
> + if (!auxdev)
> + return -ENOMEM;
> +
> + auxdev->name = kasprintf(GFP_KERNEL, "mdev-%s", idxd_name[idxd->type]);
> + if (!auxdev->name) {
> + rc = -ENOMEM;
> + goto err_name;
> + }
> +
> + dev_dbg(&idxd->pdev->dev, "aux dev mdev: %s\n", auxdev->name);
> +
> + auxdev->dev.parent = dev;
> + auxdev->dev.release = idxd_auxdev_release;
> + auxdev->id = idxd->id;
> +
> + rc = auxiliary_device_init(auxdev);
> + if (rc < 0) {
> + dev_err(dev, "Failed to init aux dev: %d\n", rc);
> + goto err_auxdev;
> + }

Put the init earlier so it can handle the error unwinds

> + rc = auxiliary_device_add(auxdev);
> + if (rc < 0) {
> + dev_err(dev, "Failed to add aux dev: %d\n", rc);
> + goto err_auxdev;
> + }
> +
> + idxd->mdev_auxdev = auxdev;
> + dev_set_drvdata(&auxdev->dev, idxd);

No to using drvdata, and this is in the wrong order anyhow.

> + return 0;
> +
> + err_auxdev:
> + kfree(auxdev->name);
> + err_name:
> + kfree(auxdev);
> + return rc;
> +}
> +
>  static int idxd_probe(struct idxd_device *idxd)
>  {
>   struct pci_dev *pdev = idxd->pdev;
> @@ -434,11 +502,19 @@ static int idxd_probe(struct idxd_device *idxd)
>   goto err_idr_fail;
>   }
>  
> + rc = idxd_setup_mdev_auxdev(idxd);
> + if (rc < 0)
> + goto err_auxdev_fail;
> +
>   idxd->major = idxd_cdev_get_major(idxd);
>  
>   dev_dbg(dev, "IDXD device %d probed successfully\n", idxd->id);
>   return 0;
>  
> + err_auxdev_fail:
> + mutex_lock(&idxd_idr_lock);
> + idr_remove(&idxd_idrs[idxd->type], idxd->id);
> + mutex_unlock(&idxd_idr_lock);

Probably wrong to order things like this..

Also somehow this has a 

idxd = devm_kzalloc(dev, sizeof(struct idxd_device), GFP_KERNEL);

but the idxd has a kref'd struct device in it:

struct idxd_device {
enum idxd_type type;
struct device conf_dev;

So that's not right either

You'll need to fix the lifetime model for idxd_device before you get
to adding auxdevices

> +static int idxd_mdev_host_init(struct idxd_device *idxd)
> +{
> + /* FIXME: Fill in later */
> + return 0;
> +}
> +
> +static int idxd_mdev_host_release(struct idxd_device *idxd)
> +{
> + /* FIXME: Fill in later */
> + return 0;
> +}

Don't leave empty stubs like this, just provide the whole driver in
the next patch

> +static int id

Re: [PATCH v5 05/14] vfio/mdev: idxd: add basic mdev registration and helper functions

2021-02-10 Thread Jason Gunthorpe
On Fri, Feb 05, 2021 at 01:53:24PM -0700, Dave Jiang wrote:

> +static int check_vma(struct idxd_wq *wq, struct vm_area_struct *vma)
>  {
> - /* FIXME: Fill in later */
> + if (vma->vm_end < vma->vm_start)
> + return -EINVAL;

These checks are redundant

> -static int idxd_mdev_host_release(struct idxd_device *idxd)
> +static int idxd_vdcm_mmap(struct mdev_device *mdev, struct vm_area_struct 
> *vma)
> +{
> + unsigned int wq_idx, rc;
> + unsigned long req_size, pgoff = 0, offset;
> + pgprot_t pg_prot;
> + struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
> + struct idxd_wq *wq = vidxd->wq;
> + struct idxd_device *idxd = vidxd->idxd;
> + enum idxd_portal_prot virt_portal, phys_portal;
> + phys_addr_t base = pci_resource_start(idxd->pdev, IDXD_WQ_BAR);
> + struct device *dev = mdev_dev(mdev);
> +
> + rc = check_vma(wq, vma);
> + if (rc)
> + return rc;
> +
> + pg_prot = vma->vm_page_prot;
> + req_size = vma->vm_end - vma->vm_start;
> + vma->vm_flags |= VM_DONTCOPY;
> +
> + offset = (vma->vm_pgoff << PAGE_SHIFT) &
> +  ((1ULL << VFIO_PCI_OFFSET_SHIFT) - 1);
> +
> + wq_idx = offset >> (PAGE_SHIFT + 2);
> + if (wq_idx >= 1) {
> + dev_err(dev, "mapping invalid wq %d off %lx\n",
> + wq_idx, offset);
> + return -EINVAL;
> + }
> +
> + /*
> +  * Check and see if the guest wants to map to the limited or unlimited 
> portal.
> +  * The driver will allow mapping to unlimited portal only if the the wq 
> is a
> +  * dedicated wq. Otherwise, it goes to limited.
> +  */
> + virt_portal = ((offset >> PAGE_SHIFT) & 0x3) == 1;
> + phys_portal = IDXD_PORTAL_LIMITED;
> + if (virt_portal == IDXD_PORTAL_UNLIMITED && wq_dedicated(wq))
> + phys_portal = IDXD_PORTAL_UNLIMITED;
> +
> + /* We always map IMS portals to the guest */
> + pgoff = (base + idxd_get_wq_portal_full_offset(wq->id, phys_portal,
> +IDXD_IRQ_IMS)) >> 
> PAGE_SHIFT;
> + dev_dbg(dev, "mmap %lx %lx %lx %lx\n", vma->vm_start, pgoff, req_size,
> + pgprot_val(pg_prot));
> + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> + vma->vm_private_data = mdev;

What ensures the mdev pointer is valid strictly longer than the VMA?
This needs refcounting.

> + vma->vm_pgoff = pgoff;
> +
> + return remap_pfn_range(vma, vma->vm_start, pgoff, req_size, pg_prot);

Nothing validated req_size - did you copy this from the Intel RDMA
driver? It had a huge security bug just like this.
> +
> +static int msix_trigger_unregister(struct vdcm_idxd *vidxd, int index)
> +{
> + struct mdev_device *mdev = vidxd->vdev.mdev;
> + struct device *dev = mdev_dev(mdev);
> + struct ims_irq_entry *irq_entry;
> + int rc;
> +
> + if (!vidxd->vdev.msix_trigger[index])
> + return 0;
> +
> + dev_dbg(dev, "disable MSIX trigger %d\n", index);
> + if (index) {
> + u32 auxval;
> +
> + irq_entry = &vidxd->irq_entries[index];
> + if (irq_entry->irq_set) {
> + free_irq(irq_entry->irq, irq_entry);
> + irq_entry->irq_set = false;
> + }
> +
> + auxval = ims_ctrl_pasid_aux(0, false);
> + rc = irq_set_auxdata(irq_entry->irq, IMS_AUXDATA_CONTROL_WORD, 
> auxval);
> + if (rc)
> + return rc;
> + }
> + eventfd_ctx_put(vidxd->vdev.msix_trigger[index]);
> + vidxd->vdev.msix_trigger[index] = NULL;
> +
> + return 0;
> +}
> +
> +static int msix_trigger_register(struct vdcm_idxd *vidxd, u32 fd, int index)
> +{
> + struct mdev_device *mdev = vidxd->vdev.mdev;
> + struct device *dev = mdev_dev(mdev);
> + struct ims_irq_entry *irq_entry;
> + struct eventfd_ctx *trigger;
> + int rc;
> +
> + if (vidxd->vdev.msix_trigger[index])
> + return 0;
> +
> + dev_dbg(dev, "enable MSIX trigger %d\n", index);
> + trigger = eventfd_ctx_fdget(fd);
> + if (IS_ERR(trigger)) {
> + dev_warn(dev, "eventfd_ctx_fdget failed %d\n", index);
> + return PTR_ERR(trigger);
> + }
> +
> + if (index) {
> + u32 pasid;
> + u32 auxval;
> +
> + irq_entry = &vidxd->irq_entries[index];
> + rc = idxd_mdev_get_pasid(mdev, &pasid);
> + if (rc < 0)
> + return rc;
> +
> + /*
> +  * Program and enable the pasid field in the IMS entry. The 
> programmed pasid and
> +  * enabled field is checked against the  pasid and enable field 
> for the work queue
> +  * configuration and the pasid for the descriptor. A mismatch 
> will result in blocked
> +  * IMS interrupt.
> +  */
> + auxval = ims_ctrl_pasid_aux(pasid, true);
> + rc = irq_set

Re: [PATCH v5 07/14] vfio/mdev: idxd: add 1dwq-v1 mdev type

2021-02-10 Thread Jason Gunthorpe
On Fri, Feb 05, 2021 at 01:53:37PM -0700, Dave Jiang wrote:

> -static const struct mdev_parent_ops idxd_vdcm_ops = {
> +static ssize_t name_show(struct kobject *kobj, struct device *dev, char *buf)
> +{
> + struct vdcm_idxd_type *type;
> +
> + type = idxd_vdcm_find_vidxd_type(dev, kobject_name(kobj));
> +
> + if (type)
> + return sprintf(buf, "%s\n", type->name);
> +
> + return -EINVAL;

Success oriented flow

Jason


Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd

2021-02-11 Thread Jason Gunthorpe
On Thu, Feb 11, 2021 at 08:47:03AM +, Christoph Hellwig wrote:
> On Wed, Feb 03, 2021 at 09:54:48AM -0400, Jason Gunthorpe wrote:
> > If people are accepting that these device-specific drivers are
> > required then we need to come to a community consensus to decide what
> > direction to pursue:
> > 
> > * Do we embrace the driver core and use it to load VFIO modules like a
> >   normal subsytem (this RFC)
> > 
> > OR 
> > 
> > * Do we make a driver-core like thing inside the VFIO bus drivers and
> >   have them run their own special driver matching, binding, and loading
> >   scheme. (May RFC)
> > 
> > Haven't heard a 3rd option yet..
> 
> The third option would be to use the driver core to bind the VFIO
> submodules.  Define a new bus for it, which also uses the normal PCI IDs
> for binding, and walk through those VFIO specific drivers when vfio_pci
> is bound to a device.  That would provide a pretty clean abstraction
> and could even keep the existing behavior of say bind to all VGA devices
> with an Intel vendor ID (even if I think that is a bad idea).

I think of this as some variant to the second option above.

Maximally using the driver core to make subdrivers still means the
VFIO side needs to reimplement all the existing matcher logic for PCI
(and it means we don't generalize, so future CXL/etc, will need more
VFIO special code)  It has to put this stuff on a new special bus and
somehow make names and match tables for it.

It also means we'd have to somehow fix vfio-pci to allow hot
plug/unplug of the subdriver. The driver core doesn't really run
synchronously for binding, so late binding would have to be
accommodated somehow. It feels like adding a lot of complexity for
very little gain to me.

Personally I dislike the subdriver direction of the May RFC quite a
bit, it has a lot of unfixable negatives for the admin side. The first
option does present some challenges for userspace but I belive we can
work through them.

Jason


Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd

2021-02-11 Thread Jason Gunthorpe
On Thu, Feb 11, 2021 at 08:50:21AM +, Christoph Hellwig wrote:
> On Thu, Feb 04, 2021 at 11:12:49AM +0200, Max Gurtovoy wrote:
> > But the PCI function (the bounded BDF) is GPU function or NVLINK function ?
> > 
> > If it's NVLINK function then we should fail probing in the host vfio-pci
> > driver.
> > 
> > if its a GPU function so it shouldn't been called nvlink2 vfio-pci driver.
> > Its just an extension in the GPU vfio-pci driver.
> 
> I suspect the trivial and correct answer is that we should just drop
> the driver entirely.  It is for obsolete hardware that never had
> upstream

The HW is still in active deployment and use. The big system, Summit,
only went to production sometime in 2019, so it is barely started on
its lifecycle. Something around a 5-10 year operational lifetime would
be pretty typical in this area.

> support for even using it in the guest.  It also is the reason for
> keeping cruft in the always built-in powernv platform code alive that
> is otherwise dead wood.

Or stated another way, once vfio-pci supports loadable extensions the
non-upstream hardware could provide the extension it needs out of
tree.

Jason


Re: [PATCH 1/3] vfio: Introduce vma ops registration and notifier

2021-02-17 Thread Jason Gunthorpe
On Fri, Feb 12, 2021 at 05:20:57PM -0400, Jason Gunthorpe wrote:
> On Fri, Feb 12, 2021 at 12:27:39PM -0700, Alex Williamson wrote:
> > Create an interface through vfio-core where a vfio bus driver (ex.
> > vfio-pci) can register the vm_operations_struct it uses to map device
> > memory, along with a set of registration callbacks.  This allows
> > vfio-core to expose interfaces for IOMMU backends to match a
> > vm_area_struct to a bus driver and register a notifier for relavant
> > changes to the device mapping.  For now we define only a notifier
> > action for closing the device.
> > 
> > Signed-off-by: Alex Williamson 
> >  drivers/vfio/vfio.c  |  120 
> > ++
> >  include/linux/vfio.h |   20 
> >  2 files changed, 140 insertions(+)
> > 
> > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > index 38779e6fd80c..568f5e37a95f 100644
> > +++ b/drivers/vfio/vfio.c
> > @@ -47,6 +47,8 @@ static struct vfio {
> > struct cdev group_cdev;
> > dev_t   group_devt;
> > wait_queue_head_t   release_q;
> > +   struct list_headvm_ops_list;
> > +   struct mutexvm_ops_lock;
> >  } vfio;
> >  
> >  struct vfio_iommu_driver {
> > @@ -2354,6 +2356,121 @@ struct iommu_domain *vfio_group_iommu_domain(struct 
> > vfio_group *group)
> >  }
> >  EXPORT_SYMBOL_GPL(vfio_group_iommu_domain);
> >  
> > +struct vfio_vma_ops {
> > +   const struct vm_operations_struct   *vm_ops;
> > +   vfio_register_vma_nb_t  *reg_fn;
> > +   vfio_unregister_vma_nb_t*unreg_fn;
> > +   struct list_headnext;
> > +};
> > +
> > +int vfio_register_vma_ops(const struct vm_operations_struct *vm_ops,
> > + vfio_register_vma_nb_t *reg_fn,
> > + vfio_unregister_vma_nb_t *unreg_fn)
> 
> This just feels a little bit too complicated
> 
> I've recently learned from Daniel that we can use the address_space
> machinery to drive the zap_vma_ptes() via unmap_mapping_range(). This
> technique replaces all the open, close and vma_list logic in vfio_pci

Here is my effort to make rdma use this, it removes a lot of ugly code:

https://github.com/jgunthorpe/linux/commits/rdma_addr_space

Still needs some more detailed testing.

This gives an option to detect vfio VMAs by checking

   if (vma->vm_file &&
   file_inode(vma->vm_file) &&
   file_inode(vma->vm_file)->i_sb->s_type == vfio_fs_type)

And all vfio VMA's can have some consistent vm_private_data, or at
worst a consistent extended vm operations struct.

Jason


Re: [PATCH v4] tpm: fix reference counting for struct tpm_chip

2021-02-17 Thread Jason Gunthorpe
On Thu, Feb 18, 2021 at 12:14:11AM +0200, Jarkko Sakkinen wrote:
> On Tue, Feb 16, 2021 at 04:31:26PM +, David Laight wrote:
> > ...
> > > > > + get_device(&chip->dev);
> > > > > + chip->devs.release = tpm_devs_release;
> > > > > + chip->devs.devt =
> > > > > + MKDEV(MAJOR(tpm_devt), chip->dev_num + TPM_NUM_DEVICES);
> > > 
> > > Isn't this less than 100 chars?
> > 
> > Still best kept under 80 if 'reasonable'?
> > 
> > Really it is just split in the wrong place:
> > chip->devs.devt = MKDEV(MAJOR(tpm_devt),
> > chip->dev_num + TPM_NUM_DEVICES);
> 
> 
> Well it looks crap IMHO. Would be more reasonable to have it in a single 
> like. And it is legit too, since it is accepted by checkpatch.
> 
> You might break the lines within 80 chars if it is somehow "logically"
> consistent.

FWIW, I've become kind of tired of the style wishywashyness I've
mostly been happy to accept anything that clang-format spits out for
ordinary C constructs.

It is good enough and universally usable. If devs don't have it linked
to their editor to format single expression or format selected blocks,
they are missing out :)

The community consensus on style is quite unclear. Is 1 or 2 above the
majority preference? Does this case fall under the new "use more than
80 cols if it improves readability?" I have no idea.

Frankly, for most people writing driver code, if they consistently use
clang-format their work will be alot better than if they try to do it
by hand. It takes a lot of experiance to reliably eyeball something
close to the kernel style..

Jason


Re: [PATCH 1/2] hugetlb: fix update_and_free_page contig page struct assumption

2021-02-18 Thread Jason Gunthorpe
On Thu, Feb 18, 2021 at 02:45:54PM +, Matthew Wilcox wrote:
> On Wed, Feb 17, 2021 at 11:02:52AM -0800, Andrew Morton wrote:
> > On Wed, 17 Feb 2021 10:49:25 -0800 Mike Kravetz  
> > wrote:
> > > page structs are not guaranteed to be contiguous for gigantic pages.  The
> >
> > June 2014.  That's a long lurk time for a bug.  I wonder if some later
> > commit revealed it.
> 
> I would suggest that gigantic pages have not seen much use.  Certainly
> performance with Intel CPUs on benchmarks that I've been involved with
> showed lower performance with 1GB pages than with 2MB pages until quite
> recently.

I suggested in another thread that maybe it is time to consider
dropping this "feature"

If it has been slightly broken for 7 years it seems a good bet it
isn't actually being used.

The cost to fix GUP to be compatible with this will hurt normal
GUP performance - and again, that nobody has hit this bug in GUP
further suggests the feature isn't used..

Jason 


Re: [PATCH 1/2] hugetlb: fix update_and_free_page contig page struct assumption

2021-02-18 Thread Jason Gunthorpe
On Thu, Feb 18, 2021 at 12:27:58PM -0500, Zi Yan wrote:
> On 18 Feb 2021, at 12:25, Jason Gunthorpe wrote:
> 
> > On Thu, Feb 18, 2021 at 02:45:54PM +, Matthew Wilcox wrote:
> >> On Wed, Feb 17, 2021 at 11:02:52AM -0800, Andrew Morton wrote:
> >>> On Wed, 17 Feb 2021 10:49:25 -0800 Mike Kravetz  
> >>> wrote:
> >>>> page structs are not guaranteed to be contiguous for gigantic pages.  The
> >>>
> >>> June 2014.  That's a long lurk time for a bug.  I wonder if some later
> >>> commit revealed it.
> >>
> >> I would suggest that gigantic pages have not seen much use.  Certainly
> >> performance with Intel CPUs on benchmarks that I've been involved with
> >> showed lower performance with 1GB pages than with 2MB pages until quite
> >> recently.
> >
> > I suggested in another thread that maybe it is time to consider
> > dropping this "feature"
>
> You mean dropping gigantic page support in hugetlb?

No, I mean dropping support for arches that want to do:

   tail_page != head_page + tail_page_nr

because they can't allocate the required page array either virtually
or physically contiguously.

It seems like quite a burden on the core mm for a very niche, and
maybe even non-existant, case. 

It was originally done for PPC, can these PPC systems use VMEMMAP now?

> > The cost to fix GUP to be compatible with this will hurt normal
> > GUP performance - and again, that nobody has hit this bug in GUP
> > further suggests the feature isn't used..
> 
> A easy fix might be to make gigantic hugetlb page depends on
> CONFIG_SPARSEMEM_VMEMMAP, which guarantee all struct pages are contiguous.

Yes, exactly.

Jason


Re: [PATCH 1/3] vfio: Introduce vma ops registration and notifier

2021-02-18 Thread Jason Gunthorpe
On Thu, Feb 18, 2021 at 02:56:06PM -0700, Alex Williamson wrote:

> Looks pretty slick.  I won't claim it's fully gelled in my head yet,
> but AIUI you're creating these inodes on your new pseudo fs and
> associating it via the actual user fd via the f_mapping pointer, which
> allows multiple fds to associate and address space back to this inode
> when you want to call unmap_mapping_range().  

Yes, from what I can tell all the fs/inode stuff is just mandatory
overhead to get a unique address_space pointer, as that is the only
thing this is actually using.

I have to check the mmap flow more carefully, I recall pointing to a
existing race here with Daniel, but the general idea should hold.

> That clarifies from the previous email how we'd store the inode on
> the vfio_device without introducing yet another tracking list for
> device fds.

Right, you can tell from the vma what inode it is for, and the inode
can tell you if it is a VFIO VMA or not, so no tracking lists needed
at all.

Jason


Re: [PATCH v2 1/4] hmm: Device exclusive memory access

2021-02-19 Thread Jason Gunthorpe
On Fri, Feb 19, 2021 at 09:47:41AM +, Christoph Hellwig wrote:

> > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > index 866a0fa104c4..5d28ff6d4d80 100644
> > +++ b/include/linux/hmm.h
> > @@ -109,6 +109,10 @@ struct hmm_range {
> >   */
> >  int hmm_range_fault(struct hmm_range *range);
> >  
> > +int hmm_exclusive_range(struct mm_struct *mm, unsigned long start,
> > +   unsigned long end, struct page **pages);
> > +vm_fault_t hmm_remove_exclusive_entry(struct vm_fault *vmf);
> 
> Can we avoid the hmm naming for new code (we should probably also kill
> it off for the existing code)?

Yes please, I'd prefer it if hmm.c was just the special page walker
and not a grab bag of unrelated things

Is there is a more natural place to put this in the mm for this idea?

Jason


Re: [PATCH 1/4] mm: Trial do_wp_page() simplification

2021-02-02 Thread Jason Gunthorpe
On Tue, Feb 02, 2021 at 11:31:27AM -0500, Peter Xu wrote:
> On Tue, Feb 02, 2021 at 04:40:33PM +0200, Gal Pressman wrote:
> > Hi Peter & Jason,
> 
> Hi, Gal, Jason,
> 
> > 
> > It seems the hugetlb part was overlooked?
> > We're testing if the RDMA fork MADV_DONTFORK stuff can be removed on 
> > appropriate
> > kernels, but our tests still fail due to lacking explicit huge pages 
> > support [1].
> 
> I didn't think it high priority only because I think most hugetlbfs users
> should be using it shared, but maybe I'm wrong..  Then it got lost indeed.

It turns out people are doing this:

mmap(NULL, SEND_BUFF_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS 
| MAP_HUGETLB, -1, 0)

Which makes some sense...

Gal, you could also MADV_DONTFORK this range if you are explicitly
allocating them via special mmap.

Jason


Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd

2021-02-02 Thread Jason Gunthorpe
On Tue, Feb 02, 2021 at 05:06:59PM +0100, Cornelia Huck wrote:

> On the other side, we have the zdev support, which both requires s390
> and applies to any pci device on s390.

Is there a reason why CONFIG_VFIO_PCI_ZDEV exists? Why not just always
return the s390 specific data in VFIO_DEVICE_GET_INFO if running on
s390?

It would be like returning data from ACPI on other platforms.

It really seems like part of vfio-pci-core

Jason


Re: [PATCH 1/4] mm: Trial do_wp_page() simplification

2021-02-02 Thread Jason Gunthorpe
On Tue, Feb 02, 2021 at 12:05:36PM -0500, Peter Xu wrote:

> > Gal, you could also MADV_DONTFORK this range if you are explicitly
> > allocating them via special mmap.
> 
> Yeah I wanted to mention this one too but I just forgot when reply: the issue
> thread previously pasted smells like some people would like to drop
> MADV_DONTFORK, but if it's able to still be applied I don't know why
> not..

I want to drop the MADV_DONTFORK for dynamic data memory allocated by
the application layers (eg with malloc) without knowledge of how they
will be used.

This case is a buffer internal to the communication system that we
know at allocation time how it will be used; so an explicit,
deliberate, MADV_DONTFORK is fine

Jason


Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd

2021-02-02 Thread Jason Gunthorpe
On Tue, Feb 02, 2021 at 10:54:55AM -0700, Alex Williamson wrote:

> As noted previously, if we start adding ids for vfio drivers then we
> create conflicts with the native host driver.  We cannot register a
> vfio PCI driver that automatically claims devices.

We can't do that in vfio_pci.ko, but a nvlink_vfio_pci.ko can, just
like the RFC showed with the mlx5 example. The key thing is the module
is not autoloadable and there is no modules.alias data for the PCI
IDs.

The admin must explicitly load the module, just like the admin must
explicitly do "cat > new_id". "modprobe nvlink_vfio_pci" replaces
"newid", and preloading the correct IDs into the module's driver makes
the entire admin experience much more natural and safe.

This could be improved with some simple work in the driver core:

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 2f32f38a11ed0b..dc3b088ad44d69 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -828,6 +828,9 @@ static int __device_attach_driver(struct device_driver 
*drv, void *_data)
bool async_allowed;
int ret;
 
+   if (drv->flags & DRIVER_EXPLICIT_BIND_ONLY)
+   continue;
+
ret = driver_match_device(drv, dev);
if (ret == 0) {
/* no match */

Thus the match table could be properly specified, but only explicit
manual bind would attach the driver. This would cleanly resolve the
duplicate ID problem, and we could even set a wildcard PCI match table
for vfio_pci and eliminate the new_id part of the sequence.

However, I'd prefer to split any driver core work from VFIO parts - so
I'd propose starting by splitting to vfio_pci_core.ko, vfio_pci.ko,
nvlink_vfio_pci.ko, and igd_vfio_pci.ko working as above.

For uAPI compatability vfio_pci.ko would need some
request_module()/symbol_get() trick to pass control over to the device
specific module.

Jason


<    2   3   4   5   6   7   8   9   10   11   >