Re: [PATCH] ib_umem_release should decrement mm-pinned_vm from ib_umem_get

2014-08-28 Thread Haggai Eran
On 26/08/2014 00:07, Shawn Bohrer wrote:
 The following patch fixes the issue by storing the mm_struct of the
  
  You are doing more than just storing the mm_struct - you are taking
  a reference to the process' mm.  This can lead to a massive resource
  leakage. The reason is bit complex: The destruction flow for IB
  uverbs is based upon releasing the file handle for it. Once the file
  handle is released, all MRs, QPs, CQs, PDs, etc. that the process
  allocated are released.  For the kernel to release the file handle,
  the kernel reference count to it needs to reach zero.  Most IB
  implementations expose some hardware registers to the application by
  allowing it to mmap the uverbs device file.  This mmap takes a
  reference to uverbs device file handle that the application opened.
  This reference is dropped when the process mm is released during the
  process destruction.  Your code takes a reference to the mm that
  will only be released when the parent MR/QP is released.
  
  Now, we have a deadlock - the mm is waiting for the MR to be
  destroyed, the MR is waiting for the file handle to be destroyed,
  and the file handle is waiting for the mm to be destroyed.
  
  The proper solution is to keep a reference to the task_pid (using
  get_task_pid), and use this pid to get the task_struct and from it
  the mm_struct during the destruction flow.
  
 I'll put together a patch using get_task_pid() and see if I can
 test/reproduce the issue.  This may take a couple of days since we
 have to test this in production at the moment.
  

Hi,

I just wanted to point out that while working on the on demand paging patches
we also needed to keep a reference to the task pid (to make sure we always 
handle page faults on behalf of the correct mm struct). You can find the 
relevant code in the patch titled IB/core: Add support for on demand paging 
regions [1].

Haggai

[1] https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg20552.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] s390/kernel: add system calls for access PCI memory

2014-08-28 Thread Alexey Ishchuk

Hi Shachar,

Thank you for your comments. Please, find my answers below.

Regards,
Alexey Ishchuk

On 08/27/2014 07:18 PM, Shachar Raindel wrote:

Hi Alex,

Few comments on your patch below.


Thanks,
--Shachar


-Original Message-
From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
ow...@vger.kernel.org] On Behalf Of Alexey Ishchuk
Sent: Wednesday, August 27, 2014 1:29 PM
To: linux-rdma@vger.kernel.org
Cc: arlin.r.da...@intel.com; g...@dev.mellanox.co.il; rol...@kernel.org;
linux-s...@vger.kernel.org; gmue...@de.ibm.com;
utz.bac...@de.ibm.com; martin.schwidef...@de.ibm.com;
frank.blasc...@de.ibm.com; Alexey Ishchuk
Subject: [PATCH] s390/kernel: add system calls for access PCI memory

Add the new __NR_s390_pci_mmio_write and __NR_s390_pci_mmio_read
system calls to allow user space applications to access device PCI I/O


Why do you need this to be a special syscall for this functionality? If S390 
platform supports mapping MMIO pages to the user space? If this must happen in 
kernel, it should be provided as a device file (probably character), on which 
writes or ioctls does mmio_write, and reads or ioctl does mmio_reads.


The special syscall is needed for this functionality because the s390
platform does not really support MMIO page mappings to user space. The
PCI memory itself on this platform can be accessed only using special
privileged CPU instructions. Because of that the user space program
cannot simply store the data into a mapped memory that belongs to the
PCI memory address space.

To get as close to the programming model used by normal architectures
the PCI MMIO address is stored into the page table for the user space
process as if the process could store to the memory. But in fact it can
not, every access to this memory area would create an exception.

To write data to the PCI memory address space  the user space program on
the s390 platform needs to call into the kernel to have these privileged
PCI instructions executed. A system call is a straight-forward way to do
this.

I tried to provide a new Infiniband verb command for data writing to PCI
memory but that approach was rejected by the community. My tests shown
that the syscall approach implementation works noticeably faster than
new verb command way. Concerning the security issues, please, see my
considerations below.


memory pages on s390x platform.

Signed-off-by: Alexey Ishchuk aishc...@linux.vnet.ibm.com
---
  arch/s390/include/uapi/asm/unistd.h |   4 +-
  arch/s390/kernel/Makefile   |   1 +
  arch/s390/kernel/entry.h|   4 +
  arch/s390/kernel/pci_mmio.c | 197

  arch/s390/kernel/syscalls.S |   2 +
  5 files changed, 207 insertions(+), 1 deletion(-)
  create mode 100644 arch/s390/kernel/pci_mmio.c

diff --git a/arch/s390/include/uapi/asm/unistd.h
b/arch/s390/include/uapi/asm/unistd.h
index 3802d2d..ab49d1d 100644
--- a/arch/s390/include/uapi/asm/unistd.h
+++ b/arch/s390/include/uapi/asm/unistd.h
@@ -283,7 +283,9 @@
  #define __NR_sched_setattr345
  #define __NR_sched_getattr346
  #define __NR_renameat2347
-#define NR_syscalls 348
+#define __NR_s390_pci_mmio_write   348
+#define __NR_s390_pci_mmio_read349
+#define NR_syscalls 350

  /*
   * There are some system calls that are not present on 64 bit, some
diff --git a/arch/s390/kernel/Makefile b/arch/s390/kernel/Makefile
index 8c2518f..44e8fbb 100644
--- a/arch/s390/kernel/Makefile
+++ b/arch/s390/kernel/Makefile
@@ -62,6 +62,7 @@ ifdef CONFIG_64BIT
  obj-$(CONFIG_PERF_EVENTS) += perf_event.o perf_cpum_cf.o
perf_cpum_sf.o \
perf_cpum_cf_events.o
  obj-y += runtime_instr.o cache.o
+obj-y  += pci_mmio.o
  endif

  # vdso
diff --git a/arch/s390/kernel/entry.h b/arch/s390/kernel/entry.h
index 6ac7819..a36b6f9 100644
--- a/arch/s390/kernel/entry.h
+++ b/arch/s390/kernel/entry.h
@@ -70,4 +70,8 @@ struct old_sigaction;
  long sys_s390_personality(unsigned int personality);
  long sys_s390_runtime_instr(int command, int signum);

+long sys_s390_pci_mmio_write(const unsigned long mmio_addr,
+const void *user_buffer, const size_t length);
+long sys_s390_pci_mmio_read(const unsigned long mmio_addr,
+   void *user_buffer, const size_t length);
  #endif /* _ENTRY_H */
diff --git a/arch/s390/kernel/pci_mmio.c b/arch/s390/kernel/pci_mmio.c
new file mode 100644
index 000..4539d23
--- /dev/null
+++ b/arch/s390/kernel/pci_mmio.c
@@ -0,0 +1,197 @@
+/*
+ * Copyright IBM Corp. 2014
+ */
+#include linux/kernel.h
+#include linux/syscalls.h
+#include linux/init.h
+#include linux/mm.h
+#include linux/errno.h
+#include linux/pci.h
+
+union value_buffer {
+   u8 buf8;
+   u16 buf16;
+   u32 buf32;
+   u64 buf64;
+   u8 buf_large[64];
+};
+
+static long get_pfn(const unsigned long user_addr,
+   

Re: mlx5 + SRP: max_qp_sz mismatch

2014-08-28 Thread Bart Van Assche
On 08/27/14 13:28, Eli Cohen wrote:
 On 08/26/14 18:10, Sagi Grimberg wrote:

 Since I don't know how true send queue size can be computed from the 
 device capabilities at the moment -I can suggest a fix to srpt to 
 retry with srp_sq_size/2 (ans so on until it succeeds...)

 The device capabilities provide the maximum number of send work
 requests that the device supports but the actual number of work
 requests that can be supported in a specific case depends on other
 characteristics of the work requests. For example, in the case of
 Connect-IB, the actual number depends on the number of s/g entries,
 the transport type, etc. This is in compliance with the IB spec:
 
 11.2.1.2 QUERY HCA
 Description:
 Returns the attributes for the specified HCA.
 The maximum values defined in this section are guaranteed
 not-to-exceed values. It is possible for an implementation to allocate
 some HCA resources from the same space. In that case, the maximum
 values returned are not guaranteed for all of those resources
 simultaneously.
 
 So, a well written application should try smaller values if it fails
 with ENOMEM.
 
Hello Mark,

It would help if you could test the patch below. Sorry but I don't
have access to a ConnectIB setup myself.

Thanks,

Bart.

Reported-by: Mark Lehrer leh...@gmail.com
Signed-off-by: Bart Van Assche bvanass...@acm.org
---
 drivers/infiniband/ulp/srpt/ib_srpt.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.c 
b/drivers/infiniband/ulp/srpt/ib_srpt.c
index fe09f27..3ffaf4e 100644
--- a/drivers/infiniband/ulp/srpt/ib_srpt.c
+++ b/drivers/infiniband/ulp/srpt/ib_srpt.c
@@ -2091,6 +2091,7 @@ static int srpt_create_ch_ib(struct srpt_rdma_ch *ch)
if (!qp_init)
goto out;
 
+retry:
ch-cq = ib_create_cq(sdev-device, srpt_completion, NULL, ch,
  ch-rq_size + srp_sq_size, 0);
if (IS_ERR(ch-cq)) {
@@ -2114,6 +2115,13 @@ static int srpt_create_ch_ib(struct srpt_rdma_ch *ch)
ch-qp = ib_create_qp(sdev-pd, qp_init);
if (IS_ERR(ch-qp)) {
ret = PTR_ERR(ch-qp);
+   if (ret == -ENOMEM) {
+   srp_sq_size /= 2;
+   if (srp_sq_size = MIN_SRPT_SQ_SIZE) {
+   ib_destroy_cq(ch-cq);
+   goto retry;
+   }
+   }
printk(KERN_ERR failed to create_qp ret= %d\n, ret);
goto err_destroy_cq;
}
-- 
1.8.4.5


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH libibverbs V5 2/2] Use neighbour lookup for RoCE UD QPs Eth L2 resolution

2014-08-28 Thread Matan Barak

On 28/8/2014 12:48 PM, Devesh Sharma wrote:

Hi Matan,

I have been watching this thread for quite some time. I have a
Basic question, do you think ib_uverbs_create_ah() in uverbs_cmd.c
Should resolve to l2 address? Presently it is not calling 
rdma_addr_find_dmac_by_grh(), am I missing something here?

-Regards
  Devesh



Hi Devesh,

Some vendors don't call ib_uverbs_create_ah and do all this creation in 
userspace only. It's true that it might be a lot easier to do that 
resolution in kernel, but it could create dependency of new versions of 
libibverbs and the provider library tn new kernels only.

I would like to avoid creating such a dependency.

Matan


-Original Message-
From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
ow...@vger.kernel.org] On Behalf Of Jason Gunthorpe
Sent: Tuesday, August 26, 2014 9:39 PM
To: Or Gerlitz
Cc: Doug Ledford; Matan Barak; Roland Dreier; Yishai Hadas; linux-
r...@vger.kernel.org
Subject: Re: [PATCH libibverbs V5 2/2] Use neighbour lookup for RoCE UD
QPs Eth L2 resolution

On Tue, Aug 26, 2014 at 03:18:37PM +0300, Or Gerlitz wrote:

On 25/08/2014 22:33, Doug Ledford wrote:

On Aug 25, 2014, at 1:33 PM, Jason Gunthorpe

jguntho...@obsidianresearch.com wrote:



+timer_fd = timerfd_create(CLOCK_MONOTONIC,

TFD_NONBLOCK | TFD_CLOEXEC);

+if (-1 == timer_fd) {
+print_err(Couldn't create timer\n);
+return timer_fd;
+}


The use of timerfd will impact the minimum OS version, have you
checked this is OK? Does RHEL5 still work?



It was added in linux v2.6.25. I think that an API that's more than
6.5 years old is valid.


RHEL5 is using 2.6.18 as their base kernel. You should at least
consult with the OFED people to determine if this is a problem for
them.


Please don't.  This code should not be changed for something as ancient

as rhel5.


Indeed. Telling people to avoid using constructs/mechanisms ~6-7 years
after they were introduced isn't something we want nor need to do.


I looked myself and it looks like OFED has dropped support for these old
distros so there isn't any problem.

However, I still think this use of timerfd is fairly gratuitous, and looking 
closer,
causes little bugs:

+   if (timerfd_settime(timer_fd, 0, timer_time, NULL)) {
+ print_err(Couldn't set timer\n);
+   return -1;
  ^^
leaks timer_fd


Alos, I noticed:

+   /* wait for an incoming message on the netlink socket */
+  ret = select(nfds, fdset, NULL, NULL, NULL);
+
+   if (ret) {

Fails to detect error return from select.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in the
body of a message to majord...@vger.kernel.org More majordomo info at
http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH libibverbs V5 2/2] Use neighbour lookup for RoCE UD QPs Eth L2 resolution

2014-08-28 Thread Matan Barak

On 26/8/2014 7:08 PM, Jason Gunthorpe wrote:

On Tue, Aug 26, 2014 at 03:18:37PM +0300, Or Gerlitz wrote:

On 25/08/2014 22:33, Doug Ledford wrote:

On Aug 25, 2014, at 1:33 PM, Jason Gunthorpe jguntho...@obsidianresearch.com 
wrote:


+timer_fd = timerfd_create(CLOCK_MONOTONIC, TFD_NONBLOCK | TFD_CLOEXEC);
+if (-1 == timer_fd) {
+print_err(Couldn't create timer\n);
+return timer_fd;
+}


The use of timerfd will impact the minimum OS version, have you
checked this is OK? Does RHEL5 still work?



It was added in linux v2.6.25. I think that an API that's more than
6.5 years old is valid.


RHEL5 is using 2.6.18 as their base kernel. You should at least
consult with the OFED people to determine if this is a problem for
them.


Please don't.  This code should not be changed for something as ancient as 
rhel5.


Indeed. Telling people to avoid using constructs/mechanisms ~6-7
years after they were introduced isn't something we want nor need to
do.


I looked myself and it looks like OFED has dropped support for these
old distros so there isn't any problem.

However, I still think this use of timerfd is fairly gratuitous, and
looking closer, causes little bugs:

+   if (timerfd_settime(timer_fd, 0, timer_time, NULL)) {
+ print_err(Couldn't set timer\n);
+   return -1;
  ^^
leaks timer_fd


Alos, I noticed:

+   /* wait for an incoming message on the netlink socket */
+  ret = select(nfds, fdset, NULL, NULL, NULL);
+
+   if (ret) {

Fails to detect error return from select.

Jason



Thanks, I'll fix those issues too.

Matan
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


mlx4 query in sriov mode

2014-08-28 Thread Bob Biloxi
Hi All,


I really appreciate this wonderful community which has immensely
helped me broaden my knowledge and understanding.


I was going through the mlx4 sriov code, trying to understand the
communication between the VF driver and the PF driver.

I was having a few queries..hoping to get a better understanding.


As I understand, the commands are communicated between VF and PF
through a mechanism called communication channel. VF writes to
specific address in its BAR space, PF gets an event and then proceeds
ahead to read the command from its BAR space and then complete the
execution of it..


Now, my query is, lets say the VF driver is not yet present and only
the PF driver is there...

In this case, can we simulate a VF command write and get notified
through an event?

For eg. we write to some offset in the PF BAR space itself upon
completion of which, an event is generated because of the write? kind
of like loopback mechanism.


I searched through the code but couldn't find anywhere.

Can anyone please help me understand if this is possible? And if there
is any location in the code where i can find this?

Thanks a lot in advance!!


Best Regards,
Bob
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH libibverbs V5 2/2] Use neighbour lookup for RoCE UD QPs Eth L2 resolution

2014-08-28 Thread Devesh Sharma
Hi Matan,

Thanks for quick response, my further comments are inline below:

 -Original Message-
 From: Matan Barak [mailto:mat...@mellanox.com]
 Sent: Thursday, August 28, 2014 9:51 PM
 To: Devesh Sharma; Jason Gunthorpe; Or Gerlitz
 Cc: Doug Ledford; Roland Dreier; Yishai Hadas; linux-rdma@vger.kernel.org
 Subject: Re: [PATCH libibverbs V5 2/2] Use neighbour lookup for RoCE UD
 QPs Eth L2 resolution
 
 On 28/8/2014 12:48 PM, Devesh Sharma wrote:
  Hi Matan,
 
  I have been watching this thread for quite some time. I have a Basic
  question, do you think ib_uverbs_create_ah() in uverbs_cmd.c Should
  resolve to l2 address? Presently it is not calling
 rdma_addr_find_dmac_by_grh(), am I missing something here?
 
  -Regards
Devesh
 
 
 Hi Devesh,
 
 Some vendors don't call ib_uverbs_create_ah and do all this creation in
 userspace only. It's true that it might be a lot easier to do that resolution 
 in
 kernel, but it could create dependency of new versions of libibverbs and the
Which dependency you are specifying here, I see the scheme posted in this 
series adds dependency on libibverbs.
Do you anticipate modification even when uverbs interface is used?
 provider library tn new kernels only.
 I would like to avoid creating such a dependency.

Okay, got it, any suggestions for those who use uverbs interface.
I think another patch is needed for such vendors?

 
 Matan
 
  -Original Message-
  From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
  ow...@vger.kernel.org] On Behalf Of Jason Gunthorpe
  Sent: Tuesday, August 26, 2014 9:39 PM
  To: Or Gerlitz
  Cc: Doug Ledford; Matan Barak; Roland Dreier; Yishai Hadas; linux-
  r...@vger.kernel.org
  Subject: Re: [PATCH libibverbs V5 2/2] Use neighbour lookup for RoCE
  UD QPs Eth L2 resolution
 
  On Tue, Aug 26, 2014 at 03:18:37PM +0300, Or Gerlitz wrote:
  On 25/08/2014 22:33, Doug Ledford wrote:
  On Aug 25, 2014, at 1:33 PM, Jason Gunthorpe
  jguntho...@obsidianresearch.com wrote:
 
  +timer_fd = timerfd_create(CLOCK_MONOTONIC,
  TFD_NONBLOCK | TFD_CLOEXEC);
  +if (-1 == timer_fd) {
  +print_err(Couldn't create timer\n);
  +return timer_fd;
  +}
 
  The use of timerfd will impact the minimum OS version, have you
  checked this is OK? Does RHEL5 still work?
 
  It was added in linux v2.6.25. I think that an API that's more
  than
  6.5 years old is valid.
 
  RHEL5 is using 2.6.18 as their base kernel. You should at least
  consult with the OFED people to determine if this is a problem for
  them.
 
  Please don't.  This code should not be changed for something as
  ancient
  as rhel5.
 
  Indeed. Telling people to avoid using constructs/mechanisms ~6-7
  years after they were introduced isn't something we want nor need to
 do.
 
  I looked myself and it looks like OFED has dropped support for these
  old distros so there isn't any problem.
 
  However, I still think this use of timerfd is fairly gratuitous, and
  looking closer, causes little bugs:
 
  +  if (timerfd_settime(timer_fd, 0, timer_time, NULL)) {
  +print_err(Couldn't set timer\n);
  +  return -1;
^^
  leaks timer_fd
 
 
  Alos, I noticed:
 
  +  /* wait for an incoming message on the netlink socket */
  + ret = select(nfds, fdset, NULL, NULL, NULL);
  +
  +  if (ret) {
 
  Fails to detect error return from select.
 
  Jason
  --
  To unsubscribe from this list: send the line unsubscribe linux-rdma
  in the body of a message to majord...@vger.kernel.org More
 majordomo
  info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html