Re: [PATCH] ib_umem_release should decrement mm-pinned_vm from ib_umem_get
On 26/08/2014 00:07, Shawn Bohrer wrote: The following patch fixes the issue by storing the mm_struct of the You are doing more than just storing the mm_struct - you are taking a reference to the process' mm. This can lead to a massive resource leakage. The reason is bit complex: The destruction flow for IB uverbs is based upon releasing the file handle for it. Once the file handle is released, all MRs, QPs, CQs, PDs, etc. that the process allocated are released. For the kernel to release the file handle, the kernel reference count to it needs to reach zero. Most IB implementations expose some hardware registers to the application by allowing it to mmap the uverbs device file. This mmap takes a reference to uverbs device file handle that the application opened. This reference is dropped when the process mm is released during the process destruction. Your code takes a reference to the mm that will only be released when the parent MR/QP is released. Now, we have a deadlock - the mm is waiting for the MR to be destroyed, the MR is waiting for the file handle to be destroyed, and the file handle is waiting for the mm to be destroyed. The proper solution is to keep a reference to the task_pid (using get_task_pid), and use this pid to get the task_struct and from it the mm_struct during the destruction flow. I'll put together a patch using get_task_pid() and see if I can test/reproduce the issue. This may take a couple of days since we have to test this in production at the moment. Hi, I just wanted to point out that while working on the on demand paging patches we also needed to keep a reference to the task pid (to make sure we always handle page faults on behalf of the correct mm struct). You can find the relevant code in the patch titled IB/core: Add support for on demand paging regions [1]. Haggai [1] https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg20552.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] s390/kernel: add system calls for access PCI memory
Hi Shachar, Thank you for your comments. Please, find my answers below. Regards, Alexey Ishchuk On 08/27/2014 07:18 PM, Shachar Raindel wrote: Hi Alex, Few comments on your patch below. Thanks, --Shachar -Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma- ow...@vger.kernel.org] On Behalf Of Alexey Ishchuk Sent: Wednesday, August 27, 2014 1:29 PM To: linux-rdma@vger.kernel.org Cc: arlin.r.da...@intel.com; g...@dev.mellanox.co.il; rol...@kernel.org; linux-s...@vger.kernel.org; gmue...@de.ibm.com; utz.bac...@de.ibm.com; martin.schwidef...@de.ibm.com; frank.blasc...@de.ibm.com; Alexey Ishchuk Subject: [PATCH] s390/kernel: add system calls for access PCI memory Add the new __NR_s390_pci_mmio_write and __NR_s390_pci_mmio_read system calls to allow user space applications to access device PCI I/O Why do you need this to be a special syscall for this functionality? If S390 platform supports mapping MMIO pages to the user space? If this must happen in kernel, it should be provided as a device file (probably character), on which writes or ioctls does mmio_write, and reads or ioctl does mmio_reads. The special syscall is needed for this functionality because the s390 platform does not really support MMIO page mappings to user space. The PCI memory itself on this platform can be accessed only using special privileged CPU instructions. Because of that the user space program cannot simply store the data into a mapped memory that belongs to the PCI memory address space. To get as close to the programming model used by normal architectures the PCI MMIO address is stored into the page table for the user space process as if the process could store to the memory. But in fact it can not, every access to this memory area would create an exception. To write data to the PCI memory address space the user space program on the s390 platform needs to call into the kernel to have these privileged PCI instructions executed. A system call is a straight-forward way to do this. I tried to provide a new Infiniband verb command for data writing to PCI memory but that approach was rejected by the community. My tests shown that the syscall approach implementation works noticeably faster than new verb command way. Concerning the security issues, please, see my considerations below. memory pages on s390x platform. Signed-off-by: Alexey Ishchuk aishc...@linux.vnet.ibm.com --- arch/s390/include/uapi/asm/unistd.h | 4 +- arch/s390/kernel/Makefile | 1 + arch/s390/kernel/entry.h| 4 + arch/s390/kernel/pci_mmio.c | 197 arch/s390/kernel/syscalls.S | 2 + 5 files changed, 207 insertions(+), 1 deletion(-) create mode 100644 arch/s390/kernel/pci_mmio.c diff --git a/arch/s390/include/uapi/asm/unistd.h b/arch/s390/include/uapi/asm/unistd.h index 3802d2d..ab49d1d 100644 --- a/arch/s390/include/uapi/asm/unistd.h +++ b/arch/s390/include/uapi/asm/unistd.h @@ -283,7 +283,9 @@ #define __NR_sched_setattr345 #define __NR_sched_getattr346 #define __NR_renameat2347 -#define NR_syscalls 348 +#define __NR_s390_pci_mmio_write 348 +#define __NR_s390_pci_mmio_read349 +#define NR_syscalls 350 /* * There are some system calls that are not present on 64 bit, some diff --git a/arch/s390/kernel/Makefile b/arch/s390/kernel/Makefile index 8c2518f..44e8fbb 100644 --- a/arch/s390/kernel/Makefile +++ b/arch/s390/kernel/Makefile @@ -62,6 +62,7 @@ ifdef CONFIG_64BIT obj-$(CONFIG_PERF_EVENTS) += perf_event.o perf_cpum_cf.o perf_cpum_sf.o \ perf_cpum_cf_events.o obj-y += runtime_instr.o cache.o +obj-y += pci_mmio.o endif # vdso diff --git a/arch/s390/kernel/entry.h b/arch/s390/kernel/entry.h index 6ac7819..a36b6f9 100644 --- a/arch/s390/kernel/entry.h +++ b/arch/s390/kernel/entry.h @@ -70,4 +70,8 @@ struct old_sigaction; long sys_s390_personality(unsigned int personality); long sys_s390_runtime_instr(int command, int signum); +long sys_s390_pci_mmio_write(const unsigned long mmio_addr, +const void *user_buffer, const size_t length); +long sys_s390_pci_mmio_read(const unsigned long mmio_addr, + void *user_buffer, const size_t length); #endif /* _ENTRY_H */ diff --git a/arch/s390/kernel/pci_mmio.c b/arch/s390/kernel/pci_mmio.c new file mode 100644 index 000..4539d23 --- /dev/null +++ b/arch/s390/kernel/pci_mmio.c @@ -0,0 +1,197 @@ +/* + * Copyright IBM Corp. 2014 + */ +#include linux/kernel.h +#include linux/syscalls.h +#include linux/init.h +#include linux/mm.h +#include linux/errno.h +#include linux/pci.h + +union value_buffer { + u8 buf8; + u16 buf16; + u32 buf32; + u64 buf64; + u8 buf_large[64]; +}; + +static long get_pfn(const unsigned long user_addr, +
Re: mlx5 + SRP: max_qp_sz mismatch
On 08/27/14 13:28, Eli Cohen wrote: On 08/26/14 18:10, Sagi Grimberg wrote: Since I don't know how true send queue size can be computed from the device capabilities at the moment -I can suggest a fix to srpt to retry with srp_sq_size/2 (ans so on until it succeeds...) The device capabilities provide the maximum number of send work requests that the device supports but the actual number of work requests that can be supported in a specific case depends on other characteristics of the work requests. For example, in the case of Connect-IB, the actual number depends on the number of s/g entries, the transport type, etc. This is in compliance with the IB spec: 11.2.1.2 QUERY HCA Description: Returns the attributes for the specified HCA. The maximum values defined in this section are guaranteed not-to-exceed values. It is possible for an implementation to allocate some HCA resources from the same space. In that case, the maximum values returned are not guaranteed for all of those resources simultaneously. So, a well written application should try smaller values if it fails with ENOMEM. Hello Mark, It would help if you could test the patch below. Sorry but I don't have access to a ConnectIB setup myself. Thanks, Bart. Reported-by: Mark Lehrer leh...@gmail.com Signed-off-by: Bart Van Assche bvanass...@acm.org --- drivers/infiniband/ulp/srpt/ib_srpt.c | 8 1 file changed, 8 insertions(+) diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.c b/drivers/infiniband/ulp/srpt/ib_srpt.c index fe09f27..3ffaf4e 100644 --- a/drivers/infiniband/ulp/srpt/ib_srpt.c +++ b/drivers/infiniband/ulp/srpt/ib_srpt.c @@ -2091,6 +2091,7 @@ static int srpt_create_ch_ib(struct srpt_rdma_ch *ch) if (!qp_init) goto out; +retry: ch-cq = ib_create_cq(sdev-device, srpt_completion, NULL, ch, ch-rq_size + srp_sq_size, 0); if (IS_ERR(ch-cq)) { @@ -2114,6 +2115,13 @@ static int srpt_create_ch_ib(struct srpt_rdma_ch *ch) ch-qp = ib_create_qp(sdev-pd, qp_init); if (IS_ERR(ch-qp)) { ret = PTR_ERR(ch-qp); + if (ret == -ENOMEM) { + srp_sq_size /= 2; + if (srp_sq_size = MIN_SRPT_SQ_SIZE) { + ib_destroy_cq(ch-cq); + goto retry; + } + } printk(KERN_ERR failed to create_qp ret= %d\n, ret); goto err_destroy_cq; } -- 1.8.4.5 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH libibverbs V5 2/2] Use neighbour lookup for RoCE UD QPs Eth L2 resolution
On 28/8/2014 12:48 PM, Devesh Sharma wrote: Hi Matan, I have been watching this thread for quite some time. I have a Basic question, do you think ib_uverbs_create_ah() in uverbs_cmd.c Should resolve to l2 address? Presently it is not calling rdma_addr_find_dmac_by_grh(), am I missing something here? -Regards Devesh Hi Devesh, Some vendors don't call ib_uverbs_create_ah and do all this creation in userspace only. It's true that it might be a lot easier to do that resolution in kernel, but it could create dependency of new versions of libibverbs and the provider library tn new kernels only. I would like to avoid creating such a dependency. Matan -Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma- ow...@vger.kernel.org] On Behalf Of Jason Gunthorpe Sent: Tuesday, August 26, 2014 9:39 PM To: Or Gerlitz Cc: Doug Ledford; Matan Barak; Roland Dreier; Yishai Hadas; linux- r...@vger.kernel.org Subject: Re: [PATCH libibverbs V5 2/2] Use neighbour lookup for RoCE UD QPs Eth L2 resolution On Tue, Aug 26, 2014 at 03:18:37PM +0300, Or Gerlitz wrote: On 25/08/2014 22:33, Doug Ledford wrote: On Aug 25, 2014, at 1:33 PM, Jason Gunthorpe jguntho...@obsidianresearch.com wrote: +timer_fd = timerfd_create(CLOCK_MONOTONIC, TFD_NONBLOCK | TFD_CLOEXEC); +if (-1 == timer_fd) { +print_err(Couldn't create timer\n); +return timer_fd; +} The use of timerfd will impact the minimum OS version, have you checked this is OK? Does RHEL5 still work? It was added in linux v2.6.25. I think that an API that's more than 6.5 years old is valid. RHEL5 is using 2.6.18 as their base kernel. You should at least consult with the OFED people to determine if this is a problem for them. Please don't. This code should not be changed for something as ancient as rhel5. Indeed. Telling people to avoid using constructs/mechanisms ~6-7 years after they were introduced isn't something we want nor need to do. I looked myself and it looks like OFED has dropped support for these old distros so there isn't any problem. However, I still think this use of timerfd is fairly gratuitous, and looking closer, causes little bugs: + if (timerfd_settime(timer_fd, 0, timer_time, NULL)) { + print_err(Couldn't set timer\n); + return -1; ^^ leaks timer_fd Alos, I noticed: + /* wait for an incoming message on the netlink socket */ + ret = select(nfds, fdset, NULL, NULL, NULL); + + if (ret) { Fails to detect error return from select. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH libibverbs V5 2/2] Use neighbour lookup for RoCE UD QPs Eth L2 resolution
On 26/8/2014 7:08 PM, Jason Gunthorpe wrote: On Tue, Aug 26, 2014 at 03:18:37PM +0300, Or Gerlitz wrote: On 25/08/2014 22:33, Doug Ledford wrote: On Aug 25, 2014, at 1:33 PM, Jason Gunthorpe jguntho...@obsidianresearch.com wrote: +timer_fd = timerfd_create(CLOCK_MONOTONIC, TFD_NONBLOCK | TFD_CLOEXEC); +if (-1 == timer_fd) { +print_err(Couldn't create timer\n); +return timer_fd; +} The use of timerfd will impact the minimum OS version, have you checked this is OK? Does RHEL5 still work? It was added in linux v2.6.25. I think that an API that's more than 6.5 years old is valid. RHEL5 is using 2.6.18 as their base kernel. You should at least consult with the OFED people to determine if this is a problem for them. Please don't. This code should not be changed for something as ancient as rhel5. Indeed. Telling people to avoid using constructs/mechanisms ~6-7 years after they were introduced isn't something we want nor need to do. I looked myself and it looks like OFED has dropped support for these old distros so there isn't any problem. However, I still think this use of timerfd is fairly gratuitous, and looking closer, causes little bugs: + if (timerfd_settime(timer_fd, 0, timer_time, NULL)) { + print_err(Couldn't set timer\n); + return -1; ^^ leaks timer_fd Alos, I noticed: + /* wait for an incoming message on the netlink socket */ + ret = select(nfds, fdset, NULL, NULL, NULL); + + if (ret) { Fails to detect error return from select. Jason Thanks, I'll fix those issues too. Matan -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
mlx4 query in sriov mode
Hi All, I really appreciate this wonderful community which has immensely helped me broaden my knowledge and understanding. I was going through the mlx4 sriov code, trying to understand the communication between the VF driver and the PF driver. I was having a few queries..hoping to get a better understanding. As I understand, the commands are communicated between VF and PF through a mechanism called communication channel. VF writes to specific address in its BAR space, PF gets an event and then proceeds ahead to read the command from its BAR space and then complete the execution of it.. Now, my query is, lets say the VF driver is not yet present and only the PF driver is there... In this case, can we simulate a VF command write and get notified through an event? For eg. we write to some offset in the PF BAR space itself upon completion of which, an event is generated because of the write? kind of like loopback mechanism. I searched through the code but couldn't find anywhere. Can anyone please help me understand if this is possible? And if there is any location in the code where i can find this? Thanks a lot in advance!! Best Regards, Bob -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH libibverbs V5 2/2] Use neighbour lookup for RoCE UD QPs Eth L2 resolution
Hi Matan, Thanks for quick response, my further comments are inline below: -Original Message- From: Matan Barak [mailto:mat...@mellanox.com] Sent: Thursday, August 28, 2014 9:51 PM To: Devesh Sharma; Jason Gunthorpe; Or Gerlitz Cc: Doug Ledford; Roland Dreier; Yishai Hadas; linux-rdma@vger.kernel.org Subject: Re: [PATCH libibverbs V5 2/2] Use neighbour lookup for RoCE UD QPs Eth L2 resolution On 28/8/2014 12:48 PM, Devesh Sharma wrote: Hi Matan, I have been watching this thread for quite some time. I have a Basic question, do you think ib_uverbs_create_ah() in uverbs_cmd.c Should resolve to l2 address? Presently it is not calling rdma_addr_find_dmac_by_grh(), am I missing something here? -Regards Devesh Hi Devesh, Some vendors don't call ib_uverbs_create_ah and do all this creation in userspace only. It's true that it might be a lot easier to do that resolution in kernel, but it could create dependency of new versions of libibverbs and the Which dependency you are specifying here, I see the scheme posted in this series adds dependency on libibverbs. Do you anticipate modification even when uverbs interface is used? provider library tn new kernels only. I would like to avoid creating such a dependency. Okay, got it, any suggestions for those who use uverbs interface. I think another patch is needed for such vendors? Matan -Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma- ow...@vger.kernel.org] On Behalf Of Jason Gunthorpe Sent: Tuesday, August 26, 2014 9:39 PM To: Or Gerlitz Cc: Doug Ledford; Matan Barak; Roland Dreier; Yishai Hadas; linux- r...@vger.kernel.org Subject: Re: [PATCH libibverbs V5 2/2] Use neighbour lookup for RoCE UD QPs Eth L2 resolution On Tue, Aug 26, 2014 at 03:18:37PM +0300, Or Gerlitz wrote: On 25/08/2014 22:33, Doug Ledford wrote: On Aug 25, 2014, at 1:33 PM, Jason Gunthorpe jguntho...@obsidianresearch.com wrote: +timer_fd = timerfd_create(CLOCK_MONOTONIC, TFD_NONBLOCK | TFD_CLOEXEC); +if (-1 == timer_fd) { +print_err(Couldn't create timer\n); +return timer_fd; +} The use of timerfd will impact the minimum OS version, have you checked this is OK? Does RHEL5 still work? It was added in linux v2.6.25. I think that an API that's more than 6.5 years old is valid. RHEL5 is using 2.6.18 as their base kernel. You should at least consult with the OFED people to determine if this is a problem for them. Please don't. This code should not be changed for something as ancient as rhel5. Indeed. Telling people to avoid using constructs/mechanisms ~6-7 years after they were introduced isn't something we want nor need to do. I looked myself and it looks like OFED has dropped support for these old distros so there isn't any problem. However, I still think this use of timerfd is fairly gratuitous, and looking closer, causes little bugs: + if (timerfd_settime(timer_fd, 0, timer_time, NULL)) { +print_err(Couldn't set timer\n); + return -1; ^^ leaks timer_fd Alos, I noticed: + /* wait for an incoming message on the netlink socket */ + ret = select(nfds, fdset, NULL, NULL, NULL); + + if (ret) { Fails to detect error return from select. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html