Re: IB SR-IOV
On 12/08/2013 10:01, Ernst Gunnar Gran wrote: I’m about to start experimenting with SR-IOV and IB (ConnectX-2 G2, HP DL360p Gen8, E5-2609). Is it OK to just download the latest OFED release and start playing, or are there any known issues that I should know of when it comes to SR-IOV with the current release? I would recommend using the IB/RDMA stack from the upstream kernel or the one provided with one of the distros that picked SRIOV support from upstream into the bits they ship. Note that IB and Ethernet are supported for ConnectX VFs, but not yet RoCE. Or. Also note you need to do some preparations at the node (bios, kernel boot command line) level and card (re-burn the firmware with sriov support enabled, see the web or your mellanox rep) level. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IPoIB: Fix race in deleting ipoib_neigh entries
On 09/08/2013 03:44, Jim Foraker wrote: In several places, this snippet is used when removing neigh entries: list_del(neigh-list); ipoib_neigh_free(neigh); The list_del() removes neigh from the associated struct ipoib_path, while ipoib_neigh_free() removes neigh from the device's neigh entry lookup table. Both of these operations are protected by the priv-lock spinlock. The table however is also protected via RCU, and so naturally the lock is not held when doing reads. This leads to a race condition, in which a thread may successfully look up a neigh entry that has already been deleted from neigh-list. Since the previous deletion will have marked the entry with poison, a second list_del() on the object will cause a panic: #5 [8802338c3c70] general_protection at 815108c5 [exception RIP: list_del+16] RIP: 81289020 RSP: 8802338c3d20 RFLAGS: 00010082 RAX: dead00200200 RBX: 880433e60c88 RCX: 9e6c RDX: 0246 RSI: 8806012ca298 RDI: 880433e60c88 RBP: 8802338c3d30 R8: 8806012ca2e8 R9: R10: 0001 R11: R12: 8804346b2020 R13: 88032a3e7540 R14: 8804346b26e0 R15: 0246 ORIG_RAX: CS: 0010 SS: #6 [8802338c3d38] ipoib_cm_tx_handler at a066fe0a [ib_ipoib] #7 [8802338c3d98] cm_process_work at a05149a7 [ib_cm] #8 [8802338c3de8] cm_work_handler at a05161aa [ib_cm] #9 [8802338c3e38] worker_thread at 81090e10 #10 [8802338c3ee8] kthread at 81096c66 #11 [8802338c3f48] kernel_thread at 8100c0ca We move the list_del() into ipoib_neigh_free(), so that deletion happens only once, after the entry has been successfully removed from the lookup table. This same behavior is already used in ipoib_del_neighs_by_gid() and __ipoib_reap_neigh(). Signed-off-by: Jim Forakerforak...@llnl.gov Hi Jim, I have reviewed the patch with Shlomo who did the neighboring changes in IPoIB -- impressive analysis and debugging, a good and proper fix to the issue presented e.g here marc.info/?l=linux-rdmam=136881939301117w=2 Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IPoIB: Fix race in deleting ipoib_neigh entries
On 08/13/2013 09:54 AM, Or Gerlitz wrote: On 09/08/2013 03:44, Jim Foraker wrote: In several places, this snippet is used when removing neigh entries: list_del(neigh-list); ipoib_neigh_free(neigh); The list_del() removes neigh from the associated struct ipoib_path, while ipoib_neigh_free() removes neigh from the device's neigh entry lookup table. Both of these operations are protected by the priv-lock spinlock. The table however is also protected via RCU, and so naturally the lock is not held when doing reads. This leads to a race condition, in which a thread may successfully look up a neigh entry that has already been deleted from neigh-list. Since the previous deletion will have marked the entry with poison, a second list_del() on the object will cause a panic: #5 [8802338c3c70] general_protection at 815108c5 [exception RIP: list_del+16] RIP: 81289020 RSP: 8802338c3d20 RFLAGS: 00010082 RAX: dead00200200 RBX: 880433e60c88 RCX: 9e6c RDX: 0246 RSI: 8806012ca298 RDI: 880433e60c88 RBP: 8802338c3d30 R8: 8806012ca2e8 R9: R10: 0001 R11: R12: 8804346b2020 R13: 88032a3e7540 R14: 8804346b26e0 R15: 0246 ORIG_RAX: CS: 0010 SS: #6 [8802338c3d38] ipoib_cm_tx_handler at a066fe0a [ib_ipoib] #7 [8802338c3d98] cm_process_work at a05149a7 [ib_cm] #8 [8802338c3de8] cm_work_handler at a05161aa [ib_cm] #9 [8802338c3e38] worker_thread at 81090e10 #10 [8802338c3ee8] kthread at 81096c66 #11 [8802338c3f48] kernel_thread at 8100c0ca We move the list_del() into ipoib_neigh_free(), so that deletion happens only once, after the entry has been successfully removed from the lookup table. This same behavior is already used in ipoib_del_neighs_by_gid() and __ipoib_reap_neigh(). Signed-off-by: Jim Forakerforak...@llnl.gov Hi Jim, I have reviewed the patch with Shlomo who did the neighboring changes in IPoIB -- impressive analysis and debugging, a good and proper fix to the issue presented e.g here marc.info/?l=linux-rdmam=136881939301117w=2 Or. Looks nice for me too. Jack -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IPoIB: Fix race in deleting ipoib_neigh entries
On 8/13/2013 11:11 AM, Jack Wang wrote: On 08/13/2013 09:54 AM, Or Gerlitz wrote: On 09/08/2013 03:44, Jim Foraker wrote: In several places, this snippet is used when removing neigh entries: list_del(neigh-list); ipoib_neigh_free(neigh); The list_del() removes neigh from the associated struct ipoib_path, while ipoib_neigh_free() removes neigh from the device's neigh entry lookup table. Both of these operations are protected by the priv-lock spinlock. The table however is also protected via RCU, and so naturally the lock is not held when doing reads. This leads to a race condition, in which a thread may successfully look up a neigh entry that has already been deleted from neigh-list. Since the previous deletion will have marked the entry with poison, a second list_del() on the object will cause a panic: #5 [8802338c3c70] general_protection at 815108c5 [exception RIP: list_del+16] RIP: 81289020 RSP: 8802338c3d20 RFLAGS: 00010082 RAX: dead00200200 RBX: 880433e60c88 RCX: 9e6c RDX: 0246 RSI: 8806012ca298 RDI: 880433e60c88 RBP: 8802338c3d30 R8: 8806012ca2e8 R9: R10: 0001 R11: R12: 8804346b2020 R13: 88032a3e7540 R14: 8804346b26e0 R15: 0246 ORIG_RAX: CS: 0010 SS: #6 [8802338c3d38] ipoib_cm_tx_handler at a066fe0a [ib_ipoib] #7 [8802338c3d98] cm_process_work at a05149a7 [ib_cm] #8 [8802338c3de8] cm_work_handler at a05161aa [ib_cm] #9 [8802338c3e38] worker_thread at 81090e10 #10 [8802338c3ee8] kthread at 81096c66 #11 [8802338c3f48] kernel_thread at 8100c0ca We move the list_del() into ipoib_neigh_free(), so that deletion happens only once, after the entry has been successfully removed from the lookup table. This same behavior is already used in ipoib_del_neighs_by_gid() and __ipoib_reap_neigh(). Signed-off-by: Jim Forakerforak...@llnl.gov Hi Jim, I have reviewed the patch with Shlomo who did the neighboring changes in IPoIB -- impressive analysis and debugging, a good and proper fix to the issue presented e.g here marc.info/?l=linux-rdmam=136881939301117w=2 Or. Looks nice for me too. Jack Looks good to me, thanks. S.P. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Help needed porting Ceph to RSockets
On Aug 13, 2013, at 10:06 AM, Andreas Bluemle andreas.blue...@itxperts.de wrote: Hi Matthew, I found a workaround for my (our) problem: in the librdmacm code, rsocket.c, there is a global constant polling_time, which is set to 10 microseconds at the moment. I raise this to 1 - and all of a sudden things work nicely. I am adding the linux-rdma list to CC so Sean might see this. If I understand what you are describing, the caller to rpoll() spins for up to 10 ms (10,000 us) before calling the real poll(). What is the purpose of the real poll() call? Is it simply a means to block the caller and avoid spinning? Or does it actually expect to detect an event? I think we are looking at two issues here: 1. the thread structure of ceph messenger For a given socket connection, there are 3 threads of interest here: the main messenger thread, the Pipe::reader and the Pipe::writer. For a ceph client like the ceph admin command, I see the following sequence - the connection to the ceph monitor is created by the main messenger thread, the Pipe::reader and Pipe::writer are instantiated. - the requested command is sent to the ceph monitor, the answer is read and printed - at this point the Pipe::reader already has called tcp_read_wait(), polling for more data or connection termination - after the response had been printed, the main loop calls the shutdown routines which in in turn shutdown() the socket There is some time between the last two steps - and this gap is long enough to open a race: 2. rpoll, ibv and poll the rpoll implementation in rsockets is split in 2 phases: - a busy loop which checks the state of the underlying ibv queue pair - the call to real poll() system call (i.e. the uverbs(?) implementation of poll() inside the kernel) The busy loop has a maximum duration of polling_time (10 microseconds by default) - and is able detect the local shutdown and returns a POLLHUP. The poll() system call (i.e. the uverbs implementation of poll() in the kernel) does not detect the local shutdown - and only returns after the caller supplied timeout expires. Increasing the rsockets polloing_time from 10 to 1 microseconds results in the rpoll to detect the local shutdown within the busy loop. Decreasing the ceph ms tcp read timeout from the default of 900 to 5 seconds serves a similar purpose, but is much coarser. From my understanding, the real issue is neither at the ceph nor at the rsockets level: it is related to the uverbs kernel module. An alternative way to address the current problem at the rsockets level would be w re-write of the rpoll(): instead of the busy loop at the beginning followed by the reall poll() call with the full user specificed timeout value (ms tcp read timeout in our case), I would embed the real poll() into a loop, splitting the user specified timeout into smaller portions and doing the rsockets specific rs_poll_check() on every timeout of the real poll(). I have not looked at the rsocket code, so take the following with a grain of salt. If the purpose of the real poll() is to simply block the user for a specified time, then you can simply make it a short duration (taking into consideration what granularity the OS provides) and then call ibv_poll_cq(). Keep in mind, polling will prevent your CPU from reducing power. If the real poll() is actually checking for something (e.g. checking on the RDMA channel's fd or the IB channel's fd), then you may not want to spin too much. Scott Best Regards Andreas Bluemle On Tue, 13 Aug 2013 07:53:12 +0200 Andreas Bluemle andreas.blue...@itxperts.de wrote: Hi Matthew, I can confirm the beahviour whichi you describe. I too believe that the problem is on the client side (ceph command). My log files show the very same symptom, i.e. the client side not being able to shutdown the pipes properly. (Q: I had problems yesterday to send a mail to ceph-users list with the log files attached to it because of the size of the attachments exceeding some limit; I hadnÄt been subscribed to the list at that point. Is the uses of pastebin.com the better way to provide such lengthy information in general? Best Regards Andreas Bluemle On Tue, 13 Aug 2013 11:59:36 +0800 Matthew Anderson manderson8...@gmail.com wrote: Moving this conversation to ceph-devel where the dev's might be able to shed some light on this. I've added some additional debug to my code to narrow the issue down a bit and the reader thread appears to be getting locked by tcp_read_wait() because rpoll never returns an event when the socket is shutdown. A hack way of proving this was to lower the timeout in rpoll to 5 seconds. When command like 'ceph osd tree' completes you can see it block for 5 seconds until rpoll times out and returns 0. The reader thread is then able to join and the
RE: [PATCH 13/16] qib: clean up unnecessary MSI/MSI-X capability find
Subject: [PATCH 13/16] qib: clean up unnecessary MSI/MSI-X capability find Thanks for the patch! Acked-by: Mike Marciniszyn mike.marcinis...@intel.com -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4 for-next 2/4] IB/core: Infra-structure to support verbs extensions through uverbs
Add Infra-structure to support extended uverbs capabilities in a forward/backward manner. Uverbs command opcodes which are based on the verbs extensions approach should be greater or equal to IB_USER_VERBS_CMD_THRESHOLD. They have new header format and processed a bit differently. Previously I wrote the following about an earlier version of the patch. (Quite a while before you sent the new version). You seem to have ignored it (and also ignored my correction about how to write infrastructure): I think you missed the feedback I gave to the previous version of this patch: This patch at least doesn't have a sufficient changelog. I don't understand what extended capabilities are or why we need to change the header format. What is the verbs extensions approach? Why does the kernel need to know about it? What is different about the processing? Yes, you have tried to answer those questions separately but do you think git writes changelogs magically when a patch is applied? If you don't supply a good enough changelog, you're expecting me to write it for you. +struct ib_uverbs_cmd_hdr_ex { + __u32 command; + __u16 in_words; + __u16 out_words; + __u16 provider_in_words; + __u16 provider_out_words; + __u32 cmd_hdr_reserved; +}; + If I understand the vague explanations and the analogy to the presentation about userspace, then cmd_hdr_reserved is supposed to be used as a mask field. Should a kernel that doesn't understand any mask components make sure that this reserved field is 0, and return an error if it isn't? I don't see any code to do that, and it seems to risk new userspace silently getting wrong answers on an old kernel. Is there any reason not to name it as a mask field from the start? - R. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] IB/qib: fix sparse warnings in qib_user_sdma.c
Commit fd1b423 (IB/qib: Improve SDMA performance) introduced some new sparse warnings. This patch eliminates all sparse warnings in this file. Reported-by: Fengguang Wu fengguang...@intel.com Reviewed-by: Dean Luick dean.lu...@intel.com Signed-off-by: Mike Marciniszyn mike.marcinis...@intel.com --- drivers/infiniband/hw/qib/qib_user_sdma.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/qib/qib_user_sdma.c b/drivers/infiniband/hw/qib/qib_user_sdma.c index 0544166..d0a0ea0 100644 --- a/drivers/infiniband/hw/qib/qib_user_sdma.c +++ b/drivers/infiniband/hw/qib/qib_user_sdma.c @@ -222,7 +222,7 @@ static int qib_user_sdma_page_to_frags(const struct qib_devdata *dd, struct page *page, u16 put, u16 offset, u16 len, void *kvaddr) { - u16 *pbc16; + __le16 *pbc16; void *pbcvaddr; struct qib_message_header *hdr; u16 newlen, pbclen, lastdesc, dma_mapped; @@ -610,7 +610,7 @@ static int qib_user_sdma_pin_pages(const struct qib_devdata *dd, (PAGE_SIZE - fofs) : tlen; ret = qib_user_sdma_page_to_frags(dd, pq, pkt, - pages[i], 1, fofs, flen, 0); + pages[i], 1, fofs, flen, NULL); if (ret 0) { /* current page has beed taken * care of inside above call. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ceph-users] Help needed porting Ceph to RSockets
I found a workaround for my (our) problem: in the librdmacm code, rsocket.c, there is a global constant polling_time, which is set to 10 microseconds at the moment. I raise this to 1 - and all of a sudden things work nicely. I am adding the linux-rdma list to CC so Sean might see this. If I understand what you are describing, the caller to rpoll() spins for up to 10 ms (10,000 us) before calling the real poll(). What is the purpose of the real poll() call? Is it simply a means to block the caller and avoid spinning? Or does it actually expect to detect an event? When the real poll() is called, an event is expected on an fd associated with the CQ's completion channel. I think we are looking at two issues here: 1. the thread structure of ceph messenger For a given socket connection, there are 3 threads of interest here: the main messenger thread, the Pipe::reader and the Pipe::writer. For a ceph client like the ceph admin command, I see the following sequence - the connection to the ceph monitor is created by the main messenger thread, the Pipe::reader and Pipe::writer are instantiated. - the requested command is sent to the ceph monitor, the answer is read and printed - at this point the Pipe::reader already has called tcp_read_wait(), polling for more data or connection termination - after the response had been printed, the main loop calls the shutdown routines which in in turn shutdown() the socket There is some time between the last two steps - and this gap is long enough to open a race: 2. rpoll, ibv and poll the rpoll implementation in rsockets is split in 2 phases: - a busy loop which checks the state of the underlying ibv queue pair - the call to real poll() system call (i.e. the uverbs(?) implementation of poll() inside the kernel) The busy loop has a maximum duration of polling_time (10 microseconds by default) - and is able detect the local shutdown and returns a POLLHUP. The poll() system call (i.e. the uverbs implementation of poll() in the kernel) does not detect the local shutdown - and only returns after the caller supplied timeout expires. It sounds like there's an issue here either with a message getting lost or a race. Given that spinning longer works for you, it sounds like an event is getting lost, not being generated correctly, or not being configured to generate. Increasing the rsockets polloing_time from 10 to 1 microseconds results in the rpoll to detect the local shutdown within the busy loop. Decreasing the ceph ms tcp read timeout from the default of 900 to 5 seconds serves a similar purpose, but is much coarser. From my understanding, the real issue is neither at the ceph nor at the rsockets level: it is related to the uverbs kernel module. An alternative way to address the current problem at the rsockets level would be w re-write of the rpoll(): instead of the busy loop at the beginning followed by the reall poll() call with the full user specificed timeout value (ms tcp read timeout in our case), I would embed the real poll() into a loop, splitting the user specified timeout into smaller portions and doing the rsockets specific rs_poll_check() on every timeout of the real poll(). I have not looked at the rsocket code, so take the following with a grain of salt. If the purpose of the real poll() is to simply block the user for a specified time, then you can simply make it a short duration (taking into consideration what granularity the OS provides) and then call ibv_poll_cq(). Keep in mind, polling will prevent your CPU from reducing power. If the real poll() is actually checking for something (e.g. checking on the RDMA channel's fd or the IB channel's fd), then you may not want to spin too much. The real poll() call is intended to block the application until a timeout occurs or an event shows up. Since increasing the spin time works for you, it makes me suspect that there is a bug in the CQ event handling in rsockets. What's particularly weird is that the monitor receives a POLLHUP event when the ceph command shuts down it's socket but the ceph command never does. When using regular sockets both sides of the connection receive a POLLIN | POLLHUP | POLRDHUP event when the sockets are shut down. It would seem like there is a bug in rsockets that causes the side that calls shutdown first not to receive the correct rpoll events. rsockets does not support POLLRDHUP. - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html