Re: [PATCH V2 5/5] RDMA CM: Netlink Client
On Wed, 2010-12-08 at 11:23 -0700, Jason Gunthorpe wrote: Sorry, I still need some clarifications... When you say deadlocks, do you mean when calling malloc with a lock or when overflowing a socket receive buffer? For the second case, when we use netlink_unicast, the skbuff is sent and freed. It is transferred to the userspace's socket using netlink_sendskb and accumulated in its recv buff. Are you referring to a deadlock there? I still fail to see the issue. Why would the kernel socket recv buff reach a limit? Could you please elaborate? Netlink is all driven from user space syscalls.. so it looks like sendmsg() [..] ibnl_rcv_msg cma_get_stats [..] ibnl_unicast [..] netlink_attachskb (now we block on the socket recv queue once it fills) The deadlock is that userspace is sitting in sendmsg() while the kernel is sleeping in netlink_attachskb waiting for the recvbuf to empty. User space cannot call recvmsg() while it is in blocked in sendmsg() so it all goes boom. Oh, now I see what you mean. I thought you meant the recv buffer in the netlink socket... But I'm using MSG_DONTWAIT when calling netlink_unicast, so attachskb shouldn't block. I also tested that. I do agree that freeing the skb and simply giving up is not the best we can do, so we can try and send as much as we can instead, but either way, a deadlock shouldn't occur. Nir -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 5/5] RDMA CM: Netlink Client
On Thu, Dec 09, 2010 at 10:47:18AM +0200, Nir Muchtar wrote: But I'm using MSG_DONTWAIT when calling netlink_unicast, so attachskb shouldn't block. I also tested that. But then you are guarenteed to have an incomplete dump once you have enough entries! The best trade off is what the other dump_start user's do, you might have an inconsistent dump sometimes, but at least it is complete and correct most of the time. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 5/5] RDMA CM: Netlink Client
On Tue, 2010-12-07 at 14:29 -0700, Jason Gunthorpe wrote: What you've done in your v2 patch won't work if the table you are dumping is too large, once you pass sk_rmem_alloc for the netlink socket it will deadlock. The purpose of dump_start is to avoid that deadlock. (review my past messages on the subject) Your v1 patch wouldn't deadlock, but it would fail to dump with ENOMEM, and provides an avenue to build an unprivileged kernel OOM DOS. The places in the kernel that don't use dump_start have to stay under sk_rmem_alloc. Jason Sorry, I still need some clarifications... When you say deadlocks, do you mean when calling malloc with a lock or when overflowing a socket receive buffer? For the second case, when we use netlink_unicast, the skbuff is sent and freed. It is transferred to the userspace's socket using netlink_sendskb and accumulated in its recv buff. Are you referring to a deadlock there? I still fail to see the issue. Why would the kernel socket recv buff reach a limit? Could you please elaborate? Thanks, Nir -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 5/5] RDMA CM: Netlink Client
On Wed, Dec 08, 2010 at 04:55:22PM +0200, Nir Muchtar wrote: On Tue, 2010-12-07 at 14:29 -0700, Jason Gunthorpe wrote: What you've done in your v2 patch won't work if the table you are dumping is too large, once you pass sk_rmem_alloc for the netlink socket it will deadlock. The purpose of dump_start is to avoid that deadlock. (review my past messages on the subject) Your v1 patch wouldn't deadlock, but it would fail to dump with ENOMEM, and provides an avenue to build an unprivileged kernel OOM DOS. The places in the kernel that don't use dump_start have to stay under sk_rmem_alloc. Jason Sorry, I still need some clarifications... When you say deadlocks, do you mean when calling malloc with a lock or when overflowing a socket receive buffer? For the second case, when we use netlink_unicast, the skbuff is sent and freed. It is transferred to the userspace's socket using netlink_sendskb and accumulated in its recv buff. Are you referring to a deadlock there? I still fail to see the issue. Why would the kernel socket recv buff reach a limit? Could you please elaborate? Netlink is all driven from user space syscalls.. so it looks like sendmsg() [..] ibnl_rcv_msg cma_get_stats [..] ibnl_unicast [..] netlink_attachskb (now we block on the socket recv queue once it fills) The deadlock is that userspace is sitting in sendmsg() while the kernel is sleeping in netlink_attachskb waiting for the recvbuf to empty. User space cannot call recvmsg() while it is in blocked in sendmsg() so it all goes boom. Even if cma_get_stats was executed from a kernel thread and ibnl_rcv_msg returned back to userspace you still hold the dev_list mutex while calling ibnl_unicast, which can sleep waiting on userspace, which creates an easy DOS against the RDMA CM (I can write a program that causes the kernel the hold the mutx indefinitely). You can't hold the mutex while sleeping for userspace, so you have to unlock it. If you unlock it you have to fixup your position when you re-lock it. If you can fixup your position then you can use dump_start. I don't see malloc being a concern anywhere in what you've done... Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 5/5] RDMA CM: Netlink Client
On Wed, 2010-12-01 at 11:35 -0700, Jason Gunthorpe wrote: I agree, but this is a bit out of scope for the current patches and I think this kind of change should be given some thought. I'd view it as a pre-condition, actually. I understand. still, I prefer staying focused on the primary goal and complete the patches before looking into this, So the solution for now is to just skip on device name and obtain them from userspace using the net devices and /sys/class. I might return to this once the first patches are accepted. I'm happy to see things done later, if we can agree on what everything should look like later so the pieces we have now fit. Maybe you can outline the sort of schema you are thinking of as I did? Having a framework where ib_core generates the QP message and calls out to RDMA_CM, IB_CM, driver, uverbs, etc to fill in attributes seems best to me for the QP table. What I mean is that I will supply the maximum flexibility by supporting arbitrary netlink messages and attributes. This will support your suggested schema as well as any changes we'll agree upon. My current plans are for RDMA CM exports and not QP table exports but this should be next in line. You shouldn't be attempting to dump the structure in one go while holding a lock, you need to try best-efforts to dump it by keeping some kind of current position value. inet_diag seems to use a pretty simple scheme where it just records the hash bucket and count into the chain. Not sure what happens if things are erased - looks like you'll get duplicates/misses? You could do the same by keeping track of the offset into the linked list. Jason The thing is, there's no easy and clean way to retrieve the export when using dump_start. The locking problem could be solved using GFP_ATOMIC when using malloc. This will prevent possible long locking periods. Nir -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 5/5] RDMA CM: Netlink Client
On Tue, Dec 07, 2010 at 05:40:27PM +0200, Nir Muchtar wrote: I understand. still, I prefer staying focused on the primary goal and complete the patches before looking into this, So the solution for now is to just skip on device name and obtain them from userspace using the net devices and /sys/class. I might return to this once the first patches are accepted. This doesn't really work if the IB netdevice is part of a bond, and it won't work at all once Sean is done adding AF_GID. Churning the userspace ABI is a really bad idea, if something is hard to do, then fine. But this isn't.. What I mean is that I will supply the maximum flexibility by supporting arbitrary netlink messages and attributes. This will support your suggested schema as well as any changes we'll agree upon. My current plans are for RDMA CM exports and not QP table exports but this should be next in line. Do you have an idea what this will look like? You shouldn't be attempting to dump the structure in one go while holding a lock, you need to try best-efforts to dump it by keeping some kind of current position value. inet_diag seems to use a pretty simple scheme where it just records the hash bucket and count into the chain. Not sure what happens if things are erased - looks like you'll get duplicates/misses? You could do the same by keeping track of the offset into the linked list. The thing is, there's no easy and clean way to retrieve the export when using dump_start. I don't follow this comment, can you elaborate? This really needs to use the dump api, and I can't see any reason why it can't. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH V2 5/5] RDMA CM: Netlink Client
Jason Gunthorpe wrote: On Tue, Dec 07, 2010 at 05:40:27PM +0200, Nir Muchtar wrote: I understand. still, I prefer staying focused on the primary goal and complete the patches before looking into this, So the solution for now is to just skip on device name and obtain them from userspace using the net devices and /sys/class. I might return to this once the first patches are accepted. This doesn't really work if the IB netdevice is part of a bond, and it won't work at all once Sean is done adding AF_GID. Churning the userspace ABI is a really bad idea, if something is hard to do, then fine. But this isn't.. I'm just not convinced this can be easily completed/accepted and not divert us from the primary goal, so I prefer picking this up later after those patches are accepted. If I discover there's no good way to obtain this through userspace then I won't. What I mean is that I will supply the maximum flexibility by supporting arbitrary netlink messages and attributes. This will support your suggested schema as well as any changes we'll agree upon. My current plans are for RDMA CM exports and not QP table exports but this should be next in line. Do you have an idea what this will look like? I'm submitting an RDMA CM extension and not a QP table extension, so I don't have a complete design, but the infrastructure will support a more module collaborative design like you suggested as well as a more per-module design like in the case of RDMA CM. The exact specification can be agreed upon later. The thing is, there's no easy and clean way to retrieve the export when using dump_start. I don't follow this comment, can you elaborate? This really needs to use the dump api, and I can't see any reason why it can't. Jason As I said, there's just no way (I know of) to use dump_start, divide data into several packets, and receive a consistent snapshot of the data, and this is an issue. We can achieve all that by doing something a little different so why shouldn't we? dump_start is used by some subsystems and not used by others. It's a convenience function and it doesn't necessarily fit in every case, so we shouldn't force it. Nir -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 5/5] RDMA CM: Netlink Client
On Tue, Dec 07, 2010 at 10:53:23PM +0200, Nir Muchtar wrote: I'm just not convinced this can be easily completed/accepted and not divert us from the primary goal, so I prefer picking this up later after those This part of the kernel is mature, the 'primary goal' is to not add new half backed things that need to churn userspace APIs before they are complete :) patches are accepted. If I discover there's no good way to obtain this through userspace then I won't. There isn't. Read my last message. The thing is, there's no easy and clean way to retrieve the export when using dump_start. I don't follow this comment, can you elaborate? This really needs to use the dump api, and I can't see any reason why it can't. As I said, there's just no way (I know of) to use dump_start, divide data into several packets, and receive a consistent snapshot of the data, and this is an issue. We can achieve all that by doing something a little different so why shouldn't we? You have to give up on 100% consistency to use dump_start, which is OK for diags, and what other dumpers in the kernel do. What you've done in your v2 patch won't work if the table you are dumping is too large, once you pass sk_rmem_alloc for the netlink socket it will deadlock. The purpose of dump_start is to avoid that deadlock. (review my past messages on the subject) Your v1 patch wouldn't deadlock, but it would fail to dump with ENOMEM, and provides an avenue to build an unprivileged kernel OOM DOS. The places in the kernel that don't use dump_start have to stay under sk_rmem_alloc. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 5/5] RDMA CM: Netlink Client
On Wed, Dec 01, 2010 at 05:58:54PM +0200, Nir Muchtar wrote: On Tue, 2010-11-30 at 11:19 -0700, Jason Gunthorpe wrote: I don't know of an IB device index mapping like the one in netdevice. Am I missing one? Do you mean we should create one? Yes, definately. It is very easy to do and goes hand-in-hand with the typical netlink protocol design. I agree, but this is a bit out of scope for the current patches and I think this kind of change should be given some thought. I'd view it as a pre-condition, actually. It needs to supply userspace with mapping functions and I don't think it will be that easy to complete. The patch in its current state uses names but it doesn't perpetuate their use because the rdma cm export is separate from the infrastructure. Once we have such an ability, it will be very easy to use here. I think you are overthinking things. For now, just including the ifindex attribute in sysfs in quite enough. As the netlink interface is completed a by index lookup will naturally fall out. So we are in agreement that more then one export type is required here. I do agree that your suggestion will make sense once we try to export QP related data, so maybe we can agree that I will fully support such a scheme, so it will be easy to implement later. By that I mean that the infrastructure will allow adding arbitrary attributes to messages (in type and in size). What do you think? I'm happy to see things done later, if we can agree on what everything should look like later so the pieces we have now fit. Maybe you can outline the sort of schema you are thinking of as I did? Having a framework where ib_core generates the QP message and calls out to RDMA_CM, IB_CM, driver, uverbs, etc to fill in attributes seems best to me for the QP table. A table for listening objects would be kind of similar with information provided by rdma_cm and ib_cm, or just ib_cm 6. Kernel copies the data from #3 into userspace 7. netlink_dump calls callback which returns non-zero 8. recv() returns in userspace Yes that's correct, but inet_diag takes care of the last two steps by updating its cb index, and not dump_start. If we use it that way we can have problems with changes in data structure on subsequent recv calls, so if we want to keep it the same we would still need to employ locking. I don't see a way to keep the same data without locking and without a session mechanism of some sort. That is what the netlink_callback structure is for, you can stick your current position info into args[]. You shouldn't be attempting to dump the structure in one go while holding a lock, you need to try best-efforts to dump it by keeping some kind of current position value. inet_diag seems to use a pretty simple scheme where it just records the hash bucket and count into the chain. Not sure what happens if things are erased - looks like you'll get duplicates/misses? You could do the same by keeping track of the offset into the linked list. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 5/5] RDMA CM: Netlink Client
Jason Gunthorpe wrote: struct ib_nl_qp // String names for IB devices was a mistake, don't perpetuate it. __u32 ib_dev_if; Jason, Do you have a concrete suggestion and/or sketch for a patch someone can work on to make this enumeration to take place? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 5/5] RDMA CM: Netlink Client
On Mon, 2010-11-29 at 12:11 -0700, Jason Gunthorpe wrote: On Mon, Nov 29, 2010 at 06:16:39PM +0200, Nir Muchtar wrote: Add callbacks and data types for statistics export. One callback is implemented that exports all of the current devices/ids. Add/remove the callback to IB Netlink on init/cleanup. Please include the schema for the messages you are adding to netlink in the comment so other people can review them easily. Looks to me like you have messages of the form: { [IBNL_RDMA_CM_DEVICE_NAME - char[]] [IBNL_RDMA_CM_ID_STATS - struct rdma_cm_id_stats]* }* Yes that's the basic structure. I'll add an explanation next time. As I've said before, I don't want to see this tied to RDMA_CM, that is not general enough for a new userspace API. The use of IBNL_RDMA_CM_DEVICE_NAME is very un-netlink-like and is just an ugly hack to avoid addressing that problem. This is done to save space in the netlink messages. I am open for ideas for improvements. I thought of another possibility: We can make another op for rdma devices only with index mapping. This could create a problems if a device is added/removed between calls though. See my question about your suggestion below. How about messages of the form: { IBNL_QP - struct ib_nl_qp IBNL_QP_ATTR - struct ib_qp_attr (or a reasonable subset) IBNL_QP_CM_INFO u8[] - struct ib_nl_qp_cm_info IBNL_QP_CM_SERVICE_ID u8[] IBNL_RDMA_CM_INFO - struct ib_nl_rdma_cm_info IBNL_RDMA_CM_SRC - u8[] // This is a sockaddr_* IBNL_RDMA_CM_DST - u8[] }+ Meaning there is an array of IBNL_QP messages which contains various attributes. Similar to how everything else in netlink works. struct ib_nl_qp { // String names for IB devices was a mistake, don't perpetuate it. I don't know of an IB device index mapping like the one in netdevice. Am I missing one? Do you mean we should create one? __u32 ib_dev_if; __u32 qp_num; __s32 creator_pid; // -1 for kernel consumer }; struct ib_nl_qp_cm_info { u32 cm_state; // enum ib_cm_state u32 lap_state; }; struct ib_nl_rdma_cm_info { __u32 bound_dev_if; __u32 family; __u32 cm_state; // enum rdma_cm_state }; This captures more information and doesn't tie things to RDMA_CM. My problem with making everything QP related is that not everything necessarily is. For example, when creating rdma cm id's they are still not bound to QP's. I guess you could send all zeros in this case, but as more and more of such exceptions are needed this framework will become a bit unnatural. The current implementation is not tying anything to RDMA CM. It allows other modules to export data exactly the way they need. +static int cma_get_stats(struct sk_buff **nl_skb, int pid) You really have to use netlink_dump_start here, doing it like this will deadlock. See how other places use NLM_F_DUMP as well. Well, I already reviewed netlink_dump_start, and this is how it works as much as I can see (Please correct me if I'm wrong): 1. Allocates an skb 2. Calls its supplied dump cb 3. Calls its supplied done cb if if applicable 4. Appends NLMSG_DONE This appears to be executed synchronously, within the context of the calling thread. So I couldn't figure out how to use it for avoiding long locking times. Anyway, what I tried to achieve is a mechanism that allocates more skb's as they are needed, and separate it from the calling module. Do you know of an inherent way to make netlink_dump_start to do that? +struct rdma_cm_id_stats { + enum rdma_node_type nt; + int port_num; + int bound_dev_if; + __be16 local_port; + __be16 remote_port; + __be32 local_addr[4]; + __be32 remote_addr[4]; + enum rdma_port_space ps; + enum rdma_cm_state cm_state; + u32 qp_num; + pid_t pid; +}; Putting enums in a user/kernel structure like this makes me nervous that we'll have a 32/64 bit problem. It would be more consistent with the uverbs stuff to use explicit fixed width types. Yes you're right. Also, I see now that this is not normally done this way, so I'll drop the enums. Jason Thanks again for all your input. Nir -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH V2 5/5] RDMA CM: Netlink Client
+struct rdma_cm_id_stats { + enum rdma_node_type nt; + int port_num; + int bound_dev_if; + __be16 local_port; + __be16 remote_port; + __be32 local_addr[4]; + __be32 remote_addr[4]; Please use sockaddr_storage, so that we can expand the rdma_cm beyond ipv4/6 address support. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 5/5] RDMA CM: Netlink Client
On Tue, Nov 30, 2010 at 02:34:59PM +0200, Or Gerlitz wrote: Jason Gunthorpe wrote: struct ib_nl_qp // String names for IB devices was a mistake, don't perpetuate it. __u32 ib_dev_if; Do you have a concrete suggestion and/or sketch for a patch someone can work on to make this enumeration to take place? Oh, it is really easy, just add an ifindex member to struct ib_device. alloc_name can be extended to compute an appropriate index, see how this works in dev_new_index. Add a 'ifindex' to the ib_device's sysfs directory that exports this number. In future we'd want to see a netlink query to get the RDMA device list to replace trundling through sysfs. I think this is really important from a netlink perspective, there are going to be so many places you wan't to refer to an RDMA device. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 5/5] RDMA CM: Netlink Client
On Tue, Nov 30, 2010 at 04:09:31PM +0200, Nir Muchtar wrote: struct ib_nl_qp { // String names for IB devices was a mistake, don't perpetuate it. I don't know of an IB device index mapping like the one in netdevice. Am I missing one? Do you mean we should create one? Yes, definately. It is very easy to do and goes hand-in-hand with the typical netlink protocol design. __u32 ib_dev_if; __u32 qp_num; __s32 creator_pid; // -1 for kernel consumer }; struct ib_nl_qp_cm_info { u32 cm_state; // enum ib_cm_state u32 lap_state; }; struct ib_nl_rdma_cm_info { __u32 bound_dev_if; __u32 family; __u32 cm_state; // enum rdma_cm_state }; This captures more information and doesn't tie things to RDMA_CM. My problem with making everything QP related is that not everything necessarily is. For example, when creating rdma cm id's they are still not bound to QP's. I guess you could send all zeros in this case, but as more and more of such exceptions are needed this framework will become a bit unnatural. The current implementation is not tying anything to RDMA CM. It allows other modules to export data exactly the way they need. Well, I was outlining how I think the QP-centric information can be returned. You are right that we also have non-QP info, like listening objects, and I think that can be best returned with a seperate query. Trying to conflate them seems like it would be trouble. Certainly, as I've described IBNL_QP messages should only refer to active QPs. Remember you can have as many queries as you like, this is just the QP object query. I guess an alternative would be to have many tables: RDMA_CM, QP, and IB_CM and then rely on userspace to 'join' them by ifindex+QPN - but that seems like alot of work in userspace and I think pretty much everyone is going to want to have the joined data. +static int cma_get_stats(struct sk_buff **nl_skb, int pid) You really have to use netlink_dump_start here, doing it like this will deadlock. See how other places use NLM_F_DUMP as well. Well, I already reviewed netlink_dump_start, and this is how it works as much as I can see (Please correct me if I'm wrong): 1. Allocates an skb 2. Calls its supplied dump cb 3. Calls its supplied done cb if if applicable 4. Appends NLMSG_DONE No, this isn't quite right. The dumpcb is also called after userspace calls recvmsg(), which continues the dump once the buffer is freed. The idea is to return a bit of the table on every dump call back. The way it is used is: 1. Userspace does send() 2. Kernel calls netlink_dump_start() 3. netlink_dump_start calls callback which returns non-zero 4. send() returns in userspace 5. Userspace does recv() 6. Kernel copies the data from #3 into userspace 7. netlink_dump calls callback which returns non-zero 8. recv() returns in userspace [...] Thanks again for all your input. Thanks for working on this! Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 5/5] RDMA CM: Netlink Client
On Tue, Nov 30, 2010 at 08:13:45AM -0800, Hefty, Sean wrote: +struct rdma_cm_id_stats { + enum rdma_node_type nt; + int port_num; + int bound_dev_if; + __be16 local_port; + __be16 remote_port; + __be32 local_addr[4]; + __be32 remote_addr[4]; Please use sockaddr_storage, so that we can expand the rdma_cm beyond ipv4/6 address support. Actually for netlink the proper thing to do here is to place the sockaddr in a sub-attribute that can be variable-sized. Do not put variably sized data into a struct like this. Further, sockaddr_storage is not actually ABI guaranteed to be constant in size, it is just the current largest sockaddr possible. This is why the sockaddr size in/out parameter associated with all sockaddrs is so important. Don't put sockaddrs in fixed-size structures for user/kernel interfaces. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V2 5/5] RDMA CM: Netlink Client
Add callbacks and data types for statistics export. One callback is implemented that exports all of the current devices/ids. Add/remove the callback to IB Netlink on init/cleanup. Signed-off-by: Nir Muchtar n...@voltaire.com --- drivers/infiniband/core/cma.c | 97 + include/rdma/rdma_cm.h| 14 ++ 2 files changed, 111 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 5821f93..9c6ce73 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -51,6 +51,7 @@ #include rdma/ib_cm.h #include rdma/ib_sa.h #include rdma/iw_cm.h +#include rdma/ib_netlink.h MODULE_AUTHOR(Sean Hefty); MODULE_DESCRIPTION(Generic RDMA CM Agent); @@ -134,6 +135,7 @@ struct rdma_id_private { u32 qp_num; u8 srq; u8 tos; + pid_t owner; }; struct cma_multicast { @@ -418,6 +420,7 @@ struct rdma_cm_id *rdma_create_id(rdma_cm_event_handler event_handler, if (!id_priv) return ERR_PTR(-ENOMEM); + id_priv-owner = current-pid; id_priv-state = RDMA_CM_IDLE; id_priv-id.context = context; id_priv-id.event_handler = event_handler; @@ -2671,8 +2674,14 @@ int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) { struct rdma_id_private *id_priv; int ret; + unsigned long flags; id_priv = container_of(id, struct rdma_id_private, id); + + spin_lock_irqsave(id_priv-lock, flags); + id_priv-owner = current-pid; + spin_unlock_irqrestore(id_priv-lock, flags); + if (!cma_comp(id_priv, RDMA_CM_CONNECT)) return -EINVAL; @@ -3243,6 +3252,91 @@ static void cma_remove_one(struct ib_device *device) kfree(cma_dev); } +static int cma_get_stats(struct sk_buff **nl_skb, int pid) +{ + struct rdma_cm_id_stats *id_stats; + struct rdma_id_private *id_priv; + struct rdma_cm_id *id = NULL; + struct cma_device *cma_dev; + char *dev_name; + struct sockaddr_in *src, *dst; +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) + struct sockaddr_in6 *src6, *dst6; +#endif + int seq = 0; + + mutex_lock(lock); + list_for_each_entry(cma_dev, dev_list, list) { + dev_name = ibnl_put(nl_skb, pid, seq++, + strlen(cma_dev-device-name) + 1, + IBNL_RDMA_CM, + IBNL_RDMA_CM_DEVICE_NAME); + if (!dev_name) + goto errmem; + strcpy(dev_name, cma_dev-device-name); + list_for_each_entry(id_priv, cma_dev-id_list, list) { + id_stats = ibnl_put(nl_skb, pid, seq++, + sizeof *id_stats, IBNL_RDMA_CM, + IBNL_RDMA_CM_ID_STATS); + if (!id_stats) + goto errmem; + memset(id_stats, 0, sizeof *id_stats); + id = id_priv-id; + id_stats-nt = id-route.addr.dev_addr.dev_type; + id_stats-port_num = id-port_num; + id_stats-bound_dev_if = + id-route.addr.dev_addr.bound_dev_if; + + if (id-route.addr.src_addr.ss_family == AF_INET + id-route.addr.dst_addr.ss_family == AF_INET) { + src = (struct sockaddr_in *) + (id-route.addr.src_addr); + dst = (struct sockaddr_in *) + (id-route.addr.dst_addr); + id_stats-local_port = src-sin_port; + id_stats-remote_port = dst-sin_port; + id_stats-local_addr[0] = src-sin_addr.s_addr; + id_stats-remote_addr[0] = dst-sin_addr.s_addr; + } + +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) + if (id-route.addr.src_addr.ss_family == AF_INET6 + id-route.addr.dst_addr.ss_family == AF_INET6) { + src6 = (struct sockaddr_in6 *) + (id-route.addr.src_addr); + dst6 = (struct sockaddr_in6 *) + (id-route.addr.dst_addr); + + ipv6_addr_copy((struct in6_addr *) + (id_stats-local_addr), + src6-sin6_addr); + ipv6_addr_copy((struct in6_addr *) +
Re: [PATCH V2 5/5] RDMA CM: Netlink Client
On Mon, Nov 29, 2010 at 06:16:39PM +0200, Nir Muchtar wrote: Add callbacks and data types for statistics export. One callback is implemented that exports all of the current devices/ids. Add/remove the callback to IB Netlink on init/cleanup. Please include the schema for the messages you are adding to netlink in the comment so other people can review them easily. Looks to me like you have messages of the form: { [IBNL_RDMA_CM_DEVICE_NAME - char[]] [IBNL_RDMA_CM_ID_STATS - struct rdma_cm_id_stats]* }* As I've said before, I don't want to see this tied to RDMA_CM, that is not general enough for a new userspace API. The use of IBNL_RDMA_CM_DEVICE_NAME is very un-netlink-like and is just an ugly hack to avoid addressing that problem. How about messages of the form: { IBNL_QP - struct ib_nl_qp IBNL_QP_ATTR - struct ib_qp_attr (or a reasonable subset) IBNL_QP_CM_INFO u8[] - struct ib_nl_qp_cm_info IBNL_QP_CM_SERVICE_ID u8[] IBNL_RDMA_CM_INFO - struct ib_nl_rdma_cm_info IBNL_RDMA_CM_SRC - u8[] // This is a sockaddr_* IBNL_RDMA_CM_DST - u8[] }+ Meaning there is an array of IBNL_QP messages which contains various attributes. Similar to how everything else in netlink works. struct ib_nl_qp { // String names for IB devices was a mistake, don't perpetuate it. __u32 ib_dev_if; __u32 qp_num; __s32 creator_pid; // -1 for kernel consumer }; struct ib_nl_qp_cm_info { u32 cm_state; // enum ib_cm_state u32 lap_state; }; struct ib_nl_rdma_cm_info { __u32 bound_dev_if; __u32 family; __u32 cm_state; // enum rdma_cm_state }; This captures more information and doesn't tie things to RDMA_CM. iWarp QPs would not export IBNL_QP_CM_INFO and QP_CM_SERVICE_ID, but ideally we'd have a call out to the NIC to include the TCP diag information for the underlying TCP socket since there is no other way to access that. Non RDMA-CM QPs (ie ipoib) would not include the RDMA_CM bits. If you study how SS works you'll see it is similar, it uses a message of type AF_INET/6.. and then includes attributes like INET_DIAG_MEMINFO/INFO/CONG @@ -134,6 +135,7 @@ struct rdma_id_private { u32 qp_num; u8 srq; u8 tos; + pid_t owner; Maybe a seperate patch for this? It probably really belongs in ib_uverbs. What about kernel consumers? +static int cma_get_stats(struct sk_buff **nl_skb, int pid) You really have to use netlink_dump_start here, doing it like this will deadlock. See how other places use NLM_F_DUMP as well. +struct rdma_cm_id_stats { + enum rdma_node_type nt; + int port_num; + int bound_dev_if; + __be16 local_port; + __be16 remote_port; + __be32 local_addr[4]; + __be32 remote_addr[4]; + enum rdma_port_space ps; + enum rdma_cm_state cm_state; + u32 qp_num; + pid_t pid; +}; Putting enums in a user/kernel structure like this makes me nervous that we'll have a 32/64 bit problem. It would be more consistent with the uverbs stuff to use explicit fixed width types. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html