Re: [PATCH V2 5/5] RDMA CM: Netlink Client

2010-12-09 Thread Nir Muchtar
On Wed, 2010-12-08 at 11:23 -0700, Jason Gunthorpe wrote:

  Sorry, I still need some clarifications...
  When you say deadlocks, do you mean when calling malloc with a lock or
  when overflowing a socket receive buffer?
  For the second case, when we use netlink_unicast, the skbuff is sent and
  freed. It is transferred to the userspace's socket using netlink_sendskb
  and accumulated in its recv buff.
  
  Are you referring to a deadlock there? I still fail to see the issue.
  Why would the kernel socket recv buff reach a limit? Could you please
  elaborate?
 
 Netlink is all driven from user space syscalls.. so it looks like
 
 sendmsg()
 [..]
 ibnl_rcv_msg
 cma_get_stats
 [..]
 ibnl_unicast
 [..]
 netlink_attachskb
 (now we block on the socket recv queue once it fills)
 
 The deadlock is that userspace is sitting in sendmsg() while the
 kernel is sleeping in netlink_attachskb waiting for the recvbuf to
 empty.
 
 User space cannot call recvmsg() while it is in blocked in sendmsg()
 so it all goes boom.
 

Oh, now I see what you mean. I thought you meant the recv buffer in the
netlink socket... 

But I'm using MSG_DONTWAIT when calling netlink_unicast, so attachskb
shouldn't block. I also tested that.
I do agree that freeing the skb and simply giving up is not the best we
can do, so we can try and send as much as we can instead, but either
way, a deadlock shouldn't occur.

Nir

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 5/5] RDMA CM: Netlink Client

2010-12-09 Thread Jason Gunthorpe
On Thu, Dec 09, 2010 at 10:47:18AM +0200, Nir Muchtar wrote:

 But I'm using MSG_DONTWAIT when calling netlink_unicast, so attachskb
 shouldn't block. I also tested that.

But then you are guarenteed to have an incomplete dump once you have
enough entries!

The best trade off is what the other dump_start user's do, you might
have an inconsistent dump sometimes, but at least it is complete and
correct most of the time.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 5/5] RDMA CM: Netlink Client

2010-12-08 Thread Nir Muchtar
On Tue, 2010-12-07 at 14:29 -0700, Jason Gunthorpe wrote:

 What you've done in your v2 patch won't work if the table you are
 dumping is too large, once you pass sk_rmem_alloc for the netlink
 socket it will deadlock. The purpose of dump_start is to avoid that
 deadlock. (review my past messages on the subject)
 
 Your v1 patch wouldn't deadlock, but it would fail to dump with
 ENOMEM, and provides an avenue to build an unprivileged kernel OOM
 DOS.
 
 The places in the kernel that don't use dump_start have to stay under
 sk_rmem_alloc.
 
 Jason

Sorry, I still need some clarifications...
When you say deadlocks, do you mean when calling malloc with a lock or
when overflowing a socket receive buffer?
For the second case, when we use netlink_unicast, the skbuff is sent and
freed. It is transferred to the userspace's socket using netlink_sendskb
and accumulated in its recv buff.

Are you referring to a deadlock there? I still fail to see the issue.
Why would the kernel socket recv buff reach a limit? Could you please
elaborate?

Thanks,
Nir

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 5/5] RDMA CM: Netlink Client

2010-12-08 Thread Jason Gunthorpe
On Wed, Dec 08, 2010 at 04:55:22PM +0200, Nir Muchtar wrote:
 On Tue, 2010-12-07 at 14:29 -0700, Jason Gunthorpe wrote:
 
  What you've done in your v2 patch won't work if the table you are
  dumping is too large, once you pass sk_rmem_alloc for the netlink
  socket it will deadlock. The purpose of dump_start is to avoid that
  deadlock. (review my past messages on the subject)
  
  Your v1 patch wouldn't deadlock, but it would fail to dump with
  ENOMEM, and provides an avenue to build an unprivileged kernel OOM
  DOS.
  
  The places in the kernel that don't use dump_start have to stay under
  sk_rmem_alloc.
  
  Jason
 
 Sorry, I still need some clarifications...
 When you say deadlocks, do you mean when calling malloc with a lock or
 when overflowing a socket receive buffer?
 For the second case, when we use netlink_unicast, the skbuff is sent and
 freed. It is transferred to the userspace's socket using netlink_sendskb
 and accumulated in its recv buff.
 
 Are you referring to a deadlock there? I still fail to see the issue.
 Why would the kernel socket recv buff reach a limit? Could you please
 elaborate?

Netlink is all driven from user space syscalls.. so it looks like

sendmsg()
[..]
ibnl_rcv_msg
cma_get_stats
[..]
ibnl_unicast
[..]
netlink_attachskb
(now we block on the socket recv queue once it fills)

The deadlock is that userspace is sitting in sendmsg() while the
kernel is sleeping in netlink_attachskb waiting for the recvbuf to
empty.

User space cannot call recvmsg() while it is in blocked in sendmsg()
so it all goes boom.

Even if cma_get_stats was executed from a kernel thread and
ibnl_rcv_msg returned back to userspace you still hold the dev_list
mutex while calling ibnl_unicast, which can sleep waiting on
userspace, which creates an easy DOS against the RDMA CM (I can write
a program that causes the kernel the hold the mutx indefinitely).

You can't hold the mutex while sleeping for userspace, so you have to
unlock it. If you unlock it you have to fixup your position when you
re-lock it. If you can fixup your position then you can use
dump_start.

I don't see malloc being a concern anywhere in what you've done...

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 5/5] RDMA CM: Netlink Client

2010-12-07 Thread Nir Muchtar
On Wed, 2010-12-01 at 11:35 -0700, Jason Gunthorpe wrote:
  I agree, but this is a bit out of scope for the current patches and I
  think this kind of change should be given some thought. 
 
 I'd view it as a pre-condition, actually.
 

I understand. still, I prefer staying focused on the primary goal and
complete the patches before looking into this, So the solution for now
is to just skip on device name and obtain them from userspace using the
net devices and /sys/class. I might return to this once the first
patches are accepted.

 I'm happy to see things done later, if we can agree on what everything
 should look like later so the pieces we have now fit. Maybe you can
 outline the sort of schema you are thinking of as I did?
 
 Having a framework where ib_core generates the QP message and calls
 out to RDMA_CM, IB_CM, driver, uverbs, etc to fill in attributes seems
 best to me for the QP table.

What I mean is that I will supply the maximum flexibility by supporting
arbitrary netlink messages and attributes. This will support your
suggested schema as well as any changes we'll agree upon. My current
plans are for RDMA CM exports and not QP table exports but this should
be next in line.

 You shouldn't be attempting to dump the structure in one go while
 holding a lock, you need to try best-efforts to dump it by keeping
 some kind of current position value.
 
 inet_diag seems to use a pretty simple scheme where it just records
 the hash bucket and count into the chain. Not sure what happens if
 things are erased - looks like you'll get duplicates/misses? You could
 do the same by keeping track of the offset into the linked list.
 
 Jason

The thing is, there's no easy and clean way to retrieve the export when
using dump_start.
The locking problem could be solved using GFP_ATOMIC when using malloc.
This will prevent possible long locking periods.

Nir

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 5/5] RDMA CM: Netlink Client

2010-12-07 Thread Jason Gunthorpe
On Tue, Dec 07, 2010 at 05:40:27PM +0200, Nir Muchtar wrote:

 I understand. still, I prefer staying focused on the primary goal and
 complete the patches before looking into this, So the solution for now
 is to just skip on device name and obtain them from userspace using the
 net devices and /sys/class. I might return to this once the first
 patches are accepted.

This doesn't really work if the IB netdevice is part of a bond, and it
won't work at all once Sean is done adding AF_GID.

Churning the userspace ABI is a really bad idea, if something is hard
to do, then fine. But this isn't..

 What I mean is that I will supply the maximum flexibility by supporting
 arbitrary netlink messages and attributes. This will support your
 suggested schema as well as any changes we'll agree upon. My current
 plans are for RDMA CM exports and not QP table exports but this should
 be next in line.

Do you have an idea what this will look like?

  You shouldn't be attempting to dump the structure in one go while
  holding a lock, you need to try best-efforts to dump it by keeping
  some kind of current position value.
  
  inet_diag seems to use a pretty simple scheme where it just records
  the hash bucket and count into the chain. Not sure what happens if
  things are erased - looks like you'll get duplicates/misses? You could
  do the same by keeping track of the offset into the linked list.
 
 The thing is, there's no easy and clean way to retrieve the export when
 using dump_start.

I don't follow this comment, can you elaborate?

This really needs to use the dump api, and I can't see any reason why
it can't.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH V2 5/5] RDMA CM: Netlink Client

2010-12-07 Thread Nir Muchtar

Jason Gunthorpe wrote:
  
 On Tue, Dec 07, 2010 at 05:40:27PM +0200, Nir Muchtar wrote:
 
  I understand. still, I prefer staying focused on the primary goal and
  complete the patches before looking into this, So the solution for now
  is to just skip on device name and obtain them from userspace using the
  net devices and /sys/class. I might return to this once the first
  patches are accepted.
 
 This doesn't really work if the IB netdevice is part of a bond, and it
 won't work at all once Sean is done adding AF_GID.
 
 Churning the userspace ABI is a really bad idea, if something is hard
 to do, then fine. But this isn't..

I'm just not convinced this can be easily completed/accepted and not divert
us from the primary goal, so I prefer picking this up later after those
patches are accepted. If I discover there's no good way to obtain this 
through userspace then I won't.

 
  What I mean is that I will supply the maximum flexibility by supporting
  arbitrary netlink messages and attributes. This will support your
  suggested schema as well as any changes we'll agree upon. My current
  plans are for RDMA CM exports and not QP table exports but this should
  be next in line.
 
 Do you have an idea what this will look like?

I'm submitting an RDMA CM extension and not a QP table extension, so I
don't have a complete design, but the infrastructure will support a
more module collaborative design like you suggested as well as a more
per-module design like in the case of RDMA CM. The exact specification
can be agreed upon later.

  The thing is, there's no easy and clean way to retrieve the export when
  using dump_start.
 
 I don't follow this comment, can you elaborate?
 
 This really needs to use the dump api, and I can't see any reason why
 it can't.
 
 Jason

As I said, there's just no way (I know of) to use dump_start, divide data
into several packets, and receive a consistent snapshot of the data, and
this is an issue. We can achieve all that by doing something a little 
different so why shouldn't we? 

dump_start is used by some subsystems and not used by others. It's a
convenience function and it doesn't necessarily fit in every case,
so we shouldn't force it.

Nir








--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 5/5] RDMA CM: Netlink Client

2010-12-07 Thread Jason Gunthorpe
On Tue, Dec 07, 2010 at 10:53:23PM +0200, Nir Muchtar wrote:

 I'm just not convinced this can be easily completed/accepted and not
 divert us from the primary goal, so I prefer picking this up later
 after those

This part of the kernel is mature, the 'primary goal' is to not add
new half backed things that need to churn userspace APIs before they
are complete :)

 patches are accepted. If I discover there's no good way to obtain this 
 through userspace then I won't.

There isn't. Read my last message.

   The thing is, there's no easy and clean way to retrieve the export when
   using dump_start.
  
  I don't follow this comment, can you elaborate?
  
  This really needs to use the dump api, and I can't see any reason why
  it can't.
 
 As I said, there's just no way (I know of) to use dump_start, divide data
 into several packets, and receive a consistent snapshot of the data, and
 this is an issue. We can achieve all that by doing something a little 
 different so why shouldn't we? 

You have to give up on 100% consistency to use dump_start, which is OK
for diags, and what other dumpers in the kernel do.

What you've done in your v2 patch won't work if the table you are
dumping is too large, once you pass sk_rmem_alloc for the netlink
socket it will deadlock. The purpose of dump_start is to avoid that
deadlock. (review my past messages on the subject)

Your v1 patch wouldn't deadlock, but it would fail to dump with
ENOMEM, and provides an avenue to build an unprivileged kernel OOM
DOS.

The places in the kernel that don't use dump_start have to stay under
sk_rmem_alloc.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 5/5] RDMA CM: Netlink Client

2010-12-01 Thread Jason Gunthorpe
On Wed, Dec 01, 2010 at 05:58:54PM +0200, Nir Muchtar wrote:
 On Tue, 2010-11-30 at 11:19 -0700, Jason Gunthorpe wrote:
 
   I don't know of an IB device index mapping like the one in netdevice.
   Am I missing one? Do you mean we should create one?
  
  Yes, definately. It is very easy to do and goes hand-in-hand with the
  typical netlink protocol design.
 
 I agree, but this is a bit out of scope for the current patches and I
 think this kind of change should be given some thought. 

I'd view it as a pre-condition, actually.

 It needs to supply userspace with mapping functions and I don't
 think it will be that easy to complete. The patch in its current
 state uses names but it doesn't perpetuate their use because the
 rdma cm export is separate from the infrastructure. Once we have
 such an ability, it will be very easy to use here.

I think you are overthinking things. For now, just including the
ifindex attribute in sysfs in quite enough. As the netlink interface
is completed a by index lookup will naturally fall out.

 So we are in agreement that more then one export type is required here.
 I do agree that your suggestion will make sense once we try to export QP
 related data, so maybe we can agree that I will fully support such a
 scheme, so it will be easy to implement later. By that I mean that the
 infrastructure will allow adding arbitrary attributes to messages (in
 type and in size). What do you think?

I'm happy to see things done later, if we can agree on what everything
should look like later so the pieces we have now fit. Maybe you can
outline the sort of schema you are thinking of as I did?

Having a framework where ib_core generates the QP message and calls
out to RDMA_CM, IB_CM, driver, uverbs, etc to fill in attributes seems
best to me for the QP table.

A table for listening objects would be kind of similar with
information provided by rdma_cm and ib_cm, or just ib_cm

   6. Kernel copies the data from #3 into userspace
   7. netlink_dump calls callback which returns non-zero
   8. recv() returns in userspace
 
 Yes that's correct, but inet_diag takes care of the last two steps by
 updating its cb index, and not dump_start. If we use it that way we can
 have problems with changes in data structure on subsequent recv calls,
 so if we want to keep it the same we would still need to employ locking.
 I don't see a way to keep the same data without locking and without a
 session mechanism of some sort.

That is what the netlink_callback structure is for, you can stick
your current position info into args[].

You shouldn't be attempting to dump the structure in one go while
holding a lock, you need to try best-efforts to dump it by keeping
some kind of current position value.

inet_diag seems to use a pretty simple scheme where it just records
the hash bucket and count into the chain. Not sure what happens if
things are erased - looks like you'll get duplicates/misses? You could
do the same by keeping track of the offset into the linked list.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 5/5] RDMA CM: Netlink Client

2010-11-30 Thread Or Gerlitz
Jason Gunthorpe wrote:
 struct ib_nl_qp
 // String names for IB devices was a mistake, don't perpetuate it.
 __u32 ib_dev_if;

Jason,

Do you have a concrete suggestion and/or sketch for a patch
someone can work on to make this enumeration to take place? 

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 5/5] RDMA CM: Netlink Client

2010-11-30 Thread Nir Muchtar
On Mon, 2010-11-29 at 12:11 -0700, Jason Gunthorpe wrote:
 On Mon, Nov 29, 2010 at 06:16:39PM +0200, Nir Muchtar wrote:
  Add callbacks and data types for statistics export.
  One callback is implemented that exports all of the current devices/ids.
  Add/remove the callback to IB Netlink on init/cleanup.
 
 Please include the schema for the messages you are adding to netlink
 in the comment so other people can review them easily.
 
 Looks to me like you have messages of the form:
 
 {
  [IBNL_RDMA_CM_DEVICE_NAME - char[]]
  [IBNL_RDMA_CM_ID_STATS - struct rdma_cm_id_stats]*
 }*
 

Yes that's the basic structure. I'll add an explanation next time.

 As I've said before, I don't want to see this tied to RDMA_CM, that is
 not general enough for a new userspace API. The use of
 IBNL_RDMA_CM_DEVICE_NAME is very un-netlink-like and is just an ugly
 hack to avoid addressing that problem.

This is done to save space in the netlink messages.
I am open for ideas for improvements.
I thought of another possibility: We can make another op for rdma
devices only with index mapping. This could create a problems if a
device is added/removed between calls though.
See my question about your suggestion below.

 
 How about messages of the form:
 {
  IBNL_QP - struct ib_nl_qp
IBNL_QP_ATTR - struct ib_qp_attr (or a reasonable subset)
IBNL_QP_CM_INFO u8[] - struct ib_nl_qp_cm_info
IBNL_QP_CM_SERVICE_ID u8[]
IBNL_RDMA_CM_INFO - struct ib_nl_rdma_cm_info
IBNL_RDMA_CM_SRC - u8[]  // This is a sockaddr_*
IBNL_RDMA_CM_DST - u8[] 
 }+
 
 Meaning there is an array of IBNL_QP messages which contains various
 attributes. Similar to how everything else in netlink works.
 
 struct ib_nl_qp
 {
 // String names for IB devices was a mistake, don't perpetuate it.

I don't know of an IB device index mapping like the one in netdevice.
Am I missing one? Do you mean we should create one?

 __u32 ib_dev_if;
   __u32 qp_num;
 __s32 creator_pid; // -1 for kernel consumer
 };
 
 struct ib_nl_qp_cm_info
 {
   u32 cm_state; // enum ib_cm_state
   u32 lap_state;
 };
 
 struct ib_nl_rdma_cm_info
 {
   __u32 bound_dev_if;
 __u32 family;
   __u32 cm_state; // enum rdma_cm_state
 };
 
 This captures more information and doesn't tie things to RDMA_CM.

My problem with making everything QP related is that not everything
necessarily is. For example, when creating rdma cm id's they are still 
not bound to QP's. I guess you could send all zeros in this case, but as
more and more of such exceptions are needed this framework will become a
bit unnatural. The current implementation is not tying anything to RDMA
CM. It allows other modules to export data exactly the way they need.

 
  +static int cma_get_stats(struct sk_buff **nl_skb, int pid)
 
 You really have to use netlink_dump_start here, doing it like this
 will deadlock. See how other places use NLM_F_DUMP as well.

Well, I already reviewed netlink_dump_start, and this is how it works 
as much as I can see (Please correct me if I'm wrong):
1. Allocates an skb
2. Calls its supplied dump cb
3. Calls its supplied done cb if if applicable
4. Appends NLMSG_DONE

This appears to be executed synchronously, within the context of the
calling thread. So I couldn't figure out how to use it for avoiding long
locking times.

Anyway, what I tried to achieve is a mechanism that allocates more skb's
as they are needed, and separate it from the calling module. Do you know
of an inherent way to make netlink_dump_start to do that?

 
  +struct rdma_cm_id_stats {
  +   enum rdma_node_type nt;
  +   int port_num;
  +   int bound_dev_if;
  +   __be16  local_port;
  +   __be16  remote_port;
  +   __be32  local_addr[4];
  +   __be32  remote_addr[4];
  +   enum rdma_port_space ps;
  +   enum rdma_cm_state cm_state;
  +   u32 qp_num;
  +   pid_t pid;
  +};
 
 Putting enums in a user/kernel structure like this makes me nervous
 that we'll have a 32/64 bit problem. It would be more consistent with
 the uverbs stuff to use explicit fixed width types.

Yes you're right. Also, I see now that this is not normally done this
way, so I'll drop the enums.

 
 Jason

Thanks again for all your input.
Nir

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH V2 5/5] RDMA CM: Netlink Client

2010-11-30 Thread Hefty, Sean
 +struct rdma_cm_id_stats {
 + enum rdma_node_type nt;
 + int port_num;
 + int bound_dev_if;
 + __be16  local_port;
 + __be16  remote_port;
 + __be32  local_addr[4];
 + __be32  remote_addr[4];

Please use sockaddr_storage, so that we can expand the rdma_cm beyond ipv4/6 
address support.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 5/5] RDMA CM: Netlink Client

2010-11-30 Thread Jason Gunthorpe
On Tue, Nov 30, 2010 at 02:34:59PM +0200, Or Gerlitz wrote:
 Jason Gunthorpe wrote:
  struct ib_nl_qp
  // String names for IB devices was a mistake, don't perpetuate it.
  __u32 ib_dev_if;

 Do you have a concrete suggestion and/or sketch for a patch
 someone can work on to make this enumeration to take place? 

Oh, it is really easy, just add an ifindex member to struct
ib_device. alloc_name can be extended to compute an appropriate index,
see how this works in dev_new_index.

Add a 'ifindex' to the ib_device's sysfs directory that exports this
number.

In future we'd want to see a netlink query to get the RDMA device list
to replace trundling through sysfs.

I think this is really important from a netlink perspective, there are
going to be so many places you wan't to refer to an RDMA device.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 5/5] RDMA CM: Netlink Client

2010-11-30 Thread Jason Gunthorpe
On Tue, Nov 30, 2010 at 04:09:31PM +0200, Nir Muchtar wrote:

  struct ib_nl_qp
  {
  // String names for IB devices was a mistake, don't perpetuate it.
 
 I don't know of an IB device index mapping like the one in netdevice.
 Am I missing one? Do you mean we should create one?

Yes, definately. It is very easy to do and goes hand-in-hand with the
typical netlink protocol design.
 
  __u32 ib_dev_if;
  __u32 qp_num;
  __s32 creator_pid; // -1 for kernel consumer
  };
  
  struct ib_nl_qp_cm_info
  {
u32 cm_state; // enum ib_cm_state
u32 lap_state;
  };
  
  struct ib_nl_rdma_cm_info
  {
  __u32 bound_dev_if;
  __u32 family;
  __u32 cm_state; // enum rdma_cm_state
  };
  
  This captures more information and doesn't tie things to RDMA_CM.
 
 My problem with making everything QP related is that not everything
 necessarily is. For example, when creating rdma cm id's they are still 
 not bound to QP's. I guess you could send all zeros in this case, but as
 more and more of such exceptions are needed this framework will become a
 bit unnatural. The current implementation is not tying anything to RDMA
 CM. It allows other modules to export data exactly the way they need.

Well, I was outlining how I think the QP-centric information can be
returned. You are right that we also have non-QP info, like listening
objects, and I think that can be best returned with a seperate
query. Trying to conflate them seems like it would be
trouble. Certainly, as I've described IBNL_QP messages should only
refer to active QPs.

Remember you can have as many queries as you like, this is just the QP
object query.

I guess an alternative would be to have many tables: RDMA_CM, QP, and
IB_CM and then rely on userspace to 'join' them by ifindex+QPN - but
that seems like alot of work in userspace and I think pretty much
everyone is going to want to have the joined data.

   +static int cma_get_stats(struct sk_buff **nl_skb, int pid)
  
  You really have to use netlink_dump_start here, doing it like this
  will deadlock. See how other places use NLM_F_DUMP as well.
 
 Well, I already reviewed netlink_dump_start, and this is how it works 
 as much as I can see (Please correct me if I'm wrong):
 1. Allocates an skb
 2. Calls its supplied dump cb
 3. Calls its supplied done cb if if applicable
 4. Appends NLMSG_DONE

No, this isn't quite right. The dumpcb is also called after userspace
calls recvmsg(), which continues the dump once the buffer is
freed. The idea is to return a bit of the table on every dump call
back.

The way it is used is:
 1. Userspace does send()
 2. Kernel calls netlink_dump_start()
 3. netlink_dump_start calls callback which returns non-zero
 4. send() returns in userspace
 5. Userspace does recv()
 6. Kernel copies the data from #3 into userspace
 7. netlink_dump calls callback which returns non-zero
 8. recv() returns in userspace
 [...]

 Thanks again for all your input.

Thanks for working on this!

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 5/5] RDMA CM: Netlink Client

2010-11-30 Thread Jason Gunthorpe
On Tue, Nov 30, 2010 at 08:13:45AM -0800, Hefty, Sean wrote:
  +struct rdma_cm_id_stats {
  +   enum rdma_node_type nt;
  +   int port_num;
  +   int bound_dev_if;
  +   __be16  local_port;
  +   __be16  remote_port;
  +   __be32  local_addr[4];
  +   __be32  remote_addr[4];
 
 Please use sockaddr_storage, so that we can expand the rdma_cm
 beyond ipv4/6 address support.

Actually for netlink the proper thing to do here is to place the
sockaddr in a sub-attribute that can be variable-sized. Do not put
variably sized data into a struct like this.

Further, sockaddr_storage is not actually ABI guaranteed to be
constant in size, it is just the current largest sockaddr
possible. This is why the sockaddr size in/out parameter associated
with all sockaddrs is so important.

Don't put sockaddrs in fixed-size structures for user/kernel
interfaces.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2 5/5] RDMA CM: Netlink Client

2010-11-29 Thread Nir Muchtar
Add callbacks and data types for statistics export.
One callback is implemented that exports all of the current devices/ids.
Add/remove the callback to IB Netlink on init/cleanup.

Signed-off-by: Nir Muchtar n...@voltaire.com
---
 drivers/infiniband/core/cma.c |   97 +
 include/rdma/rdma_cm.h|   14 ++
 2 files changed, 111 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 5821f93..9c6ce73 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -51,6 +51,7 @@
 #include rdma/ib_cm.h
 #include rdma/ib_sa.h
 #include rdma/iw_cm.h
+#include rdma/ib_netlink.h
 
 MODULE_AUTHOR(Sean Hefty);
 MODULE_DESCRIPTION(Generic RDMA CM Agent);
@@ -134,6 +135,7 @@ struct rdma_id_private {
u32 qp_num;
u8  srq;
u8  tos;
+   pid_t   owner;
 };
 
 struct cma_multicast {
@@ -418,6 +420,7 @@ struct rdma_cm_id *rdma_create_id(rdma_cm_event_handler 
event_handler,
if (!id_priv)
return ERR_PTR(-ENOMEM);
 
+   id_priv-owner = current-pid;
id_priv-state = RDMA_CM_IDLE;
id_priv-id.context = context;
id_priv-id.event_handler = event_handler;
@@ -2671,8 +2674,14 @@ int rdma_accept(struct rdma_cm_id *id, struct 
rdma_conn_param *conn_param)
 {
struct rdma_id_private *id_priv;
int ret;
+   unsigned long flags;
 
id_priv = container_of(id, struct rdma_id_private, id);
+
+   spin_lock_irqsave(id_priv-lock, flags);
+   id_priv-owner = current-pid;
+   spin_unlock_irqrestore(id_priv-lock, flags);
+
if (!cma_comp(id_priv, RDMA_CM_CONNECT))
return -EINVAL;
 
@@ -3243,6 +3252,91 @@ static void cma_remove_one(struct ib_device *device)
kfree(cma_dev);
 }
 
+static int cma_get_stats(struct sk_buff **nl_skb, int pid)
+{
+   struct rdma_cm_id_stats *id_stats;
+   struct rdma_id_private *id_priv;
+   struct rdma_cm_id *id = NULL;
+   struct cma_device *cma_dev;
+   char *dev_name;
+   struct sockaddr_in *src, *dst;
+#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+   struct sockaddr_in6 *src6, *dst6;
+#endif
+   int seq = 0;
+
+   mutex_lock(lock);
+   list_for_each_entry(cma_dev, dev_list, list) {
+   dev_name = ibnl_put(nl_skb, pid, seq++,
+   strlen(cma_dev-device-name) + 1,
+   IBNL_RDMA_CM,
+   IBNL_RDMA_CM_DEVICE_NAME);
+   if (!dev_name)
+   goto errmem;
+   strcpy(dev_name, cma_dev-device-name);
+   list_for_each_entry(id_priv, cma_dev-id_list, list) {
+   id_stats = ibnl_put(nl_skb, pid, seq++,
+   sizeof *id_stats, IBNL_RDMA_CM,
+   IBNL_RDMA_CM_ID_STATS);
+   if (!id_stats)
+   goto errmem;
+   memset(id_stats, 0, sizeof *id_stats);
+   id = id_priv-id;
+   id_stats-nt = id-route.addr.dev_addr.dev_type;
+   id_stats-port_num = id-port_num;
+   id_stats-bound_dev_if =
+   id-route.addr.dev_addr.bound_dev_if;
+
+   if (id-route.addr.src_addr.ss_family == AF_INET 
+   id-route.addr.dst_addr.ss_family == AF_INET) {
+   src = (struct sockaddr_in *)
+   (id-route.addr.src_addr);
+   dst = (struct sockaddr_in *)
+   (id-route.addr.dst_addr);
+   id_stats-local_port = src-sin_port;
+   id_stats-remote_port = dst-sin_port;
+   id_stats-local_addr[0] = src-sin_addr.s_addr;
+   id_stats-remote_addr[0] = dst-sin_addr.s_addr;
+   }
+
+#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+   if (id-route.addr.src_addr.ss_family == AF_INET6 
+   id-route.addr.dst_addr.ss_family == AF_INET6) {
+   src6 = (struct sockaddr_in6 *)
+   (id-route.addr.src_addr);
+   dst6 = (struct sockaddr_in6 *)
+   (id-route.addr.dst_addr);
+
+   ipv6_addr_copy((struct in6_addr *)
+  (id_stats-local_addr),
+  src6-sin6_addr);
+   ipv6_addr_copy((struct in6_addr *)
+  

Re: [PATCH V2 5/5] RDMA CM: Netlink Client

2010-11-29 Thread Jason Gunthorpe
On Mon, Nov 29, 2010 at 06:16:39PM +0200, Nir Muchtar wrote:
 Add callbacks and data types for statistics export.
 One callback is implemented that exports all of the current devices/ids.
 Add/remove the callback to IB Netlink on init/cleanup.

Please include the schema for the messages you are adding to netlink
in the comment so other people can review them easily.

Looks to me like you have messages of the form:

{
 [IBNL_RDMA_CM_DEVICE_NAME - char[]]
 [IBNL_RDMA_CM_ID_STATS - struct rdma_cm_id_stats]*
}*

As I've said before, I don't want to see this tied to RDMA_CM, that is
not general enough for a new userspace API. The use of
IBNL_RDMA_CM_DEVICE_NAME is very un-netlink-like and is just an ugly
hack to avoid addressing that problem.

How about messages of the form:
{
 IBNL_QP - struct ib_nl_qp
   IBNL_QP_ATTR - struct ib_qp_attr (or a reasonable subset)
   IBNL_QP_CM_INFO u8[] - struct ib_nl_qp_cm_info
   IBNL_QP_CM_SERVICE_ID u8[]
   IBNL_RDMA_CM_INFO - struct ib_nl_rdma_cm_info
   IBNL_RDMA_CM_SRC - u8[]  // This is a sockaddr_*
   IBNL_RDMA_CM_DST - u8[] 
}+

Meaning there is an array of IBNL_QP messages which contains various
attributes. Similar to how everything else in netlink works.

struct ib_nl_qp
{
// String names for IB devices was a mistake, don't perpetuate it.
__u32 ib_dev_if;
__u32 qp_num;
__s32 creator_pid; // -1 for kernel consumer
};

struct ib_nl_qp_cm_info
{
  u32 cm_state; // enum ib_cm_state
  u32 lap_state;
};

struct ib_nl_rdma_cm_info
{
__u32 bound_dev_if;
__u32 family;
__u32 cm_state; // enum rdma_cm_state
};

This captures more information and doesn't tie things to RDMA_CM.

iWarp QPs would not export IBNL_QP_CM_INFO and QP_CM_SERVICE_ID, but
ideally we'd have a call out to the NIC to include the TCP diag
information for the underlying TCP socket since there is no other way
to access that.

Non RDMA-CM QPs (ie ipoib) would not include the RDMA_CM bits.

If you study how SS works you'll see it is similar, it uses a message
of type AF_INET/6.. and then includes attributes like
INET_DIAG_MEMINFO/INFO/CONG

 @@ -134,6 +135,7 @@ struct rdma_id_private {
   u32 qp_num;
   u8  srq;
   u8  tos;
 + pid_t   owner;

Maybe a seperate patch for this? It probably really belongs in
ib_uverbs. What about kernel consumers?

 +static int cma_get_stats(struct sk_buff **nl_skb, int pid)

You really have to use netlink_dump_start here, doing it like this
will deadlock. See how other places use NLM_F_DUMP as well.

 +struct rdma_cm_id_stats {
 + enum rdma_node_type nt;
 + int port_num;
 + int bound_dev_if;
 + __be16  local_port;
 + __be16  remote_port;
 + __be32  local_addr[4];
 + __be32  remote_addr[4];
 + enum rdma_port_space ps;
 + enum rdma_cm_state cm_state;
 + u32 qp_num;
 + pid_t pid;
 +};

Putting enums in a user/kernel structure like this makes me nervous
that we'll have a 32/64 bit problem. It would be more consistent with
the uverbs stuff to use explicit fixed width types.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html