Re: [ANNOUNCE] libcxgb3 1.2.5 released

2009-10-05 Thread Christoph Lameter
If it just would support multicasting... Sigh.

On Mon, 5 Oct 2009, Steve Wise wrote:

> Sorry for yet another libcxgb3 so soon, but:
>
> The libcxgb3 package is a userspace driver for Chelsio T3 iWARP RNICs.   It is
> a plug-in module for libibverbs that allows programs to use Chelsio RDMA
> hardware directly from userspace.
>
> A new release, libcxgb3-1.2.5, is available from
>
> http://www.openfabrics.org/downloads/cxgb3/libcxgb3-1.2.5.tar.gz
>
> with md5sum
>
> ed2eaf99bc7cce401dd0549505febdd1 libcxgb3-1.2.5.tar.gz
>
> This minor release fixes a bug where the qp wasn't being correctly RESET
> causing dapltest failures.
>
>
> Vlad: Please pull this into ofed-1.5 before RC1.
>
>
> Thanks,
>
>
> Steve.
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: patch for non-ib network

2009-11-20 Thread Christoph Lameter
On Fri, 20 Nov 2009, ja...@cdac.in wrote:

> is it possible to add a patch in ofed release for non-ib networks? we've
> integrated ofed with 'paramnet-3' a 10Gbps system area network. currently
> IPoIB is working fine. more details can be provided if required.

Sure. Mellanox is doing something similar right now. Just be sure to sync
code up.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMAoE verbs questions

2009-12-01 Thread Christoph Lameter
On Wed, 25 Nov 2009, Jason Gunthorpe wrote:

> If you have a single physical chip with two ports and they are running
> different protocols it seems much cleaner to me to report it to verbs
> apps as two devices.
>
> Doing this avoids creating compatability problems.

Right. Mellanox has some limitation between how each port can be
configured. But that could also be checked with two independent devices.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMAoE verbs questions

2009-12-01 Thread Christoph Lameter
On Tue, 1 Dec 2009, Eli Cohen wrote:

> On Mon, Nov 30, 2009 at 09:03:47AM -0500, Jeff Squyres wrote:
> > Per my prior question: is it expected that IBoE will function
> > *exactly* the same as real IB?  The addition of the port attribute
> > seems to imply not.
>
> IBoE and IB should work exactly the same from the perspective of a
> user level application that makes use of rdmacm to create connections.
> Such apps can ignore the new attribute. And we believe this should
> cover the vast majority of apps.  The new port attribute optionally
> allows the distinction for apps that need it (e.g. those that do not
> use the rdmacm, apps that have a reason to prefer one over the other
> when there is a choice, etc).

rdmacm only? I would expect syscall API compatibility? For our use case
this needs to work with multicast traffic.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


IPoIB multiqueue support?

2010-05-10 Thread Christoph Lameter
I see that some IB nics can do multiqueu in ethernet mode.

Is there any work on multiqueue support for IPoIB going on?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB multiqueue support?

2010-05-11 Thread Christoph Lameter
On Mon, 10 May 2010, Roland Dreier wrote:

>  > Is there any work on multiqueue support for IPoIB going on?
>
> No, although one could view connected mode as an even better place to
> start, since you already get perfect classification by remote peer for
> free.

I am mostly interested in multicast traffic. Connected mode is not
relevant to that usage scenario.




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB multiqueue support?

2010-05-11 Thread Christoph Lameter
On Tue, 11 May 2010, Roland Dreier wrote:

>  > I am mostly interested in multicast traffic. Connected mode is not
>  > relevant to that usage scenario.
>
> As I said, I don't think anyone is working on it.  However it wouldn't
> be that hard to get something pretty good for multicast, since the
> InfiniBand multicast join mechanism would let you have essentially a
> perfect filter for steering individual multicast groups to whichever QP
> (ring) you wanted to.

Right but then would each individual QP need its own IP address?

> Of course you could also implement the equivalent thing in userspace and
> probably get even better performance.

Start a QP listening to IPoIB mc traffic?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB multiqueue support?

2010-05-11 Thread Christoph Lameter
On Tue, 11 May 2010, Jason Gunthorpe wrote:

> > Right but then would each individual QP need its own IP address?
>
> I think Roland means that each IP multicast address is mapped into an
> IB multicast GID, and you can bind a QP to a set of MGIDs. Right now
> the driver binds all MGIDs to the rx QP and basically ignores the
> MGID on receive.

Aha.

> To go multi-queue you'd create multiple QPs and spread the MGID binds
> amongst them.

It would be best to bind them to the QP of the local processor (assuming
that the process continues to run on that processor).

What about unicast traffic? One QP gets all unicast?

> Yes, but the downside is that if you rely on the kernel to the group
> join then the HCA will send the packet to user space and the kernel QP.
>
> Some of the weird features in the RDMA CM seem to be for supporting
> this..

The UMCAST flag can stop the kernel from processing the IGMP reply.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB multiqueue support?

2010-05-11 Thread Christoph Lameter
On Tue, 11 May 2010, Jason Gunthorpe wrote:

> > The UMCAST flag can stop the kernel from processing the IGMP reply.
>
> I'm not talking about IGMP, but the IB version of IGMP, the kernel
> joins the group in IB land and also attaches the IPOIB QP. This can
> all be faked out in userspace, but it isn't entirely straightforward.

Yes vendors use the UMCAST flag to avoid this.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


IPoIB: Broken IGMP processing

2010-08-23 Thread Christoph Lameter
We see that IGMP timers are not properly deferred when hosts send IGMP
membership information. It looks as if the IPoIB layer does not properly
mark the multicast/broadcast packets with PACKET_MULTICAST or
PACKET_BROADCAST. As a results icmp_recv() ignores the IGMP membership
information from others. That in turn results in the IGMP timers
frequently expiring, thus the network becomes quite chatty.

The following is an untested patch: I am not sure how exact to access the
ipob Mac header. The IB header contains a special marker for IPoIB
multicast. So it should be simple to identify the multicast packets in the
receive path.


Subject: [IB] Make igmp processing work with IPOIB

IGMP processing is broken because the IPOIB does not set the
skb->pkt_type the right way for Multicast traffic. All incoming
packets are set to PACKET_HOST which means that the igmp_recv()
function will ignore the IGMP broadcasts/multicasts.

This in turn means that the IGMP timers are firing and are sending
information about multicast subscriptions unnecessarily. In a large
private network this can cause traffic spikes.

Signed-off-by: Christoph Lameter 

---
 drivers/infiniband/ulp/ipoib/ipoib.h|   13 +
 drivers/infiniband/ulp/ipoib/ipoib_ib.c |6 --
 2 files changed, 17 insertions(+), 2 deletions(-)

Index: linux-2.6/drivers/infiniband/ulp/ipoib/ipoib.h
===
--- linux-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2010-08-20 
19:44:13.0 -0500
+++ linux-2.6/drivers/infiniband/ulp/ipoib/ipoib.h  2010-08-20 
19:58:21.0 -0500
@@ -114,6 +114,9 @@ enum {
 #defineIPOIB_OP_CM (0)
 #endif

+#define IPOIB_MGID_IPV4_SIGNATURE 0x401B
+#define IPOIB_MGID_IPV6_SIGNATURE 0x601B
+
 /* structs */

 struct ipoib_header {
@@ -125,6 +128,16 @@ struct ipoib_pseudoheader {
u8  hwaddr[INFINIBAND_ALEN];
 };

+int ipoib_is_ipv4_multicast(u8 *p)
+{
+   return *((u16 *)(p + 2)) == htonl(IPOIB_MGID_IPV4_SIGNATURE);
+}
+
+int ipoib_is_ipv6_multicast(u8 *p)
+{
+   return *((u16 *)(p + 2)) == htonl(IPOIB_MGID_IPV6_SIGNATURE);
+}
+
 /* Used for all multicast joins (broadcast, IPv4 mcast and IPv6 mcast) */
 struct ipoib_mcast {
struct ib_sa_mcmember_rec mcmember;
Index: linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_ib.c
===
--- linux-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c  2010-08-20 
18:43:44.0 -0500
+++ linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_ib.c   2010-08-20 
19:58:34.0 -0500
@@ -281,8 +281,10 @@ static void ipoib_ib_handle_rx_wc(struct
dev->stats.rx_bytes += skb->len;

skb->dev = dev;
-   /* XXX get correct PACKET_ type here */
-   skb->pkt_type = PACKET_HOST;
+   if (ipoib_is_ipv4_multicast(skb_mac_header(skb)))
+   skb->pkt_type = PACKET_MULTICAST;
+   else
+   skb->pkt_type = PACKET_HOST;

if (test_bit(IPOIB_FLAG_CSUM, &priv->flags) && likely(wc->csum_ok))
skb->ip_summed = CHECKSUM_UNNECESSARY;
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB: Broken IGMP processing

2010-08-23 Thread Christoph Lameter
On Mon, 23 Aug 2010, Jason Gunthorpe wrote:

> On Mon, Aug 23, 2010 at 12:16:40PM -0500, Christoph Lameter wrote:
>
> > +int ipoib_is_ipv4_multicast(u8 *p)
> > +{
> > +   return *((u16 *)(p + 2)) == htonl(IPOIB_MGID_IPV4_SIGNATURE);
> > +}
> > +
> > +int ipoib_is_ipv6_multicast(u8 *p)
> > +{
> > +   return *((u16 *)(p + 2)) == htonl(IPOIB_MGID_IPV6_SIGNATURE);
> > +}
>
> static inline for functions in headers?

Right.

> Maybe checking for checking wc->qp_num == multicast QPN is a better
> choice?

If that is properly set then yes of course that is much easier. Would this
work? The broadcast QP is used for multicast right?


Subject: [IB] Make igmp processing work with IPOIB

IGMP processing is broken because the IPOIB does not set the
skb->pkt_type the right way for Multicast traffic. All incoming
packets are set to PACKET_HOST which means that the igmp_recv()
function will ignore the IGMP broadcasts/multicasts.

This in turn means that the IGMP timers are firing and are sending
information about multicast subscriptions unnecessarily. In a large
private network this can cause traffic spikes.

Signed-off-by: Christoph Lameter 

---
 drivers/infiniband/ulp/ipoib/ipoib_ib.c |8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

Index: linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_ib.c
===
--- linux-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c  2010-08-23 
13:07:32.0 -0500
+++ linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_ib.c   2010-08-23 
13:09:06.0 -0500
@@ -223,6 +223,7 @@ static void ipoib_ib_handle_rx_wc(struct
unsigned int wr_id = wc->wr_id & ~IPOIB_OP_RECV;
struct sk_buff *skb;
u64 mapping[IPOIB_UD_RX_SG];
+   struct ipoib_dev_priv *multicast_priv = 
netdev_priv(priv->broadcast->dev);

ipoib_dbg_data(priv, "recv completion: id %d, status: %d\n",
   wr_id, wc->status);
@@ -281,8 +282,11 @@ static void ipoib_ib_handle_rx_wc(struct
dev->stats.rx_bytes += skb->len;

skb->dev = dev;
-   /* XXX get correct PACKET_ type here */
-   skb->pkt_type = PACKET_HOST;
+   if (wc->src_qp == multicast_priv->qp->qp_num)
+
+   skb->pkt_type = PACKET_MULTICAST;
+   else
+   skb->pkt_type = PACKET_HOST;

if (test_bit(IPOIB_FLAG_CSUM, &priv->flags) && likely(wc->csum_ok))
skb->ip_summed = CHECKSUM_UNNECESSARY;
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB: Broken IGMP processing

2010-08-23 Thread Christoph Lameter
On Mon, 23 Aug 2010, Jason Gunthorpe wrote:

> Hmmm... What are you trying to access here? I'm guessing it is the
> DGID of the GRH?
>
> ipoib_ud_skb_put_frags(priv, skb, wc->byte_len);
> skb_pull(skb, IB_GRH_BYTES);  <-- These are the bytes you want
> skb_reset_mac_header(skb);  <-- Sets skb_mac_header to skb->head+40
> skb_pull(skb, IPOIB_ENCAP_LEN);
>
> So, I think you are accessing byte 42, which doesn't seem right? The
> DGID starts in byte 24 from skb->head.
>
> Also, you need to check for IBV_WC_GRH, the 40 bytes are garbage if it
> is not set.

Trying to get the MGID information:

>From http://tools.ietf.org/html/draft-ietf-ipoib-link-multicast-00


7. The IPoIB All-Node Multicast and Broadcast Group

   Once an IB partition is created with link attributes identified for
   an IPoIB link, the network administrator must create a special IB
   multicast group for every node on the IPoIB link to join. This is
   achieved through the creation of "MCGroupRecord" in each IB subnet
   that the IB partition encompasses, as described in section 4 above.

   The MGID will have the P_Key of the IB partition that defines the
   IPoIB link embedded in it. A special signature is also embedded to
   identify the MGID for IPoIB use only. For IPv4 over IB, the signature
   will be "0x401B". For IPv6 over IB, the signature will be "0x601B".

   For an IPv4 subnet, the MGID for this special IB multicast group
   SHALL have the following format:

   |   8|  4 |  4 | 16 bits | 16 bits | 48 bits  | 32 bits |
   ++++-+-+--+-+
   ||0001|scop||< P_Key >|00...0||
   ++++-+-+--+-+



Chu & Kashyap   [Page 7]

draft-ietf-ipoib-link-multicast-00.txt  January 2002


   For an IPv6 subnet, the format of the MGID SHALL look like this:

   |   8|  4 |  4 | 16 bits | 16 bits |   80 bits  |
   ++++-+-++
   ||0001|scop||< P_Key >|000.0001|
   ++++-+-++

   As for the scop bits, if the IPoIB link is fully contained within a
   single IB subnet, the scop bits SHALL be set to 2 (link-local).
   Otherwise the scope will be set higher.

   A MCGroupRecord will be created with all the IPoIB link attributes
   described before, including the link MTU, Q_Key, TClass, FlowLabel,
   and HopLimit. When a node is attached to an IPoIB link identified by
   a P_Key, it must look for a special, all-node multicast/broadcast
   group to join. This is done by constructing the MGID with the link
   P_Key and the IPoIB signature. The node SHOULD always look for a MGID
   of a link-local scope first before attempting one with a greater
   scope.

   Once the right MGID and MCGroupRecord are identified, the local node
   SHOULD use the link MTU recorded in the MCGroupRecord. It MUST accept
   a smaller MTU if one is advertised through the link MTU option of a
   router advertisement [DISC].

   In case the link MTU is greater than the maximum payload size that
   the local HCA can support, the node can not join the IPoIB link and
   operate as an IP node.

   After the right MTU is determined, the local node must join the
   special all-node multicast/broadcast group by calling the SA to
   create a MCMemberRecord corresponding to the MGID. The SA will return
   all the link attributes for the local node to use. The node MUST use
   these attributes in all future multicast operations to the local
   IPoIB link.  The broadcast group for IPv4 will serve to provide a
   broadcast service for protocol like ARP to use.

   In addition to the all-node multicast/broadcast group, an all-router
   multicast group SHOULD be created at link configuration time if an IP
   router will be attached to the link. This is to facilitate IP
   multicast operations described later. A MCGroupRecord for the all-
   router MGID must be created in every IB subnet that the IPoIB link
   encompasses. The format of the all-router MGID will be covered in
   next section.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB: Broken IGMP processing

2010-08-23 Thread Christoph Lameter
On Mon, 23 Aug 2010, Jason Gunthorpe wrote:

> src_qp is just the send QPN, you need to look at qp_num (aka dest qp).
> I'm not entirely sure what it will be, I didn't find anything too clear
> in the spec. If it is 0xFF then the HCA is copying the dest QP
> directly into the WC and this can work, if it is something else then
> the HCA is setting it to the QPN of the RQ that recieved the packet,
> which is not useful for this.

AFAICT the spec has a QP for RC and one for UD packets. How do these
relate to multicast?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB: Broken IGMP processing

2010-08-23 Thread Christoph Lameter
On Mon, 23 Aug 2010, Jason Gunthorpe wrote:

> Simplest then is to check if byte 24 of the packet is 0xff.
> (ie IN6_IS_ADDR_MULTICAST)

Dont see that function defined anywhere in the kernel.

> No need to worry about if it is properly formed or anything, if it is
> a multicast DGID then it is a multicast packet at the link level.

ok so skb->head[24] = 0xff ?

Looks raw and not kernel style. There need to be some skb_xxx function
that get to it.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: IPoIB: Broken IGMP processing

2010-08-25 Thread Christoph Lameter
On Mon, 23 Aug 2010, Yossi Etigin wrote:

> >
> > Simplest then is to check if byte 24 of the packet is 0xff.
> > (ie IN6_IS_ADDR_MULTICAST)
> >
> > No need to worry about if it is properly formed or anything, if it is
> > a multicast DGID then it is a multicast packet at the link level.
> >
>
> Sounds good to me

Therefore this patch?

Subject: [IB] Make igmp processing work with IPOIB

IGMP processing is broken because the IPOIB does not set the
skb->pkt_type the right way for Multicast traffic. All incoming
packets are set to PACKET_HOST which means that the igmp_recv()
function will ignore the IGMP broadcasts/multicasts.

This in turn means that the IGMP timers are firing and are sending
information about multicast subscriptions unnecessarily. In a large
private network this can cause traffic spikes.

Signed-off-by: Christoph Lameter 

---
 drivers/infiniband/ulp/ipoib/ipoib_ib.c |7 +--
 include/linux/in6.h |3 +++
 2 files changed, 8 insertions(+), 2 deletions(-)

Index: linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_ib.c
===
--- linux-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c  2010-08-23 
16:04:38.0 -0500
+++ linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_ib.c   2010-08-25 
09:43:01.0 -0500
@@ -281,8 +281,11 @@ static void ipoib_ib_handle_rx_wc(struct
dev->stats.rx_bytes += skb->len;

skb->dev = dev;
-   /* XXX get correct PACKET_ type here */
-   skb->pkt_type = PACKET_HOST;
+   if (IN6_IS_ADDR_MULTICAST(skb->head + 24))
+
+   skb->pkt_type = PACKET_MULTICAST;
+   else
+   skb->pkt_type = PACKET_HOST;

if (test_bit(IPOIB_FLAG_CSUM, &priv->flags) && likely(wc->csum_ok))
skb->ip_summed = CHECKSUM_UNNECESSARY;
Index: linux-2.6/include/linux/in6.h
===
--- linux-2.6.orig/include/linux/in6.h  2010-08-25 09:39:40.0 -0500
+++ linux-2.6/include/linux/in6.h   2010-08-25 09:40:22.0 -0500
@@ -53,6 +53,9 @@ extern const struct in6_addr in6addr_lin
 extern const struct in6_addr in6addr_linklocal_allrouters;
 #define IN6ADDR_LINKLOCAL_ALLROUTERS_INIT \
{ { { 0xff,2,0,0,0,0,0,0,0,0,0,0,0,0,0,2 } } }
+
+#define IN6_IS_ADDR_MULTICAST(a) (((const __u8 *) (a))[0] == 0xff)
+
 #endif

 struct sockaddr_in6 {
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv10 0/11] IBoE support to Infiniband

2010-08-26 Thread Christoph Lameter
On Thu, 26 Aug 2010, Eli Cohen wrote:

> With these patches, IBoE multicast frames may be broadcast as there is
> currently no use of a L2 multicast group membership protocol.

May be? They are broadcast because there is no IGMP like support? Any
hopes of multicast support soon?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[IPoIB] Identify Multicast packets and fix IGMP breakage

2010-08-26 Thread Christoph Lameter
I had to move the check around a bit but the patch now passes our tests.
Please merge soon.


Subject: [IPoIB] Identify Multicast packets and fix IGMP breakage

IGMP processing is broken because the IPOIB does not set the
skb->pkt_type the right way for Multicast traffic. All incoming
packets are set to PACKET_HOST which means that the igmp_recv()
function will ignore the IGMP broadcasts/multicasts.

This in turn means that the IGMP timers are firing and are sending
information about multicast subscriptions unnecessarily. In a large
private network this can cause traffic spikes.

Signed-off-by: Christoph Lameter 

---
 drivers/infiniband/ulp/ipoib/ipoib_ib.c |7 +--
 include/linux/in6.h |3 +++
 2 files changed, 8 insertions(+), 2 deletions(-)

Index: linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_ib.c
===
--- linux-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c  2010-08-26 
14:11:39.0 -0500
+++ linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_ib.c   2010-08-26 
14:51:31.0 -0500
@@ -271,6 +271,14 @@
ipoib_ud_dma_unmap_rx(priv, mapping);
ipoib_ud_skb_put_frags(priv, skb, wc->byte_len);

+   /* According to Jason Gunthorpe byte 24 in the GRH has the MGID */
+   if (IN6_IS_ADDR_MULTICAST(skb->data + 24))
+
+   skb->pkt_type = PACKET_MULTICAST;
+
+   else
+   skb->pkt_type = PACKET_HOST;
+
skb_pull(skb, IB_GRH_BYTES);

skb->protocol = ((struct ipoib_header *) skb->data)->proto;
@@ -281,9 +289,6 @@
dev->stats.rx_bytes += skb->len;

skb->dev = dev;
-   /* XXX get correct PACKET_ type here */
-   skb->pkt_type = PACKET_HOST;
-
if (test_bit(IPOIB_FLAG_CSUM, &priv->flags) && likely(wc->csum_ok))
skb->ip_summed = CHECKSUM_UNNECESSARY;

Index: linux-2.6/include/linux/in6.h
===
--- linux-2.6.orig/include/linux/in6.h  2010-08-26 14:11:39.0 -0500
+++ linux-2.6/include/linux/in6.h   2010-08-26 14:11:52.0 -0500
@@ -53,6 +53,9 @@
 extern const struct in6_addr in6addr_linklocal_allrouters;
 #define IN6ADDR_LINKLOCAL_ALLROUTERS_INIT \
{ { { 0xff,2,0,0,0,0,0,0,0,0,0,0,0,0,0,2 } } }
+
+#define IN6_IS_ADDR_MULTICAST(a) (((const __u8 *) (a))[0] == 0xff)
+
 #endif

 struct sockaddr_in6 {
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [IPoIB] Identify Multicast packets and fix IGMP breakage

2010-08-26 Thread Christoph Lameter
On Thu, 26 Aug 2010, Roland Dreier wrote:

> also it's not clear to me why it's OK to do this test of the DGID if the
> packet didn't have a GRH -- presumably we are just looking at random
> uninitialized memory so we might incorrectly say some packets are
> multicast if that byte happens to be 0xff.  (or does that not matter?
> if so why can't we just always make everything PACKET_MULTICAST?)

We will do an skb_pull to skip over the GRH next? The IP layer checks
PACKET_HOST etc in various places so we may risk breakage in odd places
like what already occurred here. It would be good to also have support
for PACKET_BROADCAST but that would require a full MGID comparison.

> It seems the check should be something like
>
>   if ((wc->wc_flags & IB_WC_GRH) &&
>   IN6_IS_ADDR_MULTICAST(((struct ipv6hdr *) skb->data)->daddr))

Hmmm... More IPV6 stuff slipping in.

> We probably want to run this addition past netdev.

Ok.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[IPoIB] Identify multicast packets and fix IGMP breakage V2

2010-08-26 Thread Christoph Lameter
Is this better?


Subject: [IPoIB] Identify Multicast packets and fix IGMP breakage V2

IGMP processing is broken because the IPOIB does not set the
skb->pkt_type the right way for Multicast traffic. All incoming
packets are set to PACKET_HOST which means that the igmp_recv()
function will ignore the IGMP broadcasts/multicasts.

This in turn means that the IGMP timers are firing and are sending
information about multicast subscriptions unnecessarily. In a large
private network this can cause traffic spikes.

Signed-off-by: Christoph Lameter 

---
 drivers/infiniband/ulp/ipoib/ipoib_ib.c |   10 +++---
 include/linux/in6.h |3 +++
 2 files changed, 10 insertions(+), 3 deletions(-)

Index: linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_ib.c
===
--- linux-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c  2010-08-26 
15:11:34.0 -0500
+++ linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_ib.c   2010-08-26 
16:29:19.0 -0500
@@ -271,6 +271,13 @@ static void ipoib_ib_handle_rx_wc(struct
ipoib_ud_dma_unmap_rx(priv, mapping);
ipoib_ud_skb_put_frags(priv, skb, wc->byte_len);

+   if ((wc->wc_flags & IB_WC_GRH) &&
+   IN6_IS_ADDR_MULTICAST(&((struct ipv6hdr *)skb->data)->daddr))
+
+   skb->pkt_type = PACKET_MULTICAST;
+   else
+   skb->pkt_type = PACKET_HOST;
+
skb_pull(skb, IB_GRH_BYTES);

skb->protocol = ((struct ipoib_header *) skb->data)->proto;
@@ -281,9 +288,6 @@ static void ipoib_ib_handle_rx_wc(struct
dev->stats.rx_bytes += skb->len;

skb->dev = dev;
-   /* XXX get correct PACKET_ type here */
-   skb->pkt_type = PACKET_HOST;
-
if (test_bit(IPOIB_FLAG_CSUM, &priv->flags) && likely(wc->csum_ok))
skb->ip_summed = CHECKSUM_UNNECESSARY;

Index: linux-2.6/include/linux/in6.h
===
--- linux-2.6.orig/include/linux/in6.h  2010-08-26 15:11:34.0 -0500
+++ linux-2.6/include/linux/in6.h   2010-08-26 16:27:08.0 -0500
@@ -53,6 +53,9 @@ extern const struct in6_addr in6addr_lin
 extern const struct in6_addr in6addr_linklocal_allrouters;
 #define IN6ADDR_LINKLOCAL_ALLROUTERS_INIT \
{ { { 0xff,2,0,0,0,0,0,0,0,0,0,0,0,0,0,2 } } }
+
+#define IN6_IS_ADDR_MULTICAST(a) ((a)->in6_u.u6_addr8[0] == 0xff)
+
 #endif

 struct sockaddr_in6 {
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [IPoIB] Identify multicast packets and fix IGMP breakage V3

2010-08-26 Thread Christoph Lameter
On Thu, 26 Aug 2010, Jason Gunthorpe wrote:

> The 40 bytes at this location are defined by the HW specification to
> be an IB GRH which has an identical layout to an IPv6 header. Roland
> is right, it would be clearer to use ib_grh ->dgid

Ok but then we have no nice function that checks for multicast anymore.



Subject: [IPoIB] Identify multicast packets and fix IGMP breakage V3

IGMP processing is broken because the IPOIB does not set the
skb->pkt_type the right way for Multicast traffic. All incoming
packets are set to PACKET_HOST which means that the igmp_recv()
function will ignore the IGMP broadcasts/multicasts.

This in turn means that the IGMP timers are firing and are sending
information about multicast subscriptions unnecessarily. In a large
private network this can cause traffic spikes.

Signed-off-by: Christoph Lameter 

---

Index: linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_ib.c
===
--- linux-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c  2010-08-26 
18:24:07.842079559 -0500
+++ linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_ib.c   2010-08-26 
18:25:33.859815544 -0500
@@ -271,6 +271,14 @@
ipoib_ud_dma_unmap_rx(priv, mapping);
ipoib_ud_skb_put_frags(priv, skb, wc->byte_len);

+   /* First byte of dgid signals multicast when 0xff */
+   if ((wc->wc_flags & IB_WC_GRH) &&
+   ((struct ib_grh *)skb->data)->dgid.raw[0] == 0xff)
+
+   skb->pkt_type = PACKET_MULTICAST;
+   else
+   skb->pkt_type = PACKET_HOST;
+
skb_pull(skb, IB_GRH_BYTES);

skb->protocol = ((struct ipoib_header *) skb->data)->proto;
@@ -281,9 +289,6 @@
dev->stats.rx_bytes += skb->len;

skb->dev = dev;
-   /* XXX get correct PACKET_ type here */
-   skb->pkt_type = PACKET_HOST;
-
if (test_bit(IPOIB_FLAG_CSUM, &priv->flags) && likely(wc->csum_ok))
skb->ip_summed = CHECKSUM_UNNECESSARY;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [IPoIB] Identify multicast packets and fix IGMP breakage V3

2010-08-26 Thread Christoph Lameter
On Thu, 26 Aug 2010, Jason Gunthorpe wrote:

> I think doing the memcmp only in the multicast path should be
> reasonable overhead wise.

Thats is not always possible. Here the multicast path is the
default path that is taken 99% of the time.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [IPoIB] Identify multicast packets and fix IGMP breakage V3

2010-08-26 Thread Christoph Lameter

On Thu, 26 Aug 2010, David Miller wrote:

> The highest cost is bringing in that packet header's cache line, which
> you've already done by reading the byte and checking for 0xff.

And then you need to bring in the cacheline of the broadcast address
These are pretty long byte strings in the IB case.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [IPoIB] Identify multicast packets and fix IGMP breakage V3

2010-08-27 Thread Christoph Lameter
On Thu, 26 Aug 2010, Jason Gunthorpe wrote:

> I think doing the memcmp only in the multicast path should be
> reasonable overhead wise.

Ok the dgid is only 8 bytes not the whole 40 bytes Here is the patch
somewhat cleaned up with PACKET_BROADCAST.


Subject: [IPoIB] Identify multicast packets and fix IGMP breakage V3

IGMP processing is broken because the IPOIB does not set the
skb->pkt_type the right way for Multicast traffic. All incoming
packets are set to PACKET_HOST which means that the igmp_recv()
function will ignore the IGMP broadcasts/multicasts.

This in turn means that the IGMP timers are firing and are sending
information about multicast subscriptions unnecessarily. In a large
private network this can cause traffic spikes.

Signed-off-by: Christoph Lameter 

---

Index: linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_ib.c
===
--- linux-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c  2010-08-26 
18:24:07.842079559 -0500
+++ linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_ib.c   2010-08-27 
08:26:37.929641162 -0500
@@ -223,6 +223,7 @@
unsigned int wr_id = wc->wr_id & ~IPOIB_OP_RECV;
struct sk_buff *skb;
u64 mapping[IPOIB_UD_RX_SG];
+   union ib_gid *dgid;

ipoib_dbg_data(priv, "recv completion: id %d, status: %d\n",
   wr_id, wc->status);
@@ -271,6 +272,21 @@
ipoib_ud_dma_unmap_rx(priv, mapping);
ipoib_ud_skb_put_frags(priv, skb, wc->byte_len);

+   /* First byte of dgid signals multicast when 0xff */
+   dgid = &((struct ib_grh *)skb->data)->dgid;
+
+   if (!(wc->wc_flags & IB_WC_GRH) || dgid->raw[0] != 0xff)
+
+   skb->pkt_type = PACKET_HOST;
+
+   else if (memcmp(dgid, dev->broadcast + 4, sizeof(union ib_gid)) == 0)
+
+   skb->pkt_type = PACKET_BROADCAST;
+
+   else
+
+   skb->pkt_type = PACKET_MULTICAST;
+
skb_pull(skb, IB_GRH_BYTES);

skb->protocol = ((struct ipoib_header *) skb->data)->proto;
@@ -281,9 +297,6 @@
dev->stats.rx_bytes += skb->len;

skb->dev = dev;
-   /* XXX get correct PACKET_ type here */
-   skb->pkt_type = PACKET_HOST;
-
if (test_bit(IPOIB_FLAG_CSUM, &priv->flags) && likely(wc->csum_ok))
skb->ip_summed = CHECKSUM_UNNECESSARY;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [IPoIB] Identify multicast packets and fix IGMP breakage V3

2010-09-14 Thread Christoph Lameter
On Tue, 14 Sep 2010, Or Gerlitz wrote:

> I don't see this patch in Roland's for-next branch nor Dave's net-next-2.6
> tree, is anything else needed to merge that?

No there is nothing else needed.

Roland said he is going to merge it for 2.6.37.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


igmp: Allow mininum interval specification for igmp timers.

2010-09-22 Thread Christoph Lameter
IGMP timers sometimes fire too rapidly due to randomization of the
intervalsfrom 0 to max_delay in igmp_start_timer(). For some situations
(like the initial IGMP reports that are not responses to an IGMP query) we
do not want them in too rapid succession otherwise all the initial reports
may be lost due to a race conditions with the reconfiguration of the
routers and switches going on via the link layer (like on Infiniband). If
those are lost then the router will only discover that a new mc group was
joined when the igmp query was sent. General IGMP queries may be sent
rarely on large fabrics resulting in excessively long wait times until
data starts flowing. The application may abort before then concluding that
the network hardware is not operational.

The worst case scenario without the changes will send 3 igmp reports on join:

First   3 jiffies ("immediate" (spec) ~3 ms)
Second  3 jiffies (randomization leads to shortest interval) 3 ms
Third   3 jiffies (randomization leads to shortest interval) 3 ms

Which may result in a total of less than 10ms until the kernel gives up sending
igmp requests.

Change the IGMP layer to allow the specification of minimum and maximum delay.
Calculate the IGMP_Unsolicated_Report interval based on what the interval
before this patch would be on a 100HZ kernel. 3 jiffies at 100 HZ would result
in a mininum ~30 milliseconds spacing between the initial two IGMP reports.
Round it up to 40ms.

This will result in 3 initial unsolicited reports

First   "immediately"   3 jiffies (~ 3ms)
Second  randomized 40ms to 10seconds later
Third   randomized 40ms to 10seconds later

So a mininum of ~83ms will pass before the unsolicted reports are
given up.

Signed-off-by: Christoph Lameter 

---
 net/ipv4/igmp.c |   45 +++--
 1 file changed, 31 insertions(+), 14 deletions(-)

Index: linux-2.6/net/ipv4/igmp.c
===
--- linux-2.6.orig/net/ipv4/igmp.c  2010-09-22 11:15:19.0 -0500
+++ linux-2.6/net/ipv4/igmp.c   2010-09-22 12:50:32.0 -0500
@@ -116,10 +116,17 @@
 #define IGMP_V2_Router_Present_Timeout (400*HZ)
 #define IGMP_Unsolicited_Report_Interval   (10*HZ)
 #define IGMP_Query_Response_Interval   (10*HZ)
-#define IGMP_Unsolicited_Report_Count  2

+/* Parameters not specified in igmp rfc. */
+
+/* Mininum ticks to have a meaningful notion of delay */
+#define IGMP_Mininum_Delay (2)
+
+/* Control of unsolilcited reports (after join) */

+#define IGMP_Unsolicited_Report_Count  2
 #define IGMP_Initial_Report_Delay  (1)
+#define IGMP_Unsolicited_Report_Min_Delay  (HZ/25)

 /* IGMP_Initial_Report_Delay is not from IGMP specs!
  * IGMP specs require to report membership immediately after
@@ -174,22 +181,30 @@ static __inline__ void igmp_stop_timer(s
spin_unlock_bh(&im->lock);
 }

-/* It must be called with locked im->lock */
-static void igmp_start_timer(struct ip_mc_list *im, int max_delay)
+static inline unsigned long jiffies_rand_delay(int min_delay, int max_delay)
 {
-   int tv = net_random() % max_delay;
+   int d = min_delay;
+
+   if (min_delay < max_delay)
+   d += net_random() % (max_delay - min_delay);

+   return jiffies + d;
+}
+
+/* It must be called with locked im->lock */
+static void igmp_start_timer(struct ip_mc_list *im, int min_delay, int 
max_delay)
+{
im->tm_running = 1;
-   if (!mod_timer(&im->timer, jiffies+tv+2))
+   if (!mod_timer(&im->timer, jiffies_rand_delay(min_delay, max_delay)))
atomic_inc(&im->refcnt);
 }

 static void igmp_gq_start_timer(struct in_device *in_dev)
 {
-   int tv = net_random() % in_dev->mr_maxdelay;
-
in_dev->mr_gq_running = 1;
-   if (!mod_timer(&in_dev->mr_gq_timer, jiffies+tv+2))
+   if (!mod_timer(&in_dev->mr_gq_timer,
+   jiffies_rand_delay(IGMP_Mininum_Delay,
+   in_dev->mr_maxdelay)))
in_dev_hold(in_dev);
 }

@@ -201,7 +216,7 @@ static void igmp_ifc_start_timer(struct
in_dev_hold(in_dev);
 }

-static void igmp_mod_timer(struct ip_mc_list *im, int max_delay)
+static void igmp_mod_timer(struct ip_mc_list *im, int min_delay, int max_delay)
 {
spin_lock_bh(&im->lock);
im->unsolicit_count = 0;
@@ -214,7 +229,7 @@ static void igmp_mod_timer(struct ip_mc_
}
atomic_dec(&im->refcnt);
}
-   igmp_start_timer(im, max_delay);
+   igmp_start_timer(im, min_delay, max_delay);
spin_unlock_bh(&im->lock);
 }

@@ -733,7 +748,8 @@ static void igmp_timer_expire(unsigned l

if (im->unsolicit_count) {
im->unsolicit_count--;
-   igmp_start_ti

igmp: Staggered igmp report intervals for unsolicited igmp reports

2010-09-22 Thread Christoph Lameter
The earlier patch added an initial mininum latency and got us up to
~80ms. However, there are large networks that take longer to configure
multicast paths.

This patch changes the behavior for unsolicited igmp reports to ensure
that even sporadic loss of the initial IGMP reports will result in a
reasonable fast subscription.

The rfc states that the first igmp report should be sent immediately and
then mentions that a couple of more should be sent but does not specify
exactly how the repeating of the igmp reports should occur. The RFC
suggests that the behavior in response to an IGMP report (randomized
response 0-max response time) could be followed but we have seen issues
with this suggestion since the intervals can be very short. There is also
no reason to randomize since the unsolicited reports are not a response to
an igmp query but the result of a join request in the code.

The patch here establishes more fixed delays for sending unsolicited
igmp reports after join. There is still a fuzz factor associated but the
sending of the igmp reports follows more tightly a set of intervals and sends
up to 7 igmp reports.

IGMP Report Time delay

0   3 ticks "immediate" accordig to RFC.
1   40ms
2   200ms
3   1sec
4   5sec
5   10sec
6   60sec

So unsolicited reports are send for an interval of at least a minute
(reports are aborted if igmp reports or other info is seen).

Signed-off-by: Christoph Lameter 

---
 net/ipv4/igmp.c |   38 ++
 1 file changed, 34 insertions(+), 4 deletions(-)

Index: linux-2.6/net/ipv4/igmp.c
===
--- linux-2.6.orig/net/ipv4/igmp.c  2010-09-22 12:50:32.0 -0500
+++ linux-2.6/net/ipv4/igmp.c   2010-09-22 13:32:58.0 -0500
@@ -124,17 +124,40 @@

 /* Control of unsolilcited reports (after join) */

-#define IGMP_Unsolicited_Report_Count  2
+#define IGMP_Unsolicited_Report_Count  6
 #define IGMP_Initial_Report_Delay  (1)
 #define IGMP_Unsolicited_Report_Min_Delay  (HZ/25)
+#define IGMP_Unsolicited_Fuzz  (HZ/100)
+

 /* IGMP_Initial_Report_Delay is not from IGMP specs!
  * IGMP specs require to report membership immediately after
  * joining a group, but we delay the first report by a
  * small interval. It seems more natural and still does not
  * contradict to specs provided this delay is small enough.
+ *
+ * The spec does not say how the initial igmp reports
+ * need to be repeated (aside from suggesting to just do the
+ * randomization of the intervals as for igmp queries but then
+ * there is no centralized trigger and therefore no randomization
+ * needed). We provide an array of delays here that are likely
+ * to work in general avoiding the often too short or too long intervals
+ * that would be generated if we would follow the suggestion in the rfc.
+ *
+ * Note that the sending of unsolicited reports may stop at any point
+ * if we see an igmp query from a router or a neighbors ignmp report.
  */

+static int unsolicited_delay[IGMP_Unsolicited_Report_Count + 1] = {
+   IGMP_Initial_Report_Delay + IGMP_Mininum_Delay, /* "Immediate" */
+   HZ / 25,/* 40ms  */
+   HZ / 5, /* 200ms */
+   HZ,
+   5 * HZ,
+   10 * HZ,
+   60 * HZ
+};
+
 #define IGMP_V1_SEEN(in_dev) \
(IPV4_DEVCONF_ALL(dev_net(in_dev->dev), FORCE_IGMP_VERSION) == 1 || \
 IN_DEV_CONF_GET((in_dev), FORCE_IGMP_VERSION) == 1 || \
@@ -199,6 +222,13 @@ static void igmp_start_timer(struct ip_m
atomic_inc(&im->refcnt);
 }

+static void igmp_start_initial_timer(struct ip_mc_list *im, int interval)
+{
+   int delay = unsolicited_delay[interval];
+
+   igmp_start_timer(im, delay, delay + IGMP_Unsolicited_Fuzz);
+}
+
 static void igmp_gq_start_timer(struct in_device *in_dev)
 {
in_dev->mr_gq_running = 1;
@@ -748,8 +778,8 @@ static void igmp_timer_expire(unsigned l

if (im->unsolicit_count) {
im->unsolicit_count--;
-   igmp_start_timer(im, IGMP_Unsolicited_Report_Min_Delay,
-   IGMP_Unsolicited_Report_Interval);
+   igmp_start_initial_timer(im,
+   IGMP_Unsolicited_Report_Count - im->unsolicit_count);
}
im->reporter = 1;
spin_unlock(&im->lock);
@@ -1185,7 +1215,7 @@ static void igmp_group_added(struct ip_m
return;
if (IGMP_V1_SEEN(in_dev) || IGMP_V2_SEEN(in_dev)) {
spin_lock_bh(&im->lock);
-   igmp_start_timer(im, IGMP_Mininum_Delay, 
IGMP_Initial_Report_Delay);
+   igmp_start_initial_timer(im, 0);
spin_unlock_bh(&im->lock);
   

Re: igmp: Staggered igmp report intervals for unsolicited igmp reports

2010-09-22 Thread Christoph Lameter
On Wed, 22 Sep 2010, David Stevens wrote:

> I feel your pain, but the protocol allows this to be 0 and all
> of the unsolicited reports can be lost. I don't think adding a minimum
> latency solves a general problem. Perhaps the device should queue some

The protocol does not specificy the intervals during unsolicited igmp
sends. It only specifies the intervals as a result of a igmp query.

> packets if it isn't ready quickly? A querier is what makes these
> reliable, but for the start-up in particular, I think it'd be better
> to not initiate the send on devices that have this problem until the
> device is actually ready to send-- why not put the delay in the device
> driver on initialization?

The device is ready. Its just the multicast group that has not been
established yet.

> > an igmp query but the result of a join request in the code.
>
> These are also staggered to prevent a storm by mass reboots, e.g.,
> from a power outage, and the default groups are joined on interface
> bring-up.

There is still some staggering left (see IGMP_Unsolicited_Fuzz). I can
increase that if necessary.

There also cannot be any storm since any unsolicited igmp report by any
system will stop the unsolicited igmp reports by any other system.

> > So unsolicited reports are send for an interval of at least a minute
> > (reports are aborted if igmp reports or other info is seen).
>
> This is outside the protocol spec, and the intervals are neither
> random nor scaled based on any network performance metric.

Where does it say that in the spec? Again this is an *unsolicited* igmp
report.

> 2) I think this would better be solved in the driver-- don't do the
> upper initialization and group joins until the sends can actually
> succeed.

The driver is fine. Its just the multicast path in the network that take
time to establish.

> 3) I don't think it's a good idea to make up intervals, and especially
> non-randomized ones. The probability of getting all minimum
> intervals
> is very low (which goes back to #1) and sending fixed intervals
> may
> introduce a problem (packet storms) that isn't there per RFC.
> These
> fixed intervals can also be either way too long or way too short,
> depending on link characteristics they don't account for. Leaving
> the intervals randomized based on querier-supplied data seems much
> more appropriate to me.

These are *unsolicited* igmp reports. There is *no* querier supplied data
yet. The first querier supplied data (or any other unsolicited igmp
report) will immediately stop the unsolicited reports and then will
continue to respond in randomized intervals based on the data that the
querier has supplied.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: igmp: Staggered igmp report intervals for unsolicited igmp reports

2010-09-22 Thread Christoph Lameter
On Wed, 22 Sep 2010, David Stevens wrote:

> > The protocol does not specificy the intervals during unsolicited igmp
> > sends. It only specifies the intervals as a result of a igmp query.
>
> RFC 3376:
> "  To cover the possibility of the State-Change Report being missed by
>one or more multicast routers, it is retransmitted [Robustness
>Variable] - 1 more times, at intervals chosen at random from the
>range (0, [Unsolicited Report Interval])."
> and
> "8.11. Unsolicited Report Interval
>
>The Unsolicited Report Interval is the time between repetitions of a
>host's initial report of membership in a group.  Default: 1 second."


Hmmm looks like I looked at the earlier RFC 2236 3) (was not really
interested in IGMP v3, IGMPv2 is run).

   When a host joins a multicast group, it should immediately transmit
   an unsolicited Version 2 Membership Report for that group, in case it
   is the first member of that group on the network.  To cover the
   possibility of the initial Membership Report being lost or damaged,
   it is recommended that it be repeated once or twice after short
   delays [Unsolicited Report Interval].  (A simple way to accomplish
   this is to send the initial Version 2 Membership Report and then act
   as if a Group-Specific Query was received for that group, and set a
   timer appropriately).

The new Unsolicited Report Interval is promising. We need to support that?

> > The device is ready. Its just the multicast group that has not been
> > established yet.
> Well, if you know that's going to happen with your device, then
> again, why not queue them on start up until you have indication that
> the group has been established, or delay in the driver.
> You're changing IGMP for all device types to fix a problem in
> only one.
>
> > There also cannot be any storm since any unsolicited igmp report by any
> > system will stop the unsolicited igmp reports by any other system.
>
> Not if they are simultaneous, which is exactly when it is a
> problem. :-)

But then they are not simulateneous since there is a fuzz factor.

> > These are *unsolicited* igmp reports. There is *no* querier supplied
> data
> > yet. The first querier supplied data (or any other unsolicited igmp
> > report) will immediately stop the unsolicited reports and then will
> > continue to respond in randomized intervals based on the data that the
> > querier has supplied.
>
> There are initial values, which are currently constants, but it'd
> be (more) reasonable to turn those into per-interface tunables or
> per-interface initial values with IB interfaces using larger values.
>
> IGMP_Unsolicited_Report_Count (default 2)
> IGMP_Unsolicited_Report_Interval (default 10secs which is 10x larger,as
> you want, than the RFC suggests).

Ahhh... Interesting. 1 second now? That is much better and would avoid
long drawn out joins due to the long delays.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: igmp: Staggered igmp report intervals for unsolicited igmp reports

2010-09-22 Thread Christoph Lameter
On Wed, 22 Sep 2010, Bob Arendt wrote:

> multicast traffic is received. While IGMPv2 defines an "Unsolicited Report
> Interval" default of 10 seconds, it appears that this is a significant enough
> issue that the later IGMPv3 document calls out a default of 1 second, and
> goes on to define a "Robustness Variable" and talks about the same case that
> Christoph is trying to mitigate.

Actually that suggests a different way to reach the same goal:


Subject: igmp: Make unsolicited report interval conform to RFC3376

RFC3376 specifies a shorter time interval for sending igmp joins.
This can address issues where joins are slow because the initial join is
frequently lost.

Also increment the frequency so that we get a 10 reports send over a
few seconds.

Signed-off-by: Christoph Lameter 


---
 net/ipv4/igmp.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6/net/ipv4/igmp.c
===
--- linux-2.6.orig/net/ipv4/igmp.c  2010-09-22 16:28:17.0 -0500
+++ linux-2.6/net/ipv4/igmp.c   2010-09-22 16:28:54.0 -0500
@@ -114,9 +114,9 @@

 #define IGMP_V1_Router_Present_Timeout (400*HZ)
 #define IGMP_V2_Router_Present_Timeout (400*HZ)
-#define IGMP_Unsolicited_Report_Interval   (10*HZ)
+#define IGMP_Unsolicited_Report_Interval   (HZ)
 #define IGMP_Query_Response_Interval   (10*HZ)
-#define IGMP_Unsolicited_Report_Count  2
+#define IGMP_Unsolicited_Report_Count  10


 #define IGMP_Initial_Report_Delay  (1)
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: igmp: Staggered igmp report intervals for unsolicited igmp reports

2010-09-23 Thread Christoph Lameter
On Wed, 22 Sep 2010, Jason Gunthorpe wrote:

> > The device is ready. Its just the multicast group that has not been
> > established yet.
>
> In IB when the SA replies to a group join the group should be ready,
> prior to that the device can't send into the group because it has no
> MLID for the group.. If you have a MLID then the group is working.

When the SA replies it has created the MLID but not reconfigured the
fabric yet. So the initial IGMP messages get lost.

> Is the issue you are dropping IGMP packets because the 224.0.0.2 join
> hasn't finished? Ideally you'd wait for the SA to reply before sending
> a IGMP, but a simpler solution might just be to use the broadcast MLID
> for packets addressed to a MGID that has not yet got a MLID. This
> would bebe similar to the ethernet behaviour of flooding.

IGMP reports are sent on the multicast group not on 224.0.0.2. 224.0.0.2
is only used when leaving a multicast group.

I thought also about solutions along the same lines. We could modify the
IB layer to send to 224.0.0.2 while until the SA has confirmed the
creation of the MC group. For that to work we first would need to modify
the SA logic to ensure that it only sends confirmation *after* the fabric
has been reconfigured. Then we need to switch the MLIDs of the MC group
when the notification is received.

If the IB layer has not joined 224.0.0.2 yet (and it will take awhile)
then we could even fallback to broadcast until its ready.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: igmp: Staggered igmp report intervals for unsolicited igmp reports

2010-09-23 Thread Christoph Lameter
On Wed, 22 Sep 2010, David Stevens wrote:

> >
> > Also increment the frequency so that we get a 10 reports send over a
> > few seconds.
>
> Except you want to conform and not conform at the same time. :-)
> IGMPv2 should be: default count 2, interval 10secs
> IGMPv3 should be: default count 2, interval 1sec

This is during the period of unsolicited igmp reports. We do not know if
this group is managed using V3 or V2 since no igmp query/report has been
received yet.

> ...and no way is it a good idea to send 10 unsolicited reports on an
> Ethernet.

Why would that be an issue?

The IGMPv2 RFC has no strict limit and RFC3376
mentions that the retransmission occurs "Robustness Variable" times
minus one. Choosing 10 for the "Robustness Variable" is certainly ok.

If we do not increase the number of reports but just limit the interval
then the chance of outages of a second or so during mc group creation
causing routers missing igmp reports is significantly increased.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: igmp: Staggered igmp report intervals for unsolicited igmp reports

2010-09-23 Thread Christoph Lameter
On Thu, 23 Sep 2010, Jason Gunthorpe wrote:

> On Thu, Sep 23, 2010 at 10:32:17AM -0500, Christoph Lameter wrote:
>
> > > Is the issue you are dropping IGMP packets because the 224.0.0.2 join
> > > hasn't finished? Ideally you'd wait for the SA to reply before sending
> > > a IGMP, but a simpler solution might just be to use the broadcast MLID
> > > for packets addressed to a MGID that has not yet got a MLID. This
> > > would bebe similar to the ethernet behaviour of flooding.
> >
> > IGMP reports are sent on the multicast group not on 224.0.0.2. 224.0.0.2
> > is only used when leaving a multicast group.
>
> Hm, that is quite different than in IGMPv3.. How does this work at all
> in IB? A message to the multicast group isn't going to make it to any
> routers unless the routers use some other means to join the IB MGID.

IPoIB creates a infiniband multicast group via the IB calls for a IP
multicast group. Then IGMP comes into play and the kernel sends the IP
based igmp report. This igmp report must be received by an outside router
(on an IP network) in order to for traffic to get forwarded into the IB
fabric. You can end up with a IB multicast configuration that is all fine
but with loss of the unsolicited packets due to fabric reconfiguration not
being complete yet. The larger the fabric the worse the situation.

If all unsolicited igmp reports are lost then the router will
only start forwarding the mc group after the reporting intervals
(which could be in the range of minutes) when it triggers igmp reports
through a general igmp query. Until that time the MC group looks dead. And
people and software may conclude that the  network is broken.

This is a general issue for any network where configurations for MC
forwarding is needed and where initial igmp reports may get lost. A
staggering of time intervals would be a general solution to that issue.



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: igmp: Staggered igmp report intervals for unsolicited igmp reports

2010-09-23 Thread Christoph Lameter
On Thu, 23 Sep 2010, Jason Gunthorpe wrote:

> > IPoIB creates a infiniband multicast group via the IB calls for a IP
> > multicast group. Then IGMP comes into play and the kernel sends the IP
> > based igmp report. This igmp report must be received by an outside router
> > (on an IP network) in order to for traffic to get forwarded into the IB
> > fabric. You can end up with a IB multicast configuration that is all fine
> > but with loss of the unsolicited packets due to fabric reconfiguration not
> > being complete yet. The larger the fabric the worse the situation.
>
> But my point is that IB has very limited multicast, if I create a IB
> group and then send IGMP into that group *it will not reach a router*.

The IPoIB routers automatically join all IP MC groups created.

> The only way this kind of scheme could work is if an IGMPv2 IPoIB
> router listens for IB MGID Create notices from the SA and
> automatically joins all groups that are created, so it can get IGMPv2
> membership reports. Which obviously adds more delay, lag, and risk.

Right that is how it works now.

> I'm *guessing* that the change in IGMPv3 to send reports to 224.0.0.22
> (all IGMPv3 multicast address) is related to this sort of problem, and
> it seems like on IB IGMPv2 is not a good fit and should not be used if
> v3 is available..

Existing routers do no support IGMPv3.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: igmp: Allow mininum interval specification for igmp timers.

2010-09-27 Thread Christoph Lameter
On Mon, 27 Sep 2010, David Stevens wrote:

> > With bad luck this thing times out way too fast because the total of
> > all of the randomized intervals can end up being very small, and I
> > think we should fix that independently of the other issues hit by the
> > IB folks.
>
> I think I'm caught up on the discussion now. For IGMPv3, we
> would send all the reports always in < 2 secs, and the average would
> be < 1 sec, so I'm not sure any sort of tweaks we do to enforce a
> minimum randomized interval are compatible with IGMPv3 and still
> solve IB's problem.

Ok thanks for the effort but so far I do not see you having caught up. I'd
rather avoid responding to the misleading statements you made in other
replies and just respond to where you missed the boat here.

> As I said before, I think per protocol, back-to-back is both
> allowed and not a problem, even if both subsequent randomized reports
> come out to 0 time. But if we wanted to enforce a minimum interval
> of, say, X, then I think the better way to do that is to set the
> timer to X + rand(Interval-X) and not a table of fixed intervals

The second patch sets the intervals to X .. X + Rand (interval) and not to
a table of fixed intervals as you state here. I have pointed this out
before.

> as in the original patch. For v2, X=1 or 2 sec and Interval=10
> might work well, but for v3, the entire interval is 1 sec and I
> think I saw that the set-up time for the fabric may be on the
> order of 1 sec.

Again there is no knowledge about V2 or V3 without a query and this is
during the period when no querier is known yet. You stated elsewhere that
I can assume V3 by default? So 1 sec?

> I also don't think that we want those kinds of delays on
> Ethernet. A program may join and send lots of traffic in 1 sec,
> and if the immediate join is lost, one of the quickly-following
> <1 sec duplicate reports will make it recover and work. Delaying
> the minimum would guarantee it wouldn't work until that minimum
> and drop all that traffic if the immediate report is lost, then.

There can be any number of reasons that a short outage could prevent the
packets from going through. A buffer overrun (that you mentioned
elsewhere) usually causes lots of packets to be lost. Buffer overrun
scenarios usually mean that all igmp queries are lost.

> Really, of course, I think the solution belongs in IB, but
> if we did anything in IGMP, I'd prefer it were a per-interface
> tunable that defaults as in the RFC. Since you can change the
> interval and # of reports through a querier now, exporting the
> default values of (10,2) for v2 and (1,2) for v3 to instead be
> per-interface tunables and then bumped as needed for IB would
> allow tweaking without running a querier. But a querier that's
> using default values would also override that and cause the
> problem all over again. Queuing in the driver until the MAC
> address is usable solves it generally.

There is no solution on the IB layer since there is no notification when
the fabric reconfiguration necessary for an multicast group is complete.

The querier is of not use since (for the gazillionth of times) this is an
unsolicited IGMP report. If there is a querier then the unsolicited igmp
reports would not be used but the timeout indicated by the querier would
be used.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re:

2010-09-27 Thread Christoph Lameter
On Mon, 27 Sep 2010, David Stevens wrote:

> Because a querier can set the robustness value and
> query interval to anything you want. In the original report,
> he's not running a querier. The fact that it's a new group
> doesn't matter -- these are per-interface.

The per interface settings are used to force an IGMP version overriding
any information by the queriers. You would not want to enable that because
it disables support for other IGMP versions. Without the override
different version of IGMP can be handled per MC group.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re:

2010-09-28 Thread Christoph Lameter
On Mon, 27 Sep 2010, David Stevens wrote:

> No. I'm not talking about the force_igmp_tunable here, I'm talking
> about the per-interface robustness and interval settings which come from
> the querier (whatever version you are using).

The igmp subsystem currently does not keep state on the interface layer
about robustness etc. An interval setting is only kept for IGMP v3 and
used only for general query timeouts with igmp V3. The interval is
different one from the one used for the host membership reports.

Looking at the spec I get the impression that these variables seems to be
mainly of interest to router to router communications?



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: igmp: Allow mininum interval specification for igmp timers.

2010-09-28 Thread Christoph Lameter
On Mon, 27 Sep 2010, David Stevens wrote:

> > The second patch sets the intervals to X .. X + Rand (interval) and not
> to
> > a table of fixed intervals as you state here. I have pointed this out
> > before.
>
> Sorry if I've misunderstood something you're proposing, but what
> you describe above would be certainly technically incorrect. There are
> really no circumstances for sending a report greater than 
> that is protocol-compliant. You can enforce a minimum greater than 0,
> which is a departure from both RFCs, though IGMPv2 uses wishy-washy
> language. The intent for both was to explicitly allow 0, IMO.

There is no igmp interval set by any igmp query yet so this is your usual
unresponsive crappy response to something else that we are not talking
about.

I thought you were talking about the "fixed intervals" that you saw in the
patch. These initial intervals are for unsolicited igmp reports (do I need
to add that statement to every sentence in a thread where we *only*
discuss unsolicited igmp issues?) and those "intervals" are randomized
and not fixed.

> > > as in the original patch. For v2, X=1 or 2 sec and Interval=10
> > > might work well, but for v3, the entire interval is 1 sec and I
> > > think I saw that the set-up time for the fabric may be on the
> > > order of 1 sec.
> >
> > Again there is no knowledge about V2 or V3 without a query and this is
> > during the period when no querier is known yet. You stated elsewhere
> that
> > I can assume V3 by default? So 1 sec?
>
> Yes, without a querier or the tunable to force it to IGMPv2,
> the default is IGMPv3. It appears there is a bug where IGMPv3 is also
> using a 10sec interval (haven't verified that), but a 1 sec interval

You do not have the linux source code tree available?

from net/ipv4/igmp.c:

#define IGMP_Unsolicited_Report_Interval(10*HZ)

> as required makes your situation worse, not better. It makes it even
> more likely that all the initial reports will occur before your set-up
> is done.

Right. So can you please give me an approach that considers all these
issues and does not invent problem that do not exist, stays within the
subject discussed and follows the RFCs?

  > > There can be any number of reasons that a short outage could prevent the
> > packets from going through. A buffer overrun (that you mentioned
> > elsewhere) usually causes lots of packets to be lost. Buffer overrun
> > scenarios usually mean that all igmp queries are lost.
>
> You're arguing against protocol compliance. I didn't define
> the protocol, I only implemented it. And your view is through the
> IB lens, but I don't believe this is an actual problem in any way
> for typical networks. If you wrote a standards-track RFC that modifies
> IGMP for NBMA networks that require a delay or different parameters
> there, I'd have no objection to implementing that. Unilaterally
> changing linux's behavior on all network types without cause for
> departing from RFC on the most common types is another matter.

The RFCs state that the igmp queries have to be repeated at least 3
times. The first patch ensures that a mininum time passes between two igmp
reports (to avoid them getting lost in one go). The second patch doubles
the number of igmp reports and increases the intervals so that we still
have a chance to process the join before the next igmp query is send by
the router (which can be minuates away).

It fixes buggy havior that we see because multicast joins take very long
or fail outright.

 > > There is no solution on the IB layer since there is no notification when
> > the fabric reconfiguration necessary for an multicast group is complete.
>
> Certainly that's not true; without notification, you can queue for
> first use of a new hardware multicast address and send the queue after an
> appropriate delay (1 sec? If that covers your set-up time). If you had
> positive acknowledgement from the IB network, you'd know exactly when to
> do it, but there's no need to change anything for non-IB networks here.

So you want an arbitrary delay for all new multicast traffic to be
created? I'd rather have a series of staggered attempts so that we can
avoid this setup time.

Also the setup time varies greatly depending on the complexity of the
fabric changes. It can be extremely fast if the multicast group is already
in use by others in the fabric. Adding a delay penalizes everyone
unnecessarily.

Also much of these seems to be contigent on IGMPv3. We are using igmpv2.

> > The querier is of not use since (for the gazillionth of times) this is
> an
> > unsolicited IGMP report. If there is a querier then the unsolicited igmp
> > reports would not be used but the timeout indicated by the querier would
> > be used.
>
> A querier affects unsolicited reports because it sets both the
> query interval and the robustness value. If you want to send 10 reports,
> you can cause that by having a querier that sets it to that many. The
> in

Re: [PATCH] Make multicast and path record queue flexible.

2010-10-05 Thread Christoph Lameter
On Tue, 5 Oct 2010, Jason Gunthorpe wrote:

> On Tue, Oct 05, 2010 at 06:07:37PM +0200, Aleksey Senin wrote:
> > When using slow SM allow more packets to be buffered before answer
> > comming back. This patch based on idea of Christoph Lameter.
> >
> > http://lists.openfabrics.org/pipermail/general/2009-June/059853.html
>
> IMHO, I think it is better to send multicasts to the broadcast MLID than to
> queue them.. More like ethernet that way.

I agree. We had similar ideas. However, the kernel does send igmp
reports to the MC address not to 244.0.0.2. We would have to redirect at
the IB layer until multicast via MLID becomes functional. We cannot tell
when that will be the case.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Make multicast and path record queue flexible.

2010-10-05 Thread Christoph Lameter
On Tue, 5 Oct 2010, Aleksey Senin wrote:

> When using slow SM allow more packets to be buffered before answer
> comming back. This patch based on idea of Christoph Lameter.

I agree, I think we need to have those things configurable.

Reviewed-by: Christoph Lameter 

How do you handle the situation of the SM responding before the fabric has
been reconfigured? I do not see any delay on join. So they will be dropped
if the fabric was not reconfigured fast enough? Or does the SM somehow
delay the response?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Make multicast and path record queue flexible.

2010-10-05 Thread Christoph Lameter
On Tue, 5 Oct 2010, Jason Gunthorpe wrote:

> > I agree. We had similar ideas. However, the kernel does send igmp
> > reports to the MC address not to 244.0.0.2. We would have to redirect at
> > the IB layer until multicast via MLID becomes functional. We cannot tell
> > when that will be the case.
>
> Sure, but Aleksey's patch is aimed at the case when the SM has not yet
> replied, not for your problem with IGMPv2. If their is no MLID then
> sending to the broadcast MLID is a better choice than hanging onto the
> packets. I wonder if you could even send unicasts to the broadcast?

The problem that the SM has not yet replied is no different between the
IGMP versions. If you get a confirmation but the MC group is not
functional then packets go nowhere.

> I still think the problem you have with IGMPv2 is best solved by
> leaning on the gateway vendors to support IGMPv3 - which *does* send
> all reports to 244.0.0.22

s/22/2

Certainly a solution for the igmp messages themselves but not for
initial traffic or traffic send via sendonly join.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Make multicast and path record queue flexible.

2010-10-05 Thread Christoph Lameter
On Tue, 5 Oct 2010, Jason Gunthorpe wrote:

> > > I still think the problem you have with IGMPv2 is best solved by
> > > leaning on the gateway vendors to support IGMPv3 - which *does* send
> > > all reports to 244.0.0.22
> >
> > s/22/2
>
> No, 22. RFC 3376:
>
> 4.2.14. IP Destination Addresses for Reports
>
>Version 3 Reports are sent with an IP destination address of
>224.0.0.22, to which all IGMPv3-capable multicast routers listen.  A
>system that is operating in version 1 or version 2 compatibility
>modes sends version 1 or version 2 Reports to the multicast group
>specified in the Group Address field of the Report.  In addition, a
>system MUST accept and process any version 1 or version 2 Report
>whose IP Destination Address field contains *any* of the addresses
>(unicast or multicast) assigned to the interface on which the Report
>arrives.

Argh. Another MC group. And the ib layer does need to do IB level joins
for those. So the initial messages will be lost for the first join(s)?

> > Certainly a solution for the igmp messages themselves but not for
> > initial traffic or traffic send via sendonly join.
>
> Using .22 will generally solve the problems with sychronizing the
> IPoIB gateway to the state of the IGMPv3 clients. Yes, there will
> still be some unknown lag in building the IB side of the network and
> for the router(s) to get ready to handle the group - but at least it
> is no longer dependent on any timeouts.

How do you propose to handle the IB level join to 224.0.0.22 to avoid
packet loss there? IGMP messages will still get lost because of that.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Make multicast and path record queue flexible.

2010-10-05 Thread Christoph Lameter
On Tue, 5 Oct 2010, Jason Gunthorpe wrote:

> > How do you propose to handle the IB level join to 224.0.0.22 to avoid
> > packet loss there? IGMP messages will still get lost because of that.
>
> First, the routers all join the group at startup and stay joined
> forever. This avoids the race in the route joining a new MGID after
> the client creates it, but before the IGMPv2 report is sent. I expect
> this is a major source of delay and uncertainty

I think the current routers join 224.0.0.2 already. Adding another MC
group should come with IGMPv3 support.

> Second, since all clients join this group as send-only it becomes
> possible for the SM to do reasonable things - for instance the MLID
> can be pre-provisioned as send-only from any end-port and thus after
> the SM replies with a MLID the MLID is guaranteed good for send-only
> use immediately.

The problem is that the client join on 224.0.0.22 will be delayed due to
fabric reconfig. The group is joined on demand. It is not automatically
joined.

> Third, once the client etners IGMPv3 mode and joins the group (maybe
> at system boot?) it stays joined forever.

IGMP does not explicitly join 224.0.0.X groups. Looks like messages to
224.0.0.X will not be send unless there is no other responder on the
subnet. So the initial messages for the first join getting lost
may still be a problem.

> Finally, by sending multicast packets to the broadcast during the time
> the MLID is unknown we can pretty much guarantee that the first IGMPv3
> packet that is sent to .22 will reach all routers in a timely fashion.
> (Hence my objection to Aleksey's approach)

Right. So the multicast traffic will flow to the broadcast address until
the SM sends the response. The multicast traffic will then get lost until
the fabric reconfig is complete.

> Basically, this completely solves the IGMP client to IPoIB router
> communication problem. Yes, there will still be an unknown time until
> the IB network, router, and whatever is beyond the router is ready to
> actually process packets on a new group - BUT that is normal for IP
> multicast! The main point is that without lost IGMP packets things can
> proceed without relying on timeouts.

Sure this sounds to be a much better approach (we have thought through
such approaches here repeatedly) but I do not know of any IB gateway that
supports IGMPv3.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Make multicast and path record queue flexible.

2010-10-05 Thread Christoph Lameter
On Tue, 5 Oct 2010, Jason Gunthorpe wrote:

> > The problem is that the client join on 224.0.0.22 will be delayed due to
> > fabric reconfig. The group is joined on demand. It is not automatically
> > joined.
>
> I was trying to explain that it is possible for the SM to provide a
> MLID that is fully functional for .22 - there is no behind the scenes
> network reconfiguring delay. This is doable with IGMPv3 because the
> client join is send-only and all the listeners have been joined for a
> long time.

Ahh.. Good idea.

> There is virtually no cost with preconfiguring switches for send only
> traffic.

True.

> > Sure this sounds to be a much better approach (we have thought through
> > such approaches here repeatedly) but I do not know of any IB gateway that
> > supports IGMPv3.
>
> Lean on the vendors :( Seems crazy to not implement v3 when v2 is so
> unworkable on IB.

Oh we will Do obsidian routers support IGMPv3 for IB?



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Make multicast and path record queue flexible.

2010-10-06 Thread Christoph Lameter
On Wed, 6 Oct 2010, Alekseys Senin wrote:

> > How do you handle the situation of the SM responding before the fabric has
> > been reconfigured? I do not see any delay on join. So they will be dropped
> > if the fabric was not reconfigured fast enough? Or does the SM somehow
> > delay the response?
>
> I think this issue, that should solve the problem of sending or delaying
> packets after obtaining MLID should be solved in another patch. Proposed
> code improve today situation, when you can't change it at all.

I agree with that.

> Relating to broadcast, I don't think that this is a good solution it
> will bring unwarranted load, specially in the case if no MLID received.
> The way of adding delay before we start to send packages, seems to me
> better.

Broadcast is a temporary solution and it only occurs with in an IB
partition. It would be a problem if one host suddenly starts sending
at full line speed. Maybe have that configurable?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What is the best way to time RDMA over IB code segments?

2010-10-20 Thread Christoph Lameter
On Tue, 19 Oct 2010, i...@celticblues.com wrote:

> What is the best way to time RDMA over IB code segments?  Are there timers
> included in the RDMA libs?

The kernel has a clock? And some cpus have fancy timer registers.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ewg] IPoIB to Ethernet routing performance

2010-12-09 Thread Christoph Lameter
On Mon, 6 Dec 2010, sebastien dugue wrote:

> > The Mellanox BridgeX looks a better hardware solution with 12x 10Ge
> > ports but when I tested this they could only provide vNIC
> > functionality and would not commit to adding IPoIB gateway on their
> > roadmap.
>
>   Right, we did some evaluation on it and this was really a show stopper.

Did the same thing here came to the same conclusions.

> > Qlogic also offer the 12400 Gateway.  This has 6x 10ge ports.
> > However, like the Mellanox, I understand they only provide host vNIC
> > support.

Really? I was hoping that they would have something worth looking at.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ewg] IPoIB to Ethernet routing performance

2011-01-03 Thread Christoph Lameter
On Sat, 25 Dec 2010, Ali Ayoub wrote:

> On Thu, Dec 9, 2010 at 3:46 PM, Christoph Lameter  wrote:
> > On Mon, 6 Dec 2010, sebastien dugue wrote:
> >
> >> > The Mellanox BridgeX looks a better hardware solution with 12x 10Ge
> >> > ports but when I tested this they could only provide vNIC
> >> > functionality and would not commit to adding IPoIB gateway on their
> >> > roadmap.
> >>
> >>   Right, we did some evaluation on it and this was really a show stopper.
> >
> > Did the same thing here came to the same conclusions.
>
> May I ask why do you need IPoIB when you have EoIB (vNic driver)?
> Why it's a show stopper?

EoIB is immature for some use cases like financial. No multicast support
f.e. All multicast becomes broadcast. There is extensive support for
multicast on IPoIB and the various gotchas and hiccups that where there
initially have mostly been worked out.



Re: [PATCH V2] IB/ipoib: Leave stale send-only multicast groups

2011-01-18 Thread Christoph Lameter
On Mon, 17 Jan 2011, Moni Shoua wrote:

> Unlike with send/receive multicast groups, there is no indication for IPoIB
> that a send-only multicast group is useless. Therefore, even a single packet
> to a multicast destination leaves a multicast entry on the fabric until the
> host interface is down. This causes an MGID leakage in the SM.

There is such an indication. The igmp subsystem removes the multicast
group from the grouplist for the interface when the process terminates.

The ipoib layer will then release the MGID when the multicast groups are
reprocessed in ipoib_mcast_restart_task(). The sendonly detect logic puts
it onto the remove_list() and then mcast_mcast_leave() is called at the
end to dispose of the MC group.

That is at least what our tests show. MC sendonly groups vanish when a
task terminates. Lets leave it that way.

Did you leave the task that caused the sendonly join running until
shutdown?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2] IB/ipoib: Leave stale send-only multicast groups

2011-01-26 Thread Christoph Lameter
And here is my answer that also was not cced.


> Please take a look at this. It demonstrates what I claim.

Trouble is that iperf is a performance measurement tool and I am not sure
what is going on behind the scenes in regard to MC subscriptions etc. Also
I do not see when you terminate what task.

I can demonstrate using mcast that the groups go away (see below). AFAICT
behavior that you suggest would be troublesome for some of the apps we run
here.


> linux:~ # cat /proc/net/dev_mcast |grep ib0
>
> 29   ib0 1 0 0012401b03070707 <---
  ^^^ there is a reference remaining that is why the
group sticks around.

Here is the output of /sys/kernel/debug/ipoib/ib0.8030_mcg a few minutes
after an mcast test completed:

GID: ff12:401b:8030:0:0:0:0:1
  created: 4370767176
  queuelen: 0
  complete:   yes
  send_only:   no

GID: ff12:401b:8030:0:0:0::
  created: 4370767176
  queuelen: 0
  complete:   yes
  send_only:   no


Running mcast -b ib0.8030 on another host
and running ./mcast -n 1 -b ib0.8030 on this host yields:

Receiver: Listening to control channel 239.0.192.1
Receiver: Subscribing to 0 MC addresses 239.0.192-254.2-254 offset 0
origin 10.2.30.180
Sender: Sending 10 msgs/ch/sec on 1 channels. Probe interval=0.001-1 sec.

While the program is running we do:

clameter@rd-gateway-deb64:/sys/kernel/debug/ipoib$ cat ib0.8030_mcg
GID: ff12:401b:8030:0:0:0:0:1
  created: 4370767176
  queuelen: 0
  complete:   yes
  send_only:   no

GID: ff12:401b:8030:0:0:0:f00:c001
  created: 4371010379
  queuelen: 0
  complete:   yes
  send_only:   no

GID: ff12:401b:8030:0:0:0:f00:c002
  created: 4371010589
  queuelen: 0
  complete:   yes
  send_only:  yes

GID: ff12:401b:8030:0:0:0::
  created: 4370767176
  queuelen: 0
  complete:   yes
  send_only:   no


Terminating mcast yields:

clameter@rd-gateway-deb64:/sys/kernel/debug/ipoib$ cat ib0.8030_mcg
GID: ff12:401b:8030:0:0:0:0:1
  created: 4370767176
  queuelen: 0
  complete:   yes
  send_only:   no

GID: ff12:401b:8030:0:0:0:0:2
  created: 4371020715
  queuelen: 0
  complete:   yes
  send_only:  yes

GID: ff12:401b:8030:0:0:0:f00:c002
  created: 4371010589
  queuelen: 0
  complete:   yes
  send_only:  yes

GID: ff12:401b:8030:0:0:0::
  created: 4370767176
  queuelen: 0
  complete:   yes
  send_only:   no

Wait a few minutes and then c002 will also vanish and you will have the
state above.



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2] IB/ipoib: Leave stale send-only multicast groups

2011-01-26 Thread Christoph Lameter

Well turns out that my tests just proves that the patch works as
intended. After all we run Voltaire OFED which has had a similar patch to
address the issue for a long time.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: compile ofed 1.5.3 against lustre kernel

2011-06-15 Thread Christoph Lameter
On Tue, 14 Jun 2011, Steve Wise wrote:

> The ofa kernel configure/build scripts are not detecting that your kernel
> source is RHEL5.x.  If you just use the ofa_kernel tree to build/install the
> OFED modules, then you can specify explicitly which backport patches to use
> when you configure it.  Or if you correctly name your kernel tree and set the
> version to be something like 2.6.18-238.el5.lustre18 it should correctly
> detect it.

I surely wish this constant nightmare would go away. Could we please have
OFED trees against each kernel version in use somewhere so that we can
just do a git pull to get these into the respective trees?

This OFED divining the kernel version and then the application of upgrade
/ downgrade patches makes a lot of things very hard.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net/for-next V1 1/1] IB/ipoib: break linkage to neighbouring system

2012-07-19 Thread Christoph Lameter
On Thu, 19 Jul 2012, Or Gerlitz wrote:

> Aged ipoib_neigh instances are deleted by a garbage collection task that runs 
> every
> 30 seconds and deletes every ipoib_neigh instance that was idle for at least 
> 60
> seconds. The deletion is safe since the ipoib_neigh instances are protected
> using RCU and reference count mechanisms.

Could we have the idle time configurable please? For many use cases we
want a much longer retention of the neighbors (actually we typically use 4
hrs).

Also wish we would not run useless code every 30 seconds (create noise
events especially if its per cpu). Could we only run those events as
necessary and group the expiration of neighbors to reduce the events?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net/for-next V1 1/1] IB/ipoib: break linkage to neighbouring system

2012-07-19 Thread Christoph Lameter
On Thu, 19 Jul 2012, Shlomo Pongartz wrote:

> The garbage collection and stale times follow the default ipv4/6
> neigh.default.gc_yyy
> sysctl values, for example
>
> net.ipv4.neigh.default.gc_interval = 30
> net.ipv4.neigh.default.gc_stale_time = 60
>
> If given access to these values from IPoIB, we will be happy
> to integrate them into that logic

It looks like the values are hardcoded right now.

> Please clarify what do you mean by group expiration.

If you have neighbor expiration periods of 4 hrs and it is necessary to
run the expiration logic then please expire all the neighbor entries due a
certain period after that as well to avoid running the expiration again in
the next minute or so. I guess the fuzz factor needs to scale depending on
the expiration period.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net/for-next V1 1/1] IB/ipoib: break linkage to neighbouring system

2012-07-24 Thread Christoph Lameter
On Thu, 19 Jul 2012, Or Gerlitz wrote:

> > If you have neighbor expiration periods of 4 hrs and it is necessary to
> > run the expiration logic then please expire all the neighbor entries due a
> > certain period after that as well to avoid running the expiration again in
> > the next minute or so.
>
>
> This is still a bit unclear here... do you mean to say that at a certain point
> in time,
> **all** entries need to be deleted irrelevant of their (jiffies) age? why?

No. Just the ones in a certain time frame.
>
> > I guess the fuzz factor needs to scale depending on the expiration period.
> >
> >
>
> and this is what happens now, the factor is 0.5, entry would be deleted when
> if  (60m <= unused < 90s) holds

Ok that sounds good and that is what I meant.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net/for-next V1 1/1] IB/ipoib: break linkage to neighbouring system

2012-07-24 Thread Christoph Lameter
On Sun, 22 Jul 2012, Or Gerlitz wrote:

> On 7/19/2012 8:08 PM, David Miller wrote:
> > These numbers come from the IPV6 Neighbour Discovery RFCs.  IPV4 replicates
> > the Neighbour Unreachability Detection schemes of IPV6 in pretty much it's
> > entirety, and therefore takes on the same timeout et al. parameters.
>
> OK, got it. At this point, I guess we should enhance the patch to use the
> values plugged into the IPv4 arp table at the time IPoIB is loaded, with
> arp_tbl being exported its easy to achieve. This would allow users to use
> non-default values by the ipoib neigh handling logic. In a later step, we need
> to see if/how to allow ipoib to use the already existing sysctl entries, makes
> sense?

Sounds about right.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] librdmacm-1.0.16

2012-08-13 Thread Christoph Lameter
On Mon, 13 Aug 2012, Joseph Glanville wrote:

> Hi Sean,
>
> On 14 July 2012 03:57, Hefty, Sean  wrote:
> > librdmacm release 1.0.16 is now available from
> >
> > www.openfabrics.org/downloads/rdmacm
> >
> > This release contains several bug fixes from 1.0.15, plus introduces the 
> > rsocket API and protocol.
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> Are there any plans to include SOCK_DGRAM support?

I am a bit confused here. Is this the librdmacm the RDMA connection
manager? If so then it already supports datagrams and multicast.

> I could see that being potentionally interesting along with mapping
> broadcast/multicast to IB physical layer multicast.

Such as implemented in IPoIB? Any ideas that would be better than that?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Trust model for raw QPs

2012-08-15 Thread Christoph Lameter
On Wed, 15 Aug 2012, Or Gerlitz wrote:

> Currently, for an app to open a raw QP from user space, we (verbs) require
> admin permission, for which we (Mellanox) got customer feedback saying this is
> problematic on some of the environments.

Well yes it is but the kernel mod is a one line to get rid of this
problem.

> Suppose we allow to user to provide source mac+vlan when creating the QP or
> when modifying its state, and the HW can enforce that -- in that case I think
> its OK to remove that restriction e.g ala what is allowed today with user
> space UD QPs when the fabric is IB.

Well yes that would mean that the source mac and vlan are configured with
admin permissions and then the app would run without within the
constraints established in priviledged moded.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Trust model for raw QPs

2012-08-15 Thread Christoph Lameter
On Wed, 15 Aug 2012, Jason Gunthorpe wrote:

> Can you fix this by elevating the process with SELinux?

Can SELinux be used to compromise security? How?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Trust model for raw QPs

2012-08-15 Thread Christoph Lameter
On Wed, 15 Aug 2012, Or Gerlitz wrote:

> Jason Gunthorpe  wrote:
>
> > Can you fix this by elevating the process with SELinux?
>
> Chirstoph, do you think this would valid option from users standpoint?

Sure. If SELinux can be used to compromise systems security (in a
controlled fashion) then I think we finally found a reason to use the
stuff. Could someone explain how this would work? Hopefully this is easily
usable and controllable?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Christoph Lameter
On Wed, 29 Aug 2012, Atchley, Scott wrote:

> I am benchmarking a sockets based application and I want a sanity check
> on IPoIB performance expectations when using connected mode (65520 MTU).
> I am using the tuning tips in Documentation/infiniband/ipoib.txt. The
> machines have Mellanox QDR cards (see below for the verbose ibv_devinfo
> output). I am using a 2.6.36 kernel. The hosts have single socket Intel
> E5520 (4 core with hyper-threading on) at 2.27 GHz.
>
> I am using netperf's TCP_STREAM and binding cores. The best I have seen
> is ~13 Gbps. Is this the best I can expect from these cards?

Sounds about right, This is not a hardware limitation but
a limitation of the socket I/O layer / PCI-E bus. The cards generally can
process more data than the PCI bus and the OS can handle.

PCI-E on PCI 2.0 should give you up to about 2.3 Gbytes/sec with these
nics. So there is like something that the network layer does to you that
limits the bandwidth.

> What should I expect as a max for ipoib with FDR cards?

More of the same. You may want to

A) increase the block size handled by the socket layer

B) Increase the bandwidth by using PCI-E 3 or more PCI-E lanes.

C) Bypass the socket layer. Look at Sean's rsockets layer f.e.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Christoph Lameter
On Wed, 5 Sep 2012, Atchley, Scott wrote:

> # ethtool -k ib0
> Offload parameters for ib0:
> rx-checksumming: off
> tx-checksumming: off
> scatter-gather: off
> tcp segmentation offload: off
> udp fragmentation offload: off
> generic segmentation offload: on
> generic-receive-offload: off
>
> There is no checksum support which I would expect to lower performance.
> Since checksums need to be calculated in the host, I would expect faster
> processors to help performance some.

K that is a major problem. Both are on by default here. What NIC is this?

> > A) increase the block size handled by the socket layer
>
> Do you mean altering sysctl with something like:

Nope increase mtu. Connected mode supports up to 64k mtu size I believe.

> or something increasing the SO_SNFBUF and SO_RCVBUF sizes or something else?

That does nothing for performance. The problem is that the handling of the
data by the kernel causes too much latency so that you cannot reach the
full bw of the hardware.

> We actually want to test the socket stack and not bypass it.

AFAICT the network stack is useful up to 1Gbps and
after that more and more band-aid comes into play.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Christoph Lameter
On Wed, 5 Sep 2012, Atchley, Scott wrote:

> > AFAICT the network stack is useful up to 1Gbps and
> > after that more and more band-aid comes into play.
>
> Hmm, many 10G Ethernet NICs can reach line rate. I have not yet tested any 
> 40G Ethernet NICs, but I hope that they will get close to line rate. If not, 
> what is the point? ;-)

Oh yes they can under restricted circumstances. Large packets, multiple
cores etc. With the band-aids

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Christoph Lameter
On Wed, 5 Sep 2012, Atchley, Scott wrote:

> These are Mellanox QDR HCAs (board id is MT_0D90110009). The full output of 
> ibv_devinfo is in my original post.

Hmmm... You are running an old kernel. What version of OFED do you use?


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Christoph Lameter
On Wed, 5 Sep 2012, Atchley, Scott wrote:

> With Myricom 10G NICs, for example, you just need one core and it can do
> line rate with 1500 byte MTU. Do you count the stateless offloads as
> band-aids? Or something else?

The stateless aids also have certain limitations. Its a grey zone if you
want to call them band aids. It gets there at some point because stateless
offload can only get you so far. The need to send larger sized packets
through the kernel increases the latency and forces the app to do larger
batching. Its not very useful if you need to send small packets to a
variety of receivers.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Christoph Lameter
On Wed, 5 Sep 2012, Atchley, Scott wrote:

> > Hmmm... You are running an old kernel. What version of OFED do you
> > use?
>
> Hah, if you think my kernel is old, you should see my userland
> (RHEL5.5). ;-)

My condolences.

> Does the version of OFED impact the kernel modules? I am using the
> modules that came with the kernel. I don't believe that libibverbs or
> librdmacm are used by the kernel's socket stack. That said, I am using
> source builds with tags libibverbs-1.1.6 and v1.0.16 (librdmacm).

OFED includes kernel modules which provides the drivers that you need.
Installing a new OFED release on RH5 is possible and would give you up to
date drivers. Check with RH: They may have them somewhere easy to install
for your version of RH.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Updated Debian packages?

2012-11-15 Thread Christoph Lameter
Where would I find updated debian packages for the OFED library? Something
that works with kernel 3.5 please? There are old packages in Debian. We
need  newer stuff for Ubuntu/Debian.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Updated Debian packages?

2012-11-16 Thread Christoph Lameter
On Thu, 15 Nov 2012, Narayan Desai wrote:

> I've packaged up a number of recent versions of libraries for ubuntu in a PPA:
> https://launchpad.net/~narayan-desai/+archive/infiniband
>
> There are both source packages and binaries for ubuntu precise. If
> there are missing packages that you want, let me know and I'll add
> them in. Rebuilding these for debian is pretty straightforward from
> the source packages.

I know but I wish these packages would be kept up to date. Are you doing
that now?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Updated Debian packages?

2012-11-27 Thread Christoph Lameter
Roland, we need the raw eth qp patches in the git trees for the libraries.

Could you please bring the trees up to date so that the userspace raw eth
support is in sync with the kernel?

I guess other will build .debs out of this?

Do you plan to update the .deb packages at some point?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Updated Debian packages?

2012-12-10 Thread Christoph Lameter
So far I do not see that any patches have gone in since March for mlx4 and
the most recent patches for libibverbs are in 2011.

Could we get the trees updated?

On Wed, 28 Nov 2012, Or Gerlitz wrote:

> On 28/11/2012 11:46, Roland Dreier wrote:
> > On Tue, Nov 27, 2012 at 12:35 PM, Christoph Lameter  wrote:
> > > Roland, we need the raw eth qp patches in the git trees for the libraries.
> > >
> > > Could you please bring the trees up to date so that the userspace raw eth
> > > support is in sync with the kernel?
> > Do you have branches I could pull from?
>
> Its in your patchwork...  here are the posts
>
> libmlx4
> http://marc.info/?l=linux-rdma&m=134817307614014&w=2
> http://marc.info/?l=linux-rdma&m=134838353315573&w=2
>
> libibverbs
> http://marc.info/?l=linux-rdma&m=134817304514001&w=2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v4 2/7] libibverbs: Introduce XRC domains

2013-03-18 Thread Christoph Lameter
On Mon, 18 Mar 2013, Hefty, Sean wrote:

> > I don't see cover-letter for V4 of libibverbs and V3 of libmlx4 patches
> > you posted over the weekend, would be nice if you
> > can reply to one of your posting per series and send quick listing of
> > the changes (if any) from previous versions.
>
> There are no changes from v3.

Well the cover should still be included to give someone just reviewing the
patchset v4 the chance to understand what is going on.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Infiniband use of get_user_pages()

2013-04-24 Thread Christoph Lameter
On Wed, 24 Apr 2013, Jan Kara wrote:

>   Hello,
>
>   when checking users of get_user_pages() (I'm doing some cleanups in that
> area to fix filesystem's issues with mmap_sem locking) I've noticed that
> infiniband drivers add number of pages obtained from get_user_pages() to
> mm->pinned_vm counter. Although this makes some sence, it doesn't match
> with any other user of get_user_pages() (e.g. direct IO) so has infiniband
> some special reason why it does so?

get_user_pages typically is used to temporarily increase the refcount. The
Infiniband layer needs to permanently pin the pages for memory
registration.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 6/9] IB/core: Add receive Flow Steering support

2013-04-29 Thread Christoph Lameter
On Mon, 29 Apr 2013, Steve Wise wrote:

> Hey Or,  This looks good at first glance.  I must confess I cannot tell yet if
> this will provide everything we need for chelsio's RAW packet requirements.
> But I think we should move forward on this, and enhance as needed.

Well we are using the raw qp s here too and would like to use receive
flow steering. Could we please get this merged?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] mm: Fix RLIMIT_MEMLOCK

2013-05-24 Thread Christoph Lameter
On Fri, 24 May 2013, Peter Zijlstra wrote:

> Patch bc3e53f682 ("mm: distinguish between mlocked and pinned pages")
> broke RLIMIT_MEMLOCK.

Nope the patch fixed a problem with double accounting.

The problem that we seem to have is to define what mlocked and pinned mean
and how this relates to RLIMIT_MEMLOCK.

mlocked pages are pages that are movable (not pinned!!!) and that are
marked in some way by user space actions as mlocked (POSIX semantics).
They are marked with a special page flag (PG_mlocked).

Pinned pages are pages that have an elevated refcount because the hardware
needs to use these pages for I/O. The elevated refcount may be temporary
(then we dont care about this) or for a longer time (such as the memory
registration of the IB subsystem). That is when we account the memory as
pinned. The elevated refcount stops page migration and other things from
trying to move that memory.

Pages can be both pinned and mlocked. Before my patch some pages those two
issues were conflated since the same counter was used and therefore these
pages were counted twice. If an RDMA application was running using
mlockall() and was performing large scale I/O then the counters could show
extraordinary large numbers and the VM would start to behave erratically.

It is important for the VM to know which pages cannot be evicted but that
involves many more pages due to dirty pages etc etc.

So far the assumption has been that RLIMIT_MEMLOCK is a limit on the pages
that userspace has mlocked.

You want the counter to mean something different it seems. What is it?

I think we need to be first clear on what we want to accomplish and what
these counters actually should count before changing things.

Certainly would appreciate improvements in this area but resurrecting the
conflation between mlocked and pinned pages is not the way to go.

> This patch proposes to properly fix the problem by introducing
> VM_PINNED. This also provides the groundwork for a possible mpin()
> syscall or MADV_PIN -- although these are not included.

Maybe add a new PIN page flag? Pages are not pinned per vma as the patch
seems to assume.

> It recognises that pinned page semantics are a strict super-set of
> locked page semantics -- a pinned page will not generate major faults
> (and thus satisfies mlock() requirements).

Not exactly true. Pinned pages may not have the mlocked flag set and they
are not managed on the unevictable LRU lists of the MM.

> If people find this approach unworkable, I request we revert the above
> mentioned patch to at least restore RLIMIT_MEMLOCK to a usable state
> again.

Cannot do that. This will cause the breakage that the patch was fixing to
resurface.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] mm: Fix RLIMIT_MEMLOCK

2013-05-28 Thread Christoph Lameter
On Sat, 25 May 2013, KOSAKI Motohiro wrote:

> If pinned and mlocked are totally difference intentionally, why IB uses
> RLIMIT_MEMLOCK. Why don't IB uses IB specific limit and why only IB raise up
> number of pinned pages and other gup users don't.
> I can't guess IB folk's intent.

True another limit would be better. The reason that IB raises the
pinned pages is because IB permanently pins those pages. Other users of
gup do that temporarily.

If there are other users that pin pages permanently should also account
for it.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] mm: Fix RLIMIT_MEMLOCK

2013-05-28 Thread Christoph Lameter
On Mon, 27 May 2013, Peter Zijlstra wrote:

> Before your patch pinned was included in locked and thus RLIMIT_MEMLOCK
> had a single resource counter. After your patch RLIMIT_MEMLOCK is
> applied separately to both -- more or less.

Before the patch the count was doubled since a single page was counted
twice: Once because it was mlocked (marked with PG_mlock) and then again
because it was also pinned (the refcount was increased). Two different things.

We have agreed for a long time that mlocked pages are movable. That is not
true for pinned pages and therefore pinning pages therefore do not fall
into that category (Hugh? AFAICR you came up with that rule?)

> NO, mlocked pages are pages that do not leave core memory; IOW do not
> cause major faults. Pinning pages is a perfectly spec compliant mlock()
> implementation.

That is not the definition that we have used so far.

> Now in an earlier discussion on the issue 'we' (I can't remember if you
> participated there, I remember Mel and Kosaki-San) agreed that for
> 'normal' (read not whacky real-time people) mlock can still be useful
> and we should introduce a pinned user API for the RT people.

Right. I remember that.

> > Pinned pages are pages that have an elevated refcount because the hardware
> > needs to use these pages for I/O. The elevated refcount may be temporary
> > (then we dont care about this) or for a longer time (such as the memory
> > registration of the IB subsystem). That is when we account the memory as
> > pinned. The elevated refcount stops page migration and other things from
> > trying to move that memory.
>
> Again I _know_ that!!!

But then you refuse to acknowledge the difference and want to conflate
both.

> > Pages can be both pinned and mlocked.
>
> Right, but apart for mlockall() this is a highly unlikely situation to
> actually occur. And if you're using mlockall() you've effectively
> disabled RLIMIT_MEMLOCK and thus nobody cares if the resource counter
> goes funny.

mlockall() would never be used on all processes. You still need the
RLIMIT_MLOCK to ensure that the box does not lock up.

> > I think we need to be first clear on what we want to accomplish and what
> > these counters actually should count before changing things.
>
> Backward isn't it... _you_ changed it without consideration.

I applied the categorization that we had agreed on before during the
development of page migratiob. Pinning is not compatible.

> The IB code does a big get_user_pages(), which last time I checked
> pins a sequential range of pages. Therefore the VMA approach.

The IB code (and other code) can require the pinning of pages in various
ways.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mm: Revert pinned_vm braindamage

2013-06-06 Thread Christoph Lameter
On Thu, 6 Jun 2013, Peter Zijlstra wrote:

> Since RLIMIT_MEMLOCK is very clearly a limit on the amount of pages the
> process can 'lock' into memory it should very much include pinned pages
> as well as mlock()ed pages. Neither can be paged.

So we we thought that this is the sum of the pages that a process has
mlocked. Initiated by the process and/or environment explicitly. A user
space initiated action.

> Since nobody had anything constructive to say about the VM_PINNED
> approach and the IB code hurts my head too much to make it work I
> propose we revert said patch.

I said that the use of a PIN page flag would allow correct accounting if
one wanted to interpret the limit the way you do.

> Once again the rationale; MLOCK(2) is part of POSIX Realtime Extentsion
> (1003.1b-1993/1003.1i-1995). It states that the specified part of the
> user address space should stay memory resident until either program exit
> or a matching munlock() call.
>
> This definition basically excludes major faults from happening on the
> pages -- a major fault being one where IO needs to happen to obtain the
> page content; the direct implication being that page content must remain
> in memory.

Exactly that is the definition.

> Linux has taken this literal and made mlock()ed pages subject to page
> migration (albeit only for the explicit move_pages() syscall; but it
> would very much like to make them subject to implicit page migration for
> the purpose of compaction etc.).

Page migration is not a page fault? The ability to move a process
completely (including its mlocked segments) is important for the manual
migration of process memory. That is what page migration was made for. If
mlocked pages are treated as pinnned pages then the complete process can
no longer be moved from node to node.

> This view disregards the intention of the spec; since mlock() is part of
> the realtime spec the intention is very much that the user address range
> generate no faults; neither minor nor major -- any delay is
> unacceptable.

Where does it say that no faults are generated? Dont we generate COW on
mlocked ranges?

> This leaves the RT people unhappy -- therefore _if_ we continue with
> this Linux specific interpretation of mlock() we must introduce new
> syscalls that implement the intended mlock() semantics.

Intended means Peter's semantics?

> It was found that there are useful purposes for this weaker mlock(), a
> rationale to indeed have two sets of syscalls. The weaker mlock() can be
> used in the context of security -- where we avoid sensitive data being
> written to disk, and in the context of userspace deamons that are part
> of the IO path -- which would otherwise form IO deadlocks.

Migratable mlocked pages enable complete process migration between nodes
of a NUMA system for HPC workloads.

> The proposed second set of primitives would be mpin() and munpin() and
> would implement the intended mlock() semantics.

I agree that we need mpin and munpin. But they should not be called mlock
semantics.

> Such pages would not be migratable in any way (a possible
> implementation would be to 'pin' the pages using an extra refcount on
> the page frame). From the above we can see that any mpin()ed page is
> also an mlock()ed page, since mpin() will disallow any fault, and thus
> will also disallow major faults.

That cannot be so since mlocked pages need to be migratable.

> While we still lack the formal mpin() and munpin() syscalls there are a
> number of sites that have similar 'side effects' and result in user
> controlled 'pinning' of pages. Namely IB and perf.

Right thats why we need this.

> For the purpose of RLIMIT_MEMLOCK we must use intent only as it is not
> part of the formal spec. The only useful thing is to limit the amount of
> pages a user can exempt from paging. This would therefore include all
> pages either mlock()ed or mpin()ed.

RLIMIT_MEMLOCK is a limit on the pages that a process has mlocked into
memory. Pinning is not initiated by user space but by the kernel. Either
temporarily (page count increases are used all over the kernel for this)
or for longer time frame (IB and Perf and likely more drivers that we have
not found yet).


> > > Back to the patch; a resource limit must have a resource counter to
> enact the limit upon. Before the patch this was mm_struct::locked_vm.
> After the patch there is no such thing left.

The limit was not checked correctly before the patch since pinned pages
were accounted as mlocked.

> I state that since mlockall() disables/invalidates RLIMIT_MEMLOCK the
> actual resource counter value is irrelevant, and thus the reported
> problem is a non-problem.

Where does it disable RLIMIT_MEMLOCK?

> However, it would still be possible to observe weirdness in the very
> unlikely event that a user would indeed call mlock() upon an address
> range obtained from IB/perf. In this case he would be unduly constrained
> and find his effective RLIMIT_MEMLOCK limit halved (at worst).

This is

Re: [PATCH] mm: Revert pinned_vm braindamage

2013-06-07 Thread Christoph Lameter
On Fri, 7 Jun 2013, Peter Zijlstra wrote:

> However you twist this; your patch leaves an inconsistent mess. If you
> really think they're two different things then you should have
> introduced a second RLIMIT_MEMPIN to go along with your counter.

Well continuing to repeat myself: I worked based on agreed upon
characteristics of mlocked pages. The patch was there to address a
brokenness in the mlock accounting because someone naively assumed that
pinning = mlock.

> I'll argue against such a thing; for I think that limiting the total
> amount of pages a user can exempt from paging is the far more
> userful/natural thing to measure/limit.

Pinned pages are exempted by the kernel. A device driver or some other
kernel process (reclaim, page migration, io etc) increase the page count.
There is currently no consistent accounting for pinned pages. The
vm_pinned counter was introduced to allow the largest pinners to track
what they did.

> > I said that the use of a PIN page flag would allow correct accounting if
> > one wanted to interpret the limit the way you do.
>
> You failed to explain how that would help any. With a pin page flag you
> still need to find the mm to unaccount crap from. Also, all user
> controlled address space ops operate on vmas.

Pinning is kernel controlled...

> > Page migration is not a page fault?
>
> It introduces faults; what happens when a process hits the migration
> pte? It gets a random delay and eventually services a minor fault to the
> new page.

Ok but this is similar to reclaim and other such things that are unmapping
pages.

> At which point the saw will have cut your finger off (going with the
> most popular RT application ever -- that of a bandsaw and a laser beam).

I am pretty confused by your newer notion of RT. RT was about high latency
deterministic behavior I thought. RT was basically an abused marketing
term and was referring to the bloating of the kernel with all sorts of
fair stuff that slows us down. What happened to make you work on low
latency stuff? There is some shift that you still need to go through to
make that transition. Yes, you would want to avoid reclaim and all sorts
of other stuff for low latency. So you disable auto NUMA, defrag etc to
avoid these things.

> > > This leaves the RT people unhappy -- therefore _if_ we continue with
> > > this Linux specific interpretation of mlock() we must introduce new
> > > syscalls that implement the intended mlock() semantics.
> >
> > Intended means Peter's semantics?
>
> No, I don't actually write RT applications. But I've had plenty of
> arguments with RT people when I explained to them what our mlock()
> actually does vs what they expected it to do.

Ok Guess this is all new to you at this point. I am happy to see that you
are willing to abandon your evil ways (although under pressure from your
users) and are willing to put the low latency people now in the RT camp.

> They're not happy. Aside from that; you HPC/HFT minimal latency lot
> should very well appreciate the minimal interference stuff they do
> actually expect.

Sure we do and we know how to do things to work around the "fair
scheduler" and other stuff. But you are breaking the basics of how we do
things with your conflation of pinning and mlocking.

We do not migrate, do not allow defragmentation or reclaim when running
low latency applications. These are non issues.

> This might well be; and I'm not arguing we remove this. I'm merely
> stating that it doesn't make everybody happy. Also what purpose do HPC
> type applications have for mlock()?

HPC wants to keep them in memory to avoid eviction. HPC apps are not as
sensitive to faults as low latency apps are. Minor faults have
traditionally be tolerated there. The lower you get in terms of the
latencies required the more difficult the OS control becomes.

> Here we must disagree I fear; given that mlock() is of RT origin and RT
> people very much want/expect mlock() to do what our proposed mpin() will
> do.

RT is a dirty word for me given the fairness and bloat issue. Not sure
what you mean with that. mlock is a means to keep data in memory and not a
magical wand that avoids all OS handling of the page.

> > That cannot be so since mlocked pages need to be migratable.
>
> I'm talking about the proposed mpin() stuff.

Could you write that up in detail? I am not sure how this could work at
this point.

> So I proposed most of the machinery that would be required to actually
> implement the syscalls. Except that the IB code stumped me. In
> particular I cannot easily find the userspace address to unpin for
> ipath/qib release paths.
>
> Once we have that we can trivially implement the syscalls.

Why would you need syscalls? Pinning is driver/kernel subsystem initiated
and therefore the driver can do the pin/unpin calls.

> > Pinning is not initiated by user space but by the kernel. Either
> > temporarily (page count increases are used all over the kernel for this)
> > or for longer time frame (I

[RFC] mm: Distinguish between mlocked and pinned pages

2011-08-10 Thread Christoph Lameter
Some kernel components pin user space memory (infiniband and perf)
(by increasing the page count) and account that memory as "mlocked".

The difference between mlocking and pinning is:

A. mlocked pages are marked with PG_mlocked and are exempt from
   swapping. Page migration may move them around though.
   They are kept on a special LRU list.

B. Pinned pages cannot be moved because something needs to
   directly access physical memory. They may not be on any
   LRU list.

I recently saw an mlockalled process where mm->locked_vm became
bigger than the virtual size of the process (!) because some
memory was accounted for twice:

Once when the page was mlocked and once when the Infiniband
layer increased the refcount because it needt to pin the RDMA
memory.

This patch introduces a separate counter for pinned pages and
accounts them seperately.

Signed-off-by: Christoph Lameter 

---
 drivers/infiniband/core/umem.c |6 +++---
 drivers/infiniband/hw/ipath/ipath_user_pages.c |6 +++---
 drivers/infiniband/hw/qib/qib_user_pages.c |4 ++--
 fs/proc/task_mmu.c |2 ++
 include/linux/mm_types.h   |2 +-
 kernel/events/core.c   |6 +++---
 6 files changed, 14 insertions(+), 12 deletions(-)

Index: linux-2.6/fs/proc/task_mmu.c
===
--- linux-2.6.orig/fs/proc/task_mmu.c   2011-08-10 14:08:42.0 -0500
+++ linux-2.6/fs/proc/task_mmu.c2011-08-10 15:01:37.0 -0500
@@ -44,6 +44,7 @@ void task_mem(struct seq_file *m, struct
"VmPeak:\t%8lu kB\n"
"VmSize:\t%8lu kB\n"
"VmLck:\t%8lu kB\n"
+   "VmPin:\t%8lu kB\n"
"VmHWM:\t%8lu kB\n"
"VmRSS:\t%8lu kB\n"
"VmData:\t%8lu kB\n"
@@ -55,6 +56,7 @@ void task_mem(struct seq_file *m, struct
hiwater_vm << (PAGE_SHIFT-10),
(total_vm - mm->reserved_vm) << (PAGE_SHIFT-10),
mm->locked_vm << (PAGE_SHIFT-10),
+   mm->pinned_vm << (PAGE_SHIFT-10),
hiwater_rss << (PAGE_SHIFT-10),
total_rss << (PAGE_SHIFT-10),
data << (PAGE_SHIFT-10),
Index: linux-2.6/include/linux/mm_types.h
===
--- linux-2.6.orig/include/linux/mm_types.h 2011-08-10 14:08:42.0 
-0500
+++ linux-2.6/include/linux/mm_types.h  2011-08-10 14:09:02.0 -0500
@@ -281,7 +281,7 @@ struct mm_struct {
unsigned long hiwater_rss;  /* High-watermark of RSS usage */
unsigned long hiwater_vm;   /* High-water virtual memory usage */

-   unsigned long total_vm, locked_vm, shared_vm, exec_vm;
+   unsigned long total_vm, locked_vm, pinned_vm, shared_vm, exec_vm;
unsigned long stack_vm, reserved_vm, def_flags, nr_ptes;
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
Index: linux-2.6/drivers/infiniband/core/umem.c
===
--- linux-2.6.orig/drivers/infiniband/core/umem.c   2011-08-10 
14:08:57.0 -0500
+++ linux-2.6/drivers/infiniband/core/umem.c2011-08-10 14:09:06.0 
-0500
@@ -136,7 +136,7 @@ struct ib_umem *ib_umem_get(struct ib_uc

down_write(¤t->mm->mmap_sem);

-   locked = npages + current->mm->locked_vm;
+   locked = npages + current->mm->pinned_vm;
lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;

if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
@@ -206,7 +206,7 @@ out:
__ib_umem_release(context->device, umem, 0);
kfree(umem);
} else
-   current->mm->locked_vm = locked;
+   current->mm->pinned_vm = locked;

up_write(¤t->mm->mmap_sem);
if (vma_list)
@@ -222,7 +222,7 @@ static void ib_umem_account(struct work_
struct ib_umem *umem = container_of(work, struct ib_umem, work);

down_write(&umem->mm->mmap_sem);
-   umem->mm->locked_vm -= umem->diff;
+   umem->mm->pinned_vm -= umem->diff;
up_write(&umem->mm->mmap_sem);
mmput(umem->mm);
kfree(umem);
Index: linux-2.6/drivers/infiniband/hw/ipath/ipath_user_pages.c
===
--- linux-2.6.orig/drivers/infiniband/hw/ipath/ipath_user_pages.c   
2011-08-10 14:08:57.0 -0500
+++ linux-2.6/drivers/infiniband/hw/ipath/ipath_user_pages.c2011-08-10 
14:09:06.0 -0500
@@ -79,7 +79,7 @@ static int __ipath_get_user_pages(unsign
  

Re: [RFC] mm: Distinguish between mlocked and pinned pages

2011-08-18 Thread Christoph Lameter
On Wed, 17 Aug 2011, Andrew Morton wrote:

> Sounds reasonable.  But how do we prevent future confusion?  We should
> carefully define these terms in an obvious place, please.

Ok.

> > --- linux-2.6.orig/include/linux/mm_types.h 2011-08-10 14:08:42.0 
> > -0500
> > +++ linux-2.6/include/linux/mm_types.h  2011-08-10 14:09:02.0 
> > -0500
> > @@ -281,7 +281,7 @@ struct mm_struct {
> > unsigned long hiwater_rss;  /* High-watermark of RSS usage */
> > unsigned long hiwater_vm;   /* High-water virtual memory usage */
> >
> > -   unsigned long total_vm, locked_vm, shared_vm, exec_vm;
> > +   unsigned long total_vm, locked_vm, pinned_vm, shared_vm, exec_vm;
> > unsigned long stack_vm, reserved_vm, def_flags, nr_ptes;
> > unsigned long start_code, end_code, start_data, end_data;
> > unsigned long start_brk, brk, start_stack;
>
> This is an obvious place.  Could I ask that you split all these up into
> one-definition-per-line and we can start in on properly documenting
> each field?

Will do that after the linuxcon.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] mm: Distinguish between mlocked and pinned pages

2011-08-23 Thread Christoph Lameter
On Wed, 17 Aug 2011, Andrew Morton wrote:

> This is an obvious place.  Could I ask that you split all these up into
> one-definition-per-line and we can start in on properly documenting
> each field?

Subject: mm: Add comments to explain mm_struct fields

Explain comments to explain the page statistics field in the
mm_struct.

Signed-off-by: Christoph Lameter 

---
 include/linux/mm_types.h |   11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/mm_types.h
===
--- linux-2.6.orig/include/linux/mm_types.h 2011-08-23 09:43:32.0 
-0500
+++ linux-2.6/include/linux/mm_types.h  2011-08-23 09:52:09.0 -0500
@@ -281,8 +281,15 @@ struct mm_struct {
unsigned long hiwater_rss;  /* High-watermark of RSS usage */
unsigned long hiwater_vm;   /* High-water virtual memory usage */

-   unsigned long total_vm, locked_vm, pinned_vm, shared_vm, exec_vm;
-   unsigned long stack_vm, reserved_vm, def_flags, nr_ptes;
+   unsigned long total_vm; /* Total pages mapped */
+   unsigned long locked_vm /* Pages that have PG_mlocked set */
+   unsigned long pinned_vm;/* Refcount permanently increased */
+   unsigned long shared_vm;/* Shared pages (files) */
+   unsigned long exec_vm;  /* VM_EXEC & ~VM_WRITE */
+   unsigned long stack_vm; /* VM_GROWSUP/DOWN */
+   unsigned long reserved_vm;  /* VM_RESERVED|VM_IO pages */
+   unsigned long def_flags;
+   unsigned long nr_ptes;  /* Page table pages */
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Building 3.1-rc9 in kernel infiniband support with OFED libraries

2011-10-11 Thread Christoph Lameter
Seems to work mostly but some userspace libraries (mlx4) complain about
kernel version mismatch and missing XRC support.

Has XRC support not been merged? How can I build the OFED libraries
against Linux 3.1? I'd really like to get rid of the OFED kernel tree
nightmare.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Building 3.1-rc9 in kernel infiniband support with OFED libraries

2011-10-11 Thread Christoph Lameter
On Tue, 11 Oct 2011, Jason Gunthorpe wrote:

> On Tue, Oct 11, 2011 at 09:02:41AM -0500, Christoph Lameter wrote:
>
> > Has XRC support not been merged? How can I build the OFED libraries
> > against Linux 3.1? I'd really like to get rid of the OFED kernel tree
> > nightmare.
>
> You have to use upstream libraries with upstream kernels. Be warned
> that the OFED libraries of the same SONAME are not ABI compatible with
> upstream libraries.

Thats a pretty bad situation. Could we not at least get the kernel API
standardized?

Publish git trees for ofed that are based on supported upstream and distro
versions?


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Building 3.1-rc9 in kernel infiniband support with OFED libraries

2011-10-13 Thread Christoph Lameter
On Thu, 13 Oct 2011, Roland Dreier wrote:

> > My latest patches are at:
> >
> >        git://git.openfabrics.org/~shefty/rdma-dev.git xrc
>
> So I went through this and merged it to my tree (pretty much only conflicts
> from 3.0->3.1-rc fixed, commit log changes and other minor cleanups).

We got the XRC support via the for-next branch on github. Is that current?

> The result is pushed out to my github  for-next branch, with the
> expectation that I'll ask Linus to pull for 3.2.
>
> However I do have one question: the last patch
> ("RDMA/uverbs: Export ib_open_qp() capability to user space" in
> my tree) adds IB_USER_VERBS_CMD_OPEN_QP but I don't
> see any driver add that to its uverbs_cmd_mask... does this work?
>
> Thanks, Roland

There seems to be some stuff missing in the upstream code compared to the
OFED releases:

1. Raw ethernet support (IB_QPT_RAW_ETH) is missing.

2. MLX4_IB_QP_BLOCK_LOOPBACK is broken it seems? All packets are fed back
via loopback?


Re: Building 3.1-rc9 in kernel infiniband support with OFED libraries

2011-10-13 Thread Christoph Lameter
Oh yeah and can we can FDR support in for-next as well?


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Building 3.1-rc9 in kernel infiniband support with OFED libraries

2011-10-13 Thread Christoph Lameter
On Thu, 13 Oct 2011, Roland Dreier wrote:

> On Thu, Oct 13, 2011 at 10:17 AM, Christoph Lameter  wrote:
> > There seems to be some stuff missing in the upstream code compared to the
> > OFED releases:
> >
> > 1. Raw ethernet support (IB_QPT_RAW_ETH) is missing.
> >
> > 2. MLX4_IB_QP_BLOCK_LOOPBACK is broken it seems? All packets are fed back
> > via loopback?
>
> Clean reviewed patches for this are where?

They are in OFED-1.5.3.1 so they were already released.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Building 3.1-rc9 in kernel infiniband support with OFED libraries

2011-10-13 Thread Christoph Lameter
On Thu, 13 Oct 2011, Roland Dreier wrote:

> On Thu, Oct 13, 2011 at 10:24 AM, Christoph Lameter  wrote:
> >> Clean reviewed patches for this are where?
> >
> > They are in OFED-1.5.3.1 so they were already released.
>
> OFED is neither clean nor reviewed.  Really.  The stuff in OFED always
> needs a bunch
> of review before it is suitable for upstream.  Ignoring ABI stability
> is just one problem
> that OFED code typically has.

Yeah. ABI stability etc is really what we want. Thats why I would like to
see it upstream rather than continue to work with this OFED tarball
ofa_kernel mess.

So the patches need to be resubmitted for upstream inclusion from
Mellanox to you?


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


rdma: 3.1.0-rc9 breaks UD

2011-10-14 Thread Christoph Lameter
Running ibv_ud_pingpong and ibc_uc_pingpong between two hosts. One with
OFED 1.5.3.1 (Ubuntu LTS 10.04) and another on linux 3.1.0-rc9 (Same
ubuntu version uderlying) with the upstream libraries.

ibv_uc_pingpong

OFED:

# ibv_uc_pingpong
  local address:  LID 0x000a, QPN 0x54004d, PSN 0x00637e, GID ::
  remote address: LID 0x000b, QPN 0x04004c, PSN 0x7fdc0a, GID ::
8192000 bytes in 0.01 seconds = 7526.82 Mbit/sec
1000 iters in 0.01 seconds = 8.71 usec/iter


3.1.0-rc9:

ibv_uc_pingpong 10.2.35.21
  local address:  LID 0x000b, QPN 0x04004c, PSN 0x7fdc0a, GID ::
  remote address: LID 0x000a, QPN 0x54004d, PSN 0x00637e, GID ::
8192000 bytes in 0.01 seconds = 7634.67 Mbit/sec
1000 iters in 0.01 seconds = 8.58 usec/iter


ibv_ud_pingpong

OFED:

# ibv_ud_pingpong
  local address:  LID 0x000a, QPN 0x50004d, PSN 0x8b572b: GID ::
  remote address: LID 0x000b, QPN 0x02004c, PSN 0x2117ef, GID ::

3.1.0-rc9:

# ibv_ud_pingpong 10.2.35.21
  local address:  LID 0x000b, QPN 0x02004c, PSN 0x2117ef: GID ::
  remote address: LID 0x000a, QPN 0x50004d, PSN 0x8b572b, GID ::


No traffic flows.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: rdma: 3.1.0-rc9 breaks UD

2011-10-14 Thread Christoph Lameter
On Fri, 14 Oct 2011, Hefty, Sean wrote:

> > Running ibv_ud_pingpong and ibc_uc_pingpong between two hosts. One with
> > OFED 1.5.3.1 (Ubuntu LTS 10.04) and another on linux 3.1.0-rc9 (Same
> > ubuntu version uderlying) with the upstream libraries.
>
> FWIW, I was able to run 3.1-rc9 in loopback and between 3.0 and 3.1-rc9 
> systems.  I don't have 2 systems configured with 3.1-rc9.  I'm using upstream 
> builds of all libraries.

Our 3.1-rc9 included Rolands for-next branch.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rdma: 3.1.0-rc9 breaks UD

2011-10-18 Thread Christoph Lameter
On Mon, 17 Oct 2011, Or Gerlitz wrote:

> Could you try that again with small msg size e.g use "-s 100", if that
> doesn't help
> can you send the output of "ibv_devinfo" from both nodes. Also please
> specify the exact top commit of the two libraries (libibverbs and I
> assume libmlx4 if you use Mellanox ConnectX) in case you built them
> from the git trees, or the library version if you're using official
> release, thanks,

-s 100 fixes the issue.

The one side has OFED 1.5.3.1 kernel modules and libraries from the same
tarball.

The kernel ib side has

Libibverbs_git from a mirror of kernel.org

commit d0ebae72223298d6dbd7a2e2c812c529bf4d4a1d
Author: Bart Van Assche 
Date:   Sun Aug 7 18:01:48 2011 +

Makefile.am: Fix an automake warning

Fix the following automake warning message:

Makefile.am:1: `INCLUDES' is the old name for `AM_CPPFLAGS' (or 
`*_CPPFLAGS')

A quote from the automake manual:

INCLUDES
This does the same job as AM_CPPFLAGS (or any per-target _CPPFLAGS 
variable
if it is used). It is an older name for the same functionality. This
variable is deprecated; we suggest using AM_CPPFLAGS and er-target
_CPPFLAGS instead.

Signed-off-by: Bart Van Assche bvanass...@acm.org

Mlx4_git from the same place.

commit 60dade7a5e97da3b2eedf5839169c044f67577b3
Author: Roland Dreier 
Date:   Thu Aug 11 09:35:20 2011 -0700

Add "foreign" option to AM_INIT_AUTOMAKE

Switch to the modern form of the AM_INIT_AUTOMAKE macro and tell
automake that the libmlx4 package does not follow the GNU standards.
This change makes it possible to use 'autoreconf' for libmlx4.

Signed-off-by: Roland Dreier 


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Linux 3.2-rc1 vs. OFED 1.5.4-rc4: Packets are looped back

2011-11-11 Thread Christoph Lameter
We have an app here that runs fine with OFED. But if I try to use the
kernel IB subsystem in 3.2 it complains about packets being looped back to
the application.

That seems to be controlled by IB_DEVICE_BLOCK_MULTICAST_LOOPBACK and
IB_QP_CREATE_BLOCK_MULTICAST_LOOPBACK.

Why does it not work in the kernel IB stack?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RAW Ethernet support for Kernel IB stack?

2011-11-23 Thread Christoph Lameter
Seems that OFED has this raw ethernet support but the kernel IB stack does
not?

Any work in progress on that one?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] – Proposal for new process for OFED releases

2011-12-02 Thread Christoph Lameter
On Fri, 2 Dec 2011, Bart Van Assche wrote:

> On Fri, Dec 2, 2011 at 1:04 AM, Hefty, Sean  wrote:
> > > - What should we do with modules like SDP that are not in kernel?
> >
> > Either remove them or carry them forward as experimental features.
>
> Wat I expect is that reworking the SDP implementation such that it can
> be included upstream will take less work in the long term than
> maintaining it as out-of-tree code.

What were the issues that prevented the merging of the SDP
implementation?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] – Proposal for new process for OFED releases

2011-12-02 Thread Christoph Lameter
On Thu, 1 Dec 2011, Tziporet Koren wrote:

> We propose a new process for the OFED releases starting from next OFED 
> release:
> - OFED content will be the relevant kernel.org modules and user space 
> released packages
> - OFED will offer only backports to the distros  (no fixes)
> - OFED package will be used for easy installation of all packages in a 
> friendly manner
>
> The main goals of this change:
> 1. Ensure OFED and the upstream kernel are the same
> 2. Provide customers a way to use the new features in latest kernels on 
> existing distros
> 3. OFED qualification will contribute to the stability of the upstream code

Ohh. A Christmas (or whatever your favorite holidays label is) present.

Great, we may now be able to stop torturing our support departments to get
OFED running with this and that.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAW Ethernet support for Kernel IB stack?

2011-12-08 Thread Christoph Lameter
On Wed, 23 Nov 2011, Or Gerlitz wrote:

> Christoph Lameter  wrote:
> > Seems that OFED has this raw ethernet support but the kernel IB stack does 
> > not?
> > Any work in progress on that one?
>
> Yes, I will be reworking the patches for upstream submission,
> expected in the coming weeks.

Are there any prereleases of that available? Or if you just give me a list
of OFED patches then I can try and see how far I can get on my own.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] IB/core: add RAW Packet QP type

2012-02-16 Thread Christoph Lameter
On Thu, 16 Feb 2012, Or Gerlitz wrote:

> Also here, Sean provided his reviewed-by signature, people (CCed
> Christoph and others) keep asking me about this patch set and I didn't
> get any feedback from you.

I sure would like to see this merged for 3.4


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] IB/core: add RAW Packet QP type

2012-03-12 Thread Christoph Lameter
On Mon, 12 Mar 2012, Or Gerlitz wrote:

> Same here, I submitted the patches ~two months ago and haven't heard from you,
> I would be happy to see this going into 3.4 and if fixes/changes are needed
> please let me know as soon as possible so I will do them.

I can confirm that the patches in OFED to the same effect work very well.
Widely deployed here.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] please pull infiniband.git

2012-03-21 Thread Christoph Lameter
On Wed, 21 Mar 2012, Or Gerlitz wrote:

> So again, any reason not to merge the RAW QP patches for 3.4? they
> have been posted few months ago, its two kernel patches and we have
> Sean's reviewd-by Signature for the core patch (see
> http://marc.info/?l=linux-rdma&m=132693668105264&w=2)

Yes please. Can we can talk about this next week at the OFA conference?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] please pull infiniband.git

2012-03-22 Thread Christoph Lameter
On Wed, 21 Mar 2012, Or Gerlitz wrote:

> Christoph, I will not attend, but will love to get any related
> feedback. Looking on the agenda
> (https://www.openfabrics.org/images/docs/PR/schedule.pdf) I see that
> on Wednesday noon there are two sessions dealing with exactly this -
> "User Mode Ethernet Programming" and "Network Adapter Flow Steering"
> by Tzahi Oved from Mellanox, so be there or be sqaure (e.g myself...)

Ok. I can also mention this in my session Monday morning. ;-)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 2/6] IB/core: fix mismatch between locked and pinned pages

2012-05-10 Thread Christoph Lameter
On Thu, 10 May 2012, Or Gerlitz wrote:

> Commit bc3e53f68 "mm: distinguish between mlocked and pinned pages"
> introduced a separate counter for pinned pages and used it over
> the IB stack. Specifically, in ib_umem_get the pinned counter
> is incremented, but ib_umem_release wrongly decrements the
> locked counter, fix that.

Reviewed-by: Christoph Lameter 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 2/6] IB/core: fix mismatch between locked and pinned pages

2012-05-11 Thread Christoph Lameter
On Thu, 10 May 2012, Or Gerlitz wrote:

> Commit bc3e53f68 "mm: distinguish between mlocked and pinned pages"
> introduced a separate counter for pinned pages and used it over
> the IB stack. Specifically, in ib_umem_get the pinned counter
> is incremented, but ib_umem_release wrongly decrements the
> locked counter, fix that.

This looks like a stable fix or at least something that should be merged
upstream asap. Patch is missing my reviewed-by tag.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   >