from:"Shirley Ma"






Roland Dreier [EMAIL PROTECTED] wrote on 02/27/2007 01:40:36 PM:

 Shirley, can you clarify why doing dma_alloc_coherent() in the kernel
 helps on your Cell blade?  It really seems that dma_alloc_coherent()
 just allocates some memory and then does dma_map(DMA_BIDIRECTIONAL),
 which would be exactly the same as allocating the CQ buffer in
 userspace and using ib_umem_get() to map it into the kernel.

 I'm looking at a possibly cleaner solution to the Altix issue, so I
 would like to make sure it fixes whatever the bug on Cell is as well.
 So any details you can provide about the problem you see on Cell would
 help a lot.

 Thanks...
Thanks, Roland. The failure on Cell is different with Altix issue after I
reviewed the whole thread. So this fix might not help Cell. The problem I
have might be related to multiple DMAs mapping to the same CQ. It might be
somewhere else lost the sync.

Thanks
Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] Fw: [PATCH] enable IPoIB only if broadcast join finish






Hello Roland,

  Sorry to bother you again. Could you please review below patch to see
it's possible to be in upper stream soon? IPoIB can't ping each other if
broadcast join successfully but encounting any other IB multicast join
failure (like  IB multicast group join failure for default IPv6 link local
solicited address) when bringing the interface up. It does impact IPoIB
usability in large node cluster when MCG LIDs are limited.

Thanks
Shirley Ma


- Forwarded by Shirley Ma/Beaverton/IBM on 02/27/07 06:23 AM -
   
 Shirley   
 Ma/Beaverton/IBM@ 
 IBMUS  To
 Sent by:  Roland Dreier [EMAIL PROTECTED]
 openib-general-bo  cc
 [EMAIL PROTECTED]  openib-general@openib.org   
   Subject
   [openib-general] [PATCH] enable 
 02/05/07 06:50 AM IPoIB only if broadcast join finish
   
   
   
   
   
   




Hi, Roland,

Please review this patch. According to IPoIB RFC4391 section 5, once IPoIB
broacast group has been joined, the interface should be ready for data
transfer. In current IPoIB implementation, the interface is UP and RUNNING
when all default multicast join successful. We hit a problem while the
broadcast join finishe and sucessful but the all hosts multicast join
failure.

Here is the patch, if possible please give your input asap, we have an
urgent customer issue need to be resolved:

diff -urpN ipoib/ipoib_multicast.c ipoib-multicast/ipoib_multicast.c
--- ipoib/ipoib_multicast.c 2006-11-29 13:57:37.0 -0800
+++ ipoib-multicast/ipoib_multicast.c 2007-02-04 22:34:16.0 -0800
@@ -402,6 +402,11 @@ static void ipoib_mcast_join_complete(in
queue_work(ipoib_workqueue, priv-mcast_task);
mutex_unlock(mcast_mutex);
complete(mcast-done);
+ /*
+ * broadcast join finished, enable carrier
+ */
+ if (mcast == priv-broadcast)
+ netif_carrier_on(dev);
return;
}

@@ -599,7 +604,6 @@ void ipoib_mcast_join_task(void *dev_ptr
ipoib_dbg_mcast(priv, successfully joined all multicast groups\n);

clear_bit(IPOIB_MCAST_RUN, priv-flags);
- netif_carrier_on(dev);
}

int ipoib_mcast_start_thread(struct net_device *dev)

(See attached file: ipoib-multicast.patch)

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638(See attached file: ipoib-multicast.patch)
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general



ipoib-multicast.patch
Description: Binary data
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] IPOIB NAPI





Roland Dreier [EMAIL PROTECTED] wrote on 02/26/2007 02:36:26 PM:
 No way, it's way too late at this point to change the kernel-user ABI,
 let alone change all ULPs.

  - R.

Hello Roland,

So the IBV_CQ_REPORT_MISSED_EVENTS has been part of OFED-1.2 already? I can
generate the patch for all ULPs to use this for review. Do you need me to
do that?

Thanks
Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] Fw: [PATCH] enable IPoIB only if broadcast join finish





Roland Dreier [EMAIL PROTECTED] wrote on 02/27/2007 02:35:34 PM:

 I don't think this applies any more since Sean's multicast stuff was
 merged.  I didn't realize you wanted to get this merged upstream --
 anyway, can you please regenerate the patch against the latest kernel?

 Thanks
Sure. I will generate a new patch.

Thanks
Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] IPOIB NAPI





oland Dreier [EMAIL PROTECTED] wrote on 02/27/2007 02:41:44 PM:

   So the IBV_CQ_REPORT_MISSED_EVENTS has been part of OFED-1.2 already?
I can
   generate the patch for all ULPs to use this for review. Do you need me
to
   do that?

 No, it's not in OFED 1.2 or the upstream kernel.  And no one has
 implemented it for userspace (and I'm somewhat reluctant to break the
 ABI at this point without some performance numbers to motivate making
 this API change).

 Have the NAPI performance problems with ehca been resolved?  We could
 probably merge IPoIB NAPI for 2.6.22 then, which would pull in the
 kernel changes at least.

  - R.
We have addressed the NAPI performance issues with ehca driver. I believe
the patches have been upper stream. However the test results show that it's
better to delay poll again to next NAPI interval, something like this:

poll-cq
notify-cq, if missed_event  netif_rx_reschedule()
return 1

vs.
poll-cq,
notify-cq, if missed_event  netif_rx_reschedule()
poll again
return 0

It seems ehca delivering packet much faster than other HCAs. So poll again
would stay in the loop for many many times. So the above changes doesn't
impact other HCAs, I would recommand it. I saw same implementations on
other ethernet drivers.

Thanks
Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] Fw: [PATCH] enable IPoIB only if broadcast join finish





Hello Roland,

Here is the new patch against 2.6.20-rc1 kernel. Please review it.

diff -urpN ipoib/ipoib_multicast.c ipoib-link/ipoib_multicast.c
--- ipoib/ipoib_multicast.c   2007-02-27 07:21:50.0 -0800
+++ ipoib-link/ipoib_multicast.c2007-02-27 07:52:10.0 -0800
@@ -407,6 +407,11 @@ static int ipoib_mcast_join_complete(int
  queue_delayed_work(ipoib_workqueue,
 priv-mcast_task, 0);
mutex_unlock(mcast_mutex);
+   /*
+* broadcast join finished, enable carrier
+*/
+   if (unlikely(mcast == priv-broadcast))
+ netif_carrier_on(dev);
return 0;
  }

@@ -596,7 +601,6 @@ void ipoib_mcast_join_task(struct work_s
  ipoib_dbg_mcast(priv, successfully joined all multicast groups\n);

  clear_bit(IPOIB_MCAST_RUN, priv-flags);
- netif_carrier_on(dev);
 }

 int ipoib_mcast_start_thread(struct net_device *dev)

(See attached file: ipoib-link.patch)

Thanks
Shirley Ma

ipoib-link.patch
Description: Binary data
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[OFA General] Re: [openib-general] IPOIB NAPI





I'm confused. Which one is faster?
Sorry for the confusion, Michael. The one with return 1 has better
throughput.

Thanks
Shirley Ma___
general mailing list
[EMAIL PROTECTED]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] ib0 interface up but can't ping





If your subnet is already has a SM running. Please look at the ifconfig
output. If the interface ib0 is UP but not RUNNING, you can't ping since
the carrier is not ON. Also look at /var/log/messages to see whether there
is any errors.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] IPOIB NAPI





Roland,

Yes. It would be good to reduce number of interrupts by changing all upper
layer protocols to use:

poll CQ
notify CQ, rotting packet notification
poll again

instead of
notify CQ
poll CQ

If possible this can be in OFED-1.2?

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [RFC/BUG] DMA vs. CQ race





 Hmm, OK.  Then I will do my best to make sure we get a fix for this
 into 2.6.22.

That would be great. We hit a similar problem in our cluster test -- data
corruption because of this race.

Thanks
Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [RFC/BUG] DMA vs. CQ race





Roland Dreier [EMAIL PROTECTED] wrote on 02/26/2007 02:09:48 PM:
   That would be great. We hit a similar problem in our cluster test --
data
   corruption because of this race.

 On what platform?

  - R.

On our cell blade + PCI-e Mellanox.

Thanks
Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] IPv6oIB neighbour discover broken when MCGs overflow






Roland,

Thanks for your quick response.

Even SM supports 1000 MCGs, it's still not sufficitent for 250 nodes
cluster, each node have 4 links for IPv6 without any scope/global IPv6
address configured.(250*4+ a few default MCGs) There will be a MCG overflow
problem anyway in IPv6oIB.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638




   
 Roland Dreier 
 [EMAIL PROTECTED] 
 m To
   Shirley Ma/Beaverton/[EMAIL PROTECTED]  
 02/16/2007 09:00   cc
 AMMichael S. Tsirkin
   [EMAIL PROTECTED],   
   openib-general@openib.org   
   Subject
   Re: IPv6oIB neighbour discover  
   broken when MCGs overflow   
   
   
   
   
   
   




  We have a customer issue regarding IPv6oIB. In the subnet, there are
  limited number of MCGs supported. So when there are multiple IPv6
addresses
  are assigned to one interface, each IPv6 address will have one unique
  solicited-node address (depends on their groupID). Then in a large
subnet,
  we will have tons of MCGs. If IPv6 solicited node addresses exceed the
  number of MDGs in this subnet, then IPv6 neighbour discovery will be
  broken, this won't happen in Ethernet since sendonly doesn't require
sender
  to be joined any MCG.

  I have done an initial patch to addresss MCG overflow problem and
redirect
  the solicited-node address to all hosts node address, thus IPv6
neighbour
  discovery will work no matter how many IPv6 addresses in this subnet.
This
  patch is only triggered with IPv6 enabled and MGC overflows, so there is
  almost no performance penalty.

I really don't like this approach, since it can break things in very
subtle ways (eg suppose one node fails to join its solicited node
group, but then a later node wants to talk to it and succeeds in
joining the solicited node group as a send-only member -- since the
first node is not a member then it will never see the ND messages).

I much prefer to fix the SM not to impose too-low limits on the number
of MCGs.  Supporting O(# nodes) MCGs is really not a very onerous
requirement on the SM.

 - R.



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] IPv6oIB neighbour discover broken when MCGs overflow





Roland,

I really don't like this approach, since it can break things in very
subtle ways (eg suppose one node fails to join its solicited node
group, but then a later node wants to talk to it and succeeds in
joining the solicited node group as a send-only member -- since the
first node is not a member then it will never see the ND messages).

For the successful join, ND sends to the node directly, for the failure
join, ND sends to all hosts addr. So ND will work no matter whether the
join OK or not, that's the patch does.

Thanks
Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] IPv6oIB neighbour discover broken when MCGs overflow





Roland Dreier [EMAIL PROTECTED] wrote on 02/16/2007 09:49:24 AM:

   For the successful join, ND sends to the node directly, for the
failure
   join, ND sends to all hosts addr. So ND will work no matter whether
the
   join OK or not, that's the patch does.

 But what if the full-member join fails on node A for node A's
 solicited node group, but then node B succeeds in joining that group
 as a send-only member (perhaps because some other nodes have dropped
 off the fabric in the meantime).  Then node B will send the ND message
 on a MCG that A is not a member of.

  - R.

Yes. B can send ND to A, and A responds without being a member so IPv6 ND
works. Is there any security or other problems here?

Thanks
Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] IPv6oIB neighbour discover broken when MCGs overflow





Roland Dreier [EMAIL PROTECTED] wrote on 02/16/2007 10:10:55 AM:

But what if the full-member join fails on node A for node A's
solicited node group, but then node B succeeds in joining that group
as a send-only member (perhaps because some other nodes have dropped
off the fabric in the meantime).  Then node B will send the ND
message
on a MCG that A is not a member of.

   Yes. B can send ND to A, and A responds without being a member so IPv6
ND
   works. Is there any security or other problems here?

 Node A is not a member of the group B is sending on, so SM does not
 have to set up any routes for the messages to even reach node A.  So
 it doesn't see the messages and doesn't respond to ND.

  - R.

Two MCGs groups must be establised before IPoIB link up, one is broadcast
for IPv4, one is all hosts multicast for IPv6. So Node A is a member of all
hosts address, the patch directs ND sends to all hosts, so node A responses
it.

Thanks
Shirley Ma


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] IPv6oIB neighbour discover broken when MCGs overflow






Roland Dreier [EMAIL PROTECTED] wrote on 02/16/2007 09:25:30 AM:

   Even SM supports 1000 MCGs, it's still not sufficitent for 250 nodes
   cluster, each node have 4 links for IPv6 without any scope/global IPv6
   address configured.(250*4+ a few default MCGs) There will be a MCG
overflow
   problem anyway in IPv6oIB.

 But what's the problem with supporting 1000 or even 1 MCGs?

  - R.

I am not sure whether I understand your question. I am trying to answer it,
please let me know whether I am wrong.

Each IPv6 Link local address will create a unique solicited-node multicast
address, which will create unique full member of IB MCG, each other IPv6
address will create a solicited-node multicast address, whether it's unique
or not based on the groupID. So when IPv6 module being loaded in the
kernel, (or might be a part of kernel in the future) in SM, we will see
more than 1000 MCGs when IPoIB link up. Some of them can't join any MCGs.
Then IPv6 ND is broken with some of the nodes join failure.


Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] IPv6oIB neighbour discover broken when MCGs overflow






Roland Dreier [EMAIL PROTECTED] wrote on 02/16/2007 01:55:06 PM:

   Two MCGs groups must be establised before IPoIB link up, one is
broadcast
   for IPv4, one is all hosts multicast for IPv6. So Node A is a member
of all
   hosts address, the patch directs ND sends to all hosts, so node
Aresponses
   it.

 I'm still confused.  How do you interoperate with other RFC-compliant
 nodes (they might not have your patch or might not even be running
 Linux) that send ND messages to the solicited node group?  If node A
 has your patch and doesn't try to join its own solicited node group,
 then another node that doesn't know to send ND messages to the all
 nodes group will not be able to find it.

  - R.

All nodes in the subnet join all hosts multicast group by default. What the
patch does differently than before, is when join failure, sends to all
hosts multicast group instead of sending to a particular solicited-node
multicast address, the node with the destination solicited-node multicast
address will respond to it, so the network will not lose the connectivity
when MCGs overflow. There is no interoperability issue here between patched
and unpatched node or Linux and none-Linux node. I don't think IPoIB RFC
covers this corner case. So there is no RFC-compliant problem here. I will
discuss this with the author.

Thanks
Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH] enable IPoIB only if broadcast join finish

2007-02-06 Thread Shirley Ma





Roland,

  Could you please review this patch when you have time? I am looking
forward to seeing your comments to address a customer issue. Appreciate
your help.

Thanks
Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH] enable IPoIB only if broadcast join finish

2007-02-06 Thread Shirley Ma






  Thanks Roland, I will apply the patch to the customer's cluster.

  The problem I found when failover bringing the new IPoIB interface up
in the existing fabric, with a limit number of multicast join groups from
our configuration, the interface can join broadcast group successfully, but
all hosts multicast group join failure. Then ib interface can be UP, but
not RUNNING. The interface couldn't work at all.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638




   
 Roland Dreier 
 [EMAIL PROTECTED] 
 m To
   Shirley Ma/Beaverton/[EMAIL PROTECTED]  
 02/06/2007 11:59   cc
 AMMichael S. Tsirkin
   [EMAIL PROTECTED],   
   openib-general@openib.org   
   Subject
   Re: [PATCH] enable IPoIB only if
   broadcast join finish   
   
   
   
   
   
   




  Here is the patch, if possible please give your input asap, we have an
  urgent customer issue need to be resolved:

I guess this is OK, but what is the urgent issue it fixes?

 - R.



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] [PATCH] enable IPoIB only if broadcast join finish

2007-02-05 Thread Shirley Ma






Hi, Roland,

Please review this patch. According to IPoIB RFC4391 section 5, once IPoIB
broacast group has been joined, the interface should be ready for data
transfer. In current IPoIB implementation, the interface is UP and RUNNING
when all default multicast join successful. We hit a problem while the
broadcast join finishe and sucessful but the all hosts multicast join
failure.

Here is the patch, if possible please give your input asap, we have an
urgent customer issue need to be resolved:

diff -urpN ipoib/ipoib_multicast.c ipoib-multicast/ipoib_multicast.c
--- ipoib/ipoib_multicast.c   2006-11-29 13:57:37.0 -0800
+++ ipoib-multicast/ipoib_multicast.c 2007-02-04 22:34:16.0 -0800
@@ -402,6 +402,11 @@ static void ipoib_mcast_join_complete(in
  queue_work(ipoib_workqueue, priv-mcast_task);
mutex_unlock(mcast_mutex);
complete(mcast-done);
+   /*
+* broadcast join finished, enable carrier
+*/
+   if (mcast == priv-broadcast)
+ netif_carrier_on(dev);
return;
  }

@@ -599,7 +604,6 @@ void ipoib_mcast_join_task(void *dev_ptr
  ipoib_dbg_mcast(priv, successfully joined all multicast groups\n);

  clear_bit(IPOIB_MCAST_RUN, priv-flags);
- netif_carrier_on(dev);
 }

 int ipoib_mcast_start_thread(struct net_device *dev)

(See attached file: ipoib-multicast.patch)

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638

ipoib-multicast.patch
Description: Binary data
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] Multicast join group failure prevents IPoIB performing

2007-02-03 Thread Shirley Ma






According to IPoIB RFC4391 section 5, once IPoIB broadcast group has been
joined, the IPoIB link should be UP, since it's ready for data transfer,
the interface should be able to run for broadcast and unicast, do not need
to wait for all multicast join successfully. Here is the patch to allow
IPoIB interface running without waiting for all multicast join succesful,
like all host group multicast join  Here is the patch:

diff -urpN ipoib/ipoib_multicast.c ipoib-patch/ipoib_multicast.c
--- ipoib/ipoib_multicast.c   2006-11-29 13:57:37.0 -0800
+++ ipoib-patch/ipoib_multicast.c   2007-02-03 00:52:23.0 -0800
@@ -566,6 +566,7 @@ void ipoib_mcast_join_task(void *dev_ptr

  if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, priv-broadcast-flags)) {
ipoib_mcast_join(dev, priv-broadcast, 0);
+   netif_carrier_on(dev);
return;
  }

@@ -599,7 +600,6 @@ void ipoib_mcast_join_task(void *dev_ptr
  ipoib_dbg_mcast(priv, successfully joined all multicast groups\n);

  clear_bit(IPOIB_MCAST_RUN, priv-flags);
- netif_carrier_on(dev);
 }

 int ipoib_mcast_start_thread(struct net_device *dev)

(See attached file: multicast.patch)

http://www.rfc-editor.org/rfc/rfc4391.txt

5.  Setting Up an IPoIB Link

   The broadcast-GID, as defined in the previous section, MUST be set up
   for an IPoIB subnet to be formed.  Every IPoIB interface MUST
   FullMember join the IB multicast group defined by the broadcast-
   GID.  This multicast group will henceforth be referred to as the
   broadcast group.  The join operation returns the MTU, the Q_Key, and
   other parameters associated with the broadcast group.  The node then
   associates the parameters received as a result of the join operation
   with its IPoIB interface.  The broadcast group also serves to provide
   a link-layer broadcast service for protocols like ARP, net-directed,
   subnet-directed, and all-subnets-directed broadcasts in IPv4 over IB
   networks.

   The join operation is successful only if the Subnet Manager (SM)
   determines that the joining node can support the MTU registered with
   the broadcast group [RFC4392] ensuring support for a common link MTU.
   The SM also ensures that all the nodes joining the broadcast-GID have
   paths to one another and can therefore send and receive unicast
   packets.  It further ensures that all the nodes do indeed form a
   multicast tree that allows packets sent from any member to be
   replicated to every other member.   Thus, the IPoIB link is formed by
   the IPoIB nodes joining the broadcast group.  There is no physical
   demarcation of the IPoIB link other than that determined by the
   broadcast group membership.


Shirley Ma




   
 Shirley   
 Ma/Beaverton/IBM@ 
 IBMUS  To
 Sent by:  openib-general@openib.org   
 openib-general-bo  cc
 [EMAIL PROTECTED]  
   Subject
   [openib-general] Multicast join 
 02/02/07 08:58 PM group failure prevents IPoIB
   performing  
   
   
   
   
   
   




When bringing IPoIB interface up, I hit default group multicast join
failure. (This could be fixed in SM set up?)
ib0: multicast join failed for , status -22
Then the interface was UP but not RUNNING. So the nodes couldn't ping each
other. I think the right behavior of the interface should be UP and RUNNING
even with some multicast join failure. I would like to provide a patch if
there is no problem. Please advise.

Thanks
Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general




multicast.patch
Description: Binary data
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] Multicast join group failure prevents IPoIB performing

2007-02-02 Thread Shirley Ma






When bringing IPoIB interface up, I hit default group multicast join
failure. (This could be fixed in SM set up?)
ib0: multicast join failed for , status -22
Then the interface was UP but not RUNNING. So the nodes couldn't ping each
other. I think the right behavior of the interface should be UP and RUNNING
even with some multicast join failure. I would like to provide a patch if
there is no problem. Please advise.

Thanks
Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()

2006-11-16 Thread Shirley Ma






Roland Dreier [EMAIL PROTECTED] wrote on 11/16/2006 11:26:31 AM:

   What I have found in ehca driver, n! = t, does't mean it's empty. If
poll
   again, there are still some packets in cq. IB_CQ_REPORT_mISSED_EVENTS
most
   of the time reports 1. It relies on netif_rx_reschedule() returns0 to
exit
   napi poll. That might be the reason in poll routine for a long time? I
will
   rerun my test to use n! = 0 to see any difference here.

 Maybe there's an ehca bug in poll CQ?  If n != t then it should mean
 that the CQ was indeed drained.  I would expect a missed event would
 be rare, because it means a completion occurs between the last poll CQ
 and the request notify, and that shouldn't be that common...

 My rough estimate is that even at a higher throughput than what you're
 seeing, IPoIB should only generate ~ 500K completions/sec, which means
 the average delay between completions is 2 microseconds.  So I
 wouldn't expect completions to hit the window between poll and request
 notify that often.

  - R.

I have tried low_latency is 1 to disable TCP prequeue, the throughput was
increased from 1XXMb/s to 4XXMb/s. If I delayed net_skb_receive() a little
bit, I could get around 1700Mb/s. If I totally disable
netif_rx_reschedule(), then there is no repoll and return 0, I could get
around 2900Mb/s throughout without packet seeing out of order issues. I
have tried to add a spin lock in ipoib_poll(). And I still see packets out
of orders.

disable prequeue: 2XXMb/s to 4XXMb/s (packets out of order)
slowdown netif_receive_skb: 17XXMb/s (packets out of order)
don't handle missed event: 28XXMb/s (no packets out of order)
handler missed envent later: 7XXMb/s to 11XXMb/s (packets out of order)

Maybe it is ehca driver deliver packets much faster?  Which makes me think
user processes tcp backlogqueue, prequeue might be out of order?

Thanks
Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()

2006-11-15 Thread Shirley Ma


Roland Dreier [EMAIL PROTECTED] wrote on 11/14/2006 03:18:23 PM:

   Shirley The rotting packet situation consistently happens for
   Shirley ehca driver. The napi could poll forever with your
   Shirley original patch. That's the reason I defer the rotting
   Shirley packet process in next napi poll.
 
 Hmm, I don't see it. In my latest patch, the poll routine does:
 
 repoll:
  done = 0;
  empty = 0;
 
  while (max) {
t = min(IPOIB_NUM_WC, max);
n = ib_poll_cq(priv-cq, t, priv-ibwc);
 
for (i = 0; i  n; ++i) {
 if (priv-ibwc[i].wr_id  IPOIB_OP_RECV) {
   ++done;
   --max;
   ipoib_ib_handle_rx_wc(dev, priv-ibwc + i);
 } else
   ipoib_ib_handle_tx_wc(dev, priv-ibwc + i);
}
 
if (n != t) {
 empty = 1;
 break;
}
  }
 
  dev-quota -= done;
  *budget  -= done;
 
  if (empty) {
netif_rx_complete(dev);
if (unlikely(ib_req_notify_cq(priv-cq,
   IB_CQ_NEXT_COMP |
   IB_CQ_REPORT_MISSED_EVENTS)) 
  netif_rx_reschedule(dev, 0))
 goto repoll;
 
return 0;
  }
 
  return 1;
 
 so every receive completion will count against the limit set by the
 variable max. The only way I could see the driver staying in the poll
 routine for a long time would be if it was only processing send
 completions, but even that doesn't actually seem bad: the driver is
 making progress handling completions.
What I have found in ehca driver, n! = t, does't mean it's empty. If poll again, there are still some packets in cq. IB_CQ_REPORT_mISSED_EVENTS most of the time reports 1. It relies on netif_rx_reschedule() returns 0 to exit napi poll. That might be the reason in poll routine for a long time? I will rerun my test to use n! = 0 to see any difference here.

 
   Shirley It does help the performance from 1XXMb/s to 7XXMb/s, but
   Shirley not as expected 3XXXMb/s.
 
 Is that 3xxx Mb/sec the performance you see without the NAPI patch?

Without NAPI patch, in my test environment ehca can gain around 2800Mb to 3000Mb/s throughput.

   Shirley With the defer rotting packet process patch, I can see
   Shirley packets out of order problem in TCP layer. Is it
   Shirley possible there is a race somewhere causing two napi polls
   Shirley in the same time? mthca seems to use irq auto affinity,
   Shirley but ehca uses round-robin interrupt.
 
 I don't see how two NAPI polls could run at once, and I would expect
 worse effects from them stepping on each other than just out-of-order
 packets. However, the fact that ehca does round-robin interrupt
 handling might lead to out-of-order packets just because different
 CPUs are all feeding packets into the network stack.
 
 - R.
Normally for NAPI there should be only one running at a time. And NAPI process packet all the way to TCP layer by processing packet one by one (netif_receive_skb()). So it shouldn't lead to out-of-packets even for round-robin interrupt handling in NAPI. I am still investing this.

Thanks
Shirley___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()

2006-11-15 Thread Shirley Ma


I will rerun my test to use n! = 0 to see any difference here.
It should be n == 0 to indicate empty.

Thanks
Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()

Roland Dreier [EMAIL PROTECTED] wrote on 11/13/2006 08:45:52 AM:

Sorry I was not intend to send previous email. Anyway I accidently sent it
out. What I thought was there would be a problem, if the missed_event
always return to 1. Then this napi poll would keep forever.

Well, it's limited by the quota that the net stack gives it, so
there's no possibility of looping forever. However

How about defer the rotting packets process later? like this:

that seems like it is still correct.

With this patch, I could get NAPI + non scaling code throughput performance
from 1XXMb/s to 7XXMb/s, anyway there are some other problems I am still
investigating now.

But I wonder why it gives you a factor of 4 in performance?? Why does
it make a difference? I would have thought that the rotting packet
situation would be rare enough that it doesn't really matter for
performance exactly how we handle it.

What are the other problems you're investigating?

- R.

The rotting packet situation consistently happens for ehca driver. The napi could poll forever with your original patch. That's the reason I defer the rotting packet process in next napi poll. It does help the performance from 1XXMb/s to 7XXMb/s, but not as expected 3XXXMb/s. With the defer rotting packet process patch, I can see packets out of order problem in TCP layer. Is it possible there is a race somewhere causing two napi polls in the same time? mthca seems to use irq auto affinity, but ehca uses round-robin interrupt.

Thanks
Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()


Roland,

I think there is a barrier might be needed in checking LINK SCHED state, like smp_mb_before_clear_bit() and smp_mb_after_clear_bit(), otherwise the netif_rx_reschedule() for rotting packet and next interrupt netif_rx_schedule() could be running in the time. If the interrupt is round-robin fashion, then packets are going to be out of order in TCP layer. I will test it out once I have the resouce. 

How do you think?

Thanks
Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()


Roland,

	Ignore my previous email, test_and_set_bit is atomic operation and has the memeory barrier already. 

Thanks
Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()


From the code work through, if defering rotting packet process by return (missed_event  netif_rx_reschedule(dev, 0)); Then the same dev-poll can be added to per cpu poll list twice: one is from netif_rx_reschedule, one is from napi return 1. That might explains packets out of order: when one poll finishes and reset LINK SCHED bit and the next interrupt runs on other cpu.

Thanks
Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()

2006-11-10 Thread Shirley Ma


 I would really like to understand why ehca does worse with NAPI. In
 my tests both mthca and ipath exhibit various degrees of improvement
 depending on the test -- but I've never seen performance get worse.
 This is the main thing holding back merging NAPI.
 
 Does the NAPI patch help mthca on pSeries? I wonder if it's not ehca,
 but rather that there's some ppc64 quirk that makes NAPI a lot more
 expensive.
 
 - R.

Got your point. Sorry I haven't made any big progress yet. What I have found so far in none scaling code, if I always set missed_event = 0 without peeking rotting packet, then NAPI will increase the performance and reduce the cpu utilization. That's the reason I suggest above change.
I have't found the reason for scaling code dropping 2/3 of the performance yet.
The NAPI touch test for methca on power performance is good.
So I don't think it's ppc4 issue.

Thanks
Shirley___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()

2006-11-10 Thread Shirley Ma


Roland Dreier [EMAIL PROTECTED] wrote on 11/10/2006 07:00:46 AM:

 I think it has to stay the way I wrote it. Your version:
 
 +  if (empty) 
 +return (ib_req_notify_cq(priv-cq, IB_CQ_NEXT_COMP|IB_CQ_REPORT_MISSED_EVENTS)  netif_rx_reschedule(dev, 0);
 +   


Thanks
Shirley Ma
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()

2006-11-10 Thread Shirley Ma


Roland,

Sorry I was not intend to send previous email. Anyway I accidently sent it out. What I thought was there would be a problem, if the missed_event always return to 1. Then this napi poll would keep forever. How about defer the rotting packets process later? like this:

 
 +  if (empty) 
 +return (ib_req_notify_cq(priv-cq, 
IB_CQ_NEXT_COMP|IB_CQ_REPORT_MISSED_EVENTS)  netif_rx_reschedule(dev, 0);
 +   

With this patch, I could get NAPI + non scaling code throughput performance from 1XXMb/s to 7XXMb/s, anyway there are some other problems I am still investigating now.

Thanks
Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()

2006-11-09 Thread Shirley Ma


Roland Dreier [EMAIL PROTECTED] wrote on 10/19/2006 09:10:35 PM:
Roland,

 I looked over my code again, and I don't see anything obviously wrong,
 but it's quite possible I made a mistake that I just can't see right
 now (like reversing a truth value somewhere). Someone who knows how
 ehca works might be able to spot the error.
 
 - R.

Your code is OK. I just found the problem here.
+		 if (empty) {
+		 		 netif_rx_complete(dev);
+		 		 ib_req_notify_cq(priv-cq, IB_CQ_NEXT_COMP, missed_event);
+		 		 if (unlikely(missed_event)  netif_rx_reschedule(dev, 0))
+		 		 		 goto repoll;
+
+		 		 return 0;
+		 }

netif_rx_complete() should be called right before return. It does improve none scaling performance with this patch, but reduce scaling performance.

+		 if (empty) {
+		 		 ib_req_notify_cq(priv-cq, IB_CQ_NEXT_COMP, missed_event);
+		 		 if (unlikely(missed_event)  netif_rx_reschedule(dev, 0))
+		 		 		 goto repoll;
+		 		 netif_rx_complete(dev);
+
+		 		 return 0;
+		 }
Any other reason, calling netif_rx_complete() while still possibably within napi? 

Thanks
Shirley Ma
IBM Linux Technology Center___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()

2006-10-20 Thread Shirley Ma


Retest several times, this hack patch only fixed the none scaling code. I thought I tested both scaling and none scaling, it seems I made a mistake, I might configure and test none scaling configuration twice in previous run.

thanks
Shirley Ma
IBM Linux Technology Center
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()

2006-10-20 Thread Shirley Ma


Michael S. Tsirkin [EMAIL PROTECTED] wrote on 10/19/2006 01:21:45 PM:

 Please also note that due to factors such as TCP window limits, TX on a single
 socket is often stalled. To really stress a connection and see benefit from
 NAPI you should be running multiple socket streams in parallel: 
 either just run multiple instances of netperf/netserver, or use iperf with -P flag.

I used to get 7600Mb/s IPoIB one socket duplex throughput with my other IPoIB patches on 2.6.5 kernel under certain configuration. Which makes me believe we could gain close to link throughput with one UD QP. Now I couldn't get it anymore on the new kernel. I was struggling with TCP window limits on the new kernel. Do you have any hint?

thanks
Shirley Ma
IBM Linux Technology Center
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] RHEL5 and OFED ...


[EMAIL PROTECTED] wrote on 10/16/2006 01:50:49 PM:
 On Mon, 2006-10-16 at 15:25 +0200, Michael S. Tsirkin wrote:
  Quoting r. Maestas, Christopher Daniel [EMAIL PROTECTED]:
   Subject: Re: [openib-general] RHEL5 and OFED ...
   
Now for userspace - does RHEL5 include at least libibverbs-1.0?
This has been released a while back, and Roland makes regular bugfix
   releases.
   
   Here's what I see on a rhel4 u4 system:
   ---
   $ rpm -q libibverbs
   libibverbs-1.0.3-1
   ---
   
   So I would think rhel5 would have at least that or greater. When I
   compiled rpms for 1.1rc7 it generated:
   ---
   # ls libibverbs-*
   libibverbs-1.0.4-0.x86_64.rpmlibibverbs-utils-1.0.4-0.x86_64.rpm
   libibverbs-devel-1.0.4-0.x86_64.rpm
  
  Dough, would it be possible to update this + libmthca?
 
 Possibly. What's the justification? What's in 1.0.4 that is the
 primary reason for wanting to update from 1.0.3?
 
 -- 
 Doug Ledford [EMAIL PROTECTED]

I am not sure whether this already has an answer. 
The justification is madvise(..., MADV_DONTFORK) is used to make fork() work for verbs consumers in the recent packages. I hope same patch will be in libehca.

thanks
Shirley Ma
IBM Linux Technology Center___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [openfabrics-ewg] RHEL5 and OFED ...


Roland Dreier [EMAIL PROTECTED] wrote on 10/19/2006 09:19:14 AM:

   Shirley How can RHEL5 pick up this particular patch? Applications
   Shirley with fork() depend on this patch.
 
 It can't really, since it breaks the libibverbs ABI and therefore has
 to be part of a major release.

Then we need to wait for the new release or find an alternative way which I doubt.

Thanks
Shirley Ma
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()


Thanks Michael for all these tips. I have tried several suggestions as you proposed here. I couldn't see performance any better. The TCP_RR is dropped to 472 trans/s from about 18,000 trans/s , and TCP_STREAM BW is dropped to 1/3 as before ( ehca + scaling code) with same TCP configuration, send queue size=recve queue size = 1K.

Thanks
Shirley Ma
IBM Linux Technology Center___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()


Roland,

I have applied this patch and updated patch 2/2. You will send out an updated patch 2/2, I think. 
I did some extra modification in ipoib code, (which has more extra repolls). I do see around 10% or more performance improvement now with this change on both scaling and none scaling code. I will run oprofile tomorrow to see the difference. I think with these extra repolls, the cpu utilization would be much higher.

Thanks
Shirley Ma
IBM Linux Technology Center___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()


Roland Dreier [EMAIL PROTECTED] wrote on 10/19/2006 07:39:25 PM:

  I have applied this patch and updated patch 2/2. You will send out an
  updated patch 2/2, I think.
 
 Sorry, messed that up. I just sent out the patch.
No problem, I did same change.

 You mean you add more calls to ib_poll_cq()? Where do you add them?
 Why does it help?
 
 - R.
I run out of ideas why losing 2/3 of the throughput and got 476 trans/s. So I assumed there was always a missed event, then ipoib would stay in its napi poll within its scheduled time. That's why it helps. This is really a hack, doesn't address the problem. It sacrificed cpu utilization and gained the performance back. I need to understand how ehca reports missing event, there might be some delay there?

Thanks
Shirley Ma
IBM Linux Technology Center___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()


Roland Dreier [EMAIL PROTECTED] wrote on 10/19/2006 09:10:35 PM:
 It's entirely possible that my implementation of the missing event
 hint in ehca is wrong. I just guessed based on how poll CQ is
 implemented -- if the consumer requests a hint about missing events,
 then I lock the CQ and check if its empty after requesting
 notification.
 
 I looked over my code again, and I don't see anything obviously wrong,
 but it's quite possible I made a mistake that I just can't see right
 now (like reversing a truth value somewhere). Someone who knows how
 ehca works might be able to spot the error.
 
 - R.

The oprofile data (with your napi + this hack patch) looks good, it reduced cpu utilization significantly. (I was wrong about cpu utilization.) I will talk with ehca team regarding this missing event hint patch on ehca.

thanks
Shirley Ma
IBM Linux Technology Center
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()

2006-10-18 Thread Shirley Ma


Roland Dreier [EMAIL PROTECTED] wrote on 10/17/2006 08:41:59 PM:
 Anyway, I'm eagerly awaiting your NAPI results with ehca.
 
 Thanks,
  Roland

Thanks. The touch test results are not good. This NAPI patch induces huge latency for ehca driver scaling code, the throughput performance is not good. (I am not fully conviced the huge latency is because of raising NAPI in thread context.) Then I tried ehca no scaling driver, the latency looks good, but the throughtput is still a problem. We are working on these issues. Hopefully we can get the answer soon. 

Thanks
Shirley Ma
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()

2006-10-18 Thread Shirley Ma


Roland Dreier [EMAIL PROTECTED] wrote on 10/18/2006 01:55:13 PM:
 I would like to understand why there's a throughput difference with
 scaling turned off, since the NAPI code doesn't change the interrupt
 handling all that much, and should lower the CPU usage if anything.
That's I am trying to understand now. 
Yes, the send side rate dropped significant, cpu usage lower as well.

 Does changing the netdev weight value affect anything?
 
 - R.
No, it doesn't.

Thanks
Shirley Ma
IBM Linux Technology Center___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] ethtool support for ipoib


I am going to add below ethtool ops in ipoib. Anything comments? Once ethtool support is added, GSO will be get/set directly through ethtool as Michael pointed out earlier.

static struct ethtool_ops ipoib_ethtool_ops = {
.get_settings   = ipoib_get_settings,
.set_settings   = ipoib_set_settings,
.get_drvinfo= ipoib_get_dvrinfo,
.get_link   = ethtool_op_get_link,
.get_stats_count= ipoib_get_stats_count,
.get_ethtool_stats  = ipoib_get_ethtool_stats,
/* can be added later once ipoib support sg
.get_sg = ethtool_op_get_sg,
.set_sg = ethtool_op_set_sg,
*/
};

thanks
Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] ethtool support for ipoib


Michael S. Tsirkin [EMAIL PROTECTED] wrote on 10/16/2006 11:12:03 PM:

 Quoting r. Shirley Ma [EMAIL PROTECTED]:
  /* can be added later once ipoib support sg
  .get_sg = ethtool_op_get_sg,
  .set_sg = ethtool_op_set_sg,
  */
 
 The difficulty here is that sg currently requires checksum offloading in
 netdevice.
 
 -- 
 MST

I read the discussion in net-dev. Since IB packet has its own CRC (ICRC, VCRC). Is it a good idea to enable checksum unnecessary in a pure IB Fabrics for large MTU 64K. It requires some negotiation. Does your prototype implementation for large MTU requires both ends agreement? Practically it can be implemented, but I don't know what RFCs have defined.___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] ethtool support for ipoib


What I suggested here is when it's connected mode with large MTU, set ib interface flag to CHECKSUM_UNNECESSARY. But this only works on packets not being routed off-net at the TCP layer.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] ethtool support for ipoib


Parks Fields [EMAIL PROTECTED] wrote on 10/17/2006 01:12:48 PM:
 
 
 No, it's never a good idea to turn off TCP or IP checksums. That
 leads to possibilities of silent data corruption too easily.
 
 I totally agree...

Have we ever seen silent data corruption in CHECKSUM_HW?

Thanks
Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()


Hi, Roland,

There were a couple errors and warning when I applied this patch to OFED-1.1-rc7.
1. ehca_req_notify_cq() in ehca_iverbs.h is not updated.
2. *maybe_missed_event = ipz_qeit_is_valid(my_cq-ipz_queue) should be =ipz_qeit_is_valid(my_cq-ipz_queue) 
3. a compile warning this line return cqe_flags  7 == queue-toggle_state  1;

Thanks
Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] enable GSO over IPoIB


Hi, Roland,

If we only support GSO enablement in ethtool, there is no problem. What I meant is anything related to MAC address in ethtool utility needs to be updated for IB device.

Do you like the idea to add ethtool support in IPoIB? Do you want me to work on this?

Thanks
Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] enable GSO over IPoIB


Good. Then after enabling GSO, we can chain multiple packets together in IPoIB for one doorbell to send large packet.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] enable GSO over IPoIB


Roland Dreier [EMAIL PROTECTED] wrote on 10/16/2006 10:37:12 AM:

   Shirley Good. Then after enabling GSO, we can chain multiple
   Shirley packets together in IPoIB for one doorbell to send large
   Shirley packet.
 
 How does that work? GSO doesn't change the hard_start_xmit()
 interface, does it?
 
 - R.

No, it doesn't. I am thinking to add enqueue/dequeue multiple packets in qdisc. It would benifit other networking device.

Thanks
Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] enable GSO over IPoIB


Roland Dreier [EMAIL PROTECTED] wrote on 10/16/2006 10:49:32 AM:

   Shirley No, it doesn't. I am thinking to add enqueue/dequeue
   Shirley multiple packets in qdisc. It would benifit other
   Shirley networking device.
 
 So am I understanding correctly -- this is other work that is
 independent of GSO? Is the plan to add a new optional driver method
 that extends hard_start_xmit() to accept multiple packets?
 
 - R.

Yes, you are right. It is the new work indendent of GSO. I hope I have the BW to do all the work on time.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] IPOIB NAPI


Roland,

Don't know why I have trouble to get this patch from your git tree. Do you mind to post this patch here so I can test the performance over ehca?

Thanks
Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH] IB/ipoib: NAPI

2006-09-29 Thread Shirley Ma


[EMAIL PROTECTED] wrote on 09/28/2006 09:11:47 AM:

   Michael Looked pretty simple on the outset, but oh well. Keep us
   Michael posted.
 
 I just work slowly.
 
 Anyway I don't think this is that urgent -- we've dumped enough stuff
 into 2.6.19, so I think this should wait for 2.6.20 at the earliest anyway.

Please wait for other device drivers to finish the performance test. This NAPI patch somehow kills ehca performance, extremly bad.

Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH] IB/ipoib: NAPI


Michael S. Tsirkin [EMAIL PROTECTED] wrote on 09/26/2006 09:59:30 PM:
 I still hope ehca NAPI performance can be fixed. But if not, maybe we should
 have the low level driver set a disable_napi flag rather than have users play
 with module options.
 
 -- 
 MST
I forgot to mention these NAPI parameters should be tunable for different device drivers, like dev-weight, or set up in lower driver.

thanks
Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH] IB/ipoib: NAPI


Michael S. Tsirkin [EMAIL PROTECTED] wrote on 09/26/2006 11:23:16 PM:

 Quoting r. Shirley Ma [EMAIL PROTECTED]:
  We
  are implementing multiple EQs suport for one adapter now.
 
 I think with MSI we can have a per-interface EQ in mthca.
 Main reason I'm not doing this is because I haven't figured out
 the right interface to pass this information to the low level driver yet.
 
 Maybe we should just assign EQs to CQs in a round-robin fashion
 for now, and just hope typical use allocates CQs sequentially.
 Worst case, we are back to where we are now, performance-wise.
 Roland, how does this sound?
 
  If that works, then
  we can modify the ehca code as mthca. Actually mthca has the same problem as
  ehca over two links on the same adapter.
 
 OK, but if as you point out the issue is not device-specific -
 that's a good reason not to do tricks in low-level driver to try
 and work around this, but address this at ULP level.
 
 -- 
 MST
Yes. That's what we are working on to define the right APIs to pass information to low level driver. Now we are trying per interface per EQ, then we will extent the work to N(CQ):M(EQ) mapping. ehca could support up to 127 EQs, I would suggest to use hash.

Thanks
Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] heads-up - ipoib NAPI

Hi, Eli,

Eli cohen [EMAIL PROTECTED] wrote on 09/26/2006 11:35:26 PM:
On Tue, 2006-09-26 at 21:34 -0700, Shirley Ma wrote:

NAPI patch moves ipoib poll from hardware interrupt context to softirq
context. It would reduce the hardware interrupts, reduce hardware
latency and induce some network latency. It might reduce cpu
utilization. But I still question about the BW improvement. I did see
various performance with the same test under the same condition.

When you open just one connection you can see around 10% of variations
in BW measure. But then you don't utilize all the CPU power you have and
you don't get to the threshold where NAPI becomes effective.
Using multiple connections utilizes all CPUs in the system, increases
send rate, and increases the chances of the receiver to poll CQEs up to
its quota and be scheduled again without re-enabling interrupts.

Send rate shouldn't be limited by one connection. The cpu is much faster than the link speed. I don't think multiple connections send rate is increased than one connection. Do you have any data to show that?

When I monitored the CQEs, I didn't see too many CQEs in CQ for one notification, and I don't think moving NAPI from hardware interrupt context to softirq context would increase that number. Or the latency might cause the number increased, I did see that number increased and performance increased with some udelay in hardware interrupt polling mode. If you saw the packets increased, how many packets did you see in both one hardware interrupt poll and one NAPI poll?

You NAPI poll is driven either by receiver quota or any send CQE in CQ. Have you tested UDP performance? any difference?

Thanks
Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] enable GSO over IPoIB


Michael S. Tsirkin [EMAIL PROTECTED] wrote on 09/27/2006 01:30:03 AM:
Any idea what does ethtool do that IPoIB can't support?
ethtool is an ethernet device tool. It's OK to partically implement ethtool operations in IPoIB. We also need to patch the userlevel utility to support ibX interface. Now it only supports ethX.

thanks
Shirley Ma
IBM Linux Technology Center___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH] IB/ipoib: NAPI


I have created a patch to monitor CQ. That wasn't the reason for performance drop. I couldn't see any race from the output.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH] IB/ipoib: NAPI


Roland,

We had a simple version of NAPI patch. We saw the performance improvement on mthca but not ehca. We will test this NAPI patch on ehca when it's available to see how's the performance.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH] IB/ipoib: NAPI


 This patch implements NAPI for iopib. It is a draft implementation.
 I would like your opinion on whether we need a module parameter
 to control if NAPI should be activated or not.

It can be a configuration option to enable/disable NAPI, just like other network device.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] enable GSO over IPoIB


Since linux 2.6.18 supports GSO, I have patched IPoIB to enable GSO, but haven't tested the performance yet. Has anyone tried already?

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] 2.6.18 kernel support in the main trunk.



 Whats the status with the main trunk kernel code and 2.6.18? 
 
 I noticed that it doesn't build and needs something like this. I haven't
 tested this yet...

Yes. You need this patch and also need change ipoib_multicast.c: dev-xmit_lock to dev-_xmit_lock to build the trunk on 2.6.18.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH] IB/ipoib: NAPI


We did some touch test on ehca driver, we saw performance drop somehow. I strongly recommand NAPI as a configurable option in ipoib. So customers can turn on/off based on their configurations.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH] IB/ipoib: NAPI


Roland,

 Do you know how ehca behaves? Does it have that race? ie what
 happens in this situation:
 
   poll CQ - CQ is empty
 (new completion is added to CQ)
   request notify on CQ
 (no more completions are added)
 
 Mellanox HCAs will generate a CQ event in this case, although it's not
 strictly required by the IB spec. How will ehca behave?
 
 - R.

That could be the reason. I did see mthca poll empty entry, but not on ehca. I will confirm this with ehca team.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] enable GSO over IPoIB


   Shirley Since linux 2.6.18 supports GSO, I have patched IPoIB to
   Shirley enable GSO, but haven't tested the performance yet. Has
   Shirley anyone tried already?
 
 No, I don't think anyone looked at that yet. Could you post your
 patch? What is required? Supporting gather/scatter?
 
 - R.

Don't need too. GSO only improves sender side performance. It allows large packet send in ULPs, and segments this packet in interface layer before driver xmit. The GSO enablement is through ethtool. Since ipoib doesn't support ethtool, i just simply added a module parameter to set the interface GSO flag when loading the module. My next step is to enable gather/scatter in ipoib send to chain multiple packets together for one door bell.

Thanks
Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] heads-up - ipoib NAPI


Hi, Eli,

 Hi,
 I have a draft implementation of NAPI in ipoib and got the following
 results:
 System descriptions
 ===
 Quad CPU E64T 2.4 Ghz
 4 GB RAM
 MT25204 Sinai HCA
 I used netperf for benchmarking, the BW test ran for 600 seconds with 8
 clients and 8 servers.
 
 The results I received are bellow:
 
 netperf TCP_STREAM:
BW [MByte/sec]  clients side [irqs/sec]  server side [irqs/sec]
--  ---  --
 without NAPI:506  86441  66311
 with NAPI: 550   6830  13600 
 
 
 netperf TCP_RR:
rate [tran/sec]
 ---
 without NAPI:   39600
 with NAPI: 39470
 
 
 
 Please note this is still under work and we plan to do more tests and
 measure on other devices.

NAPI patch moves ipoib poll from hardware interrupt context to softirq context. It would reduce the hardware interrupts, reduce hardware latency and induce some network latency. It might reduce cpu utilization. But I still question about the BW improvement. I did see various performance with the same test under the same condition.

Have you tested this patch with different message sizes, different socket sizes? Are these results consistent better?

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH] IB/ipoib: NAPI


Hi, Roland,

   Shirley It can be a configuration option to enable/disable NAPI,
   Shirley just like other network device.
 
 But is there any reason to keep the non-NAPI code around? I hate to
 have two codepaths to maintain.

If you would like to maintain one code path only, then we need to compare the NAPI patch with thread-context polling mode patch. I did see big performance improvement with thread-context polling mode patch I have been working on. (I used to split CQ. I am tring without splitting CQ now). And I think it would improve multiple links performance in share one EQ situation. 

thanks
Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH] IB/ipoib: NAPI

Michael S. Tsirkin [EMAIL PROTECTED] wrote on 09/26/2006 09:59:30 PM:

Quoting r. Shirley Ma [EMAIL PROTECTED]:
Subject: Re: [PATCH] IB/ipoib: NAPI

We did some touch test on ehca driver, we saw performance drop somehow.

Hmm, it seems ehca still defers the completion event to a tasklet. It always
seemed weird to me. So that could be the reason - with NAPI you now get 2
tasklet schedules, as you are actually doing part of what NAPI does,inside the
low level driver. Try ripping that out and calling the event
handler directly,
and see what it does to performance with NAPI.
The reason for this ehca implementation is two ports/links shared one EQ. We are implementing multiple EQs suport for one adapter now. If that works, then we can modify the ehca code as mthca. Actually mthca has the same problem as ehca over two links on the same adapter. Two links on the same adapter performance are very bad, not scaled at all.

I strongly recommand NAPI as a configurable option in ipoib. So
customers can turn on/off based on their configurations.

I still hope ehca NAPI performance can be fixed. But if not, maybe we should
have the low level driver set a disable_napi flag rather than have users play
with module options.

--
MST
We have been working on this issue for some time. That's the reason we didn't post our NAPI patch. Hopefully we can fix it. If we can show NAPI performance (latency, BW, cpu utilization) are better in all cases (UP vs. SMP, one socket vs. multiple sockets, one link vs. multiple links, different message sizes, different socket sizes) I will agree to turn on NAPI as default.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] Re: [PATCH]Repost: IPoIB skb panic

2006-06-07 Thread Shirley Ma


Roland,

Can you post a recipe to reproduce the crash?
It happened on 32 nodes cluster (each
node has 8 dual core cpus) running IBM applications over IPoIB.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] Re: Re: [PATCH]Repost: IPoIB skb panic

2006-06-05 Thread Shirley Ma


Michael,

I will apply this patch. This patch would reduce the race, not address the problem.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] Re: [PATCH]Repost: IPoIB skb panic

2006-06-04 Thread Shirley Ma


Ohmm. That's a myth. So this problem
is hardware independent, right? 
It's not easy to reproduce it. ifconfig
up and down stress test could hit this problem occasionally.

thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] [PATCH]Repost: IPoIB skb panic

2006-06-02 Thread Shirley Ma

Roland,

I posted the patch yesterday, it seems it only went to web site. I
repost this patch here for you to review. Please let me know if there is
any problem to apply this patch.

There are two problems in path_free(), which caused kernel skb
panic during interface up/down stress test.
1. path_free() should call dev_kfree_skb_any() (any context) instead of
dev_kfree_skb_irq() (irq context) since it is called in process
context. 
2. path-queue should be protected by priv-lock since there is a  race
between unicast_send_arp() and ipoib_flush_paths() to release skb when
bringing interface down. It's  safe to use priv-lock, because
skb_queue_len(path-queue)   
IPOIB_MAX_PATH_REC_QUEUE, which is 3.

Signed-off-by: Shirley Ma [EMAIL PROTECTED]
diff -urpN infiniband/ulp/ipoib/ipoib_main.c 
infiniband-skb/ulp/ipoib/ipoib_main.c
--- infiniband/ulp/ipoib/ipoib_main.c   2006-05-03 13:16:18.0 -0700
+++ infiniband-skb/ulp/ipoib/ipoib_main.c   2006-06-01 09:14:05.0 
-0700
@@ -252,11 +252,11 @@ static void path_free(struct net_device 
struct sk_buff *skb;
unsigned long flags;
 
-   while ((skb = __skb_dequeue(path-queue)))
-   dev_kfree_skb_irq(skb);
-
spin_lock_irqsave(priv-lock, flags);
 
+   while ((skb = __skb_dequeue(path-queue)))
+   dev_kfree_skb_any(skb);
+
list_for_each_entry_safe(neigh, tn, path-neigh_list, list) {
/*
 * It's safe to call ipoib_put_ah() inside priv-lock

Thanks
Shirley Ma
IBM LTC

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] Re: [PATCH]Repost: IPoIB skb panic

2006-06-02 Thread Shirley Ma

Roland,

More clarification: we saw two races here:
1. path_free() was called by both unicast_arp_send() and
ipoib_flush_paths() in the same time.
0xc004bff0a0d031  10   R  0xc004bff0a580
*ksoftirqd/0
  SP(esp)PC(eip)  Function(args)
0xcf707c80  0xc03199d0  .skb_release_data +0x7c
0xcf707c80  0xc0319688 (lr) .kfree_skbmem +0x20
0xcf707d10  0xc0319688  .kfree_skbmem +0x20
0xcf707da0  0xc03197fc  .__kfree_skb +0x148
0xcf707e50  0xc031e2a8  .net_tx_action +0xa4
0xcf707f00  0xc006ab38  .__do_softirq +0xa8
0xcf707f90  0xc00177b0  .call_do_softirq +0x14
0xc000cff83d90  0xc0012064  .do_softirq +0x90
0xc000cff83e20  0xc006b0fc  .ksoftirqd +0xfc
0xc000cff83ed0  0xc0081d74  .kthread +0x17c
0xc000cff83f90  0xc0017d24  .kernel_thread +0x4c
KERNEL: assertion (!atomic_read(skb-users)) failed at net/core/dev.c 

2. during unicast arp skb retransmission, unicast_arp_send() appended
the skb on the list, while ipoib_flush_paths() calling path_free() to
free the same skb from the list.
3KERNEL: assertion (!atomic_read(skb-users)) failed at
net/core/dev.c 
(1742)
4Warning: kfree_skb passed an skb still on a list (from c031e2a8).
2kernel BUG in __kfree_skb at net/core/skbuff.c:225! (sles9 sp3 kernel)
void __kfree_skb(struct sk_buff *skb)
{
if (skb-list) {
printk(KERN_WARNING Warning: kfree_skb passed an skb still 
   on a list (from %p).\n, NET_CALLER(skb));
BUG();
}

The patch will fix both problems by using priv-lock to protect path-queue 
list. Am I right?

Thanks
Shirley Ma
IBM LTC

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] Re: [PATCH]Repost: IPoIB skb panic

2006-06-02 Thread Shirley Ma

On Fri, 2006-06-02 at 16:15 -0700, Roland Dreier wrote:
   2. during unicast arp skb retransmission, unicast_arp_send() appended
   the skb on the list, while ipoib_flush_paths() calling path_free() to
   free the same skb from the list.
 
 I think I see what's going on.  the skb ends up being on two lists at
 once I guess...
 
  - R.

The skb has only one prev, one next pointers, it can only be on one list
at a time. How could skb go on two lists at once?

Thanks
Shirley

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] [PATCH] IPoIB skb panic

2006-06-01 Thread Shirley Ma

Roland,

I found there are two problems in path_free(), it would cause kernel skb
panic.
1. path_free() should dev_kfree_skb_any() (any context) instead of
dev_kfree_skb_irq() (irq context)
2. path-queue should be protected by priv-lock since there is a  
possible race between unicast_send_arp() and ipoib_flush_paths() when
bring interface down. It's  safe to use priv-lock, because
skb_queue_len(path-queue)   
IPOIB_MAX_PATH_REC_QUEUE, which is 3.

Here is the patch. Please review it and let me know if there is a
problem to apply this patch.

Signed-off-by: Shirley Ma [EMAIL PROTECTED]
diff -urpN infiniband/ulp/ipoib/ipoib_main.c 
infiniband-skb/ulp/ipoib/ipoib_main.c
--- infiniband/ulp/ipoib/ipoib_main.c   2006-05-03 13:16:18.0 -0700
+++ infiniband-skb/ulp/ipoib/ipoib_main.c   2006-06-01 09:14:05.0 
-0700
@@ -252,11 +252,11 @@ static void path_free(struct net_device 
struct sk_buff *skb;
unsigned long flags;
 
-   while ((skb = __skb_dequeue(path-queue)))
-   dev_kfree_skb_irq(skb);
-
spin_lock_irqsave(priv-lock, flags);
 
+   while ((skb = __skb_dequeue(path-queue)))
+   dev_kfree_skb_any(skb);
+
list_for_each_entry_safe(neigh, tn, path-neigh_list, list) {
/*
 * It's safe to call ipoib_put_ah() inside priv-lock

Thanks
Shirley Ma
IBM LTC

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH][1/7]ipoib performance patches -- remove ah_reap


Roland,

Yes. The lock sequences are right to
me.

What I found that the ah is always available
the IPoIB neigh, I can modify this patch like that in ipoib_send:

if (unlikely(*to_ipoib_neigh(skb-dst-neighbour)))
kref_get();

in ipoib completion:

if (unlikely(*to_ipoib_neigh(skb-dst-neighbour)))
kref_put();

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH][1/7]ipoib performance patches -- remove ah_reap


in ipoib send
if (unlikely(!*to_ipoib_neigh(skb-dst-neighbour)))

kref_get(); 

in ipoib completion: 
if (unlikely(!*to_ipoib_neigh(skb-dst-neighbour)))

ipoib_put_ah(); 


Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH][3/7]ipoib performance patches -- remove tx_ring


Roland,

Roland,

Thanks for the review comments. I will update these
patches and
tests results.

BTW it would be nice if you could figure
out a way to fix your mail client to post patches inline without
mangling them, or at least attach them with a mime type of text/plain
or something.

I will use my unix account to send out patches. 

Also, if you're interested, you could try the patch below and see how
it does on your tests.

Sure. I will test it after this weekend.
Did you see send queue overrun
with tx_ring default size 128?

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH][3/7]ipoib performance patches -- remove tx_ring


Roland Dreier [EMAIL PROTECTED] wrote on 05/26/2006
04:20:02 PM:
 (Also the default TX ring size is 64, not 128, isn't it?)
 
 - R.

Yes.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638




___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH][3/7]ipoib performance patches -- remove tx_ring

2006-05-25 Thread Shirley Ma


Roland,

I made some mistakes during splitting
these patches. Thanks for pointing out.
The reason I removed the cacheline because
I have tested and proved that it didn't help somehow.
Even in some code, I induced some locks,
the overall performance of all these 7 patches I am trying to post here
could improve IPoIB from 20% - 80% unidirectional and doubled bidirectional.
As you mentioned I need help to repolish
these patches. I am glad that you give all these valuable inputs of my
patches. Thanks lots here. 

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH][3/7]ipoib performance patches -- remove tx_ring

2006-05-25 Thread Shirley Ma


Roland Dreier [EMAIL PROTECTED] wrote on 05/25/2006
04:28:27 PM:

   Shirley didn't help somehow. Even in some
code, I induced some
   Shirley locks, the overall performance of all these
7 patches I
   Shirley am trying to post here could improve IPoIB
from 20% - 80%
   Shirley unidirectional and doubled bidirectional.
 
 I'm guessing that all of that gain comes from letting the send and
 receive completion handlers run simultaneously. Is that right?
 
 For example how much of an improvement do you see if you just apply
 the patches you've posted -- that is, only 1/7, 2/7 and 3/7?
 
 - R.

That's not true. I tested performance
with 1/7, 3/7 a couple weeks ago, I saw more than 10% improvement. I never
saw send queue overrun with tx_ring before on one cpu with one TCP stream.
After removing tx_ring, the send path is much faster, the default 128 is
not bigger enough to handle it. That's the reason I have another patch
to handler send queue overrun -- requeue the packet to the head of dev
xmit queue instead of current implementation which is silently dropped
the packets when device driver send queue is full, and depends on TCP retransmission.
This implementation would cause TCP fast trans, slow start, and packets
out of orders.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH][3/7]ipoib performance patches -- remove tx_ring

2006-05-25 Thread Shirley Ma


Roland,

Roland Dreier [EMAIL PROTECTED] wrote on 05/25/2006
09:24:01 AM:

 This also looks like a step backwards to me.
You are replacing a
 cache-friendly array with a cache-unfriendly linked list, which also
 requires two more lock/unlock operations in the fast path.

This patch reduces one extra ring between
dev xmit queue and device 
send queue and removes tx_lock in completion
handler.

The whole purpose to have the send_list
and slock is for shutting down
clean up. Otherwise we don't need to
maintain this list. And most likely when
shutting down, waiting for 5HZ, the
list is empty. 

I could implment it differently, like
use RCU list with cache-friendly. 
I thought it's not worth it before since
i didn't see the gain.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] [PATCH][1/7]ipoib performance patches -- remove ah_reap

;
 }
 
-clear_bit(IPOIB_STOP_REAPER, priv-flags);
-queue_delayed_work(ipoib_workqueue,
priv-ah_reap_task, HZ);
-
 set_bit(IPOIB_FLAG_INITIALIZED, priv-flags);
 
 return 0;
@@ -580,24 +540,6 @@ timeout:
 if (ib_modify_qp(priv-qp, qp_attr,
IB_QP_STATE))
 ipoib_warn(priv,
Failed to modify QP to RESET state\n);
 
-/* Wait for all AHs to be reaped */
-set_bit(IPOIB_STOP_REAPER, priv-flags);
-cancel_delayed_work(priv-ah_reap_task);
-flush_workqueue(ipoib_workqueue);
-
-begin = jiffies;
-
-while (!list_empty(priv-dead_ahs))
{
-__ipoib_reap_ah(dev);
-
-if
(time_after(jiffies, begin + HZ)) {
-
   ipoib_warn(priv, timing out; will
leak address handles\n);
-
   break;
-}
-
-msleep(1);
-}
-
 return 0;
 }
 
diff -urpN infiniband-split-cq/ulp/ipoib/ipoib_main.c infiniband-ah/ulp/ipoib/ipoib_main.c
--- infiniband-split-cq/ulp/ipoib/ipoib_main.c2006-05-22
08:48:47.0 -0700
+++ infiniband-ah/ulp/ipoib/ipoib_main.c2006-05-23
09:31:49.0 -0700
@@ -957,7 +957,6 @@ static void ipoib_setup(struct net_devic
 INIT_WORK(priv-mcast_task,
 ipoib_mcast_join_task,  priv-dev);
 INIT_WORK(priv-flush_task,
 ipoib_ib_dev_flush,priv-dev);
 INIT_WORK(priv-restart_task,
ipoib_mcast_restart_task, priv-dev);
-INIT_WORK(priv-ah_reap_task,
ipoib_reap_ah,  priv-dev);
 }
 
 struct ipoib_dev_priv *ipoib_intf_alloc(const char *name)




Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638

infiniband-ah.patch
Description: Binary data
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH][1/7]ipoib performance patches -- remove ah_reap


Roland Dreier [EMAIL PROTECTED] wrote on 05/24/2006
01:50:37 PM:

 NAK to this patch. Not only is is a step backwards in performance
--
 you've essentially added two (expensive) atomic operations for every
 packet sent 

My observation is the atomic operation is not that
expensive.

 -- but the patch is actually wrong:
 
  +err = post_send(priv, priv-tx_head
 (ipoib_sendq_size - 1), 
  +
   address-ah, qpn, addr, skb-len); 
  +kref_put(address-ref, ipoib_free_ah);
 
 The whole point of the complexity in AH handling in IPoIB is that
AHs
 cannot be freed until the driver knows that all sends referring to
 them have _completed_. As you've written your patch, an AH can
easily
 be freed before the HCA has a chance to execute the corresponding
send
 request.
 
 - R.

I thought the path holding another AH
reference to prevent it to be freed?

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638




___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH][1/7]ipoib performance patches -- remove ah_reap


Roland,

Roland Dreier [EMAIL PROTECTED] wrote on 05/24/2006
03:01:01 PM:
   Shirley Compared to have a single thread handling
AH, I don't
   Shirley think this atomic operation is expensive.
 
 But freeing AHs is something that happens infrequently and can be
done
 asynchronously. You're replacing that cost with two atomic operations
 per send packet!

No, actually it didn't free during sending during
my test.

   Shirley It is true for unicast, it has a reference
count before
   Shirley ipoib_send(). I need to look at multicast.
 
 But can you guarantee that the AH stays around until after the send
 completes (which could be an arbitrarily long delay)?
 
 - R.

I checked negih_add_path(), for unicast
it is true always. See code below.

static void neigh_add_path(..)
{
...
if (path-ah)
{
 
  kref_get(path-ah-ref);
 
  neigh-ah = path-ah;

 ipoib_send(dev, skb, path-ah...
}

Please correct me if I am wrong.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638




___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH][1/7]ipoib performance patches -- remove ah_reap


Roland Dreier [EMAIL PROTECTED] wrote on 05/24/2006
04:07:58 PM:
 To reiterate: freeing AHs is a rare,
 slow path operation that can be done asynchronously. It
is not a
 good tradeoff to do two atomic_t operations for every sent packet,
 just to avoid occasionally reaping AHs in process context.

I don't think two atomic operation is that expensive
compare to 
reaping AHs in process context according to the test
results and 
profiling data. Or we can use RCU instead.

   But can you guarantee that the AH stays around until
after the send
   completes (which could be an arbitrarily long delay)?
 
  I checked negih_add_path(), for unicast it is true always.
See code below.
  
  static void neigh_add_path(..)
  {
  ...
  if (path-ah) {
 
kref_get(path-ah-ref);
 
neigh-ah = path-ah;
ipoib_send(dev,
skb, path-ah... 
  } 
 
 Again, I don't understand how this is a response at all. The
AH
 cannot be freed until after the send operation is actually fully
 completed, which could be a long time after ib_post_send() returns.
 
 If an AH is freed after ipoib_send() returns but before the send is
 executed, then the HCA may use stale data, which could lead to a send
 error.
 
 To summarize: the patch is broken (leads to incorrect lifetimes for
 AHs), and in any case makes the send fast path slower.
 
 - R.

That's a value point. This problem will be addressed
in next tx_ring removal 
patch. The kref_put was called in ipoib_ib_handle_send_wc().

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH][1/7]ipoib performance patches -- remove ah_reap


Roland,

My idea is to remove this AH reap thread.
We can use RCU to do the same work without lots of coding. Do you agree?

And in the reap AH code, tx_tail/tx_head
isn't consistently protected by tx_lock. It uses priv-lock.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH][1/7]ipoib performance patches -- remove ah_reap


Roland Dreier [EMAIL PROTECTED] wrote on 05/24/2006
05:11:12 PM:

   Shirley Roland, My idea is to remove this AH reap
thread. We can
   Shirley use RCU to do the same work without lots
of coding. Do
   Shirley you agree?
 
 No, I don't see how that will help. How does RCU know when it's
safe
 to free an AH?

With tx_ring removal patch, RCU can
be done.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH][1/7]ipoib performance patches -- remove ah_reap


Roland Dreier [EMAIL PROTECTED] wrote on 05/24/2006
05:52:31 PM:
   Shirley With tx_ring removal patch, RCU can be done.
 
 OK, I guess I'll wait and see. But to be honest I don't see
how RCU
 helps anything.
 
 - R.

I am continuing to sumit the tx_ring
patch with atomic operation for you
to review, let's discuss the AH reap
solution later.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638




___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] [PATCH][3/7]ipoib performance patches -- remove tx_ring

 */
 priv-rx_ring =   
kzalloc(ipoib_recvq_size * sizeof *priv-rx_ring,
 
   GFP_KERNEL);
 if (!priv-rx_ring) {
@@ -855,24 +853,11 @@ int ipoib_dev_init(struct net_device *de
 goto
out;
 }
 
-priv-tx_ring = kzalloc(ipoib_sendq_size
* sizeof *priv-tx_ring,
-
   GFP_KERNEL);
-if (!priv-tx_ring) {
-printk(KERN_WARNING
%s: failed to allocate TX ring (%d entries)\n,
-
   ca-name, ipoib_sendq_size);
-
   goto out_rx_ring_cleanup;
-}
-
-/* priv-tx_head  tx_tail
are already 0 */
-
 if (ipoib_ib_dev_init(dev, ca, port))
-goto
out_tx_ring_cleanup;
+goto
out_rx_ring_cleanup;
 
 return 0;
 
-out_tx_ring_cleanup:
-kfree(priv-tx_ring);
-
 out_rx_ring_cleanup:
 kfree(priv-rx_ring);
 
@@ -896,10 +881,8 @@ void ipoib_dev_cleanup(struct net_device
 ipoib_ib_dev_cleanup(dev);
 
 kfree(priv-rx_ring);
-kfree(priv-tx_ring);
 
 priv-rx_ring = NULL;
-priv-tx_ring = NULL;
 }
 
 static void ipoib_setup(struct net_device *dev)
@@ -944,6 +927,7 @@ static void ipoib_setup(struct net_devic
 
 spin_lock_init(priv-lock);
 spin_lock_init(priv-tx_lock);
+spin_lock_init(priv-slist_lock);
 
 mutex_init(priv-mcast_mutex);
 mutex_init(priv-vlan_mutex);
@@ -952,6 +936,7 @@ static void ipoib_setup(struct net_devic
 INIT_LIST_HEAD(priv-child_intfs);
 INIT_LIST_HEAD(priv-dead_ahs);
 INIT_LIST_HEAD(priv-multicast_list);
+INIT_LIST_HEAD(priv-send_list);
   
 
 INIT_WORK(priv-pkey_task,
 ipoib_pkey_poll, priv-dev);
INIT_WORK(priv-mcast_task,
 ipoib_mcast_join_task,  priv-dev);


Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [PATCH][3/7]ipoib performance patches -- remove tx_ring