Re: [openib-general] IPOIB NAPI
Roland Dreier [EMAIL PROTECTED] wrote on 02/26/2007 02:36:26 PM: No way, it's way too late at this point to change the kernel-user ABI, let alone change all ULPs. - R. Hello Roland, So the IBV_CQ_REPORT_MISSED_EVENTS has been part of OFED-1.2 already? I can generate the patch for all ULPs to use this for review. Do you need me to do that? Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPOIB NAPI
So the IBV_CQ_REPORT_MISSED_EVENTS has been part of OFED-1.2 already? I can generate the patch for all ULPs to use this for review. Do you need me to do that? No, it's not in OFED 1.2 or the upstream kernel. And no one has implemented it for userspace (and I'm somewhat reluctant to break the ABI at this point without some performance numbers to motivate making this API change). Have the NAPI performance problems with ehca been resolved? We could probably merge IPoIB NAPI for 2.6.22 then, which would pull in the kernel changes at least. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPOIB NAPI
oland Dreier [EMAIL PROTECTED] wrote on 02/27/2007 02:41:44 PM: So the IBV_CQ_REPORT_MISSED_EVENTS has been part of OFED-1.2 already? I can generate the patch for all ULPs to use this for review. Do you need me to do that? No, it's not in OFED 1.2 or the upstream kernel. And no one has implemented it for userspace (and I'm somewhat reluctant to break the ABI at this point without some performance numbers to motivate making this API change). Have the NAPI performance problems with ehca been resolved? We could probably merge IPoIB NAPI for 2.6.22 then, which would pull in the kernel changes at least. - R. We have addressed the NAPI performance issues with ehca driver. I believe the patches have been upper stream. However the test results show that it's better to delay poll again to next NAPI interval, something like this: poll-cq notify-cq, if missed_event netif_rx_reschedule() return 1 vs. poll-cq, notify-cq, if missed_event netif_rx_reschedule() poll again return 0 It seems ehca delivering packet much faster than other HCAs. So poll again would stay in the loop for many many times. So the above changes doesn't impact other HCAs, I would recommand it. I saw same implementations on other ethernet drivers. Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPOIB NAPI
Quoting Shirley Ma [EMAIL PROTECTED]: Subject: Re: [openib-general] IPOIB NAPI Roland Dreier [EMAIL PROTECTED] wrote on 02/27/2007 02:41:44 PM: So the IBV_CQ_REPORT_MISSED_EVENTS has been part of OFED-1.2 already? I can generate the patch for all ULPs to use this for review. Do you need me to do that? No, it's not in OFED 1.2 or the upstream kernel. And no one has implemented it for userspace (and I'm somewhat reluctant to break the ABI at this point without some performance numbers to motivate making this API change). Have the NAPI performance problems with ehca been resolved? We could probably merge IPoIB NAPI for 2.6.22 then, which would pull in the kernel changes at least. - R. We have addressed the NAPI performance issues with ehca driver. I believe the patches have been upper stream. However the test results show that it's better to delay poll again to next NAPI interval, something like this: poll-cq notify-cq, if missed_event netif_rx_reschedule() return 1 vs. poll-cq, notify-cq, if missed_event netif_rx_reschedule() poll again return 0 It seems ehca delivering packet much faster than other HCAs. So poll again would stay in the loop for many many times. So the above changes doesn't impact other HCAs, I would recommand it. I saw same implementations on other ethernet drivers. I'm confused. Which one is faster? -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[OFA General] Re: [openib-general] IPOIB NAPI
I'm confused. Which one is faster? Sorry for the confusion, Michael. The one with return 1 has better throughput. Thanks Shirley Ma___ general mailing list [EMAIL PROTECTED] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ipoib the partial pkey
On Sun, 2007-02-25 at 05:48, Or Gerlitz wrote: Sean Hefty wrote: I looked into this more... RFC 4391 states (middle of page 5): For a node to join a partition, one of its ports must be assigned the relevant P_Key by the SM [RFC4392]. Jumping to RFC 4392 (top of page 4): Just to have us agree on the quote, it is from section 4 of rfc 4392 (page 14) eg in http://www.ietf.org/rfc/rfc4392.txt at the time of creating an IB multicast group, multiple values such as the P_Key, Q_Key, Service Level, Hop Limit, Flow ID, TClass, MTU, etc. have to be specified. These values should be such that all potential members of the IB multicast group are able to communicate with one another when using them. OK, I suggest to remove this spec limitation, IMO you would need to get the IB spec changed first in order to do this. as it does not allow the use case of a server using a partition for which inter-client communication is not allowed. Actually since it does not let people use partial membership partitioning with IPoIB as every ipoib device needs to join the broadcast group, it is probably a spec bug and not a limitation done on purpose. I'm pretty sure this was done on purpose (a conscious choice) as it is based on what the IBA spec requires. The flip side of this approach are the partial connectivity issues which Sean mentioned and this will be reported as SM failures (e.g. more support issues). A simple real-life example is I/O target, the system admin wants IB block and/or file storage traffic to use a partition, but he does not want initiators to communicate among themselves on this partition. To achieve that the SM is configured to assign the partial pkey to the initiator nodes and the full pkey to the target ports. The current implementation of IPoIB and core perfectly (and transparently...) supports that. and is currently non compliant in its behavior. -- Hal Or. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ipoib the partial pkey
On Mon, 2007-02-26 at 10:37, Or Gerlitz wrote: Hal Rosenstock wrote: On Sun, 2007-02-25 at 05:48, Or Gerlitz wrote: Just to have us agree on the quote, it is from section 4 of rfc 4392 (page 14) eg in http://www.ietf.org/rfc/rfc4392.txt at the time of creating an IB multicast group, multiple values such as the P_Key, Q_Key, Service Level, Hop Limit, Flow ID, TClass, MTU, etc. have to be specified. These values should be such that all potential members of the IB multicast group are able to communicate with one another when using them. OK, I suggest to remove this spec limitation, IMO you would need to get the IB spec changed first in order to do this. do you refers to this? What about the description og P_Key in MCMemberRecord (table 210 on p. 908 which is compliance) which states: All members of the multicast group shall have full membership in the partition indicated by the partition key. if yes, indeed, this also has to be changed. Yes, for one. There may be others; I didn't look exhaustively at the spec for this. -- Hal Or. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPOIB NAPI
Roland, Yes. It would be good to reduce number of interrupts by changing all upper layer protocols to use: poll CQ notify CQ, rotting packet notification poll again instead of notify CQ poll CQ If possible this can be in OFED-1.2? Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPOIB NAPI
Yes. It would be good to reduce number of interrupts by changing all upper layer protocols to use: poll CQ notify CQ, rotting packet notification poll again instead of notify CQ poll CQ If possible this can be in OFED-1.2? No way, it's way too late at this point to change the kernel-user ABI, let alone change all ULPs. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ipoib the partial pkey
Sean Hefty wrote: I looked into this more... RFC 4391 states (middle of page 5): For a node to join a partition, one of its ports must be assigned the relevant P_Key by the SM [RFC4392]. Jumping to RFC 4392 (top of page 4): Just to have us agree on the quote, it is from section 4 of rfc 4392 (page 14) eg in http://www.ietf.org/rfc/rfc4392.txt at the time of creating an IB multicast group, multiple values such as the P_Key, Q_Key, Service Level, Hop Limit, Flow ID, TClass, MTU, etc. have to be specified. These values should be such that all potential members of the IB multicast group are able to communicate with one another when using them. OK, I suggest to remove this spec limitation, as it does not allow the use case of a server using a partition for which inter-client communication is not allowed. Actually since it does not let people use partial membership partitioning with IPoIB as every ipoib device needs to join the broadcast group, it is probably a spec bug and not a limitation done on purpose. A simple real-life example is I/O target, the system admin wants IB block and/or file storage traffic to use a partition, but he does not want initiators to communicate among themselves on this partition. To achieve that the SM is configured to assign the partial pkey to the initiator nodes and the full pkey to the target ports. The current implementation of IPoIB and core perfectly (and transparently...) supports that. Or. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ipoib the partial pkey, was: librdmacm: fix bug causing failure to work with partial membership pkey
On Thu, 2007-02-22 at 18:35, Sean Hefty wrote: Doesn't this allow ipoib to join a multicast group for which it may not be able to communicate with all members? For the broadcast group, this seems like an error to me. Can ipoib work in such a configuration? If all nodes were assigned a partial membership PKey, none of them could communicate, but no errors would be generated anywhere. I looked into this more... RFC 4391 states (middle of page 5): For a node to join a partition, one of its ports must be assigned the relevant P_Key by the SM [RFC4392]. Jumping to RFC 4392 (top of page 4): at the time of creating an IB multicast group, multiple values such as the P_Key, Q_Key, Service Level, Hop Limit, Flow ID, TClass, MTU, etc. have to be specified. These values should be such that all potential members of the IB multicast group are able to communicate with one another when using them. Seems to me that for P_Key this would mean full membership. and page 14: Note that this IB_join to the broadcast group is a FullMember join. FullMember here is referring to MCMemberRecord:JoinState rather than partition membership. -- Hal If any of the ports or the switches linking the port to the rest of the IPoIB subnet cannot support the parameters (e.g., path MTU or P_Key) associated with the broadcast group, then the IB_join request will fail and the requesting port will not become part of the IPoIB subnet My initial interpretation of these statements lead me to believe that pkey check in ib_find_cached_pkey should not mask out the upper bit, which would prevent ipoib from joining a multicast group until it has been configured with the full membership pkey for the broadcast group. Does this seem reasonable? - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPOIB NAPI
An API idea: how about instead testing missed_events, we add a flag: IB_CQ_TEST (or a longer name IB_CQ_REPORT_MISSED_EVENTS?) and change ib_req_notify_cq to return int which will keep the missed_events value, only if this flag is set? This has 2 advatages - Less churn updating all users to new API - they just ignore return value - and still almost no overhead for them as they don't set IB_CQ_TEST - For all users we have to push less values on stack - note compiler can't get rid of them as we are calling function through a pointer - For users that do missed_events = ib_req_notify_cq(priv-cq, IB_CQ_NEXT_COMP | IB_CQ_TEST) we get the result in register. Yes, I like this. So ib_req_notify_cq() gets a return value that is negative if an error occurred, 0 if everything is fine, or positive if a missed event might have happened. I think I prefer the longer name IB_CQ_REPORT_MISSED_EVENTS -- at least there's a chance at guessing what it means even if you don't read the documentation. By the way, how about extending the userspace API in a similiar fashion? missed_events = ibv_req_notify_cq(priv-cq, IBV_CQ_NEXT_COMP | IBV_CQ_REPORT_MISSED_EVENTS) -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPOIB NAPI
By the way, how about extending the userspace API in a similiar fashion? missed_events = ibv_req_notify_cq(priv-cq, IBV_CQ_NEXT_COMP | IBV_CQ_REPORT_MISSED_EVENTS) It would require a kernel-user ABI bump. Is it worth it? - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] ipoib the partial pkey, was: librdmacm: fix bug causing failure to work with partial membership pkey
Doesn't this allow ipoib to join a multicast group for which it may not be able to communicate with all members? For the broadcast group, this seems like an error to me. Can ipoib work in such a configuration? If all nodes were assigned a partial membership PKey, none of them could communicate, but no errors would be generated anywhere. I looked into this more... RFC 4391 states (middle of page 5): For a node to join a partition, one of its ports must be assigned the relevant P_Key by the SM [RFC4392]. Jumping to RFC 4392 (top of page 4): at the time of creating an IB multicast group, multiple values such as the P_Key, Q_Key, Service Level, Hop Limit, Flow ID, TClass, MTU, etc. have to be specified. These values should be such that all potential members of the IB multicast group are able to communicate with one another when using them. and page 14: Note that this IB_join to the broadcast group is a FullMember join. If any of the ports or the switches linking the port to the rest of the IPoIB subnet cannot support the parameters (e.g., path MTU or P_Key) associated with the broadcast group, then the IB_join request will fail and the requesting port will not become part of the IPoIB subnet My initial interpretation of these statements lead me to believe that pkey check in ib_find_cached_pkey should not mask out the upper bit, which would prevent ipoib from joining a multicast group until it has been configured with the full membership pkey for the broadcast group. Does this seem reasonable? - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPOIB NAPI
Quoting Roland Dreier [EMAIL PROTECTED]: Subject: Re: IPOIB NAPI By the way, how about extending the userspace API in a similiar fashion? missed_events = ibv_req_notify_cq(priv-cq, IBV_CQ_NEXT_COMP | IBV_CQ_REPORT_MISSED_EVENTS) It would require a kernel-user ABI bump. Is it worth it? I hear some people asking for it: I imagine reasons are same as NAPI - race-free, clean API to switch from polling to event mode - rather than a minor optimization. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPoIB CM for merge?
Quoting Steve Wise [EMAIL PROTECTED]: Subject: Re: [openib-general] IPoIB CM for merge? On Fri, 2007-02-02 at 13:15 +0200, Michael S. Tsirkin wrote: Quoting Roland Dreier [EMAIL PROTECTED]: Subject: Re: IPoIB CM for merge? Could you please spend some time reviewing IPoIB CM code? I am concerned about missing the 2.6.21 merge window. Thanks for the reminder. Can we trade? Have you looked at the cxgb3 iwarp driver? Any comments? OK. I am not sure I have the last version posted so I am going to go by what is there in OFED git tree. And I also only looked under drivers/infiniband/. So, here are some questions: I looked in the archives and have not seen these addressed. Maybe these can be answered and then I'll go from there? Does this sound OK? Files with names like ./core/cxio_hal.c ./core/cxio_hal.h normally generate a fair bit of discussion which wasn't present here, I did not guess everyone was just busy. For example, why is there both struct iwch_cq and struct t3_cq? The cxgb3/core code defines a low level interface to the RDMA bits of the T3 device. This code was originally a separate module (named cxio) that allowed other RDMA middleware layers to sit on top of the this core rdma module. At the time, there was RNIC-PI and OFA being developed. So that is the history of this. As per the first openib review (about a year ago) of this code I merged this core module into the cxgb3 module. I left the file structure and names as-is because it was low priority IMO. The t3_cq struct is the low level CQ structure used to manage both a HW accessed CQ and a SW CQ (needed to handle error cases and out of order completions). The iwch_cq struct contains the stuff needed to integrate with the OFA core and uverbs code. It contains a t3_cq inline. So now that there's a common module, there's no technical reason for the two-level structure to exist? I would say you want to at least move the files into a common directory. I think you will also find that for datapath operations such as poll cq, converting completion from hardware to struct t3_cqe, and from that to ib_wc adds an untrivial amount of overhead. File tcb.h comment says: /* This file is automatically generated --- do not edit */ This looks like a GPL violation, does it not? I can add the license if that's what you mean. I mean that this file does not seem to be the source, in the GPL sense. The following comes from COPYING under linux source directory: The source code for a work means the preferred form of the work for making modifications to it. For an executable work, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the executable. So I think you must make the actual source available under the terms of GPL. What's the deal with the naming convention? Is there a reason in cxgb3, some files start with iwch and some with cxio? How about using cxgb3 prefix all over? The cxio_ prefix is used for the low-level functions/types that talk directly with the HW. iwch_ is the provider driver functions that interface with the OFA stack. I'd rather not change the names. Especially since this has already gone through several review cycles. I'm hoping we can get this in and improve it with subsequent submissions. Is that reasonable? -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPoIB connected mode review comments
Quoting Steve Wise [EMAIL PROTECTED]: Subject: IPoIB connected mode review comments On Thu, 2007-02-01 at 20:48 -0800, Roland Dreier wrote: Have you had a chance to review this? Still on my list. Can we trade? Can you look at the IPoIB connected mode stuff in the ipoib-cm branch in git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git and let me know if you see anything you don't like? - R. Here are my comments. I'm not an ib cm expert though. These are mostly questions: Steve, thanks for looking at the code! I hope the following answers your questions. Since IPoIB is using IP addresses already, wouldn't it be simpler to use the rdma cm to setup connections? IPoIB is not using IP addresses. It uses hardware addresses as any network device would. So using rdma cm does not make sense. Could you optimize this design and only signal some of the tx wrs? This optimization would apply to UD mode too. No one so far came up with a way to do this cleanly. In ipoib_cm_send() you call ipoib_cm_skb_too_long() if the packet is too large for the interface mtu. And you print a warning. But ipoib_cm_skb_too_long() actually queues the packet for the cm case. For ud it just drops the packet. The skb task for cm then will send a ICMP_DEST_UNREACH for these packets. Why the difference? For UD I just kept the current behaviour - I think this can actually only happen in case of a race when packet was queued before MTU was changed, so the originator was already notified of the MTU change by the stack above us. For CM the local MTU may exceed the size of a buffer that was posted on the remote QP. So we need to send ICMP_DEST_UNREACH to reduce the originator's dest MTU to whatever this QP actually can support. Since this needs the original skb, and must be done from task or bh context, so we queue the skb and handle it in task context. Also if this packet came from the local stack via a local application, you don't want to send DEST_UNREACH, right? (I'm probably just confused about the purpose of this). Yes, sending DEST_UNREACH does not seem to affect local interface. That's why I call update_pmtu too. It is also good to update the MTU ASAP to reduce the number lot of packets that are dropped - and update_pmtu can be called from atomic context. I do not know how to tell the packet is from local stack and it does not seem to do any harm to handle all packets in a uniform manner. net/ipv4/ip_gre.c and net/ipv4/ipip.c are examples of code that do something similiar. In ipoib_cm_tx_completion() you rearm, then drain the cq. I thought there was some reason that it was better to do drain/rearm/drain? Something about if you rearm and there's a cq entry mthca does another immediate interrupt? Again, this comment applies to UD mode as well. AFAIK so far this worked best. In ipoib_cm_handle_tx_wc(): When can a tx completion happen with a wr_id that isn't within the ipoib_sendq_size range? This looks like it is really a bug condition that should never happen. Because of this: post_send(priv, tx, tx-tx_head (ipoib_sendq_size - 1)) so wr_id is always within range. Again, this is exactly the same logic as in UD case. I see the same code in the rx completion path too. It's even simpler there: + for (i = 0; i ipoib_recvq_size; ++i) { ... + if (ipoib_cm_post_receive(dev, i)) { ... + } + } So i is always within RX size range. Also, what's up with the /* FIXME */ comment? Since I have QPs which I never post send WRs on, I should be able to set .cap.max_send_wr to 0 and .cap.max_send_sge should not matter. However, low level drivers do not seem to support this at the moment, so I set these to 1 for now - this is also correct but has a small memory cost. You lock the priv-lock inside of the priv-tx_lock. Is this ordering correct and consistent across all the code? Yes, that's the nesting rule. ipoib_cm_handle_rx_wc() - what's up with the XXX comment? We have the same comment in UD code - that's where this comes from. Basically we don't have an easy way to know the correct packet type, and always setting it to PACKET_HOST seems to work. What's the algorithm to keep enough buffers posted in the SRQ? Same as with UD really - if I can't allocate a new skb I repost the old one and increment the dropped packet counter. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPoIB CM for merge?
Quoting Pradeep Satyanarayana [EMAIL PROTECTED]: Subject: Re: [openib-general] IPoIB CM for merge? Hello Michael, Here are a few more observations : Pradeep, I think you are posting in the wrong thread: it seems you are not talking about my code, but rather about the project you mentioned of implementing IPoIB CM without SRQ. IPoIB CM currently falls back on UD mode for HCAs that do not support SRQ, so there should be no problem for the ehca - as new code won't be activated. As I said already, I do not see a clean way to address this limitation, so I would rather have current IPoIB CM code merged upstream first, and think about enhancements later. 1. For the SRQ case, the skbs and recieve biffers are posted during init and even before the rx_qp is created. This causes a problem (atleast for non SRQs) for the ehca. We need to call the ipoib_cm_alloc_skb() and ipoib_cm_post_recieve() after the rx_qp is in the RTR state. 2. Also found that in ipoib_cm_create_rx_qp() one needs to initialize .cap.max_recv_wr and .cap.max_recv_sge. Otherwise this leads to some problems like rq overflows and causing communication failures. Yes, I think these are some of the things that would need to be done to make IPoIB CM work without SRQ. It is clearly not something we want to do for SRQ case however: for example, posting WRs to SRQ during connection setup would race against completion events for other connections. And assigning .cap.max_recv_wr 0 for a QP not connected to SRQ does not make sense, and might thinkably confuse low level drivers. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPoIB CM for merge?
Quoting Roland Dreier [EMAIL PROTECTED]: Subject: Re: IPoIB CM for merge? Could you please spend some time reviewing IPoIB CM code? I am concerned about missing the 2.6.21 merge window. Thanks for the reminder. Can we trade? Have you looked at the cxgb3 iwarp driver? Any comments? OK. I am not sure I have the last version posted so I am going to go by what is there in OFED git tree. And I also only looked under drivers/infiniband/. So, here are some questions: I looked in the archives and have not seen these addressed. Maybe these can be answered and then I'll go from there? Does this sound OK? Files with names like ./core/cxio_hal.c ./core/cxio_hal.h normally generate a fair bit of discussion which wasn't present here, I did not guess everyone was just busy. For example, why is there both struct iwch_cq and struct t3_cq? File tcb.h comment says: /* This file is automatically generated --- do not edit */ This looks like a GPL violation, does it not? What's the deal with the naming convention? Is there a reason in cxgb3, some files start with iwch and some with cxio? How about using cxgb3 prefix all over? -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPoIB CM for merge?
On Fri, 2007-02-02 at 13:15 +0200, Michael S. Tsirkin wrote: Quoting Roland Dreier [EMAIL PROTECTED]: Subject: Re: IPoIB CM for merge? Could you please spend some time reviewing IPoIB CM code? I am concerned about missing the 2.6.21 merge window. Thanks for the reminder. Can we trade? Have you looked at the cxgb3 iwarp driver? Any comments? OK. I am not sure I have the last version posted so I am going to go by what is there in OFED git tree. And I also only looked under drivers/infiniband/. So, here are some questions: I looked in the archives and have not seen these addressed. Maybe these can be answered and then I'll go from there? Does this sound OK? Files with names like ./core/cxio_hal.c ./core/cxio_hal.h normally generate a fair bit of discussion which wasn't present here, I did not guess everyone was just busy. For example, why is there both struct iwch_cq and struct t3_cq? The cxgb3/core code defines a low level interface to the RDMA bits of the T3 device. This code was originally a separate module (named cxio) that allowed other RDMA middleware layers to sit on top of the this core rdma module. At the time, there was RNIC-PI and OFA being developed. So that is the history of this. As per the first openib review (about a year ago) of this code I merged this core module into the cxgb3 module. I left the file structure and names as-is because it was low priority IMO. The t3_cq struct is the low level CQ structure used to manage both a HW accessed CQ and a SW CQ (needed to handle error cases and out of order completions). The iwch_cq struct contains the stuff needed to integrate with the OFA core and uverbs code. It contains a t3_cq inline. File tcb.h comment says: /* This file is automatically generated --- do not edit */ This looks like a GPL violation, does it not? I can add the license if that's what you mean. What's the deal with the naming convention? Is there a reason in cxgb3, some files start with iwch and some with cxio? How about using cxgb3 prefix all over? The cxio_ prefix is used for the low-level functions/types that talk directly with the HW. iwch_ is the provider driver functions that interface with the OFA stack. I'd rather not change the names. Especially since this has already gone through several review cycles. I'm hoping we can get this in and improve it with subsequent submissions. Is that reasonable? Steve. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] IPoIB connected mode review comments
On Thu, 2007-02-01 at 20:48 -0800, Roland Dreier wrote: Have you had a chance to review this? Still on my list. Can we trade? Can you look at the IPoIB connected mode stuff in the ipoib-cm branch in git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git and let me know if you see anything you don't like? - R. Here are my comments. I'm not an ib cm expert though. These are mostly questions: Since IPoIB is using IP addresses already, wouldn't it be simpler to use the rdma cm to setup connections? Could you optimize this design and only signal some of the tx wrs? In ipoib_cm_send() you call ipoib_cm_skb_too_long() if the packet is too large for the interface mtu. And you print a warning. But ipoib_cm_skb_too_long() actually queues the packet for the cm case. For ud it just drops the packet. The skb task for cm then will send a ICMP_DEST_UNREACH for these packets. Why the difference? Also if this packet came from the local stack via a local application, you don't want to send DEST_UNREACH, right? (I'm probably just confused about the purpose of this). In ipoib_cm_tx_completion() you rearm, then drain the cq. I thought there was some reason that it was better to do drain/rearm/drain? Something about if you rearm and there's a cq entry mthca does another immediate interrupt? In ipoib_cm_handle_tx_wc(): When can a tx completion happen with a wr_id that isn't within the ipoib_sendq_size range? This looks like it is really a bug condition that should never happen. I see the same code in the rx completion path too. Also, what's up with the /* FIXME */ comment? You lock the priv-lock inside of the priv-tx_lock. Is this ordering correct and consistent across all the code? ipoib_cm_handle_rx_wc() - what's up with the XXX comment? What's the algorithm to keep enough buffers posted in the SRQ? ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPoIB CM for merge?
Hello Michael, Here are a few more observations : 1. For the SRQ case, the skbs and recieve biffers are posted during init and even before the rx_qp is created. This causes a problem (atleast for non SRQs) for the ehca. We need to call the ipoib_cm_alloc_skb() and ipoib_cm_post_recieve() after the rx_qp is in the RTR state. 2. Also found that in ipoib_cm_create_rx_qp() one needs to initialize .cap.max_recv_wr and .cap.max_recv_sge. Otherwise this leads to some problems like rq overflows and causing communication failures. Pradeep [EMAIL PROTECTED]___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] IPoIB CM for merge?
Roland, 2.6.20 is nearly done. Could you please spend some time reviewing IPoIB CM code? I am concerned about missing the 2.6.21 merge window. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPoIB CM for merge?
Could you please spend some time reviewing IPoIB CM code? I am concerned about missing the 2.6.21 merge window. Thanks for the reminder. Can we trade? Have you looked at the cxgb3 iwarp driver? Any comments? - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPoIB CM for merge?
Quoting Roland Dreier [EMAIL PROTECTED]: Subject: Re: IPoIB CM for merge? Could you please spend some time reviewing IPoIB CM code? I am concerned about missing the 2.6.21 merge window. Thanks for the reminder. Can we trade? Have you looked at the cxgb3 iwarp driver? Any comments? I haven't yet, sorry. OK. I am not sure I have the last version posted so I am going to go by what is there in OFED git tree. And I also only looked under drivers/infiniband/. So, here are some questions: I looked in the archives and have not seen these addressed. Maybe these can be answered and then I'll go from there? Does this sound OK? Files with names like ./core/cxio_hal.c ./core/cxio_hal.h normally generate a fair bit of discussion which wasn't present here, I did not guess everyone was just busy. For example, why is there both struct iwch_cq and struct t3_cq? File tcb.h comment says: /* This file is automatically generated --- do not edit */ This looks like a GPL violation, does it not? What's the deal with the naming convention? Is there a reason in cxgb3, some files start with iwch and some with cxio? How about using cxgb3 prefix all over? -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPOIB CM with Non SRQ support
-One artifact of the current send side implemantation is that for every message we create a new set of tx qps. I do not believe this describes the implementation correctly - ipoib_cm_tx is cached in ipoib_neigh structure so that once a connection is setup, it is reused for all messages to the same neighbour. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] ipoib, ipv6 and multicast groups
recently our sm started throwing the following errors: Jan 29 18:10:49 706710 [42003940] - __get_new_mlid: ERR 1B23: All available:32 mlids are taken Jan 29 18:10:49 706721 [42003940] - osm_mcmr_rcv_create_new_mgrp: ERR 1B19: __get_new_mlid failed Jan 29 18:10:51 345113 [42804940] - __get_new_mlid: ERR 1B23: All available:32 mlids are taken Jan 29 18:10:51 345132 [42804940] - osm_mcmr_rcv_create_new_mgrp: ERR 1B19: __get_new_mlid failed Jan 29 18:10:51 514312 [41802940] - __get_new_mlid: ERR 1B23: All available:32 mlids are taken Jan 29 18:10:51 514320 [41802940] - osm_mcmr_rcv_create_new_mgrp: ERR 1B19: __get_new_mlid failed Jan 29 18:10:51 735732 [42804940] - __get_new_mlid: ERR 1B23: All available:32 mlids are taken we tracked this down to a problem with ipoib interaction with ipv6. ipv6 joins two multicast groups, instead of just one like ipv4. # netstat -A inet6 -g -n ... IPv6/IPv4 Group Memberships Interface RefCnt Group --- -- - lo 1 ff02::1 ib0 1 ff02::1:ff00:77a2 ib0 1 ff02::1 # netstat -A inet6 -g -n ... IPv6/IPv4 Group Memberships Interface RefCnt Group --- -- - lo 1 224.0.0.1 ib0 1 224.0.0.1 # cat /sys/kernel/debug/ipoib/ib0_mcg GID: ff12:401b::0:0:0:0:1 created: 4298482097 queuelen: 0 complete: yes send_only: no GID: ff12:401b::0:0:0:: created: 4298482097 queuelen: 0 complete: yes send_only: no GID: ff12:601b::0:0:0:0:1 created: 4298482097 queuelen: 0 complete: yes send_only: no GID: ff12:601b::0:0:1:ff00:77a2 created: 4298482097 queuelen: 0 complete: yes send_only: no the ff02::1:ff00:77a2 group is specific to the interface (link local), so each of our ib hosts running ipv6 registers its own unique multicast group. since our network is bigger than 32 hosts, it appears that we have exceeded the multicast tables in our local switches and this is making opensm generate the above error. besides not running ipv6, are there any thoughts about this? ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ipoib, ipv6 and multicast groups
On Mon, 2007-01-29 at 13:17, chas williams - CONTRACTOR wrote: recently our sm started throwing the following errors: Jan 29 18:10:49 706710 [42003940] - __get_new_mlid: ERR 1B23: All available:32 mlids are taken Jan 29 18:10:49 706721 [42003940] - osm_mcmr_rcv_create_new_mgrp: ERR 1B19: __get_new_mlid failed Jan 29 18:10:51 345113 [42804940] - __get_new_mlid: ERR 1B23: All available:32 mlids are taken Jan 29 18:10:51 345132 [42804940] - osm_mcmr_rcv_create_new_mgrp: ERR 1B19: __get_new_mlid failed Jan 29 18:10:51 514312 [41802940] - __get_new_mlid: ERR 1B23: All available:32 mlids are taken Jan 29 18:10:51 514320 [41802940] - osm_mcmr_rcv_create_new_mgrp: ERR 1B19: __get_new_mlid failed Jan 29 18:10:51 735732 [42804940] - __get_new_mlid: ERR 1B23: All available:32 mlids are taken 32 is too low for MLID space support IMO. we tracked this down to a problem with ipoib interaction with ipv6. ipv6 joins two multicast groups, instead of just one like ipv4. # netstat -A inet6 -g -n ... IPv6/IPv4 Group Memberships Interface RefCnt Group --- -- - lo 1 ff02::1 ib0 1 ff02::1:ff00:77a2 ib0 1 ff02::1 # netstat -A inet6 -g -n ... IPv6/IPv4 Group Memberships Interface RefCnt Group --- -- - lo 1 224.0.0.1 ib0 1 224.0.0.1 # cat /sys/kernel/debug/ipoib/ib0_mcg GID: ff12:401b::0:0:0:0:1 created: 4298482097 queuelen: 0 complete: yes send_only: no GID: ff12:401b::0:0:0:: created: 4298482097 queuelen: 0 complete: yes send_only: no GID: ff12:601b::0:0:0:0:1 created: 4298482097 queuelen: 0 complete: yes send_only: no GID: ff12:601b::0:0:1:ff00:77a2 created: 4298482097 queuelen: 0 complete: yes send_only: no the ff02::1:ff00:77a2 group is specific to the interface (link local), so each of our ib hosts running ipv6 registers its own unique multicast group. since our network is bigger than 32 hosts, it appears that we have exceeded the multicast tables in our local switches and this is making opensm generate the above error. besides not running ipv6, are there any thoughts about this? This has been discussed on the list before. Last time was a thread on IPv6 and IPoIB scalability issue back in late November (11/30) to early December (12/2). There are some options presented. None have been pursued to the best of my knowledge. -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPOIB CM with Non SRQ support
Hello Michael, Yes, the code seems to get complex with lots of small changes spread across all over the recieve side. Plus special cassing them with #ifdef makes it look a little messy. It is unlikely I can get this out by Feb 1st. As I was working through this I noticed a few things and here are my observations: -ipoib_cm_modify_rx_rts() does not actually transition the passive side qp to RTS state and remains in the RTR state. However, the active side qp does transition to RTS. -One artifact of the current send side implemantation is that for every message we create a new set of tx qps. So, if one were to use IB for the cluster heartbeat mechanism as an example, then for every heartbeat we end up creating an ipoib_cm_tx structure and initiating a set of CM exchanges. This might consume a lot of resources (even on an idle system). Changing this has a potential performance upside. Pradeep [EMAIL PROTECTED] Michael S. Tsirkin [EMAIL PROTECTED] wrote on 01/25/2007 11:41:28 PM: Quoting Pradeep Satyanarayana [EMAIL PROTECTED]: Subject: IPOIB CM with Non SRQ support Michael, I am working on a prototype based on your IPOIB CM patch to incorporate support for Non SRQ as well. IPOIB CM was planned to be in OFED 1.2 if I remember correctly. If I were to submit a patch for non SRQ support, what would be the cut off date to make it into OFED 1.2? I think it must be ready for merge by feature freeze on Feb 1st, but at this stage it really needs to be a small patch. I can't commit to merging it before I see it. I have to warn you that I thought about this problem, and unfortunately I do not see a way to implement it in a robust fashion without complicating the code significantly. In this case, you'll just might have to maintain it as a separate patch until the code lands upstream, and propose as a separate improvement later. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPOIB CM with Non SRQ support
Quoting Pradeep Satyanarayana [EMAIL PROTECTED]: Subject: IPOIB CM with Non SRQ support Michael, I am working on a prototype based on your IPOIB CM patch to incorporate support for Non SRQ as well. IPOIB CM was planned to be in OFED 1.2 if I remember correctly. If I were to submit a patch for non SRQ support, what would be the cut off date to make it into OFED 1.2? I think it must be ready for merge by feature freeze on Feb 1st, but at this stage it really needs to be a small patch. I can't commit to merging it before I see it. I have to warn you that I thought about this problem, and unfortunately I do not see a way to implement it in a robust fashion without complicating the code significantly. In this case, you'll just might have to maintain it as a separate patch until the code lands upstream, and propose as a separate improvement later. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ipoib ipv6 multicast joins, was: multicast code/merge status
Can you explain how this relates to your multicast changes? the IPoIB send-only-full-member-join hack was there before your patch and stayed there after your patch... and how come a change in the multicast code can cause the error steam to be finite... have you moved the retry mechanism from the ib_sa consumer to the ib_sa mcast engine? There was a bug in the ib_sa multicast engine handling failed joins, which had it retry forever. (Basically, the response was not being matched with the request. So the response was discarded, and the request was retried.) I had fixed this in svn, but lost the patch moving over to git. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ipoib ipv6 multicast joins, was: multicast code/merge status
On 1/15/07, Sean Hefty [EMAIL PROTECTED] wrote: Can you explain how this relates to your multicast changes? the IPoIB send-only-full-member-join hack was there before your patch and stayed there after your patch... and how come a change in the multicast code can cause the error steam to be finite... have you moved the retry mechanism from the ib_sa consumer to the ib_sa mcast engine? There was a bug in the ib_sa multicast engine handling failed joins, which had it retry forever. (Basically, the response was not being matched with the request. So the response was discarded, and the request was retried.) I had fixed this in svn, but lost the patch moving over to git. sure, got you. Or. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ipoib ipv6 multicast joins, was: multicast code/merge status
Sean Hefty wrote: So, this looks like a work-around for some broken SM, does it not? Yes - I mentioned it because the resulting error message (wrong component mask) is what was filling up the opensm log file. Jan 11 14:21:36 083844 [40583BB0] - osm_mcmr_rcv_join_mgrp: ERR 1B11: method = SubnAdmSet, scope_state = 0x1, component mask = 0x00010083, expected com p mask = 0x000130c7, MGID: 0x : 0x201400020404 from port 0x0002c9010ad258f1 I've applied a missing patch to my rdma-dev git tree that should avoid filling up the opensm log file. But the error in the opensm log file is a result of this work-around. Sean, Can you explain how this relates to your multicast changes? the IPoIB send-only-full-member-join hack was there before your patch and stayed there after your patch... and how come a change in the multicast code can cause the error steam to be finite... have you moved the retry mechanism from the ib_sa consumer to the ib_sa mcast engine? Or. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ipoib ipv6 multicast joins, was: multicast code/merge status
On Thu, 2007-01-11 at 19:11, Sean Hefty wrote: Hal Rosenstock wrote: (*) there are some more issues here which need to be addressed, see for example the Some SMs don't support send-only yet weird comment at ipoib_mcast_sendonly_join() It's more likely an SA issue but I'm only guessing... It may also be historical... Based on observation, it looks like ipoib joins a couple of IPv6 multicast groups with send only membership. Yes. However it changes the join_state from 4 to 1 (send-only to full member). Yes, that is the workaround Roland had put in (likely for a non compliant SM which didn't support send only joins). This results in the SA trying to create the multicast group, only the required MCMemberRecord components have not been set. Right, the group either needs to be previously precreated or a receiver started first which would create the group. I'm not sure if this indicates a serious problem, but I'm guessing not. I don't believe it's a serious problem (at least now). In any case, it is no worse than it was before your change for this (it is not a problem of your making...). The join request simply fails and returns an error back to ipoib. (Which would have happened for a send-only join if the group hadn't already been created.) Right. -- Hal - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] ipoib ipv6 multicast joins, was: multicast code/merge status
Hal Rosenstock wrote: (*) there are some more issues here which need to be addressed, see for example the Some SMs don't support send-only yet weird comment at ipoib_mcast_sendonly_join() It's more likely an SA issue but I'm only guessing... It may also be historical... Based on observation, it looks like ipoib joins a couple of IPv6 multicast groups with send only membership. However it changes the join_state from 4 to 1 (send-only to full member). This results in the SA trying to create the multicast group, only the required MCMemberRecord components have not been set. I'm not sure if this indicates a serious problem, but I'm guessing not. The join request simply fails and returns an error back to ipoib. (Which would have happened for a send-only join if the group hadn't already been created.) - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ipoib ipv6 multicast joins, was: multicast code/merge status
Quoting Sean Hefty [EMAIL PROTECTED]: Subject: ipoib ipv6 multicast joins, was: multicast code/merge status Hal Rosenstock wrote: (*) there are some more issues here which need to be addressed, see for example the Some SMs don't support send-only yet weird comment at ipoib_mcast_sendonly_join() It's more likely an SA issue but I'm only guessing... It may also be historical... Based on observation, it looks like ipoib joins a couple of IPv6 multicast groups with send only membership. However it changes the join_state from 4 to 1 (send-only to full member). This results in the SA trying to create the multicast group, only the required MCMemberRecord components have not been set. I'm not sure if this indicates a serious problem, but I'm guessing not. The join request simply fails and returns an error back to ipoib. (Which would have happened for a send-only join if the group hadn't already been created.) So, this looks like a work-around for some broken SM, does it not? -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ipoib ipv6 multicast joins, was: multicast code/merge status
So, this looks like a work-around for some broken SM, does it not? Yes - I mentioned it because the resulting error message (wrong component mask) is what was filling up the opensm log file. Jan 11 14:21:36 083844 [40583BB0] - osm_mcmr_rcv_join_mgrp: ERR 1B11: method = SubnAdmSet, scope_state = 0x1, component mask = 0x00010083, expected com p mask = 0x000130c7, MGID: 0x : 0x201400020404 from port 0x0002c9010ad258f1 I've applied a missing patch to my rdma-dev git tree that should avoid filling up the opensm log file. But the error in the opensm log file is a result of this work-around. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPoIB new multicast API patches oops
Quoting Sean Hefty [EMAIL PROTECTED]: Subject: RE: IPoIB new multicast API patches oops I have not been able to reproduce this crash on my systems, and even instrumenting the code isn't helping me to locate the issue. Can you apply the following patch on top of the previous patches, and let me know if you get any additional output? OK, I hope to get back to testing this next-week-ish. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPoIB new multicast API patches oops
I have not been able to reproduce this crash on my systems, and even instrumenting the code isn't helping me to locate the issue. Can you apply the following patch on top of the previous patches, and let me know if you get any additional output? - Sean --- diff --git a/drivers/infiniband/core/multicast.c b/drivers/infiniband/core/multicast.c index 88a9edf..b3bc4c6 100644 --- a/drivers/infiniband/core/multicast.c +++ b/drivers/infiniband/core/multicast.c @@ -81,6 +81,12 @@ enum mcast_state { MCAST_ERROR }; +enum mcast_debug { + MCAST_DEBUG_IDLE, + MCAST_DEBUG_JOINING, + MCAST_DEBUG_LEAVING, +}; + struct mcast_member; struct mcast_group { @@ -97,6 +103,7 @@ struct mcast_group { enum mcast_statestate; struct ib_sa_query *query; int query_id; + enum mcast_debugdebug_state; }; struct mcast_member { @@ -179,6 +186,7 @@ static void release_group(struct mcast_g if (atomic_dec_and_test(group-refcount)) { rb_erase(group-node, port-table); spin_unlock_irqrestore(port-lock, flags); + BUG_ON(group-debug_state != MCAST_DEBUG_IDLE); kfree(group); deref_port(port); } else @@ -319,6 +327,8 @@ static int send_join(struct mcast_group struct mcast_port *port = group-port; int ret; + BUG_ON(group-debug_state != MCAST_DEBUG_IDLE); + group-debug_state = MCAST_DEBUG_JOINING; ret = ib_sa_mcmember_rec_query(sa_client, port-dev-device, port-port_num, IB_MGMT_METHOD_SET, member-multicast.rec, @@ -341,6 +351,8 @@ static int send_leave(struct mcast_group rec = group-rec; rec.join_state = leave_state; + BUG_ON(group-debug_state != MCAST_DEBUG_IDLE); + group-debug_state = MCAST_DEBUG_LEAVING; ret = ib_sa_mcmember_rec_query(sa_client, port-dev-device, port-port_num, IB_SA_METHOD_DELETE, rec, IB_SA_MCMEMBER_REC_MGID | @@ -493,6 +505,8 @@ static void join_handler(int status, str { struct mcast_group *group = context; + BUG_ON(group-debug_state != MCAST_DEBUG_JOINING); + group-debug_state = MCAST_DEBUG_IDLE; if (status) process_join_error(group, status); else { @@ -510,6 +524,10 @@ static void join_handler(int status, str static void leave_handler(int status, struct ib_sa_mcmember_rec *rec, void *context) { + struct mcast_group *group = context; + + BUG_ON(group-debug_state != MCAST_DEBUG_LEAVING); + group-debug_state = MCAST_DEBUG_IDLE; mcast_work_handler(context); } ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ipoib mtu problem with UDP
Michael S. Tsirkin wrote: I tried using ifconfig to limit the ipoib mtu. Once I do this on *either* both server and client, or only on the client side, UDP seems to stop working: #ifconfig ib0 mtu 512 #netperf -c -C -H 11.4.3.68 -f M -t UDP_STREAM UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 11.4.3.68 (11.4.3.68) port 0 AF_INET : demo Socket Message Elapsed Messages CPU Service SizeSize Time Okay Errors Throughput Util Demand bytes bytessecs# # MBytes/sec % SS us/KB 118784 65507 10.00 27582 0 172.2 26.33inf 118784 10.00 0 0.0 23.40inf Things work fine if the mtu on the client side is 2044: # ifconfig ib0 mtu 2044 # netperf -c -C -H 11.4.3.68 -f M -t UDP_STREAM UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 11.4.3.68 (11.4.3.68) port 0 AF_INET : demo Socket Message Elapsed Messages CPU Service SizeSize Time Okay Errors Throughput Util Demand bytes bytessecs# # MBytes/sec % SS us/KB 118784 65507 10.00 78488 0 490.1 25.312.310 118784 10.00 68534 428.0 24.552.241 Tested with kernel 2.6.19-rc4 and netperf 2.4.2. I get the same results with iperf. However they succeed with smaller datagrams (netperf uses 65507 by default) dodly5:/home/shared/testing-tools/x86_64/netperf/netperf-2.4.1 # ifconfig ib0 ib0 Link encap:UNSPEC HWaddr 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 inet addr:192.168.11.235 Bcast:192.168.11.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:512 Metric:1 RX packets:42 errors:0 dropped:0 overruns:0 frame:0 TX packets:14077513 errors:0 dropped:5 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:5776 (5.6 Kb) TX bytes:6717604780 (6406.4 Mb) dodly5:/home/shared/testing-tools/x86_64/netperf/netperf-2.4.1 # ./netperf -H 192.168.11.233 -t UDP_STREAM -- -m 3 UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.11.233 (192.168.11.233) port 0 AF_INET Socket Message Elapsed Messages SizeSize Time Okay Errors Throughput bytes bytessecs# # 10^6bits/sec 262144 3 10.00 52533 01260.59 262144 10.00 22956550.86 dodly5:/home/shared/testing-tools/x86_64/iperf-2.0.2 # ./iperf -uc 192.168.11.233 -l 65000 Client connecting to 192.168.11.233, UDP port 5001 Sending 65000 byte datagrams UDP buffer size: 256 KByte (default) [ 3] local 192.168.11.235 port 32769 connected with 192.168.11.233 port 5001 [ 3] 0.0-10.9 sec 1.36 MBytes 1.05 Mbits/sec [ 3] Sent 22 datagrams [ 3] WARNING: did not receive ack of last datagram after 10 tries. dodly5:/home/shared/testing-tools/x86_64/iperf-2.0.2 # ./iperf -uc 192.168.11.233 Client connecting to 192.168.11.233, UDP port 5001 Sending 1470 byte datagrams UDP buffer size: 256 KByte (default) [ 3] local 192.168.11.235 port 32769 connected with 192.168.11.233 port 5001 [ 3] 0.0-10.0 sec 1.25 MBytes 1.05 Mbits/sec [ 3] Sent 893 datagrams [ 3] Server Report: [ 3] 0.0-10.0 sec 1.25 MBytes 1.05 Mbits/sec 0.002 ms0/ 893 (0%) ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] ipoib mtu problem with UDP
I tried using ifconfig to limit the ipoib mtu. Once I do this on *either* both server and client, or only on the client side, UDP seems to stop working: #ifconfig ib0 mtu 512 #netperf -c -C -H 11.4.3.68 -f M -t UDP_STREAM UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 11.4.3.68 (11.4.3.68) port 0 AF_INET : demo Socket Message Elapsed Messages CPU Service SizeSize Time Okay Errors Throughput Util Demand bytes bytessecs# # MBytes/sec % SS us/KB 118784 65507 10.00 27582 0 172.2 26.33inf 118784 10.00 0 0.0 23.40inf Things work fine if the mtu on the client side is 2044: # ifconfig ib0 mtu 2044 # netperf -c -C -H 11.4.3.68 -f M -t UDP_STREAM UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 11.4.3.68 (11.4.3.68) port 0 AF_INET : demo Socket Message Elapsed Messages CPU Service SizeSize Time Okay Errors Throughput Util Demand bytes bytessecs# # MBytes/sec % SS us/KB 118784 65507 10.00 78488 0 490.1 25.312.310 118784 10.00 68534 428.0 24.552.241 Tested with kernel 2.6.19-rc4 and netperf 2.4.2. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPoIB odd loopback packet from arp
Todd, This does not look like an error. The first arp is a broadcast (qpn=ff) so it is received in at the sending interface and is dropped. The second on is a unicast arp (qpn=0x000404) so it is not received at the local interface. On Mon, 2006-10-23 at 13:48 -0600, Todd Bowman wrote: Using the OFED 1.0 and OFED 1.1 stack I have notice some rcvswrelay errors. I have tracked it down to the arp request. I can reproduce the problem with the following steps: ( I have used both 2.6.14.14 and 2.6.18.1 kernels) ib109 arp -d ib110 ib109 ping ib110 -c 2 # ib_ipoib module debug 13:15:46 ib109 kernel: ib0: sending packet, length=60 address=f6187200 qpn=0xff 13:15:46 ib109 kernel: ib0: called: id 34, op 0, status: 0 13:15:46 ib109 kernel: ib0: send complete, wrid 34 13:15:46 ib109 kernel: ib0: called: id -2147483623, op 128, status: 0 13:15:46 ib109 kernel: ib0: received 100 bytes, SLID 0x0369 13:15:46 ib109 kernel: ib0: dropping loopback packet 13:15:46 ib109 kernel: ib0: called: id -2147483622, op 128, status: 0 13:15:46 ib109 kernel: ib0: received 100 bytes, SLID 0x016d 13:15:46 ib109 kernel: ib0: sending packet, length=88 address=f6e57520 qpn=0x000404 13:15:46 ib109 kernel: ib0: called: id 35, op 0, status: 0 13:15:46 ib109 kernel: ib0: send complete, wrid 35 13:15:46 ib109 kernel: ib0: called: id -2147483621, op 128, status: 0 13:15:46 ib109 kernel: ib0: received 128 bytes, SLID 0x016d 13:15:47 ib109 kernel: ib0: sending packet, length=88 address=f6e57520 qpn=0x000404 13:15:47 ib109 kernel: ib0: called: id 36, op 0, status: 0 13:15:47 ib109 kernel: ib0: send complete, wrid 36 13:15:47 ib109 kernel: ib0: called: id -2147483620, op 128, status: 0 13:15:47 ib109 kernel: ib0: received 128 bytes, SLID 0x016d 13:15:51 ib109 kernel: ib0: called: id -2147483619, op 128, status: 0 13:15:51 ib109 kernel: ib0: received 100 bytes, SLID 0x016d 13:15:51 ib109 kernel: ib0: sending packet, length=60 address=f6e57520 qpn=0x000404 13:15:51 ib109 kernel: ib0: called: id 37, op 0, status: 0 13:15:51 ib109 kernel: ib0: send complete, wrid 37 # tcpdump -i ib0 13:15:46.977578 arp who-has ib110 tell ib109 hardware #32 13:15:46.977682 arp reply ib110 is-at 00:00:04:04:fe:80:00:00:00:00:00:00:00:08:f1:04:03:96:11:59 hardware #32 13:15:46.977710 IP ib109 ib110: icmp 64: echo request seq 0 13:15:46.977790 IP ib110 ib109: icmp 64: echo reply seq 0 13:15:47.92 IP ib109 ib110: icmp 64: echo request seq 1 13:15:47.977892 IP ib110 ib109: icmp 64: echo reply seq 1 13:15:51.977076 arp who-has ib109 tell ib110 hardware #32 13:15:51.977094 arp reply ib109 is-at 00:02:00:14:fe:80:00:00:00:00:00:00:00:02:c9:02:00:00:3b:31 hardware #32 # error dump rcvswrelayerrors:1 MT47396 Infiniscale-III 0x2c9010b022090[1] ib109 HCA-1 0x2c9023b30[1] 1) The ping is successful and the arp table is populated so Is this really a problem or a false positive? 2) The second arp does not generate an error (the error dump reports all new errors in switches). Why? Any ideas? Thanks in advance. Todd ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPoIB Question
Greg Lindahl wrote: On Mon, Oct 23, 2006 at 07:53:06AM -0500, Hubbell, Sean C Contractor/Decibel wrote: I currently have several applications that uses a legacy IPv4 protocol and I use IPoIB to utilize my infiniband network which works great. I have completed some timing and throughput analysis and noticed that I do not get very much more if I use an infiniband network interface than using my GigE network interface. You might want to note that different InfinBand implementations have quite different performance of IPoIB, especially for UDP. Another issue is that IPoIB has quite different performance with different Linux kernels. This is especially evident for TCP, although you can use SDP to accelerate TCP sockets and avoid this issue. We are currently looking at the new tickless kernel. Do you have one that you recommend? Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPoIB odd loopback packet from arp
Thanks Eli.So the switch is incrementing the rcvswrelay counter when it sends the broadcast back through the original port. This doesn't seem to be correct behavior, it makes that counter unreliable. On 10/24/06, Eli Cohen [EMAIL PROTECTED] wrote: Todd,This does not look like an error. The first arp is a broadcast(qpn=ff) so it is received in at the sending interface and isdropped. The second on is a unicast arp (qpn=0x000404) so it is notreceived at the local interface. On Mon, 2006-10-23 at 13:48 -0600, Todd Bowman wrote: Using the OFED 1.0 and OFED 1.1 stack I have notice some rcvswrelay errors.I have tracked it down to the arp request.I can reproduce the problem with the following steps: ( I have used both 2.6.14.14 and 2.6.18.1 kernels) ib109 arp -d ib110 ib109 ping ib110 -c 2 # ib_ipoib module debug 13:15:46 ib109 kernel: ib0: sending packet, length=60 address=f6187200 qpn=0xff 13:15:46 ib109 kernel: ib0: called: id 34, op 0, status: 0 13:15:46 ib109 kernel: ib0: send complete, wrid 34 13:15:46 ib109 kernel: ib0: called: id -2147483623, op 128, status: 0 13:15:46 ib109 kernel: ib0: received 100 bytes, SLID 0x0369 13:15:46 ib109 kernel: ib0: dropping loopback packet 13:15:46 ib109 kernel: ib0: called: id -2147483622, op 128, status: 0 13:15:46 ib109 kernel: ib0: received 100 bytes, SLID 0x016d 13:15:46 ib109 kernel: ib0: sending packet, length=88 address=f6e57520 qpn=0x000404 13:15:46 ib109 kernel: ib0: called: id 35, op 0, status: 0 13:15:46 ib109 kernel: ib0: send complete, wrid 35 13:15:46 ib109 kernel: ib0: called: id -2147483621, op 128, status: 0 13:15:46 ib109 kernel: ib0: received 128 bytes, SLID 0x016d 13:15:47 ib109 kernel: ib0: sending packet, length=88 address=f6e57520 qpn=0x000404 13:15:47 ib109 kernel: ib0: called: id 36, op 0, status: 0 13:15:47 ib109 kernel: ib0: send complete, wrid 36 13:15:47 ib109 kernel: ib0: called: id -2147483620, op 128, status: 0 13:15:47 ib109 kernel: ib0: received 128 bytes, SLID 0x016d 13:15:51 ib109 kernel: ib0: called: id -2147483619, op 128, status: 0 13:15:51 ib109 kernel: ib0: received 100 bytes, SLID 0x016d 13:15:51 ib109 kernel: ib0: sending packet, length=60 address=f6e57520 qpn=0x000404 13:15:51 ib109 kernel: ib0: called: id 37, op 0, status: 0 13:15:51 ib109 kernel: ib0: send complete, wrid 37 # tcpdump -i ib0 13:15:46.977578 arp who-has ib110 tell ib109 hardware #32 13:15:46.977682 arp reply ib110 is-at 00:00:04:04:fe:80:00:00:00:00:00:00:00:08:f1:04:03:96:11:59 hardware #32 13:15:46.977710 IP ib109 ib110: icmp 64: echo request seq 0 13:15: 46.977790 IP ib110 ib109: icmp 64: echo reply seq 0 13:15:47.92 IP ib109 ib110: icmp 64: echo request seq 1 13:15:47.977892 IP ib110 ib109: icmp 64: echo reply seq 1 13:15:51.977076 arp who-has ib109 tell ib110 hardware #32 13:15:51.977094 arp reply ib109 is-at 00:02:00:14:fe:80:00:00:00:00:00:00:00:02:c9:02:00:00:3b:31 hardware #32 # error dump rcvswrelayerrors:1 MT47396 Infiniscale-III 0x2c9010b022090[1] ib109 HCA-1 0x2c9023b30[1] 1) The ping is successful and the arp table is populated so Is this really a problem or a false positive? 2) The second arp does not generate an error (the error dump reports all new errors in switches). Why? Any ideas? Thanks in advance. Todd ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPoIB Question
At 10:00 PM 10/23/2006, Greg Lindahl wrote: On Mon, Oct 23, 2006 at 07:53:06AM -0500, Hubbell, Sean C Contractor/Decibel wrote: I currently have several applications that uses a legacy IPv4 protocol and I use IPoIB to utilize my infiniband network which works great. I have completed some timing and throughput analysis and noticed that I do not get very much more if I use an infiniband network interface than using my GigE network interface. You might want to note that different InfinBand implementations have quite different performance of IPoIB, especially for UDP. Another issue is that IPoIB has quite different performance with different Linux kernels. This is especially evident for TCP, although you can use SDP to accelerate TCP sockets and avoid this issue. My question is, am I using IPoIB correctly or are these the typical numbers that everyone is seeing? It is certainly the case that there are some message patterns and situations for which InfiniBand is not much of an improvement over gigE. Unfortunately, the comparison of IB to GbE are often apple-to-orange comparisons even for IP over IB. Until a HCA supplies the same level of functional off-load enabled by the IP network stack that is used with Ethernet, it really isn't a fair comparison. The same is also true for many of the marketroids and their comparisons of IB to Ethernet based solutions. Fortunately, most customers are getting a bit smarter and not falling for the marketing drivel these days - certainly the OEM don't fall for it thought the marketroids continue to come in and try to convince people it isn't an apple-to-orange comparison.The fact is both technologies have their pros / cons and it is really the workload or production environment that determines which is the best fit instead of the force fit. In any case, not really a development issue so will drop further discussion. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPoIB Question
On Tue, Oct 24, 2006 at 08:35:18AM -0500, Sean Hubbell wrote: We are currently looking at the new tickless kernel. Do you have one that you recommend? The main one to less-recommend is 2.6.9-based kernels, those are the slowest at TCP. Modern kernels, like the ones you see in Fedora 4 and up and SLES 10, seem to all be good and about equal in this area. I don't think we've tried a tickless kernel. We do most of our testing on the various kernels that ship with distros, plus the tip-of-tree kernel.org kernel. -- greg ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPoIB Question
We see 3.6 Gb/sec with IPoIB using RHEL4U4 2.6.9-42 x86_64 kernel on Dell PE1950 Woodcrest systems. In my testing, faster hardware is more important than newer kernels, but I don't try newer kernels much. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Greg Lindahl Sent: Tuesday, October 24, 2006 1:16 PM To: Sean Hubbell Cc: openib-general@openib.org Subject: Re: [openib-general] IPoIB Question On Tue, Oct 24, 2006 at 08:35:18AM -0500, Sean Hubbell wrote: We are currently looking at the new tickless kernel. Do you have one that you recommend? The main one to less-recommend is 2.6.9-based kernels, those are the slowest at TCP. Modern kernels, like the ones you see in Fedora 4 and up and SLES 10, seem to all be good and about equal in this area. I don't think we've tried a tickless kernel. We do most of our testing on the various kernels that ship with distros, plus the tip-of-tree kernel.org kernel. -- greg ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPoIB Question
Is this with a combination of TCP and UDP or just TCP? Sean Scott Weitzenkamp (sweitzen) wrote: We see 3.6 Gb/sec with IPoIB using RHEL4U4 2.6.9-42 x86_64 kernel on Dell PE1950 Woodcrest systems. In my testing, faster hardware is more important than newer kernels, but I don't try newer kernels much. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] IPoIB Question
Title: IPoIB Question Hello, I currently have several applications that uses a legacy IPv4 protocol and I use IPoIB to utilize my infiniband network which works great. I have completed some timing and throughput analysis and noticed that I do not get very much more if I use an infiniband network interface than using my GigE network interface. My question is, am I using IPoIB correctly or are these the typical numbers that everyone is seeing? Is there a standard application that I may use to test my current configuration? Thanks in advance, Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPoIB Question
Title: IPoIB Question IPoIB performance will vary quite a bit depending on what motherboard, CPU speed, and HCA type you have. What are the specs on the systems you are using? Netperf (www.netperf.org) is a good tool to measure IPoIB performance. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Hubbell, Sean C Contractor/DecibelSent: Monday, October 23, 2006 5:53 AMTo: openib-general@openib.orgCc: Sean HubbellSubject: [openib-general] IPoIB Question Hello, I currently have several applications that uses a legacy IPv4 protocol and I use IPoIB to utilize my infiniband network which works great. I have completed some timing and throughput analysis and noticed that I do not get very much more if I use an infiniband network interface than using my GigE network interface. My question is, am I using IPoIB correctly or are these the typical numbers that everyone is seeing? Is there a standard application that I may use to test my current configuration? Thanks in advance, Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPoIB Question
Quoting r. Scott Weitzenkamp (sweitzen) [EMAIL PROTECTED]: Netperf (www.netperf.org) is a good tool to measure IPoIB performance. Of special note is the -T flag which often lets you get more consistent results by pinning the test to a single CPU. Another useful tool is iperf, which has a -P option for running multiple socket tests in parallel. In TCP, multi-socket performance often exceeds that of a single socket. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPoIB Question
We currently have a non-homogeneous cluster so that seems that would possible explain a few of the differences that I have seen on some of my tests. I will look at netperf.org and see what they have to offer. On another note, is there plans to have IPoIB support the full throughput that infiniband 4x or 12x has? Specifically, can I keep my legacy apps and just upgrade the network to take advantage of the bandwidth? Sean Scott Weitzenkamp (sweitzen) wrote: IPoIB performance will vary quite a bit depending on what motherboard, CPU speed, and HCA type you have. What are the specs on the systems you are using? Netperf (www.netperf.org http://www.netperf.org) is a good tool to measure IPoIB performance. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems *From:* [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] *On Behalf Of *Hubbell, Sean C Contractor/Decibel *Sent:* Monday, October 23, 2006 5:53 AM *To:* openib-general@openib.org *Cc:* Sean Hubbell *Subject:* [openib-general] IPoIB Question Hello, I currently have several applications that uses a legacy IPv4 protocol and I use IPoIB to utilize my infiniband network which works great. I have completed some timing and throughput analysis and noticed that I do not get very much more if I use an infiniband network interface than using my GigE network interface. My question is, am I using IPoIB correctly or are these the typical numbers that everyone is seeing? Is there a standard application that I may use to test my current configuration? Thanks in advance, Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPoIB Question
If you are using TCP, you can use SDP transparently via libsdp to get improved latency and throughput. Scott -Original Message- From: Sean Hubbell [mailto:[EMAIL PROTECTED] Sent: Monday, October 23, 2006 8:56 AM To: Scott Weitzenkamp (sweitzen) Cc: openib-general@openib.org Subject: Re: [openib-general] IPoIB Question We currently have a non-homogeneous cluster so that seems that would possible explain a few of the differences that I have seen on some of my tests. I will look at netperf.org and see what they have to offer. On another note, is there plans to have IPoIB support the full throughput that infiniband 4x or 12x has? Specifically, can I keep my legacy apps and just upgrade the network to take advantage of the bandwidth? Sean Scott Weitzenkamp (sweitzen) wrote: IPoIB performance will vary quite a bit depending on what motherboard, CPU speed, and HCA type you have. What are the specs on the systems you are using? Netperf (www.netperf.org http://www.netperf.org) is a good tool to measure IPoIB performance. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems -- -- *From:* [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] *On Behalf Of *Hubbell, Sean C Contractor/Decibel *Sent:* Monday, October 23, 2006 5:53 AM *To:* openib-general@openib.org *Cc:* Sean Hubbell *Subject:* [openib-general] IPoIB Question Hello, I currently have several applications that uses a legacy IPv4 protocol and I use IPoIB to utilize my infiniband network which works great. I have completed some timing and throughput analysis and noticed that I do not get very much more if I use an infiniband network interface than using my GigE network interface. My question is, am I using IPoIB correctly or are these the typical numbers that everyone is seeing? Is there a standard application that I may use to test my current configuration? Thanks in advance, Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPoIB Question
Scott, Thanks for the reply again. The third party api that we use leverages a combination of UDP and TCP socket conntections for speed. Is there something for UCP as well? Sean Scott Weitzenkamp (sweitzen) wrote: If you are using TCP, you can use SDP transparently via libsdp to get improved latency and throughput. Scott -Original Message- From: Sean Hubbell [mailto:[EMAIL PROTECTED] Sent: Monday, October 23, 2006 8:56 AM To: Scott Weitzenkamp (sweitzen) Cc: openib-general@openib.org Subject: Re: [openib-general] IPoIB Question We currently have a non-homogeneous cluster so that seems that would possible explain a few of the differences that I have seen on some of my tests. I will look at netperf.org and see what they have to offer. On another note, is there plans to have IPoIB support the full throughput that infiniband 4x or 12x has? Specifically, can I keep my legacy apps and just upgrade the network to take advantage of the bandwidth? Sean Scott Weitzenkamp (sweitzen) wrote: IPoIB performance will vary quite a bit depending on what motherboard, CPU speed, and HCA type you have. What are the specs on the systems you are using? Netperf (www.netperf.org http://www.netperf.org) is a good tool to measure IPoIB performance. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems -- -- *From:* [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] *On Behalf Of *Hubbell, Sean C Contractor/Decibel *Sent:* Monday, October 23, 2006 5:53 AM *To:* openib-general@openib.org *Cc:* Sean Hubbell *Subject:* [openib-general] IPoIB Question Hello, I currently have several applications that uses a legacy IPv4 protocol and I use IPoIB to utilize my infiniband network which works great. I have completed some timing and throughput analysis and noticed that I do not get very much more if I use an infiniband network interface than using my GigE network interface. My question is, am I using IPoIB correctly or are these the typical numbers that everyone is seeing? Is there a standard application that I may use to test my current configuration? Thanks in advance, Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPoIB Question
Quoting r. Sean Hubbell [EMAIL PROTECTED]: Subject: Re: IPoIB Question Scott, Thanks for the reply again. The third party api that we use leverages a combination of UDP and TCP socket conntections for speed. Is there something for UCP as well? iperf supports UDP as well. Again, check out the -P flag. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPoIB Question
Nothing today in OF to accelerate UDP sockets. Scott Thanks for the reply again. The third party api that we use leverages a combination of UDP and TCP socket conntections for speed. Is there something for UCP as well? Sean Scott Weitzenkamp (sweitzen) wrote: If you are using TCP, you can use SDP transparently via libsdp to get improved latency and throughput. Scott ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPoIB Question
Thanks Michael I looked at iperf and that looks like a very nice tool. I will be using that when I evaluate and check performance of my applications. I am also interested in getting more bandwidth out of my applications leveraging a current or planned capability for IPoIB. This way, I will not have to modify my source code and I can just actually change out the interfaces that my applications send and receive on. So, I am looking at libsdp for the TCP funcationality and wanted to know if libsdp supports UDP as well or is there another library that I can use to maximize the bandwidth when transmitting and sending over infiniband? Sean Michael S. Tsirkin wrote: Quoting r. Sean Hubbell [EMAIL PROTECTED]: Subject: Re: IPoIB Question Scott, Thanks for the reply again. The third party api that we use leverages a combination of UDP and TCP socket conntections for speed. Is there something for UCP as well? iperf supports UDP as well. Again, check out the -P flag. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPoIB Question
Quoting r. Sean Hubbell [EMAIL PROTECTED]: I am looking at libsdp for the TCP funcationality and wanted to know if libsdp supports UDP as well AFAIK, SDP can only emulate TCP sockets. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPoIB Question
At 10:19 AM 10/23/2006, Michael S. Tsirkin wrote: Quoting r. Sean Hubbell [EMAIL PROTECTED]: I am looking at libsdp for the TCP funcationality and wanted to know if libsdp supports UDP as well AFAIK, SDP can only emulate TCP sockets. SDP is defined to work with AF_INET applications. If using a shared library approach / pre-load, one can transparently enable any AF_INET application to utilize SDP without a recompile, etc. The SDP Port Mapper specification for iWARP / service id for IB enable the connection management or whatever service it is implemented within to application-transparent discover the real target listen port and establish a SDP session nominally during connection establishment.Implementations may vary in the robustness or policies used to determine what to off-load, number of off-load sessions, etc. - in other words, a lot of opportunity and flexibility is provided to use SDP. Note: WinSocks Direct on Windows provides an equivalent service though uses a proprietary protocol. Vista will have SDP as defined in the specifications. There are currently no plans to develop an equivalent for datagram applications. Any datagram application (user or kernel) can already access the hardware directly and given RDMA is not defined for datagram, it was felt such a specification would provide minimal value. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPoIB Question
Perfect, I'll check with my vendor to see if this is possible. If so, this rocks! Thanks! Sean -- Original message -- Date: Mon, 23 Oct 2006 11:04:40 -0700 From: Michael Krause [EMAIL PROTECTED] Reply-To: Michael Krause [EMAIL PROTECTED] To: Michael S. Tsirkin [EMAIL PROTECTED], Sean Hubbell [EMAIL PROTECTED] CC: openib-general@openib.org Subject: Re: [openib-general] IPoIB Question At 10:19 AM 10/23/2006, Michael S. Tsirkin wrote: Quoting r. Sean Hubbell [EMAIL PROTECTED]: I am looking at libsdp for the TCP funcationality and wanted to know if libsdp supports UDP as well AFAIK, SDP can only emulate TCP sockets. SDP is defined to work with AF_INET applications. If using a shared library approach / pre-load, one can transparently enable any AF_INET application to utilize SDP without a recompile, etc. The SDP Port Mapper specification for iWARP / service id for IB enable the connection management or whatever service it is implemented within to application-transparent discover the real target listen port and establish a SDP session nominally during connection establishment.Implementations may vary in the robustness or policies used to determine what to off-load, number of off-load sessions, etc. - in other words, a lot of opportunity and flexibility is provided to use SDP. Note: WinSocks Direct on Windows provides an equivalent service though uses a proprietary protocol. Vista will have SDP as defined in the specifications. There are currently no plans to develop an equivalent for datagram applications. Any datagram application (user or kernel) can already access the hardware directly and given RDMA is not defined for datagram, it was felt such a specification would provide minimal value. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] IPoIB odd loopback packet from arp
Using the OFED 1.0 and OFED 1.1 stack I have notice some rcvswrelay errors. I have tracked it down to the arp request. I can reproduce the problem with the following steps:( I have used both 2.6.14.14 and 2.6.18.1 kernels) ib109 arp -d ib110ib109 ping ib110 -c 2# ib_ipoib module debug13:15:46 ib109 kernel: ib0: sending packet, length=60 address=f6187200 qpn=0xff 13:15:46 ib109 kernel: ib0: called: id 34, op 0, status: 013:15:46 ib109 kernel: ib0: send complete, wrid 3413:15:46 ib109 kernel: ib0: called: id -2147483623, op 128, status: 013:15:46 ib109 kernel: ib0: received 100 bytes, SLID 0x0369 13:15:46 ib109 kernel: ib0: dropping loopback packet13:15:46 ib109 kernel: ib0: called: id -2147483622, op 128, status: 013:15:46 ib109 kernel: ib0: received 100 bytes, SLID 0x016d13:15:46 ib109 kernel: ib0: sending packet, length=88 address=f6e57520 qpn=0x000404 13:15:46 ib109 kernel: ib0: called: id 35, op 0, status: 013:15:46 ib109 kernel: ib0: send complete, wrid 3513:15:46 ib109 kernel: ib0: called: id -2147483621, op 128, status: 013:15:46 ib109 kernel: ib0: received 128 bytes, SLID 0x016d 13:15:47 ib109 kernel: ib0: sending packet, length=88 address=f6e57520 qpn=0x00040413:15:47 ib109 kernel: ib0: called: id 36, op 0, status: 013:15:47 ib109 kernel: ib0: send complete, wrid 3613:15:47 ib109 kernel: ib0: called: id -2147483620, op 128, status: 0 13:15:47 ib109 kernel: ib0: received 128 bytes, SLID 0x016d13:15:51 ib109 kernel: ib0: called: id -2147483619, op 128, status: 013:15:51 ib109 kernel: ib0: received 100 bytes, SLID 0x016d13:15:51 ib109 kernel: ib0: sending packet, length=60 address=f6e57520 qpn=0x000404 13:15:51 ib109 kernel: ib0: called: id 37, op 0, status: 013:15:51 ib109 kernel: ib0: send complete, wrid 37# tcpdump -i ib013:15:46.977578 arp who-has ib110 tell ib109 hardware #3213:15:46.977682 arp reply ib110 is-at 00:00:04:04:fe:80:00:00:00:00:00:00:00:08:f1:04:03:96:11:59 hardware #32 13:15:46.977710 IP ib109 ib110: icmp 64: echo request seq 013:15:46.977790 IP ib110 ib109: icmp 64: echo reply seq 013:15:47.92 IP ib109 ib110: icmp 64: echo request seq 113:15:47.977892 IP ib110 ib109: icmp 64: echo reply seq 113:15:51.977076 arp who-has ib109 tell ib110 hardware #3213:15:51.977094 arp reply ib109 is-at 00:02:00:14:fe:80:00:00:00:00:00:00:00:02:c9:02:00:00:3b:31 hardware #32 # error dumprcvswrelayerrors:1 MT47396 Infiniscale-III 0x2c9010b022090[1] ib109 HCA-1 0x2c9023b30[1] 1) The ping is successful and the arp table is populated so Is this really a problem or a false positive? 2) The second arp does not generate an error (the error dump reports all new errors in switches). Why?Any ideas?Thanks in advance.Todd ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPoIB Question
At 10:59 AM 10/23/2006, Sean Hubbell wrote: Thanks Michael I looked at iperf and that looks like a very nice tool. Something else about Iperf is, that it supports multiple streams. Which maybe closer to the way some apps operate. * Correspondence * This email contains no programmatic content that requires independent ADC review ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPoIB Question
On Mon, Oct 23, 2006 at 07:53:06AM -0500, Hubbell, Sean C Contractor/Decibel wrote: I currently have several applications that uses a legacy IPv4 protocol and I use IPoIB to utilize my infiniband network which works great. I have completed some timing and throughput analysis and noticed that I do not get very much more if I use an infiniband network interface than using my GigE network interface. You might want to note that different InfinBand implementations have quite different performance of IPoIB, especially for UDP. Another issue is that IPoIB has quite different performance with different Linux kernels. This is especially evident for TCP, although you can use SDP to accelerate TCP sockets and avoid this issue. My question is, am I using IPoIB correctly or are these the typical numbers that everyone is seeing? It is certainly the case that there are some message patterns and situations for which InfiniBand is not much of an improvement over gigE. -- greg ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] IPoIB multicast neighbour?!
While debugging something, i have opened ipoib debug messages and see ib0: neigh_destructor for ff ff12:601b::::::0002 Do you have an idea what is the source of this neighbour? why it is created and is there a way to eliminate this somehow (my guess is that removing IPv6 support from the kernel will do that). Its a RH4 U3 system with OFED 1.1 rc7 more info below, thanks. Or. # ip a s ib0 9: ib0: BROADCAST,MULTICAST,UP mtu 1500 qdisc pfifo_fast qlen 128 link/[32] 00:02:04:04:fe:80:00:00:00:00:00:00:00:08:f1:04:03:97:08:c5 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff inet 192.169.3.235/24 brd 192.169.3.255 scope global ib0 inet6 fe80::208:f104:397:8c5/64 scope link valid_lft forever preferred_lft forever # ip m s ib0 9: ib0 link 00:ff:ff:ff:ff:12:40:1b:00:00:00:00:00:00:00:00:00:00:00:01 link 00:ff:ff:ff:ff:12:60:1b:00:00:00:00:00:00:00:01:ff:97:08:c5 link 00:ff:ff:ff:ff:12:60:1b:00:00:00:00:00:00:00:00:00:00:00:01 inet 224.0.0.1 inet6 ff02::1:ff97:8c5 inet6 ff02::1 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPOIB NAPI
On Sun, 2006-10-15 at 09:39 -0700, Roland Dreier wrote: I've been meaning to mention this... I have a preliminary version in git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ipoib-napi There are further changes I would like to add on top of that, but comments on the two patches there would be appreciated. And also benchmarks would be good. Please diff to see my comments. Generaly it looks like the condition on netif_rx_reschedule() should be inverted. Also ou need to set max to some large value since you don't know if how many completions you missed and you want to make sure you get all the ones the sneaked from the last poll to the request notify. int ipoib_poll(struct net_device *dev, int *budget) { struct ipoib_dev_priv *priv = netdev_priv(dev); int max = min(*budget, dev-quota); int done; int t; int empty; int missed_event; int n, i; repoll: done = 0; empty = 0; while (max) { t = min(IPOIB_NUM_WC, max); n = ib_poll_cq(priv-cq, t, priv-ibwc); for (i = 0; i n; ++i) { if (priv-ibwc[i].wr_id IPOIB_OP_RECV) { ++done; --max; ipoib_ib_handle_rx_wc(dev, priv-ibwc + i); } else ipoib_ib_handle_tx_wc(dev, priv-ibwc + i); } if (n != t) { empty = 1; break; } } dev-quota -= done; *budget-= done; if (empty) { netif_rx_complete(dev); ib_req_notify_cq(priv-cq, IB_CQ_NEXT_COMP, missed_event); if (missed_event !netif_rx_reschedule(dev, 0)) { max = 1000; goto repoll; } return 0; } return 1; } ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPOIB NAPI
Eli Please diff to see my comments. Generaly it looks like the Eli condition on netif_rx_reschedule() should be inverted. Why? A return value of 0 means that the reschedule failed (probably because the poll routine is already running somewhere else) and the poll routine should just return. I think the code is correct as it stands. Eli Also ou need to set max to some large value since you don't Eli know if how many completions you missed and you want to make Eli sure you get all the ones the sneaked from the last poll to Eli the request notify. Why? max is there to limit us from doing more work than the quota passed in from the networking stack. If we fail to drain the CQ because we exhaust max, then the poll routine will return 1 and will remain scheduled, so the networking stack will call the poll routine again to continue grabbing completions. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPOIB NAPI
On Mon, 2006-10-16 at 09:48 -0700, Roland Dreier wrote: Eli Please diff to see my comments. Generaly it looks like the Eli condition on netif_rx_reschedule() should be inverted. Why? A return value of 0 means that the reschedule failed (probably because the poll routine is already running somewhere else) and the poll routine should just return. I think the code is correct as it stands. Eli Also ou need to set max to some large value since you don't Eli know if how many completions you missed and you want to make Eli sure you get all the ones the sneaked from the last poll to Eli the request notify. Why? max is there to limit us from doing more work than the quota passed in from the networking stack. If we fail to drain the CQ because we exhaust max, then the poll routine will return 1 and will remain scheduled, so the networking stack will call the poll routine again to continue grabbing completions. - R. OK I see what you mean. So I guess it's OK then. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPOIB NAPI
Roland, Don't know why I have trouble to get this patch from your git tree. Do you mind to post this patch here so I can test the performance over ehca? Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPOIB NAPI
Quoting r. Roland Dreier [EMAIL PROTECTED]: There are further changes I would like to add on top of that, but comments on the two patches there would be appreciated. A small optimization: if (missed_event netif_rx_reschedule(dev, 0)) should be, I think if (unlikely(missed_event) netif_rx_reschedule(dev, 0)) since we are talking about an unlikely race where CQ became non-empty just as we were calling req_notify_cq. An API idea: how about instead testing missed_events, we add a flag: IB_CQ_TEST (or a longer name IB_CQ_REPORT_MISSED_EVENTS?) and change ib_req_notify_cq to return int which will keep the missed_events value, only if this flag is set? This has 2 advatages - Less churn updating all users to new API - they just ignore return value - and still almost no overhead for them as they don't set IB_CQ_TEST - For all users we have to push less values on stack - note compiler can't get rid of them as we are calling function through a pointer - For users that do missed_events = ib_req_notify_cq(priv-cq, IB_CQ_NEXT_COMP | IB_CQ_TEST) we get the result in register. I agree its a minor optimization, but I think quite a similiar change went in in the linux irq code - waste not, want not. Want to see hw a patch like this will look? -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] IPOIB NAPI
Hi Roland, can you tell when you are going to push your NAPI patch to ipoib? Is there anything I can do to help making this happen? ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPOIB NAPI
Eli Hi Roland, can you tell when you are going to push your NAPI Eli patch to ipoib? Is there anything I can do to help making Eli this happen? I've been meaning to mention this... I have a preliminary version in git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ipoib-napi There are further changes I would like to add on top of that, but comments on the two patches there would be appreciated. And also benchmarks would be good. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ipoib: ignores dma mapping errors on TX?
Quoting r. Roland Dreier [EMAIL PROTECTED]: + if (unlikely(dma_mapping_error(addr))) { + ++priv-stats.tx_errors; + dev_kfree_skb_any(skb); + return; + } Do we want a warning there? -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ipoib: ignores dma mapping errors on TX?
Does anyone know what might happen if a device tries to bus master bad_dma_address. Does it get a pci-abort, an NMI, a bus err interrupt, all of the above? On 10/9/06 1:01 PM, Roland Dreier [EMAIL PROTECTED] wrote: Michael It seems that IPoIB ignores the possibility that Michael dma_map_single with DMA_TO_DEVICE direction might return Michael dma_mapping_error. Michael Is there some reason that such mappings can't fail? No, it's just an oversight. Most network device drivers don't check for DMA mapping errors but it's probably better to do so anyway. I added this to my queue: commit 8edaf479946022d67350d6c344952fb65064e51b Author: Roland Dreier [EMAIL PROTECTED] Date: Mon Oct 9 10:54:20 2006 -0700 IPoIB: Check for DMA mapping error for TX packets Signed-off-by: Roland Dreier [EMAIL PROTECTED] diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index f426a69..8bf5e9e 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -355,6 +355,11 @@ void ipoib_send(struct net_device *dev, tx_req-skb = skb; addr = dma_map_single(priv-ca-dma_device, skb-data, skb-len, DMA_TO_DEVICE); + if (unlikely(dma_mapping_error(addr))) { + ++priv-stats.tx_errors; + dev_kfree_skb_any(skb); + return; + } pci_unmap_addr_set(tx_req, mapping, addr); if (unlikely(post_send(priv, priv-tx_head (ipoib_sendq_size - 1), ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ipoib: ignores dma mapping errors on TX?
At 10:24 AM 10/10/2006, Tom Tucker wrote: Does anyone know what might happen if a device tries to bus master bad_dma_address. Does it get a pci-abort, an NMI, a bus err interrupt, all of the above? It depends upon the platform. Some will enter a containment mode and, for example, shutdown the PCI Bus or the PCIe Root Port. Others may trigger a system error and shutdown the system. These responses are in part, a policy of the implementation and how the system is implemented. In future chipsets that contain IOMMU / Address Translation Protection Tables (ATPT) / pick your favorite name, the error can be contained to a single device and the appropriate error recovery triggered without requiring the system to go down. Again, all policy at the end of the day as to what action is triggered. For most, the potential for silent data corruption is too high to risk that bus or Root Port from continuing to operate without a reset / flush so containment is used at a minimum. Mike On 10/9/06 1:01 PM, Roland Dreier [EMAIL PROTECTED] wrote: Michael It seems that IPoIB ignores the possibility that Michael dma_map_single with DMA_TO_DEVICE direction might return Michael dma_mapping_error. Michael Is there some reason that such mappings can't fail? No, it's just an oversight. Most network device drivers don't check for DMA mapping errors but it's probably better to do so anyway. I added this to my queue: commit 8edaf479946022d67350d6c344952fb65064e51b Author: Roland Dreier [EMAIL PROTECTED] Date: Mon Oct 9 10:54:20 2006 -0700 IPoIB: Check for DMA mapping error for TX packets Signed-off-by: Roland Dreier [EMAIL PROTECTED] diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index f426a69..8bf5e9e 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -355,6 +355,11 @@ void ipoib_send(struct net_device *dev, tx_req-skb = skb; addr = dma_map_single(priv-ca-dma_device, skb-data, skb-len, DMA_TO_DEVICE); + if (unlikely(dma_mapping_error(addr))) { + ++priv-stats.tx_errors; + dev_kfree_skb_any(skb); + return; + } pci_unmap_addr_set(tx_req, mapping, addr); if (unlikely(post_send(priv, priv-tx_head (ipoib_sendq_size - 1), ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] ipoib: ignores dma mapping errors on TX?
It seems that IPoIB ignores the possibility that dma_map_single with DMA_TO_DEVICE direction might return dma_mapping_error. Is there some reason that such mappings can't fail? -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ipoib: ignores dma mapping errors on TX?
Michael It seems that IPoIB ignores the possibility that Michael dma_map_single with DMA_TO_DEVICE direction might return Michael dma_mapping_error. Michael Is there some reason that such mappings can't fail? No, it's just an oversight. Most network device drivers don't check for DMA mapping errors but it's probably better to do so anyway. I added this to my queue: commit 8edaf479946022d67350d6c344952fb65064e51b Author: Roland Dreier [EMAIL PROTECTED] Date: Mon Oct 9 10:54:20 2006 -0700 IPoIB: Check for DMA mapping error for TX packets Signed-off-by: Roland Dreier [EMAIL PROTECTED] diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index f426a69..8bf5e9e 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -355,6 +355,11 @@ void ipoib_send(struct net_device *dev, tx_req-skb = skb; addr = dma_map_single(priv-ca-dma_device, skb-data, skb-len, DMA_TO_DEVICE); + if (unlikely(dma_mapping_error(addr))) { + ++priv-stats.tx_errors; + dev_kfree_skb_any(skb); + return; + } pci_unmap_addr_set(tx_req, mapping, addr); if (unlikely(post_send(priv, priv-tx_head (ipoib_sendq_size - 1), ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ipoib mcast questions...
hi roland, ... On Thu, Oct 05, 2006 at 09:18:36PM -0700, Roland Dreier wrote: 1) the set_multicast_list net device callback seems to just kick off another thread to do the work of registering the multicast group. the mc_list net_device field is only valid under the netif_tx_lock, but this lock is not grabbed by the restart_task. what happens if the mc_list is modified while in the restart_task? Just looking quickly, I see that ipoib_mcast_restart_task() does netif_tx_lock() (right near the top). Isn't this sufficient? doh! i just missed it -- i predicted it would be missing, so i made it missing... 2) there seem to be 2 threads, the restart_task which creates queries and the join_task which sends off the mad requests. why? is there some performance advantage? it would seem easier to do the registrations serially in the restart task... I guess it's really that way mainly for historical reasons. I'd be glad to see patches that simplify things (of course making sure that everything still works ;) i'm imagining that all the proprietary eth interfaces + ipoib need to do about the same thing when it comes to registering with mcast groups. would you (all) be averse to pulling some of the mcast group registration code out into the core ib driver for all to use? arthur ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ipoib mcast questions...
On Fri, 2006-10-06 at 11:17, Arthur Jones wrote: hi roland, ... On Thu, Oct 05, 2006 at 09:18:36PM -0700, Roland Dreier wrote: 1) the set_multicast_list net device callback seems to just kick off another thread to do the work of registering the multicast group. the mc_list net_device field is only valid under the netif_tx_lock, but this lock is not grabbed by the restart_task. what happens if the mc_list is modified while in the restart_task? Just looking quickly, I see that ipoib_mcast_restart_task() does netif_tx_lock() (right near the top). Isn't this sufficient? doh! i just missed it -- i predicted it would be missing, so i made it missing... 2) there seem to be 2 threads, the restart_task which creates queries and the join_task which sends off the mad requests. why? is there some performance advantage? it would seem easier to do the registrations serially in the restart task... I guess it's really that way mainly for historical reasons. I'd be glad to see patches that simplify things (of course making sure that everything still works ;) i'm imagining that all the proprietary eth interfaces + ipoib need to do about the same thing when it comes to registering with mcast groups. would you (all) be averse to pulling some of the mcast group registration code out into the core ib driver for all to use? Isn't this already done with Sean's multicast work ? -- Hal arthur ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ipoib mcast questions...
hi hal, ... On Fri, Oct 06, 2006 at 11:26:26AM -0400, Hal Rosenstock wrote: [...] i'm imagining that all the proprietary eth interfaces + ipoib need to do about the same thing when it comes to registering with mcast groups. would you (all) be averse to pulling some of the mcast group registration code out into the core ib driver for all to use? Isn't this already done with Sean's multicast work ? i didn't know about this work. do you know where i can find it? arthur ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ipoib mcast questions...
On Fri, 2006-10-06 at 11:44, Arthur Jones wrote: hi hal, ... On Fri, Oct 06, 2006 at 11:26:26AM -0400, Hal Rosenstock wrote: [...] i'm imagining that all the proprietary eth interfaces + ipoib need to do about the same thing when it comes to registering with mcast groups. would you (all) be averse to pulling some of the mcast group registration code out into the core ib driver for all to use? Isn't this already done with Sean's multicast work ? i didn't know about this work. do you know where i can find it? I think it is in svn trunk. -- Hal arthur ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ipoib mcast questions...
Hal Rosenstock wrote: i didn't know about this work. do you know where i can find it? I think it is in svn trunk. It's in svn. I've create patches against for-2.6.19, and will post that as part of a request to merge some on the features upstream. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ipoib mcast questions...
thanks all! i'll have a look... arthur On Fri, Oct 06, 2006 at 09:37:39AM -0700, Sean Hefty wrote: Hal Rosenstock wrote: i didn't know about this work. do you know where i can find it? I think it is in svn trunk. It's in svn. I've create patches against for-2.6.19, and will post that as part of a request to merge some on the features upstream. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ipoib mcast questions...
hi hal, ... On Fri, Oct 06, 2006 at 11:26:26AM -0400, Hal Rosenstock wrote: [...] i'm imagining that all the proprietary eth interfaces + ipoib need to do about the same thing when it comes to registering with mcast groups. would you (all) be averse to pulling some of the mcast group registration code out into the core ib driver for all to use? Isn't this already done with Sean's multicast work ? after reading the code, iiuc, sean's work provides nice infrastructure for ib_multicast group join/leave. i was thinking about one more level up, i.e. generic _net_ multicast join/leave infrastructure. i'm not sure exactly how it would go -- but i think all the ib net_devices are going to need a way to associate a multicast hw addr w/ a live mgid. if that could be broken out, we could all share it... arthur ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ipoib mcast questions...
On Fri, 2006-10-06 at 15:47, Arthur Jones wrote: hi hal, ... On Fri, Oct 06, 2006 at 11:26:26AM -0400, Hal Rosenstock wrote: [...] i'm imagining that all the proprietary eth interfaces + ipoib need to do about the same thing when it comes to registering with mcast groups. would you (all) be averse to pulling some of the mcast group registration code out into the core ib driver for all to use? Isn't this already done with Sean's multicast work ? after reading the code, iiuc, sean's work provides nice infrastructure for ib_multicast group join/leave. i was thinking about one more level up, i.e. generic _net_ multicast join/leave infrastructure. i'm not sure exactly how it would go -- but i think all the ib net_devices are going to need a way to associate a multicast hw addr w/ a live mgid. Don't IPmc addresses translate to MGIDs per the RFC ? MGIDs are not hardware addresses (MLIDs are). -- Hal if that could be broken out, we could all share it... arthur ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ipoib mcast questions...
hi hal, ... On Fri, Oct 06, 2006 at 04:09:05PM -0400, Hal Rosenstock wrote: On Fri, 2006-10-06 at 15:47, Arthur Jones wrote: hi hal, ... On Fri, Oct 06, 2006 at 11:26:26AM -0400, Hal Rosenstock wrote: [...] i'm imagining that all the proprietary eth interfaces + ipoib need to do about the same thing when it comes to registering with mcast groups. would you (all) be averse to pulling some of the mcast group registration code out into the core ib driver for all to use? Isn't this already done with Sean's multicast work ? after reading the code, iiuc, sean's work provides nice infrastructure for ib_multicast group join/leave. i was thinking about one more level up, i.e. generic _net_ multicast join/leave infrastructure. i'm not sure exactly how it would go -- but i think all the ib net_devices are going to need a way to associate a multicast hw addr w/ a live mgid. Don't IPmc addresses translate to MGIDs per the RFC ? that's a different problem than the one i'm trying to address. i think you're talking about mapping ip mcast addresses to hardware addresses. rfc4391 tells ipoib how to do that, for the virtual ethernet devices, we'll need to come up w/ something different... MGIDs are not hardware addresses (MLIDs are). mgids are generated from the mc_list-dmi_addr. this is a hardware address to the linux net code. i'm looking for commonality to reduce duplicated code. we all (ipoib + virtual eth) need to associate mgids, however we got them, with the mlids (i think). i'm guessing we'll do it in a very similar way... arthur ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] ipoib mcast questions...
hi all, i'm looking over the ipoib multicast code, and i have a couple questions: 1) the set_multicast_list net device callback seems to just kick off another thread to do the work of registering the multicast group. the mc_list net_device field is only valid under the netif_tx_lock, but this lock is not grabbed by the restart_task. what happens if the mc_list is modified while in the restart_task? 2) there seem to be 2 threads, the restart_task which creates queries and the join_task which sends off the mad requests. why? is there some performance advantage? it would seem easier to do the registrations serially in the restart task... arthur ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ipoib mcast questions...
1) the set_multicast_list net device callback seems to just kick off another thread to do the work of registering the multicast group. the mc_list net_device field is only valid under the netif_tx_lock, but this lock is not grabbed by the restart_task. what happens if the mc_list is modified while in the restart_task? Just looking quickly, I see that ipoib_mcast_restart_task() does netif_tx_lock() (right near the top). Isn't this sufficient? 2) there seem to be 2 threads, the restart_task which creates queries and the join_task which sends off the mad requests. why? is there some performance advantage? it would seem easier to do the registrations serially in the restart task... I guess it's really that way mainly for historical reasons. I'd be glad to see patches that simplify things (of course making sure that everything still works ;) - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] ipoib question when running on the same node as opensm
We just brought another cluster up and had an issue with our management node (node running opensm) not coming up on ipoib. Here is what happened and how I got it working and I had some questions. 1) We had both opensm running and a switch based Voltaire SM running. This caused problems. 2) We stopped the Voltaire SM and restarted all the nodes. This got all of the nodes except the one with opensm running to work. 3) I had to unload all the modules, load only those needed by opensm, start opensm, and then bring up the ipoib interface. At this point the node seemed to be in the multicast group and ipoib worked fine. Does this seem like proper behavior? I would think that on boot if ipoib does not find a SM running it will delay setting up a connection until the SM comes on-line? (ie when the opensm init script gets run.) It seems like the card saves some information (from the Voltaire SM) across a soft reboot? I know that it was not coming up in the multicast group with the opensm. Is this by design? At this point ipoib seems to work fine after a reboot even though the interface is brought up before opensm. Do I need to ensure that opensm is up before all ipoib requests in the future? Thanks, Ira Weiny [EMAIL PROTECTED] ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ipoib question when running on the same node as opensm
Quoting r. Ira Weiny [EMAIL PROTECTED]: Do I need to ensure that opensm is up before all ipoib requests in the future? Shouldn't be required, thing work well for me, anyway. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [ipoib] [PATCH] - Removed unused include of vmalloc.h
IPoIB: Removed unused include of vmalloc.h. Signed-off-by: Dotan Barak [EMAIL PROTECTED] --- Index: last_stable/drivers/infiniband/ulp/ipoib/ipoib_main.c === --- last_stable.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2006-08-07 17:45:02.0 +0300 +++ last_stable/drivers/infiniband/ulp/ipoib/ipoib_main.c 2006-08-08 09:36:45.0 +0300 @@ -40,7 +40,6 @@ #include linux/init.h #include linux/slab.h -#include linux/vmalloc.h #include linux/kernel.h #include linux/if_arp.h /* For ARPHRD_xxx */ ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] ipoib multicast problem
Hi, I have seen the following problem with ipoib: 1. An application registers to a multicast group as a full member. As a result all the groups are listed in dev-mclist. 2. The infiniband link falls momentarily, opensm restarted etc. 3. All multicast memberships are flushed. 4. The net device will not join again until at a later time something will cause ipoib_set_mcast_list() to be called. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] ipoib multicast problems on RHEL4.0 u4
Hi, while testing ipoib multicast on RHEL4.0 u4, I noticed that setsockopt() succeeds to add a multicast group to an interface but actually the multicast group is not added to the net_device. This means that an application cannot join a multicast group as a full member. When I examined the differences between the kernel sources for u3 and u4 I noticed that essential code was removed: diff -ru net/ipv4/arp.c ../linux-2.6.9-42.ELsmp/net/ipv4/arp.c --- net/ipv4/arp.c 2006-09-18 15:35:03.0 +0300 +++ ../linux-2.6.9-42.ELsmp/net/ipv4/arp.c 2006-09-19 10:08:06.0 +0300 @@ -213,9 +213,6 @@ case ARPHRD_IEEE802_TR: ip_tr_mc_map(addr, haddr); return 0; - case ARPHRD_INFINIBAND: - ip_ib_mc_map(addr, haddr); - return 0; default: if (dir) { memcpy(haddr, dev-broadcast, dev-addr_len); Can anyone suggest a workaround to this issue? Thanks Eli ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ipoib multicast problem
Eli 1. An application registers to a multicast group as a full Eli member. As a result all the groups are listed in dev-mclist. Eli 2. The infiniband link falls momentarily, opensm restarted Eli etc. 3. All multicast memberships are flushed. 4. The net Eli device will not join again until at a later time something Eli will cause ipoib_set_mcast_list() to be called. I don't understand. How could ipoib rejoin the broadcast group and then not rejoin the rest of the full member groups it has? - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ipoib multicast problems on RHEL4.0 u4
On Tue, 2006-09-19 at 14:44 +0300, Eli cohen wrote: Hi, while testing ipoib multicast on RHEL4.0 u4, I noticed that setsockopt() succeeds to add a multicast group to an interface but actually the multicast group is not added to the net_device. This means that an application cannot join a multicast group as a full member. When I examined the differences between the kernel sources for u3 and u4 I noticed that essential code was removed: diff -ru net/ipv4/arp.c ../linux-2.6.9-42.ELsmp/net/ipv4/arp.c --- net/ipv4/arp.c 2006-09-18 15:35:03.0 +0300 +++ ../linux-2.6.9-42.ELsmp/net/ipv4/arp.c 2006-09-19 10:08:06.0 +0300 @@ -213,9 +213,6 @@ case ARPHRD_IEEE802_TR: ip_tr_mc_map(addr, haddr); return 0; - case ARPHRD_INFINIBAND: - ip_ib_mc_map(addr, haddr); - return 0; default: if (dir) { memcpy(haddr, dev-broadcast, dev-addr_len); Can anyone suggest a workaround to this issue? Short of spinning a kernel, it's going to be hard to work around. Thanks for finding this, I'll track down how this got left out of the U4 kernel when it was in the U3 kernel :-/ -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ipoib multicast problem
I don't understand. How could ipoib rejoin the broadcast group and then not rejoin the rest of the full member groups it has? That is because the broadcast group is not part of the multicast groups maintained by the kernel but rather is part of ipoib and is joined from a different function. The other full members are maintained by the kernel for the net device and come from dev-mclist. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ipoib multicast problem
eli That is because the broadcast group is not part of the eli multicast groups maintained by the kernel but rather is part eli of ipoib and is joined from a different function. The other eli full members are maintained by the kernel for the net device eli and come from dev-mclist. Oh I see, when we flush the multicast groups we actually delete all of them instead of just removing the attached flag. OK I guess your fix makes sense then. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] ipoib send only failure
Hi, when running a test I encountered the following scenario: the test sends to multicast address ipoib issues send only joins which fails. successive joins to this group will not be attempted since the query field of the mcast object holds the old pointer. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPOIB failover ?
Richard Frank wrote: Does IPOIB in this stack support transparent fail over between ports and across redundant HCAs using a virtual IP ? I am working on a patch to the linux bonding driver which will allow it to enslave also IPoIB devices for the active-backup mode. I will send an RFC to netdev for review next week. Does this meets your needs? Does by virtual IP you mean an ***alias address*** assigned at one point of time to one ipoib device and in another point of time (eg during fail-over) to a second ipoib device? does this approach have any advantage on the bonding approach? Or. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPOIB failover ?
Supporting IPOIB fail over with the Bonding driver will work - we currently use this for GE, etc. On Wed, 2006-09-13 at 14:27 +0300, Or Gerlitz wrote: Richard Frank wrote: Does IPOIB in this stack support transparent fail over between ports and across redundant HCAs using a virtual IP ? I am working on a patch to the linux bonding driver which will allow it to enslave also IPoIB devices for the active-backup mode. I will send an RFC to netdev for review next week. Does this meets your needs? Does by virtual IP you mean an ***alias address*** assigned at one point of time to one ipoib device and in another point of time (eg during fail-over) to a second ipoib device? does this approach have any advantage on the bonding approach? Or. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPOIB failover ?
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Richard Frank Sent: Wednesday, September 13, 2006 7:12 AM To: Or Gerlitz Cc: openib-general@openib.org Subject: Re: [openib-general] IPOIB failover ? Supporting IPOIB fail over with the Bonding driver will work - we currently use this for GE, etc. You can also get failover with IPoIB if you're willing to use SCTP as the transport. -Brian ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general