RE: [openib-general] Re: IPoIB Failure CQ overrun

2004-12-16 Thread Woodruff, Robert J
>This appears to be an issue with the latest FW (I see it with Tavor FW >3.3.1 but not 3.2.0). I am working with Mellanox on finding out >whether it's a FW bug or a problem with mthca. >For now you can work around it by changing > IPOIB_NUM_WC = 4, >to > IPOIB_NUM_WC

Re: [openib-general] Re: IPoIB Failure CQ overrun

2004-12-16 Thread Roland Dreier
Robert> :04:00.0: CQ overrun on CQN 0082 This appears to be an issue with the latest FW (I see it with Tavor FW 3.3.1 but not 3.2.0). I am working with Mellanox on finding out whether it's a FW bug or a problem with mthca. For now you can work around it by changing IPOIB_NUM

RE: [openib-general] Re: IPoIB Failure CQ overrun

2004-12-16 Thread Woodruff, Robert J
>This appears to be an issue with the latest FW (I see it with Tavor FW >3.3.1 but not 3.2.0). I am working with Mellanox on finding out >whether it's a FW bug or a problem with mthca. >For now you can work around it by changing > IPOIB_NUM_WC = 4, >to > IPOIB_NUM_WC

RE: [openib-general] Re: IPoIB Failure CQ overrun

2004-12-16 Thread Woodruff, Robert J
I am now seeing a new failure now. I bring up 2 nodes and initially can ping between the nodes. Then I try to run netpipe, and after the messages size gets a little past 4K, it hangs. I see the same behavior running MPI over TCP. This use to work. I look in the dmesg log and see the following:

Re: [openib-general] CM header file

2004-12-16 Thread frank zago
Areas where the majority of clients will need to implement the exact same code make sense to push into the CM. So far there doesn't seem to be any disagreement that the CM has features that it doesn't need. And the list of desired features seem to be: * Perform QP transitions for the user. *

Re: [openib-general] CM header file

2004-12-16 Thread Sean Hefty
Roland Dreier wrote: The CM retry policy is specified much more tightly than for general MADs. The total number of retries is limited by the "Max CM Retries" field and the timeout waiting for each response is also part of the CM protocol. It appears then that given max_cm_retries, remote_cm_respon

Re: [openib-general] CM header file

2004-12-16 Thread Sean Hefty
frank zago wrote: I was trying to match the existing MAD API. The CM would perform timeouts, but not retries. Consumers could retry request immediately upon notification of a timeout. This lets the client change the timeout value. (I'm negotiable on this, but the cost of having clients init

Re: [openib-general] CM header file

2004-12-16 Thread Roland Dreier
Sean> I think it makes sense in the case of a timeout for the CM Sean> to optimize for the retry case. (Such as keeping the last Sean> sent MAD around for retransmission.) The argument used Sean> against putting retries into the MAD layer was to allow the Sean> consumer to set

Re: [openib-general] CM header file

2004-12-16 Thread Sean Hefty
Libor Michalek wrote: I have given some thought to how retries should work. I've thought about adding a new call, ib_retry_cm_send() - or something like that, that resends the last message sent. Or the callback could indicate to retry. Oh, so the CM would have enough information to generate

[openib-general] Re: IPoIB Path Static Rate

2004-12-16 Thread Roland Dreier
Hal> It looks to me like after obtaining the PathRecord, the Hal> static rate is not used when the AV is created. Shouldn't it Hal> be ? Is there an issue with doing this ? There is a similar Hal> issue with the multicast AVs as well. I know there is an Hal> assumption that ever

Re: [openib-general] CM header file

2004-12-16 Thread Roland Dreier
Sean> My thinking was that the connection model isn't carried in Sean> the CM MADs however, so a receiving CM has to determine how Sean> to match the connection request based on what the remote Sean> user requested, which isn't known. Sean> I guess that by not having this param

Re: [openib-general] CM header file

2004-12-16 Thread Libor Michalek
On Thu, Dec 16, 2004 at 02:51:29PM -0800, Sean Hefty wrote: > Libor Michalek wrote: > > On Thu, Dec 16, 2004 at 12:14:18PM -0800, Sean Hefty wrote: > > > >>One final note, I'm hoping that a more abstracted CM could be layered > >>on top of this one, if it were desired. E.g. one that performs QP

Re: [openib-general] [PATCH] initial CM module

2004-12-16 Thread Libor Michalek
On Thu, Dec 16, 2004 at 04:10:52PM -0600, frank zago wrote: > > I've used several CM and I found this kind of interface to be painful to > use. > I'd rather see an interface similar to Topspin's where you register a CM > callback, get CM events and react (or not) to these. > > With the interfac

Re: [openib-general] CM header file

2004-12-16 Thread frank zago
I was trying to match the existing MAD API. The CM would perform timeouts, but not retries. Consumers could retry request immediately upon notification of a timeout. This lets the client change the timeout value. (I'm negotiable on this, but the cost of having clients initiate retries is b

Re: [openib-general] [PATCH] initial CM module

2004-12-16 Thread frank zago
Sean Hefty wrote: Libor Michalek wrote: This ties into what I was saying about an error return value from the consumer callback being treated as a connection handle destroy request. There were three return types supported: I'm not opposed to this. I just haven't thought about it enough. error

Re: [openib-general] CM header file

2004-12-16 Thread Sean Hefty
Libor Michalek wrote: On Thu, Dec 16, 2004 at 12:14:18PM -0800, Sean Hefty wrote: One final note, I'm hoping that a more abstracted CM could be layered on top of this one, if it were desired. E.g. one that performs QP transitions, automatically generates MRAs, retries requests, etc. Are you s

Re: [openib-general] [PATCH] initial CM module

2004-12-16 Thread Sean Hefty
Libor Michalek wrote: This ties into what I was saying about an error return value from the consumer callback being treated as a connection handle destroy request. There were three return types supported: I'm not opposed to this. I just haven't thought about it enough. error - connection han

Re: [openib-general] CM header file

2004-12-16 Thread Libor Michalek
On Thu, Dec 16, 2004 at 12:14:18PM -0800, Sean Hefty wrote: > > One final note, I'm hoping that a more abstracted CM could be layered > on top of this one, if it were desired. E.g. one that performs QP > transitions, automatically generates MRAs, retries requests, etc. Are you suggesting tha

Re: [openib-general] [PATCH] initial CM module

2004-12-16 Thread Sean Hefty
frank zago wrote: With the interface you propose it takes maybe 200 lines of code to establish a simple connection, while with a callback it can be down to 30 lines. It should be as easy as possible for an application or a driver to establish a connection. I shouldn't have to rewrite a CM stat

Re: [openib-general] [PATCH] initial CM module

2004-12-16 Thread frank zago
With the interface you propose it takes maybe 200 lines of code to establish a simple connection, while with a callback it can be down to 30 lines. It should be as easy as possible for an application or a driver to establish a connection. I shouldn't have to rewrite a CM state machine every ti

Re: [openib-general] [PATCH] initial CM module

2004-12-16 Thread Sean Hefty
frank zago wrote: +int ib_send_cm_req(struct ib_cm_id *cm_id, + struct ib_cm_req_param *param); + The Topspin API: int ib_cm_connect(struct ib_cm_active_param *param, struct ib_path_record *primary_path, struct ib_path_record *alternate_path,

Re: [openib-general] Re: IPoIB Partial Connectivity Scenario

2004-12-16 Thread Hal Rosenstock
On the remote node to which connectivity fails, it has a stale arp cache entry which does not seem to go away as if the timer is not started. Is that possible ? Is there a case where the ARP entry is created but not timed ? /sbin/ip neigh show dev ib0 192.168.0.1 lladdr 00:00:04:04:fe:80:00:00:00

Re: [openib-general] [PATCH] initial CM module

2004-12-16 Thread frank zago
Hi Sean, This patch adds in the initial CM API and module code. The module loads, unloads, and allocates/deallocates connection structures, but that's about it. This patch does not include changes needed to Kconfig or the Makefile, since I'm not sure that it makes sense to change these yet. I wi

[openib-general] Re: IPoIB Partial Connectivity Scenario

2004-12-16 Thread Hal Rosenstock
On Thu, 2004-12-16 at 16:01, Roland Dreier wrote: > Sure, that sounds reasonable. I had thought that future packets would > cause the path record to be retried, but maybe that's not happening. If that were to happen that would be fine too. I don't see them on subsequent packets. Not sure what is

[openib-general] IPoIB Path Static Rate

2004-12-16 Thread Hal Rosenstock
Hi Roland, It looks to me like after obtaining the PathRecord, the static rate is not used when the AV is created. Shouldn't it be ? Is there an issue with doing this ? There is a similar issue with the multicast AVs as well. I know there is an assumption that everything is 4x but I am not sure th

Re: [openib-general] [PATCH] initial CM module

2004-12-16 Thread Sean Hefty
Roland Dreier wrote: Can you use the new copyright header I posted? Will do. Also it's good to hold off on Makefile/Kconfig changes for now, since that will simplify generating patches for upstream merging. If it gets too difficult to hold back on the CM, I would suggest developing the CM on a bra

Re: [openib-general] CM header file

2004-12-16 Thread Sean Hefty
Hal Rosenstock wrote: Would UC as well as RC be supported ? If so, UC can wait a little for implementing if this adds time. I'm not sure that RC or UC matters much to the CM. The CM will probably just read the value from the QP type. So it's just a pass through in terms of the components ? I bel

Re: [openib-general] CM header file

2004-12-16 Thread Sean Hefty
Roland Dreier wrote: Sean> I'm not sure that peer-to-peer needs to be exposed by the Sean> API. The CM should be able to determine the connection Sean> model when matching a received MAD with a local service ID. Sean> I.e. does the local service ID match with a listen request S

[openib-general] Re: IPoIB Partial Connectivity Scenario

2004-12-16 Thread Roland Dreier
Hal> I have a proposal: Rather than a single SA Get(PathRecord) Hal> with a 1 second timeout, what about a retry or two with a Hal> smaller (0.33 - 0.5 sec) timeout ? SA Get/GetResp is Hal> inherently unreliable and these could be retried. Sure, that sounds reasonable. I had thoug

Re: [openib-general] [PATCH] initial CM module

2004-12-16 Thread Roland Dreier
Can you use the new copyright header I posted? Also it's good to hold off on Makefile/Kconfig changes for now, since that will simplify generating patches for upstream merging. If it gets too difficult to hold back on the CM, I would suggest developing the CM on a branch. - R. _

Re: [openib-general] CM header file

2004-12-16 Thread Roland Dreier
Sean> I'm not sure that peer-to-peer needs to be exposed by the Sean> API. The CM should be able to determine the connection Sean> model when matching a received MAD with a local service ID. Sean> I.e. does the local service ID match with a listen request Sean> or a connection

Re: [openib-general] GUID/EUI-64 Issue

2004-12-16 Thread Hal Rosenstock
On Wed, 2004-12-08 at 08:58, Hal Rosenstock wrote: > Hi, > > Did we come to closure on how to handle the GUID/EUI-64 issue ? > > -- Hal > > On Thu, 2004-11-11 at 13:11, Roland Dreier wrote: > > My only questions are: > > > > + eui[0] ^= 2; > > > > I remember some discussion about

Re: [openib-general] CM header file

2004-12-16 Thread Hal Rosenstock
On Thu, 2004-12-16 at 15:14, Sean Hefty wrote: > Hal Rosenstock wrote: > > > Would UC as well as RC be supported ? If so, UC can wait a little for > > implementing if this adds time. > > I'm not sure that RC or UC matters much to the CM. The CM will > probably just read the value from the QP t

[openib-general] IPoIB Partial Connectivity Scenario

2004-12-16 Thread Hal Rosenstock
I've looked at the remote side to understand what it was (or wasn't doing). The partial connectivity stems from an issue in resolving the path on the remote side. I have a proposal: Rather than a single SA Get(PathRecord) with a 1 second timeout, what about a retry or two with a smaller (0.33 - 0.

Re: [openib-general] CM header file

2004-12-16 Thread Sean Hefty
Hal Rosenstock wrote: Would UC as well as RC be supported ? If so, UC can wait a little for implementing if this adds time. I'm not sure that RC or UC matters much to the CM. The CM will probably just read the value from the QP type. /** * ib_send_cm_mra - Sends a message receipt acknowledge

[openib-general] [PATCH] new CM test utility

2004-12-16 Thread Sean Hefty
This patch adds in a new test utility framework for CM development. It's currently located in the gen2/utils directory. I will commit this change unless there are any objections. - Sean Index: util/cmpost/Kconfig === --- util/cmpos

[openib-general] [PATCH] initial CM module

2004-12-16 Thread Sean Hefty
This patch adds in the initial CM API and module code. The module loads, unloads, and allocates/deallocates connection structures, but that's about it. This patch does not include changes needed to Kconfig or the Makefile, since I'm not sure that it makes sense to change these yet. I will commit

Re: [openib-general] CM header file

2004-12-16 Thread Hal Rosenstock
On Wed, 2004-12-15 at 19:25, Sean Hefty wrote: Here are some initial CM comments: Would UC as well as RC be supported ? If so, UC can wait a little for implementing if this adds time. /** > * ib_send_cm_mra - Sends a message receipt acknowledgement to a > connection > * message. > * @c

Re: [openib-general] CM header file

2004-12-16 Thread Sean Hefty
frank zago wrote: However, a nice to have feature which I've grown use to is the ability to listen to an entire range of service IDs using a value/mask combo. I second this request. It very usefull in some cases. Also, I didn't see a provision for peer to peer connections. I'm not sure that p

RE: [openib-general] IPoIB oops on path record completion

2004-12-16 Thread Woodruff, Robert J
>Are you running the latest code from svn? I fixed a bug this morning >that would cause problems with more than 2 nodes. >Thanks, > Roland With the 1348 version I just downloaded, I can now ping from all nodes to all other nodes. I will not try to install and run some MPI tests and/or other

RE: [openib-general] IPoIB oops on path record completion

2004-12-16 Thread Woodruff, Robert J
>Are you running the latest code from svn? I fixed a bug this morning >that would cause problems with more than 2 nodes. >Thanks, > Roland Thanks, I will grab it and give it a try. I am running 1335 and I know that you pushed a couple of fixes late yesterday and this morning after I downloaded

Re: [openib-general] IPoIB oops on path record completion

2004-12-16 Thread Roland Dreier
Robert> I also seem to be having some partial connectivity Robert> problems. The first 2 nodes seem to be able to Robert> communicate, but adding the 3rd and 4th nodes, they cannot Robert> ping the first 2. Are you running the latest code from svn? I fixed a bug this morning that

Re: [openib-general] CM header file

2004-12-16 Thread Sean Hefty
Libor Michalek wrote: The other option is to destroy the connection if the consumer returns an error value from the callback. I'll have to think about this. As a personal preference I try to avoid having callbacks return values. But then I'm not thrilled about passing in flags to destroy to h

Re: [openib-general] IPoIB oops on path record completion

2004-12-16 Thread Hal Rosenstock
On Thu, 2004-12-16 at 12:35, Roland Dreier wrote: > Are you running the latest code from svn? I fixed a bug this morning > that would cause problems with more than 2 nodes. I am. -- Hal ___ openib-general mailing list [EMAIL PROTECTED] http://openib.o

Re: [openib-general] [PATCH] [RFC] new test directory + test codefor MAD snooping

2004-12-16 Thread Sean Hefty
Hal Rosenstock wrote: On Tue, 2004-12-14 at 15:05, Sean Hefty wrote: I've pushed in the madeye code under gen2/utils. I think it would be better as something like gen2/trunk/src/tests, gen2/trunk/tests, gen2/trunk/src/utils, or /gen2/trunk/utils so it is all in one place and can be obtained with

RE: [openib-general] IPoIB oops on path record completion

2004-12-16 Thread Woodruff, Robert J
>Still have the partial connectivity problem. I can see the ARP going out >on the broadcast group followed by ARPs coming oin on the broadcast >group followed by the PathRecord requests/responses with the SA followed >by the unicast ARP and ICMP. After the unicast ARP to one of the nodes, >it is

[openib-general] [PATCH] IPoIB FAQ

2004-12-16 Thread Hal Rosenstock
reflect proper Arbel firmware revision Signed-off-by: Hal Rosenstock <[EMAIL PROTECTED]> Index: ipoib_faq.txt === --- ipoib_faq.txt (revision 1346) +++ ipoib_faq.txt (working copy) @@ -50,8 +50,8 @@ cat /sys/class/infi

RE: [openib-general] IPoIB still not working

2004-12-16 Thread England, Joshua J
Title: RE: [openib-general] IPoIB still not working They're on 4.5.3. -JE -Original Message- From: Hal Rosenstock [mailto:[EMAIL PROTECTED]] Sent: Thu 12/16/2004 8:54 AM To: England, Joshua J Cc: Roland Dreier; Robert J Woodruff; [EMAIL PROTECTED] Subject: RE: [openib-general] IPoI

RE: [openib-general] IPoIB still not working

2004-12-16 Thread Hal Rosenstock
On Wed, 2004-12-15 at 13:22, England, Joshua J wrote: > I'll definitely pound on the stuff and let you know if anything > breaks. You are using the 4.3.5 firmware, right ? I want to put the proper info into the IPoIB FAQ. Thanks. -- Hal ___ openib-gene

Re: [openib-general] CM header file

2004-12-16 Thread frank zago
However, a nice to have feature which I've grown use to is the ability to listen to an entire range of service IDs using a value/mask combo. I second this request. It very usefull in some cases. Also, I didn't see a provision for peer to peer connections. Frank. __

Re: [openib-general] IPoIB oops on path record completion

2004-12-16 Thread Hal Rosenstock
On Thu, 2004-12-16 at 09:58, Roland Dreier wrote: > Hal> I am still seeing the continual retransmission of SA > Hal> Get(PathRecords) even after I terminate the ping -b. The > Hal> status on the callback is 0. > > OK, I found another bug and pushed the change out. Are things better no

Re: [openib-general] IPoIB oops on path record completion

2004-12-16 Thread Roland Dreier
Hal> I am still seeing the continual retransmission of SA Hal> Get(PathRecords) even after I terminate the ping -b. The Hal> status on the callback is 0. OK, I found another bug and pushed the change out. Are things better now? - Roland __

Re: [openib-general] IPoIB oops on path record completion

2004-12-16 Thread Hal Rosenstock
On Wed, 2004-12-15 at 23:18, Roland Dreier wrote: > Hal> Can you shorten your timeout down to 1 msec and see what happens ? > > OK, that let me reproduce the oops, and I pushed a fix out (skqueue > was used unitialized). That fixed the oops :-) Thanks. I am still seeing the continual retrans

Re: [openib-general] [PATCH] [RFC] new test directory + test codefor MAD snooping

2004-12-16 Thread Hal Rosenstock
On Tue, 2004-12-14 at 15:05, Sean Hefty wrote: > I've pushed in the madeye code under gen2/utils. I think it would be better as something like gen2/trunk/src/tests, gen2/trunk/tests, gen2/trunk/src/utils, or /gen2/trunk/utils so it is all in one place and can be obtained with a single checkout.

RE: [openib-general] IPoIB still not working

2004-12-16 Thread Tziporet Koren
Title: RE: [openib-general] IPoIB still not working Roland is correct here - this was the only change we had to do for Arbel. Tavor FW was enhanced to get this workaround too. Tziporet -Original Message- From: Roland Dreier [mailto:[EMAIL PROTECTED]] Sent: Wednesday, December 15, 2