[openib-general] VAPI_RETRY_EXC_ERR

2004-11-09 Thread Sreenivasulu Pulichintala
HI, I use MPICH 1.2.5 and MVAPICH 0.9.2 stack and when I run some of my fortran applications, some times my application crashes producing the following error === Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor code=81mpi_latency: mpid/ch_vapi/viacheck.c:2109:

RE: [openib-general] VAPI_RETRY_EXC_ERR

2004-11-09 Thread Sreenivasulu Pulichintala
The corresponding IB maro is - IB_COMP_RETRY_EXC_ERR -Original Message- From: Sreenivasulu Pulichintala Sent: Tuesday, November 09, 2004 3:56 PM To: [EMAIL PROTECTED] Subject: [openib-general] VAPI_RETRY_EXC_ERR HI, I use MPICH 1.2.5 and MVAPICH 0.9.2 stack and when I run

Re: [openib-general] ib_mad_recv_done_handler questions

2004-11-09 Thread Hal Rosenstock
On Mon, 2004-11-08 at 18:48, Sean Hefty wrote: Looking at the latest changes to ib_mad_recv_done_handler, I have a couple of questions: * If process_mad consumes the MAD, should the code just goto out? Something more like: ret = port_priv-device-process_mad(...) if ((ret

[openib-general] [PATCH] mad: In ib_mad_recv_done_handler, don't dispatch additional error cases

2004-11-09 Thread Hal Rosenstock
mad: In ib_mad_recv_done_handler, don't dispatch additional error cases Index: mad.c === --- mad.c (revision 1180) +++ mad.c (working copy) @@ -1138,26 +1138,27 @@

Re: [openib-general] MAD agent code comments

2004-11-09 Thread Hal Rosenstock
On Mon, 2004-11-08 at 19:27, Sean Hefty wrote: A couple of comments (so far) while tracing through the MAD agent code. * There are a couple of places where ib_get_agent_mad() will be called multiple times in the same execution path. For example agent_send calls it, as does agent_mad_send.

[openib-general] [PATCH] agent: Rename ib_get_agent_mad to ib_get_agent_port and eliminate duplicated call to it in agent_mad_send

2004-11-09 Thread Hal Rosenstock
agent: Rename ib_get_agent_mad to ib_get_agent_port and eliminate duplicated call to it in agent_mad_send (pointed out by Sean Hefty) Index: agent.c === --- agent.c (revision 1180) +++ agent.c (working copy) @@ -35,8 +35,8 @@

Re: [openib-general] ib_mad_recv_done_handler questions

2004-11-09 Thread Hal Rosenstock
On Mon, 2004-11-08 at 19:51, Roland Dreier wrote: Sean * If the underlying driver provides a process_mad routine, a Sean response MAD is allocated every time a MAD is received on QP Sean 0 or 1. Can we either push this allocation down into the Sean HCA driver, or find an

[openib-general] IPoIB Completion Handling

2004-11-09 Thread Hal Rosenstock
Hi Roland, In ipoib_ib_handle_wc when status != success, isn't the WC opcode invalid ? Also, in that case, don't receives also need to be reposted ? -- Hal ___ openib-general mailing list [EMAIL PROTECTED]

RE: [openib-general] VAPI_RETRY_EXC_ERR

2004-11-09 Thread Tziporet Koren
There can be several problems: - The retry count is too small - try to put max number - 7 - Maybe the timeout is too small - so the HCA start to perform retry too much - try to enlarge it to 21 - Can be that the PSN between two sides is not synchronized - The link fail - The QP in the

[openib-general] More on IPoIB Multicast

2004-11-09 Thread Hal Rosenstock
Hi Roland, If a multicast send is attempted and the node is not joined to the multicast group which is the destination of the send, a send only join (which is neutered due to lack of SM support) is assumed. Is my understanding correct ? Linux also supports multicast routing. For this case, I

Re: [openib-general] Re: IPoIB Completion Handling

2004-11-09 Thread Sean Hefty
Hal Rosenstock wrote: On Tue, 2004-11-09 at 10:37, Roland Dreier wrote: By the way, reposting the receives is not the right thing to do on error -- the QP will be in the error state, so any new work requests will just complete with a flush status. We need to reset the QP and start over to recover

Re: [openib-general] [PATCH] mad: In ib_mad_recv_done_handler, don't dispatch additional error cases

2004-11-09 Thread Hal Rosenstock
On Tue, 2004-11-09 at 12:11, Sean Hefty wrote: Hal Rosenstock wrote: mad: In ib_mad_recv_done_handler, don't dispatch additional error cases + if (ret IB_MAD_RESULT_SUCCESS) { + if (ret IB_MAD_RESULT_REPLY) { + if

[openib-general] [PATCH] mad: In ib_mad_recv_done_handler, don't dispatch in additional case

2004-11-09 Thread Hal Rosenstock
mad: In ib_mad_recv_done_handler, don't dispatch in additional case Index: mad.c === --- mad.c (revision 1183) +++ mad.c (working copy) @@ -1161,8 +1161,8 @@

[Fwd: [openib-general] [PATCH] mad/agent: Modify receive buffer allocation strategy]

2004-11-09 Thread Hal Rosenstock
One more thing on this I forgot to post: As I am not yet set up with Kegel cross tools (and don't have a machine where the pci_ macros are non trivial), I would appreciate it if someone could verify these changes (or latest code) on some architecture where the pci_ macros are non trivial.

Re: [openib-general] VAPI_RETRY_EXC_ERR

2004-11-09 Thread Libor Michalek
On Tue, Nov 09, 2004 at 04:19:17PM +0530, Sreenivasulu Pulichintala wrote: -Original Message- From: Sreenivasulu Pulichintala Sent: Tuesday, November 09, 2004 3:56 PM To: [EMAIL PROTECTED] Subject: [openib-general] VAPI_RETRY_EXC_ERR HI, I use MPICH 1.2.5 and MVAPICH 0.9.2

Re: [Fwd: [openib-general] [PATCH] mad/agent: Modify receive buffer allocation strategy]

2004-11-09 Thread Roland Dreier
By the way, we probably want this applied: Index: core/mad.c === --- core/mad.c (revision 1184) +++ core/mad.c (working copy) @@ -385,7 +385,7 @@ mad_agent-device-node_type,

[openib-general] error trying to bring up node

2004-11-09 Thread Sean Hefty
I have two nodes directly connected. When trying to bring up the openib node, I receive a local length error on the CQ after trying to perform a send. I'm continuing to debug... - Sean ___ openib-general mailing list [EMAIL PROTECTED]

Re: [openib-general] error trying to bring up node

2004-11-09 Thread Sean Hefty
Sean Hefty wrote: I have two nodes directly connected. When trying to bring up the openib node, I receive a local length error on the CQ after trying to perform a send. I'm continuing to debug... static int agent_mad_send(struct ib_mad_agent *mad_agent, struct

Re: [Fwd: [openib-general] [PATCH] mad/agent: Modify receive buffer allocation strategy]

2004-11-09 Thread Hal Rosenstock
On Tue, 2004-11-09 at 14:39, Roland Dreier wrote: By the way, we probably want this applied: Thanks. Applied. -- Hal ___ openib-general mailing list [EMAIL PROTECTED] http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit

Re: [Fwd: [openib-general] [PATCH] mad/agent: Modify receive buffer allocation strategy]

2004-11-09 Thread Hal Rosenstock
On Tue, 2004-11-09 at 14:37, Roland Dreier wrote: Hal One more thing on this I forgot to post: As I am not yet set Hal up with Kegel cross tools (and don't have a machine where the Hal pci_ macros are non trivial), I would appreciate it if Hal someone could verify these changes

Re: [openib-general] error trying to bring up node

2004-11-09 Thread Hal Rosenstock
On Tue, 2004-11-09 at 14:56, Sean Hefty wrote: Sean Hefty wrote: I have two nodes directly connected. When trying to bring up the openib node, I receive a local length error on the CQ after trying to perform a send. I'm continuing to debug... static int agent_mad_send(struct

Re: [openib-general] error trying to bring up node

2004-11-09 Thread Roland Dreier
Hal Doesn't that just map starting at the GRH ? This is to handle Hal PMA responses which might have GRHs. Sure, it maps starting at the GRH and uses that as the start of the gather segment used for the send (and tries to send more than 256 bytes). This is wrong even when sending a

Re: [openib-general] error trying to bring up node

2004-11-09 Thread Sean Hefty
Hal Rosenstock wrote: On Tue, 2004-11-09 at 14:56, Sean Hefty wrote: Sean Hefty wrote: I have two nodes directly connected. When trying to bring up the openib node, I receive a local length error on the CQ after trying to perform a send. I'm continuing to debug... static int

Re: [openib-general] error trying to bring up node

2004-11-09 Thread Hal Rosenstock
On Tue, 2004-11-09 at 15:25, Hal Rosenstock wrote: Doesn't that just map starting at the GRH ? This is to handle PMA responses which might have GRHs. Never mind. I see the problem. -- Hal ___ openib-general mailing list [EMAIL PROTECTED]

[openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length

2004-11-09 Thread Hal Rosenstock
agent: Fix agent_mad_send PCI mapping and gather address and length Index: agent.c === --- agent.c (revision 1183) +++ agent.c (working copy) @@ -116,10 +116,10 @@ /* PCI mapping */ gather_list.addr =

Re: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length

2004-11-09 Thread Roland Dreier
OK, this works on my i386 system but I'm still getting ib_mad: Invalid directed route on ppc64. I'll try to debug what exactly is happening (ie put some prints in to see why smi_handle_dr_smp_send() is rejecting it). - R. ___ openib-general

Re: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length

2004-11-09 Thread Roland Dreier
Roland OK, this works on my i386 system but I'm still getting Roland ib_mad: Invalid directed route Roland on ppc64. I'll try to debug what exactly is happening (ie Roland put some prints in to see why smi_handle_dr_smp_send() is Roland rejecting it). By the way, the i386

[openib-general] Re: [PATCH] Unnecessary initialization of sa_query in failure case.

2004-11-09 Thread Krishna Kumar
On Tue, 9 Nov 2004, Roland Dreier wrote: Why is this initialization unnecessary? If we delete these lines then sa_query is left pointing to invalid memory when a send fails? Because ULP's should not use a pointers to-be-set-in-callee routines if the call failed. In this case, path_rec_start

[openib-general] Re: [openib-commits] r1186 - gen2/trunk/src/linux-kernel/infiniband/core

2004-11-09 Thread Tom Duffy
On Tue, 2004-11-09 at 15:23 -0800, [EMAIL PROTECTED] wrote: Author: halr Date: 2004-11-09 15:23:07 -0800 (Tue, 09 Nov 2004) New Revision: 1186 Modified: gen2/trunk/src/linux-kernel/infiniband/core/agent.c Log: Fix agent_mad_send PCI mapping and gather address and length Please revert

Re: [openib-general] Re: [openib-commits] r1186 - gen2/trunk/src/linux-kernel/infiniband/core

2004-11-09 Thread Nitin Hande
Tom Duffy wrote: On Tue, 2004-11-09 at 15:23 -0800, [EMAIL PROTECTED] wrote: Author: halr Date: 2004-11-09 15:23:07 -0800 (Tue, 09 Nov 2004) New Revision: 1186 Modified: gen2/trunk/src/linux-kernel/infiniband/core/agent.c Log: Fix agent_mad_send PCI mapping and gather address and length

Re: [openib-general] Re: [openib-commits] r1186 - gen2/trunk/src/linux-kernel/infiniband/core

2004-11-09 Thread Roland Dreier
Nitin certainly it does break my x86_64 setup too. Can we revert Nitin back to working set of bits please ? It's actually not an architecture issue -- it's an issue if your node is more than one hop from the SM. You should be able to use the patch I just posted to get things working

Re: [openib-general] Re: [openib-commits] r1186 - gen2/trunk/src/linux-kernel/infiniband/core

2004-11-09 Thread Tom Duffy
On Tue, 2004-11-09 at 16:01 -0800, Roland Dreier wrote: Nitin certainly it does break my x86_64 setup too. Can we revert Nitin back to working set of bits please ? It's actually not an architecture issue -- it's an issue if your node is more than one hop from the SM. You should be

Re: [openib-general] Re: [openib-commits] r1186 - gen2/trunk/src/linux-kernel/infiniband/core

2004-11-09 Thread Nitin Hande
Roland Dreier wrote: Nitin certainly it does break my x86_64 setup too. Can we revert Nitin back to working set of bits please ? It's actually not an architecture issue -- it's an issue if your node is more than one hop from the SM. You should be able to use the patch I just posted

[openib-general] [PATCH] handle QP0/1 send queue overrun

2004-11-09 Thread Sean Hefty
The following patch adds support for handling QP0/1 send queue overrun, along with a couple of related fixes: * The patch includes that provided by Roland in order to configure the fabric. * The code no longer modifies the user's send_wr structures when sending a MAD. * Sent MADs work requests

Re: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length

2004-11-09 Thread Hal Rosenstock
On Tue, 2004-11-09 at 18:54, Roland Dreier wrote: OK, I think I understand the problem, but I'm not sure what the correct solution is. When a DR SMP arrives at a CA from the SM, hop_cnt == hop_ptr == number of hops in the directed route, What was the number ? and somehow they are not

[openib-general] *****SPAM***** Your Washington Mutual Account

2004-11-09 Thread [EMAIL PROTECTED]
Spam detection software, running on the system openib.ca.sandia.gov, has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or block similar future email. If you have any questions, see [EMAIL PROTECTED] for

Re: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length

2004-11-09 Thread Roland Dreier
Roland OK, I think I understand the problem, but I'm not sure Roland what the correct solution is. When a DR SMP arrives at a Roland CA from the SM, hop_cnt == hop_ptr == number of hops in Roland the directed route, Hal What was the number ? For one port it was 4 and for