Re: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length

2004-11-09 Thread Roland Dreier
It seems that MAD handling is still not quite right. It seems in my set up that IPoIB is not seeing the response to its MCMember set... (it does look like the query is reaching the SM) - R. ___ openib-general mailing list [EMAIL PROTECTED] http://openi

Re: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length

2004-11-09 Thread Roland Dreier
Roland> OK, I think I understand the problem, but I'm not sure Roland> what the correct solution is. When a DR SMP arrives at a Roland> CA from the SM, hop_cnt == hop_ptr == number of hops in Roland> the directed route, Hal> What was the number ? For one port it was 4 and for

[openib-general] *****SPAM***** Your Washington Mutual Account

2004-11-09 Thread [EMAIL PROTECTED]
Spam detection software, running on the system "openib.ca.sandia.gov", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or block similar future email. If you have any questions, see [EMAIL PROTECTED] for deta

Re: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length

2004-11-09 Thread Hal Rosenstock
On Tue, 2004-11-09 at 18:54, Roland Dreier wrote: > OK, I think I understand the problem, but I'm not sure what the > correct solution is. When a DR SMP arrives at a CA from the SM, > hop_cnt == hop_ptr == number of hops in the directed route, What was the number ? > and somehow they are not upd

[openib-general] [PATCH] handle QP0/1 send queue overrun

2004-11-09 Thread Sean Hefty
The following patch adds support for handling QP0/1 send queue overrun, along with a couple of related fixes: * The patch includes that provided by Roland in order to configure the fabric. * The code no longer modifies the user's send_wr structures when sending a MAD. * Sent MADs work requests

Re: [openib-general] Re: [openib-commits] r1186 - gen2/trunk/src/linux-kernel/infiniband/core

2004-11-09 Thread Nitin Hande
Roland Dreier wrote: > Nitin> certainly it does break my x86_64 setup too. Can we revert > Nitin> back to working set of bits please ? > > It's actually not an architecture issue -- it's an issue if your node > is more than one hop from the SM. You should be able to use the patch > I just

Re: [openib-general] Re: [openib-commits] r1186 - gen2/trunk/src/linux-kernel/infiniband/core

2004-11-09 Thread Tom Duffy
On Tue, 2004-11-09 at 16:01 -0800, Roland Dreier wrote: > Nitin> certainly it does break my x86_64 setup too. Can we revert > Nitin> back to working set of bits please ? > > It's actually not an architecture issue -- it's an issue if your node > is more than one hop from the SM. You shoul

Re: [openib-general] Re: [openib-commits] r1186 - gen2/trunk/src/linux-kernel/infiniband/core

2004-11-09 Thread Sean Hefty
Roland Dreier wrote: Nitin> certainly it does break my x86_64 setup too. Can we revert Nitin> back to working set of bits please ? It's actually not an architecture issue -- it's an issue if your node is more than one hop from the SM. You should be able to use the patch I just posted to ge

Re: [openib-general] Re: [openib-commits] r1186 - gen2/trunk/src/linux-kernel/infiniband/core

2004-11-09 Thread Roland Dreier
Nitin> certainly it does break my x86_64 setup too. Can we revert Nitin> back to working set of bits please ? It's actually not an architecture issue -- it's an issue if your node is more than one hop from the SM. You should be able to use the patch I just posted to get things working aga

Re: [openib-general] Re: [openib-commits] r1186 - gen2/trunk/src/linux-kernel/infiniband/core

2004-11-09 Thread Nitin Hande
Tom Duffy wrote: > On Tue, 2004-11-09 at 15:23 -0800, [EMAIL PROTECTED] wrote: > >>Author: halr >>Date: 2004-11-09 15:23:07 -0800 (Tue, 09 Nov 2004) >>New Revision: 1186 >> >>Modified: >> gen2/trunk/src/linux-kernel/infiniband/core/agent.c >>Log: >>Fix agent_mad_send PCI mapping and gather addre

Re: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length

2004-11-09 Thread Roland Dreier
OK, I think I understand the problem, but I'm not sure what the correct solution is. When a DR SMP arrives at a CA from the SM, hop_cnt == hop_ptr == number of hops in the directed route, and somehow they are not updated correctly by the time the response reaches handle_outgoing_smp(). I can't fo

[openib-general] Re: [openib-commits] r1186 - gen2/trunk/src/linux-kernel/infiniband/core

2004-11-09 Thread Tom Duffy
On Tue, 2004-11-09 at 15:23 -0800, [EMAIL PROTECTED] wrote: > Author: halr > Date: 2004-11-09 15:23:07 -0800 (Tue, 09 Nov 2004) > New Revision: 1186 > > Modified: >gen2/trunk/src/linux-kernel/infiniband/core/agent.c > Log: > Fix agent_mad_send PCI mapping and gather address and length Please

[openib-general] Re: [PATCH] Unnecessary initialization of sa_query in failure case.

2004-11-09 Thread Krishna Kumar
On Tue, 9 Nov 2004, Roland Dreier wrote: > Why is this initialization unnecessary? If we delete these lines then > sa_query is left pointing to invalid memory when a send fails? Because ULP's should not use a pointers to-be-set-in-callee routines if the call failed. In this case, path_rec_start

[openib-general] Question on handle_outgoing_smp

2004-11-09 Thread root
In following code : if (smi_check_local_dr_smp(smp, mad_agent->device, mad_agent->port_num)) { ... ret = mad_agent->device->process_mad( mad_agent->device, 0, mad

[openib-general] Re: [PATCH] Unnecessary initialization of sa_query in failure case.

2004-11-09 Thread Roland Dreier
Why is this initialization unnecessary? If we delete these lines then sa_query is left pointing to invalid memory when a send fails? - R. ___ openib-general mailing list [EMAIL PROTECTED] http://openib.org/mailman/listinfo/openib-general To unsubscrib

[openib-general] [PATCH] Unnecessary initialization of sa_query in failure case.

2004-11-09 Thread root
diff -ruNp org/core/sa_query.c new/core/sa_query.c --- org/core/sa_query.c 2004-11-09 12:51:35.0 -0800 +++ new/core/sa_query.c 2004-11-09 13:30:38.0 -0800 @@ -547,7 +547,6 @@ int ib_sa_path_rec_get(struct ib_device *sa_query = &query->sa_query; ret = send_mad(&query-

Re: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length

2004-11-09 Thread Hal Rosenstock
On Tue, 2004-11-09 at 15:53, Roland Dreier wrote: > By the way, the i386 system is connected directly to the switch > running the SM, That's the config I run in too. > while the ppc64 system is a few hops away. I think Sean's original config was a couple of hops. > So it's > just as likely to

Re: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length

2004-11-09 Thread Roland Dreier
Roland> OK, this works on my i386 system but I'm still getting Roland> ib_mad: Invalid directed route Roland> on ppc64. I'll try to debug what exactly is happening (ie Roland> put some prints in to see why smi_handle_dr_smp_send() is Roland> rejecting it). By the way, the i3

Re: [openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length

2004-11-09 Thread Roland Dreier
OK, this works on my i386 system but I'm still getting ib_mad: Invalid directed route on ppc64. I'll try to debug what exactly is happening (ie put some prints in to see why smi_handle_dr_smp_send() is rejecting it). - R. ___ openib-general mail

[openib-general] [PATCH] agent: Fix agent_mad_send PCI mapping and gather address and length

2004-11-09 Thread Hal Rosenstock
agent: Fix agent_mad_send PCI mapping and gather address and length Index: agent.c === --- agent.c (revision 1183) +++ agent.c (working copy) @@ -116,10 +116,10 @@ /* PCI mapping */ gather_list.addr = pci_map

Re: [openib-general] error trying to bring up node

2004-11-09 Thread Hal Rosenstock
On Tue, 2004-11-09 at 15:25, Hal Rosenstock wrote: > Doesn't that just map starting at the GRH ? This is to handle PMA > responses which might have GRHs. Never mind. I see the problem. -- Hal ___ openib-general mailing list [EMAIL PROTECTED] http://ope

Re: [openib-general] error trying to bring up node

2004-11-09 Thread Sean Hefty
Hal Rosenstock wrote: On Tue, 2004-11-09 at 14:56, Sean Hefty wrote: Sean Hefty wrote: I have two nodes directly connected. When trying to bring up the openib node, I receive a local length error on the CQ after trying to perform a send. I'm continuing to debug... static int agent_mad_send(str

Re: [openib-general] error trying to bring up node

2004-11-09 Thread Roland Dreier
Hal> Doesn't that just map starting at the GRH ? This is to handle Hal> PMA responses which might have GRHs. Sure, it maps starting at the GRH and uses that as the start of the gather segment used for the send (and tries to send more than 256 bytes). This is wrong even when sending a pack

Re: [openib-general] error trying to bring up node

2004-11-09 Thread Hal Rosenstock
On Tue, 2004-11-09 at 14:56, Sean Hefty wrote: > Sean Hefty wrote: > > > I have two nodes directly connected. When trying to bring up the openib > > node, I receive a local length error on the CQ after trying to perform a > > send. > > > > I'm continuing to debug... > > static int agent_mad_s

Re: [Fwd: [openib-general] [PATCH] mad/agent: Modify receive buffer allocation strategy]

2004-11-09 Thread Hal Rosenstock
On Tue, 2004-11-09 at 14:37, Roland Dreier wrote: > Hal> One more thing on this I forgot to post: As I am not yet set > Hal> up with Kegel cross tools (and don't have a machine where the > Hal> pci_ macros are non trivial), I would appreciate it if > Hal> someone could verify these

Re: [openib-general] error trying to bring up node

2004-11-09 Thread Roland Dreier
Sean> Wouldn't this result in sending the GRH data buffer before Sean> the MAD buffer? Yes, it sure looks that way. Sean> Does mthca check the size of sends that are Sean> posted to QP0/1 and report an error if they are larger than Sean> 256 bytes? No, it will probably send i

Re: [Fwd: [openib-general] [PATCH] mad/agent: Modify receive buffer allocation strategy]

2004-11-09 Thread Hal Rosenstock
On Tue, 2004-11-09 at 14:39, Roland Dreier wrote: > By the way, we probably want this applied: Thanks. Applied. -- Hal ___ openib-general mailing list [EMAIL PROTECTED] http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit htt

Re: [openib-general] error trying to bring up node

2004-11-09 Thread Sean Hefty
Sean Hefty wrote: I have two nodes directly connected. When trying to bring up the openib node, I receive a local length error on the CQ after trying to perform a send. I'm continuing to debug... static int agent_mad_send(struct ib_mad_agent *mad_agent, struct ib_agent

[openib-general] error trying to bring up node

2004-11-09 Thread Sean Hefty
I have two nodes directly connected. When trying to bring up the openib node, I receive a local length error on the CQ after trying to perform a send. I'm continuing to debug... - Sean ___ openib-general mailing list [EMAIL PROTECTED] http://openib.or

Re: [Fwd: [openib-general] [PATCH] mad/agent: Modify receive buffer allocation strategy]

2004-11-09 Thread Roland Dreier
By the way, we probably want this applied: Index: core/mad.c === --- core/mad.c (revision 1184) +++ core/mad.c (working copy) @@ -385,7 +385,7 @@ mad_agent->device->node_type,

Re: [Fwd: [openib-general] [PATCH] mad/agent: Modify receive buffer allocation strategy]

2004-11-09 Thread Roland Dreier
Hal> One more thing on this I forgot to post: As I am not yet set Hal> up with Kegel cross tools (and don't have a machine where the Hal> pci_ macros are non trivial), I would appreciate it if Hal> someone could verify these changes (or latest code) on some Hal> architecture whe

Re: [openib-general] VAPI_RETRY_EXC_ERR

2004-11-09 Thread Libor Michalek
On Tue, Nov 09, 2004 at 04:19:17PM +0530, Sreenivasulu Pulichintala wrote: > -Original Message- > From: Sreenivasulu Pulichintala > Sent: Tuesday, November 09, 2004 3:56 PM > To: [EMAIL PROTECTED] > Subject: [openib-general] VAPI_RETRY_EXC_ERR > > HI, > > I use MPICH 1.2.5 and MVAPICH 0

[Fwd: [openib-general] [PATCH] mad/agent: Modify receive buffer allocation strategy]

2004-11-09 Thread Hal Rosenstock
One more thing on this I forgot to post: As I am not yet set up with Kegel cross tools (and don't have a machine where the pci_ macros are non trivial), I would appreciate it if someone could verify these changes (or latest code) on some architecture where the pci_ macros are non trivial. Thanks.

[openib-general] [PATCH] mad: In ib_mad_recv_done_handler, don't dispatch in additional case

2004-11-09 Thread Hal Rosenstock
mad: In ib_mad_recv_done_handler, don't dispatch in additional case Index: mad.c === --- mad.c (revision 1183) +++ mad.c (working copy) @@ -1161,8 +1161,8 @@ port_priv->devic

Re: [openib-general] [PATCH] mad: In ib_mad_recv_done_handler, don't dispatch additional error cases

2004-11-09 Thread Hal Rosenstock
On Tue, 2004-11-09 at 12:11, Sean Hefty wrote: > Hal Rosenstock wrote: > > > mad: In ib_mad_recv_done_handler, don't dispatch additional error cases > > + if (ret & IB_MAD_RESULT_SUCCESS) { > > + if (ret & IB_MAD_RESULT_REPLY) { > > + if (respo

Re: [openib-general] Re: IPoIB Completion Handling

2004-11-09 Thread Sean Hefty
Hal Rosenstock wrote: On Tue, 2004-11-09 at 10:37, Roland Dreier wrote: By the way, reposting the receives is not the right thing to do on error -- the QP will be in the error state, so any new work requests will just complete with a flush status. We need to reset the QP and start over to recover

Re: [openib-general] [PATCH] mad: In ib_mad_recv_done_handler, don't dispatch additional error cases

2004-11-09 Thread Sean Hefty
Hal Rosenstock wrote: mad: In ib_mad_recv_done_handler, don't dispatch additional error cases + if (ret & IB_MAD_RESULT_SUCCESS) { + if (ret & IB_MAD_RESULT_REPLY) { + if (response->mad_hdr.mgmt_class == +

[openib-general] Re: More on IPoIB Multicast

2004-11-09 Thread Roland Dreier
Hal> Hi Roland, If a multicast send is attempted and the node is Hal> not joined to the multicast group which is the destination of Hal> the send, a send only join (which is neutered due to lack of Hal> SM support) is assumed. Is my understanding correct ? Yes. Hal> Linux also

[openib-general] More on IPoIB Multicast

2004-11-09 Thread Hal Rosenstock
Hi Roland, If a multicast send is attempted and the node is not joined to the multicast group which is the destination of the send, a send only join (which is neutered due to lack of SM support) is assumed. Is my understanding correct ? Linux also supports multicast routing. For this case, I thin

Re: [openib-general] MAD agent code comments

2004-11-09 Thread Sean Hefty
Hal Rosenstock wrote: Since the agent does not use solicited sends, are its sends completed in order (so this is only an issue for clients using solicited sends) ? I would think that solicited sends (i.e. responses) would be easier to maintain order, since those wouldn't have a timeout. But my p

[openib-general] [PATCH] mad/agent: Modify receive buffer allocation strategy

2004-11-09 Thread Hal Rosenstock
mad/agent: Modify receive buffer allocation strategy (Inefficiency pointed out by Sean; algorithm described by Roland) Problem: Currently, if the underlying driver provides a process_mad routine, a response MAD is allocated every time a MAD is received on QP 0 or 1. Solution: The MAD layer can al

Re: [openib-general] Re: IPoIB Completion Handling

2004-11-09 Thread Roland Dreier
Roland> By the way, reposting the receives is not the right thing Roland> to do on error -- the QP will be in the error state, so Roland> any new work requests will just complete with a flush Roland> status. We need to reset the QP and start over to recover Roland> from errors.

Re: [openib-general] Re: IPoIB Completion Handling

2004-11-09 Thread Hal Rosenstock
On Tue, 2004-11-09 at 10:37, Roland Dreier wrote: > By the way, reposting the receives is not the right thing to do on > error -- the QP will be in the error state, so any new work requests > will just complete with a flush status. We need to reset the QP and > start over to recover from errors.

Re: [openib-general] Re: IPoIB Completion Handling

2004-11-09 Thread Roland Dreier
Hal> In ipoib_ib_handle_wc when status != success, isn't the WC Hal> opcode invalid ? Also, in that case, don't receives also need Hal> to be reposted ? Roland> Yes, the error handling in IPoIB needs to be fixed. By the way, reposting the receives is not the right thing to do on e

[openib-general] Re: IPoIB Completion Handling

2004-11-09 Thread Roland Dreier
Hal> In ipoib_ib_handle_wc when status != success, isn't the WC Hal> opcode invalid ? Also, in that case, don't receives also need Hal> to be reposted ? Yes, the error handling in IPoIB needs to be fixed. - R. ___ openib-general mailing li

RE: [openib-general] VAPI_RETRY_EXC_ERR

2004-11-09 Thread Tziporet Koren
There can be several problems: - The retry count is too small - try to put max number - 7 - Maybe the timeout is too small - so the HCA start to perform retry too much - try to enlarge it to 21 - Can be that the PSN between two sides is not synchronized - The link fail - The QP in the oth

[openib-general] IPoIB Completion Handling

2004-11-09 Thread Hal Rosenstock
Hi Roland, In ipoib_ib_handle_wc when status != success, isn't the WC opcode invalid ? Also, in that case, don't receives also need to be reposted ? -- Hal ___ openib-general mailing list [EMAIL PROTECTED] http://openib.org/mailman/listinfo/openib-gene

Re: [openib-general] ib_mad_recv_done_handler questions

2004-11-09 Thread Hal Rosenstock
On Mon, 2004-11-08 at 19:51, Roland Dreier wrote: > Sean> * If the underlying driver provides a process_mad routine, a > Sean> response MAD is allocated every time a MAD is received on QP > Sean> 0 or 1. Can we either push this allocation down into the > Sean> HCA driver, or find a

[openib-general] [PATCH] agent: Rename ib_get_agent_mad to ib_get_agent_port and eliminate duplicated call to it in agent_mad_send

2004-11-09 Thread Hal Rosenstock
agent: Rename ib_get_agent_mad to ib_get_agent_port and eliminate duplicated call to it in agent_mad_send (pointed out by Sean Hefty) Index: agent.c === --- agent.c (revision 1180) +++ agent.c (working copy) @@ -35,8 +35,8 @@

Re: [openib-general] MAD agent code comments

2004-11-09 Thread Hal Rosenstock
On Mon, 2004-11-08 at 19:27, Sean Hefty wrote: > A couple of comments (so far) while tracing through the MAD agent code. > > * There are a couple of places where ib_get_agent_mad() will be called > multiple times in the same execution path. For example agent_send calls > it, as does agent_mad_s

[openib-general] [PATCH] mad: In ib_mad_recv_done_handler, don't dispatch additional error cases

2004-11-09 Thread Hal Rosenstock
mad: In ib_mad_recv_done_handler, don't dispatch additional error cases Index: mad.c === --- mad.c (revision 1180) +++ mad.c (working copy) @@ -1138,26 +1138,27 @@ wc->s

Re: [openib-general] ib_mad_recv_done_handler questions

2004-11-09 Thread Hal Rosenstock
On Mon, 2004-11-08 at 18:48, Sean Hefty wrote: > Looking at the latest changes to ib_mad_recv_done_handler, I have a > couple of questions: > * If process_mad consumes the MAD, should the code just goto out? > Something more like: > > ret = port_priv->device->process_mad(...) > if (

RE: [openib-general] VAPI_RETRY_EXC_ERR

2004-11-09 Thread Sreenivasulu Pulichintala
The corresponding IB maro is - IB_COMP_RETRY_EXC_ERR   -Original Message- From: Sreenivasulu Pulichintala Sent: Tuesday, November 09, 2004 3:56 PM To: [EMAIL PROTECTED] Subject: [openib-general] VAPI_RETRY_EXC_ERR   HI,   I use MPICH 1.2.5 and MVAPICH 0.9.2 stack and when I r

[openib-general] VAPI_RETRY_EXC_ERR

2004-11-09 Thread Sreenivasulu Pulichintala
HI,   I use MPICH 1.2.5 and MVAPICH 0.9.2 stack and when I run some of my fortran applications, some times my application crashes producing the following error –   === Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor code=81mpi_latency: mpid/ch_vapi/viacheck.c:2109: viutil_sp