HI,
I use MPICH 1.2.5 and MVAPICH 0.9.2 stack and when I run some of my fortran
applications, some times my application crashes producing the following error
===
Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor code=81mpi_latency: mpid/ch_vapi/viacheck.c:2109:
The corresponding IB maro is - IB_COMP_RETRY_EXC_ERR
-Original Message-
From: Sreenivasulu Pulichintala
Sent: Tuesday, November 09, 2004
3:56 PM
To: [EMAIL PROTECTED]
Subject: [openib-general]
VAPI_RETRY_EXC_ERR
HI,
I use MPICH 1.2.5 and MVAPICH 0.9.2 stack and when I
run
On Mon, 2004-11-08 at 18:48, Sean Hefty wrote:
Looking at the latest changes to ib_mad_recv_done_handler, I have a
couple of questions:
* If process_mad consumes the MAD, should the code just goto out?
Something more like:
ret = port_priv-device-process_mad(...)
if ((ret
mad: In ib_mad_recv_done_handler, don't dispatch additional error cases
Index: mad.c
===
--- mad.c (revision 1180)
+++ mad.c (working copy)
@@ -1138,26 +1138,27 @@
On Mon, 2004-11-08 at 19:27, Sean Hefty wrote:
A couple of comments (so far) while tracing through the MAD agent code.
* There are a couple of places where ib_get_agent_mad() will be called
multiple times in the same execution path. For example agent_send calls
it, as does agent_mad_send.
agent: Rename ib_get_agent_mad to ib_get_agent_port and eliminate
duplicated call to it in agent_mad_send (pointed out by Sean Hefty)
Index: agent.c
===
--- agent.c (revision 1180)
+++ agent.c (working copy)
@@ -35,8 +35,8 @@
On Mon, 2004-11-08 at 19:51, Roland Dreier wrote:
Sean * If the underlying driver provides a process_mad routine, a
Sean response MAD is allocated every time a MAD is received on QP
Sean 0 or 1. Can we either push this allocation down into the
Sean HCA driver, or find an
Hi Roland,
In ipoib_ib_handle_wc when status != success, isn't the WC opcode
invalid ? Also, in that case, don't receives also need to be reposted ?
-- Hal
___
openib-general mailing list
[EMAIL PROTECTED]
There
can be several problems:
- The
retry count is too small - try to put max number - 7
-
Maybe the timeout is too small - so the HCA start to perform retry too much -
try to enlarge it to 21
- Can
be that the PSN between two sides is not synchronized
- The
link fail
- The
QP in the
Hi Roland,
If a multicast send is attempted and the node is not joined to the
multicast group which is the destination of the send, a send only join
(which is neutered due to lack of SM support) is assumed. Is my
understanding correct ?
Linux also supports multicast routing. For this case, I
Hal Rosenstock wrote:
On Tue, 2004-11-09 at 10:37, Roland Dreier wrote:
By the way, reposting the receives is not the right thing to do on
error -- the QP will be in the error state, so any new work requests
will just complete with a flush status. We need to reset the QP and
start over to recover
On Tue, 2004-11-09 at 12:11, Sean Hefty wrote:
Hal Rosenstock wrote:
mad: In ib_mad_recv_done_handler, don't dispatch additional error cases
+ if (ret IB_MAD_RESULT_SUCCESS) {
+ if (ret IB_MAD_RESULT_REPLY) {
+ if
mad: In ib_mad_recv_done_handler, don't dispatch in additional case
Index: mad.c
===
--- mad.c (revision 1183)
+++ mad.c (working copy)
@@ -1161,8 +1161,8 @@
One more thing on this I forgot to post:
As I am not yet set up with Kegel cross tools (and don't have a machine
where the pci_ macros are non trivial), I would appreciate it if someone
could verify these changes (or latest code) on some architecture where
the pci_ macros are non trivial.
On Tue, Nov 09, 2004 at 04:19:17PM +0530, Sreenivasulu Pulichintala wrote:
-Original Message-
From: Sreenivasulu Pulichintala
Sent: Tuesday, November 09, 2004 3:56 PM
To: [EMAIL PROTECTED]
Subject: [openib-general] VAPI_RETRY_EXC_ERR
HI,
I use MPICH 1.2.5 and MVAPICH 0.9.2
By the way, we probably want this applied:
Index: core/mad.c
===
--- core/mad.c (revision 1184)
+++ core/mad.c (working copy)
@@ -385,7 +385,7 @@
mad_agent-device-node_type,
I have two nodes directly connected. When trying to bring up the openib
node, I receive a local length error on the CQ after trying to perform a
send.
I'm continuing to debug...
- Sean
___
openib-general mailing list
[EMAIL PROTECTED]
Sean Hefty wrote:
I have two nodes directly connected. When trying to bring up the openib
node, I receive a local length error on the CQ after trying to perform a
send.
I'm continuing to debug...
static int agent_mad_send(struct ib_mad_agent *mad_agent,
struct
On Tue, 2004-11-09 at 14:39, Roland Dreier wrote:
By the way, we probably want this applied:
Thanks. Applied.
-- Hal
___
openib-general mailing list
[EMAIL PROTECTED]
http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit
On Tue, 2004-11-09 at 14:37, Roland Dreier wrote:
Hal One more thing on this I forgot to post: As I am not yet set
Hal up with Kegel cross tools (and don't have a machine where the
Hal pci_ macros are non trivial), I would appreciate it if
Hal someone could verify these changes
On Tue, 2004-11-09 at 14:56, Sean Hefty wrote:
Sean Hefty wrote:
I have two nodes directly connected. When trying to bring up the openib
node, I receive a local length error on the CQ after trying to perform a
send.
I'm continuing to debug...
static int agent_mad_send(struct
Hal Doesn't that just map starting at the GRH ? This is to handle
Hal PMA responses which might have GRHs.
Sure, it maps starting at the GRH and uses that as the start of the
gather segment used for the send (and tries to send more than 256
bytes). This is wrong even when sending a
Hal Rosenstock wrote:
On Tue, 2004-11-09 at 14:56, Sean Hefty wrote:
Sean Hefty wrote:
I have two nodes directly connected. When trying to bring up the openib
node, I receive a local length error on the CQ after trying to perform a
send.
I'm continuing to debug...
static int
On Tue, 2004-11-09 at 15:25, Hal Rosenstock wrote:
Doesn't that just map starting at the GRH ? This is to handle PMA
responses which might have GRHs.
Never mind. I see the problem.
-- Hal
___
openib-general mailing list
[EMAIL PROTECTED]
agent: Fix agent_mad_send PCI mapping and gather address and length
Index: agent.c
===
--- agent.c (revision 1183)
+++ agent.c (working copy)
@@ -116,10 +116,10 @@
/* PCI mapping */
gather_list.addr =
OK, this works on my i386 system but I'm still getting
ib_mad: Invalid directed route
on ppc64. I'll try to debug what exactly is happening (ie put some
prints in to see why smi_handle_dr_smp_send() is rejecting it).
- R.
___
openib-general
Roland OK, this works on my i386 system but I'm still getting
Roland ib_mad: Invalid directed route
Roland on ppc64. I'll try to debug what exactly is happening (ie
Roland put some prints in to see why smi_handle_dr_smp_send() is
Roland rejecting it).
By the way, the i386
On Tue, 9 Nov 2004, Roland Dreier wrote:
Why is this initialization unnecessary? If we delete these lines then
sa_query is left pointing to invalid memory when a send fails?
Because ULP's should not use a pointers to-be-set-in-callee routines
if the call failed. In this case, path_rec_start
On Tue, 2004-11-09 at 15:23 -0800, [EMAIL PROTECTED] wrote:
Author: halr
Date: 2004-11-09 15:23:07 -0800 (Tue, 09 Nov 2004)
New Revision: 1186
Modified:
gen2/trunk/src/linux-kernel/infiniband/core/agent.c
Log:
Fix agent_mad_send PCI mapping and gather address and length
Please revert
Tom Duffy wrote:
On Tue, 2004-11-09 at 15:23 -0800, [EMAIL PROTECTED] wrote:
Author: halr
Date: 2004-11-09 15:23:07 -0800 (Tue, 09 Nov 2004)
New Revision: 1186
Modified:
gen2/trunk/src/linux-kernel/infiniband/core/agent.c
Log:
Fix agent_mad_send PCI mapping and gather address and length
Nitin certainly it does break my x86_64 setup too. Can we revert
Nitin back to working set of bits please ?
It's actually not an architecture issue -- it's an issue if your node
is more than one hop from the SM. You should be able to use the patch
I just posted to get things working
On Tue, 2004-11-09 at 16:01 -0800, Roland Dreier wrote:
Nitin certainly it does break my x86_64 setup too. Can we revert
Nitin back to working set of bits please ?
It's actually not an architecture issue -- it's an issue if your node
is more than one hop from the SM. You should be
Roland Dreier wrote:
Nitin certainly it does break my x86_64 setup too. Can we revert
Nitin back to working set of bits please ?
It's actually not an architecture issue -- it's an issue if your node
is more than one hop from the SM. You should be able to use the patch
I just posted
The following patch adds support for handling QP0/1 send queue overrun,
along with a couple of related fixes:
* The patch includes that provided by Roland in order to configure the
fabric.
* The code no longer modifies the user's send_wr structures when sending
a MAD.
* Sent MADs work requests
On Tue, 2004-11-09 at 18:54, Roland Dreier wrote:
OK, I think I understand the problem, but I'm not sure what the
correct solution is. When a DR SMP arrives at a CA from the SM,
hop_cnt == hop_ptr == number of hops in the directed route,
What was the number ?
and somehow they are not
Spam detection software, running on the system openib.ca.sandia.gov, has
identified this incoming email as possible spam. The original message
has been attached to this so you can view it (if it isn't spam) or block
similar future email. If you have any questions, see
[EMAIL PROTECTED] for
Roland OK, I think I understand the problem, but I'm not sure
Roland what the correct solution is. When a DR SMP arrives at a
Roland CA from the SM, hop_cnt == hop_ptr == number of hops in
Roland the directed route,
Hal What was the number ?
For one port it was 4 and for
37 matches
Mail list logo