It seems that MAD handling is still not quite right. It seems in my
set up that IPoIB is not seeing the response to its MCMember
set... (it does look like the query is reaching the SM)
- R.
___
openib-general mailing list
[EMAIL PROTECTED]
http://openi
Roland> OK, I think I understand the problem, but I'm not sure
Roland> what the correct solution is. When a DR SMP arrives at a
Roland> CA from the SM, hop_cnt == hop_ptr == number of hops in
Roland> the directed route,
Hal> What was the number ?
For one port it was 4 and for
Spam detection software, running on the system "openib.ca.sandia.gov", has
identified this incoming email as possible spam. The original message
has been attached to this so you can view it (if it isn't spam) or block
similar future email. If you have any questions, see
[EMAIL PROTECTED] for deta
On Tue, 2004-11-09 at 18:54, Roland Dreier wrote:
> OK, I think I understand the problem, but I'm not sure what the
> correct solution is. When a DR SMP arrives at a CA from the SM,
> hop_cnt == hop_ptr == number of hops in the directed route,
What was the number ?
> and somehow they are not upd
The following patch adds support for handling QP0/1 send queue overrun,
along with a couple of related fixes:
* The patch includes that provided by Roland in order to configure the
fabric.
* The code no longer modifies the user's send_wr structures when sending
a MAD.
* Sent MADs work requests
Roland Dreier wrote:
> Nitin> certainly it does break my x86_64 setup too. Can we revert
> Nitin> back to working set of bits please ?
>
> It's actually not an architecture issue -- it's an issue if your node
> is more than one hop from the SM. You should be able to use the patch
> I just
On Tue, 2004-11-09 at 16:01 -0800, Roland Dreier wrote:
> Nitin> certainly it does break my x86_64 setup too. Can we revert
> Nitin> back to working set of bits please ?
>
> It's actually not an architecture issue -- it's an issue if your node
> is more than one hop from the SM. You shoul
Roland Dreier wrote:
Nitin> certainly it does break my x86_64 setup too. Can we revert
Nitin> back to working set of bits please ?
It's actually not an architecture issue -- it's an issue if your node
is more than one hop from the SM. You should be able to use the patch
I just posted to ge
Nitin> certainly it does break my x86_64 setup too. Can we revert
Nitin> back to working set of bits please ?
It's actually not an architecture issue -- it's an issue if your node
is more than one hop from the SM. You should be able to use the patch
I just posted to get things working aga
Tom Duffy wrote:
> On Tue, 2004-11-09 at 15:23 -0800, [EMAIL PROTECTED] wrote:
>
>>Author: halr
>>Date: 2004-11-09 15:23:07 -0800 (Tue, 09 Nov 2004)
>>New Revision: 1186
>>
>>Modified:
>> gen2/trunk/src/linux-kernel/infiniband/core/agent.c
>>Log:
>>Fix agent_mad_send PCI mapping and gather addre
OK, I think I understand the problem, but I'm not sure what the
correct solution is. When a DR SMP arrives at a CA from the SM,
hop_cnt == hop_ptr == number of hops in the directed route, and
somehow they are not updated correctly by the time the response
reaches handle_outgoing_smp().
I can't fo
On Tue, 2004-11-09 at 15:23 -0800, [EMAIL PROTECTED] wrote:
> Author: halr
> Date: 2004-11-09 15:23:07 -0800 (Tue, 09 Nov 2004)
> New Revision: 1186
>
> Modified:
>gen2/trunk/src/linux-kernel/infiniband/core/agent.c
> Log:
> Fix agent_mad_send PCI mapping and gather address and length
Please
On Tue, 9 Nov 2004, Roland Dreier wrote:
> Why is this initialization unnecessary? If we delete these lines then
> sa_query is left pointing to invalid memory when a send fails?
Because ULP's should not use a pointers to-be-set-in-callee routines
if the call failed. In this case, path_rec_start
In following code :
if (smi_check_local_dr_smp(smp, mad_agent->device, mad_agent->port_num)) {
...
ret = mad_agent->device->process_mad(
mad_agent->device,
0,
mad
Why is this initialization unnecessary? If we delete these lines then
sa_query is left pointing to invalid memory when a send fails?
- R.
___
openib-general mailing list
[EMAIL PROTECTED]
http://openib.org/mailman/listinfo/openib-general
To unsubscrib
diff -ruNp org/core/sa_query.c new/core/sa_query.c
--- org/core/sa_query.c 2004-11-09 12:51:35.0 -0800
+++ new/core/sa_query.c 2004-11-09 13:30:38.0 -0800
@@ -547,7 +547,6 @@ int ib_sa_path_rec_get(struct ib_device
*sa_query = &query->sa_query;
ret = send_mad(&query-
On Tue, 2004-11-09 at 15:53, Roland Dreier wrote:
> By the way, the i386 system is connected directly to the switch
> running the SM,
That's the config I run in too.
> while the ppc64 system is a few hops away.
I think Sean's original config was a couple of hops.
> So it's
> just as likely to
Roland> OK, this works on my i386 system but I'm still getting
Roland> ib_mad: Invalid directed route
Roland> on ppc64. I'll try to debug what exactly is happening (ie
Roland> put some prints in to see why smi_handle_dr_smp_send() is
Roland> rejecting it).
By the way, the i3
OK, this works on my i386 system but I'm still getting
ib_mad: Invalid directed route
on ppc64. I'll try to debug what exactly is happening (ie put some
prints in to see why smi_handle_dr_smp_send() is rejecting it).
- R.
___
openib-general mail
agent: Fix agent_mad_send PCI mapping and gather address and length
Index: agent.c
===
--- agent.c (revision 1183)
+++ agent.c (working copy)
@@ -116,10 +116,10 @@
/* PCI mapping */
gather_list.addr = pci_map
On Tue, 2004-11-09 at 15:25, Hal Rosenstock wrote:
> Doesn't that just map starting at the GRH ? This is to handle PMA
> responses which might have GRHs.
Never mind. I see the problem.
-- Hal
___
openib-general mailing list
[EMAIL PROTECTED]
http://ope
Hal Rosenstock wrote:
On Tue, 2004-11-09 at 14:56, Sean Hefty wrote:
Sean Hefty wrote:
I have two nodes directly connected. When trying to bring up the openib
node, I receive a local length error on the CQ after trying to perform a
send.
I'm continuing to debug...
static int agent_mad_send(str
Hal> Doesn't that just map starting at the GRH ? This is to handle
Hal> PMA responses which might have GRHs.
Sure, it maps starting at the GRH and uses that as the start of the
gather segment used for the send (and tries to send more than 256
bytes). This is wrong even when sending a pack
On Tue, 2004-11-09 at 14:56, Sean Hefty wrote:
> Sean Hefty wrote:
>
> > I have two nodes directly connected. When trying to bring up the openib
> > node, I receive a local length error on the CQ after trying to perform a
> > send.
> >
> > I'm continuing to debug...
>
> static int agent_mad_s
On Tue, 2004-11-09 at 14:37, Roland Dreier wrote:
> Hal> One more thing on this I forgot to post: As I am not yet set
> Hal> up with Kegel cross tools (and don't have a machine where the
> Hal> pci_ macros are non trivial), I would appreciate it if
> Hal> someone could verify these
Sean> Wouldn't this result in sending the GRH data buffer before
Sean> the MAD buffer?
Yes, it sure looks that way.
Sean> Does mthca check the size of sends that are
Sean> posted to QP0/1 and report an error if they are larger than
Sean> 256 bytes?
No, it will probably send i
On Tue, 2004-11-09 at 14:39, Roland Dreier wrote:
> By the way, we probably want this applied:
Thanks. Applied.
-- Hal
___
openib-general mailing list
[EMAIL PROTECTED]
http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit htt
Sean Hefty wrote:
I have two nodes directly connected. When trying to bring up the openib
node, I receive a local length error on the CQ after trying to perform a
send.
I'm continuing to debug...
static int agent_mad_send(struct ib_mad_agent *mad_agent,
struct ib_agent
I have two nodes directly connected. When trying to bring up the openib
node, I receive a local length error on the CQ after trying to perform a
send.
I'm continuing to debug...
- Sean
___
openib-general mailing list
[EMAIL PROTECTED]
http://openib.or
By the way, we probably want this applied:
Index: core/mad.c
===
--- core/mad.c (revision 1184)
+++ core/mad.c (working copy)
@@ -385,7 +385,7 @@
mad_agent->device->node_type,
Hal> One more thing on this I forgot to post: As I am not yet set
Hal> up with Kegel cross tools (and don't have a machine where the
Hal> pci_ macros are non trivial), I would appreciate it if
Hal> someone could verify these changes (or latest code) on some
Hal> architecture whe
On Tue, Nov 09, 2004 at 04:19:17PM +0530, Sreenivasulu Pulichintala wrote:
> -Original Message-
> From: Sreenivasulu Pulichintala
> Sent: Tuesday, November 09, 2004 3:56 PM
> To: [EMAIL PROTECTED]
> Subject: [openib-general] VAPI_RETRY_EXC_ERR
>
> HI,
>
> I use MPICH 1.2.5 and MVAPICH 0
One more thing on this I forgot to post:
As I am not yet set up with Kegel cross tools (and don't have a machine
where the pci_ macros are non trivial), I would appreciate it if someone
could verify these changes (or latest code) on some architecture where
the pci_ macros are non trivial.
Thanks.
mad: In ib_mad_recv_done_handler, don't dispatch in additional case
Index: mad.c
===
--- mad.c (revision 1183)
+++ mad.c (working copy)
@@ -1161,8 +1161,8 @@
port_priv->devic
On Tue, 2004-11-09 at 12:11, Sean Hefty wrote:
> Hal Rosenstock wrote:
>
> > mad: In ib_mad_recv_done_handler, don't dispatch additional error cases
> > + if (ret & IB_MAD_RESULT_SUCCESS) {
> > + if (ret & IB_MAD_RESULT_REPLY) {
> > + if (respo
Hal Rosenstock wrote:
On Tue, 2004-11-09 at 10:37, Roland Dreier wrote:
By the way, reposting the receives is not the right thing to do on
error -- the QP will be in the error state, so any new work requests
will just complete with a flush status. We need to reset the QP and
start over to recover
Hal Rosenstock wrote:
mad: In ib_mad_recv_done_handler, don't dispatch additional error cases
+ if (ret & IB_MAD_RESULT_SUCCESS) {
+ if (ret & IB_MAD_RESULT_REPLY) {
+ if (response->mad_hdr.mgmt_class ==
+
Hal> Hi Roland, If a multicast send is attempted and the node is
Hal> not joined to the multicast group which is the destination of
Hal> the send, a send only join (which is neutered due to lack of
Hal> SM support) is assumed. Is my understanding correct ?
Yes.
Hal> Linux also
Hi Roland,
If a multicast send is attempted and the node is not joined to the
multicast group which is the destination of the send, a send only join
(which is neutered due to lack of SM support) is assumed. Is my
understanding correct ?
Linux also supports multicast routing. For this case, I thin
Hal Rosenstock wrote:
Since the agent does not use solicited sends, are its sends completed in
order (so this is only an issue for clients using solicited sends) ?
I would think that solicited sends (i.e. responses) would be easier to
maintain order, since those wouldn't have a timeout. But my p
mad/agent: Modify receive buffer allocation strategy
(Inefficiency pointed out by Sean; algorithm described by Roland)
Problem: Currently, if the underlying driver provides a process_mad
routine, a response MAD is allocated every time a MAD is received on
QP 0 or 1.
Solution: The MAD layer can al
Roland> By the way, reposting the receives is not the right thing
Roland> to do on error -- the QP will be in the error state, so
Roland> any new work requests will just complete with a flush
Roland> status. We need to reset the QP and start over to recover
Roland> from errors.
On Tue, 2004-11-09 at 10:37, Roland Dreier wrote:
> By the way, reposting the receives is not the right thing to do on
> error -- the QP will be in the error state, so any new work requests
> will just complete with a flush status. We need to reset the QP and
> start over to recover from errors.
Hal> In ipoib_ib_handle_wc when status != success, isn't the WC
Hal> opcode invalid ? Also, in that case, don't receives also need
Hal> to be reposted ?
Roland> Yes, the error handling in IPoIB needs to be fixed.
By the way, reposting the receives is not the right thing to do on
e
Hal> In ipoib_ib_handle_wc when status != success, isn't the WC
Hal> opcode invalid ? Also, in that case, don't receives also need
Hal> to be reposted ?
Yes, the error handling in IPoIB needs to be fixed.
- R.
___
openib-general mailing li
There
can be several problems:
- The
retry count is too small - try to put max number - 7
-
Maybe the timeout is too small - so the HCA start to perform retry too much -
try to enlarge it to 21
- Can
be that the PSN between two sides is not synchronized
- The
link fail
- The
QP in the oth
Hi Roland,
In ipoib_ib_handle_wc when status != success, isn't the WC opcode
invalid ? Also, in that case, don't receives also need to be reposted ?
-- Hal
___
openib-general mailing list
[EMAIL PROTECTED]
http://openib.org/mailman/listinfo/openib-gene
On Mon, 2004-11-08 at 19:51, Roland Dreier wrote:
> Sean> * If the underlying driver provides a process_mad routine, a
> Sean> response MAD is allocated every time a MAD is received on QP
> Sean> 0 or 1. Can we either push this allocation down into the
> Sean> HCA driver, or find a
agent: Rename ib_get_agent_mad to ib_get_agent_port and eliminate
duplicated call to it in agent_mad_send (pointed out by Sean Hefty)
Index: agent.c
===
--- agent.c (revision 1180)
+++ agent.c (working copy)
@@ -35,8 +35,8 @@
On Mon, 2004-11-08 at 19:27, Sean Hefty wrote:
> A couple of comments (so far) while tracing through the MAD agent code.
>
> * There are a couple of places where ib_get_agent_mad() will be called
> multiple times in the same execution path. For example agent_send calls
> it, as does agent_mad_s
mad: In ib_mad_recv_done_handler, don't dispatch additional error cases
Index: mad.c
===
--- mad.c (revision 1180)
+++ mad.c (working copy)
@@ -1138,26 +1138,27 @@
wc->s
On Mon, 2004-11-08 at 18:48, Sean Hefty wrote:
> Looking at the latest changes to ib_mad_recv_done_handler, I have a
> couple of questions:
> * If process_mad consumes the MAD, should the code just goto out?
> Something more like:
>
> ret = port_priv->device->process_mad(...)
> if (
The corresponding IB maro is - IB_COMP_RETRY_EXC_ERR
-Original Message-
From: Sreenivasulu Pulichintala
Sent: Tuesday, November 09, 2004
3:56 PM
To: [EMAIL PROTECTED]
Subject: [openib-general]
VAPI_RETRY_EXC_ERR
HI,
I use MPICH 1.2.5 and MVAPICH 0.9.2 stack and when I
r
HI,
I use MPICH 1.2.5 and MVAPICH 0.9.2 stack and when I run some of my fortran
applications, some times my application crashes producing the following error –
===
Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor code=81mpi_latency: mpid/ch_vapi/viacheck.c:2109: viutil_sp
54 matches
Mail list logo