Re: [ewg] OpenSM 1.5.4 Boot Problem

2011-12-27 Thread Hector Abrach
Hal,

Thank you for your help, I believe to have found the problem. The problem 
occurs in the function you told me (timeout_sends) this occurs in the QNX 
code as QNX doesn't compute jiffies automatically. Whomever wrote the 
driver created a bug when computing jiffies that gave me always the same 
time hence never timing out. This occurs in the following statement:

if(time_after(mad_send_wr-timeout, jiffies)) {

I was noticing that when it times out this if statement becomes true 
eventually. Maybe after 1 or 2 tries, this tells me that it will fail 
anyway if that's true is there a point at having that if statement, why 
not just time out right away?
Anyway, thank you once again.

Hector Abrach



From:
Hal Rosenstock h...@dev.mellanox.co.il
To:
Hector Abrach habr...@tmriusa.com
Cc:
ewg@lists.openfabrics.org
Date:
12/22/2011 07:24 AM
Subject:
Re: [ewg] OpenSM 1.5.4 Boot Problem



Hector,

On 12/21/2011 2:16 PM, Hector Abrach wrote:
 Hal,
 
 When an SMP times out how does the Linux kernel know it timed out?

The kernel MAD module maintains a timer list for outstanding
transactions and if no response is received before the timer expires, it
knows that transaction timed out. If the matching response is received,
it removes that transaction from that list. See timeout_sends in
drivers/infiniband/core/mad.c

 When the Linux kernel determines it timed out how does it signal OpenSM
 the timeout/retry/send? Through what function calls does this signal go
 through?
 
 I was noticing that cl_event_wait_on in vl15_poller() has a parameter
 passed as EVENT_NO_TIMEOUT should this be a time out or should the time
 out occur somewhere else? This is where it stalls.

Yes, I've already responded about this several times. I'm reasonably
sure that this is due to erroneous QP0/VL15 accounting due to lack of
timeouts.

 Do you know somewhere I could read a little bit more about the Linux
 Kernel timeout and how it interacts with OpenSM?

In terms of the kernel, look at:
linux/Documentation/infiniband/user_mad.txt
and
include/rdma/ib_mad.h and ib_user_mad.h

OpenSM uses osm_vendor_ibumad.c which is layered on top of libibumad. In
osm_vendor_ibumad.c, the send error callback is invoked for transaction
timeout in umad_receiver. For libibumad, see umad_status and umad_send
man pages.

-- Hal

 Thank you for the help and your insight.
 
 Hector Abrach
 
 
 From:  Hal Rosenstock h...@dev.mellanox.co.il
 To:Hector Abrach habr...@tmriusa.com
 Cc:ewg@lists.openfabrics.org
 Date:  12/16/2011 11:11 AM
 Subject:   Re: [ewg] OpenSM 1.5.4 Boot Problem
 
 
 
 
 
 
 Hector,
 
 On 12/16/2011 11:59 AM, Hector Abrach wrote:
 Hal,

 Is timeout/retry/send error support implemented in your QNX
 implementation ? That would explain why the SM appears to stop...

 Based on the inherit nature of the QNX Kernel I don't believe we have a
 timeout/retry/send on it. This may be the reason I see the bootup
 freeze. If it is I may have to implement this somehow.
 
 I think that's the OpenSM side of the failure as a timed out transaction
 never times out so the MAD accounting is wrong, etc. It breaks that
 fundamental assumption.
 
 There may also be some issue with the SMA implementation on your QNX
 nodes which is the root cause. Of course, SMPs are unreliable so
 timeout/retries can be needed...
 
 However, for the time being at least, I believe that setting
 OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 will be an acceptable solution as it
 works reliably. But, it would be nice to know why it freezes anyway, 
may
 be because of the above.

 Thus far I've been unsuccessful in failing with debug property -D 0x23
 but I'll keep trying.
 
 That slows things down enough to make it work as does 1 SMP outstanding.
 It appears when SMPs are pipelined, some get dropped...
 
 -- Hal
 
 Thank you

 Hector Abrach


 From:  Hal Rosenstock h...@dev.mellanox.co.il
 To:  Hector Abrach habr...@tmriusa.com
 Date:  12/15/2011 01:21 PM
 Subject:  Re: [ewg] OpenSM 1.5.4 Boot Problem


 




 On 12/15/2011 1:57 PM, Hector Abrach wrote:
 Hal,

 I managed to get it to fail with Debug information -D 0x08. Attached 
is
 the log file.
 I'll dig deeper it seems is pkey related maybe...

 Yes, I saw signs of that last night from the log you sent where it
 stopped on the pkey tables on the CAs but I wasn't 100% sure whether it
 was that or not. I didn't check how many pairs of the pkey tables you
 got back here to validate whether every port responded with the proper
 number of pkey table blocks.

 Is timeout/retry/send error support implemented in your QNX
 implementation ? That would explain why the SM appears to stop...

 -- Hal

 Once again thank you for your support.



 Hector Abrach


 From:  Hal

Re: [ewg] OpenSM 1.5.4 Boot Problem

2011-12-22 Thread Hal Rosenstock
Hector,

On 12/21/2011 2:16 PM, Hector Abrach wrote:
 Hal,
 
 When an SMP times out how does the Linux kernel know it timed out?

The kernel MAD module maintains a timer list for outstanding
transactions and if no response is received before the timer expires, it
knows that transaction timed out. If the matching response is received,
it removes that transaction from that list. See timeout_sends in
drivers/infiniband/core/mad.c

 When the Linux kernel determines it timed out how does it signal OpenSM
 the timeout/retry/send? Through what function calls does this signal go
 through?
 
 I was noticing that cl_event_wait_on in vl15_poller() has a parameter
 passed as EVENT_NO_TIMEOUT should this be a time out or should the time
 out occur somewhere else? This is where it stalls.

Yes, I've already responded about this several times. I'm reasonably
sure that this is due to erroneous QP0/VL15 accounting due to lack of
timeouts.

 Do you know somewhere I could read a little bit more about the Linux
 Kernel timeout and how it interacts with OpenSM?

In terms of the kernel, look at:
linux/Documentation/infiniband/user_mad.txt
and
include/rdma/ib_mad.h and ib_user_mad.h

OpenSM uses osm_vendor_ibumad.c which is layered on top of libibumad. In
osm_vendor_ibumad.c, the send error callback is invoked for transaction
timeout in umad_receiver. For libibumad, see umad_status and umad_send
man pages.

-- Hal

 Thank you for the help and your insight.
 
 Hector Abrach
 
 
 From: Hal Rosenstock h...@dev.mellanox.co.il
 To:   Hector Abrach habr...@tmriusa.com
 Cc:   ewg@lists.openfabrics.org
 Date: 12/16/2011 11:11 AM
 Subject:  Re: [ewg] OpenSM 1.5.4 Boot Problem
 
 
 
 
 
 
 Hector,
 
 On 12/16/2011 11:59 AM, Hector Abrach wrote:
 Hal,

 Is timeout/retry/send error support implemented in your QNX
 implementation ? That would explain why the SM appears to stop...

 Based on the inherit nature of the QNX Kernel I don't believe we have a
 timeout/retry/send on it. This may be the reason I see the bootup
 freeze. If it is I may have to implement this somehow.
 
 I think that's the OpenSM side of the failure as a timed out transaction
 never times out so the MAD accounting is wrong, etc. It breaks that
 fundamental assumption.
 
 There may also be some issue with the SMA implementation on your QNX
 nodes which is the root cause. Of course, SMPs are unreliable so
 timeout/retries can be needed...
 
 However, for the time being at least, I believe that setting
 OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 will be an acceptable solution as it
 works reliably. But, it would be nice to know why it freezes anyway, may
 be because of the above.

 Thus far I've been unsuccessful in failing with debug property -D 0x23
 but I'll keep trying.
 
 That slows things down enough to make it work as does 1 SMP outstanding.
 It appears when SMPs are pipelined, some get dropped...
 
 -- Hal
 
 Thank you

 Hector Abrach


 From:  Hal Rosenstock h...@dev.mellanox.co.il
 To:  Hector Abrach habr...@tmriusa.com
 Date:  12/15/2011 01:21 PM
 Subject:  Re: [ewg] OpenSM 1.5.4 Boot Problem


 



 On 12/15/2011 1:57 PM, Hector Abrach wrote:
 Hal,

 I managed to get it to fail with Debug information -D 0x08. Attached is
 the log file.
 I'll dig deeper it seems is pkey related maybe...

 Yes, I saw signs of that last night from the log you sent where it
 stopped on the pkey tables on the CAs but I wasn't 100% sure whether it
 was that or not. I didn't check how many pairs of the pkey tables you
 got back here to validate whether every port responded with the proper
 number of pkey table blocks.

 Is timeout/retry/send error support implemented in your QNX
 implementation ? That would explain why the SM appears to stop...

 -- Hal

 Once again thank you for your support.



 Hector Abrach


 From:  Hal Rosenstock h...@dev.mellanox.co.il
 To:  Hector Abrach habr...@tmriusa.com
 Date:  12/14/2011 08:29 PM
 Subject:  Re: [ewg] OpenSM 1.5.4 Boot Problem


 



 Hector,

 On 12/14/2011 5:49 PM, Hector Abrach wrote:
 Hal,

 I got the system to fail with verbose enabled after 25 reboots. Please
 find attached the log file.


 I can see the responses but not the requests. What verbosity level did
 you use ?

 I was reading that OSM_DEFAULT_SMP_MAX_ON_WIRE is used to pipeline the
 boot process in multi-switch systems and make the boot process faster
 correct?

 It's multinode not just multiswitch and this configuration is 8 nodes (1
 switch + 7 CAs). It's not boot process but discovery/initialization
 which is pipelined.

 Since my system is a single switch system I do not need to have
 4 but 1

Re: [ewg] OpenSM 1.5.4 Boot Problem

2011-12-21 Thread Hector Abrach
Hal,

When an SMP times out how does the Linux kernel know it timed out?

When the Linux kernel determines it timed out how does it signal OpenSM 
the timeout/retry/send? Through what function calls does this signal go 
through?

I was noticing that cl_event_wait_on in vl15_poller() has a parameter 
passed as EVENT_NO_TIMEOUT should this be a time out or should the time 
out occur somewhere else? This is where it stalls.

Do you know somewhere I could read a little bit more about the Linux 
Kernel timeout and how it interacts with OpenSM?
Thank you for the help and your insight.

Hector Abrach



From:
Hal Rosenstock h...@dev.mellanox.co.il
To:
Hector Abrach habr...@tmriusa.com
Cc:
ewg@lists.openfabrics.org
Date:
12/16/2011 11:11 AM
Subject:
Re: [ewg] OpenSM 1.5.4 Boot Problem



Hector,

On 12/16/2011 11:59 AM, Hector Abrach wrote:
 Hal,
 
 Is timeout/retry/send error support implemented in your QNX
 implementation ? That would explain why the SM appears to stop...
 
 Based on the inherit nature of the QNX Kernel I don't believe we have a
 timeout/retry/send on it. This may be the reason I see the bootup
 freeze. If it is I may have to implement this somehow.

I think that's the OpenSM side of the failure as a timed out transaction
never times out so the MAD accounting is wrong, etc. It breaks that
fundamental assumption.

There may also be some issue with the SMA implementation on your QNX
nodes which is the root cause. Of course, SMPs are unreliable so
timeout/retries can be needed...

 However, for the time being at least, I believe that setting
 OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 will be an acceptable solution as it
 works reliably. But, it would be nice to know why it freezes anyway, may
 be because of the above.
 
 Thus far I've been unsuccessful in failing with debug property -D 0x23
 but I'll keep trying.

That slows things down enough to make it work as does 1 SMP outstanding.
It appears when SMPs are pipelined, some get dropped...

-- Hal

 Thank you
 
 Hector Abrach
 
 
 From:  Hal Rosenstock h...@dev.mellanox.co.il
 To:Hector Abrach habr...@tmriusa.com
 Date:  12/15/2011 01:21 PM
 Subject:   Re: [ewg] OpenSM 1.5.4 Boot Problem
 
 
 
 
 
 
 On 12/15/2011 1:57 PM, Hector Abrach wrote:
 Hal,

 I managed to get it to fail with Debug information -D 0x08. Attached is
 the log file.
 I'll dig deeper it seems is pkey related maybe...
 
 Yes, I saw signs of that last night from the log you sent where it
 stopped on the pkey tables on the CAs but I wasn't 100% sure whether it
 was that or not. I didn't check how many pairs of the pkey tables you
 got back here to validate whether every port responded with the proper
 number of pkey table blocks.
 
 Is timeout/retry/send error support implemented in your QNX
 implementation ? That would explain why the SM appears to stop...
 
 -- Hal
 
 Once again thank you for your support.



 Hector Abrach


 From:  Hal Rosenstock h...@dev.mellanox.co.il
 To:  Hector Abrach habr...@tmriusa.com
 Date:  12/14/2011 08:29 PM
 Subject:  Re: [ewg] OpenSM 1.5.4 Boot Problem


 




 Hector,

 On 12/14/2011 5:49 PM, Hector Abrach wrote:
 Hal,

 I got the system to fail with verbose enabled after 25 reboots. Please
 find attached the log file.


 I can see the responses but not the requests. What verbosity level did
 you use ?

 I was reading that OSM_DEFAULT_SMP_MAX_ON_WIRE is used to pipeline the
 boot process in multi-switch systems and make the boot process faster
 correct?

 It's multinode not just multiswitch and this configuration is 8 nodes 
(1
 switch + 7 CAs). It's not boot process but discovery/initialization
 which is pipelined.

 Since my system is a single switch system I do not need to have
 4 but 1 for OSM_DEFAULT_SMP_MAX_ON_WIRE.

 You can run with 1 if that suits your needs. It's just not the default.

 Maybe the pipelined SMP's are confusing the switch some how.

 Even if it did, there's nothing that should stop the SM from
 working/proceeding. From the log, it looks like the SM does get stuck.

 -- Hal

 Thanks again for your help.

 Hector Abrach


 From:  Hal Rosenstock h...@dev.mellanox.co.il
 To:  Hector Abrach habr...@tmriusa.com
 Cc:  ewg@lists.openfabrics.org
 Date:  12/14/2011 08:03 AM
 Subject:  Re: [ewg] OpenSM 1.5.4 Boot Problem


 




 Hi,

 On 12/13/2011 2:35 PM, Hector Abrach wrote:
 Hello,

 I have a boot problem with OpenSM

 Are you saying the switch is booted rather than OpenSM ?

 What is the OpenSM running on and in what environment ?

 the problem occurs seldomly and
 started to ocur when we started using a new Mellanox

Re: [ewg] OpenSM 1.5.4 Boot Problem

2011-12-16 Thread Alex Netes
Hi Hector,

Few more questions.
Does this happen to you only when you try to shut down the OpenSM on reboot?
What is the host cpu architecture? x86/x86_64/ppc?


 -Original Message-
 From: ewg-boun...@lists.openfabrics.org [mailto:ewg-
 boun...@lists.openfabrics.org] On Behalf Of Hal Rosenstock
 Sent: Thursday, December 15, 2011 9:06 PM
 To: Hector Abrach
 Cc: ewg@lists.openfabrics.org
 Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem
 
 Hector,
 
 On 12/15/2011 12:49 PM, Hector Abrach wrote:
  Hal,
 
  Thank you for the response. To address your questions:
 
  So the switch stays up and the servers (including the one OpenSM is
  on) is rebooted, right ?
 
  Right.
 
  Do the servers run QNX rather than Linux ? Are you saying all OpenSM
  code is the same as stock OpenSM 3.3.12 (OFED 1.5.4-rc3) ?
 
  Yes, all 7 servers run QNX. The OpenSM code is 99% the same, the only
  changes I had to make were made to some #define libraries.
  The big changes were made for the driver, not so much OpenSM.
 
 I would think there are also changes for porting of complib to QNX. Do you
 use osm_vendor_ibumad.c as the OpenSM vendor layer ?
 
  I'm using IBNet 1.3.
 
 What's IBNet 1.3 ? I'm not familiar with that.
 
  OpenSM always runs on the same one server, the others don't run it.
 
 Understood.
 
  Is the topology the 7 servers and the 1 switch and if you use other
  switches you don't see this issue ?
 
  That's correct, the topology is 7 servers and 1 switch. We typically
  use less servers (4) for our application but the problem is more
  easily reproducible with more servers so we have a 7 server setup with
  1 switch. We don't have a great selection of switches but I know our
  previous switch did not cause this problem. Our intention is to go to
  production with this new switch but we can't release until we find an
  acceptable solution.
 
 Ican see the responses but not the requests. What verbosity level did
 you use ?
 
  I ran OpenSM with level -D 0x06 (error, info, verbose). I don't want
  to do -D 0xFF because I know this fixes the problem for sure.
 
 I think -D 0x23 (error, info, frames) would do the trick...
 
  -
 
  In summary:
  1.knowing that the system gets stuck for sm_vendor_ibumad.c -
  umad_receiver() - for(;;) but keeps running properly for function
  main.c - osm_manager_loop().
  2.If I use -D 0xFF the problem is completely fixed
  3.if I use OSM_DEFAULT_SMP_MAX_ON_WIRE of 1 instead of any other
  value the problem is completely fixed
  4.The failure always occurs with qp0_mads_outstanding of 1
  remaining
  what do you think could be wrong?
  Do you think the driver could be the problem?
 
 Yes; The thing that I think is a likely suspect and may be missing and causing
 this issue is the (built in to kernel MAD in Linux) timeout retry code for MAD
 transactions which if the timeout/retries are exhaused triggers a send error
 (callback). Is that implemented ?
 
 However, I don't have a good explanation for why you see this now and not
 before with your other switches but maybe that's not important.
 
  What debug command should I use to see the sent requests?
 
 See above.
 
 -- Hal
 
  Thank you
 
  Hector Abrach
 
 
 
 
  From:   Hal Rosenstock h...@dev.mellanox.co.il
  To: Hector Abrach habr...@tmriusa.com
  Cc: ewg@lists.openfabrics.org
  Date:   12/14/2011 08:23 PM
  Subject:Re: [ewg] OpenSM 1.5.4 Boot Problem
 
 
  --
  --
 
 
 
  Hector,
 
  On 12/14/2011 1:41 PM, Hector Abrach wrote:
  Hal,
 
  Sorry for the multiple emails, but I was thinking how it may be a
  freeze /stall rather than a time out.  One reason is that it
  doesn't send an error message, is as if the log completely dies.
 
  So nothing interesting in the log...
 
  However, in
  file osm_vendor_ibumad.c under function umad_receiver there is an
  infinite loop for(;;) which seems to die when I get to that
  previously discussed vl15_poller. I checked to see if it breaks out
  of the loop but it doesn't seem to.
 
  It never breaks out of that loop except when OpenSM is shutting down.
  That's the basic receive loop.
 
  -- Hal
 
  I'm not sure if this may be an additional hint.
  Thank you
 
  Hector Abrach
 
 
  From:  Hector Abrach habr...@tmriusa.com
  To:  Hal Rosenstock h...@dev.mellanox.co.il
  Cc:  ewg@lists.openfabrics.org
  Date:  12/14/2011 11:15 AM
  Subject:  Re: [ewg] OpenSM 1.5.4 Boot Problem
  Sent by:  ewg-boun...@lists.openfabrics.org
 
 
  -
  ---
 
 
 
  Hal,
 
  Thank you very much for the support, I am the same person from the
  gmail account so I will respond through here.
 
  Attached is a picture of the switch serial number:
 
 
 
  I am indeed using OFED 1.5.4-rc3. My experiment

Re: [ewg] OpenSM 1.5.4 Boot Problem

2011-12-16 Thread Hector Abrach
Alex,

 Few more questions.
 Does this happen to you only when you try to shut down the OpenSM on 
reboot?

Our system servers don't have an actual hard drive which means we boot 
remotely. So, when I run the re-boot script OpenSM doesn't shutdown 
properly (may this affect the switch?). However, it always boots the same 
way. The problem occurs when the system is in the bring up process. 
Specifically for OpenSM it occurs in the Discovering state.

 What is the host cpu architecture? x86/x86_64/ppc?

We use x86_64 but QNX is only a 32-bit OS which means we are technically 
running as 32-bit.
Thanks,

Hector Abrach





From:
Alex Netes ale...@mellanox.com
To:
Hal Rosenstock h...@dev.mellanox.co.il, Hector Abrach 
habr...@tmriusa.com
Cc:
ewg@lists.openfabrics.org ewg@lists.openfabrics.org
Date:
12/16/2011 03:15 AM
Subject:
RE: [ewg] OpenSM 1.5.4 Boot Problem



Hi Hector,

Few more questions.
Does this happen to you only when you try to shut down the OpenSM on 
reboot?
What is the host cpu architecture? x86/x86_64/ppc?


 -Original Message-
 From: ewg-boun...@lists.openfabrics.org [mailto:ewg-
 boun...@lists.openfabrics.org] On Behalf Of Hal Rosenstock
 Sent: Thursday, December 15, 2011 9:06 PM
 To: Hector Abrach
 Cc: ewg@lists.openfabrics.org
 Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem
 
 Hector,
 
 On 12/15/2011 12:49 PM, Hector Abrach wrote:
  Hal,
 
  Thank you for the response. To address your questions:
 
  So the switch stays up and the servers (including the one OpenSM is
  on) is rebooted, right ?
 
  Right.
 
  Do the servers run QNX rather than Linux ? Are you saying all OpenSM
  code is the same as stock OpenSM 3.3.12 (OFED 1.5.4-rc3) ?
 
  Yes, all 7 servers run QNX. The OpenSM code is 99% the same, the only
  changes I had to make were made to some #define libraries.
  The big changes were made for the driver, not so much OpenSM.
 
 I would think there are also changes for porting of complib to QNX. Do 
you
 use osm_vendor_ibumad.c as the OpenSM vendor layer ?
 
  I'm using IBNet 1.3.
 
 What's IBNet 1.3 ? I'm not familiar with that.
 
  OpenSM always runs on the same one server, the others don't run it.
 
 Understood.
 
  Is the topology the 7 servers and the 1 switch and if you use other
  switches you don't see this issue ?
 
  That's correct, the topology is 7 servers and 1 switch. We typically
  use less servers (4) for our application but the problem is more
  easily reproducible with more servers so we have a 7 server setup with
  1 switch. We don't have a great selection of switches but I know our
  previous switch did not cause this problem. Our intention is to go to
  production with this new switch but we can't release until we find an
  acceptable solution.
 
 Ican see the responses but not the requests. What verbosity level did
 you use ?
 
  I ran OpenSM with level -D 0x06 (error, info, verbose). I don't want
  to do -D 0xFF because I know this fixes the problem for sure.
 
 I think -D 0x23 (error, info, frames) would do the trick...
 
  -
 
  In summary:
  1.knowing that the system gets stuck for sm_vendor_ibumad.c -
  umad_receiver() - for(;;) but keeps running properly for function
  main.c - osm_manager_loop().
  2.If I use -D 0xFF the problem is completely fixed
  3.if I use OSM_DEFAULT_SMP_MAX_ON_WIRE of 1 instead of any 
other
  value the problem is completely fixed
  4.The failure always occurs with qp0_mads_outstanding of 1
  remaining
  what do you think could be wrong?
  Do you think the driver could be the problem?
 
 Yes; The thing that I think is a likely suspect and may be missing and 
causing
 this issue is the (built in to kernel MAD in Linux) timeout retry code 
for MAD
 transactions which if the timeout/retries are exhaused triggers a send 
error
 (callback). Is that implemented ?
 
 However, I don't have a good explanation for why you see this now and 
not
 before with your other switches but maybe that's not important.
 
  What debug command should I use to see the sent requests?
 
 See above.
 
 -- Hal
 
  Thank you
 
  Hector Abrach
 
 
 
 
  From:Hal Rosenstock h...@dev.mellanox.co.il
  To:  Hector Abrach habr...@tmriusa.com
  Cc:  ewg@lists.openfabrics.org
  Date:12/14/2011 08:23 PM
  Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem
 
 
  --
  --
 
 
 
  Hector,
 
  On 12/14/2011 1:41 PM, Hector Abrach wrote:
  Hal,
 
  Sorry for the multiple emails, but I was thinking how it may be a
  freeze /stall rather than a time out.  One reason is that it
  doesn't send an error message, is as if the log completely dies.
 
  So nothing interesting in the log...
 
  However, in
  file osm_vendor_ibumad.c under function umad_receiver there is an
  infinite loop for(;;) which seems to die when I get to that
  previously discussed vl15_poller. I checked to see

Re: [ewg] OpenSM 1.5.4 Boot Problem

2011-12-16 Thread Hector Abrach
Hal,

 Is timeout/retry/send error support implemented in your QNX
 implementation ? That would explain why the SM appears to stop...

Based on the inherit nature of the QNX Kernel I don't believe we have a 
timeout/retry/send on it. This may be the reason I see the bootup freeze. 
If it is I may have to implement this somehow.

However, for the time being at least, I believe that setting 
OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 will be an acceptable solution as it 
works reliably. But, it would be nice to know why it freezes anyway, may 
be because of the above.

Thus far I've been unsuccessful in failing with debug property -D 0x23 but 
I'll keep trying.
Thank you

Hector Abrach



From:
Hal Rosenstock h...@dev.mellanox.co.il
To:
Hector Abrach habr...@tmriusa.com
Date:
12/15/2011 01:21 PM
Subject:
Re: [ewg] OpenSM 1.5.4 Boot Problem



On 12/15/2011 1:57 PM, Hector Abrach wrote:
 Hal,
 
 I managed to get it to fail with Debug information -D 0x08. Attached is
 the log file.
 I'll dig deeper it seems is pkey related maybe...

Yes, I saw signs of that last night from the log you sent where it
stopped on the pkey tables on the CAs but I wasn't 100% sure whether it
was that or not. I didn't check how many pairs of the pkey tables you
got back here to validate whether every port responded with the proper
number of pkey table blocks.

Is timeout/retry/send error support implemented in your QNX
implementation ? That would explain why the SM appears to stop...

-- Hal

 Once again thank you for your support.
 
 
 
 Hector Abrach
 
 
 From:  Hal Rosenstock h...@dev.mellanox.co.il
 To:Hector Abrach habr...@tmriusa.com
 Date:  12/14/2011 08:29 PM
 Subject:   Re: [ewg] OpenSM 1.5.4 Boot Problem
 
 
 
 
 
 
 Hector,
 
 On 12/14/2011 5:49 PM, Hector Abrach wrote:
 Hal,

 I got the system to fail with verbose enabled after 25 reboots. Please
 find attached the log file.

 
 I can see the responses but not the requests. What verbosity level did
 you use ?
 
 I was reading that OSM_DEFAULT_SMP_MAX_ON_WIRE is used to pipeline the
 boot process in multi-switch systems and make the boot process faster
 correct?
 
 It's multinode not just multiswitch and this configuration is 8 nodes (1
 switch + 7 CAs). It's not boot process but discovery/initialization
 which is pipelined.
 
 Since my system is a single switch system I do not need to have
 4 but 1 for OSM_DEFAULT_SMP_MAX_ON_WIRE.
 
 You can run with 1 if that suits your needs. It's just not the default.
 
 Maybe the pipelined SMP's are confusing the switch some how.
 
 Even if it did, there's nothing that should stop the SM from
 working/proceeding. From the log, it looks like the SM does get stuck.
 
 -- Hal
 
 Thanks again for your help.

 Hector Abrach


 From:  Hal Rosenstock h...@dev.mellanox.co.il
 To:  Hector Abrach habr...@tmriusa.com
 Cc:  ewg@lists.openfabrics.org
 Date:  12/14/2011 08:03 AM
 Subject:  Re: [ewg] OpenSM 1.5.4 Boot Problem


 




 Hi,

 On 12/13/2011 2:35 PM, Hector Abrach wrote:
 Hello,

 I have a boot problem with OpenSM

 Are you saying the switch is booted rather than OpenSM ?

 What is the OpenSM running on and in what environment ?

 the problem occurs seldomly and
 started to ocur when we started using a new Mellanox MT1118X03342 
switch.
 The problem occurs during the discovery phase within
 state_mgr_sweep_hop_1.

 However, I discovered that the actual location is because the
 qp0_mads_outsanding stalls at 1 occasionally.

 Is it stuck or after timeout/retry does this get updated properly ?

 Within file osm_vl15intf.c in function vl15_poller it checks at the
 rfifo and if the qlist still has items it applies function 
vl15_send_mad
 which later on triggers the signal.
 With the current default setting of 4 for OSM_DEFAULT_SMP_MAX_ON_WIRE 
I
 noticed that cl_qlist_end reaches zero before
 stats-qp0_mads_outstanding does. This causes a stall in
 cl_event_wait_on. The rfifo always reaches 0 when there are 4
 qp0_mads_outstanding however when it fails it always fails when there 
is
 1 qp0_mad_outstanding.

 Is some (request) SMP that OpenSM sent timing out (not being responded
 to) ?

 Have you seen this failure? By the way, I see this failure once every 
15
 reboots approximately.

 I discovered that changing OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 fixes the
 problem.

 What do you mean exactly by fixes the problem ? I'm not sure I
 understand what the problem is yet.

 -- Hal

 My guess is that there is a race condition when the switch sends 4 
SMPs
 in parallel. Also, this failure only appears to occur at reboot. 
Another
 solution which is not acceptable is when I add a delay in the process
 the failure goes away. This as if the switch needed more time to do
 something.

 I

Re: [ewg] OpenSM 1.5.4 Boot Problem

2011-12-16 Thread Hal Rosenstock
Hector,

On 12/16/2011 11:59 AM, Hector Abrach wrote:
 Hal,
 
 Is timeout/retry/send error support implemented in your QNX
 implementation ? That would explain why the SM appears to stop...
 
 Based on the inherit nature of the QNX Kernel I don't believe we have a
 timeout/retry/send on it. This may be the reason I see the bootup
 freeze. If it is I may have to implement this somehow.

I think that's the OpenSM side of the failure as a timed out transaction
never times out so the MAD accounting is wrong, etc. It breaks that
fundamental assumption.

There may also be some issue with the SMA implementation on your QNX
nodes which is the root cause. Of course, SMPs are unreliable so
timeout/retries can be needed...

 However, for the time being at least, I believe that setting
 OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 will be an acceptable solution as it
 works reliably. But, it would be nice to know why it freezes anyway, may
 be because of the above.
 
 Thus far I've been unsuccessful in failing with debug property -D 0x23
 but I'll keep trying.

That slows things down enough to make it work as does 1 SMP outstanding.
It appears when SMPs are pipelined, some get dropped...

-- Hal

 Thank you
 
 Hector Abrach
 
 
 From: Hal Rosenstock h...@dev.mellanox.co.il
 To:   Hector Abrach habr...@tmriusa.com
 Date: 12/15/2011 01:21 PM
 Subject:  Re: [ewg] OpenSM 1.5.4 Boot Problem
 
 
 
 
 
 
 On 12/15/2011 1:57 PM, Hector Abrach wrote:
 Hal,

 I managed to get it to fail with Debug information -D 0x08. Attached is
 the log file.
 I'll dig deeper it seems is pkey related maybe...
 
 Yes, I saw signs of that last night from the log you sent where it
 stopped on the pkey tables on the CAs but I wasn't 100% sure whether it
 was that or not. I didn't check how many pairs of the pkey tables you
 got back here to validate whether every port responded with the proper
 number of pkey table blocks.
 
 Is timeout/retry/send error support implemented in your QNX
 implementation ? That would explain why the SM appears to stop...
 
 -- Hal
 
 Once again thank you for your support.



 Hector Abrach


 From:  Hal Rosenstock h...@dev.mellanox.co.il
 To:  Hector Abrach habr...@tmriusa.com
 Date:  12/14/2011 08:29 PM
 Subject:  Re: [ewg] OpenSM 1.5.4 Boot Problem


 



 Hector,

 On 12/14/2011 5:49 PM, Hector Abrach wrote:
 Hal,

 I got the system to fail with verbose enabled after 25 reboots. Please
 find attached the log file.


 I can see the responses but not the requests. What verbosity level did
 you use ?

 I was reading that OSM_DEFAULT_SMP_MAX_ON_WIRE is used to pipeline the
 boot process in multi-switch systems and make the boot process faster
 correct?

 It's multinode not just multiswitch and this configuration is 8 nodes (1
 switch + 7 CAs). It's not boot process but discovery/initialization
 which is pipelined.

 Since my system is a single switch system I do not need to have
 4 but 1 for OSM_DEFAULT_SMP_MAX_ON_WIRE.

 You can run with 1 if that suits your needs. It's just not the default.

 Maybe the pipelined SMP's are confusing the switch some how.

 Even if it did, there's nothing that should stop the SM from
 working/proceeding. From the log, it looks like the SM does get stuck.

 -- Hal

 Thanks again for your help.

 Hector Abrach


 From:  Hal Rosenstock h...@dev.mellanox.co.il
 To:  Hector Abrach habr...@tmriusa.com
 Cc:  ewg@lists.openfabrics.org
 Date:  12/14/2011 08:03 AM
 Subject:  Re: [ewg] OpenSM 1.5.4 Boot Problem


 



 Hi,

 On 12/13/2011 2:35 PM, Hector Abrach wrote:
 Hello,

 I have a boot problem with OpenSM

 Are you saying the switch is booted rather than OpenSM ?

 What is the OpenSM running on and in what environment ?

 the problem occurs seldomly and
 started to ocur when we started using a new Mellanox MT1118X03342
 switch.
 The problem occurs during the discovery phase within
 state_mgr_sweep_hop_1.

 However, I discovered that the actual location is because the
 qp0_mads_outsanding stalls at 1 occasionally.

 Is it stuck or after timeout/retry does this get updated properly ?

 Within file osm_vl15intf.c in function vl15_poller it checks at the
 rfifo and if the qlist still has items it applies function vl15_send_mad
 which later on triggers the signal.
 With the current default setting of 4 for OSM_DEFAULT_SMP_MAX_ON_WIRE I
 noticed that cl_qlist_end reaches zero before
 stats-qp0_mads_outstanding does. This causes a stall in
 cl_event_wait_on. The rfifo always reaches 0 when there are 4
 qp0_mads_outstanding however when it fails it always fails when there is
 1 qp0_mad_outstanding.

 Is some (request) SMP that OpenSM

Re: [ewg] OpenSM 1.5.4 Boot Problem

2011-12-15 Thread Hal Rosenstock
Hector,

On 12/15/2011 12:49 PM, Hector Abrach wrote:
 Hal,
 
 Thank you for the response. To address your questions:
 
 So the switch stays up and the servers (including the one OpenSM is on)
 is rebooted, right ?
 
 Right.
 
 Do the servers run QNX rather than Linux ? Are you saying all OpenSM
 code is the same as stock OpenSM 3.3.12 (OFED 1.5.4-rc3) ?
 
 Yes, all 7 servers run QNX. The OpenSM code is 99% the same, the only
 changes I had to make were made to some #define libraries.
 The big changes were made for the driver, not so much OpenSM. 

I would think there are also changes for porting of complib to QNX. Do
you use osm_vendor_ibumad.c as the OpenSM vendor layer ?

 I'm using IBNet 1.3. 

What's IBNet 1.3 ? I'm not familiar with that.

 OpenSM always runs on the same one server, the others don't
 run it.

Understood.

 Is the topology the 7 servers and the 1 switch and if you use other
 switches you don't see this issue ?
 
 That's correct, the topology is 7 servers and 1 switch. We typically use
 less servers (4) for our application but the problem is more easily
 reproducible with more servers so we have a 7 server setup with 1
 switch. We don't have a great selection of switches but I know our
 previous switch did not cause this problem. Our intention is to go to
 production with this new switch but we can't release until we find an
 acceptable solution.
 
Ican see the responses but not the requests. What verbosity level did
 you use ?
 
 I ran OpenSM with level -D 0x06 (error, info, verbose). I don't want to
 do -D 0xFF because I know this fixes the problem for sure.

I think -D 0x23 (error, info, frames) would do the trick...

 -
 
 In summary:
 1.knowing that the system gets stuck for sm_vendor_ibumad.c -
 umad_receiver() - for(;;) but keeps running properly for function
 main.c - osm_manager_loop().
 2.If I use -D 0xFF the problem is completely fixed
 3.if I use OSM_DEFAULT_SMP_MAX_ON_WIRE of 1 instead of any other
 value the problem is completely fixed
 4.The failure always occurs with qp0_mads_outstanding of 1
 remaining
 what do you think could be wrong?
 Do you think the driver could be the problem?

Yes; The thing that I think is a likely suspect and may be missing and
causing this issue is the (built in to kernel MAD in Linux) timeout
retry code for MAD transactions which if the timeout/retries are
exhaused triggers a send error (callback). Is that implemented ?

However, I don't have a good explanation for why you see this now and
not before with your other switches but maybe that's not important.

 What debug command should I use to see the sent requests?

See above.

-- Hal

 Thank you
 
 Hector Abrach
 
 
 
 
 From: Hal Rosenstock h...@dev.mellanox.co.il
 To:   Hector Abrach habr...@tmriusa.com
 Cc:   ewg@lists.openfabrics.org
 Date: 12/14/2011 08:23 PM
 Subject:  Re: [ewg] OpenSM 1.5.4 Boot Problem
 
 
 
 
 
 
 Hector,
 
 On 12/14/2011 1:41 PM, Hector Abrach wrote:
 Hal,

 Sorry for the multiple emails, but I was thinking how it may be a
 freeze /stall rather than a time out.  One reason is that it doesn't
 send an error message, is as if the log completely dies.
 
 So nothing interesting in the log...
 
 However, in
 file osm_vendor_ibumad.c under function umad_receiver there is an
 infinite loop for(;;) which seems to die when I get to that previously
 discussed vl15_poller. I checked to see if it breaks out of the loop but
 it doesn't seem to.
 
 It never breaks out of that loop except when OpenSM is shutting down.
 That's the basic receive loop.
 
 -- Hal
 
 I'm not sure if this may be an additional hint.
 Thank you

 Hector Abrach


 From:  Hector Abrach habr...@tmriusa.com
 To:  Hal Rosenstock h...@dev.mellanox.co.il
 Cc:  ewg@lists.openfabrics.org
 Date:  12/14/2011 11:15 AM
 Subject:  Re: [ewg] OpenSM 1.5.4 Boot Problem
 Sent by:  ewg-boun...@lists.openfabrics.org


 



 Hal,

 Thank you very much for the support, I am the same person from the gmail
 account so I will respond through here.

 Attached is a picture of the switch serial number:



 I am indeed using OFED 1.5.4-rc3. My experiment consists of a 7 server
 system which I reboot via a script over and over again. Technically
 speaking the switch is not being powered off or physically rebooted. My
 server system is what is being rebooted. I am running OpenSM on one of
 the 7 servers. This means I'm constantly shutting down and rebooting
 OpenSM. I am running OpenSM on QNX but we have not had this problem
 until we decided to upgrade to this switch.

 The problem is that every 1 out of 15 of this remote reboots OpenSM
 stalls or times out because stats-qp0_mads_outstanding did not reach
 zero. Please

Re: [ewg] OpenSM 1.5.4 Boot Problem

2011-12-14 Thread Hal Rosenstock
Hi,

On 12/13/2011 2:35 PM, Hector Abrach wrote:
 Hello,
 
 I have a boot problem with OpenSM

Are you saying the switch is booted rather than OpenSM ?

What is the OpenSM running on and in what environment ?

 the problem occurs seldomly and
 started to ocur when we started using a new Mellanox MT1118X03342 switch.
 The problem occurs during the discovery phase within state_mgr_sweep_hop_1.
 
 However, I discovered that the actual location is because the
 qp0_mads_outsanding stalls at 1 occasionally.

Is it stuck or after timeout/retry does this get updated properly ?

 Within file osm_vl15intf.c in function vl15_poller it checks at the
 rfifo and if the qlist still has items it applies function vl15_send_mad
 which later on triggers the signal.
 With the current default setting of 4 for OSM_DEFAULT_SMP_MAX_ON_WIRE I
 noticed that cl_qlist_end reaches zero before
 stats-qp0_mads_outstanding does. This causes a stall in
 cl_event_wait_on. The rfifo always reaches 0 when there are 4
 qp0_mads_outstanding however when it fails it always fails when there is
 1 qp0_mad_outstanding.

Is some (request) SMP that OpenSM sent timing out (not being responded to) ?

 Have you seen this failure? By the way, I see this failure once every 15
 reboots approximately.
 
 I discovered that changing OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 fixes the
 problem.

What do you mean exactly by fixes the problem ? I'm not sure I
understand what the problem is yet.

-- Hal

 My guess is that there is a race condition when the switch sends 4 SMPs
 in parallel. Also, this failure only appears to occur at reboot. Another
 solution which is not acceptable is when I add a delay in the process
 the failure goes away. This as if the switch needed more time to do
 something.
 
 I would really appreciate your help and insight.
 Thank you
 
 Hector Abrach
 __
 This email has been scanned by the Symantec Email Security.cloud service.
 For more information please visit http://www.symanteccloud.com
 __
 
 
 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] OpenSM 1.5.4 Boot Problem

2011-12-14 Thread Hector Abrach
Hal,

Sorry for the multiple emails, but I was thinking how it may be a freeze 
/stall rather than a time out.  One reason is that it doesn't send an 
error message, is as if the log completely dies. However, in file 
osm_vendor_ibumad.c under function umad_receiver there is an infinite loop 
for(;;) which seems to die when I get to that previously discussed 
vl15_poller. I checked to see if it breaks out of the loop but it doesn't 
seem to. I'm not sure if this may be an additional hint.
Thank you

Hector Abrach



From:
Hector Abrach habr...@tmriusa.com
To:
Hal Rosenstock h...@dev.mellanox.co.il
Cc:
ewg@lists.openfabrics.org
Date:
12/14/2011 11:15 AM
Subject:
Re: [ewg] OpenSM 1.5.4 Boot Problem
Sent by:
ewg-boun...@lists.openfabrics.org



Hal, 

Thank you very much for the support, I am the same person from the gmail 
account so I will respond through here. 

Attached is a picture of the switch serial number: 


I am indeed using OFED 1.5.4-rc3. My experiment consists of a 7 server 
system which I reboot via a script over and over again. Technically 
speaking the switch is not being powered off or physically rebooted. My 
server system is what is being rebooted. I am running OpenSM on one of the 
7 servers. This means I'm constantly shutting down and rebooting OpenSM. I 
am running OpenSM on QNX but we have not had this problem until we decided 
to upgrade to this switch. 

The problem is that every 1 out of 15 of this remote reboots OpenSM stalls 
or times out because stats-qp0_mads_outstanding did not reach zero. 
Please excuse my ignorance as I'm relatively new at this but how do I 
verify if it is a timeout problem vs a stall? 

You also mentioned that you'd like to see the Verbose output of openSM; 
however, when I run in Verbose mode I don't see the problem. It appears as 
if the verbose output stalls enough time to give the switch time to do 
what ever it needs to do and hence not have the problem occur. But this is 
the last I see when the problem occurs: 



- 
OpenSM 3.3.12 
Command Line Arguments: 
 Log file max size is 5 MBytes 
 Log File: /tmp/opensm.log 
- 
OpenSM 3.3.12 

Entering DISCOVERING state 

Using default GUID 0x2c9020023277d 



The problem occurs in function osm_vl15intf.c - vl15_poller in the else 
statement. 

if (p_madw != (osm_madw_t *) cl_qlist_end(p_fifo)) { 
OSM_LOG(p_vl-p_log, OSM_LOG_DEBUG, 
Servicing p_madw = %p\n, p_madw); 
if (osm_log_is_active(p_vl-p_log, OSM_LOG_FRAMES)) 
osm_dump_dr_smp(p_vl-p_log, 
osm_madw_get_smp_ptr(p_madw), 
OSM_LOG_FRAMES); 

vl15_send_mad(p_vl, p_madw); 
} else 
/* 
   The VL15 FIFO is empty, so we have nothing left to do. 
 */ 
status = cl_event_wait_on(p_vl-signal, 
  EVENT_NO_TIMEOUT, TRUE); 

It won't move forward from the cl_event_wait_on in this line of code. 
However, there are other locations such as wait_for_pending_transactions 
in the do_sweep function that won't move forward from. But I believe this 
to be a side effect of the problem I'm mentioning. 

When you mention what is my timeout, I'm guessing you refer to 
max_smps_timeout which is used in the second while loop within 
vl15_poller? For this setting I am using the default which is defined in 
osm_subnet.c as: 

p_opt-transaction_timeout = OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC; 
p_opt-transaction_retries = OSM_DEFAULT_RETRY_COUNT; 
p_opt-max_smps_timeout = 1000 * p_opt-transaction_timeout 
*p_opt-transaction_retries; 

Would you explain to me what are the advantages or disadvantages of 
OSM_DEFAULT_SMP_MAX_ON_WIRE? Does this parameter change my bandwidth 
performance at all? 

I noticed that when using the default setting of 4 I get into the else of 
the above if statement when there are 4 qp0_mads_outstanding. I noticed 
that if I change OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 I don't get the failure 
I'm mentioning at all. Partly (I think) because I don't enter the else in 
the if statement until there is 1 qp0_mads_outstanding. 

I hope this explains the problem well enough and it may be a time out 
problem but I'd like to understand why the problem is occurring. 
Thank you very much, 

Hector Abrach 


From: 
Hal Rosenstock h...@dev.mellanox.co.il 
To: 
Hector Abrach habr...@tmriusa.com 
Cc: 
ewg@lists.openfabrics.org 
Date: 
12/14/2011 08:03 AM 
Subject: 
Re: [ewg] OpenSM 1.5.4 Boot Problem




Hi,

On 12/13/2011 2:35 PM, Hector Abrach wrote:
 Hello,
 
 I have a boot problem with OpenSM

Are you saying the switch is booted rather than OpenSM ?

What is the OpenSM running on and in what environment ?

 the problem occurs seldomly and
 started to ocur when we started using a new Mellanox MT1118X03342 
switch.
 The problem occurs during the discovery phase within 
state_mgr_sweep_hop_1.
 
 However, I discovered that the actual location is because the
 qp0_mads_outsanding

Re: [ewg] OpenSM 1.5.4 Boot Problem

2011-12-14 Thread Hal Rosenstock
Hector,

On 12/14/2011 12:14 PM, Hector Abrach wrote:
 Hal,
 
 Thank you very much for the support, I am the same person from the gmail
 account so I will respond through here.
 
 Attached is a picture of the switch serial number:

OK; I see now; that's an 8 port unmanaged QDR switch.

 I am indeed using OFED 1.5.4-rc3. My experiment consists of a 7 server
 system which I reboot via a script over and over again. Technically
 speaking the switch is not being powered off or physically rebooted. My
 server system is what is being rebooted.

So the switch stays up and the servers (including the one OpenSM is on)
is rebooted, right ?

 I am running OpenSM on one of
 the 7 servers. This means I'm constantly shutting down and rebooting
 OpenSM. I am running OpenSM on QNX but we have not had this problem
 until we decided to upgrade to this switch.

Do the servers run QNX rather than Linux ? Are you saying all OpenSM
code is the same as stock OpenSM 3.3.12 (OFED 1.5.4-rc3) ?

Is the topology the 7 servers and the 1 switch and if you use other
switches you don't see this issue ?

 The problem is that every 1 out of 15 of this remote reboots OpenSM
 stalls or times out because stats-qp0_mads_outstanding did not reach
 zero. Please excuse my ignorance as I'm relatively new at this but how
 do I verify if it is a timeout problem vs a stall?
 
 You also mentioned that you'd like to see the Verbose output of openSM;
 however, when I run in Verbose mode I don't see the problem. It appears
 as if the verbose output stalls enough time to give the switch time to
 do what ever it needs to do and hence not have the problem occur. But
 this is the last I see when the problem occurs:
 
 
 
 -
 OpenSM 3.3.12
 Command Line Arguments:
  Log file max size is 5 MBytes
  Log File: /tmp/opensm.log
 -
 OpenSM 3.3.12
 
 Entering DISCOVERING state
 
 Using default GUID 0x2c9020023277d
 

Is there anything interesting in the log file (when running normally not
with verbosity on) ?

 The problem occurs in function osm_vl15intf.c - vl15_poller in the else
 statement.
 
 if (p_madw != (osm_madw_t *) cl_qlist_end(p_fifo)) {
 OSM_LOG(p_vl-p_log, OSM_LOG_DEBUG,
 Servicing p_madw = %p\n, p_madw);
 if (osm_log_is_active(p_vl-p_log, OSM_LOG_FRAMES))
 osm_dump_dr_smp(p_vl-p_log,
 osm_madw_get_smp_ptr(p_madw),
 OSM_LOG_FRAMES);
 
 vl15_send_mad(p_vl, p_madw);
 } else
 /*
The VL15 FIFO is empty, so we have nothing left to do.
  */
 status = cl_event_wait_on(p_vl-signal,
   EVENT_NO_TIMEOUT, TRUE);

 It won't move forward from the cl_event_wait_on in this line of code.

So it's stuck here forever and never gets past this ?

 However, there are other locations such as wait_for_pending_transactions
 in the do_sweep function that won't move forward from. But I believe
 this to be a side effect of the problem I'm mentioning.
 
 When you mention what is my timeout, I'm guessing you refer to
 max_smps_timeout which is used in the second while loop within
 vl15_poller? For this setting I am using the default which is defined in
 osm_subnet.c as:
 
 p_opt-transaction_timeout = OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC;
 p_opt-transaction_retries = OSM_DEFAULT_RETRY_COUNT;
 p_opt-max_smps_timeout = 1000 * p_opt-transaction_timeout
 *p_opt-transaction_retries;
 
 Would you explain to me what are the advantages or disadvantages of
 OSM_DEFAULT_SMP_MAX_ON_WIRE? 

It allows for more SMPs to be outstanding on the IB wire which helps
with subnet discovery/initialization, etc. So limiting SMPs to 1 will
slow things down but maybe that doesn't matter in your subnet.

 Does this parameter change my bandwidth
 performance at all?

It's a minor amount of bandwidth and is used to limit the SMPs which
unlike other VLs are not flow controlled so you can overflow the
dedicated buffers for those if OpenSM or diag tools send too quickly.

-- Hal

 I noticed that when using the default setting of 4 I get into the else
 of the above if statement when there are 4 qp0_mads_outstanding. I
 noticed that if I change OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 I don't get
 the failure I'm mentioning at all. Partly (I think) because I don't
 enter the else in the if statement until there is 1 qp0_mads_outstanding.
 
 I hope this explains the problem well enough and it may be a time out
 problem but I'd like to understand why the problem is occurring.
 Thank you very much,
 
 Hector Abrach
 
 
 From: Hal Rosenstock h...@dev.mellanox.co.il
 To:   Hector Abrach habr...@tmriusa.com
 Cc:   ewg@lists.openfabrics.org
 Date: 12/14/2011 08:03 AM
 Subject:  Re: [ewg] OpenSM 1.5.4 Boot Problem
 
 
 
 
 
 
 Hi,
 
 On 12/13/2011 2:35 PM, Hector Abrach wrote:
 Hello,

 I have a boot problem with OpenSM
 
 Are you

Re: [ewg] OpenSM 1.5.4 Boot Problem

2011-12-14 Thread Hal Rosenstock
Hector,

On 12/14/2011 1:41 PM, Hector Abrach wrote:
 Hal,
 
 Sorry for the multiple emails, but I was thinking how it may be a
 freeze /stall rather than a time out.  One reason is that it doesn't
 send an error message, is as if the log completely dies.

So nothing interesting in the log...

 However, in
 file osm_vendor_ibumad.c under function umad_receiver there is an
 infinite loop for(;;) which seems to die when I get to that previously
 discussed vl15_poller. I checked to see if it breaks out of the loop but
 it doesn't seem to. 

It never breaks out of that loop except when OpenSM is shutting down.
That's the basic receive loop.

-- Hal

 I'm not sure if this may be an additional hint.
 Thank you
 
 Hector Abrach
 
 
 From: Hector Abrach habr...@tmriusa.com
 To:   Hal Rosenstock h...@dev.mellanox.co.il
 Cc:   ewg@lists.openfabrics.org
 Date: 12/14/2011 11:15 AM
 Subject:  Re: [ewg] OpenSM 1.5.4 Boot Problem
 Sent by:  ewg-boun...@lists.openfabrics.org
 
 
 
 
 
 
 Hal,
 
 Thank you very much for the support, I am the same person from the gmail
 account so I will respond through here.
 
 Attached is a picture of the switch serial number:
 
 
 
 I am indeed using OFED 1.5.4-rc3. My experiment consists of a 7 server
 system which I reboot via a script over and over again. Technically
 speaking the switch is not being powered off or physically rebooted. My
 server system is what is being rebooted. I am running OpenSM on one of
 the 7 servers. This means I'm constantly shutting down and rebooting
 OpenSM. I am running OpenSM on QNX but we have not had this problem
 until we decided to upgrade to this switch.
 
 The problem is that every 1 out of 15 of this remote reboots OpenSM
 stalls or times out because stats-qp0_mads_outstanding did not reach
 zero. Please excuse my ignorance as I'm relatively new at this but how
 do I verify if it is a timeout problem vs a stall?
 
 You also mentioned that you'd like to see the Verbose output of openSM;
 however, when I run in Verbose mode I don't see the problem. It appears
 as if the verbose output stalls enough time to give the switch time to
 do what ever it needs to do and hence not have the problem occur. But
 this is the last I see when the problem occurs:
 
 
 
 -
 OpenSM 3.3.12
 Command Line Arguments:
 Log file max size is 5 MBytes
 Log File: /tmp/opensm.log
 -
 OpenSM 3.3.12
 
 Entering DISCOVERING state
 
 Using default GUID 0x2c9020023277d
 
 
 
 The problem occurs in function osm_vl15intf.c - vl15_poller in the else
 statement.
 
 if (p_madw != (osm_madw_t *) cl_qlist_end(p_fifo)) {
OSM_LOG(p_vl-p_log, OSM_LOG_DEBUG,
Servicing p_madw = %p\n, p_madw);
if (osm_log_is_active(p_vl-p_log, OSM_LOG_FRAMES))
osm_dump_dr_smp(p_vl-p_log,
osm_madw_get_smp_ptr(p_madw),
OSM_LOG_FRAMES);
 
vl15_send_mad(p_vl, p_madw);
 } else
/*
   The VL15 FIFO is empty, so we have nothing left to do.
 */
status = cl_event_wait_on(p_vl-signal,
  EVENT_NO_TIMEOUT, TRUE);
 
 It won't move forward from the cl_event_wait_on in this line of code.
 However, there are other locations such as wait_for_pending_transactions
 in the do_sweep function that won't move forward from. But I believe
 this to be a side effect of the problem I'm mentioning.
 
 When you mention what is my timeout, I'm guessing you refer to
 max_smps_timeout which is used in the second while loop within
 vl15_poller? For this setting I am using the default which is defined in
 osm_subnet.c as:
 
 p_opt-transaction_timeout = OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC;
p_opt-transaction_retries = OSM_DEFAULT_RETRY_COUNT;
p_opt-max_smps_timeout = 1000 * p_opt-transaction_timeout
 *p_opt-transaction_retries;
 
 Would you explain to me what are the advantages or disadvantages of
 OSM_DEFAULT_SMP_MAX_ON_WIRE? Does this parameter change my bandwidth
 performance at all?
 
 I noticed that when using the default setting of 4 I get into the else
 of the above if statement when there are 4 qp0_mads_outstanding. I
 noticed that if I change OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 I don't get
 the failure I'm mentioning at all. Partly (I think) because I don't
 enter the else in the if statement until there is 1 qp0_mads_outstanding.
 
 I hope this explains the problem well enough and it may be a time out
 problem but I'd like to understand why the problem is occurring.
 Thank you very much,
 
 Hector Abrach
 
 From: Hal Rosenstock h...@dev.mellanox.co.il
 To:   Hector Abrach habr...@tmriusa.com
 Cc:   ewg@lists.openfabrics.org
 Date: 12/14/2011 08:03 AM
 Subject:  Re: [ewg] OpenSM 1.5.4 Boot Problem
 
 
 
 
 
 
 
 Hi,
 
 On 12/13/2011 2:35 PM, Hector Abrach wrote:
 Hello

[ewg] OpenSM 1.5.4 Boot Problem

2011-12-13 Thread Hector Abrach
Hello,

I have a boot problem with OpenSM the problem occurs seldomly and started 
to ocur when we started using a new Mellanox MT1118X03342 switch.
The problem occurs during the discovery phase within 
state_mgr_sweep_hop_1.

However, I discovered that the actual location is because the 
qp0_mads_outsanding stalls at 1 occasionally.

Within file osm_vl15intf.c in function vl15_poller it checks at the rfifo 
and if the qlist still has items it applies function vl15_send_mad which 
later on triggers the signal.
With the current default setting of 4 for OSM_DEFAULT_SMP_MAX_ON_WIRE I 
noticed that cl_qlist_end reaches zero before stats-qp0_mads_outstanding 
does. This causes a stall in cl_event_wait_on. The rfifo always reaches 0 
when there are 4 qp0_mads_outstanding however when it fails it always 
fails when there is 1 qp0_mad_outstanding.

Have you seen this failure? By the way, I see this failure once every 15 
reboots approximately.

I discovered that changing OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 fixes the 
problem.

My guess is that there is a race condition when the switch sends 4 SMPs in 
parallel. Also, this failure only appears to occur at reboot. Another 
solution which is not acceptable is when I add a delay in the process the 
failure goes away. This as if the switch needed more time to do something.

I would really appreciate your help and insight.
Thank you

Hector Abrach

__
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
_
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg