Re: [ewg] OpenSM 1.5.4 Boot Problem
Hal, Thank you for your help, I believe to have found the problem. The problem occurs in the function you told me (timeout_sends) this occurs in the QNX code as QNX doesn't compute jiffies automatically. Whomever wrote the driver created a bug when computing jiffies that gave me always the same time hence never timing out. This occurs in the following statement: if(time_after(mad_send_wr-timeout, jiffies)) { I was noticing that when it times out this if statement becomes true eventually. Maybe after 1 or 2 tries, this tells me that it will fail anyway if that's true is there a point at having that if statement, why not just time out right away? Anyway, thank you once again. Hector Abrach From: Hal Rosenstock h...@dev.mellanox.co.il To: Hector Abrach habr...@tmriusa.com Cc: ewg@lists.openfabrics.org Date: 12/22/2011 07:24 AM Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem Hector, On 12/21/2011 2:16 PM, Hector Abrach wrote: Hal, When an SMP times out how does the Linux kernel know it timed out? The kernel MAD module maintains a timer list for outstanding transactions and if no response is received before the timer expires, it knows that transaction timed out. If the matching response is received, it removes that transaction from that list. See timeout_sends in drivers/infiniband/core/mad.c When the Linux kernel determines it timed out how does it signal OpenSM the timeout/retry/send? Through what function calls does this signal go through? I was noticing that cl_event_wait_on in vl15_poller() has a parameter passed as EVENT_NO_TIMEOUT should this be a time out or should the time out occur somewhere else? This is where it stalls. Yes, I've already responded about this several times. I'm reasonably sure that this is due to erroneous QP0/VL15 accounting due to lack of timeouts. Do you know somewhere I could read a little bit more about the Linux Kernel timeout and how it interacts with OpenSM? In terms of the kernel, look at: linux/Documentation/infiniband/user_mad.txt and include/rdma/ib_mad.h and ib_user_mad.h OpenSM uses osm_vendor_ibumad.c which is layered on top of libibumad. In osm_vendor_ibumad.c, the send error callback is invoked for transaction timeout in umad_receiver. For libibumad, see umad_status and umad_send man pages. -- Hal Thank you for the help and your insight. Hector Abrach From: Hal Rosenstock h...@dev.mellanox.co.il To:Hector Abrach habr...@tmriusa.com Cc:ewg@lists.openfabrics.org Date: 12/16/2011 11:11 AM Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem Hector, On 12/16/2011 11:59 AM, Hector Abrach wrote: Hal, Is timeout/retry/send error support implemented in your QNX implementation ? That would explain why the SM appears to stop... Based on the inherit nature of the QNX Kernel I don't believe we have a timeout/retry/send on it. This may be the reason I see the bootup freeze. If it is I may have to implement this somehow. I think that's the OpenSM side of the failure as a timed out transaction never times out so the MAD accounting is wrong, etc. It breaks that fundamental assumption. There may also be some issue with the SMA implementation on your QNX nodes which is the root cause. Of course, SMPs are unreliable so timeout/retries can be needed... However, for the time being at least, I believe that setting OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 will be an acceptable solution as it works reliably. But, it would be nice to know why it freezes anyway, may be because of the above. Thus far I've been unsuccessful in failing with debug property -D 0x23 but I'll keep trying. That slows things down enough to make it work as does 1 SMP outstanding. It appears when SMPs are pipelined, some get dropped... -- Hal Thank you Hector Abrach From: Hal Rosenstock h...@dev.mellanox.co.il To: Hector Abrach habr...@tmriusa.com Date: 12/15/2011 01:21 PM Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem On 12/15/2011 1:57 PM, Hector Abrach wrote: Hal, I managed to get it to fail with Debug information -D 0x08. Attached is the log file. I'll dig deeper it seems is pkey related maybe... Yes, I saw signs of that last night from the log you sent where it stopped on the pkey tables on the CAs but I wasn't 100% sure whether it was that or not. I didn't check how many pairs of the pkey tables you got back here to validate whether every port responded with the proper number of pkey table blocks. Is timeout/retry/send error support implemented in your QNX implementation ? That would explain why the SM appears to stop... -- Hal Once again thank you for your support. Hector Abrach From: Hal
Re: [ewg] OpenSM 1.5.4 Boot Problem
Hector, On 12/21/2011 2:16 PM, Hector Abrach wrote: Hal, When an SMP times out how does the Linux kernel know it timed out? The kernel MAD module maintains a timer list for outstanding transactions and if no response is received before the timer expires, it knows that transaction timed out. If the matching response is received, it removes that transaction from that list. See timeout_sends in drivers/infiniband/core/mad.c When the Linux kernel determines it timed out how does it signal OpenSM the timeout/retry/send? Through what function calls does this signal go through? I was noticing that cl_event_wait_on in vl15_poller() has a parameter passed as EVENT_NO_TIMEOUT should this be a time out or should the time out occur somewhere else? This is where it stalls. Yes, I've already responded about this several times. I'm reasonably sure that this is due to erroneous QP0/VL15 accounting due to lack of timeouts. Do you know somewhere I could read a little bit more about the Linux Kernel timeout and how it interacts with OpenSM? In terms of the kernel, look at: linux/Documentation/infiniband/user_mad.txt and include/rdma/ib_mad.h and ib_user_mad.h OpenSM uses osm_vendor_ibumad.c which is layered on top of libibumad. In osm_vendor_ibumad.c, the send error callback is invoked for transaction timeout in umad_receiver. For libibumad, see umad_status and umad_send man pages. -- Hal Thank you for the help and your insight. Hector Abrach From: Hal Rosenstock h...@dev.mellanox.co.il To: Hector Abrach habr...@tmriusa.com Cc: ewg@lists.openfabrics.org Date: 12/16/2011 11:11 AM Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem Hector, On 12/16/2011 11:59 AM, Hector Abrach wrote: Hal, Is timeout/retry/send error support implemented in your QNX implementation ? That would explain why the SM appears to stop... Based on the inherit nature of the QNX Kernel I don't believe we have a timeout/retry/send on it. This may be the reason I see the bootup freeze. If it is I may have to implement this somehow. I think that's the OpenSM side of the failure as a timed out transaction never times out so the MAD accounting is wrong, etc. It breaks that fundamental assumption. There may also be some issue with the SMA implementation on your QNX nodes which is the root cause. Of course, SMPs are unreliable so timeout/retries can be needed... However, for the time being at least, I believe that setting OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 will be an acceptable solution as it works reliably. But, it would be nice to know why it freezes anyway, may be because of the above. Thus far I've been unsuccessful in failing with debug property -D 0x23 but I'll keep trying. That slows things down enough to make it work as does 1 SMP outstanding. It appears when SMPs are pipelined, some get dropped... -- Hal Thank you Hector Abrach From: Hal Rosenstock h...@dev.mellanox.co.il To: Hector Abrach habr...@tmriusa.com Date: 12/15/2011 01:21 PM Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem On 12/15/2011 1:57 PM, Hector Abrach wrote: Hal, I managed to get it to fail with Debug information -D 0x08. Attached is the log file. I'll dig deeper it seems is pkey related maybe... Yes, I saw signs of that last night from the log you sent where it stopped on the pkey tables on the CAs but I wasn't 100% sure whether it was that or not. I didn't check how many pairs of the pkey tables you got back here to validate whether every port responded with the proper number of pkey table blocks. Is timeout/retry/send error support implemented in your QNX implementation ? That would explain why the SM appears to stop... -- Hal Once again thank you for your support. Hector Abrach From: Hal Rosenstock h...@dev.mellanox.co.il To: Hector Abrach habr...@tmriusa.com Date: 12/14/2011 08:29 PM Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem Hector, On 12/14/2011 5:49 PM, Hector Abrach wrote: Hal, I got the system to fail with verbose enabled after 25 reboots. Please find attached the log file. I can see the responses but not the requests. What verbosity level did you use ? I was reading that OSM_DEFAULT_SMP_MAX_ON_WIRE is used to pipeline the boot process in multi-switch systems and make the boot process faster correct? It's multinode not just multiswitch and this configuration is 8 nodes (1 switch + 7 CAs). It's not boot process but discovery/initialization which is pipelined. Since my system is a single switch system I do not need to have 4 but 1
Re: [ewg] OpenSM 1.5.4 Boot Problem
Hal, When an SMP times out how does the Linux kernel know it timed out? When the Linux kernel determines it timed out how does it signal OpenSM the timeout/retry/send? Through what function calls does this signal go through? I was noticing that cl_event_wait_on in vl15_poller() has a parameter passed as EVENT_NO_TIMEOUT should this be a time out or should the time out occur somewhere else? This is where it stalls. Do you know somewhere I could read a little bit more about the Linux Kernel timeout and how it interacts with OpenSM? Thank you for the help and your insight. Hector Abrach From: Hal Rosenstock h...@dev.mellanox.co.il To: Hector Abrach habr...@tmriusa.com Cc: ewg@lists.openfabrics.org Date: 12/16/2011 11:11 AM Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem Hector, On 12/16/2011 11:59 AM, Hector Abrach wrote: Hal, Is timeout/retry/send error support implemented in your QNX implementation ? That would explain why the SM appears to stop... Based on the inherit nature of the QNX Kernel I don't believe we have a timeout/retry/send on it. This may be the reason I see the bootup freeze. If it is I may have to implement this somehow. I think that's the OpenSM side of the failure as a timed out transaction never times out so the MAD accounting is wrong, etc. It breaks that fundamental assumption. There may also be some issue with the SMA implementation on your QNX nodes which is the root cause. Of course, SMPs are unreliable so timeout/retries can be needed... However, for the time being at least, I believe that setting OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 will be an acceptable solution as it works reliably. But, it would be nice to know why it freezes anyway, may be because of the above. Thus far I've been unsuccessful in failing with debug property -D 0x23 but I'll keep trying. That slows things down enough to make it work as does 1 SMP outstanding. It appears when SMPs are pipelined, some get dropped... -- Hal Thank you Hector Abrach From: Hal Rosenstock h...@dev.mellanox.co.il To:Hector Abrach habr...@tmriusa.com Date: 12/15/2011 01:21 PM Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem On 12/15/2011 1:57 PM, Hector Abrach wrote: Hal, I managed to get it to fail with Debug information -D 0x08. Attached is the log file. I'll dig deeper it seems is pkey related maybe... Yes, I saw signs of that last night from the log you sent where it stopped on the pkey tables on the CAs but I wasn't 100% sure whether it was that or not. I didn't check how many pairs of the pkey tables you got back here to validate whether every port responded with the proper number of pkey table blocks. Is timeout/retry/send error support implemented in your QNX implementation ? That would explain why the SM appears to stop... -- Hal Once again thank you for your support. Hector Abrach From: Hal Rosenstock h...@dev.mellanox.co.il To: Hector Abrach habr...@tmriusa.com Date: 12/14/2011 08:29 PM Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem Hector, On 12/14/2011 5:49 PM, Hector Abrach wrote: Hal, I got the system to fail with verbose enabled after 25 reboots. Please find attached the log file. I can see the responses but not the requests. What verbosity level did you use ? I was reading that OSM_DEFAULT_SMP_MAX_ON_WIRE is used to pipeline the boot process in multi-switch systems and make the boot process faster correct? It's multinode not just multiswitch and this configuration is 8 nodes (1 switch + 7 CAs). It's not boot process but discovery/initialization which is pipelined. Since my system is a single switch system I do not need to have 4 but 1 for OSM_DEFAULT_SMP_MAX_ON_WIRE. You can run with 1 if that suits your needs. It's just not the default. Maybe the pipelined SMP's are confusing the switch some how. Even if it did, there's nothing that should stop the SM from working/proceeding. From the log, it looks like the SM does get stuck. -- Hal Thanks again for your help. Hector Abrach From: Hal Rosenstock h...@dev.mellanox.co.il To: Hector Abrach habr...@tmriusa.com Cc: ewg@lists.openfabrics.org Date: 12/14/2011 08:03 AM Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem Hi, On 12/13/2011 2:35 PM, Hector Abrach wrote: Hello, I have a boot problem with OpenSM Are you saying the switch is booted rather than OpenSM ? What is the OpenSM running on and in what environment ? the problem occurs seldomly and started to ocur when we started using a new Mellanox
Re: [ewg] OpenSM 1.5.4 Boot Problem
Hi Hector, Few more questions. Does this happen to you only when you try to shut down the OpenSM on reboot? What is the host cpu architecture? x86/x86_64/ppc? -Original Message- From: ewg-boun...@lists.openfabrics.org [mailto:ewg- boun...@lists.openfabrics.org] On Behalf Of Hal Rosenstock Sent: Thursday, December 15, 2011 9:06 PM To: Hector Abrach Cc: ewg@lists.openfabrics.org Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem Hector, On 12/15/2011 12:49 PM, Hector Abrach wrote: Hal, Thank you for the response. To address your questions: So the switch stays up and the servers (including the one OpenSM is on) is rebooted, right ? Right. Do the servers run QNX rather than Linux ? Are you saying all OpenSM code is the same as stock OpenSM 3.3.12 (OFED 1.5.4-rc3) ? Yes, all 7 servers run QNX. The OpenSM code is 99% the same, the only changes I had to make were made to some #define libraries. The big changes were made for the driver, not so much OpenSM. I would think there are also changes for porting of complib to QNX. Do you use osm_vendor_ibumad.c as the OpenSM vendor layer ? I'm using IBNet 1.3. What's IBNet 1.3 ? I'm not familiar with that. OpenSM always runs on the same one server, the others don't run it. Understood. Is the topology the 7 servers and the 1 switch and if you use other switches you don't see this issue ? That's correct, the topology is 7 servers and 1 switch. We typically use less servers (4) for our application but the problem is more easily reproducible with more servers so we have a 7 server setup with 1 switch. We don't have a great selection of switches but I know our previous switch did not cause this problem. Our intention is to go to production with this new switch but we can't release until we find an acceptable solution. Ican see the responses but not the requests. What verbosity level did you use ? I ran OpenSM with level -D 0x06 (error, info, verbose). I don't want to do -D 0xFF because I know this fixes the problem for sure. I think -D 0x23 (error, info, frames) would do the trick... - In summary: 1.knowing that the system gets stuck for sm_vendor_ibumad.c - umad_receiver() - for(;;) but keeps running properly for function main.c - osm_manager_loop(). 2.If I use -D 0xFF the problem is completely fixed 3.if I use OSM_DEFAULT_SMP_MAX_ON_WIRE of 1 instead of any other value the problem is completely fixed 4.The failure always occurs with qp0_mads_outstanding of 1 remaining what do you think could be wrong? Do you think the driver could be the problem? Yes; The thing that I think is a likely suspect and may be missing and causing this issue is the (built in to kernel MAD in Linux) timeout retry code for MAD transactions which if the timeout/retries are exhaused triggers a send error (callback). Is that implemented ? However, I don't have a good explanation for why you see this now and not before with your other switches but maybe that's not important. What debug command should I use to see the sent requests? See above. -- Hal Thank you Hector Abrach From: Hal Rosenstock h...@dev.mellanox.co.il To: Hector Abrach habr...@tmriusa.com Cc: ewg@lists.openfabrics.org Date: 12/14/2011 08:23 PM Subject:Re: [ewg] OpenSM 1.5.4 Boot Problem -- -- Hector, On 12/14/2011 1:41 PM, Hector Abrach wrote: Hal, Sorry for the multiple emails, but I was thinking how it may be a freeze /stall rather than a time out. One reason is that it doesn't send an error message, is as if the log completely dies. So nothing interesting in the log... However, in file osm_vendor_ibumad.c under function umad_receiver there is an infinite loop for(;;) which seems to die when I get to that previously discussed vl15_poller. I checked to see if it breaks out of the loop but it doesn't seem to. It never breaks out of that loop except when OpenSM is shutting down. That's the basic receive loop. -- Hal I'm not sure if this may be an additional hint. Thank you Hector Abrach From: Hector Abrach habr...@tmriusa.com To: Hal Rosenstock h...@dev.mellanox.co.il Cc: ewg@lists.openfabrics.org Date: 12/14/2011 11:15 AM Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem Sent by: ewg-boun...@lists.openfabrics.org - --- Hal, Thank you very much for the support, I am the same person from the gmail account so I will respond through here. Attached is a picture of the switch serial number: I am indeed using OFED 1.5.4-rc3. My experiment
Re: [ewg] OpenSM 1.5.4 Boot Problem
Alex, Few more questions. Does this happen to you only when you try to shut down the OpenSM on reboot? Our system servers don't have an actual hard drive which means we boot remotely. So, when I run the re-boot script OpenSM doesn't shutdown properly (may this affect the switch?). However, it always boots the same way. The problem occurs when the system is in the bring up process. Specifically for OpenSM it occurs in the Discovering state. What is the host cpu architecture? x86/x86_64/ppc? We use x86_64 but QNX is only a 32-bit OS which means we are technically running as 32-bit. Thanks, Hector Abrach From: Alex Netes ale...@mellanox.com To: Hal Rosenstock h...@dev.mellanox.co.il, Hector Abrach habr...@tmriusa.com Cc: ewg@lists.openfabrics.org ewg@lists.openfabrics.org Date: 12/16/2011 03:15 AM Subject: RE: [ewg] OpenSM 1.5.4 Boot Problem Hi Hector, Few more questions. Does this happen to you only when you try to shut down the OpenSM on reboot? What is the host cpu architecture? x86/x86_64/ppc? -Original Message- From: ewg-boun...@lists.openfabrics.org [mailto:ewg- boun...@lists.openfabrics.org] On Behalf Of Hal Rosenstock Sent: Thursday, December 15, 2011 9:06 PM To: Hector Abrach Cc: ewg@lists.openfabrics.org Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem Hector, On 12/15/2011 12:49 PM, Hector Abrach wrote: Hal, Thank you for the response. To address your questions: So the switch stays up and the servers (including the one OpenSM is on) is rebooted, right ? Right. Do the servers run QNX rather than Linux ? Are you saying all OpenSM code is the same as stock OpenSM 3.3.12 (OFED 1.5.4-rc3) ? Yes, all 7 servers run QNX. The OpenSM code is 99% the same, the only changes I had to make were made to some #define libraries. The big changes were made for the driver, not so much OpenSM. I would think there are also changes for porting of complib to QNX. Do you use osm_vendor_ibumad.c as the OpenSM vendor layer ? I'm using IBNet 1.3. What's IBNet 1.3 ? I'm not familiar with that. OpenSM always runs on the same one server, the others don't run it. Understood. Is the topology the 7 servers and the 1 switch and if you use other switches you don't see this issue ? That's correct, the topology is 7 servers and 1 switch. We typically use less servers (4) for our application but the problem is more easily reproducible with more servers so we have a 7 server setup with 1 switch. We don't have a great selection of switches but I know our previous switch did not cause this problem. Our intention is to go to production with this new switch but we can't release until we find an acceptable solution. Ican see the responses but not the requests. What verbosity level did you use ? I ran OpenSM with level -D 0x06 (error, info, verbose). I don't want to do -D 0xFF because I know this fixes the problem for sure. I think -D 0x23 (error, info, frames) would do the trick... - In summary: 1.knowing that the system gets stuck for sm_vendor_ibumad.c - umad_receiver() - for(;;) but keeps running properly for function main.c - osm_manager_loop(). 2.If I use -D 0xFF the problem is completely fixed 3.if I use OSM_DEFAULT_SMP_MAX_ON_WIRE of 1 instead of any other value the problem is completely fixed 4.The failure always occurs with qp0_mads_outstanding of 1 remaining what do you think could be wrong? Do you think the driver could be the problem? Yes; The thing that I think is a likely suspect and may be missing and causing this issue is the (built in to kernel MAD in Linux) timeout retry code for MAD transactions which if the timeout/retries are exhaused triggers a send error (callback). Is that implemented ? However, I don't have a good explanation for why you see this now and not before with your other switches but maybe that's not important. What debug command should I use to see the sent requests? See above. -- Hal Thank you Hector Abrach From:Hal Rosenstock h...@dev.mellanox.co.il To: Hector Abrach habr...@tmriusa.com Cc: ewg@lists.openfabrics.org Date:12/14/2011 08:23 PM Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem -- -- Hector, On 12/14/2011 1:41 PM, Hector Abrach wrote: Hal, Sorry for the multiple emails, but I was thinking how it may be a freeze /stall rather than a time out. One reason is that it doesn't send an error message, is as if the log completely dies. So nothing interesting in the log... However, in file osm_vendor_ibumad.c under function umad_receiver there is an infinite loop for(;;) which seems to die when I get to that previously discussed vl15_poller. I checked to see
Re: [ewg] OpenSM 1.5.4 Boot Problem
Hal, Is timeout/retry/send error support implemented in your QNX implementation ? That would explain why the SM appears to stop... Based on the inherit nature of the QNX Kernel I don't believe we have a timeout/retry/send on it. This may be the reason I see the bootup freeze. If it is I may have to implement this somehow. However, for the time being at least, I believe that setting OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 will be an acceptable solution as it works reliably. But, it would be nice to know why it freezes anyway, may be because of the above. Thus far I've been unsuccessful in failing with debug property -D 0x23 but I'll keep trying. Thank you Hector Abrach From: Hal Rosenstock h...@dev.mellanox.co.il To: Hector Abrach habr...@tmriusa.com Date: 12/15/2011 01:21 PM Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem On 12/15/2011 1:57 PM, Hector Abrach wrote: Hal, I managed to get it to fail with Debug information -D 0x08. Attached is the log file. I'll dig deeper it seems is pkey related maybe... Yes, I saw signs of that last night from the log you sent where it stopped on the pkey tables on the CAs but I wasn't 100% sure whether it was that or not. I didn't check how many pairs of the pkey tables you got back here to validate whether every port responded with the proper number of pkey table blocks. Is timeout/retry/send error support implemented in your QNX implementation ? That would explain why the SM appears to stop... -- Hal Once again thank you for your support. Hector Abrach From: Hal Rosenstock h...@dev.mellanox.co.il To:Hector Abrach habr...@tmriusa.com Date: 12/14/2011 08:29 PM Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem Hector, On 12/14/2011 5:49 PM, Hector Abrach wrote: Hal, I got the system to fail with verbose enabled after 25 reboots. Please find attached the log file. I can see the responses but not the requests. What verbosity level did you use ? I was reading that OSM_DEFAULT_SMP_MAX_ON_WIRE is used to pipeline the boot process in multi-switch systems and make the boot process faster correct? It's multinode not just multiswitch and this configuration is 8 nodes (1 switch + 7 CAs). It's not boot process but discovery/initialization which is pipelined. Since my system is a single switch system I do not need to have 4 but 1 for OSM_DEFAULT_SMP_MAX_ON_WIRE. You can run with 1 if that suits your needs. It's just not the default. Maybe the pipelined SMP's are confusing the switch some how. Even if it did, there's nothing that should stop the SM from working/proceeding. From the log, it looks like the SM does get stuck. -- Hal Thanks again for your help. Hector Abrach From: Hal Rosenstock h...@dev.mellanox.co.il To: Hector Abrach habr...@tmriusa.com Cc: ewg@lists.openfabrics.org Date: 12/14/2011 08:03 AM Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem Hi, On 12/13/2011 2:35 PM, Hector Abrach wrote: Hello, I have a boot problem with OpenSM Are you saying the switch is booted rather than OpenSM ? What is the OpenSM running on and in what environment ? the problem occurs seldomly and started to ocur when we started using a new Mellanox MT1118X03342 switch. The problem occurs during the discovery phase within state_mgr_sweep_hop_1. However, I discovered that the actual location is because the qp0_mads_outsanding stalls at 1 occasionally. Is it stuck or after timeout/retry does this get updated properly ? Within file osm_vl15intf.c in function vl15_poller it checks at the rfifo and if the qlist still has items it applies function vl15_send_mad which later on triggers the signal. With the current default setting of 4 for OSM_DEFAULT_SMP_MAX_ON_WIRE I noticed that cl_qlist_end reaches zero before stats-qp0_mads_outstanding does. This causes a stall in cl_event_wait_on. The rfifo always reaches 0 when there are 4 qp0_mads_outstanding however when it fails it always fails when there is 1 qp0_mad_outstanding. Is some (request) SMP that OpenSM sent timing out (not being responded to) ? Have you seen this failure? By the way, I see this failure once every 15 reboots approximately. I discovered that changing OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 fixes the problem. What do you mean exactly by fixes the problem ? I'm not sure I understand what the problem is yet. -- Hal My guess is that there is a race condition when the switch sends 4 SMPs in parallel. Also, this failure only appears to occur at reboot. Another solution which is not acceptable is when I add a delay in the process the failure goes away. This as if the switch needed more time to do something. I
Re: [ewg] OpenSM 1.5.4 Boot Problem
Hector, On 12/16/2011 11:59 AM, Hector Abrach wrote: Hal, Is timeout/retry/send error support implemented in your QNX implementation ? That would explain why the SM appears to stop... Based on the inherit nature of the QNX Kernel I don't believe we have a timeout/retry/send on it. This may be the reason I see the bootup freeze. If it is I may have to implement this somehow. I think that's the OpenSM side of the failure as a timed out transaction never times out so the MAD accounting is wrong, etc. It breaks that fundamental assumption. There may also be some issue with the SMA implementation on your QNX nodes which is the root cause. Of course, SMPs are unreliable so timeout/retries can be needed... However, for the time being at least, I believe that setting OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 will be an acceptable solution as it works reliably. But, it would be nice to know why it freezes anyway, may be because of the above. Thus far I've been unsuccessful in failing with debug property -D 0x23 but I'll keep trying. That slows things down enough to make it work as does 1 SMP outstanding. It appears when SMPs are pipelined, some get dropped... -- Hal Thank you Hector Abrach From: Hal Rosenstock h...@dev.mellanox.co.il To: Hector Abrach habr...@tmriusa.com Date: 12/15/2011 01:21 PM Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem On 12/15/2011 1:57 PM, Hector Abrach wrote: Hal, I managed to get it to fail with Debug information -D 0x08. Attached is the log file. I'll dig deeper it seems is pkey related maybe... Yes, I saw signs of that last night from the log you sent where it stopped on the pkey tables on the CAs but I wasn't 100% sure whether it was that or not. I didn't check how many pairs of the pkey tables you got back here to validate whether every port responded with the proper number of pkey table blocks. Is timeout/retry/send error support implemented in your QNX implementation ? That would explain why the SM appears to stop... -- Hal Once again thank you for your support. Hector Abrach From: Hal Rosenstock h...@dev.mellanox.co.il To: Hector Abrach habr...@tmriusa.com Date: 12/14/2011 08:29 PM Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem Hector, On 12/14/2011 5:49 PM, Hector Abrach wrote: Hal, I got the system to fail with verbose enabled after 25 reboots. Please find attached the log file. I can see the responses but not the requests. What verbosity level did you use ? I was reading that OSM_DEFAULT_SMP_MAX_ON_WIRE is used to pipeline the boot process in multi-switch systems and make the boot process faster correct? It's multinode not just multiswitch and this configuration is 8 nodes (1 switch + 7 CAs). It's not boot process but discovery/initialization which is pipelined. Since my system is a single switch system I do not need to have 4 but 1 for OSM_DEFAULT_SMP_MAX_ON_WIRE. You can run with 1 if that suits your needs. It's just not the default. Maybe the pipelined SMP's are confusing the switch some how. Even if it did, there's nothing that should stop the SM from working/proceeding. From the log, it looks like the SM does get stuck. -- Hal Thanks again for your help. Hector Abrach From: Hal Rosenstock h...@dev.mellanox.co.il To: Hector Abrach habr...@tmriusa.com Cc: ewg@lists.openfabrics.org Date: 12/14/2011 08:03 AM Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem Hi, On 12/13/2011 2:35 PM, Hector Abrach wrote: Hello, I have a boot problem with OpenSM Are you saying the switch is booted rather than OpenSM ? What is the OpenSM running on and in what environment ? the problem occurs seldomly and started to ocur when we started using a new Mellanox MT1118X03342 switch. The problem occurs during the discovery phase within state_mgr_sweep_hop_1. However, I discovered that the actual location is because the qp0_mads_outsanding stalls at 1 occasionally. Is it stuck or after timeout/retry does this get updated properly ? Within file osm_vl15intf.c in function vl15_poller it checks at the rfifo and if the qlist still has items it applies function vl15_send_mad which later on triggers the signal. With the current default setting of 4 for OSM_DEFAULT_SMP_MAX_ON_WIRE I noticed that cl_qlist_end reaches zero before stats-qp0_mads_outstanding does. This causes a stall in cl_event_wait_on. The rfifo always reaches 0 when there are 4 qp0_mads_outstanding however when it fails it always fails when there is 1 qp0_mad_outstanding. Is some (request) SMP that OpenSM
Re: [ewg] OpenSM 1.5.4 Boot Problem
Hector, On 12/15/2011 12:49 PM, Hector Abrach wrote: Hal, Thank you for the response. To address your questions: So the switch stays up and the servers (including the one OpenSM is on) is rebooted, right ? Right. Do the servers run QNX rather than Linux ? Are you saying all OpenSM code is the same as stock OpenSM 3.3.12 (OFED 1.5.4-rc3) ? Yes, all 7 servers run QNX. The OpenSM code is 99% the same, the only changes I had to make were made to some #define libraries. The big changes were made for the driver, not so much OpenSM. I would think there are also changes for porting of complib to QNX. Do you use osm_vendor_ibumad.c as the OpenSM vendor layer ? I'm using IBNet 1.3. What's IBNet 1.3 ? I'm not familiar with that. OpenSM always runs on the same one server, the others don't run it. Understood. Is the topology the 7 servers and the 1 switch and if you use other switches you don't see this issue ? That's correct, the topology is 7 servers and 1 switch. We typically use less servers (4) for our application but the problem is more easily reproducible with more servers so we have a 7 server setup with 1 switch. We don't have a great selection of switches but I know our previous switch did not cause this problem. Our intention is to go to production with this new switch but we can't release until we find an acceptable solution. Ican see the responses but not the requests. What verbosity level did you use ? I ran OpenSM with level -D 0x06 (error, info, verbose). I don't want to do -D 0xFF because I know this fixes the problem for sure. I think -D 0x23 (error, info, frames) would do the trick... - In summary: 1.knowing that the system gets stuck for sm_vendor_ibumad.c - umad_receiver() - for(;;) but keeps running properly for function main.c - osm_manager_loop(). 2.If I use -D 0xFF the problem is completely fixed 3.if I use OSM_DEFAULT_SMP_MAX_ON_WIRE of 1 instead of any other value the problem is completely fixed 4.The failure always occurs with qp0_mads_outstanding of 1 remaining what do you think could be wrong? Do you think the driver could be the problem? Yes; The thing that I think is a likely suspect and may be missing and causing this issue is the (built in to kernel MAD in Linux) timeout retry code for MAD transactions which if the timeout/retries are exhaused triggers a send error (callback). Is that implemented ? However, I don't have a good explanation for why you see this now and not before with your other switches but maybe that's not important. What debug command should I use to see the sent requests? See above. -- Hal Thank you Hector Abrach From: Hal Rosenstock h...@dev.mellanox.co.il To: Hector Abrach habr...@tmriusa.com Cc: ewg@lists.openfabrics.org Date: 12/14/2011 08:23 PM Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem Hector, On 12/14/2011 1:41 PM, Hector Abrach wrote: Hal, Sorry for the multiple emails, but I was thinking how it may be a freeze /stall rather than a time out. One reason is that it doesn't send an error message, is as if the log completely dies. So nothing interesting in the log... However, in file osm_vendor_ibumad.c under function umad_receiver there is an infinite loop for(;;) which seems to die when I get to that previously discussed vl15_poller. I checked to see if it breaks out of the loop but it doesn't seem to. It never breaks out of that loop except when OpenSM is shutting down. That's the basic receive loop. -- Hal I'm not sure if this may be an additional hint. Thank you Hector Abrach From: Hector Abrach habr...@tmriusa.com To: Hal Rosenstock h...@dev.mellanox.co.il Cc: ewg@lists.openfabrics.org Date: 12/14/2011 11:15 AM Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem Sent by: ewg-boun...@lists.openfabrics.org Hal, Thank you very much for the support, I am the same person from the gmail account so I will respond through here. Attached is a picture of the switch serial number: I am indeed using OFED 1.5.4-rc3. My experiment consists of a 7 server system which I reboot via a script over and over again. Technically speaking the switch is not being powered off or physically rebooted. My server system is what is being rebooted. I am running OpenSM on one of the 7 servers. This means I'm constantly shutting down and rebooting OpenSM. I am running OpenSM on QNX but we have not had this problem until we decided to upgrade to this switch. The problem is that every 1 out of 15 of this remote reboots OpenSM stalls or times out because stats-qp0_mads_outstanding did not reach zero. Please
Re: [ewg] OpenSM 1.5.4 Boot Problem
Hi, On 12/13/2011 2:35 PM, Hector Abrach wrote: Hello, I have a boot problem with OpenSM Are you saying the switch is booted rather than OpenSM ? What is the OpenSM running on and in what environment ? the problem occurs seldomly and started to ocur when we started using a new Mellanox MT1118X03342 switch. The problem occurs during the discovery phase within state_mgr_sweep_hop_1. However, I discovered that the actual location is because the qp0_mads_outsanding stalls at 1 occasionally. Is it stuck or after timeout/retry does this get updated properly ? Within file osm_vl15intf.c in function vl15_poller it checks at the rfifo and if the qlist still has items it applies function vl15_send_mad which later on triggers the signal. With the current default setting of 4 for OSM_DEFAULT_SMP_MAX_ON_WIRE I noticed that cl_qlist_end reaches zero before stats-qp0_mads_outstanding does. This causes a stall in cl_event_wait_on. The rfifo always reaches 0 when there are 4 qp0_mads_outstanding however when it fails it always fails when there is 1 qp0_mad_outstanding. Is some (request) SMP that OpenSM sent timing out (not being responded to) ? Have you seen this failure? By the way, I see this failure once every 15 reboots approximately. I discovered that changing OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 fixes the problem. What do you mean exactly by fixes the problem ? I'm not sure I understand what the problem is yet. -- Hal My guess is that there is a race condition when the switch sends 4 SMPs in parallel. Also, this failure only appears to occur at reboot. Another solution which is not acceptable is when I add a delay in the process the failure goes away. This as if the switch needed more time to do something. I would really appreciate your help and insight. Thank you Hector Abrach __ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com __ ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] OpenSM 1.5.4 Boot Problem
Hal, Sorry for the multiple emails, but I was thinking how it may be a freeze /stall rather than a time out. One reason is that it doesn't send an error message, is as if the log completely dies. However, in file osm_vendor_ibumad.c under function umad_receiver there is an infinite loop for(;;) which seems to die when I get to that previously discussed vl15_poller. I checked to see if it breaks out of the loop but it doesn't seem to. I'm not sure if this may be an additional hint. Thank you Hector Abrach From: Hector Abrach habr...@tmriusa.com To: Hal Rosenstock h...@dev.mellanox.co.il Cc: ewg@lists.openfabrics.org Date: 12/14/2011 11:15 AM Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem Sent by: ewg-boun...@lists.openfabrics.org Hal, Thank you very much for the support, I am the same person from the gmail account so I will respond through here. Attached is a picture of the switch serial number: I am indeed using OFED 1.5.4-rc3. My experiment consists of a 7 server system which I reboot via a script over and over again. Technically speaking the switch is not being powered off or physically rebooted. My server system is what is being rebooted. I am running OpenSM on one of the 7 servers. This means I'm constantly shutting down and rebooting OpenSM. I am running OpenSM on QNX but we have not had this problem until we decided to upgrade to this switch. The problem is that every 1 out of 15 of this remote reboots OpenSM stalls or times out because stats-qp0_mads_outstanding did not reach zero. Please excuse my ignorance as I'm relatively new at this but how do I verify if it is a timeout problem vs a stall? You also mentioned that you'd like to see the Verbose output of openSM; however, when I run in Verbose mode I don't see the problem. It appears as if the verbose output stalls enough time to give the switch time to do what ever it needs to do and hence not have the problem occur. But this is the last I see when the problem occurs: - OpenSM 3.3.12 Command Line Arguments: Log file max size is 5 MBytes Log File: /tmp/opensm.log - OpenSM 3.3.12 Entering DISCOVERING state Using default GUID 0x2c9020023277d The problem occurs in function osm_vl15intf.c - vl15_poller in the else statement. if (p_madw != (osm_madw_t *) cl_qlist_end(p_fifo)) { OSM_LOG(p_vl-p_log, OSM_LOG_DEBUG, Servicing p_madw = %p\n, p_madw); if (osm_log_is_active(p_vl-p_log, OSM_LOG_FRAMES)) osm_dump_dr_smp(p_vl-p_log, osm_madw_get_smp_ptr(p_madw), OSM_LOG_FRAMES); vl15_send_mad(p_vl, p_madw); } else /* The VL15 FIFO is empty, so we have nothing left to do. */ status = cl_event_wait_on(p_vl-signal, EVENT_NO_TIMEOUT, TRUE); It won't move forward from the cl_event_wait_on in this line of code. However, there are other locations such as wait_for_pending_transactions in the do_sweep function that won't move forward from. But I believe this to be a side effect of the problem I'm mentioning. When you mention what is my timeout, I'm guessing you refer to max_smps_timeout which is used in the second while loop within vl15_poller? For this setting I am using the default which is defined in osm_subnet.c as: p_opt-transaction_timeout = OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC; p_opt-transaction_retries = OSM_DEFAULT_RETRY_COUNT; p_opt-max_smps_timeout = 1000 * p_opt-transaction_timeout *p_opt-transaction_retries; Would you explain to me what are the advantages or disadvantages of OSM_DEFAULT_SMP_MAX_ON_WIRE? Does this parameter change my bandwidth performance at all? I noticed that when using the default setting of 4 I get into the else of the above if statement when there are 4 qp0_mads_outstanding. I noticed that if I change OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 I don't get the failure I'm mentioning at all. Partly (I think) because I don't enter the else in the if statement until there is 1 qp0_mads_outstanding. I hope this explains the problem well enough and it may be a time out problem but I'd like to understand why the problem is occurring. Thank you very much, Hector Abrach From: Hal Rosenstock h...@dev.mellanox.co.il To: Hector Abrach habr...@tmriusa.com Cc: ewg@lists.openfabrics.org Date: 12/14/2011 08:03 AM Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem Hi, On 12/13/2011 2:35 PM, Hector Abrach wrote: Hello, I have a boot problem with OpenSM Are you saying the switch is booted rather than OpenSM ? What is the OpenSM running on and in what environment ? the problem occurs seldomly and started to ocur when we started using a new Mellanox MT1118X03342 switch. The problem occurs during the discovery phase within state_mgr_sweep_hop_1. However, I discovered that the actual location is because the qp0_mads_outsanding
Re: [ewg] OpenSM 1.5.4 Boot Problem
Hector, On 12/14/2011 12:14 PM, Hector Abrach wrote: Hal, Thank you very much for the support, I am the same person from the gmail account so I will respond through here. Attached is a picture of the switch serial number: OK; I see now; that's an 8 port unmanaged QDR switch. I am indeed using OFED 1.5.4-rc3. My experiment consists of a 7 server system which I reboot via a script over and over again. Technically speaking the switch is not being powered off or physically rebooted. My server system is what is being rebooted. So the switch stays up and the servers (including the one OpenSM is on) is rebooted, right ? I am running OpenSM on one of the 7 servers. This means I'm constantly shutting down and rebooting OpenSM. I am running OpenSM on QNX but we have not had this problem until we decided to upgrade to this switch. Do the servers run QNX rather than Linux ? Are you saying all OpenSM code is the same as stock OpenSM 3.3.12 (OFED 1.5.4-rc3) ? Is the topology the 7 servers and the 1 switch and if you use other switches you don't see this issue ? The problem is that every 1 out of 15 of this remote reboots OpenSM stalls or times out because stats-qp0_mads_outstanding did not reach zero. Please excuse my ignorance as I'm relatively new at this but how do I verify if it is a timeout problem vs a stall? You also mentioned that you'd like to see the Verbose output of openSM; however, when I run in Verbose mode I don't see the problem. It appears as if the verbose output stalls enough time to give the switch time to do what ever it needs to do and hence not have the problem occur. But this is the last I see when the problem occurs: - OpenSM 3.3.12 Command Line Arguments: Log file max size is 5 MBytes Log File: /tmp/opensm.log - OpenSM 3.3.12 Entering DISCOVERING state Using default GUID 0x2c9020023277d Is there anything interesting in the log file (when running normally not with verbosity on) ? The problem occurs in function osm_vl15intf.c - vl15_poller in the else statement. if (p_madw != (osm_madw_t *) cl_qlist_end(p_fifo)) { OSM_LOG(p_vl-p_log, OSM_LOG_DEBUG, Servicing p_madw = %p\n, p_madw); if (osm_log_is_active(p_vl-p_log, OSM_LOG_FRAMES)) osm_dump_dr_smp(p_vl-p_log, osm_madw_get_smp_ptr(p_madw), OSM_LOG_FRAMES); vl15_send_mad(p_vl, p_madw); } else /* The VL15 FIFO is empty, so we have nothing left to do. */ status = cl_event_wait_on(p_vl-signal, EVENT_NO_TIMEOUT, TRUE); It won't move forward from the cl_event_wait_on in this line of code. So it's stuck here forever and never gets past this ? However, there are other locations such as wait_for_pending_transactions in the do_sweep function that won't move forward from. But I believe this to be a side effect of the problem I'm mentioning. When you mention what is my timeout, I'm guessing you refer to max_smps_timeout which is used in the second while loop within vl15_poller? For this setting I am using the default which is defined in osm_subnet.c as: p_opt-transaction_timeout = OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC; p_opt-transaction_retries = OSM_DEFAULT_RETRY_COUNT; p_opt-max_smps_timeout = 1000 * p_opt-transaction_timeout *p_opt-transaction_retries; Would you explain to me what are the advantages or disadvantages of OSM_DEFAULT_SMP_MAX_ON_WIRE? It allows for more SMPs to be outstanding on the IB wire which helps with subnet discovery/initialization, etc. So limiting SMPs to 1 will slow things down but maybe that doesn't matter in your subnet. Does this parameter change my bandwidth performance at all? It's a minor amount of bandwidth and is used to limit the SMPs which unlike other VLs are not flow controlled so you can overflow the dedicated buffers for those if OpenSM or diag tools send too quickly. -- Hal I noticed that when using the default setting of 4 I get into the else of the above if statement when there are 4 qp0_mads_outstanding. I noticed that if I change OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 I don't get the failure I'm mentioning at all. Partly (I think) because I don't enter the else in the if statement until there is 1 qp0_mads_outstanding. I hope this explains the problem well enough and it may be a time out problem but I'd like to understand why the problem is occurring. Thank you very much, Hector Abrach From: Hal Rosenstock h...@dev.mellanox.co.il To: Hector Abrach habr...@tmriusa.com Cc: ewg@lists.openfabrics.org Date: 12/14/2011 08:03 AM Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem Hi, On 12/13/2011 2:35 PM, Hector Abrach wrote: Hello, I have a boot problem with OpenSM Are you
Re: [ewg] OpenSM 1.5.4 Boot Problem
Hector, On 12/14/2011 1:41 PM, Hector Abrach wrote: Hal, Sorry for the multiple emails, but I was thinking how it may be a freeze /stall rather than a time out. One reason is that it doesn't send an error message, is as if the log completely dies. So nothing interesting in the log... However, in file osm_vendor_ibumad.c under function umad_receiver there is an infinite loop for(;;) which seems to die when I get to that previously discussed vl15_poller. I checked to see if it breaks out of the loop but it doesn't seem to. It never breaks out of that loop except when OpenSM is shutting down. That's the basic receive loop. -- Hal I'm not sure if this may be an additional hint. Thank you Hector Abrach From: Hector Abrach habr...@tmriusa.com To: Hal Rosenstock h...@dev.mellanox.co.il Cc: ewg@lists.openfabrics.org Date: 12/14/2011 11:15 AM Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem Sent by: ewg-boun...@lists.openfabrics.org Hal, Thank you very much for the support, I am the same person from the gmail account so I will respond through here. Attached is a picture of the switch serial number: I am indeed using OFED 1.5.4-rc3. My experiment consists of a 7 server system which I reboot via a script over and over again. Technically speaking the switch is not being powered off or physically rebooted. My server system is what is being rebooted. I am running OpenSM on one of the 7 servers. This means I'm constantly shutting down and rebooting OpenSM. I am running OpenSM on QNX but we have not had this problem until we decided to upgrade to this switch. The problem is that every 1 out of 15 of this remote reboots OpenSM stalls or times out because stats-qp0_mads_outstanding did not reach zero. Please excuse my ignorance as I'm relatively new at this but how do I verify if it is a timeout problem vs a stall? You also mentioned that you'd like to see the Verbose output of openSM; however, when I run in Verbose mode I don't see the problem. It appears as if the verbose output stalls enough time to give the switch time to do what ever it needs to do and hence not have the problem occur. But this is the last I see when the problem occurs: - OpenSM 3.3.12 Command Line Arguments: Log file max size is 5 MBytes Log File: /tmp/opensm.log - OpenSM 3.3.12 Entering DISCOVERING state Using default GUID 0x2c9020023277d The problem occurs in function osm_vl15intf.c - vl15_poller in the else statement. if (p_madw != (osm_madw_t *) cl_qlist_end(p_fifo)) { OSM_LOG(p_vl-p_log, OSM_LOG_DEBUG, Servicing p_madw = %p\n, p_madw); if (osm_log_is_active(p_vl-p_log, OSM_LOG_FRAMES)) osm_dump_dr_smp(p_vl-p_log, osm_madw_get_smp_ptr(p_madw), OSM_LOG_FRAMES); vl15_send_mad(p_vl, p_madw); } else /* The VL15 FIFO is empty, so we have nothing left to do. */ status = cl_event_wait_on(p_vl-signal, EVENT_NO_TIMEOUT, TRUE); It won't move forward from the cl_event_wait_on in this line of code. However, there are other locations such as wait_for_pending_transactions in the do_sweep function that won't move forward from. But I believe this to be a side effect of the problem I'm mentioning. When you mention what is my timeout, I'm guessing you refer to max_smps_timeout which is used in the second while loop within vl15_poller? For this setting I am using the default which is defined in osm_subnet.c as: p_opt-transaction_timeout = OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC; p_opt-transaction_retries = OSM_DEFAULT_RETRY_COUNT; p_opt-max_smps_timeout = 1000 * p_opt-transaction_timeout *p_opt-transaction_retries; Would you explain to me what are the advantages or disadvantages of OSM_DEFAULT_SMP_MAX_ON_WIRE? Does this parameter change my bandwidth performance at all? I noticed that when using the default setting of 4 I get into the else of the above if statement when there are 4 qp0_mads_outstanding. I noticed that if I change OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 I don't get the failure I'm mentioning at all. Partly (I think) because I don't enter the else in the if statement until there is 1 qp0_mads_outstanding. I hope this explains the problem well enough and it may be a time out problem but I'd like to understand why the problem is occurring. Thank you very much, Hector Abrach From: Hal Rosenstock h...@dev.mellanox.co.il To: Hector Abrach habr...@tmriusa.com Cc: ewg@lists.openfabrics.org Date: 12/14/2011 08:03 AM Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem Hi, On 12/13/2011 2:35 PM, Hector Abrach wrote: Hello
[ewg] OpenSM 1.5.4 Boot Problem
Hello, I have a boot problem with OpenSM the problem occurs seldomly and started to ocur when we started using a new Mellanox MT1118X03342 switch. The problem occurs during the discovery phase within state_mgr_sweep_hop_1. However, I discovered that the actual location is because the qp0_mads_outsanding stalls at 1 occasionally. Within file osm_vl15intf.c in function vl15_poller it checks at the rfifo and if the qlist still has items it applies function vl15_send_mad which later on triggers the signal. With the current default setting of 4 for OSM_DEFAULT_SMP_MAX_ON_WIRE I noticed that cl_qlist_end reaches zero before stats-qp0_mads_outstanding does. This causes a stall in cl_event_wait_on. The rfifo always reaches 0 when there are 4 qp0_mads_outstanding however when it fails it always fails when there is 1 qp0_mad_outstanding. Have you seen this failure? By the way, I see this failure once every 15 reboots approximately. I discovered that changing OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 fixes the problem. My guess is that there is a race condition when the switch sends 4 SMPs in parallel. Also, this failure only appears to occur at reboot. Another solution which is not acceptable is when I add a delay in the process the failure goes away. This as if the switch needed more time to do something. I would really appreciate your help and insight. Thank you Hector Abrach __ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com _ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg