RE: Node Description mismatch between saquery & smpquery
> -Original Message- > From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma- > Subject: Re: Node Description mismatch between saquery & smpquery > > On Tue, 2013-06-18 at 07:13 -0400, Hal Rosenstock wrote: > > On 6/17/2013 5:38 PM, Albert Chu wrote: > > > We've recently noticed that the Node Description for a node can > > > mis-mismatch between the output of smpquery and saquery. For > example: > > > > > > # smpquery NodeDesc 427 > > > Node Description:.sierra1932 qib0 > > > > > > # saquery NodeRecord 427 | grep NodeDesc > > > NodeDescription.QLogic Infiniband HCA > > > > > > A restart of OpenSM is the current solution to resolve this. [snip] > > > > > > When the node descriptor is changed, a trap should be sent to opensm > > > indicating the change. Normally OpenSM gets the trap and reads the > > > new node descriptor. > > > > Are you sure the trap is being issued by those devices when the > > NodeDescription is changed locally ? > > These particular devices do support the trap and tests show they do send > traps on changes (i.e. manually changing > /sys/class/infiniband/qib0/node_desc). > > > Also, if so, do these devices implement timeout/retry on sending the > > trap (e.g. trying to make sure that they receive trap repress before > > giving up on trap) ? > > This I don't know. I've been trying to figure out if they do and if they do > how > it might be configurable. Is there a way to figure this out? > Looking quickly at the driver I don't think it does resend the trap. However, Mike might know better: CC'ed. Ira -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Node Description mismatch between saquery & smpquery
On Tue, 2013-06-18 at 07:13 -0400, Hal Rosenstock wrote: > On 6/17/2013 5:38 PM, Albert Chu wrote: > > We've recently noticed that the Node Description for a node can > > mis-mismatch between the output of smpquery and saquery. For example: > > > > # smpquery NodeDesc 427 > > Node Description:.sierra1932 qib0 > > > > # saquery NodeRecord 427 | grep NodeDesc > > NodeDescription.QLogic Infiniband HCA > > > > A restart of OpenSM is the current solution to resolve this. > > > > We've noticed it occurring more often on our larger clusters than our > > smaller clusters, leading to a speculation about why it is happening. > > > > The speculation is when a node comes up, there is a window of time in > > which the HCA is up, can be scanned by OpenSM, but not yet have its node > > descriptor set (in RHEL I appears to be set via /etc/init.d/rdma). > > During this window, OpenSM reads/stores the non-desired node descriptor > > (in the above case the non-desired "Qlogic Infiniband HCA"). > > > > When the node descriptor is changed, a trap should be sent to opensm > > indicating the change. Normally OpenSM gets the trap and reads the new > > node descriptor. > > Are you sure the trap is being issued by those devices when the > NodeDescription is changed locally ? These particular devices do support the trap and tests show they do send traps on changes (i.e. manually changing /sys/class/infiniband/qib0/node_desc). > Also, if so, do these devices implement timeout/retry on sending the > trap (e.g. trying to make sure that they receive trap repress before > giving up on trap) ? This I don't know. I've been trying to figure out if they do and if they do how it might be configurable. Is there a way to figure this out? > > On our large clusters all nodes are typically brought up at the same > > time, so there are probably a ton of node descriptor change traps > > happening at the exact same time. We speculate a number of these are > > dropped/lost, and subsequently OpenSM never realizes that the node > > descriptor has changed. > > Do you see any evidence of that traps are being dropped ? Have you > correlated any VL15Dropped counters in the subnet with this ? Also, > there is a module parameter in MAD kernel module that might help with > any unsolicited MAD bursts. You might try increasing that on your SM > node(s). On our largest clusters we always see a nice chunk of VL15 drops, however we haven't correlated them specifically to a trap. > > I don't know if the speculation sounds reasonable or not. Regardless, > > we're not sure of the best fix. > > > > A trivial fix would be to just make OpenSM re-scan the node descriptor > > of an HCA, perhaps during a heavy sweep. But I don't know if this is > > optimal. It'll introduce more MADs on the wire. However if the present > > solution is to restart OpenSM, we figure this can't be any worse. > > Yes, but to add the additional queries in is O(n) there and has been > resisted in the past. > > > Just wondering what peoples thoughts are of if there's another obvious > > solution we're not seeing. > > I think this issue needs better understanding first. Yeah, just looking for hints/pointers for the time being. Thanks, Al > -- Hal > > > Al > > > -- Albert Chu ch...@llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Node Description mismatch between saquery & smpquery
On 6/17/2013 5:38 PM, Albert Chu wrote: > We've recently noticed that the Node Description for a node can > mis-mismatch between the output of smpquery and saquery. For example: > > # smpquery NodeDesc 427 > Node Description:.sierra1932 qib0 > > # saquery NodeRecord 427 | grep NodeDesc > NodeDescription.QLogic Infiniband HCA > > A restart of OpenSM is the current solution to resolve this. > > We've noticed it occurring more often on our larger clusters than our > smaller clusters, leading to a speculation about why it is happening. > > The speculation is when a node comes up, there is a window of time in > which the HCA is up, can be scanned by OpenSM, but not yet have its node > descriptor set (in RHEL I appears to be set via /etc/init.d/rdma). > During this window, OpenSM reads/stores the non-desired node descriptor > (in the above case the non-desired "Qlogic Infiniband HCA"). > > When the node descriptor is changed, a trap should be sent to opensm > indicating the change. Normally OpenSM gets the trap and reads the new > node descriptor. Are you sure the trap is being issued by those devices when the NodeDescription is changed locally ? Also, if so, do these devices implement timeout/retry on sending the trap (e.g. trying to make sure that they receive trap repress before giving up on trap) ? > On our large clusters all nodes are typically brought up at the same > time, so there are probably a ton of node descriptor change traps > happening at the exact same time. We speculate a number of these are > dropped/lost, and subsequently OpenSM never realizes that the node > descriptor has changed. Do you see any evidence of that traps are being dropped ? Have you correlated any VL15Dropped counters in the subnet with this ? Also, there is a module parameter in MAD kernel module that might help with any unsolicited MAD bursts. You might try increasing that on your SM node(s). > I don't know if the speculation sounds reasonable or not. Regardless, > we're not sure of the best fix. > > A trivial fix would be to just make OpenSM re-scan the node descriptor > of an HCA, perhaps during a heavy sweep. But I don't know if this is > optimal. It'll introduce more MADs on the wire. However if the present > solution is to restart OpenSM, we figure this can't be any worse. Yes, but to add the additional queries in is O(n) there and has been resisted in the past. > Just wondering what peoples thoughts are of if there's another obvious > solution we're not seeing. I think this issue needs better understanding first. -- Hal > Al > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Node Description mismatch between saquery & smpquery
On Mon, 2013-06-17 at 22:00 +, Weiny, Ira wrote: > Does running "update_desc" in the console fix this? This worked as a short term solution. But we're still thinking about a longer term one that requires less interaction. Al > Ira > > > -Original Message- > > From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma- > > ow...@vger.kernel.org] On Behalf Of Albert Chu > > Sent: Monday, June 17, 2013 2:38 PM > > To: linux-rdma@vger.kernel.org > > Subject: Node Description mismatch between saquery & smpquery > > > > We've recently noticed that the Node Description for a node can mis- > > mismatch between the output of smpquery and saquery. For example: > > > > # smpquery NodeDesc 427 > > Node Description:.sierra1932 qib0 > > > > # saquery NodeRecord 427 | grep NodeDesc > > NodeDescription.QLogic Infiniband HCA > > > > A restart of OpenSM is the current solution to resolve this. > > > > We've noticed it occurring more often on our larger clusters than our > > smaller > > clusters, leading to a speculation about why it is happening. > > > > The speculation is when a node comes up, there is a window of time in which > > the HCA is up, can be scanned by OpenSM, but not yet have its node > > descriptor set (in RHEL I appears to be set via /etc/init.d/rdma). > > During this window, OpenSM reads/stores the non-desired node descriptor > > (in the above case the non-desired "Qlogic Infiniband HCA"). > > > > When the node descriptor is changed, a trap should be sent to opensm > > indicating the change. Normally OpenSM gets the trap and reads the new > > node descriptor. > > > > On our large clusters all nodes are typically brought up at the same time, > > so > > there are probably a ton of node descriptor change traps happening at the > > exact same time. We speculate a number of these are dropped/lost, and > > subsequently OpenSM never realizes that the node descriptor has changed. > > > > I don't know if the speculation sounds reasonable or not. Regardless, we're > > not sure of the best fix. > > > > A trivial fix would be to just make OpenSM re-scan the node descriptor of an > > HCA, perhaps during a heavy sweep. But I don't know if this is optimal. > > It'll > > introduce more MADs on the wire. However if the present solution is to > > restart OpenSM, we figure this can't be any worse. > > > > Just wondering what peoples thoughts are of if there's another obvious > > solution we're not seeing. > > > > Al > > > > -- > > Albert Chu > > ch...@llnl.gov > > Computer Scientist > > High Performance Systems Division > > Lawrence Livermore National Laboratory > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Albert Chu ch...@llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Node Description mismatch between saquery & smpquery
Does running "update_desc" in the console fix this? Ira > -Original Message- > From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma- > ow...@vger.kernel.org] On Behalf Of Albert Chu > Sent: Monday, June 17, 2013 2:38 PM > To: linux-rdma@vger.kernel.org > Subject: Node Description mismatch between saquery & smpquery > > We've recently noticed that the Node Description for a node can mis- > mismatch between the output of smpquery and saquery. For example: > > # smpquery NodeDesc 427 > Node Description:.sierra1932 qib0 > > # saquery NodeRecord 427 | grep NodeDesc > NodeDescription.QLogic Infiniband HCA > > A restart of OpenSM is the current solution to resolve this. > > We've noticed it occurring more often on our larger clusters than our smaller > clusters, leading to a speculation about why it is happening. > > The speculation is when a node comes up, there is a window of time in which > the HCA is up, can be scanned by OpenSM, but not yet have its node > descriptor set (in RHEL I appears to be set via /etc/init.d/rdma). > During this window, OpenSM reads/stores the non-desired node descriptor > (in the above case the non-desired "Qlogic Infiniband HCA"). > > When the node descriptor is changed, a trap should be sent to opensm > indicating the change. Normally OpenSM gets the trap and reads the new > node descriptor. > > On our large clusters all nodes are typically brought up at the same time, so > there are probably a ton of node descriptor change traps happening at the > exact same time. We speculate a number of these are dropped/lost, and > subsequently OpenSM never realizes that the node descriptor has changed. > > I don't know if the speculation sounds reasonable or not. Regardless, we're > not sure of the best fix. > > A trivial fix would be to just make OpenSM re-scan the node descriptor of an > HCA, perhaps during a heavy sweep. But I don't know if this is optimal. > It'll > introduce more MADs on the wire. However if the present solution is to > restart OpenSM, we figure this can't be any worse. > > Just wondering what peoples thoughts are of if there's another obvious > solution we're not seeing. > > Al > > -- > Albert Chu > ch...@llnl.gov > Computer Scientist > High Performance Systems Division > Lawrence Livermore National Laboratory > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html