Re: Node Description mismatch between saquery smpquery

2013-06-18 Thread Hal Rosenstock
On 6/17/2013 5:38 PM, Albert Chu wrote:
 We've recently noticed that the Node Description for a node can
 mis-mismatch between the output of smpquery and saquery.  For example:
 
 # smpquery NodeDesc 427
 Node Description:.sierra1932 qib0
 
 # saquery NodeRecord 427 | grep NodeDesc
 NodeDescription.QLogic Infiniband HCA
 
 A restart of OpenSM is the current solution to resolve this.
 
 We've noticed it occurring more often on our larger clusters than our
 smaller clusters, leading to a speculation about why it is happening.
 
 The speculation is when a node comes up, there is a window of time in
 which the HCA is up, can be scanned by OpenSM, but not yet have its node
 descriptor set (in RHEL I appears to be set via /etc/init.d/rdma).
 During this window, OpenSM reads/stores the non-desired node descriptor
 (in the above case the non-desired Qlogic Infiniband HCA).
 
 When the node descriptor is changed, a trap should be sent to opensm
 indicating the change.  Normally OpenSM gets the trap and reads the new
 node descriptor.

Are you sure the trap is being issued by those devices when the
NodeDescription is changed locally ?

Also, if so, do these devices implement timeout/retry on sending the
trap (e.g. trying to make sure that they receive trap repress before
giving up on trap) ?

 On our large clusters all nodes are typically brought up at the same
 time, so there are probably a ton of node descriptor change traps
 happening at the exact same time.  We speculate a number of these are
 dropped/lost, and subsequently OpenSM never realizes that the node
 descriptor has changed.

Do you see any evidence of that traps are being dropped ? Have you
correlated any VL15Dropped counters in the subnet with this ? Also,
there is a module parameter in MAD kernel module that might help with
any unsolicited MAD bursts. You might try increasing that on your SM
node(s).

 I don't know if the speculation sounds reasonable or not.  Regardless,
 we're not sure of the best fix.
 
 A trivial fix would be to just make OpenSM re-scan the node descriptor
 of an HCA, perhaps during a heavy sweep.  But I don't know if this is
 optimal.  It'll introduce more MADs on the wire.  However if the present
 solution is to restart OpenSM, we figure this can't be any worse.

Yes, but to add the additional queries in is O(n) there and has been
resisted in the past.

 Just wondering what peoples thoughts are of if there's another obvious
 solution we're not seeing.

I think this issue needs better understanding first.

-- Hal

 Al
 

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Node Description mismatch between saquery smpquery

2013-06-18 Thread Weiny, Ira
 -Original Message-
 From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
 Subject: Re: Node Description mismatch between saquery  smpquery
 
 On Tue, 2013-06-18 at 07:13 -0400, Hal Rosenstock wrote:
  On 6/17/2013 5:38 PM, Albert Chu wrote:
   We've recently noticed that the Node Description for a node can
   mis-mismatch between the output of smpquery and saquery.  For
 example:
  
   # smpquery NodeDesc 427
   Node Description:.sierra1932 qib0
  
   # saquery NodeRecord 427 | grep NodeDesc
   NodeDescription.QLogic Infiniband HCA
  
   A restart of OpenSM is the current solution to resolve this.

[snip]

  
   When the node descriptor is changed, a trap should be sent to opensm
   indicating the change.  Normally OpenSM gets the trap and reads the
   new node descriptor.
 
  Are you sure the trap is being issued by those devices when the
  NodeDescription is changed locally ?
 
 These particular devices do support the trap and tests show they do send
 traps on changes (i.e. manually changing
 /sys/class/infiniband/qib0/node_desc).
 
  Also, if so, do these devices implement timeout/retry on sending the
  trap (e.g. trying to make sure that they receive trap repress before
  giving up on trap) ?
 
 This I don't know.  I've been trying to figure out if they do and if they do 
 how
 it might be configurable.  Is there a way to figure this out?
 

Looking quickly at the driver I don't think it does resend the trap.  However, 
Mike might know better: CC'ed.

Ira

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Node Description mismatch between saquery smpquery

2013-06-17 Thread Weiny, Ira
Does running update_desc in the console fix this?

Ira

 -Original Message-
 From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
 ow...@vger.kernel.org] On Behalf Of Albert Chu
 Sent: Monday, June 17, 2013 2:38 PM
 To: linux-rdma@vger.kernel.org
 Subject: Node Description mismatch between saquery  smpquery
 
 We've recently noticed that the Node Description for a node can mis-
 mismatch between the output of smpquery and saquery.  For example:
 
 # smpquery NodeDesc 427
 Node Description:.sierra1932 qib0
 
 # saquery NodeRecord 427 | grep NodeDesc
 NodeDescription.QLogic Infiniband HCA
 
 A restart of OpenSM is the current solution to resolve this.
 
 We've noticed it occurring more often on our larger clusters than our smaller
 clusters, leading to a speculation about why it is happening.
 
 The speculation is when a node comes up, there is a window of time in which
 the HCA is up, can be scanned by OpenSM, but not yet have its node
 descriptor set (in RHEL I appears to be set via /etc/init.d/rdma).
 During this window, OpenSM reads/stores the non-desired node descriptor
 (in the above case the non-desired Qlogic Infiniband HCA).
 
 When the node descriptor is changed, a trap should be sent to opensm
 indicating the change.  Normally OpenSM gets the trap and reads the new
 node descriptor.
 
 On our large clusters all nodes are typically brought up at the same time, so
 there are probably a ton of node descriptor change traps happening at the
 exact same time.  We speculate a number of these are dropped/lost, and
 subsequently OpenSM never realizes that the node descriptor has changed.
 
 I don't know if the speculation sounds reasonable or not.  Regardless, we're
 not sure of the best fix.
 
 A trivial fix would be to just make OpenSM re-scan the node descriptor of an
 HCA, perhaps during a heavy sweep.  But I don't know if this is optimal.  
 It'll
 introduce more MADs on the wire.  However if the present solution is to
 restart OpenSM, we figure this can't be any worse.
 
 Just wondering what peoples thoughts are of if there's another obvious
 solution we're not seeing.
 
 Al
 
 --
 Albert Chu
 ch...@llnl.gov
 Computer Scientist
 High Performance Systems Division
 Lawrence Livermore National Laboratory
 
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Node Description mismatch between saquery smpquery

2013-06-17 Thread Albert Chu
On Mon, 2013-06-17 at 22:00 +, Weiny, Ira wrote:
 Does running update_desc in the console fix this?

This worked as a short term solution.  But we're still thinking about a
longer term one that requires less interaction.

Al

 Ira
 
  -Original Message-
  From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
  ow...@vger.kernel.org] On Behalf Of Albert Chu
  Sent: Monday, June 17, 2013 2:38 PM
  To: linux-rdma@vger.kernel.org
  Subject: Node Description mismatch between saquery  smpquery
  
  We've recently noticed that the Node Description for a node can mis-
  mismatch between the output of smpquery and saquery.  For example:
  
  # smpquery NodeDesc 427
  Node Description:.sierra1932 qib0
  
  # saquery NodeRecord 427 | grep NodeDesc
  NodeDescription.QLogic Infiniband HCA
  
  A restart of OpenSM is the current solution to resolve this.
  
  We've noticed it occurring more often on our larger clusters than our 
  smaller
  clusters, leading to a speculation about why it is happening.
  
  The speculation is when a node comes up, there is a window of time in which
  the HCA is up, can be scanned by OpenSM, but not yet have its node
  descriptor set (in RHEL I appears to be set via /etc/init.d/rdma).
  During this window, OpenSM reads/stores the non-desired node descriptor
  (in the above case the non-desired Qlogic Infiniband HCA).
  
  When the node descriptor is changed, a trap should be sent to opensm
  indicating the change.  Normally OpenSM gets the trap and reads the new
  node descriptor.
  
  On our large clusters all nodes are typically brought up at the same time, 
  so
  there are probably a ton of node descriptor change traps happening at the
  exact same time.  We speculate a number of these are dropped/lost, and
  subsequently OpenSM never realizes that the node descriptor has changed.
  
  I don't know if the speculation sounds reasonable or not.  Regardless, we're
  not sure of the best fix.
  
  A trivial fix would be to just make OpenSM re-scan the node descriptor of an
  HCA, perhaps during a heavy sweep.  But I don't know if this is optimal.  
  It'll
  introduce more MADs on the wire.  However if the present solution is to
  restart OpenSM, we figure this can't be any worse.
  
  Just wondering what peoples thoughts are of if there's another obvious
  solution we're not seeing.
  
  Al
  
  --
  Albert Chu
  ch...@llnl.gov
  Computer Scientist
  High Performance Systems Division
  Lawrence Livermore National Laboratory
  
  
  --
  To unsubscribe from this list: send the line unsubscribe linux-rdma in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Albert Chu
ch...@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html