RE: Node Description mismatch between saquery & smpquery

2013-06-18 Thread Weiny, Ira
> -Original Message-
> From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
> Subject: Re: Node Description mismatch between saquery & smpquery
> 
> On Tue, 2013-06-18 at 07:13 -0400, Hal Rosenstock wrote:
> > On 6/17/2013 5:38 PM, Albert Chu wrote:
> > > We've recently noticed that the Node Description for a node can
> > > mis-mismatch between the output of smpquery and saquery.  For
> example:
> > >
> > > # smpquery NodeDesc 427
> > > Node Description:.sierra1932 qib0
> > >
> > > # saquery NodeRecord 427 | grep NodeDesc
> > > NodeDescription.QLogic Infiniband HCA
> > >
> > > A restart of OpenSM is the current solution to resolve this.

[snip]

> > >
> > > When the node descriptor is changed, a trap should be sent to opensm
> > > indicating the change.  Normally OpenSM gets the trap and reads the
> > > new node descriptor.
> >
> > Are you sure the trap is being issued by those devices when the
> > NodeDescription is changed locally ?
> 
> These particular devices do support the trap and tests show they do send
> traps on changes (i.e. manually changing
> /sys/class/infiniband/qib0/node_desc).
> 
> > Also, if so, do these devices implement timeout/retry on sending the
> > trap (e.g. trying to make sure that they receive trap repress before
> > giving up on trap) ?
> 
> This I don't know.  I've been trying to figure out if they do and if they do 
> how
> it might be configurable.  Is there a way to figure this out?
> 

Looking quickly at the driver I don't think it does resend the trap.  However, 
Mike might know better: CC'ed.

Ira

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Node Description mismatch between saquery & smpquery

2013-06-18 Thread Albert Chu
On Tue, 2013-06-18 at 07:13 -0400, Hal Rosenstock wrote:
> On 6/17/2013 5:38 PM, Albert Chu wrote:
> > We've recently noticed that the Node Description for a node can
> > mis-mismatch between the output of smpquery and saquery.  For example:
> > 
> > # smpquery NodeDesc 427
> > Node Description:.sierra1932 qib0
> > 
> > # saquery NodeRecord 427 | grep NodeDesc
> > NodeDescription.QLogic Infiniband HCA
> > 
> > A restart of OpenSM is the current solution to resolve this.
> > 
> > We've noticed it occurring more often on our larger clusters than our
> > smaller clusters, leading to a speculation about why it is happening.
> > 
> > The speculation is when a node comes up, there is a window of time in
> > which the HCA is up, can be scanned by OpenSM, but not yet have its node
> > descriptor set (in RHEL I appears to be set via /etc/init.d/rdma).
> > During this window, OpenSM reads/stores the non-desired node descriptor
> > (in the above case the non-desired "Qlogic Infiniband HCA").
> > 
> > When the node descriptor is changed, a trap should be sent to opensm
> > indicating the change.  Normally OpenSM gets the trap and reads the new
> > node descriptor.
> 
> Are you sure the trap is being issued by those devices when the
> NodeDescription is changed locally ?

These particular devices do support the trap and tests show they do send
traps on changes (i.e. manually
changing /sys/class/infiniband/qib0/node_desc).

> Also, if so, do these devices implement timeout/retry on sending the
> trap (e.g. trying to make sure that they receive trap repress before
> giving up on trap) ?

This I don't know.  I've been trying to figure out if they do and if
they do how it might be configurable.  Is there a way to figure this
out?

> > On our large clusters all nodes are typically brought up at the same
> > time, so there are probably a ton of node descriptor change traps
> > happening at the exact same time.  We speculate a number of these are
> > dropped/lost, and subsequently OpenSM never realizes that the node
> > descriptor has changed.
> 
> Do you see any evidence of that traps are being dropped ? Have you
> correlated any VL15Dropped counters in the subnet with this ? Also,
> there is a module parameter in MAD kernel module that might help with
> any unsolicited MAD bursts. You might try increasing that on your SM
> node(s).

On our largest clusters we always see a nice chunk of VL15 drops,
however we haven't correlated them specifically to a trap.

> > I don't know if the speculation sounds reasonable or not.  Regardless,
> > we're not sure of the best fix.
> > 
> > A trivial fix would be to just make OpenSM re-scan the node descriptor
> > of an HCA, perhaps during a heavy sweep.  But I don't know if this is
> > optimal.  It'll introduce more MADs on the wire.  However if the present
> > solution is to restart OpenSM, we figure this can't be any worse.
> 
> Yes, but to add the additional queries in is O(n) there and has been
> resisted in the past.
> 
> > Just wondering what peoples thoughts are of if there's another obvious
> > solution we're not seeing.
> 
> I think this issue needs better understanding first.

Yeah, just looking for hints/pointers for the time being.

Thanks,

Al

> -- Hal
> 
> > Al
> > 
> 
-- 
Albert Chu
ch...@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Node Description mismatch between saquery & smpquery

2013-06-18 Thread Hal Rosenstock
On 6/17/2013 5:38 PM, Albert Chu wrote:
> We've recently noticed that the Node Description for a node can
> mis-mismatch between the output of smpquery and saquery.  For example:
> 
> # smpquery NodeDesc 427
> Node Description:.sierra1932 qib0
> 
> # saquery NodeRecord 427 | grep NodeDesc
> NodeDescription.QLogic Infiniband HCA
> 
> A restart of OpenSM is the current solution to resolve this.
> 
> We've noticed it occurring more often on our larger clusters than our
> smaller clusters, leading to a speculation about why it is happening.
> 
> The speculation is when a node comes up, there is a window of time in
> which the HCA is up, can be scanned by OpenSM, but not yet have its node
> descriptor set (in RHEL I appears to be set via /etc/init.d/rdma).
> During this window, OpenSM reads/stores the non-desired node descriptor
> (in the above case the non-desired "Qlogic Infiniband HCA").
> 
> When the node descriptor is changed, a trap should be sent to opensm
> indicating the change.  Normally OpenSM gets the trap and reads the new
> node descriptor.

Are you sure the trap is being issued by those devices when the
NodeDescription is changed locally ?

Also, if so, do these devices implement timeout/retry on sending the
trap (e.g. trying to make sure that they receive trap repress before
giving up on trap) ?

> On our large clusters all nodes are typically brought up at the same
> time, so there are probably a ton of node descriptor change traps
> happening at the exact same time.  We speculate a number of these are
> dropped/lost, and subsequently OpenSM never realizes that the node
> descriptor has changed.

Do you see any evidence of that traps are being dropped ? Have you
correlated any VL15Dropped counters in the subnet with this ? Also,
there is a module parameter in MAD kernel module that might help with
any unsolicited MAD bursts. You might try increasing that on your SM
node(s).

> I don't know if the speculation sounds reasonable or not.  Regardless,
> we're not sure of the best fix.
> 
> A trivial fix would be to just make OpenSM re-scan the node descriptor
> of an HCA, perhaps during a heavy sweep.  But I don't know if this is
> optimal.  It'll introduce more MADs on the wire.  However if the present
> solution is to restart OpenSM, we figure this can't be any worse.

Yes, but to add the additional queries in is O(n) there and has been
resisted in the past.

> Just wondering what peoples thoughts are of if there's another obvious
> solution we're not seeing.

I think this issue needs better understanding first.

-- Hal

> Al
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Node Description mismatch between saquery & smpquery

2013-06-17 Thread Albert Chu
On Mon, 2013-06-17 at 22:00 +, Weiny, Ira wrote:
> Does running "update_desc" in the console fix this?

This worked as a short term solution.  But we're still thinking about a
longer term one that requires less interaction.

Al

> Ira
> 
> > -Original Message-
> > From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
> > ow...@vger.kernel.org] On Behalf Of Albert Chu
> > Sent: Monday, June 17, 2013 2:38 PM
> > To: linux-rdma@vger.kernel.org
> > Subject: Node Description mismatch between saquery & smpquery
> > 
> > We've recently noticed that the Node Description for a node can mis-
> > mismatch between the output of smpquery and saquery.  For example:
> > 
> > # smpquery NodeDesc 427
> > Node Description:.sierra1932 qib0
> > 
> > # saquery NodeRecord 427 | grep NodeDesc
> > NodeDescription.QLogic Infiniband HCA
> > 
> > A restart of OpenSM is the current solution to resolve this.
> > 
> > We've noticed it occurring more often on our larger clusters than our 
> > smaller
> > clusters, leading to a speculation about why it is happening.
> > 
> > The speculation is when a node comes up, there is a window of time in which
> > the HCA is up, can be scanned by OpenSM, but not yet have its node
> > descriptor set (in RHEL I appears to be set via /etc/init.d/rdma).
> > During this window, OpenSM reads/stores the non-desired node descriptor
> > (in the above case the non-desired "Qlogic Infiniband HCA").
> > 
> > When the node descriptor is changed, a trap should be sent to opensm
> > indicating the change.  Normally OpenSM gets the trap and reads the new
> > node descriptor.
> > 
> > On our large clusters all nodes are typically brought up at the same time, 
> > so
> > there are probably a ton of node descriptor change traps happening at the
> > exact same time.  We speculate a number of these are dropped/lost, and
> > subsequently OpenSM never realizes that the node descriptor has changed.
> > 
> > I don't know if the speculation sounds reasonable or not.  Regardless, we're
> > not sure of the best fix.
> > 
> > A trivial fix would be to just make OpenSM re-scan the node descriptor of an
> > HCA, perhaps during a heavy sweep.  But I don't know if this is optimal.  
> > It'll
> > introduce more MADs on the wire.  However if the present solution is to
> > restart OpenSM, we figure this can't be any worse.
> > 
> > Just wondering what peoples thoughts are of if there's another obvious
> > solution we're not seeing.
> > 
> > Al
> > 
> > --
> > Albert Chu
> > ch...@llnl.gov
> > Computer Scientist
> > High Performance Systems Division
> > Lawrence Livermore National Laboratory
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Albert Chu
ch...@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Node Description mismatch between saquery & smpquery

2013-06-17 Thread Weiny, Ira
Does running "update_desc" in the console fix this?

Ira

> -Original Message-
> From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
> ow...@vger.kernel.org] On Behalf Of Albert Chu
> Sent: Monday, June 17, 2013 2:38 PM
> To: linux-rdma@vger.kernel.org
> Subject: Node Description mismatch between saquery & smpquery
> 
> We've recently noticed that the Node Description for a node can mis-
> mismatch between the output of smpquery and saquery.  For example:
> 
> # smpquery NodeDesc 427
> Node Description:.sierra1932 qib0
> 
> # saquery NodeRecord 427 | grep NodeDesc
> NodeDescription.QLogic Infiniband HCA
> 
> A restart of OpenSM is the current solution to resolve this.
> 
> We've noticed it occurring more often on our larger clusters than our smaller
> clusters, leading to a speculation about why it is happening.
> 
> The speculation is when a node comes up, there is a window of time in which
> the HCA is up, can be scanned by OpenSM, but not yet have its node
> descriptor set (in RHEL I appears to be set via /etc/init.d/rdma).
> During this window, OpenSM reads/stores the non-desired node descriptor
> (in the above case the non-desired "Qlogic Infiniband HCA").
> 
> When the node descriptor is changed, a trap should be sent to opensm
> indicating the change.  Normally OpenSM gets the trap and reads the new
> node descriptor.
> 
> On our large clusters all nodes are typically brought up at the same time, so
> there are probably a ton of node descriptor change traps happening at the
> exact same time.  We speculate a number of these are dropped/lost, and
> subsequently OpenSM never realizes that the node descriptor has changed.
> 
> I don't know if the speculation sounds reasonable or not.  Regardless, we're
> not sure of the best fix.
> 
> A trivial fix would be to just make OpenSM re-scan the node descriptor of an
> HCA, perhaps during a heavy sweep.  But I don't know if this is optimal.  
> It'll
> introduce more MADs on the wire.  However if the present solution is to
> restart OpenSM, we figure this can't be any worse.
> 
> Just wondering what peoples thoughts are of if there's another obvious
> solution we're not seeing.
> 
> Al
> 
> --
> Albert Chu
> ch...@llnl.gov
> Computer Scientist
> High Performance Systems Division
> Lawrence Livermore National Laboratory
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html