This might be a stretch but do you happen to have a user/fileset/group
over it's hard quota or soft quota + grace period? We've had this really
upset our cluster before. At least with 3.5 each op that's done against
an over quota user/group/fileset results in at least one rpc from the fs
manager to every node in the cluster.
Are those waiters from an fs manager node? If so perhaps briefly fire up
tracing (/usr/lpp/mmfs/bin/mmtrace start) let it run for ~10 seconds
then stop it (/usr/lpp/mmfs/bin/mmtrace stop) then grep for
"TRACE_QUOTA" out of the resulting trcrpt file. If you see a bunch of
lines that contain:
TRACE_QUOTA: qu.server revoke reply type
that might be what's going on. You can also see the behavior if you look
at the output of mmdiag --network on your fs manager nodes and see a
bunch of RPC's with all of your cluster node listed as the recipients.
Can't recall what the RPC is called that you're looking for, though.
Hope that helps!
-Aaron
On 1/26/17 7:57 PM, Oesterlin, Robert wrote:
OK, I have a sick cluster, and it seems to be tied up with quota related
RPCs like this. Any help in narrowing down what the issue is?
Waiting 3.8729 sec since 19:54:09, monitored, thread 32786 Msg handler
quotaMsgRequestShare: on ThCond 0x1801919D3C8 (LkObjCondvar), reason
'waiting for WA lock'
Waiting 4.3158 sec since 19:54:08, monitored, thread 32771 Msg handler
quotaMsgRequestShare: on ThCond 0x1801919D3C8 (LkObjCondvar), reason
'waiting for WA lock'
Waiting 4.3173 sec since 19:54:08, monitored, thread 35829 Msg handler
quotaMsgPrefetchShare: on ThCond 0x1801919D3C8 (LkObjCondvar), reason
'waiting for WA lock'
Waiting 4.4619 sec since 19:54:08, monitored, thread 9694 Msg handler
quotaMsgRequestShare: on ThCond 0x1801919D3C8 (LkObjCondvar), reason
'waiting for WA lock'
Waiting 4.4967 sec since 19:54:08, monitored, thread 32357 Msg handler
quotaMsgRelinquish: on ThCond 0x1801919D3C8 (LkObjCondvar), reason
'waiting for WA lock'
Waiting 4.6885 sec since 19:54:08, monitored, thread 32305 Msg handler
quotaMsgRequestShare: on ThCond 0x1801919D3C8 (LkObjCondvar), reason
'waiting for WA lock'
Waiting 4.7123 sec since 19:54:08, monitored, thread 32261 Msg handler
quotaMsgRequestShare: on ThCond 0x1801919D3C8 (LkObjCondvar), reason
'waiting for WA lock'
Waiting 4.7932 sec since 19:54:08, monitored, thread 53409 Msg handler
quotaMsgRelinquish: on ThCond 0x1801919D3C8 (LkObjCondvar), reason
'waiting for WA lock'
Waiting 5.2954 sec since 19:54:07, monitored, thread 32905 Msg handler
quotaMsgRequestShare: on ThCond 0x1801919D3C8 (LkObjCondvar), reason
'waiting for WA lock'
Waiting 5.3058 sec since 19:54:07, monitored, thread 32573 Msg handler
quotaMsgPrefetchShare: on ThCond 0x1801919D3C8 (LkObjCondvar), reason
'waiting for WA lock'
Waiting 5.3207 sec since 19:54:07, monitored, thread 32397 Msg handler
quotaMsgRelinquish: on ThCond 0x1801919D3C8 (LkObjCondvar), reason
'waiting for WA lock'
Waiting 5.3274 sec since 19:54:07, monitored, thread 32897 Msg handler
quotaMsgRelinquish: on ThCond 0x1801919D3C8 (LkObjCondvar), reason
'waiting for WA lock'
Waiting 5.3343 sec since 19:54:07, monitored, thread 32691 Msg handler
quotaMsgRelinquish: on ThCond 0x1801919D3C8 (LkObjCondvar), reason
'waiting for WA lock'
Waiting 5.3347 sec since 19:54:07, monitored, thread 32364 Msg handler
quotaMsgRequestShare: on ThCond 0x1801919D3C8 (LkObjCondvar), reason
'waiting for WA lock'
Waiting 5.3348 sec since 19:54:07, monitored, thread 32522 Msg handler
quotaMsgRelinquish: on ThCond 0x1801919D3C8 (LkObjCondvar), reason
'waiting for WA lock'
Bob Oesterlin
Sr Principal Storage Engineer, Nuance
507-269-0413
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss