Re: [gpfsug-discuss] High I/O wait times

Buterbaugh, Kevin L Mon, 09 Jul 2018 14:59:03 -0700

Hi All,

Time for a daily update on this saga…

First off, responses to those who have responded to me:

Yaron - we have QLogic switches, but I’ll RTFM and figure out how to clear the 
counters … with a quick look via the CLI interface to one of them I don’t see 
how to even look at those counters, must less clear them, but I’ll do some 
digging.  QLogic does have a GUI app, but given that the Mac version is PowerPC 
only I think that’s a dead end!  :-O

Jonathan - understood.  We were just wanting to eliminate as much hardware as 
potential culprits as we could.  The storage arrays will all get a power-cycle 
this Sunday when we take a downtime to do firmware upgrades on them … the 
vendor is basically refusing to assist further until we get on the latest 
firmware.

So … we had noticed that things seem to calm down starting Friday evening and 
continuing throughout the weekend.  We have a script that runs every half hour 
and if there’s any NSD servers where “mmdiag —iohist” shows an I/O > 1,000 ms, 
we get an alert (again, designed to alert us of a CBM failure).  We only got 
three all weekend long (as opposed to last week, when the alerts were coming 
every half hour round the clock).

Then, this morning I repeated the “dd” test that I had run before and after 
replacing the FC cables going to “eon34” and which had showed very typical I/O 
rates for all the NSDs except for the 4 in eon34, which were quite poor (~1.5 - 
10 MB/sec).  I ran the new tests this morning from different NSD servers and 
with a higher “count” passed to dd to eliminate any potential caching effects.  
I ran the test twice from two different NSD servers and this morning all NSDs - 
including those on eon34 - showed normal I/O rates!

Argh - so do we have a hardware problem or not?!?

I still think we do, but am taking *nothing* for granted at this point!   So 
today we also used another script we’ve written to do some investigation … 
basically we took the script which runs “mmdiag —iohist” and added some options 
to it so that for every I/O greater than the threshold it will see which client 
issued the I/O.  It then queries SLURM to see what jobs are running on that 
client.

Interestingly enough, one user showed up waaaayyyyyy more often than anybody 
else.  And many times she was on a node with only one other user who we know 
doesn’t access the GPFS filesystem and other times she was the only user on the 
node.

We certainly recognize that correlation is not causation (she could be a victim 
and not the culprit), but she was on so many of the reported clients that we 
decided to investigate further … but her jobs seem to have fairly modest I/O 
requirements.  Each one processes 4 input files, which are basically just 
gzip’d text files of 1.5 - 5 GB in size.  This is what, however, prompted my 
other query to the list about determining which NSDs a given file has its’ 
blocks on.  I couldn’t see how files of that size could have all their blocks 
on only a couple of NSDs in the pool (out of 19 total!) but wanted to verify 
that.  The files that I have looked at are evenly spread out across the NSDs.

So given that her files are spread across all 19 NSDs in the pool and the high 
I/O wait times are almost always only on LUNs in eon34 (and, more specifically, 
on two of the four LUNs in eon34) I’m pretty well convinced it’s not her jobs 
causing the problems … I’m back to thinking a weird hardware issue.

But if anyone wants to try to convince me otherwise, I’ll listen…

Thanks!

Kevin

On Jul 8, 2018, at 12:32 PM, Yaron Daniel 
<y...@il.ibm.com<mailto:y...@il.ibm.com>> wrote:

Hi

Clean all counters on the FC switches and see which port have errors .

For brocade run :

slotstatsclear
statsclear
porterrshow

For cisco run:

clear countersall

There might be bad gbic/cable/Storage gbic, which can affect the performance, 
if there is something like that - u can see which ports have errors grow over 
time.
Regards

________________________________

Yaron Daniel     94 Em Ha'Moshavot Rd
<ATT00001.gif>

Storage Architect – IL Lab Services (Storage)    Petach Tiqva, 49527
IBM Global Markets, Systems HW Sales     Israel

Phone:  +972-3-916-5672
Fax:    +972-3-916-5672
Mobile: +972-52-8395593
e-mail: y...@il.ibm.com<mailto:y...@il.ibm.com>
IBM 
Israel<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.ibm.com%2Fil%2Fhe%2F&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C7c1ced16f6d44055c63408d5e4fa7d2e%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636666686866046739&sdata=fpkPC3%2FjrhpFp1iLq3THOlRQTCGFdAInnRjsIs9zFEc%3D&reserved=0>

<ATT00002.gif><ATT00003.gif><ATT00004.gif><ATT00005.gif> <ATT00006.gif>      
<ATT00007.jpeg>

From:        Jonathan Buzzard 
<jonathan.buzz...@strath.ac.uk<mailto:jonathan.buzz...@strath.ac.uk>>
To:        
gpfsug-discuss@spectrumscale.org<mailto:gpfsug-discuss@spectrumscale.org>
Date:        07/07/2018 11:43 AM
Subject:        Re: [gpfsug-discuss] High I/O wait times
Sent by:        
gpfsug-discuss-boun...@spectrumscale.org<mailto:gpfsug-discuss-boun...@spectrumscale.org>
________________________________

On 07/07/18 01:28, Buterbaugh, Kevin L wrote:

[SNIP]

>
> So, to try to rule out everything but the storage array we replaced the
> FC cables going from the SAN switches to the array, plugging the new
> cables into different ports on the SAN switches.  Then we repeated the
> dd tests from a different NSD server, which both eliminated the NSD
> server and its’ FC cables as a potential cause … and saw results
> virtually identical to the previous test.  Therefore, we feel pretty
> confident that it is the storage array and have let the vendor know all
> of this.

I was not thinking of doing anything quite as drastic as replacing
stuff, more look into the logs on the switches in the FC network and
examine them for packet errors. The above testing didn't eliminate bad
optics in the storage array itself for example, though it does appear to
be the storage arrays themselves. Sounds like they could do with a power
cycle...

JAB.

--
Jonathan A. Buzzard                         Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<http://spectrumscale.org>
https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=Bn1XE9uK2a9CZQ8qKnJE3Q&m=TM-kJsvzTX9cq_xmR5ITHclBCfO4FDvZ3ZxyugfJCfQ&s=Ass164qVEhb9fC4_VCmzfZeYd_BLOv9cZsfkrzqi8pM&e=<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttp-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss%26d%3DDwIGaQ%26c%3Djf_iaSHvJObTbx-siA1ZOg%26r%3DBn1XE9uK2a9CZQ8qKnJE3Q%26m%3DTM-kJsvzTX9cq_xmR5ITHclBCfO4FDvZ3ZxyugfJCfQ%26s%3DAss164qVEhb9fC4_VCmzfZeYd_BLOv9cZsfkrzqi8pM%26e%3D&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C7c1ced16f6d44055c63408d5e4fa7d2e%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636666686866046739&sdata=B%2F2Q9L1bwUvPHv858hLhTzt1hFT%2BMhCIOVeqGvLv3Rg%3D&reserved=0>

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<http://spectrumscale.org>
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&amp;data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C7c1ced16f6d44055c63408d5e4fa7d2e%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636666686866066749&amp;sdata=Viltitj3L9aScuuVKCLSp9FKkj7xdzWxsvvPVDSUqHw%3D&amp;reserved=0

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] High I/O wait times

Reply via email to