Re: [gpfsug-discuss] IO sizes

Kumaran Rajaram Thu, 24 Feb 2022 06:32:51 -0800

Hi Uwe,

>> But what puzzles me even more: one of the server compiles IOs even smaller, 
>> varying between 3.2MiB and 3.6MiB mostly - both for reads and writes ... I 
>> just cannot see why.

IMHO, If GPFS on this particular NSD server was restarted often during the 
setup, then it is possible that the GPFS pagepool may not be contiguous. As a 
result, GPFS 8MiB buffer in the pagepool might be a scatter-gather (SG) list 
with many small entries (in the memory) resulting in smaller I/O when these 
buffers are issued to the disks. The fix would be to reboot the server and 
start GPFS so that pagepool is contiguous resulting in 8MiB buffer to be 
comprised of 1 (or fewer) SG entries.

>>In the current situation (i.e. with IOs bit larger than 4MiB) setting 
>>max_sectors_kB to 4096 might do the trick, but as I do not know the cause for 
>>that behaviour it might well start to issue IOs >>smaller than 4MiB again at 
>>some point, so that is not a nice solution.
It will be advised not to restart GPFS often in the NSD servers (in production) 
to keep the pagepool contiguous. Ensure that there is enough free memory in NSD 
server and not run any memory intensive jobs so that pagepool is not impacted 
(e.g. swapped out).

Also, enable GPFS numaMemoryInterleave=yes and verify that pagepool is equally 
distributed across the NUMA domains for good performance. GPFS 
numaMemoryInterleave=yes requires that numactl packages are installed and then 
GPFS restarted.

# mmfsadm dump config | egrep "numaMemory|pagepool "
! numaMemoryInterleave yes
! pagepool 282394099712

# pgrep mmfsd | xargs numastat -p

Per-node process memory usage (in MBs) for PID 2120821 (mmfsd)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         1.26            3.26            4.52
Stack                        0.01            0.01            0.02
Private                 137710.43       137709.96       275420.39
----------------  --------------- --------------- ---------------
Total                   137711.70       137713.23       275424.92

My two cents,
-Kums

Kumaran Rajaram
[cid:[email protected]]

From: [email protected] 
<[email protected]> On Behalf Of Uwe Falke
Sent: Wednesday, February 23, 2022 8:04 PM
To: [email protected]
Subject: Re: [gpfsug-discuss] IO sizes

Hi,

the test bench is gpfsperf running on up to 12 clients with 1...64 threads 
doing sequential reads and writes , file size per gpfsperf process is 12TB 
(with 6TB I saw caching effects in particular for large thread numbers ...)

As I wrote initially: GPFS is issuing nothing but 8MiB IOs to the data disks, 
as expected in that case.

Interesting thing though:

I have rebooted the suspicious node. Now, it does not issue smaller IOs than 
the others, but -- unbelievable -- larger ones (up to about 4.7MiB). This is 
still harmful as also that size is incompatible with full stripe writes on the 
storage ( 8+2 disk groups, i.e. logically RAID6)

Currently, I draw this information from the storage boxes; I have not yet 
checked iostat data for that benchmark test after the reboot (before, when IO 
sizes were smaller, we saw that both in iostat and in the perf data retrieved 
from the storage controllers).

And: we have a separate data pool , hence dataOnly NSDs, I am just talking 
about these ...

As for "Are you sure that Linux OS is configured the same on all 4 NSD 
servers?." - of course there are not two boxes identical in the world. I have 
actually not installed those machines, and, yes, i also considered reinstalling 
them (or at least the disturbing one).

However, I do not have reason to assume or expect a difference, the supplier 
has just implemented these systems  recently from scratch.

In the current situation (i.e. with IOs bit larger than 4MiB) setting 
max_sectors_kB to 4096 might do the trick, but as I do not know the cause for 
that behaviour it might well start to issue IOs smaller than 4MiB again at some 
point, so that is not a nice solution.

Thanks

Uwe

On 23.02.22 22:20, Andrew Beattie wrote:
Alex,

Metadata will be 4Kib

Depending on the filesystem version you will also have subblocks to consider V4 
filesystems have 1/32 subblocks, V5 filesystems have 1/1024 subblocks (assuming 
metadata and data block size is the same)

My first question would be is “ Are you sure that Linux OS is configured the 
same on all 4 NSD servers?.

My second question would be do you know what your average file size is if most 
of your files are smaller than your filesystem block size, then you are always 
going to be performing writes using groups of subblocks rather than a full 
block writes.

Regards,

Andrew

On 24 Feb 2022, at 04:39, Alex Chekholko 
<[email protected]><mailto:[email protected]> wrote:
 Hi, Metadata I/Os will always be smaller than the usual data block size, 
right? Which version of GPFS? Regards, Alex On Wed, Feb 23, 2022 at 10:26 AM 
Uwe Falke <[email protected]><mailto:[email protected]> wrote: Dear all, sorry 
for asking a question which seems ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd

Hi,

Metadata I/Os will always be smaller than the usual data block size, right?
Which version of GPFS?

Regards,
Alex

On Wed, Feb 23, 2022 at 10:26 AM Uwe Falke 
<[email protected]<mailto:[email protected]>> wrote:
Dear all,

sorry for asking a question which seems not directly GPFS related:

In a setup with 4 NSD servers (old-style, with storage controllers in
the back end), 12 clients and 10 Seagate storage systems, I do see in
benchmark tests that  just one of the NSD servers does send smaller IO
requests to the storage  than the other 3 (that is, both reads and
writes are smaller).

The NSD servers form 2 pairs, each pair is connected to 5 seagate boxes
( one server to the controllers A, the other one to controllers B of the
Seagates, resp.).

All 4 NSD servers are set up similarly:

kernel: 3.10.0-1160.el7.x86_64 #1 SMP

HBA: Broadcom / LSI Fusion-MPT 12GSAS/PCIe Secure SAS38xx

driver : mpt3sas 31.100.01.00

max_sectors_kb=8192 (max_hw_sectors_kb=16383 , not 16384, as limited by
mpt3sas) for all sd devices and all multipath (dm) devices built on top.

scheduler: deadline

multipath (actually we do have 3 paths to each volume, so there is some
asymmetry, but that should not affect the IOs, shouldn't it?, and if it
did we would see the same effect in both pairs of NSD servers, but we do
not).

All 4 storage systems are also configured the same way (2 disk groups /
pools / declustered arrays, one managed by  ctrl A, one by ctrl B,  and
8 volumes out of each; makes altogether 2 x 8 x 10 = 160 NSDs).

GPFS BS is 8MiB , according to iohistory (mmdiag) we do see clean IO
requests of 16384 disk blocks (i.e. 8192kiB) from GPFS.

The first question I have - but that is not my main one: I do see, both
in iostat and on the storage systems, that the default IO requests are
about 4MiB, not 8MiB as I'd expect from above settings (max_sectors_kb
is really in terms of kiB, not sectors, cf.
https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.kernel.org%2Fdoc%2FDocumentation%2Fblock%2Fqueue-sysfs.txt&data=04%7C01%7Ckrajaram%40geocomputing.net%7C52cc6360e6ea4be737ba08d9f7317d78%7C229a2792a5064f25b3bdbab585cec3ed%7C0%7C0%7C637812615096678246%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=II8k%2FHzrU7BC%2FVejg9AujgZGk1E0XTz8QCpH6IE6RGM%3D&reserved=0>).

But what puzzles me even more: one of the server compiles IOs even
smaller, varying between 3.2MiB and 3.6MiB mostly - both for reads and
writes ... I just cannot see why.

I have to suspect that this will (in writing to the storage) cause
incomplete stripe writes on our erasure-coded volumes (8+2p)(as long as
the controller is not able to re-coalesce the data properly; and it
seems it cannot do it completely at least)

If someone of you has seen that already and/or knows a potential
explanation I'd be glad to learn about.

And if some of you wonder: yes, I (was) moved away from IBM and am now
at KIT.

Many thanks in advance

Uwe

--
Karlsruhe Institute of Technology (KIT)
Steinbuch Centre for Computing (SCC)
Scientific Data Management (SDM)

Uwe Falke

Hermann-von-Helmholtz-Platz 1, Building 442, Room 187
D-76344 Eggenstein-Leopoldshafen

Tel: +49 721 608 28024
Email: [email protected]<mailto:[email protected]>
www.scc.kit.edu<https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.scc.kit.edu%2F&data=04%7C01%7Ckrajaram%40geocomputing.net%7C52cc6360e6ea4be737ba08d9f7317d78%7C229a2792a5064f25b3bdbab585cec3ed%7C0%7C0%7C637812615096678246%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=mXwzkLB1EFB1Dh31rVRMwJZBY4CBbHcJduc9gK6M71A%3D&reserved=0>

Registered office:
Kaiserstraße 12, 76131 Karlsruhe, Germany

KIT – The Research University in the Helmholtz Association

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at 
spectrumscale.org<https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspectrumscale.org%2F&data=04%7C01%7Ckrajaram%40geocomputing.net%7C52cc6360e6ea4be737ba08d9f7317d78%7C229a2792a5064f25b3bdbab585cec3ed%7C0%7C0%7C637812615096678246%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=G6bjUWlzkKzR2ptGcLffwD8qF2IT9vkruoevFoTwNE0%3D&reserved=0>
http://gpfsug.org/mailman/listinfo/gpfsug-discuss<https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=04%7C01%7Ckrajaram%40geocomputing.net%7C52cc6360e6ea4be737ba08d9f7317d78%7C229a2792a5064f25b3bdbab585cec3ed%7C0%7C0%7C637812615096678246%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=G9l4PMuzdNA%2BwtAtWK%2BApoXxvKn5jZKeP%2FENOVc9xXg%3D&reserved=0>

_______________________________________________

gpfsug-discuss mailing list

gpfsug-discuss at spectrumscale.org

http://gpfsug.org/mailman/listinfo/gpfsug-discuss<https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=04%7C01%7Ckrajaram%40geocomputing.net%7C52cc6360e6ea4be737ba08d9f7317d78%7C229a2792a5064f25b3bdbab585cec3ed%7C0%7C0%7C637812615096678246%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=G9l4PMuzdNA%2BwtAtWK%2BApoXxvKn5jZKeP%2FENOVc9xXg%3D&reserved=0>

--

Karlsruhe Institute of Technology (KIT)

Steinbuch Centre for Computing (SCC)

Scientific Data Management (SDM)

Uwe Falke

Hermann-von-Helmholtz-Platz 1, Building 442, Room 187

D-76344 Eggenstein-Leopoldshafen

Tel: +49 721 608 28024

Email: [email protected]<mailto:[email protected]>

www.scc.kit.edu<https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.scc.kit.edu%2F&data=04%7C01%7Ckrajaram%40geocomputing.net%7C52cc6360e6ea4be737ba08d9f7317d78%7C229a2792a5064f25b3bdbab585cec3ed%7C0%7C0%7C637812615096678246%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=mXwzkLB1EFB1Dh31rVRMwJZBY4CBbHcJduc9gK6M71A%3D&reserved=0>

Registered office:

Kaiserstraße 12, 76131 Karlsruhe, Germany

KIT – The Research University in the Helmholtz Association

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] IO sizes

Reply via email to