> > Are you seeing the issues across the whole file system or in certain > areas? >
Only with accounts in GPFS, local accounts and root do not gt this. > That sounds like inode exhaustion to me (and based on it not being block > exhaustion as you’ve demonstrated). > > > > What does a “df -i /cluster” show you? > We bumped it up a few weeks ago: df -i /cluster Filesystem Inodes IUsed IFree IUse% Mounted on cluster 276971520 154807697 122163823 56% /cluster > Or if this is only in a certain area you can “cd” into that directory and > run a “df -i .” > As root on a login node; df -i Filesystem Inodes IUsed IFree IUse% Mounted on /dev/sda2 20971520 169536 20801984 1% / devtmpfs 12169978 528 12169450 1% /dev tmpfs 12174353 1832 12172521 1% /run tmpfs 12174353 77 12174276 1% /dev/shm tmpfs 12174353 15 12174338 1% /sys/fs/cgroup /dev/sda1 0 0 0 - /boot/efi /dev/sda3 52428800 2887 52425913 1% /var /dev/sda7 277368832 35913 277332919 1% /local /dev/sda5 104857600 398 104857202 1% /tmp tmpfs 12174353 1 12174352 1% /run/user/551336 tmpfs 12174353 1 12174352 1% /run/user/0 moto 276971520 154807697 122163823 56% /cluster tmpfs 12174353 3 12174350 1% /run/user/441245 tmpfs 12174353 12 12174341 1% /run/user/553562 tmpfs 12174353 1 12174352 1% /run/user/525583 tmpfs 12174353 1 12174352 1% /run/user/476374 tmpfs 12174353 1 12174352 1% /run/user/468934 tmpfs 12174353 5 12174348 1% /run/user/551200 tmpfs 12174353 1 12174352 1% /run/user/539143 tmpfs 12174353 1 12174352 1% /run/user/488676 tmpfs 12174353 1 12174352 1% /run/user/493713 tmpfs 12174353 1 12174352 1% /run/user/507831 tmpfs 12174353 1 12174352 1% /run/user/549822 tmpfs 12174353 1 12174352 1% /run/user/500569 tmpfs 12174353 1 12174352 1% /run/user/443748 tmpfs 12174353 1 12174352 1% /run/user/543676 tmpfs 12174353 1 12174352 1% /run/user/451446 tmpfs 12174353 1 12174352 1% /run/user/497945 tmpfs 12174353 6 12174347 1% /run/user/554672 tmpfs 12174353 32 12174321 1% /run/user/554653 tmpfs 12174353 1 12174352 1% /run/user/30094 tmpfs 12174353 1 12174352 1% /run/user/470790 tmpfs 12174353 59 12174294 1% /run/user/553037 tmpfs 12174353 1 12174352 1% /run/user/554670 tmpfs 12174353 1 12174352 1% /run/user/548236 tmpfs 12174353 1 12174352 1% /run/user/547288 tmpfs 12174353 1 12174352 1% /run/user/547289 > > > You may need to allocate more inodes to an independent inode fileset > somewhere. Especially with something as old as 4.2.3 you won’t have > auto-inode expansion for the filesets. > Do we have to restart any service after upping the inode count? > > Best, > > > > J.D. Maloney > > Lead HPC Storage Engineer | Storage Enabling Technologies Group > > National Center for Supercomputing Applications (NCSA) > Ho JD I took an intermediate LCI workshop with you at Univ of Cincinnati! > > > *From: *gpfsug-discuss <gpfsug-discuss-boun...@gpfsug.org> on behalf of > Rob Kudyba <rk3...@columbia.edu> > *Date: *Thursday, June 6, 2024 at 3:50 PM > *To: *gpfsug-discuss@gpfsug.org <gpfsug-discuss@gpfsug.org> > *Subject: *[gpfsug-discuss] No space left on device, but plenty of quota > space for inodes and blocks > > Running GPFS 4.2.3 on a DDN GridScaler and users are getting the No space > left on device message when trying to write to a file. In > /var/adm/ras/mmfs.log > the only recent errors are this: > > > > 2024-06-06_15:51:22.311-0400: mmcommon getContactNodes cluster failed. > Return code -1. > 2024-06-06_15:51:22.311-0400: The previous error was detected on node > x.x.x.x (headnode). > 2024-06-06_15:53:25.088-0400: mmcommon getContactNodes cluster failed. > Return code -1. > 2024-06-06_15:53:25.088-0400: The previous error was detected on node > x.x.x.x (headnode). > > > > according to > https://www.ibm.com/docs/en/storage-scale/5.1.9?topic=messages-6027-615 > <https://urldefense.com/v3/__https:/www.ibm.com/docs/en/storage-scale/5.1.9?topic=messages-6027-615__;!!DZ3fjg!4ZyUNmTiGNp6C3Yls1wqW-RdRGa8n-ZmfZ0y0i-y6pce_ZIFSaefpOWvKIYIXspKjfREPtf3BRuO5VqAS6Y9UXQ$> > > > > > Check the preceding messages, and consult the earlier chapters of this > document. A frequent cause for such errors is lack of space in /var. > > > > We have plenty of space left. > > > > /usr/lpp/mmfs/bin/mmlsdisk cluster > disk driver sector failure holds holds > storage > name type size group metadata data status > availability pool > ------------ -------- ------ ----------- -------- ----- ------------- > ------------ ------------ > S01_MDT200_1 nsd 4096 200 Yes No ready up > system > S01_MDT201_1 nsd 4096 201 Yes No ready up > system > S01_DAT0001_1 nsd 4096 100 No Yes ready up > data1 > S01_DAT0002_1 nsd 4096 101 No Yes ready up > data1 > S01_DAT0003_1 nsd 4096 100 No Yes ready up > data1 > S01_DAT0004_1 nsd 4096 101 No Yes ready up > data1 > S01_DAT0005_1 nsd 4096 100 No Yes ready up > data1 > S01_DAT0006_1 nsd 4096 101 No Yes ready up > data1 > S01_DAT0007_1 nsd 4096 100 No Yes ready up > data1 > > > > /usr/lpp/mmfs/bin/mmdf headnode > disk disk size failure holds holds free KB > free KB > name in KB group metadata data in full blocks > in fragments > --------------- ------------- -------- -------- ----- -------------------- > ------------------- > Disks in storage pool: system (Maximum disk size allowed is 14 TB) > S01_MDT200_1 1862270976 200 Yes No 969134848 ( 52%) > 2948720 ( 0%) > S01_MDT201_1 1862270976 201 Yes No 969126144 ( 52%) > 2957424 ( 0%) > ------------- -------------------- > ------------------- > (pool total) 3724541952 1938260992 ( 52%) > 5906144 ( 0%) > > Disks in storage pool: data1 (Maximum disk size allowed is 578 TB) > S01_DAT0007_1 77510737920 100 No Yes 21080752128 ( 27%) > 897723392 ( 1%) > S01_DAT0005_1 77510737920 100 No Yes 14507212800 ( 19%) > 949412160 ( 1%) > S01_DAT0001_1 77510737920 100 No Yes 14503620608 ( 19%) > 951327680 ( 1%) > S01_DAT0003_1 77510737920 100 No Yes 14509205504 ( 19%) > 949340544 ( 1%) > S01_DAT0002_1 77510737920 101 No Yes 14504585216 ( 19%) > 948377536 ( 1%) > S01_DAT0004_1 77510737920 101 No Yes 14503647232 ( 19%) > 952892480 ( 1%) > S01_DAT0006_1 77510737920 101 No Yes 14504486912 ( 19%) > 949072512 ( 1%) > ------------- -------------------- > ------------------- > (pool total) 542575165440 108113510400 ( 20%) > 6598146304 ( 1%) > > ============= ==================== > =================== > (data) 542575165440 108113510400 ( 20%) > 6598146304 ( 1%) > (metadata) 3724541952 1938260992 ( 52%) > 5906144 ( 0%) > ============= ==================== > =================== > (total) 546299707392 110051771392 ( 22%) > 6604052448 ( 1%) > > Inode Information > ----------------- > Total number of used inodes in all Inode spaces: 154807668 > Total number of free inodes in all Inode spaces: 12964492 > Total number of allocated inodes in all Inode spaces: 167772160 > Total of Maximum number of inodes in all Inode spaces: 276971520 > > > > On the head node: > > > > df -h > Filesystem Size Used Avail Use% Mounted on > /dev/sda4 430G 216G 215G 51% / > devtmpfs 47G 0 47G 0% /dev > tmpfs 47G 0 47G 0% /dev/shm > tmpfs 47G 4.1G 43G 9% /run > tmpfs 47G 0 47G 0% /sys/fs/cgroup > /dev/sda1 504M 114M 365M 24% /boot > /dev/sda2 100M 9.9M 90M 10% /boot/efi > x.x.x.:/nfs-share 430G 326G 105G 76% /nfs-share > cluster 506T 405T 101T 81% /cluster > tmpfs 9.3G 0 9.3G 0% /run/user/443748 > tmpfs 9.3G 0 9.3G 0% /run/user/547288 > tmpfs 9.3G 0 9.3G 0% /run/user/551336 > tmpfs 9.3G 0 9.3G 0% /run/user/547289 > > > > The login nodes have plenty of space in /var: > > /dev/sda3 50G 8.7G 42G 18% /var > > > > What else should we check? We are just at 81% on the GPFS mounted file > system but that should be enough for more space without these errors. Any > recommended service(s) that we can restart? > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org