Your help on this issue has been much appreciated, thanks. I deleted all the zero-length files for the group that was having issues. The robinhood report and the quota are now reporting the same number of files. Amazing. Thanks again.
— Dan Szkola FNAL > On Oct 18, 2023, at 7:12 AM, Andreas Dilger <adil...@whamcloud.com> wrote: > > The zero-length objects are created for the file stripes, but if the MDT > inodes were deleted, but something went wrong with the MDT before the OST > objects were deleted, then the objects would be left behind. > > If the objects are in lost+found with the FID as the filename, then the file > itself is almost certainly already deleted, so fid2path would just return the > file in lost+found. > > I don't think there would be any problem to delete them. > > Cheers, Andreas > >> On Oct 18, 2023, at 08:30, Daniel Szkola <dszk...@fnal.gov> wrote: >> >> In this case almost all, if not all, of the files look a lot like this: >> >> -r-------- 1 someuser somegroup 0 Dec 31 1969 >> '[0x200012392:0xe0ad:0x0]-R-0’ >> >> stat shows: >> >> # stat [0x200012392:0xe0ad:0x0]-R-0 >> File: [0x200012392:0xe0ad:0x0]-R-0 >> Size: 0 Blocks: 1 IO Block: 4194304 regular empty file >> Device: a75b4da0h/2807778720d Inode: 144116440360870061 Links: 1 >> Access: (0400/-r--------) Uid: (43667/ someuser) Gid: ( 9349/somegroup) >> Access: 1969-12-31 18:00:00.000000000 -0600 >> Modify: 1969-12-31 18:00:00.000000000 -0600 >> Change: 1969-12-31 18:00:00.000000000 -0600 >> Birth: 2023-01-11 13:01:40.000000000 -0600 >> >> Not sure what these were or how they ended up in lost+found. I took this >> lustre fs over from folks who have moved on and I’m still trying to wrap my >> head around some of the finer details. In a normal linux fs, usually, not >> always, the blocks will have data in them. These are all zero-length. My >> inclination is to see if I can delete them and be done with it, but I’m a >> bit paranoid. >> >> — >> Dan Szkola >> FNAL >> >> >> >> >> >>> On Oct 17, 2023, at 4:23 PM, Andreas Dilger <adil...@whamcloud.com> wrote: >>> >>> The files reported in .lustre/lost+found *ARE* the objects on the OSTs (at >>> least when accessed through a Lustre mountpoint, not if accessed directly >>> on the MDT mounted as ldiskfs), so when they are deleted the space on the >>> OSTs will be freed. >>> >>> As for identification, the OST objects do not have any name information, >>> but they should have UID/GID/PROJID and timestamps that might help >>> identification. >>> >>> Cheers, Andreas >>> >>>>> On Oct 18, 2023, at 03:42, Daniel Szkola <dszk...@fnal.gov> wrote: >>>> >>>> OK, so I did find the hidden .lustre directory (thanks Darby) and there >>>> are many, many files in the lost+found directory. I can run ’stat’ on them >>>> and get some info. Is there anything else I can do to tell what these >>>> were? Is it safe to delete them? Is there anyway to tell if there are >>>> matching files on the OST(s) that also need to be deleted? >>>> >>>> — >>>> Dan Szkola >>>> FNAL >>>> >>>>> On Oct 10, 2023, at 3:44 PM, Vicker, Darby J. (JSC-EG111)[Jacobs >>>>> Technology, Inc.] <darby.vicke...@nasa.gov> wrote: >>>>> >>>>>> I don’t have a .lustre directory at the filesystem root. >>>>> >>>>> It's there, but doesn't show up even with 'ls -a'. If you cd into it or >>>>> ls it, it's there. Lustre magic. :) >>>>> >>>>> -----Original Message----- >>>>> From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org >>>>> <mailto:lustre-discuss-boun...@lists.lustre.org>> on behalf of Daniel >>>>> Szkola via lustre-discuss <lustre-discuss@lists.lustre.org >>>>> <mailto:lustre-discuss@lists.lustre.org>> >>>>> Reply-To: Daniel Szkola <dszk...@fnal.gov <mailto:dszk...@fnal.gov>> >>>>> Date: Tuesday, October 10, 2023 at 2:30 PM >>>>> To: Andreas Dilger <adil...@whamcloud.com <mailto:adil...@whamcloud.com>> >>>>> Cc: lustre <lustre-discuss@lists.lustre.org >>>>> <mailto:lustre-discuss@lists.lustre.org>> >>>>> Subject: [EXTERNAL] [BULK] Re: [lustre-discuss] Ongoing issues with quota >>>>> >>>>> >>>>> CAUTION: This email originated from outside of NASA. Please take care >>>>> when clicking links or opening attachments. Use the "Report Message" >>>>> button to report suspicious messages to the NASA SOC. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Hello Andreas, >>>>> >>>>> >>>>> lfs df -i reports 19,204,412 inodes used. When I did the full robinhood >>>>> scan, it reported scanning 18,673,874 entries, so fairly close. >>>>> >>>>> >>>>> I don’t have a .lustre directory at the filesystem root. >>>>> >>>>> >>>>> Another interesting aspect of this particular issue is I can run lctl >>>>> lfsck and every time I get: >>>>> >>>>> >>>>> layout_repaired: 1468299 >>>>> >>>>> >>>>> But it doesn’t seem to be actually repairing anything because if I run it >>>>> again, I’ll get the same or a similar number. >>>>> >>>>> >>>>> I run it like this: >>>>> lctl lfsck_start -t layout -t namespace -o -M lfsc-MDT0000 >>>>> >>>>> >>>>> — >>>>> Dan Szkola >>>>> FNAL >>>>> >>>>> >>>>> >>>>> >>>>>> On Oct 10, 2023, at 10:47 AM, Andreas Dilger <adil...@whamcloud.com >>>>>> <mailto:adil...@whamcloud.com>> wrote: >>>>>> >>>>>> There is a $ROOT/.lustre/lost+found that you could check. >>>>>> >>>>>> What does "lfs df -i" report for the used inode count? Maybe it is RBH >>>>>> that is reporting the wrong count? >>>>>> >>>>>> The other alternative would be to mount the MDT filesystem directly as >>>>>> type ZFS and see what df -i and find report? >>>>>> >>>>>> Cheers, Andreas >>>>>> >>>>>>> On Oct 10, 2023, at 22:16, Daniel Szkola via lustre-discuss >>>>>>> <lustre-discuss@lists.lustre.org >>>>>>> <mailto:lustre-discuss@lists.lustre.org>> wrote: >>>>>>> >>>>>>> OK, I disabled, waited for a while, then reenabled. I still get the >>>>>>> same numbers. The only thing I can think is somehow the count is >>>>>>> correct, despite the huge difference. Robinhood and find show about >>>>>>> 1.7M files, dirs, and links. The quota is showing a bit over 3.1M >>>>>>> inodes used. We only have one MDS and MGS. Any ideas where the >>>>>>> discrepancy may lie? Orphans? Is there a lost+found area in lustre? >>>>>>> >>>>>>> — >>>>>>> Dan Szkola >>>>>>> FNAL >>>>>>> >>>>>>> >>>>>>>> On Oct 10, 2023, at 8:24 AM, Daniel Szkola <dszk...@fnal.gov >>>>>>>> <mailto:dszk...@fnal.gov>> wrote: >>>>>>>> >>>>>>>> Hi Robert, >>>>>>>> >>>>>>>> Thanks for the response. Do you remember exactly how you did it? Did >>>>>>>> you bring everything down at any point? I know you can do this: >>>>>>>> >>>>>>>> lctl conf_param fsname.quota.mdt=none >>>>>>>> >>>>>>>> but is that all you did? Did you wait or bring everything down before >>>>>>>> reenabling? I’m worried because that allegedly just enables/disables >>>>>>>> enforcement and space accounting is always on. Andreas stated that >>>>>>>> quotas are controlled by ZFS, but there has been no quota support >>>>>>>> enabled on any of the ZFS volumes in our lustre filesystem. >>>>>>>> >>>>>>>> — >>>>>>>> Dan Szkola >>>>>>>> FNAL >>>>>>>> >>>>>>>>>> On Oct 10, 2023, at 2:17 AM, Redl, Robert <robert.r...@lmu.de >>>>>>>>>> <mailto:robert.r...@lmu.de>> wrote: >>>>>>>>> >>>>>>>>> Dear Dan, >>>>>>>>> >>>>>>>>> I had a similar problem some time ago. We are also using ZFS for MDT >>>>>>>>> and OSTs. For us, the used disk space was reported wrong. The problem >>>>>>>>> was fixed by switching quota support off on the MGS and then on again. >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Robert >>>>>>>>> >>>>>>>>>> Am 09.10.2023 um 17:55 schrieb Daniel Szkola via lustre-discuss >>>>>>>>>> <lustre-discuss@lists.lustre.org >>>>>>>>>> <mailto:lustre-discuss@lists.lustre.org>>: >>>>>>>>>> >>>>>>>>>> Thanks, I will look into the ZFS quota since we are using ZFS for >>>>>>>>>> all storage, MDT and OSTs. >>>>>>>>>> >>>>>>>>>> In our case, there is a single MDS/MDT. I have used Robinhood and >>>>>>>>>> lfs find (by group) commands to verify what the numbers should >>>>>>>>>> apparently be. >>>>>>>>>> >>>>>>>>>> — >>>>>>>>>> Dan Szkola >>>>>>>>>> FNAL >>>>>>>>>> >>>>>>>>>>> On Oct 9, 2023, at 10:13 AM, Andreas Dilger <adil...@whamcloud.com >>>>>>>>>>> <mailto:adil...@whamcloud.com>> wrote: >>>>>>>>>>> >>>>>>>>>>> The quota accounting is controlled by the backing filesystem of the >>>>>>>>>>> OSTs and MDTs. >>>>>>>>>>> >>>>>>>>>>> For ldiskfs/ext4 you could run e2fsck to re-count all of the inode >>>>>>>>>>> and block usage. >>>>>>>>>>> >>>>>>>>>>> For ZFS you would have to ask on the ZFS list to see if there is >>>>>>>>>>> some way to re-count the quota usage. >>>>>>>>>>> >>>>>>>>>>> The "inode" quota is accounted from the MDTs, while the "block" >>>>>>>>>>> quota is accounted from the OSTs. You might be able to see with >>>>>>>>>>> "lfs quota -v -g group" to see if there is one particular MDT that >>>>>>>>>>> is returning too many inodes. >>>>>>>>>>> >>>>>>>>>>> Possibly if you have directories that are striped across many MDTs >>>>>>>>>>> it would inflate the used inode count. For example, if every one of >>>>>>>>>>> the 426k directories reported by RBH was striped across 4 MDTs then >>>>>>>>>>> you would see the inode count add up to 3.6M. >>>>>>>>>>> >>>>>>>>>>> If that was the case, then I would really, really advise against >>>>>>>>>>> striping every directory in the filesystem. That will cause >>>>>>>>>>> problems far worse than just inflating the inode quota accounting. >>>>>>>>>>> >>>>>>>>>>> Cheers, Andreas >>>>>>>>>>> >>>>>>>>>>>> On Oct 9, 2023, at 22:33, Daniel Szkola via lustre-discuss >>>>>>>>>>>> <lustre-discuss@lists.lustre.org >>>>>>>>>>>> <mailto:lustre-discuss@lists.lustre.org>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Is there really no way to force a recount of files used by the >>>>>>>>>>>> quota? All indications are we have accounts where files were >>>>>>>>>>>> removed and this is not reflected in the used file count in the >>>>>>>>>>>> quota. The space used seems correct but the inodes used numbers >>>>>>>>>>>> are way high. There must be a way to clear these numbers and have >>>>>>>>>>>> a fresh count done. >>>>>>>>>>>> >>>>>>>>>>>> — >>>>>>>>>>>> Dan Szkola >>>>>>>>>>>> FNAL >>>>>>>>>>>> >>>>>>>>>>>>> On Oct 4, 2023, at 11:37 AM, Daniel Szkola via lustre-discuss >>>>>>>>>>>>> <lustre-discuss@lists.lustre.org >>>>>>>>>>>>> <mailto:lustre-discuss@lists.lustre.org>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Also, quotas on the OSTS don’t add up to near 3 million files >>>>>>>>>>>>> either: >>>>>>>>>>>>> >>>>>>>>>>>>> [root@lustreclient scratch]# ssh ossnode0 lfs quota -g somegroup >>>>>>>>>>>>> -I 0 /lustre1 >>>>>>>>>>>>> Disk quotas for grp somegroup (gid 9544): >>>>>>>>>>>>> Filesystem kbytes quota limit grace files quota limit grace >>>>>>>>>>>>> 1394853459 0 1913344192 - 132863 0 0 - >>>>>>>>>>>>> [root@lustreclient scratch]# ssh ossnode0 lfs quota -g somegroup >>>>>>>>>>>>> -I 1 /lustre1 >>>>>>>>>>>>> Disk quotas for grp somegroup (gid 9544): >>>>>>>>>>>>> Filesystem kbytes quota limit grace files quota limit grace >>>>>>>>>>>>> 1411579601 0 1963246413 - 120643 0 0 - >>>>>>>>>>>>> [root@lustreclient scratch]# ssh ossnode1 lfs quota -g somegroup >>>>>>>>>>>>> -I 2 /lustre1 >>>>>>>>>>>>> Disk quotas for grp somegroup (gid 9544): >>>>>>>>>>>>> Filesystem kbytes quota limit grace files quota limit grace >>>>>>>>>>>>> 1416507527 0 1789950778 - 190687 0 0 - >>>>>>>>>>>>> [root@lustreclient scratch]# ssh ossnode1 lfs quota -g somegroup >>>>>>>>>>>>> -I 3 /lustre1 >>>>>>>>>>>>> Disk quotas for grp somegroup (gid 9544): >>>>>>>>>>>>> Filesystem kbytes quota limit grace files quota limit grace >>>>>>>>>>>>> 1636465724 0 1926578117 - 195034 0 0 - >>>>>>>>>>>>> [root@lustreclient scratch]# ssh ossnode2 lfs quota -g somegroup >>>>>>>>>>>>> -I 4 /lustre1 >>>>>>>>>>>>> Disk quotas for grp somegroup (gid 9544): >>>>>>>>>>>>> Filesystem kbytes quota limit grace files quota limit grace >>>>>>>>>>>>> 2202272244 0 3020159313 - 185097 0 0 - >>>>>>>>>>>>> [root@lustreclient scratch]# ssh ossnode2 lfs quota -g somegroup >>>>>>>>>>>>> -I 5 /lustre1 >>>>>>>>>>>>> Disk quotas for grp somegroup (gid 9544): >>>>>>>>>>>>> Filesystem kbytes quota limit grace files quota limit grace >>>>>>>>>>>>> 1324770165 0 1371244768 - 145347 0 0 - >>>>>>>>>>>>> [root@lustreclient scratch]# ssh ossnode3 lfs quota -g somegroup >>>>>>>>>>>>> -I 6 /lustre1 >>>>>>>>>>>>> Disk quotas for grp somegroup (gid 9544): >>>>>>>>>>>>> Filesystem kbytes quota limit grace files quota limit grace >>>>>>>>>>>>> 2892027349 0 3221225472 - 169386 0 0 - >>>>>>>>>>>>> [root@lustreclient scratch]# ssh ossnode3 lfs quota -g somegroup >>>>>>>>>>>>> -I 7 /lustre1 >>>>>>>>>>>>> Disk quotas for grp somegroup (gid 9544): >>>>>>>>>>>>> Filesystem kbytes quota limit grace files quota limit grace >>>>>>>>>>>>> 2076201636 0 2474853207 - 171552 0 0 - >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> — >>>>>>>>>>>>> Dan Szkola >>>>>>>>>>>>> FNAL >>>>>>>>>>>>> >>>>>>>>>>>>>>> On Oct 4, 2023, at 8:45 AM, Daniel Szkola via lustre-discuss >>>>>>>>>>>>>>> <lustre-discuss@lists.lustre.org >>>>>>>>>>>>>>> <mailto:lustre-discuss@lists.lustre.org>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> No combination of ossnodek runs has helped with this. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Again, robinhood shows 1796104 files for the group, an 'lfs find >>>>>>>>>>>>>> -G gid' found 1796104 files as well. >>>>>>>>>>>>>> >>>>>>>>>>>>>> So why is the quota command showing over 3 million inodes used? >>>>>>>>>>>>>> >>>>>>>>>>>>>> There must be a way to force it to recount or clear all stale >>>>>>>>>>>>>> quota data and have it regenerate it? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Anyone? >>>>>>>>>>>>>> >>>>>>>>>>>>>> — >>>>>>>>>>>>>> Dan Szkola >>>>>>>>>>>>>> FNAL >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Sep 27, 2023, at 9:42 AM, Daniel Szkola via lustre-discuss >>>>>>>>>>>>>>> <lustre-discuss@lists.lustre.org >>>>>>>>>>>>>>> <mailto:lustre-discuss@lists.lustre.org>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We have a lustre filesystem that we just upgraded to 2.15.3, >>>>>>>>>>>>>>> however this problem has been going on for some time. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The quota command shows this: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Disk quotas for grp somegroup (gid 9544): >>>>>>>>>>>>>>> Filesystem used quota limit grace files quota limit grace >>>>>>>>>>>>>>> /lustre1 13.38T 40T 45T - 3136761* 2621440 3670016 expired >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The group is not using nearly that many files. We have >>>>>>>>>>>>>>> robinhood installed and it show this: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Using config file '/etc/robinhood.d/lustre1.conf'. >>>>>>>>>>>>>>> group, type, count, volume, spc_used, avg_size >>>>>>>>>>>>>>> somegroup, symlink, 59071, 5.12 MB, 103.16 MB, 91 >>>>>>>>>>>>>>> somegroup, dir, 426619, 5.24 GB, 5.24 GB, 12.87 KB >>>>>>>>>>>>>>> somegroup, file, 1310414, 16.24 TB, 13.37 TB, 13.00 MB >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Total: 1796104 entries, volume: 17866508365925 bytes (16.25 >>>>>>>>>>>>>>> TB), space used: 14704924899840 bytes (13.37 TB) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Any ideas what is wrong here? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> — >>>>>>>>>>>>>>> Dan Szkola >>>>>>>>>>>>>>> FNAL >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> lustre-discuss mailing list >>>>>>>>>> lustre-discuss@lists.lustre.org >>>>>>>>>> <mailto:lustre-discuss@lists.lustre.org> >>>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org&d=DwIGaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=e9DXjyTaQ786Tg7WH7oIVaQOA1YDRqyxHOUaYU2_LQw&m=fBsuy6PvJ6sM0CR4j5XQgdmXVIcfg7TPKCS_M1GPMH4gO9ZYMYzNwrtCk4VsbxsJ&s=G9xyFTfK33CmMEh4Da4Vor1Iu5u_EwwX04fX1YarFPI&e= >>>>>>>>>> >>>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org&d=DwIGaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=e9DXjyTaQ786Tg7WH7oIVaQOA1YDRqyxHOUaYU2_LQw&m=fBsuy6PvJ6sM0CR4j5XQgdmXVIcfg7TPKCS_M1GPMH4gO9ZYMYzNwrtCk4VsbxsJ&s=G9xyFTfK33CmMEh4Da4Vor1Iu5u_EwwX04fX1YarFPI&e=> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> lustre-discuss mailing list >>>>>>> lustre-discuss@lists.lustre.org <mailto:lustre-discuss@lists.lustre.org> >>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org&d=DwIGaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=e9DXjyTaQ786Tg7WH7oIVaQOA1YDRqyxHOUaYU2_LQw&m=fBsuy6PvJ6sM0CR4j5XQgdmXVIcfg7TPKCS_M1GPMH4gO9ZYMYzNwrtCk4VsbxsJ&s=G9xyFTfK33CmMEh4Da4Vor1Iu5u_EwwX04fX1YarFPI&e= >>>>>>> >>>>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org&d=DwIGaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=e9DXjyTaQ786Tg7WH7oIVaQOA1YDRqyxHOUaYU2_LQw&m=fBsuy6PvJ6sM0CR4j5XQgdmXVIcfg7TPKCS_M1GPMH4gO9ZYMYzNwrtCk4VsbxsJ&s=G9xyFTfK33CmMEh4Da4Vor1Iu5u_EwwX04fX1YarFPI&e=> >>>>> >>>>> >>>>> _______________________________________________ >>>>> lustre-discuss mailing list >>>>> lustre-discuss@lists.lustre.org <mailto:lustre-discuss@lists.lustre.org> >>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org&d=DwIGaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=e9DXjyTaQ786Tg7WH7oIVaQOA1YDRqyxHOUaYU2_LQw&m=0w70-kOuiuQFGBYkAfiiTTpCg4Y0hyNsE5eiNjaEKABgaZXRMVf3O2sqLq2-1HIC&s=LcunTK-Jz7cEyPzFriWsTM8uxubSE1NHq-DZOLUHXZ4&e= >>>>> >>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org&d=DwIGaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=e9DXjyTaQ786Tg7WH7oIVaQOA1YDRqyxHOUaYU2_LQw&m=0w70-kOuiuQFGBYkAfiiTTpCg4Y0hyNsE5eiNjaEKABgaZXRMVf3O2sqLq2-1HIC&s=LcunTK-Jz7cEyPzFriWsTM8uxubSE1NHq-DZOLUHXZ4&e= >>>>> > >>>>> >>>>> >>>>> >>>> >>> >>> Cheers, Andreas >>> -- >>> Andreas Dilger >>> Lustre Principal Architect >>> Whamcloud >>> >>> >>> >>> >>> >>> >>> >> _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org