Hi Andreas,

Thanks for getting back to me - we are in the process of expanding the storage 
on this filesystem so I think I'll be pushing for an upgrade instead!

Cheers
Jon
________________________________
From: Andreas Dilger <adil...@whamcloud.com>
Sent: 22 June 2023 20:00
To: Jon Marshall <jon.marsh...@cruk.cam.ac.uk>
Cc: lustre-discuss@lists.lustre.org <lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] No space left on device MDT DoM but not full nor 
run out of inodes

There is a bug in the grant accounting that leaks under certain operations 
(maybe O_DIRECT?).  It is resolved by unmounting and remounting the clients, 
and/or upgrading.  There was a thread about it on lustre-discuss a couple of 
years ago.

Cheers, Andreas

On Jun 20, 2023, at 09:32, Jon Marshall via lustre-discuss 
<lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>> wrote:

Sorry, typo in the version number - the version we are actually running is 
2.12.6
________________________________
From: Jon Marshall
Sent: 20 June 2023 16:18
To: lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org> 
<lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>>
Subject: No space left on device MDT DoM but not full nor run out of inodes

Hi,

We've been running lustre 2.15.1 in production for over a year and recently 
decided to enable PFL with DoM on our filesystem. Things have been fine up 
until last week, when users started reporting issues copying files, 
specifically "No space left on device". The MDT is running ldiskfs as the 
backend.

I've searched through the mailing list and found a couple of people reporting 
similar problems, which prompted me to check the inode allocation, which is 
currently:

UUID                      Inodes       IUsed       IFree IUse% Mounted on
scratchc-MDT0000_UUID   624492544    71144384   553348160  12% 
/mnt/scratchc[MDT:0]
scratchc-OST0000_UUID    57712579    24489934    33222645  43% 
/mnt/scratchc[OST:0]
scratchc-OST0001_UUID    57114064    24505876    32608188  43% 
/mnt/scratchc[OST:1]

filesystem_summary:    136975217    71144384    65830833  52% /mnt/scratchc

So, nowhere near full - the disk usage is a little higher:

UUID                       bytes        Used   Available Use% Mounted on
scratchc-MDT0000_UUID      882.1G      451.9G      355.8G  56% 
/mnt/scratchc[MDT:0]
scratchc-OST0000_UUID       53.6T       22.7T       31.0T  43% 
/mnt/scratchc[OST:0]
scratchc-OST0001_UUID       53.6T       23.0T       30.6T  43% 
/mnt/scratchc[OST:1]

filesystem_summary:       107.3T       45.7T       61.6T  43% /mnt/scratchc

But not full either! The errors are accompanied in the logs by:

LustreError: 15450:0:(tgt_grant.c:463:tgt_grant_space_left()) scratchc-MDT0000: 
cli ba0195c7-1ab4-4f7c-9e28-8689478f5c17/ffff9e331e231c00 left 82586337280 < 
tot_grant 82586681321 unstable 0 pending 0 dirty 1044480
LustreError: 15450:0:(tgt_grant.c:463:tgt_grant_space_left()) Skipped 33050 
previous similar messages

For reference the DoM striping we're using is:

  lcm_layout_gen:    0
  lcm_mirror_count:  1
  lcm_entry_count:   3
    lcme_id:             N/A
    lcme_mirror_id:      N/A
    lcme_flags:          0
    lcme_extent.e_start: 0
    lcme_extent.e_end:   1048576
      stripe_count:  0       stripe_size:   1048576       pattern:       mdt    
   stripe_offset: -1

    lcme_id:             N/A
    lcme_mirror_id:      N/A
    lcme_flags:          0
    lcme_extent.e_start: 1048576
    lcme_extent.e_end:   1073741824
      stripe_count:  1       stripe_size:   1048576       pattern:       raid0  
     stripe_offset: -1

    lcme_id:             N/A
    lcme_mirror_id:      N/A
    lcme_flags:          0
    lcme_extent.e_start: 1073741824
    lcme_extent.e_end:   EOF
      stripe_count:  -1       stripe_size:   1048576       pattern:       raid0 
      stripe_offset: -1

So the first 1MB on the MDT.

My question is obviously what is causing these errors? I'm not massively 
familiar with Lustre internals, so any pointers on where to look would be 
greatly appreciated!

Cheers
Jon

Jon Marshall
High Performance Computing Specialist



IT and Scientific Computing Team



Cancer Research UK Cambridge Institute
Li Ka Shing Centre | Robinson Way | Cambridge | CB2 0RE
Web<http://www.cruk.cam.ac.uk/> | 
Facebook<http://www.facebook.com/cancerresearchuk> | 
Twitter<http://twitter.com/CR_UK>



[Description: CRI Logo]<http://www.cruk.cam.ac.uk/>

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to