Sorry, typo in the version number - the version we are actually running is 2.12.6 ________________________________ From: Jon Marshall Sent: 20 June 2023 16:18 To: lustre-discuss@lists.lustre.org <lustre-discuss@lists.lustre.org> Subject: No space left on device MDT DoM but not full nor run out of inodes
Hi, We've been running lustre 2.15.1 in production for over a year and recently decided to enable PFL with DoM on our filesystem. Things have been fine up until last week, when users started reporting issues copying files, specifically "No space left on device". The MDT is running ldiskfs as the backend. I've searched through the mailing list and found a couple of people reporting similar problems, which prompted me to check the inode allocation, which is currently: UUID Inodes IUsed IFree IUse% Mounted on scratchc-MDT0000_UUID 624492544 71144384 553348160 12% /mnt/scratchc[MDT:0] scratchc-OST0000_UUID 57712579 24489934 33222645 43% /mnt/scratchc[OST:0] scratchc-OST0001_UUID 57114064 24505876 32608188 43% /mnt/scratchc[OST:1] filesystem_summary: 136975217 71144384 65830833 52% /mnt/scratchc So, nowhere near full - the disk usage is a little higher: UUID bytes Used Available Use% Mounted on scratchc-MDT0000_UUID 882.1G 451.9G 355.8G 56% /mnt/scratchc[MDT:0] scratchc-OST0000_UUID 53.6T 22.7T 31.0T 43% /mnt/scratchc[OST:0] scratchc-OST0001_UUID 53.6T 23.0T 30.6T 43% /mnt/scratchc[OST:1] filesystem_summary: 107.3T 45.7T 61.6T 43% /mnt/scratchc But not full either! The errors are accompanied in the logs by: LustreError: 15450:0:(tgt_grant.c:463:tgt_grant_space_left()) scratchc-MDT0000: cli ba0195c7-1ab4-4f7c-9e28-8689478f5c17/ffff9e331e231c00 left 82586337280 < tot_grant 82586681321 unstable 0 pending 0 dirty 1044480 LustreError: 15450:0:(tgt_grant.c:463:tgt_grant_space_left()) Skipped 33050 previous similar messages For reference the DoM striping we're using is: lcm_layout_gen: 0 lcm_mirror_count: 1 lcm_entry_count: 3 lcme_id: N/A lcme_mirror_id: N/A lcme_flags: 0 lcme_extent.e_start: 0 lcme_extent.e_end: 1048576 stripe_count: 0 stripe_size: 1048576 pattern: mdt stripe_offset: -1 lcme_id: N/A lcme_mirror_id: N/A lcme_flags: 0 lcme_extent.e_start: 1048576 lcme_extent.e_end: 1073741824 stripe_count: 1 stripe_size: 1048576 pattern: raid0 stripe_offset: -1 lcme_id: N/A lcme_mirror_id: N/A lcme_flags: 0 lcme_extent.e_start: 1073741824 lcme_extent.e_end: EOF stripe_count: -1 stripe_size: 1048576 pattern: raid0 stripe_offset: -1 So the first 1MB on the MDT. My question is obviously what is causing these errors? I'm not massively familiar with Lustre internals, so any pointers on where to look would be greatly appreciated! Cheers Jon Jon Marshall High Performance Computing Specialist IT and Scientific Computing Team Cancer Research UK Cambridge Institute Li Ka Shing Centre | Robinson Way | Cambridge | CB2 0RE Web<http://www.cruk.cam.ac.uk/> | Facebook<http://www.facebook.com/cancerresearchuk> | Twitter<http://twitter.com/CR_UK> [Description: CRI Logo]<http://www.cruk.cam.ac.uk/>
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org