[lustre-discuss] Zabbix Lustre template
Hi, I'm looking for a Zabbix Lustre template, but couldn't find one. Is anyone aware of such a template and can share a link? Thanks, David ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Project quota and project quota accounting
Hi, We are running Lustre 2.12.7 (ldiskfs) both on the servers and the clients. lctl get_param osd-*.*.quota_slave.info returns for all ost and mds/mdt: quota enabled: ugp space acct: ug I tried enabling group quota on a client, with no success: chattr -p 1 /storage/test chattr: Operation not supported while setting project on /storage/test Or with: lfs project -p 1 /storage/test lfs: failed to set xattr for /storage/test': Operation not supported Regards, David ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Unable to mount new OST
s03 kernel: sd 15:0:0:92: [sddy] 34863054848 4096-byte logical blocks: (142 TB/129 TiB) Jul 6 07:59:41 oss03 kernel: sd 15:0:0:92: [sddy] Write Protect is off Jul 6 07:59:41 oss03 kernel: sd 15:0:0:92: [sddy] Write cache: enabled, read cache: enabled, supports DPO and FUA Jul 6 07:59:41 oss03 kernel: sd 15:0:0:92: [sddy] Attached SCSI disk Jul 6 07:59:42 oss03 multipathd: sddy: add path (uevent) Jul 6 07:59:42 oss03 multipathd: sddy [128:0]: path added to devmap OST0051 On Wed, Jul 7, 2021 at 7:24 AM Jeff Johnson wrote: > What devices are underneath dm-21 and are there any errors in > /var/log/messages for those devices? (assuming /dev/sdX devices underneath) > > Run `ls /sys/block/dm-21/slaves` to see what devices are beneath dm-21 > > > > > > On Tue, Jul 6, 2021 at 20:09 David Cohen > wrote: > >> Hi, >> The index of the OST is unique in the system and free for the new one, as >> it is increased by "1" for every new OST created, so whatever it converts >> to should not be relevant to it's refusal to mount, or am I mistaken? >> >> I'm pasting the log messages again, in case they were lost up the thread, >> adding the output of "fdisk -l", should the OST size be the issue: >> >> lctl dk show tens of thousands of lines repeating the same error after >> attempting to mount the OST: >> >> 0010:1000:26.0:1625546374.322973:0:248211:0:(osd_scrub.c:2039:osd_ios_scan_one()) >> local-OST0033: fail to set LMA for init OI scrub: rc = -30 >> 0010:1000:26.0:1625546374.322974:0:248211:0:(osd_scrub.c:2039:osd_ios_scan_one()) >> local-OST0033: fail to set LMA for init OI scrub: rc = -30 >> 0010:1000:26.0:1625546374.322975:0:248211:0:(osd_scrub.c:2039:osd_ios_scan_one()) >> local-OST0033: fail to set LMA for init OI scrub: rc = -30 >> >> in /var/log/messages I see the following corresponding to dm21 which is >> the new OST: >> >> Jul 6 07:38:37 oss03 kernel: LDISKFS-fs warning (device dm-21): >> ldiskfs_multi_mount_protect:322: MMP interval 42 higher than expected, >> please wait. >> Jul 6 07:39:19 oss03 kernel: LDISKFS-fs (dm-21): file extents enabled, >> maximum tree depth=5 >> Jul 6 07:39:19 oss03 kernel: LDISKFS-fs warning (device dm-21): >> ldiskfs_clear_journal_err:4862: Filesystem error recorded from previous >> mount: IO failure >> Jul 6 07:39:19 oss03 kernel: LDISKFS-fs warning (device dm-21): >> ldiskfs_clear_journal_err:4863: Marking fs in need of filesystem check. >> Jul 6 07:39:19 oss03 kernel: LDISKFS-fs (dm-21): warning: mounting fs >> with errors, running e2fsck is recommended >> Jul 6 07:39:22 oss03 kernel: LDISKFS-fs (dm-21): recovery complete >> Jul 6 07:39:22 oss03 kernel: LDISKFS-fs (dm-21): mounted filesystem with >> ordered data mode. Opts: >> user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc >> Jul 6 07:39:22 oss03 kernel: LDISKFS-fs error (device dm-21): >> htree_dirblock_to_tree:1278: inode #2: block 21233: comm mount.lustre: bad >> entry in directory: rec_len is too small for name_len - offset=4084(4084), >> inode=0, rec_len=12 >> , name_len=0 >> Jul 6 07:39:22 oss03 kernel: Aborting journal on device dm-21-8. >> Jul 6 07:39:22 oss03 kernel: LDISKFS-fs (dm-21): Remounting filesystem >> read-only >> Jul 6 07:39:24 oss03 kernel: LDISKFS-fs warning (device dm-21): >> kmmpd:187: kmmpd being stopped since filesystem has been remounted as >> readonly. >> Jul 6 07:44:22 oss03 kernel: LDISKFS-fs (dm-21): error count since last >> fsck: 6 >> Jul 6 07:44:22 oss03 kernel: LDISKFS-fs (dm-21): initial error at time >> 1625367384: htree_dirblock_to_tree:1278: inode 2: block 21233 >> Jul 6 07:44:22 oss03 kernel: LDISKFS-fs (dm-21): last error at time >> 1625546362: htree_dirblock_to_tree:1278: inode 2: block 21233 >> >> fdisk -l /dev/mapper/OST0051 >> >> Disk /dev/mapper/OST0051: 142799.1 GB, 142799072657408 bytes, 34863054848 >> sectors >> Units = sectors of 1 * 4096 = 4096 bytes >> Sector size (logical/physical): 4096 bytes / 4096 bytes >> I/O size (minimum/optimal): 2097152 bytes / 2097152 bytes >> >> >> Thanks, >> David >> >> On Tue, Jul 6, 2021 at 10:35 PM Spitz, Cory James >> wrote: >> >>> What OST index (number) were you trying to add? >>> >>> >>> >>> Andreas is right: >>> >>> Note that your "--index=0051" value is probably interpreted as an octal >>> number "41", it should be "--index=0x0051" or "--index=0x51" (hex, to match >>> the OST device name) or "-
Re: [lustre-discuss] Unable to mount new OST
Hi, The index of the OST is unique in the system and free for the new one, as it is increased by "1" for every new OST created, so whatever it converts to should not be relevant to it's refusal to mount, or am I mistaken? I'm pasting the log messages again, in case they were lost up the thread, adding the output of "fdisk -l", should the OST size be the issue: lctl dk show tens of thousands of lines repeating the same error after attempting to mount the OST: 0010:1000:26.0:1625546374.322973:0:248211:0:(osd_scrub.c:2039:osd_ios_scan_one()) local-OST0033: fail to set LMA for init OI scrub: rc = -30 0010:1000:26.0:1625546374.322974:0:248211:0:(osd_scrub.c:2039:osd_ios_scan_one()) local-OST0033: fail to set LMA for init OI scrub: rc = -30 0010:1000:26.0:1625546374.322975:0:248211:0:(osd_scrub.c:2039:osd_ios_scan_one()) local-OST0033: fail to set LMA for init OI scrub: rc = -30 in /var/log/messages I see the following corresponding to dm21 which is the new OST: Jul 6 07:38:37 oss03 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_multi_mount_protect:322: MMP interval 42 higher than expected, please wait. Jul 6 07:39:19 oss03 kernel: LDISKFS-fs (dm-21): file extents enabled, maximum tree depth=5 Jul 6 07:39:19 oss03 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_clear_journal_err:4862: Filesystem error recorded from previous mount: IO failure Jul 6 07:39:19 oss03 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_clear_journal_err:4863: Marking fs in need of filesystem check. Jul 6 07:39:19 oss03 kernel: LDISKFS-fs (dm-21): warning: mounting fs with errors, running e2fsck is recommended Jul 6 07:39:22 oss03 kernel: LDISKFS-fs (dm-21): recovery complete Jul 6 07:39:22 oss03 kernel: LDISKFS-fs (dm-21): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc Jul 6 07:39:22 oss03 kernel: LDISKFS-fs error (device dm-21): htree_dirblock_to_tree:1278: inode #2: block 21233: comm mount.lustre: bad entry in directory: rec_len is too small for name_len - offset=4084(4084), inode=0, rec_len=12 , name_len=0 Jul 6 07:39:22 oss03 kernel: Aborting journal on device dm-21-8. Jul 6 07:39:22 oss03 kernel: LDISKFS-fs (dm-21): Remounting filesystem read-only Jul 6 07:39:24 oss03 kernel: LDISKFS-fs warning (device dm-21): kmmpd:187: kmmpd being stopped since filesystem has been remounted as readonly. Jul 6 07:44:22 oss03 kernel: LDISKFS-fs (dm-21): error count since last fsck: 6 Jul 6 07:44:22 oss03 kernel: LDISKFS-fs (dm-21): initial error at time 1625367384: htree_dirblock_to_tree:1278: inode 2: block 21233 Jul 6 07:44:22 oss03 kernel: LDISKFS-fs (dm-21): last error at time 1625546362: htree_dirblock_to_tree:1278: inode 2: block 21233 fdisk -l /dev/mapper/OST0051 Disk /dev/mapper/OST0051: 142799.1 GB, 142799072657408 bytes, 34863054848 sectors Units = sectors of 1 * 4096 = 4096 bytes Sector size (logical/physical): 4096 bytes / 4096 bytes I/O size (minimum/optimal): 2097152 bytes / 2097152 bytes Thanks, David On Tue, Jul 6, 2021 at 10:35 PM Spitz, Cory James wrote: > What OST index (number) were you trying to add? > > > > Andreas is right: > > Note that your "--index=0051" value is probably interpreted as an octal > number "41", it should be "--index=0x0051" or "--index=0x51" (hex, to match > the OST device name) or "--index=81" (decimal). > > > > And you said: > > I'm aware that index 51 actually translates to hex 33 (local-OST0033_UUID). > > > > Ok, 0051 (in octal by way of the leading zeros*) translates to decimal 41 > as Andreas pointed out, but that’s 0x29 in hexadecimal, not 0x33. Assuming > you wanted to use decimal 51 then you’d have tried to mkfs.lustre the wrong > index. So, if you wanted to use decimal 51, you’d have use –index=0x33 or > –index=0063. > > > > -Cory > > > > p.s. > > (*) BTW, the convention with leading zeros for octal can be googled or > read about at https://en.wikipedia.org/wiki/Octal. > > > > > > On 7/6/21, 12:35 AM, "lustre-discuss on behalf of David Cohen" < > lustre-discuss-boun...@lists.lustre.org on behalf of > cda...@physics.technion.ac.il> wrote: > > > > Thanks Andreas, > > I'm aware that index 51 actually translates to hex 33 (local-OST0033_UUID). > I don't believe that's the reason for the failed mount as it is only an > index that I increase for every new OST and there are no duplicates. > > > > lctl dk show tens of thousands of lines repeating the same error after > attempting to mount the OST: > > > > 0010:1000:26.0:1625546374.322973:0:248211:0:(osd_scrub.c:2039:osd_ios_scan_one()) > local-OST0033: fail to set LMA for init OI scrub: rc = -30 > > 0010:1000:26.0:1625546374.
Re: [lustre-discuss] Unable to mount new OST
Thanks Artem, I already tried that (e2fsck) with no avail. I even tried tunefs.lustre --writeconf --erase-params on the MDS and all the other targets, but the behaviour remains the same. Best regards, David On Tue, Jul 6, 2021 at 10:09 AM Благодаренко Артём < artem.blagodare...@gmail.com> wrote: > Hello David, > > On 6 Jul 2021, at 08:34, David Cohen > wrote: > > Jul 6 07:39:19 oss03 kernel: LDISKFS-fs (dm-21): warning: mounting fs > with errors, running e2fsck is recommended > > > > It looks like LDISKFS partition is in inconsistent state now. It is better > to follow the recommendation and run e2fsck. > > Best regards, > Artem Blagodarenko. > > ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Unable to mount new OST
Thanks Andreas, I'm aware that index 51 actually translates to hex 33 (local-OST0033_UUID). I don't believe that's the reason for the failed mount as it is only an index that I increase for every new OST and there are no duplicates. lctl dk show tens of thousands of lines repeating the same error after attempting to mount the OST: 0010:1000:26.0:1625546374.322973:0:248211:0:(osd_scrub.c:2039:osd_ios_scan_one()) local-OST0033: fail to set LMA for init OI scrub: rc = -30 0010:1000:26.0:1625546374.322974:0:248211:0:(osd_scrub.c:2039:osd_ios_scan_one()) local-OST0033: fail to set LMA for init OI scrub: rc = -30 0010:1000:26.0:1625546374.322975:0:248211:0:(osd_scrub.c:2039:osd_ios_scan_one()) local-OST0033: fail to set LMA for init OI scrub: rc = -30 in /var/log/messages I see the following corresponding to dm21 which is the new OST: Jul 6 07:38:37 oss03 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_multi_mount_protect:322: MMP interval 42 higher than expected, please wait. Jul 6 07:39:19 oss03 kernel: LDISKFS-fs (dm-21): file extents enabled, maximum tree depth=5 Jul 6 07:39:19 oss03 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_clear_journal_err:4862: Filesystem error recorded from previous mount: IO failure Jul 6 07:39:19 oss03 kernel: LDISKFS-fs warning (device dm-21): ldiskfs_clear_journal_err:4863: Marking fs in need of filesystem check. Jul 6 07:39:19 oss03 kernel: LDISKFS-fs (dm-21): warning: mounting fs with errors, running e2fsck is recommended Jul 6 07:39:22 oss03 kernel: LDISKFS-fs (dm-21): recovery complete Jul 6 07:39:22 oss03 kernel: LDISKFS-fs (dm-21): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc Jul 6 07:39:22 oss03 kernel: LDISKFS-fs error (device dm-21): htree_dirblock_to_tree:1278: inode #2: block 21233: comm mount.lustre: bad entry in directory: rec_len is too small for name_len - offset=4084(4084), inode=0, rec_len=12 , name_len=0 Jul 6 07:39:22 oss03 kernel: Aborting journal on device dm-21-8. Jul 6 07:39:22 oss03 kernel: LDISKFS-fs (dm-21): Remounting filesystem read-only Jul 6 07:39:24 oss03 kernel: LDISKFS-fs warning (device dm-21): kmmpd:187: kmmpd being stopped since filesystem has been remounted as readonly. Jul 6 07:44:22 oss03 kernel: LDISKFS-fs (dm-21): error count since last fsck: 6 Jul 6 07:44:22 oss03 kernel: LDISKFS-fs (dm-21): initial error at time 1625367384: htree_dirblock_to_tree:1278: inode 2: block 21233 Jul 6 07:44:22 oss03 kernel: LDISKFS-fs (dm-21): last error at time 1625546362: htree_dirblock_to_tree:1278: inode 2: block 21233 As I mentioned before mount never completes so the only way out of that is force reboot. Thanks, David On Tue, Jul 6, 2021 at 8:07 AM Andreas Dilger wrote: > > > On Jul 5, 2021, at 09:05, David Cohen > wrote: > > Hi, > I'm using Lustre 2.10.5 and lately tried to add a new OST. > The OST was formatted with the command below, which other than the index > is the exact same one used for all the other OSTs in the system. > > mkfs.lustre --reformat --mkfsoptions="-t ext4 -T huge" --ost > --fsname=local --index=0051 --param ost.quota_type=ug > --mountfsoptions='errors=remount-ro,extents,mballoc' --mgsnode=10.0.0.3@tcp > --mgsnode=10.0.0.1@tc > p --mgsnode=10.0.0.2@tcp --servicenode=10.0.0.3@tcp > --servicenode=10.0.0.1@tcp --servicenode=10.0.0.2@tcp /dev/mapper/OST0051 > > > Note that your "--index=0051" value is probably interpreted as an octal > number "41", it should be "--index=0x0051" or "--index=0x51" (hex, to match > the OST device name) or "--index=81" (decimal). > > > When trying to mount the with: > mount.lustre /dev/mapper/OST0051 /Lustre/OST0051 > > The system stays on 100% CPU (one core) forever and the mount never > completes, not even after a week. > > I tried tunefs.lustre --writeconf --erase-params on the MDS and all the > other targets, but the behaviour remains the same. > > > Cheers, Andreas > -- > Andreas Dilger > Lustre Principal Architect > Whamcloud > > > > > > > > ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Unable to mount new OST
Hi, I'm using Lustre 2.10.5 and lately tried to add a new OST. The OST was formatted with the command below, which other than the index is the exact same one used for all the other OSTs in the system. mkfs.lustre --reformat --mkfsoptions="-t ext4 -T huge" --ost --fsname=local --index=0051 --param ost.quota_type=ug --mountfsoptions='errors=remount-ro,extents,mballoc' --mgsnode=10.0.0.3@tcp --mgsnode=10.0.0.1@tc p --mgsnode=10.0.0.2@tcp --servicenode=10.0.0.3@tcp --servicenode=10.0.0.1@tcp --servicenode=10.0.0.2@tcp /dev/mapper/OST0051 When trying to mount the with: mount.lustre /dev/mapper/OST0051 /Lustre/OST0051 The system stays on 100% CPU (one core) forever and the mount never completes, not even after a week. I tried tunefs.lustre --writeconf --erase-params on the MDS and all the other targets, but the behaviour remains the same. David ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] [EXTERNAL] Re: Disk quota exceeded while quota is not filled
Thank you Chad for answering, We are using the patched kernel on the MDT/OSS The problem is in the group space quota. In any case I enabled project quota just for future purposes. There are no defined projects, do you think it can still pose a problem? Best, David On Wed, Aug 26, 2020 at 3:18 PM Chad DeWitt wrote: > Hi David, > > Hope you're doing well. > > This is a total shot in the dark, but depending on the kernel version you > are running, you may need a patched kernel to use project quotas. I'm not > sure what the symptoms would be, but it may be worth turning off project > quotas and seeing if doing so resolves your issue: > > lctl conf_param technion.quota.mdt=none > lctl conf_param technion.quota.mdt=ug > lctl conf_param technion.quota.ost=none > lctl conf_param technion.quota.ost=ug > > (Looks like you have been running project quota on your MDT for a while > without issue, so this may be a deadend.) > > Here's more info concerning when a patched kernel is necessary for > project quotas (25.2. Enabling Disk Quotas): > > http://doc.lustre.org/lustre_manual.xhtml > > > Cheers, > Chad > > > > Chad DeWitt, CISSP | University Research Computing > > UNC Charlotte *| *Office of OneIT > > ccdew...@uncc.edu > > > > > > On Tue, Aug 25, 2020 at 3:04 AM David Cohen > wrote: > >> [*Caution*: Email from External Sender. Do not click or open links or >> attachments unless you know this sender.] >> >> Hi, >> Still hoping for a reply... >> >> It seems to me that old groups are more affected by the issue than new >> ones that were created after a major disk migration. >> It seems that the quota enforcement is somehow based on a counter other >> than the accounting as the accounting produces the same numbers as du. >> So if quota is calculated separately from accounting, it is possible that >> quota is broken and keeps values from removed disks, while accounting is >> correct. >> So following that suspicion I tried to force the FS to recalculate quota. >> I tried: >> lctl conf_param technion.quota.ost=none >> and back to: >> lctl conf_param technion.quota.ost=ugp >> >> I tried running on mds and all ost: >> tune2fs -O ^quota >> and on again: >> tune2fs -O quota >> and after each attempt, also: >> lctl lfsck_start -A -t all -o -e continue >> >> But still the problem persists and groups under the quota usage get >> blocked with "quota exceeded" >> >> Best, >> David >> >> >> On Sun, Aug 16, 2020 at 8:41 AM David Cohen < >> cda...@physics.technion.ac.il> wrote: >> >>> Hi, >>> Adding some more information. >>> A Few months ago the data on the Lustre fs was migrated to new physical >>> storage. >>> After successful migration the old ost were marked as active=0 >>> (lctl conf_param technion-OST0001.osc.active=0) >>> >>> Since then all the clients were unmounted and mounted. >>> tunefs.lustre --writeconf was executed on the mgs/mdt and all the ost. >>> lctl dl don't show the old ost anymore, but when querying the quota they >>> still appear. >>> As I see that new users are less affected by the "quota exceeded" >>> problem (blocked from writing while quota is not filled), >>> I suspect that quota calculation is still summing values from the old >>> ost: >>> >>> *lfs quota -g -v md_kaplan /storage/* >>> Disk quotas for grp md_kaplan (gid 10028): >>> Filesystem kbytes quota limit grace files quota limit >>> grace >>> /storage/ 4823987000 0 5368709120 - 143596 0 >>> 0 - >>> technion-MDT_UUID >>> 37028 - 0 - 143596 - 0 >>> - >>> quotactl ost0 failed. >>> quotactl ost1 failed. >>> quotactl ost2 failed. >>> quotactl ost3 failed. >>> quotactl ost4 failed. >>> quotactl ost5 failed. >>> quotactl ost6 failed. >>> quotactl ost7 failed. >>> quotactl ost8 failed. >>> quotactl ost9 failed. >>> quotactl ost10 failed. >>> quotactl ost11 failed. >>> quotactl ost12 failed. >>> quotactl ost13 failed. >>> quotactl ost14 failed. >>> quotactl ost15 failed. >>> quotactl ost16 failed. >>> quotactl ost17 failed. >>> q
Re: [lustre-discuss] Disk quota exceeded while quota is not filled
Hi, Still hoping for a reply... It seems to me that old groups are more affected by the issue than new ones that were created after a major disk migration. It seems that the quota enforcement is somehow based on a counter other than the accounting as the accounting produces the same numbers as du. So if quota is calculated separately from accounting, it is possible that quota is broken and keeps values from removed disks, while accounting is correct. So following that suspicion I tried to force the FS to recalculate quota. I tried: lctl conf_param technion.quota.ost=none and back to: lctl conf_param technion.quota.ost=ugp I tried running on mds and all ost: tune2fs -O ^quota and on again: tune2fs -O quota and after each attempt, also: lctl lfsck_start -A -t all -o -e continue But still the problem persists and groups under the quota usage get blocked with "quota exceeded" Best, David On Sun, Aug 16, 2020 at 8:41 AM David Cohen wrote: > Hi, > Adding some more information. > A Few months ago the data on the Lustre fs was migrated to new physical > storage. > After successful migration the old ost were marked as active=0 > (lctl conf_param technion-OST0001.osc.active=0) > > Since then all the clients were unmounted and mounted. > tunefs.lustre --writeconf was executed on the mgs/mdt and all the ost. > lctl dl don't show the old ost anymore, but when querying the quota they > still appear. > As I see that new users are less affected by the "quota exceeded" problem > (blocked from writing while quota is not filled), > I suspect that quota calculation is still summing values from the old ost: > > *lfs quota -g -v md_kaplan /storage/* > Disk quotas for grp md_kaplan (gid 10028): > Filesystem kbytes quota limit grace files quota limit > grace > /storage/ 4823987000 0 5368709120 - 143596 0 > 0 - > technion-MDT_UUID > 37028 - 0 - 143596 - 0 > - > quotactl ost0 failed. > quotactl ost1 failed. > quotactl ost2 failed. > quotactl ost3 failed. > quotactl ost4 failed. > quotactl ost5 failed. > quotactl ost6 failed. > quotactl ost7 failed. > quotactl ost8 failed. > quotactl ost9 failed. > quotactl ost10 failed. > quotactl ost11 failed. > quotactl ost12 failed. > quotactl ost13 failed. > quotactl ost14 failed. > quotactl ost15 failed. > quotactl ost16 failed. > quotactl ost17 failed. > quotactl ost18 failed. > quotactl ost19 failed. > quotactl ost20 failed. > technion-OST0015_UUID > 114429464* - 114429464 - - - > - - > technion-OST0016_UUID > 92938588 - 92938592 - - - - > - > technion-OST0017_UUID > 128496468* - 128496468 - - - > - - > technion-OST0018_UUID > 191478704* - 191478704 - - - > - - > technion-OST0019_UUID > 107720552 - 107720560 - - - > - - > technion-OST001a_UUID > 165631952* - 165631952 - - - > - - > technion-OST001b_UUID > 460714156* - 460714156 - - - > - - > technion-OST001c_UUID > 157182900* - 157182900 - - - > - - > technion-OST001d_UUID > 102945952* - 102945952 - - - > - - > technion-OST001e_UUID > 175840980* - 175840980 - - - > - - > technion-OST001f_UUID > 142666872* - 142666872 - - - > - - > technion-OST0020_UUID > 188147548* - 188147548 - - - > - - > technion-OST0021_UUID > 125914240* - 125914240 - - - > - - > technion-OST0022_UUID > 186390800* - 186390800 - - - > - - > technion-OST0023_UUID > 115386876 - 115386884 - - - > - - > technion-OST0024_UUID > 127139556* - 127139556 - - - > - - > technion-OST0025_UUID > 179666580* - 179666580 - - - > - - > technion-OST0026_UUID > 147837348 - 147837356 - - - > - - > technion-OST0027_UUID > 129823528 - 129823536 - - - > - - > technion-OST0028_UUID > 158270776 - 158270784 - - - > - - > technion-OST0029_UUID &g
Re: [lustre-discuss] Disk quota exceeded while quota is not filled
les quota limit grace /storage/ 4.493T 0k 5T - 143596 0 0 - On Tue, Aug 11, 2020 at 7:35 AM David Cohen wrote: > Hi, > I'm running Lustre 2.10.5 on the oss and mds, and 2.10.7 on the clients. > While inode quota ons mdt worked fine for a while now: > lctl conf_param technion.quota.mdt=ugp > When, few days ago I turned on quota on ost: > lctl conf_param technion.quota.ost=ugp > Users started getting "Disk quota exceeded" error messages while quota is > not filled > > Actions taken: > Full e2fsck -f -y to all the file system, mdt and ost. > lctl lfsck_start -A -t all -o -e continue > turning quota to none and back. > > None of the above solved the problem. > > lctl lfsck_query > > > layout_mdts_init: 0 > layout_mdts_scanning-phase1: 0 > layout_mdts_scanning-phase2: 0 > layout_mdts_completed: 0 > layout_mdts_failed: 0 > layout_mdts_stopped: 0 > layout_mdts_paused: 0 > layout_mdts_crashed: 0 > *layout_mdts_partial: 1 *# is that normal output? > layout_mdts_co-failed: 0 > layout_mdts_co-stopped: 0 > layout_mdts_co-paused: 0 > layout_mdts_unknown: 0 > layout_osts_init: 0 > layout_osts_scanning-phase1: 0 > layout_osts_scanning-phase2: 0 > layout_osts_completed: 30 > layout_osts_failed: 0 > layout_osts_stopped: 0 > layout_osts_paused: 0 > layout_osts_crashed: 0 > layout_osts_partial: 0 > layout_osts_co-failed: 0 > layout_osts_co-stopped: 0 > layout_osts_co-paused: 0 > layout_osts_unknown: 0 > layout_repaired: 15 > namespace_mdts_init: 0 > namespace_mdts_scanning-phase1: 0 > namespace_mdts_scanning-phase2: 0 > namespace_mdts_completed: 1 > namespace_mdts_failed: 0 > namespace_mdts_stopped: 0 > namespace_mdts_paused: 0 > namespace_mdts_crashed: 0 > namespace_mdts_partial: 0 > namespace_mdts_co-failed: 0 > namespace_mdts_co-stopped: 0 > namespace_mdts_co-paused: 0 > namespace_mdts_unknown: 0 > namespace_osts_init: 0 > namespace_osts_scanning-phase1: 0 > namespace_osts_scanning-phase2: 0 > namespace_osts_completed: 0 > namespace_osts_failed: 0 > namespace_osts_stopped: 0 > namespace_osts_paused: 0 > namespace_osts_crashed: 0 > namespace_osts_partial: 0 > namespace_osts_co-failed: 0 > namespace_osts_co-stopped: 0 > namespace_osts_co-paused: 0 > namespace_osts_unknown: 0 > namespace_repaired: 99 > > > > > ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Disk quota exceeded while quota is not filled
Hi, I'm running Lustre 2.10.5 on the oss and mds, and 2.10.7 on the clients. While inode quota ons mdt worked fine for a while now: lctl conf_param technion.quota.mdt=ugp When, few days ago I turned on quota on ost: lctl conf_param technion.quota.ost=ugp Users started getting "Disk quota exceeded" error messages while quota is not filled Actions taken: Full e2fsck -f -y to all the file system, mdt and ost. lctl lfsck_start -A -t all -o -e continue turning quota to none and back. None of the above solved the problem. lctl lfsck_query layout_mdts_init: 0 layout_mdts_scanning-phase1: 0 layout_mdts_scanning-phase2: 0 layout_mdts_completed: 0 layout_mdts_failed: 0 layout_mdts_stopped: 0 layout_mdts_paused: 0 layout_mdts_crashed: 0 *layout_mdts_partial: 1 *# is that normal output? layout_mdts_co-failed: 0 layout_mdts_co-stopped: 0 layout_mdts_co-paused: 0 layout_mdts_unknown: 0 layout_osts_init: 0 layout_osts_scanning-phase1: 0 layout_osts_scanning-phase2: 0 layout_osts_completed: 30 layout_osts_failed: 0 layout_osts_stopped: 0 layout_osts_paused: 0 layout_osts_crashed: 0 layout_osts_partial: 0 layout_osts_co-failed: 0 layout_osts_co-stopped: 0 layout_osts_co-paused: 0 layout_osts_unknown: 0 layout_repaired: 15 namespace_mdts_init: 0 namespace_mdts_scanning-phase1: 0 namespace_mdts_scanning-phase2: 0 namespace_mdts_completed: 1 namespace_mdts_failed: 0 namespace_mdts_stopped: 0 namespace_mdts_paused: 0 namespace_mdts_crashed: 0 namespace_mdts_partial: 0 namespace_mdts_co-failed: 0 namespace_mdts_co-stopped: 0 namespace_mdts_co-paused: 0 namespace_mdts_unknown: 0 namespace_osts_init: 0 namespace_osts_scanning-phase1: 0 namespace_osts_scanning-phase2: 0 namespace_osts_completed: 0 namespace_osts_failed: 0 namespace_osts_stopped: 0 namespace_osts_paused: 0 namespace_osts_crashed: 0 namespace_osts_partial: 0 namespace_osts_co-failed: 0 namespace_osts_co-stopped: 0 namespace_osts_co-paused: 0 namespace_osts_unknown: 0 namespace_repaired: 99 ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] MGS+MDT migration to a new storage using LVM tools
Thanks Andreas for your detailed reply. I took your advice on the MDT naming. As now the migration is complete I want to share some major problems I had on the way. I don't know where to point the blame, the Lustre kernel, e2fprogs, srp tools, multipath version, lvm version and moving from 512 blocks to 4096. But as soon as I created the mirror, the server went into a kernel panic, core dump loop. I managed to stop it only by breaking the mirror from another server connected to the same storage. It took me a full day to recover the system. Today I restarted the process, this time from a different server ,not running Lustre, which I already used for a vm lun lvm migration. The exact same procedure ran flawlessly and I needed only to refresh the lvm on the MDS to be able to mount the migrated mdt. Cheers, David On Sun, Jul 19, 2020 at 12:27 PM Andreas Dilger wrote: > On Jul 19, 2020, at 12:41 AM, David Cohen > wrote: > > > > Hi, > > We have a combined MGS+MDT and I'm looking for a migration to new > storage with a minimal disruption to the running jobs on the cluster. > > > > Can anyone find problems in the scenario below and/or suggest another > solution? > > I would appreciate also "no problems" replies to reassure the scenario > before I proceed. > > > > Current configuration: > > The mdt is a logical volume in a lustre_pool VG on a /dev/mapper/MDT0001 > PV > > I've been running Lustre on LVM at home for many years, and have done > pvmove > of the underlying storage to new devices without any problems. > > > Migration plan: > > Add /dev/mapper/MDT0002 new disk (multipath) > > I would really recommend that you *not* use MDT0002 as the name of the PV. > This is very confusing because the MDT itself (at the Lustre level) is > almost certainly named "-MDT", and if you ever add new MDTs to > this filesystem it will be confusing as to which *Lustre* MDT is on which > underlying PV. Instead, I'd take the opportunity to name this "MDT" to > match the actual Lustre MDT target name. > > > extend the VG: > > pvcreate /dev/mapper/MDT0002 > > vgextend lustre_pool /dev/mapper/MDT0002 > > mirror the mdt to the new disk: > > lvconvert -m 1 /dev/lustre_pool/TECH_MDT /dev/mapper/MDT0002 > > I typically just use "pvmove", but doing this by adding a mirror and then > splitting it off is probably safer. That would still leave you with a full > copy of the MDT on the original PV if something happened in the middle. > > > wait the mirrored disk to sync: > > lvs -o+devices > > when it's fully synced unmount the MDT, remove the old disk from the > mirror: > > lvconvert -m 0 /dev/lustre_pool/TECH_MDT /dev/mapper/MDT0001 > > and remove the old disk from the pool: > > vgreduce lustre_pool /dev/mapper/MDT0001 > > pvremove /dev/mapper/MDT0001 > > remount the MDT and let the clients few minutes to recover the > connection. > > In my experience with pvmove, there is no need to do anything with the > clients, > as long as you are not also moving the MDT to a new server, since the > LVM/DM > operations are totally transparent to both the Lustre server and client. > > After my pvmove (your "lvconvert -m 0"), I would just vgreduce the old PV > from > the VG, and then leave it in the system (internal HDD) until the next time > I > needed to shut down the server. If you have hot-plug capability for the > PVs, > then you don't even need to wait for that. > > Cheers, Andreas > > > > > > ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] MGS+MDT migration to a new storage using LVM tools
Hi, We have a combined MGS+MDT and I'm looking for a migration to new storage with a minimal disruption to the running jobs on the cluster. Can anyone find problems in the scenario below and/or suggest another solution? I would appreciate also "no problems" replies to reassure the scenario before I proceed. Current configuration: The mdt is a logical volume in a lustre_pool VG on a /dev/mapper/MDT0001 PV Migration plan: Add /dev/mapper/MDT0002 new disk (multipath) extend the VG: pvcreate /dev/mapper/MDT0002 vgextend lustre_pool /dev/mapper/MDT0002 mirror the mdt to the new disk: lvconvert -m 1 /dev/lustre_pool/TECH_MDT /dev/mapper/MDT0002 wait the mirrored disk to sync: lvs -o+devices when it's fully synced unmount the MDT, remove the old disk from the mirror: lvconvert -m 0 /dev/lustre_pool/TECH_MDT /dev/mapper/MDT0001 and remove the old disk from the pool: vgreduce lustre_pool /dev/mapper/MDT0001 pvremove /dev/mapper/MDT0001 remount the MDT and let the clients few minutes to recover the connection. Thanks, David ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] frequent Connection lost, Connection restored to mdt
Hi, Yes, I do see load on the client side, but as the client has 40gb NIC and the load comes from a 10gb WAN link I wouldn't expect it to overload the net. I can correlate the messages with load higher than 6gb from the WAN. Far from the limit of the NIC. The client has a latest generation Xeon processor so I wouldn't expect that to be the bottle neck either. David On Mon, Dec 23, 2019 at 5:09 PM Degremont, Aurelien wrote: > Hi > > > > These messages means the client thinks it has lost the communication with > the server and reconnect. The server only sees the reconnection and never > thought the client was gone. > > > > It could be related to lots of things. The server could be receiving RPCs > from this client but not processing them fast enough. Is there other errors > on your server? Is there any high load? > > Same on your clients? Is there any high load that could prevent your > client from communicating with your server properly? > > > > Do you correlate that with some specific load running on your clients? > > > > Aurélien > > > > *De : *lustre-discuss au nom de > David Cohen > *Date : *dimanche 22 décembre 2019 à 17:08 > *À : *"lustre-discuss@lists.lustre.org" > *Objet : *[lustre-discuss] frequent Connection lost, Connection restored > to mdt > > > > Hi, > > We are running 2.10.5 on the servers and 2.10.8 on the clients. > > Every few minutes, we see: > > > > On client side: > > > > Dec 22 15:26:34 gftp kernel: Lustre: > 439834:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has > timed out for slow reply: [sent 1577021187/real 1577021187] > req@88160be9c6c0 x1653620348981536/t0(0) > o36->lustre-MDT-mdc-8817d9776c00@10.0.0.1@tcp:12/10 lens 608/4768 > e 0 to 1 dl 1577021194 ref 2 fl Rpc:X/0/ rc 0/-1 > Dec 22 15:26:34 gftp kernel: Lustre: > 439834:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 3 previous > similar messages > Dec 22 15:26:34 gftp kernel: Lustre: lustre-MDT-mdc-8817d9776c00: > Connection to lustre-MDT (at 10.0.0.1@tcp) was lost; in progress > operations using this service will wait for recovery to complete > Dec 22 15:26:34 gftp kernel: Lustre: Skipped 3 previous similar messages > Dec 22 15:26:34 gftp kernel: Lustre: lustre-MDT-mdc-8817d9776c00: > Connection restored to 10.0.0.1@tcp (at 192.114.101.153@tcp) > Dec 22 15:26:34 gftp kernel: Lustre: Skipped 3 previous similar messages > > > > On server side: > > > > Dec 22 15:26:34 oss03 kernel: Lustre: lustre-MDT: Client > 38d6eef1-e146-be41-bab9-409b272d0d4f (at 10.0.0.10@tcp) reconnecting > Dec 22 15:26:34 oss03 kernel: Lustre: lustre-MDT: Connection restored > to ec2cdfce-353f-583a-c970-fde3f5d5189c (at 10.0.0.10@tcp) > > > ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] frequent Connection lost, Connection restored to mdt
Hi, We are running 2.10.5 on the servers and 2.10.8 on the clients. Every few minutes, we see: On client side: Dec 22 15:26:34 gftp kernel: Lustre: 439834:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1577021187/real 1577021187] req@88160be9c6c0 x1653620348981536/t0(0) o36->lustre-MDT-mdc-8817d9776c00@10.0.0.1@tcp:12/10 lens 608/4768 e 0 to 1 dl 1577021194 ref 2 fl Rpc:X/0/ rc 0/-1 Dec 22 15:26:34 gftp kernel: Lustre: 439834:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Dec 22 15:26:34 gftp kernel: Lustre: lustre-MDT-mdc-8817d9776c00: Connection to lustre-MDT (at 10.0.0.1@tcp) was lost; in progress operations using this service will wait for recovery to complete Dec 22 15:26:34 gftp kernel: Lustre: Skipped 3 previous similar messages Dec 22 15:26:34 gftp kernel: Lustre: lustre-MDT-mdc-8817d9776c00: Connection restored to 10.0.0.1@tcp (at 192.114.101.153@tcp) Dec 22 15:26:34 gftp kernel: Lustre: Skipped 3 previous similar messages On server side: Dec 22 15:26:34 oss03 kernel: Lustre: lustre-MDT: Client 38d6eef1-e146-be41-bab9-409b272d0d4f (at 10.0.0.10@tcp) reconnecting Dec 22 15:26:34 oss03 kernel: Lustre: lustre-MDT: Connection restored to ec2cdfce-353f-583a-c970-fde3f5d5189c (at 10.0.0.10@tcp) ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Limit to the number of "--servicenode="
Hi, In all the manuals and examples there are only two "--servicenode=" in the creation of the mgs nodes and oss. Is that a limitation or can I create more service nodes? Is the maximum number of servicenodes is different for mgs and oss? David ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Lustre 2.10.4 failover
the fstab line I use for mounting the Lustre filesystem: oss03@tcp:oss01@tcp:/fsname /storagelustre flock,user_xattr,defaults0 0 the mds is also configured for failover (unsuccessfully) : tunefs.lustre --writeconf --erase-params --fsname=fsname --mgs --mountfsoptions='user_xattr,errors=remount-ro,acl' --param="mgsnode=oss03@tcp mgsnode=oss01@tcp servicenode=oss01@tcp servicenode=oss03@tcp" /dev/lustre_pool/MDT On Mon, Aug 13, 2018 at 8:40 PM Mohr Jr, Richard Frank (Rick Mohr) < rm...@utk.edu> wrote: > > > On Aug 13, 2018, at 7:14 AM, David Cohen > wrote: > > > > I installed a new 2.10.4 Lustre file system. > > Running MDS and OSS on the same servers. > > Failover wasn't configured at format time. > > I'm trying to configure failover node with tunefs without success. > > tunefs.lustre --writeconf --erase-params --param="ost.quota_type=ug" > --mgsnode=oss03@tcp --mgsnode=oss01@tcp --servicenode=oss01@tcp > --servicenode=oss03@tcp /dev/mapper/OST0015 > > > > I can mount the ost on the second server but the clients won't restore > the connection. > > Maybe I'm missing something obvious. Do you see any typo in the command? > > What mount command are you using on the client? > > -- > Rick Mohr > Senior HPC System Administrator > National Institute for Computational Sciences > http://www.nics.tennessee.edu > > ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Lustre 2.10.4 failover
Hi I installed a new 2.10.4 Lustre file system. Running MDS and OSS on the same servers. Failover wasn't configured at format time. I'm trying to configure failover node with tunefs without success. tunefs.lustre --writeconf --erase-params --param="ost.quota_type=ug" --mgsnode=oss03@tcp --mgsnode=oss01@tcp --servicenode=oss01@tcp --servicenode=oss03@tcp /dev/mapper/OST0015 I can mount the ost on the second server but the clients won't restore the connection. Maybe I'm missing something obvious. Do you see any typo in the command? David ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] How to support user_xattr in 2.10.4
Hi, I'm running a newly installed Lustre 2.10.4. The mds is configured to support acl and user_xattr: Persistent mount opts: user_xattr,errors=remount-ro,acl But when trying to mount (or remount) the client with "-o remount,acl,user_xattr" And checking the mount I get only: type lustre (rw,lazystatfs) While acl seems to be available user_xattr isn't: rsync: rsync_xal_set: lsetxattr(""/storage/atlas/atlasdatadisk/SAM/testfile-prep-GET-ATLASLOCALGROUPDISK.txt"","user.storm.checksum.adler32") failed: Operation not supported (95) David ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Mount options ignored?
Hi, I'm running a newly installed Lustre 2.10.4. The mds is configured to support acl and user_xattr: Persistent mount opts: user_xattr,errors=remount-ro,acl But when trying to mount (or remount) the client with "-o remount,acl,user_xattr" And checking the mount I get only: type lustre (rw,lazystatfs) While acl seems to be available user_xattr isn't: rsync: rsync_xal_set: lsetxattr(""/storage/atlas/atlasdatadisk/SAM/testfile-prep-GET-ATLASLOCALGROUPDISK.txt"","user.storm.checksum.adler32") failed: Operation not supported (95) David ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Lustre Client in a container
Thanks for all the answers. I was thinking of creating a new file system, starting from clean configuration, implementing quotas etc. For that I was looking for a way in which the systems can coexist, moving symbolic links while the folders are synchronized to the new system. In the process emptying disks of the old file system and moving them to the new one. This is a long process that might take more then a month, but can be done without disturbing normal cluster operation. As it doesn't seems to be possible in real life, I will have to reevaluate my options and come with a different migration schema. On Wed, Jan 3, 2018 at 1:49 PM, Patrick Farrell wrote: > FWIW, as long as you don’t intend to use any interesting features (quotas, > etc), 1.8 clients were used with 2.5 servers at ORNL for some time with no > ill effects on the IO side of things. > > I’m not sure how much further that limited compatibility goes, though. > -- > *From:* Dilger, Andreas > *Sent:* Wednesday, January 3, 2018 4:20:56 AM > *To:* David Cohen > *Cc:* Patrick Farrell; lustre-discuss@lists.lustre.org > *Subject:* Re: [lustre-discuss] Lustre Client in a container > > On Dec 31, 2017, at 01:50, David Cohen > wrote: > > > > Patrick, > > Thanks for you response. > > I looking for a way to migrate from 1.8.9 system to 2.10.2, stable > enough to run the several weeks or more that it might take. > > Note that there is no longer direct support for upgrading from 1.8 to > 2.10. > > That said, are you upgrading the filesystem in place, or are you copying > the data from the 1.8.9 filesystem to the 2.10.2 filesystem? In the latter > case, the upgrade compatibility doesn't really matter. What you need is a > client that can mount both server versions at the same time. > > Unfortunately, no 2.x clients can mount the 1.8.x server filesystem > directly, so that does limit your options. There was a time of > interoperability with 1.8 clients being able to mount 2.1-ish servers, but > that doesn't really help you. You could upgrade the 1.8 servers to 2.1 or > later, and then mount both filesystems with a 2.5-ish client, or upgrade > the servers to 2.5. > > Cheers, Andreas > > > On Sun, Dec 31, 2017 at 12:12 AM, Patrick Farrell wrote: > > David, > > > > I have no direct experience trying this, but I would imagine not - > Lustre is a kernel module (actually a set of kernel modules), so unless the > container tech you're using allows loading multiple different versions of > *kernel modules*, this is likely impossible. My limited understanding of > container tech on Linux suggests that this would be impossible, containers > allow userspace separation but there is only one kernel/set of > modules/drivers. > > > > I don't know of any way to run multiple client versions on the same node. > > > > The other question is *why* do you want to run multiple client versions > on one node...? Clients are usually interoperable across a pretty generous > set of server versions. > > > > - Patrick > > > > > > From: lustre-discuss on > behalf of David Cohen > > Sent: Saturday, December 30, 2017 11:45:15 AM > > To: lustre-discuss@lists.lustre.org > > Subject: [lustre-discuss] Lustre Client in a container > > > > Hi, > > Is it possible to run Lustre client in a container? > > The goal is to run two different client version on the same node, can it > be done? > > > > David > > > > > > ___ > > lustre-discuss mailing list > > lustre-discuss@lists.lustre.org > > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > > Cheers, Andreas > -- > Andreas Dilger > Lustre Principal Architect > Intel Corporation > > > > > > > > ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Lustre Client in a container
Patrick, Thanks for you response. I looking for a way to migrate from 1.8.9 system to 2.10.2, stable enough to run the several weeks or more that it might take. David On Sun, Dec 31, 2017 at 12:12 AM, Patrick Farrell wrote: > David, > > > I have no direct experience trying this, but I would imagine not - Lustre > is a kernel module (actually a set of kernel modules), so unless the > container tech you're using allows loading multiple different versions of > *kernel modules*, this is likely impossible. My limited understanding of > container tech on Linux suggests that this would be impossible, containers > allow userspace separation but there is only one kernel/set of > modules/drivers. > > > I don't know of any way to run multiple client versions on the same node. > > > The other question is *why* do you want to run multiple client versions on > one node...? Clients are usually interoperable across a pretty generous > set of server versions. > > > - Patrick > > > > -- > *From:* lustre-discuss on > behalf of David Cohen > *Sent:* Saturday, December 30, 2017 11:45:15 AM > *To:* lustre-discuss@lists.lustre.org > *Subject:* [lustre-discuss] Lustre Client in a container > > Hi, > Is it possible to run Lustre client in a container? > The goal is to run two different client version on the same node, can it > be done? > > David > > ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Lustre Client in a container
Hi, Is it possible to run Lustre client in a container? The goal is to run two different client version on the same node, can it be done? David ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [Lustre-discuss] delete a "undeletable" file
You can move the entire folder (mv) to another location /lustre_fs/somthing/badfiles recreate the folder and mv back only the good files. > If I run "unlink .viminfo" I got the same error: > > unlink: cannot unlink `.viminfo': Invalid argument > > > I can't stop the MDS/OSS to do a lfsck or e2fsck because is a filesystem in > production of a lots of "terabytes" > > > Any idea more to delete the "damm file"? > > > > THANKS! > > -Mensaje original- > From: Bob Ball > Sent: Thursday, March 07, 2013 6:09 PM > To: Colin Faber > Cc: Alfonso Pardo ; Ben Evans ; lustre-discuss@lists.lustre.org > Subject: Re: [Lustre-discuss] delete a "undeletable" file > > You could just unlink it instead. That will work when rm fails. > > bob > > On 3/7/2013 11:10 AM, Colin Faber wrote: > > Hi, > > > > If the file is disassociated with an OST which is offline, bring the OST > > back online, if the OST object it self is missing then you can remove > > the file using 'unlink' rather than 'rm' to unlink the object meta data. > > > > If you want to try and recover the missing OST object an lfs getstripe > > against the file should yield the the OST on which it resides. Once > > that's determined you can take that OST offline and e2fsck may > > successfully restore it. > > > > Another option as Ben correctly points out, lfsck will correct / prune > > this meta data as well as the now orphaned (if any) OST object. > > > > -cf > > > > > > On 03/07/2013 08:30 AM, Ben Evans wrote: > >> The snarky reply would be to use Emacs. > >> > >> More seriously: > >> > >> When I see something like ?? in the attributes for a file, my > >> first thought is that the group_upcall on the filesystem is not > >> correct, so permissions are broken. If you can log on as root, you may > >> be able to see it clearly. > >> > >> If that doesn't work, you may have to run an fsck on the MDT (which > >> may take minutes to hours depending on the size of your MDT) > >> > >> If that doesn't work, follow the procedure for running an lfsck (which > >> will take a long time, and require quite a bit of storage to execute) > >> > >> -Ben Evans > >> > >> > >> *From:* lustre-discuss-boun...@lists.lustre.org > >> [lustre-discuss-boun...@lists.lustre.org] on behalf of Alfonso Pardo > >> [alfonso.pa...@ciemat.es] > >> *Sent:* Thursday, March 07, 2013 10:09 AM > >> *To:* lustre-discuss@lists.lustre.org; wc-disc...@whamcloud.com > >> *Subject:* [Lustre-discuss] delete a "undeletable" file > >> > >> Hello, > >> I have a corrupt file than I can’t delete. > >> This is my file: > >>> ls –la .viminfo > >> /-? ? ? ? ? ? .viminfo/ > >> />lfs getstripe .viminfo/ > >> /.viminfo/ > >> /lmm_stripe_count: 6/ > >> /lmm_stripe_size: 1048576/ > >> /lmm_layout_gen: 0/ > >> /lmm_stripe_offset: 18/ > >> /obdidx objid objid group/ > >> /18 1442898 0x160452 0/ > >> /22 48 0x30 0/ > >> /19 1442770 0x1603d2 0/ > >> /21 49 0x31 0/ > >> /23 48 0x30 0/ > >> /20 50 0x32 0/ > >> And these are my OST: > >> />lctl dl/ > >> /0 UP mgc MGC192.168.11.9@tcp f6d5b76f-a7e0-61ca-b389-cb3896b86186 5/ > >> /1 UP lov cetafs-clilov-88009816e400 > >> a7ba6783-6ed8-2197-4ffc-fecbff9860a5 4/ > >> /2 UP lmv cetafs-clilmv-88009816e400 > >> a7ba6783-6ed8-2197-4ffc-fecbff9860a5 4/ > >> /3 UP mdc cetafs-MDT-mdc-88009816e400 > >> a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/ > >> /4 UP osc cetafs-OST-osc-88009816e400 > >> a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/ > >> /5 UP osc cetafs-OST0001-osc-88009816e400 > >> a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/ > >> /6 UP osc cetafs-OST0002-osc-88009816e400 > >> a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/ > >> /7 UP osc cetafs-OST0003-osc-88009816e400 > >> a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/ > >> /8 UP osc cetafs-OST0004-osc-88009816e400 > >> a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/ > >> /9 UP osc cetafs-OST0005-osc-88009816e400 > >> a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/ > >> /10 UP osc cetafs-OST0006-osc-88009816e400 > >> a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/ > >> /11 UP osc cetafs-OST0007-osc-88009816e400 > >> a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/ > >> /12 UP osc cetafs-OST0012-osc-88009816e400 > >> a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/ > >> /13 UP osc cetafs-OST0013-osc-88009816e400 > >> a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/ > >> /14 UP osc cetafs-OST0008-osc-88009816e400 > >> a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/ > >> /15 UP osc cetafs-OST000a-osc-88009816e400 > >> a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/ > >> /16 UP osc cetafs-OST0009-osc-88009816e400 > >> a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/ > >> /17 UP osc cetafs-OST000b-osc-88009816e400 > >> a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/ > >> /18 UP osc cetafs-OST000c-osc-88009816e400 > >> a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/ > >> /19 UP osc cetafs-OST000d-osc-88009816e400 > >> a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/ > >> /20 UP osc cet
Re: [Lustre-discuss] MDS crashes daily at the same hour
On Monday 04 January 2010 20:42:12 Andreas Dilger wrote: > On 2010-01-04, at 03:02, David Cohen wrote: > > I'm using a mixed environment of 1.8.0.1 MDS and 1.6.6 OSS's (had a > > problem > > with qlogic drivers and rolled back to 1.6.6). > > My MDS get unresponsive each day at 4-5 am local time, no kernel > > panic or > > error messages before. I was indeed the *locate update, a simple edit of /etc/updatedb.conf on the clients and the system is stable again. Many Thanks. > > Judging by the time, I'd guess this is "slocate" or "mlocate" running > on all of your clients at the same time. This used to be a source of > extremely high load back in the old days, but I thought that Lustre > was in the exclude list in newer versions of *locate. Looking at the > installed mlocate on my system, that doesn't seem to be the case... > strange. > > > Some errors and an LBUG appear in the log after force booting the > > MDS and > > mounting the MDT and then the log is clear until next morning: > > > > Jan 4 06:33:31 tech-mds kernel: LustreError: 6357:0: > > (class_hash.c:225:lustre_hash_findadd_unique_hnode()) > > ASSERTION(hlist_unhashed(hnode)) failed > > Jan 4 06:33:31 tech-mds kernel: LustreError: 6357:0: > > (class_hash.c:225:lustre_hash_findadd_unique_hnode()) LBUG > > Jan 4 06:33:31 tech-mds kernel: Lustre: 6357:0:(linux- > > debug.c:222:libcfs_debug_dumpstack()) showing stack for process 6357 > > Jan 4 06:33:31 tech-mds kernel: ll_mgs_02 R running task > > 0 6357 > > 16340 (L-TLB) > > Jan 4 06:33:31 tech-mds kernel: Call Trace: > > Jan 4 06:33:31 tech-mds kernel: thread_return+0x62/0xfe > > Jan 4 06:33:31 tech-mds kernel: __wake_up_common+0x3e/0x68 > > Jan 4 06:33:31 tech-mds kernel: :ptlrpc:ptlrpc_main+0x1218/0x13e0 > > Jan 4 06:33:31 tech-mds kernel: default_wake_function+0x0/0xe > > Jan 4 06:33:31 tech-mds kernel: audit_syscall_exit+0x31b/0x336 > > Jan 4 06:33:31 tech-mds kernel: child_rip+0xa/0x11 > > Jan 4 06:33:31 tech-mds kernel: :ptlrpc:ptlrpc_main+0x0/0x13e0 > > Jan 4 06:33:31 tech-mds kernel: child_rip+0x0/0x11 > > It shouldn't LBUG during recovery, however. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > -- David Cohen Grid Computing Physics Department Technion - Israel Institute of Technology ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] MDS crashes daily at the same hour
n 4 06:38:41 tech-mds kernel: LustreError: 6398:0: (mds_open.c:1665:mds_close()) @@@ no handle for file close ino 18531068: cookie 0xdcb9c7fd999e9dfc r...@8100dc7c8c00 x1323646224495073/t0 o35->5d1ee8c1- f826-9ab3-89bf-342c4f9e2...@net_0x2c0726512_uuid:0/0 lens 408/976 e 0 to 0 dl 1262579927 ref 1 fl Interpret:/0/0 rc 0/0 Jan 4 06:38:41 tech-mds kernel: LustreError: 6415:0: (mds_open.c:1665:mds_close()) @@@ no handle for file close ino 18508458: cookie 0xdcb9c7fd9983617e r...@8100d4bfb400 x1323646224495345/t0 o35->5d1ee8c1- f826-9ab3-89bf-342c4f9e2...@net_0x2c0726512_uuid:0/0 lens 408/976 e 0 to 0 dl 1262579927 ref 1 fl Interpret:/0/0 rc 0/0 Jan 4 06:38:41 tech-mds kernel: LustreError: 6415:0: (mds_open.c:1665:mds_close()) Skipped 271 previous similar messages Jan 4 06:38:42 tech-mds kernel: LustreError: 6409:0: (mds_open.c:1665:mds_close()) @@@ no handle for file close ino 18498078: cookie 0xdcb9c7fd99273a35 r...@810054d2e800 x1323646224496303/t0 o35->5d1ee8c1- f826-9ab3-89bf-342c4f9e2...@net_0x2c0726512_uuid:0/0 lens 408/976 e 0 to 0 dl 1262579928 ref 1 fl Interpret:/0/0 rc 0/0 Jan 4 06:38:42 tech-mds kernel: LustreError: 6409:0: (mds_open.c:1665:mds_close()) Skipped 957 previous similar messages Jan 4 06:38:44 tech-mds kernel: LustreError: 6413:0: (mds_open.c:1665:mds_close()) @@@ no handle for file close ino 18464618: cookie 0xdcb9c7fd9893064a r...@8100d39f3400 x1323646224498078/t0 o35->5d1ee8c1- f826-9ab3-89bf-342c4f9e2...@net_0x2c0726512_uuid:0/0 lens 408/976 e 0 to 0 dl 1262579930 ref 1 fl Interpret:/0/0 rc 0/0 Jan 4 06:38:44 tech-mds kernel: LustreError: 6413:0: (mds_open.c:1665:mds_close()) Skipped 1774 previous similar messages Jan 4 06:38:48 tech-mds kernel: LustreError: 6423:0: (mds_open.c:1665:mds_close()) @@@ no handle for file close ino 18437710: cookie 0xdcb9c7fd9817e589 r...@8100d45b5c00 x1323646224499484/t0 o35->5d1ee8c1- f826-9ab3-89bf-342c4f9e2...@net_0x2c0726512_uuid:0/0 lens 408/976 e 0 to 0 dl 1262579934 ref 1 fl Interpret:/0/0 rc 0/0 Jan 4 06:38:48 tech-mds kernel: LustreError: 6423:0: (mds_open.c:1665:mds_close()) Skipped 1405 previous similar messages Jan 4 06:38:53 tech-mds kernel: LustreError: 6422:0: (ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error (-116) r...@810054d38000 x1323646224500886/t0 o35->5d1ee8c1- f826-9ab3-89bf-342c4f9e2...@net_0x2c0726512_uuid:0/0 lens 408/976 e 0 to 0 dl 1262579939 ref 1 fl Interpret:/0/0 rc -116/0 Jan 4 06:38:53 tech-mds kernel: LustreError: 6422:0: (ldlm_lib.c:1826:target_send_reply_msg()) Skipped 5838 previous similar messages Jan 4 06:38:56 tech-mds kernel: LustreError: 6420:0: (mds_open.c:1665:mds_close()) @@@ no handle for file close ino 13567564: cookie 0xde1fda06cd4d058c r...@810055378800 x1323646224501408/t0 o35->5d1ee8c1- f826-9ab3-89bf-342c4f9e2...@net_0x2c0726512_uuid:0/0 lens 408/976 e 0 to 0 dl 1262579942 ref 1 fl Interpret:/0/0 rc 0/0 Jan 4 06:38:56 tech-mds kernel: LustreError: 6420:0: (mds_open.c:1665:mds_close()) Skipped 1923 previous similar messages -- David Cohen ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss