[lustre-discuss] Tools to check a lustre
I'm having trouble diagnosing where the problem lies in my Lustre installation, clients are 2.12.6 and I have a /home and /lustre filesystems using Lustre. /home has 4 OSTs and /lustre is made up of 6 OSTs. lfs df shows all OSTs as ACTIVE. The /lustre file system appears fine, I can *ls *into every directory. When people log into the login node, it appears to lockup. I have shut down everything and remounted the OSTs and MDTs etc in order with no errors reporting but I'm getting the lockup issue soon after a few people log in. The backend network is 100G Ethernet using ConnectX5 cards and the OS is Cento 7.9, everything was installed as RPMs and updates are disabled in yum.conf Two questions to start with: Is there a command line tool to check each OST individually? Apart from /var/log/messages, is there a lustre specific log I can monitor on the login node to see errors when I hit /home... Sid Young ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] eviction timeout
Hello This message is appearing during MDT recovery, likely after a MDS restart. MDT tries to reconnect first all existing client when it stopped. It seems all these clients have been also rebooted. To avoid this message, try to stop your clients before the servers. If not possible, you can abort the recovery, either at start time (https://doc.lustre.org/lustre_manual.xhtml#lustremaint.abortRecovery) or when recovery is running with the following commands on the MDS host: lctl --device lustre-MDT abort_recovery Aurélien De : lustre-discuss au nom de Sid Young via lustre-discuss Répondre à : Sid Young Date : lundi 11 octobre 2021 à 03:16 À : lustre-discuss Objet : [EXTERNAL] [lustre-discuss] eviction timeout CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. I'm seeing a lot of these messages: Oct 11 11:12:09 hpc-mds-02 kernel: Lustre: lustre-MDT: Denying connection for new client b6df7eda-8ae1-617c-6ff1-406d1ffb6006 (at 10.140.90.82@tcp), waiting for 6 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:42 It seems to be a 3minute timeout, is it possible to shorten this and even not log this message? Sid Young ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Tools to check a lustre
Have you tried lfs check servers on the login node? Sent from my iPhone On Oct 11, 2021, at 2:58 AM, Sid Young via lustre-discuss wrote: I'm having trouble diagnosing where the problem lies in my Lustre installation, clients are 2.12.6 and I have a /home and /lustre filesystems using Lustre. /home has 4 OSTs and /lustre is made up of 6 OSTs. lfs df shows all OSTs as ACTIVE. The /lustre file system appears fine, I can ls into every directory. When people log into the login node, it appears to lockup. I have shut down everything and remounted the OSTs and MDTs etc in order with no errors reporting but I'm getting the lockup issue soon after a few people log in. The backend network is 100G Ethernet using ConnectX5 cards and the OS is Cento 7.9, everything was installed as RPMs and updates are disabled in yum.conf Two questions to start with: Is there a command line tool to check each OST individually? Apart from /var/log/messages, is there a lustre specific log I can monitor on the login node to see errors when I hit /home... Sid Young ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Lustre /home lockup - how to check
> > >2. Tools to check a lustre (Sid Young) >4. Re: Tools to check a lustre (Dennis Nelson) > > > My key issue is why /home locks solid when you try to use it but /lustre is OK . The backend is ZFS used to manage the disks presented from the HP D8000 JBOD I'm at a loss after 6 months of 100% operation why this is suddenly occurring. If I do repeated "dd" tasks on lustre it works fine, start one on /home and it locks solid. I have started a ZFS scrub on two of the zfs pools. at 47T each it will take most of today to resolve, but that should rule out the actual storage (which is showing "NORMAL/ONLINE" and no errors. I'm seeing a lot of these in /var/log/messages kernel: LustreError: 6578:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc 89cdf3b9dc00 A google search returned this: https://wiki.lustre.org/Lustre_Resiliency:_Understanding_Lustre_Message_Loss_and_Tuning_for_Resiliency Could it be a network issue? - the nodes are running the Centos7.9 drivers... the Mellanox one did not seam to make any difference when I originally tried it 6 months ago. Any help appreciated :) Sid > > -- Forwarded message -- > From: Sid Young > To: lustre-discuss > Cc: > Bcc: > Date: Mon, 11 Oct 2021 16:07:56 +1000 > Subject: [lustre-discuss] Tools to check a lustre > > I'm having trouble diagnosing where the problem lies in my Lustre > installation, clients are 2.12.6 and I have a /home and /lustre > filesystems using Lustre. > > /home has 4 OSTs and /lustre is made up of 6 OSTs. lfs df shows all OSTs > as ACTIVE. > > The /lustre file system appears fine, I can *ls *into every directory. > > When people log into the login node, it appears to lockup. I have shut > down everything and remounted the OSTs and MDTs etc in order with no > errors reporting but I'm getting the lockup issue soon after a few people > log in. > The backend network is 100G Ethernet using ConnectX5 cards and the OS is > Cento 7.9, everything was installed as RPMs and updates are disabled in > yum.conf > > Two questions to start with: > Is there a command line tool to check each OST individually? > Apart from /var/log/messages, is there a lustre specific log I can monitor > on the login node to see errors when I hit /home... > > > > Sid Young > > > > > > > > -- Forwarded message -- > From: Dennis Nelson > To: Sid Young > > Date: Mon, 11 Oct 2021 12:20:25 + > Subject: Re: [lustre-discuss] Tools to check a lustre > Have you tried lfs check servers on the login node? > Yes - one of the first things I did and this is what it always reports: ]# lfs check servers home-OST-osc-89adb7e5e000 active. home-OST0001-osc-89adb7e5e000 active. home-OST0002-osc-89adb7e5e000 active. home-OST0003-osc-89adb7e5e000 active. lustre-OST-osc-89cdd14a2000 active. lustre-OST0001-osc-89cdd14a2000 active. lustre-OST0002-osc-89cdd14a2000 active. lustre-OST0003-osc-89cdd14a2000 active. lustre-OST0004-osc-89cdd14a2000 active. lustre-OST0005-osc-89cdd14a2000 active. home-MDT-mdc-89adb7e5e000 active. lustre-MDT-mdc-89cdd14a2000 active. [root@tri-minihub-01 ~]# ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Lustre /home lockup - more info
I tried remounting the /home lustre file system to /mnt in read-only mode and when I try to ls the directory it locks up but I can escape it, how ever when I do a df command i get the completely wrong size (should be around 192TB): 10.140.93.42@o2ib:/home6.0P 4.8P 1.3P 80% /mnt zfs scrub is still working and all disks physically report as OK in the ILO of the two OSS servers... When the scrub finishes later today I will unmount and remount the 4 OSTs and see if the remount changes the status... updates in about 8 hours. Sid Young On Tue, Oct 12, 2021 at 8:18 AM Sid Young wrote: > >>2. Tools to check a lustre (Sid Young) >>4. Re: Tools to check a lustre (Dennis Nelson) >> >> >> My key issue is why /home locks solid when you try to use it but /lustre > is OK . The backend is ZFS used to manage the disks presented from the HP > D8000 JBOD > I'm at a loss after 6 months of 100% operation why this is suddenly > occurring. If I do repeated "dd" tasks on lustre it works fine, start one > on /home and it locks solid. > > I have started a ZFS scrub on two of the zfs pools. at 47T each it will > take most of today to resolve, but that should rule out the actual storage > (which is showing "NORMAL/ONLINE" and no errors. > > I'm seeing a lot of these in /var/log/messages > kernel: LustreError: 6578:0:(events.c:200:client_bulk_callback()) event > type 1, status -5, desc 89cdf3b9dc00 > A google search returned this: > https://wiki.lustre.org/Lustre_Resiliency:_Understanding_Lustre_Message_Loss_and_Tuning_for_Resiliency > > Could it be a network issue? - the nodes are running the > Centos7.9 drivers... the Mellanox one did not seam to make any difference > when I originally tried it 6 months ago. > > Any help appreciated :) > > Sid > > >> >> -- Forwarded message -- >> From: Sid Young >> To: lustre-discuss >> Cc: >> Bcc: >> Date: Mon, 11 Oct 2021 16:07:56 +1000 >> Subject: [lustre-discuss] Tools to check a lustre >> >> I'm having trouble diagnosing where the problem lies in my Lustre >> installation, clients are 2.12.6 and I have a /home and /lustre >> filesystems using Lustre. >> >> /home has 4 OSTs and /lustre is made up of 6 OSTs. lfs df shows all OSTs >> as ACTIVE. >> >> The /lustre file system appears fine, I can *ls *into every directory. >> >> When people log into the login node, it appears to lockup. I have shut >> down everything and remounted the OSTs and MDTs etc in order with no >> errors reporting but I'm getting the lockup issue soon after a few people >> log in. >> The backend network is 100G Ethernet using ConnectX5 cards and the OS is >> Cento 7.9, everything was installed as RPMs and updates are disabled in >> yum.conf >> >> Two questions to start with: >> Is there a command line tool to check each OST individually? >> Apart from /var/log/messages, is there a lustre specific log I can >> monitor on the login node to see errors when I hit /home... >> >> >> >> Sid Young >> >> >> >> >> >> >> >> -- Forwarded message -- >> From: Dennis Nelson >> To: Sid Young >> >> Date: Mon, 11 Oct 2021 12:20:25 + >> Subject: Re: [lustre-discuss] Tools to check a lustre >> Have you tried lfs check servers on the login node? >> > > Yes - one of the first things I did and this is what it always reports: > > ]# lfs check servers > home-OST-osc-89adb7e5e000 active. > home-OST0001-osc-89adb7e5e000 active. > home-OST0002-osc-89adb7e5e000 active. > home-OST0003-osc-89adb7e5e000 active. > lustre-OST-osc-89cdd14a2000 active. > lustre-OST0001-osc-89cdd14a2000 active. > lustre-OST0002-osc-89cdd14a2000 active. > lustre-OST0003-osc-89cdd14a2000 active. > lustre-OST0004-osc-89cdd14a2000 active. > lustre-OST0005-osc-89cdd14a2000 active. > home-MDT-mdc-89adb7e5e000 active. > lustre-MDT-mdc-89cdd14a2000 active. > [root@tri-minihub-01 ~]# > > ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org