> > > 2. Tools to check a lustre (Sid Young) > 4. Re: Tools to check a lustre (Dennis Nelson) > > > My key issue is why /home locks solid when you try to use it but /lustre is OK . The backend is ZFS used to manage the disks presented from the HP D8000 JBOD I'm at a loss after 6 months of 100% operation why this is suddenly occurring. If I do repeated "dd" tasks on lustre it works fine, start one on /home and it locks solid.
I have started a ZFS scrub on two of the zfs pools. at 47T each it will take most of today to resolve, but that should rule out the actual storage (which is showing "NORMAL/ONLINE" and no errors. I'm seeing a lot of these in /var/log/messages kernel: LustreError: 6578:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff89cdf3b9dc00 A google search returned this: https://wiki.lustre.org/Lustre_Resiliency:_Understanding_Lustre_Message_Loss_and_Tuning_for_Resiliency Could it be a network issue? - the nodes are running the Centos7.9 drivers... the Mellanox one did not seam to make any difference when I originally tried it 6 months ago. Any help appreciated :) Sid > > ---------- Forwarded message ---------- > From: Sid Young <sid.yo...@gmail.com> > To: lustre-discuss <lustre-discuss@lists.lustre.org> > Cc: > Bcc: > Date: Mon, 11 Oct 2021 16:07:56 +1000 > Subject: [lustre-discuss] Tools to check a lustre > > I'm having trouble diagnosing where the problem lies in my Lustre > installation, clients are 2.12.6 and I have a /home and /lustre > filesystems using Lustre. > > /home has 4 OSTs and /lustre is made up of 6 OSTs. lfs df shows all OSTs > as ACTIVE. > > The /lustre file system appears fine, I can *ls *into every directory. > > When people log into the login node, it appears to lockup. I have shut > down everything and remounted the OSTs and MDTs etc in order with no > errors reporting but I'm getting the lockup issue soon after a few people > log in. > The backend network is 100G Ethernet using ConnectX5 cards and the OS is > Cento 7.9, everything was installed as RPMs and updates are disabled in > yum.conf > > Two questions to start with: > Is there a command line tool to check each OST individually? > Apart from /var/log/messages, is there a lustre specific log I can monitor > on the login node to see errors when I hit /home... > > > > Sid Young > > > > > > > > ---------- Forwarded message ---------- > From: Dennis Nelson <dnel...@ddn.com> > To: Sid Young <sid.yo...@gmail.com> > > Date: Mon, 11 Oct 2021 12:20:25 +0000 > Subject: Re: [lustre-discuss] Tools to check a lustre > Have you tried lfs check servers on the login node? > Yes - one of the first things I did and this is what it always reports: ]# lfs check servers home-OST0000-osc-ffff89adb7e5e000 active. home-OST0001-osc-ffff89adb7e5e000 active. home-OST0002-osc-ffff89adb7e5e000 active. home-OST0003-osc-ffff89adb7e5e000 active. lustre-OST0000-osc-ffff89cdd14a2000 active. lustre-OST0001-osc-ffff89cdd14a2000 active. lustre-OST0002-osc-ffff89cdd14a2000 active. lustre-OST0003-osc-ffff89cdd14a2000 active. lustre-OST0004-osc-ffff89cdd14a2000 active. lustre-OST0005-osc-ffff89cdd14a2000 active. home-MDT0000-mdc-ffff89adb7e5e000 active. lustre-MDT0000-mdc-ffff89cdd14a2000 active. [root@tri-minihub-01 ~]#
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org