[lustre-discuss] Tools to check a lustre

2021-10-11 Thread Sid Young via lustre-discuss
I'm having trouble diagnosing where the problem lies in  my Lustre
installation, clients are 2.12.6 and I have a /home and /lustre
filesystems using Lustre.

/home has 4 OSTs and /lustre is made up of 6 OSTs. lfs df shows all OSTs as
ACTIVE.

The /lustre file system appears fine, I can *ls *into every directory.

When people log into the login node, it appears to lockup. I have shut down
everything and remounted the OSTs and MDTs etc in order with no
errors reporting but I'm getting the lockup issue soon after a few people
log in.
The backend network is 100G Ethernet using ConnectX5 cards and the OS is
Cento 7.9, everything was installed as RPMs and updates are disabled in
yum.conf

Two questions to start with:
Is there a command line tool to check each OST individually?
Apart from /var/log/messages, is there a lustre specific log I can monitor
on the login node to see errors when I hit /home...



Sid Young
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] eviction timeout

2021-10-11 Thread Degremont, Aurelien via lustre-discuss
Hello

This message is appearing during MDT recovery, likely after a MDS restart. MDT 
tries to reconnect first all existing client when it stopped.
It seems all these clients have been also rebooted. To avoid this message, try 
to stop your clients before the servers.

If not possible, you can abort the recovery, either at start time 
(https://doc.lustre.org/lustre_manual.xhtml#lustremaint.abortRecovery) or when 
recovery is running with the following commands on the MDS host:

lctl --device lustre-MDT abort_recovery


Aurélien

De : lustre-discuss  au nom de Sid 
Young via lustre-discuss 
Répondre à : Sid Young 
Date : lundi 11 octobre 2021 à 03:16
À : lustre-discuss 
Objet : [EXTERNAL] [lustre-discuss] eviction timeout


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.



I'm seeing a lot of these messages:

Oct 11 11:12:09 hpc-mds-02 kernel: Lustre: lustre-MDT: Denying connection 
for new client b6df7eda-8ae1-617c-6ff1-406d1ffb6006 (at 10.140.90.82@tcp), 
waiting for 6 known clients (0 recovered, 0 in progress, and 0 evicted) to 
recover in 2:42

It seems to be a 3minute timeout, is it possible to shorten this and even not 
log this message?

Sid Young

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Tools to check a lustre

2021-10-11 Thread Dennis Nelson via lustre-discuss
Have you tried lfs check servers on the login node?

Sent from my iPhone

On Oct 11, 2021, at 2:58 AM, Sid Young via lustre-discuss 
 wrote:



I'm having trouble diagnosing where the problem lies in  my Lustre 
installation, clients are 2.12.6 and I have a /home and /lustre filesystems 
using Lustre.

/home has 4 OSTs and /lustre is made up of 6 OSTs. lfs df shows all OSTs as 
ACTIVE.

The /lustre file system appears fine, I can ls into every directory.

When people log into the login node, it appears to lockup. I have shut down 
everything and remounted the OSTs and MDTs etc in order with no errors 
reporting but I'm getting the lockup issue soon after a few people log in.
The backend network is 100G Ethernet using ConnectX5 cards and the OS is Cento 
7.9, everything was installed as RPMs and updates are disabled in yum.conf

Two questions to start with:
Is there a command line tool to check each OST individually?
Apart from /var/log/messages, is there a lustre specific log I can monitor on 
the login node to see errors when I hit /home...



Sid Young
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Lustre /home lockup - how to check

2021-10-11 Thread Sid Young via lustre-discuss
>
>
>2. Tools to check a lustre (Sid Young)
>4. Re: Tools to check a lustre (Dennis Nelson)
>
>
> My key issue is why /home locks solid when you try to use it but /lustre
is OK . The backend is ZFS used to manage the disks presented from the HP
D8000 JBOD
I'm at a loss after 6 months of 100% operation why this is suddenly
occurring. If I do repeated "dd" tasks on lustre it works fine, start one
on /home and it locks solid.

I have started a ZFS scrub on two of the zfs pools. at 47T each it will
take most of today to resolve, but that should rule out the actual storage
(which is showing "NORMAL/ONLINE" and no errors.

I'm seeing a lot of these in /var/log/messages
kernel: LustreError: 6578:0:(events.c:200:client_bulk_callback()) event
type 1, status -5, desc 89cdf3b9dc00
A google search returned this:
https://wiki.lustre.org/Lustre_Resiliency:_Understanding_Lustre_Message_Loss_and_Tuning_for_Resiliency

Could it be a network issue? - the nodes are running the
Centos7.9 drivers... the Mellanox one did not seam to make any difference
when I originally tried it 6 months ago.

Any help appreciated :)

Sid


>
> -- Forwarded message --
> From: Sid Young 
> To: lustre-discuss 
> Cc:
> Bcc:
> Date: Mon, 11 Oct 2021 16:07:56 +1000
> Subject: [lustre-discuss] Tools to check a lustre
>
> I'm having trouble diagnosing where the problem lies in  my Lustre
> installation, clients are 2.12.6 and I have a /home and /lustre
> filesystems using Lustre.
>
> /home has 4 OSTs and /lustre is made up of 6 OSTs. lfs df shows all OSTs
> as ACTIVE.
>
> The /lustre file system appears fine, I can *ls *into every directory.
>
> When people log into the login node, it appears to lockup. I have shut
> down everything and remounted the OSTs and MDTs etc in order with no
> errors reporting but I'm getting the lockup issue soon after a few people
> log in.
> The backend network is 100G Ethernet using ConnectX5 cards and the OS is
> Cento 7.9, everything was installed as RPMs and updates are disabled in
> yum.conf
>
> Two questions to start with:
> Is there a command line tool to check each OST individually?
> Apart from /var/log/messages, is there a lustre specific log I can monitor
> on the login node to see errors when I hit /home...
>
>
>
> Sid Young
>
>
>
>
>
>
>
> -- Forwarded message --
> From: Dennis Nelson 
> To: Sid Young 
>
> Date: Mon, 11 Oct 2021 12:20:25 +
> Subject: Re: [lustre-discuss] Tools to check a lustre
> Have you tried lfs check servers on the login node?
>

Yes - one of the first things I did and this is what it always reports:

]# lfs check servers
home-OST-osc-89adb7e5e000 active.
home-OST0001-osc-89adb7e5e000 active.
home-OST0002-osc-89adb7e5e000 active.
home-OST0003-osc-89adb7e5e000 active.
lustre-OST-osc-89cdd14a2000 active.
lustre-OST0001-osc-89cdd14a2000 active.
lustre-OST0002-osc-89cdd14a2000 active.
lustre-OST0003-osc-89cdd14a2000 active.
lustre-OST0004-osc-89cdd14a2000 active.
lustre-OST0005-osc-89cdd14a2000 active.
home-MDT-mdc-89adb7e5e000 active.
lustre-MDT-mdc-89cdd14a2000 active.
[root@tri-minihub-01 ~]#
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre /home lockup - more info

2021-10-11 Thread Sid Young via lustre-discuss
I tried remounting the /home lustre file system to /mnt in read-only mode
and when I try to ls the directory it locks up but I can escape it, how
ever when I do a df command i get the completely wrong size (should be
around 192TB):

10.140.93.42@o2ib:/home6.0P  4.8P  1.3P  80% /mnt

zfs scrub is still working and all disks physically report as OK in the ILO
of the two OSS servers...

When the scrub finishes later today I will unmount and remount the 4 OSTs
and see if the remount changes the status... updates in about 8 hours.

Sid Young

On Tue, Oct 12, 2021 at 8:18 AM Sid Young  wrote:

>
>>2. Tools to check a lustre (Sid Young)
>>4. Re: Tools to check a lustre (Dennis Nelson)
>>
>>
>> My key issue is why /home locks solid when you try to use it but /lustre
> is OK . The backend is ZFS used to manage the disks presented from the HP
> D8000 JBOD
> I'm at a loss after 6 months of 100% operation why this is suddenly
> occurring. If I do repeated "dd" tasks on lustre it works fine, start one
> on /home and it locks solid.
>
> I have started a ZFS scrub on two of the zfs pools. at 47T each it will
> take most of today to resolve, but that should rule out the actual storage
> (which is showing "NORMAL/ONLINE" and no errors.
>
> I'm seeing a lot of these in /var/log/messages
> kernel: LustreError: 6578:0:(events.c:200:client_bulk_callback()) event
> type 1, status -5, desc 89cdf3b9dc00
> A google search returned this:
> https://wiki.lustre.org/Lustre_Resiliency:_Understanding_Lustre_Message_Loss_and_Tuning_for_Resiliency
>
> Could it be a network issue? - the nodes are running the
> Centos7.9 drivers... the Mellanox one did not seam to make any difference
> when I originally tried it 6 months ago.
>
> Any help appreciated :)
>
> Sid
>
>
>>
>> -- Forwarded message --
>> From: Sid Young 
>> To: lustre-discuss 
>> Cc:
>> Bcc:
>> Date: Mon, 11 Oct 2021 16:07:56 +1000
>> Subject: [lustre-discuss] Tools to check a lustre
>>
>> I'm having trouble diagnosing where the problem lies in  my Lustre
>> installation, clients are 2.12.6 and I have a /home and /lustre
>> filesystems using Lustre.
>>
>> /home has 4 OSTs and /lustre is made up of 6 OSTs. lfs df shows all OSTs
>> as ACTIVE.
>>
>> The /lustre file system appears fine, I can *ls *into every directory.
>>
>> When people log into the login node, it appears to lockup. I have shut
>> down everything and remounted the OSTs and MDTs etc in order with no
>> errors reporting but I'm getting the lockup issue soon after a few people
>> log in.
>> The backend network is 100G Ethernet using ConnectX5 cards and the OS is
>> Cento 7.9, everything was installed as RPMs and updates are disabled in
>> yum.conf
>>
>> Two questions to start with:
>> Is there a command line tool to check each OST individually?
>> Apart from /var/log/messages, is there a lustre specific log I can
>> monitor on the login node to see errors when I hit /home...
>>
>>
>>
>> Sid Young
>>
>>
>>
>>
>>
>>
>>
>> -- Forwarded message --
>> From: Dennis Nelson 
>> To: Sid Young 
>>
>> Date: Mon, 11 Oct 2021 12:20:25 +
>> Subject: Re: [lustre-discuss] Tools to check a lustre
>> Have you tried lfs check servers on the login node?
>>
>
> Yes - one of the first things I did and this is what it always reports:
>
> ]# lfs check servers
> home-OST-osc-89adb7e5e000 active.
> home-OST0001-osc-89adb7e5e000 active.
> home-OST0002-osc-89adb7e5e000 active.
> home-OST0003-osc-89adb7e5e000 active.
> lustre-OST-osc-89cdd14a2000 active.
> lustre-OST0001-osc-89cdd14a2000 active.
> lustre-OST0002-osc-89cdd14a2000 active.
> lustre-OST0003-osc-89cdd14a2000 active.
> lustre-OST0004-osc-89cdd14a2000 active.
> lustre-OST0005-osc-89cdd14a2000 active.
> home-MDT-mdc-89adb7e5e000 active.
> lustre-MDT-mdc-89cdd14a2000 active.
> [root@tri-minihub-01 ~]#
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org