Re: [lustre-discuss] OST is not mounting

2023-11-13 Thread Thomas Roth via lustre-discuss

So, did you do the "writeconf"? And the OST mounted afterwards?

As I understand, the MGS was under the impression that this re-mounting 
OST was actually a new one using an old index.

So, what made your repaired OST look new/different ?
I would probably have mounted it locally, as an ext4 file system, if 
only to check that there is data still present (ok, "df" would do that, 
too).
"tunefs.lustre --dryrun"  will show other quantum numbers that _should 
not_ change when taking down and remounting an OST.


And since "writeconf" has to be done on all targets, you have to take 
down your MDS anyhow - so nothing is lost by simply trying an MDS restart?


Regards
Thomas

On 11/5/23 17:11, Backer via lustre-discuss wrote:

Hi,

I am new to this email list. Looking to get some help on why an OST is 
not getting mounted.



The cluster was running healthy and the OST experienced an issue and 
Linux re-mounted the OST read only. After fixing the issue and rebooting 
the node multiple times, it wouldn't mount.


When the mount is done, the mount command errors out stating that that 
the index is already in use. The index for the device is 33.  There is 
no place where this index is mounted.


The debug message from the MGS during the mount is attached at the end 
of this email. It is asking to use writeconf. After using writeconfig, 
the device was mounted. Looking for a couple of things here.


- I am hoping that the writeconf method is the right thing to do here.
- Why did OST become in this state after the write failure and was 
mounted RO.  The write error was due to iSCSI target going offline and 
coming back after a few seconds later.


2000:0100:17.0:1698240468.758487:0:91492:0:(mgs_handler.c:496:mgs_target_reg())
 updating fs1-OST0021, index=33

2000:0001:17.0:1698240468.758488:0:91492:0:(mgs_llog.c:4403:mgs_write_log_target())
 Process entered

2000:0001:17.0:1698240468.758488:0:91492:0:(mgs_llog.c:671:mgs_set_index())
 Process entered

2000:0001:17.0:1698240468.758488:0:91492:0:(mgs_llog.c:572:mgs_find_or_make_fsdb())
 Process entered

2000:0001:17.0:1698240468.758489:0:91492:0:(mgs_llog.c:551:mgs_find_or_make_fsdb_nolock())
 Process entered

2000:0001:17.0:1698240468.758489:0:91492:0:(mgs_llog.c:565:mgs_find_or_make_fsdb_nolock())
 Process leaving (rc=0 : 0 : 0)

2000:0001:17.0:1698240468.758489:0:91492:0:(mgs_llog.c:578:mgs_find_or_make_fsdb())
 Process leaving (rc=0 : 0 : 0)

2000:0202:17.0:1698240468.758490:0:91492:0:(mgs_llog.c:711:mgs_set_index())
 140-5: Server fs1-OST0021 requested index 33, but that index is already in 
use. Use --writeconf to force

2000:0001:17.0:1698240468.772355:0:91492:0:(mgs_llog.c:712:mgs_set_index())
 Process leaving via out_up (rc=18446744073709551518 : -98 : 0xff9e)

2000:0001:17.0:1698240468.772356:0:91492:0:(mgs_llog.c:4408:mgs_write_log_target())
 Process leaving (rc=18446744073709551518 : -98 : ff9e)

2000:0002:17.0:1698240468.772357:0:91492:0:(mgs_handler.c:503:mgs_target_reg())
 Failed to write fs1-OST0021 log (-98)

2000:0001:17.0:1698240468.783747:0:91492:0:(mgs_handler.c:504:mgs_target_reg())
 Process leaving via out (rc=18446744073709551518 : -98 : 0xff9e)




___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] OST is not mounting

2023-11-08 Thread Backer via lustre-discuss
Thanks for the explanation. There was a problem with the iscsi target. It
is already multi-path. Anyhow, I was expecting things to come back online
after the problem was resolved. This kind of created a data loss situation
and I thought Lustre was resilient not to lose the whole OST. Here the OST
became completely unmountable.

On Tue, 7 Nov 2023 at 13:56, Andreas Dilger  wrote:

> The OST went read-only because that is what happens when the block device
> disappears underneath it. That is a behavior of ext4 and other local
> filesystems as well.
>
> If you look in the console logs you would see SCSI errors and the
> filesystem being remounted read-only.
>
> To have reliability in the face of such storage issues you need to use
> dm-multipath.
>
> Cheers, Andreas
>
> > On Nov 5, 2023, at 09:13, Backer via lustre-discuss <
> lustre-discuss@lists.lustre.org> wrote:
> >
> > - Why did OST become in this state after the write failure and was
> mounted RO.  The write error was due to iSCSI target going offline and
> coming back after a few seconds later.
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] OST is not mounting

2023-11-07 Thread Andreas Dilger via lustre-discuss
The OST went read-only because that is what happens when the block device 
disappears underneath it. That is a behavior of ext4 and other local 
filesystems as well. 

If you look in the console logs you would see SCSI errors and the filesystem 
being remounted read-only. 

To have reliability in the face of such storage issues you need to use 
dm-multipath. 

Cheers, Andreas

> On Nov 5, 2023, at 09:13, Backer via lustre-discuss 
>  wrote:
> 
> - Why did OST become in this state after the write failure and was mounted 
> RO.  The write error was due to iSCSI target going offline and coming back 
> after a few seconds later. 
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] OST is not mounting

2023-11-07 Thread James Lam via lustre-discuss

If possible do the hexdump to see if any problems of the desired OST

https://groups.google.com/g/lustre-discuss-list/c/3cmmcKAB34w

If the OST is in ldiskfs , do the e2fsck for the lowest level ldiskfs check to 
see if any problem , remember , dry run first.

Regards,

James



From: lustre-discuss  on behalf of 
Backer via lustre-discuss 
Sent: Tuesday, November 7, 2023 2:19 PM
To: lustre-discuss@lists.lustre.org 
Subject: Re: [lustre-discuss] OST is not mounting

Hi,

Sending this again. Appreciate your help.

On Sun, 5 Nov 2023 at 11:11, Backer 
mailto:backer.k...@gmail.com>> wrote:
Hi,

I am new to this email list. Looking to get some help on why an OST is not 
getting mounted.


The cluster was running healthy and the OST experienced an issue and Linux 
re-mounted the OST read only. After fixing the issue and rebooting the node 
multiple times, it wouldn't mount.

When the mount is done, the mount command errors out stating that that the 
index is already in use. The index for the device is 33.  There is no place 
where this index is mounted.

The debug message from the MGS during the mount is attached at the end of this 
email. It is asking to use writeconf. After using writeconfig, the device was 
mounted. Looking for a couple of things here.

- I am hoping that the writeconf method is the right thing to do here.
- Why did OST become in this state after the write failure and was mounted RO.  
The write error was due to iSCSI target going offline and coming back after a 
few seconds later.


2000:0100:17.0:1698240468.758487:0:91492:0:(mgs_handler.c:496:mgs_target_reg())
 updating fs1-OST0021, index=33

2000:0001:17.0:1698240468.758488:0:91492:0:(mgs_llog.c:4403:mgs_write_log_target())
 Process entered

2000:0001:17.0:1698240468.758488:0:91492:0:(mgs_llog.c:671:mgs_set_index())
 Process entered

2000:0001:17.0:1698240468.758488:0:91492:0:(mgs_llog.c:572:mgs_find_or_make_fsdb())
 Process entered

2000:0001:17.0:1698240468.758489:0:91492:0:(mgs_llog.c:551:mgs_find_or_make_fsdb_nolock())
 Process entered

2000:0001:17.0:1698240468.758489:0:91492:0:(mgs_llog.c:565:mgs_find_or_make_fsdb_nolock())
 Process leaving (rc=0 : 0 : 0)

2000:0001:17.0:1698240468.758489:0:91492:0:(mgs_llog.c:578:mgs_find_or_make_fsdb())
 Process leaving (rc=0 : 0 : 0)

2000:0202:17.0:1698240468.758490:0:91492:0:(mgs_llog.c:711:mgs_set_index())
 140-5: Server fs1-OST0021 requested index 33, but that index is already in 
use. Use --writeconf to force

2000:0001:17.0:1698240468.772355:0:91492:0:(mgs_llog.c:712:mgs_set_index())
 Process leaving via out_up (rc=18446744073709551518 : -98 : 0xff9e)

2000:0001:17.0:1698240468.772356:0:91492:0:(mgs_llog.c:4408:mgs_write_log_target())
 Process leaving (rc=18446744073709551518 : -98 : ff9e)

2000:0002:17.0:1698240468.772357:0:91492:0:(mgs_handler.c:503:mgs_target_reg())
 Failed to write fs1-OST0021 log (-98)

2000:0001:17.0:1698240468.783747:0:91492:0:(mgs_handler.c:504:mgs_target_reg())
 Process leaving via out (rc=18446744073709551518 : -98 : 0xff9e)


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] OST is not mounting

2023-11-07 Thread Backer via lustre-discuss
Hi,

Sending this again. Appreciate your help.

On Sun, 5 Nov 2023 at 11:11, Backer  wrote:

> Hi,
>
> I am new to this email list. Looking to get some help on why an OST is not
> getting mounted.
>
>
> The cluster was running healthy and the OST experienced an issue and Linux
> re-mounted the OST read only. After fixing the issue and rebooting the node
> multiple times, it wouldn't mount.
>
> When the mount is done, the mount command errors out stating that that the
> index is already in use. The index for the device is 33.  There is no place
> where this index is mounted.
>
> The debug message from the MGS during the mount is attached at the end of
> this email. It is asking to use writeconf. After using writeconfig, the
> device was mounted. Looking for a couple of things here.
>
> - I am hoping that the writeconf method is the right thing to do here.
> - Why did OST become in this state after the write failure and was mounted
> RO.  The write error was due to iSCSI target going offline and coming back
> after a few seconds later.
>
> 2000:0100:17.0:1698240468.758487:0:91492:0:(mgs_handler.c:496:mgs_target_reg())
> updating fs1-OST0021, index=33
>
> 2000:0001:17.0:1698240468.758488:0:91492:0:(mgs_llog.c:4403:mgs_write_log_target())
> Process entered
>
> 2000:0001:17.0:1698240468.758488:0:91492:0:(mgs_llog.c:671:mgs_set_index())
> Process entered
>
> 2000:0001:17.0:1698240468.758488:0:91492:0:(mgs_llog.c:572:mgs_find_or_make_fsdb())
> Process entered
>
> 2000:0001:17.0:1698240468.758489:0:91492:0:(mgs_llog.c:551:mgs_find_or_make_fsdb_nolock())
> Process entered
>
> 2000:0001:17.0:1698240468.758489:0:91492:0:(mgs_llog.c:565:mgs_find_or_make_fsdb_nolock())
> Process leaving (rc=0 : 0 : 0)
>
> 2000:0001:17.0:1698240468.758489:0:91492:0:(mgs_llog.c:578:mgs_find_or_make_fsdb())
> Process leaving (rc=0 : 0 : 0)
>
> 2000:0202:17.0:1698240468.758490:0:91492:0:(mgs_llog.c:711:mgs_set_index())
> 140-5: Server fs1-OST0021 requested index 33, but that index is already in
> use. Use --writeconf to force
>
> 2000:0001:17.0:1698240468.772355:0:91492:0:(mgs_llog.c:712:mgs_set_index())
> Process leaving via out_up (rc=18446744073709551518 : -98 :
> 0xff9e)
>
> 2000:0001:17.0:1698240468.772356:0:91492:0:(mgs_llog.c:4408:mgs_write_log_target())
> Process leaving (rc=18446744073709551518 : -98 : ff9e)
>
> 2000:0002:17.0:1698240468.772357:0:91492:0:(mgs_handler.c:503:mgs_target_reg())
> Failed to write fs1-OST0021 log (-98)
>
> 2000:0001:17.0:1698240468.783747:0:91492:0:(mgs_handler.c:504:mgs_target_reg())
> Process leaving via out (rc=18446744073709551518 : -98 : 0xff9e)
>
>
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] OST is not mounting

2023-11-05 Thread Backer via lustre-discuss
Hi,

I am new to this email list. Looking to get some help on why an OST is not
getting mounted.


The cluster was running healthy and the OST experienced an issue and Linux
re-mounted the OST read only. After fixing the issue and rebooting the node
multiple times, it wouldn't mount.

When the mount is done, the mount command errors out stating that that the
index is already in use. The index for the device is 33.  There is no place
where this index is mounted.

The debug message from the MGS during the mount is attached at the end of
this email. It is asking to use writeconf. After using writeconfig, the
device was mounted. Looking for a couple of things here.

- I am hoping that the writeconf method is the right thing to do here.
- Why did OST become in this state after the write failure and was mounted
RO.  The write error was due to iSCSI target going offline and coming back
after a few seconds later.

2000:0100:17.0:1698240468.758487:0:91492:0:(mgs_handler.c:496:mgs_target_reg())
updating fs1-OST0021, index=33

2000:0001:17.0:1698240468.758488:0:91492:0:(mgs_llog.c:4403:mgs_write_log_target())
Process entered

2000:0001:17.0:1698240468.758488:0:91492:0:(mgs_llog.c:671:mgs_set_index())
Process entered

2000:0001:17.0:1698240468.758488:0:91492:0:(mgs_llog.c:572:mgs_find_or_make_fsdb())
Process entered

2000:0001:17.0:1698240468.758489:0:91492:0:(mgs_llog.c:551:mgs_find_or_make_fsdb_nolock())
Process entered

2000:0001:17.0:1698240468.758489:0:91492:0:(mgs_llog.c:565:mgs_find_or_make_fsdb_nolock())
Process leaving (rc=0 : 0 : 0)

2000:0001:17.0:1698240468.758489:0:91492:0:(mgs_llog.c:578:mgs_find_or_make_fsdb())
Process leaving (rc=0 : 0 : 0)

2000:0202:17.0:1698240468.758490:0:91492:0:(mgs_llog.c:711:mgs_set_index())
140-5: Server fs1-OST0021 requested index 33, but that index is already in
use. Use --writeconf to force

2000:0001:17.0:1698240468.772355:0:91492:0:(mgs_llog.c:712:mgs_set_index())
Process leaving via out_up (rc=18446744073709551518 : -98 :
0xff9e)

2000:0001:17.0:1698240468.772356:0:91492:0:(mgs_llog.c:4408:mgs_write_log_target())
Process leaving (rc=18446744073709551518 : -98 : ff9e)

2000:0002:17.0:1698240468.772357:0:91492:0:(mgs_handler.c:503:mgs_target_reg())
Failed to write fs1-OST0021 log (-98)

2000:0001:17.0:1698240468.783747:0:91492:0:(mgs_handler.c:504:mgs_target_reg())
Process leaving via out (rc=18446744073709551518 : -98 : 0xff9e)
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org