Re: [lustre-discuss] OST is not mounting
So, did you do the "writeconf"? And the OST mounted afterwards? As I understand, the MGS was under the impression that this re-mounting OST was actually a new one using an old index. So, what made your repaired OST look new/different ? I would probably have mounted it locally, as an ext4 file system, if only to check that there is data still present (ok, "df" would do that, too). "tunefs.lustre --dryrun" will show other quantum numbers that _should not_ change when taking down and remounting an OST. And since "writeconf" has to be done on all targets, you have to take down your MDS anyhow - so nothing is lost by simply trying an MDS restart? Regards Thomas On 11/5/23 17:11, Backer via lustre-discuss wrote: Hi, I am new to this email list. Looking to get some help on why an OST is not getting mounted. The cluster was running healthy and the OST experienced an issue and Linux re-mounted the OST read only. After fixing the issue and rebooting the node multiple times, it wouldn't mount. When the mount is done, the mount command errors out stating that that the index is already in use. The index for the device is 33. There is no place where this index is mounted. The debug message from the MGS during the mount is attached at the end of this email. It is asking to use writeconf. After using writeconfig, the device was mounted. Looking for a couple of things here. - I am hoping that the writeconf method is the right thing to do here. - Why did OST become in this state after the write failure and was mounted RO. The write error was due to iSCSI target going offline and coming back after a few seconds later. 2000:0100:17.0:1698240468.758487:0:91492:0:(mgs_handler.c:496:mgs_target_reg()) updating fs1-OST0021, index=33 2000:0001:17.0:1698240468.758488:0:91492:0:(mgs_llog.c:4403:mgs_write_log_target()) Process entered 2000:0001:17.0:1698240468.758488:0:91492:0:(mgs_llog.c:671:mgs_set_index()) Process entered 2000:0001:17.0:1698240468.758488:0:91492:0:(mgs_llog.c:572:mgs_find_or_make_fsdb()) Process entered 2000:0001:17.0:1698240468.758489:0:91492:0:(mgs_llog.c:551:mgs_find_or_make_fsdb_nolock()) Process entered 2000:0001:17.0:1698240468.758489:0:91492:0:(mgs_llog.c:565:mgs_find_or_make_fsdb_nolock()) Process leaving (rc=0 : 0 : 0) 2000:0001:17.0:1698240468.758489:0:91492:0:(mgs_llog.c:578:mgs_find_or_make_fsdb()) Process leaving (rc=0 : 0 : 0) 2000:0202:17.0:1698240468.758490:0:91492:0:(mgs_llog.c:711:mgs_set_index()) 140-5: Server fs1-OST0021 requested index 33, but that index is already in use. Use --writeconf to force 2000:0001:17.0:1698240468.772355:0:91492:0:(mgs_llog.c:712:mgs_set_index()) Process leaving via out_up (rc=18446744073709551518 : -98 : 0xff9e) 2000:0001:17.0:1698240468.772356:0:91492:0:(mgs_llog.c:4408:mgs_write_log_target()) Process leaving (rc=18446744073709551518 : -98 : ff9e) 2000:0002:17.0:1698240468.772357:0:91492:0:(mgs_handler.c:503:mgs_target_reg()) Failed to write fs1-OST0021 log (-98) 2000:0001:17.0:1698240468.783747:0:91492:0:(mgs_handler.c:504:mgs_target_reg()) Process leaving via out (rc=18446744073709551518 : -98 : 0xff9e) ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] OST is not mounting
Thanks for the explanation. There was a problem with the iscsi target. It is already multi-path. Anyhow, I was expecting things to come back online after the problem was resolved. This kind of created a data loss situation and I thought Lustre was resilient not to lose the whole OST. Here the OST became completely unmountable. On Tue, 7 Nov 2023 at 13:56, Andreas Dilger wrote: > The OST went read-only because that is what happens when the block device > disappears underneath it. That is a behavior of ext4 and other local > filesystems as well. > > If you look in the console logs you would see SCSI errors and the > filesystem being remounted read-only. > > To have reliability in the face of such storage issues you need to use > dm-multipath. > > Cheers, Andreas > > > On Nov 5, 2023, at 09:13, Backer via lustre-discuss < > lustre-discuss@lists.lustre.org> wrote: > > > > - Why did OST become in this state after the write failure and was > mounted RO. The write error was due to iSCSI target going offline and > coming back after a few seconds later. > ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] OST is not mounting
The OST went read-only because that is what happens when the block device disappears underneath it. That is a behavior of ext4 and other local filesystems as well. If you look in the console logs you would see SCSI errors and the filesystem being remounted read-only. To have reliability in the face of such storage issues you need to use dm-multipath. Cheers, Andreas > On Nov 5, 2023, at 09:13, Backer via lustre-discuss > wrote: > > - Why did OST become in this state after the write failure and was mounted > RO. The write error was due to iSCSI target going offline and coming back > after a few seconds later. ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] OST is not mounting
If possible do the hexdump to see if any problems of the desired OST https://groups.google.com/g/lustre-discuss-list/c/3cmmcKAB34w If the OST is in ldiskfs , do the e2fsck for the lowest level ldiskfs check to see if any problem , remember , dry run first. Regards, James From: lustre-discuss on behalf of Backer via lustre-discuss Sent: Tuesday, November 7, 2023 2:19 PM To: lustre-discuss@lists.lustre.org Subject: Re: [lustre-discuss] OST is not mounting Hi, Sending this again. Appreciate your help. On Sun, 5 Nov 2023 at 11:11, Backer mailto:backer.k...@gmail.com>> wrote: Hi, I am new to this email list. Looking to get some help on why an OST is not getting mounted. The cluster was running healthy and the OST experienced an issue and Linux re-mounted the OST read only. After fixing the issue and rebooting the node multiple times, it wouldn't mount. When the mount is done, the mount command errors out stating that that the index is already in use. The index for the device is 33. There is no place where this index is mounted. The debug message from the MGS during the mount is attached at the end of this email. It is asking to use writeconf. After using writeconfig, the device was mounted. Looking for a couple of things here. - I am hoping that the writeconf method is the right thing to do here. - Why did OST become in this state after the write failure and was mounted RO. The write error was due to iSCSI target going offline and coming back after a few seconds later. 2000:0100:17.0:1698240468.758487:0:91492:0:(mgs_handler.c:496:mgs_target_reg()) updating fs1-OST0021, index=33 2000:0001:17.0:1698240468.758488:0:91492:0:(mgs_llog.c:4403:mgs_write_log_target()) Process entered 2000:0001:17.0:1698240468.758488:0:91492:0:(mgs_llog.c:671:mgs_set_index()) Process entered 2000:0001:17.0:1698240468.758488:0:91492:0:(mgs_llog.c:572:mgs_find_or_make_fsdb()) Process entered 2000:0001:17.0:1698240468.758489:0:91492:0:(mgs_llog.c:551:mgs_find_or_make_fsdb_nolock()) Process entered 2000:0001:17.0:1698240468.758489:0:91492:0:(mgs_llog.c:565:mgs_find_or_make_fsdb_nolock()) Process leaving (rc=0 : 0 : 0) 2000:0001:17.0:1698240468.758489:0:91492:0:(mgs_llog.c:578:mgs_find_or_make_fsdb()) Process leaving (rc=0 : 0 : 0) 2000:0202:17.0:1698240468.758490:0:91492:0:(mgs_llog.c:711:mgs_set_index()) 140-5: Server fs1-OST0021 requested index 33, but that index is already in use. Use --writeconf to force 2000:0001:17.0:1698240468.772355:0:91492:0:(mgs_llog.c:712:mgs_set_index()) Process leaving via out_up (rc=18446744073709551518 : -98 : 0xff9e) 2000:0001:17.0:1698240468.772356:0:91492:0:(mgs_llog.c:4408:mgs_write_log_target()) Process leaving (rc=18446744073709551518 : -98 : ff9e) 2000:0002:17.0:1698240468.772357:0:91492:0:(mgs_handler.c:503:mgs_target_reg()) Failed to write fs1-OST0021 log (-98) 2000:0001:17.0:1698240468.783747:0:91492:0:(mgs_handler.c:504:mgs_target_reg()) Process leaving via out (rc=18446744073709551518 : -98 : 0xff9e) ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] OST is not mounting
Hi, Sending this again. Appreciate your help. On Sun, 5 Nov 2023 at 11:11, Backer wrote: > Hi, > > I am new to this email list. Looking to get some help on why an OST is not > getting mounted. > > > The cluster was running healthy and the OST experienced an issue and Linux > re-mounted the OST read only. After fixing the issue and rebooting the node > multiple times, it wouldn't mount. > > When the mount is done, the mount command errors out stating that that the > index is already in use. The index for the device is 33. There is no place > where this index is mounted. > > The debug message from the MGS during the mount is attached at the end of > this email. It is asking to use writeconf. After using writeconfig, the > device was mounted. Looking for a couple of things here. > > - I am hoping that the writeconf method is the right thing to do here. > - Why did OST become in this state after the write failure and was mounted > RO. The write error was due to iSCSI target going offline and coming back > after a few seconds later. > > 2000:0100:17.0:1698240468.758487:0:91492:0:(mgs_handler.c:496:mgs_target_reg()) > updating fs1-OST0021, index=33 > > 2000:0001:17.0:1698240468.758488:0:91492:0:(mgs_llog.c:4403:mgs_write_log_target()) > Process entered > > 2000:0001:17.0:1698240468.758488:0:91492:0:(mgs_llog.c:671:mgs_set_index()) > Process entered > > 2000:0001:17.0:1698240468.758488:0:91492:0:(mgs_llog.c:572:mgs_find_or_make_fsdb()) > Process entered > > 2000:0001:17.0:1698240468.758489:0:91492:0:(mgs_llog.c:551:mgs_find_or_make_fsdb_nolock()) > Process entered > > 2000:0001:17.0:1698240468.758489:0:91492:0:(mgs_llog.c:565:mgs_find_or_make_fsdb_nolock()) > Process leaving (rc=0 : 0 : 0) > > 2000:0001:17.0:1698240468.758489:0:91492:0:(mgs_llog.c:578:mgs_find_or_make_fsdb()) > Process leaving (rc=0 : 0 : 0) > > 2000:0202:17.0:1698240468.758490:0:91492:0:(mgs_llog.c:711:mgs_set_index()) > 140-5: Server fs1-OST0021 requested index 33, but that index is already in > use. Use --writeconf to force > > 2000:0001:17.0:1698240468.772355:0:91492:0:(mgs_llog.c:712:mgs_set_index()) > Process leaving via out_up (rc=18446744073709551518 : -98 : > 0xff9e) > > 2000:0001:17.0:1698240468.772356:0:91492:0:(mgs_llog.c:4408:mgs_write_log_target()) > Process leaving (rc=18446744073709551518 : -98 : ff9e) > > 2000:0002:17.0:1698240468.772357:0:91492:0:(mgs_handler.c:503:mgs_target_reg()) > Failed to write fs1-OST0021 log (-98) > > 2000:0001:17.0:1698240468.783747:0:91492:0:(mgs_handler.c:504:mgs_target_reg()) > Process leaving via out (rc=18446744073709551518 : -98 : 0xff9e) > > > > ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] OST is not mounting
Hi, I am new to this email list. Looking to get some help on why an OST is not getting mounted. The cluster was running healthy and the OST experienced an issue and Linux re-mounted the OST read only. After fixing the issue and rebooting the node multiple times, it wouldn't mount. When the mount is done, the mount command errors out stating that that the index is already in use. The index for the device is 33. There is no place where this index is mounted. The debug message from the MGS during the mount is attached at the end of this email. It is asking to use writeconf. After using writeconfig, the device was mounted. Looking for a couple of things here. - I am hoping that the writeconf method is the right thing to do here. - Why did OST become in this state after the write failure and was mounted RO. The write error was due to iSCSI target going offline and coming back after a few seconds later. 2000:0100:17.0:1698240468.758487:0:91492:0:(mgs_handler.c:496:mgs_target_reg()) updating fs1-OST0021, index=33 2000:0001:17.0:1698240468.758488:0:91492:0:(mgs_llog.c:4403:mgs_write_log_target()) Process entered 2000:0001:17.0:1698240468.758488:0:91492:0:(mgs_llog.c:671:mgs_set_index()) Process entered 2000:0001:17.0:1698240468.758488:0:91492:0:(mgs_llog.c:572:mgs_find_or_make_fsdb()) Process entered 2000:0001:17.0:1698240468.758489:0:91492:0:(mgs_llog.c:551:mgs_find_or_make_fsdb_nolock()) Process entered 2000:0001:17.0:1698240468.758489:0:91492:0:(mgs_llog.c:565:mgs_find_or_make_fsdb_nolock()) Process leaving (rc=0 : 0 : 0) 2000:0001:17.0:1698240468.758489:0:91492:0:(mgs_llog.c:578:mgs_find_or_make_fsdb()) Process leaving (rc=0 : 0 : 0) 2000:0202:17.0:1698240468.758490:0:91492:0:(mgs_llog.c:711:mgs_set_index()) 140-5: Server fs1-OST0021 requested index 33, but that index is already in use. Use --writeconf to force 2000:0001:17.0:1698240468.772355:0:91492:0:(mgs_llog.c:712:mgs_set_index()) Process leaving via out_up (rc=18446744073709551518 : -98 : 0xff9e) 2000:0001:17.0:1698240468.772356:0:91492:0:(mgs_llog.c:4408:mgs_write_log_target()) Process leaving (rc=18446744073709551518 : -98 : ff9e) 2000:0002:17.0:1698240468.772357:0:91492:0:(mgs_handler.c:503:mgs_target_reg()) Failed to write fs1-OST0021 log (-98) 2000:0001:17.0:1698240468.783747:0:91492:0:(mgs_handler.c:504:mgs_target_reg()) Process leaving via out (rc=18446744073709551518 : -98 : 0xff9e) ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org