Re: [lustre-discuss] Mistake while removing an OST

2023-02-02 Thread Stephane Thiell via lustre-discuss
lctl del_ost was added in the upcoming Lustre 2.16 to remove a specific OST 
while the filesystem is online. What it does is actually launching the 
llog_cancel commands for the specified OST, hence with reduced risk of user 
errors (like what happened here). We have used it several times in production. 
lctl del_ost is user-space and it is easy to backport it to 2.15 or even 2.12 
if you are familiar with how to build Lustre. Do it yourself or ask your 
favorite storage vendor if it’s not already done. :)
In any case, do not panic, as Andreas said, you can always recover from this 
mistake by doing a full writeconf as documented (which will regenerate all 
config files).

Best,
--
Stéphane Thiell
Stanford University

On Feb 2, 2023, at 12:57 AM, Andreas Dilger via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

You should follow the documented process, that's why it is documented.  All 
targets need to be unmounted to make it work properly.

On Feb 2, 2023, at 01:08, BALVERS Martin 
mailto:martin.balv...@danone.com>> wrote:

Hi Andreas,

Thank you for answering.

Can I just run the ‘tunefs.lustre --writeconf /dev/ost_device’ on the affected 
server (serving OST0002 in my case) on a live filesystem? So unmount OST0002, 
issue writeconf command and mount OST0002 again?
Or should I follow the procedure as described in 
https://doc.lustre.org/lustre_manual.xhtml#lustremaint.regenerateConfigLogs, 
and take the whole filesystem offline and run the writeconf command on all 
servers?

To be clear, my situation is this.
I had a server with OST0003 that needed to be removed. That worked, but in the 
process I deleted the add_uuid and attach indexes for OST0002. OST0002 is the 
one I need to keep.

Regards,
Martin

From: Andreas Dilger mailto:adil...@whamcloud.com>>
Sent: Wednesday, February 1, 2023 18:16
To: BALVERS Martin mailto:martin.balv...@danone.com>>
Cc: lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] Mistake while removing an OST

** Caution - this is an external email **
You should just be able to run the "writeconf" process to regenerate the config 
logs. The removed OST will not re-register with the MGS, but all of the other 
servers will, so it should be fine.
Cheers, Andreas


On Feb 1, 2023, at 03:48, BALVERS Martin via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hi,

I have a defective OSS with a single OST that I was tying to remove from the 
lustre filesystem completely (2.15.1). I was 
followinghttps://doc.lustre.org/lustre_manual.xhtml#lustremaint.remove_ost<https://urldefense.com/v3/__https:/doc.lustre.org/lustre_manual.xhtml*lustremaint.remove_ost__;Iw!!OUGTln_Lrg!RiLzALkCuFehXiToUr1EEaIJH6J8AnCRNE6js19QvVd68PYBhvNQV9-Teh1p6lltu5zpR_tjJEwAvRuZ5wWrTA$>
I had drained the OST and was now using the commands with llog_cancel to remove 
the config. This is where it went wrong. I first deleted attach, setup, add_osc 
indexes for the ‘client’ and then needed to also delete those for MDT and 
MDT0001, but I accidentally removed two more indexes from the ‘client’.
Now I have incomplete client llogs for one OST, I am missing the add_uuid and 
attach lines for OST0002.

[root@mds ~]# lctl --device MGS llog_print lustre-client
- { index: 34, event: add_uuid, nid: 
192.168.2.3@tcp(0x2c0a80203)<mailto:192.168.2.3@tcp(0x2c0a80203)>, 
node: 192.168.2.3@tcp<mailto:192.168.2.3@tcp> }
- { index: 35, event: attach, device: lustre-OST0001-osc, type: osc, UUID: 
lustre-clilov_UUID }
- { index: 36, event: setup, device: lustre-OST0001-osc, UUID: 
lustre-OST0001_UUID, node: 192.168.2.3@tcp<mailto:192.168.2.3@tcp> }
- { index: 37, event: add_osc, device: lustre-clilov, ost: lustre-OST0001_UUID, 
index: 1, gen: 1 }

- { index: 42, event: setup, device: lustre-OST0002-osc, UUID: 
lustre-OST0002_UUID, node: 192.168.2.4@tcp<mailto:192.168.2.4@tcp> }
- { index: 43, event: add_osc, device: lustre-clilov, ost: lustre-OST0002_UUID, 
index: 2, gen: 1 }

Is there a way to recover from this ?

I hope someone can help.

Regards,
Martin Balvers
Ce message électronique et tous les fichiers attachés qu'il contient sont 
confidentiels et destinés exclusivement à l'usage de la personne à laquelle ils 
sont adressés. Si vous avez reçu ce message par erreur, merci de le retourner à 
son émetteur. Les idées et opinions présentées dans ce message sont celles de 
son auteur, et ne représentent pas nécessairement celles de DANONE ou d'une 
quelconque de ses filiales. La publication, l'usage, la distribution, 
l'impression ou la copie non autorisée de ce message et des attachements qu'il 
contient sont strictement interdits.

This e-mail and any files transmitted with it are confidential and intended 
solely for the use of the individual to whom it is addressed. If you have 
received this email in error please send it back to the person that sent it to

Re: [lustre-discuss] Mistake while removing an OST

2023-02-02 Thread Andreas Dilger via lustre-discuss
You should follow the documented process, that's why it is documented.  All 
targets need to be unmounted to make it work properly.

On Feb 2, 2023, at 01:08, BALVERS Martin 
mailto:martin.balv...@danone.com>> wrote:

Hi Andreas,

Thank you for answering.

Can I just run the ‘tunefs.lustre --writeconf /dev/ost_device’ on the affected 
server (serving OST0002 in my case) on a live filesystem? So unmount OST0002, 
issue writeconf command and mount OST0002 again?
Or should I follow the procedure as described in 
https://doc.lustre.org/lustre_manual.xhtml#lustremaint.regenerateConfigLogs, 
and take the whole filesystem offline and run the writeconf command on all 
servers?

To be clear, my situation is this.
I had a server with OST0003 that needed to be removed. That worked, but in the 
process I deleted the add_uuid and attach indexes for OST0002. OST0002 is the 
one I need to keep.

Regards,
Martin

From: Andreas Dilger mailto:adil...@whamcloud.com>>
Sent: Wednesday, February 1, 2023 18:16
To: BALVERS Martin mailto:martin.balv...@danone.com>>
Cc: lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] Mistake while removing an OST

** Caution - this is an external email **
You should just be able to run the "writeconf" process to regenerate the config 
logs. The removed OST will not re-register with the MGS, but all of the other 
servers will, so it should be fine.
Cheers, Andreas


On Feb 1, 2023, at 03:48, BALVERS Martin via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hi,

I have a defective OSS with a single OST that I was tying to remove from the 
lustre filesystem completely (2.15.1). I was 
followinghttps://doc.lustre.org/lustre_manual.xhtml#lustremaint.remove_ost<https://urldefense.com/v3/__https:/doc.lustre.org/lustre_manual.xhtml*lustremaint.remove_ost__;Iw!!OUGTln_Lrg!RiLzALkCuFehXiToUr1EEaIJH6J8AnCRNE6js19QvVd68PYBhvNQV9-Teh1p6lltu5zpR_tjJEwAvRuZ5wWrTA$>
I had drained the OST and was now using the commands with llog_cancel to remove 
the config. This is where it went wrong. I first deleted attach, setup, add_osc 
indexes for the ‘client’ and then needed to also delete those for MDT and 
MDT0001, but I accidentally removed two more indexes from the ‘client’.
Now I have incomplete client llogs for one OST, I am missing the add_uuid and 
attach lines for OST0002.

[root@mds ~]# lctl --device MGS llog_print lustre-client
- { index: 34, event: add_uuid, nid: 
192.168.2.3@tcp(0x2c0a80203)<mailto:192.168.2.3@tcp(0x2c0a80203)>, 
node: 192.168.2.3@tcp<mailto:192.168.2.3@tcp> }
- { index: 35, event: attach, device: lustre-OST0001-osc, type: osc, UUID: 
lustre-clilov_UUID }
- { index: 36, event: setup, device: lustre-OST0001-osc, UUID: 
lustre-OST0001_UUID, node: 192.168.2.3@tcp<mailto:192.168.2.3@tcp> }
- { index: 37, event: add_osc, device: lustre-clilov, ost: lustre-OST0001_UUID, 
index: 1, gen: 1 }

- { index: 42, event: setup, device: lustre-OST0002-osc, UUID: 
lustre-OST0002_UUID, node: 192.168.2.4@tcp<mailto:192.168.2.4@tcp> }
- { index: 43, event: add_osc, device: lustre-clilov, ost: lustre-OST0002_UUID, 
index: 2, gen: 1 }

Is there a way to recover from this ?

I hope someone can help.

Regards,
Martin Balvers
Ce message électronique et tous les fichiers attachés qu'il contient sont 
confidentiels et destinés exclusivement à l'usage de la personne à laquelle ils 
sont adressés. Si vous avez reçu ce message par erreur, merci de le retourner à 
son émetteur. Les idées et opinions présentées dans ce message sont celles de 
son auteur, et ne représentent pas nécessairement celles de DANONE ou d'une 
quelconque de ses filiales. La publication, l'usage, la distribution, 
l'impression ou la copie non autorisée de ce message et des attachements qu'il 
contient sont strictement interdits.

This e-mail and any files transmitted with it are confidential and intended 
solely for the use of the individual to whom it is addressed. If you have 
received this email in error please send it back to the person that sent it to 
you. Any views or opinions presented are solely those of its author and do not 
necessarily represent those of DANONE or any of its subsidiary companies. 
Unauthorized publication, use, dissemination, forwarding, printing or copying 
of this email and its associated attachments is strictly prohibited.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org<https://urldefense.com/v3/__http:/lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!OUGTln_Lrg!RiLzALkCuFehXiToUr1EEaIJH6J8AnCRNE6js19QvVd68PYBhvNQV9-Teh1p6lltu5zpR_tjJEwAvRuLSxlvXw$>

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
l

Re: [lustre-discuss] Mistake while removing an OST

2023-02-02 Thread BALVERS Martin via lustre-discuss
Hi Andreas,

Thank you for answering.

Can I just run the ‘tunefs.lustre --writeconf /dev/ost_device’ on the affected 
server (serving OST0002 in my case) on a live filesystem? So unmount OST0002, 
issue writeconf command and mount OST0002 again?
Or should I follow the procedure as described in 
https://doc.lustre.org/lustre_manual.xhtml#lustremaint.regenerateConfigLogs, 
and take the whole filesystem offline and run the writeconf command on all 
servers?

To be clear, my situation is this.
I had a server with OST0003 that needed to be removed. That worked, but in the 
process I deleted the add_uuid and attach indexes for OST0002. OST0002 is the 
one I need to keep.

Regards,
Martin

From: Andreas Dilger 
Sent: Wednesday, February 1, 2023 18:16
To: BALVERS Martin 
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Mistake while removing an OST

** Caution - this is an external email **
You should just be able to run the "writeconf" process to regenerate the config 
logs. The removed OST will not re-register with the MGS, but all of the other 
servers will, so it should be fine.
Cheers, Andreas


On Feb 1, 2023, at 03:48, BALVERS Martin via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hi,

I have a defective OSS with a single OST that I was tying to remove from the 
lustre filesystem completely (2.15.1). I was following 
https://doc.lustre.org/lustre_manual.xhtml#lustremaint.remove_ost<https://urldefense.com/v3/__https:/doc.lustre.org/lustre_manual.xhtml*lustremaint.remove_ost__;Iw!!OUGTln_Lrg!RiLzALkCuFehXiToUr1EEaIJH6J8AnCRNE6js19QvVd68PYBhvNQV9-Teh1p6lltu5zpR_tjJEwAvRuZ5wWrTA$>
I had drained the OST and was now using the commands with llog_cancel to remove 
the config. This is where it went wrong. I first deleted attach, setup, add_osc 
indexes for the ‘client’ and then needed to also delete those for MDT and 
MDT0001, but I accidentally removed two more indexes from the ‘client’.
Now I have incomplete client llogs for one OST, I am missing the add_uuid and 
attach lines for OST0002.

[root@mds ~]# lctl --device MGS llog_print lustre-client
- { index: 34, event: add_uuid, nid: 
192.168.2.3@tcp(0x2c0a80203)<mailto:192.168.2.3@tcp(0x2c0a80203)>, 
node: 192.168.2.3@tcp<mailto:192.168.2.3@tcp> }
- { index: 35, event: attach, device: lustre-OST0001-osc, type: osc, UUID: 
lustre-clilov_UUID }
- { index: 36, event: setup, device: lustre-OST0001-osc, UUID: 
lustre-OST0001_UUID, node: 192.168.2.3@tcp<mailto:192.168.2.3@tcp> }
- { index: 37, event: add_osc, device: lustre-clilov, ost: lustre-OST0001_UUID, 
index: 1, gen: 1 }

- { index: 42, event: setup, device: lustre-OST0002-osc, UUID: 
lustre-OST0002_UUID, node: 192.168.2.4@tcp<mailto:192.168.2.4@tcp> }
- { index: 43, event: add_osc, device: lustre-clilov, ost: lustre-OST0002_UUID, 
index: 2, gen: 1 }

Is there a way to recover from this ?

I hope someone can help.

Regards,
Martin Balvers
Ce message électronique et tous les fichiers attachés qu'il contient sont 
confidentiels et destinés exclusivement à l'usage de la personne à laquelle ils 
sont adressés. Si vous avez reçu ce message par erreur, merci de le retourner à 
son émetteur. Les idées et opinions présentées dans ce message sont celles de 
son auteur, et ne représentent pas nécessairement celles de DANONE ou d'une 
quelconque de ses filiales. La publication, l'usage, la distribution, 
l'impression ou la copie non autorisée de ce message et des attachements qu'il 
contient sont strictement interdits.

This e-mail and any files transmitted with it are confidential and intended 
solely for the use of the individual to whom it is addressed. If you have 
received this email in error please send it back to the person that sent it to 
you. Any views or opinions presented are solely those of its author and do not 
necessarily represent those of DANONE or any of its subsidiary companies. 
Unauthorized publication, use, dissemination, forwarding, printing or copying 
of this email and its associated attachments is strictly prohibited.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org<https://urldefense.com/v3/__http:/lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!OUGTln_Lrg!RiLzALkCuFehXiToUr1EEaIJH6J8AnCRNE6js19QvVd68PYBhvNQV9-Teh1p6lltu5zpR_tjJEwAvRuLSxlvXw$>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Mistake while removing an OST

2023-02-01 Thread Andreas Dilger via lustre-discuss
You should just be able to run the "writeconf" process to regenerate the config 
logs. The removed OST will not re-register with the MGS, but all of the other 
servers will, so it should be fine.

Cheers, Andreas

On Feb 1, 2023, at 03:48, BALVERS Martin via lustre-discuss 
 wrote:


Hi,

I have a defective OSS with a single OST that I was tying to remove from the 
lustre filesystem completely (2.15.1). I was following 
https://doc.lustre.org/lustre_manual.xhtml#lustremaint.remove_ost
I had drained the OST and was now using the commands with llog_cancel to remove 
the config. This is where it went wrong. I first deleted attach, setup, add_osc 
indexes for the ‘client’ and then needed to also delete those for MDT and 
MDT0001, but I accidentally removed two more indexes from the ‘client’.
Now I have incomplete client llogs for one OST, I am missing the add_uuid and 
attach lines for OST0002.

[root@mds ~]# lctl --device MGS llog_print lustre-client
- { index: 34, event: add_uuid, nid: 192.168.2.3@tcp(0x2c0a80203), node: 
192.168.2.3@tcp }
- { index: 35, event: attach, device: lustre-OST0001-osc, type: osc, UUID: 
lustre-clilov_UUID }
- { index: 36, event: setup, device: lustre-OST0001-osc, UUID: 
lustre-OST0001_UUID, node: 192.168.2.3@tcp }
- { index: 37, event: add_osc, device: lustre-clilov, ost: lustre-OST0001_UUID, 
index: 1, gen: 1 }

- { index: 42, event: setup, device: lustre-OST0002-osc, UUID: 
lustre-OST0002_UUID, node: 192.168.2.4@tcp }
- { index: 43, event: add_osc, device: lustre-clilov, ost: lustre-OST0002_UUID, 
index: 2, gen: 1 }

Is there a way to recover from this ?

I hope someone can help.

Regards,
Martin Balvers
Ce message électronique et tous les fichiers attachés qu'il contient sont 
confidentiels et destinés exclusivement à l'usage de la personne à laquelle ils 
sont adressés. Si vous avez reçu ce message par erreur, merci de le retourner à 
son émetteur. Les idées et opinions présentées dans ce message sont celles de 
son auteur, et ne représentent pas nécessairement celles de DANONE ou d'une 
quelconque de ses filiales. La publication, l'usage, la distribution, 
l'impression ou la copie non autorisée de ce message et des attachements qu'il 
contient sont strictement interdits.

This e-mail and any files transmitted with it are confidential and intended 
solely for the use of the individual to whom it is addressed. If you have 
received this email in error please send it back to the person that sent it to 
you. Any views or opinions presented are solely those of its author and do not 
necessarily represent those of DANONE or any of its subsidiary companies. 
Unauthorized publication, use, dissemination, forwarding, printing or copying 
of this email and its associated attachments is strictly prohibited.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Mistake while removing an OST

2023-02-01 Thread BALVERS Martin via lustre-discuss
Hi,

I have a defective OSS with a single OST that I was tying to remove from the 
lustre filesystem completely (2.15.1). I was following 
https://doc.lustre.org/lustre_manual.xhtml#lustremaint.remove_ost
I had drained the OST and was now using the commands with llog_cancel to remove 
the config. This is where it went wrong. I first deleted attach, setup, add_osc 
indexes for the 'client' and then needed to also delete those for MDT and 
MDT0001, but I accidentally removed two more indexes from the 'client'.
Now I have incomplete client llogs for one OST, I am missing the add_uuid and 
attach lines for OST0002.

[root@mds ~]# lctl --device MGS llog_print lustre-client
- { index: 34, event: add_uuid, nid: 192.168.2.3@tcp(0x2c0a80203), node: 
192.168.2.3@tcp }
- { index: 35, event: attach, device: lustre-OST0001-osc, type: osc, UUID: 
lustre-clilov_UUID }
- { index: 36, event: setup, device: lustre-OST0001-osc, UUID: 
lustre-OST0001_UUID, node: 192.168.2.3@tcp }
- { index: 37, event: add_osc, device: lustre-clilov, ost: lustre-OST0001_UUID, 
index: 1, gen: 1 }

- { index: 42, event: setup, device: lustre-OST0002-osc, UUID: 
lustre-OST0002_UUID, node: 192.168.2.4@tcp }
- { index: 43, event: add_osc, device: lustre-clilov, ost: lustre-OST0002_UUID, 
index: 2, gen: 1 }

Is there a way to recover from this ?

I hope someone can help.

Regards,
Martin Balvers

Ce message électronique et tous les fichiers attachés qu'il contient sont 
confidentiels et destinés exclusivement à l'usage de la personne à laquelle ils 
sont adressés. Si vous avez reçu ce message par erreur, merci de le retourner à 
son émetteur. Les idées et opinions présentées dans ce message sont celles de 
son auteur, et ne représentent pas nécessairement celles de DANONE ou d'une 
quelconque de ses filiales. La publication, l'usage, la distribution, 
l'impression ou la copie non autorisée de ce message et des attachements qu'il 
contient sont strictement interdits. 

This e-mail and any files transmitted with it are confidential and intended 
solely for the use of the individual to whom it is addressed. If you have 
received this email in error please send it back to the person that sent it to 
you. Any views or opinions presented are solely those of its author and do not 
necessarily represent those of DANONE or any of its subsidiary companies. 
Unauthorized publication, use, dissemination, forwarding, printing or copying 
of this email and its associated attachments is strictly prohibited.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org