Re: [lustre-discuss] Mistake while removing an OST
lctl del_ost was added in the upcoming Lustre 2.16 to remove a specific OST while the filesystem is online. What it does is actually launching the llog_cancel commands for the specified OST, hence with reduced risk of user errors (like what happened here). We have used it several times in production. lctl del_ost is user-space and it is easy to backport it to 2.15 or even 2.12 if you are familiar with how to build Lustre. Do it yourself or ask your favorite storage vendor if it’s not already done. :) In any case, do not panic, as Andreas said, you can always recover from this mistake by doing a full writeconf as documented (which will regenerate all config files). Best, -- Stéphane Thiell Stanford University On Feb 2, 2023, at 12:57 AM, Andreas Dilger via lustre-discuss mailto:lustre-discuss@lists.lustre.org>> wrote: You should follow the documented process, that's why it is documented. All targets need to be unmounted to make it work properly. On Feb 2, 2023, at 01:08, BALVERS Martin mailto:martin.balv...@danone.com>> wrote: Hi Andreas, Thank you for answering. Can I just run the ‘tunefs.lustre --writeconf /dev/ost_device’ on the affected server (serving OST0002 in my case) on a live filesystem? So unmount OST0002, issue writeconf command and mount OST0002 again? Or should I follow the procedure as described in https://doc.lustre.org/lustre_manual.xhtml#lustremaint.regenerateConfigLogs, and take the whole filesystem offline and run the writeconf command on all servers? To be clear, my situation is this. I had a server with OST0003 that needed to be removed. That worked, but in the process I deleted the add_uuid and attach indexes for OST0002. OST0002 is the one I need to keep. Regards, Martin From: Andreas Dilger mailto:adil...@whamcloud.com>> Sent: Wednesday, February 1, 2023 18:16 To: BALVERS Martin mailto:martin.balv...@danone.com>> Cc: lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org> Subject: Re: [lustre-discuss] Mistake while removing an OST ** Caution - this is an external email ** You should just be able to run the "writeconf" process to regenerate the config logs. The removed OST will not re-register with the MGS, but all of the other servers will, so it should be fine. Cheers, Andreas On Feb 1, 2023, at 03:48, BALVERS Martin via lustre-discuss mailto:lustre-discuss@lists.lustre.org>> wrote: Hi, I have a defective OSS with a single OST that I was tying to remove from the lustre filesystem completely (2.15.1). I was followinghttps://doc.lustre.org/lustre_manual.xhtml#lustremaint.remove_ost<https://urldefense.com/v3/__https:/doc.lustre.org/lustre_manual.xhtml*lustremaint.remove_ost__;Iw!!OUGTln_Lrg!RiLzALkCuFehXiToUr1EEaIJH6J8AnCRNE6js19QvVd68PYBhvNQV9-Teh1p6lltu5zpR_tjJEwAvRuZ5wWrTA$> I had drained the OST and was now using the commands with llog_cancel to remove the config. This is where it went wrong. I first deleted attach, setup, add_osc indexes for the ‘client’ and then needed to also delete those for MDT and MDT0001, but I accidentally removed two more indexes from the ‘client’. Now I have incomplete client llogs for one OST, I am missing the add_uuid and attach lines for OST0002. [root@mds ~]# lctl --device MGS llog_print lustre-client - { index: 34, event: add_uuid, nid: 192.168.2.3@tcp(0x2c0a80203)<mailto:192.168.2.3@tcp(0x2c0a80203)>, node: 192.168.2.3@tcp<mailto:192.168.2.3@tcp> } - { index: 35, event: attach, device: lustre-OST0001-osc, type: osc, UUID: lustre-clilov_UUID } - { index: 36, event: setup, device: lustre-OST0001-osc, UUID: lustre-OST0001_UUID, node: 192.168.2.3@tcp<mailto:192.168.2.3@tcp> } - { index: 37, event: add_osc, device: lustre-clilov, ost: lustre-OST0001_UUID, index: 1, gen: 1 } - { index: 42, event: setup, device: lustre-OST0002-osc, UUID: lustre-OST0002_UUID, node: 192.168.2.4@tcp<mailto:192.168.2.4@tcp> } - { index: 43, event: add_osc, device: lustre-clilov, ost: lustre-OST0002_UUID, index: 2, gen: 1 } Is there a way to recover from this ? I hope someone can help. Regards, Martin Balvers Ce message électronique et tous les fichiers attachés qu'il contient sont confidentiels et destinés exclusivement à l'usage de la personne à laquelle ils sont adressés. Si vous avez reçu ce message par erreur, merci de le retourner à son émetteur. Les idées et opinions présentées dans ce message sont celles de son auteur, et ne représentent pas nécessairement celles de DANONE ou d'une quelconque de ses filiales. La publication, l'usage, la distribution, l'impression ou la copie non autorisée de ce message et des attachements qu'il contient sont strictement interdits. This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual to whom it is addressed. If you have received this email in error please send it back to the person that sent it to
Re: [lustre-discuss] Mistake while removing an OST
You should follow the documented process, that's why it is documented. All targets need to be unmounted to make it work properly. On Feb 2, 2023, at 01:08, BALVERS Martin mailto:martin.balv...@danone.com>> wrote: Hi Andreas, Thank you for answering. Can I just run the ‘tunefs.lustre --writeconf /dev/ost_device’ on the affected server (serving OST0002 in my case) on a live filesystem? So unmount OST0002, issue writeconf command and mount OST0002 again? Or should I follow the procedure as described in https://doc.lustre.org/lustre_manual.xhtml#lustremaint.regenerateConfigLogs, and take the whole filesystem offline and run the writeconf command on all servers? To be clear, my situation is this. I had a server with OST0003 that needed to be removed. That worked, but in the process I deleted the add_uuid and attach indexes for OST0002. OST0002 is the one I need to keep. Regards, Martin From: Andreas Dilger mailto:adil...@whamcloud.com>> Sent: Wednesday, February 1, 2023 18:16 To: BALVERS Martin mailto:martin.balv...@danone.com>> Cc: lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org> Subject: Re: [lustre-discuss] Mistake while removing an OST ** Caution - this is an external email ** You should just be able to run the "writeconf" process to regenerate the config logs. The removed OST will not re-register with the MGS, but all of the other servers will, so it should be fine. Cheers, Andreas On Feb 1, 2023, at 03:48, BALVERS Martin via lustre-discuss mailto:lustre-discuss@lists.lustre.org>> wrote: Hi, I have a defective OSS with a single OST that I was tying to remove from the lustre filesystem completely (2.15.1). I was followinghttps://doc.lustre.org/lustre_manual.xhtml#lustremaint.remove_ost<https://urldefense.com/v3/__https:/doc.lustre.org/lustre_manual.xhtml*lustremaint.remove_ost__;Iw!!OUGTln_Lrg!RiLzALkCuFehXiToUr1EEaIJH6J8AnCRNE6js19QvVd68PYBhvNQV9-Teh1p6lltu5zpR_tjJEwAvRuZ5wWrTA$> I had drained the OST and was now using the commands with llog_cancel to remove the config. This is where it went wrong. I first deleted attach, setup, add_osc indexes for the ‘client’ and then needed to also delete those for MDT and MDT0001, but I accidentally removed two more indexes from the ‘client’. Now I have incomplete client llogs for one OST, I am missing the add_uuid and attach lines for OST0002. [root@mds ~]# lctl --device MGS llog_print lustre-client - { index: 34, event: add_uuid, nid: 192.168.2.3@tcp(0x2c0a80203)<mailto:192.168.2.3@tcp(0x2c0a80203)>, node: 192.168.2.3@tcp<mailto:192.168.2.3@tcp> } - { index: 35, event: attach, device: lustre-OST0001-osc, type: osc, UUID: lustre-clilov_UUID } - { index: 36, event: setup, device: lustre-OST0001-osc, UUID: lustre-OST0001_UUID, node: 192.168.2.3@tcp<mailto:192.168.2.3@tcp> } - { index: 37, event: add_osc, device: lustre-clilov, ost: lustre-OST0001_UUID, index: 1, gen: 1 } - { index: 42, event: setup, device: lustre-OST0002-osc, UUID: lustre-OST0002_UUID, node: 192.168.2.4@tcp<mailto:192.168.2.4@tcp> } - { index: 43, event: add_osc, device: lustre-clilov, ost: lustre-OST0002_UUID, index: 2, gen: 1 } Is there a way to recover from this ? I hope someone can help. Regards, Martin Balvers Ce message électronique et tous les fichiers attachés qu'il contient sont confidentiels et destinés exclusivement à l'usage de la personne à laquelle ils sont adressés. Si vous avez reçu ce message par erreur, merci de le retourner à son émetteur. Les idées et opinions présentées dans ce message sont celles de son auteur, et ne représentent pas nécessairement celles de DANONE ou d'une quelconque de ses filiales. La publication, l'usage, la distribution, l'impression ou la copie non autorisée de ce message et des attachements qu'il contient sont strictement interdits. This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual to whom it is addressed. If you have received this email in error please send it back to the person that sent it to you. Any views or opinions presented are solely those of its author and do not necessarily represent those of DANONE or any of its subsidiary companies. Unauthorized publication, use, dissemination, forwarding, printing or copying of this email and its associated attachments is strictly prohibited. ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org<https://urldefense.com/v3/__http:/lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!OUGTln_Lrg!RiLzALkCuFehXiToUr1EEaIJH6J8AnCRNE6js19QvVd68PYBhvNQV9-Teh1p6lltu5zpR_tjJEwAvRuLSxlvXw$> Cheers, Andreas -- Andreas Dilger Lustre Principal Architect Whamcloud ___ l
Re: [lustre-discuss] Mistake while removing an OST
Hi Andreas, Thank you for answering. Can I just run the ‘tunefs.lustre --writeconf /dev/ost_device’ on the affected server (serving OST0002 in my case) on a live filesystem? So unmount OST0002, issue writeconf command and mount OST0002 again? Or should I follow the procedure as described in https://doc.lustre.org/lustre_manual.xhtml#lustremaint.regenerateConfigLogs, and take the whole filesystem offline and run the writeconf command on all servers? To be clear, my situation is this. I had a server with OST0003 that needed to be removed. That worked, but in the process I deleted the add_uuid and attach indexes for OST0002. OST0002 is the one I need to keep. Regards, Martin From: Andreas Dilger Sent: Wednesday, February 1, 2023 18:16 To: BALVERS Martin Cc: lustre-discuss@lists.lustre.org Subject: Re: [lustre-discuss] Mistake while removing an OST ** Caution - this is an external email ** You should just be able to run the "writeconf" process to regenerate the config logs. The removed OST will not re-register with the MGS, but all of the other servers will, so it should be fine. Cheers, Andreas On Feb 1, 2023, at 03:48, BALVERS Martin via lustre-discuss mailto:lustre-discuss@lists.lustre.org>> wrote: Hi, I have a defective OSS with a single OST that I was tying to remove from the lustre filesystem completely (2.15.1). I was following https://doc.lustre.org/lustre_manual.xhtml#lustremaint.remove_ost<https://urldefense.com/v3/__https:/doc.lustre.org/lustre_manual.xhtml*lustremaint.remove_ost__;Iw!!OUGTln_Lrg!RiLzALkCuFehXiToUr1EEaIJH6J8AnCRNE6js19QvVd68PYBhvNQV9-Teh1p6lltu5zpR_tjJEwAvRuZ5wWrTA$> I had drained the OST and was now using the commands with llog_cancel to remove the config. This is where it went wrong. I first deleted attach, setup, add_osc indexes for the ‘client’ and then needed to also delete those for MDT and MDT0001, but I accidentally removed two more indexes from the ‘client’. Now I have incomplete client llogs for one OST, I am missing the add_uuid and attach lines for OST0002. [root@mds ~]# lctl --device MGS llog_print lustre-client - { index: 34, event: add_uuid, nid: 192.168.2.3@tcp(0x2c0a80203)<mailto:192.168.2.3@tcp(0x2c0a80203)>, node: 192.168.2.3@tcp<mailto:192.168.2.3@tcp> } - { index: 35, event: attach, device: lustre-OST0001-osc, type: osc, UUID: lustre-clilov_UUID } - { index: 36, event: setup, device: lustre-OST0001-osc, UUID: lustre-OST0001_UUID, node: 192.168.2.3@tcp<mailto:192.168.2.3@tcp> } - { index: 37, event: add_osc, device: lustre-clilov, ost: lustre-OST0001_UUID, index: 1, gen: 1 } - { index: 42, event: setup, device: lustre-OST0002-osc, UUID: lustre-OST0002_UUID, node: 192.168.2.4@tcp<mailto:192.168.2.4@tcp> } - { index: 43, event: add_osc, device: lustre-clilov, ost: lustre-OST0002_UUID, index: 2, gen: 1 } Is there a way to recover from this ? I hope someone can help. Regards, Martin Balvers Ce message électronique et tous les fichiers attachés qu'il contient sont confidentiels et destinés exclusivement à l'usage de la personne à laquelle ils sont adressés. Si vous avez reçu ce message par erreur, merci de le retourner à son émetteur. Les idées et opinions présentées dans ce message sont celles de son auteur, et ne représentent pas nécessairement celles de DANONE ou d'une quelconque de ses filiales. La publication, l'usage, la distribution, l'impression ou la copie non autorisée de ce message et des attachements qu'il contient sont strictement interdits. This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual to whom it is addressed. If you have received this email in error please send it back to the person that sent it to you. Any views or opinions presented are solely those of its author and do not necessarily represent those of DANONE or any of its subsidiary companies. Unauthorized publication, use, dissemination, forwarding, printing or copying of this email and its associated attachments is strictly prohibited. ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org<https://urldefense.com/v3/__http:/lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!OUGTln_Lrg!RiLzALkCuFehXiToUr1EEaIJH6J8AnCRNE6js19QvVd68PYBhvNQV9-Teh1p6lltu5zpR_tjJEwAvRuLSxlvXw$> ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Mistake while removing an OST
You should just be able to run the "writeconf" process to regenerate the config logs. The removed OST will not re-register with the MGS, but all of the other servers will, so it should be fine. Cheers, Andreas On Feb 1, 2023, at 03:48, BALVERS Martin via lustre-discuss wrote: Hi, I have a defective OSS with a single OST that I was tying to remove from the lustre filesystem completely (2.15.1). I was following https://doc.lustre.org/lustre_manual.xhtml#lustremaint.remove_ost I had drained the OST and was now using the commands with llog_cancel to remove the config. This is where it went wrong. I first deleted attach, setup, add_osc indexes for the ‘client’ and then needed to also delete those for MDT and MDT0001, but I accidentally removed two more indexes from the ‘client’. Now I have incomplete client llogs for one OST, I am missing the add_uuid and attach lines for OST0002. [root@mds ~]# lctl --device MGS llog_print lustre-client - { index: 34, event: add_uuid, nid: 192.168.2.3@tcp(0x2c0a80203), node: 192.168.2.3@tcp } - { index: 35, event: attach, device: lustre-OST0001-osc, type: osc, UUID: lustre-clilov_UUID } - { index: 36, event: setup, device: lustre-OST0001-osc, UUID: lustre-OST0001_UUID, node: 192.168.2.3@tcp } - { index: 37, event: add_osc, device: lustre-clilov, ost: lustre-OST0001_UUID, index: 1, gen: 1 } - { index: 42, event: setup, device: lustre-OST0002-osc, UUID: lustre-OST0002_UUID, node: 192.168.2.4@tcp } - { index: 43, event: add_osc, device: lustre-clilov, ost: lustre-OST0002_UUID, index: 2, gen: 1 } Is there a way to recover from this ? I hope someone can help. Regards, Martin Balvers Ce message électronique et tous les fichiers attachés qu'il contient sont confidentiels et destinés exclusivement à l'usage de la personne à laquelle ils sont adressés. Si vous avez reçu ce message par erreur, merci de le retourner à son émetteur. Les idées et opinions présentées dans ce message sont celles de son auteur, et ne représentent pas nécessairement celles de DANONE ou d'une quelconque de ses filiales. La publication, l'usage, la distribution, l'impression ou la copie non autorisée de ce message et des attachements qu'il contient sont strictement interdits. This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual to whom it is addressed. If you have received this email in error please send it back to the person that sent it to you. Any views or opinions presented are solely those of its author and do not necessarily represent those of DANONE or any of its subsidiary companies. Unauthorized publication, use, dissemination, forwarding, printing or copying of this email and its associated attachments is strictly prohibited. ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Mistake while removing an OST
Hi, I have a defective OSS with a single OST that I was tying to remove from the lustre filesystem completely (2.15.1). I was following https://doc.lustre.org/lustre_manual.xhtml#lustremaint.remove_ost I had drained the OST and was now using the commands with llog_cancel to remove the config. This is where it went wrong. I first deleted attach, setup, add_osc indexes for the 'client' and then needed to also delete those for MDT and MDT0001, but I accidentally removed two more indexes from the 'client'. Now I have incomplete client llogs for one OST, I am missing the add_uuid and attach lines for OST0002. [root@mds ~]# lctl --device MGS llog_print lustre-client - { index: 34, event: add_uuid, nid: 192.168.2.3@tcp(0x2c0a80203), node: 192.168.2.3@tcp } - { index: 35, event: attach, device: lustre-OST0001-osc, type: osc, UUID: lustre-clilov_UUID } - { index: 36, event: setup, device: lustre-OST0001-osc, UUID: lustre-OST0001_UUID, node: 192.168.2.3@tcp } - { index: 37, event: add_osc, device: lustre-clilov, ost: lustre-OST0001_UUID, index: 1, gen: 1 } - { index: 42, event: setup, device: lustre-OST0002-osc, UUID: lustre-OST0002_UUID, node: 192.168.2.4@tcp } - { index: 43, event: add_osc, device: lustre-clilov, ost: lustre-OST0002_UUID, index: 2, gen: 1 } Is there a way to recover from this ? I hope someone can help. Regards, Martin Balvers Ce message électronique et tous les fichiers attachés qu'il contient sont confidentiels et destinés exclusivement à l'usage de la personne à laquelle ils sont adressés. Si vous avez reçu ce message par erreur, merci de le retourner à son émetteur. Les idées et opinions présentées dans ce message sont celles de son auteur, et ne représentent pas nécessairement celles de DANONE ou d'une quelconque de ses filiales. La publication, l'usage, la distribution, l'impression ou la copie non autorisée de ce message et des attachements qu'il contient sont strictement interdits. This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual to whom it is addressed. If you have received this email in error please send it back to the person that sent it to you. Any views or opinions presented are solely those of its author and do not necessarily represent those of DANONE or any of its subsidiary companies. Unauthorized publication, use, dissemination, forwarding, printing or copying of this email and its associated attachments is strictly prohibited. ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org