Hello everybody.

I am dealing with an issue with a relatively new Lustre installation. The Metadata Server (MDS) hangs randomly without any common pattern. It can take anywhere from 30 minutes to 30 days, but it always ends up hanging without a consistent pattern (at least, I haven't found one). The logs don't show anything unusual at the time of the failure. The only thing I continuously see are these messages:

/[lun ene 20 14:17:10 2025] LustreError: 7068:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with -22, flags:0x4  qsd:LUSTRE-OST138f qtype:prj id:2325 enforced:1 granted: 16304159618662232032 pending:0 waiting:0 req:1 usage: 114636 qunit:262144 qtune:65536 edquot:0 default:yes [lun ene 20 14:17:10 2025] LustreError: 7068:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 39 previous similar messages [lun ene 20 14:21:52 2025] LustreError: 1895328:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much! uuid:LUSTRE-MDT0000-lwp-OST0c1f_UUID release: 15476132855418716160 granted:262144, total:14257500 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2582 enforced:1 hard:62914560 soft:52428800 granted:14257500 time:0 qunit: 262144 edquot:0 may_rel:0 revoke:0 default:yes [lun ene 20 14:21:52 2025] LustreError: 1947381:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much! uuid:LUSTRE-MDT0000-lwp-OST0fb2_UUID release: 13809297465413342331 granted:66568, total:14179564 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2325 enforced:1 hard:62914560 soft:52428800 granted:14179564 time:0 qunit: 262144 edquot:0 may_rel:0 revoke:0 default:yes [lun ene 20 14:21:52 2025] LustreError: 1947381:0:(qmt_handler.c:798:qmt_dqacq0()) Skipped 802 previous similar messages [lun ene 20 14:21:52 2025] LustreError: 1895328:0:(qmt_handler.c:798:qmt_dqacq0()) Skipped 802 previous similar messages [lun ene 20 14:27:24 2025] LustreError: 7047:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with -22, flags:0x4  qsd:LUSTRE-OST138f qtype:prj id:2325 enforced:1 granted: 16304159618662232032 pending:0 waiting:0 req:1 usage: 114636 qunit:262144 qtune:65536 edquot:0 default:yes [lun ene 20 14:27:24 2025] LustreError: 7047:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 39 previous similar messages [lun ene 20 14:31:52 2025] LustreError: 1844354:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much! uuid:LUSTRE-MDT0000-lwp-OST1399_UUID release: 12882711387029922688 granted:66116, total:14078012 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2586 enforced:1 hard:62914560 soft:52428800 granted:14078012 time:0 qunit: 262144 edquot:0 may_rel:0 revoke:0 default:yes [lun ene 20 14:31:52 2025] LustreError: 1844354:0:(qmt_handler.c:798:qmt_dqacq0()) Skipped 785 previous similar messages [lun ene 20 14:37:39 2025] LustreError: 7054:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with -22, flags:0x4  qsd:LUSTRE-OST138f qtype:prj id:2325 enforced:1 granted: 16304159618662232032 pending:0 waiting:0 req:1 usage: 114636 qunit:262144 qtune:65536 edquot:0 default:yes [lun ene 20 14:37:39 2025] LustreError: 7054:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 39 previous similar messages [lun ene 20 14:41:54 2025] LustreError: 1895328:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much! uuid:LUSTRE-MDT0000-lwp-OST0faa_UUID release: 13811459193234480169 granted:65632, total:14179564 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2325 enforced:1 hard:62914560 soft:52428800 granted:14179564 time:0 qunit: 262144 edquot:0 may_rel:0 revoke:0 default:yes [lun ene 20 14:41:54 2025] LustreError: 1895328:0:(qmt_handler.c:798:qmt_dqacq0()) Skipped 798 previous similar messages [lun ene 20 14:47:53 2025] LustreError: 7052:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with -22, flags:0x4  qsd:LUSTRE-OST138f qtype:prj id:2325 enforced:1 granted: 16304159618662232032 pending:0 waiting:0 req:1 usage: 114636 qunit:262144 qtune:65536 edquot:0 default:yes [lun ene 20 14:47:53 2025] LustreError: 7052:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 39 previous similar messages
/
I have ruled out hardware failure since the MDS service has been moved between different servers, and it happens with all of them.

Linux distribution: AlmaLinux release 8.10 (Cerulean Leopard)
Kernel: Linux srv-lustre15 4.18.0-553.5.1.el8_lustre.x86_64 #1 SMP Fri Jun 28 18:44:24 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Lustre release: lustre-2.15.5-1.el8.x86_64
Not using ZFS.

Any ideas on where to continue investigating?
Is the error appearing in dmesg a bug, or is it a corruption in the quota database?

The possible bugs affecting quotas that might be related seem to be fixed in version 2.15.


Thanks in advance.

--
- no title specified

Jose Manuel Martínez García

Coordinador de Sistemas

Supercomputación de Castilla y León

Tel: 987 293 174

        

        

Edificio CRAI-TIC, Campus de Vegazana, s/n Universidad de León - 24071 León, España

<https://www.scayle.es/>

Le informamos, como destinatario de este mensaje, que el correo electrónico y las comunicaciones por medio de Internet no permiten asegurar ni garantizar la confidencialidad de los mensajes transmitidos, así como tampoco su integridad o su correcta recepción, por lo que SCAYLE no asume responsabilidad alguna por tales circunstancias. Si no consintiese en la utilización del correo electrónico o de las comunicaciones vía Internet le rogamos nos lo comunique y ponga en nuestro conocimiento de manera inmediata. Para más información visite nuestro Aviso Legal <https://www.scayle.es/aviso-legal/>.
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to