Dear Slurm developers, please find below a patch for a slurmctld bug that results in incorrect reservation ids after the restart or reconfiguration of slurmctld if a reservation with a specific name pattern has been created previously. This may (I have not verified that but consider it to be very likely) result in sacct reporting incorrect reservation names for jobs when the reservation id wraps around. The patch applies to the current HEAD. We originally found this problem with version 14.03.4 to which the identical patch applies. All other versions in between are most likely affected as well.
Thank you for your consideration. Best regards, Dorian From 321b6998aa7ecd7bae10e5efbaf7a2bcf309dca4 Mon Sep 17 00:00:00 2001 From: Dorian Krause <[email protected]> Date: Thu, 7 Jan 2016 13:37:15 +0100 Subject: [PATCH] slurmctld/reservation.c: Fix bug in computation of top_suffix The top_suffix variable is used to store and update the reservation id when creating new reservations. As the name suggests the top_suffix value is also used as a suffix for reservations without user-provided name. In _validate_all_reservations() the top_suffix is updated after a restart or reconfiguration based on the suffices of existing reservations. This logic does not take into account the possibility of users specifying reservation names that end with "_[0-9]+". If users create such a reservation this will result in bogus top_suffix values and thus bogus (e.g., non-unique) reservation ids. Instead of retrieving the reservation id from the name it is faster and easier to use the resv_id member of slurmctld_resv_t. Prior to the commit: > cat p_top_suffix p top_suffix quit > scontrol create reservation=test1 nodes=node1 start=now duration=1 users=root Reservation created: test1 > gdb --pid=$(pgrep slurmctld) -x ./p_top_suffix 0x00000039884c8a5d in nanosleep () from /lib64/libc.so.6 $1 = 1 A debugging session is active. Inferior 1 [process 1166] will be detached. > scontrol create reservation=test2 nodes=node1 start=now+60 duration=1 > users=root Reservation created: test2 > gdb --pid=$(pgrep slurmctld) -x ./p_top_suffix 0x00000039884c8a5d in nanosleep () from /lib64/libc.so.6 $1 = 2 A debugging session is active. Inferior 1 [process 1166] will be detached. > scontrol create reservation=test3_100 nodes=node1 start=now+120 duration=1 > users=root Reservation created: test3_100 > gdb --pid=$(pgrep slurmctld) -x ./p_top_suffix 0x00000039884c8a5d in nanosleep () from /lib64/libc.so.6 $1 = 3 A debugging session is active. Inferior 1 [process 1166] will be detached. > scontrol reconfigure > gdb --pid=$(pgrep slurmctld) -x ./p_top_suffix 0x00000039884c8a5d in nanosleep () from /lib64/libc.so.6 $1 = 100 A debugging session is active. Inferior 1 [process 1166] will be detached. With the fix: > scontrol create reservation=test1 nodes=node1 start=now duration=1 users=root Reservation created: test1 > gdb --pid=$(pgrep slurmctld) -x ./p_top_suffix 0x00000039884c8a5d in nanosleep () from /lib64/libc.so.6 $1 = 1 A debugging session is active. Inferior 1 [process 5854] will be detached. > scontrol create reservation=test2 nodes=node1 start=now+60 duration=1 > users=root Reservation created: test2 > gdb --pid=$(pgrep slurmctld) -x ./p_top_suffix 0x00000039884c8a5d in nanosleep () from /lib64/libc.so.6 $1 = 2 A debugging session is active. Inferior 1 [process 5854] will be detached. > scontrol create reservation=test3_100 nodes=node1 start=now+120 duration=1 > users=root Reservation created: test3_100 > gdb --pid=$(pgrep slurmctld) -x ./p_top_suffix 0x00000039884c8a5d in nanosleep () from /lib64/libc.so.6 $1 = 3 A debugging session is active. Inferior 1 [process 5854] will be detached. > scontrol reconfigure > gdb --pid=$(pgrep slurmctld) -x ./p_top_suffix 0x00000039884c8a5d in nanosleep () from /lib64/libc.so.6 $1 = 3 A debugging session is active. Inferior 1 [process 5854] will be detached. --- src/slurmctld/reservation.c | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/src/slurmctld/reservation.c b/src/slurmctld/reservation.c index 9d9107a..41258d2 100644 --- a/src/slurmctld/reservation.c +++ b/src/slurmctld/reservation.c @@ -3301,8 +3301,6 @@ static void _validate_all_reservations(void) ListIterator iter; slurmctld_resv_t *resv_ptr; struct job_record *job_ptr; - char *tmp; - uint32_t res_num; iter = list_iterator_create(resv_list); while ((resv_ptr = (slurmctld_resv_t *) list_next(iter))) { @@ -3314,11 +3312,7 @@ static void _validate_all_reservations(void) list_delete_item(iter); } else { _set_assoc_list(resv_ptr); - tmp = strrchr(resv_ptr->name, '_'); - if (tmp) { - res_num = atoi(tmp + 1); - top_suffix = MAX(top_suffix, res_num); - } + top_suffix = MAX(top_suffix, resv_ptr->resv_id); } } list_iterator_destroy(iter); -- 2.4.3 ------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------ Forschungszentrum Juelich GmbH 52425 Juelich Sitz der Gesellschaft: Juelich Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender), Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, Prof. Dr. Sebastian M. Schmidt ------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------
