2014-07-15 15:43 GMT+02:00 Markus Blank-Burian <bur...@muenster.de>: > Hi, > > after job 436172 completed, the slurmctld daemon segfaulted. Starting > slurmctld again reproduces the segfault. Debugging with gdb shows the > following backtrace. How can i fix this without losing the complete state? > > Markus > > check this bug: http://bugs.schedmd.com/show_bug.cgi?id=958
> > slurmctld: _sync_nodes_to_comp_job: Job 436172 in completing state > [New Thread 0x7ffff2906700 (LWP 15397)] > [New Thread 0x7ffff2805700 (LWP 15398)] > slurmctld: debug: Priority MULTIFACTOR plugin loaded > slurmctld: debug2: _adjust_limit_usage: job 436172: MPC: job_memory set to > 16384 > slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_TERMINATE_JOB > [New Thread 0x7ffff2704700 (LWP 15399)] > slurmctld: _sync_nodes_to_comp_job: completing 1 jobs > slurmctld: debug: Updating partition uid access list > slurmctld: Recovered state of 0 reservations > slurmctld: State of 0 triggers recovered > [New Thread 0x7ffff2603700 (LWP 15400)] > slurmctld: debug2: got 1 threads to send out > slurmctld: read_slurm_conf: backup_controller not specified. > slurmctld: cons_res: select_p_reconfigure > slurmctld: cons_res: select_p_node_init > slurmctld: cons_res: preparing for 3 partitions > [New Thread 0x7ffff2502700 (LWP 15401)] > [New Thread 0x7ffff2401700 (LWP 15402)] > slurmctld: debug2: Tree head got back 0 looking for 1 > slurmctld: Running as primary controller > slurmctld: Registering slurmctld at port 6817 with slurmdbd. > slurmctld: debug2: Tree head got back 1 > [Thread 0x7ffff2401700 (LWP 15402) exited] > slurmctld: cleanup_completing: job 436172 completion process took 2671 > seconds > > Program received signal SIGSEGV, Segmentation fault. > [Switching to Thread 0x7ffff2502700 (LWP 15401)] > 0x000000000054e0a7 in gres_plugin_job_clear (job_gres_list=<optimized > out>) at > gres.c:2945 > 2945 FREE_NULL_BITMAP(job_state_ptr- > >gres_bit_step_alloc[i]); > (gdb) bt > #0 0x000000000054e0a7 in gres_plugin_job_clear (job_gres_list=<optimized > out>) at gres.c:2945 > #1 0x000000000048e350 in delete_step_records (job_ptr=job_ptr@entry > =0xb85b08) > at step_mgr.c:263 > #2 0x000000000045d7d3 in cleanup_completing (job_ptr=job_ptr@entry > =0xb85b08) > at job_scheduler.c:3057 > #3 0x000000000046713c in make_node_idle (node_ptr=0x7ff728, > job_ptr=job_ptr@entry=0xb85b08) at node_mgr.c:3072 > #4 0x000000000044bac6 in job_epilog_complete (job_id=436172, > node_name=0x7fffe0000cc8 "kaa-23", return_code=return_code@entry=0) at > job_mgr.c:10265 > #5 0x0000000000436d7c in _thread_per_group_rpc (args=0x7fffe8000a28) at > agent.c:923 > #6 0x00007ffff7486ed3 in start_thread (arg=0x7ffff2502700) at > pthread_create.c:308 > #7 0x00007ffff71bbe2d in clone () at > ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 > (gdb) list > 2940 if (!job_gres_ptr) > 2941 continue; > 2942 job_state_ptr = (gres_job_state_t *) job_gres_ptr- > >gres_data; > 2943 for (i = 0; i < job_state_ptr->node_cnt; i++) { > 2944 FREE_NULL_BITMAP(job_state_ptr- > >gres_bit_alloc[i]); > 2945 FREE_NULL_BITMAP(job_state_ptr- > >gres_bit_step_alloc[i]); > 2946 } > 2947 xfree(job_state_ptr->gres_bit_alloc); > 2948 xfree(job_state_ptr->gres_bit_step_alloc); > 2949 xfree(job_state_ptr->gres_cnt_step_alloc); > (gdb)