2014-07-15 15:43 GMT+02:00 Markus Blank-Burian <bur...@muenster.de>:

> Hi,
>
> after job 436172 completed, the slurmctld daemon segfaulted. Starting
> slurmctld again reproduces the segfault. Debugging with gdb shows the
> following backtrace. How can i fix this without losing the complete state?
>
> Markus
>
>
check this bug:
 http://bugs.schedmd.com/show_bug.cgi?id=958

>
> slurmctld: _sync_nodes_to_comp_job: Job 436172 in completing state
> [New Thread 0x7ffff2906700 (LWP 15397)]
> [New Thread 0x7ffff2805700 (LWP 15398)]
> slurmctld: debug:  Priority MULTIFACTOR plugin loaded
> slurmctld: debug2: _adjust_limit_usage: job 436172: MPC: job_memory set to
> 16384
> slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_TERMINATE_JOB
> [New Thread 0x7ffff2704700 (LWP 15399)]
> slurmctld: _sync_nodes_to_comp_job: completing 1 jobs
> slurmctld: debug:  Updating partition uid access list
> slurmctld: Recovered state of 0 reservations
> slurmctld: State of 0 triggers recovered
> [New Thread 0x7ffff2603700 (LWP 15400)]
> slurmctld: debug2: got 1 threads to send out
> slurmctld: read_slurm_conf: backup_controller not specified.
> slurmctld: cons_res: select_p_reconfigure
> slurmctld: cons_res: select_p_node_init
> slurmctld: cons_res: preparing for 3 partitions
> [New Thread 0x7ffff2502700 (LWP 15401)]
> [New Thread 0x7ffff2401700 (LWP 15402)]
> slurmctld: debug2: Tree head got back 0 looking for 1
> slurmctld: Running as primary controller
> slurmctld: Registering slurmctld at port 6817 with slurmdbd.
> slurmctld: debug2: Tree head got back 1
> [Thread 0x7ffff2401700 (LWP 15402) exited]
> slurmctld: cleanup_completing: job 436172 completion process took 2671
> seconds
>
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7ffff2502700 (LWP 15401)]
> 0x000000000054e0a7 in gres_plugin_job_clear (job_gres_list=<optimized
> out>) at
> gres.c:2945
> 2945                            FREE_NULL_BITMAP(job_state_ptr-
> >gres_bit_step_alloc[i]);
> (gdb) bt
> #0  0x000000000054e0a7 in gres_plugin_job_clear (job_gres_list=<optimized
> out>) at gres.c:2945
> #1  0x000000000048e350 in delete_step_records (job_ptr=job_ptr@entry
> =0xb85b08)
> at step_mgr.c:263
> #2  0x000000000045d7d3 in cleanup_completing (job_ptr=job_ptr@entry
> =0xb85b08)
> at job_scheduler.c:3057
> #3  0x000000000046713c in make_node_idle (node_ptr=0x7ff728,
> job_ptr=job_ptr@entry=0xb85b08) at node_mgr.c:3072
> #4  0x000000000044bac6 in job_epilog_complete (job_id=436172,
> node_name=0x7fffe0000cc8 "kaa-23", return_code=return_code@entry=0) at
> job_mgr.c:10265
> #5  0x0000000000436d7c in _thread_per_group_rpc (args=0x7fffe8000a28) at
> agent.c:923
> #6  0x00007ffff7486ed3 in start_thread (arg=0x7ffff2502700) at
> pthread_create.c:308
> #7  0x00007ffff71bbe2d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
> (gdb) list
> 2940                    if (!job_gres_ptr)
> 2941                            continue;
> 2942                    job_state_ptr = (gres_job_state_t *) job_gres_ptr-
> >gres_data;
> 2943                    for (i = 0; i < job_state_ptr->node_cnt; i++) {
> 2944                            FREE_NULL_BITMAP(job_state_ptr-
> >gres_bit_alloc[i]);
> 2945                            FREE_NULL_BITMAP(job_state_ptr-
> >gres_bit_step_alloc[i]);
> 2946                    }
> 2947                    xfree(job_state_ptr->gres_bit_alloc);
> 2948                    xfree(job_state_ptr->gres_bit_step_alloc);
> 2949                    xfree(job_state_ptr->gres_cnt_step_alloc);
> (gdb)

Reply via email to