Carles,

A big THANK YOU for your work to track down this problem. This bug can  
affect any job submitted to more than one partition. As you  
discovered, the job's partition pointer can not be changed until after  
the test for the job being previously started in a different partition  
within that same scheduling cycle. Your fix will be in the next  
release. Anyone who wants a patch now can get it here:

https://github.com/SchedMD/slurm/commit/dd1d573f844727f43fcd2ad4d95226bf23452b03

Thanks again!
Moe Jette

Quoting Carles Fenoy <mini...@gmail.com>:

> Hi all,
>
> After a tough day I've finally found the problem and a solution for 2.4.1
> I was able to reproduce the explained behavior by submitting jobs to 2
> partitions.
> This makes the job to be allocated in one partition but in the schedule
> function the partition of the job is changed to the NON allocated one. This
> makes that the resources can not be free at the end of the job.
>
> I've solved this by changing the IS_PENDING test some lines above in the
> schedule function in (job_scheduler.c)
>
> This is the code from the git HEAD (Line 801). As this file has changed a
> lot from 2.4.x I have not done a patch but I'm commenting the solution here.
> I've moved the if(!IS_JOB_PENDING) after the 2nd line (part_ptr...). This
> prevents the partition of the job to be changed if it is already starting
> in another partition.
>
>                       job_ptr  = job_queue_rec->job_ptr;
>                       part_ptr = job_queue_rec->part_ptr;
>                       job_ptr->part_ptr = part_ptr;
>                       xfree(job_queue_rec);
>                       if (!IS_JOB_PENDING(job_ptr))
>                               continue;  /* started in other partition */
>
>
> Hope this is enough information to solve it.
>
> I've just realized (while writing this mail) that my solution has a memory
> leak as job_queue_rec is not freed.
>
> Regards,
> Carles Fenoy
>
>
> On Thu, Jul 5, 2012 at 1:22 AM, Moe Jette <je...@schedmd.com> wrote:
>
>>
>> "CPUAlloc" comes from he "alloc_cpus" field on the node data structure
>> and that represents the number of bits set in the "row_bitmap" field.
>> The relevent code is in the select plugin (probably select/cons_res).
>> Adding logging after jobs are allocated resources (look for
>> "add_job_to_cores") and deallocated (see "remove_job_from_cores")
>> should help identify cases where the deallocate does not happen. The
>> "rows" here represent internal record keeping for allocated cores and
>> each "row" represents a time-slice for gang scheduling.
>>
>> Moe
>>
>> Quoting Carles Fenoy <mini...@gmail.com>:
>>
>> > Hi all,
>> >
>> > Today I migrated one cluster to 2.4.1 and I've found a strange behavior.
>> > Sometimes (I've still have to find when this happens) when a job finishes
>> > the node appears as idle but it has some cpus allocated and it won't get
>> > assigned.
>> > Here is a example:
>> >
>> > [root@n0 ~]# squeue -w n39
>> >   JOBID PARTITION     NAME     USER  ST       TIME  NODES
>> NODELIST(REASON)
>> > [root@n0 ~]# scontrol show node n39
>> > NodeName=n39 Arch=x86_64 CoresPerSocket=1
>> >    CPUAlloc=8 CPUErr=0 CPUTot=8 Features=RedHat,lustre
>> >    Gres=(null)
>> >    NodeAddr=n39 NodeHostName=n39
>> >    OS=Linux RealMemory=48162 Sockets=8
>> >    State=IDLE ThreadsPerCore=1 TmpDisk=170699 Weight=10
>> >    BootTime=2012-06-25T08:57:32 SlurmdStartTime=2012-07-04T18:35:59
>> >
>> > As you can see Slurm reports 8 CPUAlloc but there is no job running and
>> the
>> > node State is IDLE
>> > This is only solved restarting the controller
>> >
>> > any hint where can be the problem?
>> > I will take a look at this tomorrow, and will make a patch if I find a
>> > solution, but any help is welcome
>> >
>> > regards,
>> >
>> > --
>> > --
>> > Carles Fenoy
>> >
>>
>>
>
>
> --
> --
> Carles Fenoy
>

Reply via email to