yes, that's the line

PartitionName=CLUSTER Default=yes State=UP 
nodes=gpu-[1]-[4-17],gpu-[2]-[4,6-16],gpu-[3]-[9]

and I have gcn-2-4 defined in the nodenames file
NodeName=gpu-2-4 NodeAddr=10.240.31.235 CPUs=32 Sockets=2 CoresPerSocket=8 
ThreadsPerCore=2 Gres=gpu:4 Weight=20512304 Feature=rack-2,32CPUs 
RealMemory=245760

as well as /etc/hosts

10.240.31.235   gpu-2-4.local   gpu-2-4

but nevertheless slurmctl crashes:

[2013-07-11T10:11:40.764] Recovered state of 29 nodes
[2013-07-11T10:11:40.764] Recovered information about 0 jobs
[2013-07-11T10:11:40.764] error: find_node_record: lookup failure for 
gpu-[2]-[4]
[2013-07-11T10:11:40.764] error: node_name2bitmap: invalid node specified 
gpu-[2]-[4]
[2013-07-11T10:11:40.764] error: find_node_record: lookup failure for 6-16]
[2013-07-11T10:11:40.764] error: node_name2bitmap: invalid node specified 6-16]
[2013-07-11T10:11:40.764] fatal: Invalid node names in partition CLUSTER


Looks to me like a parsing error. Also if torque can't resolve a
hostname it just logs an error but still functions. slurm is completely
dead with one missing node!


Thanks
Eva

On Wed, 10 Jul 2013, John Thiltges wrote:

>
> Hi Eva,
>
> I wasn't able to reproduce the problem with a quick test. You have
> config lines similar to these?
>
>      NodeName=gpu-1-[4-17],gpu-2-[4,6-16],gpu-3-9 ...
>      PartitionName=... Nodes=gpu-1-[4-17],gpu-2-[4,6-16],gpu-3-9
>
> Regards,
> John
>
> On 2013-07-10 19:20, Eva Hocks wrote:
> >
> >
> >
> >
> > Thanks, John
> >
> >
> >
> > but  this is what I have in the partition file:
> >
> > nodes=gpu-1-[4-17],gpu-2-[4,6-16],gpu-3-9
> >
> >
> >
> > slurm gets confused when it can't look up gpu-2-4 and then splits the
> >
> > gpu-2-[4,6-16]   into gpu-[2]-[4] (failed lookup) and 6-16] (which is
> >
> > actually no node name at all but a wrong parsing after the failure)
> >
> >
> >
> > Thanks
> >
> > Eva
> >
> >
> >
> > On Wed, 10 Jul 2013, John Thiltges wrote:
> >
> >
> >
> >> On 07/10/2013 06:16 PM, Eva Hocks wrote:
> >>> The entry in partiton.conf:
> >>> PartitionName=CLUSTER Default=yes State=UP 
> >>> nodes=gpu-[1]-[4-17],gpu-[2]-[4,6-16],gpu-[3]-[9]
> >>> causes slurmctl to crash:
> >>> 2013-07-10T16:03:22.923] error: find_node_record: lookup failure for 
> >>> gpu-[2]-[4]
> >>> [2013-07-10T16:03:22.923] error: node_name2bitmap: invalid node specified 
> >>> gpu-[2]-[4]
> >>> [2013-07-10T16:03:22.923] error: find_node_record: lookup failure for 
> >>> 6-16]
> >>> [2013-07-10T16:03:22.923] error: node_name2bitmap: invalid node specified 
> >>> 6-16]
> >>> [2013-07-10T16:03:22.923] fatal: Invalid node names in partition CLUSTER
> >> It looks like the hostlist parser is confused by the brackets, finding
> >> names of "6-16]" and "gpu-[2]-[4]".
> >> Brackets are only needed when there is a range. If you take out the
> >> extra brackets, it should parse OK:
> >>       nodes=gpu-1-[4-17],gpu-2-[4,6-16],gpu-3-9
> >> Regards,
> >> John
> > >
>

Reply via email to