[slurm-dev] Re: Diagnosing failing sbatch jobs

Trey Dockendorf Tue, 09 Dec 2014 09:49:42 -0800

We have /home exported to all systems via NFS and also a parallel
filesystem (FhGFS/BeeGFS, but any will do) mounted for scratch space.  I'm
sure other sites do it differently, but this is one way of doing it.


Exporting /home via NFS is likely the most common approach.  Another is
exporting compiled applications via NFS that are loaded via modules (Lmod
for example), at our site that is /apps, some do /opt, or some non-standard
directory not typically used by the OS.

- Trey

=============================

Trey Dockendorf
Systems Analyst I
Texas A&M University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: [email protected]
Jabber: [email protected]

On Tue, Dec 9, 2014 at 11:27 AM, Adrian Reich <[email protected]>
wrote:

>  Thank you so much for that suggestion! That led me straight to the
> issue. The file system that I have mounted to the head node, is not visible
> to the compute nodes. All the jobs were failing because I was getting
> streams of "No such file or directory" errors. If I launch a job from a
> folder that is part of the OS, the job runs, because that same folder also
> exists on the compute nodes.
>
> So, what is the best way for my compute nodes to write to the file system
> that I have set up on the headnode? Thank you again.
>
> Sincerely,
> Adrian Reich
>
> On Tue, Dec 9, 2014 at 11:57 AM, <[email protected]> wrote:
>
>>
>> Look at your SlurmctldLogFile (on the head node) and SlurmdLogFile (on
>> the allocated node).
>>
>>
>> Quoting Adrian Reich <[email protected]>:
>>
>>  Hello,
>>>
>>> I have set up a small SLURM cluster using the SLURM roll within Rocks.
>>> Every time I try to submit an sbatch job it fails immediately and the job
>>> quits. However, I can request resources using salloc and everything
>>> works.
>>> How can I go about diagnosing where the issue is and what information
>>> can I
>>> provide to help in the diagnosis? Thank you.
>>>
>>> Sincerely,
>>> Adrian Reich
>>>
>>
>>
>> --
>> Morris "Moe" Jette
>> CTO, SchedMD LLC
>>
>
>

[slurm-dev] Re: Diagnosing failing sbatch jobs

Reply via email to