Re: [slurm-users] slurmstepd: error: Too many levels of symbolic links

2021-12-03 Thread Bjørn-Helge Mevik
Adrian Sevcenco  writes:

> On 01.12.2021 10:25, Bjørn-Helge Mevik wrote:
>
>> In the end we had to give up
>> using automount, and implement a manual procedure that mounts/umounts
>> the needed nfs areas.
>
> Thanks a lot for info! manual as in "script" or as in "systemd.mount service"?

Script.  We mount (if needed) in the prolog.  Then in the healthcheck
(run every 5 mins), we check if a job is still running on the node that
needs the mount, and unmounts if not.  (We could have done it in the
epilog, but feared it could lead to a lot of mount/umount cycles if a
set of jobs failed immediately.  Hence we put it in the healthcheck
script instead.)

I don't have much experience with the systemd.mount service, but it is
possible it would work fine (and be less hackish than our solution :).

> Also, the big and the only advantage that autofs had over static mounts was
> that whenever there was a problem with the server, after the passing of the 
> glitch
> the autofs would re-mount the target...

That's in theory. :) Our experience in practice is that if the client is
actively using the nfs mounted are when the problem arises, you will
often have to reboot the client to resolve the disk waits.  (I *think*
it has something to do with nfs using longer and longer timeouts when it
cannot reach the server, so eventually it will take too long to time out
and return an error to the running applications.)

> I'm not very sure that a static nfs mount have this capability ... did you 
> baked in
> your manual procedure also a recovery part?

No, we simply pretend it will not happen. :)  In fact, I think we've
only had this type of problems once or twice in the last four-five
years.  But this might be because we only mount the homedirs with nfs,
so most of the time, the jobs are not actively using the nfs mounted
area.  (The most activity happen in BeeGFS or GPFS mounted areas.)

-- 
Bjørn-Helge


signature.asc
Description: PGP signature


Re: [slurm-users] slurmstepd: error: Too many levels of symbolic links

2021-12-02 Thread Adrian Sevcenco

Hi!

On 01.12.2021 10:25, Bjørn-Helge Mevik wrote:

Adrian Sevcenco  writes:


Hi! Does anyone know what could the the cause of such error?
I have a shared home, slurm 20.11.8 and i try a simple script in the submit 
directory
which is in the home that is nfs shared...


We had the "Too many levels of symbolic links" error some years ago,
while using a combination of automounting nfs areas and private fs name
spaces to get a private /tmp for each job.  In the end we had to give up
using automount, and implement a manual procedure that mounts/umounts
the needed nfs areas.


Thanks a lot for info! manual as in "script" or as in "systemd.mount service"?

Also, the big and the only advantage that autofs had over static mounts was
that whenever there was a problem with the server, after the passing of the 
glitch
the autofs would re-mount the target...

I'm not very sure that a static nfs mount have this capability ... did you 
baked in
your manual procedure also a recovery part?

Thank you!
Adrian



Re: [slurm-users] slurmstepd: error: Too many levels of symbolic links

2021-12-01 Thread Bjørn-Helge Mevik
Adrian Sevcenco  writes:

> Hi! Does anyone know what could the the cause of such error?
> I have a shared home, slurm 20.11.8 and i try a simple script in the submit 
> directory
> which is in the home that is nfs shared...

We had the "Too many levels of symbolic links" error some years ago,
while using a combination of automounting nfs areas and private fs name
spaces to get a private /tmp for each job.  In the end we had to give up
using automount, and implement a manual procedure that mounts/umounts
the needed nfs areas.

-- 
B/H


signature.asc
Description: PGP signature


[slurm-users] slurmstepd: error: Too many levels of symbolic links

2021-11-30 Thread Adrian Sevcenco




Hi! Does anyone know what could the the cause of such error?
I have a shared home, slurm 20.11.8 and i try a simple script in the submit 
directory
which is in the home that is nfs shared...

also i have job_container.conf defined, but i have no idea if this is a 
problem..

Thank you!
Adrian