Re: [slurm-users] slurmstepd: error: Too many levels of symbolic links
Adrian Sevcenco writes: > On 01.12.2021 10:25, Bjørn-Helge Mevik wrote: > >> In the end we had to give up >> using automount, and implement a manual procedure that mounts/umounts >> the needed nfs areas. > > Thanks a lot for info! manual as in "script" or as in "systemd.mount service"? Script. We mount (if needed) in the prolog. Then in the healthcheck (run every 5 mins), we check if a job is still running on the node that needs the mount, and unmounts if not. (We could have done it in the epilog, but feared it could lead to a lot of mount/umount cycles if a set of jobs failed immediately. Hence we put it in the healthcheck script instead.) I don't have much experience with the systemd.mount service, but it is possible it would work fine (and be less hackish than our solution :). > Also, the big and the only advantage that autofs had over static mounts was > that whenever there was a problem with the server, after the passing of the > glitch > the autofs would re-mount the target... That's in theory. :) Our experience in practice is that if the client is actively using the nfs mounted are when the problem arises, you will often have to reboot the client to resolve the disk waits. (I *think* it has something to do with nfs using longer and longer timeouts when it cannot reach the server, so eventually it will take too long to time out and return an error to the running applications.) > I'm not very sure that a static nfs mount have this capability ... did you > baked in > your manual procedure also a recovery part? No, we simply pretend it will not happen. :) In fact, I think we've only had this type of problems once or twice in the last four-five years. But this might be because we only mount the homedirs with nfs, so most of the time, the jobs are not actively using the nfs mounted area. (The most activity happen in BeeGFS or GPFS mounted areas.) -- Bjørn-Helge signature.asc Description: PGP signature
Re: [slurm-users] slurmstepd: error: Too many levels of symbolic links
Hi! On 01.12.2021 10:25, Bjørn-Helge Mevik wrote: Adrian Sevcenco writes: Hi! Does anyone know what could the the cause of such error? I have a shared home, slurm 20.11.8 and i try a simple script in the submit directory which is in the home that is nfs shared... We had the "Too many levels of symbolic links" error some years ago, while using a combination of automounting nfs areas and private fs name spaces to get a private /tmp for each job. In the end we had to give up using automount, and implement a manual procedure that mounts/umounts the needed nfs areas. Thanks a lot for info! manual as in "script" or as in "systemd.mount service"? Also, the big and the only advantage that autofs had over static mounts was that whenever there was a problem with the server, after the passing of the glitch the autofs would re-mount the target... I'm not very sure that a static nfs mount have this capability ... did you baked in your manual procedure also a recovery part? Thank you! Adrian
Re: [slurm-users] slurmstepd: error: Too many levels of symbolic links
Adrian Sevcenco writes: > Hi! Does anyone know what could the the cause of such error? > I have a shared home, slurm 20.11.8 and i try a simple script in the submit > directory > which is in the home that is nfs shared... We had the "Too many levels of symbolic links" error some years ago, while using a combination of automounting nfs areas and private fs name spaces to get a private /tmp for each job. In the end we had to give up using automount, and implement a manual procedure that mounts/umounts the needed nfs areas. -- B/H signature.asc Description: PGP signature
[slurm-users] slurmstepd: error: Too many levels of symbolic links
Hi! Does anyone know what could the the cause of such error? I have a shared home, slurm 20.11.8 and i try a simple script in the submit directory which is in the home that is nfs shared... also i have job_container.conf defined, but i have no idea if this is a problem.. Thank you! Adrian