Re: [slurm-users] Running vnc after srun fails but works after a direct ssh
Actually, I solved the issue by observing that the user had created a file "~/.vnc/xstartup*.sh*" while it should have been "~/.vnc/xstartup" at least, simply removing the extension and vncserver starts successfully, even in a srun ! Best; Jeremy. On 15/05/2021 14:02, Jeremy Fix wrote: > Hello ! > > I'm facing a weird issue. With one user, call it gpupro_user , if I log > with ssh on a compute node, I can run a vncserver (see command [1] > below) succesfully (in my case, a tigervnc server). However, if I > allocate the exact same node through a srun (see command [2] below), > running vnc server fails with the error given at the end of this message. > > And finally, if I do the exact same srun, having the exact same computer > node, from another login (my own login actually), and then start > vncserver with the exact same command, it works. > > So, do you think there is anything in the way we configured the user > gpupro_user, or maybe declared it in sacctmgr or somewhere that could > explain why running vncserver from within the srun session fails ? > > Thank you for your help. > > Have a nice day ; > > Jeremy. > > [1] vncserver -SecurityTypes None -depth 32 -geometry 1680x1050 > > [2] srun --nodelist=tx01 -N 1 -p gpue60 -t 0:30:00 --pty bash > > --- VNC error > > Please be aware that you are exposing your VNC server to all users on the > local machine. These users can access your server without authentication! > > New 'tx01:1 (gpuaut_2)' desktop at :1 on machine tx01 > > Starting applications specified in /etc/X11/Xvnc-session > Log file is /usr/users/gpuaut/gpuaut_2/.vnc/tx01:1.log > > Use xtigervncviewer -SecurityTypes None :1 to connect to the VNC server. > > > vncserver: Failed command '/etc/X11/Xvnc-session': 256! > > === tail -15 /usr/users/gpuaut/gpuaut_2/.vnc/tx01:1.log > === > Killing Xtigervnc process ID 31975... which seems to be deadlocked. > Using SIGKILL! > > Xvnc TigerVNC 1.7.0 - built Dec 5 2017 09:25:01 > Copyright (C) 1999-2016 TigerVNC Team and many others (see README.txt) > See http://www.tigervnc.org for information on TigerVNC. > Underlying X server release 11905000, The X.Org Foundation > > > Sat May 15 13:57:35 2021 > vncext: VNC extension running! > vncext: Listening for VNC connections on local interface(s), port 5901 > vncext: created VNC server for screen 0 > XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":1" > after 175 requests (175 known processed) with 0 events remaining. > Killing Xtigervnc process ID 7169... which seems to be deadlocked. Using > SIGKILL! > > === > > Starting applications specified in /etc/X11/Xvnc-session has failed. > Maybe try something simple first, e.g., > tigervncserver -xstartup /usr/bin/xterm > --- VNC error > > >
Re: [slurm-users] schedule mixed nodes first
Durai Arasan writes: > Is there a way of improving this situation? E.g. by not blocking IDLE nodes > with jobs that only use a fraction of the 8 GPUs? Why are single GPU jobs > not scheduled to fill already MIXED nodes before using IDLE ones? > > What parameters/configuration need to be adjusted for this to be enforced? There are two SchedulerParameters you could experiment with (from man slurm.conf): bf_busy_nodes When selecting resources for pending jobs to reserve for future execution (i.e. the job can not be started immediately), then pref‐ erentially select nodes that are in use. This will tend to leave currently idle resources available for backfilling longer running jobs, but may result in allocations having less than optimal network topology. This option is currently only supported by the select/cons_res and select/cons_tres plugins (or select/cray_aries with SelectTypeParameters set to "OTHER_CONS_RES" or "OTHER_CONS_TRES", which layers the select/cray_aries plugin over the select/cons_res or select/cons_tres plugin respectively). pack_serial_at_end If used with the select/cons_res or select/cons_tres plugin, then put serial jobs at the end of the available nodes rather than using a best fit algorithm. This may reduce resource fragmentation for some workloads. -- B/H signature.asc Description: PGP signature
Re: [slurm-users] Determining Cluster Usage Rate
Il 17/05/21 09:25, Ole Holm Nielsen ha scritto: I hope that someone on the list can help you build Debian packages. The problem is not just rebuilding Slurm: if I rebuild Slurm, I have to rebuild OpenMPI, OpenIB and a alot of other stuff that I don't know with the needed detail. When you find the time, you must upgrade by at most 2 Slurm versions at a time, so you have to upgrade in two steps, for example 18.08->19.05->20.11. I usually just stop everything for the upgrade, then upgrade to whatever Debian is shipping at the moment. If the history is lost, it's not a big issue (that's what DB backups are for :) ). My Slurm upgrade instructions refer to CentOS, but the overall process would be the same for all Linuxes: https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm Please read carefully the existing documentation from SchedMD linked to in this page. Tks. I upgrade Slurm frequently and have no problems doing so. We're at 20.11.7 now. You should avoid 20.11.{0-2} due to a bug in MPI. That's a really useful info. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786
Re: [slurm-users] Determining Cluster Usage Rate
On 5/17/21 8:59 AM, Diego Zuccato wrote: Il 15/05/21 00:43, Christopher Samuel ha scritto: It just doesn't recognize 'ALL'. It works if I specify the resources. That's odd, what does this say? sreport --version slurm-wlm 18.08.5-2 That's the package from Debian stable (we don't have the manpower to handle manually-compiled packages). As Ole said, it's an old version. I'd love to be able to keep up with the newest releases, but ... :( I hope that someone on the list can help you build Debian packages. When you find the time, you must upgrade by at most 2 Slurm versions at a time, so you have to upgrade in two steps, for example 18.08->19.05->20.11. My Slurm upgrade instructions refer to CentOS, but the overall process would be the same for all Linuxes: https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm Please read carefully the existing documentation from SchedMD linked to in this page. I upgrade Slurm frequently and have no problems doing so. We're at 20.11.7 now. You should avoid 20.11.{0-2} due to a bug in MPI. /Ole
Re: [slurm-users] Determining Cluster Usage Rate
Il 15/05/21 00:43, Christopher Samuel ha scritto: It just doesn't recognize 'ALL'. It works if I specify the resources. That's odd, what does this say? sreport --version slurm-wlm 18.08.5-2 That's the package from Debian stable (we don't have the manpower to handle manually-compiled packages). As Ole said, it's an old version. I'd love to be able to keep up with the newest releases, but ... :( -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786