Re: [slurm-users] Running vnc after srun fails but works after a direct ssh

2021-05-17 Thread Jeremy Fix
Actually, I solved the issue by observing that the user had created a
file  "~/.vnc/xstartup*.sh*" while it should have been "~/.vnc/xstartup"

at least, simply removing the extension and vncserver starts
successfully, even in a srun !

Best;

Jeremy.

On 15/05/2021 14:02, Jeremy Fix wrote:
> Hello !
>
> I'm facing a weird issue. With one user, call it gpupro_user , if I log
> with ssh on a compute node, I can run a vncserver (see command [1] 
> below) succesfully (in my case, a tigervnc server). However, if I
> allocate the exact same node through a srun (see command [2] below),
> running vnc server fails with the error given at the end of this message.
>
> And finally, if I do the exact same srun, having the exact same computer
> node, from another login (my own login actually), and then start
> vncserver with the exact same command, it works.
>
> So, do you think there is anything in the way we configured the user
> gpupro_user, or maybe declared it in sacctmgr or somewhere that could
> explain why running vncserver from within the srun session fails ?
>
> Thank you for your help.
>
> Have a nice day ;
>
> Jeremy.
>
> [1] vncserver -SecurityTypes None -depth 32 -geometry 1680x1050
>
> [2]  srun --nodelist=tx01 -N 1 -p gpue60 -t 0:30:00 --pty bash
>
> --- VNC error 
>
> Please be aware that you are exposing your VNC server to all users on the
> local machine. These users can access your server without authentication!
>
> New 'tx01:1 (gpuaut_2)' desktop at :1 on machine tx01
>
> Starting applications specified in /etc/X11/Xvnc-session
> Log file is /usr/users/gpuaut/gpuaut_2/.vnc/tx01:1.log
>
> Use xtigervncviewer -SecurityTypes None :1 to connect to the VNC server.
>
>
> vncserver: Failed command '/etc/X11/Xvnc-session': 256!
>
> === tail -15 /usr/users/gpuaut/gpuaut_2/.vnc/tx01:1.log
> ===
> Killing Xtigervnc process ID 31975... which seems to be deadlocked.
> Using SIGKILL!
>
> Xvnc TigerVNC 1.7.0 - built Dec  5 2017 09:25:01
> Copyright (C) 1999-2016 TigerVNC Team and many others (see README.txt)
> See http://www.tigervnc.org for information on TigerVNC.
> Underlying X server release 11905000, The X.Org Foundation
>
>
> Sat May 15 13:57:35 2021
>  vncext:  VNC extension running!
>  vncext:  Listening for VNC connections on local interface(s), port 5901
>  vncext:  created VNC server for screen 0
> XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":1"
>   after 175 requests (175 known processed) with 0 events remaining.
> Killing Xtigervnc process ID 7169... which seems to be deadlocked. Using
> SIGKILL!
>
> ===
>
> Starting applications specified in /etc/X11/Xvnc-session has failed.
> Maybe try something simple first, e.g.,
>     tigervncserver -xstartup /usr/bin/xterm
> --- VNC error 
>
>
>



Re: [slurm-users] schedule mixed nodes first

2021-05-17 Thread Bjørn-Helge Mevik
Durai Arasan  writes:

> Is there a way of improving this situation? E.g. by not blocking IDLE nodes
> with jobs that only use a fraction of the 8 GPUs? Why are single GPU jobs
> not scheduled to fill already MIXED nodes before using IDLE ones?
>
> What parameters/configuration need to be adjusted for this to be enforced?

There are two SchedulerParameters you could experiment with (from man 
slurm.conf):

   bf_busy_nodes
  When  selecting resources for pending jobs to reserve for future 
execution (i.e. the job can not be started immediately), then pref‐
  erentially select nodes that are in use.  This will tend to leave 
currently idle resources available for backfilling longer  running
  jobs,  but  may  result  in  allocations  having less than optimal 
network topology.  This option is currently only supported by the
  select/cons_res  and  select/cons_tres  plugins  (or  
select/cray_aries  with  SelectTypeParameters  set  to   "OTHER_CONS_RES"   or
  "OTHER_CONS_TRES", which layers the select/cray_aries plugin over the 
select/cons_res or select/cons_tres plugin respectively).

   pack_serial_at_end
  If used with the select/cons_res or select/cons_tres plugin, then put 
serial jobs at the end of  the  available  nodes  rather  than
  using a best fit algorithm.  This may reduce resource fragmentation 
for some workloads.

-- 
B/H


signature.asc
Description: PGP signature


Re: [slurm-users] Determining Cluster Usage Rate

2021-05-17 Thread Diego Zuccato

Il 17/05/21 09:25, Ole Holm Nielsen ha scritto:

I hope that someone on the list can help you build Debian packages.  
The problem is not just rebuilding Slurm: if I rebuild Slurm, I have to 
rebuild OpenMPI, OpenIB and a alot of other stuff that I don't know with 
the needed detail.


When you find the time, you must upgrade by at most 2 Slurm versions at 
a time, so you have to upgrade in two steps, for example 
18.08->19.05->20.11.
I usually just stop everything for the upgrade, then upgrade to whatever 
Debian is shipping at the moment. If the history is lost, it's not a big 
issue (that's what DB backups are for :) ).


My Slurm upgrade instructions refer to CentOS, but the overall process 
would be the same for all Linuxes:

https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm
Please read carefully the existing documentation from SchedMD linked to 
in this page.

Tks.

I upgrade Slurm frequently and have no problems doing so.  We're at 
20.11.7 now.  You should avoid 20.11.{0-2} due to a bug in MPI.

That's a really useful info.

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



Re: [slurm-users] Determining Cluster Usage Rate

2021-05-17 Thread Ole Holm Nielsen

On 5/17/21 8:59 AM, Diego Zuccato wrote:

Il 15/05/21 00:43, Christopher Samuel ha scritto:


It just doesn't recognize 'ALL'. It works if I specify the resources.

That's odd, what does this say?
sreport --version

slurm-wlm 18.08.5-2
That's the package from Debian stable (we don't have the manpower to 
handle manually-compiled packages).
As Ole said, it's an old version. I'd love to be able to keep up with the 
newest releases, but ... :(


I hope that someone on the list can help you build Debian packages.  When 
you find the time, you must upgrade by at most 2 Slurm versions at a time, 
so you have to upgrade in two steps, for example 18.08->19.05->20.11.


My Slurm upgrade instructions refer to CentOS, but the overall process 
would be the same for all Linuxes:

https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm
Please read carefully the existing documentation from SchedMD linked to in 
this page.


I upgrade Slurm frequently and have no problems doing so.  We're at 
20.11.7 now.  You should avoid 20.11.{0-2} due to a bug in MPI.


/Ole



Re: [slurm-users] Determining Cluster Usage Rate

2021-05-17 Thread Diego Zuccato

Il 15/05/21 00:43, Christopher Samuel ha scritto:


It just doesn't recognize 'ALL'. It works if I specify the resources.

That's odd, what does this say?
sreport --version

slurm-wlm 18.08.5-2
That's the package from Debian stable (we don't have the manpower to 
handle manually-compiled packages).
As Ole said, it's an old version. I'd love to be able to keep up with 
the newest releases, but ... :(


--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786