Sounds like a race condition where slurmd is starting before the node is
truly ready.
You can try adding dependencies for slurmd so it will not start until
some other needed service is running.
The benefits of systemd :)
Brian Andrus
On 6/9/2020 10:53 AM, Dumont, Joey wrote:
Hi,
I am encountering a weird issue, and I'm not sure where it is coming from.
I have setup a slurm-based cluster using AWS ParallelCluster. I have
tweaked the slurm configuration to enable X forwarding by setting
PrologFlags=X11. The ParallelCluster portion is relevant, as basically
every time a user queues a job, a brand new compute node is
provisioned, and added to the default queue. Users want to run a GUI
based application based on Qt5. To run it, they issue something like:
salloc --nodes=1 --ntasks=1 --cpus-per-task=48 --x11=all srun
run_lsf.sh
However, if there are no nodes available, a new one is provisioned and
the job is run on the new node. Every time this job is the first job
on the compute node, the application crashes. If I issue the exact
same command a second time (it usually gets allocated to the same
node), then it runs without any issues. I was able to retrieve this
from the core dump:
(gdb) bt
#0 0x00007fffdfced337 in raise () from /lib64/libc.so.6
#1 0x00007fffdfceea28 in abort () from /lib64/libc.so.6
#2 0x00007fffe2e699db in QMessageLogger::fatal(char const*, ...) const ()
from
/shared/opt/spack/opt/spack/linux-centos7-cascadelake/gcc-9.2.0/lumerical-2020a-r5-mt7ihfs2o3wfpxrn2ciw2oqfoqvo34dl/opt/lumerical/2020a/bin/../lib/libQt5Core.so.5
#3 0x00007fffe44ce28b in
QGuiApplicationPrivate::createPlatformIntegration() ()
from
/shared/opt/spack/opt/spack/linux-centos7-cascadelake/gcc-9.2.0/lumerical-2020a-r5-mt7ihfs2o3wfpxrn2ciw2oqfoqvo34dl/opt/lumerical/2020a/bin/../lib/libQt5Gui.so.5
#4 0x00007fffe44ce72d in QGuiApplicationPrivate::createEventDispatcher() ()
from
/shared/opt/spack/opt/spack/linux-centos7-cascadelake/gcc-9.2.0/lumerical-2020a-r5-mt7ihfs2o3wfpxrn2ciw2oqfoqvo34dl/opt/lumerical/2020a/bin/../lib/libQt5Gui.so.5
#5 0x00007fffe30579f5 in QCoreApplicationPrivate::init() ()
from
/shared/opt/spack/opt/spack/linux-centos7-cascadelake/gcc-9.2.0/lumerical-2020a-r5-mt7ihfs2o3wfpxrn2ciw2oqfoqvo34dl/opt/lumerical/2020a/bin/../lib/libQt5Core.so.5
#6 0x00007fffe44cfcec in QGuiApplicationPrivate::init() ()
from
/shared/opt/spack/opt/spack/linux-centos7-cascadelake/gcc-9.2.0/lumerical-2020a-r5-mt7ihfs2o3wfpxrn2ciw2oqfoqvo34dl/opt/lumerical/2020a/bin/../lib/libQt5Gui.so.5
#7 0x00007fffe4cfcca9 in QApplicationPrivate::init() ()
from
/shared/opt/spack/opt/spack/linux-centos7-cascadelake/gcc-9.2.0/lumerical-2020a-r5-mt7ihfs2o3wfpxrn2ciw2oqfoqvo34dl/opt/lumerical/2020a/bin/../lib/libQt5Widgets.so.5
#8 0x0000000001f17345 in ?? ()
#9 0x00000000005286bb in ?? ()
#10 0x00007fffdfcd9505 in __libc_start_main () from /lib64/libc.so.6
#11 0x0000000000522201 in ?? ()
#12 0x00007fffffff3928 in ?? ()
#13 0x000000000000001c in ?? ()
#14 0x0000000000000004 in ?? ()
#15 0x00007fffffff3c5e in ?? ()
#16 0x00007fffffff3cfd in ?? ()
#17 0x00007fffffff3d01 in ?? ()
#18 0x00007fffffff3d06 in ?? ()
#19 0x0000000000000000 in ?? ()
So it seems that the Qt5 application cannot initialize, possibly due
to the X server not being ready? I tried adding a delay before
starting starting the GUI application, but that didn't seem to help.
Do you have any idea of where to look for relevant errors?
/var/log/messages indicates that the app crashed, without any
additional information.
The nodes are running on CentOS 7.
Let me know if additional info is needed.
Cheers,
Joey Dumont
Technical Advisor, Knowledge, Information, and Technology Services
National Research Council Canada / Governement of Canada
joey.dum...@nrc-cnrc.gc.ca <mailto:joey.dum...@nrc-cnrc.gc.ca> / Tel:
613-990-8152 / Cell: 438-340-7436
Conseiller technique, Services du savoir, de l'information et de la
technologie
Conseil national de recherches Canada / Gouvernement du Canada
joey.dum...@nrc-cnrc.gc.ca <mailto:joey.dum...@nrc-cnrc.gc.ca> / Tél.:
613-990-8152 / Tél. cell.: 438-340-7436