On 28/6/22 12:19 pm, Jean-Christophe HAESSIG wrote:

Hi,

I'm facing a weird issue where launching a job through drmaa
(https://github.com/natefoo/slurm-drmaa) aborts with the message "Plugin
is corrupted", but only when that job is placed from one of my compute
nodes. Running the command from the login node seems to work.

I suspect this is where your error is happening:

https://github.com/SchedMD/slurm/blob/1ce55318222f89fbc862ce559edfd17e911fee38/src/common/plugin.c#L284

it's when it's checking it can load the plugin and not hit any unresolved library symbols. The fact you are hitting this sounds like you're missing libraries from the compute nodes that are present on the login node (or there's some reason they're not getting found if present).

[...]
Anyway, the message seems to originate from libslurm36 and I would like
to activate the debug messages (debug3, debug4). Is there a way to do
this with an environment variable or any other convenient method ?

This depends on what part of Slurm is generating these errors, is this something like sbatch or srun? If so using multiple -v's will increase the debug level so you can pick those up. If it's from slurmd then you'll want to set SlurmdDebug to "debug3" in your slurm.conf.

Once that's done you should get the information on what symbols are not being found and that should give you some insight into what's going on.

Best of luck,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

Reply via email to