What I am trying to say is that the srun command must be loaded with all of its plugins from the same version of SLURM to function properly. I am _guessing_ that srun loaded some plugins from v2.3 and then when the resource allocation happened, it loaded more plugins, which would be from v2.4 after the upgrade and not compatible with the v2.3 srun code (e.g. different function arguments for some functions). This is just a guess. You would probably need to instrument the srun code to see exactly what is happening. Running srun with an argument of "-vvvvv -d7" will provide more logging that may be helpful.
Quoting nancy.kritkau...@bull.com: > Moe, > I'm sorry I guess I am still a bit confused. I was assuming that I was > having a similar problem as everyone else on this issue, but just having > different symptoms. I guess to clarify first. Is it supported that jobs > can be queued waiting resources in a 2.3.x release and then upgrade to a > 2.4.1 release and expect these jobs to run once resources are available? > If this is true, are you saying that we can not have any plugins > configured if we want to do this? I am surprised that the previous > reports of problems moving from previous releases to 2.4.1 do not have any > plugins configured. > Nancy > > > > From: Moe Jette <je...@schedmd.com> > To: "slurm-dev" <slurm-dev@schedmd.com>, > Date: 07/04/2012 10:48 AM > Subject: [slurm-dev] Re: Problems upgrading to 2.4.0 > > > > > Did the srun start on v2.3, but not get a resource allocation, then > continue execution on v2.4? In that case, it could has a combination > of plugins, some from v2.3 and others from v2.4, which would probably > not work. That is what I am thinking happened. > > Quoting nancy.kritkau...@bull.com: > >> Moe, >> Thank you for your reply, but I am not sure I understand what you > saying. >> I have the same slurm.conf file for both releases. The srun that is >> queued, is started with the 2.3 release and I expected it to be started >> even when I upgrade to V2.4.1 once resources are available. Maybe this > is >> not how is works... >> Nancy >> >> >> >> From: Moe Jette <je...@schedmd.com> >> To: slurm-dev <slurm-dev@schedmd.com>, nancy.kritkau...@bull.com, >> Date: 07/04/2012 09:12 AM >> Subject: Re: [slurm-dev] Re: Problems upgrading to 2.4.0 >> >> >> >> RPC 4017 is RESPONSE_JOB_ALLOCATION_INFO_LITE (see >> src/common/slurm_protocol_defs.h) and that only contains a job id. >> Nothing in the message contents have changed. Most plugins are loaded >> on demand rather than all being loaded when a program (e.g.. srun) >> starts. My best guess is that the srun command has some version 2.3 >> plugins loaded and some version 2.4 plugins were loaded after the >> upgrade resulting in an inconsistent set of software. >> >> You definitely don't want to keep using a version 2.3 srun with >> version 2.4 daemons. The other commands (sinfo, sbatch, squeue, etc.) >> should all work with new daemons though. >> >> Quoting nancy.kritkau...@bull.com: >> >>> Danny, >>> We are having some trouble with the transition from v2.3.5 to v2.4.1. I >>> tried to keep the test and logs as simple as possible. I have a single >>> node and start job and have a job queued awaiting resources. When I >>> terminate v2.3.5 and start v2.4.1 the job terminates correctly, but the >>> queued job does not start with the following error coming to the >> console. >>> The logs are attached as well. >>> Thanks for any help, >>> Nancy >>> >>> [sulu] (slurm) slurm>srun: error: Invalid Protocol Version 6144 from >>> uid=200 at 141.112.17.124:39306 >>> srun: error: slurm_receive_msg: Protocol version has changed, re-link >> your >>> code >>> srun: error: _accept_msg_connection[sulu.gpv.az05.bull.com]: Protocol >>> version has changed, re-link your code >>> srun: error: Malformed RPC of type 4017 received >>> srun: error: slurm_receive_msg: Header lengths are longer than data >>> received >>> srun: error: Invalid Protocol Version 6144 from uid=200 at >>> 141.112.17.124:53548 >>> srun: error: slurm_receive_msg: Protocol version has changed, re-link >> your >>> code >>> srun: error: slurm_receive_msg[141.112.17.124]: Protocol version has >>> changed, re-link your code >>> srun: error: Unable to allocate resources: Header lengths are longer >> than >>> data received >>> >>> >>> >> >> >> >> >> >> > > > >