Hello,
Please note, that one of the reasons may be non-equivalent allocations.
DMTCP cannot restore processes that was originally running on the same node
to be on different nodes. This means that if you originally requested the
following allocation: cn[0-1], ppn = 4
and trying to restart on cn[0-4], ppn = 2
this won't work even though the allocations are logically equivalent.

2015-10-08 16:00 GMT+03:00 abderrahmane <[email protected]>:

> Hello
>
> I did it and still got Restart error : cannot map initial resources into
> the restart allocation.
>
> Also i used openmpi 1.8.8 and got the same error msg.
>
>
> On 10/06/2015 07:06 PM, Jiajun Cao wrote:
>
> Hi,
>
> Could you replace
>
> dmtcp_launch --rm mpirun --mca btl self,tcp ./<your binary>
>
> with the following:
>
> srun dmtcp_launch --rm ./<your binary>
>
> Also, add the following env vars to the script:
>
> export OMPI_MCA_mtl=^psm
> export OMPI_MCA_btl=self,tcp
>
> and try again?
>
> On Tue, Oct 6, 2015 at 4:41 PM, abderrahmane <[email protected]> wrote:
>
>> Hello
>> ]Thanks for the respond.
>>
>>
>> On 10/06/2015 02:18 PM, Jiajun Cao wrote:
>>
>> Hi,
>>
>>
>> 1. What kind of application are you running? Is there an integration of
>> matlab and mpi? I'm asking because I haven't run any mpi-based matlab
>> applications before.
>>
>> i just created a script that calculate fibonacci number a prints it out.
>>
>> 2. What kind of environment are you using? Specifically, I'd like to know
>> the MPI version, interconnect network type (Ethernet or InfiniBand), and
>> how MPI and Slurm are integrated (i.e., in the cluster, what command do you
>> use to run the application, srun or mpirun).
>>
>> I am using rhel7 and openmpi 1.8 inbiniband. for the slurm it is
>> integrated in a cluster environment, I used the script here :
>>
>> https://github.com/dmtcp/dmtcp/blob/master/plugin/batch-queue/job_examples/slurm_launch.job
>>
>> 3. Do you get a valid checkpoint image(s)? Also, please attach your job
>> scripts.
>>
>> I get the checkpoint needed but when i restart i received the error i sent
>>
>> Thanks
>>
>>
>> On Tue, Oct 6, 2015 at 1:29 PM, Kapil Arya < <[email protected]>
>> [email protected]> wrote:
>>
>>> Jiajun, Artem,
>>>
>>> Can one of you take a look at this one?
>>>
>>> Kapil
>>>
>>> On Tue, Oct 6, 2015 at 12:31 PM, abderrahmane < <[email protected]>
>>> [email protected]> wrote:
>>>
>>>> Hello
>>>>
>>>> Thank you for the effort and work (dmtcp), I do have some questions:
>>>> ( P.S :I run my matlab code using --rm mpirun and slurm.)
>>>>
>>>> 1- is there a good way to run matlab code? I created a bash file in
>>>> added the following :
>>>>      matlab -nojvm < file.m
>>>>
>>>> 2- running the code above with dmtcp and matlab worked fine, but when i
>>>> tried to restart the code using slurm_restart.job code from your github
>>>> and using --rm mpirun , I received the following error:
>>>>
>>>> restart error: cannot map initial resources into the restart allocation.
>>>> Allocated resources : *nodex:4  nodey:4
>>>>
>>>> any ideas? please feel free to ask me more questions.
>>>>
>>>> best regards;
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> _______________________________________________
>>>> Dmtcp-forum mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>>>>
>>>
>>>
>>
>>
>
>


-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov
------------------------------------------------------------------------------
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to