Hey
Thank you for the email, is there a way to make it work or i have tot have
variables to "remember" the exact allocations?
On Friday, October 9, 2015 4:34 AM, Artem Polyakov <[email protected]>
wrote:
Hello,Please note, that one of the reasons may be non-equivalent allocations.
DMTCP cannot restore processes that was originally running on the same node to
be on different nodes. This means that if you originally requested the
following allocation: cn[0-1], ppn = 4and trying to restart on cn[0-4], ppn =
2this won't work even though the allocations are logically equivalent.
2015-10-08 16:00 GMT+03:00 abderrahmane <[email protected]>:
Hello
I did it and still got Restart error : cannot map initial resources into the
restart allocation.
Also i used openmpi 1.8.8 and got the same error msg.
On 10/06/2015 07:06 PM, Jiajun Cao wrote:
Hi,
Could you replace
dmtcp_launch --rm mpirun --mca btl self,tcp ./<your binary>
with the following:
srun dmtcp_launch --rm ./<your binary>
Also, add the following env vars to the script:
export OMPI_MCA_mtl=^psm
export OMPI_MCA_btl=self,tcp
and try again?
On Tue, Oct 6, 2015 at 4:41 PM, abderrahmane <[email protected]> wrote:
Hello
]Thanks for the respond.
On 10/06/2015 02:18 PM, Jiajun Cao wrote:
Hi,
1. What kind of application are you running? Is there an integration of
matlab and mpi? I'm asking because I haven't run any mpi-based matlab
applications before.
i just created a script that calculate fibonacci number a prints it out.
2. What kind of environment are you using? Specifically, I'd like to know the
MPI version, interconnect network type (Ethernet or InfiniBand), and how MPI
and Slurm are integrated (i.e., in the cluster, what command do you use to run
the application, srun or mpirun).
I am using rhel7 and openmpi 1.8 inbiniband. for the slurm it is integrated in
a cluster environment, I used the script here :
https://github.com/dmtcp/dmtcp/blob/master/plugin/batch-queue/job_examples/slurm_launch.job
3. Do you get a valid checkpoint image(s)? Also, please attach your job
scripts.
I get the checkpoint needed but when i restart i received the error i sent
Thanks
On Tue, Oct 6, 2015 at 1:29 PM, Kapil Arya <[email protected]> wrote:
Jiajun, Artem,
Can one of you take a look at this one?
Kapil
On Tue, Oct 6, 2015 at 12:31 PM, abderrahmane <[email protected]> wrote:
Hello
Thank you for the effort and work (dmtcp), I do have some questions:
( P.S :I run my matlab code using --rm mpirun and slurm.)
1- is there a good way to run matlab code? I created a bash file in
added the following :
matlab -nojvm < file.m
2- running the code above with dmtcp and matlab worked fine, but when i
tried to restart the code using slurm_restart.job code from your github
and using --rm mpirun , I received the following error:
restart error: cannot map initial resources into the restart allocation.
Allocated resources : *nodex:4 nodey:4
any ideas? please feel free to ask me more questions.
best regards;
------------------------------------------------------------------------------
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
--
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov
------------------------------------------------------------------------------
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum