Hi Joshua. I've tried the migration again, and i get the next (running process where mpirun is running):
Terminal 1: *[hmeyer@clus9 whoami]$ /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 2 -am ft-enable-cr-recovery --mca orte_base_help_aggregate 0 ./whoami 10 10* *Antes de MPI_Init* *Antes de MPI_Init* *--------------------------------------------------------------------------* *Warning: Could not find any processes to migrate on the nodes specified.* * You provided the following:* *Nodes: node9* *Procs: (null)* *--------------------------------------------------------------------------* *Soy el número 1 (100000000)* *Terminando, una instrucción antes del finalize* *Soy el número 0 (100000000)* *Terminando, una instrucción antes del finalize* Terminal 2: *[hmeyer@clus9 build]$ /home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -x node9 -t node3 11724* *--------------------------------------------------------------------------* *Error: The Job identified by PID (11724) was not able to migrate processes in this* * job. This could be caused by any of the following:* * - Invalid node or rank specified* * - No processes on the indicated node can by migrated* * - Process migration was not enabled for this job. Make sure to indicate* * the proper AMCA file: "-am ft-enable-cr-recovery".* *--------------------------------------------------------------------------* Then i try another way, and i get the next: Terminal 1: *[hmeyer@clus9 whoami]$ /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 3 -am ft-enable-cr-recovery ./whoami 10 10* *Antes de MPI_Init* *Antes de MPI_Init* *Antes de MPI_Init* *--------------------------------------------------------------------------* *Notice: A migration of this job has been requested.* * The processes below will be migrated.* * Please standby.* * **[[40382,1],1] Rank 1 on Node clus9* * * *--------------------------------------------------------------------------* *--------------------------------------------------------------------------* *Error: The process below has failed. There is no checkpoint available for* * this job, so we are terminating the application since automatic* * recovery cannot occur.* *Internal Name: [[40382,1],1]* *MCW Rank: 1* * * *--------------------------------------------------------------------------* *Soy el número 0 (100000000)* *Terminando, una instrucción antes del finalize* *Soy el número 2 (100000000)* *Terminando, una instrucción antes del finalize* * * Terminal 2: *[hmeyer@clus9 build]$ /home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -r 1 -t node3 11784* *[clus9:11795] *** Process received signal **** *[clus9:11795] Signal: Segmentation fault (11)* *[clus9:11795] Signal code: Address not mapped (1)* *[clus9:11795] Failing at address: (nil)* *[clus9:11795] [ 0] /lib64/libpthread.so.0 [0x2aaaac0b9d40]* *[clus9:11795] *** End of error message **** *Segmentation fault* * * I'm using the ompi-migrate command in the right way? or i am missing something? Because the first attempt didn't find any process. Best Regards. Hugo Meyer 2011/1/28 Hugo Meyer <meyer.h...@gmail.com> > Thanks to you Joshua. > > I will try the procedure with this modifications and i will let you know > how it goes. > > Best Regards. > > Hugo Meyer > > 2011/1/27 Joshua Hursey <jjhur...@open-mpi.org> > > I believe that this is now fixed on the trunk. All the details are in the >> commit message: >> https://svn.open-mpi.org/trac/ompi/changeset/24317 >> >> In my testing yesterday, I did not test the scenario where the node with >> mpirun also contains processes (the test cluster I was using does not by >> default run this way). So I was able to reproduce by running on a single >> node. There were a couple bugs that emerged that are fixed in the commit. >> The two bugs that were hurting you were the TCP socket cleanup (which caused >> the looping of the automatic recovery), and the incorrect accounting of >> local process termination (which caused the modex errors). >> >> Let me know if that fixes the problems that you were seeing. >> >> Thanks for the bug report and your patience while I pursued a fix. >> >> -- Josh >> >> On Jan 27, 2011, at 11:28 AM, Hugo Meyer wrote: >> >> > Hi Josh. >> > >> > Thanks for your reply. I'll tell you what i'm getting now from the >> executions in the next lines. >> > When i run without doing a checkpoint i get this output, and the process >> don' finish: >> > >> > [hmeyer@clus9 whoami]$ >> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 2 -am >> ft-enable-cr-recovery ./whoami 10 10 >> > Antes de MPI_Init >> > Antes de MPI_Init >> > [clus9:04985] [[41167,0],0] ORTE_ERROR_LOG: Error in file >> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287 >> > [clus9:04985] [[41167,0],0] ORTE_ERROR_LOG: Error in file >> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287 >> > Soy el número 1 (100000000) >> > Terminando, una instrucción antes del finalize >> > Soy el número 0 (100000000) >> > Terminando, una instrucción antes del finalize >> > >> -------------------------------------------------------------------------- >> > Error: The process below has failed. There is no checkpoint available >> for >> > this job, so we are terminating the application since automatic >> > recovery cannot occur. >> > Internal Name: [[41167,1],0] >> > MCW Rank: 0 >> > >> > >> -------------------------------------------------------------------------- >> > [clus9:04985] 1 more process has sent help message >> help-orte-errmgr-hnp.txt / autor_failed_to_recover_proc >> > [clus9:04985] Set MCA parameter "orte_base_help_aggregate" to 0 to see >> all help / error messages >> > >> > If i make a checkpoint in another terminal of the mpirun process, during >> the execution, i get this output: >> > >> > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file >> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287 >> > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file >> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287 >> > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past >> end of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c >> at line 350 >> > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past >> end of buffer in file >> ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 323 >> > [clus9:06107] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = >> -26 >> > [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past >> end of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c >> at line 350 >> > [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past >> end of buffer in file >> ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 323 >> > [clus9:06106] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = >> -26 >> > >> -------------------------------------------------------------------------- >> > Notice: The job has been successfully recovered from the >> > last checkpoint. >> > >> -------------------------------------------------------------------------- >> > Soy el número 1 (100000000) >> > Terminando, una instrucción antes del finalize >> > Soy el número 0 (100000000) >> > Terminando, una instrucción antes del finalize >> > [clus9:06105] 1 more process has sent help message >> help-orte-errmgr-hnp.txt / autor_recovering_job >> > [clus9:06105] Set MCA parameter "orte_base_help_aggregate" to 0 to see >> all help / error messages >> > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file >> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287 >> > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file >> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287 >> > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past >> end of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c >> at line 350 >> > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past >> end of buffer in file >> ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 323 >> > [clus9:06107] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = >> -26 >> > [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past >> end of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c >> at line 350 >> > [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past >> end of buffer in file >> ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 323 >> > [clus9:06106] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = >> -26 >> > [clus9:06105] 1 more process has sent help message >> help-orte-errmgr-hnp.txt / autor_recovery_complete >> > Soy el número 0 (100000000) >> > Terminando, una instrucción antes del finalize >> > Soy el número 1 (100000000) >> > Terminando, una instrucción antes del finalize >> > [clus9:06105] 1 more process has sent help message >> help-orte-errmgr-hnp.txt / autor_recovering_job >> > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file >> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287 >> > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file >> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287 >> > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past >> end of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c >> at line 350 >> > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past >> end of buffer in file >> ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 323 >> > [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past >> end of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c >> at line 350 >> > [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past >> end of buffer in file >> ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 323 >> > [clus9:06106] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = >> -26 >> > [clus9:06107] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = >> -26 >> > >> > As you can see, it keeps looping on the recover. Then when i try to >> migrate this processes using ompi-migrate, i get this: >> > >> > [hmeyer@clus9 ~]$ >> /home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -x node9 -t >> node3 18082 >> > >> -------------------------------------------------------------------------- >> > Error: The Job identified by PID (18082) was not able to migrate >> processes in this >> > job. This could be caused by any of the following: >> > - Invalid node or rank specified >> > - No processes on the indicated node can by migrated >> > - Process migration was not enabled for this job. Make sure to >> indicate >> > the proper AMCA file: "-am ft-enable-cr-recovery". >> > >> -------------------------------------------------------------------------- >> > But, in the terminal where is running the application i get this: >> > >> > [hmeyer@clus9 whoami]$ >> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 2 -am >> ft-enable-cr-recovery ./whoami 10 10 >> > Antes de MPI_Init >> > Antes de MPI_Init >> > [clus9:18082] [[62740,0],0] ORTE_ERROR_LOG: Error in file >> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287 >> > [clus9:18082] [[62740,0],0] ORTE_ERROR_LOG: Error in file >> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287 >> > >> -------------------------------------------------------------------------- >> > Warning: Could not find any processes to migrate on the nodes specified. >> > You provided the following: >> > Nodes: node9 >> > Procs: (null) >> > >> -------------------------------------------------------------------------- >> > >> -------------------------------------------------------------------------- >> > Notice: The processes have been successfully migrated to/from the >> specified >> > machines. >> > >> -------------------------------------------------------------------------- >> > Soy el número 1 (100000000) >> > Terminando, una instrucción antes del finalize >> > Soy el número 0 (100000000) >> > Terminando, una instrucción antes del finalize >> > >> -------------------------------------------------------------------------- >> > Error: The process below has failed. There is no checkpoint available >> for >> > this job, so we are terminating the application since automatic >> > recovery cannot occur. >> > Internal Name: [[62740,1],0] >> > MCW Rank: 0 >> > >> > >> -------------------------------------------------------------------------- >> > [clus9:18082] 1 more process has sent help message >> help-orte-errmgr-hnp.txt / autor_failed_to_recover_proc >> > [clus9:18082] Set MCA parameter "orte_base_help_aggregate" to 0 to see >> all help / error messages >> > >> > I asume that the orte_get_job_data_object is the problem, because it is >> not obtaining the proper value. >> > >> > If you need more data, just let me know. >> > >> > Best Regards. >> > >> > Hugo Meyer >> > >> > >> > >> > >> > 2011/1/26 Joshua Hursey <jjhur...@open-mpi.org> >> > I found a few more bugs after testing the C/R functionality this >> morning. I just committed some more C/R fixes in r24306 (things are now >> working correctly on my test cluster). >> > https://svn.open-mpi.org/trac/ompi/changeset/24306 >> > >> > One thing I just noticed in your original email was that you are >> specifying the wrong parameter for migration (it is different than the >> standard C/R functionality for backwards compatibility reasons). You need to >> use the 'ft-enable-cr-recovery' AMCA parameter: >> > mpirun -np 2 -am ft-enable-cr-recovery ./whoami 10 10 >> > >> > If you still get the segmentation fault after upgrading to the current >> trunk, can you send me a backtrace from the core file? That will help me >> narrow down on the problem. >> > >> > Thanks, >> > Josh >> > >> > >> > On Jan 26, 2011, at 8:40 AM, Hugo Meyer wrote: >> > >> > > Josh. >> > > >> > > The ompi-checkpoint with his restart now are working great, but the >> same error persist with ompi-migrate. I've also tried using "-r", but i get >> the same error. >> > > >> > > Best regards. >> > > >> > > Hugo Meyer >> > > >> > > 2011/1/26 Hugo Meyer <meyer.h...@gmail.com> >> > > Thanks Josh. >> > > >> > > I've already check te prelink and is set to "no". >> > > >> > > I'm going to try with the trunk head, and then i'll let you know how >> it goes. >> > > >> > > Best regards. >> > > >> > > Hugo Meyer >> > > >> > > 2011/1/25 Joshua Hursey <jjhur...@open-mpi.org> >> > > >> > > Can you try with the current trunk head (r24296)? >> > > I just committed a fix for the C/R functionality in which restarts >> were getting stuck. This will likely affect the migration functionality, but >> I have not had an opportunity to test just yet. >> > > >> > > Another thing to check is that prelink is turned off on all of your >> machines. >> > > https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink >> > > >> > > Let me know if the problem persists, and I'll dig into a bit more. >> > > >> > > Thanks, >> > > Josh >> > > >> > > On Jan 24, 2011, at 11:37 AM, Hugo Meyer wrote: >> > > >> > > > Hello @ll >> > > > >> > > > I've got a problem when i try to use the ompi-migrate command. >> > > > >> > > > What i'm doing is execute for example the next application in one >> node of a cluster (both process wil run on the same node): >> > > > >> > > > mpirun -np 2 -am ft-enable-cr ./whoami 10 10 >> > > > >> > > > Then in the same node i try to migrate the processes to another >> node: >> > > > >> > > > ompi-migrate -x node9 -t node3 14914 >> > > > >> > > > And then i get this message: >> > > > >> > > > [clus9:15620] *** Process received signal *** >> > > > [clus9:15620] Signal: Segmentation fault (11) >> > > > [clus9:15620] Signal code: Address not mapped (1) >> > > > [clus9:15620] Failing at address: (nil) >> > > > [clus9:15620] [ 0] /lib64/libpthread.so.0 [0x2aaaac0b8d40] >> > > > [clus9:15620] *** End of error message *** >> > > > Segmentation fault >> > > > >> > > > I assume that maybe there is something wrong with the thread level, >> but i have configured the open-mpi like this: >> > > > >> > > > ../configure --prefix=/home/hmeyer/desarrollo/ompi-code/binarios/ >> --enable-debug --enable-debug-symbols --enable-trace --with-ft=cr >> --disable-ipv6 --enable-opal-multi-threads --enable-ft-thread >> --without-hwloc --disable-vt --with-blcr=/soft/blcr-0.8.2/ >> --with-blcr-libdir=/soft/blcr-0.8.2/lib/ >> > > > >> > > > The checkpoint and restart works fine, but when i restore an >> application that has more than one process, this one is restored and >> executed until the last line before MPI_FINALIZE(), but the processes never >> finalize, i assume that they never call the MPI_FINALIZE(), but with one >> process ompi-checkpoint and ompi-restart work great. >> > > > >> > > > Best regards. >> > > > >> > > > Hugo Meyer >> > > > _______________________________________________ >> > > > devel mailing list >> > > > de...@open-mpi.org >> > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > >> > > ------------------------------------ >> > > Joshua Hursey >> > > Postdoctoral Research Associate >> > > Oak Ridge National Laboratory >> > > http://users.nccs.gov/~jjhursey >> > > >> > > >> > > _______________________________________________ >> > > devel mailing list >> > > de...@open-mpi.org >> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > >> > > >> > > _______________________________________________ >> > > devel mailing list >> > > de...@open-mpi.org >> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > >> > ------------------------------------ >> > Joshua Hursey >> > Postdoctoral Research Associate >> > Oak Ridge National Laboratory >> > http://users.nccs.gov/~jjhursey >> > >> > >> > _______________________________________________ >> > devel mailing list >> > de...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > >> > _______________________________________________ >> > devel mailing list >> > de...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> ------------------------------------ >> Joshua Hursey >> Postdoctoral Research Associate >> Oak Ridge National Laboratory >> http://users.nccs.gov/~jjhursey >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > >