Re: [OMPI devel] OMPI-MIGRATE error
Hi Joshua. I've tried the migration again, and i get the next (running process where mpirun is running): Terminal 1: *[hmeyer@clus9 whoami]$ /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 2 -am ft-enable-cr-recovery --mca orte_base_help_aggregate 0 ./whoami 10 10* *Antes de MPI_Init* *Antes de MPI_Init* *--* *Warning: Could not find any processes to migrate on the nodes specified.* * You provided the following:* *Nodes: node9* *Procs: (null)* *--* *Soy el número 1 (1)* *Terminando, una instrucción antes del finalize* *Soy el número 0 (1)* *Terminando, una instrucción antes del finalize* Terminal 2: *[hmeyer@clus9 build]$ /home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -x node9 -t node3 11724* *--* *Error: The Job identified by PID (11724) was not able to migrate processes in this* * job. This could be caused by any of the following:* * - Invalid node or rank specified* * - No processes on the indicated node can by migrated* * - Process migration was not enabled for this job. Make sure to indicate* * the proper AMCA file: "-am ft-enable-cr-recovery".* *--* Then i try another way, and i get the next: Terminal 1: *[hmeyer@clus9 whoami]$ /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 3 -am ft-enable-cr-recovery ./whoami 10 10* *Antes de MPI_Init* *Antes de MPI_Init* *Antes de MPI_Init* *--* *Notice: A migration of this job has been requested.* *The processes below will be migrated.* *Please standby.* * **[[40382,1],1] Rank 1 on Node clus9* * * *--* *--* *Error: The process below has failed. There is no checkpoint available for* * this job, so we are terminating the application since automatic* * recovery cannot occur.* *Internal Name: [[40382,1],1]* *MCW Rank: 1* * * *--* *Soy el número 0 (1)* *Terminando, una instrucción antes del finalize* *Soy el número 2 (1)* *Terminando, una instrucción antes del finalize* * * Terminal 2: *[hmeyer@clus9 build]$ /home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -r 1 -t node3 11784* *[clus9:11795] *** Process received signal *[clus9:11795] Signal: Segmentation fault (11)* *[clus9:11795] Signal code: Address not mapped (1)* *[clus9:11795] Failing at address: (nil)* *[clus9:11795] [ 0] /lib64/libpthread.so.0 [0x2c0b9d40]* *[clus9:11795] *** End of error message *Segmentation fault* * * I'm using the ompi-migrate command in the right way? or i am missing something? Because the first attempt didn't find any process. Best Regards. Hugo Meyer 2011/1/28 Hugo Meyer > Thanks to you Joshua. > > I will try the procedure with this modifications and i will let you know > how it goes. > > Best Regards. > > Hugo Meyer > > 2011/1/27 Joshua Hursey > > I believe that this is now fixed on the trunk. All the details are in the >> commit message: >> https://svn.open-mpi.org/trac/ompi/changeset/24317 >> >> In my testing yesterday, I did not test the scenario where the node with >> mpirun also contains processes (the test cluster I was using does not by >> default run this way). So I was able to reproduce by running on a single >> node. There were a couple bugs that emerged that are fixed in the commit. >> The two bugs that were hurting you were the TCP socket cleanup (which caused >> the looping of the automatic recovery), and the incorrect accounting of >> local process termination (which caused the modex errors). >> >> Let me know if that fixes the problems that you were seeing. >> >> Thanks for the bug report and your patience while I pursued a fix. >> >> -- Josh >> >> On Jan 27, 2011, at 11:28 AM, Hugo Meyer wrote: >> >> > Hi Josh. >> > >> > Thanks for your reply. I'll tell you what i'm getting now from the >> executions in the next lines. >> > When i run without doing a checkpoint i get this output, and the process >> don' finish: >> > >> > [hmeyer@clus9 whoami]$ >> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 2 -am >> ft-enable-cr-recovery ./whoami 10 10 >> > Antes de MPI_Init >> > Antes de MPI_Init >> > [clus9:04985] [[41167,0],0] ORTE_ERROR_LOG: Error in file >> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287 >> > [clus9:04985] [[41167,0],0] ORTE_ERROR_LOG: Error in file >> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287 >> > Soy el número 1 (1) >> > Terminando, una instrucción antes del fina
Re: [OMPI devel] OMPI-MIGRATE error
On Jan 31, 2011, at 6:47 AM, Hugo Meyer wrote: > Hi Joshua. > > I've tried the migration again, and i get the next (running process where > mpirun is running): > > Terminal 1: > > [hmeyer@clus9 whoami]$ /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun > -np 2 -am ft-enable-cr-recovery --mca orte_base_help_aggregate 0 ./whoami 10 > 10 > Antes de MPI_Init > Antes de MPI_Init > -- > Warning: Could not find any processes to migrate on the nodes specified. > You provided the following: > Nodes: node9 > Procs: (null) > -- > Soy el número 1 (1) > Terminando, una instrucción antes del finalize > Soy el número 0 (1) > Terminando, una instrucción antes del finalize > > Terminal 2: > > [hmeyer@clus9 build]$ > /home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -x node9 -t node3 > 11724 > -- > Error: The Job identified by PID (11724) was not able to migrate processes in > this >job. This could be caused by any of the following: >- Invalid node or rank specified >- No processes on the indicated node can by migrated >- Process migration was not enabled for this job. Make sure to indicate > the proper AMCA file: "-am ft-enable-cr-recovery". > -- The error message indicates that there were no processes found on 'node9'. Did you confirm that there were processes running on that node? It is possible that the node name that Open MPI is using is different than what you put in. For example it could be fully qualified (e.g., node9.my.domain.com). So you might try that too. MPI_Get_processor_name() should return the name of the node that we are attempting to use. So you could have all processes print that out when the startup. > Then i try another way, and i get the next: > > Terminal 1: > > [hmeyer@clus9 whoami]$ /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun > -np 3 -am ft-enable-cr-recovery ./whoami 10 10 > Antes de MPI_Init > Antes de MPI_Init > Antes de MPI_Init > -- > Notice: A migration of this job has been requested. > The processes below will be migrated. > Please standby. > [[40382,1],1] Rank 1 on Node clus9 > > -- > -- > Error: The process below has failed. There is no checkpoint available for >this job, so we are terminating the application since automatic >recovery cannot occur. > Internal Name: [[40382,1],1] > MCW Rank: 1 > > -- > Soy el número 0 (1) > Terminando, una instrucción antes del finalize > Soy el número 2 (1) > Terminando, una instrucción antes del finalize > > Terminal 2: > > [hmeyer@clus9 build]$ > /home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -r 1 -t node3 > 11784 > [clus9:11795] *** Process received signal *** > [clus9:11795] Signal: Segmentation fault (11) > [clus9:11795] Signal code: Address not mapped (1) > [clus9:11795] Failing at address: (nil) > [clus9:11795] [ 0] /lib64/libpthread.so.0 [0x2c0b9d40] > [clus9:11795] *** End of error message *** > Segmentation fault Humm. Well that's not good. It looks like the automatic recovery is jumping in while migrating, which should not be happening. I'll take a look and see if I can reproduce locally. Thanks, Josh > > I'm using the ompi-migrate command in the right way? or i am missing > something? Because the first attempt didn't find any process. > > Best Regards. > > Hugo Meyer > > > 2011/1/28 Hugo Meyer > Thanks to you Joshua. > > I will try the procedure with this modifications and i will let you know how > it goes. > > Best Regards. > > Hugo Meyer > > 2011/1/27 Joshua Hursey > > I believe that this is now fixed on the trunk. All the details are in the > commit message: > https://svn.open-mpi.org/trac/ompi/changeset/24317 > > In my testing yesterday, I did not test the scenario where the node with > mpirun also contains processes (the test cluster I was using does not by > default run this way). So I was able to reproduce by running on a single > node. There were a couple bugs that emerged that are fixed in the commit. The > two bugs that were hurting you were the TCP socket cleanup (which caused the > looping of the automatic recovery), and the incorrect accounting of local > process termination (which caused the modex errors). > > Let me know if that fixes the problems that you were seeing. > > Thanks for the bug report and your patience while I pursued a fix. > > --
Re: [OMPI devel] OMPI-MIGRATE error
So I was not able to reproduce this issue. A couple notes: - You can see the node-to-process-rank mapping using the '-display-map' command line option to mpirun. This will give you the node names that Open MPI is using, and how it intends to layout the processes. You can use the '-display-allocation' option to see all of the nodes that Open MPI knows about. Open MPI cannot, currently, migrate to a node that it does not know about on startup. - If the problem persists, add the following MCA parameters to your ~/.openmpi/mca-params.conf file and send me a zipped-up text file of the output. It might show us where things are going wrong: orte_debug_daemons=1 errmgr_base_verbose=20 snapc_full_verbose=20 -- Josh On Jan 31, 2011, at 9:46 AM, Joshua Hursey wrote: > > On Jan 31, 2011, at 6:47 AM, Hugo Meyer wrote: > >> Hi Joshua. >> >> I've tried the migration again, and i get the next (running process where >> mpirun is running): >> >> Terminal 1: >> >> [hmeyer@clus9 whoami]$ /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun >> -np 2 -am ft-enable-cr-recovery --mca orte_base_help_aggregate 0 ./whoami 10 >> 10 >> Antes de MPI_Init >> Antes de MPI_Init >> -- >> Warning: Could not find any processes to migrate on the nodes specified. >> You provided the following: >> Nodes: node9 >> Procs: (null) >> -- >> Soy el número 1 (1) >> Terminando, una instrucción antes del finalize >> Soy el número 0 (1) >> Terminando, una instrucción antes del finalize >> >> Terminal 2: >> >> [hmeyer@clus9 build]$ >> /home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -x node9 -t >> node3 11724 >> -- >> Error: The Job identified by PID (11724) was not able to migrate processes >> in this >> job. This could be caused by any of the following: >> - Invalid node or rank specified >> - No processes on the indicated node can by migrated >> - Process migration was not enabled for this job. Make sure to indicate >> the proper AMCA file: "-am ft-enable-cr-recovery". >> -- > > The error message indicates that there were no processes found on 'node9'. > Did you confirm that there were processes running on that node? > > It is possible that the node name that Open MPI is using is different than > what you put in. For example it could be fully qualified (e.g., > node9.my.domain.com). So you might try that too. MPI_Get_processor_name() > should return the name of the node that we are attempting to use. So you > could have all processes print that out when the startup. > > >> Then i try another way, and i get the next: >> >> Terminal 1: >> >> [hmeyer@clus9 whoami]$ /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun >> -np 3 -am ft-enable-cr-recovery ./whoami 10 10 >> Antes de MPI_Init >> Antes de MPI_Init >> Antes de MPI_Init >> -- >> Notice: A migration of this job has been requested. >>The processes below will be migrated. >>Please standby. >> [[40382,1],1] Rank 1 on Node clus9 >> >> -- >> -- >> Error: The process below has failed. There is no checkpoint available for >> this job, so we are terminating the application since automatic >> recovery cannot occur. >> Internal Name: [[40382,1],1] >> MCW Rank: 1 >> >> -- >> Soy el número 0 (1) >> Terminando, una instrucción antes del finalize >> Soy el número 2 (1) >> Terminando, una instrucción antes del finalize >> >> Terminal 2: >> >> [hmeyer@clus9 build]$ >> /home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -r 1 -t node3 >> 11784 >> [clus9:11795] *** Process received signal *** >> [clus9:11795] Signal: Segmentation fault (11) >> [clus9:11795] Signal code: Address not mapped (1) >> [clus9:11795] Failing at address: (nil) >> [clus9:11795] [ 0] /lib64/libpthread.so.0 [0x2c0b9d40] >> [clus9:11795] *** End of error message *** >> Segmentation fault > > Humm. Well that's not good. It looks like the automatic recovery is jumping > in while migrating, which should not be happening. I'll take a look and see > if I can reproduce locally. > > Thanks, > Josh > >> >> I'm using the ompi-migrate command in the right way? or i am missing >> something? Because the first attempt didn't find any process. >> >> Best Regards. >> >> Hugo Meyer >> >> >> 2011/1/28 Hugo Meyer >> Thanks to you Joshua. >> >> I will try the procedure with this modi
Re: [OMPI devel] OMPI-MIGRATE error
Hi Josh. As you say, the first problem was because of the name of the node. But the second problem persist (the segmentation fault). As you ask, i'm sending you the output of execute with the mca params that you pass me. At the end of the file i put the output of the second terminal. Best Regards Hugo Meyer 2011/1/31 Joshua Hursey > So I was not able to reproduce this issue. > > A couple notes: > - You can see the node-to-process-rank mapping using the '-display-map' > command line option to mpirun. This will give you the node names that Open > MPI is using, and how it intends to layout the processes. You can use the > '-display-allocation' option to see all of the nodes that Open MPI knows > about. Open MPI cannot, currently, migrate to a node that it does not know > about on startup. > - If the problem persists, add the following MCA parameters to your > ~/.openmpi/mca-params.conf file and send me a zipped-up text file of the > output. It might show us where things are going wrong: > > orte_debug_daemons=1 > errmgr_base_verbose=20 > snapc_full_verbose=20 > > > -- Josh > > On Jan 31, 2011, at 9:46 AM, Joshua Hursey wrote: > > > > > On Jan 31, 2011, at 6:47 AM, Hugo Meyer wrote: > > > >> Hi Joshua. > >> > >> I've tried the migration again, and i get the next (running process > where mpirun is running): > >> > >> Terminal 1: > >> > >> [hmeyer@clus9 whoami]$ > /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 2 -am > ft-enable-cr-recovery --mca orte_base_help_aggregate 0 ./whoami 10 10 > >> Antes de MPI_Init > >> Antes de MPI_Init > >> > -- > >> Warning: Could not find any processes to migrate on the nodes specified. > >> You provided the following: > >> Nodes: node9 > >> Procs: (null) > >> > -- > >> Soy el número 1 (1) > >> Terminando, una instrucción antes del finalize > >> Soy el número 0 (1) > >> Terminando, una instrucción antes del finalize > >> > >> Terminal 2: > >> > >> [hmeyer@clus9 build]$ > /home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -x node9 -t > node3 11724 > >> > -- > >> Error: The Job identified by PID (11724) was not able to migrate > processes in this > >> job. This could be caused by any of the following: > >> - Invalid node or rank specified > >> - No processes on the indicated node can by migrated > >> - Process migration was not enabled for this job. Make sure to > indicate > >> the proper AMCA file: "-am ft-enable-cr-recovery". > >> > -- > > > > The error message indicates that there were no processes found on > 'node9'. Did you confirm that there were processes running on that node? > > > > It is possible that the node name that Open MPI is using is different > than what you put in. For example it could be fully qualified (e.g., > node9.my.domain.com). So you might try that too. MPI_Get_processor_name() > should return the name of the node that we are attempting to use. So you > could have all processes print that out when the startup. > > > > > >> Then i try another way, and i get the next: > >> > >> Terminal 1: > >> > >> [hmeyer@clus9 whoami]$ > /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 3 -am > ft-enable-cr-recovery ./whoami 10 10 > >> Antes de MPI_Init > >> Antes de MPI_Init > >> Antes de MPI_Init > >> > -- > >> Notice: A migration of this job has been requested. > >>The processes below will be migrated. > >>Please standby. > >> [[40382,1],1] Rank 1 on Node clus9 > >> > >> > -- > >> > -- > >> Error: The process below has failed. There is no checkpoint available > for > >> this job, so we are terminating the application since automatic > >> recovery cannot occur. > >> Internal Name: [[40382,1],1] > >> MCW Rank: 1 > >> > >> > -- > >> Soy el número 0 (1) > >> Terminando, una instrucción antes del finalize > >> Soy el número 2 (1) > >> Terminando, una instrucción antes del finalize > >> > >> Terminal 2: > >> > >> [hmeyer@clus9 build]$ > /home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -r 1 -t node3 > 11784 > >> [clus9:11795] *** Process received signal *** > >> [clus9:11795] Signal: Segmentation fault (11) > >> [clus9:11795] Signal code: Address not mapped (1) > >> [clus9:11795] Failing at address: (nil) > >> [clus9:11795] [ 0] /lib64/libpthread.so.0 [0x2c0b9d40] > >> [clus9:11795] *** End of error message *** > >> Segmenta
Re: [OMPI devel] OMPI-MIGRATE error
That helped. There was a missing check in the automatic recovery logic that prevents it from starting up while the migration is going on. r24326 should fix this bug. The segfault should have just been residual fallout from this bug. Can you try the current trunk to confirm? One other thing I noticed in the output is that it looks like one of your nodes is asking you for a password (i.e., 'node1'). You may want to make sure that you can login without a password on that node, as it might otherwise hinder Open MPI's startup mechanism on that node. Thanks, Josh On Jan 31, 2011, at 12:36 PM, Hugo Meyer wrote: > Hi Josh. > > As you say, the first problem was because of the name of the node. But the > second problem persist (the segmentation fault). As you ask, i'm sending you > the output of execute with the mca params that you pass me. At the end of the > file i put the output of the second terminal. > > Best Regards > > Hugo Meyer > > 2011/1/31 Joshua Hursey > So I was not able to reproduce this issue. > > A couple notes: > - You can see the node-to-process-rank mapping using the '-display-map' > command line option to mpirun. This will give you the node names that Open > MPI is using, and how it intends to layout the processes. You can use the > '-display-allocation' option to see all of the nodes that Open MPI knows > about. Open MPI cannot, currently, migrate to a node that it does not know > about on startup. > - If the problem persists, add the following MCA parameters to your > ~/.openmpi/mca-params.conf file and send me a zipped-up text file of the > output. It might show us where things are going wrong: > > orte_debug_daemons=1 > errmgr_base_verbose=20 > snapc_full_verbose=20 > > > -- Josh > > On Jan 31, 2011, at 9:46 AM, Joshua Hursey wrote: > > > > > On Jan 31, 2011, at 6:47 AM, Hugo Meyer wrote: > > > >> Hi Joshua. > >> > >> I've tried the migration again, and i get the next (running process where > >> mpirun is running): > >> > >> Terminal 1: > >> > >> [hmeyer@clus9 whoami]$ > >> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 2 -am > >> ft-enable-cr-recovery --mca orte_base_help_aggregate 0 ./whoami 10 10 > >> Antes de MPI_Init > >> Antes de MPI_Init > >> -- > >> Warning: Could not find any processes to migrate on the nodes specified. > >> You provided the following: > >> Nodes: node9 > >> Procs: (null) > >> -- > >> Soy el número 1 (1) > >> Terminando, una instrucción antes del finalize > >> Soy el número 0 (1) > >> Terminando, una instrucción antes del finalize > >> > >> Terminal 2: > >> > >> [hmeyer@clus9 build]$ > >> /home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -x node9 -t > >> node3 11724 > >> -- > >> Error: The Job identified by PID (11724) was not able to migrate processes > >> in this > >> job. This could be caused by any of the following: > >> - Invalid node or rank specified > >> - No processes on the indicated node can by migrated > >> - Process migration was not enabled for this job. Make sure to > >> indicate > >> the proper AMCA file: "-am ft-enable-cr-recovery". > >> -- > > > > The error message indicates that there were no processes found on 'node9'. > > Did you confirm that there were processes running on that node? > > > > It is possible that the node name that Open MPI is using is different than > > what you put in. For example it could be fully qualified (e.g., > > node9.my.domain.com). So you might try that too. MPI_Get_processor_name() > > should return the name of the node that we are attempting to use. So you > > could have all processes print that out when the startup. > > > > > >> Then i try another way, and i get the next: > >> > >> Terminal 1: > >> > >> [hmeyer@clus9 whoami]$ > >> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 3 -am > >> ft-enable-cr-recovery ./whoami 10 10 > >> Antes de MPI_Init > >> Antes de MPI_Init > >> Antes de MPI_Init > >> -- > >> Notice: A migration of this job has been requested. > >>The processes below will be migrated. > >>Please standby. > >> [[40382,1],1] Rank 1 on Node clus9 > >> > >> -- > >> -- > >> Error: The process below has failed. There is no checkpoint available for > >> this job, so we are terminating the application since automatic > >> recovery cannot occur. > >> Internal Name: [[40382,1],1] > >> MCW Rank: 1 > >> > >> --