Re: [OMPI devel] OMPI-MIGRATE error

2011-01-31 Thread Hugo Meyer
Hi Joshua.

I've tried the migration again, and i get the next (running process where
mpirun is running):

Terminal 1:

*[hmeyer@clus9 whoami]$
/home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 2 -am
ft-enable-cr-recovery --mca orte_base_help_aggregate 0 ./whoami 10 10*
*Antes de MPI_Init*
*Antes de MPI_Init*
*--*
*Warning: Could not find any processes to migrate on the nodes specified.*
* You provided the following:*
*Nodes: node9*
*Procs: (null)*
*--*
*Soy el número 1 (1)*
*Terminando, una instrucción antes del finalize*
*Soy el número 0 (1)*
*Terminando, una instrucción antes del finalize*


Terminal 2:

*[hmeyer@clus9 build]$
/home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -x node9 -t
node3 11724*
*--*
*Error: The Job identified by PID (11724) was not able to migrate processes
in this*
*   job. This could be caused by any of the following:*
*   - Invalid node or rank specified*
*   - No processes on the indicated node can by migrated*
*   - Process migration was not enabled for this job. Make sure to
indicate*
* the proper AMCA file: "-am ft-enable-cr-recovery".*
*--*

Then i try another way, and i get the next:

Terminal 1:

*[hmeyer@clus9 whoami]$
/home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 3 -am
ft-enable-cr-recovery ./whoami 10 10*
*Antes de MPI_Init*
*Antes de MPI_Init*
*Antes de MPI_Init*
*--*
*Notice: A migration of this job has been requested.*
*The processes below will be migrated.*
*Please standby.*
* **[[40382,1],1] Rank 1 on Node clus9*
*
*
*--*
*--*
*Error: The process below has failed. There is no checkpoint available for*
*   this job, so we are terminating the application since automatic*
*   recovery cannot occur.*
*Internal Name: [[40382,1],1]*
*MCW Rank: 1*
*
*
*--*
*Soy el número 0 (1)*
*Terminando, una instrucción antes del finalize*
*Soy el número 2 (1)*
*Terminando, una instrucción antes del finalize*
*
*

Terminal 2:

*[hmeyer@clus9 build]$
/home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -r 1 -t node3
11784*
*[clus9:11795] *** Process received signal 
*[clus9:11795] Signal: Segmentation fault (11)*
*[clus9:11795] Signal code: Address not mapped (1)*
*[clus9:11795] Failing at address: (nil)*
*[clus9:11795] [ 0] /lib64/libpthread.so.0 [0x2c0b9d40]*
*[clus9:11795] *** End of error message 
*Segmentation fault*
*
*

I'm using the ompi-migrate command in the right way? or i am missing
something? Because the first attempt didn't find any process.

Best Regards.

Hugo Meyer


2011/1/28 Hugo Meyer 

> Thanks to you Joshua.
>
> I will try the procedure with this modifications and i will let you know
> how it goes.
>
> Best Regards.
>
> Hugo Meyer
>
> 2011/1/27 Joshua Hursey 
>
> I believe that this is now fixed on the trunk. All the details are in the
>> commit message:
>>  https://svn.open-mpi.org/trac/ompi/changeset/24317
>>
>> In my testing yesterday, I did not test the scenario where the node with
>> mpirun also contains processes (the test cluster I was using does not by
>> default run this way). So I was able to reproduce by running on a single
>> node. There were a couple bugs that emerged that are fixed in the commit.
>> The two bugs that were hurting you were the TCP socket cleanup (which caused
>> the looping of the automatic recovery), and the incorrect accounting of
>> local process termination (which caused the modex errors).
>>
>> Let me know if that fixes the problems that you were seeing.
>>
>> Thanks for the bug report and your patience while I pursued a fix.
>>
>> -- Josh
>>
>> On Jan 27, 2011, at 11:28 AM, Hugo Meyer wrote:
>>
>> > Hi Josh.
>> >
>> > Thanks for your reply. I'll tell you what i'm getting now from the
>> executions in the next lines.
>> > When i run without doing a checkpoint i get this output, and the process
>> don' finish:
>> >
>> > [hmeyer@clus9 whoami]$
>> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 2 -am
>> ft-enable-cr-recovery ./whoami 10 10
>> > Antes de MPI_Init
>> > Antes de MPI_Init
>> > [clus9:04985] [[41167,0],0] ORTE_ERROR_LOG: Error in file
>> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
>> > [clus9:04985] [[41167,0],0] ORTE_ERROR_LOG: Error in file
>> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
>> > Soy el número 1 (1)
>> > Terminando, una instrucción antes del fina

Re: [OMPI devel] OMPI-MIGRATE error

2011-01-31 Thread Joshua Hursey

On Jan 31, 2011, at 6:47 AM, Hugo Meyer wrote:

> Hi Joshua.
> 
> I've tried the migration again, and i get the next (running process where 
> mpirun is running):
> 
> Terminal 1:
> 
> [hmeyer@clus9 whoami]$ /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun 
> -np 2 -am ft-enable-cr-recovery --mca orte_base_help_aggregate 0 ./whoami 10 
> 10
> Antes de MPI_Init
> Antes de MPI_Init
> --
> Warning: Could not find any processes to migrate on the nodes specified.
>  You provided the following:
> Nodes: node9
> Procs: (null)
> --
> Soy el número 1 (1)
> Terminando, una instrucción antes del finalize
> Soy el número 0 (1)
> Terminando, una instrucción antes del finalize
> 
> Terminal 2:
> 
> [hmeyer@clus9 build]$ 
> /home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -x node9 -t node3 
> 11724
> --
> Error: The Job identified by PID (11724) was not able to migrate processes in 
> this
>job. This could be caused by any of the following:
>- Invalid node or rank specified
>- No processes on the indicated node can by migrated
>- Process migration was not enabled for this job. Make sure to indicate
>  the proper AMCA file: "-am ft-enable-cr-recovery".
> --

The error message indicates that there were no processes found on 'node9'. Did 
you confirm that there were processes running on that node?

It is possible that the node name that Open MPI is using is different than what 
you put in. For example it could be fully qualified (e.g., 
node9.my.domain.com). So you might try that too. MPI_Get_processor_name() 
should return the name of the node that we are attempting to use. So you could 
have all processes print that out when the startup.


> Then i try another way, and i get the next:
> 
> Terminal 1:
> 
> [hmeyer@clus9 whoami]$ /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun 
> -np 3 -am ft-enable-cr-recovery ./whoami 10 10
> Antes de MPI_Init
> Antes de MPI_Init
> Antes de MPI_Init
> --
> Notice: A migration of this job has been requested.
> The processes below will be migrated.
> Please standby.
>   [[40382,1],1] Rank 1 on Node clus9
> 
> --
> --
> Error: The process below has failed. There is no checkpoint available for
>this job, so we are terminating the application since automatic
>recovery cannot occur.
> Internal Name: [[40382,1],1]
> MCW Rank: 1
> 
> --
> Soy el número 0 (1)
> Terminando, una instrucción antes del finalize
> Soy el número 2 (1)
> Terminando, una instrucción antes del finalize
> 
> Terminal 2:
> 
> [hmeyer@clus9 build]$ 
> /home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -r 1 -t node3 
> 11784
> [clus9:11795] *** Process received signal ***
> [clus9:11795] Signal: Segmentation fault (11)
> [clus9:11795] Signal code: Address not mapped (1)
> [clus9:11795] Failing at address: (nil)
> [clus9:11795] [ 0] /lib64/libpthread.so.0 [0x2c0b9d40]
> [clus9:11795] *** End of error message ***
> Segmentation fault

Humm. Well that's not good. It looks like the automatic recovery is jumping in 
while migrating, which should not be happening. I'll take a look and see if I 
can reproduce locally.

Thanks,
Josh

> 
> I'm using the ompi-migrate command in the right way? or i am missing 
> something? Because the first attempt didn't find any process.
> 
> Best Regards.
> 
> Hugo Meyer
> 
> 
> 2011/1/28 Hugo Meyer 
> Thanks to you Joshua.
> 
> I will try the procedure with this modifications and i will let you know how 
> it goes.
> 
> Best Regards.
> 
> Hugo Meyer
> 
> 2011/1/27 Joshua Hursey 
> 
> I believe that this is now fixed on the trunk. All the details are in the 
> commit message:
>  https://svn.open-mpi.org/trac/ompi/changeset/24317
> 
> In my testing yesterday, I did not test the scenario where the node with 
> mpirun also contains processes (the test cluster I was using does not by 
> default run this way). So I was able to reproduce by running on a single 
> node. There were a couple bugs that emerged that are fixed in the commit. The 
> two bugs that were hurting you were the TCP socket cleanup (which caused the 
> looping of the automatic recovery), and the incorrect accounting of local 
> process termination (which caused the modex errors).
> 
> Let me know if that fixes the problems that you were seeing.
> 
> Thanks for the bug report and your patience while I pursued a fix.
> 
> -- 

Re: [OMPI devel] OMPI-MIGRATE error

2011-01-31 Thread Joshua Hursey
So I was not able to reproduce this issue.

A couple notes:
 - You can see the node-to-process-rank mapping using the '-display-map' 
command line option to mpirun. This will give you the node names that Open MPI 
is using, and how it intends to layout the processes. You can use the 
'-display-allocation' option to see all of the nodes that Open MPI knows about. 
Open MPI cannot, currently, migrate to a node that it does not know about on 
startup.
 - If the problem persists, add the following MCA parameters to your 
~/.openmpi/mca-params.conf file and send me a zipped-up text file of the 
output. It might show us where things are going wrong:

orte_debug_daemons=1
errmgr_base_verbose=20
snapc_full_verbose=20


-- Josh

On Jan 31, 2011, at 9:46 AM, Joshua Hursey wrote:

> 
> On Jan 31, 2011, at 6:47 AM, Hugo Meyer wrote:
> 
>> Hi Joshua.
>> 
>> I've tried the migration again, and i get the next (running process where 
>> mpirun is running):
>> 
>> Terminal 1:
>> 
>> [hmeyer@clus9 whoami]$ /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun 
>> -np 2 -am ft-enable-cr-recovery --mca orte_base_help_aggregate 0 ./whoami 10 
>> 10
>> Antes de MPI_Init
>> Antes de MPI_Init
>> --
>> Warning: Could not find any processes to migrate on the nodes specified.
>> You provided the following:
>> Nodes: node9
>> Procs: (null)
>> --
>> Soy el número 1 (1)
>> Terminando, una instrucción antes del finalize
>> Soy el número 0 (1)
>> Terminando, una instrucción antes del finalize
>> 
>> Terminal 2:
>> 
>> [hmeyer@clus9 build]$ 
>> /home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -x node9 -t 
>> node3 11724
>> --
>> Error: The Job identified by PID (11724) was not able to migrate processes 
>> in this
>>   job. This could be caused by any of the following:
>>   - Invalid node or rank specified
>>   - No processes on the indicated node can by migrated
>>   - Process migration was not enabled for this job. Make sure to indicate
>> the proper AMCA file: "-am ft-enable-cr-recovery".
>> --
> 
> The error message indicates that there were no processes found on 'node9'. 
> Did you confirm that there were processes running on that node?
> 
> It is possible that the node name that Open MPI is using is different than 
> what you put in. For example it could be fully qualified (e.g., 
> node9.my.domain.com). So you might try that too. MPI_Get_processor_name() 
> should return the name of the node that we are attempting to use. So you 
> could have all processes print that out when the startup.
> 
> 
>> Then i try another way, and i get the next:
>> 
>> Terminal 1:
>> 
>> [hmeyer@clus9 whoami]$ /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun 
>> -np 3 -am ft-enable-cr-recovery ./whoami 10 10
>> Antes de MPI_Init
>> Antes de MPI_Init
>> Antes de MPI_Init
>> --
>> Notice: A migration of this job has been requested.
>>The processes below will be migrated.
>>Please standby.
>>  [[40382,1],1] Rank 1 on Node clus9
>> 
>> --
>> --
>> Error: The process below has failed. There is no checkpoint available for
>>   this job, so we are terminating the application since automatic
>>   recovery cannot occur.
>> Internal Name: [[40382,1],1]
>> MCW Rank: 1
>> 
>> --
>> Soy el número 0 (1)
>> Terminando, una instrucción antes del finalize
>> Soy el número 2 (1)
>> Terminando, una instrucción antes del finalize
>> 
>> Terminal 2:
>> 
>> [hmeyer@clus9 build]$ 
>> /home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -r 1 -t node3 
>> 11784
>> [clus9:11795] *** Process received signal ***
>> [clus9:11795] Signal: Segmentation fault (11)
>> [clus9:11795] Signal code: Address not mapped (1)
>> [clus9:11795] Failing at address: (nil)
>> [clus9:11795] [ 0] /lib64/libpthread.so.0 [0x2c0b9d40]
>> [clus9:11795] *** End of error message ***
>> Segmentation fault
> 
> Humm. Well that's not good. It looks like the automatic recovery is jumping 
> in while migrating, which should not be happening. I'll take a look and see 
> if I can reproduce locally.
> 
> Thanks,
> Josh
> 
>> 
>> I'm using the ompi-migrate command in the right way? or i am missing 
>> something? Because the first attempt didn't find any process.
>> 
>> Best Regards.
>> 
>> Hugo Meyer
>> 
>> 
>> 2011/1/28 Hugo Meyer 
>> Thanks to you Joshua.
>> 
>> I will try the procedure with this modi

Re: [OMPI devel] OMPI-MIGRATE error

2011-01-31 Thread Hugo Meyer
Hi Josh.

As you say, the first problem was because of the name of the node. But the
second problem persist (the segmentation fault). As you ask, i'm sending you
the output of execute with the mca params that you pass me. At the end of
the file i put the output of the second terminal.

Best Regards

Hugo Meyer

2011/1/31 Joshua Hursey 

> So I was not able to reproduce this issue.
>
> A couple notes:
>  - You can see the node-to-process-rank mapping using the '-display-map'
> command line option to mpirun. This will give you the node names that Open
> MPI is using, and how it intends to layout the processes. You can use the
> '-display-allocation' option to see all of the nodes that Open MPI knows
> about. Open MPI cannot, currently, migrate to a node that it does not know
> about on startup.
>  - If the problem persists, add the following MCA parameters to your
> ~/.openmpi/mca-params.conf file and send me a zipped-up text file of the
> output. It might show us where things are going wrong:
> 
> orte_debug_daemons=1
> errmgr_base_verbose=20
> snapc_full_verbose=20
> 
>
> -- Josh
>
> On Jan 31, 2011, at 9:46 AM, Joshua Hursey wrote:
>
> >
> > On Jan 31, 2011, at 6:47 AM, Hugo Meyer wrote:
> >
> >> Hi Joshua.
> >>
> >> I've tried the migration again, and i get the next (running process
> where mpirun is running):
> >>
> >> Terminal 1:
> >>
> >> [hmeyer@clus9 whoami]$
> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 2 -am
> ft-enable-cr-recovery --mca orte_base_help_aggregate 0 ./whoami 10 10
> >> Antes de MPI_Init
> >> Antes de MPI_Init
> >>
> --
> >> Warning: Could not find any processes to migrate on the nodes specified.
> >> You provided the following:
> >> Nodes: node9
> >> Procs: (null)
> >>
> --
> >> Soy el número 1 (1)
> >> Terminando, una instrucción antes del finalize
> >> Soy el número 0 (1)
> >> Terminando, una instrucción antes del finalize
> >>
> >> Terminal 2:
> >>
> >> [hmeyer@clus9 build]$
> /home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -x node9 -t
> node3 11724
> >>
> --
> >> Error: The Job identified by PID (11724) was not able to migrate
> processes in this
> >>   job. This could be caused by any of the following:
> >>   - Invalid node or rank specified
> >>   - No processes on the indicated node can by migrated
> >>   - Process migration was not enabled for this job. Make sure to
> indicate
> >> the proper AMCA file: "-am ft-enable-cr-recovery".
> >>
> --
> >
> > The error message indicates that there were no processes found on
> 'node9'. Did you confirm that there were processes running on that node?
> >
> > It is possible that the node name that Open MPI is using is different
> than what you put in. For example it could be fully qualified (e.g.,
> node9.my.domain.com). So you might try that too. MPI_Get_processor_name()
> should return the name of the node that we are attempting to use. So you
> could have all processes print that out when the startup.
> >
> >
> >> Then i try another way, and i get the next:
> >>
> >> Terminal 1:
> >>
> >> [hmeyer@clus9 whoami]$
> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 3 -am
> ft-enable-cr-recovery ./whoami 10 10
> >> Antes de MPI_Init
> >> Antes de MPI_Init
> >> Antes de MPI_Init
> >>
> --
> >> Notice: A migration of this job has been requested.
> >>The processes below will be migrated.
> >>Please standby.
> >>  [[40382,1],1] Rank 1 on Node clus9
> >>
> >>
> --
> >>
> --
> >> Error: The process below has failed. There is no checkpoint available
> for
> >>   this job, so we are terminating the application since automatic
> >>   recovery cannot occur.
> >> Internal Name: [[40382,1],1]
> >> MCW Rank: 1
> >>
> >>
> --
> >> Soy el número 0 (1)
> >> Terminando, una instrucción antes del finalize
> >> Soy el número 2 (1)
> >> Terminando, una instrucción antes del finalize
> >>
> >> Terminal 2:
> >>
> >> [hmeyer@clus9 build]$
> /home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -r 1 -t node3
> 11784
> >> [clus9:11795] *** Process received signal ***
> >> [clus9:11795] Signal: Segmentation fault (11)
> >> [clus9:11795] Signal code: Address not mapped (1)
> >> [clus9:11795] Failing at address: (nil)
> >> [clus9:11795] [ 0] /lib64/libpthread.so.0 [0x2c0b9d40]
> >> [clus9:11795] *** End of error message ***
> >> Segmenta

Re: [OMPI devel] OMPI-MIGRATE error

2011-01-31 Thread Joshua Hursey
That helped. There was a missing check in the automatic recovery logic that 
prevents it from starting up while the migration is going on. r24326 should fix 
this bug. The segfault should have just been residual fallout from this bug. 
Can you try the current trunk to confirm?

One other thing I noticed in the output is that it looks like one of your nodes 
is asking you for a password (i.e., 'node1'). You may want to make sure that 
you can login without a password on that node, as it might otherwise hinder 
Open MPI's startup mechanism on that node.

Thanks,
Josh

On Jan 31, 2011, at 12:36 PM, Hugo Meyer wrote:

> Hi Josh.
> 
> As you say, the first problem was because of the name of the node. But the 
> second problem persist (the segmentation fault). As you ask, i'm sending you 
> the output of execute with the mca params that you pass me. At the end of the 
> file i put the output of the second terminal.
> 
> Best Regards
> 
> Hugo Meyer
> 
> 2011/1/31 Joshua Hursey 
> So I was not able to reproduce this issue.
> 
> A couple notes:
>  - You can see the node-to-process-rank mapping using the '-display-map' 
> command line option to mpirun. This will give you the node names that Open 
> MPI is using, and how it intends to layout the processes. You can use the 
> '-display-allocation' option to see all of the nodes that Open MPI knows 
> about. Open MPI cannot, currently, migrate to a node that it does not know 
> about on startup.
>  - If the problem persists, add the following MCA parameters to your 
> ~/.openmpi/mca-params.conf file and send me a zipped-up text file of the 
> output. It might show us where things are going wrong:
> 
> orte_debug_daemons=1
> errmgr_base_verbose=20
> snapc_full_verbose=20
> 
> 
> -- Josh
> 
> On Jan 31, 2011, at 9:46 AM, Joshua Hursey wrote:
> 
> >
> > On Jan 31, 2011, at 6:47 AM, Hugo Meyer wrote:
> >
> >> Hi Joshua.
> >>
> >> I've tried the migration again, and i get the next (running process where 
> >> mpirun is running):
> >>
> >> Terminal 1:
> >>
> >> [hmeyer@clus9 whoami]$ 
> >> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 2 -am 
> >> ft-enable-cr-recovery --mca orte_base_help_aggregate 0 ./whoami 10 10
> >> Antes de MPI_Init
> >> Antes de MPI_Init
> >> --
> >> Warning: Could not find any processes to migrate on the nodes specified.
> >> You provided the following:
> >> Nodes: node9
> >> Procs: (null)
> >> --
> >> Soy el número 1 (1)
> >> Terminando, una instrucción antes del finalize
> >> Soy el número 0 (1)
> >> Terminando, una instrucción antes del finalize
> >>
> >> Terminal 2:
> >>
> >> [hmeyer@clus9 build]$ 
> >> /home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -x node9 -t 
> >> node3 11724
> >> --
> >> Error: The Job identified by PID (11724) was not able to migrate processes 
> >> in this
> >>   job. This could be caused by any of the following:
> >>   - Invalid node or rank specified
> >>   - No processes on the indicated node can by migrated
> >>   - Process migration was not enabled for this job. Make sure to 
> >> indicate
> >> the proper AMCA file: "-am ft-enable-cr-recovery".
> >> --
> >
> > The error message indicates that there were no processes found on 'node9'. 
> > Did you confirm that there were processes running on that node?
> >
> > It is possible that the node name that Open MPI is using is different than 
> > what you put in. For example it could be fully qualified (e.g., 
> > node9.my.domain.com). So you might try that too. MPI_Get_processor_name() 
> > should return the name of the node that we are attempting to use. So you 
> > could have all processes print that out when the startup.
> >
> >
> >> Then i try another way, and i get the next:
> >>
> >> Terminal 1:
> >>
> >> [hmeyer@clus9 whoami]$ 
> >> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 3 -am 
> >> ft-enable-cr-recovery ./whoami 10 10
> >> Antes de MPI_Init
> >> Antes de MPI_Init
> >> Antes de MPI_Init
> >> --
> >> Notice: A migration of this job has been requested.
> >>The processes below will be migrated.
> >>Please standby.
> >>  [[40382,1],1] Rank 1 on Node clus9
> >>
> >> --
> >> --
> >> Error: The process below has failed. There is no checkpoint available for
> >>   this job, so we are terminating the application since automatic
> >>   recovery cannot occur.
> >> Internal Name: [[40382,1],1]
> >> MCW Rank: 1
> >>
> >> --