Re: [OMPI devel] OMPI 1.4.3 hangs in gather

2011-01-26 Thread Doron Shoham
using the flag --mca mpi_preconnect_mpi seems to solved the issue with the
oob connection manager.
This solution is not scalable but it looks more and more like a connection
establishment problem.
I'm still trying to figure out what is the root cause of this and how to
solve it.
Any ideas will be more then welcome.


Thanks,
Doron

On Tue, Jan 18, 2011 at 3:29 PM, Terry Dontje wrote:

>  On 01/18/2011 07:48 AM, Jeff Squyres wrote:
>
> IBCM is broken and disabled (has been for a long time).
>
> Did you mean RDMACM?
>
>
>
> No I think I meant OMPI oob.
>
> sorry,
>
> --
> [image: Oracle]
>
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.781.442.2631
> Oracle *- Performance Technologies*
> 95 Network Drive, Burlington, MA 01803
> Email terry.don...@oracle.com
>
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] OMPI-MIGRATE error

2011-01-26 Thread Hugo Meyer
Thanks Josh.

I've already check te prelink and is set to "no".

I'm going to try with the trunk head, and then i'll let you know how it
goes.

Best regards.

Hugo Meyer

2011/1/25 Joshua Hursey 

> Can you try with the current trunk head (r24296)?
> I just committed a fix for the C/R functionality in which restarts were
> getting stuck. This will likely affect the migration functionality, but I
> have not had an opportunity to test just yet.
>
> Another thing to check is that prelink is turned off on all of your
> machines.
>  https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink
>
> Let me know if the problem persists, and I'll dig into a bit more.
>
> Thanks,
> Josh
>
> On Jan 24, 2011, at 11:37 AM, Hugo Meyer wrote:
>
> > Hello @ll
> >
> > I've got a problem when i try to use the ompi-migrate command.
> >
> > What i'm doing is execute for example the next application in one node of
> a cluster (both process wil run on the same node):
> >
> > mpirun -np 2 -am ft-enable-cr ./whoami 10 10
> >
> > Then in the same node i try to migrate the processes to another node:
> >
> > ompi-migrate -x node9 -t node3 14914
> >
> > And then i get this message:
> >
> > [clus9:15620] *** Process received signal ***
> > [clus9:15620] Signal: Segmentation fault (11)
> > [clus9:15620] Signal code: Address not mapped (1)
> > [clus9:15620] Failing at address: (nil)
> > [clus9:15620] [ 0] /lib64/libpthread.so.0 [0x2c0b8d40]
> > [clus9:15620] *** End of error message ***
> > Segmentation fault
> >
> > I assume that maybe there is something wrong with the thread level, but i
> have configured the open-mpi like this:
> >
> > ../configure --prefix=/home/hmeyer/desarrollo/ompi-code/binarios/
> --enable-debug --enable-debug-symbols --enable-trace --with-ft=cr
> --disable-ipv6 --enable-opal-multi-threads --enable-ft-thread
> --without-hwloc --disable-vt --with-blcr=/soft/blcr-0.8.2/
> --with-blcr-libdir=/soft/blcr-0.8.2/lib/
> >
> > The checkpoint and restart works fine, but when i restore an application
> that has more than one process, this one is restored and executed until the
> last line before MPI_FINALIZE(), but the processes never finalize, i assume
> that they never call the MPI_FINALIZE(), but with one process
> ompi-checkpoint and ompi-restart work great.
> >
> > Best regards.
> >
> > Hugo Meyer
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> 
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] OMPI-MIGRATE error

2011-01-26 Thread Hugo Meyer
Josh.

The ompi-checkpoint with his restart now are working great, but the same
error persist with ompi-migrate. I've also tried using "-r", but i get the
same error.

Best regards.

Hugo Meyer

2011/1/26 Hugo Meyer 

> Thanks Josh.
>
> I've already check te prelink and is set to "no".
>
> I'm going to try with the trunk head, and then i'll let you know how it
> goes.
>
> Best regards.
>
> Hugo Meyer
>
> 2011/1/25 Joshua Hursey 
>
> Can you try with the current trunk head (r24296)?
>> I just committed a fix for the C/R functionality in which restarts were
>> getting stuck. This will likely affect the migration functionality, but I
>> have not had an opportunity to test just yet.
>>
>> Another thing to check is that prelink is turned off on all of your
>> machines.
>>  https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink
>>
>> Let me know if the problem persists, and I'll dig into a bit more.
>>
>> Thanks,
>> Josh
>>
>> On Jan 24, 2011, at 11:37 AM, Hugo Meyer wrote:
>>
>> > Hello @ll
>> >
>> > I've got a problem when i try to use the ompi-migrate command.
>> >
>> > What i'm doing is execute for example the next application in one node
>> of a cluster (both process wil run on the same node):
>> >
>> > mpirun -np 2 -am ft-enable-cr ./whoami 10 10
>> >
>> > Then in the same node i try to migrate the processes to another node:
>> >
>> > ompi-migrate -x node9 -t node3 14914
>> >
>> > And then i get this message:
>> >
>> > [clus9:15620] *** Process received signal ***
>> > [clus9:15620] Signal: Segmentation fault (11)
>> > [clus9:15620] Signal code: Address not mapped (1)
>> > [clus9:15620] Failing at address: (nil)
>> > [clus9:15620] [ 0] /lib64/libpthread.so.0 [0x2c0b8d40]
>> > [clus9:15620] *** End of error message ***
>> > Segmentation fault
>> >
>> > I assume that maybe there is something wrong with the thread level, but
>> i have configured the open-mpi like this:
>> >
>> > ../configure --prefix=/home/hmeyer/desarrollo/ompi-code/binarios/
>> --enable-debug --enable-debug-symbols --enable-trace --with-ft=cr
>> --disable-ipv6 --enable-opal-multi-threads --enable-ft-thread
>> --without-hwloc --disable-vt --with-blcr=/soft/blcr-0.8.2/
>> --with-blcr-libdir=/soft/blcr-0.8.2/lib/
>> >
>> > The checkpoint and restart works fine, but when i restore an application
>> that has more than one process, this one is restored and executed until the
>> last line before MPI_FINALIZE(), but the processes never finalize, i assume
>> that they never call the MPI_FINALIZE(), but with one process
>> ompi-checkpoint and ompi-restart work great.
>> >
>> > Best regards.
>> >
>> > Hugo Meyer
>> > ___
>> > devel mailing list
>> > de...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> 
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> http://users.nccs.gov/~jjhursey
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>


Re: [OMPI devel] OMPI-MIGRATE error

2011-01-26 Thread Joshua Hursey
I found a few more bugs after testing the C/R functionality this morning. I 
just committed some more C/R fixes in r24306 (things are now working correctly 
on my test cluster).
  https://svn.open-mpi.org/trac/ompi/changeset/24306

One thing I just noticed in your original email was that you are specifying the 
wrong parameter for migration (it is different than the standard C/R 
functionality for backwards compatibility reasons). You need to use the 
'ft-enable-cr-recovery' AMCA parameter:
  mpirun -np 2 -am ft-enable-cr-recovery ./whoami 10 10

If you still get the segmentation fault after upgrading to the current trunk, 
can you send me a backtrace from the core file? That will help me narrow down 
on the problem.

Thanks,
Josh


On Jan 26, 2011, at 8:40 AM, Hugo Meyer wrote:

> Josh.
> 
> The ompi-checkpoint with his restart now are working great, but the same 
> error persist with ompi-migrate. I've also tried using "-r", but i get the 
> same error.
> 
> Best regards.
> 
> Hugo Meyer
> 
> 2011/1/26 Hugo Meyer 
> Thanks Josh.
> 
> I've already check te prelink and is set to "no".
> 
> I'm going to try with the trunk head, and then i'll let you know how it goes.
> 
> Best regards.
> 
> Hugo Meyer
> 
> 2011/1/25 Joshua Hursey 
> 
> Can you try with the current trunk head (r24296)?
> I just committed a fix for the C/R functionality in which restarts were 
> getting stuck. This will likely affect the migration functionality, but I 
> have not had an opportunity to test just yet.
> 
> Another thing to check is that prelink is turned off on all of your machines.
>  https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink
> 
> Let me know if the problem persists, and I'll dig into a bit more.
> 
> Thanks,
> Josh
> 
> On Jan 24, 2011, at 11:37 AM, Hugo Meyer wrote:
> 
> > Hello @ll
> >
> > I've got a problem when i try to use the ompi-migrate command.
> >
> > What i'm doing is execute for example the next application in one node of a 
> > cluster (both process wil run on the same node):
> >
> > mpirun -np 2 -am ft-enable-cr ./whoami 10 10
> >
> > Then in the same node i try to migrate the processes to another node:
> >
> > ompi-migrate -x node9 -t node3 14914
> >
> > And then i get this message:
> >
> > [clus9:15620] *** Process received signal ***
> > [clus9:15620] Signal: Segmentation fault (11)
> > [clus9:15620] Signal code: Address not mapped (1)
> > [clus9:15620] Failing at address: (nil)
> > [clus9:15620] [ 0] /lib64/libpthread.so.0 [0x2c0b8d40]
> > [clus9:15620] *** End of error message ***
> > Segmentation fault
> >
> > I assume that maybe there is something wrong with the thread level, but i 
> > have configured the open-mpi like this:
> >
> > ../configure --prefix=/home/hmeyer/desarrollo/ompi-code/binarios/ 
> > --enable-debug --enable-debug-symbols --enable-trace --with-ft=cr 
> > --disable-ipv6 --enable-opal-multi-threads --enable-ft-thread 
> > --without-hwloc --disable-vt --with-blcr=/soft/blcr-0.8.2/ 
> > --with-blcr-libdir=/soft/blcr-0.8.2/lib/
> >
> > The checkpoint and restart works fine, but when i restore an application 
> > that has more than one process, this one is restored and executed until the 
> > last line before MPI_FINALIZE(), but the processes never finalize, i assume 
> > that they never call the MPI_FINALIZE(), but with one process 
> > ompi-checkpoint and ompi-restart work great.
> >
> > Best regards.
> >
> > Hugo Meyer
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey