[OMPI devel] trunk hangs since r19010

2008-07-28 Thread Lenny Verkhovsky
Hi,

I experience hanging of tests ( latency ) since r19010

Best Regards

Lenny.


Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Jeff Squyres

Is this related to r1378?


On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:


Hi,

I experience hanging of tests ( latency ) since r19010


Best Regards

Lenny.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Jeff Squyres

On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote:


Is this related to r1378?


Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.



On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:


Hi,

I experience hanging of tests ( latency ) since r19010


Best Regards

Lenny.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems




--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Lenny Verkhovsky
I believe it it.

On 7/28/08, Jeff Squyres  wrote:
>
> On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote:
>
>  Is this related to r1378?
>>
>
> Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.
>
>
>  On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:
>>
>>  Hi,
>>>
>>> I experience hanging of tests ( latency ) since r19010
>>>
>>>
>>> Best Regards
>>>
>>> Lenny.
>>>
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>>
>
> --
> Jeff Squyres
> Cisco Systems
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Ralph Castain
It could also be something new. Brad and I noted on Fri that IB was  
locking up as soon as we tried any cross-node communications. Hadn't  
seen that before, and at least I haven't explored it further - planned  
to do so today.



On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote:


I believe it it.

On 7/28/08, Jeff Squyres  wrote:
On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote:

Is this related to r1378?

Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.



On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:

Hi,

I experience hanging of tests ( latency ) since r19010


Best Regards

Lenny.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




[OMPI devel] Funny warning message

2008-07-28 Thread Ralph Castain
Just got this warning today while trying to test IB connections. Last  
I checked, 32 was indeed smaller than 192...


--
WARNING: rd_win specification is non optimal. For maximum performance  
it is
advisable to configure rd_win smaller then (rd_num - rd_low), but  
currently

rd_win = 32 and (rd_num - rd_low) = 192.
--

Ralph



[OMPI devel] 1.3 build failing on MacOSX

2008-07-28 Thread Greg Watson

I'm getting the following when I try and build 1.3 from SVN:

 gcc -DHAVE_CONFIG_H -I. -I../../adio/include -DOMPI_BUILDING=1 -I/ 
Users/greg/Documents/workspaces/ptp_head/ompi/ompi/mca/io/romio/ 
romio/../../../../.. -I/Users/greg/Documents/workspaces/ptp_head/ompi/ 
ompi/mca/io/romio/romio/../../../../../opal/include - 
I../../../../../../../opal/include -I../../../../../../../ompi/include  
-I/Users/greg/Documents/workspaces/ptp_head/ompi/ompi/mca/io/romio/ 
romio/include -I/Users/greg/Documents/workspaces/ptp_head/ompi/ompi/ 
mca/io/romio/romio/adio/include -D_REENTRANT -g -Wall -Wundef -Wno- 
long-long -Wsign-compare -Wmissing-prototypes -Wstrict-prototypes - 
Wcomment -pedantic -Wno-long-double -Werror-implicit-function- 
declaration -finline-functions -fno-strict-aliasing -DHAVE_ROMIOCONF_H  
-DHAVE_ROMIOCONF_H -I../../include -MT ad_write_nolock.lo -MD -MP - 
MF .deps/ad_write_nolock.Tpo -c ad_write_nolock.c  -fno-common -DPIC - 
o .libs/ad_write_nolock.o

ad_write_nolock.c: In function ‘ADIOI_NOLOCK_WriteStrided’:
ad_write_nolock.c:92: error: implicit declaration of function ‘lseek64’
make[5]: *** [ad_write_nolock.lo] Error 1
make[4]: *** [all-recursive] Error 1
make[3]: *** [all-recursive] Error 1
make[2]: *** [all-recursive] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive] Error 1

Configured with:

./configure --with-platform=contrib/platform/lanl/macosx-dynamic

Any ideas?

Greg


Re: [OMPI devel] 1.3 build failing on MacOSX

2008-07-28 Thread Jeff Squyres

Blast.  Looks like a problem with the new ROMIO I brought in last week.

I'll fix shortly; thanks for the heads-up.


On Jul 28, 2008, at 9:36 AM, Greg Watson wrote:


I'm getting the following when I try and build 1.3 from SVN:

gcc -DHAVE_CONFIG_H -I. -I../../adio/include -DOMPI_BUILDING=1 -I/ 
Users/greg/Documents/workspaces/ptp_head/ompi/ompi/mca/io/romio/ 
romio/../../../../.. -I/Users/greg/Documents/workspaces/ptp_head/ 
ompi/ompi/mca/io/romio/romio/../../../../../opal/include - 
I../../../../../../../opal/include -I../../../../../../../ompi/ 
include -I/Users/greg/Documents/workspaces/ptp_head/ompi/ompi/mca/io/ 
romio/romio/include -I/Users/greg/Documents/workspaces/ptp_head/ompi/ 
ompi/mca/io/romio/romio/adio/include -D_REENTRANT -g -Wall -Wundef - 
Wno-long-long -Wsign-compare -Wmissing-prototypes -Wstrict- 
prototypes -Wcomment -pedantic -Wno-long-double -Werror-implicit- 
function-declaration -finline-functions -fno-strict-aliasing - 
DHAVE_ROMIOCONF_H -DHAVE_ROMIOCONF_H -I../../include -MT  
ad_write_nolock.lo -MD -MP -MF .deps/ad_write_nolock.Tpo -c  
ad_write_nolock.c  -fno-common -DPIC -o .libs/ad_write_nolock.o

ad_write_nolock.c: In function ‘ADIOI_NOLOCK_WriteStrided’:
ad_write_nolock.c:92: error: implicit declaration of function  
‘lseek64’

make[5]: *** [ad_write_nolock.lo] Error 1
make[4]: *** [all-recursive] Error 1
make[3]: *** [all-recursive] Error 1
make[2]: *** [all-recursive] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive] Error 1

Configured with:

./configure --with-platform=contrib/platform/lanl/macosx-dynamic

Any ideas?

Greg
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems




[OMPI devel] MCA base changes

2008-07-28 Thread Jeff Squyres
With the update on #1400, I think we're ready to push the MCA base  
changes to the SVN trunk.  Speak now if you object, or forever hold  
your peace.  The most notable parts of this commit:


- add "register" function to mca_base_component_t
  - converted coll:basic and paffinity:linux and paffinity:solaris to  
use this function
  --> we'll convert the rest over time (I'll file a ticket once all  
this is committed)


- add 32 bytes of "reserved" space to the end of mca_base_component_t  
and mca_base_component_data_2_0_0_t to make future upgrades [slightly]  
easier

  - new mca_base_component_t size: 196 bytes
  - new mca_base_component_data_2_0_0_t size: 36 bytes

- MCA base version bumped to v2.0
  - **We now refuse to load components that are not MCA v2.0.x**

- all MCA frameworks versions bumped to v2.0

- be a little more explicit about version numbers in the MCA base
  - add big comment in mca.h about versioning philosophy

It's a pretty big commit because it touches a lot of files (although  
most are just changing the version number); I'll commit it this evening.


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Ralph Castain
I checked this out some more and I believe it is ticket #1378 related.  
We lock up if SM is included in the BTL's, which is what I had done on  
my test. If I ^sm, I can run fine.



On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote:

It could also be something new. Brad and I noted on Fri that IB was  
locking up as soon as we tried any cross-node communications. Hadn't  
seen that before, and at least I haven't explored it further -  
planned to do so today.



On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote:


I believe it it.

On 7/28/08, Jeff Squyres  wrote:
On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote:

Is this related to r1378?

Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.



On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:

Hi,

I experience hanging of tests ( latency ) since r19010


Best Regards

Lenny.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Funny warning message

2008-07-28 Thread Lenny Verkhovsky
It seems that the error felt into the helpfile.

Index: ompi/mca/btl/openib/help-mpi-btl-openib.txt
===
--- ompi/mca/btl/openib/help-mpi-btl-openib.txt (revision 19054)
+++ ompi/mca/btl/openib/help-mpi-btl-openib.txt (working copy)
@@ -497,7 +497,7 @@
 #
 [non optimal rd_win]
 WARNING: rd_win specification is non optimal. For maximum performance it is
-advisable to configure rd_win smaller then (rd_num - rd_low), but currently
+advisable to configure rd_win bigger then (rd_num - rd_low), but currently
 rd_win = %d and (rd_num - rd_low) = %d.
 #
 [apm without lmc]

Best regards

Lenny

On 7/28/08, Ralph Castain  wrote:
>
> Just got this warning today while trying to test IB connections. Last I
> checked, 32 was indeed smaller than 192...
>
> --
> WARNING: rd_win specification is non optimal. For maximum performance it is
> advisable to configure rd_win smaller then (rd_num - rd_low), but currently
> rd_win = 32 and (rd_num - rd_low) = 192.
> --
>
> Ralph
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Lenny Verkhovsky
I failed to run on different nodes or on the same node via self,openib



On 7/28/08, Ralph Castain  wrote:
>
> I checked this out some more and I believe it is ticket #1378 related. We
> lock up if SM is included in the BTL's, which is what I had done on my test.
> If I ^sm, I can run fine.
>
> On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote:
>
> It could also be something new. Brad and I noted on Fri that IB was locking
> up as soon as we tried any cross-node communications. Hadn't seen that
> before, and at least I haven't explored it further - planned to do so today.
>
> On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote:
>
> I believe it it.
>
> On 7/28/08, Jeff Squyres  wrote:
>>
>> On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote:
>>
>>  Is this related to r1378?
>>>
>>
>> Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.
>>
>>
>>  On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:
>>>
>>>  Hi,

 I experience hanging of tests ( latency ) since r19010


 Best Regards

 Lenny.

 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel

>>>
>>>
>>> --
>>> Jeff Squyres
>>> Cisco Systems
>>>
>>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] Funny warning message

2008-07-28 Thread Adrian Knoth
On Mon, Jul 28, 2008 at 05:14:29PM +0300, Lenny Verkhovsky wrote:

> -advisable to configure rd_win smaller then (rd_num - rd_low), but currently
> +advisable to configure rd_win bigger then (rd_num - rd_low), but currently
  ^ a


-- 
Cluster and Metacomputing Working Group
Friedrich-Schiller-Universität Jena, Germany

private: http://adi.thur.de


[OMPI devel] RFC: MCA DSO filename

2008-07-28 Thread Jeff Squyres
WHAT: Rename MCA DSO filenames from "mca__.so"  
to "libmca__.so" (backwards compatibility can be  
preserved if we want it; see below)


WHY: Allows simplifying component Makefile.am's

WHEN: No real rush; just wanted to get the idea out there (does *not*  
need to be before v1.3; more explanation below)


WHERE: autogen.sh, some stuff in opal/mca/base, and every component's  
Makefile.am


TIMEOUT: Fri, 8 Aug 2008



In reviewing some old SVN/HG trees that I had hanging around, I  
discovered one about significantly simplifying (and slightly  
optimizing) component Makefile.am's.  I believe that these ideas came  
from Brian, Ralf, and possibly others.  Here's a "simple" current  
Makefile.am (the TCP BTL):



https://svn.open-mpi.org/trac/ompi/browser/trunk/ompi/mca/btl/tcp/Makefile.am

At the end of this mail, I include what the meat of the TCP BTL  
Makefile.am can be reduced to.


However, to do this, we need to use the same output filename for both  
the static and dynamic builds (i.e., as a standalone DSO and as a  
convenience LT library).  Libtool will complain if we build a  
convenience library with a filename that does not begin with "lib".


Note that there are two parts involved:

1. touching each Makefile.am and converting to the simpler format.
2. converting the MCA base to look for "libmca__"  
filenames.  NOTE: we can optionally have the MCA base *also* look for  
the old-style name "mca__" if backwards compatibility is  
desired.


Because of the backwards compatibility possibility, there is no need  
to do this before v1.3 -- it could be done for v1.3.x or even v1.4  
(there's no real rush).  It's just an idea that has been around for a  
while, so I thought I'd turn it into an RFC.  If the community agrees,  
I'll likely file a ticket about this and we'll get to it someday.


Below is what the TCP BTL Makefile.am can be reduced to (compare the  
end of this file to the end of the current TCP BTL Makefile.am).  Note  
that the whole "if" logic at the end could possibly be hidden in  
autogen -- I haven't thought that through, but it's a possibility (we  
can't hide that stuff in autogen until we unify the output filename;  
we can't do it in today's build system, for example).


-
libmca_btl_tcp_la_SOURCES = \
btl_tcp.c \
btl_tcp.h \
btl_tcp_addr.h \
btl_tcp_component.c \
btl_tcp_endpoint.c \
btl_tcp_endpoint.h \
btl_tcp_frag.c \
btl_tcp_frag.h \
btl_tcp_hdr.h \
btl_tcp_proc.c \
btl_tcp_proc.h \
btl_tcp_ft.c \
btl_tcp_ft.h
libmca_btl_tcp_la_LDFLAGS = -module -avoid-version

if OMPI_BUILD_btl_tcp_DSO
mcacomponentdir = $(pkglibdir)
mcacomponent_LTLIBRARIES = libmca_btl_tcp.la
else
noinst_LTLIBRARIES = libmca_btl_tcp.la
endif
-

--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] Funny warning message

2008-07-28 Thread Jeff Squyres
I think Lenny is pointing out that "smaller" got changed to "bigger",  
too.  :-)


Looking at the test in the code (btl_openib_component.c):

if ((rd_num - rd_low) > rd_win) {
orte_show_help("help-mpi-btl-openib.txt", "non  
optimal rd_win",

true, rd_win, rd_num - rd_low);
}

So the change in the help message is correct -- it is better when  
rd_win is bigger than (rd_num - rd_low).


Ralph -- were you running with a non-default btl_openib_receive_queues?



On Jul 28, 2008, at 10:17 AM, Adrian Knoth wrote:


On Mon, Jul 28, 2008 at 05:14:29PM +0300, Lenny Verkhovsky wrote:

-advisable to configure rd_win smaller then (rd_num - rd_low), but  
currently
+advisable to configure rd_win bigger then (rd_num - rd_low), but  
currently

 ^ a


--
Cluster and Metacomputing Working Group
Friedrich-Schiller-Universität Jena, Germany

private: http://adi.thur.de
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems




Re: [OMPI devel] Funny warning message

2008-07-28 Thread Ralph Castain


On Jul 28, 2008, at 8:22 AM, Jeff Squyres wrote:

I think Lenny is pointing out that "smaller" got changed to  
"bigger", too.  :-)


Looking at the test in the code (btl_openib_component.c):

   if ((rd_num - rd_low) > rd_win) {
   orte_show_help("help-mpi-btl-openib.txt", "non  
optimal rd_win",

   true, rd_win, rd_num - rd_low);
   }

So the change in the help message is correct -- it is better when  
rd_win is bigger than (rd_num - rd_low).


Ralph -- were you running with a non-default  
btl_openib_receive_queues?





Yep...was using a queue layout from Brad that is pretty complex. I was  
just pointing out that the warning's stated condition was met, so  
either the warning text is wrong or the test that generates it is wrong.






On Jul 28, 2008, at 10:17 AM, Adrian Knoth wrote:


On Mon, Jul 28, 2008 at 05:14:29PM +0300, Lenny Verkhovsky wrote:

-advisable to configure rd_win smaller then (rd_num - rd_low), but  
currently
+advisable to configure rd_win bigger then (rd_num - rd_low), but  
currently

^ a


--
Cluster and Metacomputing Working Group
Friedrich-Schiller-Universität Jena, Germany

private: http://adi.thur.de
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Brad Benton
My experience is the same a Lenny's.  I've tested on x86_64 and ppc64
systems and tests using --mca btl  openib,self hang in all cases.

--brad


2008/7/28 Lenny Verkhovsky 

> I failed to run on different nodes or on the same node via self,openib
>
>
>
> On 7/28/08, Ralph Castain  wrote:
>>
>> I checked this out some more and I believe it is ticket #1378 related. We
>> lock up if SM is included in the BTL's, which is what I had done on my test.
>> If I ^sm, I can run fine.
>>
>> On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote:
>>
>> It could also be something new. Brad and I noted on Fri that IB was
>> locking up as soon as we tried any cross-node communications. Hadn't seen
>> that before, and at least I haven't explored it further - planned to do so
>> today.
>>
>> On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote:
>>
>> I believe it it.
>>
>> On 7/28/08, Jeff Squyres  wrote:
>>>
>>> On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote:
>>>
>>>  Is this related to r1378?

>>>
>>> Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.
>>>
>>>
>>>  On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:

  Hi,
>
> I experience hanging of tests ( latency ) since r19010
>
>
> Best Regards
>
> Lenny.
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


 --
 Jeff Squyres
 Cisco Systems


>>>
>>> --
>>> Jeff Squyres
>>> Cisco Systems
>>>
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Ralph Castain
Interesting - you are quite correct and I should have been more  
precise. I ran with -mca btl openib and it worked. So having just  
openib seems to be okay.




On Jul 28, 2008, at 8:37 AM, Brad Benton wrote:

My experience is the same a Lenny's.  I've tested on x86_64 and  
ppc64 systems and tests using --mca btl  openib,self hang in all  
cases.


--brad


2008/7/28 Lenny Verkhovsky 
I failed to run on different nodes or on the same node via self,openib




On 7/28/08, Ralph Castain  wrote:
I checked this out some more and I believe it is ticket #1378  
related. We lock up if SM is included in the BTL's, which is what I  
had done on my test. If I ^sm, I can run fine.



On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote:

It could also be something new. Brad and I noted on Fri that IB was  
locking up as soon as we tried any cross-node communications.  
Hadn't seen that before, and at least I haven't explored it further  
- planned to do so today.



On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote:


I believe it it.

On 7/28/08, Jeff Squyres  wrote:
On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote:

Is this related to r1378?

Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.



On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:

Hi,

I experience hanging of tests ( latency ) since r19010


Best Regards

Lenny.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Jeff Squyres

FWIW, all my MTT runs are hanging as well.


On Jul 28, 2008, at 10:37 AM, Brad Benton wrote:

My experience is the same a Lenny's.  I've tested on x86_64 and  
ppc64 systems and tests using --mca btl  openib,self hang in all  
cases.


--brad


2008/7/28 Lenny Verkhovsky 
I failed to run on different nodes or on the same node via self,openib




On 7/28/08, Ralph Castain  wrote:
I checked this out some more and I believe it is ticket #1378  
related. We lock up if SM is included in the BTL's, which is what I  
had done on my test. If I ^sm, I can run fine.



On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote:

It could also be something new. Brad and I noted on Fri that IB was  
locking up as soon as we tried any cross-node communications.  
Hadn't seen that before, and at least I haven't explored it further  
- planned to do so today.



On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote:


I believe it it.

On 7/28/08, Jeff Squyres  wrote: On Jul 28,  
2008, at 7:51 AM, Jeff Squyres wrote:


Is this related to r1378?

Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.



On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:

Hi,

I experience hanging of tests ( latency ) since r19010


Best Regards

Lenny.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Lenny Verkhovsky
only openib works for me too,

but Glebs said to me once that it's illigal and I always need to use self
btl.

On 7/28/08, Jeff Squyres  wrote:
>
> FWIW, all my MTT runs are hanging as well.
>
>
> On Jul 28, 2008, at 10:37 AM, Brad Benton wrote:
>
>  My experience is the same a Lenny's.  I've tested on x86_64 and ppc64
>> systems and tests using --mca btl  openib,self hang in all cases.
>>
>> --brad
>>
>>
>> 2008/7/28 Lenny Verkhovsky 
>> I failed to run on different nodes or on the same node via self,openib
>>
>>
>>
>>
>> On 7/28/08, Ralph Castain  wrote:
>> I checked this out some more and I believe it is ticket #1378 related. We
>> lock up if SM is included in the BTL's, which is what I had done on my test.
>> If I ^sm, I can run fine.
>>
>>
>> On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote:
>>
>>  It could also be something new. Brad and I noted on Fri that IB was
>>> locking up as soon as we tried any cross-node communications. Hadn't seen
>>> that before, and at least I haven't explored it further - planned to do so
>>> today.
>>>
>>>
>>> On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote:
>>>
>>>  I believe it it.

 On 7/28/08, Jeff Squyres  wrote: On Jul 28, 2008,
 at 7:51 AM, Jeff Squyres wrote:

 Is this related to r1378?

 Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.



 On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:

 Hi,

 I experience hanging of tests ( latency ) since r19010


 Best Regards

 Lenny.

 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel


 --
 Jeff Squyres
 Cisco Systems



 --
 Jeff Squyres
 Cisco Systems

 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel

 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel

>>>
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Ralph Castain


On Jul 28, 2008, at 8:52 AM, Lenny Verkhovsky wrote:


only openib works for me too,

but Glebs said to me once that it's illigal and I always need to use  
self btl.




Don't know - could be true. But if that is true, then we should check  
to see if that condition is met and error out - with an appropriate  
message - if so. Otherwise, how is a user supposed to know this  
condition?





On 7/28/08, Jeff Squyres  wrote:
FWIW, all my MTT runs are hanging as well.



On Jul 28, 2008, at 10:37 AM, Brad Benton wrote:

My experience is the same a Lenny's.  I've tested on x86_64 and  
ppc64 systems and tests using --mca btl  openib,self hang in all  
cases.


--brad


2008/7/28 Lenny Verkhovsky 
I failed to run on different nodes or on the same node via self,openib




On 7/28/08, Ralph Castain  wrote:
I checked this out some more and I believe it is ticket #1378  
related. We lock up if SM is included in the BTL's, which is what I  
had done on my test. If I ^sm, I can run fine.



On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote:

It could also be something new. Brad and I noted on Fri that IB was  
locking up as soon as we tried any cross-node communications. Hadn't  
seen that before, and at least I haven't explored it further -  
planned to do so today.



On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote:

I believe it it.

On 7/28/08, Jeff Squyres  wrote: On Jul 28,  
2008, at 7:51 AM, Jeff Squyres wrote:


Is this related to r1378?

Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.



On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:

Hi,

I experience hanging of tests ( latency ) since r19010


Best Regards

Lenny.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] 1.3 build failing on MacOSX

2008-07-28 Thread Jeff Squyres
Looking into it a bit more, the situation is a little convoluted.   
I've filed https://svn.open-mpi.org/trac/ompi/ticket/1419; followups  
will occur there.



On Jul 28, 2008, at 9:42 AM, Jeff Squyres wrote:

Blast.  Looks like a problem with the new ROMIO I brought in last  
week.


I'll fix shortly; thanks for the heads-up.


On Jul 28, 2008, at 9:36 AM, Greg Watson wrote:


I'm getting the following when I try and build 1.3 from SVN:

gcc -DHAVE_CONFIG_H -I. -I../../adio/include -DOMPI_BUILDING=1 -I/ 
Users/greg/Documents/workspaces/ptp_head/ompi/ompi/mca/io/romio/ 
romio/../../../../.. -I/Users/greg/Documents/workspaces/ptp_head/ 
ompi/ompi/mca/io/romio/romio/../../../../../opal/include - 
I../../../../../../../opal/include -I../../../../../../../ompi/ 
include -I/Users/greg/Documents/workspaces/ptp_head/ompi/ompi/mca/ 
io/romio/romio/include -I/Users/greg/Documents/workspaces/ptp_head/ 
ompi/ompi/mca/io/romio/romio/adio/include -D_REENTRANT -g -Wall - 
Wundef -Wno-long-long -Wsign-compare -Wmissing-prototypes -Wstrict- 
prototypes -Wcomment -pedantic -Wno-long-double -Werror-implicit- 
function-declaration -finline-functions -fno-strict-aliasing - 
DHAVE_ROMIOCONF_H -DHAVE_ROMIOCONF_H -I../../include -MT  
ad_write_nolock.lo -MD -MP -MF .deps/ad_write_nolock.Tpo -c  
ad_write_nolock.c  -fno-common -DPIC -o .libs/ad_write_nolock.o

ad_write_nolock.c: In function ‘ADIOI_NOLOCK_WriteStrided’:
ad_write_nolock.c:92: error: implicit declaration of function  
‘lseek64’

make[5]: *** [ad_write_nolock.lo] Error 1
make[4]: *** [all-recursive] Error 1
make[3]: *** [all-recursive] Error 1
make[2]: *** [all-recursive] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive] Error 1

Configured with:

./configure --with-platform=contrib/platform/lanl/macosx-dynamic

Any ideas?

Greg
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems




[OMPI devel] Change in slot_list specification

2008-07-28 Thread Ralph Castain

Just an FYI for those of you working with slot_lists.

Lenny, Jeff and I have changed the mca param associated with how you  
specify the slot list you want the rank_file mapper to use. This was  
done to avoid the possibility of ORTE processes such as mpirun and  
orted accidentally binding themselves to cores. The prior param was  
identical to the one used to tell MPI procs their core bindings - so  
if someone ever modified the paffinity system to detect the param and  
automatically perform the binding, mpirun and orted could both bind  
themselves to the specified cores...which isn't what we would want.


The new param is "rmaps_base_slot_list". To make life easier, we also  
added a new orterun cmd line option --slot-list which acts as a  
shorthand for the new mca param.


Ralph



[OMPI devel] Change in hostfile behavior

2008-07-28 Thread Ralph Castain
Per an earlier telecon, I have modified the hostfile behavior slightly  
to allow hostfiles to subdivide allocations.


Briefly: given an allocation, we allow users to specify --hostfile on  
a per-app_context basis. In this mode, the hostfile info is used to  
filter the nodes that will be used for that app_context. However, the  
prior implementation only filtered the nodes themselves - i.e., it was  
a binary filter that allowed you to include or exclude an entire node.


The change now allows you to include a specified #slots for a given  
node as opposed to -all- slots from that node. You are limited to the  
#slots included in the original allocation. I just realized that I  
hadn't output a warning if you attempt to violate this condition -  
will do so shortly. Rather than just abort if this happens, I set the  
allocation to that of the original - please let me know if you would  
prefer it to abort.


If you have interest in this behavior, please check it out and let me  
know if this meets needs.


Ralph



Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread George Bosilca
I'm a little bit lost here. You're stating that openib,self doesn't  
work while openib does? In other words that adding self to the BTL  
leads to deadlocks?


  george.

PS: Btw, it is not supposed to work at all, except in the case where  
openib handle internal messages (where the source and destination is  
the same process).


On Jul 28, 2008, at 5:05 PM, Ralph Castain wrote:



On Jul 28, 2008, at 8:52 AM, Lenny Verkhovsky wrote:


only openib works for me too,

but Glebs said to me once that it's illigal and I always need to  
use self btl.




Don't know - could be true. But if that is true, then we should  
check to see if that condition is met and error out - with an  
appropriate message - if so. Otherwise, how is a user supposed to  
know this condition?





On 7/28/08, Jeff Squyres  wrote: FWIW, all my  
MTT runs are hanging as well.




On Jul 28, 2008, at 10:37 AM, Brad Benton wrote:

My experience is the same a Lenny's.  I've tested on x86_64 and  
ppc64 systems and tests using --mca btl  openib,self hang in all  
cases.


--brad


2008/7/28 Lenny Verkhovsky 
I failed to run on different nodes or on the same node via  
self,openib





On 7/28/08, Ralph Castain  wrote:
I checked this out some more and I believe it is ticket #1378  
related. We lock up if SM is included in the BTL's, which is what I  
had done on my test. If I ^sm, I can run fine.



On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote:

It could also be something new. Brad and I noted on Fri that IB was  
locking up as soon as we tried any cross-node communications.  
Hadn't seen that before, and at least I haven't explored it further  
- planned to do so today.



On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote:

I believe it it.

On 7/28/08, Jeff Squyres  wrote: On Jul 28,  
2008, at 7:51 AM, Jeff Squyres wrote:


Is this related to r1378?

Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.



On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:

Hi,

I experience hanging of tests ( latency ) since r19010


Best Regards

Lenny.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Ralph Castain

I just re-tested to confirm, and that is correct.

-mca btl openib works
-mca btl openib,selfhangs
-mca btl openib,sm  works


On Jul 28, 2008, at 9:49 AM, George Bosilca wrote:

I'm a little bit lost here. You're stating that openib,self doesn't  
work while openib does? In other words that adding self to the BTL  
leads to deadlocks?


 george.

PS: Btw, it is not supposed to work at all, except in the case where  
openib handle internal messages (where the source and destination is  
the same process).


On Jul 28, 2008, at 5:05 PM, Ralph Castain wrote:



On Jul 28, 2008, at 8:52 AM, Lenny Verkhovsky wrote:


only openib works for me too,

but Glebs said to me once that it's illigal and I always need to  
use self btl.




Don't know - could be true. But if that is true, then we should  
check to see if that condition is met and error out - with an  
appropriate message - if so. Otherwise, how is a user supposed to  
know this condition?





On 7/28/08, Jeff Squyres  wrote: FWIW, all my  
MTT runs are hanging as well.




On Jul 28, 2008, at 10:37 AM, Brad Benton wrote:

My experience is the same a Lenny's.  I've tested on x86_64 and  
ppc64 systems and tests using --mca btl  openib,self hang in all  
cases.


--brad


2008/7/28 Lenny Verkhovsky 
I failed to run on different nodes or on the same node via  
self,openib





On 7/28/08, Ralph Castain  wrote:
I checked this out some more and I believe it is ticket #1378  
related. We lock up if SM is included in the BTL's, which is what  
I had done on my test. If I ^sm, I can run fine.



On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote:

It could also be something new. Brad and I noted on Fri that IB  
was locking up as soon as we tried any cross-node communications.  
Hadn't seen that before, and at least I haven't explored it  
further - planned to do so today.



On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote:

I believe it it.

On 7/28/08, Jeff Squyres  wrote: On Jul 28,  
2008, at 7:51 AM, Jeff Squyres wrote:


Is this related to r1378?

Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.



On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:

Hi,

I experience hanging of tests ( latency ) since r19010


Best Regards

Lenny.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread George Bosilca
Interesting. The self is only used for local communications. I don't  
expect that any benchmark execute such communications, but apparently  
I was wrong. Please let me know the failing test, I will take a look  
this evening.


  Thanks,
george.

On Jul 28, 2008, at 5:56 PM, Ralph Castain wrote:


I just re-tested to confirm, and that is correct.

-mca btl openib works
-mca btl openib,selfhangs
-mca btl openib,sm  works


On Jul 28, 2008, at 9:49 AM, George Bosilca wrote:

I'm a little bit lost here. You're stating that openib,self doesn't  
work while openib does? In other words that adding self to the BTL  
leads to deadlocks?


george.

PS: Btw, it is not supposed to work at all, except in the case  
where openib handle internal messages (where the source and  
destination is the same process).


On Jul 28, 2008, at 5:05 PM, Ralph Castain wrote:



On Jul 28, 2008, at 8:52 AM, Lenny Verkhovsky wrote:


only openib works for me too,

but Glebs said to me once that it's illigal and I always need to  
use self btl.




Don't know - could be true. But if that is true, then we should  
check to see if that condition is met and error out - with an  
appropriate message - if so. Otherwise, how is a user supposed to  
know this condition?





On 7/28/08, Jeff Squyres  wrote: FWIW, all my  
MTT runs are hanging as well.




On Jul 28, 2008, at 10:37 AM, Brad Benton wrote:

My experience is the same a Lenny's.  I've tested on x86_64 and  
ppc64 systems and tests using --mca btl  openib,self hang in all  
cases.


--brad


2008/7/28 Lenny Verkhovsky 
I failed to run on different nodes or on the same node via  
self,openib





On 7/28/08, Ralph Castain  wrote:
I checked this out some more and I believe it is ticket #1378  
related. We lock up if SM is included in the BTL's, which is what  
I had done on my test. If I ^sm, I can run fine.



On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote:

It could also be something new. Brad and I noted on Fri that IB  
was locking up as soon as we tried any cross-node communications.  
Hadn't seen that before, and at least I haven't explored it  
further - planned to do so today.



On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote:

I believe it it.

On 7/28/08, Jeff Squyres  wrote: On Jul 28,  
2008, at 7:51 AM, Jeff Squyres wrote:


Is this related to r1378?

Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.



On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:

Hi,

I experience hanging of tests ( latency ) since r19010


Best Regards

Lenny.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Jeff Squyres

On Jul 28, 2008, at 12:03 PM, George Bosilca wrote:

Interesting. The self is only used for local communications. I don't  
expect that any benchmark execute such communications, but  
apparently I was wrong. Please let me know the failing test, I will  
take a look this evening.


FWIW, my manual tests of a simplistic "ring" program work for all  
combinations (openib, openib+self, openib+self+sm).  Shrug.


But for OSU latency, I found that openib, openib+sm work, but openib+sm 
+self hangs (same results whether the 2 procs are on the same node or  
different nodes).  There is no self communication in osu_latency, so  
something else must be going on.


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Ralph Castain
My test wasn't a benchmark - I was just testing with a little program  
that calls mpi_init, mpi_barrier, and mpi_finalize.


A test with just mpi_init/finalize works fine, so it looks like we  
simply hang when trying to communicate. This also only happens on  
multi-node operations.


On Jul 28, 2008, at 10:16 AM, Jeff Squyres wrote:


On Jul 28, 2008, at 12:03 PM, George Bosilca wrote:

Interesting. The self is only used for local communications. I  
don't expect that any benchmark execute such communications, but  
apparently I was wrong. Please let me know the failing test, I will  
take a look this evening.


FWIW, my manual tests of a simplistic "ring" program work for all  
combinations (openib, openib+self, openib+self+sm).  Shrug.


But for OSU latency, I found that openib, openib+sm work, but openib 
+sm+self hangs (same results whether the 2 procs are on the same  
node or different nodes).  There is no self communication in  
osu_latency, so something else must be going on.


--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Jeff Squyres

On Jul 28, 2008, at 11:05 AM, Ralph Castain wrote:


only openib works for me too,

but Glebs said to me once that it's illigal and I always need to  
use self btl.




Don't know - could be true. But if that is true, then we should  
check to see if that condition is met and error out - with an  
appropriate message - if so. Otherwise, how is a user supposed to  
know this condition?


This used to be true, but I think we changed it a while ago (Pasha: do  
you remember?) because Mellanox HCAs are capable of send-to-self  
(process) and there were no code changes necessary to enable it.  So  
it allowed a slightly simpler command line.  This was quite a while  
ago, IIRC.


All current iWARP adapters do not allow loopback communication at all  
(i.e., communication to either the same proc or other procs on the  
same host), so we added the following test in openib's add_procs:


if (IBV_TRANSPORT_IWARP == openib_btl->device->ib_dev- 
>transport_type &&

0 != (ompi_proc->proc_flags && OMPI_PROC_FLAG_LOCAL)) {
continue;
}

(meaning: skip this proc if it's on the same host; let btl self handle  
it, etc.)


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] Change in hostfile behavior

2008-07-28 Thread Tim Mattox
My only concern is how will this interact with PLPA.
Say two Open MPI jobs each use "half" the cores (slots) on a
particular node...  how would they be able to bind themselves to
a disjoint set of cores?  I'm not asking you to solve this Ralph, I'm
just pointing it out so we can maybe warn users that if both jobs sharing
a node try to use processor affinity, we don't make that magically work well,
and that we would expect it to do quite poorly.

I could see disabling paffinity and/or warning if it was enabled for
one of these "fractional" nodes.

On Mon, Jul 28, 2008 at 11:43 AM, Ralph Castain  wrote:
> Per an earlier telecon, I have modified the hostfile behavior slightly to
> allow hostfiles to subdivide allocations.
>
> Briefly: given an allocation, we allow users to specify --hostfile on a
> per-app_context basis. In this mode, the hostfile info is used to filter the
> nodes that will be used for that app_context. However, the prior
> implementation only filtered the nodes themselves - i.e., it was a binary
> filter that allowed you to include or exclude an entire node.
>
> The change now allows you to include a specified #slots for a given node as
> opposed to -all- slots from that node. You are limited to the #slots
> included in the original allocation. I just realized that I hadn't output a
> warning if you attempt to violate this condition - will do so shortly.
> Rather than just abort if this happens, I set the allocation to that of the
> original - please let me know if you would prefer it to abort.
>
> If you have interest in this behavior, please check it out and let me know
> if this meets needs.
>
> Ralph
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
 tmat...@gmail.com || timat...@open-mpi.org
 I'm a bright... http://www.the-brights.net/


Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Terry Dontje

Jeff Squyres wrote:

On Jul 28, 2008, at 12:03 PM, George Bosilca wrote:

Interesting. The self is only used for local communications. I don't 
expect that any benchmark execute such communications, but apparently 
I was wrong. Please let me know the failing test, I will take a look 
this evening.


FWIW, my manual tests of a simplistic "ring" program work for all 
combinations (openib, openib+self, openib+self+sm).  Shrug.


But for OSU latency, I found that openib, openib+sm work, but 
openib+sm+self hangs (same results whether the 2 procs are on the same 
node or different nodes).  There is no self communication in 
osu_latency, so something else must be going on.


Is it something to do with the MPI_Barrier call?  osu_latency uses 
MPI_Barrier and from rhc's email it sounds like his code does too.


--td


Re: [OMPI devel] Change in hostfile behavior

2008-07-28 Thread Ralph Castain
Actually, this is true today regardless of this change. If two  
separate mpirun invocations share a node and attempt to use paffinity,  
they will conflict with each other. The problem isn't caused by the  
hostfile sub-allocation. The problem is that the two mpiruns have no  
knowledge of each other's actions, and hence assign node ranks to each  
process independently.


Thus, we would have two procs that think they are node rank=0 and  
should therefore bind to the 0 processor, and so on up the line.


Obviously, if you run within one mpirun and have two app_contexts, the  
hostfile sub-allocation is fine - mpirun will track node rank across  
the app_contexts. It is only the use of multiple mpiruns that share  
nodes that causes the problem.


Several of us have discussed this problem and have a proposed solution  
for 1.4. Once we get past 1.3 (someday!), we'll bring it to the group.



On Jul 28, 2008, at 10:44 AM, Tim Mattox wrote:


My only concern is how will this interact with PLPA.
Say two Open MPI jobs each use "half" the cores (slots) on a
particular node...  how would they be able to bind themselves to
a disjoint set of cores?  I'm not asking you to solve this Ralph, I'm
just pointing it out so we can maybe warn users that if both jobs  
sharing
a node try to use processor affinity, we don't make that magically  
work well,

and that we would expect it to do quite poorly.

I could see disabling paffinity and/or warning if it was enabled for
one of these "fractional" nodes.

On Mon, Jul 28, 2008 at 11:43 AM, Ralph Castain  wrote:
Per an earlier telecon, I have modified the hostfile behavior  
slightly to

allow hostfiles to subdivide allocations.

Briefly: given an allocation, we allow users to specify --hostfile  
on a
per-app_context basis. In this mode, the hostfile info is used to  
filter the

nodes that will be used for that app_context. However, the prior
implementation only filtered the nodes themselves - i.e., it was a  
binary

filter that allowed you to include or exclude an entire node.

The change now allows you to include a specified #slots for a given  
node as

opposed to -all- slots from that node. You are limited to the #slots
included in the original allocation. I just realized that I hadn't  
output a
warning if you attempt to violate this condition - will do so  
shortly.
Rather than just abort if this happens, I set the allocation to  
that of the

original - please let me know if you would prefer it to abort.

If you have interest in this behavior, please check it out and let  
me know

if this meets needs.

Ralph

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





--
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
tmat...@gmail.com || timat...@open-mpi.org
I'm a bright... http://www.the-brights.net/
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




[OMPI devel] parallel debugger attach

2008-07-28 Thread Jeff Squyres
I think I fixed the parallel debugger attach stuff in an hg -- can  
interested parties test it out at their own sites before I bring it  
back to the SVN trunk?  It should be working for both Allinea DDT and  
TotalView.


HG:
http://www.open-mpi.org/hg/hgwebdir.cgi/jsquyres/debugger-stuff/
Ticket:
https://svn.open-mpi.org/trac/ompi/ticket/1361

Thanks.

--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Brad Benton
On Mon, Jul 28, 2008 at 12:08 PM, Terry Dontje  wrote:

> Jeff Squyres wrote:
>
>> On Jul 28, 2008, at 12:03 PM, George Bosilca wrote:
>>
>>  Interesting. The self is only used for local communications. I don't
>>> expect that any benchmark execute such communications, but apparently I was
>>> wrong. Please let me know the failing test, I will take a look this evening.
>>>
>>
>> FWIW, my manual tests of a simplistic "ring" program work for all
>> combinations (openib, openib+self, openib+self+sm).  Shrug.
>>
>> But for OSU latency, I found that openib, openib+sm work, but
>> openib+sm+self hangs (same results whether the 2 procs are on the same node
>> or different nodes).  There is no self communication in osu_latency, so
>> something else must be going on.
>>
>>  Is it something to do with the MPI_Barrier call?  osu_latency uses
> MPI_Barrier and from rhc's email it sounds like his code does too.


I don't think it's an issue with MPI_Barrier().  I'm running into this
problem with srtest.c (one of the example programs from the mpich
distribution).  It's a ring-type test with no barriers until the end, yet it
hangs on the very first Send/Recv pair from rank0 to rank1.

I my case, openib and openib+sm works, but openib+self & openib+sm+self
hang.

--brad


>
> --td
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


[OMPI devel] MCA_BTL_BASE_VERSION_1_0_1 and MCA_BTL_BASE_VERSION_1_0_0

2008-07-28 Thread Jeff Squyres
Since the trunk has now been bumped to MCA v2.0, and all frameworks  
have also been bumped to v2.0, are these two #defines relevant anymore:


MCA_BTL_BASE_VERSION_1_0_1
MCA_BTL_BASE_VERSION_1_0_0

I know there was at least one BTL being developed at an organization  
that may not have kept up with the trunk.  Do we need to put in  
backwards compatibility for that BTL, or should we delete these  
#defines?


--
Jeff Squyres
Cisco Systems