[OMPI devel] problem when binding to socket on a single socket node

2010-04-09 Thread Nadia Derbey
Hi,

I am facing a problem with a test that runs fine on some nodes, and
fails on others.

I have a heterogenous cluster, with 3 types of nodes:
1) Single socket , 4 cores
2) 2 sockets, 4cores per socket
3) 2 sockets, 6 cores/socket

I am using:
 . salloc to allocate the nodes,
 . mpirun binding/mapping options "-bind-to-socket -bysocket"

# salloc -N 1 mpirun -n 4 -bind-to-socket -bysocket sleep 900

This command fails if the allocated node is of type #1 (single socket/4
cpus).
BTW, in that case orte_show_help is referencing a tag
("could-not-bind-to-socket") that does not exist in
help-odls-default.txt.

While it succeeds when run on nodes of type #2 or 3.
I think a "bind to socket" should not return an error on a single socket
machine, but rather be a noop.

The problem comes from the test
OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
called in odls_default_fork_local_proc() after the binding to the
processors socket has been done:


OPAL_PAFFINITY_CPU_ZERO(mask);
for (n=0; n < orte_default_num_cores_per_socket; n++) {

OPAL_PAFFINITY_CPU_SET(phys_cpu, mask);
}
/* if we did not bind it anywhere, then that is an error */
OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
if (!bound) {
orte_show_help("help-odls-default.txt",
   "odls-default:could-not-bind-to-socket", true);
ORTE_ODLS_ERROR_OUT(ORTE_ERR_FATAL);
}

OPAL_PAFFINITY_PROCESS_IS_BOUND() will return true if there bits set in
the mask *AND* the number of bits set is lesser than the number of cpus
on the machine. Thus on a single socket, 4 cores machine the test will
fail. While on other the kinds of machines it will succeed.

Again, I think the problem could be solved by changing the alogrithm,
and assuming that ORTE_BIND_TO_SOCKET, on a single socket machine =
noop.

Another solution could be to call the test
OPAL_PAFFINITY_PROCESS_IS_BOUND() at the end of the loop only if we are
bound (orte_odls_globals.bound). Actually that is the only case where I
see a justification to this test (see attached patch).

And may be both solutions could be mixed.

Regards,
Nadia


-- 
Nadia Derbey 
Do not test actual process binding in obvious cases

diff -r 0b851b2e7934 orte/mca/odls/default/odls_default_module.c
--- a/orte/mca/odls/default/odls_default_module.c	Thu Mar 18 16:10:25 2010 +0100
+++ b/orte/mca/odls/default/odls_default_module.c	Fri Apr 09 11:38:28 2010 +0200
@@ -747,12 +747,16 @@ static int odls_default_fork_local_proc(
  target_socket, phys_core, phys_cpu));
 OPAL_PAFFINITY_CPU_SET(phys_cpu, mask);
 }
-/* if we did not bind it anywhere, then that is an error */
-OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
-if (!bound) {
-orte_show_help("help-odls-default.txt",
-   "odls-default:could-not-bind-to-socket", true);
-ORTE_ODLS_ERROR_OUT(ORTE_ERR_FATAL);
+/* if we actually did not bind it anywhere and it was
+ * originally bound then that is an error
+ */
+if (orte_odls_globals.bound) {
+OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
+if (!bound) {
+orte_show_help("help-odls-default.txt",
+   "odls-default:could-not-bind-to-socket", true);
+ORTE_ODLS_ERROR_OUT(ORTE_ERR_FATAL);
+}
 }
 if (orte_report_bindings) {
 opal_output(0, "%s odls:default:fork binding child %s to socket %d cpus %04lx",


Re: [OMPI devel] problem when binding to socket on a single socket node

2010-04-09 Thread Ralph Castain
Just to check: is this with the latest trunk? Brad and Terry have been making 
changes to this section of code, including modifying the PROCESS_IS_BOUND 
test...


On Apr 9, 2010, at 3:39 AM, Nadia Derbey wrote:

> Hi,
> 
> I am facing a problem with a test that runs fine on some nodes, and
> fails on others.
> 
> I have a heterogenous cluster, with 3 types of nodes:
> 1) Single socket , 4 cores
> 2) 2 sockets, 4cores per socket
> 3) 2 sockets, 6 cores/socket
> 
> I am using:
> . salloc to allocate the nodes,
> . mpirun binding/mapping options "-bind-to-socket -bysocket"
> 
> # salloc -N 1 mpirun -n 4 -bind-to-socket -bysocket sleep 900
> 
> This command fails if the allocated node is of type #1 (single socket/4
> cpus).
> BTW, in that case orte_show_help is referencing a tag
> ("could-not-bind-to-socket") that does not exist in
> help-odls-default.txt.
> 
> While it succeeds when run on nodes of type #2 or 3.
> I think a "bind to socket" should not return an error on a single socket
> machine, but rather be a noop.
> 
> The problem comes from the test
> OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
> called in odls_default_fork_local_proc() after the binding to the
> processors socket has been done:
> 
>
>OPAL_PAFFINITY_CPU_ZERO(mask);
>for (n=0; n < orte_default_num_cores_per_socket; n++) {
>
>OPAL_PAFFINITY_CPU_SET(phys_cpu, mask);
>}
>/* if we did not bind it anywhere, then that is an error */
>OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
>if (!bound) {
>orte_show_help("help-odls-default.txt",
>   "odls-default:could-not-bind-to-socket", true);
>ORTE_ODLS_ERROR_OUT(ORTE_ERR_FATAL);
>}
> 
> OPAL_PAFFINITY_PROCESS_IS_BOUND() will return true if there bits set in
> the mask *AND* the number of bits set is lesser than the number of cpus
> on the machine. Thus on a single socket, 4 cores machine the test will
> fail. While on other the kinds of machines it will succeed.
> 
> Again, I think the problem could be solved by changing the alogrithm,
> and assuming that ORTE_BIND_TO_SOCKET, on a single socket machine =
> noop.
> 
> Another solution could be to call the test
> OPAL_PAFFINITY_PROCESS_IS_BOUND() at the end of the loop only if we are
> bound (orte_odls_globals.bound). Actually that is the only case where I
> see a justification to this test (see attached patch).
> 
> And may be both solutions could be mixed.
> 
> Regards,
> Nadia
> 
> 
> -- 
> Nadia Derbey 
> <001_fix_process_binding_test.patch>___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] problem when binding to socket on a single socket node

2010-04-09 Thread Nadia Derbey
On Fri, 2010-04-09 at 08:41 -0600, Ralph Castain wrote:
> Just to check: is this with the latest trunk? Brad and Terry have been making 
> changes to this section of code, including modifying the PROCESS_IS_BOUND 
> test...
> 
> 

Well, it was on the v1.5. But I just checked: looks like
  1. the call to OPAL_PAFFINITY_PROCESS_IS_BOUND is still there in
 odls_default_fork_local_proc()
  2. OPAL_PAFFINITY_PROCESS_IS_BOUND() is defined the same way

But, I'll give it a try with the latest trunk.

Regards,
Nadia

> On Apr 9, 2010, at 3:39 AM, Nadia Derbey wrote:
> 
> > Hi,
> > 
> > I am facing a problem with a test that runs fine on some nodes, and
> > fails on others.
> > 
> > I have a heterogenous cluster, with 3 types of nodes:
> > 1) Single socket , 4 cores
> > 2) 2 sockets, 4cores per socket
> > 3) 2 sockets, 6 cores/socket
> > 
> > I am using:
> > . salloc to allocate the nodes,
> > . mpirun binding/mapping options "-bind-to-socket -bysocket"
> > 
> > # salloc -N 1 mpirun -n 4 -bind-to-socket -bysocket sleep 900
> > 
> > This command fails if the allocated node is of type #1 (single socket/4
> > cpus).
> > BTW, in that case orte_show_help is referencing a tag
> > ("could-not-bind-to-socket") that does not exist in
> > help-odls-default.txt.
> > 
> > While it succeeds when run on nodes of type #2 or 3.
> > I think a "bind to socket" should not return an error on a single socket
> > machine, but rather be a noop.
> > 
> > The problem comes from the test
> > OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
> > called in odls_default_fork_local_proc() after the binding to the
> > processors socket has been done:
> > 
> >
> >OPAL_PAFFINITY_CPU_ZERO(mask);
> >for (n=0; n < orte_default_num_cores_per_socket; n++) {
> >
> >OPAL_PAFFINITY_CPU_SET(phys_cpu, mask);
> >}
> >/* if we did not bind it anywhere, then that is an error */
> >OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
> >if (!bound) {
> >orte_show_help("help-odls-default.txt",
> >   "odls-default:could-not-bind-to-socket", true);
> >ORTE_ODLS_ERROR_OUT(ORTE_ERR_FATAL);
> >}
> > 
> > OPAL_PAFFINITY_PROCESS_IS_BOUND() will return true if there bits set in
> > the mask *AND* the number of bits set is lesser than the number of cpus
> > on the machine. Thus on a single socket, 4 cores machine the test will
> > fail. While on other the kinds of machines it will succeed.
> > 
> > Again, I think the problem could be solved by changing the alogrithm,
> > and assuming that ORTE_BIND_TO_SOCKET, on a single socket machine =
> > noop.
> > 
> > Another solution could be to call the test
> > OPAL_PAFFINITY_PROCESS_IS_BOUND() at the end of the loop only if we are
> > bound (orte_odls_globals.bound). Actually that is the only case where I
> > see a justification to this test (see attached patch).
> > 
> > And may be both solutions could be mixed.
> > 
> > Regards,
> > Nadia
> > 
> > 
> > -- 
> > Nadia Derbey 
> > <001_fix_process_binding_test.patch>___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
-- 
Nadia Derbey 



Re: [OMPI devel] problem when binding to socket on a single socket node

2010-04-09 Thread Terry Dontje

Nadia Derbey wrote:

On Fri, 2010-04-09 at 08:41 -0600, Ralph Castain wrote:
  

Just to check: is this with the latest trunk? Brad and Terry have been making 
changes to this section of code, including modifying the PROCESS_IS_BOUND 
test...





Well, it was on the v1.5. But I just checked: looks like
  1. the call to OPAL_PAFFINITY_PROCESS_IS_BOUND is still there in
 odls_default_fork_local_proc()
  2. OPAL_PAFFINITY_PROCESS_IS_BOUND() is defined the same way

But, I'll give it a try with the latest trunk.

Regards,
Nadia

  
The changes, I've done do not touch OPAL_PAFFINITY_PROCESS_IS_BOUND at 
all.  Also, I am only touching code related to the "bind-to-core" option 
so I really doubt if my changes are causing issues here.


--td

On Apr 9, 2010, at 3:39 AM, Nadia Derbey wrote:



Hi,

I am facing a problem with a test that runs fine on some nodes, and
fails on others.

I have a heterogenous cluster, with 3 types of nodes:
1) Single socket , 4 cores
2) 2 sockets, 4cores per socket
3) 2 sockets, 6 cores/socket

I am using:
. salloc to allocate the nodes,
. mpirun binding/mapping options "-bind-to-socket -bysocket"

# salloc -N 1 mpirun -n 4 -bind-to-socket -bysocket sleep 900

This command fails if the allocated node is of type #1 (single socket/4
cpus).
BTW, in that case orte_show_help is referencing a tag
("could-not-bind-to-socket") that does not exist in
help-odls-default.txt.

While it succeeds when run on nodes of type #2 or 3.
I think a "bind to socket" should not return an error on a single socket
machine, but rather be a noop.

The problem comes from the test
OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
called in odls_default_fork_local_proc() after the binding to the
processors socket has been done:

   
   OPAL_PAFFINITY_CPU_ZERO(mask);
   for (n=0; n < orte_default_num_cores_per_socket; n++) {
   
   OPAL_PAFFINITY_CPU_SET(phys_cpu, mask);
   }
   /* if we did not bind it anywhere, then that is an error */
   OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
   if (!bound) {
   orte_show_help("help-odls-default.txt",
  "odls-default:could-not-bind-to-socket", true);
   ORTE_ODLS_ERROR_OUT(ORTE_ERR_FATAL);
   }

OPAL_PAFFINITY_PROCESS_IS_BOUND() will return true if there bits set in
the mask *AND* the number of bits set is lesser than the number of cpus
on the machine. Thus on a single socket, 4 cores machine the test will
fail. While on other the kinds of machines it will succeed.

Again, I think the problem could be solved by changing the alogrithm,
and assuming that ORTE_BIND_TO_SOCKET, on a single socket machine =
noop.

Another solution could be to call the test
OPAL_PAFFINITY_PROCESS_IS_BOUND() at the end of the loop only if we are
bound (orte_odls_globals.bound). Actually that is the only case where I
see a justification to this test (see attached patch).

And may be both solutions could be mixed.

Regards,
Nadia


--
Nadia Derbey 
<001_fix_process_binding_test.patch>___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
  

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.650.633.7054
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 



Re: [OMPI devel] problem when binding to socket on a single socket node

2010-04-09 Thread Ralph Castain
Okay, just wanted to ensure everyone was working from the same base code.

Terry, Brad: you might want to look this proposed change over. Something 
doesn't quite look right to me, but I haven't really walked through the code to 
check it.


On Apr 9, 2010, at 9:33 AM, Terry Dontje wrote:

> Nadia Derbey wrote:
>> 
>> On Fri, 2010-04-09 at 08:41 -0600, Ralph Castain wrote:
>>   
>>> Just to check: is this with the latest trunk? Brad and Terry have been 
>>> making changes to this section of code, including modifying the 
>>> PROCESS_IS_BOUND test...
>>> 
>>> 
>>> 
>> 
>> Well, it was on the v1.5. But I just checked: looks like
>>   1. the call to OPAL_PAFFINITY_PROCESS_IS_BOUND is still there in
>>  odls_default_fork_local_proc()
>>   2. OPAL_PAFFINITY_PROCESS_IS_BOUND() is defined the same way
>> 
>> But, I'll give it a try with the latest trunk.
>> 
>> Regards,
>> Nadia
>> 
>>   
> The changes, I've done do not touch OPAL_PAFFINITY_PROCESS_IS_BOUND at all.  
> Also, I am only touching code related to the "bind-to-core" option so I 
> really doubt if my changes are causing issues here.
> 
> --td
>>> On Apr 9, 2010, at 3:39 AM, Nadia Derbey wrote:
>>> 
>>> 
 Hi,
 
 I am facing a problem with a test that runs fine on some nodes, and
 fails on others.
 
 I have a heterogenous cluster, with 3 types of nodes:
 1) Single socket , 4 cores
 2) 2 sockets, 4cores per socket
 3) 2 sockets, 6 cores/socket
 
 I am using:
 . salloc to allocate the nodes,
 . mpirun binding/mapping options "-bind-to-socket -bysocket"
 
 # salloc -N 1 mpirun -n 4 -bind-to-socket -bysocket sleep 900
 
 This command fails if the allocated node is of type #1 (single socket/4
 cpus).
 BTW, in that case orte_show_help is referencing a tag
 ("could-not-bind-to-socket") that does not exist in
 help-odls-default.txt.
 
 While it succeeds when run on nodes of type #2 or 3.
 I think a "bind to socket" should not return an error on a single socket
 machine, but rather be a noop.
 
 The problem comes from the test
 OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
 called in odls_default_fork_local_proc() after the binding to the
 processors socket has been done:
 

OPAL_PAFFINITY_CPU_ZERO(mask);
for (n=0; n < orte_default_num_cores_per_socket; n++) {

OPAL_PAFFINITY_CPU_SET(phys_cpu, mask);
}
/* if we did not bind it anywhere, then that is an error */
OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
if (!bound) {
orte_show_help("help-odls-default.txt",
   "odls-default:could-not-bind-to-socket", true);
ORTE_ODLS_ERROR_OUT(ORTE_ERR_FATAL);
}
 
 OPAL_PAFFINITY_PROCESS_IS_BOUND() will return true if there bits set in
 the mask *AND* the number of bits set is lesser than the number of cpus
 on the machine. Thus on a single socket, 4 cores machine the test will
 fail. While on other the kinds of machines it will succeed.
 
 Again, I think the problem could be solved by changing the alogrithm,
 and assuming that ORTE_BIND_TO_SOCKET, on a single socket machine =
 noop.
 
 Another solution could be to call the test
 OPAL_PAFFINITY_PROCESS_IS_BOUND() at the end of the loop only if we are
 bound (orte_odls_globals.bound). Actually that is the only case where I
 see a justification to this test (see attached patch).
 
 And may be both solutions could be mixed.
 
 Regards,
 Nadia
 
 
 -- 
 Nadia Derbey 
 <001_fix_process_binding_test.patch>___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
   
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
> 
> 
> -- 
> 
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.650.633.7054
> Oracle - Performance Technologies
> 95 Network Drive, Burlington, MA 01803
> Email terry.don...@oracle.com
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



[OMPI devel] Fwd: [all-osl-users] Upgrading trac

2010-04-09 Thread Jeff Squyres
FYI -- short Trac outage next Monday.

Begin forwarded message:

> From: "Kim, DongInn"
> Date: April 9, 2010 2:19:33 PM EDT
> Subject: [all-osl-users] Upgrading trac
> 
> The OSL will upgrade trac from 0.11.5 to 0.11.7 on sourcehaven next 
> Monday(4/12/2010)?
> 
> Date: Monday, April 12, 2010.
> Time:
> - 5:00am-5:30am Pacific US time
> - 6:00am-6:30am Mountain US time
> - 7:00am-7:30am Central US time
> - 8:00am-8:30am Eastern US time
> - 12:00pm-12:30pm GMT
> 
> 
> We will have a short outage(half hour) of apache server and the following 
> services would not be available during the upgrade.
> 
> - trac
> - subversion
> - svn check-in hook to trac
> 
> Please let me know if you have any questions or issues of the this upgrade.
> 
> --
> - DongInn
> 
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] problem when binding to socket on a single socket node

2010-04-09 Thread Terry Dontje

Ralph Castain wrote:

Okay, just wanted to ensure everyone was working from the same base code.

Terry, Brad: you might want to look this proposed change over. 
Something doesn't quite look right to me, but I haven't really walked 
through the code to check it.


At first blush I don't really get the usage of orte_odls_globals.bound 
in you patch.  It would seem to me that the insertion of that 
conditional would prevent the check it surrounds being done when the 
process has not been bounded prior to startup which is a common case.


--td




On Apr 9, 2010, at 9:33 AM, Terry Dontje wrote:


Nadia Derbey wrote:

On Fri, 2010-04-09 at 08:41 -0600, Ralph Castain wrote:
  

Just to check: is this with the latest trunk? Brad and Terry have been making 
changes to this section of code, including modifying the PROCESS_IS_BOUND 
test...





Well, it was on the v1.5. But I just checked: looks like
  1. the call to OPAL_PAFFINITY_PROCESS_IS_BOUND is still there in
 odls_default_fork_local_proc()
  2. OPAL_PAFFINITY_PROCESS_IS_BOUND() is defined the same way

But, I'll give it a try with the latest trunk.

Regards,
Nadia

  
The changes, I've done do not touch OPAL_PAFFINITY_PROCESS_IS_BOUND 
at all.  Also, I am only touching code related to the "bind-to-core" 
option so I really doubt if my changes are causing issues here.


--td

On Apr 9, 2010, at 3:39 AM, Nadia Derbey wrote:



Hi,

I am facing a problem with a test that runs fine on some nodes, and
fails on others.

I have a heterogenous cluster, with 3 types of nodes:
1) Single socket , 4 cores
2) 2 sockets, 4cores per socket
3) 2 sockets, 6 cores/socket

I am using:
. salloc to allocate the nodes,
. mpirun binding/mapping options "-bind-to-socket -bysocket"

# salloc -N 1 mpirun -n 4 -bind-to-socket -bysocket sleep 900

This command fails if the allocated node is of type #1 (single socket/4
cpus).
BTW, in that case orte_show_help is referencing a tag
("could-not-bind-to-socket") that does not exist in
help-odls-default.txt.

While it succeeds when run on nodes of type #2 or 3.
I think a "bind to socket" should not return an error on a single socket
machine, but rather be a noop.

The problem comes from the test
OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
called in odls_default_fork_local_proc() after the binding to the
processors socket has been done:

   
   OPAL_PAFFINITY_CPU_ZERO(mask);
   for (n=0; n < orte_default_num_cores_per_socket; n++) {
   
   OPAL_PAFFINITY_CPU_SET(phys_cpu, mask);
   }
   /* if we did not bind it anywhere, then that is an error */
   OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
   if (!bound) {
   orte_show_help("help-odls-default.txt",
  "odls-default:could-not-bind-to-socket", true);
   ORTE_ODLS_ERROR_OUT(ORTE_ERR_FATAL);
   }

OPAL_PAFFINITY_PROCESS_IS_BOUND() will return true if there bits set in
the mask *AND* the number of bits set is lesser than the number of cpus
on the machine. Thus on a single socket, 4 cores machine the test will
fail. While on other the kinds of machines it will succeed.

Again, I think the problem could be solved by changing the alogrithm,
and assuming that ORTE_BIND_TO_SOCKET, on a single socket machine =
noop.

Another solution could be to call the test
OPAL_PAFFINITY_PROCESS_IS_BOUND() at the end of the loop only if we are
bound (orte_odls_globals.bound). Actually that is the only case where I
see a justification to this test (see attached patch).

And may be both solutions could be mixed.

Regards,
Nadia


--
Nadia Derbey 
<001_fix_process_binding_test.patch>___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
  

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





--

Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.650.633.7054
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 

___
devel mailing list
de...@open-mpi.org 
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.650.633.7054
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 



Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-09 Thread Oliver Geisler
Sorry for replying late. Unfortunately I am not "full time
administrator". And I am going to be a conference next week, so please
be patient with me replying.

On 4/7/2010 6:56 PM, Eugene Loh wrote:
> Oliver Geisler wrote:
> 
>> Using netpipe and comparing tcp and mpi communication I get the
>> following results:
>>
>> TCP is much faster than MPI, approx. by factor 12
>>  
>>
> Faster?  12x?  I don't understand the following:
> 
>> e.g a packet size of 4096 bytes deliveres in
>> 97.11 usec with NPtcp and
>> 15338.98 usec with NPmpi
>>  
>>
> This implies NPtcp is 160x faster than NPmpi.
> 

The ratio function NPtcp/NPmpi has a mean value of factor 60 for small
packet sizes <4kB, a maximum of 160 at 4kB (it was a bad value to pick
out in the first place), then dropping down to 40 for packet sizes of
about 16kB and further dropping below factor 20 for packets larger than
100kB.


>> or
>> packet size 262kb
>> 0.05268801 sec NPtcp
>> 0.00254560 sec NPmpi
>>  
>>
> This implies NPtcp is 20x slower than NPmpi.
> 

Sorry, my fault ... vice versa, should read:
packet size 262kb
0.00254560 sec NPtcp
0.05268801 sec NPmpi


>> Further our benchmark started with "--mca btl tcp,self" runs with short
>> communication times, even using kernel 2.6.33.1
>>
>> Is there a way to see what type of communication is actually selected?
>>
>> Can anybody imagine why shared memory leads to these problems?
>>  
>>
> Okay, so it's a shared-memory performance problem since:
> 
> 1) You get better performance when you exclude sm explicitly with "--mca
> btl tcp,self".
> 2) You get better performance when you exclude sm by distributing one
> process per node (an observation you made relatively early in this thread).
> 3) TCP is faster than MPI (which is presumably using sm).
> 
> Can you run a pingpong test as a function of message length for two
> processes in a way that demonstrates the problem?  For example, if
> you're comfortable with SKaMPI, just look at Pingpong_Send_Recv and
> let's see what performance looks like as a function of message length. 
> Maybe this is a short-message-latency problem.

This is the results of skampi pt2pt, first with shared memory allowed,
second shared memory excluded.
It doesn't look to me as the long message times are related to short
messages.
Including hosts over ethernet results in higher communication times
which are equal to those when I ping the host (a hundred+ milliseconds).

mpirun --mca btl self,sm,tcp -np 2 ./skampi -i ski/skampi_pt2pt.ski

# begin result "Pingpong_Send_Recv"
count= 14   12756.0 307.4   16   11555.3   11011.2
count= 289902.8 629.0   169615.48601.0
count= 3   12   12547.5 881.0   16   12233.1   11229.2
count= 4   16   12087.2 829.6   16   11610.6   10478.6
count= 6   24   13634.4 352.1   16   11247.8   12621.9
count= 8   32   13835.8 282.2   16   11091.7   12944.6
count= 11   44   13328.9 864.6   16   12095.6   11977.0
count= 16   64   13195.2 432.3   16   11460.4   10051.9
count= 23   92   13849.3 532.5   16   12476.9   12998.1
count= 32  128   14202.2 436.4   16   11923.8   12977.4
count= 45  180   14026.3 637.7   16   13042.5   12767.8
count= 64  256   13475.8 466.7   16   11720.4   12521.3
count= 91  364   14015.0 406.1   16   13300.4   12881.6
count= 128  512   13481.3 870.6   16   11187.7   12070.6
count= 181  724   10697.1  98.4   16   10697.19520.1
count= 256 1024   14120.8 602.1   16   13988.2   11349.9
count= 362 1448   15718.2 582.3   16   14468.2   12535.2
count= 512 2048   11214.9 749.1   16   11155.09928.5
count= 724 2896   15127.3 186.1   16   15127.3   10974.9
count= 1024 4096   34045.0 692.2   16   32963.6   31728.1
count= 1448 5792   29965.9 788.1   16   27997.8   27404.4
count= 2048 8192   30082.1 785.3   16   28023.9   29538.5
count= 289611584   32556.0 219.4   16   29312.2   32290.4
count= 409616384   24999.8 839.6   16   23422.0   23644.6
# end result "Pingpong_Send_Recv"
# duration = 10.15 sec

mpirun --mca btl tcp,self -np 2 ./skampi -i ski/skampi_pt2pt.ski

# begin result "Pingpong_Send_Recv"
count= 14  14.5   0.3   16  13.5  13.2
count= 28  13.5   0.28  12.9  12.4
count= 3   12  13.1   0.4   16  12.7  11.3
count= 4   16  13.9   0.4   16  12.7  13.0
count= 6   24  13.8   0.4   16  12.5  12.8
count= 8   32  13.8   0.4   16  12.7  13.0
count= 11   44  14.0   0.3   16  12.8  13.0
count= 16   64  13.5   0.5   16  12.3  12.4
count= 23   92  13.9   0.4   16  13.1  12.7
count= 32  128  14.8   0.1   16

Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-09 Thread Eugene Loh

Oliver Geisler wrote:


This is the results of skampi pt2pt, first with shared memory allowed,
second shared memory excluded.


Thanks for the data.  The TCP results are not very interesting... they 
look reasonable.


The shared-memory data is rather straightforward:  results are just 
plain ridiculously bad.  The results for "eager" messages (messages 
shorter than 4Kbytes) are around 12 millisec.  The results for 
"rendezvous" messages (longer than 4 Kbytes, signal the receiver, wait 
for an acknowledgement, then send the message) are about 30 millisec.


I was also curious about "long-message bandwidth", but since SKaMPI is 
only going up to 16 Kbyte messages, we can't really tell.


But, maybe all that is irrelevent.

Why is shared-memory performance about four orders of magnitude slower 
than it should be?  The processes are communicating via memory that's 
shared by having the processes all mmap the same file into their address 
spaces.  Is it possible that with the newer kernels, operations to that 
shared file are going all the way out to disk?  Maybe you don't know the 
answer, but hopefully someone on this mail list can provide some insight.