Re: [OMPI users] scaling problem with openmpi

2009-05-20 Thread Peter Kjellstrom
On Tuesday 19 May 2009, Peter Kjellstrom wrote:
> On Tuesday 19 May 2009, Roman Martonak wrote:
> > On Tue, May 19, 2009 at 3:29 PM, Peter Kjellstrom  wrote:
> > > On Tuesday 19 May 2009, Roman Martonak wrote:
> > > ...
> > >
> > >> openmpi-1.3.2                           time per one MD step is 3.66 s
> > >>    ELAPSED TIME :    0 HOURS  1 MINUTES 25.90 SECONDS
> > >>  = ALL TO ALL COMM           102033. BYTES               4221.  =
> > >>  = ALL TO ALL COMM             7.802  MB/S          55.200 SEC  =
>
> ...
>
> > With TASKGROUP=2 the summary looks as follows
>
> ...
>
> >  = ALL TO ALL COMM   231821. BYTES   4221.  =
> >  = ALL TO ALL COMM82.716  MB/S  11.830 SEC  =
>
> Wow, according to this it takes 1/5th the time to do the same number (4221)
> of alltoalls if the size is (roughly) doubled... (ten times better
> performance with the larger transfer size)
>
> Something is not quite right, could you possibly try to run just the
> alltoalls like I suggested in my previous e-mail?

I was curious so I ran som tests. First it seems that the size reported by 
CPMD is the total size of the data buffer not the message size. Running 
alltoalls with 231821/64 and 102033/64 gives this (on a similar setup):

bw for   4221x 1595 B :  36.5 Mbytes/s   time was:  23.3 s
bw for   4221x 3623 B : 125.4 Mbytes/s   time was:  15.4 s
bw for   4221x 1595 B :  36.4 Mbytes/s   time was:  23.3 s
bw for   4221x 3623 B : 125.6 Mbytes/s   time was:  15.3 s

So it does seem that OpenMPI has some problems with small alltoalls. It is 
obviously broken when you can get things across faster by sending more...

As a reference I ran with a commercial MPI using the same program and node-set 
(I did not have MVAPICH nor IntelMPI on this system):

bw for   4221x 1595 B :  71.4 Mbytes/s   time was:  11.9 s
bw for   4221x 3623 B : 125.8 Mbytes/s   time was:  15.3 s
bw for   4221x 1595 B :  71.1 Mbytes/s   time was:  11.9 s
bw for   4221x 3623 B : 125.5 Mbytes/s   time was:  15.3 s

To see when OpenMPI falls over I ran with an increasing packet size:

bw for   10  x 2900 B :  59.8 Mbytes/s   time was:  61.2 ms
bw for   10  x 2925 B :  59.2 Mbytes/s   time was:  62.2 ms
bw for   10  x 2950 B :  59.4 Mbytes/s   time was:  62.6 ms
bw for   10  x 2975 B :  58.5 Mbytes/s   time was:  64.1 ms
bw for   10  x 3000 B : 113.5 Mbytes/s   time was:  33.3 ms
bw for   10  x 3100 B : 116.1 Mbytes/s   time was:  33.6 ms

The problem seems to be for packets with 1000Bytes < size < 3000Bytes with a 
hard edge at 3000Bytes. Your CPMD was communicating at more or less the worst 
case packet size.

These are the figures for my "reference" MPI:

bw for   10  x 2900 B : 110.3 Mbytes/s   time was:  33.1 ms
bw for   10  x 2925 B : 110.4 Mbytes/s   time was:  33.4 ms
bw for   10  x 2950 B : 111.5 Mbytes/s   time was:  33.3 ms
bw for   10  x 2975 B : 112.4 Mbytes/s   time was:  33.4 ms
bw for   10  x 3000 B : 118.2 Mbytes/s   time was:  32.0 ms
bw for   10  x 3100 B : 114.1 Mbytes/s   time was:  34.2 ms

Setup-details:
hw: dual socket quad core harpertowns with ConnectX IB and 1:1 2-level tree
sw: CentOS-5.3 x86_64 with OpenMPI-1.3b2 (did not have time to try 1.3.2) on 
OFED from CentOS (1.3.2-ish I think).

/Peter


signature.asc
Description: This is a digitally signed message part.


Re: [OMPI users] scaling problem with openmpi

2009-05-20 Thread Pavel Shamis (Pasha)

Default algorithm thresholds in mvapich are different from ompi.
Using tunned collectives in Open MPI you may configure the Open MPI 
Alltoall threshold as Mvapich defaults.
The follow mca parameters configure Open MPI to use custom rules that 
are defined in configure(txt) file.

"--mca use_dynamic_rules 1 --mca dynamic_rules_filename"

Here is example of dynamic_rules_filename that should make Ompi Alltoall 
tuning similar to Mvapich:

1 # num of collectives
3 # ID = 3 Alltoall collective (ID in coll_tuned.h)
1 # number of com sizes
64 # comm size 8
2 # number of msg sizes
0 3 0 0 # for message size 0, bruck 1, topo 0, 0 segmentation
8192 2 0 0 # 8k+, pairwise 2, no topo or segmentation
# end of first collective


Thanks,
Pasha

Peter Kjellstrom wrote:

On Tuesday 19 May 2009, Peter Kjellstrom wrote:
  

On Tuesday 19 May 2009, Roman Martonak wrote:


On Tue, May 19, 2009 at 3:29 PM, Peter Kjellstrom  wrote:
  

On Tuesday 19 May 2009, Roman Martonak wrote:
...



openmpi-1.3.2   time per one MD step is 3.66 s
   ELAPSED TIME :0 HOURS  1 MINUTES 25.90 SECONDS
 = ALL TO ALL COMM   102033. BYTES   4221.  =
 = ALL TO ALL COMM 7.802  MB/S  55.200 SEC  =
  

...



With TASKGROUP=2 the summary looks as follows
  

...



 = ALL TO ALL COMM   231821. BYTES   4221.  =
 = ALL TO ALL COMM82.716  MB/S  11.830 SEC  =
  

Wow, according to this it takes 1/5th the time to do the same number (4221)
of alltoalls if the size is (roughly) doubled... (ten times better
performance with the larger transfer size)

Something is not quite right, could you possibly try to run just the
alltoalls like I suggested in my previous e-mail?



I was curious so I ran som tests. First it seems that the size reported by 
CPMD is the total size of the data buffer not the message size. Running 
alltoalls with 231821/64 and 102033/64 gives this (on a similar setup):


bw for   4221x 1595 B :  36.5 Mbytes/s   time was:  23.3 s
bw for   4221x 3623 B : 125.4 Mbytes/s   time was:  15.4 s
bw for   4221x 1595 B :  36.4 Mbytes/s   time was:  23.3 s
bw for   4221x 3623 B : 125.6 Mbytes/s   time was:  15.3 s

So it does seem that OpenMPI has some problems with small alltoalls. It is 
obviously broken when you can get things across faster by sending more...


As a reference I ran with a commercial MPI using the same program and node-set 
(I did not have MVAPICH nor IntelMPI on this system):


bw for   4221x 1595 B :  71.4 Mbytes/s   time was:  11.9 s
bw for   4221x 3623 B : 125.8 Mbytes/s   time was:  15.3 s
bw for   4221x 1595 B :  71.1 Mbytes/s   time was:  11.9 s
bw for   4221x 3623 B : 125.5 Mbytes/s   time was:  15.3 s

To see when OpenMPI falls over I ran with an increasing packet size:

bw for   10  x 2900 B :  59.8 Mbytes/s   time was:  61.2 ms
bw for   10  x 2925 B :  59.2 Mbytes/s   time was:  62.2 ms
bw for   10  x 2950 B :  59.4 Mbytes/s   time was:  62.6 ms
bw for   10  x 2975 B :  58.5 Mbytes/s   time was:  64.1 ms
bw for   10  x 3000 B : 113.5 Mbytes/s   time was:  33.3 ms
bw for   10  x 3100 B : 116.1 Mbytes/s   time was:  33.6 ms

The problem seems to be for packets with 1000Bytes < size < 3000Bytes with a 
hard edge at 3000Bytes. Your CPMD was communicating at more or less the worst 
case packet size.


These are the figures for my "reference" MPI:

bw for   10  x 2900 B : 110.3 Mbytes/s   time was:  33.1 ms
bw for   10  x 2925 B : 110.4 Mbytes/s   time was:  33.4 ms
bw for   10  x 2950 B : 111.5 Mbytes/s   time was:  33.3 ms
bw for   10  x 2975 B : 112.4 Mbytes/s   time was:  33.4 ms
bw for   10  x 3000 B : 118.2 Mbytes/s   time was:  32.0 ms
bw for   10  x 3100 B : 114.1 Mbytes/s   time was:  34.2 ms

Setup-details:
hw: dual socket quad core harpertowns with ConnectX IB and 1:1 2-level tree
sw: CentOS-5.3 x86_64 with OpenMPI-1.3b2 (did not have time to try 1.3.2) on 
OFED from CentOS (1.3.2-ish I think).


/Peter
  



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] scaling problem with openmpi

2009-05-20 Thread Roman Martonak
Many thanks for the highly helpful analysis. Indeed, what Peter says
seems to be precisely the case here. I tried to run the 32 waters test
on 48 cores now, with the original cutoff of 100 Ry, and with slightly
increased one of 110 Ry. Normally with larger cutoff it should
obviously take more time for one step. Increasing cutoff however also
increases the size of the data buffer and it appears just to cross the
packet size threshold for different behaviour (test was ran with
openmpi-1.3.2).


cutoff 100Ry

time per 1 step is 2.869 s

 = ALL TO ALL COMM   151583. BYTES   2211.  =
= ALL TO ALL COMM16.741  MB/S  20.020 SEC  =


cutoff 110 Ry

time per 1 step is 1.879 s

 = ALL TO ALL COMM   167057. BYTES   2211.  =
 = ALL TO ALL COMM43.920  MB/S   8.410 SEC  =


so it actually runs much faster and  ALL TO ALL COMM is 2.6 times
faster. In my case the threshold seems to be somewhere between
167057/48 = 3 480 and 151583/48 = 3 157 bytes.

I saved the text that Pavel suggested

1 # num of collectives
3 # ID = 3 Alltoall collective (ID in coll_tuned.h)
1 # number of com sizes
64 # comm size 8
2 # number of msg sizes
0 3 0 0 # for message size 0, bruck 1, topo 0, 0 segmentation
8192 2 0 0 # 8k+, pairwise 2, no topo or segmentation
# end of first collective

to the file dyn_rules and tried to run appending the options
"--mca use_dynamic_rules 1 --mca dynamic_rules_filename ./dyn_rules" to mpirun
but it does not make any change. Is this the correct syntax to enable
the rules ?
And will the above sample file shift the threshold to lower values (to
what value) ?

Best regards

Roman

On Wed, May 20, 2009 at 10:39 AM, Peter Kjellstrom  wrote:
> On Tuesday 19 May 2009, Peter Kjellstrom wrote:
>> On Tuesday 19 May 2009, Roman Martonak wrote:
>> > On Tue, May 19, 2009 at 3:29 PM, Peter Kjellstrom  wrote:
>> > > On Tuesday 19 May 2009, Roman Martonak wrote:
>> > > ...
>> > >
>> > >> openmpi-1.3.2                           time per one MD step is 3.66 s
>> > >>    ELAPSED TIME :    0 HOURS  1 MINUTES 25.90 SECONDS
>> > >>  = ALL TO ALL COMM           102033. BYTES               4221.  =
>> > >>  = ALL TO ALL COMM             7.802  MB/S          55.200 SEC  =
>>
>> ...
>>
>> > With TASKGROUP=2 the summary looks as follows
>>
>> ...
>>
>> >  = ALL TO ALL COMM           231821. BYTES               4221.  =
>> >  = ALL TO ALL COMM            82.716  MB/S          11.830 SEC  =
>>
>> Wow, according to this it takes 1/5th the time to do the same number (4221)
>> of alltoalls if the size is (roughly) doubled... (ten times better
>> performance with the larger transfer size)
>>
>> Something is not quite right, could you possibly try to run just the
>> alltoalls like I suggested in my previous e-mail?
>
> I was curious so I ran som tests. First it seems that the size reported by
> CPMD is the total size of the data buffer not the message size. Running
> alltoalls with 231821/64 and 102033/64 gives this (on a similar setup):
>
> bw for   4221    x 1595 B :  36.5 Mbytes/s       time was:  23.3 s
> bw for   4221    x 3623 B : 125.4 Mbytes/s       time was:  15.4 s
> bw for   4221    x 1595 B :  36.4 Mbytes/s       time was:  23.3 s
> bw for   4221    x 3623 B : 125.6 Mbytes/s       time was:  15.3 s
>
> So it does seem that OpenMPI has some problems with small alltoalls. It is
> obviously broken when you can get things across faster by sending more...
>
> As a reference I ran with a commercial MPI using the same program and node-set
> (I did not have MVAPICH nor IntelMPI on this system):
>
> bw for   4221    x 1595 B :  71.4 Mbytes/s       time was:  11.9 s
> bw for   4221    x 3623 B : 125.8 Mbytes/s       time was:  15.3 s
> bw for   4221    x 1595 B :  71.1 Mbytes/s       time was:  11.9 s
> bw for   4221    x 3623 B : 125.5 Mbytes/s       time was:  15.3 s
>
> To see when OpenMPI falls over I ran with an increasing packet size:
>
> bw for   10      x 2900 B :  59.8 Mbytes/s       time was:  61.2 ms
> bw for   10      x 2925 B :  59.2 Mbytes/s       time was:  62.2 ms
> bw for   10      x 2950 B :  59.4 Mbytes/s       time was:  62.6 ms
> bw for   10      x 2975 B :  58.5 Mbytes/s       time was:  64.1 ms
> bw for   10      x 3000 B : 113.5 Mbytes/s       time was:  33.3 ms
> bw for   10      x 3100 B : 116.1 Mbytes/s       time was:  33.6 ms
>
> The problem seems to be for packets with 1000Bytes < size < 3000Bytes with a
> hard edge at 3000Bytes. Your CPMD was communicating at more or less the worst
> case

Re: [OMPI users] scaling problem with openmpi

2009-05-20 Thread Rolf Vandevaart

The correct MCA parameters are the following:
-mca coll_tuned_use_dynamic_rules 1
-mca coll_tuned_dynamic_rules_filename ./dyn_rules

You can also run the following command:
ompi_info -mca coll_tuned_use_dynamic_rules 1 -param coll tuned
This will give some insight into all the various algorithms that make up 
the tuned collectives.


If I am understanding what is happening, it looks like the original 
MPI_Alltoall made use of three algorithms.  (You can look in 
coll_tuned_decision_fixed.c)


If message size < 200 or communicator size > 12
  bruck
else if message size < 3000
  basic linear
else
  pairwise
end

With the file Pavel has provided things have changed to the following. 
(maybe someone can confirm)


If message size < 8192
  bruck
else
  pairwise
end

Rolf


On 05/20/09 07:48, Roman Martonak wrote:

Many thanks for the highly helpful analysis. Indeed, what Peter says
seems to be precisely the case here. I tried to run the 32 waters test
on 48 cores now, with the original cutoff of 100 Ry, and with slightly
increased one of 110 Ry. Normally with larger cutoff it should
obviously take more time for one step. Increasing cutoff however also
increases the size of the data buffer and it appears just to cross the
packet size threshold for different behaviour (test was ran with
openmpi-1.3.2).


cutoff 100Ry

time per 1 step is 2.869 s

 = ALL TO ALL COMM   151583. BYTES   2211.  =
= ALL TO ALL COMM16.741  MB/S  20.020 SEC  =


cutoff 110 Ry

time per 1 step is 1.879 s

 = ALL TO ALL COMM   167057. BYTES   2211.  =
 = ALL TO ALL COMM43.920  MB/S   8.410 SEC  =


so it actually runs much faster and  ALL TO ALL COMM is 2.6 times
faster. In my case the threshold seems to be somewhere between
167057/48 = 3 480 and 151583/48 = 3 157 bytes.

I saved the text that Pavel suggested

1 # num of collectives
3 # ID = 3 Alltoall collective (ID in coll_tuned.h)
1 # number of com sizes
64 # comm size 8
2 # number of msg sizes
0 3 0 0 # for message size 0, bruck 1, topo 0, 0 segmentation
8192 2 0 0 # 8k+, pairwise 2, no topo or segmentation
# end of first collective

to the file dyn_rules and tried to run appending the options
"--mca use_dynamic_rules 1 --mca dynamic_rules_filename ./dyn_rules" to mpirun
but it does not make any change. Is this the correct syntax to enable
the rules ?
And will the above sample file shift the threshold to lower values (to
what value) ?

Best regards

Roman

On Wed, May 20, 2009 at 10:39 AM, Peter Kjellstrom  wrote:

On Tuesday 19 May 2009, Peter Kjellstrom wrote:

On Tuesday 19 May 2009, Roman Martonak wrote:

On Tue, May 19, 2009 at 3:29 PM, Peter Kjellstrom  wrote:

On Tuesday 19 May 2009, Roman Martonak wrote:
...


openmpi-1.3.2   time per one MD step is 3.66 s
   ELAPSED TIME :0 HOURS  1 MINUTES 25.90 SECONDS
 = ALL TO ALL COMM   102033. BYTES   4221.  =
 = ALL TO ALL COMM 7.802  MB/S  55.200 SEC  =

...


With TASKGROUP=2 the summary looks as follows

...


 = ALL TO ALL COMM   231821. BYTES   4221.  =
 = ALL TO ALL COMM82.716  MB/S  11.830 SEC  =

Wow, according to this it takes 1/5th the time to do the same number (4221)
of alltoalls if the size is (roughly) doubled... (ten times better
performance with the larger transfer size)

Something is not quite right, could you possibly try to run just the
alltoalls like I suggested in my previous e-mail?

I was curious so I ran som tests. First it seems that the size reported by
CPMD is the total size of the data buffer not the message size. Running
alltoalls with 231821/64 and 102033/64 gives this (on a similar setup):

bw for   4221x 1595 B :  36.5 Mbytes/s   time was:  23.3 s
bw for   4221x 3623 B : 125.4 Mbytes/s   time was:  15.4 s
bw for   4221x 1595 B :  36.4 Mbytes/s   time was:  23.3 s
bw for   4221x 3623 B : 125.6 Mbytes/s   time was:  15.3 s

So it does seem that OpenMPI has some problems with small alltoalls. It is
obviously broken when you can get things across faster by sending more...

As a reference I ran with a commercial MPI using the same program and node-set
(I did not have MVAPICH nor IntelMPI on this system):

bw for   4221x 1595 B :  71.4 Mbytes/s   time was:  11.9 s
bw for   4221x 3623 B : 125.8 Mbytes/s   time was:  15.3 s
bw for   4221x 1595 B :  71.1 Mbytes/s   time was:  11.9 s
bw for   4221x 3623 B : 125.5 Mbytes/s  

Re: [OMPI users] scaling problem with openmpi

2009-05-20 Thread Pavel Shamis (Pasha)



The correct MCA parameters are the following:
-mca coll_tuned_use_dynamic_rules 1
-mca coll_tuned_dynamic_rules_filename ./dyn_rules

Ohh..it was my mistake



You can also run the following command:
ompi_info -mca coll_tuned_use_dynamic_rules 1 -param coll tuned
This will give some insight into all the various algorithms that make 
up the tuned collectives.


If I am understanding what is happening, it looks like the original 
MPI_Alltoall made use of three algorithms. (You can look in 
coll_tuned_decision_fixed.c)


If message size < 200 or communicator size > 12
bruck
else if message size < 3000
basic linear
else
pairwise
end

Yep it is correct.


With the file Pavel has provided things have changed to the following. 
(maybe someone can confirm)


If message size < 8192
bruck
else
pairwise
end
You are right here. Target of my conf file is disable basic_linear for 
medium message side.



Pasha.


Re: [OMPI users] scaling problem with openmpi

2009-05-20 Thread Peter Kjellstrom
On Wednesday 20 May 2009, Rolf Vandevaart wrote:
...
> If I am understanding what is happening, it looks like the original
> MPI_Alltoall made use of three algorithms.  (You can look in
> coll_tuned_decision_fixed.c)
>
> If message size < 200 or communicator size > 12
>bruck
> else if message size < 3000
>basic linear
> else
>pairwise
> end

And 3000 was the observed threshold for bad behaviour so it seems very likely 
that "basic linear" was the culprit. My testing would suggest that "pairwise" 
was a good choice for ~3000 (but maybe bruck, as configured by Pavel, is good 
too).

/Peter

> With the file Pavel has provided things have changed to the following.
> (maybe someone can confirm)
>
> If message size < 8192
>bruck
> else
>pairwise
> end
>
> Rolf


signature.asc
Description: This is a digitally signed message part.


[OMPI users] FW: hanging after many comm create/destroy's

2009-05-20 Thread Lippert, Ross
 

The attached program prints hangs at after printing "Iteration 65524".
It does not appear to me that it should.  Removal of the barrier call or
changing the barrier call to use MPI_COMM_WORLD does get rid of the
hang, so I believe this program is a minimal representation of a bug.

I have attached the output of ompi_info --all as well.  I do not have
access to the config.log.

The command to compile was

mpicc mpibug.c

The command to run was 

orterun --np 8 --mca btl tcp,self -- ./a.out


-r


report.tgz
Description: report.tgz


Re: [OMPI users] FW: hanging after many comm create/destroy's

2009-05-20 Thread Edgar Gabriel
I am 99.99% sure that this bug has been fixed in the current trunk and 
will be available in the upcoming 1.3.3 release...


Thanks
Edgar

Lippert, Ross wrote:
 


The attached program prints hangs at after printing "Iteration 65524".
It does not appear to me that it should.  Removal of the barrier call or
changing the barrier call to use MPI_COMM_WORLD does get rid of the
hang, so I believe this program is a minimal representation of a bug.

I have attached the output of ompi_info --all as well.  I do not have
access to the config.log.

The command to compile was

mpicc mpibug.c

The command to run was 


orterun --np 8 --mca btl tcp,self -- ./a.out


-r




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab  http://pstl.cs.uh.edu
Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335


Re: [OMPI users] FW: hanging after many comm create/destroy's

2009-05-20 Thread Lippert, Ross
OK.  I'll check back again when 1.3.3 comes out.  Thanks.

-r 

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Edgar Gabriel
Sent: Wednesday, May 20, 2009 11:16 AM
To: Open MPI Users
Subject: Re: [OMPI users] FW: hanging after many comm create/destroy's

I am 99.99% sure that this bug has been fixed in the current trunk and 
will be available in the upcoming 1.3.3 release...

Thanks
Edgar

Lippert, Ross wrote:
>  
> 
> The attached program prints hangs at after printing "Iteration 65524".
> It does not appear to me that it should.  Removal of the barrier call
or
> changing the barrier call to use MPI_COMM_WORLD does get rid of the
> hang, so I believe this program is a minimal representation of a bug.
> 
> I have attached the output of ompi_info --all as well.  I do not have
> access to the config.log.
> 
> The command to compile was
> 
> mpicc mpibug.c
> 
> The command to run was 
> 
> orterun --np 8 --mca btl tcp,self -- ./a.out
> 
> 
> -r
> 
> 
>

> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab  http://pstl.cs.uh.edu
Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] scaling problem with openmpi

2009-05-20 Thread Peter Kjellstrom
On Wednesday 20 May 2009, Pavel Shamis (Pasha) wrote:
> > With the file Pavel has provided things have changed to the following.
> > (maybe someone can confirm)
> >
> > If message size < 8192
> > bruck
> > else
> > pairwise
> > end
>
> You are right here. Target of my conf file is disable basic_linear for
> medium message side.

Disabling basic_linear seems like a good idea but your config file sets the 
cut-off at 128 Bytes for 64-ranks (the field you set to 8192 seems to result 
in a message size of that value divided by the number of ranks).

In my testing bruck seems to win clearly (at least for 64 ranks on my IB) up 
to 2048. Hence, the following line may be better:

 131072 2 0 0 # switch to pair wise for size 128K/nranks

Disclaimer: I guess this could differ quite a bit for nranks!=64 and different 
btls.

Here are some figures for this part of the package size range:

all_bruck
bw for   10  x 10 B :  13.7 Mbytes/s time was: 922.0 �s
bw for   10  x 500 B :  45.9 Mbytes/stime was:  13.7 ms
bw for   10  x 1000 B : 122.7 Mbytes/s   time was:  10.3 ms
bw for   10  x 1500 B :  86.9 Mbytes/s   time was:  21.8 ms
bw for   10  x 2000 B : 120.1 Mbytes/s   time was:  21.0 ms
bw for   10  x 2047 B :  92.6 Mbytes/s   time was:  27.9 ms
bw for   10  x 2048 B : 107.3 Mbytes/s   time was:  24.1 ms
bw for   10  x 2400 B :  93.7 Mbytes/s   time was:  32.3 ms
bw for   10  x 2800 B :  73.0 Mbytes/s   time was:  48.3 ms
bw for   10  x 2900 B :  79.5 Mbytes/s   time was:  45.9 ms
bw for   10  x 2925 B :  89.3 Mbytes/s   time was:  41.3 ms
bw for   10  x 2950 B :  72.7 Mbytes/s   time was:  51.1 ms
bw for   10  x 2975 B :  75.2 Mbytes/s   time was:  49.8 ms
bw for   10  x 3000 B :  74.9 Mbytes/s   time was:  50.5 ms
bw for   10  x 3100 B :  95.9 Mbytes/s   time was:  40.7 ms
totaltime was: 479.5 ms
all_pair
bw for   10  x 10 B : 414.2 kbytes/s time was:  30.4 ms
bw for   10  x 500 B :  19.8 Mbytes/stime was:  31.9 ms
bw for   10  x 1000 B :  43.3 Mbytes/s   time was:  29.1 ms
bw for   10  x 1500 B :  63.3 Mbytes/s   time was:  29.9 ms
bw for   10  x 2000 B :  81.2 Mbytes/s   time was:  31.0 ms
bw for   10  x 2047 B :  82.3 Mbytes/s   time was:  31.3 ms
bw for   10  x 2048 B :  83.0 Mbytes/s   time was:  31.1 ms
bw for   10  x 2400 B :  93.6 Mbytes/s   time was:  32.3 ms
bw for   10  x 2800 B : 105.0 Mbytes/s   time was:  33.6 ms
bw for   10  x 2900 B : 107.7 Mbytes/s   time was:  33.9 ms
bw for   10  x 2925 B : 108.1 Mbytes/s   time was:  34.1 ms
bw for   10  x 2950 B : 109.6 Mbytes/s   time was:  33.9 ms
bw for   10  x 2975 B : 111.1 Mbytes/s   time was:  33.7 ms
bw for   10  x 3000 B : 112.1 Mbytes/s   time was:  33.7 ms
bw for   10  x 3100 B : 114.5 Mbytes/s   time was:  34.1 ms
totaltime was: 484.1 ms
bruckto2k_pair
bw for   10  x 10 B :  11.9 Mbytes/s time was:   1.1 ms
bw for   10  x 500 B : 100.3 Mbytes/stime was:   6.3 ms
bw for   10  x 1000 B : 115.9 Mbytes/s   time was:  10.9 ms
bw for   10  x 1500 B : 117.2 Mbytes/s   time was:  16.1 ms
bw for   10  x 2000 B :  95.7 Mbytes/s   time was:  26.3 ms
bw for   10  x 2047 B :  96.6 Mbytes/s   time was:  26.7 ms
bw for   10  x 2048 B :  82.2 Mbytes/s   time was:  31.4 ms
bw for   10  x 2400 B :  94.1 Mbytes/s   time was:  32.1 ms
bw for   10  x 2800 B : 105.6 Mbytes/s   time was:  33.4 ms
bw for   10  x 2900 B : 108.4 Mbytes/s   time was:  33.7 ms
bw for   10  x 2925 B : 108.3 Mbytes/s   time was:  34.0 ms
bw for   10  x 2950 B : 109.9 Mbytes/s   time was:  33.8 ms
bw for   10  x 2975 B : 111.5 Mbytes/s   time was:  33.6 ms
bw for   10  x 3000 B : 108.3 Mbytes/s   time was:  34.9 ms
bw for   10  x 3100 B : 114.7 Mbytes/s   time was:  34.0 ms
totaltime was: 388.4 ms

These figures were run on a freshly compiled OpenMPI-1.3.2. The numbers for 
bruck at smalla package sizes vary a bit from run to run.

/Peter

> Pasha.


signature.asc
Description: This is a digitally signed message part.


Re: [OMPI users] scaling problem with openmpi

2009-05-20 Thread Pavel Shamis (Pasha)


Disabling basic_linear seems like a good idea but your config file sets the 
cut-off at 128 Bytes for 64-ranks (the field you set to 8192 seems to result 
in a message size of that value divided by the number of ranks).


In my testing bruck seems to win clearly (at least for 64 ranks on my IB) up 
to 2048. Hence, the following line may be better:


 131072 2 0 0 # switch to pair wise for size 128K/nranks

Disclaimer: I guess this could differ quite a bit for nranks!=64 and different 
btls.
  

Sounds strange for me. From the code is looks that we take the threshold as
is without dividing by number of ranks.


Pasha,


Re: [OMPI users] scaling problem with openmpi

2009-05-20 Thread Peter Kjellstrom
On Wednesday 20 May 2009, Pavel Shamis (Pasha) wrote:
> > Disabling basic_linear seems like a good idea but your config file sets
> > the cut-off at 128 Bytes for 64-ranks (the field you set to 8192 seems to
> > result in a message size of that value divided by the number of ranks).
> >
> > In my testing bruck seems to win clearly (at least for 64 ranks on my IB)
> > up to 2048. Hence, the following line may be better:
> >
> >  131072 2 0 0 # switch to pair wise for size 128K/nranks
> >
> > Disclaimer: I guess this could differ quite a bit for nranks!=64 and
> > different btls.
>
> Sounds strange for me. From the code is looks that we take the threshold as
> is without dividing by number of ranks.

Interesting, I may have had to little or too much coffe but the figures in my 
previous e-mail (3rd run, bruckto2k_pair) was run with the above line. And it 
very much looks like it switched at 128K/64=2K, not at 128K (which would have 
been above my largest size of 3000 and as such equiv. to all_bruck).

I also ran tests with:
 8192 2 0 0 # ...
And it seemed to switch between 10 Bytes and 500 Bytes (most likely then at 
8192/64=128).

My testprogram calls MPI_Alltoall like this:
  time1 = MPI_Wtime();
  for (i = 0; i < repetitions; i++) {
MPI_Alltoall(sbuf, message_size, MPI_CHAR,
 rbuf, message_size, MPI_CHAR, MPI_COMM_WORLD);
  }
  time2 = MPI_Wtime();

/Peter


signature.asc
Description: This is a digitally signed message part.


Re: [OMPI users] scaling problem with openmpi

2009-05-20 Thread Pavel Shamis (Pasha)
Tomorrow I will add some printf to collective code and check what really 
happens there...


Pasha

Peter Kjellstrom wrote:

On Wednesday 20 May 2009, Pavel Shamis (Pasha) wrote:
  

Disabling basic_linear seems like a good idea but your config file sets
the cut-off at 128 Bytes for 64-ranks (the field you set to 8192 seems to
result in a message size of that value divided by the number of ranks).

In my testing bruck seems to win clearly (at least for 64 ranks on my IB)
up to 2048. Hence, the following line may be better:

 131072 2 0 0 # switch to pair wise for size 128K/nranks

Disclaimer: I guess this could differ quite a bit for nranks!=64 and
different btls.
  

Sounds strange for me. From the code is looks that we take the threshold as
is without dividing by number of ranks.



Interesting, I may have had to little or too much coffe but the figures in my 
previous e-mail (3rd run, bruckto2k_pair) was run with the above line. And it 
very much looks like it switched at 128K/64=2K, not at 128K (which would have 
been above my largest size of 3000 and as such equiv. to all_bruck).


I also ran tests with:
 8192 2 0 0 # ...
And it seemed to switch between 10 Bytes and 500 Bytes (most likely then at 
8192/64=128).


My testprogram calls MPI_Alltoall like this:
  time1 = MPI_Wtime();
  for (i = 0; i < repetitions; i++) {
MPI_Alltoall(sbuf, message_size, MPI_CHAR,
 rbuf, message_size, MPI_CHAR, MPI_COMM_WORLD);
  }
  time2 = MPI_Wtime();

/Peter
  



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] scaling problem with openmpi

2009-05-20 Thread Roman Martonak
I tried to run with the first dynamic rules file that Pavel proposed
and it works, the time per one MD step on 48 cores decreased from 2.8
s to 1.8 s as expected. It was clearly the basic linear algorithm that
was causing the problem. I will check the performance of bruck and
pairwise on my HW. It would be nice if it could be tuned further.

Thanks

Roman

On Wed, May 20, 2009 at 7:18 PM, Pavel Shamis (Pasha)  wrote:
> Tomorrow I will add some printf to collective code and check what really
> happens there...
>
> Pasha
>
> Peter Kjellstrom wrote:
>>
>> On Wednesday 20 May 2009, Pavel Shamis (Pasha) wrote:
>>

 Disabling basic_linear seems like a good idea but your config file sets
 the cut-off at 128 Bytes for 64-ranks (the field you set to 8192 seems
 to
 result in a message size of that value divided by the number of ranks).

 In my testing bruck seems to win clearly (at least for 64 ranks on my
 IB)
 up to 2048. Hence, the following line may be better:

  131072 2 0 0 # switch to pair wise for size 128K/nranks

 Disclaimer: I guess this could differ quite a bit for nranks!=64 and
 different btls.

>>>
>>> Sounds strange for me. From the code is looks that we take the threshold
>>> as
>>> is without dividing by number of ranks.
>>>
>>
>> Interesting, I may have had to little or too much coffe but the figures in
>> my previous e-mail (3rd run, bruckto2k_pair) was run with the above line.
>> And it very much looks like it switched at 128K/64=2K, not at 128K (which
>> would have been above my largest size of 3000 and as such equiv. to
>> all_bruck).
>>
>> I also ran tests with:
>>  8192 2 0 0 # ...
>> And it seemed to switch between 10 Bytes and 500 Bytes (most likely then
>> at 8192/64=128).
>>
>> My testprogram calls MPI_Alltoall like this:
>>  time1 = MPI_Wtime();
>>  for (i = 0; i < repetitions; i++) {
>>    MPI_Alltoall(sbuf, message_size, MPI_CHAR,
>>                 rbuf, message_size, MPI_CHAR, MPI_COMM_WORLD);
>>  }
>>  time2 = MPI_Wtime();
>>
>> /Peter
>>  
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



Re: [OMPI users] scaling problem with openmpi

2009-05-20 Thread Peter Kjellstrom
On Wednesday 20 May 2009, Roman Martonak wrote:
> I tried to run with the first dynamic rules file that Pavel proposed
> and it works, the time per one MD step on 48 cores decreased from 2.8
> s to 1.8 s as expected. It was clearly the basic linear algorithm that
> was causing the problem. I will check the performance of bruck and
> pairwise on my HW. It would be nice if it could be tuned further.

I'm guessing you'll see even better performance if you change 8192 to 131072 
in that config file. That moves up the cross over point between "bruck" 
and "pair wise".

/Peter

> Thanks
>
> Roman


signature.asc
Description: This is a digitally signed message part.