My bad. Just a dumb mistake. Load-balance, as Ralph suggested. I had decomposed 
into 16 equally sized parts which didn't map well to 15 cores.

Regarding VTune, we have a code that doesn't scale well so that's a good tip.  
I have access to VTune, I've used it.  But I only remember looking at OpenMP, I 
didn't know it could handle MPI runs. That would be great.  Is VampirTrace (?) 
is another option for identifying communication bottlenecks, serial content, 
etc.?

Thanks

________________________________________
From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] on behalf of Jeff 
Squyres (jsquyres) [jsquy...@cisco.com]
Sent: Friday, June 07, 2013 6:00 AM
To: Open MPI Users
Subject: EXTERNAL: Re: [OMPI users] Sandy Bridge performance question

+1

Depending on how much you care, you might also want to look at some performance 
analysis tools to look and see what is happening under the covers.  The Intel 
VTune suite is the gold standard -- it shows all the counters and statistics 
from the CPUs themselves (be aware that there's a bit of a learning curve) -- 
to include things like cache statistics, instructions per clock, ...etc.  Lots 
and lots and lots of info.

Other tools are good, too -- google around (e.g., the cachegrind tool in 
valgrind, etc.).



On Jun 6, 2013, at 4:42 PM, Ralph Castain <r...@open-mpi.org> wrote:

> It depends on the application you are using. Some are "balanced" - i.e., they 
> run faster if the number of processes is a power of two. You'll see that n8 
> is faster than n7, so this is likely the situation.
>
>
> On Jun 6, 2013, at 4:10 PM, "Blosch, Edwin L" <edwin.l.blo...@lmco.com> wrote:
>
>> I am running single-node Sandy Bridge cases with OpenMPI and looking at 
>> scaling.
>>
>> I’m using –bind-to-core without any other options (default is –bycore I 
>> believe).
>>
>> These numbers indicate number of cores first, then the second digit is the 
>> run number (except for n=1, all runs repeated 3 times).  Any thought why n15 
>> should be so much slower than n16?   I also measure the RSS of the running 
>> processes, and the rank 0 process for n=15 cases uses about 2x more memory 
>> than all the other ranks, whereas all the ranks use the same amount of 
>> memory for the n=16 cases.
>>
>> Thanks for insights,
>>
>> Ed
>>
>> n1.1:    6.9530
>> n2.1:    7.0185
>> n2.2:    7.0313
>> n3.1:    8.2069
>> n3.2:    8.1628
>> n3.3:    8.1311
>> n4.1:    7.5307
>> n4.2:    7.5323
>> n4.3:    7.5858
>> n5.1:    9.5693
>> n5.2:    9.5104
>> n5.3:    9.4821
>> n6.1:    8.9821
>> n6.2:    8.9720
>> n6.3:    8.9541
>> n7.1:    10.640
>> n7.2:    10.650
>> n7.3:    10.638
>> n8.1:    8.6822
>> n8.2:    8.6630
>> n8.3:    8.6903
>> n9.1:    9.5058
>> n9.2:    9.5255
>> n9.3:    9.4809
>> n10.1:    10.484
>> n10.2:    10.452
>> n10.3:    10.516
>> n11.1:    11.327
>> n11.2:    11.316
>> n11.3:    11.318
>> n12.1:    12.285
>> n12.2:    12.303
>> n12.3:    12.272
>> n13.1:    13.127
>> n13.2:    13.113
>> n13.3:    13.113
>> n14.1:    14.035
>> n14.2:    13.989
>> n14.3:    14.021
>> n15.1:    14.533
>> n15.2:    14.529
>> n15.3:    14.586
>> n16.1:    8.6542
>> n16.2:    8.6731
>> n16.3:    8.6586
>> ~
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to