Re: [OMPI users] difference between OpenMPI - intel MPI -- how to understand where\why

2016-02-16 Thread Eugene Loh
Which one is producing correct (or at least reasonable) results? Are 
both results correct?  Do you have ways of assessing correctness of your 
results?



On February 16, 2016 at 5:19:16 AM, Diego Avesani (diego.aves...@gmail.com) 
wrote:

Dear all,
  
I have written an fortran-MPI code.

Usually, I compile it in MPI or in openMPI according to the cluster where
it runs.
Unfortunately, I get complitly a different result and I do not know why.


Re: [OMPI users] OpenMPI Profiling

2016-01-07 Thread Eugene Loh
I don't know specifically what you want to do, but there is a FAQ 
section on profiling and tracing.

http://www.open-mpi.org/faq/?category=perftools

On 12/31/2015 9:03 AM, anil maurya wrote:
I have compiled HPL using OpenMPI and GotoBLAS. I want to do profiling 
and tracing.
I have compiled openmpi using -enable-profile-mpi. Please let me know 
how to do the profiling.


If I want to use PAPI for the hardware base profiling, Do I need to 
compile HPL again with PAPI support??


Re: [OMPI users] now 1.9 [was: I have still a problem withrankfiles in openmpi-1.6.4rc3]

2013-02-10 Thread Eugene Loh


On 2/10/2013 1:14 AM, Siegmar Gross wrote:

I don't think the problem is related to Solaris.  I think it's also on Linux.
E.g., I can reproduce the problem with 1.9a1r28035 on Linux using GCC compilers.

Siegmar: can you confirm this is a problem also on Linux?  E.g.,
with OMPI 1.9, on one of your Linux nodes (linpc0?) try

  % cat myrankfile
  rank 0=linpc0 slot=0:1
  % mpirun --report-bindings --rankfile myrankfile numactl --show

For me, the binding I get is not 0:1 but 0,1.

I get the following outputs for openmpi-1.6.4rc4 (without your patch)


Okay thanks, but 1.6 is not the issue here.  There is something going on 
in 1.9/trunk that is very different.  Thanks for the 1.6 output, but 
it's all right.



and openmpi-1.9 (both compiled with Sun C 5.12).


Thanks for the confirmation.  You, too, are showing Linux demonstrating 
this problem.  It looks like bindings are wrong in 1.9.  Ralph says he's 
taking a look.  The rankfile says "0:1", but you're getting "0,1".



linpc1 rankfiles 96 mpirun --report-bindings --rankfile rf_1_linux numactl 
--show
[linpc1:16061] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 
0]]: [B/B][./.]
physcpubind: 0 1
linpc1 rankfiles 97 ompi_info | grep "MPI:"
 Open MPI: 1.9a1r28035


Re: [OMPI users] now 1.9 [was: I have still a problem with rankfiles in openmpi-1.6.4rc3]

2013-02-09 Thread Eugene Loh

On 02/09/13 00:32, Ralph Castain wrote:

On Feb 6, 2013, at 2:59 PM, Eugene Loh  wrote:


On 02/06/13 04:29, Siegmar Gross wrote:

thank you very much for your answer. I have compiled your program
and get different behaviours for openmpi-1.6.4rc3 and openmpi-1.9.

I think what's happening is that although you specified "0:0" or "0:1" in the rankfile, the string 
"0,0" or "0,1" is getting passed in (at least in the runs I looked at).  That colon became a comma. 
 So, it's just by accident that myrankfile_0 is working out all right.

Could someone who knows the code better than I do help me narrow this down?  
E.g., where is the rankfile parsed?  For what it's worth, by the time mpirun 
reaches orte_odls_base_default_get_add_procs_data(), orte_job_data already 
contains the corrupted cpu_bitmap string.


You'll want to look at orte/mca/rmaps/rank_file/rmaps_rank_file.c - the bit map 
is now computed in mpirun and then sent to the daemons


Actually, I'm getting lost in this code.  Anyhow, I don't think the problem is related to Solaris.  I think it's also on Linux. 
E.g., I can reproduce the problem with 1.9a1r28035 on Linux using GCC compilers.


Siegmar: can you confirm this is a problem also on Linux?  E.g., with OMPI 1.9, 
on one of your Linux nodes (linpc0?) try

% cat myrankfile
rank 0=linpc0 slot=0:1
% mpirun --report-bindings --rankfile myrankfile numactl --show

For me, the binding I get is not 0:1 but 0,1.

Could someone else take a look at this?


Re: [OMPI users] I have still a problem with rankfiles in openmpi-1.6.4rc3

2013-02-07 Thread Eugene Loh

On 02/07/13 01:05, Siegmar Gross wrote:


thank you very much for your patch. I have applied the patch to
openmpi-1.6.4rc4.

Open MPI: 1.6.4rc4r28022
: [B .][. .] (slot list 0:0)
: [. B][. .] (slot list 0:1)
: [B B][. .] (slot list 0:0-1)
: [. .][B .] (slot list 1:0)
: [. .][. B] (slot list 1:1)
: [. .][B B] (slot list 1:0-1)
: [B B][B B] (slot list 0:0-1,1:0-1)


That looks great.  I'll file a CMR to get this patch into 1.6.  Unless you indicate otherwise, I'll assume this issue is understood 
for 1.6.



I get the following output for an unpatched openmpi-1.9.

Open MPI: 1.9a1r28035
: [B/.][./.]
: [B/B][./.]
: [B/B][./.]
: [./.][B/B]
: [./.][./B]
: [./.][B/B]
: [B/B][./.]


Right.  There is something else going on for 1.9.  I think OMPI 1.9 is corrupting the binding strings.  In my case, I said "0:1" and 
the internal string was "0,1".  So, although I should have binding to only one core (0:1), OMPI was trying to bind to two of them 
(0,1).  I'm still waiting for a response to other e-mail where I asked for hints where to find the problem in the source code.


[OMPI users] now 1.9 [was: I have still a problem with rankfiles in openmpi-1.6.4rc3]

2013-02-06 Thread Eugene Loh

On 02/06/13 04:29, Siegmar Gross wrote:


thank you very much for your answer. I have compiled your program
and get different behaviours for openmpi-1.6.4rc3 and openmpi-1.9.

I get the following output for openmpi-1.9 (different outputs !!!).

sunpc1 rankfiles 104 mpirun --report-bindings --rankfile myrankfile ./a.out
[sunpc1:26554] MCW rank 0 bound to socket 0[core 0[hwt 0]],   socket 0[core 
1[hwt 0]]: [B/B][./.]
unbound

sunpc1 rankfiles 105 mpirun --report-bindings --rankfile myrankfile_0 ./a.out
[sunpc1:26557] MCW rank 0 bound to socket 0[core 0[hwt 0]]:   [B/.][./.]
bind to 0


I think what's happening is that although you specified "0:0" or "0:1" in the rankfile, the string "0,0" or "0,1" is getting passed 
in (at least in the runs I looked at).  That colon became a comma.  So, it's just by accident that myrankfile_0 is working out all 
right.


Could someone who knows the code better than I do help me narrow this down?  E.g., where is the rankfile parsed?  For what it's 
worth, by the time mpirun reaches orte_odls_base_default_get_add_procs_data(), orte_job_data already contains the corrupted 
cpu_bitmap string.


Re: [OMPI users] I have still a problem with rankfiles in openmpi-1.6.4rc3

2013-02-06 Thread Eugene Loh

On 02/06/13 04:29, Siegmar Gross wrote:

Hi

thank you very much for your answer. I have compiled your program
and get different behaviours for openmpi-1.6.4rc3 and openmpi-1.9.


Yes, something else seems to be going on for 1.9.

For 1.6, try the attached patch.  It works for me, but my machines have flatter (less interesting) topology.  It'd be great if you 
could try


  % mpirun --report-bindings --rankfile myrankfile ./a.out

with that check program I sent and with the following rankfiles:

rank 0=sunpc1 slot=0:0
rank 0=sunpc1 slot=0:1
rank 0=sunpc1 slot=0:0-1
rank 0=sunpc1 slot=1:0
rank 0=sunpc1 slot=1:1
rank 0=sunpc1 slot=1:0-1
rank 0=sunpc1 slot=0:0-1,1:0-1

where each line represents a different rankfile.
Index: opal/mca/hwloc/hwloc132/hwloc/src/topology-solaris.c
===
--- opal/mca/hwloc/hwloc132/hwloc/src/topology-solaris.c	(revision 28036)
+++ opal/mca/hwloc/hwloc132/hwloc/src/topology-solaris.c	(working copy)
@@ -137,6 +137,7 @@
   int depth = hwloc_get_type_depth(topology, HWLOC_OBJ_NODE);
   int n;
   int i;
+  processorid_t binding;

   if (depth < 0) {
 errno = ENOSYS;
@@ -146,6 +147,15 @@
   hwloc_bitmap_zero(hwloc_set);
   n = hwloc_get_nbobjs_by_depth(topology, depth);

+  /* first check if processor_bind() was used to bind to a single processor rather than to an lgroup */
+
+  if ( processor_bind(idtype, id, PBIND_QUERY, &binding) == 0 && binding != PBIND_NONE ) {
+hwloc_bitmap_only(hwloc_set, binding);
+return 0;
+  }
+
+  /* if not, check lgroups */
+
   for (i = 0; i < n; i++) {
 hwloc_obj_t obj = hwloc_get_obj_by_depth(topology, depth, i);
 lgrp_affinity_t aff = lgrp_affinity_get(idtype, id, obj->os_index);


Re: [OMPI users] I have still a problem with rankfiles in openmpi-1.6.4rc3

2013-02-05 Thread Eugene Loh

On 02/05/13 13:20, Eugene Loh wrote:

On 02/05/13 00:30, Siegmar Gross wrote:


now I can use all our machines once more. I have a problem on
Solaris 10 x86_64, because the mapping of processes doesn't
correspond to the rankfile.


A few comments.

First of all, the heterogeneous environment had nothing to do with this 
(as you have just confirmed).  You can reproduce the problem so:


% cat myrankfile
rank 0=mynode slot=0:1
% mpirun --report-bindings --rankfile myrankfile hostname
[mynode:5150] MCW rank 0 bound to socket 0[core 0-3]: [B B B B] (slot 
list 0:1)


Anyhow, that's water under the bridge at this point.

Next, and you might already know this, you can't bind arbitrarily on 
Solaris.  You have to bind to a locality group (lgroup) or an individual 
core.  Sorry if that's repeating something you already knew.  Anyhow, 
your problem cases are when binding to a single core.  So, you're all 
right (and OMPI isn't).


Finally, you can check the actual binding so:

% cat check.c
#include 
#include 
#include 
#include 

int main(int argc, char **argv) {
  processorid_t obind;
  if ( processor_bind(P_PID, P_MYID, PBIND_QUERY, &obind) != 0 ) {
printf("ERROR\n");
  } else {
if ( obind == PBIND_NONE ) printf("unbound\n");
else   printf("bind to %d\n", obind);
  }
  return 0;
}
% cc check.c
% mpirun --report-bindings --rankfile myrankfile ./a.out

I can reproduce your problem on my Solaris 11 machine (rankfile 
specifies a particular core but --report-bindings shows binding to 
entire node), but the test problem shows binding to the core I specified.


So, the problem is in --report-bindings?  I'll poke around some.


I'm thinking the issue is in openmpi-1.6.3-debug/opal/mca/hwloc/hwloc132/hwloc/src/topology-solaris.c .  First, we call 
hwloc_solaris_set_sth_cpubind() to perform the binding.  Again, there are two mechanisms for doing so:  lgroup/lgrp for an entire 
locality group or processor_bind to a specific core.  In our case, we use the latter.  Later, we call 
hwloc_solaris_get_sth_cpubind() to check what binding we have.  We check lgroups, but we don't check for processor_binding.


Sorry for the dumb question, but who maintains this code?  OMPI, or upstream in 
the hwloc project?  Where should the fix be made?


Re: [OMPI users] I have still a problem with rankfiles in openmpi-1.6.4rc3

2013-02-05 Thread Eugene Loh

On 02/05/13 00:30, Siegmar Gross wrote:


now I can use all our machines once more. I have a problem on
Solaris 10 x86_64, because the mapping of processes doesn't
correspond to the rankfile. I removed the output from "hostfile"
and wrapped around long lines.

tyr rankfiles 114 cat rf_ex_sunpc
# mpiexec -report-bindings -rf rf_ex_sunpc hostname

rank 0=sunpc0 slot=0:0-1,1:0-1
rank 1=sunpc1 slot=0:0-1
rank 2=sunpc1 slot=1:0
rank 3=sunpc1 slot=1:1


tyr rankfiles 115 mpiexec -report-bindings -rf rf_ex_sunpc hostname
[sunpc0:17920] MCW rank 0 bound to socket 0[core 0-1]  socket 1[core 0-1]: [B 
B][B B] (slot list 0:0-1,1:0-1)
[sunpc1:11265] MCW rank 1 bound to socket 0[core 0-1]: [B 
B][. .] (slot list 0:0-1)
[sunpc1:11265] MCW rank 2 bound to socket 0[core 0-1]  socket 1[core 0-1]: [B 
B][B B] (slot list 1:0)
[sunpc1:11265] MCW rank 3 bound to socket 0[core 0-1]  socket 1[core 0-1]: [B 
B][B B] (slot list 1:1)


A few comments.

First of all, the heterogeneous environment had nothing to do with this (as you 
have just confirmed).  You can reproduce the problem so:

% cat myrankfile
rank 0=mynode slot=0:1
% mpirun --report-bindings --rankfile myrankfile hostname
[mynode:5150] MCW rank 0 bound to socket 0[core 0-3]: [B B B B] (slot list 0:1)

Anyhow, that's water under the bridge at this point.

Next, and you might already know this, you can't bind arbitrarily on Solaris.  You have to bind to a locality group (lgroup) or an 
individual core.  Sorry if that's repeating something you already knew.  Anyhow, your problem cases are when binding to a single 
core.  So, you're all right (and OMPI isn't).


Finally, you can check the actual binding so:

% cat check.c
#include 
#include 
#include 
#include 

int main(int argc, char **argv) {
  processorid_t obind;
  if ( processor_bind(P_PID, P_MYID, PBIND_QUERY, &obind) != 0 ) {
printf("ERROR\n");
  } else {
if ( obind == PBIND_NONE ) printf("unbound\n");
else   printf("bind to %d\n", obind);
  }
  return 0;
}
% cc check.c
% mpirun --report-bindings --rankfile myrankfile ./a.out

I can reproduce your problem on my Solaris 11 machine (rankfile specifies a particular core but --report-bindings shows binding to 
entire node), but the test problem shows binding to the core I specified.


So, the problem is in --report-bindings?  I'll poke around some.


Re: [OMPI users] MPI_Recv operation time

2012-11-05 Thread Eugene Loh

On 11/5/2012 1:07 AM, huydanlin wrote:

Hi,
   My objective is I want to calculate the time perform by MPI_Send & 
MPI_Recv . In case MPI_Send, i can put the timer before the MPI_Send 
and after its. like this "

t1=MPI_Wtime(),
MPI_Send
t2= MPI_Wtime
tsend= t2 -t1; it mean when the message go to the system buffer, the 
control return to the sending process.
It means that the message is out of the user's send buffer.  The time 
could include a rendezvous with the receiving process.  Depending on 
what mechanism is used, a send (e.g., of a long message) might not be 
able to complete until most of the message is already in the receiver's 
buffer.

So I can measure the MPI_Send.
   But my problem in MPI_Recv. If i do like MPI_Send( put the timer 
before and after MPI_Recv) I think it wrong. Because we dont know 
exactly when the message reach the system buffer in receiving side.
So how can we measure the MPI_Recv operation time( time when the 
message is copied from the system buffer to the receive buffer) ?
You cannot if you're instrumenting the user's MPI program.  If you want 
to time the various phases of how the message was passed, you would have 
to introduce timers into the underlying MPI implementation.


Re: [OMPI users] segmentation fault with openmpi-1.6.2

2012-09-10 Thread Eugene Loh

On 09/10/12 11:37, Ralph Castain wrote:

On Sep 10, 2012, at 8:12 AM, Aleksey Senin  wrote:


On 10/09/2012 15:41, Siegmar Gross wrote:

Hi,

I have built openmpi-1.6.2rc1 and get the following error.

tyr small_prog 123 mpicc -showme
cc -I/usr/local/openmpi-1.6.2_32_cc/include -mt
   -L/usr/local/openmpi-1.6.2_32_cc/lib -lmpi -lm -lkstat -llgrp
   -lsocket -lnsl -lrt -lm
tyr small_prog 124 mpiexec -np 2 -host tyr init_finalize

Hello!
Hello!

tyr small_prog 125 mpiexec -np 2 -host sunpc4 init_finalize
key_from_blob: remaining bytes in key blob 81

Hello!
Hello!

tyr small_prog 126 mpiexec -np 2 -host tyr,sunpc4 init_finalize
[tyr:23956] *** Process received signal ***
[tyr:23956] Signal: Segmentation Fault (11)
[tyr:23956] Signal code: Address not mapped (1)
[tyr:23956] Failing at address: 18
/.../openmpi-1.6.2_32_cc/lib/libopen-rte.so.4.0.0:0x15434c
/lib/libc.so.1:0xcad04
/lib/libc.so.1:0xbf3b4
/lib/libc.so.1:0xbf59c
/.../openmpi-1.6.2_32_cc/lib/libopen-rte.so.4.0.0:orte_rmaps_base_get_target_nodes+0x1cc
 [ Signal 11 (SEGV)]
/.../openmpi-1.6.2_32_cc/lib/openmpi/mca_rmaps_round_robin.so:0x1ec8
/.../openmpi-1.6.2_32_cc/lib/libopen-rte.so.4.0.0:orte_rmaps_base_map_job+0xe4
/.../openmpi-1.6.2_32_cc/lib/libopen-rte.so.4.0.0:orte_plm_base_setup_job+0xc4
/.../openmpi-1.6.2_32_cc/lib/openmpi/mca_plm_rsh.so:orte_plm_rsh_launch+0x1b0
/.../openmpi-1.6.2_32_cc/bin/orterun:orterun+0x16a8
/.../openmpi-1.6.2_32_cc/bin/orterun:main+0x24
/.../openmpi-1.6.2_32_cc/bin/orterun:_start+0xd8
[tyr:23956] *** End of error message ***
Segmentation fault

Do you have any ideas or suggestions? As I wrote in my email from
yesterday, I had to add "#include" into file
openmpi-1.6.2rc1/ompi/contrib/vt/vt/extlib/otf/tools/otfaux/otfaux.cpp
to have a prototype for function "rint" in line 834. Thank you very
much for any help in advance.

Really? That shouldn't happen - I'll take a look at that one.

Yes, Oracle MTT testing shows 1.6.2rc2r27272 DOA:

% mpirun --host burl-ct-x2200-2 -np 2 hostname
burl-ct-x2200-2
burl-ct-x2200-2
% mpirun --host burl-ct-x2200-3 -np 2 hostname
burl-ct-x2200-3
burl-ct-x2200-3
% mpirun --host burl-ct-x2200-2,burl-ct-x2200-3 -np 2 hostname
[burl-ct-x2200-2:26019] *** Process received signal ***
[burl-ct-x2200-2:26019] Signal: Segmentation fault (11)
[burl-ct-x2200-2:26019] Signal code: Address not mapped (1)
[burl-ct-x2200-2:26019] Failing at address: 0x18
[burl-ct-x2200-2:26019] [ 0] [0xe600]
[burl-ct-x2200-2:26019] [ 1] 
/workspace/euloh/hpc/mtt-scratch/burl-ct-x2200-2/ompi-tarball-testing/installs/kBc6/install/lib/libopen-rte.so.4(orte_rmaps_base_get_target_nodes+0x432) 
[0xf7e6d482]
[burl-ct-x2200-2:26019] [ 2] 
/workspace/euloh/hpc/mtt-scratch/burl-ct-x2200-2/ompi-tarball-testing/installs/kBc6/install/lib/openmpi/mca_rmaps_round_robin.so 
[0xf7dcd8e5]
[burl-ct-x2200-2:26019] [ 3] 
/workspace/euloh/hpc/mtt-scratch/burl-ct-x2200-2/ompi-tarball-testing/installs/kBc6/install/lib/libopen-rte.so.4(orte_rmaps_base_map_job+0x46) 
[0xf7e6c4d6]
[burl-ct-x2200-2:26019] [ 4] 
/workspace/euloh/hpc/mtt-scratch/burl-ct-x2200-2/ompi-tarball-testing/installs/kBc6/install/lib/libopen-rte.so.4(orte_plm_base_setup_job+0x9c) 
[0xf7e647ec]
[burl-ct-x2200-2:26019] [ 5] 
/workspace/euloh/hpc/mtt-scratch/burl-ct-x2200-2/ompi-tarball-testing/installs/kBc6/install/lib/openmpi/mca_plm_rsh.so(orte_plm_rsh_launch+0x244) 
[0xf7dfb634]

[burl-ct-x2200-2:26019] [ 6] mpirun(orterun+0xf5e) [0x804b868]
[burl-ct-x2200-2:26019] [ 7] mpirun(main+0x22) [0x804a8f6]
[burl-ct-x2200-2:26019] [ 8] /lib/libc.so.6(__libc_start_main+0xdc) 
[0xb10dec]

[burl-ct-x2200-2:26019] [ 9] mpirun [0x804a851]
[burl-ct-x2200-2:26019] *** End of error message ***
Segmentation fault



Re: [OMPI users] Regarding the execution time calculation

2012-05-05 Thread Eugene Loh
MPI_Wtime() returns the elapsed time since some arbitrary time in the 
past.  It is a measure of "wallclock" time, not of CPU time or anything.


On 5/4/2012 3:08 PM, Jingcha Joba wrote:

Lets say I have a code like this
start = MPI_Wtime()

stop = MPI_Wtime();
What happens when right after start=MPI_Wtime(), the timeslice of the 
process ( from the operating system's perspective not the MPI 
process) is over, and the operating system schedules a next process, 
after saving the context switch, and eventually this application would 
resume, once its process is scheduled back by the os.
Does MPI_Wtime() takes care of storing/updating the time when this 
happens?

Of course, part of the answer lies in the implementation of Wtime.

On Fri, May 4, 2012 at 3:53 AM, Jeff Squyres > wrote:


On May 3, 2012, at 2:02 PM, Jingcha Joba wrote:

> Not related to this question , but just curious, is Wtime
context switch safe ?

Not sure exactly what you're asking here...?

--
Jeff Squyres
jsquy...@cisco.com 
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


___
users mailing list
us...@open-mpi.org 
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] MPI_Testsome with incount=0, NULL array_of_indices and array_of_statuses causes MPI_ERR_ARG

2012-03-15 Thread Eugene Loh

On 03/13/12 13:25, Jeffrey Squyres wrote:

On Mar 9, 2012, at 5:17 PM, Jeremiah Willcock wrote:

On Open MPI 1.5.1, when I call MPI_Testsome with incount=0 and the two output 
arrays NULL, I get an argument error (MPI_ERR_ARG).  Is this the intended 
behavior?  If incount=0, no requests can complete, so the output arrays can 
never be written to.  I do not see anything in the MPI 2.2 standard that says 
either way whether this is allowed.

I have no strong opinions here, so I coded up a patch to just return 
MPI_SUCCESS in this scenario (attached).

If no one objects, we can probably get this in 1.6.


It isn't enough just to return MPI_SUCCESS when the count is zero.  The 
man pages indicate what behavior is expected when count==0 and the MTT 
tests (ibm/pt2pt/[test|wait][any|some|all].c) check for this behavior.  
Put another way, a bunch of MTT tests started failing since r26138 due 
to quick return on count=0.


Again, the trunk since r26138 sets no output values when count=0.  In 
contrast, the ibm/pt2pt/*.c tests correctly check for the count=0 
behavior that we document in our man pages.  Here are excerpts from our 
man pages:


  Testall

Returns flag = true if all communications associated
with active handles in the array have completed (this
includes the case where no handle in the list is active).

  Testany

MPI_Testany tests for completion of either one or none
of the operations associated with active handles.  In
the latter case (no operation completed), it returns
flag = false, returns a value of MPI_UNDEFINED in index,
and status is undefined.

The array may contain null or inactive handles. If the
array contains no active handles then the call returns
immediately with flag = true, index = MPI_UNDEFINED,
and an empty status.

  Testsome

If there is no active handle in the list, it returns
outcount = MPI_UNDEFINED.

  Waitall

[...no issues...]

  Waitany

The array_of_requests list may contain null or inactive
handles.  If the list contains no active handles (list
has length zero or all entries are null or inactive),
then the call returns immediately with index = MPI_UNDEFINED,
and an empty status.

  Waitsome

If the list contains no active handles, then the call
returns immediately with outcount = MPI_UNDEFINED.

I'll test and put back the attached patch.
Index: trunk/ompi/mpi/c/testall.c
===
--- trunk/ompi/mpi/c/testall.c  (revision 26147)
+++ trunk/ompi/mpi/c/testall.c  (working copy)
@@ -67,6 +67,7 @@
 }

 if (OPAL_UNLIKELY(0 == count)) {
+*flag = true;
 return MPI_SUCCESS;
 }

Index: trunk/ompi/mpi/c/waitany.c
===
--- trunk/ompi/mpi/c/waitany.c  (revision 26147)
+++ trunk/ompi/mpi/c/waitany.c  (working copy)
@@ -67,6 +67,7 @@
 }

 if (OPAL_UNLIKELY(0 == count)) {
+*indx = MPI_UNDEFINED;
 return MPI_SUCCESS;
 }

Index: trunk/ompi/mpi/c/testany.c
===
--- trunk/ompi/mpi/c/testany.c  (revision 26147)
+++ trunk/ompi/mpi/c/testany.c  (working copy)
@@ -67,6 +67,8 @@
 }

 if (OPAL_UNLIKELY(0 == count)) {
+*completed = true;
+*indx = MPI_UNDEFINED;
 return MPI_SUCCESS;
 }

Index: trunk/ompi/mpi/c/waitsome.c
===
--- trunk/ompi/mpi/c/waitsome.c (revision 26147)
+++ trunk/ompi/mpi/c/waitsome.c (working copy)
@@ -69,6 +69,7 @@
 }

 if (OPAL_UNLIKELY(0 == incount)) {
+*outcount = MPI_UNDEFINED;
 return MPI_SUCCESS;
 }

Index: trunk/ompi/mpi/c/testsome.c
===
--- trunk/ompi/mpi/c/testsome.c (revision 26147)
+++ trunk/ompi/mpi/c/testsome.c (working copy)
@@ -69,6 +69,7 @@
 }

 if (OPAL_UNLIKELY(0 == incount)) {
+*outcount = MPI_UNDEFINED;
 return OMPI_SUCCESS;
 }



Re: [OMPI users] parallelising ADI

2012-03-06 Thread Eugene Loh
Parallelize in distributed-memory fashion or is multi-threaded good 
enough?  Anyhow, you should be able to find many resources with an 
Internet search.  This particular mailing list is more for users of 
OMPI, a particular MPI implementation.  One approach would be to 
distribute only one axis, solve locally, and transpose axes as 
necessary.  But, I see Gus also just provided an answer...  :^)


On 3/6/2012 12:59 PM, Kharche, Sanjay wrote:

I am working on a 3D ADI solver for the heat equation. I have implemented it as 
serial. Would anybody be able to indicate the best and more straightforward way 
to parallelise it. Apologies if this is going to the wrong forum.


Re: [OMPI users] Mpirun: How to print STDOUT of just one process?

2012-02-01 Thread Eugene Loh

On 2/1/2012 7:59 AM, Frank wrote:

When running

mpirun -n 2

the STDOUT streams of both processes are combined and are displayed by
the shell. In such an interleaved format its hard to tell what line
comes from which node.
As far as this part goes, there is also "mpirun --tag-output".  Check 
the mpirun man page.

Is there a way to have mpirun just merger STDOUT of one process to its
STDOUT stream?


Re: [OMPI users] Openmpi performance issue

2011-12-27 Thread Eugene Loh
If I remember correctly, both Intel MPI and MVAPICH2 bind processes by 
default.  OMPI does not.  There are many cases where the "bind by 
default" behavior gives better default performance.  (There are also 
cases where it can give catastrophically worse performance.)  Anyhow, it 
seems possible to me that this accounts for the difference you're seeing.


To play with binding in OMPI, you can try adding "--bind-to-socket 
--bysocket" to your mpirun command line, though what to try can depend 
on what version of OMPI you're using as well as details of your 
processor (HyperThreads?), your application, etc.  There's a FAQ entry 
at http://www.open-mpi.org/faq/?category=tuning#using-paffinity


On 12/27/2011 6:45 AM, Ralph Castain wrote:
It depends a lot on the application and how you ran it. Can you 
provide some info? For example, if you oversubscribed the node, then 
we dial down the performance to provide better cpu sharing. Another 
point: we don't bind processes by default while other MPIs do. Etc.


So more info (like the mpirun command line you used, which version you 
used, how OMPI was configured, etc.) would help.



On Dec 27, 2011, at 6:35 AM, Eric Feng wrote:


Can anyone help me?
I got similar performance issue when comparing to mvapich2 which is 
much faster in each MPI function in real application but similar in 
IMB benchmark.



*From:* Eric Feng >
*To:* "us...@open-mpi.org " 
mailto:us...@open-mpi.org>>

*Sent:* Friday, December 23, 2011 9:12 PM
*Subject:* [OMPI users] Openmpi performance issue

Hello,

I am running into performance issue with Open MPI, I wish experts 
here can provide me some help,


I have one application calls a lot of sendrecv, and isend/irecv, so 
waitall. When I run Intel MPI, it is around 30% faster than OpenMPI.
However if i test sendrecv using IMB, OpenMPI is even faster than 
Intel MPI, but when run with real application, Open MPI is much 
slower than Intel MPI in all MPI functions by looking at profiling 
results. So this is not some function issue, it has a overall 
drawback somewhere. Can anyone give me some suggestions of where to 
tune to make it run faster with real application?


Re: [OMPI users] Process Migration

2011-11-10 Thread Eugene Loh

On 11/10/2011 5:19 AM, Jeff Squyres wrote:

On Nov 10, 2011, at 8:11 AM, Mudassar Majeed wrote:

Thank you for your reply. I am implementing a load balancing function for MPI, 
that will balance the computation load and the communication both at a time. So 
my algorithm assumes that all the cores may at the end get different number of 
processes to run.

Are you talking about over-subscribing cores?  I.e., putting more than 1 MPI 
process on each core?

In general, that's not a good idea.

In the beginning (before that function will be called), each core will have 
equal number of processes. So I am thinking either to start more processes on 
each core (than needed) and run my function for load balancing and then block 
the remaining processes (on each core). In this way I will be able to achieve 
different number of processes per core.

Open MPI spins aggressively looking for network progress.  For example, if you 
block in an MPI_RECV waiting for a message, Open MPI is actively banging on the 
CPU looking for network progress.  Because of this (and other reasons), you 
probably do not want to over-subscribe your processors (meaning: you probably 
don't want to put more than 1 process per core).
Or, introduce your own MPI_Test/sleep loop if you really feel that you 
otherwise want to oversubscribe.  Watch out for pitfalls.


Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for OpenMPI usage

2011-11-03 Thread Eugene Loh

Right.  Actually "--mca btl ^sm".  (Was missing "btl".)

On 11/3/2011 11:19 AM, Blosch, Edwin L wrote:

I don't tell OpenMPI what BTLs to use. The default uses sm and puts a session 
file on /tmp, which is NFS-mounted and thus not a good choice.

Are you suggesting something like --mca ^sm?


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Eugene Loh
Sent: Thursday, November 03, 2011 12:54 PM
To: us...@open-mpi.org
Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for 
OpenMPI usage

I've not been following closely.  Why must one use shared-memory
communications?  How about using other BTLs in a "loopback" fashion?
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for OpenMPI usage

2011-11-03 Thread Eugene Loh
I've not been following closely.  Why must one use shared-memory 
communications?  How about using other BTLs in a "loopback" fashion?


Re: [OMPI users] Application in a cluster

2011-10-19 Thread Eugene Loh
Maybe someone else on this list has a better idea what you're trying to 
do, but I'll attempt to answer your question.


MPI is basically a set of library calls that can be used for processes 
to communicate with one another.  Of course, a program need not have any 
MPI calls in it, but if you want to parallelize an application, one 
strategy is to insert MPI calls so that the application can run as 
multiple processes that communicate via the MPI calls.


In essence, all mpirun does is launch multiple processes on a cluster.  
Those processes can run independently, but if they have MPI calls in 
them then they can work cooperatively on the same computation.


The questions you ask are more about basic MPI usage than about Open 
MPI.  There are good introductory MPI materials on the Internet.


On 10/19/2011 8:57 AM, Jorge Jaramillo wrote:
Hello everyone, I have a doubt about how to execute a parallel 
application on a cluster.
How was the application parallelized?  With MPI?  With OpenMP 
(multi-threaded model, not related to Open MPI)?  Other?
I used the 'mpirun' to execute some applications and they worked, but 
I guess this command only is useful with MPI applications.
mpirun can be used to launch multiple processes on a cluster even if 
those processes do not use MPI, but typically the real value of mpirun 
is to launch MPI jobs.
My question is, How do I execute a program that has no MPI statements 
on the cluster?

Again, is it parallelized?  Using what parallelization model?
If it is not possible, how do I change the structure of the program so 
it can be executed as a parallel application?
First, choose a parallelization model -- MPI (for multi-process 
parallelization on a cluster), OpenMP (multi-threaded parallelization on 
a shared-memory system), etc.  Search for on-line resources for more 
information.


Re: [OMPI users] Proper way to redirect GUI

2011-10-02 Thread Eugene Loh
Often you set the environment variable DISPLAY to any display you like.  
Export environment variables with "mpirun -x DISPLAY".


On 10/2/2011 5:32 AM, Xin Tong wrote:
I am launch a program with a GUI interface, How do i redirect the GUI 
to the machine i issued mpirun on ?


Re: [OMPI users] MPIRUN + Environtment Variable

2011-09-30 Thread Eugene Loh

 On 09/29/11 20:54, Xin Tong wrote:
I need to set up some environment variables before I run my 
application ( appA ). I am currently using mpirun -np 1 -host socrates 
(socrates is another machine) appA. Before appA runs, it expects 
some environment variables to be set up. How do i do that ?

% man mpirun
...
 To manage files and runtime environment:
...
 -x 
  Export  the  specified  environment  variables  to  the
  remote  nodes  before  executing the program.  Only one
  environment variable can be specified  per  -x  option.
  Existing  environment variables can be specified or new
  variable names  specified  with  corresponding  values.
  For example:
  % mpirun -x DISPLAY -x OFILE=/tmp/out ...

  The parser for the -x option is not very sophisticated;
  it  does  not even understand quoted values.  Users are
  advised to set variables in the environment,  and  then
  use -x to export (not define) them.


Re: [OMPI users] EXTERNAL: Re: How could OpenMPI (or MVAPICH) affect floating-point results?

2011-09-20 Thread Eugene Loh
I've not been following closely.  How do you know you're using the 
identical compilation flags?  Are you saying you specify the same flags 
to "mpicc" (or whatever) or are you confirming that the back-end 
compiler is seeing the same flags?  The MPI compiler wrapper (mpicc, et 
al.) can add flags.  E.g., as I remember it, "mpicc" with no flags means 
no optimization with OMPI but with optimization for MVAPICH.


On 9/20/2011 7:50 AM, Blosch, Edwin L wrote:

- It was exact same compiler, with identical compilation flags.


Re: [OMPI users] custom sparse collective non-reproducible deadlock, MPI_Sendrecv, MPI_Isend/MPI_Irecv or MPI_Send/MPI_Recv question

2011-09-19 Thread Eugene Loh



On 9/18/2011 9:12 AM, Evghenii Gaburov wrote:

Hi All,

Update to the original posting: METHOD4 also resulted in a deadlock on system HPC2 after 
5h of run with 32 MPI tasks; also, "const int scale=1;" was missing in the code 
snippet posted above.

--Evghenii


Message: 2
Date: Sun, 18 Sep 2011 02:06:33 +
From: Evghenii Gaburov
Subject: [OMPI users] custom sparse collective non-reproducible
deadlock, MPI_Sendrecv, MPI_Isend/MPI_Irecv or MPI_Send/MPI_Recv
question
To: "us...@open-mpi.org"
Message-ID:<8509050a-7357-408e-8d58-c5aefa7b3...@northwestern.edu>
Content-Type: text/plain; charset="us-ascii"

Hi All,

My MPI program's basic task consists of regularly establishing point-to-point 
communication with other procs via MPI_Alltoall, and then to communicate data. 
I tested it on two HPC clusters with 32-256 MPI tasks. One of the systems 
(HPC1) this custom collective runs flawlessly, while on another one (HPC2) the 
collective causes non-reproducible deadlocks (after a day of running, or after 
of few hours or so). So, I want to figure out whether it is a system (HPC2) bug 
that I can communicate to HPC admins, or a subtle bug in my code that needs to 
be fixed. One possibly important point, I communicate huge amount of data 
between tasks (up to ~2GB of data) in several all2all calls.

I would like to have expert eyes to look at the code to confirm or disprove 
that the code is deadlock-safe. I have implemented several methods (METHOD1 - 
METHOD4), that, if I am not mistaken, should in principle be deadlock safe. 
However, as a beginner MPI user, I can easily miss something subtle, as such I 
seek you help with this! I mostly used METHOD4 which have caused periodic 
deadlock, after having deadlocks with METHOD1 and METHOD2. On HPC1 none these 
methods deadlock in my runs. METHOD3 I am currently testing, so cannot comment 
on it as yet but will later; however, I will be happy to hear your comments.

Both system use openmpi-1.4.3.

Your answers will be of great help! Thanks!

Cheers,
Evghenii

Here is the code snippet:

  template
void  all2all(std::vector  sbuf[], std::vector  rbuf[],
const int myid,
const int nproc)
{
static int nsend[NMAXPROC], nrecv[NMAXPROC];
for (int p = 0; p<  nproc; p++)
  nsend[p] = sbuf[p].size();
MPI_Alltoall(nsend, 1, MPI_INT, nrecv, 1, MPI_INT, MPI_COMM_WORLD);  // 
let the other tasks know how much data they will receive from this one

#ifdef _METHOD1_

static MPI_Status  stat[NMAXPROC  ];
static MPI_Request  req[NMAXPROC*2];
int nreq = 0;
for (int p = 0; p<  nproc; p++)
  if (p != myid)
  {
const int scount = nsend[p];
const int rcount = nrecv[p];
rbuf[p].resize(rcount);
if (scount>  0) MPI_Isend(&sbuf[p][0], nscount, datatype(), p, 1, 
MPI_COMM_WORLD,&req[nreq++]);
if (rcount>  0) MPI_Irecv(&rbuf[p][0], rcount,  datatype(), p, 1, 
MPI_COMM_WORLD,&req[nreq++]);
  }
rbuf[myid] = sbuf[myid];
MPI_Waitall(nreq, req, stat);
Incidentally, here, you have 2*nproc requests, interlacing sends and 
receives.  Your array of statuses, however, is only MAXPROC big.  I 
think you need to declare stat[NMAXPROC*2].  Also, do you want scount in 
place of nscount?


Re: [OMPI users] RE : MPI hangs on multiple nodes

2011-09-19 Thread Eugene Loh
Should be fine.  Once MPI_Send returns, it should be safe to reuse the 
buffer.  In fact, the return of the call is the only way you have of 
checking that the message has left the user's send buffer.  The case 
you're worried about is probably MPI_Isend, where you have to check 
completion with an MPI_Test* or MPI_Wait* call.


On 9/19/2011 6:26 AM, Sébastien Boisvert wrote:

Hello,

Is it safe to re-use the same buffer (variable A) for MPI_Send and MPI_Recv 
given that MPI_Send may be eager depending on
the MCA parameters ?


Re: [OMPI users] Problem with MPI_Wtime()

2011-09-15 Thread Eugene Loh

On 9/15/2011 5:51 AM, Ghislain Lartigue wrote:

start_0 = MPI_Wtime()

start_1 = MPI_Wtime()
call foo()
end_1 = MPI_Wtime()
write(*,*) "timer1 = ",end1-start1

start_2 = MPI_Wtime()
call bar()
end_2 = MPI_Wtime()
write(*,*) "timer2 = ",end2-start2

end_0 = MPI_Wtime()
write(*,*) "timer0 = ",end0-start0

==

When I run my code on a "small" number of processors, I find that 
timer0=timer1+timer2 with a very good precision (less than 1%).
However, as I increase the number of processors, this is not true any more: I 
can have 10%, 20% or even more discrepancy!
The more processor I use, the bigger errors are observed.

Obviously, my code is much bigger than the simple example above, but the 
principle is exactly the same.
In the simple example, if timer0 is much bigger than timer1+timer2, we'd 
be inclined to attribute extra time to the timer calls or the write 
statements... in any case, to time spent between end_1 and start_2 or 
between end_2 and end_0.  Are you sure in the actual code there are no 
substantial operations in those sections?  Also, is it possible your 
processes are not running during some of those times?  Are you 
oversubscribing?  Also, instead of printing out endX-startX, how about 
writing out endX and startX individually so you get all six timestamps 
and can see in greater detail where the discrepancy is arising.


Re: [OMPI users] Problem with MPI_BARRIER

2011-09-09 Thread Eugene Loh



On 9/8/2011 11:47 AM, Ghislain Lartigue wrote:

I guess you're perfectly right!
I will try to test it tomorrow by putting a call system("wait(X)) befor the 
barrier!

What does "wait(X)" mean?

Anyhow, here is how I see your computation:

A)  The first barrier simply synchronizes the processes.
B)  Then you start a bunch of non-blocking, point-to-point messages.
C)  Then another barrier.
D)  Finally, the point-to-point messages are completed.

Your mental model might be that A, B, and C should be fast and that D 
should take a long time.  The reality may be that the completion of all 
those messages is actually taking place during C.


How about the following?

Barrier
t0 = MPI_Wtime()
start all non-blocking messages
t1 = MPI_Wtime()
Barrier
t2 = MPI_Wtime()
complete all messages
t3 = MPI_Wtime()
Barrier
t4 = MPI_Wtime()

Then, look at the data from all the processes graphically.  Compare the 
picture to the same experiment, but with middle Barrier missing.  
Presumably, the full iteration will take roughly as long in both cases.  
The difference, I might expect, would be that with the middle barrier 
present, it gets all the time and the message-completion is fast.  
Without the middle barrier, the message completion is slow.  So, message 
completion is taking a long time either way and the only difference is 
whether it's taking place during your MPI_Test loop or during what you 
thought was only a barrier.


A simple way of doing all this is to run with a time-line profiler... 
some MPI performance analysis tool.  You won't have to instrument the 
code, dump timings, or figure out graphics.  Just look at pretty 
pictures!  There is some description of tool candidates in the OMPI FAQ 
at http://www.open-mpi.org/faq/?category=perftools

PS:
if anyone has more information about the implementation of the MPI_IRECV() 
procedure, I would be glad to learn more about it!
I don't know how much detail you want here, but I suspect not much 
detail is warranted.  There is a lot of complexity here, but I think a 
few key ideas will help.


First, I'm pretty sure you're sending "long" messages.  OMPI usually 
sends such messages by queueing up a request.  These requests can, in 
the general case, be "progressed" whenever an MPI call is made.  So, 
whenever you make an MPI call, get away from the thought that you're 
doing one specific thing, as specified by the call and its arguments.  
Think instead that you will also be looking around to see whatever other 
MPI work can be progressed.


Re: [OMPI users] Problem with MPI_BARRIER

2011-09-08 Thread Eugene Loh
I should know OMPI better than I do, but generally, when you make an MPI 
call, you could be diving into all kinds of other stuff.  E.g., with 
non-blocking point-to-point operations, a message might make progress 
during another MPI call.  E.g.,


MPI_Irecv(recv_req)
MPI_Isend(send_req)
MPI_Wait(send_req)
MPI_Wait(recv_req)

A receive is started in one call and completed in another, but it's 
quite possible that most of the data transfer (and waiting time) occurs 
while the program is in the calls associated with the send.  The 
accounting gets tricky.


So, I'm guessing during the second barrier, MPI is busy making progress 
on the pending non-blocking point-to-point operations, where progress is 
possible.  It isn't purely a barrier operation.


On 9/8/2011 8:04 AM, Ghislain Lartigue wrote:

This behavior happens at every call (first and following)


Here is my code (simplified):


start_time = MPI_Wtime()
call mpi_ext_barrier()
new_time = MPI_Wtime()-start_time
write(local_time,'(F9.1)') new_time*1.0e9_WP/(36.0_WP*36.0_WP*36.0_WP)
call print_message("CAST GHOST DATA2 LOOP 1 barrier "//trim(local_time),0)

 do conn_index_id=1, Nconn(conn_type_id)

   ! loop over data
   this_data =>  block%data
   do while (associated(this_data))

 MPI_IRECV(...)
 MPI_ISEND(...)

   this_data =>  this_data%next
   enddo

endif

 enddo

  enddo

start_time = MPI_Wtime()
call mpi_ext_barrier()
new_time = MPI_Wtime()-start_time
write(local_time,'(F9.1)') new_time*1.0e9_WP/(36.0_WP*36.0_WP*36.0_WP)
call print_message("CAST GHOST DATA2 LOOP 2 barrier "//trim(local_time),0)

  done=.false.
  counter = 0
  do while (.not.done)
 do ireq=1,nreq
if (recv_req(ireq)/=MPI_REQUEST_NULL) then
   call MPI_TEST(recv_req(ireq),found,mystatus,icommerr)
   if (found) then
  call 
  counter=counter+1
   endif
endif
 enddo
 if (counter==nreq) then
done=.true.
 endif
  enddo


The first call to the barrier works perfectly fine, but the second one gives 
the strange behavior...

Ghislain.

Le 8 sept. 2011 à 16:53, Eugene Loh a écrit :


On 9/8/2011 7:42 AM, Ghislain Lartigue wrote:

I will check that, but as I said in first email, this strange behaviour happens 
only in one place in my code.

Is the strange behavior on the first time, or much later on?  (You seem to 
imply later on, but I thought I'd ask.)

I agree the behavior is noteworthy, but it's plausible and there's not enough 
information to explain it based solely on what you've said.

Here is one scenario.  I don't know if it applies to you since I know very 
little about what you're doing.  I think with VampirTrace, you can collect 
performance data into large buffers.  Occasionally, the buffers need to be 
flushed to disk.  VampirTrace will wait for a good opportunity to do so -- 
e.g., a global barrier.  So, you execute lots of barriers, but suddenly you hit 
one where VT wants to flush to disk.  This takes a long time and everyone in 
the barrier spends a long time in the barrier.  Then, execution resumes and 
barrier performance looks again like what it used to look like.

Again, there are various scenarios to explain what you see.  More information 
would be needed to decide which applies to you.
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] Problem with MPI_BARRIER

2011-09-08 Thread Eugene Loh

On 9/8/2011 7:42 AM, Ghislain Lartigue wrote:

I will check that, but as I said in first email, this strange behaviour happens 
only in one place in my code.
Is the strange behavior on the first time, or much later on?  (You seem 
to imply later on, but I thought I'd ask.)


I agree the behavior is noteworthy, but it's plausible and there's not 
enough information to explain it based solely on what you've said.


Here is one scenario.  I don't know if it applies to you since I know 
very little about what you're doing.  I think with VampirTrace, you can 
collect performance data into large buffers.  Occasionally, the buffers 
need to be flushed to disk.  VampirTrace will wait for a good 
opportunity to do so -- e.g., a global barrier.  So, you execute lots of 
barriers, but suddenly you hit one where VT wants to flush to disk.  
This takes a long time and everyone in the barrier spends a long time in 
the barrier.  Then, execution resumes and barrier performance looks 
again like what it used to look like.


Again, there are various scenarios to explain what you see.  More 
information would be needed to decide which applies to you.


Re: [OMPI users] Problem with MPI_BARRIER

2011-09-08 Thread Eugene Loh
 I agree sentimentally with Ghislain.  The time spent in a barrier 
should conceptually be some wait time, which can be very long (possibly 
on the order of milliseconds or even seconds), and the time to execute 
the barrier operations, which should essentially be "instantaneous" on 
some time scale... in any case, very fast (probably on the order of 
microseconds).  The "laggard" who holds up the barrier operation should 
report a very fast time.  The other processes might show very much 
longer times, depending on how well synchronized they were.  At least 
one, however, should be "fast".


I agree with others that it'd be nice to know the time units so that we 
can judge the speeds.


Anyhow, is this on the first barrier of the program?  If one repeats the 
timing operation several times in succession, do the barrier times 
remain high?


On 09/08/11 10:27, Jai Dayal wrote:

what tick value are you using (i.e., what units are you using?)

On Thu, Sep 8, 2011 at 10:25 AM, Ghislain Lartigue 
mailto:ghislain.larti...@coria.fr>> wrote:


Thanks,

I understand this but the delays that I measure are huge compared
to a classical ack procedure... (1000x more)
And this is repeatable: as far as I understand it, this shows that
the network is not involved.

Ghislain.


Le 8 sept. 2011 à 16:16, Teng Ma a écrit :

> I guess you forget to count the "leaving time"(fan-out).  When
everyone
> hits the barrier, it still needs "ack" to leave.  And remember
in most
> cases, leader process will send out "acks" in a sequence way.
 It's very
> possible:
>
> P0 barrier time = 29 + send/recv ack 0
> P1 barrier time = 14 + send ack 0  + send/recv ack 1
> P2 barrier time = 0 + send ack 0 + send ack 1 + send/recv ack 2
>
> That's your measure time.
>
> Teng
>> This problem as nothing to do with stdout...
>>
>> Example with 3 processes:
>>
>> P0 hits barrier at t=12
>> P1 hits barrier at t=27
>> P2 hits barrier at t=41
>>
>> In this situation:
>> P0 waits 41-12 = 29
>> P1 waits 41-27 = 14
>> P2 waits 41-41 = 00
>
>
>
>> So I should see something  like (no ordering is expected):
>> barrier_time = 14
>> barrier_time = 00
>> barrier_time = 29
>>
>> But what I see is much more like
>> barrier_time = 22
>> barrier_time = 29
>> barrier_time = 25
>>
>> See? No process has a barrier_time equal to zero !!!
>>
>>
>>
>> Le 8 sept. 2011 à 14:55, Jeff Squyres a écrit :
>>
>>> The order in which you see stdout printed from mpirun is not
necessarily
>>> reflective of what order things were actually printers.
 Remember that
>>> the stdout from each MPI process needs to flow through at least 3
>>> processes and potentially across the network before it is actually
>>> displayed on mpirun's stdout.
>>>
>>> MPI process -> local Open MPI daemon -> mpirun -> printed to
mpirun's
>>> stdout
>>>
>>> Hence, the ordering of stdout can get transposed.
>>>
>>>
>>> On Sep 8, 2011, at 8:49 AM, Ghislain Lartigue wrote:
>>>
 Thank you for this explanation but indeed this confirms that
the LAST
 process that hits the barrier should go through nearly
instantaneously
 (except for the broadcast time for the acknowledgment signal).
 And this is not what happens in my code : EVERY process waits
for a
 very long time before going through the barrier (thousands of
times
 more than a broadcast)...


 Le 8 sept. 2011 à 14:26, Jeff Squyres a écrit :

> Order in which processes hit the barrier is only one factor
in the
> time it takes for that process to finish the barrier.
>
> An easy way to think of a barrier implementation is a "fan
in/fan out"
> model.  When each nonzero rank process calls MPI_BARRIER, it
sends a
> message saying "I have hit the barrier!" (it usually sends
it to its
> parent in a tree of all MPI processes in the communicator,
but you can
> simplify this model and consider that it sends it to rank
0).  Rank 0
> collects all of these messages.  When it has messages from all
> processes in the communicator, it sends out "ok, you can
leave the
> barrier now" messages (again, it's usually via a tree
distribution,
> but you can pretend that it directly, linearly sends a
message to each
> peer process in the communicator).
>
> Hence, the time that any individual process spends in the
communicator
> is relative to when every other process enters the
communicator.  But
> it's also dependent upon communication speed, congestion in the
> network, etc.
>
>
> On Sep 8, 2011, at 6:20 AM, Ghis

Re: [OMPI users] High CPU usage with yield_when_idle =1 on CFS

2011-09-01 Thread Eugene Loh

On 8/31/2011 11:48 PM, Randolph Pullen wrote:
I recall a discussion some time ago about yield, the Completely F%’d 
Scheduler (CFS) and OpenMPI.


My system is currently suffering from massive CPU use while busy 
waiting.  This gets worse as I try to bump up user concurrency.

Yup.

I am running with yield_when_idle but its not enough.

Yup.

Is there anything else I can do to release some CPU resource?
I recall seeing one post where usleep(1) was inserted around the 
yields, is this still feasible?


I'm using 1.4.1 - is there a fix to be found in upgrading?
Unfortunately I am stuck  with the CFS as I need Linux.  Currently its 
Ubuntu 10.10 with 2.6.32.14
I think OMPI doesn't yet do (much/any) better than what you've 
observed.  You might be able to hack something up yourself.  In 
something I did recently, I replaced blocking sends and receives with 
test/nanosleep loops.  An "optimum" solution (minimum latency, optimal 
performance at arbitrary levels of under and oversubscription) might be 
elusive, but hopefully you'll quickly be able to piece together 
something for your particular purposes.  In my case, I was lucky and the 
results were very gratifying... my bottleneck vaporized for modest 
levels of oversubscription (2-4 more processes than processors).


Re: [OMPI users] poll taking too long in open-mpi

2011-08-26 Thread Eugene Loh

On 8/23/2011 1:24 PM, Dick Kachuma wrote:

I have used gprof to profile a program that uses openmpi. The result
shows that the code spends a long time in poll (37% on 8 cores, 50% on
16 and 85% on 32). I was wondering if there is anything I can do to
reduce the time spent in poll.
In serial performance optimization, if you spend a lot of time in a 
function, you try to speed that function up.  In parallel programming, 
if you spend a lot of time in a function that waits, speeding the 
function up probably will not help.  You need to speed up the thing 
you're waiting on.  In the case of poll, it is quite likely that the 
issue is not that poll is slow, but you're waiting on someone else.


Other performance tools might help here.  Check 
http://www.open-mpi.org/faq/?category=perftools  E.g., a timeline view 
of your run might be able to show you what other processes are doing 
while the long-poll process is idling.

I cannot determine the number of calls
made to poll and exactly where they are. The bulk of my code uses
exclusively MPI_Ssend for the send and MPI_Irecv and MPI_Wait for the
receive. For instance, would there be any gain expected if I switch
from MPI_Ssend to MPI_Send?
It depends.  Try to find out which MPI calls are taking a lot of time.  
How long are the messages you're sending?  If they're short and you're 
spending a lot of time in MPI_Ssend, then switching to MPI_Send could help.

Alternatively would there be any gain in
switching to MPI_Isend/MPI_Recv instead of MPI_Ssend/MPI_Irecv?
Using Isend and Irecv can help if you can do useful work during the time 
you're waiting for a non-blocking operation to complete.


Before considering too many strategies, however, it may make most sense 
to get more performance information on your application.  Which MPI 
calls are taking the most time?  What message patterns characterize your 
slowdown?  Are all processes spending lots of time in MPI, or is there 
one process that is busy in computation and upon whom everyone else is 
waiting?


Re: [OMPI users] Documentation of MPI Implementation

2011-08-23 Thread Eugene Loh

On 8/23/2011 12:32 AM, Hoang-Vu Dang wrote:

Dear all,

Where could I find a detailed documentation about algorithms that has 
been using in Open MPI ?


For example, I would like to answer following questions: how 
MPI_Algather operation is done? what is the complexity in term of the 
number of data send/receive given a number of node involved? what is 
the data structure behind ?. Same as for MPI_Alreduce etc..


Is this information available to access and how can I access it ? I 
know there is one way which is digging into the source code, but I 
hope there is a easier way to achieve the same goal.
I'm no expert, but I suspect that any documentation is scant, 
incomplete, and likely to be out-of-date.


Re: [OMPI users] Urgent Question regarding, MPI_ANY_SOURCE.

2011-07-15 Thread Eugene Loh
I'm going to echo what you've already heard here:  it is impossible for 
a process to receive a message that was sent to a different process.  A 
sender must specify a unique destination.  No process other than the 
destination process will see that message.


In what you write below, why do you think you are receiving a message 
that was intended for a different destination?  Maybe you can put 
together a short program that illustrates your question.


On 7/15/2011 9:49 AM, Mudassar Majeed wrote:


Yes, processes receive messages that were not sent to them. I am 
receiving the message with the following call


MPI_Recv(&recv_packet, 1, loadDatatype, MPI_ANY_SOURCE, MPI_TAG_LOAD, 
comm, &status);


and that was sent using the following call,

MPI_Ssend(&load_packet, 1, loadDatatype, rec_rank, MPI_TAG_LOAD, comm);

What problem it can have ?. All the parameters are correct, I have 
seen them by printf.  What I am thinking is that, the receive is done 
with MPI_ANY_SOURCE, so the process is getting any message (from any 
source). What should be done so that only that message is captured 
that had the destination as this process.


Re: [OMPI users] OpenMPI vs Intel Efficiency question

2011-07-12 Thread Eugene Loh

On 7/12/2011 4:45 PM, Mohan, Ashwin wrote:

I noticed that the exact same code took 50% more time to run on OpenMPI
than Intel.
It would be good to know if that extra time is spent inside MPI calls or 
not.  There is a discussion of how you might do this here:  
http://www.open-mpi.org/faq/?category=perftools  You should probably 
start here and narrow down your investigation.


If the difference is the time spent inside MPI calls... um, that would 
be interesting.


If the difference is time spent outside MPI calls, how you are compiling 
(which serial compiler is being used, which optimization flags, etc.) 
could be the issue.  Or possibly how processes are placed on a node 
("paffinity" or "binding" issues).

Does the compiler flags have an effect on the efficiency of the
simulation.
Sure.  Ideally, most of the time is spent in parallel computation and 
very little in MPI.  For performance in such an "ideal" case, any 
"decent" MPI implementation (OMPI and Intel hopefully among them) should 
do just fine.

Will including MPICH2 increase efficiency in running simulations using
OpenMPI?
MPICH2 and OMPI are MPI implementations.  You choose one or the other 
(or other options... e.g., Intel).


Re: [OMPI users] openmpi (1.2.8 or above) and Intel composer XE 2011 (aka 12.0)

2011-05-27 Thread Eugene Loh

On 5/27/2011 4:32 AM, Jeff Squyres wrote:

On May 27, 2011, at 4:30 AM, Robert Horton wrote:

To be clear, if you explicitly list which BTLs to use, OMPI will only
(try to) use exactly those and no others.

It might be worth putting the sm btl in the FAQ:

http://www.open-mpi.org/faq/?category=openfabrics#ib-btl

Is this entry not clear enough?

http://www.open-mpi.org/faq/?category=tuning#selecting-components
I think his point is that the example in the ib-btl entry would be more 
helpful as a template for usage if it added sm.  Why point users to a 
different FAQ entry (which we don't do anyhow) when three more 
characters ",sm" makes the ib-btl entry so much more helpful.


Re: [OMPI users] configure: mpi-threads disabled by default

2011-05-04 Thread Eugene Loh
Depending on what version you use, the option has been renamed 
--enable-mpi-thread-multiple.


Anyhow, there is widespread concern whether the support is robust.  The 
support is known to be limited and the performance poor.


On 5/4/2011 9:14 AM, Mark Dixon wrote:
I've been asked about mixed-mode MPI/OpenMP programming with OpenMPI, 
so have been digging through the past list messages on MPI_THREAD_*, 
etc. Interesting stuff :)


Before I go ahead and add "--enable-mpi-threads" to our standard 
configure flags, is there any reason it's disabled by default, please?


I'm a bit puzzled, as this default seems in conflict with whole "Law 
of Least Astonishment" thing. Have I missed some disaster that's going 
to happen?


Re: [OMPI users] --enable-progress-threads broken in 1.5.3?

2011-04-28 Thread Eugene Loh

CMR 2728 did this.  I think the changes are in 1.5.4.

On 4/28/2011 5:00 AM, Jeff Squyres wrote:

It is quite likely that --enable-progress-threads is broken.  I think it's even 
disabled in 1.4.x; I wonder if we should do the same in 1.5.x...


Re: [OMPI users] Problem with setting up openmpi-1.4.3

2011-04-13 Thread Eugene Loh

amosl...@gmail.com wrote:


Hi,
 I am embarrassed!  I submitted a note to the users on setting 
up openmpi-1.4.3 using SUSE-11.3 under Linux and received several 
replies.  I wanted to transfer them but they disappeared for no 
apparent reason.   I hope that those that sent me messages will be 
kind enough to repeat them and perhaps more users will add their ideas.


Search the archives.

http://www.open-mpi.org/community/lists/users/2011/03/15917.php


Re: [OMPI users] mpi problems,

2011-04-06 Thread Eugene Loh




Nehemiah Dacres wrote:
also, I'm not sure if I'm reading the results right.
According to the last run, did using the sun compilers (update 1 ) 
result in higher performance with sunct? 
  
  On Wed, Apr 6, 2011 at 11:38 AM, Nehemiah
Dacres 
wrote:
  this
first test was run as a base case to see if MPI works., the sedcond run
is to see the speed up using OpenIB provides
[jian@therock
~]$ mpirun -machinefile list /opt/iba/src/mpi_apps/mpi_stress/mpi_stress
[jian@therock
~]$ mpirun -mca orte_base_help_aggregate btl,openib,self, -machinefile
list /opt/iba/src/mpi_apps/mpi_stress/mpi_stress
[jian@therock
~]$ mpirun -mca orte_base_help_aggregate btl,openib,self, -machinefile
list sunMpiStress
  

I don't think the command-line syntax for the MCA parameters is quite
right.  I suspect it should be

--mca orte_base_help_aggregate 1 --mca btl openib,self

Further, they are unnecessary.  The first is on by default and the
second is unnecessary since OMPI finds the fastest interconnect
automatically (presumably openib,self, with sm if there are on-node
processes).  Another way of setting MCA parameters is with environment
variables:

setenv OMPI_MCA_orte_base_help_aggregate 1
setenv OMPI_MCA_btl openib,self

since then you can use ompi_info to check your settings.

Anyhow, it looks like your runs are probably all using openib and I
don't know why the last one is 2x faster.  If you're testing the
interconnect, the performance should be limited by the IB (more or
less) and not by the compiler.




Re: [OMPI users] Shared Memory Performance Problem.

2011-03-30 Thread Eugene Loh




Michele Marena wrote:
I've launched my app with mpiP both when two processes are
on different node and when two processes are on the same node.
  
  
  The process 0 is the manager (gathers the results only),
processes 1 and 2 are  workers (compute).
  
  
  This is the case processes 1 and 2 are on different nodes (runs
in 162s).
  
  @--- MPI Time (seconds)
---
  Task    AppTime    MPITime     MPI%
     0        162        162    99.99
     1        162       30.2    18.66
     2        162       14.7     9.04
     *        486        207    42.56
  
  
  The case when processes 1 and 2 are on the same node (runs in
260s).
  
  @--- MPI Time (seconds)
---
  Task    AppTime    MPITime     MPI%
     0        260        260    99.99
     1        260       39.7    15.29
     2        260       26.4    10.17
     *        779        326    41.82
  
  
  
  I think there's a contention problem on the memory bus.
  

Right.  Process 0 spends all its time in MPI, presumably waiting on
workers.  The workers spend about the same amount of time on MPI
regardless of whether they're placed together or not.  The big
difference is that the workers are much slower in non-MPI tasks when
they're located on the same node.  The issue has little to do with
MPI.  The workers are hogging local resources and work faster when
placed on different nodes.

  
  However, the message size is 4096 * sizeof(double). Maybe I are
wrong in this point. Is the message size too huge for shared memory?
  

No.  That's not very large at all.

  
  
  

>>> On Mar 27, 2011, at 10:33 AM, Ralph
Castain wrote:
>>>
>>> >http://www.open-mpi.org/faq/?category=perftools

  
  
  





Re: [OMPI users] Is there an mca parameter equivalent to -bind-to-core?

2011-03-23 Thread Eugene Loh

Gus Correa wrote:


Ralph Castain wrote:


On Mar 21, 2011, at 9:27 PM, Eugene Loh wrote:


Gustavo Correa wrote:


Dear OpenMPI Pros

Is there an MCA parameter that would do the same as the mpiexec 
switch '-bind-to-core'?

I.e., something that I could set up not in the mpiexec command line,
but for the whole cluster, or for an user, etc.

In the past I used '-mca mpi mpi_paffinity_alone=1'.




Must be a typo here - the correct command is '-mca 
mpi_paffinity_alone 1'



But that was before '-bind-to-core' came along.
However, my recollection of some recent discussions here in the list
is that the latter would not do the same as '-bind-to-core',
and that the recommendation was to use '-bind-to-core' in the 
mpiexec command line.




Just to be clear: mpi_paffinity_alone=1 still works and will cause 
the same behavior as bind-to-core.




A little awkward, but how about

--bycorermaps_base_schedule_policy  core
--bysocket  rmaps_base_schedule_policy  socket
--bind-to-core  orte_process_bindingcore
--bind-to-socketorte_process_bindingsocket
--bind-to-none  orte_process_bindingnone

___




Thank you Ralph and Eugene

Ralph, forgive me the typo in the previous message, please.
Equal sign inside the openmpi-mca-params.conf file,
but no equal sign on the mpiexec command line, right?

I am using OpenMPI 1.4.3
I inserted the line
"mpi_paffinity_alone = 1"
in my openmpi-mca-params.conf file, following Ralph's suggestion
that it is equivalent to '-bind-to-core'.

However, now when I do "ompi_info -a",
the output shows the non-default value 1 twice in a row,
then later it shows the default value 0 again!
Please see the output enclosed below.

I am confused.

1) Is this just a glitch in ompi_info,
or did mpi_paffinity_alone get reverted to zero?

2) How can I increase the verbosity level to make sure I have processor
affinity set (i.e. that the processes are bound to cores/processors)?


Just a quick answer on 2).  The FAQ 
http://www.open-mpi.org/faq/?category=tuning#using-paffinity-v1.4 (or 
"man mpirun" or "mpirun --help") mentions --report-bindings.


If this is on a Linux system with numactl, you can also try "mpirun ... 
numactl --show".



##

ompi_info -a

...

  MCA mpi: parameter "mpi_paffinity_alone" (current 
value: "1", data source: file 
[/home/soft/openmpi/1.4.3/gnu-intel/etc/openmpi-mca-params.conf], 
synonym of: opal_paffinity_alone)
   If nonzero, assume that this job is the 
only (set of) process(es) running on each node and bind processes to 
processors, starting with processor ID 0


  MCA mpi: parameter "mpi_paffinity_alone" (current 
value: "1", data source: file 
[/home/soft/openmpi/1.4.3/gnu-intel/etc/openmpi-mca-params.conf], 
synonym of: opal_paffinity_alone)
   If nonzero, assume that this job is the 
only (set of) process(es) running on each node and bind processes to 
processors, starting with processor ID 0


...

[ ... and after 'mpi_leave_pinned_pipeline' ...]

  MCA mpi: parameter "mpi_paffinity_alone" (current 
value: "0", data source: default value)
   If nonzero, assume that this job is the 
only (set of) process(es) running on each node and bind processes to 
processors, starting with processor ID 0


...
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





Re: [OMPI users] Is there an mca parameter equivalent to -bind-to-core?

2011-03-21 Thread Eugene Loh

Gustavo Correa wrote:


Dear OpenMPI Pros

Is there an MCA parameter that would do the same as the mpiexec switch 
'-bind-to-core'?
I.e., something that I could set up not in the mpiexec command line,
but for the whole cluster, or for an user, etc.

In the past I used '-mca mpi mpi_paffinity_alone=1'.
But that was before '-bind-to-core' came along.
However, my recollection of some recent discussions here in the list
is that the latter would not do the same as '-bind-to-core',
and that the recommendation was to use '-bind-to-core' in the mpiexec command 
line.


A little awkward, but how about

 --bycorermaps_base_schedule_policy  core
 --bysocket  rmaps_base_schedule_policy  socket
 --bind-to-core  orte_process_bindingcore
 --bind-to-socketorte_process_bindingsocket
 --bind-to-none  orte_process_bindingnone



Re: [OMPI users] multi-threaded programming

2011-03-08 Thread Eugene Loh




Durga Choudhury wrote:

  A follow-up question (and pardon if this sounds stupid) is this:

If I want to make my process multithreaded, BUT only one thread has
anything to do with MPI (for example, using OpenMP inside MPI), then
the results will be correct EVEN IF #1 or #2 of Eugene holds true. Is
this correct?
  


I believe this is thoroughly covered by the standard (though I suppose
the same could have been said about my question).

In any case, for your situation, initialize MPI with
MPI_Init_thread().  Ask for thread level MPI_THREAD_FUNNELED and check
that that level is provided.  That should cover your case.  See the man
page for MPI_Init_thread().  My question should not have anything to do
with your case.

  On Tue, Mar 8, 2011 at 12:34 PM, Eugene Loh  wrote:
  
  
Let's say you have multi-threaded MPI processes, you request
MPI_THREAD_MULTIPLE and get MPI_THREAD_MULTIPLE, and you use the self,sm,tcp
BTLs (which have some degree of threading support).  Is it okay to have an
[MPI_Isend|MPI_Irecv] on one thread be completed by an MPI_Wait on another
thread?  I'm assuming some sort of synchronization and memory barrier/flush
in between to protect against funny race conditions.

If it makes things any easier on you, we can do this multiple-choice style:

1)  Forbidden by the MPI standard.
2)  Not forbidden by the MPI standard, but will not work with OMPI (not even
with the BTLs that claim to be multi-threaded).
3)  Works well with OMPI (provided you use a BTL that's multi-threaded).

It's looking like #2 to me, but I'm not sure.
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


  
  
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
  






[OMPI users] multi-threaded programming

2011-03-08 Thread Eugene Loh
Let's say you have multi-threaded MPI processes, you request 
MPI_THREAD_MULTIPLE and get MPI_THREAD_MULTIPLE, and you use the 
self,sm,tcp BTLs (which have some degree of threading support).  Is it 
okay to have an [MPI_Isend|MPI_Irecv] on one thread be completed by an 
MPI_Wait on another thread?  I'm assuming some sort of synchronization 
and memory barrier/flush in between to protect against funny race 
conditions.


If it makes things any easier on you, we can do this multiple-choice style:

1)  Forbidden by the MPI standard.
2)  Not forbidden by the MPI standard, but will not work with OMPI (not 
even with the BTLs that claim to be multi-threaded).

3)  Works well with OMPI (provided you use a BTL that's multi-threaded).

It's looking like #2 to me, but I'm not sure.


Re: [OMPI users] using MPI through Qt

2011-03-01 Thread Eugene Loh

Eye RCS 51 wrote:


Hi,

In an effort to make a Qt gui using MPI, I have the following:

1. Gui started in master node.

2. In Gui, through a pushbutton, a global variable x is assigned some 
value; let say, x=1000;


3. I want this value to be know to all nodes. So I used broadcast in 
the function assigning it on the master node and all other nodes.


4. I printed values of x, which prints all 1000 in all nodes.

5. Now control has reached to MPI_Finalize in all nodes except master.

Now If I want to reassign value of x using pushbutton in master node 
and again broadcast to and print in all nodes, can it be done??


Not with MPI if MPI_Finalize has been called.

I mean, can I have an MPI function which through GUI is called many 
times and assigns and prints WHILE program is running.


You can call an MPI function like MPI_Bcast many times.  E.g.,

MPI_Init();
MPI_Comm_rank(...,&myrank);
while (...) {
  if ( myrank == MASTER ) x = ...;
  MPI_Bcast(&x,...);
}
MPI_Finalize();

There are many helpful MPI tutorials that can be found on the internet.



OR simply can I have a print function which is printing noderank value 
in all nodes whenever pushbutton is pressed while program is running.


command i used is "mpirun -np 3 ./a.out".




Re: [OMPI users] What's wrong with this code?

2011-02-23 Thread Eugene Loh




Prentice Bisbal wrote:

  Jeff Squyres wrote:
  
  Can you put together a small example that
shows the problem...
  
  Jeff,

Thanks for requesting that. As I was looking at the oringinal code to
write a small test program, I found the source of the error. Doesn't it
aways work that way.
  

Not always, but often enough.

George Polya was a mathematician who spent much time trying to
understand problem-solving techniques, leading for example to his book,
"How to Solve It."  It contains the line:

"If there is a problem you can’t solve, then there is an easier problem
you can solve: find it."

For most of my life, I remembered that line incorrectly, but I prefer
my version:

"If there is a problem you can’t solve, then there is an easier problem
you _can't_ solve: find it."

When we try to produce the simplest version of the problem we can't
solve, we often solve the problem.




Re: [OMPI users] Calculate time spent on non blocking communication?

2011-02-03 Thread Eugene Loh




Okay, so forget about Peruse.

You can basically figure that your user process will either be inside
an MPI call or else not.  If it's inside an MPI call, then that's time
spent in communications (and notably in the synchronization that's
implicit to communication).  If it's not inside an MPI call, then
that's time spent in computation.  Basically, no time in this model is
attributed to both communication and computation at once.

There is an OMPI FAQ on performance tools. 
http://www.open-mpi.org/faq/?category=perftools  Perhaps something
there will be helpful for you.  Specifically, the "Sun Studio
Performance Analyzer" allows you to divide that "communication" time
between "data transfer time" and "synchronization time".  But a basic
classification as either communication or else computation is pretty
central to all the tools.

Bibrak Qamar wrote:

  As asked the reason of such calculation of non
blocking communication, the main reason is that I want to look into the
program as how much it percent time is consumed on communication alone,
computation alone and the intersection of both.
  
  On Thu, Feb 3, 2011 at 5:08 AM, Eugene Loh <eugene@oracle.com>
wrote:
  Again,
you can try the Peruse instrumentation.  Configure OMPI with
--enable-peruse.  The instrumentation points might help you decide how
you want to define the time you want to measure.  Again, you really
have to spend a bunch of your own time deciding what is meaningful to
measure.

Gustavo Correa wrote:


  However, OpenMPI may give this info, with non-MPI
(hence non-portable) functions, I'd guess.
  
  
  
From: Eugene Loh <eugene@oracle.com>


Anyhow, the Peruse instrumentation in OMPI
might help.

  

  
  
  





Re: [OMPI users] Calculate time spent on non blocking communication?

2011-02-02 Thread Eugene Loh
Again, you can try the Peruse instrumentation.  Configure OMPI with 
--enable-peruse.  The instrumentation points might help you decide how 
you want to define the time you want to measure.  Again, you really have 
to spend a bunch of your own time deciding what is meaningful to measure.


Gustavo Correa wrote:

However, OpenMPI may give this info, with non-MPI (hence non-portable) 
functions, I'd guess.



From: Eugene Loh 

Anyhow, the Peruse instrumentation in OMPI might help.





Re: [OMPI users] heterogenous cluster

2011-02-02 Thread Eugene Loh

jody wrote:


Thanks for your reply.

If i try your suggestion, every process fails with the following message:

*** The MPI_Init() function was called before MPI_INIT was invoked.
 

That's a funny error message.  If you search the OMPI users mail list 
archives, this message shows up, but I didn't spend long enough to see 
if those e-mails are actually relevant to your inquiry or whether they 
help.  http://www.open-mpi.org/community/lists/users/



*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[aim-triops:15460] Abort before MPI_INIT completed successfully; not
able to guarantee that all other processes were killed!

I think this is caused by the fact that on the 64Bit machine Open MPI
is also built as a 64 bit application.
How can i force OpenMPI to be built as a 32Bit application on a 64Bit machine?
 





Re: [OMPI users] Calculate time spent on non blocking communication?

2011-02-02 Thread Eugene Loh




Bibrak Qamar wrote:

  Gus Correa, But it will include the time of
computation which took place before waitAll( ).
  

What's wrong with that?

  From: Bibrak Qamar <bibr...@gmail.com>
I am using non-blocking send and receive, and i want to calculate the
time
it took for the communication.
  
From: Eugene Loh <eugene@oracle.com>
You probably have to start by defining what you mean by "the time it
took for the communication".  Anyhow, the Peruse instrumentation in OMPI
might help.
  

Again, you should probably start by thinking more precisely about what
time you want to measure.  E.g., ask yourself why the answer even
matters.




Re: [OMPI users] Calculate time spent on non blocking communication?

2011-02-01 Thread Eugene Loh

Bibrak Qamar wrote:


Hello All,

I am using non-blocking send and receive, and i want to calculate the 
time it took for the communication. Is there any method or a way to do 
this using openmpi.


You probably have to start by defining what you mean by "the time it 
took for the communication".  Anyhow, the Peruse instrumentation in OMPI 
might help.


Re: [OMPI users] maximising bandwidth

2011-01-31 Thread Eugene Loh




David Zhang wrote:
Blocking send/recv, as the name suggest, stop processing
your master and slave code until the data is received on the slave side.

Just to clarify...

If you use point-to-point send and receive calls, you can make the
block/nonblock decision independently on the send and receive sides. 
E.g., use blocking send and nonblocking receive.  Or nonblocking send
and blocking receive.  You get the idea.

Blocking on the send side only means blocking until the message has
left the user's buffer on the send side.  It does not guarantee that
the data has been received on the other end.

I agree with Bill that performance portability is an issue.  That is,
the MPI standard itself doesn't really provide any guarantees here
about what is fastest.  Perhaps polling this mailing list will be
helpful, but if you are looking for "the fastest" solution regardless
of which MPI implementation you use (and which interconnect you use...
which might be determined at run time) you will probably be
disappointed.

Using a collective call like MPI_Gather may be worthwhile, but it
doesn't deploy additional threads, and additional threads could indeed
help in certain cases.

In addition to MPI implementation and which interconnect (or BTL) one
uses, another important variable is message length.  Short messages may
be sent "eagerly" while long messages may involve more synchronization
between master and slaves.
Nonblocking send/recv wouldn't stop, instead you must
check the status on the slave side to see if data has been sent.
Yes and no.  Again, data can be sent from the master but not yet
received by the slave (if the MPI implementation buffers the data
somewhere in-between).
Nonblocking is faster on the master side because the
master doesn't need to wait for the slave to receive the data to
continue.

???  For most sends, the master has to wait only on the data to leave
the user send buffer.
So when you say you want your master to send "as fast as
possible", I suppose you meant get back to running your code as soon as
possible.  In that case you would want nonblocking.  However when you
say you want the slaves to receive data faster, it seems you're
implying the actual data transmission across the network.  I believe
the data transmission speed is not dependent on whether the it is
blocking or nonblocking.
  
  On Sun, Jan 30, 2011 at 11:09 AM, Toon
Knapen 
wrote:
  Hi
all,

If I have a master-process that needs to send a chunk of (different)
data to each of my N slave processes as fast as possible, would I
receive the chunk in each of the slaves faster if the master would
launch N threads each doing a blocking send or would it be better to
launch N nonblocking sends in the master.

I'm currently using OpenMPI on ethernet but might the approach be
different with different types of networks ?

thanks in advance,
  
  






Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Eugene Loh




Gilbert Grosdidier wrote:
Any other suggestion ?
Can any more information be extracted from profiling?  Here is where I
think things left off:

Eugene Loh wrote:

  
  
Gilbert Grosdidier wrote:
  #    
[time]   [calls]    <%mpi>  <%wall>
# MPI_Waitall 741683   7.91081e+07 77.96   
21.58
# MPI_Allreduce   114057   2.53665e+07
11.99 3.32
# MPI_Isend  27420.6   6.53513e+08 
2.88 0.80
# MPI_Irecv  464.616   6.53513e+08 
0.05 0.01
###

It seems to my non-expert eye that MPI_Waitall is dominant among MPI
calls,
but not for the overall application,
Looks like on average each MPI_Waitall call is completing 8+ MPI_Isend
calls and 8+ MPI_Irecv calls.  I think IPM gives some point-to-point
messaging information.  Maybe you can tell what the distribution is of
message sizes, etc.  Or, maybe you already know the characteristic
pattern.  Does a stand-alone message-passing test (without the
computational portion) capture the performance problem you're looking
for?

Do you know message lengths and patterns?  Can you confirm whether
non-MPI time is the same between good and bad runs?




Re: [OMPI users] mpirun --nice 10 prog ??

2011-01-07 Thread Eugene Loh




David Mathog wrote:

  Ralph Castain wrote:
  
  
Afraid not - though you could alias your program name to be "nice --10 prog"

  
  Is there an OMPI wish list?  If so, can we please add to it "a method
to tell mpirun  what nice values to use when it starts programs on
nodes"?  Minimally, something like this:

  --nice  12   #nice value used on all nodes
  --mnice 5#nice value for master (first) node
  --wnice 10   #nice value for worker (worker) nodes

For my purposes that would be enough, as the only distinction is
master/worker.  For more complex environments more flexibility might be
desired, for instance, in a large cluster, where a subset of nodes
integrate data from worker subsets, effectively acting as "local masters".

Obviously for platforms without nice mpirun would try to use whatever
priority scheme was available, and failing that, just run the program as
it does now.

Or are we the only site where quick high priority jobs must run on the
same nodes where long term low priority jobs are also running?
  

I'm guessing people might have all sorts of ideas about how they would
want to solve "a problem like this one".

One is to forbid MPI jobs from competing for the same resources.  The
assumption that an MPI process has dedicated use of its resources is
somewhat ingrained into OMPI.

Checkpoint/restart:  if a higher-priority job comes along, kick the
lower-priority job off.

Yield.  This issue comes up often on these lists.  That is, don't just
set process priorities high or low, but make them more aggressive when
they're doing useful work and more passive when they're waiting idly.




Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores: very poor performance

2010-12-22 Thread Eugene Loh




Gilbert Grosdidier wrote:

  
Bonsoir Eugene,

Bon matin chez moi.
 Here
follows some output for a 1024 core run.

Assuming this corresponds meaningfully with your original e-mail, 1024
cores means performance of 700 vs 900.  So, that looks roughly
consistent with the 28% MPI time you show here.  That seems to imply
that the slowdown is due entirely to long MPI times (rather than slow
non-MPI times).  Just a sanity check.

Unfortunately, I'm yet unable to have the equivalent MPT chart.

That may be all right.  If one run clearly shows a problem (which is
perhaps the case here), then a "good profile" is not needed.  Here, a
"good profile" would perhaps be used only to confirm that near-zero MPI
time is possible.
 #IPMv0.983
# host    : r34i0n0/x86_64_Linux   mpi_tasks : 1024 on 128 nodes
# start   : 12/21/10/13:18:09  wallclock : 3357.308618 sec
# stop    : 12/21/10/14:14:06  %comm : 27.67
##
#
#   [total]   
min   max 
# wallclock  3.43754e+06   3356.98   3356.83  
3357.31
# user   2.82831e+06   2762.02   2622.04  
2923.37
# system  376230   367.412   174.603  
492.919
# mpi 951328   929.031   633.137  
1052.86
# %comm    27.6719   18.8601   
31.363
No glaring evidence here of load imbalance being the sole explanation,
but hard to tell from these numbers.  (If min comm time is 0%, then
that process is presumably holding everyone else up.)
#    
[time]   [calls]    <%mpi>  <%wall>
# MPI_Waitall 741683   7.91081e+07 77.96   
21.58
# MPI_Allreduce   114057   2.53665e+07
11.99 3.32
# MPI_Isend  27420.6   6.53513e+08 
2.88 0.80
# MPI_Irecv  464.616   6.53513e+08 
0.05 0.01
###
  
It seems to my non-expert eye that MPI_Waitall is dominant among MPI
calls,
but not for the overall application,
If at 1024 cores, performance is 700 compared to 900, then whatever the
problem is still hasn't dominated the entire application performance. 
So, it looks like MPI_Waitall is the problem, even if it doesn't
dominate overall application time.

Looks like on average each MPI_Waitall call is completing 8+ MPI_Isend
calls and 8+ MPI_Irecv calls.  I think IPM gives some point-to-point
messaging information.  Maybe you can tell what the distribution is of
message sizes, etc.  Or, maybe you already know the characteristic
pattern.  Does a stand-alone message-passing test (without the
computational portion) capture the performance problem you're looking
for?
Le
22/12/2010 18:50, Eugene Loh a écrit :
  Can
you isolate a bit more where the time is being spent?  The performance
effect you're describing appears to be drastic.  Have you profiled the
code?  Some choices of tools can be found in the FAQ http://www.open-mpi.org/faq/?category=perftools 
The results may be "uninteresting" (all time spent in your MPI_Waitall
calls, for example), but it'd be good to rule out other possibilities
(e.g., I've seen cases where it's the non-MPI time that's the culprit).


If all the time is spent in MPI_Waitall, then I wonder if it would be
possible for you to reproduce the problem with just some
MPI_Isend|Irecv|Waitall calls that mimic your program.  E.g., "lots of
short messages", or "lots of long messages", etc.  It sounds like there
is some repeated set of MPI exchanges, so maybe that set can be
extracted and run without the complexities of the application. 





Re: [OMPI users] Open MPI vs IBM MPI performance help

2010-12-22 Thread Eugene Loh

I'm curious if that resolved the issue.

David Singleton wrote:


http://www.open-mpi.org/faq/?category=running#oversubscribing

On 12/03/2010 06:25 AM, Price, Brian M (N-KCI) wrote:

Additional testing seems to show that the problem is related to 
barriers and how often they poll to determine whether or not it's 
time to leave.  Is there some MCA parameter or environment variable 
that allows me to control the frequency of polling while in barriers?


From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] 
On Behalf Of Price, Brian M (N-KCI)

Sent: Wednesday, December 01, 2010 11:29 AM
To: Open MPI Users
Cc: Stern, Craig J
Subject: EXTERNAL: [OMPI users] Open MPI vs IBM MPI performance help

OpenMPI version: 1.4.3
Platform: IBM P5, 32 processors, 256 GB memory, Symmetric 
Multi-Threading (SMT) enabled
Application: starts up 48 processes and does MPI using MPI_Barrier, 
MPI_Get, MPI_Put (lots of transfers, large amounts of data)
Issue:  When implemented using Open MPI vs. IBM's MPI ('poe' from HPC 
Toolkit), the application runs 3-5 times slower.
I suspect that IBM's MPI implementation must take advantage of some 
knowledge that it has about data transfers that Open MPI is not 
taking advantage of.

Any suggestions?




Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores: very poor performance

2010-12-22 Thread Eugene Loh
Can you isolate a bit more where the time is being spent?  The 
performance effect you're describing appears to be drastic.  Have you 
profiled the code?  Some choices of tools can be found in the FAQ 
http://www.open-mpi.org/faq/?category=perftools  The results may be 
"uninteresting" (all time spent in your MPI_Waitall calls, for example), 
but it'd be good to rule out other possibilities (e.g., I've seen cases 
where it's the non-MPI time that's the culprit).


If all the time is spent in MPI_Waitall, then I wonder if it would be 
possible for you to reproduce the problem with just some 
MPI_Isend|Irecv|Waitall calls that mimic your program.  E.g., "lots of 
short messages", or "lots of long messages", etc.  It sounds like there 
is some repeated set of MPI exchanges, so maybe that set can be 
extracted and run without the complexities of the application.


Anyhow, some profiling might help guide one to the problem.

Gilbert Grosdidier wrote:


There are indeed a high rate of communications. But the buffer
size is always the same for a given pair of processes, and I thought
that mpi_leave_pinned should avoid freeing the memory in this case.
Am I wrong ?




Re: [OMPI users] difference between single and double precision

2010-12-16 Thread Eugene Loh




Jeff Squyres wrote:

  On Dec 16, 2010, at 5:14 AM, Mathieu Gontier wrote:
  
  
We have lead some tests and the option btl_sm_eager_limit has a positive consequence on the performance. Eugene, thank you for your links.

  
  Good!
Just be aware of the tradeoff you're making: space for time.
  
  
Now, to offer a good support to our users, we would like to get the value of this parameters at the runtime. I am aware I can have the value running ompi_info like following:
ompi_info --param btl all | grep btl_sm_eager_limit

but can I get the value during the computation when I run mpirun -np 12 --mca btl_sm_eager_limit 8192 my_binary? This value could be compared with the buffer size into my code and some warning put into the output.

  
  We don't currently have a user-exposed method of retrieving MCA parameter values.  As you noted in your 2nd email, if the value was set by setting an environment variable, then you can just getenv() it.  But if the value was set some other way (e.g., via a file), it won't necessarily be loaded in the environment.
  

If you are desperate to get this value, I suppose you could run
empirical tests within your application.  This would be a little ugly,
but could work well enough if you are desperate enough.




Re: [OMPI users] MPI_Bcast vs. per worker MPI_Send?

2010-12-14 Thread Eugene Loh

David Mathog wrote:


For the receive I do not see how to use a collective.  Each worker sends
back a data structure, and the structures are of of varying size.  This
is almost always the case in Bioinformatics, where what is usually
coming back from each worker is a count M of the number of significant
results, M x (fixed size data per result: scores and the like), and M x
sequences or sequence alignments.  M runs from 0 to Z, where in
pathological cases, Z is a very large number, and the size of the
sequences or alignments returned also varies.
 


A collective call might not make sense in this case.

Arguably, each process could first send a size message (how much stuff 
is coming) and then the actual data.  In this case, you could do an 
MPI_Gather, master could allocate space, and then you do an MPI_Gatherv.


But it may make more sense for you to stick to your point-to-point 
implementation.  It may allow the master to operate with a smaller 
footprint and it may allow first finishers to send their results back 
earlier without everyone waiting for laggards.


Re: [OMPI users] curious behavior during wait for broadcast: 100% cpu

2010-12-14 Thread Eugene Loh

David Mathog wrote:

Is there a tool in openmpi that will reveal how much "spin time" the 
processes are using?


I don't know what sort of answer is helpful for you, but I'll describe 
one option.


With Oracle Message Passing Toolkit (formerly Sun ClusterTools, anyhow, 
an OMPI distribution from Oracle/Sun) and Oracle Solaris Studio 
Performance Analyzer (formerly Sun Studio Performance Analyzer) you can 
see how much time is spent in MPI work, MPI wait, and so on.  
Specifically, by process, you could see (I'm making an example up) that 
process 2 spent:

* 35% of its time in application-level computation
* 5% of its time in MPI moving data
* 60% of its time in MPI waiting
but process 7 spent:
* 90% of its time in application-level computation
* 5% of its time in MPI moving data
* only 5% of its time in MPI waiting
That is, beyond the usual profiling support you might find in other 
tools, with Performance Analyzer you can distinguish time spent in MPI 
moving data from time spent in MPI waiting.


On the other hand, you perhaps don't need that much detail.  For your 
purposes, it may suffice just to know how much time each process is 
spending in MPI.  There are various profiling tools that will give you 
that.  See http://www.open-mpi.org/faq/?category=perftools  Load 
balancing is a common problem people investigate with such tools.


Finally, if you want to stick to tools like top, maybe another 
alternative is to get your application to go into sleep waits.  I can't 
say this is the best choice, but it could be fun/interesting.  Let's say 
your application only calls a handful of different MPI functions.  Write 
PMPI wrappers for them that convert blocking functions 
(MPI_Send/MPI_Recv) to non-blocking ones mixed with short sleep calls.  
Not pretty, but might just be doable for your case.  I don't know.  
Anyhow, that might make MPI wait time detectable with tools like top.


Re: [OMPI users] MPI_Bcast vs. per worker MPI_Send?

2010-12-13 Thread Eugene Loh

David Mathog wrote:


Is there a rule of thumb for when it is best to contact N workers with
MPI_Bcast vs. when it is best to use a loop which cycles N times and
moves the same information with MPI_Send to one worker at a time?
 

The rule of thumb is to use a collective whenever you can.  The 
rationale is that the programming should be easier/cleaner and the 
underlying MPI implementation has the opportunity to do something clever.



For that matter, other than the coding semantics, is there any real
difference between the two approaches?  That is, does MPI_Bcast really
broadcast, daisy chain, or use other similar methods to reduce bandwidth
use when distributing its message, or does it just go ahead and run
MPI_Send in a loop anyway, but hide the details from the programmer?

I believe most MPI implementations, including OMPI, make an attempt to 
"do the right thing".  Multiple algorithms are available and the best 
one is chosen based on run-time conditions.


With any luck, you're better off with collective calls.  Of course, 
there are no guarantees.


Re: [OMPI users] How to check if Send was made or not before performing a recv

2010-12-12 Thread Eugene Loh

Alaukik Aggarwal wrote:


Thanks for your reply. I used this to solve the problem.

But I think there should be an in-built construct for this.
 

What would such a construct look like?  If you need information from the 
remote processes, they need to send messages (in the two-sided model).  
If you want to time out after a while, you can have MPI_Iprobe() checks 
for in-coming messages and then give up after some period of time.  I 
just don't know what you'd be looking for.


If you have concrete ideas you would really want, you should address 
them to the MPI Forum, which is in charge of defining MPI calls.



On Sat, Dec 11, 2010 at 10:28 AM, Eugene Loh  wrote:
 


Alaukik Aggarwal wrote:
   


I am using Open MPI 1.4.3.

I have to perform a receive operation from processes that are sending
data. It might happen that some of the processes don't send data
(might have completed in-fact).

So, how do I perform check on which processes to receive data from and
which processes to skip?

[code]
if(id != master)
 MPI::COMM_WORLD.Send(&dist, NUM_VERTEX, MPI_LONG, master, 1234);
if(id == master)
{
 for(int eachId = 1; eachId 



Re: [OMPI users] How to check if Send was made or not before performing a recv

2010-12-11 Thread Eugene Loh

Alaukik Aggarwal wrote:


Hi,

I am using Open MPI 1.4.3.

I have to perform a receive operation from processes that are sending
data. It might happen that some of the processes don't send data
(might have completed in-fact).

So, how do I perform check on which processes to receive data from and
which processes to skip?

[code]
if(id != master)
   MPI::COMM_WORLD.Send(&dist, NUM_VERTEX, MPI_LONG, master, 1234);
if(id == master)
{
   for(int eachId = 1; eachId 

One option is to have each non-master process send a "forget about me" 
message.  In practice, what this means is that every non-master process 
does, in fact, send a message, with that message either containing data 
or an indication that there is no data to send.


Re: [OMPI users] Guaranteed run rank 0 on a given machine?

2010-12-10 Thread Eugene Loh

David Mathog wrote:


Also, in my limited testing --host and -hostfile seem to be mutually
exclusive.

No.  You can use both together.  Indeed, the mpirun man page even has 
examples of this (though personally, I don't see having a use for 
this).  I think the idea was you might use a hostfile to define the 
nodes in your cluster and an mpirun command line that uses --host to 
select specific nodes from the file.



That is reasonable, but it isn't clear that it is intended.
Example, with a hostfile containing one entry for "monkey02.cluster
slots=1":

mpirun  --host monkey01   --mca plm_rsh_agent rsh  hostname
monkey01.cluster
 


Okay.


mpirun  --host monkey02   --mca plm_rsh_agent rsh  hostname
monkey02.cluster
 


Okay.


mpirun  -hostfile /usr/common/etc/openmpi.machines.test1 \
  --mca plm_rsh_agent rsh  hostname
monkey02.cluster
 


Okay.


mpirun  --host monkey01  \
 -hostfile /usr/commom/etc/openmpi.machines.test1 \
 --mca plm_rsh_agent rsh  hostname
--
There are no allocated resources for the application 
 hostname

that match the requested mapping:
 

Verify that you have mapped the allocated resources properly using the 
--host or --hostfile specification.

--
 

Right.  Your hostfile has monkey02.  On the command line, you specify 
monkey01, but that's not in your hostfile.  That's a problem.  Just like 
on the mpirun man page.


Re: [OMPI users] Method for worker to determine its "rank" on a single machine?

2010-12-10 Thread Eugene Loh




Terry Dontje wrote:

  
On 12/10/2010 09:19 AM, Richard Treumann wrote:
   It seems to me the MPI_Get_processor_name
description is too ambiguous to make this 100% portable.  I assume most
MPI implementations simply use the hostname so all processes on the
same host will return the same string.  The suggestion would work then.


However, it would also be
reasonable for an MPI  that did processor binding to return "
hostname.socket#.core#" so every rank would have a unique processor
name. 
  
Fair enough.  However, I think it is a lot more stable then grabbing
information from the bowels of the runtime environment.  Of course one
could just call the appropriate system call to get the hostname, if you
are on the right type of OS/Architecture :-).
   The extension idea is a
bit at odds with the idea that MPI is an architecture independent API.
 That does not rule out the option if there is a good use case but it
does raise the bar just a bit. 
  
Yeah, that is kind of the rub isn't it.  There is enough architectural
differences out there that it might be difficult to come to an
agreement on the elements of locality you should focus on.  It would be
nice if there was some sort of distance value that would be assigned to
each peer a process has.  Of course then you still have the problem
trying to figure out what distance you really want to base your
grouping on.

Similar issues within a node (e.g., hwloc, shared caches, sockets,
boards, etc.) as outside a node (same/different hosts, number of switch
hops, number of torus hops, etc.).  Lots of potential complexity, but
the main difference inside/outside a node is that nodal boundaries
present "hard" process-migration boundaries.




Re: [OMPI users] curious behavior during wait for broadcast: 100% cpu

2010-12-08 Thread Eugene Loh

Ralph Castain wrote:


I know we have said this many times - OMPI made a design decision to poll hard 
while waiting for messages to arrive to minimize latency.

If you want to decrease cpu usage, you can use the yield_when_idle option (it 
will cost you some latency, though) - see ompi_info --param ompi all
 

I wouldn't mind some clarification here.  Would CPU usage really 
decrease, or would other processes simply have an easier time getting 
cycles?  My impression of yield was that if there were no one to yield 
to, the "yielding" process would still go hard.  Conversely, turning on 
"yield" would still show 100% cpu, but it would be easier for other 
processes to get time.



Or don't set affinity and we won't be as aggressive - but you'll lose some 
performance

Choice is yours! :-)
 



Re: [OMPI users] difference between single and double precision

2010-12-06 Thread Eugene Loh

Mathieu Gontier wrote:

Nevertheless, one can observed some differences between MPICH and 
OpenMPI from 25% to 100% depending on the options we are using into 
our software. Tests are lead on a single SGI node on 6 or 12 
processes, and thus, I am focused on the sm option.


Is it possible to narrow our focus here a little?  E.g., are there 
particular MPI calls that are much more expensive with OMPI than MPICH?  
Is the performance difference observable with simple ping-pong tests?



So, I have two questions:
1/ does the option--mca mpool_sm_max_size= can change something (I 
am wondering if the value is not too small and, as consequence, a set 
of small messages is sent instead of a big one)


There was recent related discussion on this mail list.
http://www.open-mpi.org/community/lists/users/2010/11/14910.php

Check the OMPI FAQ for more info.  E.g.,
http://www.open-mpi.org/faq/?category=sm

This particular parameter disappeared with OMPI 1.3.2.
http://www.open-mpi.org/faq/?category=sm#how-much-use

To move messages as bigger chunks, try btl_sm_eager_limit and 
btl_sm_max_send_size:

http://www.open-mpi.org/faq/?category=sm#more-sm

2/ is there a difference between --mca btl tcp,sm,self and --mca btl 
self,sm,tcp (or not put any explicit mca option)?


I think tcp,sm,self and self,sm,tcp will be the same.  Without an 
explicit MCA btl choice, it depends on what BTL choices are available.


Re: [OMPI users] difference between single and double precision

2010-12-05 Thread Eugene Loh

Mathieu Gontier wrote:


  Dear OpenMPI users

I am dealing with an arithmetic problem. In fact, I have two variants 
of my code: one in single precision, one in double precision. When I 
compare the two executable built with MPICH, one can observed an 
expected difference of performance: 115.7-sec in single precision 
against 178.68-sec in double precision (+54%).


The thing is, when I use OpenMPI, the difference is really bigger: 
238.5-sec in single precision against 403.19-sec double precision (+69%).


Our experiences have already shown OpenMPI is less efficient than 
MPICH on Ethernet with a small number of processes. This explain the 
differences between the first set of results with MPICH and the second 
set with OpenMPI. (But if someone have more information about that or 
even a solution, I am of course interested.)
But, using OpenMPI increases the difference between the two 
arithmetic. Is it the accentuation of the OpenMPI+Ethernet loss of 
performance, is it another issue into OpenMPI or is there any option a 
can use?


It is also unusual that the performance difference between MPICH and 
OMPI is so large.  You say that OMPI is slower than MPICH even at small 
process counts.  Can you confirm that this is because MPI calls are 
slower?  Some of the biggest performance differences I've seen between 
MPI implementations had nothing to do with the performance of MPI calls 
at all.  It had to do with process binding or other factors that 
impacted the computational (non-MPI) performance of the code.  The 
performance of MPI calls was basically irrelevant.


In this particular case, I'm not convinced since neither OMPI nor MPICH 
binds processes by default.


Still, can you do some basic performance profiling to confirm what 
aspect of your application is consuming so much time?  Is it a 
particular MPI call?  If your application is spending almost all of its 
time in MPI calls, do you have some way of judging whether the faster 
performance is acceptable?  That is, is 238 secs acceptable and 403 secs 
slow?  Or, are both timings unacceptable -- e.g., the code "should" be 
running in about 30 secs.


Re: [OMPI users] Calling MPI_Test() too many times results in a time spike

2010-11-30 Thread Eugene Loh

Ioannis Papadopoulos wrote:

Has anyone observed similar behaviour? Is it something that I'll have 
to deal with it in my code or does it indeed qualify as an issue to be 
looked into?


I would say this is NOT an issue that merits much attention.  There are 
too many potential performance anomalies that you might be encountering 
and they aren't worth "fixing" (or even understanding) unless they 
impact your application's performance in a meaningful way.


E.g., try timing "nothing".  Here is a sample test program:

#include 
#include 

#define N 100

int main(int argc, char **argv) {
 int i;
 double t[N], tavg = 0, tmin = 1.e20, tmax = 0;

 MPI_Init(&argc,&argv);
 for ( i = 0; i < N; i++ ) {
   t[i] = MPI_Wtime();
   t[i] = MPI_Wtime() - t[i];
 }
 for ( i = 0; i < N; i++ ) {
   tavg += t[i];
   if ( tmin > t[i] ) tmin = t[i];
   if ( tmax < t[i] ) tmax = t[i];
 }
 tavg /= N;

 printf("avg %12.3lf\n", tavg * 1.e6);
 printf("min %12.3lf\n", tmin * 1.e6);
 printf("max %12.3lf\n", tmax * 1.e6);

 MPI_Finalize();
 return 0;
}

I find that the minimum is 0 (indicating non-infinitesimal granularity 
of the timer), the average is small (some overhead of the timer call), 
and the maximum is very large.  Why?  Because something will happen now 
and then.  What it is doesn't matter unless your application's 
performance is suffering.


You report that the overall time is about the same.  That is, it takes 
just over a second to receive the message, which is expected if the 
sender delays a second before sending.


One of the things you could do is look at total time to receive the 
message and total time spent in MPI_Test.  Then, vary TIMEOUT more 
smoothly (0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 
0.0001, 0.0002, 0.0005, 0.001, 0.002, 0.005, 0.01, 0.02).  You may also 
have to run many times to see how reproducible the results are.  As 
TIMEOUT increases, the total time to get the message will roughly 
increase, but not by much until TIMEOUT gets pretty large.  The total 
time spent in MPI_Test should fall as TIMEOUT increases.  So, the idea 
is that by increasing TIMEOUT, you decrease the responsiveness of the 
receiver while you make more CPU time available for other tasks.  With 
any luck, there will be a broad range of TIMEOUT values that degrade 
responsiveness negligibly while freeing a meaningful amount of time up 
for other computational tasks.


The performance of MPI_Test() -- and of a particular MPI_Test() call -- 
is probably not very meaningful.


Note that your MPI_Irecv calls should strictly speaking have 
MPI_ANY_SOURCE rather than MPI_ANY_TAG.


Re: [OMPI users] mpool_sm_max_size disappeared ?

2010-11-29 Thread Eugene Loh




Gilbert Grosdidier wrote:

  
  
  I found this parameter mpool_sm_max_size in this post:
  http://www.open-mpi.org/community/lists/devel/2008/11/4883.php
  
  
  But I was unable to spot it back into the 'ompi_info -all'
output for v 1.4.3.
  Is it still existing ?
  

No.

  
   If not, which other one is replacing it, please ?
  

It no longer makes any sense.

Up through OMPI 1.3.1, OMPI made crude estimates of how large the sm
backing file should be.  You could limit the actual size with min and
max parameters.  The problem was that the estimate was so crude that
often the allocated size was excessively large while in other cases the
size was insufficient to allow the job to start up.  After 1.3.1, the
estimate of the size needed to start the job became much more precise. 
So, the job should always at least start.  You might still want to
employ a "minimum" (to give the sm area some extra room for performance
reasons), but there is no longer any point in have a "maximum" size. 
If you were to set a maximum that is smaller than what OMPI needs, the
job wouldn't start anyhow.

You may find the FAQ useful.  Check
http://www.open-mpi.org/faq/?category=sm .  There are discussions of
these issues, plus ways of estimating how much room OMPI needs,
therefore also ways of tuning OMPI so that it will need less.  E.g., if
you think the sm area is taking up too much space, the FAQ tells you
what parameters to use to make OMPI less space hungry.

  
  Also, is it possible to specify to OpenMPI
  which filesystem to use for the SM backing file, please ?
  

Again, check the FAQ.
http://www.open-mpi.org/faq/?category=sm#where-sm-file

  
  Thanks in advance for any help,   Regards,   G.
  
  


  

De rien.




Re: [OMPI users] tool for measuring the ping with accuracy

2010-11-23 Thread Eugene Loh

George Markomanolis wrote:


Dear Eugene,

Thanks a lot for the answer you were right for the eager mode.

I have one more question. I am looking for an official tool to measure 
the ping time, just sending a message of 1 byte or more and measure 
the duration of the MPI_Send command on the rank 0 and the duration of 
the MPI_Recv on rank 1. I would like to know any formal tool because I 
am using also SkaMPI and the results really depend on the call of the 
synchronization before the measurement starts.


So for example with synchronizing the processors, sending 1 byte, I have:
rank 0, MPI_Send: ~7 ms
rank 1, MPI_Recv: ~52 ms

where 52 ms is almost the half of the ping-pong and this is ok.

Without synchronizing I have:
rank 0, MPI_Send: ~7 ms
rank 1, MPI_Recv: ~7 ms

However I developed a simple application where the rank 0 sends 1000 
messages of 1 byte to rank 1 and I have almost the second timings with 
the 7 ms. If in the same application I add the MPI_Recv and MPI_Send 
respectively in order to have a ping-pong application then the 
ping-pong duration is 100ms (like SkaMPI). Can someone explain me why 
is this happening? The ping-pong takes 100 ms and the ping without 
synchronization takes 7 ms.


I'm not convinced I'm following you at all.  Maybe the following helps, 
though maybe it's just obvious and misses the point you're trying to make.


In a ping-pong test, you have something like this:

tsend = MPI_Wtime()
MPI_Send
tsend = MPI_Wtime() - tsend
trecv = MPI_Wtime()
MPI_Recv
trecv = MPI_Wtime() - trecv

The send time times how long it takes to get the message out of the 
user's send buffer.  This time is very short.  In contrast, the 
"receive" time mostly measures how long it takes for the ping message to 
reach the peer and the pong message to return.  The actual time to do 
the receive processing is very short and accounts for a tiny fraction of 
trecv.


If a sender sends many short messages to a receiver and the two 
processes don't synchronize much, you can overlap many messages and hide 
the long transit time.


Here's a simple model:

sender injects message into interconnect, MPI_Send completes  (this time 
is short)

message travels the interconnect to the receiver (this time is long)
receiver unpacks the message and MPI_Recv completes (this time is short)

A ping-pong test counts the long inter-process transit time.  Sending 
many short messages before synchronizing hides the long transit time.


Sorry if this discussion misses the point you're trying to make.


Re: [OMPI users] Making MPI_Send to behave as blocking for all the sizes of the messages

2010-11-18 Thread Eugene Loh
Try lowering the eager threshold more gradually... e.g., 4K, 2K, 1K, 
512, etc. -- and watch what happens.  I think you will see what you 
expect, except once you get too small then the value is ignored 
entirely.  So, the setting just won't work at the extreme value (0) you 
want.


Maybe the thing to do is convert your MPI_Send calls to MPI_Ssend 
calls.  Or, compile in a wrapper that intercepts MPI_Send calls and 
implements them by calling PMPI_Ssend.


George Markomanolis wrote:


Dear all,

I am trying to disable the eager mode in OpenMPI 1.3.3 and I don't see 
a real difference between the timings.
I would like to execute a ping (rank 0 sends a message to rank 1) and 
to measure the duration of the MPI_Send on rank 0 and the duration of 
MPI_Recv on rank 1. I have the following results.


Without changing the eager mode:

bytesMPI_Send (in msec)MPI_Recv (in msec)
15.8  52.2
25.6  51.0
45.4  51.1
85.6  51.6
16   5.5  49.7
32   5.4  52.1
64   5.3  53.3



with disabled the eager mode:

ompi_info --param btl tcp | grep eager
MCA btl: parameter "btl_tcp_eager_limit" (current value: "0", data 
source: environment)


bytesMPI_Send (in msec)MPI_Recv (in msec)
15.4  52.3
25.4  51.0
45.4  52.1
85.4  50.7
16   5.0  50.2
32   5.1  50.1
64   5.4  52.8

However I was expecting that with disabled the eager mode the duration 
of MPI_Send should be longer. Am I wrong? Is there any option for 
making the MPI_Send to behave like blocking command for all the sizes 
of the messages?



Thanks a lot,
Best regards,
George Markomanolis

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





Re: [OMPI users] Open MPI data transfer error

2010-11-05 Thread Eugene Loh
Debugging is not a straightforward task.  Even posting the code doesn't 
necessarily help (since no one may be motivated to help or they can't 
reproduce the problem or...).  You'll just have to try different things 
and see what works for you.  Another option is to trace the MPI calls.  
If a process sends a message, dump out the MPI_Send() arguments.  When a 
receiver receives, correspondingly dump those arguments.  Etc.  This 
might be a way of seeing what the program is doing in terms of MPI and 
thereby getting to suggestion B below.


How do you trace and sort through the resulting data?  That's another 
tough question.  Among other things, if you can't find a tool that fits 
your needs, you can use the PMPI layer to write wrappers.  Writing 
wrappers is like inserting printf() statements, but doesn't quite have 
the same amount of moral shame associated with it!


Prentice Bisbal wrote:


Choose one

A) Post only the relevant sections of the code. If you have syntax
error, it should be in the Send and Receive calls, or one of the lines
where the data is copied or read from the array/buffer/whatever that
you're sending or receiving.

B) Try reproducing your problem in a toy program that has only enough
code to reproduce your problem. For example, create an array, populate
it with data, send it, and then on the receiving end, receive it, and
print it out. Something simple like that. I find when I do that, I
usually find the error in my code.

Jack Bryan wrote:
 

But, my code is too long to be posted. 
dozens of files, thousands of lines. 
Do you have better ideas ? 
Any help is appreciated. 


Nov. 5 2010

From: solarbik...@gmail.com
Date: Fri, 5 Nov 2010 11:20:57 -0700
To: us...@open-mpi.org
Subject: Re: [OMPI users] Open MPI data transfer error

As Prentice said, we can't help you without seeing your code.  openMPI
has stood many trials from many programmers, with many bugs ironed out.
So typically it is unlikely openMPI is the source of your error. 
Without seeing your code the only logical conclusion is that something

is wrong with your programming.

On Fri, Nov 5, 2010 at 10:52 AM, Prentice Bisbal mailto:prent...@ias.edu>> wrote:

   We can't help you with your coding problem without seeing your code.


   Jack Bryan wrote:
   > Thanks,
   > I have used "cout" in c++ to print the values of data.
   >
   > The sender sends correct data to correct receiver.
   >
   > But, receiver gets wrong data from correct sender.
   >
   > why ?
   >
   > thanks
   >
   > Nov. 5 2010
   >
   >> Date: Fri, 5 Nov 2010 08:54:22 -0400
   >> From: prent...@ias.edu 
   >> To: us...@open-mpi.org 
   >> Subject: Re: [OMPI users] Open MPI data transfer error
   >>
   >> Jack Bryan wrote:
   >> >
   >> > Hi,
   >> >
   >> > In my Open MPI program, one master sends data to 3 workers.
   >> >
   >> > Two workers can receive their data.
   >> >
   >> > But, the third worker can not get their data.
   >> >
   >> > Before sending data, the master sends a head information to
   each worker
   >> > receiver
   >> > so that each worker knows what the following data package is.
   (such as
   >> > length, package tag).
   >> >
   >> > The third worker can get its head information message from
   master but
   >> > cannot get its correct
   >> > data package.
   >> >
   >> > It got the data that should be received by first worker, which
   get its
   >> > correct data.
   >> >
   >>
   >>
   >> Jack,
   >>
   >> Providing the relevant sections of code here would be very helpful.
   >>
   >> 
   >> I would tell you to add some printf statements to your code to
   see what
   >> data is stored in your variables on the master before it sends
   them to
   >> each node, but Jeff Squyres and I agreed to disagree in a civil
   manner
   >> on that debugging technique earlier this week, and I'd hate to
   re-open
   >> those old wounds by suggesting that technique here. ;)
   >> 
   



Re: [OMPI users] Need Help for understand heat equation 2D mpi solving version

2010-10-29 Thread Eugene Loh




christophe petit wrote:

  
  i
am still trying to understand the parallelized version of the heat
equation 2D solving that we saw at school.
I am confused between the shift of the values near to the bounds done
by the "updateBound" routine  and the main loop (at line 161 in main
code)  which calls the routine "Explicit".
  
  

Each process "owns" a subdomain of cells, for which it will compute
updated values.  The process has storage not only for these cells,
which it owns, but also for a perimeter of cells, whose values need to
be fetched from nearby processes.  So, there are two steps.  In
"updateBound", processes communicate so that each supplies boundary
values to neighbors and gets boundary values from neighbors.  In
"Explicit", the computation (stencil operation) is performed.

  
  For
a given process (say number 1) ( i use 4 here for execution), i send to
the east process (3) the penultimate
column left column, to the north process (0) the penultimate row top ,and to the others
(mpi_proc_null=-2) 
the penultimate right
column and the bottom row. But how the 4  processes are
synchronous ?
  

When UpdateBound is called, neighboring processes are implicitly
synchronized via the MPI_Sendrecv() calls.

  
  I
don't understand too why all the
processes go through the solving piece of code calling
the "Explicit" routine.
  
  

The computational domain is distributed among all processes.  Each cell
must be updated with the stencil operation.  So, each process calls
that computation for the cells that it owns.

You should be able to get better interactivity at your school than on
this mailing list.  Further, your questions at school would help the
instructor get feedback from the students.




Re: [OMPI users] Using hostfile with default hostfile

2010-10-27 Thread Eugene Loh

jody wrote:


Where is the option 'default-hostfile' described?
 

Try "mpirun --help".  Not everything makes it to the man page.  Heck, 
not everything is documented!



It does not appear in mpirun's man page (for v. 1.4.2)
and i couldn't find anything like that with googling.

On Wed, Oct 27, 2010 at 4:02 PM, Ralph Castain  wrote:
 


Specify your hostfile as the default one:

mpirun --default-hostfile ./Cluster.hosts

Otherwise, we take the default hostfile and then apply the hostfile as a filter 
to select hosts from within it. Sounds strange, I suppose, but the idea is that 
the default hostfile can contain configuration info (#sockets, #cores/socket, 
etc.) that you might not want to have to put in every hostfile.



Re: [OMPI users] try to understand heat equation 2D mpi version

2010-10-22 Thread Eugene Loh

christophe petit wrote:

i'm studying the parallelized version of a solving 2D heat equation 
code in order to understand cartesian topology and the famous 
"MPI_CART_SHIFT".

Here's my problem at this part of the code :


---
call MPI_INIT(infompi)
  comm = MPI_COMM_WORLD
  call MPI_COMM_SIZE(comm,nproc,infompi)
  call MPI_COMM_RANK(comm,me,infompi)
!

..


! Create 2D cartesian grid
  periods(:) = .false.

  ndims = 2
  dims(1)=x_domains
  dims(2)=y_domains
  CALL MPI_CART_CREATE(MPI_COMM_WORLD, ndims, dims, periods, &
reorganisation,comm2d,infompi)
!
! Identify neighbors
!
  NeighBor(:) = MPI_PROC_NULL
! Left/West and right/Est neigbors
  CALL MPI_CART_SHIFT(comm2d,0,1,NeighBor(W),NeighBor(E),infompi)
 
  print *,'rank=', me

  print *, 'here first mpi_cart_shift : neighbor(w)=',NeighBor(W)
  print *, 'here first mpi_cart_shift : neighbor(e)=',NeighBor(E)

...

---

with x_domains=y_domains=2

and i get at the execution :" mpirun -np 4 ./explicitPar"

 rank=   0
here first mpi_cart_shift : neighbor(w)=  -1
 here first  mpi_cart_shift : neighbor(e)=   2
rank=   3
 here first mpi_cart_shift : neighbor(w)=   1
 here first mpi_cart_shift : neighbor(e)=  -1
 rank=   2
 here first mpi_cart_shift : neighbor(w)=   0
 here first mpi_cart_shift : neighbor(e)=  -1
 rank=   1
 here first mpi_cart_shift : neighbor(w)=  -1
 here first mpi_cart_shift : neighbor(e)=   3

I saw that if the rank is out of the topology and wihtout periodicity, 
the rank should be equal to MPI_UNDEFINED whis is assigned to -32766 
in "mpif.h" . So, why have i got the value "-1" ?

On my Macbook pro, i get the value "-2".


It seems to me the man page says MPI_PROC_NULL may be returned, and in 
OMPI that looks like -2.  Can you try the following:


% cat x.f90
 include "mpif.h"

 integer comm, dims(2)
 logical periods(2)

 call MPI_INIT(ier)
 ndims = 2
 dims(1)=2;  periods(1) = .false.
 dims(2)=2;  periods(2) = .false.
 CALL MPI_CART_CREATE(MPI_COMM_WORLD, ndims, dims, periods, .false., 
comm, ier)

 CALL MPI_CART_SHIFT(comm, 0, 1, iwest, ieast, ier)
 write(6,*) iwest, ieast, MPI_PROC_NULL
 call MPI_Finalize(ier)
end
% mpif90 x.f90
% mpirun -n 4 ./a.out
1 -2 -2
-2 2 -2
-2 3 -2
0 -2 -2



Re: [OMPI users] Question about MPI_Barrier

2010-10-21 Thread Eugene Loh




My main point was that, while what Jeff said about the short-comings of
calling timers after Barriers was true, I wanted to come in defense of
this timing strategy.  Otherwise, I was just agreeing with him that it
seems implausible that commenting out B should influence the timing of
A, but I'm equally clueless what that real issue is.  I have seen cases
where the presence or absence of code that isn't executed can influence
timings (perhaps because code will come out of the instruction cache
differently), but all that is speculation.  It's all a guess that what
you're really seeing isn't really MPI related at all.

Storm Zhang wrote:
Hi, Eugene, You said:
" The bottom line here is that from a causal point of view it would
seem that B should not impact the timings.  Presumably, some other
variable is actually responsible here."  Could you explain it in more
details for the second sentence. Thanks a lot.
  
  On Thu, Oct 21, 2010 at 9:58 AM, Eugene Loh <eugene@oracle.com>
wrote: 
  
Jeff Squyres wrote:

MPI::COMM_WORLD.Barrier();
if(rank == master) t1 = clock();
"code A";
MPI::COMM_WORLD.Barrier();
if(rank == master) t2 = clock();
"code B";
  
Remember that the time that individual processes exit barrier is not
guaranteed to be uniform (indeed, it most likely *won't* be the same).
 MPI only guarantees that a process will not exit until after all
processes have entered.  So taking t2 after the barrier might be a bit
misleading, and may cause unexpected skew.
 


The barrier exit times are not guaranteed to be uniform, but in
practice this style of timing is often the best (or only practical)
tool one has for measuring the collective performance of a group of
processes.


Code B *probably* has no effect on time spent between t1 and t2.  But
extraneous effects might cause it to do so -- e.g., are you running in
an oversubscribed scenario?  And so on.
 


Right.  The bottom line here is that from a causal point of view it
would seem that B should not impact the timings.  Presumably, some
other variable is actually responsible here.
  





Re: [OMPI users] Question about MPI_Barrier

2010-10-21 Thread Eugene Loh

Jeff Squyres wrote:


Ah.  The original code snipit you sent was:

MPI::COMM_WORLD.Barrier();
if(rank == master) t1 = clock();
"code A";
MPI::COMM_WORLD.Barrier();
if(rank == master) t2 = clock();
"code B";

Remember that the time that individual processes exit barrier is not guaranteed 
to be uniform (indeed, it most likely *won't* be the same).  MPI only 
guarantees that a process will not exit until after all processes have entered. 
 So taking t2 after the barrier might be a bit misleading, and may cause 
unexpected skew.
 

The barrier exit times are not guaranteed to be uniform, but in practice 
this style of timing is often the best (or only practical) tool one has 
for measuring the collective performance of a group of processes.



Code B *probably* has no effect on time spent between t1 and t2.  But 
extraneous effects might cause it to do so -- e.g., are you running in an 
oversubscribed scenario?  And so on.
 

Right.  The bottom line here is that from a causal point of view it 
would seem that B should not impact the timings.  Presumably, some other 
variable is actually responsible here.


Re: [OMPI users] busy wait in MPI_Recv

2010-10-19 Thread Eugene Loh

Brian Budge wrote:


Hi all -

I just ran a small test to find out the overhead of an MPI_Recv call
when no communication is occurring.   It seems quite high.  I noticed
during my google excursions that openmpi does busy waiting.  I also
noticed that the option to -mca mpi_yield_when_idle seems not to help
much (in fact, turning on the yield seems only to slow down the
program).  What is the best way to reduce this polling cost during
low-communication invervals?  Should I write my own recv loop that
sleeps for short periods?  I don't want to go write someone that is
possibly already done much better in the library :)
 


I think this has been discussed a variety of times before on this list.

Yes, OMPI does busy wait.

Turning on the MCA yield parameter can help some.  There will still be a 
load, but one that defers somewhat to other loads.  In any case, even 
with yield, a wait is still relatively intrusive.


You might have some luck writing something like this yourself, 
particularly if you know you'll be idle long periods.


Re: [OMPI users] How to time data transfers?

2010-10-13 Thread Eugene Loh




Ed Peddycoart wrote:

  
  
  I need to do some performance tests on my mpi app.  I simply want
to determine how long it takes for my sends from one process to be
received by another process.  
  
Here is the code I used as my example for non-blocking send/receive...
   if( myrank == 0 ) {
  /* Post a receive, send a message, then wait */
  MPI_Irecv( b, 100, MPI_DOUBLE, 1, 19, MPI_COMM_WORLD, &request );
  MPI_Isend( a, 100, MPI_DOUBLE, 1, 17, MPI_COMM_WORLD );
  MPI_Wait( &request, &status );
 }
 else if( myrank == 1 ) {
  /* Post a receive, send a message, then wait */
  MPI_Irecv( b, 100, MPI_DOUBLE, 0, 17, MPI_COMM_WORLD, &request
);   
  MPI_Isend( a, 100, MPI_DOUBLE, 0, 19, MPI_COMM_WORLD );
  MPI_Wait( &request, &status );
 }

First of all, you should also complete the Isend request.  One option
is to turn it into a blocking Send.  Another option is to add a request
to the Isend call (which is required by the API) and then turn the Wait
call into a Waitall call on both requests.

  I originally thought to just put a timer call before affer the
rank=0 receive, but that doesn't seem to capture the complete time...
see the following code.
   if( myrank == 0 ) {
    timer.start();
  /* Post a receive, send a message, then wait */
  MPI_Irecv( b, 100, MPI_DOUBLE, 1, 19, MPI_COMM_WORLD, &request );
  MPI_Isend( a, 100, MPI_DOUBLE, 1, 17, MPI_COMM_WORLD );
  MPI_Wait( &request, &status );
  timer.stop();
  elapsedTime = getElapsedTime();
 }
 else if( myrank == 1 ) {
  /* Post a receive, send a message, then wait */
  MPI_Irecv( b, 100, MPI_DOUBLE, 0, 17, MPI_COMM_WORLD, &request
);   
  MPI_Isend( a, 100, MPI_DOUBLE, 0, 19, MPI_COMM_WORLD );
  MPI_Wait( &request, &status );
 }

That should work once the code is corrected.  Can you use MPI_Wtime()? 
(Not necessarily a big deal, but should be a portable way of getting
high-quality timings in MPI programs.)  In what sense does it not
capture the complete time?

  How do others time this process?  Should I send a msg from one app
to the other to initiate timing, send the data I want to time?

It's common to ping-pong many times back and forth.  There may be one
or more "warm-up" iterations (to make sure both processes are ready and
all resources used have been touched and warmed/woken up) and other
iterations to check reproducibility of results.  Also, one might have
many iterations between the timer calls to amortize the overhead of the
timer call.

  
   if( myrank == 0 ) {
    MPI_Irecv( b, 100, MPI_DOUBLE, 1,
startTimeTag, MPI_COMM_WORLD, &request );
  MPI_Wait( &request, &status );
  timer.start();
  MPI_Irecv( b, 100, MPI_DOUBLE, 1, dataTag, MPI_COMM_WORLD,
&request );
  MPI_Wait( &request, &status );
  timer.stop();
  elapsedTime = getElapsedTime();
 }
 else if( myrank == 1 ) {
    MPI_Isend( a, 100, MPI_DOUBLE, 0,
startTimerTag, MPI_COMM_WORLD );
  MPI_Wait( &request, &status );
  MPI_Isend( b, 100, MPI_DOUBLE, 0, dataTag , MPI_COMM_WORLD,
&request );   
  MPI_Wait( &request, &status );
 }
  Ed
  
  

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users






Re: [OMPI users] OpenMPI and glut

2010-10-08 Thread Eugene Loh




Ed Peddycoart wrote:

  
  
  
  After searching some more and reading some FAQs on the
opmi website,  I see sugestions on how to make a remote app use the
local display to render, but that isn't what I need... Let me revise or
clarify my question:  
   
  I have an app which will run on 5 machines:  The app
will be kicked off on Machine0.  Machine1,2,3,4 will render a scene to
their own display, then send data to Machine0, which will use the data
as input to render its scene, which will be rendered on its own display.
  

I'm missing the question.

  
   
  Also, Is my understand of rank correct:  Rank is like
a App ID#?  Rank = 0 is the initial process? and 1-N is all others?
  
  

The ranks start at 0 and end at N-1.

Typically, one thinks of all N processes starting at once.  E.g., you
could have a hostfile that lists the 5 machines on which your processes
will run.  When you use "mpirun --np 5 --hostfile myhostfile ...", all
5 processes will be started, each on its machine.  Each one should get
to an MPI_Init() call, at which point they coordinate with one another
to set up interprocess communications.

  
  
  
  
  From: Ed
Peddycoart
  Sent: Fri 10/8/2010 11:10 AM
  To: us...@open-mpi.org
  Subject: OpenMPI and glut
  
  
  
  
  
  I have a glut app I
am infusing with MPI calls... The glut init appears to fail in the
rank1 processes.  How do I accomplish this, that is, parallel rendering
with GLUT and MPI?
  
  





Re: [OMPI users] Pros and cons of --enable-heterogeneous

2010-10-07 Thread Eugene Loh

David Ronis wrote:

Ralph, thanks for the reply.   


If I build with enable-heterogeneous and then decide to run on a
homogeneous set of nodes, does the additional "overhead" go away or
become completely negligible; i.e., if no conversion is necessary.
 

I'm no expert, but I think the overhead does not go away.  Even if you 
run on a homogeneous set of nodes, a local node does not know that.  It 
prepares a message without knowing if the destination is "same" or 
"different".  (There may be an exception with the sm BTL, which is only 
for processes on the same node and where it it assumed that a node 
comprises homogeneous processors.)


Whether the overhead is significant or negligible is another matter.  A 
subjective matter.  I suppose you could try some tests and judge for 
yourself for your case.


Re: [OMPI users] Bad performance when scattering big size of data?

2010-10-04 Thread Eugene Loh

Storm Zhang wrote:



Here is what I meant: the results of 500 procs in fact shows it with 
272-304(<500) real cores, the program's running time is good, which is 
almost five times 100 procs' time. So it can be handled very well. 
Therefore I guess OpenMPI or Rocks OS does make use of hyperthreading 
to do the job. But with 600 procs, the running time is more than 
double of that of 500 procs. I don't know why. This is my problem.  

BTW, how to use -bind-to-core? I added it as mpirun's options. It 
always gives me error " the executable 'bind-to-core' can't be found. 
Isn't it like:

mpirun --mca btl_tcp_if_include eth0 -np 600  -bind-to-core scatttest


Thanks for sending the mpirun run and error message.  That helps.

It's not recognizing the --bind-to-core option.  (Single hyphen, as you 
had, should also be okay.)  Skimming through the e-mail, it looks like 
you are using OMPI 1.3.2 and 1.4.2.  Did you try --bind-to-core with 
both?  If I remember my version numbers, --bind-to-core will not be 
recognized with 1.3.2, but should be with 1.4.2.  Could it be that you 
only tried 1.3.2?


Another option is to try "mpirun --help".  Make sure that it reports 
--bind-to-core.


Re: [OMPI users] Shared memory

2010-09-24 Thread Eugene Loh




It seems to me there are two extremes.

One is that you replicate the data for each process.  This has the
disadvantage of consuming lots of memory "unnecessarily."

Another extreme is that shared data is distributed over all processes. 
This has the disadvantage of making at least some of the data less
accessible, whether in programming complexity and/or run-time
performance.

I'm not familiar with Global Arrays.  I was somewhat familiar with
HPF.  I think the natural thing to do with those programming models is
to distribute data over all processes, which may relieve the excessive
memory consumption you're trying to address but which may also just put
you at a different "extreme" of this spectrum.

The middle ground I think might make most sense would be to share data
only within a node, but to replicate the data for each node.  There are
probably multiple ways of doing this -- possibly even GA, I don't
know.  One way might be to use one MPI process per node, with OMP
multithreading within each process|node.  Or (and I thought this was
the solution you were looking for), have some idea which processes are
collocal.  Have one process per node create and initialize some shared
memory -- mmap, perhaps, or SysV shared memory.  Then, have its peers
map the same shared memory into their address spaces.

You asked what source code changes would be required.  It depends.  If
you're going to mmap shared memory in on each node, you need to know
which processes are collocal.  If you're willing to constrain how
processes are mapped to nodes, this could be easy.  (E.g., "every 4
processes are collocal".)  If you want to discover dynamically at run
time which are collocal, it would be harder.  The mmap stuff could be
in a stand-alone function of about a dozen lines.  If the shared area
is allocated as one piece, substituting the single malloc() call with a
call to your mmap function should be simple.  If you have many
malloc()s you're trying to replace, it's harder.

Andrei Fokau wrote:

  The data are read from a file and
processed before calculations begin, so I think that mapping will not
work in our case.
  
  
Global Arrays look promising indeed. As I said, we need to put just a
part of data to the shared section. John, do you (or may be other
users) have an experience of working with GA?
  
  
  
  
  
  
  http://www.emsl.pnl.gov/docs/global/um/build.html
  
  
  
  When GA runs with MPI:
  
  
  MPI_Init(..)
     ! start MPI 
  GA_Initialize()
  ! start global arrays 
  MA_Init(..)
      ! start memory allocator
  
  
    
 do work
  
  
  GA_Terminate()
   ! tidy up global arrays 
  MPI_Finalize()
   ! tidy up MPI 
    
               ! exit program
  
  
  On Fri, Sep 24, 2010 at 13:44, Reuti 
wrote:
  Am
24.09.2010 um 13:26 schrieb John Hearns:

> On 24 September 2010 08:46, Andrei Fokau 
wrote:
>> We use a C-program which consumes a lot of memory per process
(up to few
>> GB), 99% of the data being the same for each process. So for
us it would be
>> quite reasonable to put that part of data in a shared memory.
>
> http://www.emsl.pnl.gov/docs/global/
>
> Is this eny help? Apologies if I'm talking through my hat.


I was also thinking of this when I read "data in a shared memory"
(besides approaches like http://www.kerrighed.org/wiki/index.php/Main_Page).
Wasn't this also one idea behind "High Performance Fortran" - running
in parallel across nodes even without knowing that it's across nodes at
all while programming and access all data like it's being local.
  
  
  
  
  
  





Re: [OMPI users] latency #2

2010-09-13 Thread Eugene Loh

Georges Markomanolis wrote:


Dear all,

Hi again, after using MPI_Ssend seems to be what I was looking for but 
I would like to know more about MPI_Send.


For example sending 1 byte with MPI_Send it takes 8.69 microsec but 
with MPI_Ssend it takes 152.9 microsec. I understand the difference 
but it seems that from one message's size and after their difference 
is not so big like trying for 518400 bytes where it needs 3515.78 
microsec with MPI_Send and 3584.1 microsec with MPI_Ssend. So has is 
there any rule to figure out (of course it depends on the hardware) 
the threshold that after this size the difference between the timings 
of MPI_Send and MPI_Send is not so big or at least how to find it for 
my hardware? 


Most MPI implementations choose one strategy for passing short messages 
(sender sends, MPI implementation buffers the message, receiver 
receives) and long messages (sender alerts receiver, receiver replies, 
then sender and receiver coordinate the transfer).  The first style is 
"eager" (sender sends eagerly without waiting for receiver to 
coordinate) while the second style is "rendezvous" (sender and receiver 
meet).  The message size at which the crossover occurs can be determined 
or changed.  In OMPI, it depends on the BTL.  E.g., try "ompi_info -a | 
grep eager".


Try the OMPI FAQ at http://www.open-mpi.org/faq/ and look at the 
"Tuning" categories.


Re: [OMPI users] computing the latency with OpenMpi

2010-09-13 Thread Eugene Loh

Georges Markomanolis wrote:

I have some questions about the duration of the communication with 
MPI_Send and MPI_Recv. I am using either SkaMPI either my 
implementation to measure the pingpong (MPI_Send and MPI_Recv) time 
between two nodes for 1 byte and more. The timing of the pingpong is 
106.8 microseconds. Although if I measure only the ping of the message 
(only the MPI_Send) the time is ~20 microseconds. Could anyone explain 
me why it is not the half? I would like to understand what is the 
difference inside to OpenMpi about MPI_Send and MPI_Recv.


The time for the MPI_Send is the time to move the data out of the user's 
send buffer.  It is quite possible that the data has not yet gotten to 
the destination.  If the message is short, it could be buffered 
somewhere by the MPI implementation.


The time for MPI_Recv probably includes some amount of waiting time.



More analytical the timings for pingpong between two nodes with a 
simple pingpong application, timings only for rank 0 (almost the same 
for rank 1):

1 byte, time for MPI_Send, 9 microsec, time for MPI_Recv, 86.4 microsec
1600 bytes, time for MPI_Send, 14.7 microsec, time for MPI_Recv, 
197.07 microsec
3200 bytes, time for MPI_Send, 19.73 microsec, time for MPI_Recv, 
227.6 microsec
518400 bytes, time for MPI_Send, 3536.5 microsec, time for MPI_Recv, 
5739.6 microsec
1049760 bytes, time for MPI_Send, 8020.33 microsec, time for MPI_Recv, 
10287 microsec


So the duration of the MPI_Send is till the buffer goes to the queue 
of the destination without the message to be saved in the memory or 
something like this, right?


It is possible that the data has not gone to the destination, but only 
some intermediate buffer, but yes it is possible that the message has 
not made it all the way to the receive buffer by the time the MPI_Send 
has finished.


So if I want to know the real time of sending one message to another 
node (taking the half of pingpoing seems that is not right)


It is not clear to me what "the real time" is.  I don't think there is 
any well-defined answer.  It depends on what you're really looking for, 
and that is unclear to me.  You could send many sends to many receivers 
and see how fast a process can emit sends.  You can use a profiler to 
send how the MPI implementation spends its time;  I've had some success 
with using Oracle Studio Performance Analyzer on OMPI.  You could use 
the PERUSE instrumentation inside of OMPI to get timestamps on 
particular internal events.  You could try designing other experiments.  
But which one is "right" could be debated.


Why does it matter?  What are you really looking for?


should I use a program with other commands like  MPI_Fence, MPI_Put etc?


Those are a different set of calls (one-sided operations) that could be 
more or less efficient than Send/Recv.  It varies.


Or is there any flag when I execute the application where MPI_Send 
behaves like I would expect? According to MPI standards what is 
MPI_Send measuring? If there is any article which explain all these 
please inform me. 


MPI_Send completes when the data has left the send buffer and that 
buffer can be reused by the application.  There are many implementation 
choices.  Specifically, it is possible that the MPI_Send will complete 
even before the MPI_Recv has started.  But it is also possible that the 
MPI_Send will not complete until after the MPI_Recv has completed.  It 
depends on the implementation, which may choose a strategy based on the 
message size, the interconnect, and other factors.


Re: [OMPI users] MPI_Reduce performance

2010-09-10 Thread Eugene Loh




Richard Treumann wrote:

  Hi Ashley
  
  
  I understand the problem with
descriptor
flooding can be serious in an application with unidirectional data
dependancy.
Perhaps we have a different perception of how common that is.
  

Ashley speculated it was a "significant minority."  I don't know what
that means, but it seems like it is a minority (most computations have
causal relationships among the processes holding unbounded imbalances
in check) and yet we end up seeing these exceptions.
I think that
adding some flow control to the application is a better solution than
semantically
redundant barrier.
It seems to me there is no difference.  Flow control, at this level, is
just semantically redundant synchronization.  A barrier is just a
special case of that.




Re: [OMPI users] MPI_Reduce performance

2010-09-09 Thread Eugene Loh




Alex A. Granovsky wrote:

  
  
  
  
  Isn't in
evident from the theory of random processes and probability theory that in the limit of infinitely 
  large cluster and parallel process, the probability of deadlocks with current implementation is unfortunately 
  quite a finite quantity and in limit
approaches to unity regardless on any particular details of the program.
  

No, not at all.  Consider simulating a physical volume.  Each process
is assigned to some small subvolume.  It updates conditions locally,
but on the surface of its simulation subvolume it needs information
from "nearby" processes.  It cannot proceed along the surface until it
has that neighboring information.  Its neighbors, in turn, cannot
proceed until their neighbors have reached some point.  Two distant
processes can be quite out of step with one another, but only by some
bounded amount.  At some point, a leading process has to wait for
information from a laggard to propagate to it.  All processes proceed
together, in some loose lock-step fashion.  Many applications behave in
this fashion.  Actually, in many applications, the synchronization is
tightened in that "physics" is made to propagate faster than
neighbor-by-neighbor.

As the number of processes increases, the laggard might seem relatively
slower in comparison, but that isn't deadlock.

As the size of the cluster increases, the chances of a system component
failure increase, but that also is a different matter.




Re: [OMPI users] MPI_Reduce performance

2010-09-09 Thread Eugene Loh

Gus Correa wrote:


More often than not some components lag behind (regardless of how
much you tune the number of processors assigned to each component),
slowing down the whole scheme.
The coupler must sit and wait for that late component,
the other components must sit and wait for the coupler,
and the (vicious) "positive feedback" cycle that
Ashley mentioned goes on and on.


I think "sit and wait" is the "typical" scenario that Dick mentions.  
Someone lags, so someone else has to wait.


In contrast, the "feedback" cycle Ashley mentions is where someone lags 
and someone else keeps racing ahead, pumping even more data at the 
laggard, forcing the laggard ever further behind.


Re: [OMPI users] is there a way to bring to light _all_ configure options in a ready installation?

2010-08-24 Thread Eugene Loh




Terry Dontje wrote:

  
Jeff Squyres wrote:
  
You should be able to run "./configure --help" and see a lengthy help message that includes all the command line options to configure.

Is that what you're looking for? 
  
No, he wants to know what configure options were used with some
binaries.

Apparently even what configure options could have been used even if
they weren't actually used.

  
On Aug 24, 2010, at 7:40 AM, Paul Kapinos wrote:


  Hello OpenMPI developers,

I am searching for a way to discover _all_ configure options of an OpenMPI installation.

Background: in a existing installation, the ompi_info program helps to find out a lot of informations about the installation. So, "ompi_info -c" shows *some* configuration options like CFLAGS, FFLAGS et cetera. Compilation directories often does not survive for long time (or are not shipped at all, e.g. with SunMPI)

But what about --enable-mpi-threads or --enable-contrib-no-build=vt for example (and all other possible) flags of "configure", how can I see would these flags set or would not?

In other words: is it possible to get _all_ flags of configure from an "ready" installation in without having the compilation dirs (with configure logs) any more?

Many thanks

Paul
  

  





Re: [OMPI users] problem with .bashrc stetting of openmpi

2010-08-16 Thread Eugene Loh




sun...@chem.iitb.ac.in wrote:

  
sun...@chem.iitb.ac.in wrote:


  Dear Open-mpi users,

I installed openmpi-1.4.1 in my user area and then set the path for
openmpi in the .bashrc file as follow. However, am still getting
following
error message whenever am starting the parallel molecular dynamics
simulation using GROMACS. So every time am starting the MD job, I need
to
source the .bashrc file again.
  

Have you set OPAL_PREFIX to /home/sunitap/soft/openmpi?

  
  How to set OPAL_PREFIX?
During the installation of openmpi, I ran configure with
--prefix=/home/sunitap/soft/openmpi
Did you mean this?
  

No.  The "OPAL_PREFIX" steps occurs after you configure, build, and
install OMPI.  At the time that you run MPI programs, set the
"OPAL_PREFIX" environment variable to /home/sunitap/soft/openmpi.  The
syntax depends on your shell.  E.g., for csh:

setenv OPAL_PREFIX /home/sunitap/soft/openmpi

The sequence might be something like this:

./configure --prefix=/home/sunitap/soft/openmpi
make
make install
cd /home/sunitap/soft/openmpi/examples
mpicc connectivity_c.c
setenv OPAL_PREFIX /home/sunitap/soft/openmpi
mpirun -n 2 ./connectivity_c

though I didn't check all those commands out.




Re: [OMPI users] Hyper-thread architecture effect on MPI jobs

2010-08-11 Thread Eugene Loh




The way MPI processes are being assigned to hardware threads is perhaps
neither controlled nor optimal.  On the HT nodes, two processes may end
up sharing the same core, with poorer performance.

Try submitting your job like this

% cat myrankfile1
rank  0=os223 slot=0
rank  1=os221 slot=0
rank  2=os222 slot=0
rank  3=os224 slot=0
rank  4=os228 slot=0
rank  5=os229 slot=0
rank  6=os223 slot=1
rank  7=os221 slot=1
rank  8=os222 slot=1
rank  9=os224 slot=1
rank 10=os228 slot=1
rank 11=os229 slot=1
rank 12=os223 slot=2
rank 13=os221 slot=2
rank 14=os222 slot=2
rank 15=os224 slot=2
rank 16=os228 slot=2
rank 17=os229 slot=2
% mpirun -host os221,os222,os223,os224,os228,os229 -np 18 --rankfile
myrankfile1 ./a.out

You can also try

% cat myrankfile2
rank  0=os223 slot=0
rank  1=os221 slot=0
rank  2=os222 slot=0
rank  3=os224 slot=0
rank  4=os228 slot=0
rank  5=os229 slot=0
rank  6=os223 slot=1
rank  7=os221 slot=1
rank  8=os222 slot=1
rank  9=os224 slot=1
rank 10=os228 slot=2
rank 11=os229 slot=2
rank 12=os223 slot=2
rank 13=os221 slot=2
rank 14=os222 slot=2
rank 15=os224 slot=2
rank 16=os228 slot=4
rank 17=os229 slot=4
% mpirun -host os221,os222,os223,os224,os228,os229 -np 18 --rankfile
myrankfile2 ./a.out

which one reproduces your problem and which one avoids it depends on
how the BIOS numbers your HTs.  Once you can confirm you understand the
problem, you (with the help of this list) can devise a solution
approach for your situation.


Saygin Arkan wrote:
Hello,
  
I'm running mpi jobs in non-homogeneous cluster. 4 of my machines have
the following properties, os221, os222, os223, os224:
  
  vendor_id   : GenuineIntel
cpu family  : 6
model   : 23
model name  : Intel(R) Core(TM)2 Quad  CPU   Q9300  @ 2.50GHz
stepping    : 7
cache size  : 3072 KB
physical id : 0
siblings    : 4
core id : 3
cpu cores   : 4
fpu : yes
fpu_exception   : yes
cpuid level : 10
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni monitor
ds_cpl vmx smx est tm2 ssse3 cx16 xtpr sse4_1 lahf_lm
bogomips    : 4999.40
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
  
and the problematic, hyper-threaded 2 machines are as follows, os228
and os229:
  
  vendor_id   : GenuineIntel
cpu family  : 6
model   : 26
model name  : Intel(R) Core(TM) i7 CPU 920  @ 2.67GHz
stepping    : 5
cache size  : 8192 KB
physical id : 0
siblings    : 8
core id : 3
cpu cores   : 4
fpu : yes
fpu_exception   : yes
cpuid level : 11
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good pni
monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
ida
bogomips    : 5396.88
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
  
  
The problem is: those 2 machines seem to be having 8 cores (virtually,
actualy core number is 4).
When I submit an MPI job, I calculated the comparison times in the
cluster. I got strange results.
  
I'm running the job on 6 nodes, 3 core per node. And sometimes ( I can
say 1/3 of the tests) os228 or os229 returns strange results. 2 cores
are slow (slower than the first 4 nodes) but the 3rd core is extremely
fast.
  
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - RANK(0) Printing
Times...
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os221 RANK(1)   
:38 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os222 RANK(2)   
:38 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os224 RANK(3)   
:38 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os228 RANK(4)   
:37 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os229 RANK(5)   
:34 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os223 RANK(6)   
:38 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os221 RANK(7)   
:39 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os222 RANK(8)   
:37 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os224 RANK(9)   
:38 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os228
RANK(10)    :48 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os229
RANK(11)    :35 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os223
RANK(12)    :38 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os221
RANK(13)    :37 sec
2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os222
RANK(14)    :37 sec
2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os224
RANK(15)    :38 sec
2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os228
RANK(16)    :43 sec
2010-08-

Re: [OMPI users] MPI_Bcast issue

2010-08-09 Thread Eugene Loh




Personally, I've been having trouble following the explanations of the
problem.  Perhaps it'd be helpful if you gave us an example of how to
reproduce the problem.  E.g., short sample code and how you run the
example to produce the problem.  The shorter the example, the greater
the odds of resolution.

  

  
From:

Randolph Pullen


  
  
To:

us...@open-mpi.org

  
  
Date:

08/07/2010 01:23 AM

  
  
Subject:

[OMPI users] MPI_Bcast
issue

  
  
Sent by:

users-boun...@open-mpi.org
  

  
  
  

  
I seem to be having a problem with MPI_Bcast.
My massive I/O intensive data movement program must broadcast from n to
n nodes. My problem starts because I require 2 processes per node, a
sender
and a receiver and I have implemented these using MPI processes rather
than tackle the complexities of threads on MPI.

Consequently, broadcast and calls like alltoall are not completely
helpful.
 The dataset is huge and each node must end up with a complete copy
built by the large number of contributing broadcasts from the sending
nodes.
 Network efficiency and run time are paramount.

As I don’t want to needlessly broadcast all this data to the sending
nodes
and I have a perfectly good MPI program that distributes globally from
a single node (1 to N), I took the unusual decision to start N copies
of
this program by spawning the MPI system from the PVM system in an
effort
to get my N to N concurrent transfers.

It seems that the broadcasts running on concurrent MPI environments
collide
and cause all but the first process to hang waiting for their
broadcasts.
 This theory seems to be confirmed by introducing a sleep of n-1
seconds
before the first MPI_Bcast  call on each node, which results in the
code working perfectly.  (total run time 55 seconds, 3 nodes, standard
TCP stack)

My guess is that unlike PVM, OpenMPI implements broadcasts with
broadcasts
rather than multicasts.  Can someone confirm this?  Is this a
bug?

Is there any multicast or N to N broadcast where sender processes can
avoid
participating when they don’t need to?

  

  
  





Re: [OMPI users] OpenMPI providing rank?

2010-08-04 Thread Eugene Loh




Eugene Loh wrote:

  
  
Yves Caniou wrote:
  
Le Wednesday 28 July 2010 15:05:28, vous avez écrit :


  I am confused. I thought all you wanted to do is report out the binding of
the process - yes? Are you trying to set the affinity bindings yourself?

If the latter, then your script doesn't do anything that mpirun wouldn't
do, and doesn't do it as well. You would be far better off just adding
--bind-to-core to the mpirun cmd line.
  

"mpirun -h" says that it is the default, so there is not even something to do?
I don't even have to add "--mca mpi_paffinity_alone 1" ?

  
Wow.  I just tried "mpirun -h" and, yes, it claims that
"--bind-to-core" is the default.  I believe this is wrong... or at
least "misleading."  :^)
To close the loop on this, Ralph just fixed this error in r23537.




Re: [OMPI users] Hybrid OpenMPI / OpenMP run pins OpenMP threads to a single core

2010-08-04 Thread Eugene Loh

David Akin wrote:


All,
I'm trying to get the OpenMP portion of the code below to run
multicore on a couple of 8 core nodes.
 

I was gone last week and am trying to catch up on e-mail.  This thread 
was a little intriguing.


I agree with Ralph and Terry:

*) OMPI should not be binding by default.
*) There is nothing in your program that would induce binding nor 
anything in your reported output that indicates binding is occurring.


So, any possibility that your use of taskset or top is misleading?  Did 
you ever try running with --report-bindings as Terry suggested?


The thread also discussed OMPI's inability to control the binding 
behavior of individual threads.  You can't manage individual threads 
with OMPI;  you'd have to use a thread-specific mechanism, and many OMP 
implementations support such mechanisms.  The best you could do with 
OMPI would be to unbind or bind broadly (e.g., to an entire socket), and 
that policy would be applied to all the threads within the process.


But, all that should be unnecessary... there shouldn't be any binding by 
default in the first place.  I'd check into whether these threads really 
are being bound and, if so, why.


Re: [OMPI users] execuation time is not stable with 2 processes

2010-08-04 Thread Eugene Loh

Tad Lake wrote:


I run it :
 mpirun -np 2 --host node2 ./a.out

But the result of time is not stable with difference of 100 times. For example, 
the max value of time can be 3000, meanwhile the min is 100.

Again, know what results to expect.  Is 3000 a reasonable time and 100 
too fast?  Or, is 100 a reasonable time and 3000 too slow?  Or other?  
The expected result should depend on the number of messages (2000), the 
size of a message (40Kbyte), and the interconnect (apparently shared 
memory even though you have openib).  2000 messages should take at least 
thousands of microseconds and when the messages are 40 Kbyte each even 
longer.


Try running the timing loop multiple times per program invocation and 
using MPI_Wtime() for your timer.


Re: [OMPI users] execuation time is not stable with 2 processes

2010-08-04 Thread Eugene Loh

Mark Potts wrote:


Hi,
I'd opt for the fact that tv0 is given value only on rank 0 and 
tv1 is

only given value on rank 1.  Kind of hard to get a diff betwn the two
on either rank with that setup.  You need to determine the tv0 and 
tv1

on both ranks.


I don't understand this.  It appears to me that tv1-tv0 is computed and 
reported on each process.  This seems okay.



In addition, there are a number of other errors in the code (such as
MPI_Finalize() as an errant function outside of main), etc.


Yes, there seem to be many small errors, such as tag undeclared, extra 
bracket so that MPI_Finalize is outside main, MPI_Send on rank 1 sending 
to itself, etc.


Regarding performance methodology, you should consider adding another 
loop so that your program reports multiple timings instead of just one.  
Also, use MPI_Wtime for your timer since it's more likely to pick up a 
portable, high-resolution timer.



Ralph Castain wrote:

Did you bind the processes? If not you may be seeing the impact of 
having processes bouncing between cpus, and/or processes not being 
local to their memory. Try adding -bind-to-core or -bind-to-socket to 
your cmd line and see if things smooth out. I'm assuming, of course, 
that you are running on a system that supports binding...


And know what result you are expecting.  You are reporting the total 
number of microseconds for 2000 round trips.  If we divide 3000 and 100 
by 2000, that's 1.5 usec and 0.05 usec for latency.  The first is 
reasonable for shared memory.  The second sounds much too short.  
Perhaps your timer has too high granularity?


The time can also be impacted by other things running on your cpu - 
could be context switching.


It seems to me that your results are not too slow but too fast.  Again, 
high-granularity timings may be at fault.  Might need to time a larger 
number of iterations within the timer and report multiple measurements.


Final point: since both processes are running on the same node, IB 
will have no involvement - the messages are going to flow over shared 
memory.



+1




On Aug 4, 2010, at 6:51 AM, Tad Lake wrote:


Hi,
 I have a little program for execution time.
=
#include "mpi.h"
#include 
#include 
#include 
#include 
int main (int argc, char *argv[]) {
MPI_Status Stat;
struct timeval tv0, tv1;

long int totaltime = 0;
int i, j;
int buf[10240];
 int numtasks, rank;

MPI_Init (&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank (MPI_COMM_WORLD, &rank);


if (rank == 0) {
  gettimeofday("&tv0, NULL);   for(i=0;i<1000;i++){
  MPI_Send (buf, 10240, MPI_INT, 1, tag, MPI_COMM_WORLD);
  MPI_Recv (buf, 10240, MPI_INT, 1, tag,MPI_COMM_WORLD, &Stat);
  }
  gettimeofday (&tv1, NULL);
}else{
 gettimeofday(&tv0, NULL);
 for(i=0;i<1000;i++){
   MPI_Recv(buf, 10240,MPI_INT, 0, tag, MPI_COMM_WORLD, &Stat);
   MPI_Send(buf, 10240, MPI_INT, 1, tag, MPI_COMM_WORLD);
 }
 gettimeofday(&tv1, NULL);
}

totaltime = (tv1.tv_sec - tv0.tv_sec) *  100 + (tv1.tv_usec - 
tv0.tv_usec);

fprintf (stdout, "rank %d with total time is %d",rank, totaltime);
}

MPI_Finalize ();

return 0;
} ===

I run it :
 mpirun -np 2 --host node2 ./a.out

But the result of time is not stable with difference of 100 times. 
For example, the max value of time can be 3000, meanwhile the min is 
100.


Is there anything wrong ?
I am using 1.4.2 and openib.




Re: [OMPI users] MPIRUN Error on Mac pro i7 laptop and linux desktop

2010-08-03 Thread Eugene Loh




christophe petit wrote:
Thanks for your answers, 
  
the execution of this parallel program works fine at my work, but we
used MPICH2. I thought this will run with OPEN-MPI too.


In your input deck, how big are x_domains and y_domains -- that is,
iconf(3) and iconf(4)?  Do they have to be changed if you change the
number of processes you run on?  Off hand, it looks like
x_domains*y_domains = iconf(3)*iconf(4) should equal nproc.  If you can
run with nproc=1 and don't change the input deck, you won't be able to
run on nproc/=1.

Given that the problem is in MPI_Cart_shift, could you produce a much
smaller program that illustrates the error you're trying to understand?

Here is the f90 source where MPI_CART_SHIFT is called :
  
  program heat
!**
!
!   This program solves the heat equation on the unit square [0,1]x[0,1]
!    | du/dt - Delta(u) = 0
!    |  u/gamma = cste
!   by implementing a explicit scheme.
!   The discretization is done using a 5 point finite difference scheme
!   and the domain is decomposed into sub-domains. 
!   The PDE is discretized using a 5 point finite difference scheme
!   over a (x_dim+2)*(x_dim+2) grid including the end points
!   correspond to the boundary points that are stored. 
!
!   The data on the whole domain are stored in
!   the following way :
!
!    y
!   
!    d  |  |
!    i  |  |
!    r  |  |
!    e  |  |
!    c  |  |
!    t  |  |
!    i  | x20  |
!    o /\   |  |
!    n  |   | x10  |
!   |   |  |
!   |   | x00  x01 x02 ... |
!   |   
!    ---> x direction  x(*,j)
!
!   The boundary conditions are stored in the following submatrices
!
!
!    x(1:x_dim, 0)  ---> left   temperature
!    x(1:x_dim, x_dim+1)    ---> right  temperature
!    x(0, 1:x_dim)  ---> top    temperature
!    x(x_dim+1, 1:x_dim)    ---> bottom temperature
!
!**
  implicit none
  include 'mpif.h'
! size of the discretization
  integer :: x_dim, nb_iter
  double precision, allocatable :: x(:,:),b(:,:),x0(:,:)
  double precision  :: dt, h, epsilon
  double precision  :: resLoc, result, t, tstart, tend
! 
  integer :: i,j
  integer :: step, maxStep
  integer :: size_x, size_y, me, x_domains,y_domains
  integer :: iconf(5), size_x_glo
  double precision conf(2)
!  
! MPI variables
  integer :: nproc, infompi, comm, comm2d, lda, ndims
  INTEGER, DIMENSION(2)  :: dims
  LOGICAL, DIMENSION(2)  :: periods
  LOGICAL, PARAMETER :: reorganisation = .false.
  integer :: row_type
  integer, parameter :: nbvi=4
  integer, parameter :: S=1, E=2, N=3, W=4
  integer, dimension(4) :: neighBor
  
!
  intrinsic abs
!
!
  call MPI_INIT(infompi)
  comm = MPI_COMM_WORLD
  call MPI_COMM_SIZE(comm,nproc,infompi)
  call MPI_COMM_RANK(comm,me,infompi)
!
!
  if (me.eq.0) then
  call readparam(iconf, conf)
  endif
  call MPI_BCAST(iconf,5,MPI_INTEGER,0,comm,infompi)
  call MPI_BCAST(conf,2,MPI_DOUBLE_PRECISION,0,comm,infompi)
!
  size_x    = iconf(1)
  size_y    = iconf(1)
  x_domains = iconf(3)
  y_domains = iconf(4)
  maxStep   = iconf(5)
  dt    = conf(1)
  epsilon   = conf(2)
!
  size_x_glo = x_domains*size_x+2
  h  = 1.0d0/dble(size_x_glo)
  dt = 0.25*h*h
!
!
  lda = size_y+2
  allocate(x(0:size_y+1,0:size_x+1))
  allocate(x0(0:size_y+1,0:size_x+1))
  allocate(b(0:size_y+1,0:size_x+1))
!
! Create 2D cartesian grid
  periods(:) = .false.
  
  ndims = 2
  dims(1)=x_domains
  dims(2)=y_domains
  CALL MPI_CART_CREATE(MPI_COMM_WORLD, ndims, dims, periods, &
    reorganisation,comm2d,infompi)
!
! Identify neighbors
!
  NeighBor(:) = MPI_PROC_NULL
! Left/West and right/Est neigbors
  CALL MPI_CART_SHIFT(comm2d,0,1,NeighBor(W),NeighBor(E),infompi)
! Bottom/South and Upper/North neigbors
  CALL MPI_CART_SHIFT(comm2d,1,1,NeighBor(S),NeighBor(N),infompi)
!
! Create row data type to coimmunicate with South and North neighbors
!
  CALL MPI_TYPE_VECTOR(size_x, 1, size_y+2, MPI_DOUBLE_PRECISION,
row_type,infompi)
  CALL MPI_TYPE_COMMIT(row_type, infompi)
!
! initialization 
!
  call initvalues(x0, b, size_x+1, size_x )
!
! Update the boundaries
!
  call updateBound(x0,size_x,size_x, NeighBor, comm2d, row_type)
    
  step = 0
  t    = 0.0
!
  tstar

Re: [OMPI users] Fortran MPI Struct with Allocatable Array

2010-08-02 Thread Eugene Loh




I can't give you a complete answer, but I think this is less an MPI
question and more of a Fortran question.  The question is if you have a
Fortran derived type, one of whose components is a POINTER, what does
the data structure look like in linear memory?  I could imagine the
answer is implementation dependent.  Anyhow, here is a sample, non-MPI,
Fortran program that illustrates the question:

% cat b.f90
  type :: small
  integer, pointer :: array(:)
  end type small
  type(small) :: lala

  integer, pointer :: array(:)

  n = 20

  allocate( lala%array(n) )
  allocate(  array(n) )

  lala%array = (/ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16, 17, 18, 19, 20 /)
   array = (/ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16, 17, 18, 19, 20 /)

  call sub(lala)
  call sub(lala%array)
  call sub( array)
end

subroutine sub(x)
  integer x(20)
  write(6,*) x
end
% f90 b.f90
% a.out
 599376 20 4 599372 1 20 -4197508 1 2561 0 33 0 0 0 0 0 0 0 0 0
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
% 

So, your model of 20 consecutive words does not work if you pass the
derived type.  It does work if you pass the POINTER component.  This is
with Oracle (Sun) Studio Fortran.  Again, I can imagine the behavior
depends on the Fortran compiler.

I suspect what's going on is that a POINTER is a complicated data
structure that has all sorts of metadata in it, but if you pass a
POINTER the compiler knows to pass the thing you're pointing to rather
than the metadata itself.

Jeremy Roberts wrote:
I'm trying to parallelize a Fortran
code with rather complicated derived types full of pointer arrays. 
When I build the MPI type for sending, all the static components are
sent, but the pointer arrays are not (and retain initial values).  I
imagine this has to do with memory addresses when creating the MPI
struct, but I have no idea how to fix it.
  
I've included a simple code illustrating my issue below.  Any
suggestions?
  
program mpi_struct_example
  use mpi
  implicit none
  ! declarations
  type :: small
  real, pointer :: array(:)
  end type small
  type(small) :: lala
  integer :: stat, counts(1), types(1), ierr, iam, n=0, MPI_SMALL
  integer (kind=MPI_ADDRESS_KIND) :: displs(1)
  ! initialize MPI and get my rank
  call MPI_INIT( ierr )
  call MPI_COMM_RANK( MPI_COMM_WORLD, iam, ierr )
  n = 20
  allocate( lala%array(n) )
  lala%array = 2.0
  ! build block counts, displacements, and oldtypes
  counts = (/n/)
  displs = (/0/)
  types  = (/MPI_REAL/)
  ! make and commit new type
  call MPI_TYPE_CREATE_STRUCT( 1, counts, displs, types, MPI_SMALL,
ierr )
  call MPI_TYPE_COMMIT( MPI_SMALL, ierr )
  if (iam .eq. 0) then
    ! reset the value of the array
    lala%array  = 1.0 
    call MPI_SEND( lala, 1, MPI_SMALL, 1, 1, MPI_COMM_WORLD,
ierr)   ! this doesn't work
    !call MPI_SEND( lala%array, n, MPI_REAL, 1, 1,
MPI_COMM_WORLD, ierr) ! this does work
    write (*,*) "iam ",iam," and lala%array(1)  = ",
lala%array(1)
  else
    call MPI_RECV( lala, 1, MPI_SMALL, 0, 1, MPI_COMM_WORLD,
stat, ierr )   ! this doesn't work
    !call MPI_RECV( lala%array, n, MPI_REAL, 0, 1,
MPI_COMM_WORLD, stat, ierr ) ! this does work
    write (*,*) "iam ",iam," and lala%array(1)  = ",
lala%array(1), " ( should be 1.0)"
  end if
  call MPI_FINALIZE(ierr)
end program mpi_struct_example




  1   2   3   4   >