[OMPI devel] OMPI 1.4.3 hangs in gather

2011-01-11 Thread Doron Shoham
Hi

All machines on the setup are IDataPlex with Nehalem 12 cores per node, 24GB
 memory.



· *Problem 1 – OMPI 1.4.3 hangs in gather:*



I’m trying to run IMB and gather operation with OMPI 1.4.3 (Vanilla).

It happens when np >= 64 and message size exceed 4k:

mpirun -np 64 -machinefile voltairenodes -mca btl sm,self,openib
imb/src-1.4.2/IMB-MPI1 gather –npmin 64



voltairenodes consists of 64 machines.



#

# Benchmarking Gather

# #processes = 64

#

   #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]

0 1000 0.02 0.02 0.02

1  33114.0214.1614.09

2  33112.8713.0812.93

4  33114.2914.4314.34

8  33116.0316.2016.11

   16  33117.5417.7417.64

   32  33120.4920.6220.53

   64  33123.5723.8423.70

  128  33128.0228.3528.18

  256  33134.7834.8834.80

  512  33146.3446.9146.60

 1024  33163.9664.7164.33

 2048  331   460.67   465.74   463.18

 4096  331   637.33   643.99   640.75



This the padb output:

padb –A –x –Ormgr=mpirun –tree:



=~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2011.01.06 14:33:17
=~=~=~=~=~=~=~=~=~=~=~=



Warning, remote process state differs across ranks

state : ranks

R (running) :
[1,3-6,8,10-13,16-20,23-28,30-32,34-42,44-45,47-49,51-53,56-59,61-63]

S (sleeping) : [0,2,7,9,14-15,21-22,29,33,43,46,50,54-55,60]

Stack trace(s) for thread: 1

-

[0-63] (64 processes)

-

main() at ?:?

  IMB_init_buffers_iter() at ?:?

IMB_gather() at ?:?

  PMPI_Gather() at pgather.c:175

mca_coll_sync_gather() at coll_sync_gather.c:46

  ompi_coll_tuned_gather_intra_dec_fixed() at
coll_tuned_decision_fixed.c:714

-

[0,3-63] (62 processes)

-

ompi_coll_tuned_gather_intra_linear_sync() at
coll_tuned_gather.c:248

  mca_pml_ob1_recv() at pml_ob1_irecv.c:104

ompi_request_wait_completion() at
../../../../ompi/request/request.h:375

  opal_condition_wait() at
../../../../opal/threads/condition.h:99

-

[1] (1 processes)

-

ompi_coll_tuned_gather_intra_linear_sync() at
coll_tuned_gather.c:302

  mca_pml_ob1_send() at pml_ob1_isend.c:125

ompi_request_wait_completion() at
../../../../ompi/request/request.h:375

  opal_condition_wait() at
../../../../opal/threads/condition.h:99

-

[2] (1 processes)

-

ompi_coll_tuned_gather_intra_linear_sync() at
coll_tuned_gather.c:315

  ompi_request_default_wait() at request/req_wait.c:37

ompi_request_wait_completion() at
../ompi/request/request.h:375

  opal_condition_wait() at ../opal/threads/condition.h:99

Stack trace(s) for thread: 2

-

[0-63] (64 processes)

-

start_thread() at ?:?

  btl_openib_async_thread() at btl_openib_async.c:344

poll() at ?:?

Stack trace(s) for thread: 3

-

[0-63] (64 processes)

-

start_thread() at ?:?

  service_thread_start() at btl_openib_fd.c:427

select() at ?:?

-bash-3.2$





When running again padb after couple of minutes, I can see that the total
number of processes remain in the same position but

different processes are at different positions.

For example, this is the diff between two padb outputs:



Warning, remote process state differs across ranks

state : ranks

-R (running) : [0,2-4,6-13,16-18,20-21,28-31,33-36,38-56,58,60,62-63]

-S (sleeping) : [1,5,14-15,19,22-27,32,37,57,59,61]

+R (running) : [2,5-14,16-23,25,28-40,42-48,50-51,53-58,61,63]

+S (sleeping) : [0-1,3-4,15,24,26-27,41,49,52,59-60,62]

Stack trace(s) for thread: 1

-

[0-63] (64 processes)

@@ -13,21 +13,21 @@

mca_coll_sync_gather() at coll_sync_gather.c:46

ompi_coll_tuned_gather_intra_dec_fixed() at coll_tuned_decision_fixed.c:714

-

- [0,3-63] (62 processes)

+ [0-5,8-63] (62 processes)

-

ompi_coll_tuned_gather_intra_linear_sync() at coll_tuned_gather.c:248

mca_pml_ob1_recv() at pml_ob1_irecv.c:104

ompi_request_wait_completion() at ../../../../ompi/request/request.h:375

opal_condition_wait() at ../../../../

[OMPI devel] Removing paffinity trunk components

2011-01-11 Thread Jeff Squyres
I updated the hwloc paffinity component to hwloc v1.1 last night.

Given that hwloc seems to be working well, I'd like to remove the following 
paffinity components from the trunk (and eventually, v1.5) tomorrow COB (5pm US 
Eastern, Wed, Jan 12 2011).

- solaris
- darwin
- posix
- windows

So all we'll be left with is hwloc and test.

Any problems with that?

I didn't make this an official RFC with a timeout because we all agreed to this 
general plan of removing non-hwloc/test paffinity components a long time ago.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] Removing paffinity trunk components

2011-01-11 Thread Jeff Squyres
Disregard this; we removed these components a long time ago.

I was looking in an old svn tree that still had sentinel 
solaris/darwin/posix/windows directories (because there were files like 
Makefile.in files in them).

Sorry for the noise...



On Jan 11, 2011, at 8:35 AM, Jeff Squyres wrote:

> I updated the hwloc paffinity component to hwloc v1.1 last night.
> 
> Given that hwloc seems to be working well, I'd like to remove the following 
> paffinity components from the trunk (and eventually, v1.5) tomorrow COB (5pm 
> US Eastern, Wed, Jan 12 2011).
> 
> - solaris
> - darwin
> - posix
> - windows
> 
> So all we'll be left with is hwloc and test.
> 
> Any problems with that?
> 
> I didn't make this an official RFC with a timeout because we all agreed to 
> this general plan of removing non-hwloc/test paffinity components a long time 
> ago.
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r24219

2011-01-11 Thread Jeff Squyres
Terry --

The trunk doesn't use configure.params anymore.  You should probably remove 
this file again...


On Jan 11, 2011, at 1:31 PM, t...@osl.iu.edu wrote:

> Author: tdd
> Date: 2011-01-11 13:31:55 EST (Tue, 11 Jan 2011)
> New Revision: 24219
> URL: https://svn.open-mpi.org/trac/ompi/changeset/24219
> 
> Log:
> add configure.params to solaris sysinfo module to allow it to be built
> Added:
>   trunk/opal/mca/sysinfo/solaris/configure.params
> 
> Added: trunk/opal/mca/sysinfo/solaris/configure.params
> ==
> --- (empty file)
> +++ trunk/opal/mca/sysinfo/solaris/configure.params   2011-01-11 13:31:55 EST 
> (Tue, 11 Jan 2011)
> @@ -0,0 +1,18 @@
> +# -*- shell-script -*-
> +#
> +# Copyright (c) 2011  Oracle and/or affiliates.  All rights reserved. 
> +#
> +# $COPYRIGHT$
> +# 
> +# Additional copyrights may follow
> +# 
> +# $HEADER$
> +#
> +
> +PARAM_CONFIG_FILES="Makefile"
> +
> +#
> +# Set the config priority so that, if we can build,
> +# only this component will build
> +
> +PARAM_CONFIG_PRIORITY=60
> ___
> svn-full mailing list
> svn-f...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/svn-full


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] u_int8_t

2011-01-11 Thread Eugene Loh

Jeff Squyres wrote:


Shrug.  If they're not used anywhere, I'd whack them.
 

Excellent.  They screw things up (at least for me).  Turns out, Solaris 
IB uses such types and has the sense to typedef them.  But such typedefs 
conflict with opal_config.h, which #define's them (for apparently no 
reason).



Do we have configure tests for them, or just #define's?
 


Configure tests.


On Jan 10, 2011, at 7:51 PM, Eugene Loh wrote:
 


Why do
u_int8_t
u_int16_t
u_int32_t
u_int64_t
get defined in opal_config.h?  I don't see them used anywhere in the 
OMPI/OPAL/ORTE code base.

Okay, one exception, in opal/util/if.c:

#if defined(__DragonFly__)
#define IN_LINKLOCAL(i)(((u_int32_t)(i) & 0x) == 0xa9fe)
#endif
   


Ah, and even this one exception you got rid of in r22869.


Re: [OMPI devel] u_int8_t

2011-01-11 Thread Jeff Squyres
On Jan 11, 2011, at 2:05 PM, Eugene Loh wrote:

>> Do we have configure tests for them, or just #define's?
>> 
> Configure tests.

Ok, cool.  I assume you'll remove the senseless configure tests, too.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] u_int8_t

2011-01-11 Thread Eugene Loh




Jeff Squyres wrote:

  On Jan 11, 2011, at 2:05 PM, Eugene Loh wrote:
  
  

  Do we have configure tests for them, or just #define's?
  

Configure tests.

  
  Ok, cool.  I assume you'll remove the senseless configure tests, too.
  

Right.