[OMPI devel] OMPI 1.4.3 hangs in gather
Hi All machines on the setup are IDataPlex with Nehalem 12 cores per node, 24GB memory. · *Problem 1 – OMPI 1.4.3 hangs in gather:* I’m trying to run IMB and gather operation with OMPI 1.4.3 (Vanilla). It happens when np >= 64 and message size exceed 4k: mpirun -np 64 -machinefile voltairenodes -mca btl sm,self,openib imb/src-1.4.2/IMB-MPI1 gather –npmin 64 voltairenodes consists of 64 machines. # # Benchmarking Gather # #processes = 64 # #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] 0 1000 0.02 0.02 0.02 1 33114.0214.1614.09 2 33112.8713.0812.93 4 33114.2914.4314.34 8 33116.0316.2016.11 16 33117.5417.7417.64 32 33120.4920.6220.53 64 33123.5723.8423.70 128 33128.0228.3528.18 256 33134.7834.8834.80 512 33146.3446.9146.60 1024 33163.9664.7164.33 2048 331 460.67 465.74 463.18 4096 331 637.33 643.99 640.75 This the padb output: padb –A –x –Ormgr=mpirun –tree: =~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2011.01.06 14:33:17 =~=~=~=~=~=~=~=~=~=~=~= Warning, remote process state differs across ranks state : ranks R (running) : [1,3-6,8,10-13,16-20,23-28,30-32,34-42,44-45,47-49,51-53,56-59,61-63] S (sleeping) : [0,2,7,9,14-15,21-22,29,33,43,46,50,54-55,60] Stack trace(s) for thread: 1 - [0-63] (64 processes) - main() at ?:? IMB_init_buffers_iter() at ?:? IMB_gather() at ?:? PMPI_Gather() at pgather.c:175 mca_coll_sync_gather() at coll_sync_gather.c:46 ompi_coll_tuned_gather_intra_dec_fixed() at coll_tuned_decision_fixed.c:714 - [0,3-63] (62 processes) - ompi_coll_tuned_gather_intra_linear_sync() at coll_tuned_gather.c:248 mca_pml_ob1_recv() at pml_ob1_irecv.c:104 ompi_request_wait_completion() at ../../../../ompi/request/request.h:375 opal_condition_wait() at ../../../../opal/threads/condition.h:99 - [1] (1 processes) - ompi_coll_tuned_gather_intra_linear_sync() at coll_tuned_gather.c:302 mca_pml_ob1_send() at pml_ob1_isend.c:125 ompi_request_wait_completion() at ../../../../ompi/request/request.h:375 opal_condition_wait() at ../../../../opal/threads/condition.h:99 - [2] (1 processes) - ompi_coll_tuned_gather_intra_linear_sync() at coll_tuned_gather.c:315 ompi_request_default_wait() at request/req_wait.c:37 ompi_request_wait_completion() at ../ompi/request/request.h:375 opal_condition_wait() at ../opal/threads/condition.h:99 Stack trace(s) for thread: 2 - [0-63] (64 processes) - start_thread() at ?:? btl_openib_async_thread() at btl_openib_async.c:344 poll() at ?:? Stack trace(s) for thread: 3 - [0-63] (64 processes) - start_thread() at ?:? service_thread_start() at btl_openib_fd.c:427 select() at ?:? -bash-3.2$ When running again padb after couple of minutes, I can see that the total number of processes remain in the same position but different processes are at different positions. For example, this is the diff between two padb outputs: Warning, remote process state differs across ranks state : ranks -R (running) : [0,2-4,6-13,16-18,20-21,28-31,33-36,38-56,58,60,62-63] -S (sleeping) : [1,5,14-15,19,22-27,32,37,57,59,61] +R (running) : [2,5-14,16-23,25,28-40,42-48,50-51,53-58,61,63] +S (sleeping) : [0-1,3-4,15,24,26-27,41,49,52,59-60,62] Stack trace(s) for thread: 1 - [0-63] (64 processes) @@ -13,21 +13,21 @@ mca_coll_sync_gather() at coll_sync_gather.c:46 ompi_coll_tuned_gather_intra_dec_fixed() at coll_tuned_decision_fixed.c:714 - - [0,3-63] (62 processes) + [0-5,8-63] (62 processes) - ompi_coll_tuned_gather_intra_linear_sync() at coll_tuned_gather.c:248 mca_pml_ob1_recv() at pml_ob1_irecv.c:104 ompi_request_wait_completion() at ../../../../ompi/request/request.h:375 opal_condition_wait() at ../../../../
[OMPI devel] Removing paffinity trunk components
I updated the hwloc paffinity component to hwloc v1.1 last night. Given that hwloc seems to be working well, I'd like to remove the following paffinity components from the trunk (and eventually, v1.5) tomorrow COB (5pm US Eastern, Wed, Jan 12 2011). - solaris - darwin - posix - windows So all we'll be left with is hwloc and test. Any problems with that? I didn't make this an official RFC with a timeout because we all agreed to this general plan of removing non-hwloc/test paffinity components a long time ago. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] Removing paffinity trunk components
Disregard this; we removed these components a long time ago. I was looking in an old svn tree that still had sentinel solaris/darwin/posix/windows directories (because there were files like Makefile.in files in them). Sorry for the noise... On Jan 11, 2011, at 8:35 AM, Jeff Squyres wrote: > I updated the hwloc paffinity component to hwloc v1.1 last night. > > Given that hwloc seems to be working well, I'd like to remove the following > paffinity components from the trunk (and eventually, v1.5) tomorrow COB (5pm > US Eastern, Wed, Jan 12 2011). > > - solaris > - darwin > - posix > - windows > > So all we'll be left with is hwloc and test. > > Any problems with that? > > I didn't make this an official RFC with a timeout because we all agreed to > this general plan of removing non-hwloc/test paffinity components a long time > ago. > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r24219
Terry -- The trunk doesn't use configure.params anymore. You should probably remove this file again... On Jan 11, 2011, at 1:31 PM, t...@osl.iu.edu wrote: > Author: tdd > Date: 2011-01-11 13:31:55 EST (Tue, 11 Jan 2011) > New Revision: 24219 > URL: https://svn.open-mpi.org/trac/ompi/changeset/24219 > > Log: > add configure.params to solaris sysinfo module to allow it to be built > Added: > trunk/opal/mca/sysinfo/solaris/configure.params > > Added: trunk/opal/mca/sysinfo/solaris/configure.params > == > --- (empty file) > +++ trunk/opal/mca/sysinfo/solaris/configure.params 2011-01-11 13:31:55 EST > (Tue, 11 Jan 2011) > @@ -0,0 +1,18 @@ > +# -*- shell-script -*- > +# > +# Copyright (c) 2011 Oracle and/or affiliates. All rights reserved. > +# > +# $COPYRIGHT$ > +# > +# Additional copyrights may follow > +# > +# $HEADER$ > +# > + > +PARAM_CONFIG_FILES="Makefile" > + > +# > +# Set the config priority so that, if we can build, > +# only this component will build > + > +PARAM_CONFIG_PRIORITY=60 > ___ > svn-full mailing list > svn-f...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/svn-full -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] u_int8_t
Jeff Squyres wrote: Shrug. If they're not used anywhere, I'd whack them. Excellent. They screw things up (at least for me). Turns out, Solaris IB uses such types and has the sense to typedef them. But such typedefs conflict with opal_config.h, which #define's them (for apparently no reason). Do we have configure tests for them, or just #define's? Configure tests. On Jan 10, 2011, at 7:51 PM, Eugene Loh wrote: Why do u_int8_t u_int16_t u_int32_t u_int64_t get defined in opal_config.h? I don't see them used anywhere in the OMPI/OPAL/ORTE code base. Okay, one exception, in opal/util/if.c: #if defined(__DragonFly__) #define IN_LINKLOCAL(i)(((u_int32_t)(i) & 0x) == 0xa9fe) #endif Ah, and even this one exception you got rid of in r22869.
Re: [OMPI devel] u_int8_t
On Jan 11, 2011, at 2:05 PM, Eugene Loh wrote: >> Do we have configure tests for them, or just #define's? >> > Configure tests. Ok, cool. I assume you'll remove the senseless configure tests, too. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] u_int8_t
Jeff Squyres wrote: On Jan 11, 2011, at 2:05 PM, Eugene Loh wrote: Do we have configure tests for them, or just #define's? Configure tests. Ok, cool. I assume you'll remove the senseless configure tests, too. Right.