Re: [OMPI devel] OMPI 1.4.3 hangs in gather

2011-01-13 Thread Nysal Jan
Try manually specifying the collective component "-mca coll tuned"
You seem to be using the "sync" collective component, any stale mca param
files lying around ?

--Nysal

On Tue, Jan 11, 2011 at 6:28 PM, Doron Shoham  wrote:

> Hi
>
> All machines on the setup are IDataPlex with Nehalem 12 cores per node,
> 24GB  memory.
>
>
>
> · *Problem 1 – OMPI 1.4.3 hangs in gather:*
>
>
>
> I’m trying to run IMB and gather operation with OMPI 1.4.3 (Vanilla).
>
> It happens when np >= 64 and message size exceed 4k:
>
> mpirun -np 64 -machinefile voltairenodes -mca btl sm,self,openib
> imb/src-1.4.2/IMB-MPI1 gather –npmin 64
>
>
>
> voltairenodes consists of 64 machines.
>
>
>
> #
>
> # Benchmarking Gather
>
> # #processes = 64
>
> #
>
>#bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
>
> 0 1000 0.02 0.02 0.02
>
> 1  33114.0214.1614.09
>
> 2  33112.8713.0812.93
>
> 4  33114.2914.4314.34
>
> 8  33116.0316.2016.11
>
>16  33117.5417.7417.64
>
>32  33120.4920.6220.53
>
>64  33123.5723.8423.70
>
>   128  33128.0228.3528.18
>
>   256  33134.7834.8834.80
>
>   512  33146.3446.9146.60
>
>  1024  33163.9664.7164.33
>
>  2048  331   460.67   465.74   463.18
>
>  4096  331   637.33   643.99   640.75
>
>
>
> This the padb output:
>
> padb –A –x –Ormgr=mpirun –tree:
>
>
>
> =~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2011.01.06 14:33:17
> =~=~=~=~=~=~=~=~=~=~=~=
>
>
>
> Warning, remote process state differs across ranks
>
> state : ranks
>
> R (running) :
> [1,3-6,8,10-13,16-20,23-28,30-32,34-42,44-45,47-49,51-53,56-59,61-63]
>
> S (sleeping) : [0,2,7,9,14-15,21-22,29,33,43,46,50,54-55,60]
>
> Stack trace(s) for thread: 1
>
> -
>
> [0-63] (64 processes)
>
> -
>
> main() at ?:?
>
>   IMB_init_buffers_iter() at ?:?
>
> IMB_gather() at ?:?
>
>   PMPI_Gather() at pgather.c:175
>
> mca_coll_sync_gather() at coll_sync_gather.c:46
>
>   ompi_coll_tuned_gather_intra_dec_fixed() at
> coll_tuned_decision_fixed.c:714
>
> -
>
> [0,3-63] (62 processes)
>
> -
>
> ompi_coll_tuned_gather_intra_linear_sync() at
> coll_tuned_gather.c:248
>
>   mca_pml_ob1_recv() at pml_ob1_irecv.c:104
>
> ompi_request_wait_completion() at
> ../../../../ompi/request/request.h:375
>
>   opal_condition_wait() at
> ../../../../opal/threads/condition.h:99
>
> -
>
> [1] (1 processes)
>
> -
>
> ompi_coll_tuned_gather_intra_linear_sync() at
> coll_tuned_gather.c:302
>
>   mca_pml_ob1_send() at pml_ob1_isend.c:125
>
> ompi_request_wait_completion() at
> ../../../../ompi/request/request.h:375
>
>   opal_condition_wait() at
> ../../../../opal/threads/condition.h:99
>
> -
>
> [2] (1 processes)
>
> -
>
> ompi_coll_tuned_gather_intra_linear_sync() at
> coll_tuned_gather.c:315
>
>   ompi_request_default_wait() at request/req_wait.c:37
>
> ompi_request_wait_completion() at
> ../ompi/request/request.h:375
>
>   opal_condition_wait() at ../opal/threads/condition.h:99
>
> Stack trace(s) for thread: 2
>
> -
>
> [0-63] (64 processes)
>
> -
>
> start_thread() at ?:?
>
>   btl_openib_async_thread() at btl_openib_async.c:344
>
> poll() at ?:?
>
> Stack trace(s) for thread: 3
>
> -
>
> [0-63] (64 processes)
>
> -
>
> start_thread() at ?:?
>
>   service_thread_start() at btl_openib_fd.c:427
>
> select() at ?:?
>
> -bash-3.2$
>
>
>
>
>
> When running again padb after couple of minutes, I can see that the total
> number of processes remain in the same position but
>
> different processes are at different positions.
>
> For example, this is the diff between two padb outputs:
>
>
>
> Warning, remote process state differs across ranks
>
> state : ranks
>
> -R (running) : [0,2-4,6-13,16-18,20-21,28-31,33-36,38-56,58,60,62-63]
>
> -S (sleeping) : [1,5,14-15,19,22-27,32,37,57,59,61]
>
> +R (running) : [2,5-14,16-23,25,28-40,42-48,50-51,53-58,61,63]
>
> +S (sleeping) : [0-1,3-4,15,24,26-27,41,49,52,59-60,62]
>

Re: [OMPI devel] Back-porting components from SVN trunk to v1.5 branch

2011-01-13 Thread Jeff Squyres
For the moment, that's true.  

Abhishek's working on bringing over SOS and the notifier...


On Jan 12, 2011, at 5:57 PM, Ralph Castain wrote:

> You also have to remove all references to OPAL SOS...
> 
> 
> On Jan 12, 2011, at 1:25 PM, Jeff Squyres wrote:
> 
>> I back-ported the trunk's paffinity/hwloc component to the v1.5 branch 
>> today.  Here's the things that you need to look out for if you undertake 
>> back-porting a component from the trunk to the v1.5 branch...
>> 
>> Remember: the whole autogen.pl infrastructure was not (and will not be) 
>> ported to the v1.5 branch.  So there's some things that you need to change 
>> in your component's build system:
>> 
>> - You need to add a configure.params file
>> - In your component's configure.m4 file:
>> 
>>- Rename your m4 define from MCACONFIG to 
>> MCA___CONFIG
>>- Same for _POST_CONFIG
>>- Remove AC_CONFIG_FILES (they should now be in configure.params)
>>- We renamed a few m4 macros on the trunk; e.g., it's OPAL_VAR_SCOPE_PUSH 
>> on the trunk and OMPI_VAR_SCOPE_PUSH on v1.5.  So if you run "configure" and 
>> it says that commands are not found and they're un-expanded m4 names, look 
>> to see if they have changed names.
>> 
>> - In your component's Makefile.am:
>> 
>>- Rename any "if" variables from the form 
>> MCA_BUILDDSO to OMPI_BUILD___DSO
>> 
>> I think those are the main points to watch out for.
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] Problem with attributes attached to communicators

2011-01-13 Thread Pascal Deveze

A new patch in ROMIO solves this problem
Thanks to Dave.

Pascal

Dave Goodell a écrit :

Hmm... Apparently I was too optimistic about my untested patch.  I'll work with 
Rob this afternoon to straighten this out.

-Dave

On Jan 10, 2011, at 5:53 AM CST, Pascal Deveze wrote:

  

Dave,

Your proposed patch does not work when the call to MPI_File_open() is done on 
MPI_COMM_SELF.
For example, with the romio test program "simple.c", I got the fatal error:

mpirun -np 1 ./simple -fname /tmp//TEST
Fatal error in MPI_Attr_put: Invalid keyval, error stack:
MPI_Attr_put(131): MPI_Attr_put(comm=0x8400, keyval=603979776, 
attr_value=0x2279fa0) failed
MPI_Attr_put(89).: Attribute key was MPI_KEYVAL_INVALID
APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)

Pascal

Dave Goodell a écrit :


Try this (untested) patch instead:

  





-Dave

On Jan 7, 2011, at 3:50 AM CST, Rob Latham wrote:

  

  

Hi Pascal.  I'm really happy that you have been working with the
OpenMPI folks to re-sync romio.  I meant to ask you how that work was
progressing, so thanks for the email!

I need copy Dave Goodell on this conversation because he helped me
understand the keyval issues when we last worked on this two years
ago.  


Dave, some background.  We added some code in ROMIO to address ticket
222:

http://trac.mcs.anl.gov/projects/mpich2/ticket/222


But that code apparently makes OpenMPI unhappy.  I think when we
talked about this I remember it came down to a, shall we say,
different interpretation of the standard between MPICH2 and OpenMPI.

In case it's not clear from the nesting of messages, here's Pascal's
extraction of the ROMIO keyval code:


http://www.open-mpi.org/community/lists/devel/2011/01/8837.php


and here's the OpenMPI developer's response:

http://www.open-mpi.org/community/lists/devel/2011/01/8838.php


I think this is related to a discussion I had a couple years ago:

http://www.open-mpi.org/community/lists/users/2009/03/8409.php


So, to eventually answer your question yes I do have some remarks, but
I have no answers.  It's been a couple of years since I added those
frees...

==rob

On Fri, Jan 07, 2011 at 09:47:17AM +0100, Pascal Deveze wrote:




Hi Rob,

As you perhaps remember, I was porting ROMIO on OpenMPI.
The job is quite finished, I only have a problem with the
allocation/dealocation of Keyval (cb_config_list_keyval in
adio/common/cb_config_list.c).
As the alogorithm runs on MPICH2, I asked for help on the

de...@open-mpi.org
 mailing list.
I just received the following answer from George Bosilca.

The solution I found to run ROMIO with OpenMPI is to delete the line:
   MPI_Keyval_free(&keyval);
in the function ADIOI_cb_delete_name_array
(romio/adio/common/cb_config_list.c).

Do you have any remarks about that ?

Regards,

Pascal

 Message original 
Sujet:  Re: [OMPI devel] Problem with attributes attached to communicators
Date:   Thu, 6 Jan 2011 13:15:14 -0500
De: 	George Bosilca 



Répondre à: 	Open MPI Developers 



Pour: 	Open MPI Developers 



Références: 
<4d25daf9.3070...@bull.net>




MPI_Comm_create_keyval and MPI_Comm_free_keyval are the functions you should 
use in order to be MPI 2.2 compliant.

Based on my understanding of the MPI standard, your application is incorrect, 
and therefore the MPICH behavior is incorrect. The delete function is not there 
for you to delete the keyval (!) but to delete the attribute. Here is what the 
MPI standard states about this:

  

  

Note that it is not erroneous to free an attribute key that is in use, because 
the actual free does not transpire until after all references (in other 
communicators on the process) to the key have been freed. These references need 
to be explictly freed by the program, either via calls to MPI_COMM_DELETE_ATTR 
that free one attribute instance, or by calls to MPI_COMM_FREE that free all 
attribute instances associated with the freed communicator.




george.

On Jan 6, 2011, at 10:08 , Pascal Deveze wrote:

  

  

I have a problem to finish the porting of ROMIO into Open MPI. It is related to 
the routines MPI_Comm_dup together with MPI_Keyval_create, MPI_Keyval_free, 
MPI_Attr_get and MPI_Attr_put.

Here is a simple program that reproduces my problem:

===
#include 
#include "mpi.h"

int copy_fct(MPI_Comm comm, int keyval, void *extra, void *attr_in, void 
**attr_out, int *flag) {
return MPI_SUCCESS;
}

int delete_fct(MPI_Comm comm, int keyval, void *attr_val, void *extra) {
MPI_Keyval_free(&keyval);
return MPI_SUCCESS;
}

int main(int argc, char **argv) {
int i, found, attribute_val=100, keyval = MPI_KEYVAL_INVALID;
MPI_Comm dupcomm;

MPI_Init(&argc,&argv);

for (i=0; i<100;i++) {
/* This simulates the MPI_File_open() */
if (keyval == MPI_KEYVAL_INVALID) {
MPI_Keyval_create((MPI_Copy_function *) copy_fct, (MPI_Delete_function 
*) delete_fct, &keyval, NULL);

Re: [OMPI devel] RFC: Bring the lastest ROMIO version from MPICH2-1.3 into the trunk

2011-01-13 Thread Pascal Deveze
This problem of assertion is now solved by a patch in ROMIO just 
commited in http://bitbucket.org/devezep/new-romio-for-openmpi


I don't know any other problem in this porting of ROMIO.

Pascal

Pascal Deveze a écrit :

Jeff Squyres a écrit :

On Dec 16, 2010, at 3:31 AM, Pascal Deveze wrote:

 

int main(int argc, char **argv) {
  MPI_File fh;
  MPI_Info info, info_used;

  MPI_Init(&argc,&argv);

  MPI_File_open(MPI_COMM_WORLD, "/tmp/A", MPI_MODE_CREATE | MPI_MODE_RDWR, 
MPI_INFO_NULL, &fh);
  MPI_File_close(&fh);

  MPI_File_open(MPI_COMM_WORLD, "/tmp/A", MPI_MODE_CREATE | MPI_MODE_RDWR, 
MPI_INFO_NULL, &fh);
  MPI_File_close(&fh);

  MPI_Finalize();
}

I run this programon one process : salloc -p debug  -n1 mpirun -np 1 ./a.out
And I get teh assertion error:

a.out: attribute/attribute.c:763: ompi_attr_delete: Assertion `((0xdeafbeedULL << 
32) + 0xdeafbeedULL) == ((opal_object_t *) (keyval))->obj_magic_id' failed.
[cuzco10:24785] *** Process received signal ***
[cuzco10:24785] Signal: Aborted (6)



Ok.

  

I saw that there is a problem with an MPI_COMM_SELF communicator.

The problem disappears (and all ROMIO tests are OK) when I comment line 89 in 
the file ompi/mca/io/romio/romio/adio/common/ad_close.c :
 // MPI_Comm_free(&(fd->comm));

The problem disappears (and all ROMIO tests are OK) when I comment line 425 in 
the file ompi/mca/io/romio/romio/adio/common/cb_config_list.c :
   //  MPI_Keyval_free(&keyval);

The problem also disappears (but only 50% of the ROMIO tests are OK) when I 
comment line 133 in the file ompi/runtime/ompi_mpi_finalize.c:
  // ompi_attr_delete_all(COMM_ATTR, &ompi_mpi_comm_self,
 // ompi_mpi_comm_self.comm.c_keyhash);



It sounds like there's a problem with the ordering of shutdown of things in 
MPI_FINALIZE w.r.t. ROMIO.

FWIW: ROMIO violates some of our abstractions, but it's the price we pay for using a 
3rd party package.  One very, very important abstraction that we have is that no 
top-level MPI API functions are not allowed to call any other MPI API functions.  
E.g., MPI_Send (i.e., ompi/mpi/c/send.c) cannot call MPI_Isend (i.e., 
ompi/mpi/c/isend.c).  MPI_Send *can* call the same back-end implementation functions 
that isend does -- it's just not allowed to call MPI_.

The reason is that the top-level MPI API functions do things like check for 
whether MPI_INIT / MPI_FINALIZE have been called, etc.  The back-end functions 
do not do this.  Additionally, top-level MPI API functions may be overridden 
via PMPI kinds of things.  We wouldn't want our internal library calls to get 
intercepted by user code.

  

I am not very familiar with the OBJ_RELEASE/OBJ_RETAIN mechanism and till now I 
do not understand what is the real origin of that problem.



RETAIN/RELEASE is part of OMPI's "poor man's C++" design.  Wy back in the beginning of the project, we debated whether to use C or C++ for developing the code.  There was a desire to use some of the basic object functionality of C++ (e.g., derived classes, constructors, destructors, etc.), but we wanted to stay as portable as possible.  So we ended up going with C, but with a few macros that emulate some C++-like functionality.  This led to OMPI's OBJ system that is used all over the place.  


The OBJ system does several things:

- allows you to have "constructor"- and "destructor"-like behavior for structs
- works for both stack and heap memory
- reference counting

The reference counting is perhaps the most-used function of OBJ.  Here's a 
sample scenario:

/* allocate some memory, call the some_object_type "constructor",
   and set the reference count of "foo" to 1 */
foo = OBJ_NEW(some_object_type);

/* increment the reference count of foo (to 2) */
OBJ_RETAIN(foo);

/* increment the reference count of foo (to 3) */
OBJ_RETAIN(foo);

/* decrement the reference count of foo (to 1) */
OBJ_RELEASE(foo);
OBJ_RELEASE(foo);

/* decrement the reference count of foo to 0 -- which will
   call foo's "destructor" and then free the memory */
OBJ_RELEASE(foo);

The same principle works for structs on the stack -- we do the same constructor 
/ destructor behavior, but just don't free the memory.  For example:

/* Instantiate the memory and call its "constructor" and set the
   ref count to 1 */
some_object_type foo;
OBJ_CONSTRUCT(&foo, some_object_type);

/* Increment and decrement the ref count */
OBJ_RETAIN(&foo);
OBJ_RETAIN(&foo);
OBJ_RELEASE(&foo);
OBJ_RELEASE(&foo);

/* The last RELEASE will call the destructor, but won't actually
   free the memory, because the memory was not allocated with 
   OBJ_NEW */

OBJ_RELEASE(&foo);

When the destructor is called, the OBJ system sets the magic number in the 
obj's memory to a sentinel value so that we know that the destructor has been 
called on this particular struct.  Hence, if we call OBJ_RELEASE *again* on a 
struct that has already had its ref count go to 0 (and therefore already had 
its destructor called), we get 

Re: [OMPI devel] RFC: Bring the lastest ROMIO version from MPICH2-1.3 into the trunk

2011-01-13 Thread Jeff Squyres
Great!

I see in your other mail that you pulled something from MPICH2 to make this 
work.

Does that mean that there's a even-newer version of ROMIO that we should pull 
in its entirety?  It's a little risky to pull most stuff from one released 
version of ROMIO and then more stuff from another released version.  Meaning: 
it's little nicer/safer to say that we have ROMIO from a single released 
version of MPICH2.  

If possible.  :-)

Is it possible?

Don't get me wrong -- I want the new ROMIO, and I'm sorry you've had to go 
through so many hoops to get it ready.  :-(  But we should do it the best way 
we can; we have history/precedent for taking ROMIO from a single 
source/released version of MPICH[2], and I'd like to maintain that precedent if 
at all possible.


On Jan 13, 2011, at 8:04 AM, Pascal Deveze wrote:

> This problem of assertion is now solved by a patch in ROMIO just commited in 
> http://bitbucket.org/devezep/new-romio-for-openmpi
> 
> I don't know any other problem in this porting of ROMIO.
> 
> Pascal
> 
> Pascal Deveze a écrit :
>> Jeff Squyres a écrit :
>>> On Dec 16, 2010, at 3:31 AM, Pascal Deveze wrote:
>>> 
>>>  
>>> 
 int main(int argc, char **argv) {
   MPI_File fh;
   MPI_Info info, info_used;
 
   MPI_Init(&argc,&argv);
 
   MPI_File_open(MPI_COMM_WORLD, "/tmp/A", MPI_MODE_CREATE | MPI_MODE_RDWR, 
 MPI_INFO_NULL, &fh);
   MPI_File_close(&fh);
 
   MPI_File_open(MPI_COMM_WORLD, "/tmp/A", MPI_MODE_CREATE | MPI_MODE_RDWR, 
 MPI_INFO_NULL, &fh);
   MPI_File_close(&fh);
 
   MPI_Finalize();
 }
 
 I run this programon one process : salloc -p debug  -n1 mpirun -np 1 
 ./a.out
 And I get teh assertion error:
 
 a.out: attribute/attribute.c:763: ompi_attr_delete: Assertion 
 `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) 
 (keyval))->obj_magic_id' failed.
 [cuzco10:24785] *** Process received signal ***
 [cuzco10:24785] Signal: Aborted (6)
 
 
>>> 
>>> Ok.
>>> 
>>>   
>>> 
 I saw that there is a problem with an MPI_COMM_SELF communicator.
 
 The problem disappears (and all ROMIO tests are OK) when I comment line 89 
 in the file ompi/mca/io/romio/romio/adio/common/ad_close.c :
  // MPI_Comm_free(&(fd->comm));
 
 The problem disappears (and all ROMIO tests are OK) when I comment line 
 425 in the file ompi/mca/io/romio/romio/adio/common/cb_config_list.c :
//  MPI_Keyval_free(&keyval);
 
 The problem also disappears (but only 50% of the ROMIO tests are OK) when 
 I comment line 133 in the file ompi/runtime/ompi_mpi_finalize.c:
   // ompi_attr_delete_all(COMM_ATTR, &ompi_mpi_comm_self,
  // ompi_mpi_comm_self.comm.c_keyhash);
 
 
>>> 
>>> It sounds like there's a problem with the ordering of shutdown of things in 
>>> MPI_FINALIZE w.r.t. ROMIO.
>>> 
>>> FWIW: ROMIO violates some of our abstractions, but it's the price we pay 
>>> for using a 3rd party package.  One very, very important abstraction that 
>>> we have is that no top-level MPI API functions are not allowed to call any 
>>> other MPI API functions.  E.g., MPI_Send (i.e., ompi/mpi/c/send.c) cannot 
>>> call MPI_Isend (i.e., ompi/mpi/c/isend.c).  MPI_Send *can* call the same 
>>> back-end implementation functions that isend does -- it's just not allowed 
>>> to call MPI_.
>>> 
>>> The reason is that the top-level MPI API functions do things like check for 
>>> whether MPI_INIT / MPI_FINALIZE have been called, etc.  The back-end 
>>> functions do not do this.  Additionally, top-level MPI API functions may be 
>>> overridden via PMPI kinds of things.  We wouldn't want our internal library 
>>> calls to get intercepted by user code.
>>> 
>>>   
>>> 
 I am not very familiar with the OBJ_RELEASE/OBJ_RETAIN mechanism and till 
 now I do not understand what is the real origin of that problem.
 
 
>>> 
>>> RETAIN/RELEASE is part of OMPI's "poor man's C++" design.  Wy back in 
>>> the beginning of the project, we debated whether to use C or C++ for 
>>> developing the code.  There was a desire to use some of the basic object 
>>> functionality of C++ (e.g., derived classes, constructors, destructors, 
>>> etc.), but we wanted to stay as portable as possible.  So we ended up going 
>>> with C, but with a few macros that emulate some C++-like functionality.  
>>> This led to OMPI's OBJ system that is used all over the place.  
>>> 
>>> The OBJ system does several things:
>>> 
>>> - allows you to have "constructor"- and "destructor"-like behavior for 
>>> structs
>>> - works for both stack and heap memory
>>> - reference counting
>>> 
>>> The reference counting is perhaps the most-used function of OBJ.  Here's a 
>>> sample scenario:
>>> 
>>> /* allocate some memory, call the some_object_type "constructor",
>>>and set the reference count of "foo" to 1 */
>>> foo = OBJ_NEW(s

Re: [OMPI devel] OMPI 1.4.3 hangs in gather

2011-01-13 Thread Jeff Squyres
+1 on what Pasha said -- if using rdmacm fixes the problem, then there's 
something else nefarious going on...

You might want to check padb with your hangs to see where all the processes are 
hung to see if anything obvious jumps out.  I'd be surprised if there's a bug 
in the oob cpc; it's been around for a long, long time; it should be pretty 
stable.

Do we create QP's differently between oob and rdmacm, such that perhaps they 
are "better" (maybe better routed, or using a different SL, or ...) when 
created via rdmacm?


On Jan 12, 2011, at 12:12 PM, Shamis, Pavel wrote:

> RDMACM or OOB can not effect on performance of this benchmark, since they are 
> not involved in communication. So I'm not sure that the performance changes 
> that you see are related to connection manager changes.
> About oob - I'm not aware about hangs issue there, the code is very-very old, 
> we did not touch it for a long time.
> 
> Regards,
> 
> Pavel (Pasha) Shamis
> ---
> Application Performance Tools Group
> Computer Science and Math Division
> Oak Ridge National Laboratory
> Email: sham...@ornl.gov
> 
> 
> 
> 
> 
> On Jan 12, 2011, at 8:45 AM, Doron Shoham wrote:
> 
>> Hi,
>> 
>> For the first problem, I can see that when using rdmacm as openib oob
>> I get much better performence results (and no hangs!).
>> 
>> mpirun -display-map -np 64 -machinefile voltairenodes -mca btl
>> sm,self,openib -mca btl_openib_connect_rdmacm_priority 100
>> imb/src/IMB-MPI1 gather -npmin 64
>> 
>> 
>> #bytes  #repetitionst_min[usec] t_max[usec] t_avg[usec]
>> 
>> 0   10000.040.050.05
>> 
>> 1   100019.64   19.69   19.67
>> 
>> 2   100019.97   20.02   19.99
>> 
>> 4   100021.86   21.96   21.89
>> 
>> 8   100022.87   22.94   22.90
>> 
>> 16  100024.71   24.80   24.76
>> 
>> 32  100027.23   27.32   27.27
>> 
>> 64  100030.96   31.06   31.01
>> 
>> 128 100036.96   37.08   37.02
>> 
>> 256 100042.64   42.79   42.72
>> 
>> 512 100060.32   60.59   60.46
>> 
>> 1024100082.44   82.74   82.59
>> 
>> 20481000497.66  499.62  498.70
>> 
>> 40961000684.15  686.47  685.33
>> 
>> 8192519 544.07  546.68  545.85
>> 
>> 16384   519 653.20  656.23  655.27
>> 
>> 32768   519 704.48  707.55  706.60
>> 
>> 65536   519 918.00  922.12  920.86
>> 
>> 131072  320 2414.08 2422.17 2418.20
>> 
>> 262144  160 4198.25 4227.58 4213.19
>> 
>> 524288  80  7333.04 7503.99 7438.18
>> 
>> 1048576 40  13692.6014150.2013948.75
>> 
>> 2097152 20  30377.3432679.1531779.86
>> 
>> 4194304 10  61416.7071012.5068380.04
>> 
>> How can the oob cause the hang? Isn't it only used to bring up the 
>> connection?
>> Does the oob has any part of the connections were made?
>> 
>> Thanks,
>> Dororn
>> 
>> On Tue, Jan 11, 2011 at 2:58 PM, Doron Shoham  wrote:
>>> 
>>> Hi
>>> 
>>> All machines on the setup are IDataPlex with Nehalem 12 cores per node, 
>>> 24GB  memory.
>>> 
>>> 
>>> 
>>> · Problem 1 – OMPI 1.4.3 hangs in gather:
>>> 
>>> 
>>> 
>>> I’m trying to run IMB and gather operation with OMPI 1.4.3 (Vanilla).
>>> 
>>> It happens when np >= 64 and message size exceed 4k:
>>> 
>>> mpirun -np 64 -machinefile voltairenodes -mca btl sm,self,openib  
>>> imb/src-1.4.2/IMB-MPI1 gather –npmin 64
>>> 
>>> 
>>> 
>>> voltairenodes consists of 64 machines.
>>> 
>>> 
>>> 
>>> #
>>> 
>>> # Benchmarking Gather
>>> 
>>> # #processes = 64
>>> 
>>> #
>>> 
>>>  #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
>>> 
>>>   0 1000 0.02 0.02 0.02
>>> 
>>>   1  33114.0214.1614.09
>>> 
>>>   2  33112.8713.0812.93
>>> 
>>>   4  33114.2914.4314.34
>>> 
>>>   8  33116.0316.2016.11
>>> 
>>>  16  33117.5417.7417.64
>>> 
>>>  32  33120.4920.6220.53
>>> 
>>>  64  33123.5723.8423.70
>>> 
>>> 128  33128.0228.3528.18
>>> 
>>> 256  33134.7834.8834.80
>>> 
>>> 512  33146.3446.9146.60
>>> 
>>>1024  33163.9664.7164.33
>>> 
>>>2048  331   460.67   465.74   463.18
>>> 
>>>4096  331   637.33   643.9

Re: [OMPI devel] RFC: Bring the lastest ROMIO version from MPICH2-1.3 into the trunk

2011-01-13 Thread George Bosilca

On Jan 13, 2011, at 14:08 , Jeff Squyres wrote:

> Great!
> 
> I see in your other mail that you pulled something from MPICH2 to make this 
> work.
> 
> Does that mean that there's a even-newer version of ROMIO that we should pull 
> in its entirety?  It's a little risky to pull most stuff from one released 
> version of ROMIO and then more stuff from another released version.  Meaning: 
> it's little nicer/safer to say that we have ROMIO from a single released 
> version of MPICH2.

My understanding is that the MPICH guys provided a patch for the MPI attribute 
issue. As such the version here is the most up to date.

  george.

> 
> If possible.  :-)
> 
> Is it possible?
> 
> Don't get me wrong -- I want the new ROMIO, and I'm sorry you've had to go 
> through so many hoops to get it ready.  :-(  But we should do it the best way 
> we can; we have history/precedent for taking ROMIO from a single 
> source/released version of MPICH[2], and I'd like to maintain that precedent 
> if at all possible.
> 
> 
> On Jan 13, 2011, at 8:04 AM, Pascal Deveze wrote:
> 
>> This problem of assertion is now solved by a patch in ROMIO just commited in 
>> http://bitbucket.org/devezep/new-romio-for-openmpi
>> 
>> I don't know any other problem in this porting of ROMIO.
>> 
>> Pascal
>> 
>> Pascal Deveze a écrit :
>>> Jeff Squyres a écrit :
 On Dec 16, 2010, at 3:31 AM, Pascal Deveze wrote:
 
 
 
> int main(int argc, char **argv) {
>  MPI_File fh;
>  MPI_Info info, info_used;
> 
>  MPI_Init(&argc,&argv);
> 
>  MPI_File_open(MPI_COMM_WORLD, "/tmp/A", MPI_MODE_CREATE | MPI_MODE_RDWR, 
> MPI_INFO_NULL, &fh);
>  MPI_File_close(&fh);
> 
>  MPI_File_open(MPI_COMM_WORLD, "/tmp/A", MPI_MODE_CREATE | MPI_MODE_RDWR, 
> MPI_INFO_NULL, &fh);
>  MPI_File_close(&fh);
> 
>  MPI_Finalize();
> }
> 
> I run this programon one process : salloc -p debug  -n1 mpirun -np 1 
> ./a.out
> And I get teh assertion error:
> 
> a.out: attribute/attribute.c:763: ompi_attr_delete: Assertion 
> `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) 
> (keyval))->obj_magic_id' failed.
> [cuzco10:24785] *** Process received signal ***
> [cuzco10:24785] Signal: Aborted (6)
> 
> 
 
 Ok.
 
 
 
> I saw that there is a problem with an MPI_COMM_SELF communicator.
> 
> The problem disappears (and all ROMIO tests are OK) when I comment line 
> 89 in the file ompi/mca/io/romio/romio/adio/common/ad_close.c :
> // MPI_Comm_free(&(fd->comm));
> 
> The problem disappears (and all ROMIO tests are OK) when I comment line 
> 425 in the file ompi/mca/io/romio/romio/adio/common/cb_config_list.c :
>   //  MPI_Keyval_free(&keyval);
> 
> The problem also disappears (but only 50% of the ROMIO tests are OK) when 
> I comment line 133 in the file ompi/runtime/ompi_mpi_finalize.c:
>  // ompi_attr_delete_all(COMM_ATTR, &ompi_mpi_comm_self,
> // ompi_mpi_comm_self.comm.c_keyhash);
> 
> 
 
 It sounds like there's a problem with the ordering of shutdown of things 
 in MPI_FINALIZE w.r.t. ROMIO.
 
 FWIW: ROMIO violates some of our abstractions, but it's the price we pay 
 for using a 3rd party package.  One very, very important abstraction that 
 we have is that no top-level MPI API functions are not allowed to call any 
 other MPI API functions.  E.g., MPI_Send (i.e., ompi/mpi/c/send.c) cannot 
 call MPI_Isend (i.e., ompi/mpi/c/isend.c).  MPI_Send *can* call the same 
 back-end implementation functions that isend does -- it's just not allowed 
 to call MPI_.
 
 The reason is that the top-level MPI API functions do things like check 
 for whether MPI_INIT / MPI_FINALIZE have been called, etc.  The back-end 
 functions do not do this.  Additionally, top-level MPI API functions may 
 be overridden via PMPI kinds of things.  We wouldn't want our internal 
 library calls to get intercepted by user code.
 
 
 
> I am not very familiar with the OBJ_RELEASE/OBJ_RETAIN mechanism and till 
> now I do not understand what is the real origin of that problem.
> 
> 
 
 RETAIN/RELEASE is part of OMPI's "poor man's C++" design.  Wy back in 
 the beginning of the project, we debated whether to use C or C++ for 
 developing the code.  There was a desire to use some of the basic object 
 functionality of C++ (e.g., derived classes, constructors, destructors, 
 etc.), but we wanted to stay as portable as possible.  So we ended up 
 going with C, but with a few macros that emulate some C++-like 
 functionality.  This led to OMPI's OBJ system that is used all over the 
 place.  
 
 The OBJ system does several things:
 
 - allows you to have "constructor"- and "destructor"-like behavior for 
 structs

Re: [OMPI devel] OMPI 1.4.3 hangs in gather

2011-01-13 Thread Shamis, Pavel
RDMACM creates the same QPs with the same tunings as OOB, so I don't see how 
CPC may effect on performance.

Pavel (Pasha) Shamis
---
Application Performance Tools Group
Computer Science and Math Division
Oak Ridge National Laboratory






On Jan 13, 2011, at 2:15 PM, Jeff Squyres wrote:

> +1 on what Pasha said -- if using rdmacm fixes the problem, then there's 
> something else nefarious going on...
>
> You might want to check padb with your hangs to see where all the processes 
> are hung to see if anything obvious jumps out.  I'd be surprised if there's a 
> bug in the oob cpc; it's been around for a long, long time; it should be 
> pretty stable.
>
> Do we create QP's differently between oob and rdmacm, such that perhaps they 
> are "better" (maybe better routed, or using a different SL, or ...) when 
> created via rdmacm?
>
>
> On Jan 12, 2011, at 12:12 PM, Shamis, Pavel wrote:
>
>> RDMACM or OOB can not effect on performance of this benchmark, since they 
>> are not involved in communication. So I'm not sure that the performance 
>> changes that you see are related to connection manager changes.
>> About oob - I'm not aware about hangs issue there, the code is very-very 
>> old, we did not touch it for a long time.
>>
>> Regards,
>>
>> Pavel (Pasha) Shamis
>> ---
>> Application Performance Tools Group
>> Computer Science and Math Division
>> Oak Ridge National Laboratory
>> Email: sham...@ornl.gov
>>
>>
>>
>>
>>
>> On Jan 12, 2011, at 8:45 AM, Doron Shoham wrote:
>>
>>> Hi,
>>>
>>> For the first problem, I can see that when using rdmacm as openib oob
>>> I get much better performence results (and no hangs!).
>>>
>>> mpirun -display-map -np 64 -machinefile voltairenodes -mca btl
>>> sm,self,openib -mca btl_openib_connect_rdmacm_priority 100
>>> imb/src/IMB-MPI1 gather -npmin 64
>>>
>>>
>>> #bytes  #repetitionst_min[usec] t_max[usec] t_avg[usec]
>>>
>>> 0   10000.040.050.05
>>>
>>> 1   100019.64   19.69   19.67
>>>
>>> 2   100019.97   20.02   19.99
>>>
>>> 4   100021.86   21.96   21.89
>>>
>>> 8   100022.87   22.94   22.90
>>>
>>> 16  100024.71   24.80   24.76
>>>
>>> 32  100027.23   27.32   27.27
>>>
>>> 64  100030.96   31.06   31.01
>>>
>>> 128 100036.96   37.08   37.02
>>>
>>> 256 100042.64   42.79   42.72
>>>
>>> 512 100060.32   60.59   60.46
>>>
>>> 1024100082.44   82.74   82.59
>>>
>>> 20481000497.66  499.62  498.70
>>>
>>> 40961000684.15  686.47  685.33
>>>
>>> 8192519 544.07  546.68  545.85
>>>
>>> 16384   519 653.20  656.23  655.27
>>>
>>> 32768   519 704.48  707.55  706.60
>>>
>>> 65536   519 918.00  922.12  920.86
>>>
>>> 131072  320 2414.08 2422.17 2418.20
>>>
>>> 262144  160 4198.25 4227.58 4213.19
>>>
>>> 524288  80  7333.04 7503.99 7438.18
>>>
>>> 1048576 40  13692.6014150.2013948.75
>>>
>>> 2097152 20  30377.3432679.1531779.86
>>>
>>> 4194304 10  61416.7071012.5068380.04
>>>
>>> How can the oob cause the hang? Isn't it only used to bring up the 
>>> connection?
>>> Does the oob has any part of the connections were made?
>>>
>>> Thanks,
>>> Dororn
>>>
>>> On Tue, Jan 11, 2011 at 2:58 PM, Doron Shoham  wrote:

 Hi

 All machines on the setup are IDataPlex with Nehalem 12 cores per node, 
 24GB  memory.



 · Problem 1 – OMPI 1.4.3 hangs in gather:



 I’m trying to run IMB and gather operation with OMPI 1.4.3 (Vanilla).

 It happens when np >= 64 and message size exceed 4k:

 mpirun -np 64 -machinefile voltairenodes -mca btl sm,self,openib  
 imb/src-1.4.2/IMB-MPI1 gather –npmin 64



 voltairenodes consists of 64 machines.



 #

 # Benchmarking Gather

 # #processes = 64

 #

 #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]

  0 1000 0.02 0.02 0.02

  1  33114.0214.1614.09

  2  33112.8713.0812.93

  4  33114.2914.4314.34

  8  33116.0316.2016.11

 16  33117.5417.7417.64

 32  33120.4920.6220.53

 64  33123.5723.8423.70

128  33128