[OMPI devel] shared-memory allocations

2008-12-10 Thread Eugene Loh
For shared memory communications, each on-node connection (non-self, 
sender-receiver pair) gets a circular buffer during MPI_Init().  Each CB 
requires the following allocations:


*) ompi_cb_fifo_wrapper_t (roughly 64 bytes)
*) ompi_cb_fifo_ctl_t head (roughly 12 bytes)
*) ompi_cb_fifo_ctl_t tail (roughly 12 bytes)
*) queue (roughly 1024 bytes)

Importantly, the current code lays these four allocations out on three 
separate pages.  (The tail and queue are aggregated together.)  So, for 
example, that "head" allocation (12 bytes) ends up consuming a full page.


As one goes to more and more on-node processes -- say, for a large SMP 
or a multicore system -- the number of non-self connections grows as 
n*(n-1).  So, these circular-buffer allocations end up consuming a lot 
of shared memory.


For example, for a 4K pagesize and n=512 on-node processes, the circular 
buffers consume 3 Gbyte of memory -- 90% of which is empty and simply 
used for page alignment.


I'd like to aggregate more of these allocations so that:

*) shared-memory consumption is reduced
*) the number of allocations (and hence the degree of lock contention) 
during MPI_Init is reduced


Any comments?

I'd like to understand the original rationale for these page 
alignments.  I expect this is related to memory placement of pages.  So, 
I imagine three scenarios.  Which is it?


A) There really is a good reason for each allocation to have its own 
page and any attempt to aggregate is doomed.


B) There is actual benefit for placing things carefully in memory, but 
substantial aggregation is still possible.  That is, for n processes, we 
need at most n different allocations -- not 3*n*(n-1).


C) There is no actual justification for having everything on different 
pages.  That is, allowing different parts of a FIFO CB to be mapped 
differently to physical memory sounded to someone like a good idea at 
the time, but no one really did any performance measurements to justify 
this.  Or, if they did, it was only on one platform and we have no 
evidence that the same behavior exists on all platforms.  Personally, 
I've played with some simple experiments on one (or more?) platforms and 
found no performance variations due to placement of shared variables 
that two processes use for communication.  I guess it's possible that 
data is moving cache-to-cache and doesn't care where the backing memory is.


Note that I only want to reduce the number of page-aligned allocations.  
I'd preserve cacheline alignment.  So, no worry about false sharing due 
to a sender thrashing on one end of a FIFO and a receiver on the other.


Re: [OMPI devel] RFC: merge windows branch into trunk

2008-12-10 Thread Rainer Keller
Ralph,
we delayed the COB for this to 9.12., announced yesterday to prepare to commit 
today.
We updated to get new buglets that were fixed, tested twice on Win 
(shared&static) and Linux to see that nothing breaks

Now we are ready to commit and just as well get a r20106 which touches quite a 
code-base once again ,-]


Thanks,
Rainer



On Donnerstag, 20. November 2008, Ralph Castain wrote:
> HmmmI was just typing this up when Tim's note hit. I also have two
> concerns that somewhat echo his:
>
> 1. since nearly everyone is at SC08, and since next week is a holiday,
> the timing of this merge is poor. I would really urge that you delay
> it until at least Dec 5 so people actually know about it - and have
> time to even think about it
>
> 2. how does this fit into our overall release schedule? There was talk
> at one time (when we thought 1.3 was going out soon) about having a
> short release cycle to get Windows support out for 1.4. Now this is
> coming into the trunk even before 1.3 goes out.
>
> So is 1.3 going to have a lifecycle of a month? Or are we going to
> delay 1.3 (if it even needs to be delayed) so it can include this code?
>
> Reason I ask: last time we rolled Windows support into the system it
> created a complete code fork, making support for the current stable
> release nearly impossible. There generated a lot of unhappiness and
> argument within the community until we finally released a new version.
>
>  From what I have seen as we've discussed things during devel, these
> are fairly well-contained changes. However, it -will- make maintaining
> 1.3 more difficult if people attempt to do it the old way - making
> changes in the trunk and patching across to 1.3. If we instead use
> isolated 1.3 branches for maintaining the code, then this isn't an
> issue.
>
> Merits more thought than one week can provide.
>
> Ralph
>
> On Nov 20, 2008, at 6:53 AM, Tim Mattox wrote:
> > I have two concerns.  First is that we really need to focus on
> > getting 1.3 stable and released.  My second concern with
> > this is how will it effect merging of bugfixes for 1.3 from the
> > trunk once we release 1.3.  Will the following modified files
> > cause merge conflicts for CMRs?  How big is this diff,
> > can you send it to the list, or otherwise make it available?
> >
> >> M ompi/runtime/ompi_mpi_init.c
> >> M opal/event/event.c
> >> M opal/event/WIN32-Code/win32.c
> >> M opal/mca/base/mca_base_param.c
> >> M opal/mca/installdirs/windows/opal_installdirs_windows.c
> >> M opal/runtime/opal_cr.c
> >> M opal/win32/ompi_misc.h
> >> M opal/win32/win_compat.h
> >> M orte/mca/plm/ccp/plm_ccp_component.c
> >> M orte/mca/plm/ccp/plm_ccp_module.c
> >> M orte/mca/plm/process/plm_process_module.c
> >> M orte/mca/ras/ccp/ras_ccp_component.c
> >> M orte/mca/ras/ccp/ras_ccp_module.c
> >> M orte/runtime/orte_wait.c
> >> M orte/tools/orterun/orterun.c
> >> M orte/util/hnp_contact.c
> >
> > I would ask that you consider breaking these
> > modifications into parts that "could" be harmlessly
> > brought over independently to 1.3, if a subsequent
> > non-windows bugfix to one of those files needs to
> > be brought over that will only merge cleanly if some
> > of your changes to the same file are also brought over.
> > For example, it would be a real pain to have to use
> > patchfiles to resolve merge conflicts simply because
> > of an #ifdef or white-space change here or there.
> > Hopefully that made sense...
> >
> > Although I don't use windows myself, I appreciate your
> > and others' efforts to expand the number of platforms
> > we can run on.  Great work!
> > --
> > Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
> > tmat...@gmail.com || timat...@open-mpi.org
> >I'm a bright... http://www.the-brights.net/
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



-- 

Dipl.-Inf. Rainer Keller   http://www.hlrs.de/people/keller
 HLRS  Tel: ++49 (0)711-685 6 5858
 Nobelstrasse 19  Fax: ++49 (0)711-685 6 5832
 70550 Stuttgartemail: kel...@hlrs.de 
 Germany AIM/Skype:rusraink


Re: [OMPI devel] RFC: merge windows branch into trunk

2008-12-10 Thread Ralph Castain


On Dec 10, 2008, at 2:01 PM, Rainer Keller wrote:


Ralph,
we delayed the COB for this to 9.12., announced yesterday to prepare  
to commit

today.
We updated to get new buglets that were fixed, tested twice on Win
(shared&static) and Linux to see that nothing breaks



Sounds great!

Now we are ready to commit and just as well get a r20106 which  
touches quite a

code-base once again ,-]


Actually, r20106 is pretty well confined to the iof area (the changes  
outside iof are rather trivial) and mostly just restores what was  
there a few days ago. So I would be surprised to see a conflict other  
than perhaps how Windows handles iof.


Glad to see this come over! Should be an interesting few days of MTT  
results... :-))

Ralph





Thanks,
Rainer



On Donnerstag, 20. November 2008, Ralph Castain wrote:
HmmmI was just typing this up when Tim's note hit. I also have  
two

concerns that somewhat echo his:

1. since nearly everyone is at SC08, and since next week is a  
holiday,

the timing of this merge is poor. I would really urge that you delay
it until at least Dec 5 so people actually know about it - and have
time to even think about it

2. how does this fit into our overall release schedule? There was  
talk

at one time (when we thought 1.3 was going out soon) about having a
short release cycle to get Windows support out for 1.4. Now this is
coming into the trunk even before 1.3 goes out.

So is 1.3 going to have a lifecycle of a month? Or are we going to
delay 1.3 (if it even needs to be delayed) so it can include this  
code?


Reason I ask: last time we rolled Windows support into the system it
created a complete code fork, making support for the current stable
release nearly impossible. There generated a lot of unhappiness and
argument within the community until we finally released a new  
version.


From what I have seen as we've discussed things during devel, these
are fairly well-contained changes. However, it -will- make  
maintaining

1.3 more difficult if people attempt to do it the old way - making
changes in the trunk and patching across to 1.3. If we instead use
isolated 1.3 branches for maintaining the code, then this isn't an
issue.

Merits more thought than one week can provide.

Ralph

On Nov 20, 2008, at 6:53 AM, Tim Mattox wrote:

I have two concerns.  First is that we really need to focus on
getting 1.3 stable and released.  My second concern with
this is how will it effect merging of bugfixes for 1.3 from the
trunk once we release 1.3.  Will the following modified files
cause merge conflicts for CMRs?  How big is this diff,
can you send it to the list, or otherwise make it available?


M ompi/runtime/ompi_mpi_init.c
M opal/event/event.c
M opal/event/WIN32-Code/win32.c
M opal/mca/base/mca_base_param.c
M opal/mca/installdirs/windows/opal_installdirs_windows.c
M opal/runtime/opal_cr.c
M opal/win32/ompi_misc.h
M opal/win32/win_compat.h
M orte/mca/plm/ccp/plm_ccp_component.c
M orte/mca/plm/ccp/plm_ccp_module.c
M orte/mca/plm/process/plm_process_module.c
M orte/mca/ras/ccp/ras_ccp_component.c
M orte/mca/ras/ccp/ras_ccp_module.c
M orte/runtime/orte_wait.c
M orte/tools/orterun/orterun.c
M orte/util/hnp_contact.c


I would ask that you consider breaking these
modifications into parts that "could" be harmlessly
brought over independently to 1.3, if a subsequent
non-windows bugfix to one of those files needs to
be brought over that will only merge cleanly if some
of your changes to the same file are also brought over.
For example, it would be a real pain to have to use
patchfiles to resolve merge conflicts simply because
of an #ifdef or white-space change here or there.
Hopefully that made sense...

Although I don't use windows myself, I appreciate your
and others' efforts to expand the number of platforms
we can run on.  Great work!
--
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
tmat...@gmail.com || timat...@open-mpi.org
  I'm a bright... http://www.the-brights.net/
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




--

Dipl.-Inf. Rainer Keller   http://www.hlrs.de/people/keller
HLRS  Tel: ++49 (0)711-685 6 5858
Nobelstrasse 19  Fax: ++49 (0)711-685 6 5832
70550 Stuttgartemail: kel...@hlrs.de
Germany AIM/Skype:rusraink
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r20003 (Solaris malloc.h issue)

2008-12-10 Thread Ethan Mallove
Hi Patrick,

r20003 seems to break MX support on Solaris.

  $ cd ompi/mca/common/mx
  $ make
  ...
  "/usr/include/malloc.h", line 46: syntax error before or at: (
  "/usr/include/malloc.h", line 47: syntax error before or at: (
  "/usr/include/malloc.h", line 48: syntax error before or at: (
  "/usr/include/malloc.h", line 48: cannot have void object: size_t
  "/usr/include/malloc.h", line 48: identifier redeclared: size_t
  ... <4000 more lines of compiler errors> ...

The below patch makes it so opal/util/malloc.h is used instead of
/usr/include/malloc.h and the compiler errors go away. (I also needed
to include errno.h.) Would this be okay to do?

  diff -r 347f52a3713f ompi/mca/common/mx/common_mx.c
  --- ompi/mca/common/mx/common_mx.c
  +++ ompi/mca/common/mx/common_mx.c
  @@ -23,9 +23,8 @@
   #include "ompi/constants.h"
   #include "common_mx.h"

  -#ifdef HAVE_MALLOC_H
  -#include 
  -#endif
  +#include 
  +#include "opal/util/malloc.h"
   #include "opal/memoryhooks/memory.h"
   #include "opal/mca/base/mca_base_param.h"
   #include "ompi/runtime/params.h"

I tested the above on Solaris and Linux with SunStudio.

Regards,
Ethan


On Fri, Nov/14/2008 11:17:59PM, patr...@osl.iu.edu wrote:
> Author: patrick
> Date: 2008-11-14 23:17:58 EST (Fri, 14 Nov 2008)
> New Revision: 20003
> URL: https://svn.open-mpi.org/trac/ompi/changeset/20003
> 
> Log:
> Define a "fake" mpool to provide a memory release callback for the 
> memory hooks (munmap) and initialize the mallopt component, and 
> nothing else.
> Use this mpool in the MX common initialization, supporting both BTL 
> and MTL. Automatically set the MX_RCACHE environment variable to 
> enable registration cache in MX.
> 
> Tested with success for munmap() and large free().
> 
> 
> Added:
>trunk/ompi/mca/mpool/fake/
>trunk/ompi/mca/mpool/fake/Makefile.am
>trunk/ompi/mca/mpool/fake/configure.params
>trunk/ompi/mca/mpool/fake/mpool_fake.h
>trunk/ompi/mca/mpool/fake/mpool_fake_component.c
>trunk/ompi/mca/mpool/fake/mpool_fake_module.c
> Text files modified: 
>trunk/ompi/mca/common/mx/common_mx.c |56 
> +++ 
>1 files changed, 55 insertions(+), 1 deletions(-)
> 
> Modified: trunk/ompi/mca/common/mx/common_mx.c
> ==
> --- trunk/ompi/mca/common/mx/common_mx.c  (original)
> +++ trunk/ompi/mca/common/mx/common_mx.c  2008-11-14 23:17:58 EST (Fri, 
> 14 Nov 2008)
> @@ -9,6 +9,8 @@
>   * University of Stuttgart.  All rights reserved.
>   * Copyright (c) 2004-2006 The Regents of the University of California.
>   * All rights reserved.
> + * Copyright (c) 2008  Myricom. All rights reserved.
> + * 
>   * $COPYRIGHT$
>   *
>   * Additional copyrights may follow
> @@ -21,11 +23,29 @@
>  #include "ompi/constants.h"
>  #include "common_mx.h"
>  
> +#ifdef HAVE_MALLOC_H
> +#include 
> +#endif
> +#include "opal/memoryhooks/memory.h"
> +#include "opal/mca/base/mca_base_param.h"
> +#include "ompi/runtime/params.h"
> +#include "ompi/mca/mpool/mpool.h"
> +#include "ompi/mca/mpool/base/base.h"
> +#include "ompi/mca/mpool/fake/mpool_fake.h"
> +
> +
> +int mx__regcache_clean(void *ptr, size_t size);
> +
>  static int ompi_common_mx_initialize_ref_cnt = 0;
> +static mca_mpool_base_module_t *ompi_common_mx_fake_mpool = 0;
> +
>  int
>  ompi_common_mx_initialize(void)
>  {
>  mx_return_t mx_return;
> +struct mca_mpool_base_resources_t mpool_resources;
> +int index, value;
> +
>  ompi_common_mx_initialize_ref_cnt++;
>  
>  if(ompi_common_mx_initialize_ref_cnt == 1) { 
> @@ -35,7 +55,37 @@
>   * library does not exit the application.
>   */
>  mx_set_error_handler(MX_ERRORS_RETURN);
> -
> + 
> + /* If we have a memory manager available, and
> +mpi_leave_pinned == -1, then set mpi_leave_pinned to 1.
> +
> +We have a memory manager if:
> +- we have both FREE and MUNMAP support
> +- we have MUNMAP support and the linux mallopt */
> + value = opal_mem_hooks_support_level();
> + if (((value & (OPAL_MEMORY_FREE_SUPPORT | OPAL_MEMORY_MUNMAP_SUPPORT))
> +  == (OPAL_MEMORY_FREE_SUPPORT | OPAL_MEMORY_MUNMAP_SUPPORT))
> + || ((value & OPAL_MEMORY_MUNMAP_SUPPORT) &&
> + OMPI_MPOOL_BASE_HAVE_LINUX_MALLOPT)) {
> +   index = mca_base_param_find("mpi", NULL, "leave_pinned");
> +   if (index >= 0)
> +if ((mca_base_param_lookup_int(index, &value) == OPAL_SUCCESS) 
> + && (value == -1)) {
> +   
> +   ompi_mpi_leave_pinned = 1;
> +   setenv("MX_RCACHE", "2", 1);
> +   mpool_resources.regcache_clean = mx__regcache_clean;
> +   ompi_common_mx_fake_mpool = 
> + mca_mpool_base_module_create("fake", NULL, &mpool_resources);
> +   if (!ompi_common_mx_fake_mpool) {
> + omp

[OMPI devel] RFC: windows branch merge

2008-12-10 Thread Shiqing Fan

Hi all,

We just now merged the windows branch into trunk, split into 4 patches 
(r20108 to r20111) to keep them separate. Although incoming changes to 
trunk incurred some compile errors on windows which we fixed, we tested 
the following before committing:


Windows x86-64, static libs compilation and running
Windows x86-64, shared libs compilation and running
Linux x86-64, compilation and running

Windows test was done using CMake, selecting C, C++ and Fortran. The 
ompi wrappers have been tested with Visual Studio, orte tools seem to 
work. The MCA components that working under Windows are now marked with 
file .windows in corresponding folders.


To keep track of the proposed merge into a v1.3.x release, the ticket 
#1708 has been opened. If this is decided to be added to a later 
release, additional patches may be added to the ticket.


Thank you very much.


With Best Regards,
Rainer and Shiqing




[OMPI devel] 1.3 staging area?

2008-12-10 Thread Ralph Castain

Hi all

I'm a tad concerned about our ability to test proposed CMR's for the  
1.3 branch. Given the long delays in getting 1.3 out, and the rapidly  
looming 1.4 milestones that many of us have in our individual  
projects, it is clear that the trunk is going to quickly diverge  
significantly from what is in the 1.3 branch. In addition, we are  
going to see quite a few commits occurring within a restricted time  
period.


Thus, the fact that some proposed change does or does not pass MTT  
tests on the trunk at some given point in time is no longer a reliable  
indicator of its behavior in 1.3. Likewise, it will be difficult to  
isolate that "this commit is okay" when MTT can really only tell us  
the state of the aggregated code base.


Let me hasten to point out that this has been a recurring problem with  
every major release. We have discussed the problem on several  
occasions, but failed to reach consensus on a solution.


I would like to propose that we create a 1.3 staging branch. This  
branch would be opened on an individual-at-a-time basis for them to  
commit proposed CMR's for the 1.3 branch. We would ask that people  
please include the staging branch in their MTT testing on occasions  
when a change has been made. Once the proposed change has been  
validated, then it can be brought over as a single (and easy) merge to  
the 1.3 release branch.


I realize this may slow the passage of bug fixes somewhat, and  
obviously we should apply this on a case-by-case basis (e.g., a simple  
removal of an unused variable would hardly merit such a step).  
However, I believe that something like the IOF patch that needs to  
eventually move to 1.3, and the Windows upgrade, are examples that  
probably do merit this step.


Just a suggestion - hope it helps.
Ralph



[OMPI devel] Fwd: [OMPI users] Onesided + derived datatypes

2008-12-10 Thread Brian Barrett

Hi all -

I looked into this, and it appears to be datatype related.  If the  
displacements are set t o 3, 2, 1, 0, there the datatype will fail the  
type checks for one-sided because is_overlapped() returns 1 for the  
datatype.  My reading of the standard seems to indicate this should  
not be.  I haven't looked into the problems with displacement set to  
0, 1, 2, 3, but I'm guessing it has something to do with the reverse  
problem.


This looks like a datatype issue, so it's out of my realm of  
expertise.  Can someone else take a look?


Brian

Begin forwarded message:


From: doriankrause 
Date: December 10, 2008 4:07:55 PM MST
To: us...@open-mpi.org
Subject: [OMPI users] Onesided + derived datatypes
Reply-To: Open MPI Users 

Hi List,

I have a MPI program which uses one sided communication with derived
datatypes (MPI_Type_create_indexed_block). I developed the code with
MPICH2 and unfortunately didn't thought about trying it out with
OpenMPI. Now that I'm "porting" the Application to OpenMPI I'm facing
some problems. On the most machines I get an SIGSEGV in MPI_Win_fence,
sometimes an invalid datatype shows up. I ran the program in Valgrind
and didn't get anything valuable. Since I can't see a reason for this
problem (at least if I understand the standard correctly), I wrote the
attached testprogram.

Here are my experiences:

* If I compile without ONESIDED defined, everything works and V1 and  
V2

give the same results
* If I compile with ONESIDED and V2 defined (MPI_Type_contiguous) it  
works.
* ONESIDED + V1 + O2: No errors but obviously nothing is send? (Am I  
in

assuming that V1+O2 and V2 should be equivalent?)
* ONESIDED + V1 + O1:
[m02:03115] *** An error occurred in MPI_Put
[m02:03115] *** on win
[m02:03115] *** MPI_ERR_TYPE: invalid datatype
[m02:03115] *** MPI_ERRORS_ARE_FATAL (goodbye)

I didn't get a segfault as in the "real life example" but if  
ompitest.cc

is correct it means that OpenMPI is buggy when it comes to onesided
communication and (some) derived datatypes, so that it is probably not
of problem in my code.

I'm using OpenMPI-1.2.8 with the newest gcc 4.3.2 but the same  
behaviour

can be be seen with gcc-3.3.1 and intel 10.1.

Please correct me if ompitest.cc contains errors. Otherwise I would be
glad to hear how I should report these problems to the develepors (if
they don't read this).

Thanks + best regards

Dorian






ompitest.tar.gz
Description: GNU Zip compressed data

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users