Re: [OMPI devel] Very poor performance with btl sm on twin nehalem servers with Mellanox ConnectX installed

2010-05-18 Thread Sylvain Jeaugey
I would go further on this : when available, putting the session directory 
in a tmpfs filesystem (e.g. /dev/shm) should give you the maximum 
performance.


Again, when using /dev/shm instead of the local /tmp filesystem, I get a 
consistent 1-5us latency improvement on a barrier at 32 cores (on a single 
node). So it may not be noticeable for everyone, but it seems faster in 
all cases.


Sylvain

On Mon, 17 May 2010, Paul H. Hargrove wrote:


Entry looks good, but could probably use an additional sentence or two like:

On diskless nodes running Linux, use of /dev/shm may be an option if 
supported by your distribution.  This will use an in-memory file system for 
the session directory, but will NOT result in a doubling of the memory 
consumed for the shared memory file (i.e. file system "blocks" and memory 
"pages" share a single instance).


-Paul

Jeff Squyres wrote:

How's this?

http://www.open-mpi.org/faq/?category=sm#poor-sm-btl-performance

What's the advantage of /dev/shm?  (I don't know anything about /dev/shm)


On May 17, 2010, at 4:08 AM, Sylvain Jeaugey wrote:



I agree with Paul on the fact that a FAQ update would be great on this
subject. /dev/shm seems a good place to put the temporary files (when
available, of course).

Putting files in /dev/shm also showed better performance on our systems,
even with /tmp on a local disk.

Sylvain

On Sun, 16 May 2010, Paul H. Hargrove wrote:



If I google "ompi sm btl performance" the top match is
 http://www.open-mpi.org/faq/?category=sm

I scanned the entire page from top to bottom and don't see any questions 
of

the form
  Why is SM performance slower than ...?

The words "NFS", "network", "file system" or "filesystem" appear nowhere 
on

the page.  The closest I could find is


7. Where is the file that sm will mmap in?

The file will be in the OMPI session directory, which is typically
something like /tmp/openmpi-sessions-myusername@mynodename* . The file
itself will have the name shared_mem_pool.mynodename. For example, the 
full

path could be
/tmp/openmpi-sessions-myusername@node0_0/1543/1/shared_mem_pool.node0.

To place the session directory in a non-default location, use the MCA
parameter orte_tmpdir_base.


which says nothing about where one should or should not place the session
directory.

Not having read the entire FAQ from start to end, I will not contradict
Ralph's claim that the "your SM performance might suck if you put the 
session

directory on a remote filesystem" FAQ entry does exist, but I will assert
that I did not find it in the SM section of the FAQ.  I tried google on 
"ompi
session directory" and "ompi orte_tmpdir_base" and still didn't find 
whatever

entry Ralph is talking about.  So, I think the average user with no clue
about the relationship between the SM BLT and the session directory would
need some help finding it.  Therefore, I still feel an FAQ entry in the 
SM

category is warranted, even if it just references whatever entry Ralph is
referring to.

-Paul

Ralph Castain wrote:

We have had a FAQ on this for a long time...problem is, nobody reads it 
:-/


Glad you found the problem!

On May 14, 2010, at 3:15 PM, Paul H. Hargrove wrote:




Oskar Enoksson wrote:



Christopher Samuel wrote:



Subject: Re: [OMPI devel] Very poor performance with btl sm on twin
  nehalem servers with Mellanox ConnectX installed
To: de...@open-mpi.org
Message-ID:
  
Content-Type: text/plain; charset="iso-8859-1"

On 13/05/10 20:56, Oskar Enoksson wrote:




The problem is that I get very bad performance unless I
explicitly exclude the "sm" btl and I can't figure out why.



Recently someone reported issues which were traced back to
the fact that the files that sm uses for mmap() were in a
/tmp which was NFS mounted; changing the location where their
files were kept to another directory with the orte_tmpdir_base
MCA parameter fixed that issue for them.

Could it be similar for yourself ?

cheers,
Chris



That was exactly right, as you guessed these are diskless nodes that
mounts the root filesystem over NFS.

Setting orte_tmpdir_base to /dev/shm and btl_sm_num_fifos=9 and then
running mpi_stress on eight cores measures speeds of 1650MB/s for both
1MB messages and 1600MB/s for 10kB messages.

Thanks!
/Oskar

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



Sounds like a new FAQ entry is warranted.

-Paul

--
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
HPC Research Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Paul H. Hargrove  

Re: [OMPI devel] RFC: Remove all other paffinity components

2010-05-18 Thread Jeff Squyres
Just chatted with Ralph about this on the phone and he came up with a slightly 
better compromise...

He points out that we really don't need *all* of the hwloc API (there's a 
bajillion tiny little accessor functions).  We could provide a steady, 
OPAL/ORTE/OMPI-specific API (probably down in opal/util or somesuch) with a 
dozen or two (or whatever) functions that we really need.  These functions can 
either call their back-end hwloc counterparts or they could do something safe 
if hwloc is not present / not supported / etc.

That would alleviate the need to put #if OPAL_HAVE_HWLOC elsewhere in the code 
base.  But the code calling opal_hwloc_() needs to be able to gracefully 
handle the failure case where it returns OPAL_ERR_NOT_SUPPORTED (etc.).



On May 17, 2010, at 8:25 PM, Jeff Squyres (jsquyres) wrote:

> On May 17, 2010, at 7:59 PM, Barrett, Brian W wrote:
> 
> > HWLOC could be extended to support Red Storm, probably, but we don't have 
> > the need or time to do such an implementation. 
> 
> Fair enough.
> 
> > Given that, I'm not really picky about what the method of not breaking an 
> > existing supported platform is, but I think having HAVE_HWLOC defines 
> > everywhere is a bad idea...
> 
> We need a mechanism to have hwloc *not* be there, particularly for embedded 
> environments -- where hwloc would add no value.  This is apparently just like 
> Red Storm, but even worse because we need to keep the memory footprint down 
> as much as possible (libhwloc.so.0.0 on linux is 104KB -- libhwloc.a is 139KB 
> -- both are big numbers when you only have a few MB of usable RAM).  So even 
> leaving stubs doesn't seem like a good idea -- they'll take up space, too.  
> And the hwloc API is fairly large -- maintaining stubs for all the API 
> functions could be a daunting task.
> 
> I think embedding is the main reason I can't think of any better idea than 
> #if OPAL_HAVE_HWLOC.
> 
> I anticipate that hwloc usage would be fairly localized in the OMPI code base:
> 
> int btl_sm_setup_stuff(...)
> {
> #if OPAL_HAVE_HWLOC
>  ...do interesting hwloc things...
>  ...setup stuff on btl_sm_component...
>  btl_sm_component.have_hwloc = 1;
> #else
>  btl_sm_component.have_hwloc = 0;
> #endif
> }
> 
> int btl_sm_other_stuff(...)
> {
> if (btl_sm_component.have_hwloc) {
> ...use the hwloc info...
> }
> }
> 
> But I'm certainly open to other ideas -- got any?
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] RFC: Remove all other paffinity components

2010-05-18 Thread Terry Dontje

Jeff Squyres wrote:

Just chatted with Ralph about this on the phone and he came up with a slightly 
better compromise...

He points out that we really don't need *all* of the hwloc API (there's a 
bajillion tiny little accessor functions).  We could provide a steady, 
OPAL/ORTE/OMPI-specific API (probably down in opal/util or somesuch) with a 
dozen or two (or whatever) functions that we really need.  These functions can 
either call their back-end hwloc counterparts or they could do something safe 
if hwloc is not present / not supported / etc.

That would alleviate the need to put #if OPAL_HAVE_HWLOC elsewhere in the code base.  
But the code calling opal_hwloc_() needs to be able to gracefully handle 
the failure case where it returns OPAL_ERR_NOT_SUPPORTED (etc.).


  
The above sounds like you are replacing the whole paffinity framework 
with hwloc.  Is that true?  Or is the hwloc accessors you are talking 
about non-paffinity related?


--td

On May 17, 2010, at 8:25 PM, Jeff Squyres (jsquyres) wrote:

  

On May 17, 2010, at 7:59 PM, Barrett, Brian W wrote:


HWLOC could be extended to support Red Storm, probably, but we don't have the need or time to do such an implementation. 
  

Fair enough.



Given that, I'm not really picky about what the method of not breaking an 
existing supported platform is, but I think having HAVE_HWLOC defines 
everywhere is a bad idea...
  

We need a mechanism to have hwloc *not* be there, particularly for embedded 
environments -- where hwloc would add no value.  This is apparently just like 
Red Storm, but even worse because we need to keep the memory footprint down as 
much as possible (libhwloc.so.0.0 on linux is 104KB -- libhwloc.a is 139KB -- 
both are big numbers when you only have a few MB of usable RAM).  So even 
leaving stubs doesn't seem like a good idea -- they'll take up space, too.  And 
the hwloc API is fairly large -- maintaining stubs for all the API functions 
could be a daunting task.

I think embedding is the main reason I can't think of any better idea than #if 
OPAL_HAVE_HWLOC.

I anticipate that hwloc usage would be fairly localized in the OMPI code base:

int btl_sm_setup_stuff(...)
{
#if OPAL_HAVE_HWLOC
 ...do interesting hwloc things...
 ...setup stuff on btl_sm_component...
 btl_sm_component.have_hwloc = 1;
#else
 btl_sm_component.have_hwloc = 0;
#endif
}

int btl_sm_other_stuff(...)
{
if (btl_sm_component.have_hwloc) {
...use the hwloc info...
}
}

But I'm certainly open to other ideas -- got any?

--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





  



--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.650.633.7054
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 



[OMPI devel] /dev/shm usage (was: Very poor performance with btl sm...)

2010-05-18 Thread Jeff Squyres
Ralph and I talked about this on the phone a bit this morning.  There's several 
complicating factors in using /dev/shm (aren't there always? :-) ).

0. Note that anything in /dev/shm will need to have session-directory-like 
semantics: there needs to be per-user and per-job characteristics (e.g., if the 
same user launches multiple jobs on the same node, etc.).

1. It is not necessarily a good idea to put the entire session directory in 
/dev/shm.  It's not just the shared memory files that go in the session 
directory; there's a handful of other meta data files that go in there as well. 
 Those files don't take up much space, but it still feels wrong to put anything 
other that shared memory files in there.  Indeed, checkpoint files and filem 
files can go in there -- these can eat up lots of space (RAM).  

2. /dev/shm may not be configured right, and/or there are possible /dev/shm 
configurations where you *do* use twice the memory (Ralph cited an example of a 
nameless organization that had exactly this problem -- we don't know if this 
was a misconfiguration or whether it was done on purpose for some reason).  I 
don't know if kernel version comes into play here, too (e.g., if older Linux 
kernel versions did double the memory, or somesuch).  So it's not necessarily a 
slam dunk that you *always* want to do this.

3. The session directory has "best effort" cleanup at the end of the job:

- MPI jobs (effectively) rm -rf the session directory
- The orted (effectively) rm -rf's the session directory

But neither of these are *guaranteed* -- for example, if the resource manager 
kills the job with extreme prejudice, the session directory can be left around. 
 Where possible, ORTE tries to put the session directory in a resource manager 
job-specific-temp directory so that the resource manager itself whacks the 
session directory at the end of the job.  But this isn't always the case.

So the session directory has 2 levels of attempted cleanup (MPI procs and 
orted), and sometimes a 3rd (the resource manager).

3a. If the session directory is in /dev/shm, we get the 2 levels but definitely 
not the 3rd (note: I don't think that putting the session directory is a good 
idea, per #1 -- I'm just being complete).

3b. If the shared memory files are outside the session directory, we don't get 
any of the additional cleanup without adding some additional infrastructure -- 
possibly into orte/util/session_dir.* (e.g., add /dev/shm as a secondary 
session directory root).  This would allow us to effect session directory-like 
semantics inside /dev/shm. 

4. But even with 2 levels of possible cleanup, not having the resource manager 
cleanup can be quite disastrous if shared memory is left around after a job is 
forcibly terminated.  Sysadmins can do stuff like rm -rf /dev/shm (or whatever) 
between jobs to guarantee cleanup, but it would be extra steps required outside 
of OMPI.  

--> This seems to imply that using /dev/shm should not be default behavior.

-

All this being said, it seems like 3b is a reasonable way to go forward: extend 
orte/util/session_dir.* to allow for multiple session directory roots (somehow 
-- exact mechanism TBD).  Then both the MPI processes and the orted will try to 
clean up both the real session directory and /dev/shm.  Both roots will use the 
same per user/per job kinds of characteristics that the session dir already 
has.  

Then we can extend the MCA param orte_tmpdir_base to accept a comma-delimited 
list of roots.  It still defaults to /tmp, but a sysadmin can set it to be 
/tmp,/dev/shm (or whatever) if they want to use /dev/shm.  OMPI will still do 
"best effort" cleanup of /dev/shm, but it's the sysadmin's responsibility to 
*guarantee* its cleanup after a job ends, etc.

Thoughts?


On May 18, 2010, at 4:09 AM, Sylvain Jeaugey wrote:

> I would go further on this : when available, putting the session directory
> in a tmpfs filesystem (e.g. /dev/shm) should give you the maximum
> performance.
> 
> Again, when using /dev/shm instead of the local /tmp filesystem, I get a
> consistent 1-5us latency improvement on a barrier at 32 cores (on a single
> node). So it may not be noticeable for everyone, but it seems faster in
> all cases.
> 
> Sylvain
> 
> On Mon, 17 May 2010, Paul H. Hargrove wrote:
> 
> > Entry looks good, but could probably use an additional sentence or two like:
> >
> > On diskless nodes running Linux, use of /dev/shm may be an option if
> > supported by your distribution.  This will use an in-memory file system for
> > the session directory, but will NOT result in a doubling of the memory
> > consumed for the shared memory file (i.e. file system "blocks" and memory
> > "pages" share a single instance).
> >
> > -Paul
> >
> > Jeff Squyres wrote:
> >> How's this?
> >>
> >> http://www.open-mpi.org/faq/?category=sm#poor-sm-btl-performance
> >>
> >> What's the advantage of /dev/shm?  (I don't know anything about /dev/shm)
> >>
> >>
> >> On May 17, 2010, at 

Re: [OMPI devel] RFC: Remove all other paffinity components

2010-05-18 Thread Jeff Squyres
On May 18, 2010, at 8:31 AM, Terry Dontje wrote:

> The above sounds like you are replacing the whole paffinity framework with 
> hwloc.  Is that true?  Or is the hwloc accessors you are talking about 
> non-paffinity related?

Good point; these have all gotten muddled in the email chain.  Let me re-state 
everything in one place in an attempt to be clear:

1. Split paffinity into two frameworks (because some OS's support one and not 
the other):
  - binding: just for getting and setting processor affinity
  - hwmap: just for mapping (board, socket, core, hwthread) <--> OS processor ID
  --> Note that hwmap will be an expansion of the current paffinity capabilities

2. Add hwloc to opal
  - Commit the hwloc tree to opal/util/hwloc (or somesuch)
  - Have the ability to configure hwloc out (e.g., for embedded environments)
  - Add a dozen or two hwloc wrappers in opal/util/hwloc.c|h
  - The rest of the OPAL/ORTE/OMPI trees *only call these wrapper functions* -- 
they do not call hwloc directly
  - These wrappers will call the back-end hwloc functions or return 
OPAL_ERR_NOT_SUPPORTED (or somesuch) if hwloc is not available

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] RFC 2/2: merge the OPAL SOS development branch into trunk

2010-05-18 Thread Josh Hursey

Abhishek and Jeff,

Awesome! Thanks for all your hard work maintaining and shepherding  
this branch into the trunk.


-- Josh

On May 17, 2010, at 9:20 PM, Abhishek Kulkarni wrote:



On May 14, 2010, at 12:24 PM, Josh Hursey wrote:



On May 12, 2010, at 1:07 PM, Abhishek Kulkarni wrote:


Updated RFC (w/ discussed changes):

=
= 


[RFC 2/2] merge the OPAL SOS development branch into trunk
=
= 



WHAT: Merge the OPAL SOS development branch into the OMPI trunk.

WHY: Bring over some of the work done to enhance error reporting  
capabilities.


WHERE: opal/util/ and a few changes in the ORTE notifier.

TIMEOUT: May 17, Monday, COB.

REFERENCE BRANCHES: http://bitbucket.org/jsquyres/opal-sos-fixed/

= 
= 



BACKGROUND:

The OPAL SOS framework tries to meet the following objectives:

- Reduce the cascading error messages and the amount of code  
needed to

print an error message.
- Build and aggregate stacks of encountered errors and associate
related individual errors with each other.
- Allow registration of custom callbacks to intercept error events.

The SOS system provides an interface to log events of varying
severities.  These events are associated with an "encoded" error  
code

which can be used to refer to stacks of SOS events. When logging
events, they can also be transparently relayed to all the activated
notifier components.

The SOS system is described in detail on this wiki page:

http://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages
https://svn.open-mpi.org/trac/ompi/attachment/wiki/ErrorMessages/OPAL_SOS.pdf

CHANGES (since the last RFC):

* Wrapped all hard-coded error-code checks (OMPI_ERR_* == ret),
OPAL_SOS_GET_ERR_CODE(ret). There were about 30-40 such checks
each in the OMPI and ORTE layer and about 15 in the OPAL layer.
Since OPAL_SUCCESS is preserved by SOS, also changed calls of
the form (OPAL_SUCCESS != ret) to (OPAL_ERROR == ret).


You mean the other way around, right?
You changed code that previously looked like (OPAL_ERROR == ret) to  
(OPAL_SUCCESS != ret) where appropriate.





Yes, thanks for the correction! This (and ORTE WDC) is all in trunk  
now -- I've split the changes into smaller patches (see commits  
r23155 - r23164) so that they are easier to sift through.


Abhishek




* If the error is an SOS-encoded error, ORTE_ERROR_LOG decodes
the error, prints out the error stack and frees the errors.

= 
= 




On Mar 29, 2010, at 10:58 AM, Abhishek Kulkarni wrote:



= 
= 
= 
===

[RFC 2/2]
= 
= 
= 
===


WHAT: Merge the OPAL SOS development branch into the OMPI trunk.

WHY: Bring over some of the work done to enhance error reporting  
capabilities.


WHERE: opal/util/ and a few changes in the ORTE notifier.

TIMEOUT: April 6, Wednesday, COB.

REFERENCE BRANCHES: http://bitbucket.org/jsquyres/opal-sos-fixed/

= 
= 
= 
===


BACKGROUND:

The OPAL SOS framework tries to meet the following objectives:

- Reduce the cascading error messages and the amount of code  
needed to

print an error message.
- Build and aggregate stacks of encountered errors and associate
related individual errors with each other.
- Allow registration of custom callbacks to intercept error events.

The SOS system provides an interface to log events of varying
severities.  These events are associated with an "encoded" error  
code

which can be used to refer to stacks of SOS events. When logging
events, they can also be transparently relayed to all the activated
notifier components.

The SOS system is described in detail on this wiki page:

http://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages

Feel free to comment and/or provide suggestions.

= 
= 
= 
===

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] RFC 2/2: merge the OPAL SOS development branchinto trunk

2010-05-18 Thread Jeff Squyres
Indeed.  Nice job yesterday, Abhishek.  You did it better than my hwloc merge 
into the trunk!  :-)


On May 18, 2010, at 9:20 AM, Josh Hursey wrote:

> Abhishek and Jeff,
> 
> Awesome! Thanks for all your hard work maintaining and shepherding 
> this branch into the trunk.
> 
> -- Josh
> 
> On May 17, 2010, at 9:20 PM, Abhishek Kulkarni wrote:
> 
> >
> > On May 14, 2010, at 12:24 PM, Josh Hursey wrote:
> >
> >>
> >> On May 12, 2010, at 1:07 PM, Abhishek Kulkarni wrote:
> >>
> >>> Updated RFC (w/ discussed changes):
> >>>
> >>> =
> >>> =
> >>> 
> >>> [RFC 2/2] merge the OPAL SOS development branch into trunk
> >>> =
> >>> =
> >>> 
> >>>
> >>> WHAT: Merge the OPAL SOS development branch into the OMPI trunk.
> >>>
> >>> WHY: Bring over some of the work done to enhance error reporting 
> >>> capabilities.
> >>>
> >>> WHERE: opal/util/ and a few changes in the ORTE notifier.
> >>>
> >>> TIMEOUT: May 17, Monday, COB.
> >>>
> >>> REFERENCE BRANCHES: http://bitbucket.org/jsquyres/opal-sos-fixed/
> >>>
> >>> =
> >>> =
> >>> 
> >>>
> >>> BACKGROUND:
> >>>
> >>> The OPAL SOS framework tries to meet the following objectives:
> >>>
> >>> - Reduce the cascading error messages and the amount of code 
> >>> needed to
> >>> print an error message.
> >>> - Build and aggregate stacks of encountered errors and associate
> >>> related individual errors with each other.
> >>> - Allow registration of custom callbacks to intercept error events.
> >>>
> >>> The SOS system provides an interface to log events of varying
> >>> severities.  These events are associated with an "encoded" error 
> >>> code
> >>> which can be used to refer to stacks of SOS events. When logging
> >>> events, they can also be transparently relayed to all the activated
> >>> notifier components.
> >>>
> >>> The SOS system is described in detail on this wiki page:
> >>>
> >>> http://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages
> >>> https://svn.open-mpi.org/trac/ompi/attachment/wiki/ErrorMessages/OPAL_SOS.pdf
> >>>
> >>> CHANGES (since the last RFC):
> >>>
> >>> * Wrapped all hard-coded error-code checks (OMPI_ERR_* == ret),
> >>> OPAL_SOS_GET_ERR_CODE(ret). There were about 30-40 such checks
> >>> each in the OMPI and ORTE layer and about 15 in the OPAL layer.
> >>> Since OPAL_SUCCESS is preserved by SOS, also changed calls of
> >>> the form (OPAL_SUCCESS != ret) to (OPAL_ERROR == ret).
> >>
> >> You mean the other way around, right?
> >> You changed code that previously looked like (OPAL_ERROR == ret) to 
> >> (OPAL_SUCCESS != ret) where appropriate.
> >>
> >
> >
> > Yes, thanks for the correction! This (and ORTE WDC) is all in trunk 
> > now -- I've split the changes into smaller patches (see commits 
> > r23155 - r23164) so that they are easier to sift through.
> >
> > Abhishek
> >
> >
> >>>
> >>> * If the error is an SOS-encoded error, ORTE_ERROR_LOG decodes
> >>> the error, prints out the error stack and frees the errors.
> >>>
> >>> =
> >>> =
> >>> 
> >>>
> >>>
> >>> On Mar 29, 2010, at 10:58 AM, Abhishek Kulkarni wrote:
> >>>
> 
>  =
>  =
>  =
>  ===
>  [RFC 2/2]
>  =
>  =
>  =
>  ===
> 
>  WHAT: Merge the OPAL SOS development branch into the OMPI trunk.
> 
>  WHY: Bring over some of the work done to enhance error reporting 
>  capabilities.
> 
>  WHERE: opal/util/ and a few changes in the ORTE notifier.
> 
>  TIMEOUT: April 6, Wednesday, COB.
> 
>  REFERENCE BRANCHES: http://bitbucket.org/jsquyres/opal-sos-fixed/
> 
>  =
>  =
>  =
>  ===
> 
>  BACKGROUND:
> 
>  The OPAL SOS framework tries to meet the following objectives:
> 
>  - Reduce the cascading error messages and the amount of code 
>  needed to
>  print an error message.
>  - Build and aggregate stacks of encountered errors and associate
>  related individual errors with each other.
>  - Allow registration of custom callbacks to intercept error events.
> 
>  The SOS system provides an interface to log events of varying
>  severities.  These events are associated with an "encoded" error 
>  code
>  which can be used to refer to stacks of SOS events. When logging
>  events, they can also be transparently relayed to all the activated
>  notifier components.
> 
>  The SOS system is described in detail on this wiki page:
> 
>  http://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages
> 
>  Feel free to comment and/or provide suggestions.
> 
> >>

Re: [OMPI devel] /dev/shm usage (was: Very poor performance with btlsm...)

2010-05-18 Thread Jeff Squyres
I was reminded this morning (by 2 people :-) ) that the sysv shmem stuff was 
initiated a long time ago as a workaround for many of these same issues 
(including the potential performance issues).

Sam's work is nearly complete; I think that -- at least on Linux -- the mmap 
performance issues can go away.  The cleanup issues will not go away; it still 
requires external help to *guarantee* that shared memory IDs are removed after 
the job has completed.


On May 18, 2010, at 8:45 AM, Jeff Squyres (jsquyres) wrote:

> Ralph and I talked about this on the phone a bit this morning.  There's 
> several complicating factors in using /dev/shm (aren't there always? :-) ).
> 
> 0. Note that anything in /dev/shm will need to have session-directory-like 
> semantics: there needs to be per-user and per-job characteristics (e.g., if 
> the same user launches multiple jobs on the same node, etc.).
> 
> 1. It is not necessarily a good idea to put the entire session directory in 
> /dev/shm.  It's not just the shared memory files that go in the session 
> directory; there's a handful of other meta data files that go in there as 
> well.  Those files don't take up much space, but it still feels wrong to put 
> anything other that shared memory files in there.  Indeed, checkpoint files 
> and filem files can go in there -- these can eat up lots of space (RAM). 
> 
> 2. /dev/shm may not be configured right, and/or there are possible /dev/shm 
> configurations where you *do* use twice the memory (Ralph cited an example of 
> a nameless organization that had exactly this problem -- we don't know if 
> this was a misconfiguration or whether it was done on purpose for some 
> reason).  I don't know if kernel version comes into play here, too (e.g., if 
> older Linux kernel versions did double the memory, or somesuch).  So it's not 
> necessarily a slam dunk that you *always* want to do this.
> 
> 3. The session directory has "best effort" cleanup at the end of the job:
> 
> - MPI jobs (effectively) rm -rf the session directory
> - The orted (effectively) rm -rf's the session directory
> 
> But neither of these are *guaranteed* -- for example, if the resource manager 
> kills the job with extreme prejudice, the session directory can be left 
> around.  Where possible, ORTE tries to put the session directory in a 
> resource manager job-specific-temp directory so that the resource manager 
> itself whacks the session directory at the end of the job.  But this isn't 
> always the case.
> 
> So the session directory has 2 levels of attempted cleanup (MPI procs and 
> orted), and sometimes a 3rd (the resource manager).
> 
> 3a. If the session directory is in /dev/shm, we get the 2 levels but 
> definitely not the 3rd (note: I don't think that putting the session 
> directory is a good idea, per #1 -- I'm just being complete).
> 
> 3b. If the shared memory files are outside the session directory, we don't 
> get any of the additional cleanup without adding some additional 
> infrastructure -- possibly into orte/util/session_dir.* (e.g., add /dev/shm 
> as a secondary session directory root).  This would allow us to effect 
> session directory-like semantics inside /dev/shm.
> 
> 4. But even with 2 levels of possible cleanup, not having the resource 
> manager cleanup can be quite disastrous if shared memory is left around after 
> a job is forcibly terminated.  Sysadmins can do stuff like rm -rf /dev/shm 
> (or whatever) between jobs to guarantee cleanup, but it would be extra steps 
> required outside of OMPI. 
> 
> --> This seems to imply that using /dev/shm should not be default behavior.
> 
> -
> 
> All this being said, it seems like 3b is a reasonable way to go forward: 
> extend orte/util/session_dir.* to allow for multiple session directory roots 
> (somehow -- exact mechanism TBD).  Then both the MPI processes and the orted 
> will try to clean up both the real session directory and /dev/shm.  Both 
> roots will use the same per user/per job kinds of characteristics that the 
> session dir already has. 
> 
> Then we can extend the MCA param orte_tmpdir_base to accept a comma-delimited 
> list of roots.  It still defaults to /tmp, but a sysadmin can set it to be 
> /tmp,/dev/shm (or whatever) if they want to use /dev/shm.  OMPI will still do 
> "best effort" cleanup of /dev/shm, but it's the sysadmin's responsibility to 
> *guarantee* its cleanup after a job ends, etc.
> 
> Thoughts?
> 
> 
> On May 18, 2010, at 4:09 AM, Sylvain Jeaugey wrote:
> 
> > I would go further on this : when available, putting the session directory
> > in a tmpfs filesystem (e.g. /dev/shm) should give you the maximum
> > performance.
> >
> > Again, when using /dev/shm instead of the local /tmp filesystem, I get a
> > consistent 1-5us latency improvement on a barrier at 32 cores (on a single
> > node). So it may not be noticeable for everyone, but it seems faster in
> > all cases.
> >
> > Sylvain
> >
> > On Mon, 17 May 2010, Paul H. Hargrove wrote:

Re: [OMPI devel] /dev/shm usage

2010-05-18 Thread Paul H. Hargrove

Jeff Squyres wrote:
[snip]

Ralph and I talked about this on the phone a bit this morning.  There's several 
complicating factors in using /dev/shm (aren't there always? :-) ).
  

[snip]

--> This seems to imply that using /dev/shm should not be default behavior.
  

[snip]


I agree that /dev/shm introduces extra complications and should not be 
the default.  The FAQ text I provided was intended to suggest /dev/shm 
as a session dir (or session root) ONLY for people who had diskless 
nodes and thus no obvious alternatives to network-mounted /tmp.


If one wants to pursue placing the SM BTL's shared memory files in 
/dev/shm by default, that is independent of adding something to the new 
FAQ entry to address the diskless case.


-Paul

--
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group Tel: +1-510-495-2352
HPC Research Department   Fax: +1-510-486-6900
Lawrence Berkeley National Laboratory 



Re: [OMPI devel] /dev/shm usage

2010-05-18 Thread Jeff Squyres
On May 18, 2010, at 10:58 AM, Paul H. Hargrove wrote:

> I agree that /dev/shm introduces extra complications and should not be
> the default.  The FAQ text I provided was intended to suggest /dev/shm
> as a session dir (or session root) ONLY for people who had diskless
> nodes and thus no obvious alternatives to network-mounted /tmp.
> 
> If one wants to pursue placing the SM BTL's shared memory files in
> /dev/shm by default, that is independent of adding something to the new
> FAQ entry to address the diskless case.

Excellent; thanks for the clarification!

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI devel] Bug in opal sos changes

2010-05-18 Thread Rolf vandeVaart
I am getting SEGVs while running the IMB-MPI1 tests.  I believe the 
problem has to do with changes made to the group_init.c file last 
night.  The error occurs in the call to MPI_Comm_split. 


There were 4 changes in the file that look like this:
OLD:
if (OMPI_ERROR == new_group->grp_f_to_c_index)

NEW:
if (OMPI_SUCCESS != new_group->grp_f_to_c_index)

If I change it back, things work.  I understand the idea of changing the 
logic, but maybe that does not apply in this case?When running with 
np=2, the value of new_group->grp_f_to_c_index is 4, thereby not 
equaling OMPI_SUCCESS and we end up in an error condition which results 
in a null pointer later on.


Am I the only that has run into this?

Rolf




Re: [OMPI devel] Bug in opal sos changes

2010-05-18 Thread Jeff Squyres
Looks like the comparison to OMPI_ERROR worked by accident -- just because it 
happened to have a value of -1.  

The *_f_to_c_index values are the return value from a call to 
opal_point_array_add().  This value will either be non-negative or -1.  -1 
indicates a failure.  It's not an *_ERR_* code -- it's a -1 index value.  So 
the comparisons should really have been to -1 in the first place.

Rolf / Abhishek -- can you fix?  Did this happen in other *_f_to_c_index member 
variable comparisons elsewhere?



On May 18, 2010, at 1:25 PM, Rolf vandeVaart wrote:

> I am getting SEGVs while running the IMB-MPI1 tests.  I believe the
> problem has to do with changes made to the group_init.c file last
> night.  The error occurs in the call to MPI_Comm_split.
> 
>  There were 4 changes in the file that look like this:
> OLD:
> if (OMPI_ERROR == new_group->grp_f_to_c_index)
> 
> NEW:
> if (OMPI_SUCCESS != new_group->grp_f_to_c_index)
> 
> If I change it back, things work.  I understand the idea of changing the
> logic, but maybe that does not apply in this case?When running with
> np=2, the value of new_group->grp_f_to_c_index is 4, thereby not
> equaling OMPI_SUCCESS and we end up in an error condition which results
> in a null pointer later on.
> 
> Am I the only that has run into this?
> 
> Rolf
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] Bug in opal sos changes

2010-05-18 Thread Rolf vandeVaart
I think we are almost saying the same thing. But to be sure, I will 
restate. The call to opal_pointer_array_add() can return either an index 
(which I assume is a positive integer, maybe also 0?) or 
OPAL_ERR_OUT_OF_RESOURCE (which is a -2) if it cannot malloc anymore 
space in the table.  So, I guess I agree that the original code was 
wrong as it never would have detected the error since OMPI_ERROR != 
OPAL_ERR_OUT_OF_RESOURCE.  (-1 != -2)


Since we are overloading the return value, it seems like the only thing 
we could do is something like this:


if (new_group->grp_f_to_c_index < 0)
  error();

But that does not follow the SOS logic as the key is that we want to 
compare to OMPI_SUCCESS (I think).  Also, for what it is worth, the 
setting of the grp_f_to_c_index happens in the group constructor, so we 
cannot get at the return value of opal_pointer_array_add() except by 
looking at the grp_f_to_c value after the object is constructed.  I am 
not sure the correct way to handle this.


Rolf

On 05/18/10 14:02, Jeff Squyres wrote:
Looks like the comparison to OMPI_ERROR worked by accident -- just because it happened to have a value of -1.  


The *_f_to_c_index values are the return value from a call to 
opal_point_array_add().  This value will either be non-negative or -1.  -1 
indicates a failure.  It's not an *_ERR_* code -- it's a -1 index value.  So 
the comparisons should really have been to -1 in the first place.

Rolf / Abhishek -- can you fix?  Did this happen in other *_f_to_c_index member 
variable comparisons elsewhere?



On May 18, 2010, at 1:25 PM, Rolf vandeVaart wrote:

  

I am getting SEGVs while running the IMB-MPI1 tests.  I believe the
problem has to do with changes made to the group_init.c file last
night.  The error occurs in the call to MPI_Comm_split.

 There were 4 changes in the file that look like this:
OLD:
if (OMPI_ERROR == new_group->grp_f_to_c_index)

NEW:
if (OMPI_SUCCESS != new_group->grp_f_to_c_index)

If I change it back, things work.  I understand the idea of changing the
logic, but maybe that does not apply in this case?When running with
np=2, the value of new_group->grp_f_to_c_index is 4, thereby not
equaling OMPI_SUCCESS and we end up in an error condition which results
in a null pointer later on.

Am I the only that has run into this?

Rolf


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





  




Re: [OMPI devel] Bug in opal sos changes

2010-05-18 Thread Abhishek Kulkarni


On Tue, 18 May 2010, Rolf vandeVaart wrote:


I think we are almost saying the same thing. But to be sure, I will restate. 
The call to opal_pointer_array_add() can return either an index (which I assume 
is a positive
integer, maybe also 0?) or OPAL_ERR_OUT_OF_RESOURCE (which is a -2) if it 
cannot malloc anymore space in the table.  So, I guess I agree that the 
original code was wrong as
it never would have detected the error since OMPI_ERROR != 
OPAL_ERR_OUT_OF_RESOURCE.  (-1 != -2)

Since we are overloading the return value, it seems like the only thing we 
could do is something like this:

if (new_group->grp_f_to_c_index < 0)
   error();



Yes, that looks like the right thing to do.


But that does not follow the SOS logic as the key is that we want to compare to 
OMPI_SUCCESS (I think).  Also, for what it is worth, the setting of the 
grp_f_to_c_index
happens in the group constructor, so we cannot get at the return value of 
opal_pointer_array_add() except by looking at the grp_f_to_c value after the 
object is
constructed.  I am not sure the correct way to handle this.



The only reason we replace the OMPI_ERROR checks with OMPI_SUCCESS is 
because when SOS logs an error in its internal data structures it returns 
a new reference to the error (an encoded error-code which SOS can use to 
locate the error). So, OMPI_ERROR is not OMPI_ERROR anymore but an SOS 
encoded OMPI_ERROR. We could always wrap the code to be checked with a 
call to extract its native error code and then perform the check like


  if (0 > OPAL_SOS_GET_ERROR_CODE(new_group->grp_f_to_c_index)) {
 error();
  }

In a lot of places (where functions return a boolean OMPI_SUCCESS or 
OMPI_ERROR), it was perfectly legit to just switch the way it's done but 
for the opal_pointer_array_add() and mca_base_param_* functions which 
return an index or an error, the above transformation seems to be the way 
to go.


I'll send in a patch with these changes.

Abhishek


Rolf

On 05/18/10 14:02, Jeff Squyres wrote:

Looks like the comparison to OMPI_ERROR worked by accident -- just because it happened to have a value of -1. 


The *_f_to_c_index values are the return value from a call to 
opal_point_array_add().  This value will either be non-negative or -1.  -1 
indicates a failure.  It's not an *_
ERR_* code -- it's a -1 index value.  So the comparisons should really have 
been to -1 in the first place.

Rolf / Abhishek -- can you fix?  Did this happen in other *_f_to_c_index member 
variable comparisons elsewhere?



On May 18, 2010, at 1:25 PM, Rolf vandeVaart wrote:



I am getting SEGVs while running the IMB-MPI1 tests.  I believe the
problem has to do with changes made to the group_init.c file last
night.  The error occurs in the call to MPI_Comm_split.

 There were 4 changes in the file that look like this:
OLD:
if (OMPI_ERROR == new_group->grp_f_to_c_index)

NEW:
if (OMPI_SUCCESS != new_group->grp_f_to_c_index)

If I change it back, things work.  I understand the idea of changing the
logic, but maybe that does not apply in this case?When running with
np=2, the value of new_group->grp_f_to_c_index is 4, thereby not
equaling OMPI_SUCCESS and we end up in an error condition which results
in a null pointer later on.

Am I the only that has run into this?

Rolf


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel










Re: [OMPI devel] Bug in opal sos changes

2010-05-18 Thread Ralph Castain
Hmmm...well, the way that function -used- to work was it returned an error
code, and had the index as a *int param in the function call. Tim P changed
it awhile back (don't remember exactly why, but it was when he moved the
pointer_array code from orte to opal), and I'm not sure the fixes it
required were ever propagated everywhere (I occasionally run across them in
ORTE, though I think I've got them all now).

My point: only real fix may be to go back to the old API and quit
overloading the return code.


On Tue, May 18, 2010 at 12:32 PM, Rolf vandeVaart <
rolf.vandeva...@oracle.com> wrote:

>  I think we are almost saying the same thing. But to be sure, I will
> restate. The call to opal_pointer_array_add() can return either an index
> (which I assume is a positive integer, maybe also 0?) or
> OPAL_ERR_OUT_OF_RESOURCE (which is a -2) if it cannot malloc anymore space
> in the table.  So, I guess I agree that the original code was wrong as it
> never would have detected the error since OMPI_ERROR !=
> OPAL_ERR_OUT_OF_RESOURCE.  (-1 != -2)
>
> Since we are overloading the return value, it seems like the only thing we
> could do is something like this:
>
> if (new_group->grp_f_to_c_index < 0)
>error();
>
> But that does not follow the SOS logic as the key is that we want to
> compare to OMPI_SUCCESS (I think).  Also, for what it is worth, the setting
> of the grp_f_to_c_index happens in the group constructor, so we cannot get
> at the return value of opal_pointer_array_add() except by looking at the
> grp_f_to_c value after the object is constructed.  I am not sure the correct
> way to handle this.
>
> Rolf
>
> On 05/18/10 14:02, Jeff Squyres wrote:
>
> Looks like the comparison to OMPI_ERROR worked by accident -- just because it 
> happened to have a value of -1.
>
> The *_f_to_c_index values are the return value from a call to 
> opal_point_array_add().  This value will either be non-negative or -1.  -1 
> indicates a failure.  It's not an *_ERR_* code -- it's a -1 index value.  So 
> the comparisons should really have been to -1 in the first place.
>
> Rolf / Abhishek -- can you fix?  Did this happen in other *_f_to_c_index 
> member variable comparisons elsewhere?
>
>
>
> On May 18, 2010, at 1:25 PM, Rolf vandeVaart wrote:
>
>
>
>  I am getting SEGVs while running the IMB-MPI1 tests.  I believe the
> problem has to do with changes made to the group_init.c file last
> night.  The error occurs in the call to MPI_Comm_split.
>
>  There were 4 changes in the file that look like this:
> OLD:
> if (OMPI_ERROR == new_group->grp_f_to_c_index)
>
> NEW:
> if (OMPI_SUCCESS != new_group->grp_f_to_c_index)
>
> If I change it back, things work.  I understand the idea of changing the
> logic, but maybe that does not apply in this case?When running with
> np=2, the value of new_group->grp_f_to_c_index is 4, thereby not
> equaling OMPI_SUCCESS and we end up in an error condition which results
> in a null pointer later on.
>
> Am I the only that has run into this?
>
> Rolf
>
>
> ___
> devel mailing 
> listdevel@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD

2010-05-18 Thread Jeff Squyres
I added several FAQ items -- how do they look?

http://www.open-mpi.org/faq/?category=troubleshooting#erroneous-file-not-found-message
http://www.open-mpi.org/faq/?category=troubleshooting#missing-symbols
http://www.open-mpi.org/faq/?category=building#install-overwrite


On May 17, 2010, at 9:15 AM, Jeff Squyres (jsquyres) wrote:

> On May 16, 2010, at 5:56 PM,  
>  wrote:
> 
> > > Have you tried building Open MPI with the --disable-dlopen configure flag?
> > >  This will slurp all of OMPI's DSOs up into libmpi.so -- so there's no
> > > dlopening at run-time.  Hence, your app (R) can dlopen libmpi.so, but then
> > > libmpi.so doesn't dlopen anything else -- all of OMPI's plugins are
> > > physically located in libmpi.so.
> >
> > Given your reasoning, that's gotta be worth a shot: wilco.
> 
> This issue has come up a few times on the list; I will add something to the 
> FAQ about this.
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] The" Missing Symbol" issue and OpenMPI on NetBSD

2010-05-18 Thread Kevin . Buckley
> I added several FAQ items -- how do they look?
>
> http://www.open-mpi.org/faq/?category=troubleshooting#erroneous-file-not-found-message
> http://www.open-mpi.org/faq/?category=troubleshooting#missing-symbols
> http://www.open-mpi.org/faq/?category=building#install-overwrite
>

  "This is due to some deep run time linker voodoo"

>From what I have come to understand about this: I think that pretty
much covers it !

Serioulsy, this is good stuff to have "out there" though, because,
as you point out, the info an installer/user gets back, and through
which they might then first look to diagnose such issues, may not
steer them in the direction it should.

Kevin

PS
A style as opposed to substance thing:

I did notice that the last one of the three seem to be using a
fixed size width, whereas text in the the first and second flow
into the browser window.

-- 
Kevin M. Buckley  Room:  CO327
School of Engineering and Phone: +64 4 463 5971
 Computer Science
Victoria University of Wellington
New Zealand