from:"Eugene Loh"

Re: [OMPI devel] Annual OMPI membership review: SVN accounts

2013-07-16 Thread Eugene Loh

Terry is dropping his account due to change in "day job" 
responsibilities.  I'm retaining mine.  Oracle status is changing from 
member to contributor.


On 7/16/2013 12:16 AM, Rainer Keller wrote:

Hi Josh,
thanks for the info. Was about to look at this mail...

Is Oracle / Sun not part of OMPI anymore?

Re: [OMPI devel] Annual OMPI membership review: SVN accounts

2013-07-09 Thread Eugene Loh


On 7/8/2013 3:32 PM, Jeff Squyres (jsquyres) wrote:

According to https://svn.open-mpi.org/trac/ompi/wiki/Admistrative%20rules, it 
is time for our annual review of Open MPI SVN accounts of these SVN repos: 
hwloc, mtt, ompi-docs, ompi-tests, ompi-www, ompi.

*** Organizations must reply by COB Friday, 12 July, 2013 ***
*** No reply means: delete all of my organization's SVN accounts

Each organization must reply and specify which of their accounts can stay and 
which should go.  I cross-referenced the SVN logs from all of our SVN 
repositories to see who has not committed anything in the past year.

Oracle
==
emallove: Ethan Mallove <ethan.mall...@oracle.com> **NO COMMITS IN LAST YEAR**
eugene:   Eugene Loh <eugene@oracle.com>
tdd:  Terry Dontje <terry.don...@oracle.com>

Please keep eugene, but close emallove and tdd.

Re: [OMPI devel] v1.7.0rc7

2013-02-26 Thread Eugene Loh


On 02/23/13 14:45, Ralph Castain wrote:

This release candidate is the last one we expect to have before release, so 
please test it. Can be downloaded from the usual place:

http://www.open-mpi.org/software/ompi/v1.7/


I haven't looked at this very carefully yet.  Maybe someone can confirm what I'm seeing?  If I try to "mpirun `pwd`", the job should 
fail (since I'm launching a directory rather than an executable).  With v1.7, however, the return status is 0.  (The error message 
also suggests some confusion.)


My experiment is to run

mpirun `pwd`
echo status is $status

Here is v1.7:

--
Open MPI tried to fork a new process via the "execve" system call but
failed.  This is an unusual error because Open MPI checks many things
before attempting to launch a child process.  This error may be
indicative of another problem on the target host.  Your job will now
abort.

  Local host:/workspace/eugene/v1.7-testing
  Application name:  Permission denied
--
status is 0

Here is v1.6:

--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--
status is 1

Re: [OMPI devel] 1.6.4rc5: final rc

2013-02-20 Thread Eugene Loh


On 02/20/13 07:54, Jeff Squyres (jsquyres) wrote:

All MTT testing looks good for 1.6.4.  There seems to be an MPI dynamics 
problem when --enable-spare-groups is used, but this does not look like a 
regression to me.

I put out a final rc, because there was one more minor change to accommodate an 
MXM API change; it's in the usual place:

http://www.open-mpi.org/software/ompi/v1.6/

Unless something disastrous happens, I plan to release this as the final 1.6.4 
tomorrow.


I don't think this qualifies as "disastrous", but...

I've been trying to do some 1.6 testing on Solaris.  (Solaris 11, Oracle Studio compilers, both SPARC and x86)  Results generally 
look good.  The main issue appears to be:


- SPARC
  *AND*
- compile with "-m32 -xmemalign=8s" (the latter means assume at most 8-byte 
alignment, with sigbus for misalignment)
  *AND*
- openib

There is a sigbus during MPI_Init.  Specifically, if I go to btl_openib_frag.h 
out_constructor(), I see:

frag->sr_desc.wr_id = (uint64_t)(uintptr_t)frag;

and the left-hand side is on a 4-byte (but not 8-byte) boundary.  How hard 
would it be to get openib frags on 8-byte boundaries?

Re: [OMPI devel] [patch] Invalid MPI_Status for null or inactive request

2012-10-04 Thread Eugene Loh


On 10/04/12 07:00, Kawashima, Takahiro wrote:

(1) MPI_SOURCE of MPI_Status for a null request must be MPI_ANY_SOURCE.

   This bug is caused by a use of an incorrect variable in
   ompi/mpi/c/wait.c (for MPI_Wait) and by an incorrect
   initialization of ompi_request_null in ompi/request/request.c
   (for MPI_Waitall and MPI_Testall).

(2) MPI_Status for an inactive request must be an empty status.

   This bug is caused by not updating a req_status field of an
   inactive persistent request object in ompi/request/req_wait.c
   and ompi/request/req_test.c.

(3) Possible BUS errors on sparc64 processors.

   r23554 fixed possible BUS errors on sparc64 processors.
   But the fix seems to be insufficient.

   We should use OMPI_STATUS_SET macro for all user-supplied
   MPI_Status objects.

The attached patch is for Open MPI trunk and it also fixes some
typos in comments. A program to reproduce bugs (1) and (2) is
also attached.
Again, I apologize for the delays in fixing #3.  Anyhow, the fix is 
available in r27403 and I updated trac ticket 3218.  This particular fix 
does not address #1 or #2.  Note that OMPI_STATUS_SET has been removed 
as part of r27403 and status structs can now be accessed directly in the 
OMPI C internals.

Re: [OMPI devel] [patch] Invalid MPI_Status for null or inactive request

2012-10-04 Thread Eugene Loh

On 10/4/2012 4:00 AM, Kawashima, Takahiro wrote:
> Hi Open MPI developers,
>
> I found some bugs in Open MPI and attach a patch to fix them.
>
> The bugs are:
>
> (1) MPI_SOURCE of MPI_Status for a null request must be MPI_ANY_SOURCE.
>
> (2) MPI_Status for an inactive request must be an empty status.
>
> (3) Possible BUS errors on sparc64 processors.
>
>   r23554 fixed possible BUS errors on sparc64 processors.
>   But the fix seems to be insufficient.
>
>   We should use OMPI_STATUS_SET macro for all user-supplied
>   MPI_Status objects.
Regarding #3, see also a trac 3218. I'm putting a fix back today. Sorry
for the delay. One proposed solution was extending the use of the
OMPI_STATUS_SET macros, but I think the consensus was to fix the problem
in the Fortran layer. Indeed, the Fortran layer already routinely
converts between Fortran and C statuses. The problem was that we started
introducing optimizations to bypass the Fortran-to-C conversion and that
optimization was employed too liberally (e..g, in situations that would
introduce the alignment errors you're describing). My patch will clean
that up. I'll try to put it back in the next few hours.

[OMPI devel] nightly tarballs

2012-10-02 Thread Eugene Loh

Where do I find the details on how the nightly tarballs are made from 
the SVN repos?

Re: [OMPI devel] making Fortran MPI_Status components public

2012-09-27 Thread Eugene Loh


On 9/27/2012 11:31 AM, N.M. Maclaren wrote:

On Sep 27 2012, Jeff Squyres (jsquyres) wrote:

..."that obscene hack"...

...configure mechanism...
Good discussion, but as far as my specific issue goes, it looks like 
it's some peculiar interaction between different compiler versions.  I'm 
asking some experts.


(In any case, PRIVATE appears to be doing its job.  The problem is not 
seeing something that's supposed to be PUBLIC.)

[OMPI devel] making Fortran MPI_Status components public

2012-09-26 Thread Eugene Loh

The ibm tests aren't building for me.  One of the issues is 
mprobe_usempif08.f90 trying to access status%MPI_SOURCE and 
status%MPI_TAG.  I assume this is supposed to work, but it doesn't.  
E.g., trunk with Oracle Studio compilers:


% cat a.f90
  use mpi_f08
  type(MPI_Status) status
  write(6,*) status%MPI_SOURCE
  write(6,*) status%MPI_TAG
  end
% mpifort -m64 -c a.f90

  write(6,*) status%MPI_SOURCE
^
"a.f90", Line = 3, Column = 21: ERROR: "MPI_SOURCE" is a private 
component of "MPI_STATUS" and cannot be used outside of the module.


  write(6,*) status%MPI_TAG
^
"a.f90", Line = 4, Column = 21: ERROR: "MPI_TAG" is a private component 
of "MPI_STATUS" and cannot be used outside of the module.


If I look in ompi/mpi/fortran/[base|use-mpi-f08-desc]/mpi-f08-types.f90, 
I see:


   type, BIND(C) :: MPI_Status
  integer :: MPI_SOURCE
  integer :: MPI_TAG
  integer :: MPI_ERROR
  integer(C_INT)OMPI_PRIVATE :: c_cancelled
  integer(C_SIZE_T) OMPI_PRIVATE :: c_count
   end type MPI_Status

Should the first three components explicitly be made public?

[OMPI devel] trunk's mapping to nodes... local host

2012-09-07 Thread Eugene Loh

Maybe this is related to Reuti's "-hostfile ignored in 1.6.1" on the 
users mail list, but not quite sure.


Let's pretend my nodes are called local, r1, and r2.  That is, I launch 
mpirun from "local" and there are two other (remote) nodes available to 
me.  With the trunk (e.g., v1.9 r27247), I get


% mpirun --bynode --nooversubscribe --host r1,r1,r1,r2,r2,r2 -n 6 
--tag-output hostname

[1,0]:r1
[1,1]:r2
[1,2]:r1
[1,3]:r2
[1,4]:r1
[1,5]:r2

which seems right to me.  But when the local node is involved:

% mpirun --bynode --nooversubscribe --host 
local,local,local,r1,r1,r1 -np 4 --tag-output hostname

[1,0]:local
[1,1]:r1
[1,2]:r1
[1,3]:r1
% mpirun --bynode --nooversubscribe --host 
local,local,local,r1,r1,r1 -np 5 --tag-output hostname


--
There are not enough slots available in the system to satisfy the 5 
slots

that were requested by the application:
  hostname

Either request fewer slots for your application, or make more slots 
available

for use.

--

I'm not seeing all the local slots I should be seeing.  We're seeing 
wide-scale MTT trunk failures due to this problem.


There is a similar loss of local slots with hostfile syntax.  E.g.,

% hostname
local
% cat   hostfile
local
r1
% mpirun --hostfile hostfile -n 2 hostname

--
A hostfile was provided that contains at least one node not
present in the allocation:

  hostfile:  hostfile
  node:  local

If you are operating in a resource-managed environment, then only
nodes that are in the allocation can be used in the hostfile. You
may find relative node syntax to be a useful alternative to
specifying absolute node names see the orte_hosts man page for
further information.


--

The problem is solved with "--mca orte_default_hostname hostfile".

[OMPI devel] trunk broken?

2012-08-30 Thread Eugene Loh

Trunk broken?  Last night, Oracle's MTT trunk runs all came up empty 
handed.  E.g.,


*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: (nil)
[ 0] [0xe600]
[ 1] /lib/libc.so.6(strlen+0x33) [0x3fa0a3]
[ 2] /lib/libc.so.6(__strdup+0x25) [0x3f9de5]
[ 3] .../lib/openmpi/mca_db_hash.so [0xf7bbdd34]
[ 4] .../lib/libmpi.so.0(orte_util_decode_pidmap+0x5f4) [0xf7e46654]
[ 5] .../lib/libmpi.so.0(orte_util_nidmap_init+0x1b4) [0xf7e46d54]
[ 6] .../lib/openmpi/mca_ess_env.so [0xf7bc4f62]
[ 7] .../lib/libmpi.so.0(orte_init+0x160) [0xf7e2d250]
[ 8] .../lib/libmpi.so.0(ompi_mpi_init+0x163) [0xf7de2133]
[ 9] .../lib/libmpi.so.0(MPI_Init+0x13f) [0xf7dfb6df]
[10] ./c_ring [0x8048759]
[11] /lib/libc.so.6(__libc_start_main+0xdc) [0x3a0dec]
[12] ./c_ring [0x80486a1]
*** End of error message ***

r27182.  The previous night, with r27175, ran fine.  Quick peek at 27178 
seems fine (I think).

Re: [OMPI devel] r27078 and OMPI build

2012-08-29 Thread Eugene Loh


r27178 seems to build fine.  Thanks.

On 8/29/2012 7:42 AM, Shamis, Pavel wrote:

Eugene,
Can you please confirm that the issue is resolved on your setup ?

On Aug 29, 2012, at 10:14 AM, Shamis, Pavel wrote:

The issue #2 was fixed in r27178.

Re: [OMPI devel] r27078 and OMPI build

2012-08-24 Thread Eugene Loh

race  
--enable-heterogeneous
--enable-cxx-exceptions --enable-shared 
--enable-orterun-prefix-by-default --with-sge
--enable-mpi-f90 --with-mpi-f90-size=small  --disable-peruse 
--disable-state
--disable-mpi-thread-multiple   --disable-debug  --disable-mem-debug  
--disable-mem-profile
CFLAGS="-xtarget=ultra3 -m32 -xarch=sparcvis2 -xprefetch 
-xprefetch_level=2 -xvector=lib -Qoption
cg -xregs=no%appl -xdepend=yes -xbuiltin=%all -xO5"  
CXXFLAGS="-xtarget=ultra3 -m32
-xarch=sparcvis2 -xprefetch -xprefetch_level=2 -xvector=lib -Qoption cg 
-xregs=no%appl -xdepend=yes
-xbuiltin=%all -xO5 -Bstatic -lCrun -lCstd -Bdynamic"  
FFLAGS="-xtarget=ultra3 -m32 -xarch=sparcvis2
-xprefetch -xprefetch_level=2 -xvector=lib -Qoption cg -xregs=no%appl 
-stackvar -xO5"
FCFLAGS="-xtarget=ultra3 -m32 -xarch=sparcvis2 -xprefetch 
-xprefetch_level=2 -xvector=lib -Qoption
cg -xregs=no%appl -stackvar -xO5"

--prefix=/workspace/euloh/hpc/mtt-scratch/burl-ct-t2k-3/ompi-tarball-testing/installs/JA08/install
--mandir=${prefix}/man  --bindir=${prefix}/bin  --libdir=${prefix}/lib
--includedir=${prefix}/include   
--with-tm=/ws/ompi-tools/orte/torque/current/shared-install32
--enable-contrib-no-build=vt --with-package-string="Oracle Message Passing 
Toolkit "
--with-ident-string="@(#)RELEASE VERSION 1.9openmpi-1.5.4-r1.9a1r27092"


and the error he gets is:

make[2]: Entering directory

`/workspace/euloh/hpc/mtt-scratch/burl-ct-t2k-3/ompi-tarball-testing/mpi-install/s3rI/src/openmpi-1.9a1r27092/ompi/tools/ompi_info'
   CCLD ompi_info
Undefined   first referenced
  symbolin file
mca_coll_ml_memsync_intra   ../../../ompi/.libs/libmpi.so
ld: fatal: symbol referencing errors. No output written to 
.libs/ompi_info
make[2]: *** [ompi_info] Error 2
make[2]: Leaving directory

`/workspace/euloh/hpc/mtt-scratch/burl-ct-t2k-3/ompi-tarball-testing/mpi-install/s3rI/src/openmpi-1.9a1r27092/ompi/tools/ompi_info'
make[1]: *** [install-recursive] Error 1
make[1]: Leaving directory

`/workspace/euloh/hpc/mtt-scratch/burl-ct-t2k-3/ompi-tarball-testing/mpi-install/s3rI/src/openmpi-1.9a1r27092/ompi'
make: *** [install-recursive] Error 1


On Aug 24, 2012, at 4:30 PM, Paul Hargrove
<phhargr...@lbl.gov <mailto:phhargr...@lbl.gov>> wrote:


I have access to a few different Solaris machines and can
offer to build the trunk if somebody tells me what configure
flags are desired.

-Paul

On Fri, Aug 24, 2012 at 8:54 AM, Ralph Castain
<r...@open-mpi.org <mailto:r...@open-mpi.org>> wrote:

Eugene - can you confirm that this is only happening on
the one Solaris system? In other words, is this a
general issue or something specific to that one machine?

I'm wondering because if it is just the one machine,
then it might be something strange about how it is setup
- perhaps the version of Solaris, or it is configuring
--enable-static, or...

Just trying to assess how general a problem this might
be, and thus if this should be a blocker or not.

On Aug 24, 2012, at 8:00 AM, Eugene Loh
<eugene@oracle.com <mailto:eugene@oracle.com>>
wrote:

> On 08/24/12 09:54, Shamis, Pavel wrote:
>> Maybe there is a chance to get direct access to this
system ?
> No.
>
> But I'm attaching compressed log files from
configure/make.
>
>

___
> devel mailing list
> de...@open-mpi.org <mailto:de...@open-mpi.org>
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/devel




-- 
Paul H. Hargrove phhargr...@lbl.gov <mailto:phhargr...@lbl.gov>

Future Technologies Group
Computer and Data Sciences Department Tel:
+1-510-495-2352 <tel:%2B1-510-495-2352>
Lawrence Berkeley National Laboratory Fax:
+1-510-486-6900 <tel:%2B1-510-486-6900>

___
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___

Re: [OMPI devel] r27078 and OMPI build

2012-08-24 Thread Eugene Loh


On 08/24/12 09:54, Shamis, Pavel wrote:

Maybe there is a chance to get direct access to this system ?

No.

But I'm attaching compressed log files from configure/make.



tarball-of-log-files.tar.bz2
Description: application/bzip

Re: [OMPI devel] MPI_Mprobe

2012-08-09 Thread Eugene Loh


On 8/7/2012 5:45 AM, Jeff Squyres wrote:

So the issue is when (for example) Fortran MPI_Recv says "hey, C ints are the same 
as Fortran INEGERs, so I don't need a temporary MPI_Status buffer; I'll just use the 
INTEGER array that I was given, and pass it to the back-end C MPI_Recv() routine." 
Then C MPI_Recv() tries to write to the size_t variable, and it might be poorly aligned.  
Kaboom.

Right.  Kind of.  Read on...

So, specifically, what I propose is getting rid of the short-cuts that try to 
use Fortran statuses in-place if Fortran INTEGERs are as big as C ints.  I can 
make the changes.  Sanity checks on all that are welcome.

Hmm.  I'm not excited about this -- the whole point is that if we don't need to 
do an extra copy, let's not do it.

Is there a better way to fix this?

Off the top of my head -- for example, could we change some of those 
compile-time checks to run-time checks, and add in an alignment check?  E.g.:


#if OMPI_SIZEOF_FORTRAN_INTEGER == SIZEOF_INT
   c_status = (MPI_Status *) status;
#else
   c_status =_status2;
#endif
-

to


 /* The constant checks will be resolved at compile time; assume
alignment_is_good() is an inline macro checking for "good" alignment
on platforms where alignment(int) != alignment(size_t) */
 if (OMPI_SIZEOF_FORTRAN_INTEGER == SIZEOF_INT&&  
alignment_is_good(status)) {
   c_status = (MPI_Status *) status;
 } else {
   c_status =_status2;
 }
-

Would something like that work?

I'm thinking that the benefit here is that we only penalize platforms (with an extra 
"if" statement) where alignment(int) != alignment(size_t).
I did quite a bit more poking around.  It appears this issue is already 
"well known."  That is, due to this issue, we're not allowed to assume 
that a status that the user passed to us (ompi/mpi/c layer) has proper 
alignment since it might have come from Fortran.  So, we should use 
OMPI_STATUS_SET and OMPI_STATUS_SET_COUNT instead of doing direct status 
assignments.  Not only does ompi/request/request.h define these macros, 
but it also gives a nice description of the issue and points us to trac 
2526.


It seems to me there are a number of places where we do direct status 
assignments.  I found a couple of places in ompi/mca/pml/ob1/*.c and 
quite a few more in ompi/mpi/c/*.c.  If I'm sufficiently inspired 
tomorrow, I might look around to see if I can identify other places to 
look.  I can also confirm this leads to failures -- not only the mprobe 
stuff I reported but even vanilla MPI_Recv if you tweak the conditions 
just right.  I'll try to get to this stuff tomorrow.

Re: [OMPI devel] MPI_Mprobe

2012-07-31 Thread Eugene Loh

On 7/31/2012 5:15 AM, Jeff Squyres wrote:

On Jul 31, 2012, at 2:58 AM, Eugene Loh wrote:

The main issue is this. If I go to ompi/mpi/fortran/mpif-h, I see six files (*recv_f and
*probe_f) that take status arguments. Normally, we do some conversion between Fortran
and C status arguments. These files test if OMPI_SIZEOF_FORTRAN_INTEGER==SIZEOF_INT,
however, and bypass the conversion if the two types of integers are the same size. The
problem with this is that while the structures may be the same size, the C status has a
size_t in it. So, while the Fortran INTEGER array can start on any 4-byte alignment, the
C status can end up with a 64-bit pointer on a 4-byte alignment. That's not pleasant in
general and can incur some serious hand-slapping on SPARC. Specifically, SPARC/-m64
errors out on *probe and *recv with MPI_PROC_NULL sources. Would it be all right if I
removed these "shorts cuts"?

Ew. Yes. You're right.

What specifically do you propose? I don't remember offhand if the status
conversion macros are the same as the regular int/INTEGER conversion macros --
we want to keep the no-op behavior for the regular int/INTEGER conversion
macros and only handle the conversion of MPI_Status separately, I think.
Specifically: for MPI_Status, we can probably still have those shortcuts for
the int/INTEGERs, but treat the copying to the size_t separately.
I'm embarrassingly unfamiliar with the code. My impression is that
internally we deal with C status structures and so our requirements for
Fortran status are:

*) enough bytes to hold whatever is in a C status
*) several words are addressable via the indices MPI_SOURCE, MPI_TAG,
and MPI_ERROR
So, I think what we do today is sufficient in most respects. Copying
between Fortran and C integer-by-integer is fine. It might be a little
nonsensical for an 8-byte size_t component to be handled as two 4-byte
words, but if we do so only for copying and otherwise only use that
component from the C side, things should be fine.

The only problem is if we try to use the Fortran array in-place. It's
big enough, but its alignment might be wrong.

So, specifically, what I propose is getting rid of the short-cuts that
try to use Fortran statuses in-place if Fortran INTEGERs are as big as C
ints. I can make the changes. Sanity checks on all that are welcome.

Thanks for fixing the ibm MPROBE tests, btw. Further proof that I must have
been clinically insane when I added all those tests. :-(

Insane, no, but you might copy out long-hand 100x:
for(i=0;i<N;i++) {
translates to
DO I=0,N-1

Related issue: do we need to (conditionally) add padding for the size_t in the
Fortran array?
I guess so, but once again am unsure of myself. If I look in
ompi/config/ompi_setup_mpi_fortran.m4, we compute the size of 4 C ints
and a size_t in units of Fortran INTEGERs. I'm guessing that usually
works for us today since any padding is at the very end and doesn't need
to be copied. If someone reorganized MPI_Status, however, there could
be internal padding and we would start losing parts of the status. So,
it might make the code a little more robust if the padding were
accounted for. I'm not keen on making such a change myself.

Meanwhile, the config code errors out if the size turns out not to be an
even multiple of Fortran INTEGER size. I don't get this. I would think
one could just round up to the next multiple. I'm worried my
understanding of what's going on is faulty.

Here are two more smaller issues. I'm pretty sure about them and can make the
appropriate changes, but if someone wants to give feedback...

1) If I look at, say, the v1.7 MPI_Mprobe man page, it says:

A matching probe with MPI_PROC_NULL as source returns
message = MPI_MESSAGE_NULL...

In contrast, if I look at ibm/pt2pt/mprobe_mpifh.f90, it's checking the message to be
MPI_MESSAGE_NO_PROC. Further, if I look at the source code, mprobe.c seems to set the message to
"no proc". So, I take it the man page is wrong? It should say "message =
MPI_MESSAGE_NO_PROC"?

Oh, yes -- I think the man page is wrong. The issue here is that the original
MPI-3 proposal said to return MESSAGE_NULL, but this turns out to be ambiguous.
So we amended the original MPI-3 proposal with the new constant
MPI_MESSAGE_NO_PROC. So I think we fixed the implementation, but accidentally
left the man page saying MESSAGE_NULL.

If you care, here's the specifics:

https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/38
https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/328

2) Next, looking further at mprobe.c, it looks like this:

int MPI_Mprobe(int source, int tag, MPI_Comm comm,
MPI_Message *message, MPI_Status *status)
{
if (MPI_PROC_NULL == source) {
if (MPI_STATUS_IGNORE != status) {
*status = ompi_request_empty.req_status;
*message =_message_no_proc.message;

[OMPI devel] MPI_Mprobe

2012-07-31 Thread Eugene Loh

I have some questions originally motivated by some mpif-h/MPI_Mprobe 
failures we've seen in SPARC MTT runs at 64-bit in both v1.7 and v1.9, 
but my poking around spread out from there.


The main issue is this.  If I go to ompi/mpi/fortran/mpif-h, I see six 
files (*recv_f and *probe_f) that take status arguments.  Normally, we 
do some conversion between Fortran and C status arguments.  These files 
test if OMPI_SIZEOF_FORTRAN_INTEGER==SIZEOF_INT, however, and bypass the 
conversion if the two types of integers are the same size.  The problem 
with this is that while the structures may be the same size, the C 
status has a size_t in it.  So, while the Fortran INTEGER array can 
start on any 4-byte alignment, the C status can end up with a 64-bit 
pointer on a 4-byte alignment.  That's not pleasant in general and can 
incur some serious hand-slapping on SPARC.  Specifically, SPARC/-m64 
errors out on *probe and *recv with MPI_PROC_NULL sources.  Would it be 
all right if I removed these "shorts cuts"?


Here are two more smaller issues.  I'm pretty sure about them and can 
make the appropriate changes, but if someone wants to give feedback...


1)  If I look at, say, the v1.7 MPI_Mprobe man page, it says:

 A  matching  probe  with  MPI_PROC_NULL  as  source  returns
 message  =  MPI_MESSAGE_NULL...

In contrast, if I look at ibm/pt2pt/mprobe_mpifh.f90, it's checking the 
message to be MPI_MESSAGE_NO_PROC.  Further, if I look at the source 
code, mprobe.c seems to set the message to "no proc".  So, I take it the 
man page is wrong?  It should say "message = MPI_MESSAGE_NO_PROC"?


2)  Next, looking further at mprobe.c, it looks like this:

int MPI_Mprobe(int source, int tag, MPI_Comm comm,
   MPI_Message *message, MPI_Status *status)
{
if (MPI_PROC_NULL == source) {
if (MPI_STATUS_IGNORE != status) {
*status = ompi_request_empty.req_status;
*message = _message_no_proc.message;
}
return MPI_SUCCESS;
}
..
}

This means that if source==MPI_PROC_NULL and status==MPI_STATUS_IGNORE, 
the message does not get set.  The assignment to *message should be 
moved outside the status check, right?

Re: [OMPI devel] [EXTERNAL] non-blocking collectives, SPARC, and alignment

2012-07-18 Thread Eugene Loh

I put back r26802, modifying nbc_internal.h and nbc.c.  Can you add this 
to your omnibus v1.7 CMR 3169?  I left the padding/packing alone and 
simply replaced round-schedule accesses with macros that handle the 
addressing and use memcpy for data movement.  The former simplifies the 
pointer manipulation and the latter helps portability.  Specifically, 
NBC now works on SPARC.


On 7/16/2012 1:45 PM, Barrett, Brian W wrote:

It's unlikely that I will have time to fix this in the short term.  The
scheduling code is fairly localized in nbc.c if Oracle has some time to
spend looking at these issues.  If not, it might be best to remove the
libnbc code from 1.7, as it's unfortunately clear that it's not as ready
for integration as we believed and I don't have time to fix the code base.

On 7/16/12 2:50 PM, "Eugene Loh"<eugene@oracle.com>  wrote:

The NBC functionality doesn't fare very well on SPARC.  One of the
problems is with data alignment.  An NBC schedule is a number of
variously sized fields laid out contiguously in linear memory  (e.g.,
see nbc_internal.h or nbc.c) and words don't have much natural
alignment.

Re: [OMPI devel] [EXTERNAL] non-blocking collectives, SPARC, and alignment

2012-07-16 Thread Eugene Loh


On 7/16/2012 1:45 PM, Barrett, Brian W wrote:

It's unlikely that I will have time to fix this in the short term.  The
scheduling code is fairly localized in nbc.c if Oracle has some time to
spend looking at these issues.  If not, it might be best to remove the
libnbc code from 1.7, as it's unfortunately clear that it's not as ready
for integration as we believed
Or both!  That is, I agree the code looks manageable and I'm inclined to 
take a whack at it.  Nevertheless, the NBC stuff in v1.7 is in a painful 
state.  Without CMRs, it does perhaps more harm than good.

  and I don't have time to fix the code base.

Brian

On 7/16/12 2:50 PM, "Eugene Loh"<eugene@oracle.com>  wrote:


The NBC functionality doesn't fare very well on SPARC.  One of the
problems is with data alignment.  An NBC schedule is a number of
variously sized fields laid out contiguously in linear memory  (e.g.,
see nbc_internal.h or nbc.c) and words don't have much natural
alignment.  On SPARC, the "default" (for some definition of that word)
is to sigbus when a word is not properly aligned.  In any case (even
non-SPARC), one might argue misalignment and subsequent exception
handling is nice to avoid.

Here are two specific issues.

*)  Schedule layout uses single-char delimiters between "round
schedules".  So, even if the first "round schedule" has nice alignment,
the second will have single-byte offsets for its components.

*)  8-byte pointers can fall on 4-byte boundaries.  E.g., say a schedule
starts on some "nice" alignment.  The first words of the schedule will be:

 inttotal size of the schedule
 intnumber of elements in the first round schedule
 enum   type of function
 void * pointer to some buffer

So, with -m64, that 8-byte pointer is on a 12-byte boundary.

Any input/comments on how to proceed?
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

[OMPI devel] non-blocking collectives, SPARC, and alignment

2012-07-16 Thread Eugene Loh

The NBC functionality doesn't fare very well on SPARC.  One of the 
problems is with data alignment.  An NBC schedule is a number of 
variously sized fields laid out contiguously in linear memory  (e.g., 
see nbc_internal.h or nbc.c) and words don't have much natural 
alignment.  On SPARC, the "default" (for some definition of that word) 
is to sigbus when a word is not properly aligned.  In any case (even 
non-SPARC), one might argue misalignment and subsequent exception 
handling is nice to avoid.


Here are two specific issues.

*)  Schedule layout uses single-char delimiters between "round 
schedules".  So, even if the first "round schedule" has nice alignment, 
the second will have single-byte offsets for its components.


*)  8-byte pointers can fall on 4-byte boundaries.  E.g., say a schedule 
starts on some "nice" alignment.  The first words of the schedule will be:


inttotal size of the schedule
intnumber of elements in the first round schedule
enum   type of function
void * pointer to some buffer

So, with -m64, that 8-byte pointer is on a 12-byte boundary.

Any input/comments on how to proceed?

Re: [OMPI devel] [OMPI svn-docs] svn:open-mpi-tests r2002 - trunk/ibm/collective

2012-07-11 Thread Eugene Loh

Brian caught it.  I simply applied the change to the other ibarrier_f* 
tests.  With this and your "remove bozo debug statements" (+ sleeps) 
putbacks (26768/trunk and 26769/v1.7), I'm hoping our ibarrier_f* MTT 
time-outs will disappear.


On 7/11/2012 9:26 AM, Jeff Squyres wrote:

I thought i would be 100 at the end of that do loop.

$%#@#@$% Fortran.  :-(


On Jul 11, 2012, at 12:25 PM,<svn-commit-mai...@open-mpi.org>  wrote:


Author: eugene (Eugene Loh)
Date: 2012-07-11 12:25:09 EDT (Wed, 11 Jul 2012)
New Revision: 2002

Log:
Apply the "right value when calling waitall" fix to
all ibm/collective/ibarrier_f* tests.

Text files modified:
   trunk/ibm/collective/ibarrier_f.f90   | 4 ++--
   trunk/ibm/collective/ibarrier_f08.f90 | 3 ++-
   trunk/ibm/collective/ibarrier_f90.f90 | 3 ++-
   3 files changed, 6 insertions(+), 4 deletions(-)

Modified: trunk/ibm/collective/ibarrier_f.f90
==
--- trunk/ibm/collective/ibarrier_f.f90 Wed Jul 11 12:03:04 2012(r2001)
+++ trunk/ibm/collective/ibarrier_f.f90 2012-07-11 12:25:09 EDT (Wed, 11 Jul 
2012)  (r2002)
@@ -31,6 +31,7 @@
! Comments may be sent to:
!Richard Treumann
!treum...@kgn.ibm.com
+! Copyright (c) 2012  Oracle and/or its affiliates.

   program ibarrier
   implicit none
@@ -57,8 +58,7 @@
   do i = 1, 100
  call MPI_Ibarrier(MPI_COMM_WORLD, req(i), ierr)
   end do
-  i = 100
-  call MPI_Waitall(i, req, statuses, ierr)
+  call MPI_Waitall(100, req, statuses, ierr)

   call MPI_Barrier(MPI_COMM_WORLD, ierr)
   call MPI_Finalize(ierr)

Modified: trunk/ibm/collective/ibarrier_f08.f90
==
--- trunk/ibm/collective/ibarrier_f08.f90   Wed Jul 11 12:03:04 2012
(r2001)
+++ trunk/ibm/collective/ibarrier_f08.f90   2012-07-11 12:25:09 EDT (Wed, 
11 Jul 2012)  (r2002)
@@ -31,6 +31,7 @@
! Comments may be sent to:
!Richard Treumann
!treum...@kgn.ibm.com
+! Copyright (c) 2012  Oracle and/or its affiliates.

   program ibarrier
   use mpi_f08
@@ -56,7 +57,7 @@
   do i = 1, 100
  call MPI_Ibarrier(MPI_COMM_WORLD, req(i))
   end do
-  call MPI_Waitall(i, req, MPI_STATUSES_IGNORE)
+  call MPI_Waitall(100, req, MPI_STATUSES_IGNORE)

   call MPI_Barrier(MPI_COMM_WORLD)
   call MPI_Finalize()

Modified: trunk/ibm/collective/ibarrier_f90.f90
==
--- trunk/ibm/collective/ibarrier_f90.f90   Wed Jul 11 12:03:04 2012
(r2001)
+++ trunk/ibm/collective/ibarrier_f90.f90   2012-07-11 12:25:09 EDT (Wed, 
11 Jul 2012)  (r2002)
@@ -31,6 +31,7 @@
! Comments may be sent to:
!Richard Treumann
!treum...@kgn.ibm.com
+! Copyright (c) 2012  Oracle and/or its affiliates.

   program ibarrier
   use mpi
@@ -57,7 +58,7 @@
   do i = 1, 100
  call MPI_Ibarrier(MPI_COMM_WORLD, req(i), ierr)
   end do
-  call MPI_Waitall(i, req, statuses, ierr)
+  call MPI_Waitall(100, req, statuses, ierr)

   call MPI_Barrier(MPI_COMM_WORLD, ierr)
   call MPI_Finalize(ierr)
___
svn-docs-full mailing list
svn-docs-f...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/svn-docs-full

[OMPI devel] ibcast segfault on v1.7 [was: reduce_scatter_block failing on v1.7]

2012-07-07 Thread Eugene Loh


On 07/06/12 14:35, Barrett, Brian W wrote:

On 7/6/12 2:31 PM, "Eugene Loh"<eugene@oracle.com>  wrote:


The new reduce_scatter_block test is segfaulting with v1.7 but not with
the trunk.  When we drop down into MPI_Reduce_scatter_block and attempt
to call

comm->c_coll.coll_reduce_scatter_block()

it's NULL.  (So is comm->c_coll.coll_reduce_scatter_block_module.)

Is there some work on the trunk that should be CMRed to v1.7?

Yes.  All in good time :).
Does this also apply to ibcast?  These failures complicate wading 
through MTT results, so we each have our own idea what "in good time" 
means.  :^(  Anyhow, test ibm/collective/ibcast_f08 is segfaulting on 
v1.7.  If I look at nbc_ibcast.c:ompi_coll_libnbc_ibcast(), it looks 
like "handle" is used before it's initialized.  The trunk doesn't have 
this problem.  Anyway, here's one plea for cleaning up v1.7.

[OMPI devel] ibm/collective/bcast_f08.f90

2012-07-06 Thread Eugene Loh

I assume this is an orphaned file that should be removed?  (It looks 
like a draft version of ibcast_f08.f90.)

[OMPI devel] reduce_scatter_block failing on v1.7

2012-07-06 Thread Eugene Loh

The new reduce_scatter_block test is segfaulting with v1.7 but not with 
the trunk.  When we drop down into MPI_Reduce_scatter_block and attempt 
to call


comm->c_coll.coll_reduce_scatter_block()

it's NULL.  (So is comm->c_coll.coll_reduce_scatter_block_module.)

Is there some work on the trunk that should be CMRed to v1.7?

[OMPI devel] non-blocking barrier

2012-07-06 Thread Eugene Loh

Either there is a problem with MPI_Ibarrier or I don't understand the 
semantics.


The following example is with openmpi-1.9a1r26747.  (Thanks for the fix 
in 26757.  I tried with that as well with same results.)  I get similar 
results for different OSes, compilers, bitness, etc.


% cat ibarrier.c
#include 
#include 
#include 
#include 

int main(int argc, char** argv) {
int i, me;
double t0, t1, t2;
MPI_Request req;

MPI_Init(,);
MPI_Comm_rank(MPI_COMM_WORLD,);

MPI_Barrier(MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
t0 = MPI_Wtime();  /* set "time zero" */

if ( me < 2 ) sleep(3);/* two processes delay before 
hitting barrier */

t1 = MPI_Wtime() - t0;
MPI_Barrier(MPI_COMM_WORLD);
t2 = MPI_Wtime() - t0;
printf("%d entered at %3.1lf and exited at %3.1lf\n", me, t1, t2);

if ( me < 2 ) sleep(3);/* two processes delay before 
hitting barrier */

t1 = MPI_Wtime() - t0;
MPI_Ibarrier(MPI_COMM_WORLD, );
MPI_Wait(, MPI_STATUS_IGNORE);
t2 = MPI_Wtime() - t0;
printf("%d entered at %3.1lf and exited at %3.1lf\n", me, t1, t2);

MPI_Finalize();
return 0;
}
% mpirun -n 4 ./a.out
0 entered at 3.0 and exited at 3.0
1 entered at 3.0 and exited at 3.0
2 entered at 0.0 and exited at 3.0
3 entered at 0.0 and exited at 3.0
0 entered at 6.0 and exited at 6.0
1 entered at 6.0 and exited at 6.0
2 entered at 3.0 and exited at 3.0
3 entered at 3.0 and exited at 3.0

With the first barrier, no one leaves until the last process has 
entered.  With the non-blocking barrier, two processes enter and leave 
before the two laggards arrive at the barrier.  Is that right?

[OMPI devel] ibarrier failures on MTT

2012-07-03 Thread Eugene Loh

I'll look at this more, but for now I'll just note that the new ibarrier 
test is showing lots of failures on MTT (cisco and oracle).

[OMPI devel] u_int32_t typo in nbc_internal.h?

2012-06-27 Thread Eugene Loh


ompi/mca/coll/libnbc/nbc_internal.h

   259/* Schedule cache structures/functions */
   260u_int32_t adler32(u_int32_t adler, int8_t *buf, int len);
   261void NBC_SchedCache_args_delete(void *entry);
   262void NBC_SchedCache_args_delete_key_dummy(void *k);

u_int32_t
->
uint32_t

perhaps?

Re: [OMPI devel] openib wasn't building

2012-06-25 Thread Eugene Loh


Thanks.  That explains one mystery.

I'm still unclear, though.  Or, maybe I'm hitting a different problem.  
I configure with "--with-openib" (along with other stuff).  I get:


r26639:checking if MCA component btl:openib can compile... yes
r26640:checking if MCA component btl:openib can compile... no

I'll poke at this more, but just wanted to see if I understood correctly 
what you were saying.


On 06/25/12 16:28, Jeff Squyres wrote:

Er... I could have sworn that I committed the fix before I sent this mail, but 
it looks like I didn't.  I just committed r26654 which fixes the issue.

On Jun 25, 2012, at 2:47 PM, Jeff Squyres wrote:

I noticed earlier today that the trunk openib btl was not building if you did 
not specify --with-openib[=DIR].

I have fixed the problem, but just wanted to give a heads up that this has 
happened; either re-configure --with-openib or svn up and 
re-autogen/configure/build.

[OMPI devel] MPI_Reduce_scatter_block

2012-06-25 Thread Eugene Loh

In tarball 26642, Fortran compilation no longer succeeds.  I suspect the 
problem might be 26641.  E.g.,


libmpi_usempif08.so:
undefined reference to `ompi_iscan_f'
libmpi_mpifh.so:
undefined reference to `MPI_Reduce_scatter_block'
libmpi_mpifh.so:
undefined reference to `MPI_Ireduce_scatter_block'

If you need further characterization, let me know.  I can isolate further.

Re: [OMPI devel] bug in r26626

2012-06-24 Thread Eugene Loh


Thanks for r26638.  Looks like that file still needs a little attention:
http://www.open-mpi.org/mtt/index.php?do_redir=2073

On 6/22/2012 10:40 AM, Eugene Loh wrote:
Looking good.  Just a few more:  btl_udapl_endpoint.c has instances of 
seg_len and seg_addr.  udapl may not have much of a future, but for 
now it's still there.


On 6/22/2012 7:22 AM, Hjelm, Nathan T wrote:
Looks like I missed a few places in udapl and osc. Fixed with r26635 
and r26634. Hopefully thats the last of them outside of btl/vw.


-Nathan

On Friday, June 22, 2012 7:05 AM, TERRY 
DONTJE<terry.don...@oracle.com>  wrote:

To: Open MPI Developers; Hjelm, Nathan T
Subject: bug in r26626

It looks like compilation of 32 bit platforms is failing due to a 
missing field.  It looks to me that for some reason r26626 deleted 
hdr_segkey in ompi/mca/osc/rdma/osc_rdma_header.h which is used in 
the macro OMPI_OSC_RDMA_RDMA_INFO_HDR_NTOH and HTON.  Is there a 
reason that hdr_segkey was removed, if so more changes are needed.


--
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle - Performance Technologies
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com<mailto:terry.don...@oracle.com>




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] bug in r26626

2012-06-22 Thread Eugene Loh

Looking good.  Just a few more:  btl_udapl_endpoint.c has instances of 
seg_len and seg_addr.  udapl may not have much of a future, but for now 
it's still there.


On 6/22/2012 7:22 AM, Hjelm, Nathan T wrote:

Looks like I missed a few places in udapl and osc. Fixed with r26635 and 
r26634. Hopefully thats the last of them outside of btl/vw.

-Nathan

On Friday, June 22, 2012 7:05 AM, TERRY DONTJE  wrote:

To: Open MPI Developers; Hjelm, Nathan T
Subject: bug in r26626

It looks like compilation of 32 bit platforms is failing due to a missing 
field.  It looks to me that for some reason r26626 deleted hdr_segkey in 
ompi/mca/osc/rdma/osc_rdma_header.h which is used in the macro 
OMPI_OSC_RDMA_RDMA_INFO_HDR_NTOH and HTON.  Is there a reason that hdr_segkey 
was removed, if so more changes are needed.

--
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle - Performance Technologies
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] hang with launch including remote nodes

2012-06-21 Thread Eugene Loh


On 06/19/12 23:11, Ralph Castain wrote:

Also, how did you configure this version?

  --enable-heterogeneous
  --enable-cxx-exceptions
  --enable-shared
  --enable-orterun-prefix-by-default
  --with-sge
  --enable-mpi-f90
  --with-mpi-f90-size=small
  --disable-peruse
  --disable-mpi-thread-multiple
  --disable-debug
  --disable-mem-debug
  --disable-mem-profile
  --enable-contrib-no-build=vt


  If you had --disable-static, then there was a bug that would indeed cause a 
hang. Just committing that fix now.

I still get a hang even with r26623.

On Jun 19, 2012, at 9:01 PM, Ralph Castain wrote:

See if it works with -mca orte_use_common_port 0


I get a segfault:

[remote1:01409] *** Process received signal ***
[remote1:01409] Signal: Segmentation Fault (11)
[remote1:01409] Signal code: Address not mapped (1)
[remote1:01409] Failing at address: 2c
/home/eugene/r26609/lib/libopen-rte.so.0.0.0'show_stackframe+0x7d0
/lib/amd64/libc.so.1'__sighndlr+0x6
/lib/amd64/libc.so.1'call_user_handler+0x2c5
/home/eugene/r26609/lib/libopen-rte.so.0.0.0'orte_grpcomm_base_rollup_recv+0x73 
[Signal 11 (SEGV)]
/home/eugene/r26609/lib/openmpi/mca_rml_oob.so'orte_rml_recv_msg_callback+0x9c 


/home/eugene/r26609/lib/openmpi/mca_oob_tcp.so'mca_oob_tcp_msg_data+0x283
/home/eugene/r26609/lib/libopen-rte.so.0.0.0'event_process_active_single_queue+0x54c 


/home/eugene/r26609/lib/libopen-rte.so.0.0.0'event_process_active+0x41
/home/eugene/r26609/lib/libopen-rte.so.0.0.0'opal_libevent2019_event_base_loop+0x606 


/home/eugene/r26609/lib/libopen-rte.so.0.0.0'orte_daemon+0xd6d
/home/eugene/r26609/bin/orted'0xd4b
[remote1:01409] *** End of error message ***
Segmentation Fault (core dumped)



On Jun 19, 2012, at 8:31 PM, Eugene Loh wrote:

I'm having bad luck with the trunk starting with r26609.  Basically, things 
hang if I run

   mpirun -H remote1,remote2 -n 2 hostname

where remote1 and remote2 are remote nodes.

[OMPI devel] hang with launch including remote nodes

2012-06-19 Thread Eugene Loh

I'm having bad luck with the trunk starting with r26609.  Basically, 
things hang if I run


mpirun -H remote1,remote2 -n 2 hostname

where remote1 and remote2 are remote nodes.

Re: [OMPI devel] Barrier/coll_tuned/pml_ob1 segfault for derived data types

2012-06-15 Thread Eugene Loh


On 6/15/2012 11:59 AM, Nathan Hjelm wrote:

Until we can find the root cause I pushed a change that protects the reset by 
checking if size>  0.
Let me know if that works for you.

It does.

Re: [OMPI devel] Barrier/coll_tuned/pml_ob1 segfault for derived data types

2012-06-15 Thread Eugene Loh

Backing out r26597 solves my particular test cases.  I'll back it out of 
the trunk as well unless someone has objections.


I like how you say "same segfault."  In certain cases, I just go on to 
different segfaults.  E.g.,


  [2] btl_openib_handle_incoming(openib_btl, ep, frag, byte_len = 20U), 
line 3208 in "btl_openib_component.c"

  [3] handle_wc(device, cq = 0, wc), line 3516 in "btl_openib_component.c"
  [4] poll_device(device, count = 1), line 3654 in "btl_openib_component.c"
  [5] progress_one_device(device), line 3762 in "btl_openib_component.c"
  [6] btl_openib_component_progress(), line 3787 in 
"btl_openib_component.c"

  [7] opal_progress(), line 207 in "opal_progress.c"
  [8] opal_condition_wait(c, m), line 100 in "condition.h"
  [9] ompi_request_default_wait_all(count = 2U, requests, statuses), 
line 281 in "req_wait.c"
  [10] ompi_coll_tuned_sendrecv_actual(sendbuf = (nil), scount = 0, 
sdatatype, dest = 0, stag = -16, recvbuf = (nil), rcount = 0, rdatatype, 
source = 0, rtag = -16, comm, status = (nil)), line 54 in 
"coll_tuned_util.c"
  [11] ompi_coll_tuned_barrier_intra_recursivedoubling(comm, module), 
line 172 in "coll_tuned_barrier.c"
  [12] ompi_coll_tuned_barrier_intra_dec_fixed(comm, module), line 207 
in "coll_tuned_decision_fixed.c"

  [13] PMPI_Barrier(comm = 0x518370), line 62 in "pbarrier.c"

The reg->cbfunc is NULL.  I'm still considering whether that's an 
artifact of how I build that particular case, though.


On 06/15/12 09:44, George Bosilca wrote:

There should be no datatype attached to the barrier, so it is normal you get 
the zero values in the convertor.

Something weird is definitively going on. As there is no data to be sent, the 
opal_convertor_set_position function is supposed to trigger the special path, 
mark the convertor as completed and return successfully. However, this seems 
not to be the case anymore as in your backtrace I see the call to 
opal_convertor_set_position_nocheck, which only happens if the above described 
test fails.

I had some doubts about r26597, but I don't have time to check into it until 
Monday. Maybe you can remove it and se if you continue to have the same 
segfault.

   george.

On Jun 15, 2012, at 01:24 , Eugene Loh wrote:


I see a segfault show up in trunk testing starting with r26598 when tests like

ibm  collective/struct_gatherv
intel src/MPI_Type_free_[types|pending_msg]_[f|c]

are run over openib.  Here is a typical stack trace:

   opal_convertor_create_stack_at_begining(convertor = 0x689730, sizes), line 404 in 
"opal_convertor.c"
   opal_convertor_set_position_nocheck(convertor = 0x689730, position), line 423 in 
"opal_convertor.c"
   opal_convertor_set_position(convertor = 0x689730, position = 0x7fffc36e0bf0), line 321 
in "opal_convertor.h"
   mca_pml_ob1_send_request_start_copy(sendreq, bml_btl = 0x6a3ea0, size = 0), line 485 
in "pml_ob1_sendreq.c"
   mca_pml_ob1_send_request_start_btl(sendreq, bml_btl), line 387 in 
"pml_ob1_sendreq.h"
   mca_pml_ob1_send_request_start(sendreq = 0x689680), line 458 in 
"pml_ob1_sendreq.h"
   mca_pml_ob1_isend(buf = (nil), count = 0, datatype, dst = 2, tag = -16, sendmode = 
MCA_PML_BASE_SEND_STANDARD, comm, request), line 87 in "pml_ob1_isend.c"
   ompi_coll_tuned_sendrecv_actual(sendbuf = (nil), scount = 0, sdatatype, dest = 2, stag 
= -16, recvbuf = (nil), rcount = 0, rdatatype, source = 2, rtag = -16, comm, status = 
(nil)), line 51 in "coll_tuned_util.c"
   ompi_coll_tuned_barrier_intra_recursivedoubling(comm, module), line 172 in 
"coll_tuned_barrier.c"
   ompi_coll_tuned_barrier_intra_dec_fixed(comm, module), line 207 in 
"coll_tuned_decision_fixed.c"
   PMPI_Barrier(comm = 0x5195a0), line 62 in "pbarrier.c"
   main(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0x403219

The fact that some derived data types were sent before seems to have something 
to do with it.  I see this sort of problem cropping up in Cisco and Oracle 
testing.  Up at the level of pml_ob1_send_request_start_copy, at line 485:

   MCA_PML_OB1_SEND_REQUEST_RESET(sendreq);

I see

*sendreq->req_send.req_base.req_convertor.use_desc = {
length = 0
used   = 0
desc   = (nil)
}

and I guess that desc=NULL is causing the segfault at opal_convertor.c line 404.

Anyhow, I'm trudging along, but thought I would share at least that much with 
you helpful folks in case any of this is ringing a bell.

[OMPI devel] Barrier/coll_tuned/pml_ob1 segfault for derived data types

2012-06-14 Thread Eugene Loh

I see a segfault show up in trunk testing starting with r26598 when 
tests like


ibm  collective/struct_gatherv
intel src/MPI_Type_free_[types|pending_msg]_[f|c]

are run over openib.  Here is a typical stack trace:

   opal_convertor_create_stack_at_begining(convertor = 0x689730, 
sizes), line 404 in "opal_convertor.c"
   opal_convertor_set_position_nocheck(convertor = 0x689730, position), 
line 423 in "opal_convertor.c"
   opal_convertor_set_position(convertor = 0x689730, position = 
0x7fffc36e0bf0), line 321 in "opal_convertor.h"
   mca_pml_ob1_send_request_start_copy(sendreq, bml_btl = 0x6a3ea0, 
size = 0), line 485 in "pml_ob1_sendreq.c"
   mca_pml_ob1_send_request_start_btl(sendreq, bml_btl), line 387 in 
"pml_ob1_sendreq.h"
   mca_pml_ob1_send_request_start(sendreq = 0x689680), line 458 in 
"pml_ob1_sendreq.h"
   mca_pml_ob1_isend(buf = (nil), count = 0, datatype, dst = 2, tag = 
-16, sendmode = MCA_PML_BASE_SEND_STANDARD, comm, request), line 87 in 
"pml_ob1_isend.c"
   ompi_coll_tuned_sendrecv_actual(sendbuf = (nil), scount = 0, 
sdatatype, dest = 2, stag = -16, recvbuf = (nil), rcount = 0, rdatatype, 
source = 2, rtag = -16, comm, status = (nil)), line 51 in 
"coll_tuned_util.c"
   ompi_coll_tuned_barrier_intra_recursivedoubling(comm, module), line 
172 in "coll_tuned_barrier.c"
   ompi_coll_tuned_barrier_intra_dec_fixed(comm, module), line 207 in 
"coll_tuned_decision_fixed.c"

   PMPI_Barrier(comm = 0x5195a0), line 62 in "pbarrier.c"
   main(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0x403219

The fact that some derived data types were sent before seems to have 
something to do with it.  I see this sort of problem cropping up in 
Cisco and Oracle testing.  Up at the level of 
pml_ob1_send_request_start_copy, at line 485:


   MCA_PML_OB1_SEND_REQUEST_RESET(sendreq);

I see

*sendreq->req_send.req_base.req_convertor.use_desc = {
length = 0
used   = 0
desc   = (nil)
}

and I guess that desc=NULL is causing the segfault at opal_convertor.c 
line 404.


Anyhow, I'm trudging along, but thought I would share at least that much 
with you helpful folks in case any of this is ringing a bell.

Re: [OMPI devel] r26565 (orte progress threads and libevent thread support by default) causing segfaults

2012-06-11 Thread Eugene Loh


On 6/9/2012 6:49 PM, Ralph Castain wrote:

I fixed this one, I believe
Sorry, I'm confused.  You think you fixed the oob:ud:qp_init one you 
mean?  Which rev has the fix?

will have to look more at the loop_spawn issue later.
The original one I reported, I assume?  I see similar stacks on 
segfaults with a variety of tests.  So, I think it's not specific to 
loop_spawn.


On Sat, Jun 9, 2012 at 3:35 PM, Eugene Loh <eugene@oracle.com 
<mailto:eugene@oracle.com>> wrote:


On 6/9/2012 12:06 PM, Eugene Loh wrote:

With r26565:
   Enable orte progress threads and libevent thread support by
default
Oracle MTT testing started showing new spawn_multiple failures.

Sorry.  I meant loop_spawn.

(And then, starting I think in 26582, the problem is masked behind
another issue, "oob:ud:qp_init could not create queue pair", which
is creating widespread problems for Cisco, IU, and Oracle MTT
testing.  I suppose that's the subject of a different e-mail thread.)

I've only seen this in 64-bit.  Here are two segfaults, both
from Linux/x86 systems running over TCP:

This one with GNU compilers:
   [...]
   parent: MPI_Comm_spawn #300 return : 0
   [burl-ct-v20z-26:28518] *** Process received signal ***
   [burl-ct-v20z-26:28518] Signal: Segmentation fault (11)
   [burl-ct-v20z-26:28518] Signal code: Address not mapped (1)
   [burl-ct-v20z-26:28518] Failing at address: (nil)
   [burl-ct-v20z-26:28518] [ 0] /lib64/libpthread.so.0
[0x3a21c0e7c0]
   [burl-ct-v20z-26:28518] [ 1] /lib64/libc.so.6(memcpy+0x35)
[0x3a2107bde5]
   [burl-ct-v20z-26:28518] [ 2]

/workspace/tdontje/hpc/mtt-scratch/burl-ct-v20z-26/ompi-tarball-testing/installs/smMv/install/lib/lib64/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_copy+0x58)
   [burl-ct-v20z-26:28518] [ 3]

/workspace/tdontje/hpc/mtt-scratch/burl-ct-v20z-26/ompi-tarball-testing/installs/smMv/install/lib/lib64/openmpi/mca_oob_tcp.so
   [burl-ct-v20z-26:28518] [ 4]

/workspace/tdontje/hpc/mtt-scratch/burl-ct-v20z-26/ompi-tarball-testing/installs/smMv/install/lib/lib64/openmpi/mca_oob_tcp.so(mca_oob_tcp_recv_nb+0x314)
   [burl-ct-v20z-26:28518] [ 5]

/workspace/tdontje/hpc/mtt-scratch/burl-ct-v20z-26/ompi-tarball-testing/installs/smMv/install/lib/lib64/openmpi/mca_rml_oob.so(orte_rml_oob_recv_buffer_nb+0xff)
   [burl-ct-v20z-26:28518] [ 6]

/workspace/tdontje/hpc/mtt-scratch/burl-ct-v20z-26/ompi-tarball-testing/installs/smMv/install/lib/lib64/openmpi/mca_dpm_orte.so
   [burl-ct-v20z-26:28518] [ 7]

/workspace/tdontje/hpc/mtt-scratch/burl-ct-v20z-26/ompi-tarball-testing/installs/smMv/install/lib/lib64/libmpi.so.0(PMPI_Comm_spawn+0x2ee)
   [burl-ct-v20z-26:28518] [ 8] dynamic/loop_spawn [0x40120b]
   [burl-ct-v20z-26:28518] [ 9]
/lib64/libc.so.6(__libc_start_main+0xf4) [0x3a2101d994]
   [burl-ct-v20z-26:28518] [10] dynamic/loop_spawn [0x400dd9]
   [burl-ct-v20z-26:28518] *** End of error message ***

This one with Oracle Studio compilers:
   parent: MPI_Comm_spawn #0 return : 0
   parent: MPI_Comm_spawn #20 return : 0
   [burl-ct-x2200-12:02348] *** Process received signal ***
   [burl-ct-x2200-12:02348] Signal: Segmentation fault (11)
   [burl-ct-x2200-12:02348] Signal code: Address not mapped (1)
   [burl-ct-x2200-12:02348] Failing at address: 0x10
   [burl-ct-x2200-12:02348] [ 0] /lib64/libpthread.so.0
[0x318ac0de80]
   [burl-ct-x2200-12:02348] [ 1]

/workspace/tdontje/hpc/mtt-scratch/burl-ct-x2200-12/ompi-tarball-testing/installs/Q7wT/install/lib/lib64/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_recv_handler+0xe3)
   [burl-ct-x2200-12:02348] [ 2]

/workspace/tdontje/hpc/mtt-scratch/burl-ct-x2200-12/ompi-tarball-testing/installs/Q7wT/install/lib/lib64/openmpi/mca_oob_tcp.so
   [burl-ct-x2200-12:02348] [ 3]

/workspace/tdontje/hpc/mtt-scratch/burl-ct-x2200-12/ompi-tarball-testing/installs/Q7wT/install/lib/lib64/libmpi.so.0
   [burl-ct-x2200-12:02348] [ 4]

/workspace/tdontje/hpc/mtt-scratch/burl-ct-x2200-12/ompi-tarball-testing/installs/Q7wT/install/lib/lib64/libmpi.so.0(opal_libevent2019_event_base_loop+0x7c7)
   [burl-ct-x2200-12:02348] [ 5]

/workspace/tdontje/hpc/mtt-scratch/burl-ct-x2200-12/ompi-tarball-testing/installs/Q7wT/install/lib/lib64/libmpi.so.0
   [burl-ct-x2200-12:02348] [ 6] /lib64/libpthread.so.0
[0x318ac06307]
   [burl-ct-x2200-12:02348] [ 7] /lib64/libc.so.6(clone+0x6d)
[0x318a0d1ded]
   [burl-ct-x2200-12:02348] *** End of error message ***

Sometimes, I see a hang rather than a segfault.

Re: [OMPI devel] r26565 (orte progress threads and libevent thread support by default) causing segfaults

2012-06-09 Thread Eugene Loh


On 6/9/2012 12:06 PM, Eugene Loh wrote:

With r26565:
Enable orte progress threads and libevent thread support by default
Oracle MTT testing started showing new spawn_multiple failures.

Sorry.  I meant loop_spawn.

(And then, starting I think in 26582, the problem is masked behind 
another issue, "oob:ud:qp_init could not create queue pair", which is 
creating widespread problems for Cisco, IU, and Oracle MTT testing.  I 
suppose that's the subject of a different e-mail thread.)
I've only seen this in 64-bit.  Here are two segfaults, both from 
Linux/x86 systems running over TCP:


This one with GNU compilers:
[...]
parent: MPI_Comm_spawn #300 return : 0
[burl-ct-v20z-26:28518] *** Process received signal ***
[burl-ct-v20z-26:28518] Signal: Segmentation fault (11)
[burl-ct-v20z-26:28518] Signal code: Address not mapped (1)
[burl-ct-v20z-26:28518] Failing at address: (nil)
[burl-ct-v20z-26:28518] [ 0] /lib64/libpthread.so.0 [0x3a21c0e7c0]
[burl-ct-v20z-26:28518] [ 1] /lib64/libc.so.6(memcpy+0x35) 
[0x3a2107bde5]
[burl-ct-v20z-26:28518] [ 2] 
/workspace/tdontje/hpc/mtt-scratch/burl-ct-v20z-26/ompi-tarball-testing/installs/smMv/install/lib/lib64/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_copy+0x58)
[burl-ct-v20z-26:28518] [ 3] 
/workspace/tdontje/hpc/mtt-scratch/burl-ct-v20z-26/ompi-tarball-testing/installs/smMv/install/lib/lib64/openmpi/mca_oob_tcp.so
[burl-ct-v20z-26:28518] [ 4] 
/workspace/tdontje/hpc/mtt-scratch/burl-ct-v20z-26/ompi-tarball-testing/installs/smMv/install/lib/lib64/openmpi/mca_oob_tcp.so(mca_oob_tcp_recv_nb+0x314)
[burl-ct-v20z-26:28518] [ 5] 
/workspace/tdontje/hpc/mtt-scratch/burl-ct-v20z-26/ompi-tarball-testing/installs/smMv/install/lib/lib64/openmpi/mca_rml_oob.so(orte_rml_oob_recv_buffer_nb+0xff)
[burl-ct-v20z-26:28518] [ 6] 
/workspace/tdontje/hpc/mtt-scratch/burl-ct-v20z-26/ompi-tarball-testing/installs/smMv/install/lib/lib64/openmpi/mca_dpm_orte.so
[burl-ct-v20z-26:28518] [ 7] 
/workspace/tdontje/hpc/mtt-scratch/burl-ct-v20z-26/ompi-tarball-testing/installs/smMv/install/lib/lib64/libmpi.so.0(PMPI_Comm_spawn+0x2ee)

[burl-ct-v20z-26:28518] [ 8] dynamic/loop_spawn [0x40120b]
[burl-ct-v20z-26:28518] [ 9] 
/lib64/libc.so.6(__libc_start_main+0xf4) [0x3a2101d994]

[burl-ct-v20z-26:28518] [10] dynamic/loop_spawn [0x400dd9]
[burl-ct-v20z-26:28518] *** End of error message ***

This one with Oracle Studio compilers:
parent: MPI_Comm_spawn #0 return : 0
parent: MPI_Comm_spawn #20 return : 0
[burl-ct-x2200-12:02348] *** Process received signal ***
[burl-ct-x2200-12:02348] Signal: Segmentation fault (11)
[burl-ct-x2200-12:02348] Signal code: Address not mapped (1)
[burl-ct-x2200-12:02348] Failing at address: 0x10
[burl-ct-x2200-12:02348] [ 0] /lib64/libpthread.so.0 [0x318ac0de80]
[burl-ct-x2200-12:02348] [ 1] 
/workspace/tdontje/hpc/mtt-scratch/burl-ct-x2200-12/ompi-tarball-testing/installs/Q7wT/install/lib/lib64/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_recv_handler+0xe3)
[burl-ct-x2200-12:02348] [ 2] 
/workspace/tdontje/hpc/mtt-scratch/burl-ct-x2200-12/ompi-tarball-testing/installs/Q7wT/install/lib/lib64/openmpi/mca_oob_tcp.so
[burl-ct-x2200-12:02348] [ 3] 
/workspace/tdontje/hpc/mtt-scratch/burl-ct-x2200-12/ompi-tarball-testing/installs/Q7wT/install/lib/lib64/libmpi.so.0
[burl-ct-x2200-12:02348] [ 4] 
/workspace/tdontje/hpc/mtt-scratch/burl-ct-x2200-12/ompi-tarball-testing/installs/Q7wT/install/lib/lib64/libmpi.so.0(opal_libevent2019_event_base_loop+0x7c7)
[burl-ct-x2200-12:02348] [ 5] 
/workspace/tdontje/hpc/mtt-scratch/burl-ct-x2200-12/ompi-tarball-testing/installs/Q7wT/install/lib/lib64/libmpi.so.0

[burl-ct-x2200-12:02348] [ 6] /lib64/libpthread.so.0 [0x318ac06307]
[burl-ct-x2200-12:02348] [ 7] /lib64/libc.so.6(clone+0x6d) 
[0x318a0d1ded]

[burl-ct-x2200-12:02348] *** End of error message ***

Sometimes, I see a hang rather than a segfault.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

[OMPI devel] r26565 (orte progress threads and libevent thread support by default) causing segfaults

2012-06-09 Thread Eugene Loh


With r26565:
Enable orte progress threads and libevent thread support by default
Oracle MTT testing started showing new spawn_multiple failures.  I've 
only seen this in 64-bit.  Here are two segfaults, both from Linux/x86 
systems running over TCP:


This one with GNU compilers:
[...]
parent: MPI_Comm_spawn #300 return : 0
[burl-ct-v20z-26:28518] *** Process received signal ***
[burl-ct-v20z-26:28518] Signal: Segmentation fault (11)
[burl-ct-v20z-26:28518] Signal code: Address not mapped (1)
[burl-ct-v20z-26:28518] Failing at address: (nil)
[burl-ct-v20z-26:28518] [ 0] /lib64/libpthread.so.0 [0x3a21c0e7c0]
[burl-ct-v20z-26:28518] [ 1] /lib64/libc.so.6(memcpy+0x35) 
[0x3a2107bde5]
[burl-ct-v20z-26:28518] [ 2] 
/workspace/tdontje/hpc/mtt-scratch/burl-ct-v20z-26/ompi-tarball-testing/installs/smMv/install/lib/lib64/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_copy+0x58)
[burl-ct-v20z-26:28518] [ 3] 
/workspace/tdontje/hpc/mtt-scratch/burl-ct-v20z-26/ompi-tarball-testing/installs/smMv/install/lib/lib64/openmpi/mca_oob_tcp.so
[burl-ct-v20z-26:28518] [ 4] 
/workspace/tdontje/hpc/mtt-scratch/burl-ct-v20z-26/ompi-tarball-testing/installs/smMv/install/lib/lib64/openmpi/mca_oob_tcp.so(mca_oob_tcp_recv_nb+0x314)
[burl-ct-v20z-26:28518] [ 5] 
/workspace/tdontje/hpc/mtt-scratch/burl-ct-v20z-26/ompi-tarball-testing/installs/smMv/install/lib/lib64/openmpi/mca_rml_oob.so(orte_rml_oob_recv_buffer_nb+0xff)
[burl-ct-v20z-26:28518] [ 6] 
/workspace/tdontje/hpc/mtt-scratch/burl-ct-v20z-26/ompi-tarball-testing/installs/smMv/install/lib/lib64/openmpi/mca_dpm_orte.so
[burl-ct-v20z-26:28518] [ 7] 
/workspace/tdontje/hpc/mtt-scratch/burl-ct-v20z-26/ompi-tarball-testing/installs/smMv/install/lib/lib64/libmpi.so.0(PMPI_Comm_spawn+0x2ee)

[burl-ct-v20z-26:28518] [ 8] dynamic/loop_spawn [0x40120b]
[burl-ct-v20z-26:28518] [ 9] 
/lib64/libc.so.6(__libc_start_main+0xf4) [0x3a2101d994]

[burl-ct-v20z-26:28518] [10] dynamic/loop_spawn [0x400dd9]
[burl-ct-v20z-26:28518] *** End of error message ***

This one with Oracle Studio compilers:
parent: MPI_Comm_spawn #0 return : 0
parent: MPI_Comm_spawn #20 return : 0
[burl-ct-x2200-12:02348] *** Process received signal ***
[burl-ct-x2200-12:02348] Signal: Segmentation fault (11)
[burl-ct-x2200-12:02348] Signal code: Address not mapped (1)
[burl-ct-x2200-12:02348] Failing at address: 0x10
[burl-ct-x2200-12:02348] [ 0] /lib64/libpthread.so.0 [0x318ac0de80]
[burl-ct-x2200-12:02348] [ 1] 
/workspace/tdontje/hpc/mtt-scratch/burl-ct-x2200-12/ompi-tarball-testing/installs/Q7wT/install/lib/lib64/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_recv_handler+0xe3)
[burl-ct-x2200-12:02348] [ 2] 
/workspace/tdontje/hpc/mtt-scratch/burl-ct-x2200-12/ompi-tarball-testing/installs/Q7wT/install/lib/lib64/openmpi/mca_oob_tcp.so
[burl-ct-x2200-12:02348] [ 3] 
/workspace/tdontje/hpc/mtt-scratch/burl-ct-x2200-12/ompi-tarball-testing/installs/Q7wT/install/lib/lib64/libmpi.so.0
[burl-ct-x2200-12:02348] [ 4] 
/workspace/tdontje/hpc/mtt-scratch/burl-ct-x2200-12/ompi-tarball-testing/installs/Q7wT/install/lib/lib64/libmpi.so.0(opal_libevent2019_event_base_loop+0x7c7)
[burl-ct-x2200-12:02348] [ 5] 
/workspace/tdontje/hpc/mtt-scratch/burl-ct-x2200-12/ompi-tarball-testing/installs/Q7wT/install/lib/lib64/libmpi.so.0

[burl-ct-x2200-12:02348] [ 6] /lib64/libpthread.so.0 [0x318ac06307]
[burl-ct-x2200-12:02348] [ 7] /lib64/libc.so.6(clone+0x6d) 
[0x318a0d1ded]

[burl-ct-x2200-12:02348] *** End of error message ***

Sometimes, I see a hang rather than a segfault.

[MTT devel] MTT queries... problems

2012-05-30 Thread Eugene Loh


I seem to get unreliable results from MTT queries.

To reproduce:
- go to http://www.open-mpi.org/mtt
- click on "Test run"
- for "Date range:" enter "2012-03-23 00:30:00 - 2012-03-23 23:55:00"
- for "Org:" enter "oracle"
- for "Platform name:" enter "t2k-0"
- for "Suite:" enter "ibm-32"
- click on "Summary"
- click on "Detail"

The summary indicates there are 193 passes and 6 skips.  The detail 
shows 199 results distributed over two pages, 1-100 on one page and 
101-199 on the next.  The total (199=193+6) is correct, but I think the 
second page is suspect.  It includes "Date range" output (which is nice, 
but I didn't ask for it and I think is a symptom of what's going wrong 
here).  That second page includes some repeats from the first page 
(e.g., "00_create" and "00_create_cxx"), etc.  Because the total (199) 
is correct and because there are repeats, other results are missing 
entirely.  Another indication that there is a problem is that there are 
only three skipped tests in the "detail" view but six in the summary.  
(I believe the summary.)


Before I click on "Detail", I can go to "Preferences" and set the number 
of rows per page to be 200.  Doing so and then clicking on "Detail", the 
repeats (00_create, 00_create_cxx, and others) disappear and the number 
of Skipped tests is correct.


So, the problem seems to be with distributing results over multiple pages.

[OMPI devel] orte_util_decode_pidmap and hwloc

2012-05-26 Thread Eugene Loh

I'm suspicious of some code, but would like comment from someone who 
understands it.


In orte/util/nidmap.c orte_util_decode_pidmap(), one cycles through a 
buffer.  One cycles through jobs.  For each one, one unpacks num_procs.  
One also unpacks all sorts of other stuff like bind_idx.  In particular, 
there's


orte_process_info.bind_idx = bind_idx[ORTE_PROC_MY_NAME->vpid];

Well, if we spawn a job with more processes than the parent job, we 
could have vpid >= num_procs and we read garbage which could and I think 
does lead to some less-than-enjoyable experiences later on.


Yes/no/fix?

[OMPI devel] trunk hang (when remote orted has to spawn another orted?)

2012-05-08 Thread Eugene Loh

Here is another trunk hang.  I get it if I use at least three remote 
nodes.  E.g., with r26385:


% mpirun -H remoteA,remoteB,remoteC -n 2 hostname
[remoteA:20508] [[54625,0],1] ORTE_ERROR_LOG: Not found in file 
base/ess_base_fns.c at line 135

[remoteA:20508] [[54625,0],1] unable to get hostname for daemon 3
[remoteA:20508] [[54625,0],1] ORTE_ERROR_LOG: Not found in file 
orted/orted_comm.c at line 345

[hang]

I think the problem first appeared with r26359.

I guess if a remote orted has to spawn another orted, it gets here:

  opal_pointer_array_get_item(table = 0x7e410, element_index = 3), line 
136 in "opal_pointer_array.h"

  find_proc(proc = 0xffbff264), line 51 in "ess_base_fns.c"
  orte_ess_base_proc_get_hostname(proc = 0xffbff264), line 134 in 
"ess_base_fns.c"

  remote_spawn(launch = 0x85f30), line 812 in "plm_rsh_module.c"
  orte_daemon_recv(status = 0, sender = 0x85f54, buffer = 0x85f30, tag 
= 1U, cbdata = (nil)), line 344 in "orted_comm.c"
  orte_rml_recv_msg_callback(status = 0, peer = 0x69014, iov = 0x7d7e0, 
count = 2, tag = 1U, cbdata = 0x85ec0), line 68 in "rml_oob_recv.c"
  mca_oob_tcp_msg_data(msg = 0x85310, peer = 0x69000), line 436 in 
"oob_tcp_msg.c"
  mca_oob_tcp_msg_recv_complete(msg = 0x85310, peer = 0x69000), line 
322 in "oob_tcp_msg.c"
  mca_oob_tcp_peer_recv_handler(sd = 13, flags = 2, user = 0x69000), 
line 942 in "oob_tcp_peer.c"
  event_persist_closure(base = 0x3c600, ev = 0x647a8), line 1280 in 
"event.c"
  event_process_active_single_queue(base = 0x3c600, activeq = 0x3c4f0), 
line 1324 in "event.c"

  event_process_active(base = 0x3c600), line 1396 in "event.c"
  opal_libevent2013_event_base_loop(base = 0x3c600, flags = 1), line 
1593 in "event.c"

  orte_daemon(argc = 19, argv = 0xffbff97c), line 729 in "orted_main.c"
  main(argc = 19, argv = 0xffbff97c), line 62 in "orted.c"

So, in my case, I'm trying to look up item 3 while only item 1 in the 
array appears to be initialized.

[OMPI devel] mpirun hostname hangs on trunk r26380?

2012-05-03 Thread Eugene Loh

I'm hanging on the trunk, even with something as simple as "mpirun 
hostname".  r26377 and earlier are fine, but r26381 is not.  Quickly 
looking at the putback log, r26380 seems to be the likely candidate.  
I'll look at this some more, but the hang is here (orterun.c):


  935   /* loop the event lib until an exit event is detected */
  936   while (orte_event_base_active) {
  937   opal_event_loop(orte_event_base, OPAL_EVLOOP_ONCE);
  938   }
  939
  940DONE:

in case anyone recognizes the problem.  This is with Intel, Sun, or GCC 
compilers.

Re: [OMPI devel] Fortran linking problem: libraries have changed

2012-04-23 Thread Eugene Loh


On 4/23/2012 8:22 AM, Jeffrey Squyres wrote:

On Apr 23, 2012, at 1:40 AM, Eugene Loh wrote:

[rhc@odin001 ~/svn-trunk]$ mpifort --showme
gfortran -I/nfs/rinfs/san/homedirs/rhc/openmpi/include 
-I/nfs/rinfs/san/homedirs/rhc/openmpi/lib 
-L/nfs/rinfs/san/homedirs/rhc/openmpi/lib -lmpi_usempi -lmpi_mpifh -lmpi 
-lopen-rte -lopen-pal -ldl -lm -lpci -lresolv -Wl,--export-dynamic -lrt -lnsl 
-lutil -lm -lpthread -ldl

Hmm, that's interesting.  "mpifort --showme" in my case is *NOT* showing "-lmpi_usempi 
-lmpi_mpifh" and it *IS* showing "-lmpi_f77 -lmpi_f90".

Hmm.  I can't imagine how that happened.  The wrappers get these values from 
the wrapper data files in $prefix/share/openmpi/mpi*wrapper-data.txt.  I can't 
imagine how mpifort-wrapper-data.txt is getting loaded with -lmpi_f77 
-lmpi_f90.  Weird!
It turns out that we have MTT magic to generate those wrapper data 
files.  I will fix our magic.

I guess I need to make sure that I'm actually picking up the post-Fortran-merge 
trunk build I think I'm picking up, but it sure looks to me like I am.  Among 
other things, it recognizes mpifort as a command.  I'll think about possible 
dumb mistakes more tomorrow.

Let me know what you find.  Let's not exclude the possibility that this is a 
problem on the trunk somehow.

[OMPI devel] Fortran linking problem: libraries have changed

2012-04-22 Thread Eugene Loh


Next Fortran problem.

Oracle MTT managed to build the trunk (r26307) in some cases.  No 
test-run failures in these cases, but the pass counts are way low.  
Turns out, the Fortran tests aren't being built (or run).  I try 
compiling a Fortran code:


ld: fatal: library -lmpi_f77: not found
ld: fatal: library -lmpi_f90: not found
ld: fatal: File processing errors. No output written to a.out

I try "mpifort --showme" and see that it's trying to link in "-lmpi_f77 
-lmpi_f90", but those libraries no longer exist.  They have been replaced by

-lmpi_mpifh
-lmpi_usempi_ignore_tkr
-lmpi_usempif08

So, the Fortran wrapper needs to be updated.

[OMPI devel] configure check for Fortran and threads

2012-04-21 Thread Eugene Loh


Another probably-Fortran-merge problem.  Three issues in this e-mail.

Introduction:  The last two nights, Oracle MTT tests have been unable to 
build the trunk (r26307) with Oracle Studio compilers.  This has been 
uncovered since the fix of r26302, allowing us to get further in the 
build process.  We configure with

  --with-openib
  --enable-openib-connectx-xrc
  --without-udapl
  --disable-openib-ibcm
  --enable-btl-openib-failover
  [...]
and fail in compilation with
  "btl_openib_failover.c", line 237: undefined struct/union member: 
port_error_failover
The member is defined in btl_openib.h, but it's inside an "#if 
OPAL_HAVE_THREADS" and we're not getting threads.


#1)  Isn't there supposed to be some diplomatic message about trying to 
use openib without threads?


Anyhow, why aren't we getting threads?  Well, configure complains:
  checking if Fortran compiler and POSIX threads work as is... no
  checking if Fortran compiler and POSIX threads work with -Kthread... no
  checking if Fortran compiler and POSIX threads work with -kthread... no
  checking if Fortran compiler and POSIX threads work with -pthread... no
  checking if Fortran compiler and POSIX threads work with -pthreads... no
  checking if Fortran compiler and POSIX threads work with -mt... no
  checking if Fortran compiler and POSIX threads work with -mthreads... no
  checking if Fortran compiler and POSIX threads work with -lpthreads... no
  checking if Fortran compiler and POSIX threads work with -llthread... no
  checking if Fortran compiler and POSIX threads work with -lpthread... no

Woke up on the wrong side of bed, did we?  Checking config.log:

configure:58332: checking if Fortran compiler and POSIX threads work as is
configure:58417: cc -DNDEBUG -m32 -xO5  -I. -c conftest.c
"conftest.c", line 21: void function cannot return value
"conftest.c", line 24: void function cannot return value
"conftest.c", line 27: void function cannot return value
"conftest.c", line 30: void function cannot return value
[...]
void pthreadtest(void)
{ return pthreadtest_f(); }
[...]
void pthreadtest_(void)
{ return pthreadtest_f(); }
[...etc...]

#2)  Okay, yes, we shouldn't be trying to return values from void functions.

Same for the other checks (-pthread, -pthreads, -mt, etc.).  But, 
something else strikes me as funny about those other checks.  Here is 
more from config.log:


configure:58698: checking if Fortran compiler and POSIX threads work 
with -Kthread

configure:58768: cc -DNDEBUG -m32 -xO5  -mt -I. -c conftest.c
[...]
configure:58698: checking if Fortran compiler and POSIX threads work 
with -kthread

configure:58768: cc -DNDEBUG -m32 -xO5  -mt -I. -c conftest.c
[...]
configure:58698: checking if Fortran compiler and POSIX threads work 
with -pthread

configure:58768: cc -DNDEBUG -m32 -xO5  -mt -I. -c conftest.c
[...]
configure:58698: checking if Fortran compiler and POSIX threads work 
with -pthreads

configure:58768: cc -DNDEBUG -m32 -xO5  -mt -I. -c conftest.c
[...]
configure:58698: checking if Fortran compiler and POSIX threads work 
with -mt

configure:58768: cc -DNDEBUG -m32 -xO5  -mt -I. -c conftest.c
[...]
configure:58698: checking if Fortran compiler and POSIX threads work 
with -mthreads

configure:58768: cc -DNDEBUG -m32 -xO5  -mt -I. -c conftest.c
[...]
configure:59320: checking if Fortran compiler and POSIX threads work 
with -lpthreads

configure:59390: cc -DNDEBUG -m32 -xO5  -mt -I. -c conftest.c
[...]
configure:59320: checking if Fortran compiler and POSIX threads work 
with -llthread

configure:59390: cc -DNDEBUG -m32 -xO5  -mt -I. -c conftest.c
[...]
configure:59320: checking if Fortran compiler and POSIX threads work 
with -lpthread

configure:59390: cc -DNDEBUG -m32 -xO5  -mt -I. -c conftest.c
[...]

The purged text complains about void functions returning values, but we 
already talked about that.  What interests me now is this:


#3)  While configure claims to be trying so many flags (-pthread, -mt, 
etc.) it appears always to be checking only -mt.

[OMPI devel] testing if Fortran compiler likes the C++ exception flags

2012-04-20 Thread Eugene Loh


I think this is related to the "Fortran merge."

Last night, Oracle MTT tests couldn't build the trunk (r26307) with 
Intel compilers.  Specifically, configure fails with


checking to see if Fortran compiler likes the C++ exception flags... no
configure: WARNING: C++ exception flags are different between the C 
and Fortran compilers; this
configure script cannot currently handle this scenario.  Either 
disable C++ exception support or send mail to the Open MPI users list.

configure: error: *** Cannot continue

Looking in the config.log file, I see this:

configure:30518: checking to see if Fortran compiler likes the C++ 
exception flags
configure:30538: icc -c -O3 -DNDEBUG -Wall -static-intel -m32 
-finline-functions -fno-strict-aliasing -restrict -fexceptions  
conftest.c >&5

conftest.c(223): error: identifier "INTEGER" is undefined
 INTEGER I
 ^

Looks like the test is failing because configure is trying to compile 
Fortran source code in a *.c file with the C compiler.

Re: [OMPI devel] v1.5 r26132 broken on multiple nodes?

2012-03-16 Thread Eugene Loh

I updated trac 3047.  Thanks for the additional patch:  "mpirun -H 
 hostname" now works.


On 3/15/2012 5:15 PM, Ralph Castain wrote:

Let me know what you find - I took a look at the code and it looks correct. All 
required changes were included in the patch that was applied to the branch.

On Mar 14, 2012, at 11:27 PM, Eugene Loh wrote:

I'm quitting for the day, but happened to notice that all our v1.5 MTT runs are failing 
with r26133, though tests ran fine as of r26129.  Things run fine on-node, but if you run 
even just "hostname" on a remote node, the job fails with

orted: Command not found

I get this problem whether I include "--prefix $OPAL_PREFIX" or not.

Looking at recent putbacks, I see r26132 pulls in r26081 to fix #3047.  
According to both the trac ticket and 
http://www.open-mpi.org/community/lists/devel/2012/03/10672.php , r26081 alone 
isn't enough, but... whatever, I'm going to bed.  It does seem like r26132 
isn't quite right.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

[OMPI devel] v1.5 r26132 broken on multiple nodes?

2012-03-15 Thread Eugene Loh

I'm quitting for the day, but happened to notice that all our v1.5 MTT 
runs are failing with r26133, though tests ran fine as of r26129.  
Things run fine on-node, but if you run even just "hostname" on a remote 
node, the job fails with


orted: Command not found

I get this problem whether I include "--prefix $OPAL_PREFIX" or not.

Looking at recent putbacks, I see r26132 pulls in r26081 to fix #3047.  
According to both the trac ticket and 
http://www.open-mpi.org/community/lists/devel/2012/03/10672.php , r26081 
alone isn't enough, but... whatever, I'm going to bed.  It does seem 
like r26132 isn't quite right.

Re: [OMPI devel] trunk regression in mpirun (no --prefix) r26081

2012-03-03 Thread Eugene Loh


Yes, seems to work for me, thanks.

On 3/3/2012 3:14 PM, Ralph Castain wrote:

Should be fixed in r26093

On Mar 3, 2012, at 4:06 PM, Eugene Loh wrote:

I'll look at this some more, but for now I'll note that the trunk has an 
apparent regression in r26081.

./configure  \
  --enable-shared\
  --enable-orterun-prefix-by-default \
  --disable-peruse   \
  --disable-mpi-thread-multiple  \
  --enable-contrib-no-build=vt   \
  --prefix=...

setenv OPAL_PREFIX ...
set path = ( $OPAL_PREFIX/bin $path )
setenv LD_LIBRARY_PATH $OPAL_PREFIX/lib

mpirun [--prefix $OPAL_PREFIX] hostname

Up to 26080, the mpirun line runs fine, whether with the --prefix option or 
not.  Starting in 26081, I get a seg fault when I do *NOT* use --prefix.  
(Still runs fine *with* --prefix.)  I've seen the problem on SunOS and on 
Linux, with Studio, GCC, and Intel compilers.

[OMPI devel] trunk regression in mpirun (no --prefix) r26081

2012-03-03 Thread Eugene Loh

I'll look at this some more, but for now I'll note that the trunk has an 
apparent regression in r26081.


./configure  \
  --enable-shared\
  --enable-orterun-prefix-by-default \
  --disable-peruse   \
  --disable-mpi-thread-multiple  \
  --enable-contrib-no-build=vt   \
  --prefix=...

setenv OPAL_PREFIX ...
set path = ( $OPAL_PREFIX/bin $path )
setenv LD_LIBRARY_PATH $OPAL_PREFIX/lib

mpirun [--prefix $OPAL_PREFIX] hostname

Up to 26080, the mpirun line runs fine, whether with the --prefix option 
or not.  Starting in 26081, I get a seg fault when I do *NOT* use 
--prefix.  (Still runs fine *with* --prefix.)  I've seen the problem on 
SunOS and on Linux, with Studio, GCC, and Intel compilers.

[OMPI devel] locked memory consumption with openib and spawn

2012-02-27 Thread Eugene Loh

In the test suite, we have an ibm/dynamic/loop_spawn test that looks 
like this:


for (...) {
loop_spawn spawns loop_child
parent and child execute MPI_Intercomm_merge
parent and child execute MPI_Comm_free
parent and child execute MPI_Comm_disconnect
}

If the openib BTL is involved and you run long enough, it appears that 
you run out of locked memory.  Does anyone have a sense for whether that 
is expected or it shows a resource leak?

Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Eugene Loh


On 02/22/12 14:54, Ralph Castain wrote:
That doesn't really address the issue, though. What I want to know is: 
what happens when you try to bind processes? What about 
-bind-to-socket, and -persocket options? Etc. Reason I'm concerned: 
I'm not sure what happens if the socket layer isn't present. The logic 
in 1.5 is pretty old, but I believe it relies heavily on sockets being 
present.

Okay.  So,

*)  "out of the box", basically nothing works.  For example, "mpirun 
hostname" segfaults.


*)  With "--mca orte_num_sockets 1", stuff appears to work.

*)  With "--mca orte_num_sockets 1" and adding either "--bysocket 
--bind-to-socket" or "--npersocket ", I get:


--
Unable to bind to socket -13 on node burl-ct-v20z-10.
--
--
mpirun was unable to start the specified application as it encountered 
an error:


Error name: Fatal
Node: burl-ct-v20z-10

when attempting to start process rank 0.
--
2 total processes failed to start

So, I hear Brice's comment that this is an old kernel.  And, I hear what 
you're saying about a "real" fix being expensive.  Nevertheless, to my 
taste, automatically setting num_sockets==1 when num_sockets==0 is 
detected makes a lot of sense.  It makes things "basically" work, 
turning a situation where everything including "mpirun hostname" 
segfaults into a situation where default usage works just fine.  What 
remains broken is binding, which generates an error message that gives 
the user a hope of making progress (turning off binding).  That's in 
contrast from expecting users to go from


% mpirun hostname
Segmentation fault

to knowing that they should set orte_num_sockets==1.

Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Eugene Loh


On 2/22/2012 11:08 AM, Ralph Castain wrote:

On Feb 22, 2012, at 11:59 AM, Brice Goglin wrote:

Le 22/02/2012 17:48, Ralph Castain a écrit :

On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote

On 2/21/2012 10:31 PM, Eugene Loh wrote:

...  "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI pukes on 
divide by zero.  OS info was listed in the original message (below).  Might we want to do 
something else?  E.g., assume num_sockets==1 when num_sockets==0 (if you know what I 
mean)?  So, which one (or more) of the following should be fixed?

*) on this platform, hwloc finds no socket level
*) therefore hwloc returns num_sockets==0 to OMPI
*) OMPI divides by 0 and barfs on basically everything

Okay.  So, Brice's other e-mail indicates that the first two are "not really 
uncommon":

On 2/22/2012 7:55 AM, Brice Goglin wrote:

Anyway, we have seen other systems (mostly non-Linux) where lstopo
reports nothing interesting (only one machine object with multiple PU
children). So numsockets==0 isn't really uncommon.

So, it seems to me that OMPI needs to handle the num_sockets==0 case rather 
than just dividing by num_sockets.  This is v1.5 orte_odls_base_open() since 
r25914.

Unfortunately, just artificially setting the num_sockets to 1 won't solve much 
- you'll get past that point in the code, but attempts to bind are likely to 
fail down the road. Fixing it will require some significant effort.

Given we haven't heard reports of this before, I'm not convinced it is a 
widespread problem.
I assume we don't see the problem as widespread because it was only 
introduced into  v1.5 in r25914.  In my mind, the real question is how 
common it is for hwloc to decide numsockets==0.  On that one, Brice 
asserts it "isn't really uncommon."

For now, let's just use the mca param and see what happens.

I am probably missing something but: Why would setting num_sockets to 1
work fine as a mca param, while artificially setting it as said above
wouldn't ?

Because the param means that it isn't hardwired into the code base. I want to 
first verify that artificially forcing num_sockets to 1 doesn't break the code 
down the road, so the less change to find out, the better.
That sounds a lot different to me than the earlier statement.  Thanks 
for asking that question, Brice.  Anyhow, I tried using "--mca 
orte_num_sockets 1" and that seems to allow basic programs to run.

Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Eugene Loh


On 2/21/2012 10:31 PM, Eugene Loh wrote:
...  "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI 
pukes on divide by zero.  OS info was listed in the original message 
(below).  Might we want to do something else?  E.g., assume 
num_sockets==1 when num_sockets==0 (if you know what I mean)?  So, 
which one (or more) of the following should be fixed?


*) on this platform, hwloc finds no socket level
*) therefore hwloc returns num_sockets==0 to OMPI
*) OMPI divides by 0 and barfs on basically everything
Okay.  So, Brice's other e-mail indicates that the first two are "not 
really uncommon":


On 2/22/2012 7:55 AM, Brice Goglin wrote:

Anyway, we have seen other systems (mostly non-Linux) where lstopo
reports nothing interesting (only one machine object with multiple PU
children). So numsockets==0 isn't really uncommon.
So, it seems to me that OMPI needs to handle the num_sockets==0 case 
rather than just dividing by num_sockets.  This is v1.5 
orte_odls_base_open() since r25914.

On Feb 21, 2012, at 7:20 PM, Eugene Loh wrote:

In r25914, orte/mca/odls/base/odls_base_open.c, we get

222 /* get the number of local sockets unless we were given 
a number */

223 if (0 == orte_default_num_sockets_per_board) {
224 
opal_paffinity_base_get_socket_info(_odls_globals.num_sockets);

225 }
226 /* get the number of local processors */
227 
opal_paffinity_base_get_processor_info(_odls_globals.num_processors);

228 /* compute the base number of cores/socket, if not given */
229 if (0 == orte_default_num_cores_per_socket) {
230 orte_odls_globals.num_cores_per_socket = 
orte_odls_globals.num_processors / orte_odls_globals.num_sockets;

231 }

Well, we execute the branch at line 224, but num_sockets remains 0.  
This leads to the divide-by-0 at line 230.

Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Eugene Loh


On 2/21/2012 5:40 PM, Paul H. Hargrove wrote:

Here are the first of the results of the testing I promised.
I am not 100% sure how to reach the code that Eugene reported as 
problematic,
I don't think you're going to see it.  Somehow, hwloc on the config in 
question thinks there is no socket level and returns num_sockets==0.  If 
you can run something successfully, your platform won't show the issue.

Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Eugene Loh


On 02/21/12 19:29, Jeffrey Squyres wrote:

What's the output of running lstopo from hwloc 1.3.2?  (this is the version 
that's in the OMPI trunk and v1.5 branches)

 http://www.open-mpi.org/software/hwloc/v1.3/

Is there any difference from v1.4 hwloc?

 http://www.open-mpi.org/software/hwloc/v1.4/

Machine (8192MB)
  NUMANode L#0 (P#0 4096MB) + PU L#0 (P#0)
  NUMANode L#1 (P#1 4096MB) + PU L#1 (P#1)

No difference between 1.3 and 1.4.  No information about sockets.

As Paul says, doesn't look like a compiler thing.  (I get the same with 
Intel and gcc.)


The hwloc README has a sample program that has ("third example")

 depth = hwloc_get_type_depth(topology, HWLOC_OBJ_SOCKET);
 if (depth == HWLOC_TYPE_DEPTH_UNKNOWN) {
 printf("*** The number of sockets is unknown\n");
 } else {
...
 }

that reports that the number of sockets is unknown.  So, "sockets" is 
unknown and hwloc returns 0 for num_sockets and OMPI pukes on divide by 
zero.  OS info was listed in the original message (below).  Might we 
want to do something else?  E.g., assume num_sockets==1 when 
num_sockets==0 (if you know what I mean)?  So, which one (or more) of 
the following should be fixed?


*) on this platform, hwloc finds no socket level
*) therefore hwloc returns num_sockets==0 to OMPI
*) OMPI divides by 0 and barfs on basically everything

On Feb 21, 2012, at 7:20 PM, Eugene Loh wrote:

We have some amount of MTT testing going on every night and on ONE of our 
systems v1.5 has been dead since r25914.  The system is

Linux burl-ct-v20z-10 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 2007 x86_64 
x86_64 x86_64 GNU/Linux

and I'm encountering the problem with Intel (composer_xe_2011_sp1.7.256) 
compilers.  I haven't poked around enough yet to figure out what the 
problematic characteristic of this configuration is.

In r25914, orte/mca/odls/base/odls_base_open.c, we get

222 /* get the number of local sockets unless we were given a number */
223 if (0 == orte_default_num_sockets_per_board) {
224 
opal_paffinity_base_get_socket_info(_odls_globals.num_sockets);
225 }
226 /* get the number of local processors */
227 
opal_paffinity_base_get_processor_info(_odls_globals.num_processors);
228 /* compute the base number of cores/socket, if not given */
229 if (0 == orte_default_num_cores_per_socket) {
230 orte_odls_globals.num_cores_per_socket = 
orte_odls_globals.num_processors / orte_odls_globals.num_sockets;
231 }

Well, we execute the branch at line 224, but num_sockets remains 0.  This leads 
to the divide-by-0 at line 230.  Digging deeper, the call at line 224 led us to 
opal/mca/paffinity/hwloc/paffinity_hwloc_module.c (lots of stuff left out):

static int module_get_socket_info(int *num_sockets) {
hwloc_topology_t *t =_hwloc_topology;
*num_sockets = (int) hwloc_get_nbobjs_by_type(*t, HWLOC_OBJ_SOCKET);
return OPAL_SUCCESS;
}

Anyhow, SOCKET is somehow an unknown layer, so num_sockets is returning 0.

I can poke around more, but does someone want to advise?
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

[OMPI devel] v1.5 r25914 DOA

2012-02-21 Thread Eugene Loh

We have some amount of MTT testing going on every night and on ONE of 
our systems v1.5 has been dead since r25914.  The system is


Linux burl-ct-v20z-10 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 2007 
x86_64 x86_64 x86_64 GNU/Linux


and I'm encountering the problem with Intel (composer_xe_2011_sp1.7.256) 
compilers.  I haven't poked around enough yet to figure out what the 
problematic characteristic of this configuration is.


In r25914, orte/mca/odls/base/odls_base_open.c, we get

222 /* get the number of local sockets unless we were given a 
number */

223 if (0 == orte_default_num_sockets_per_board) {
224 
opal_paffinity_base_get_socket_info(_odls_globals.num_sockets);

225 }
226 /* get the number of local processors */
227 
opal_paffinity_base_get_processor_info(_odls_globals.num_processors);

228 /* compute the base number of cores/socket, if not given */
229 if (0 == orte_default_num_cores_per_socket) {
230 orte_odls_globals.num_cores_per_socket = 
orte_odls_globals.num_processors / orte_odls_globals.num_sockets;

231 }

Well, we execute the branch at line 224, but num_sockets remains 0.  
This leads to the divide-by-0 at line 230.  Digging deeper, the call at 
line 224 led us to opal/mca/paffinity/hwloc/paffinity_hwloc_module.c 
(lots of stuff left out):


static int module_get_socket_info(int *num_sockets) {
hwloc_topology_t *t = _hwloc_topology;
*num_sockets = (int) hwloc_get_nbobjs_by_type(*t, HWLOC_OBJ_SOCKET);
return OPAL_SUCCESS;
}

Anyhow, SOCKET is somehow an unknown layer, so num_sockets is returning 0.

I can poke around more, but does someone want to advise?

[OMPI devel] Fortran improbe support

2012-02-15 Thread Eugene Loh


I had a question about our Fortran MPI_Improbe support.

If I look at ompi/mpi/f77/improbe_f.c I see basically (lots of code 
removed):


64void mpi_improbe_f(MPI_Fint *source, MPI_Fint *tag, MPI_Fint 
*comm,
65   ompi_fortran_logical_t *flag, MPI_Fint 
*message,

66   MPI_Fint *status, MPI_Fint *ierr)
67{
94*ierr = OMPI_INT_2_FINT(MPI_Improbe(OMPI_FINT_2_INT(*source),
95OMPI_FINT_2_INT(*tag),
96c_comm, 
OMPI_LOGICAL_SINGLE_NAME_CONVERT(flag),

97 _message, c_status));
98
99if (MPI_SUCCESS == OMPI_FINT_2_INT(*ierr)) {
   106*message = MPI_Message_c2f(c_message);
   107}
   108}

I think this logic isn't right.  We reference the message at line 106 if 
the call at line 94 was successful, but we should reference the message 
*only* if there was a match.  This condition is indicated not by ierr 
but by flag.


Yes?

If someone (Brian?) would be willing to make corrections, that'd be 
great.  Otherwise, I'll try my hand at figuring out those preprocessor 
macros that wrap around "flag".

Re: [MTT devel] duplicate results

2012-01-06 Thread Eugene Loh


On 01/06/12 08:52, Josh Hursey wrote:
Weird. I don't know what is going on here, unless the client is 
somehow submitting some of the results too many times. One thing to 
check is the debug output file that the MTT client is submitting to 
the server. Check that for duplicates.
Sorry, I don't understand where to check.  I do know that if I look at 
the output from the MTT client, I see a bunch of messages like this:


>> Reported to MTTDatabase client: 1 successful submit, 0 failed 
submits (total of 6 results)


If I add up those numbers of results submitted, the totals match what I 
would expect.  So, there is some indication that the number of client 
submissions is right.
That will help determine whether this is a server side problem or 
client side problem. I have not noticed this behavior on the server 
before,
I haven't either, but I only just started looking more closely at 
results.  Mostly, in any case, things look fine.
but might be something with the submit.php script - just a guess 
though at this point.


Unfortunately I have zero time to spend on MTT for a few weeks at 
least. :/


-- Josh

On Thu, Jan 5, 2012 at 8:11 PM, Eugene Loh <eugene@oracle.com 
<mailto:eugene@oracle.com>> wrote:


I go to MTT and I choose:

Test run
Date range: 2012-01-05 05 <tel:2012-01-05%2005>:00:00 - 2012-01-05
12 <tel:2012-01-05%2012>:00:00
Org: Oracle
Platform name: $burl-ct-v20z-2$
Suite: intel-64

and I get:

1 oracle burl-ct-v20z-2 i86pc SunOS ompi-nightly-trunk 1.7a1r25692
intel-64 4 870 0 86 0 0
2 oracle burl-ct-v20z-2 i86pc SunOS ompi-nightly-v1.5
1.5.5rc2r25683 intel-64 4 915 0 92 0 0

I get more tests (passing and skipped) with v1.5 than I do with
the trunk run.  I have lots of ways of judging what the numbers
should be. The "trunk" numbers are right.  The "v1.5" numbers are
too high;  they should be the same as the trunk numbers.

I can see the explanation by clicking on "Detail" and looking at
individual runs.  (To get time stamps, I add a " | by minute"
qualifier before clicking on "Detail".  Maybe there's a more
proper way of getting time stamps, but that seems to work for me.)
 Starting with record 890 and ending with 991, records are
repeated.  That is, 890 and 891 have identical command lines, time
stamps, output, etc.  One of them is a duplicate.  Same with 892
and 893, then for 894 and 895, then 896 and 897, and so on.  So,
for about a one-hour period, the records sent in by this test run
appear duplicated when I submit queries to the database. These 51
duplicates are the 45 extra passes and 6 extra skips seen in the
results above.

Can someone figure out what's going wrong here?  Clearly, I'd like
to be able to rely on query results.

[OMPI devel] 2012 MTT results

2012-01-02 Thread Eugene Loh

Oracle has MTT jobs that have been running and, according to the log 
files, been successfully reporting results to the IU database, even in 
the last few days.  If I look at http://www.open-mpi.org/mtt, however, I 
can't seem to turn up any results for the new calendar year (2012).  Any 
insights?

Re: [OMPI devel] [BUG?] OpenMPI with openib on SPARC64: Signal: Bus error (10)

2011-12-06 Thread Eugene Loh


On 11/21/11 20:51, Lukas Razik wrote:

Hello everybody!

I've Sun T5120 (SPARC64) Servers with
- Debian: 6.0.3
- linux-2.6.39.4 (from kernel.org)
- OFED-1.5.3.2
- InfiniBand: Mellanox Technologies MT25418 [ConnectX VPI PCIe 2.0 2.5GT/s - IB 
DDR / 10GigE] (rev a0)
   with newest FW (2.9.1)
and the following issue:

If I try to mpirun a program like the osu_latency benchmark:
$ /usr/mpi/gcc/openmpi-1.4.3/bin/mpirun -np 2 --mca btl_base_verbose 50 --mca 
btl_openib_verbose 1 -host cluster1,cluster2 
/usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency

then I get these errors:

# OSU MPI Latency Test v3.1.1
# SizeLatency (us)
[cluster1:64027] *** Process received signal ***
[cluster1:64027] Signal: Bus error (10)
[cluster1:64027] Signal code: Invalid address alignment (1)
[cluster1:64027] Failing at address: 0xaa9053
[cluster1:64027] [ 0] 
/usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_pml_ob1.so(+0x62f0) 
[0xf8010209e2f0]
[cluster1:64027] [ 1] 
/usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_coll_tuned.so(+0x2904) 
[0xf801031ce904]
[cluster1:64027] [ 2] 
/usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_coll_tuned.so(+0xb498) 
[0xf801031d7498]
[cluster1:64027] [ 3] 
/usr/mpi/gcc/openmpi-1.4.3/lib64/libmpi.so.0(MPI_Barrier+0xbc) 
[0xf8010005a97c]
[cluster1:64027] [ 4] 
/usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency(main+0x2b0) 
[0x100f34]
[cluster1:64027] [ 5] /lib64/libc.so.6(__libc_start_main+0x100) 
[0xf80100ac1240]
[cluster1:64027] [ 6] 
/usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency(_start+0x2c) 
[0x100bac]
[cluster1:64027] *** End of error message ***
[cluster2:02759] *** Process received signal ***
[cluster2:02759] Signal: Bus error (10)
[cluster2:02759] Signal code: Invalid address alignment (1)
[cluster2:02759] Failing at address: 0xaa9053
[cluster2:02759] [ 0] 
/usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_pml_ob1.so(+0x62f0) 
[0xf8010209e2f0]
[cluster2:02759] [ 1] 
/usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_coll_tuned.so(+0x2904) 
[0xf801031ce904]
[cluster2:02759] [ 2] 
/usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_coll_tuned.so(+0xb498) 
[0xf801031d7498]
[cluster2:02759] [ 3] 
/usr/mpi/gcc/openmpi-1.4.3/lib64/libmpi.so.0(MPI_Barrier+0xbc) 
[0xf8010005a97c]
[cluster2:02759] [ 4] 
/usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency(main+0x2b0) 
[0x100f34]
[cluster2:02759] [ 5] /lib64/libc.so.6(__libc_start_main+0x100) 
[0xf80100ac1240]
[cluster2:02759] [ 6] 
/usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency(_start+0x2c) 
[0x100bac]
[cluster2:02759] *** End of error message ***
There do indeed seem to be a set of problems here addressing non-aligned 
words.


*IF* you were to use Oracle Solaris Studio compilers, you could use 
-xmemalign=8i as Terry suggested and it appears that eliminates these 
errors, albeit potentially with a loss of performance.


Your e-mail thread identified a problem with misalignment in

551 hdr->hdr_match.hdr_ctx = 
sendreq->req_send.req_base.req_comm->c_contextid;

It appears one can get past this problem by configuring OMPI with 
--enable-openib-control-hdr-padding.  This turns on OMPI_OPENIB_PAD_HDR and 
gives you padding/alignment in ompi/mca/btl/openib/btl_openib_frag.h here:

struct mca_btl_openib_control_header_t {
uint8_t  type;
#if OMPI_OPENIB_PAD_HDR
uint8_t  padding[15];
#endif
};
typedef struct mca_btl_openib_control_header_t mca_btl_openib_control_header_t;

struct mca_btl_openib_eager_rdma_header_t {
mca_btl_openib_control_header_t control;
uint8_t padding[3];
uint32_t rkey;
ompi_ptr_t rdma_start;
};
typedef struct mca_btl_openib_eager_rdma_header_t 
mca_btl_openib_eager_rdma_header_t;

But then perhaps the padding in mca_btl_openib_eager_rdma_header_t needs to be 
adjusted.  I don't yet know.

This helps (more tests pass), but in many cases it just delays problems until a 
later point.

All of this is I suppose to say:
1)  Yes, there is a problem with misaligned words in the openib BTL.
2)  We are interested in and looking at the problem.
3)  No promises of outcome.

Re: [OMPI devel] r25470 (hwloc CMR) breaks v1.5

2011-11-16 Thread Eugene Loh


On 11/16/2011 3:32 AM, TERRY DONTJE wrote:

On 11/15/2011 10:16 PM, Jeff Squyres wrote:

On Nov 14, 2011, at 10:17 PM, Eugene Loh wrote:

I tried building v1.5.  r25469 builds for me, r25470 does not.  This is 
Friday's hwloc putback of CMR 2866.  I'm on Solaris11/x86.  The problem is 
basically:

Doh!

Making all in tools/ompi_info
  CC ompi_info.o
"../../../opal/include/opal/sys/ia32/atomic.h", line 173: warning: parameter in 
inline asm statement unused: %2

Have these warnings always been there for you?  r25470 should not have changed 
any of the assembly stuff.
Yes.  You can ignore these warnings they aren't the droids you are 
looking for.
+1  Those warnings aren't the issue I'm talking about.  E.g., they're 
there for r25469 as well.

[OMPI devel] r25470 (hwloc CMR) breaks v1.5

2011-11-15 Thread Eugene Loh

I tried building v1.5.  r25469 builds for me, r25470 does not.  This is 
Friday's hwloc putback of CMR 2866.  I'm on Solaris11/x86.  The problem 
is basically:


Making all in tools/ompi_info
  CC ompi_info.o
"../../../opal/include/opal/sys/ia32/atomic.h", line 173: warning: 
parameter in inline asm statement unused: %2
"../../../opal/include/opal/sys/ia32/atomic.h", line 193: warning: 
parameter in inline asm statement unused: %2

  CC output.o
"../../../opal/include/opal/sys/ia32/atomic.h", line 173: warning: 
parameter in inline asm statement unused: %2
"../../../opal/include/opal/sys/ia32/atomic.h", line 193: warning: 
parameter in inline asm statement unused: %2

  CC param.o
"../../../opal/include/opal/sys/ia32/atomic.h", line 173: warning: 
parameter in inline asm statement unused: %2
"../../../opal/include/opal/sys/ia32/atomic.h", line 193: warning: 
parameter in inline asm statement unused: %2

  CC components.o
"../../../opal/include/opal/sys/ia32/atomic.h", line 173: warning: 
parameter in inline asm statement unused: %2
"../../../opal/include/opal/sys/ia32/atomic.h", line 193: warning: 
parameter in inline asm statement unused: %2

  CC version.o
"../../../opal/include/opal/sys/ia32/atomic.h", line 173: warning: 
parameter in inline asm statement unused: %2
"../../../opal/include/opal/sys/ia32/atomic.h", line 193: warning: 
parameter in inline asm statement unused: %2

  CCLD   ompi_info
Undefined   first referenced
 symbol in file
opal_hwloc122_hwloc_bitmap_dup  components.o
opal_hwloc122_hwloc_bitmap_weight   components.o
ld: fatal: symbol referencing errors. No output written to .libs/ompi_info

Blood and gore are attached to this e-mail.


ompi-output.tar.bz2
Description: application/bzip

Re: [OMPI devel] ibm/io/file_status_get_count

2011-11-04 Thread Eugene Loh


On 11/4/2011 5:56 AM, Jeff Squyres wrote:

On Oct 28, 2011, at 1:59 AM, Eugene Loh wrote

In our MTT testing, we see ibm/io/file_status_get_count fail occasionally with:

File locking failed in ADIOI_Set_lock(fd A,cmd F_SETLKW/7,type F_RDLCK/0,whence 
0) with return value
 and errno 5.
- If the file system is NFS, you need to use NFS version 3, ensure that the 
lockd daemon is running
on all the machines, and mount the directory with the 'noac' option (no 
attribute caching).
- If the file system is LUSTRE, ensure that the directory is mounted with the 
'flock' option.
ADIOI_Set_lock:: Input/output error
ADIOI_Set_lock:offset 0, length 1

One of the curious things (to us) about this test is that no one else appears 
to run it.  Looking back through a lot of MTT results, essentially the only 
results reported are Oracle.  Almost no non-Oracle results for this test have 
been reported in the last few months.  Is there something special about this 
test we should know about?

Not that I'm aware of.

I see why Cisco skipped it -- I didn't have the "io" directory listed in my 
list of IBM directories to traverse.  Doh!  That's been fixed.

(Cisco's MTT runs look like they need a bit of TLC -- I'm guessing IB is down 
on a node or two, resulting in a lot of false failures, but I likely won't have 
time to look at them until after SC :-( )
Yeah.  In our recent experience, everyone's MTT runs seem to need lots 
of TLC.  Anyhow, thanks for the feedback:  it appears there is no 
general intentional avoidance of this particular test that we were 
simply unaware of.

P.S.  We're also interested in understanding the error message better.  I 
suppose that's more appropriately taken up with ROMIO folks, which I will do, 
but if anyone on this list has useful information I'd love to hear it.  The 
error apparently comes when MPI_File_get_size sets a lock.  Each process has 
its own file and the test usually passes, so it's unclear to me what the 
problem is.  Further, the error message discussing NFS and Lustre strikes me as 
rather speculative.  We tend to run these tests repeatedly on the same file 
systems from the same test nodes.  Anyone have any idea how sound the 
NFSv3/lockd/noac advice is or what the real issue is here?

No.  You'll need to ask Rob Latham.
Thanks.  He replied to my inquiry on the MPICH list.  Main answer is 
that robustness bets are off on NFS and the message might be a little 
misleading.

[OMPI devel] ibm/io/file_status_get_count

2011-10-28 Thread Eugene Loh

In our MTT testing, we see ibm/io/file_status_get_count fail 
occasionally with:


File locking failed in ADIOI_Set_lock(fd A,cmd F_SETLKW/7,type F_RDLCK/0,whence 
0) with return value
 and errno 5.
- If the file system is NFS, you need to use NFS version 3, ensure that the 
lockd daemon is running
on all the machines, and mount the directory with the 'noac' option (no 
attribute caching).
- If the file system is LUSTRE, ensure that the directory is mounted with the 
'flock' option.
ADIOI_Set_lock:: Input/output error
ADIOI_Set_lock:offset 0, length 1

One of the curious things (to us) about this test is that no one else 
appears to run it.  Looking back through a lot of MTT results, 
essentially the only results reported are Oracle.  Almost no non-Oracle 
results for this test have been reported in the last few months.  Is 
there something special about this test we should know about?


P.S.  We're also interested in understanding the error message better.  
I suppose that's more appropriately taken up with ROMIO folks, which I 
will do, but if anyone on this list has useful information I'd love to 
hear it.  The error apparently comes when MPI_File_get_size sets a 
lock.  Each process has its own file and the test usually passes, so 
it's unclear to me what the problem is.  Further, the error message 
discussing NFS and Lustre strikes me as rather speculative.  We tend to 
run these tests repeatedly on the same file systems from the same test 
nodes.  Anyone have any idea how sound the NFSv3/lockd/noac advice is or 
what the real issue is here?

[OMPI devel] MPI 2.2 datatypes

2011-10-20 Thread Eugene Loh

In MTT testing, we check OMPI version number to decide whether to test 
MPI 2.2 datatypes.


Specifically, in intel_tests/src/mpitest_def.h:

#define MPITEST_2_2_datatype 0
#if defined(OPEN_MPI)
#if (OMPI_MAJOR_VERSION > 1) || (OMPI_MAJOR_VERSION == 1 && 
OMPI_MINOR_VERSION >= 7)

#undef MPITEST_2_2_datatype
#define MPITEST_2_2_datatype 1
#endif
#endif
#if MPI_VERSION > 2 || (MPI_VERSION == 2 && MPI_SUBVERSION >= 2)
#undef MPITEST_2_2_datatype
#define MPITEST_2_2_datatype 1
#endif

The check looks for OMPI 1.7 or higher, but we introduced support for 
MPI 2.2. datatypes in 1.5.4.  So, can we check for 1.5.4 or higher?  Or, 
is it possible that this support might not go into the first 1.6 
release?  I'm willing to make the changes, but just wanted some guidance 
on what to expect in 1.6.

Re: [OMPI devel] OMPI_MCA_opal_set_max_sys_limits

2011-09-01 Thread Eugene Loh


On 8/31/2011 4:48 AM, Ralph Castain wrote:

Perhaps it would help if you had clearly stated your concern.
Yeah.  It would have helped had I clearly understood what was going on.  
Most of all, that way I wouldn't have had to ask any questions!  :^)

 From this description, I gather your concern is -not- that remote processes 
don't see the setting, but that the remote -orteds- don't see it.
Let's dumb this down a notch for my sake.  Let's say I want to run a job 
with the TCP BTL and have lots of processes.  I hit a descriptor limit 
and so my friend tells me to use the MCA parameter 
opal_set_max_sys_limits.  So, is the point that while most MCA 
parameters can be set on the mpirun command line *OR* in param files 
*OR* with environment variables, in this case the environment-variable 
setting should be avoided?

Yes, I'm aware of that issue for rsh-based launches. It stems from rsh not 
allowing one to extend the environment. If you place the param on the cmd line, 
then it gets propagated because we collect and extend the cmd line params. If 
you place it in the environment, then we don't - because (as we have repeatedly 
explained to people) we cannot pass all relevant envars on the cmd line due to 
length restrictions. We don't have this issue with cmd line params because (the 
thinking goes) it already fit on the cmd line.

So for rsh-like launches, there is an unavoidable discrepancy. It's one reason 
why we have both system-level and personal-level MCA param files.

Re: [OMPI devel] OMPI_MCA_opal_set_max_sys_limits

2011-08-31 Thread Eugene Loh


On 8/30/2011 7:34 PM, Ralph Castain wrote:

On Aug 29, 2011, at 11:18 PM, Eugene Loh wrote:

Maybe someone can help me from having to think too hard.

Let's say I want to max my system limits.  I can say this:

% mpirun --mca opal_set_max_sys_limits 1 ...

Cool.

Meanwhile, if I do this:

% setenv OMPI_MCA_opal_set_max_sys_limits 1
% mpirun ...

remote processes don't see the setting.  (Local processes and ompi_info are 
fine.)

I looked at the 1.5 code, and mpirun is reaping all OMPI_ params from the 
environ and adding them to the app. So it should be getting set.

I then ran "mpirun -n 1 printenv" on a slurm machine, and verified that indeed 
that param was in the environment. Ditto when I told it to use the rsh launcher.

Bug?  Naively, this looks "wrong."  At least disturbing, in any case.
This is with v1.5.
Okay, so one answer is implicit in your reply:  you are expecting the 
same result I am.  So, if the behavior is not as I expect but as I 
describe, it's a bug candidate.  (As opposed to, "The problem you're 
describing is how it's supposed to work;  it's no problem at all.")


Now, regarding "mpirun -n 1 printenv", I agree that the environment 
variable is getting set.  Even on a remote node.  That suggests that 
things are fine, but it turns out they are not.  The problem is -- and 
I'm afraid I don't understand the details -- it's set "too late."  I 
imagine a time line like this:


A)  orted starts
B)  orted calls opal_util_init_sys_limits()
C)  daemonize a child process
D)  child process execs target process
E)  target process starts up

Looking at the environment, I don't see the variable set in B, which is 
the only place the variable does any good.  Like you, I do see it in E, 
which is interesting but doesn't help the user.


Your experiment was reasonable, but the problem is odd.  I suggest the 
following to see the problem.  Set the variable in your environment.  
Then use mpirun to launch a remote process.  Then:
1)  In the remote orted, inside opal_util_init_sys_limits(), check for 
the variable in your environment.

And/or:
2)  Make the remotely launched process something like this:

#!/bin/csh
limit descriptors

and see if the descriptor limit got bumped up from what it otherwise 
should be.


In contrast, if you set the MCA parameter on your mpirun command line, 
the environment variable *does* get set, even in the environment of the 
orted when it calls opal_util_init_sys_limits().


I can poke at this more tomorrow, but I suspect with one "aha!" you'll 
figure it out a lot faster than I can.  :^(

[OMPI devel] OMPI_MCA_opal_set_max_sys_limits

2011-08-30 Thread Eugene Loh


 Maybe someone can help me from having to think too hard.

Let's say I want to max my system limits.  I can say this:

% mpirun --mca opal_set_max_sys_limits 1 ...

Cool.

Meanwhile, if I do this:

% setenv OMPI_MCA_opal_set_max_sys_limits 1
% mpirun ...

remote processes don't see the setting.  (Local processes and ompi_info 
are fine.)


Bug?  Naively, this looks "wrong."  At least disturbing, in any case.

This is with v1.5.

[OMPI devel] descriptor limits -- FAQ item

2011-08-29 Thread Eugene Loh

It seems to me the FAQ item 
http://www.open-mpi.org/faq/?category=large-clusters#fd-limits needs 
updating.  I'm willing to give this a try, but need some help first.  
(I'm even more willing to let someone else do all this, but I'm not 
holding my breath.)


For example, the text sounds dated -- e.g., with references to v1.2.  Is 
the "road map" discussion still current?


Is the estimate of the needed number of descriptors our current best guess?

The FAQ is missing discussion of how to increase the limit.  For 
something like "limit/ulimit/unlimit", where should this be done?  In 
.login?  I assume it's not sufficient simply to set this in the shell 
where mpirun is executed, assuming processes will also be launched on 
remote nodes.  Yes?


(And, clearly, the FAQ item is missing discussion of the MCA parameter 
opal_set_max_sys_limits.)

Re: [OMPI devel] ibm/dynamic/loop_spawn

2011-08-20 Thread Eugene Loh

Okay, put back as r1846.

On 8/16/2011 1:23 PM, Jeff Squyres wrote:

We talked about this a lot today on the call (and then some more afterwards).
:-)

I think there's 2 important points here.

1. Ralph's original test was written with the intent of launching it with 1
process which would then do a series of local spawns. Even doing a huge
truckload of them, Ralph mentioned (on the phone to me today) that it only took
about 15 seconds.
15 seconds sounds to me like a stretch -- at least for the 2000
iterations that, not too long ago, were in the code. But, I quibble.

2. My test -- i.e., the current one in the ibm test suite directory -- is more of a
general "beat on the ORTE/spawn system" test. I.e., just spawn/reap a
bajillion times and ensure that it works. I think that it still breaks openib, for
example -- after you do a bunch of spawns, something runs out of resources (I don't
remember the exact failure scenario).
The comments to the putback "say it all." Main point is that the
number of spawning processes is cut back in some cases, and sleep-waits
have been introduced. Specifically in the case of loop_spawn, both
patterns (Ralph's and Jeff's) are in there, *and* there is a
user-settable time limit (ten minutes by default). So, /everyone/
should be happy! Happy happy joy joy.

Ralph's opinion is that we don't need to test for #1 any more. I don't think it would be bad
to test for #1 any more, but the C code for such a test could be a bit smarter (i.e., only
MCW rank 0 could COMM_SPAWN on COMM_SELF, and use a host info key of "localhost" to
ensure spawning locally, while any other MCW procs could idle looping on while (!done)
{sleep(1); MPI_Test(...,); } so that they don't spin the CPU).
This is a digression, but sleep(1) is excessive. If the cost of a
futile MPI_Test is a fraction of a millisecond, then checking out for a
full second should drop load by over 1000x. Run times will be
horrendous... not because of oversubscription, but because the machine
will be idle! Anyhow, again, just a digression. I hope/expect that the
spawn/load problems are now gone.

For #2, I don't disagree that Eugene's suggestions could make it a bit more
robust. After all, we only have so many hours for testing with so much
equipment; one test that runs for hours and hours probably isn't useful. You
can imagine a bunch of ways to make that test more useful: take an argv[1]
specifying the number of iterations, take an argv[1] that indicates a number of
seconds to run the test, ensure that you only spawn on half the MCW processes
and have the other half idle in a while(!done){...} loop, like mentioned above
so that you can spawn on CPUs that aren't spinning tightly on MPI progress,
...etc.

On Aug 15, 2011, at 11:47 AM, Eugene Loh wrote:

This is a question about ompi-tests/ibm/dynamic. Some of these tests (spawn, spawn_multiple,
loop_spawn/child, and no-disconnect) exercise MPI_Comm_spawn* functionality. Specifically,
they spawn additional processes (beyond the initial mpirun launch) and therefore exert a
different load on a test system than one might naively expect from the "mpirun
-np" command line.

One approach to testing is to have the test harness know characteristics about
individual tests like this. E.g., if I have only 8 processors and I don't want
to oversubscribe, have the test harness know that particular tests should be
launched with fewer processes. On the other hand, building such generality
into a test harness when changes would have to be so pervasive (subjective
assessment) and so few tests require it may not make that much sense.

Another approach would be to manage oversubscription in the tests themselves.
E.g., for spawn.c, instead of spawning np new processes, do the following:

- idle np/2 of the processes
- have the remaining np/2 processes spawn np/2 new ones

(Okay, so that leaves open the possibility that the newly spawned processes might not
appear on the same nodes where idled processes have "made room" for them. Each
solution seems loaded with shortcomings.)

Anyhow, I was interested in some feedback on this topic. A very small number (1-4) of
spawning tests are causing us lots of problems (undue complexity in the test harness as
well as a bunch of our time for reasons I find difficult to explain succinctly). We're
inclined to modify the tests so that they're a little more social. E.g., make decisions
about how many of the launched processes should "really" be used, idling some
fraction of the processes, and continuing the test only with the remaining fraction.

[OMPI devel] ibm/dynamic/loop_spawn

2011-08-15 Thread Eugene Loh

This is a question about ompi-tests/ibm/dynamic.  Some of these tests 
(spawn, spawn_multiple, loop_spawn/child, and no-disconnect) exercise 
MPI_Comm_spawn* functionality.  Specifically, they spawn additional 
processes (beyond the initial mpirun launch) and therefore exert a 
different load on a test system than one might naively expect from the 
"mpirun -np " command line.


One approach to testing is to have the test harness know characteristics 
about individual tests like this.  E.g., if I have only 8 processors and 
I don't want to oversubscribe, have the test harness know that 
particular tests should be launched with fewer processes.  On the other 
hand, building such generality into a test harness when changes would 
have to be so pervasive (subjective assessment) and so few tests require 
it may not make that much sense.


Another approach would be to manage oversubscription in the tests 
themselves.  E.g., for spawn.c, instead of spawning np new processes, do 
the following:


- idle np/2 of the processes
- have the remaining np/2 processes spawn np/2 new ones

(Okay, so that leaves open the possibility that the newly spawned 
processes might not appear on the same nodes where idled processes have 
"made room" for them.  Each solution seems loaded with shortcomings.)


Anyhow, I was interested in some feedback on this topic.  A very small 
number (1-4) of spawning tests are causing us lots of problems (undue 
complexity in the test harness as well as a bunch of our time for 
reasons I find difficult to explain succinctly).  We're inclined to 
modify the tests so that they're a little more social.  E.g., make 
decisions about how many of the launched processes should "really" be 
used, idling some fraction of the processes, and continuing the test 
only with the remaining fraction.


Comments?

Re: [OMPI devel] [TIPC BTL] test programmes

2011-08-01 Thread Eugene Loh


 NAS Parallel Benchmarks are self-verifying.

Another option is the MPI Testing Tool 
http://www.open-mpi.org/projects/mtt/ but it might be more trouble than 
it's worth.


(INCIDENTALLY, THERE ARE TRAC TROUBLES WITH THE THREE LINKS AT THE 
BOTTOM OF THAT PAGE!  COULD SOMEONE TAKE A LOOK?)


If you do decide to explore MTT, 
http://www.open-mpi.org/projects/mtt/svn.php tells you how to do a 
Subversion checkout.  It's a test harness.  For the tests themselves, 
look in mtt/trunk/samples/*-template.ini for examples of what tests to 
run.  Whether you want to pursue this route depends on whether you're 
serious about doing lots of testing.


On 08/01/11 17:13, Jeff Squyres wrote:

Additionally, you might want to download an run a bunch of common MPI 
benchmarks, such as:

- Netpipe
- Intel MPI Benchmarks (IMB)
- SKaMPI
- HPL (Linpack)
- ...etc.

On Aug 1, 2011, at 8:12 AM, Chris Samuel wrote:

On Mon, 1 Aug 2011 09:47:00 PM Xin He wrote:

Do any of you guys have any testing programs that I should
run to test if it really works?

How about a real MPI program which has test data to check
it's running OK ?  Gromacs is open source and has a self-test
mechanism run via "make test" IIRC.

I think HPL (Linpack) also checks the data from its run..

Re: [OMPI devel] [OMPI svn] svn:open-mpi r24903

2011-07-14 Thread Eugene Loh

Thanks for the clarification.  My myopic sense of the issue came out of 
stumbling on this behavior due to MPI_Comm_spawn_multiple failing.


I think *multiple* issues caused this problem to escape notice for so 
long.  One is that if the system thought it was oversubscribed, 
num_procs_alive was used uninitialized, which could potentially cause us 
to believe the file limit had been exceeded (whether or not that was the 
case, and all depending on the value of the uninitialized variable).


That's a problem that can get us to the nested-loop problem.

And the nested-loop problem would not necessarily lead to a 
user-observable error if the last process (the one we pick up due to the 
iteration variable getting screwed up) is happy getting the app that was 
intended for the first process.  When are we unhappy with this 
situation?  When MPI_Comm_spawn_multiple is called!  (For reasons I need 
to investigate further, we actually triggered the nested-loop problem 
often, but it caused problems only with that one MPI call.)


Anyhow, thanks for the perspective on the issues.  Most of all, the code 
should be cleaner now and we don't need to worry about all the possible 
failure modes of the old code.


On 7/14/2011 2:35 PM, Ralph Castain wrote:

Just to clarify, as this commit message is somewhat misleading. The nested loop 
problem would cause a problem whenever the system had a specified limit (that 
we had sensed) on the number of files a process could have open, and that 
number would have been violated by starting another process. It had nothing to 
do with comm_spawn_multiple or any other specific MPI command, which is why it 
has passed MTT for so long.


On Jul 14, 2011, at 2:10 PM, eug...@osl.iu.edu wrote:


Author: eugene
Date: 2011-07-14 16:10:48 EDT (Thu, 14 Jul 2011)
New Revision: 24903
URL: https://svn.open-mpi.org/trac/ompi/changeset/24903

Log:
Clean up the computations of num_procs_alive.  Do some code
refactoring to improve readability and to compute num_procs_alive
correctly and to remove the use of loop iteration variables for
two loops nested one inside another (causing MPI_Comm_spawn_multiple
to fail).


Text files modified:
   trunk/orte/mca/odls/base/odls_base_default_fns.c |62 

   1 files changed, 31 insertions(+), 31 deletions(-)

Modified: trunk/orte/mca/odls/base/odls_base_default_fns.c
==
--- trunk/orte/mca/odls/base/odls_base_default_fns.c(original)
+++ trunk/orte/mca/odls/base/odls_base_default_fns.c2011-07-14 16:10:48 EDT 
(Thu, 14 Jul 2011)
@@ -9,7 +9,7 @@
  * University of Stuttgart.  All rights reserved.
  * Copyright (c) 2004-2005 The Regents of the University of California.
  * All rights reserved.
- * Copyright (c) 2007-2010 Oracle and/or its affiliates.  All rights reserved.
+ * Copyright (c) 2007-2011 Oracle and/or its affiliates.  All rights reserved.
  * Copyright (c) 2011  Oak Ridge National Labs.  All rights reserved.
  * Copyright (c) 2011  Los Alamos National Security, LLC.
  * All rights reserved.
@@ -1240,6 +1240,28 @@
 time_is_up = true;
}

+static int compute_num_procs_alive(orte_jobid_t *job)
+{
+opal_list_item_t *item;
+orte_odls_child_t *child;
+int num_procs_alive = 0, match_job;
+
+for (item  = opal_list_get_first(_local_children);
+ item != opal_list_get_end  (_local_children);
+ item  = opal_list_get_next(item)) {
+child = (orte_odls_child_t*)item;
+if ( NULL != job ) {
+match_job = ( OPAL_EQUAL == 
opal_dss.compare(job,&(child->name->jobid), ORTE_JOBID) );
+} else {
+match_job = 0;
+}
+if (child->alive || match_job) {
+num_procs_alive++;
+}
+}
+return num_procs_alive;
+}
+
int orte_odls_base_default_launch_local(orte_jobid_t job,
 orte_odls_base_fork_local_proc_fn_t 
fork_local)
{
@@ -1371,16 +1393,7 @@
 /* compute the number of local procs alive or about to be launched
  * as part of this job
  */
-num_procs_alive = 0;
-for (item = opal_list_get_first(_local_children);
- item != opal_list_get_end(_local_children);
- item = opal_list_get_next(item)) {
-child = (orte_odls_child_t*)item;
-if (child->alive ||
-OPAL_EQUAL == opal_dss.compare(,&(child->name->jobid), 
ORTE_JOBID)) {
-num_procs_alive++;
-}
-}
+num_procs_alive = compute_num_procs_alive();
 /* get the number of local processors */
 if (ORTE_SUCCESS != (rc = 
opal_paffinity_base_get_processor_info(_processors))) {
 /* if we cannot find the number of local processors, we have no 
choice
@@ -1409,6 +1422,9 @@
 /* setup to report the proc state to the

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r24830

2011-07-13 Thread Eugene Loh


On 7/13/2011 4:31 PM, Paul H. Hargrove wrote:

On 7/13/2011 4:20 PM, Yevgeny Kliteynik wrote:
>  Finally, are you sure that infiniband/complib/cl_types_osd.h 
exists on all platforms?  (e.g., Solaris)  I know you said you don't 
have any Solaris machines to test with, but you should ping Oracle 
directly for some testing -- Terry might not be paying attention to 
this specific thread...

I'll check it, but my guess would be that Solaris doesn't have it.
AFAIK Solaris doesn't use OpenFabrics OpenSM - it has a separate
subnet manager, so I can't assume that it has.
So right now the dynamic SL will probably not work on Solaris
(though it won't break the compilation).
I have a pair of old machines running Solaris 11 Express (aka "SunOS 
5.11  snv_151aNovember 2010").
These have IB Verbs support, but there is no such header.  In fact, 
/usr/include/inifiband has no sub-directories.

+1

(That is, no such header and not even any subdirectories on a very 
recent version of Solaris 11:  snv_168.)

I may be able to do some testing eventually, but now is not a good time.

[OMPI devel] orte_odls_base_default_launch_local()

2011-07-12 Thread Eugene Loh

The function orte_odls_base_default_launch_local() has a variable 
num_procs_alive that is basically initialized like this:


if ( oversubscribed ) {
...
} else {
num_procs_alive = ...;
}

Specifically, if the "oversubscribed" test passes, the variable is not 
initialized.


(Strictly speaking, this is true only in v1.5.  In the trunk, the 
variable is set to 0 when it is declared, but I'm not sure that's very 
helpful.)


I'm inclined to move the num_procs_alive computation ahead of the "if" 
block so that this computation is always performed.


Sanity check?

[OMPI devel] orterun hanging

2011-04-06 Thread Eugene Loh

I'm running into a hang that is very easy to reproduce.  Basically, 
something like this:


% mpirun -H remote_node hostname
remote_node
^C

That is, I run a program (doesn't need to be MPI) on a remote node.  The 
program runs, but my local orterun doesn't return.  The problem seems to 
be correlated to the OS version (some very recent builds of Solaris) 
running on the remote node.


The problem would seem to be in the OS, though arguably it could be a 
long-time OMPI problem that is being exposed by a change in the OS.  
Regardless, does anyone have suggestions where I should be looking?


So far, it looks to me that the HNP orterun forks a child who launches 
an ssh process to start the remote orted.  Then, the remote orted 
daemonizes itself (forks a child and kills the parent, thereby detaching 
the daemon from the controlling terminal) and runs the user binary.  It 
seems to me that this daemonization is related to the problem.  
Specifically, if I use "mpirun --debug-daemons", there is no 
daemonization and the hang does not occur.  Perhaps, with some recent OS 
changes, the daemonized process is no longer alerting the HNP orterun 
when it's done.


Any suggestions where I should focus my efforts?  I'm working with v1.5.

Re: [OMPI devel] turning on progress threads

2011-03-10 Thread Eugene Loh

No big deal one way or the other.  It's a symbolic gesture against bit 
rot, I suppose.  The fact is that there are different pieces of the code 
base that move forward while vestiges of old stuff get left behind 
elsewhere.  At first, it's easier to leave that stuff in.  With time, 
the history gets forgotten and there gets left more and more mysterious 
stuff that future developers have to figure out.


Let's say there's code that doesn't do anything.  One can ask, "Why not 
just leave it in?"  Or, one can ask, "Why not just strip it out?"


This particular case (*.conf enable_progress) is minor.  Either way, 
things are fine.  My concern is more around the accumulation of many 
such instances.


Ralph Castain wrote:


On Mar 10, 2011, at 5:54 PM, Eugene Loh wrote:
 


Ralph Castain wrote:
   


Just stale code that doesn't hurt anything
 


Okay, so it'd be all right to remove those lines.  Right?
   


They are in my platform files - why are they a concern?

Just asking - we don't normally worry about people's platform files. I would 
rather not have to go thru everyone's files and review what they have there.
 


- frankly, I wouldn't look at platform files to try to get a handle on such 
things as they tend to fall out of date unless someone needs to change it.

We always hard-code progress threads to off because the code isn't thread safe 
in key areas involving the event library, for one.

On Mar 10, 2011, at 3:43 PM, Eugene Loh wrote:
 


In the trunk, we hardwire progress threads to be off.  E.g.,

% grep progress configure.ac
# Hardwire all progress threads to be off
enable_progress_threads="no"
  [Hardcode the ORTE progress thread to be off])
  [Hardcode the OMPI progress thread to be off])

So, how do I understand the following?

% grep enable_progress contrib/platform/*/*.conf
contrib/platform/cisco/linux-static.conf:orte_enable_progress_threads = 1
contrib/platform/cisco/macosx-dynamic.conf:orte_enable_progress_threads = 1
contrib/platform/openrcm/debug.conf:orte_enable_progress_threads = 1
% grep enable_progress contrib/platform/*/*/*.conf
contrib/platform/cisco/ebuild/hlfr.conf:orte_enable_progress_threads = 1
contrib/platform/cisco/ebuild/ludd.conf:orte_enable_progress_threads = 1
contrib/platform/cisco/ebuild/native.conf:orte_enable_progress_threads = 1

These seem to try to turn progress threads on.  Ugly, but not a problem?

Re: [OMPI devel] turning on progress threads

2011-03-10 Thread Eugene Loh


Ralph Castain wrote:


Just stale code that doesn't hurt anything


Okay, so it'd be all right to remove those lines.  Right?


- frankly, I wouldn't look at platform files to try to get a handle on such 
things as they tend to fall out of date unless someone needs to change it.

We always hard-code progress threads to off because the code isn't thread safe 
in key areas involving the event library, for one.

On Mar 10, 2011, at 3:43 PM, Eugene Loh wrote:
 


In the trunk, we hardwire progress threads to be off.  E.g.,

% grep progress configure.ac
# Hardwire all progress threads to be off
enable_progress_threads="no"
[Hardcode the ORTE progress thread to be off])
[Hardcode the OMPI progress thread to be off])

So, how do I understand the following?

% grep enable_progress contrib/platform/*/*.conf
contrib/platform/cisco/linux-static.conf:orte_enable_progress_threads = 1
contrib/platform/cisco/macosx-dynamic.conf:orte_enable_progress_threads = 1
contrib/platform/openrcm/debug.conf:orte_enable_progress_threads = 1
% grep enable_progress contrib/platform/*/*/*.conf
contrib/platform/cisco/ebuild/hlfr.conf:orte_enable_progress_threads = 1
contrib/platform/cisco/ebuild/ludd.conf:orte_enable_progress_threads = 1
contrib/platform/cisco/ebuild/native.conf:orte_enable_progress_threads = 1

These seem to try to turn progress threads on.  Ugly, but not a problem?

[OMPI devel] turning on progress threads

2011-03-10 Thread Eugene Loh


In the trunk, we hardwire progress threads to be off.  E.g.,

% grep progress configure.ac
# Hardwire all progress threads to be off
enable_progress_threads="no"
  [Hardcode the ORTE progress thread to be off])
  [Hardcode the OMPI progress thread to be off])

So, how do I understand the following?

% grep enable_progress contrib/platform/*/*.conf
contrib/platform/cisco/linux-static.conf:orte_enable_progress_threads = 1
contrib/platform/cisco/macosx-dynamic.conf:orte_enable_progress_threads = 1
contrib/platform/openrcm/debug.conf:orte_enable_progress_threads = 1
% grep enable_progress contrib/platform/*/*/*.conf
contrib/platform/cisco/ebuild/hlfr.conf:orte_enable_progress_threads = 1
contrib/platform/cisco/ebuild/ludd.conf:orte_enable_progress_threads = 1
contrib/platform/cisco/ebuild/native.conf:orte_enable_progress_threads = 1

These seem to try to turn progress threads on.  Ugly, but not a problem?

[OMPI devel] multi-threaded test

2011-03-08 Thread Eugene Loh

I've been assigned CMR 2728, which is to apply some thread-support 
changes to 1.5.x.  The trac ticket has amusing language about "needs 
testing".  I'm not sure what that means.  We rather consistently say 
that we don't promise anything with regards to true thread support.  We 
specifically say certain BTLs are off limits and we say things are 
poorly tested and can be expected to break.  Given all that, what does 
it mean to test thread support in OMPI?


One option, specifically in the context of this CMR, is to test only 
configuration options and so on.  I've done this.


Another possibility is to confirm that simple run-time tests of 
multi-threaded message passing succeed.  I'm having trouble with this.


Attached is a simple test.  It passes over sm but fails over TCP.  (One 
or both of the initial messages is not received.)


How high should I set my sights on this?
#include 
#include 
#include 
#include   /* memset */


#define N 1
int main(int argc, char **argv) {
  int np, me, buf[2][N], provided;

  /* init some stuff */
  MPI_Init_thread(, , MPI_THREAD_MULTIPLE, );
  MPI_Comm_size(MPI_COMM_WORLD,);
  MPI_Comm_rank(MPI_COMM_WORLD,);
  if ( provided < MPI_THREAD_MULTIPLE ) MPI_Abort(MPI_COMM_WORLD,-1);

  /* initialize the buffers */
  memset(buf[0], 0, N * sizeof(int));
  memset(buf[1], 0, N * sizeof(int));

  /* test */
  #pragma omp parallel num_threads(2)
  {
int id = omp_get_thread_num();
MPI_Status st;
printf("%d %d in parallel region\n", me, id); fflush(stdout);

/* pingpong */
if ( me == 0 ) {
  MPI_Send(buf[id],N,MPI_INT,1,7+id,MPI_COMM_WORLD); printf("%d %d sent\n",me,id); fflush(stdout);
  MPI_Recv(buf[id],N,MPI_INT,1,7+id,MPI_COMM_WORLD,); printf("%d %d recd\n",me,id); fflush(stdout);
} else {
  MPI_Recv(buf[id],N,MPI_INT,0,7+id,MPI_COMM_WORLD,); printf("%d %d recd\n",me,id); fflush(stdout);
  MPI_Send(buf[id],N,MPI_INT,0,7+id,MPI_COMM_WORLD); printf("%d %d sent\n",me,id); fflush(stdout);
}
  }

  MPI_Finalize();

  return 0;
}

#!/bin/csh

mpicc -xopenmp -m64 -O5 main.c

mpirun -np 2 --mca btl self,sm  ./a.out
mpirun -np 2 --mca btl self,tcp ./a.out

Re: [OMPI devel] --enable-opal-multi-threads

2011-02-15 Thread Eugene Loh


Ralph Castain wrote:


On Feb 14, 2011, at 9:26 PM, Abhishek Kulkarni wrote:
 


On Mon, 14 Feb 2011, Ralph Castain wrote:
   


If the ability to turn "on" thread support is missing from 1.5, then that is an 
error.
 


No, it was changed from "--enable-mpi-threads" to "--enable-opal-multi-threads" 
on the trunk in r22841 [1].
   


If the changeset has not been brought over to v1.5, it indeed looks like an 
anachronism in the README.

[1] https://svn.open-mpi.org/trac/ompi/changeset/22841
   


My point is that it isn't an anachronism in the README, but an error in 1.5 - 
it needs to have the ability to turn on thread safety.
 

I'm not sure if we're making progress here.  So far as I can tell, the 
v1.5 README talks about --enable-opal-multi-threads.  This option does 
not otherwise appear in v1.5, but only in the trunk.  So, to make the 
v1.5 README consistent with the v1.5 source code (as opposed to talking 
about features that will appear in unspecified future releases), either:


*) the comment should be removed from the README, or

*) opal-multi-threads should be CMRed to v1.5


On Feb 14, 2011, at 5:36 PM, Eugene Loh wrote:
 


In the v1.5 README, I see this:

--enable-opal-multi-threads
Enables thread lock support in the OPAL and ORTE layers. Does
not enable MPI_THREAD_MULTIPLE - see above option for that feature.
This is currently disabled by default.

I don't otherwise find opal-multi-threads at all in this branch.  It seems to 
me, for such an option, one needs to move to the trunk.

Is this an error (anachronism) in the v1.5 README?

[OMPI devel] --enable-opal-multi-threads

2011-02-14 Thread Eugene Loh


In the v1.5 README, I see this:

--enable-opal-multi-threads
 Enables thread lock support in the OPAL and ORTE layers. Does
 not enable MPI_THREAD_MULTIPLE - see above option for that feature.
 This is currently disabled by default.

I don't otherwise find opal-multi-threads at all in this branch.  It 
seems to me, for such an option, one needs to move to the trunk.


Is this an error (anachronism) in the v1.5 README?

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r24356

2011-02-03 Thread Eugene Loh


Jeff Squyres wrote:


Eugene --

This ROMIO fix needs to go upstream.
 

Makes sense.  Whom do I pester about that?  Is r24356 (and now CMR 2712) 
okay as is?  The ROMIO change is an unimportant stylistic change, so I'm 
okay cutting it loose from the other changes in the putback.

Re: [OMPI devel] u_int8_t

2011-01-11 Thread Eugene Loh





Jeff Squyres wrote:

  On Jan 11, 2011, at 2:05 PM, Eugene Loh wrote:
  
  

  Do we have configure tests for them, or just #define's?
  

Configure tests.

  
  Ok, cool.  I assume you'll remove the senseless configure tests, too.
  

Right.

Re: [OMPI devel] u_int8_t

2011-01-11 Thread Eugene Loh


Jeff Squyres wrote:


Shrug.  If they're not used anywhere, I'd whack them.
 

Excellent.  They screw things up (at least for me).  Turns out, Solaris 
IB uses such types and has the sense to typedef them.  But such typedefs 
conflict with opal_config.h, which #define's them (for apparently no 
reason).



Do we have configure tests for them, or just #define's?
 


Configure tests.


On Jan 10, 2011, at 7:51 PM, Eugene Loh wrote:
 


Why do
u_int8_t
u_int16_t
u_int32_t
u_int64_t
get defined in opal_config.h?  I don't see them used anywhere in the 
OMPI/OPAL/ORTE code base.

Okay, one exception, in opal/util/if.c:

#if defined(__DragonFly__)
#define IN_LINKLOCAL(i)(((u_int32_t)(i) & 0x) == 0xa9fe)
#endif
   


Ah, and even this one exception you got rid of in r22869.

[OMPI devel] u_int8_t

2011-01-10 Thread Eugene Loh


Why do
  u_int8_t
  u_int16_t
  u_int32_t
  u_int64_t
get defined in opal_config.h?  I don't see them used anywhere in the 
OMPI/OPAL/ORTE code base.


Okay, one exception, in opal/util/if.c:

#if defined(__DragonFly__)
#define IN_LINKLOCAL(i)(((u_int32_t)(i) & 0x) == 0xa9fe)
#endif

Re: [OMPI devel] mca_bml_r2_del_proc_btl()

2011-01-04 Thread Eugene Loh


Thanks for the sanity check.  Fix in r24202.

George Bosilca wrote:


As the endpoint's btl_max_send_size has been initialized to the min of the 
max_size of all BTLs in the send (respectively rdma) array, the loop you 
pinpointed will have no effect (as it is impossible to find a smaller value 
than the minimum already computed). Pre-setting to (size_t)-1 should fix the 
issue.

On Jan 3, 2011, at 17:17 , Eugene Loh wrote:
 


I can't tell if this is a problem, though I suspect it's a small one even if 
it's a problem at all.

In mca_bml_r2_del_proc_btl(), a BTL is removed from the send list and from the 
RDMA list.

If the BTL is removed from the send list, the end-point's max send size is 
recomputed to be the minimum of the max send sizes of the remaining BTLs.  The 
code looks like this, where I've removed some code to focus on the parts that 
matter:

 /* remove btl from send list */
 if(mca_bml_base_btl_array_remove(>btl_send, btl)) {

 /* reset max_send_size to the min of all btl's */
 for(b=0; b< mca_bml_base_btl_array_get_size(>btl_send); b++) {
 bml_btl = mca_bml_base_btl_array_get_index(>btl_send, b);
 ep_btl = bml_btl->btl;

 if (ep_btl->btl_max_send_size < ep->btl_max_send_size) {
 ep->btl_max_send_size = ep_btl->btl_max_send_size;
 }
 }
 }

Shouldn't that inner loop be preceded by initialization of ep->btl_max_send_size to some 
very large value (ironically enough, perhaps "-1")?

Something similar happens in the same function when the BTL is removed from the RDMA 
list and  ep->btl_pipeline_send_length and ep->btl_send_limit are recomputed.

Re: [OMPI devel] IBV_EVENT_QP_ACCESS_ERR

2011-01-03 Thread Eugene Loh


George Bosilca wrote:


Eugene,

This error indicate that somehow we're accessing the QP while the QP is in 
"down" state. As the asynchronous thread is the one that see this error, I 
wonder if it doesn't look for some information about a QP that has been destroyed by the 
main thread (as this only occurs in MPI_Finalize).

Can you look in the syslog to see if there is any additional info related to 
this issue there?


Not much.  A one-liner like this:

Dec 27 21:49:36 burl-ct-x4150-11 hermon: [ID 492207 kern.info] hermon1: 
EQE local access violation



On Dec 30, 2010, at 20:43, Eugene Loh <eugene@oracle.com> wrote:
 


I was running a bunch of np=4 test programs over two nodes.  Occasionally, 
*one* of the codes would see an IBV_EVENT_QP_ACCESS_ERR during MPI_Finalize().  
I traced the code and ran another program that mimicked the particular MPI 
calls made by that program.  This other program, too, would occasionally 
trigger this error.  I never saw the problem with other tests.  Rate of 
incidence could go from consecutive runs (I saw this once) to 1:100s (more 
typically) to even less frequently -- I've had 1000s of consecutive runs with 
no problems.  (The tests run a few seconds apiece.)  The traffic pattern is 
sends from non-zero ranks to rank 0, with root-0 gathers, and lots of 
Allgathers.  The largest messages are 1000bytes.  It appears the problem is 
always seen on rank 3.

Now, I wouldn't mind someone telling me, based on that little information, what 
the problem is here, but I guess I don't expect that.  What I am asking is what 
IBV_EVENT_QP_ACCESS_ERR means.  Again, it's seen during MPI_Finalize.  The 
async thread is seeing this.  What is this error trying to tell me?

[OMPI devel] IBV_EVENT_QP_ACCESS_ERR

2010-12-30 Thread Eugene Loh

I was running a bunch of np=4 test programs over two nodes.  
Occasionally, *one* of the codes would see an IBV_EVENT_QP_ACCESS_ERR 
during MPI_Finalize().  I traced the code and ran another program that 
mimicked the particular MPI calls made by that program.  This other 
program, too, would occasionally trigger this error.  I never saw the 
problem with other tests.  Rate of incidence could go from consecutive 
runs (I saw this once) to 1:100s (more typically) to even less 
frequently -- I've had 1000s of consecutive runs with no problems.  (The 
tests run a few seconds apiece.)  The traffic pattern is sends from 
non-zero ranks to rank 0, with root-0 gathers, and lots of Allgathers.  
The largest messages are 1000bytes.  It appears the problem is always 
seen on rank 3.


Now, I wouldn't mind someone telling me, based on that little 
information, what the problem is here, but I guess I don't expect that.  
What I am asking is what IBV_EVENT_QP_ACCESS_ERR means.  Again, it's 
seen during MPI_Finalize.  The async thread is seeing this.  What is 
this error trying to tell me?

[OMPI devel] async thread in openib BTL

2010-12-23 Thread Eugene Loh

I'm starting to look at the openib BTL for the first time and am 
puzzled.  In btl_openib_async.c, it looks like an asynchronous thread is 
started.  During MPI_Init(), the main thread sends the async thread a 
file descriptor for each IB interface to be polled.  In MPI_Finalize(), 
the main thread asks the async thread to shut down.  Between MPI_Init() 
and MPI_Finalize(), I would think that the async thread would poll on 
the IB fd's and handle events that come up.  If I stick print statements 
into the async thread, however, I don't see any events come up on the IB 
fd's.  So, the async thread is useless.  Yes?  It starts up and shuts 
down, but never sees any events on the IB devices?

Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-23 Thread Eugene Loh


Jeff Squyres (jsquyres) wrote:

Ya, it sounds like we should fix this eager limit help text so that others aren't misled. We did say "attempt", but that's probably a bit too subtle. 

Eugene - iirc: this is in the btl base (or some other central location) because it's shared between all btls. 
 

The cited text was from the OMPI FAQ ("Tuning" / "sm" section, item 6).  
I made the change in r1309.


In ompi/mca/btl/base/btl_base_mca.c, I added the phrase "including 
header" to both


"rndv_eager_limit"
"Size (in bytes, including header) of \"phase 1\" fragment sent for all 
large messages (must be >= 0 and <= eager_limit)"

module->btl_rndv_eager_limit

and

"eager_limit"
"Maximum size (in bytes, including header) of \"short\" messages (must 
be >= 1)."

module->btl_eager_limit

but I left

"max_send_size"
"Maximum size (in bytes) of a single \"phase 2\" fragment of a long 
message when using the pipeline protocol (must be >= 1)"

module->btl_max_send_size

alone (for some combination of lukewarm reasons).  Changes are in r24085.

Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-23 Thread Eugene Loh


George Bosilca wrote:


Moreover, eager send can improve performance if and only if the matching 
receives are already posted on the peer. If not, the data will become 
unexpected, and there will be one additional memcpy.

I don't think the first sentence is strictly true.  There is a cost 
associated with eager messages, but whether there is an overall 
improvement or not depends on lots of factors.

Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-23 Thread Eugene Loh





Sébastien Boisvert wrote:

  Le mardi 23 novembre 2010 à 16:07 -0500, Eugene Loh a écrit :
  
  
Sébastien Boisvert wrote:


  Case 1: 30 MPI ranks, message size is 4096 bytes

File: mpirun-np-30-Program-4096.txt
Outcome: It hangs -- I killed the poor thing after 30 seconds or so.
  

4096 is rendezvous.  For eager, try 4000 or lower.

  
  According to ompi_info, the threshold is 4096, not 4000, right ?
  

Right.

  "btl_sm_eager_limit: Below this size, messages are sent "eagerly" --
that is, a sender attempts to write its entire message to shared buffers
without waiting for a receiver to be ready. Above this size, a sender
will only write the first part of a message, then wait for the receiver
to acknowledge its ready before continuing. Eager sends can improve
performance by decoupling senders from receivers."

source:
http://www.open-mpi.org/faq/?category=sm#more-sm

It should say "Below this size or equal to this size" instead of "Below
this size" as ompi_info says. ;)
  

Well, I guess it should say:

If message data plus header information fits within this limit, the
message is sent "eagerly"...

I guess I'll fix it.  (I suspect I wrote it in the first place.  Sigh.)

Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-23 Thread Eugene Loh


Sébastien Boisvert wrote:


Now I can describe the cases.
 

The test cases can all be explained by the test requiring eager messages 
(something that test4096.cpp does not require).



Case 1: 30 MPI ranks, message size is 4096 bytes

File: mpirun-np-30-Program-4096.txt
Outcome: It hangs -- I killed the poor thing after 30 seconds or so.
 


4096 is rendezvous.  For eager, try 4000 or lower.


Case 2: 30 MPI ranks, message size is 1 byte

File: mpirun-np-30-Program-1.txt.gz
Outcome: It runs just fine.
 


1 byte is eager.


Case 3: 2 MPI ranks, message size is 4096 bytes

File: mpirun-np-2-Program-4096.txt
Outcome: It hangs -- I killed the poor thing after 30 seconds or so.
 


Same as Case 1.


Case 4: 30 MPI ranks, message size if 4096 bytes, shared memory is
disabled

File: mpirun-mca-btl-^sm-np-30-Program-4096.txt.gz
Outcome: It runs just fine.
 

Eager limit for TCP is 65536 (perhaps less some overhead).  So, these 
messages are eager.

Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-23 Thread Eugene Loh


To add to Jeff's comments:

Sébastien Boisvert wrote:


The reason is that I am developping an MPI-based software, and I use
Open-MPI as it is the only implementation I am aware of that send
messages eagerly (powerful feature, that is).
 

As wonderful as OMPI is, I am fairly sure other MPI implementations also 
support eager message passing.  That is, there is a capability for a 
sender to hand message data over to the MPI implementation, freeing the 
user send buffer and allowing an MPI_Send() call to complete, without 
the message reaching the receiver or the receiver being ready.



Each byte transfer layer has its default limit to send eagerly a
message. With shared memory (sm), the value is 4096 bytes. At least it
is according to ompi_info.
 

Yes.  I think that 4096 bytes can be a little tricky... it may include 
some header information.  So, the amount of user data that could be sent 
would be a little bit less... e.g., 4,000 bytes or so.



To verify this limit, I implemented a very simple test. The source code
is test4096.cpp, which basically just send a single message of 4096
bytes from a rank to another (rank 1 to 0).
 

I don't think the test says much at all.  It has one process post an 
MPI_Send and another post an MPI_Recv.  Such a test should complete 
under a very wide range of conditions.


Here is perhaps a better test:

#include 
#include 

int main(int argc, char **argv) {
 int me;
 char buf[N];

 MPI_Init(,);
 MPI_Comm_rank(MPI_COMM_WORLD,);
 MPI_Send(buf,N,MPI_BYTE,1-me,343,MPI_COMM_WORLD);
 MPI_Recv(buf,N,MPI_BYTE,1-me,343,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
 printf("%d of %d done\n", me, np);
 MPI_Finalize();

 return 0;
}

Compile with the preprocessor symbol N defined to, say, 64.  Run for 
--np 2.  Each process will try to send.  The code will complete for 
short, eager messages.  If the messages are long, nothing is sent 
eagerly and both processes stay hung in their sends.  Bump N up slowly.  
For N=4096, the code hangs.  For N slightly less -- say, 4000 -- it runs.

Re: [OMPI devel] knem_dma_min

2010-08-18 Thread Eugene Loh


Eugene Loh wrote:


In  mca_btl_sm_get_sync(), I see this:
  /* Use the DMA flag if knem supports it *and* the segment length
   is greater than the cutoff.  Note that if the knem_dma_min
   value is 0 (i.e., the MCA param was set to 0), the segment size
   will never be larger than it, so DMA will never be used. */
icopy.flags = 0;
if (mca_btl_sm_component.knem_dma_min <= dst->seg_len) {
icopy.flags = mca_btl_sm_component.knem_dma_flag;
}

I'm going to poke around some more, but this doesn't on the surface 
make sense to me.  If knem_dma_min==0, it would seem as though the 
segment size will *always* be at least that large and DMA will 
*always* be used (if supported). 


Answering my own question (or guessing, in any case), maybe the code is 
okay but the comment is misleading.  If knem_dma_min==0, then 
mca_btl_sm_component_init() sets knem_dma_flag to 0.  So, the seg_len 
test passes, but it has no effect.

[OMPI devel] knem_dma_min

2010-08-18 Thread Eugene Loh


In  mca_btl_sm_get_sync(), I see this:

   /* Use the DMA flag if knem supports it *and* the segment length
  is greater than the cutoff.  Note that if the knem_dma_min
  value is 0 (i.e., the MCA param was set to 0), the segment size
  will never be larger than it, so DMA will never be used. */
   icopy.flags = 0;
   if (mca_btl_sm_component.knem_dma_min <= dst->seg_len) {
   icopy.flags = mca_btl_sm_component.knem_dma_flag;
   }

I'm going to poke around some more, but this doesn't on the surface make 
sense to me.  If knem_dma_min==0, it would seem as though the segment 
size will *always* be at least that large and DMA will *always* be used 
(if supported).

[OMPI devel] RFC: mpirun options

2010-04-19 Thread Eugene Loh

Jeff and I were talking about trac 2035 and the handling of mpirun 
command-line options.  While most mpirun options have long, 
multi-character names prefixed with a double dash, OMPI had originally 
also wanted to support combinations of short names (e.g., "mpirun -hvq", 
even if we don't document such combinations) as well as legacy 
single-dash names (e.g., "-host").  To improve diagnosibility of error 
messages and simplify the source code and user documentations, some 
simplifications seemed in order.  Since the command-line parsing is 
shared not only by mpirun but by all OMPI command-line interfaces, 
however, Jeff suggested an RFC.  So, here goes.
Title: RFC: Drop mpirun Short-Name Combinations




RFC: Drop mpirun Short-Name Combinations


WHAT:  No longer support the combination of multiple mpirun
short-name (single-character) options into a single argument.  E.g., do not
allow users to combine mpirun -h -q -v into mpirun -hqv.


Also, no longer describe separate single-dash and double-dash names such
as -server-wait-time and --server-wait-time.  Simply
give one name per option and indicate that it could be prefixed with either
a single or a double dash.


WHY:  To improve the diagnosibility of error messages and simplify
the description and support of mpirun options.


WHERE:  Basically, in opal/util/cmd_line.c.


WHEN:  Upon acceptance.


TIMEOUT:  May 7, 2010.



WHY (details)

Definitions


There are three kinds of mpirun option names:

kind of nameprefixlengthexample
long name--multi-character--verbose
short name -single-character -v
single-dash name -multi-character -np


Background


We had wanted to support long and short names.


Short names were supposed to be combinable.
E.g., instead of ls -l -t -r, just write ls -ltr.


To support backwards compatibility with options that had become well-known
from other MPI implementations, we also wanted to support certain
short names, such as -np or -host.  That is, even though the option starts
with a single-dash, we would first check to see if it were a special recognized
"single-dash" option name.  Only if that check failed would we expand the
argument further to parse it as a combination of short names.

Obfuscates Error Messages


Unfortunately, the resulting, more complicated grammar leads to misleading
error messages.  E.g., consider this example from
trac 2035:

% mpirun -tag-output -np 4 -nperslot 1 -H saem9,saem10 hostname
--
mpirun was unable to launch the specified application as it could not find an executable:

Executable: -p
Node: saem9

while attempting to start process rank 0.
--



The point of the ticket was mostly that a misspelled option is handled as
an unfound executable, but it also points out that we end up reporting on
an option (-p) that from the user's perspective isn't even on the
command line in the first place.  What has happened is that an option
(-nperslot) was not recognized, the first character (n)
was recognized, the option was parsed as a combination of short names,
and one of those short names (-p) was not recognized.


There are different ways of cleaning all of this up, but a simple solution
is just not to support short-name combinations.

Fringe Functionality


The ability to combine short names into a single "-" option is fringe
functionality for mpirun anyhow.


We don't document this ability in the first place.


Further, we don't have that many short names  -- 10, out of a total of 82 options --
and many combinations don't make much sense.  The ability to combine options makes
most sense for utilities that use short option names, and then if those options don't
take arguments.  E.g., ls -ltr in place of ls -l -t -r.  The
mpirun options just aren't like that.

Simplify Single-/Double-Dash Usage


We were going to support single-dash (multi-character) names only sparingly and
only for backwards compatibility with well-established options from other MPIs.
In reality, we routinely add a single-dash name for each new option we introduce.


We end up having both single-dash and double-dash names, making both source code
and user documentation less readable.


However, ultimately the source code doesn't even check these distinctions
when parsing the mpirun command line.
For example, we go to the effort in our source code and user documentation to
distinguish between -server-wait-time and --server-wait-time,
and between -rf and --rankfile.  When options are parsed, however,
we disregard any such distinctions.  E.g., --rf and -rankfile
are recognized.

Other Issues


The command-line parser is not only for mpirun, however, but for
all OMPI command-line interfaces.  Hence, this RFC.

1 2 3 >

1 - 100 of 272 matches

Mail list logo