date:20140123

[OMPI devel] trunk and v1.7: xlc and lost atomics patch

2014-01-23 Thread Paul Hargrove

Testing the trunk w/ xlc-11.1 on a linux/ppc64 system I see two failures
from "make check".  Specifically the atomic_cmpset and atomic_spinlock
tests both get segfaults.

This is an issue I first reported against 1.5.5rc2 and v1.6.

It appears that ticket 3040 was opened at the time of my original report,
and my patch (attached to the ticket) was applied to v1.6 as r26226.
 HOWEVER, the patch never seems to have made into trunk at the time; and
thus is not in either v1.7 or trunk today.

Though the ticket indicates (and my testing today confirms) that xlc-11
will botch the atomic both with and without the patch, there *are* versions
of xlc which only generate correct atomics with the patch applied.

So, please CMR r26226 from v1.6 to v1.7(.5?) and trunk.
The patch still applies cleanly (offset of 1 line).

-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

Re: [OMPI devel] Unknown object files in libmpi.a

2014-01-23 Thread Paul Hargrove

Irvanda,

Others on this list might have specific knowledge of the objects you
listed, but I am going to present a general solution that hopefully will
let you find the answers you seek.

If you have libmpi.a build from sources configured with --enable-debug,
then the source file information is stored in the object files.  You can
use gdb to extract this information.

I don't have an openmpi-1.6.x build on hand, but here is an example with
the current trunk.
None of the files you listed are present in this build, so I've picked one
of the profiling objects as an example.  You should replace "[libdir]" with
your actual openmpi installations lib directory.

-bash-4.2$ ar x [libdir]/libmpi.a pcart_create.o
-bash-4.2$ gdb -q pcart_create.o
Reading symbols from
/home/phargrov/OMPI/openmpi-trunk-netbsd6-amd64/INST/lib/foo/pcart_create.o...done.
(gdb) list
1   pcart_create.c: No such file or directory.
in pcart_create.c
(gdb) info source
Current source file is pcart_create.c
Compilation directory is
/home/phargrov/OMPI/openmpi-trunk-netbsd6-amd64/BLD/ompi/mpi/c/profile
Source language is c.
Compiled with DWARF 2 debugging format.
Does not include preprocessor macro info.


Notice I used 2 commands in gdb: "list" and "info source".
The "list" appears to fail because the source directory has been deleted.
However, the "list" step is required to make gdb read the source info from
the object (or "info source" will fail).
The output from the second command, "info source", is the important part:
 + The first is the name (without directory) of the source file.
 + The second is the directory in which the .o file was created.
That directory (for files generated at build time) or its "twin" in the
source tree (for normal source files) are the likely places to find the
source file.

I hope that helps,
-Paul

P.S.
If others have shorter sequences to get the same debug info from an object,
I am curious to hear them.



On Wed, Jan 22, 2014 at 8:57 PM, Irvanda Kurniadi wrote:

> Hi,
>
> I'm trying to port openmpi-1.6.5 in l4/fiasco. I checked the libmpi.a. I
> did the " ar t libmpi.a " in my terminal. I can't find the source file (.c)
> of some object files created in libmpi.a, such as:
> ompi_bitmap.o
> op_predefined.o
> convertor.o
> copy_functions.o
> copy_functions_heterogeneous.o
> datatype_pack.o
> datatype_unpack.o
> dt_add.o dt_args.o .. dt_sndrcv.o (15 files)
> fake_stack.o
> position.o
> libdatatype_reliable_la-datatype_pack.o
> libdatatype_reliable_la-datatype_unpack.o
> common_sm_mmap.o
>
> Can you tell me where is the source of those object files? Because I have
> to compile every single .c file in openmpi which is needed to be compiled.
> Thanks
>
> regards,
> Irvanda
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

[OMPI devel] build failure in trunk

2014-01-23 Thread Mike Dubman

*06:29:26* make[3]: Leaving directory
`/scrap/jenkins/workspace/hpc-ompi-shmem/label/r-vmb-centos5-u7-x86-64/ompi/mca/bcol/ptpcoll'*06:29:26*
make[2]: Leaving directory
`/scrap/jenkins/workspace/hpc-ompi-shmem/label/r-vmb-centos5-u7-x86-64/ompi/mca/bcol/ptpcoll'*06:29:26*
Making install in mca/bcol/iboffload*06:29:26* make[2]: Entering
directory 
`/scrap/jenkins/workspace/hpc-ompi-shmem/label/r-vmb-centos5-u7-x86-64/ompi/mca/bcol/iboffload'*06:29:26*
  CC   bcol_iboffload_module.lo*06:29:26* bcol_iboffload_module.c:
In function 'load_func':*06:29:26* bcol_iboffload_module.c:734: error:
'mca_bcol_iboffload_allgather_register' undeclared (first use in this
function)*06:29:26* bcol_iboffload_module.c:734: error: (Each
undeclared identifier is reported only once*06:29:26*
bcol_iboffload_module.c:734: error: for each function it appears
in.)*06:29:26* make[2]: *** [bcol_iboffload_module.lo] Error
1*06:29:26* make[2]: Leaving directory
`/scrap/jenkins/workspace/hpc-ompi-shmem/label/r-vmb-centos5-u7-x86-64/ompi/mca/bcol/iboffload'*06:29:26*
make[1]: *** [install-recursive] Error 1*06:29:26* make[1]: Leaving
directory 
`/scrap/jenkins/workspace/hpc-ompi-shmem/label/r-vmb-centos5-u7-x86-64/ompi'*06:29:26*
make: *** [install-recursive] Error 1

Re: [OMPI devel] 1.7.4 status update

2014-01-23 Thread Paul Hargrove

On Wed, Jan 22, 2014 at 7:27 PM, Paul Hargrove  wrote:

> After the 1.7 tests on the XLF, Open64 and PathScale platforms complete
> I'll be testing the trunk on those systems with the compiler-appropriate
> --enable-mpi-fortran= settings.



The following are results (for trunk) for four compilers that couldn't
build the trunk 24 hours ago unless configured with --disable-mpi-fortran:

pgi-11.9 works with --enable-mpi-fortran=mpif (has app-link failures at
higher levels - NOT source issues)
pathscale-4.0 works with --enable-mpi-fortran=usempi
open64-4.5 works with --enable-mpi-fortran=usempi
xlf-14.1 works with --enable-mpi-fortran=usempi

-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

[OMPI devel] trunk: typo in error message

2014-01-23 Thread Paul Hargrove

As originally noted in Dec 2011 (
http://www.open-mpi.org/community/lists/devel/2011/12/10169.php) there is a
1-character typo in generate-asm.pl:

$ cat -n generate-asm.pl | head -20
 1  #!/usr/bin/perl -w
 2
 3
 4  my $asmarch = shift;
 5  my $asmformat = shift;
 6  my $basedir = shift;
 7  my $output = shift;
 8
 9  if ( ! $asmarch) {
10  print "usage: generate-asm.pl [ASMARCH] [ASMFORMAT] [BASEDIR]
[OUTPUT NAME]\n";
11  exit(1);
12  }
13
14  open(INPUT, "$basedir/$asmarch.asm") ||
15  die "Could not open $basedir/$asmarch.asm: $!\n";
16  open(OUTPUT, ">$output") || die "Could not open $output: $1\n";
17
18  $CONFIG = "default";
19  $TEXT = "";
20  $GLOBAL = "";

The "$1" on line 16 should actually be "$!".
The perl variable "$1" is the result of a prior pattern match, of which
there ARE NONE.
The perl variable "$!", however, is the equivalent of "strerror(errno)" in
C.

This typo is still present in today's trunk.
It is, of course, entirely harmless.

-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

[OMPI devel] mca_bml_r2_del_btl incorrect memory size reallocation?

2014-01-23 Thread Christoph Niethammer

Hello

I think I found a minor memory bug in the bml_r2 code in function 
mca_bml_r2_del_btl but I could not figure out when this function ever gets 
called.
How can I test this function in a proper way?

Here the diff showing the issue:

@@ -699,11 +699,11 @@ static int mca_bml_r2_del_btl(mca_btl_base_module_t* btl)
 if(!found) {
 /* doesn't even exist */
 goto CLEANUP;
 }
 /* remove from bml list */
-modules = (mca_btl_base_module_t**)malloc(sizeof(mca_btl_base_module_t*) * 
mca_bml_r2.num_btl_modules-1);
+modules = (mca_btl_base_module_t**)malloc(sizeof(mca_btl_base_module_t*) * 
(mca_bml_r2.num_btl_modules-1));
 for(i=0,m=0; ihttp://www.hlrs.de/people/niethammer

Re: [OMPI devel] 1.7.4 status update

2014-01-23 Thread Ralph Castain

woot!!! Thanks Paul and Jeff!

On Jan 22, 2014, at 10:22 PM, Paul Hargrove  wrote:

> 
> On Wed, Jan 22, 2014 at 7:27 PM, Paul Hargrove  wrote:
> After the 1.7 tests on the XLF, Open64 and PathScale platforms complete I'll 
> be testing the trunk on those systems with the compiler-appropriate 
> --enable-mpi-fortran= settings.
> 
> 
> The following are results (for trunk) for four compilers that couldn't build 
> the trunk 24 hours ago unless configured with --disable-mpi-fortran:
> 
> pgi-11.9 works with --enable-mpi-fortran=mpif (has app-link failures at 
> higher levels - NOT source issues)
> pathscale-4.0 works with --enable-mpi-fortran=usempi
> open64-4.5 works with --enable-mpi-fortran=usempi
> xlf-14.1 works with --enable-mpi-fortran=usempi
> 
> -Paul
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] trunk: typo in error message

2014-01-23 Thread Jeff Squyres (jsquyres)

Fixed and slated for 1.7.5; thanks.

On Jan 23, 2014, at 2:33 AM, Paul Hargrove  wrote:

> As originally noted in Dec 2011 
> (http://www.open-mpi.org/community/lists/devel/2011/12/10169.php) there is a 
> 1-character typo in generate-asm.pl:
> 
> $ cat -n generate-asm.pl | head -20
>  1  #!/usr/bin/perl -w
>  2  
>  3  
>  4  my $asmarch = shift;
>  5  my $asmformat = shift;
>  6  my $basedir = shift;
>  7  my $output = shift;
>  8  
>  9  if ( ! $asmarch) { 
> 10  print "usage: generate-asm.pl [ASMARCH] [ASMFORMAT] [BASEDIR] 
> [OUTPUT NAME]\n";
> 11  exit(1);
> 12  }
> 13  
> 14  open(INPUT, "$basedir/$asmarch.asm") || 
> 15  die "Could not open $basedir/$asmarch.asm: $!\n";
> 16  open(OUTPUT, ">$output") || die "Could not open $output: $1\n";
> 17  
> 18  $CONFIG = "default";
> 19  $TEXT = "";
> 20  $GLOBAL = "";
> 
> The "$1" on line 16 should actually be "$!".
> The perl variable "$1" is the result of a prior pattern match, of which there 
> ARE NONE.
> The perl variable "$!", however, is the equivalent of "strerror(errno)" in C.
> 
> This typo is still present in today's trunk.
> It is, of course, entirely harmless.
> 
> -Paul
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

[OMPI devel] [PATCH] make orte-checkpoint communicate with orterun again

2014-01-23 Thread Adrian Reber

Following patch makes orte-checkpoint communicate with orterun again:

diff --git a/orte/tools/orte-checkpoint/orte-checkpoint.c 
b/orte/tools/orte-checkpoint/orte-checkpoint.c
index 7106342..8539f34 100644
--- a/orte/tools/orte-checkpoint/orte-checkpoint.c
+++ b/orte/tools/orte-checkpoint/orte-checkpoint.c
@@ -834,7 +834,7 @@ static int 
notify_process_for_checkpoint(opal_crs_base_ckpt_options_t *options)
 }

 if (ORTE_SUCCESS != (ret = orte_rml.send_buffer_nb(&(orterun_hnp->name), 
buffer,
-   ORTE_RML_TAG_CKPT, 
hnp_receiver,
+   ORTE_RML_TAG_CKPT, 
orte_rml_send_callback,
NULL))) {
 exit_status = ret;
 goto cleanup;
@@ -845,11 +845,6 @@ static int 
notify_process_for_checkpoint(opal_crs_base_ckpt_options_t *options)
 ORTE_JOBID_PRINT(jobid));

  cleanup:
-if( NULL != buffer) {
-OBJ_RELEASE(buffer);
-buffer = NULL;
-}
-
 if( ORTE_SUCCESS != exit_status ) {
 opal_show_help("help-orte-checkpoint.txt", "unable_to_connect", true,
orte_checkpoint_globals.pid);


Before committing the code into the repository I wanted to make
sure it is the correct way to fix it.

The first change changes the callback to orte_rml_send_callback().
When I initially made the code compile again I used hnp_receiver()
to change the code from blocking to non-blocking and that was
wrong.

The second change (removal of OBJ_RELEASE(buffer)) is necessary
because this seems to delete buffer during communication and then
everything breaks badly.

Adrian

[OMPI devel] [PATCH] use ORTE_PROC_IS_APP

2014-01-23 Thread Adrian Reber

Selecting SNAPC requires the information if it is an app or not:

int orte_snapc_base_select(bool seed, bool app);

The following patch uses the correct define. Can I commit it like this:

t a/orte/mca/ess/base/ess_base_std_app.c b/orte/mca/ess/base/ess_base_std_app.c
index dbbb2f4..f3a38f0 100644
--- a/orte/mca/ess/base/ess_base_std_app.c
+++ b/orte/mca/ess/base/ess_base_std_app.c
@@ -252,7 +252,7 @@ int orte_ess_base_app_setup(bool db_restrict_local)
 error = "orte_sstore_base_open";
 goto error;
 }
-if (ORTE_SUCCESS != (ret = orte_snapc_base_select(ORTE_PROC_IS_HNP, 
!ORTE_PROC_IS_DAEMON))) {
+if (ORTE_SUCCESS != (ret = orte_snapc_base_select(ORTE_PROC_IS_HNP, 
ORTE_PROC_IS_APP))) {
 ORTE_ERROR_LOG(ret);
 error = "orte_snapc_base_select";
 goto error;
diff --git a/orte/mca/ess/base/ess_base_std_tool.c 
b/orte/mca/ess/base/ess_base_std_tool.c
index 98c1685..7fcf83d 100644
--- a/orte/mca/ess/base/ess_base_std_tool.c
+++ b/orte/mca/ess/base/ess_base_std_tool.c
@@ -189,7 +189,7 @@ int orte_ess_base_tool_setup(void)
 error = "orte_snapc_base_open";
 goto error;
 }
-if (ORTE_SUCCESS != (ret = orte_snapc_base_select(ORTE_PROC_IS_HNP, 
!ORTE_PROC_IS_DAEMON))) {
+if (ORTE_SUCCESS != (ret = orte_snapc_base_select(ORTE_PROC_IS_HNP, 
ORTE_PROC_IS_APP))) {
 ORTE_ERROR_LOG(ret);
 error = "orte_snapc_base_select";
 goto error;
diff --git a/orte/mca/ess/hnp/ess_hnp_module.c 
b/orte/mca/ess/hnp/ess_hnp_module.c
index a6f1777..ea444c4 100644
--- a/orte/mca/ess/hnp/ess_hnp_module.c
+++ b/orte/mca/ess/hnp/ess_hnp_module.c
@@ -678,7 +678,7 @@ static int rte_init(void)
 error = "orte_sstore_base_open";
 goto error;
 }
-if (ORTE_SUCCESS != (ret = orte_snapc_base_select(ORTE_PROC_IS_HNP, 
!ORTE_PROC_IS_DAEMON))) {
+if (ORTE_SUCCESS != (ret = orte_snapc_base_select(ORTE_PROC_IS_HNP, 
ORTE_PROC_IS_APP))) {
 ORTE_ERROR_LOG(ret);
 error = "orte_snapc_base_select";
 goto error;

Re: [OMPI devel] 1.7.4rc: yet another launch failure

2014-01-23 Thread Nathan Hjelm

I agree. A configure option to disable the use of getpwuid would be
great as it is one of those functions that can never be static. getpwuid
also fails for no particular reason on at least one XC30.

-Nathan

On Wed, Jan 22, 2014 at 08:57:20PM -0800, Ralph Castain wrote:
>Interesting - still, I see no reason for OMPI to fail just because of
>that. We can run just fine with the uid, so I'll make things a little more
>flexible.
>Thanks for tracking it down!
>On Jan 22, 2014, at 7:54 PM, Paul Hargrove  wrote:
> 
>  Not lacking getpwuid():
>  [phh1@biou2 BLD]$ grep HAVE_GETPWUID */include/*_config.h
>  opal/include/opal_config.h:#define HAVE_GETPWUID 1
>  I also can't see why the quoted code could fail.
>  The following is working fine:
>  [phh1@biou2 BLD]$ cat q.c
>  #include 
>  #include 
>  #include 
>  #include 
>  int main(void) {
> uid_t uid = getuid();
> printf("uid = %d\n", (int)uid);
> struct passwd *p = getpwuid(uid); 
> if (p) printf("name = %s\n", p->pw_name);
> return 0;
>  }
>  [phh1@biou2 BLD]$ gcc -std=c99 q.c && ./a.out
>  uid = 44154
>  name = phh1
>  HOWEVER, building for ILP32 target (as in the reported failure) fails:
>  [phh1@biou2 BLD]$ gcc -m32 -std=c99 q.c && ./a.out
>  uid = 44154
>  So, I am going to guess that this *is* a system misconfiguration (maybe
>  missing the 32-bit foo.so for the appropriate nsswitch resolver?) just
>  as the error message said.
>  Sorry for the false alarm,
>  -Paul
> 
>  On Wed, Jan 22, 2014 at 7:36 PM, Ralph Castain  wrote:
> 
>Here is the offending code:
> /* get the name of the user */
>uid = getuid();
>#ifdef HAVE_GETPWUID
>pwdent = getpwuid(uid);
>#else
>pwdent = NULL;
>#endif
>if (NULL != pwdent) {
>user = strdup(pwdent->pw_name);
>} else {
>orte_show_help("help-orte-runtime.txt",
>   "orte:session:dir:nopwname", true);
>return ORTE_ERR_OUT_OF_RESOURCE;
>}
>Is it possible on this platform that you don't have getpwuid? I'm
>surprised at the code as we could just use the uid instead - not sure
>why this more stringent test was applied
>On Jan 22, 2014, at 7:02 PM, Paul Hargrove  wrote:
> 
>  On yet another test platform I see the following:
>  $ mpirun -mca btl sm,self -np 1 examples/ring_c
>  
> --
>  Open MPI was unable to obtain the username in order to create a path
>  for its required temporary directories.  This type of error is
>  usually
>  caused by a transient failure of network-based authentication
>  services
>  (e.g., LDAP or NIS failure due to network congestion), but can also
>  be
>  an indication of system misconfiguration.
>  Please consult your system administrator about these issues and try
>  again.
>  
> --
>  [biou2.rice.edu:30021] [[40214,0],0] ORTE_ERROR_LOG: Out of resource
>  in file
>  
> /home/phh1/SCRATCH/OMPI/openmpi-1.7-latest-linux-ppc32-xlc-11.1/openmpi-1.7.4rc2r30361/orte/util/session_dir.c
>  at line 380
>  [biou2.rice.edu:30021] [[40214,0],0] ORTE_ERROR_LOG: Out of resource
>  in file
>  
> /home/phh1/SCRATCH/OMPI/openmpi-1.7-latest-linux-ppc32-xlc-11.1/openmpi-1.7.4rc2r30361/orte/mca/ess/hnp/ess_hnp_module.c
>  at line 599
>  
> --
>  It looks like orte_init failed for some reason; your parallel
>  process is
>  likely to abort.  There are many reasons that a parallel process can
>  fail during orte_init; some of which are due to configuration or
>  environment problems.  This failure appears to be an internal
>  failure;
>  here's some additional information (which may only be relevant to an
>  Open MPI developer):
>orte_session_dir failed
>--> Returned value Out of resource (-2) instead of ORTE_SUCCESS
>  
> --
>  An "-np 2" run fails in the same manner.
>  This is a production system and there is no problem with "whoami" or
>  "id", leaving me doubting the explanation provided by the error
>  message.
>  [phh1@biou2 ~]$ whoami
>  phh1
>  [phh1@biou2 ~]$ id
>  uid=44154(phh1) gid=2016(hpc)
>  groups=2016(hpc),3803(hpcusers),3805(sshgw),3808(biou)
>  The "ompi_info --all" output is attached.
>  Pl

Re: [OMPI devel] build failure in trunk

2014-01-23 Thread Nathan Hjelm

Shoot. Forgot to add the ignore for that component. Will do that now.

-Nathan

On Thu, Jan 23, 2014 at 08:17:47AM +0200, Mike Dubman wrote:
>  06:29:26 make[3]: Leaving directory 
> `/scrap/jenkins/workspace/hpc-ompi-shmem/label/r-vmb-centos5-u7-x86-64/ompi/mca/bcol/ptpcoll'
>  06:29:26 make[2]: Leaving directory 
> `/scrap/jenkins/workspace/hpc-ompi-shmem/label/r-vmb-centos5-u7-x86-64/ompi/mca/bcol/ptpcoll'
>  06:29:26 Making install in mca/bcol/iboffload
>  06:29:26 make[2]: Entering directory 
> `/scrap/jenkins/workspace/hpc-ompi-shmem/label/r-vmb-centos5-u7-x86-64/ompi/mca/bcol/iboffload'
>  06:29:26   CC   bcol_iboffload_module.lo
>  06:29:26 bcol_iboffload_module.c: In function 'load_func':
>  06:29:26 bcol_iboffload_module.c:734: error: 
> 'mca_bcol_iboffload_allgather_register' undeclared (first use in this 
> function)
>  06:29:26 bcol_iboffload_module.c:734: error: (Each undeclared identifier is 
> reported only once
>  06:29:26 bcol_iboffload_module.c:734: error: for each function it appears 
> in.)
>  06:29:26 make[2]: *** [bcol_iboffload_module.lo] Error 1
>  06:29:26 make[2]: Leaving directory 
> `/scrap/jenkins/workspace/hpc-ompi-shmem/label/r-vmb-centos5-u7-x86-64/ompi/mca/bcol/iboffload'
>  06:29:26 make[1]: *** [install-recursive] Error 1
>  06:29:26 make[1]: Leaving directory 
> `/scrap/jenkins/workspace/hpc-ompi-shmem/label/r-vmb-centos5-u7-x86-64/ompi'
>  06:29:26 make: *** [install-recursive] Error 1

> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



pgpygQXELfzwQ.pgp
Description: PGP signature

Re: [OMPI devel] trunk and v1.7: xlc and lost atomics patch

2014-01-23 Thread Ralph Castain

Sigh - no idea how that patch went into the 1.6 series without first entering 
the trunk. Thanks so much for tracking it down!

Now in the trunk and cmr'd for 1.7.4

On Jan 22, 2014, at 9:23 PM, Paul Hargrove  wrote:

> Testing the trunk w/ xlc-11.1 on a linux/ppc64 system I see two failures from 
> "make check".  Specifically the atomic_cmpset and atomic_spinlock tests both 
> get segfaults.
> 
> This is an issue I first reported against 1.5.5rc2 and v1.6.
> 
> It appears that ticket 3040 was opened at the time of my original report, and 
> my patch (attached to the ticket) was applied to v1.6 as r26226.  HOWEVER, 
> the patch never seems to have made into trunk at the time; and thus is not in 
> either v1.7 or trunk today.
> 
> Though the ticket indicates (and my testing today confirms) that xlc-11 will 
> botch the atomic both with and without the patch, there *are* versions of xlc 
> which only generate correct atomics with the patch applied.
> 
> So, please CMR r26226 from v1.6 to v1.7(.5?) and trunk.
> The patch still applies cleanly (offset of 1 line).
> 
> -Paul
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] [PATCH] make orte-checkpoint communicate with orterun again

2014-01-23 Thread Ralph Castain

Looks correct to me - you are right in that you cannot release the buffer until 
after the send completes. We don't copy the data underneath to save memory and 
time.


On Jan 23, 2014, at 6:51 AM, Adrian Reber  wrote:

> Following patch makes orte-checkpoint communicate with orterun again:
> 
> diff --git a/orte/tools/orte-checkpoint/orte-checkpoint.c 
> b/orte/tools/orte-checkpoint/orte-checkpoint.c
> index 7106342..8539f34 100644
> --- a/orte/tools/orte-checkpoint/orte-checkpoint.c
> +++ b/orte/tools/orte-checkpoint/orte-checkpoint.c
> @@ -834,7 +834,7 @@ static int 
> notify_process_for_checkpoint(opal_crs_base_ckpt_options_t *options)
> }
> 
> if (ORTE_SUCCESS != (ret = orte_rml.send_buffer_nb(&(orterun_hnp->name), 
> buffer,
> -   ORTE_RML_TAG_CKPT, 
> hnp_receiver,
> +   ORTE_RML_TAG_CKPT, 
> orte_rml_send_callback,
>NULL))) {
> exit_status = ret;
> goto cleanup;
> @@ -845,11 +845,6 @@ static int 
> notify_process_for_checkpoint(opal_crs_base_ckpt_options_t *options)
> ORTE_JOBID_PRINT(jobid));
> 
>  cleanup:
> -if( NULL != buffer) {
> -OBJ_RELEASE(buffer);
> -buffer = NULL;
> -}
> -
> if( ORTE_SUCCESS != exit_status ) {
> opal_show_help("help-orte-checkpoint.txt", "unable_to_connect", true,
>orte_checkpoint_globals.pid);
> 
> 
> Before committing the code into the repository I wanted to make
> sure it is the correct way to fix it.
> 
> The first change changes the callback to orte_rml_send_callback().
> When I initially made the code compile again I used hnp_receiver()
> to change the code from blocking to non-blocking and that was
> wrong.
> 
> The second change (removal of OBJ_RELEASE(buffer)) is necessary
> because this seems to delete buffer during communication and then
> everything breaks badly.
> 
>   Adrian
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] [PATCH] use ORTE_PROC_IS_APP

2014-01-23 Thread Ralph Castain

Sure - no issues with me


On Jan 23, 2014, at 7:10 AM, Adrian Reber  wrote:

> Selecting SNAPC requires the information if it is an app or not:
> 
> int orte_snapc_base_select(bool seed, bool app);
> 
> The following patch uses the correct define. Can I commit it like this:
> 
> t a/orte/mca/ess/base/ess_base_std_app.c 
> b/orte/mca/ess/base/ess_base_std_app.c
> index dbbb2f4..f3a38f0 100644
> --- a/orte/mca/ess/base/ess_base_std_app.c
> +++ b/orte/mca/ess/base/ess_base_std_app.c
> @@ -252,7 +252,7 @@ int orte_ess_base_app_setup(bool db_restrict_local)
> error = "orte_sstore_base_open";
> goto error;
> }
> -if (ORTE_SUCCESS != (ret = orte_snapc_base_select(ORTE_PROC_IS_HNP, 
> !ORTE_PROC_IS_DAEMON))) {
> +if (ORTE_SUCCESS != (ret = orte_snapc_base_select(ORTE_PROC_IS_HNP, 
> ORTE_PROC_IS_APP))) {
> ORTE_ERROR_LOG(ret);
> error = "orte_snapc_base_select";
> goto error;
> diff --git a/orte/mca/ess/base/ess_base_std_tool.c 
> b/orte/mca/ess/base/ess_base_std_tool.c
> index 98c1685..7fcf83d 100644
> --- a/orte/mca/ess/base/ess_base_std_tool.c
> +++ b/orte/mca/ess/base/ess_base_std_tool.c
> @@ -189,7 +189,7 @@ int orte_ess_base_tool_setup(void)
> error = "orte_snapc_base_open";
> goto error;
> }
> -if (ORTE_SUCCESS != (ret = orte_snapc_base_select(ORTE_PROC_IS_HNP, 
> !ORTE_PROC_IS_DAEMON))) {
> +if (ORTE_SUCCESS != (ret = orte_snapc_base_select(ORTE_PROC_IS_HNP, 
> ORTE_PROC_IS_APP))) {
> ORTE_ERROR_LOG(ret);
> error = "orte_snapc_base_select";
> goto error;
> diff --git a/orte/mca/ess/hnp/ess_hnp_module.c 
> b/orte/mca/ess/hnp/ess_hnp_module.c
> index a6f1777..ea444c4 100644
> --- a/orte/mca/ess/hnp/ess_hnp_module.c
> +++ b/orte/mca/ess/hnp/ess_hnp_module.c
> @@ -678,7 +678,7 @@ static int rte_init(void)
> error = "orte_sstore_base_open";
> goto error;
> }
> -if (ORTE_SUCCESS != (ret = orte_snapc_base_select(ORTE_PROC_IS_HNP, 
> !ORTE_PROC_IS_DAEMON))) {
> +if (ORTE_SUCCESS != (ret = orte_snapc_base_select(ORTE_PROC_IS_HNP, 
> ORTE_PROC_IS_APP))) {
> ORTE_ERROR_LOG(ret);
> error = "orte_snapc_base_select";
> goto error;
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] mca_bml_r2_del_btl incorrect memory size reallocation?

2014-01-23 Thread Ralph Castain

I would think valgrind on the app would be your best bet. If/when you do commit 
it, please remember to cmr it for the 1.7.5 milestone.

Thanks
Ralph

On Jan 23, 2014, at 12:57 AM, Christoph Niethammer  wrote:

> Hello
> 
> I think I found a minor memory bug in the bml_r2 code in function 
> mca_bml_r2_del_btl but I could not figure out when this function ever gets 
> called.
> How can I test this function in a proper way?
> 
> Here the diff showing the issue:
> 
> @@ -699,11 +699,11 @@ static int mca_bml_r2_del_btl(mca_btl_base_module_t* 
> btl)
> if(!found) {
> /* doesn't even exist */
> goto CLEANUP;
> }
> /* remove from bml list */
> -modules = (mca_btl_base_module_t**)malloc(sizeof(mca_btl_base_module_t*) 
> * mca_bml_r2.num_btl_modules-1);
> +modules = (mca_btl_base_module_t**)malloc(sizeof(mca_btl_base_module_t*) 
> * (mca_bml_r2.num_btl_modules-1));
> for(i=0,m=0; i if(mca_bml_r2.btl_modules[i] != btl) {
> modules[m++] = mca_bml_r2.btl_modules[i];
> }
> }
> 
> 
> Regards
> Christoph
> 
> --
> 
> Christoph Niethammer
> High Performance Computing Center Stuttgart (HLRS)
> Nobelstrasse 19
> 70569 Stuttgart
> 
> Tel: ++49(0)711-685-87203
> email: nietham...@hlrs.de
> http://www.hlrs.de/people/niethammer
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] [PATCH] make orte-checkpoint communicate with orterun again

2014-01-23 Thread Josh Hursey

+1


On Thu, Jan 23, 2014 at 10:16 AM, Ralph Castain  wrote:

> Looks correct to me - you are right in that you cannot release the buffer
> until after the send completes. We don't copy the data underneath to save
> memory and time.
>
>
> On Jan 23, 2014, at 6:51 AM, Adrian Reber  wrote:
>
> > Following patch makes orte-checkpoint communicate with orterun again:
> >
> > diff --git a/orte/tools/orte-checkpoint/orte-checkpoint.c
> b/orte/tools/orte-checkpoint/orte-checkpoint.c
> > index 7106342..8539f34 100644
> > --- a/orte/tools/orte-checkpoint/orte-checkpoint.c
> > +++ b/orte/tools/orte-checkpoint/orte-checkpoint.c
> > @@ -834,7 +834,7 @@ static int
> notify_process_for_checkpoint(opal_crs_base_ckpt_options_t *options)
> > }
> >
> > if (ORTE_SUCCESS != (ret =
> orte_rml.send_buffer_nb(&(orterun_hnp->name), buffer,
> > -
> ORTE_RML_TAG_CKPT, hnp_receiver,
> > +
> ORTE_RML_TAG_CKPT, orte_rml_send_callback,
> >NULL))) {
> > exit_status = ret;
> > goto cleanup;
> > @@ -845,11 +845,6 @@ static int
> notify_process_for_checkpoint(opal_crs_base_ckpt_options_t *options)
> > ORTE_JOBID_PRINT(jobid));
> >
> >  cleanup:
> > -if( NULL != buffer) {
> > -OBJ_RELEASE(buffer);
> > -buffer = NULL;
> > -}
> > -
> > if( ORTE_SUCCESS != exit_status ) {
> > opal_show_help("help-orte-checkpoint.txt", "unable_to_connect",
> true,
> >orte_checkpoint_globals.pid);
> >
> >
> > Before committing the code into the repository I wanted to make
> > sure it is the correct way to fix it.
> >
> > The first change changes the callback to orte_rml_send_callback().
> > When I initially made the code compile again I used hnp_receiver()
> > to change the code from blocking to non-blocking and that was
> > wrong.
> >
> > The second change (removal of OBJ_RELEASE(buffer)) is necessary
> > because this seems to delete buffer during communication and then
> > everything breaks badly.
> >
> >   Adrian
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Joshua Hursey
Assistant Professor of Computer Science
University of Wisconsin-La Crosse
http://cs.uwlax.edu/~jjhursey

Re: [OMPI devel] [PATCH] use ORTE_PROC_IS_APP

2014-01-23 Thread Josh Hursey

That should be ok.


On Thu, Jan 23, 2014 at 10:17 AM, Ralph Castain  wrote:

> Sure - no issues with me
>
>
> On Jan 23, 2014, at 7:10 AM, Adrian Reber  wrote:
>
> > Selecting SNAPC requires the information if it is an app or not:
> >
> > int orte_snapc_base_select(bool seed, bool app);
> >
> > The following patch uses the correct define. Can I commit it like this:
> >
> > t a/orte/mca/ess/base/ess_base_std_app.c
> b/orte/mca/ess/base/ess_base_std_app.c
> > index dbbb2f4..f3a38f0 100644
> > --- a/orte/mca/ess/base/ess_base_std_app.c
> > +++ b/orte/mca/ess/base/ess_base_std_app.c
> > @@ -252,7 +252,7 @@ int orte_ess_base_app_setup(bool db_restrict_local)
> > error = "orte_sstore_base_open";
> > goto error;
> > }
> > -if (ORTE_SUCCESS != (ret = orte_snapc_base_select(ORTE_PROC_IS_HNP,
> !ORTE_PROC_IS_DAEMON))) {
> > +if (ORTE_SUCCESS != (ret = orte_snapc_base_select(ORTE_PROC_IS_HNP,
> ORTE_PROC_IS_APP))) {
> > ORTE_ERROR_LOG(ret);
> > error = "orte_snapc_base_select";
> > goto error;
> > diff --git a/orte/mca/ess/base/ess_base_std_tool.c
> b/orte/mca/ess/base/ess_base_std_tool.c
> > index 98c1685..7fcf83d 100644
> > --- a/orte/mca/ess/base/ess_base_std_tool.c
> > +++ b/orte/mca/ess/base/ess_base_std_tool.c
> > @@ -189,7 +189,7 @@ int orte_ess_base_tool_setup(void)
> > error = "orte_snapc_base_open";
> > goto error;
> > }
> > -if (ORTE_SUCCESS != (ret = orte_snapc_base_select(ORTE_PROC_IS_HNP,
> !ORTE_PROC_IS_DAEMON))) {
> > +if (ORTE_SUCCESS != (ret = orte_snapc_base_select(ORTE_PROC_IS_HNP,
> ORTE_PROC_IS_APP))) {
> > ORTE_ERROR_LOG(ret);
> > error = "orte_snapc_base_select";
> > goto error;
> > diff --git a/orte/mca/ess/hnp/ess_hnp_module.c
> b/orte/mca/ess/hnp/ess_hnp_module.c
> > index a6f1777..ea444c4 100644
> > --- a/orte/mca/ess/hnp/ess_hnp_module.c
> > +++ b/orte/mca/ess/hnp/ess_hnp_module.c
> > @@ -678,7 +678,7 @@ static int rte_init(void)
> > error = "orte_sstore_base_open";
> > goto error;
> > }
> > -if (ORTE_SUCCESS != (ret = orte_snapc_base_select(ORTE_PROC_IS_HNP,
> !ORTE_PROC_IS_DAEMON))) {
> > +if (ORTE_SUCCESS != (ret = orte_snapc_base_select(ORTE_PROC_IS_HNP,
> ORTE_PROC_IS_APP))) {
> > ORTE_ERROR_LOG(ret);
> > error = "orte_snapc_base_select";
> > goto error;
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Joshua Hursey
Assistant Professor of Computer Science
University of Wisconsin-La Crosse
http://cs.uwlax.edu/~jjhursey

Re: [OMPI devel] mca_bml_r2_del_btl incorrect memory size reallocation?

2014-01-23 Thread Jeff Squyres (jsquyres)

This function is generally called during MPI_Finalize (i.e., when everything is 
being torn down).

It may also be called during the disconnection of MPI dynamic processes (i.e., 
we don't need a BTL connection to a given peer anymore because the last 
communicator containing it has been released).


On Jan 23, 2014, at 3:57 AM, Christoph Niethammer  wrote:

> Hello
> 
> I think I found a minor memory bug in the bml_r2 code in function 
> mca_bml_r2_del_btl but I could not figure out when this function ever gets 
> called.
> How can I test this function in a proper way?
> 
> Here the diff showing the issue:
> 
> @@ -699,11 +699,11 @@ static int mca_bml_r2_del_btl(mca_btl_base_module_t* 
> btl)
> if(!found) {
> /* doesn't even exist */
> goto CLEANUP;
> }
> /* remove from bml list */
> -modules = (mca_btl_base_module_t**)malloc(sizeof(mca_btl_base_module_t*) 
> * mca_bml_r2.num_btl_modules-1);
> +modules = (mca_btl_base_module_t**)malloc(sizeof(mca_btl_base_module_t*) 
> * (mca_bml_r2.num_btl_modules-1));
> for(i=0,m=0; i if(mca_bml_r2.btl_modules[i] != btl) {
> modules[m++] = mca_bml_r2.btl_modules[i];
> }
> }
> 
> 
> Regards
> Christoph
> 
> --
> 
> Christoph Niethammer
> High Performance Computing Center Stuttgart (HLRS)
> Nobelstrasse 19
> 70569 Stuttgart
> 
> Tel: ++49(0)711-685-87203
> email: nietham...@hlrs.de
> http://www.hlrs.de/people/niethammer
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

[OMPI devel] vader on SGI UV?

2014-01-23 Thread Paul Hargrove

Nathan,

Is the vader BTL known to work or not work on an SGI UV (w/ XPMEM support,
of course)?
I can easily attempt the build, but any test runs would enter a queue that
is about 1 week deep.
So, I am wondering if the attempt is worth pursuing.

Additionally, does one need an explicit "-mca btl self,vader" or "-mca btl
^sm"?

-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

Re: [OMPI devel] 1.7.4rc: mpirun hangs on ia64

2014-01-23 Thread Paul Hargrove

Some progress:

I fixed IA64.asm but still saw failures.
I realized I'd not checked the ia64/atomic.h file.

Lo and behold the origin of the bogus "sxt4" is a pair of improper casts,
removed by the following:
--- opal/include/opal/sys/ia64/atomic.h~2014-01-23
13:04:03.0 -0800
+++ opal/include/opal/sys/ia64/atomic.h 2014-01-23 13:04:42.0 -0800
@@ -119,7 +119,7 @@
 __asm__ __volatile__ ("cmpxchg8.acq %0=[%1],%2,ar.ccv":
   "=r"(ret) : "r"(addr), "r"(newval) : "memory");

-return ((int32_t)ret == oldval);
+return (ret == oldval);
 }


@@ -132,7 +132,7 @@
 __asm__ __volatile__ ("cmpxchg8.rel %0=[%1],%2,ar.ccv":
   "=r"(ret) : "r"(addr), "r"(newval) : "memory");

-return ((int32_t)ret == oldval);
+return (ret == oldval);
 }

 #endif /* OMPI_GCC_INLINE_ASSEMBLY */


I will retest ASAP and report with, I hope, an attachment to fix both
IA64.asm and ia64/atomic.h

-Paul


On Wed, Jan 22, 2014 at 4:24 PM, Paul Hargrove  wrote:

>
> On Wed, Jan 22, 2014 at 2:22 PM, Paul Hargrove  wrote:
>
>> My ia64 asm is a bit rusty, but I'll give a quick look if/when I can.
>
>
> I had a look (in v1.7) and this is what I see:
>
> $cat -n IA64.asm | grep -A14 opal_atomic_cmpset_acq_64:
> 70  opal_atomic_cmpset_acq_64:
> 71  .prologue
> 72  .body
> 73  mov ar.ccv=r33;;
> 74  cmpxchg8.acq r32=[r32],r34,ar.ccv
> 75  ;;
> 76  sxt4 r32 = r32
> 77  ;;
> 78  cmp.eq p6, p7 = r33, r32
> 79  ;;
> 80  (p6) addl r8 = 1, r0
> 81  (p7) mov r8 = r0
> 82  br.ret.sptk.many b0
> 83  ;;
> 84  .endp opal_atomic_cmpset_acq_64#
>
> The (approximate and non-atomic) C equivalent is:
>
> // r32 = address
> // r33 = oldvalue
> // r34 = newvalue
> int opal_atomic_cmpset_acq_64(int64_t r32, int64_t r33, int64 r34) {
>int64_t ccv = r33; // L73
>if (*(int64_t *)r32 == ccv) *(int64_t *)r32 = r34; // L74
>
>r32 = (int64_t)(int32_t)r32; // L76 = sign-extend 32->64
>
>bool p6, p7;
>p7 = !(p6 = (r33 == r32)); // L78
>
>const int r0 = 0;
>int r8;
>if (p6) r8 = 1 + r0; // L80
>if (p7) r8 = r0; // L81
>return r8; // L82
> }
>
> Which is fine except that line 76 is totally wrong!!
> The "sxt4" instruction is "sign-extend from 4 bytes to 8 bytes".
> Thus the upper 32-bits of the value read from memory are lost!
> Unless the upper 33 bits off r33 (oldvalue) are all 0s or all 1s, the
> comparison on line 78 MUST fail.
> This explains the hang, as the lifo push will loop indefinitely waiting
> for the success of this cmpset.
>
> Note the same erroneous instruction is also present in the _rel variant
> (at line 94).
> The trunk has the same issue.
> This code has not changed at all since IA64.asm was added way back in
> r4471.
>
> I won't have access to the IA64 platform again until tomorrow AM.
> So, testing my hypothesis will need to wait.
>
> BTW:
> IFF I am right about the source of this problem, then it would be
> beneficial to have (and I may contribute) a stronger test (for "make
> check") that would detect this sort of bug in the atomics (specifically
> look for both false-positive and false-negative return value from 64-bit
> cmpset operations with values satisfying a range of "corner cases").  I
> think I have single-bit and double-bit "marching tests" for cmpset in my
> own arsenal of tests for GASNet's atomics.  If I don't have time to
> contribute a complete test, I can at least contribute that logic for
> somebody else to port to the OPAL atomics.
>
> -Paul
>
> P.S.:
> The cmpxchgN for N in 1,2,4 are documented as ZERO-extending their loads
> to 64-bits.
> So, there is a slim chance that the sxt4 actually was intended for the
> 32-bit cmpset code.
> However, since the comparison used there is a "cmp4.eq" the "sxt4" would
> still not be needed.
>
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

Re: [OMPI devel] vader on SGI UV?

2014-01-23 Thread Paul Hargrove

I've answered this one for myself:
  NO: the vader blt does not build on an SGI UV
However, xpmem support isn't detected at configure time either.
So, there is no "problem" here.

It might be nice to clarify in README that vader is for Cray's variant of
XPMEM only.

++ Everything below this point is for the "future" milestone. ++

If one ever does want vader/xpmem support on the SGI UV, I see 2 issues to
overcome:

issue 1)
SGI puts the header in /usr/include/sn/xpmem.h
There is no --with-xpmem= value that will let configure find the header in
that location.
This is a "good thing" because if configure did find it, a build would fail
be default due to issue #2.
To support SGI's xpmem will require additional configure logic to look for
EITHER xpmem.h or sn/xpmem.h

To work past issue #1 I did:
  $ mkdir $HOME/xpmem
  $ ln -s /usr/include/sn $HOME/xpmem/include
and configured ompi using --with-xpmem=$HOME/xpmem
That allowed be to see the second issue...

issue 2)
There are some minor API differences in types in SGI's "flavor" of xpmem
which cause the build to fail.
In GASNet we support both variants and the following snippet shows how we
deal with the differences:
  #if defined(HAVE_XPMEM_H)
   /* Cray XPMEM */
   #include 
   typedef struct xpmem_addr gasneti_xpmem_addr_t;
   typedef xpmem_segid_t gasneti_xpmem_segid_t;
   typedef xpmem_apid_t gasneti_xpmem_apid_t;
   #define gasneti_xpmem_apid apid
  #elif defined(HAVE_SN_XPMEM_H)
   /* SGI XPMEM */
   #include 
   typedef xpmem_addr_t gasneti_xpmem_addr_t;
   typedef int64_t gasneti_xpmem_segid_t;
   typedef int64_t gasneti_xpmem_apid_t;
   #define gasneti_xpmem_apid id
  #endif

The differences:
+ Cray's "struct xpmem_addr" vs SGI's "xpmem_addr_t"
+ SGI's uses int64_t instead of defining xpmem_segid_t  and xpmem_apid_t
+ SGI uses a struct member name of "id" vs Cray's "apid"

Note that the different locations for the header has worked to our benefit
here by providing the variant detection mechanism without the need for
configure probes for the types and members (though one could go that route
if sufficiently paranoid about a variant between the two).

-Paul

On Thu, Jan 23, 2014 at 12:11 PM, Paul Hargrove  wrote:

> Nathan,
>
> Is the vader BTL known to work or not work on an SGI UV (w/ XPMEM support,
> of course)?
> I can easily attempt the build, but any test runs would enter a queue that
> is about 1 week deep.
> So, I am wondering if the attempt is worth pursuing.
>
> Additionally, does one need an explicit "-mca btl self,vader" or "-mca btl
> ^sm"?
>
> -Paul
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

Re: [OMPI devel] vader on SGI UV?

2014-01-23 Thread Nathan Hjelm

Well, it *should* work since the Cray and SGI variants are more or less
the same. I would have to take a look at their xpmem.h to see if
anything is different.

-Nathan

On Thu, Jan 23, 2014 at 01:38:53PM -0800, Paul Hargrove wrote:
>I've answered this one for myself:
>  NO: the vader blt does not build on an SGI UV
>However, xpmem support isn't detected at configure time either.
>So, there is no "problem" here.
>It might be nice to clarify in README that vader is for Cray's variant of
>XPMEM only.
>++ Everything below this point is for the "future" milestone. ++
>If one ever does want vader/xpmem support on the SGI UV, I see 2 issues to
>overcome:
>issue 1)
>SGI puts the header in /usr/include/sn/xpmem.h
>There is no --with-xpmem= value that will let configure find the header in
>that location.
>This is a "good thing" because if configure did find it, a build would
>fail be default due to issue #2.
>To support SGI's xpmem will require additional configure logic to look for
>EITHER xpmem.h or sn/xpmem.h
>To work past issue #1 I did:
>  $ mkdir $HOME/xpmem
>  $ ln -s /usr/include/sn $HOME/xpmem/include
>and configured ompi using --with-xpmem=$HOME/xpmem
>That allowed be to see the second issue...
>issue 2)
>There are some minor API differences in types in SGI's "flavor" of xpmem
>which cause the build to fail.
>In GASNet we support both variants and the following snippet shows how we
>deal with the differences:
>  #if defined(HAVE_XPMEM_H)
>   /* Cray XPMEM */
>   #include 
>   typedef struct xpmem_addr gasneti_xpmem_addr_t;
>   typedef xpmem_segid_t gasneti_xpmem_segid_t;
>   typedef xpmem_apid_t gasneti_xpmem_apid_t;
>   #define gasneti_xpmem_apid apid
>  #elif defined(HAVE_SN_XPMEM_H)
>   /* SGI XPMEM */
>   #include 
>   typedef xpmem_addr_t gasneti_xpmem_addr_t;
>   typedef int64_t gasneti_xpmem_segid_t;
>   typedef int64_t gasneti_xpmem_apid_t;
>   #define gasneti_xpmem_apid id
>  #endif
>The differences:
>+ Cray's "struct xpmem_addr" vs SGI's "xpmem_addr_t"
>+ SGI's uses int64_t instead of defining xpmem_segid_t  and xpmem_apid_t 
>+ SGI uses a struct member name of "id" vs Cray's "apid"
>Note that the different locations for the header has worked to our benefit
>here by providing the variant detection mechanism without the need for
>configure probes for the types and members (though one could go that route
>if sufficiently paranoid about a variant between the two).
>-Paul
> 
>On Thu, Jan 23, 2014 at 12:11 PM, Paul Hargrove 
>wrote:
> 
>  Nathan,
>  Is the vader BTL known to work or not work on an SGI UV (w/ XPMEM
>  support, of course)?
>  I can easily attempt the build, but any test runs would enter a queue
>  that is about 1 week deep.
>  So, I am wondering if the attempt is worth pursuing.
>  Additionally, does one need an explicit "-mca btl self,vader" or "-mca
>  btl ^sm"?
>  -Paul
>  --
>  Paul H. Hargrove  phhargr...@lbl.gov
>  Future Technologies Group
>  Computer and Data Sciences Department Tel: +1-510-495-2352
>  Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> 
>--
>Paul H. Hargrove  phhargr...@lbl.gov
>Future Technologies Group
>Computer and Data Sciences Department Tel: +1-510-495-2352
>Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



pgpsDmNtfZYFT.pgp
Description: PGP signature

[OMPI devel] [PATCH] mpirun hangs on ia64

2014-01-23 Thread Paul Hargrove

On Thu, Jan 23, 2014 at 1:16 PM, Paul Hargrove  wrote:
[snip]

> I will retest ASAP and report with, I hope, an attachment to fix both
> IA64.asm and ia64/atomic.h
>
[snip]

Eureka!!

With the bogus cast removed in both places, I can now run ring_c on
linux/ia64.
The attached patch is against trunk, but applies cleanly to v1.7.
I fact, since the code has been incorrect for a long time it applies
cleanly to v1.6 too.

FWIW:
The code was broken by r3448, which apparently fixed some warnings but also
added the incorrect narrowing casts to the 64-bit cmpset code.  So, IA64
*did* work prior to April 2010.

Given the timeline, this can't possibly be a regression in the 1.7 series.
Additionally, with Sylvesrte Ledru having given up on ia64 nobody may care
at all.
So, CMR to 1.7.4 vs .5 seems like a potentially moot point.

-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

ia64-cmpset_64.patch
Description: Binary data

[OMPI devel] yet another fortran (documentation) issue

2014-01-23 Thread Paul Hargrove

The following is an issue found when testing with xlf-14.1, which is
already known to have problems with the F08 stuff.  So, I've configured
with "FC=xlf90 --enable-mpi-fortran=usempi".

The problem is that mpifort is now a wrapper around xlf90 and thus is
assuming F90 free-form input, independent of the file suffix.  This results
in errors building fixed-form (f77) codes:

$ mpifort hello_mpifh.f
"hello_mpifh.f", line 1.1: 1515-019 (S) Syntax is incorrect.
"hello_mpifh.f", line 2.3: 1515-019 (S) Syntax is incorrect.
"hello_mpifh.f", line 3.27: 1515-019 (S) Syntax is incorrect.
"hello_mpifh.f", line 4.27: 1515-019 (S) Syntax is incorrect.
"hello_mpifh.f", line 5.3: 1515-019 (S) Syntax is incorrect.
"hello_mpifh.f", line 6.3: 1515-019 (S) Syntax is incorrect.
"hello_mpifh.f", line 7.1: 1515-019 (S) Syntax is incorrect.
"hello_mpifh.f", line 8.3: 1515-019 (S) Syntax is incorrect.
"hello_mpifh.f", line 9.1: 1515-019 (S) Syntax is incorrect.
"hello_mpifh.f", 1515-002 (S) END card is missing.  One is assumed.
** _main   === End of Compilation 1 ===
"hello_mpifh.f", line 22.19: 1515-019 (S) Syntax is incorrect.
** main   === End of Compilation 2 ===
1501-511  Compilation failed for file hello_mpifh.f.

This can, fortunately, be worked around with "-qfixed":

$ mpifort -qfixed hello_mpifh.f
** main   === End of Compilation 1 ===
1501-510  Compilation successful for file hello_mpifh.f.

I *think* that I should have configured with FC=xlf (rather than xlf90)
then it will determine language-level and format based on file suffix.  I
can (without re-configuring) confirm that xlf honors the extension by
setting OMPI_FC=xlf:

{hargrove@vestalac1 examples}$ OMPI_FC=xlf mpifort  hello_mpifh.f
** main   === End of Compilation 1 ===
1501-510  Compilation successful for file hello_mpifh.f.
{hargrove@vestalac1 examples}$ OMPI_FC=xlf mpifort  hello_usempi.f90
** main   === End of Compilation 1 ===
1501-510  Compilation successful for file hello_usempi.f90.


The term "fortran compiler" in Open MPI's README is not (as far as I could
see) clearly defined as "a fortran compiler which honors file suffixes to
determine language dialect"

My FC=xlf90 setting is "historical" from the testing scripts I've used
since the 1.5 and 1.6 series.

ADDITIONALLY, the Open MPI manpages STILL say FC is to be a fortran90
compiler and F77 is to be an fortran77 compiler.  It looks like others
might encounter the same problem I describe above just by reading the
documentation too closely.

I will follow up with specifics on what appears to be out-of-date in the
manpages.

-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

Re: [OMPI devel] [PATCH] mpirun hangs on ia64

2014-01-23 Thread Ralph Castain

I put it for 1.7.5 just for completeness - I agree that not many people will 
care, but we should reward your hard work!

Thanks
Ralph

On Jan 23, 2014, at 2:06 PM, Paul Hargrove  wrote:

> On Thu, Jan 23, 2014 at 1:16 PM, Paul Hargrove  wrote:
> [snip]
> I will retest ASAP and report with, I hope, an attachment to fix both 
> IA64.asm and ia64/atomic.h
> [snip]
> 
> Eureka!!
> 
> With the bogus cast removed in both places, I can now run ring_c on 
> linux/ia64.
> The attached patch is against trunk, but applies cleanly to v1.7.
> I fact, since the code has been incorrect for a long time it applies cleanly 
> to v1.6 too.
> 
> FWIW:
> The code was broken by r3448, which apparently fixed some warnings but also 
> added the incorrect narrowing casts to the 64-bit cmpset code.  So, IA64 
> *did* work prior to April 2010.
> 
> Given the timeline, this can't possibly be a regression in the 1.7 series.
> Additionally, with Sylvesrte Ledru having given up on ia64 nobody may care at 
> all.
> So, CMR to 1.7.4 vs .5 seems like a potentially moot point.
> 
> -Paul
> 
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

[OMPI devel] out-of-date or missing manpages

2014-01-23 Thread Paul Hargrove

The man pages in trunk for the compiler wrappers still make reference to
distinct FC and F77 (and FCFLAGS and FFLAGS), though configure no longer
honors F77 or FFLAGS:

$ grep -e F77 -e FFLAGS INST/share/man/man1/*
INST/share/man/man1/mpiCC.1:the user in the CC, CXX, F77, and/or FC
environment variables
INST/share/man/man1/mpiCC.1:F77
INST/share/man/man1/mpiCC.1:FFLAGS
INST/share/man/man1/mpic++.1:the user in the CC, CXX, F77, and/or FC
environment variables
INST/share/man/man1/mpic++.1:F77
INST/share/man/man1/mpic++.1:FFLAGS
INST/share/man/man1/mpicc.1:the user in the CC, CXX, F77, and/or FC
environment variables
INST/share/man/man1/mpicc.1:F77
INST/share/man/man1/mpicc.1:FFLAGS
INST/share/man/man1/mpicxx.1:the user in the CC, CXX, F77, and/or FC
environment variables
INST/share/man/man1/mpicxx.1:F77
INST/share/man/man1/mpicxx.1:FFLAGS
INST/share/man/man1/mpifort.1:the user in the CC, CXX, F77, and/or FC
environment variables
INST/share/man/man1/mpifort.1:F77
INST/share/man/man1/mpifort.1:FFLAGS
grep: INST/share/man/man1/orteCC.1: No such file or directory


Note also the last line out output is due to an orteCC.1 which is a
dangling symlink:
lrwxrwxrwx 1 hargrove PARTS 9 Jan 23 21:53 orteCC.1 -> ortec++.1

There is no manpage for ortec++ (nor otrecc).

However bin/ is populated correctly:
lrwxrwxrwx 1 hargrove PARTS 12 Jan 23 21:53 orteCC -> opal_wrapper
lrwxrwxrwx 1 hargrove PARTS 12 Jan 23 21:53 ortec++ -> opal_wrapper
lrwxrwxrwx 1 hargrove PARTS 12 Jan 23 21:53 ortecc -> opal_wrapper

-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

Re: [OMPI devel] [PATCH] mpirun hangs on ia64

2014-01-23 Thread Paul Hargrove

On Thu, Jan 23, 2014 at 4:14 PM, Ralph Castain  wrote:

> I put it for 1.7.5 just for completeness - I agree that not many people
> will care, but we should reward your hard work!
>

"The reward of a thing well done is having done it."
Ralph Waldo Emerson


-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

[OMPI devel] trunk and v1.7: xlc and lost atomics patch

Re: [OMPI devel] Unknown object files in libmpi.a

[OMPI devel] build failure in trunk

Re: [OMPI devel] 1.7.4 status update

[OMPI devel] trunk: typo in error message

[OMPI devel] mca_bml_r2_del_btl incorrect memory size reallocation?

Re: [OMPI devel] 1.7.4 status update

Re: [OMPI devel] trunk: typo in error message

[OMPI devel] [PATCH] make orte-checkpoint communicate with orterun again

[OMPI devel] [PATCH] use ORTE_PROC_IS_APP

Re: [OMPI devel] 1.7.4rc: yet another launch failure

Re: [OMPI devel] build failure in trunk

Re: [OMPI devel] trunk and v1.7: xlc and lost atomics patch

Re: [OMPI devel] [PATCH] make orte-checkpoint communicate with orterun again

Re: [OMPI devel] [PATCH] use ORTE_PROC_IS_APP

Re: [OMPI devel] mca_bml_r2_del_btl incorrect memory size reallocation?

Re: [OMPI devel] [PATCH] make orte-checkpoint communicate with orterun again

Re: [OMPI devel] [PATCH] use ORTE_PROC_IS_APP

Re: [OMPI devel] mca_bml_r2_del_btl incorrect memory size reallocation?

[OMPI devel] vader on SGI UV?

Re: [OMPI devel] 1.7.4rc: mpirun hangs on ia64

Re: [OMPI devel] vader on SGI UV?

Re: [OMPI devel] vader on SGI UV?

[OMPI devel] [PATCH] mpirun hangs on ia64

[OMPI devel] yet another fortran (documentation) issue

Re: [OMPI devel] [PATCH] mpirun hangs on ia64

[OMPI devel] out-of-date or missing manpages

Re: [OMPI devel] [PATCH] mpirun hangs on ia64

28 matches

Site Navigation

Mail list logo

Footer information