[OMPI devel] 1.7.5 end-of-week status report

2014-03-14 Thread Ralph Castain
Hi folks

I have both good and bad news to report - first the good.

OSHMEM now passes nearly all its tests on my Linux cluster (tcp). My hat is off 
to the Mellanox guys for getting this done, including getting our MTT repo 
tests complete.

The MPI layer passes nearly all the IBM, Intel, and one-sided tests. Only a few 
failures.

Now the bad. The coll/ml component continues to have problems, including 
segfaults, and I have discovered that the bcol and coll/ml code remains 
entangled (I thought it had been separated, but sadly not). I have therefore 
ompi_ignored coll/ml and bcol/ptpcoll.

We also retain a hang in mpirun under some failure cases. I'm working on a 
solution to that one.

So here is my proposal for release. We allow OSHMEM to build by default as it 
met our conditions for doing so. We add an --enable-coll-ml configure flag and 
disable coll/ml and friends unless specifically requested. We then release 
1.7.5 following the Tues developer telecon.

Ralph



Re: [OMPI devel] usage of mca variables in orte-restart

2014-03-14 Thread Hjelm, Nathan T
The preferred way is to use mca_base_var_find and then call 
mca_base_var_[set|get]_value. For performance sake we only look at the 
environment when the variable is registered.

-Nathan

Please excuse the horrible Outlook top-posting. OWA sucks.


From: devel [devel-boun...@open-mpi.org] on behalf of Adrian Reber 
[adr...@lisas.de]
Sent: Friday, March 14, 2014 3:05 PM
To: de...@open-mpi.org
Subject: [OMPI devel] usage of mca variables in orte-restart

I am now trying to run orte-restart. As far as I understand it
orte-restart analyzes the checkpoint metadata and then tries to exec()
mpirun which then starts opal-restart. During the startup of
opal-restart (during initialize()) detection of the best CRS module is
disabled:

/*
 * Turn off the selection of the CRS component,
 * we need to do that later
 */
(void) mca_base_var_env_name("crs_base_do_not_select", _env_var);
opal_setenv(tmp_env_var,
"1", /* turn off the selection */
true, );
free(tmp_env_var);
tmp_env_var = NULL;

This seems to work. Later when actually selecting the correct CRS module
to restart the checkpointed process the selection is enabled again:

/* Re-enable the selection of the CRS component, so we can choose the right 
one */
(void) mca_base_var_env_name("crs_base_do_not_select", _env_var);
opal_setenv(tmp_env_var,
"0", /* turn on the selection */
true, );
free(tmp_env_var);
tmp_env_var = NULL;

This does not seem to have an effect. The one reason why it does not work
is pretty obvious. The mca variable crs_base_do_not_select is registered during
opal_crs_base_register() and written to the bool variable 
opal_crs_base_do_not_select
only once (during register). Later in opal_crs_base_select() this bool
variable is queried if select should run or not and as it is only changed
during register it never changes. So from the code flow it cannot work
and is probably the result of one of the rewrites since C/R was introduced.

To fix this I am trying to read the value of the MCA variable
opal_crs_base_do_not_select during opal_crs_base_select() like this:

 idx = mca_base_var_find("opal", "crs", "base", "do_not_select")
 mca_base_var_get_value(idx, , NULL, NULL);

This also seems to work because it is different if I change the first
opal_setenv() during initialize(). The problem I am seeing is that the
second opal_setenv() (back to 0) cannot be detected using 
mca_base_var_get_value().

So my question is: what is the preferred way to read and write MCA
variables to access them in the different modules? Is the existing
code still correct? There is also mca_base_var_set_value() should I rather
use this to set 'opal_crs_base_do_not_select'. I was, however, not able
to use mca_base_var_set_value() without a segfault. There are not much
uses of mca_base_var_set_value() in the existing code and none uses
a bool variable.

I also discovered I can just access to global C variable 
'opal_crs_base_do_not_select'
from opal-restart.c as well as from opal_crs_base_select(). This also works.
This would solve my problem setting and reading MCA variables.

Adrian
___
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/03/14347.php


Re: [OMPI devel] usage of mca variables in orte-restart

2014-03-14 Thread Ralph Castain
I don't believe we support changing the value of an MCA param on-the-fly - 
you'd need to transfer it to an appropriate-level global that you can change as 
required

On Mar 14, 2014, at 2:05 PM, Adrian Reber  wrote:

> I am now trying to run orte-restart. As far as I understand it
> orte-restart analyzes the checkpoint metadata and then tries to exec()
> mpirun which then starts opal-restart. During the startup of
> opal-restart (during initialize()) detection of the best CRS module is
> disabled:
> 
>/* 
> * Turn off the selection of the CRS component,
> * we need to do that later
> */
>(void) mca_base_var_env_name("crs_base_do_not_select", _env_var);
>opal_setenv(tmp_env_var,
>"1", /* turn off the selection */
>true, );
>free(tmp_env_var);
>tmp_env_var = NULL;
> 
> This seems to work. Later when actually selecting the correct CRS module
> to restart the checkpointed process the selection is enabled again:
> 
>/* Re-enable the selection of the CRS component, so we can choose the 
> right one */
>(void) mca_base_var_env_name("crs_base_do_not_select", _env_var);
>opal_setenv(tmp_env_var,
>"0", /* turn on the selection */
>true, );
>free(tmp_env_var);
>tmp_env_var = NULL;
> 
> This does not seem to have an effect. The one reason why it does not work
> is pretty obvious. The mca variable crs_base_do_not_select is registered 
> during
> opal_crs_base_register() and written to the bool variable 
> opal_crs_base_do_not_select
> only once (during register). Later in opal_crs_base_select() this bool
> variable is queried if select should run or not and as it is only changed
> during register it never changes. So from the code flow it cannot work
> and is probably the result of one of the rewrites since C/R was introduced.
> 
> To fix this I am trying to read the value of the MCA variable
> opal_crs_base_do_not_select during opal_crs_base_select() like this:
> 
> idx = mca_base_var_find("opal", "crs", "base", "do_not_select")
> mca_base_var_get_value(idx, , NULL, NULL);
> 
> This also seems to work because it is different if I change the first
> opal_setenv() during initialize(). The problem I am seeing is that the
> second opal_setenv() (back to 0) cannot be detected using 
> mca_base_var_get_value().
> 
> So my question is: what is the preferred way to read and write MCA
> variables to access them in the different modules? Is the existing
> code still correct? There is also mca_base_var_set_value() should I rather
> use this to set 'opal_crs_base_do_not_select'. I was, however, not able
> to use mca_base_var_set_value() without a segfault. There are not much
> uses of mca_base_var_set_value() in the existing code and none uses
> a bool variable.
> 
> I also discovered I can just access to global C variable 
> 'opal_crs_base_do_not_select'
> from opal-restart.c as well as from opal_crs_base_select(). This also works.
> This would solve my problem setting and reading MCA variables.
> 
>   Adrian
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/03/14347.php



[OMPI devel] usage of mca variables in orte-restart

2014-03-14 Thread Adrian Reber
I am now trying to run orte-restart. As far as I understand it
orte-restart analyzes the checkpoint metadata and then tries to exec()
mpirun which then starts opal-restart. During the startup of
opal-restart (during initialize()) detection of the best CRS module is
disabled:

/* 
 * Turn off the selection of the CRS component,
 * we need to do that later
 */
(void) mca_base_var_env_name("crs_base_do_not_select", _env_var);
opal_setenv(tmp_env_var,
"1", /* turn off the selection */
true, );
free(tmp_env_var);
tmp_env_var = NULL;

This seems to work. Later when actually selecting the correct CRS module
to restart the checkpointed process the selection is enabled again:

/* Re-enable the selection of the CRS component, so we can choose the right 
one */
(void) mca_base_var_env_name("crs_base_do_not_select", _env_var);
opal_setenv(tmp_env_var,
"0", /* turn on the selection */
true, );
free(tmp_env_var);
tmp_env_var = NULL;

This does not seem to have an effect. The one reason why it does not work
is pretty obvious. The mca variable crs_base_do_not_select is registered during
opal_crs_base_register() and written to the bool variable 
opal_crs_base_do_not_select
only once (during register). Later in opal_crs_base_select() this bool
variable is queried if select should run or not and as it is only changed
during register it never changes. So from the code flow it cannot work
and is probably the result of one of the rewrites since C/R was introduced.

To fix this I am trying to read the value of the MCA variable
opal_crs_base_do_not_select during opal_crs_base_select() like this:

 idx = mca_base_var_find("opal", "crs", "base", "do_not_select")
 mca_base_var_get_value(idx, , NULL, NULL);

This also seems to work because it is different if I change the first
opal_setenv() during initialize(). The problem I am seeing is that the
second opal_setenv() (back to 0) cannot be detected using 
mca_base_var_get_value().

So my question is: what is the preferred way to read and write MCA
variables to access them in the different modules? Is the existing
code still correct? There is also mca_base_var_set_value() should I rather
use this to set 'opal_crs_base_do_not_select'. I was, however, not able
to use mca_base_var_set_value() without a segfault. There are not much
uses of mca_base_var_set_value() in the existing code and none uses
a bool variable.

I also discovered I can just access to global C variable 
'opal_crs_base_do_not_select'
from opal-restart.c as well as from opal_crs_base_select(). This also works.
This would solve my problem setting and reading MCA variables.

Adrian


Re: [OMPI devel] Loading Open MPI from MPJ Express (Java) fails

2014-03-14 Thread Bibrak Qamar
And I managed to run Open MPI with MPJ Express. I added the following code
and it worked like a charm.

*In Java*
  /*
   * Static Block for loading the libnativempjdev.so
   */
  static {
System.loadLibrary("nativempjdev");

if(!loadGlobalLibraries()) {
System.out.println("MPJ Express failed to load required libraries");
System.exit(1);
}
  }

*In C*

JNIEXPORT jboolean JNICALL Java_mpjdev_natmpjdev_Comm_loadGlobalLibraries
 (JNIEnv *env, jclass thisObject) {
//This will make sure the library is loaded
// in the case of Open MPI
if (NULL == (mpilibhandle = dlopen("libmpi.so",
   RTLD_NOW | RTLD_GLOBAL))) {
return JNI_FALSE;
}
return JNI_TRUE;
}

It works for Open MPI but for MPICH3 I have to comment the dlopen. Is there
any way to tell the compiler if its using Open MPI (mpicc) then use dlopen
else keep it commented? Or some thing else?

*On Java bindings to have some insight into the internals of the MPI
implementation*

Yes, there are some places where we need to be sync with the internals of
the native MPI implementation. These are in section 8.1.2 of MPI 2.1 (
http://www.mpi-forum.org/docs/mpi-2.1/mpi21-report.pdf). For example the
MPI_TAG_UB. For the pure Java devices of MPJ Express we have always used
Integer.MAX_VALUE.

*Datatypes?*

MPJ Express uses an internal buffering layer to buffer the user data into a
ByteBuffer. In this way for the native device we end up using the
MPI_BYTEdatatype most of the time. ByteBuffer
simplifies matters since it is directly accessible from the native code.

With our current implementation there is one exception to it i.e. in the
Reduce, Allreduce and Reduce_scatter where the native MPI implementation
needs to know which Java datatype its going to process. Same goes for MPI.Op

*On Are your bindings similar in style/signature to ours?*

I checked it and there are differences. MPJ Express (and FastMPJ also)
implements the mpiJava 1.2 specifications. There is also MPJ API (this is
very close to mpiJava 1.2 API).

*Example 1: Getting the rank and size of COMM_WORLD*

*MPJ Express (the mpiJava 1.2 API):*
 public int Size() throws MPIException;
 public int Rank() throws MPIException;

*MPJ API:*
 public int size() throws MPJException;
 public int rank() throws MPJException;

*Open MPI's Java bindings:*
 public final int getRank() throws MPIException;
 public final int getSize() throws MPIException;

*Example 2: Point-to-Point communication*

*MPJ Express (the mpiJava 1.2 API):*
 public void Send(Object buf, int offset, int count, Datatype datatype, int
dest, int tag) throws MPIException

 public Status Recv(Object buf, int offset, int count, Datatype datatype,
  int source, int tag) throws MPIException

*MPJ API:*
 public void send(Object buf, int offset, int count, Datatype datatype, int
dest, int tag) throws MPJException;

 public Status recv(Object buf, int offset, int count, Datatype datatype,
int source, int tag) throws MPJException

*Open MPI's Java bindings:*
 public final void send(Object buf, int count, Datatype type, int dest, int
tag) throws MPIException

 public final Status recv(Object buf, int count, Datatype type, int source,
int tag) throws MPIException

*Example 3: Collective communication*

*MPJ Express (the mpiJava 1.2 API):*
 public void Bcast(Object buf, int offset, int count, Datatype type, int
root)
  throws MPIException;

*MPJ API:*
 public void bcast(Object buffer, int offset, int count, Datatype datatype,
int root) throws MPJException;


*Open MPI's Java bindings:* public final void bcast(Object buf, int count,
Datatype type, int root) throws MPIException;


I couldn't find which API the Open MPI's Java bindings implement? But while
reading your README.JAVA.txt and your code I realised that you are trying
to avoid buffering overhead by giving the user the flexibility to directly
allocate data onto a ByteBuffer using MPI.newBuffer, hence not
following the mpiJava 1.2 specs (for communication operations)?


*On Performance Comparison*

Yes this is interesting, I have managed to do two kind of tests: Ping-Pong
(Latency and Bandwidth) and Collective Communications (Bcast).

Attached are graphs and the programs (testcases) that I used. The tests
were done using Infiniband, more on the platform here
http://www.nust.edu.pk/INSTITUTIONS/Centers/RCMS/AboutUs/facilities/screc/Pages/Resources.aspx

One reason for Open MPI's java bindings low performance (in the
Bandwidth.png graph) is the way the test case was written
(Bandwidth_OpenMPi.java). It allocates a total of 16M of byte array and
uses the same array in send/recv for each data point (by varying count).

This could be mainly because of the following code in mpi_Comm.c (let me
know if I am mistaken)

static void* getArrayPtr(void** bufBase, JNIEnv *env,
 jobject buf, int baseType, int offset)
{
switch(baseType)
{
   ...
   ...
  case 1: {

Re: [OMPI devel] orte-restart and PATH

2014-03-14 Thread Josh Hursey
It looks like I did not add the prefix path to the binary name before
fork/exec in orte-restart.

There is a string variable that you can use to get the appropriate prefix:
  opal_install_dirs.prefix
from
  opal/mca/installdirs/installdirs.h

It's the same one that Ralph mentioned that orterun uses.

If you add that on there then you should be ok. You might want to check the
app-files produces use the prefix as well when referencing opal-restart.



On Wed, Mar 12, 2014 at 11:34 AM, Ralph Castain  wrote:

> That's what the --enable-orterun-prefix-by-default configure option is for
>
>
> On Mar 12, 2014, at 9:28 AM, Adrian Reber  wrote:
>
> > I am using orte-restart without setting my PATH to my Open MPI
> > installation. I am running /full/path/to/orte-restart and orte-restart
> > tries to run mpirun to restart the process. This fails on my system
> > because I do not have any mpirun in my PATH. Is it expected for an Open
> > MPI installation to set up the PATH variable or should it work using the
> > absolute path to the binaries?
> >
> > Should I just set my PATH correctly and be done with it or should
> > orte-restart figure out the full path to its accompanying mpirun and
> > start mpirun with the full path?
> >
> >   Adrian
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/03/14339.php
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/03/14340.php
>



-- 
Joshua Hursey
Assistant Professor of Computer Science
University of Wisconsin-La Crosse
http://cs.uwlax.edu/~jjhursey