[OMPI devel] 1.7.5 end-of-week status report
Hi folks I have both good and bad news to report - first the good. OSHMEM now passes nearly all its tests on my Linux cluster (tcp). My hat is off to the Mellanox guys for getting this done, including getting our MTT repo tests complete. The MPI layer passes nearly all the IBM, Intel, and one-sided tests. Only a few failures. Now the bad. The coll/ml component continues to have problems, including segfaults, and I have discovered that the bcol and coll/ml code remains entangled (I thought it had been separated, but sadly not). I have therefore ompi_ignored coll/ml and bcol/ptpcoll. We also retain a hang in mpirun under some failure cases. I'm working on a solution to that one. So here is my proposal for release. We allow OSHMEM to build by default as it met our conditions for doing so. We add an --enable-coll-ml configure flag and disable coll/ml and friends unless specifically requested. We then release 1.7.5 following the Tues developer telecon. Ralph
Re: [OMPI devel] usage of mca variables in orte-restart
The preferred way is to use mca_base_var_find and then call mca_base_var_[set|get]_value. For performance sake we only look at the environment when the variable is registered. -Nathan Please excuse the horrible Outlook top-posting. OWA sucks. From: devel [devel-boun...@open-mpi.org] on behalf of Adrian Reber [adr...@lisas.de] Sent: Friday, March 14, 2014 3:05 PM To: de...@open-mpi.org Subject: [OMPI devel] usage of mca variables in orte-restart I am now trying to run orte-restart. As far as I understand it orte-restart analyzes the checkpoint metadata and then tries to exec() mpirun which then starts opal-restart. During the startup of opal-restart (during initialize()) detection of the best CRS module is disabled: /* * Turn off the selection of the CRS component, * we need to do that later */ (void) mca_base_var_env_name("crs_base_do_not_select", _env_var); opal_setenv(tmp_env_var, "1", /* turn off the selection */ true, ); free(tmp_env_var); tmp_env_var = NULL; This seems to work. Later when actually selecting the correct CRS module to restart the checkpointed process the selection is enabled again: /* Re-enable the selection of the CRS component, so we can choose the right one */ (void) mca_base_var_env_name("crs_base_do_not_select", _env_var); opal_setenv(tmp_env_var, "0", /* turn on the selection */ true, ); free(tmp_env_var); tmp_env_var = NULL; This does not seem to have an effect. The one reason why it does not work is pretty obvious. The mca variable crs_base_do_not_select is registered during opal_crs_base_register() and written to the bool variable opal_crs_base_do_not_select only once (during register). Later in opal_crs_base_select() this bool variable is queried if select should run or not and as it is only changed during register it never changes. So from the code flow it cannot work and is probably the result of one of the rewrites since C/R was introduced. To fix this I am trying to read the value of the MCA variable opal_crs_base_do_not_select during opal_crs_base_select() like this: idx = mca_base_var_find("opal", "crs", "base", "do_not_select") mca_base_var_get_value(idx, , NULL, NULL); This also seems to work because it is different if I change the first opal_setenv() during initialize(). The problem I am seeing is that the second opal_setenv() (back to 0) cannot be detected using mca_base_var_get_value(). So my question is: what is the preferred way to read and write MCA variables to access them in the different modules? Is the existing code still correct? There is also mca_base_var_set_value() should I rather use this to set 'opal_crs_base_do_not_select'. I was, however, not able to use mca_base_var_set_value() without a segfault. There are not much uses of mca_base_var_set_value() in the existing code and none uses a bool variable. I also discovered I can just access to global C variable 'opal_crs_base_do_not_select' from opal-restart.c as well as from opal_crs_base_select(). This also works. This would solve my problem setting and reading MCA variables. Adrian ___ devel mailing list de...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2014/03/14347.php
Re: [OMPI devel] usage of mca variables in orte-restart
I don't believe we support changing the value of an MCA param on-the-fly - you'd need to transfer it to an appropriate-level global that you can change as required On Mar 14, 2014, at 2:05 PM, Adrian Reberwrote: > I am now trying to run orte-restart. As far as I understand it > orte-restart analyzes the checkpoint metadata and then tries to exec() > mpirun which then starts opal-restart. During the startup of > opal-restart (during initialize()) detection of the best CRS module is > disabled: > >/* > * Turn off the selection of the CRS component, > * we need to do that later > */ >(void) mca_base_var_env_name("crs_base_do_not_select", _env_var); >opal_setenv(tmp_env_var, >"1", /* turn off the selection */ >true, ); >free(tmp_env_var); >tmp_env_var = NULL; > > This seems to work. Later when actually selecting the correct CRS module > to restart the checkpointed process the selection is enabled again: > >/* Re-enable the selection of the CRS component, so we can choose the > right one */ >(void) mca_base_var_env_name("crs_base_do_not_select", _env_var); >opal_setenv(tmp_env_var, >"0", /* turn on the selection */ >true, ); >free(tmp_env_var); >tmp_env_var = NULL; > > This does not seem to have an effect. The one reason why it does not work > is pretty obvious. The mca variable crs_base_do_not_select is registered > during > opal_crs_base_register() and written to the bool variable > opal_crs_base_do_not_select > only once (during register). Later in opal_crs_base_select() this bool > variable is queried if select should run or not and as it is only changed > during register it never changes. So from the code flow it cannot work > and is probably the result of one of the rewrites since C/R was introduced. > > To fix this I am trying to read the value of the MCA variable > opal_crs_base_do_not_select during opal_crs_base_select() like this: > > idx = mca_base_var_find("opal", "crs", "base", "do_not_select") > mca_base_var_get_value(idx, , NULL, NULL); > > This also seems to work because it is different if I change the first > opal_setenv() during initialize(). The problem I am seeing is that the > second opal_setenv() (back to 0) cannot be detected using > mca_base_var_get_value(). > > So my question is: what is the preferred way to read and write MCA > variables to access them in the different modules? Is the existing > code still correct? There is also mca_base_var_set_value() should I rather > use this to set 'opal_crs_base_do_not_select'. I was, however, not able > to use mca_base_var_set_value() without a segfault. There are not much > uses of mca_base_var_set_value() in the existing code and none uses > a bool variable. > > I also discovered I can just access to global C variable > 'opal_crs_base_do_not_select' > from opal-restart.c as well as from opal_crs_base_select(). This also works. > This would solve my problem setting and reading MCA variables. > > Adrian > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/03/14347.php
[OMPI devel] usage of mca variables in orte-restart
I am now trying to run orte-restart. As far as I understand it orte-restart analyzes the checkpoint metadata and then tries to exec() mpirun which then starts opal-restart. During the startup of opal-restart (during initialize()) detection of the best CRS module is disabled: /* * Turn off the selection of the CRS component, * we need to do that later */ (void) mca_base_var_env_name("crs_base_do_not_select", _env_var); opal_setenv(tmp_env_var, "1", /* turn off the selection */ true, ); free(tmp_env_var); tmp_env_var = NULL; This seems to work. Later when actually selecting the correct CRS module to restart the checkpointed process the selection is enabled again: /* Re-enable the selection of the CRS component, so we can choose the right one */ (void) mca_base_var_env_name("crs_base_do_not_select", _env_var); opal_setenv(tmp_env_var, "0", /* turn on the selection */ true, ); free(tmp_env_var); tmp_env_var = NULL; This does not seem to have an effect. The one reason why it does not work is pretty obvious. The mca variable crs_base_do_not_select is registered during opal_crs_base_register() and written to the bool variable opal_crs_base_do_not_select only once (during register). Later in opal_crs_base_select() this bool variable is queried if select should run or not and as it is only changed during register it never changes. So from the code flow it cannot work and is probably the result of one of the rewrites since C/R was introduced. To fix this I am trying to read the value of the MCA variable opal_crs_base_do_not_select during opal_crs_base_select() like this: idx = mca_base_var_find("opal", "crs", "base", "do_not_select") mca_base_var_get_value(idx, , NULL, NULL); This also seems to work because it is different if I change the first opal_setenv() during initialize(). The problem I am seeing is that the second opal_setenv() (back to 0) cannot be detected using mca_base_var_get_value(). So my question is: what is the preferred way to read and write MCA variables to access them in the different modules? Is the existing code still correct? There is also mca_base_var_set_value() should I rather use this to set 'opal_crs_base_do_not_select'. I was, however, not able to use mca_base_var_set_value() without a segfault. There are not much uses of mca_base_var_set_value() in the existing code and none uses a bool variable. I also discovered I can just access to global C variable 'opal_crs_base_do_not_select' from opal-restart.c as well as from opal_crs_base_select(). This also works. This would solve my problem setting and reading MCA variables. Adrian
Re: [OMPI devel] Loading Open MPI from MPJ Express (Java) fails
And I managed to run Open MPI with MPJ Express. I added the following code and it worked like a charm. *In Java* /* * Static Block for loading the libnativempjdev.so */ static { System.loadLibrary("nativempjdev"); if(!loadGlobalLibraries()) { System.out.println("MPJ Express failed to load required libraries"); System.exit(1); } } *In C* JNIEXPORT jboolean JNICALL Java_mpjdev_natmpjdev_Comm_loadGlobalLibraries (JNIEnv *env, jclass thisObject) { //This will make sure the library is loaded // in the case of Open MPI if (NULL == (mpilibhandle = dlopen("libmpi.so", RTLD_NOW | RTLD_GLOBAL))) { return JNI_FALSE; } return JNI_TRUE; } It works for Open MPI but for MPICH3 I have to comment the dlopen. Is there any way to tell the compiler if its using Open MPI (mpicc) then use dlopen else keep it commented? Or some thing else? *On Java bindings to have some insight into the internals of the MPI implementation* Yes, there are some places where we need to be sync with the internals of the native MPI implementation. These are in section 8.1.2 of MPI 2.1 ( http://www.mpi-forum.org/docs/mpi-2.1/mpi21-report.pdf). For example the MPI_TAG_UB. For the pure Java devices of MPJ Express we have always used Integer.MAX_VALUE. *Datatypes?* MPJ Express uses an internal buffering layer to buffer the user data into a ByteBuffer. In this way for the native device we end up using the MPI_BYTEdatatype most of the time. ByteBuffer simplifies matters since it is directly accessible from the native code. With our current implementation there is one exception to it i.e. in the Reduce, Allreduce and Reduce_scatter where the native MPI implementation needs to know which Java datatype its going to process. Same goes for MPI.Op *On Are your bindings similar in style/signature to ours?* I checked it and there are differences. MPJ Express (and FastMPJ also) implements the mpiJava 1.2 specifications. There is also MPJ API (this is very close to mpiJava 1.2 API). *Example 1: Getting the rank and size of COMM_WORLD* *MPJ Express (the mpiJava 1.2 API):* public int Size() throws MPIException; public int Rank() throws MPIException; *MPJ API:* public int size() throws MPJException; public int rank() throws MPJException; *Open MPI's Java bindings:* public final int getRank() throws MPIException; public final int getSize() throws MPIException; *Example 2: Point-to-Point communication* *MPJ Express (the mpiJava 1.2 API):* public void Send(Object buf, int offset, int count, Datatype datatype, int dest, int tag) throws MPIException public Status Recv(Object buf, int offset, int count, Datatype datatype, int source, int tag) throws MPIException *MPJ API:* public void send(Object buf, int offset, int count, Datatype datatype, int dest, int tag) throws MPJException; public Status recv(Object buf, int offset, int count, Datatype datatype, int source, int tag) throws MPJException *Open MPI's Java bindings:* public final void send(Object buf, int count, Datatype type, int dest, int tag) throws MPIException public final Status recv(Object buf, int count, Datatype type, int source, int tag) throws MPIException *Example 3: Collective communication* *MPJ Express (the mpiJava 1.2 API):* public void Bcast(Object buf, int offset, int count, Datatype type, int root) throws MPIException; *MPJ API:* public void bcast(Object buffer, int offset, int count, Datatype datatype, int root) throws MPJException; *Open MPI's Java bindings:* public final void bcast(Object buf, int count, Datatype type, int root) throws MPIException; I couldn't find which API the Open MPI's Java bindings implement? But while reading your README.JAVA.txt and your code I realised that you are trying to avoid buffering overhead by giving the user the flexibility to directly allocate data onto a ByteBuffer using MPI.newBuffer, hence not following the mpiJava 1.2 specs (for communication operations)? *On Performance Comparison* Yes this is interesting, I have managed to do two kind of tests: Ping-Pong (Latency and Bandwidth) and Collective Communications (Bcast). Attached are graphs and the programs (testcases) that I used. The tests were done using Infiniband, more on the platform here http://www.nust.edu.pk/INSTITUTIONS/Centers/RCMS/AboutUs/facilities/screc/Pages/Resources.aspx One reason for Open MPI's java bindings low performance (in the Bandwidth.png graph) is the way the test case was written (Bandwidth_OpenMPi.java). It allocates a total of 16M of byte array and uses the same array in send/recv for each data point (by varying count). This could be mainly because of the following code in mpi_Comm.c (let me know if I am mistaken) static void* getArrayPtr(void** bufBase, JNIEnv *env, jobject buf, int baseType, int offset) { switch(baseType) { ... ... case 1: {
Re: [OMPI devel] orte-restart and PATH
It looks like I did not add the prefix path to the binary name before fork/exec in orte-restart. There is a string variable that you can use to get the appropriate prefix: opal_install_dirs.prefix from opal/mca/installdirs/installdirs.h It's the same one that Ralph mentioned that orterun uses. If you add that on there then you should be ok. You might want to check the app-files produces use the prefix as well when referencing opal-restart. On Wed, Mar 12, 2014 at 11:34 AM, Ralph Castainwrote: > That's what the --enable-orterun-prefix-by-default configure option is for > > > On Mar 12, 2014, at 9:28 AM, Adrian Reber wrote: > > > I am using orte-restart without setting my PATH to my Open MPI > > installation. I am running /full/path/to/orte-restart and orte-restart > > tries to run mpirun to restart the process. This fails on my system > > because I do not have any mpirun in my PATH. Is it expected for an Open > > MPI installation to set up the PATH variable or should it work using the > > absolute path to the binaries? > > > > Should I just set my PATH correctly and be done with it or should > > orte-restart figure out the full path to its accompanying mpirun and > > start mpirun with the full path? > > > > Adrian > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/03/14339.php > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/03/14340.php > -- Joshua Hursey Assistant Professor of Computer Science University of Wisconsin-La Crosse http://cs.uwlax.edu/~jjhursey