[OMPI devel] alps ras patch for SLURM

2010-07-09 Thread Jerome Soumagne

Hi,

We've recently installed OpenMPI on one of our Cray XT5 machines, here 
at CSCS. This machine uses SLURM for launching jobs.

Doing an salloc defines this environment variable:
  BASIL_RESERVATION_ID
  The reservation ID on Cray systems running ALPS/BASIL only.

Since the alps ras module tries to find a variable called 
OMPI_ALPS_RESID which is set using a script, we thought that for SLURM 
systems it would be a good idea to directly integrate this 
BASIL_RESERVATION_ID variable in the code, rather than using a script. 
The small patch is attached.


Regards,

Jerome

--
Jérôme Soumagne
Scientific Computing Research Group
CSCS, Swiss National Supercomputing Centre
Galleria 2, Via Cantonale  | Tel: +41 (0)91 610 8258
CH-6928 Manno, Switzerland | Fax: +41 (0)91 610 8282


Index: orte/mca/ras/alps/ras_alps_component.c
===
--- orte/mca/ras/alps/ras_alps_component.c  (revision 23365)
+++ orte/mca/ras/alps/ras_alps_component.c  (working copy)
@@ -93,7 +93,7 @@

 /* Are we running under a ALPS job? */

-if (NULL != getenv("OMPI_ALPS_RESID")) {
+if ((NULL != getenv("OMPI_ALPS_RESID")) || (NULL != 
getenv("BASIL_RESERVATION_ID"))) {
 mca_base_param_lookup_int(param_priority, priority);
 opal_output_verbose(1, orte_ras_base.ras_output,
  "ras:alps: available for selection");
Index: orte/mca/ras/alps/ras_alps_module.c
===
--- orte/mca/ras/alps/ras_alps_module.c (revision 23365)
+++ orte/mca/ras/alps/ras_alps_module.c (working copy)
@@ -70,6 +70,10 @@
 char*alps_config_str;

 alps_batch_id = getenv("OMPI_ALPS_RESID");
+/* check if the system uses SLURM, in this case, OMPI_ALPS_RESID
+ * is not necessary and BASIL_RESERVATION_ID can be directly used instead
+ */
+if (NULL == alps_batch_id) alps_batch_id = getenv("BASIL_RESERVATION_ID");
 if (NULL == alps_batch_id) {
 orte_show_help("help-ras-alps.txt", "alps-env-var-not-found", 1,
"OMPI_ALPS_RESID");


[OMPI devel] some questions regarding the portals modules

2010-07-09 Thread Jerome Soumagne

Hi,

As I said in the previous e-mail, we've recently installed OpenMPI on a 
Cray XT5 machine, and we therefore use the portals and the alps 
libraries. Thanks for providing the configuration script from Jaguar, 
this was very helpful, it had just to be slightly adapted in order to 
use the latest CNL version installed on this machine.


I have some questions though regarding the use of the portals btl and 
mtl components. I noticed that when I compiled OpenMPI with mpi-thread 
support enabled and ran a job, the portals components did not want to 
initialize due to these funny lines:


./mtl_portals_component.c
182 /* we don't run with no stinkin' threads */
183 if (enable_progress_threads || enable_mpi_threads) return NULL;

I'd like to know why are mpi threads disabled since threads are 
supported on the XT5, does the btl/mtl require to have thread-safety 
implemented or something like that or is it because of the portals 
library itself ?


I would also like to use the MPI_Comm_accept/connect functions, it seems 
that it's not possible to do that using the portals mtl even if the 
spawn seems to be supported, did I do something wrong or is it really 
not supported?
In this case, is it possible to extend this module to support these 
functions? We could help in doing that.


I'd like also to know, are there any plans for creating a module in 
order to use the DMAPP interface for the Gemini interconnect?


Thanks.

Jerome

--
Jérôme Soumagne
Scientific Computing Research Group
CSCS, Swiss National Supercomputing Centre
Galleria 2, Via Cantonale  | Tel: +41 (0)91 610 8258
CH-6928 Manno, Switzerland | Fax: +41 (0)91 610 8282




Re: [OMPI devel] alps ras patch for SLURM

2010-07-09 Thread Ralph Castain
Forgive my confusion, but could you please clarify something? You are using 
ALPS as the resource manager doing the allocation, and then using SLURM as the 
launcher (instead of ALPS)?

That's a combination we've never seen or heard about. I suspect our module 
selection logic would be confused by such a combination - are you using mca 
params to direct selection?


On Jul 9, 2010, at 4:19 AM, Jerome Soumagne wrote:

> Hi,
> 
> We've recently installed OpenMPI on one of our Cray XT5 machines, here at 
> CSCS. This machine uses SLURM for launching jobs.
> Doing an salloc defines this environment variable:
>   BASIL_RESERVATION_ID
>   The reservation ID on Cray systems running ALPS/BASIL only.
> 
> Since the alps ras module tries to find a variable called OMPI_ALPS_RESID 
> which is set using a script, we thought that for SLURM systems it would be a 
> good idea to directly integrate this BASIL_RESERVATION_ID variable in the 
> code, rather than using a script. The small patch is attached.
> 
> Regards,
> 
> Jerome
> -- 
> Jérôme Soumagne
> Scientific Computing Research Group
> CSCS, Swiss National Supercomputing Centre 
> Galleria 2, Via Cantonale  | Tel: +41 (0)91 610 8258
> CH-6928 Manno, Switzerland | Fax: +41 (0)91 610 8282
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] alps ras patch for SLURM

2010-07-09 Thread Jerome Soumagne
Well we actually use a patched version of SLURM, 2.2.0-pre8. It is 
planned to submit the modifications made internally at CSCS for the next 
SLURM release in November. We implement ALPS support based on the basic 
architecture of SLURM.
SLURM is only used to do the ALPS ressource allocation. We then use 
mpirun based on the portals and on the alps libaries.
We don't use mca parameters to direct selection and the alps RAS is 
automatically well selected.


On 07/09/2010 01:59 PM, Ralph Castain wrote:
Forgive my confusion, but could you please clarify something? You are 
using ALPS as the resource manager doing the allocation, and then 
using SLURM as the launcher (instead of ALPS)?


That's a combination we've never seen or heard about. I suspect our 
module selection logic would be confused by such a combination - are 
you using mca params to direct selection?



On Jul 9, 2010, at 4:19 AM, Jerome Soumagne wrote:


Hi,

We've recently installed OpenMPI on one of our Cray XT5 machines, 
here at CSCS. This machine uses SLURM for launching jobs.

Doing an salloc defines this environment variable:
  BASIL_RESERVATION_ID
  The reservation ID on Cray systems running ALPS/BASIL only.

Since the alps ras module tries to find a variable called 
OMPI_ALPS_RESID which is set using a script, we thought that for 
SLURM systems it would be a good idea to directly integrate this 
BASIL_RESERVATION_ID variable in the code, rather than using a 
script. The small patch is attached.


Regards,

Jerome
--
Jérôme Soumagne
Scientific Computing Research Group
CSCS, Swiss National Supercomputing Centre
Galleria 2, Via Cantonale  | Tel: +41 (0)91 610 8258
CH-6928 Manno, Switzerland | Fax: +41 (0)91 610 8282

 
___

devel mailing list
de...@open-mpi.org 
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jérôme Soumagne
Scientific Computing Research Group
CSCS, Swiss National Supercomputing Centre
Galleria 2, Via Cantonale  | Tel: +41 (0)91 610 8258
CH-6928 Manno, Switzerland | Fax: +41 (0)91 610 8282




Re: [OMPI devel] alps ras patch for SLURM

2010-07-09 Thread Ralph Castain
Afraid I'm now even more confused. You use SLURM to do the allocation, and then 
use ALPS to launch the job?

I'm just trying to understand because I'm the person who generally maintains 
this code area. We have two frameworks involved here:

1. RAS - determines what nodes were allocated to us. There are both slurm and 
alps modules here.

2. PLM - actually launches the job. There are both slurm and alps modules here.

Up until now, we have always seen people running with either alps or slurm, but 
never both together, so the module selection of these two frameworks is 
identical - if you select slurm for the RAS module, you will definitely get 
slurm for the launcher. Ditto for alps. Are you sure that mpirun is actually 
using the modules you think? Have you run this with -mca ras_base_verbose 10 
-mca plm_base_verbose 10 and seen what modules are being used?

In any event, this seems like a very strange combination, but I assume you have 
some reason for doing this?

I'm always leery of fiddling with the SLURM modules as (a) there aren't very 
many slurm users out there, (b) the primary users are the DOE national labs 
themselves, using software provided by LLNL (who controls slurm), and (c) there 
are major disconnects between the various slurm releases, so we wind up 
breaking things for someone rather easily.

So the more I can understand what you are doing, the easier it is to determine 
how to use your patch without breaking slurm support for others.

Thanks!
Ralph


On Jul 9, 2010, at 6:46 AM, Jerome Soumagne wrote:

> Well we actually use a patched version of SLURM, 2.2.0-pre8. It is planned to 
> submit the modifications made internally at CSCS for the next SLURM release 
> in November. We implement ALPS support based on the basic architecture of 
> SLURM.
> SLURM is only used to do the ALPS ressource allocation. We then use mpirun 
> based on the portals and on the alps libaries.
> We don't use mca parameters to direct selection and the alps RAS is 
> automatically well selected.
> 
> On 07/09/2010 01:59 PM, Ralph Castain wrote:
>> 
>> Forgive my confusion, but could you please clarify something? You are using 
>> ALPS as the resource manager doing the allocation, and then using SLURM as 
>> the launcher (instead of ALPS)?
>> 
>> That's a combination we've never seen or heard about. I suspect our module 
>> selection logic would be confused by such a combination - are you using mca 
>> params to direct selection?
>> 
>> 
>> On Jul 9, 2010, at 4:19 AM, Jerome Soumagne wrote:
>> 
>>> Hi,
>>> 
>>> We've recently installed OpenMPI on one of our Cray XT5 machines, here at 
>>> CSCS. This machine uses SLURM for launching jobs.
>>> Doing an salloc defines this environment variable:
>>>   BASIL_RESERVATION_ID
>>>   The reservation ID on Cray systems running ALPS/BASIL only.
>>> 
>>> Since the alps ras module tries to find a variable called OMPI_ALPS_RESID 
>>> which is set using a script, we thought that for SLURM systems it would be 
>>> a good idea to directly integrate this BASIL_RESERVATION_ID variable in the 
>>> code, rather than using a script. The small patch is attached.
>>> 
>>> Regards,
>>> 
>>> Jerome
>>> -- 
>>> Jérôme Soumagne
>>> Scientific Computing Research Group
>>> CSCS, Swiss National Supercomputing Centre 
>>> Galleria 2, Via Cantonale  | Tel: +41 (0)91 610 8258
>>> CH-6928 Manno, Switzerland | Fax: +41 (0)91 610 8282
>>> 
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jérôme Soumagne
> Scientific Computing Research Group
> CSCS, Swiss National Supercomputing Centre 
> Galleria 2, Via Cantonale  | Tel: +41 (0)91 610 8258
> CH-6928 Manno, Switzerland | Fax: +41 (0)91 610 8282
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] alps ras patch for SLURM

2010-07-09 Thread Jerome Soumagne
Ok I may have not explained very clearly. In our case we only use SLURM 
for the resource manager.
The difference here is that the SLURM version that we use has support 
for ALPS. Therefore when we run our job using the mpirun command, since 
we have the alps environment loaded, it's the ALPS RAS which is 
selected, and the ALPS PLM as well. I think I could even not compile the 
OpenMPI slurm support.


Here is what we do for example: here is my batch script (with the 
patched version)

#!/bin/bash
#SBATCH --job-name=HelloOMPI
#SBATCH --nodes=2
#SBATCH --time=00:30:00

set -ex
cd /users/soumagne/gele/hello
mpirun --mca ras_base_verbose 10 --mca plm_base_verbose 10 -np 2 
--bynode `pwd`/hello


And here is the output that I get:
soumagne@gele1:~/gele/hello> more slurm-165.out
+ cd /users/soumagne/gele/hello
++ pwd
+ mpirun --mca ras_base_verbose 10 --mca plm_base_verbose 10 --bynode 
-np 2 /use

rs/soumagne/gele/hello/hello
[gele2:15844] mca: base: components_open: Looking for plm components
[gele2:15844] mca: base: components_open: opening plm components
[gele2:15844] mca: base: components_open: found loaded component alps
[gele2:15844] mca: base: components_open: component alps has no register 
functio

n
[gele2:15844] mca: base: components_open: component alps open function 
successfu

l
[gele2:15844] mca: base: components_open: found loaded component slurm
[gele2:15844] mca: base: components_open: component slurm has no 
register functi

on
[gele2:15844] mca: base: components_open: component slurm open function 
successf

ul
[gele2:15844] mca:base:select: Auto-selecting plm components
[gele2:15844] mca:base:select:(  plm) Querying component [alps]
[gele2:15844] mca:base:select:(  plm) Query of component [alps] set 
priority to

75
[gele2:15844] mca:base:select:(  plm) Querying component [slurm]
[gele2:15844] mca:base:select:(  plm) Query of component [slurm] set 
priority to

 75
[gele2:15844] mca:base:select:(  plm) Selected component [alps]
[gele2:15844] mca: base: close: component slurm closed
[gele2:15844] mca: base: close: unloading component slurm
[gele2:15844] mca: base: components_open: Looking for ras components
[gele2:15844] mca: base: components_open: opening ras components
[gele2:15844] mca: base: components_open: found loaded component cm
[gele2:15844] mca: base: components_open: component cm has no register 
function
[gele2:15844] mca: base: components_open: component cm open function 
successful

[gele2:15844] mca: base: components_open: found loaded component alps
[gele2:15844] mca: base: components_open: component alps has no register 
functio

n
[gele2:15844] mca: base: components_open: component alps open function 
successfu

l
[gele2:15844] mca: base: components_open: found loaded component slurm
[gele2:15844] mca: base: components_open: component slurm has no 
register functi

on
[gele2:15844] mca: base: components_open: component slurm open function 
successf

ul
[gele2:15844] mca:base:select: Auto-selecting ras components
[gele2:15844] mca:base:select:(  ras) Querying component [cm]
[gele2:15844] mca:base:select:(  ras) Skipping component [cm]. Query 
failed to r

eturn a module
[gele2:15844] mca:base:select:(  ras) Querying component [alps]
[gele2:15844] ras:alps: available for selection
[gele2:15844] mca:base:select:(  ras) Query of component [alps] set 
priority to

75
[gele2:15844] mca:base:select:(  ras) Querying component [slurm]
[gele2:15844] mca:base:select:(  ras) Query of component [slurm] set 
priority to

 75
[gele2:15844] mca:base:select:(  ras) Selected component [alps]
[gele2:15844] mca: base: close: unloading component cm
[gele2:15844] mca: base: close: unloading component slurm
[gele2:15844] ras:alps:allocate: Using ALPS configuration file: 
"/etc/sysconfig/

alps"
[gele2:15844] ras:alps:allocate: Located ALPS scheduler file: 
"/ufs/alps_shared/

appinfo"
[gele2:15844] ras:alps:orte_ras_alps_get_appinfo_attempts: 10
[gele2:15844] ras:alps:read_appinfo: got NID 16
[gele2:15844] ras:alps:read_appinfo: added NID 16 to list
[gele2:15844] ras:alps:read_appinfo: got NID 16
[gele2:15844] ras:alps:read_appinfo: got NID 16
[gele2:15844] ras:alps:read_appinfo: got NID 16
[gele2:15844] ras:alps:read_appinfo: got NID 16
[gele2:15844] ras:alps:read_appinfo: got NID 16
[gele2:15844] ras:alps:read_appinfo: got NID 16
[gele2:15844] ras:alps:read_appinfo: got NID 16
[gele2:15844] ras:alps:read_appinfo: got NID 16
[gele2:15844] ras:alps:read_appinfo: got NID 16
[gele2:15844] ras:alps:read_appinfo: got NID 16
[gele2:15844] ras:alps:read_appinfo: got NID 16
[gele2:15844] ras:alps:read_appinfo: got NID 20
[gele2:15844] ras:alps:read_appinfo: added NID 20 to list
[gele2:15844] ras:alps:read_appinfo: got NID 20
[gele2:15844] ras:alps:read_appinfo: got NID 20
[gele2:15844] ras:alps:read_appinfo: got NID 20
[gele2:15844] ras:alps:read_appinfo: got NID 20
[gele2:15844] ras:alps:read_appinfo: got NID 20
[gele2:15844] ras:alps:read_appinfo: got NID 20
[gele2:15844] ras:alps:read_appinfo: got

Re: [OMPI devel] alps ras patch for SLURM

2010-07-09 Thread Ralph Castain
My bad - I see that you actually do patch the alps ras. Is BASIL_RESERVATION_ID 
something included in alps, or is this just a name you invented?


On Jul 9, 2010, at 8:08 AM, Jerome Soumagne wrote:

> Ok I may have not explained very clearly. In our case we only use SLURM for 
> the resource manager.
> The difference here is that the SLURM version that we use has support for 
> ALPS. Therefore when we run our job using the mpirun command, since we have 
> the alps environment loaded, it's the ALPS RAS which is selected, and the 
> ALPS PLM as well. I think I could even not compile the OpenMPI slurm support.
> 
> Here is what we do for example: here is my batch script (with the patched 
> version) 
> #!/bin/bash
> #SBATCH --job-name=HelloOMPI 
> #SBATCH --nodes=2
> #SBATCH --time=00:30:00
> 
> set -ex
> cd /users/soumagne/gele/hello
> mpirun --mca ras_base_verbose 10 --mca plm_base_verbose 10 -np 2 --bynode 
> `pwd`/hello
> 
> And here is the output that I get:
> soumagne@gele1:~/gele/hello> more slurm-165.out 
> + cd /users/soumagne/gele/hello
> ++ pwd
> + mpirun --mca ras_base_verbose 10 --mca plm_base_verbose 10 --bynode -np 2 
> /use
> rs/soumagne/gele/hello/hello
> [gele2:15844] mca: base: components_open: Looking for plm components
> [gele2:15844] mca: base: components_open: opening plm components
> [gele2:15844] mca: base: components_open: found loaded component alps
> [gele2:15844] mca: base: components_open: component alps has no register 
> functio
> n
> [gele2:15844] mca: base: components_open: component alps open function 
> successfu
> l
> [gele2:15844] mca: base: components_open: found loaded component slurm
> [gele2:15844] mca: base: components_open: component slurm has no register 
> functi
> on
> [gele2:15844] mca: base: components_open: component slurm open function 
> successf
> ul
> [gele2:15844] mca:base:select: Auto-selecting plm components
> [gele2:15844] mca:base:select:(  plm) Querying component [alps]
> [gele2:15844] mca:base:select:(  plm) Query of component [alps] set priority 
> to 
> 75
> [gele2:15844] mca:base:select:(  plm) Querying component [slurm]
> [gele2:15844] mca:base:select:(  plm) Query of component [slurm] set priority 
> to
>  75
> [gele2:15844] mca:base:select:(  plm) Selected component [alps]
> [gele2:15844] mca: base: close: component slurm closed
> [gele2:15844] mca: base: close: unloading component slurm
> [gele2:15844] mca: base: components_open: Looking for ras components
> [gele2:15844] mca: base: components_open: opening ras components
> [gele2:15844] mca: base: components_open: found loaded component cm
> [gele2:15844] mca: base: components_open: component cm has no register 
> function
> [gele2:15844] mca: base: components_open: component cm open function 
> successful
> [gele2:15844] mca: base: components_open: found loaded component alps
> [gele2:15844] mca: base: components_open: component alps has no register 
> functio
> n
> [gele2:15844] mca: base: components_open: component alps open function 
> successfu
> l
> [gele2:15844] mca: base: components_open: found loaded component slurm
> [gele2:15844] mca: base: components_open: component slurm has no register 
> functi
> on
> [gele2:15844] mca: base: components_open: component slurm open function 
> successf
> ul
> [gele2:15844] mca:base:select: Auto-selecting ras components
> [gele2:15844] mca:base:select:(  ras) Querying component [cm]
> [gele2:15844] mca:base:select:(  ras) Skipping component [cm]. Query failed 
> to r
> eturn a module
> [gele2:15844] mca:base:select:(  ras) Querying component [alps]
> [gele2:15844] ras:alps: available for selection
> [gele2:15844] mca:base:select:(  ras) Query of component [alps] set priority 
> to 
> 75
> [gele2:15844] mca:base:select:(  ras) Querying component [slurm]
> [gele2:15844] mca:base:select:(  ras) Query of component [slurm] set priority 
> to
>  75
> [gele2:15844] mca:base:select:(  ras) Selected component [alps]
> [gele2:15844] mca: base: close: unloading component cm
> [gele2:15844] mca: base: close: unloading component slurm
> [gele2:15844] ras:alps:allocate: Using ALPS configuration file: 
> "/etc/sysconfig/
> alps"
> [gele2:15844] ras:alps:allocate: Located ALPS scheduler file: 
> "/ufs/alps_shared/
> appinfo"
> [gele2:15844] ras:alps:orte_ras_alps_get_appinfo_attempts: 10
> [gele2:15844] ras:alps:read_appinfo: got NID 16
> [gele2:15844] ras:alps:read_appinfo: added NID 16 to list
> [gele2:15844] ras:alps:read_appinfo: got NID 16
> [gele2:15844] ras:alps:read_appinfo: got NID 16
> [gele2:15844] ras:alps:read_appinfo: got NID 16
> [gele2:15844] ras:alps:read_appinfo: got NID 16
> [gele2:15844] ras:alps:read_appinfo: got NID 16
> [gele2:15844] ras:alps:read_appinfo: got NID 16
> [gele2:15844] ras:alps:read_appinfo: got NID 16
> [gele2:15844] ras:alps:read_appinfo: got NID 16
> [gele2:15844] ras:alps:read_appinfo: got NID 16
> [gele2:15844] ras:alps:read_appinfo: got NID 16
> [gele2:15844] ras:alps:read_appinfo: got NID 16
> [gele2:15844] ras:alps:r

Re: [OMPI devel] alps ras patch for SLURM

2010-07-09 Thread Ralph Castain
To clarify: what I'm trying to understand is what the heck a 
"BASIL_RESERVATION_ID" is - it isn't a standard slurm thing, nor can I find it 
defined in alps, so it appears to just be a local name you invented. True?

If so, I would rather see some standard name instead of something local to one 
organization.

On Jul 9, 2010, at 8:08 AM, Jerome Soumagne wrote:

> Ok I may have not explained very clearly. In our case we only use SLURM for 
> the resource manager.
> The difference here is that the SLURM version that we use has support for 
> ALPS. Therefore when we run our job using the mpirun command, since we have 
> the alps environment loaded, it's the ALPS RAS which is selected, and the 
> ALPS PLM as well. I think I could even not compile the OpenMPI slurm support.
> 
> Here is what we do for example: here is my batch script (with the patched 
> version) 
> #!/bin/bash
> #SBATCH --job-name=HelloOMPI 
> #SBATCH --nodes=2
> #SBATCH --time=00:30:00
> 
> set -ex
> cd /users/soumagne/gele/hello
> mpirun --mca ras_base_verbose 10 --mca plm_base_verbose 10 -np 2 --bynode 
> `pwd`/hello
> 
> And here is the output that I get:
> soumagne@gele1:~/gele/hello> more slurm-165.out 
> + cd /users/soumagne/gele/hello
> ++ pwd
> + mpirun --mca ras_base_verbose 10 --mca plm_base_verbose 10 --bynode -np 2 
> /use
> rs/soumagne/gele/hello/hello
> [gele2:15844] mca: base: components_open: Looking for plm components
> [gele2:15844] mca: base: components_open: opening plm components
> [gele2:15844] mca: base: components_open: found loaded component alps
> [gele2:15844] mca: base: components_open: component alps has no register 
> functio
> n
> [gele2:15844] mca: base: components_open: component alps open function 
> successfu
> l
> [gele2:15844] mca: base: components_open: found loaded component slurm
> [gele2:15844] mca: base: components_open: component slurm has no register 
> functi
> on
> [gele2:15844] mca: base: components_open: component slurm open function 
> successf
> ul
> [gele2:15844] mca:base:select: Auto-selecting plm components
> [gele2:15844] mca:base:select:(  plm) Querying component [alps]
> [gele2:15844] mca:base:select:(  plm) Query of component [alps] set priority 
> to 
> 75
> [gele2:15844] mca:base:select:(  plm) Querying component [slurm]
> [gele2:15844] mca:base:select:(  plm) Query of component [slurm] set priority 
> to
>  75
> [gele2:15844] mca:base:select:(  plm) Selected component [alps]
> [gele2:15844] mca: base: close: component slurm closed
> [gele2:15844] mca: base: close: unloading component slurm
> [gele2:15844] mca: base: components_open: Looking for ras components
> [gele2:15844] mca: base: components_open: opening ras components
> [gele2:15844] mca: base: components_open: found loaded component cm
> [gele2:15844] mca: base: components_open: component cm has no register 
> function
> [gele2:15844] mca: base: components_open: component cm open function 
> successful
> [gele2:15844] mca: base: components_open: found loaded component alps
> [gele2:15844] mca: base: components_open: component alps has no register 
> functio
> n
> [gele2:15844] mca: base: components_open: component alps open function 
> successfu
> l
> [gele2:15844] mca: base: components_open: found loaded component slurm
> [gele2:15844] mca: base: components_open: component slurm has no register 
> functi
> on
> [gele2:15844] mca: base: components_open: component slurm open function 
> successf
> ul
> [gele2:15844] mca:base:select: Auto-selecting ras components
> [gele2:15844] mca:base:select:(  ras) Querying component [cm]
> [gele2:15844] mca:base:select:(  ras) Skipping component [cm]. Query failed 
> to r
> eturn a module
> [gele2:15844] mca:base:select:(  ras) Querying component [alps]
> [gele2:15844] ras:alps: available for selection
> [gele2:15844] mca:base:select:(  ras) Query of component [alps] set priority 
> to 
> 75
> [gele2:15844] mca:base:select:(  ras) Querying component [slurm]
> [gele2:15844] mca:base:select:(  ras) Query of component [slurm] set priority 
> to
>  75
> [gele2:15844] mca:base:select:(  ras) Selected component [alps]
> [gele2:15844] mca: base: close: unloading component cm
> [gele2:15844] mca: base: close: unloading component slurm
> [gele2:15844] ras:alps:allocate: Using ALPS configuration file: 
> "/etc/sysconfig/
> alps"
> [gele2:15844] ras:alps:allocate: Located ALPS scheduler file: 
> "/ufs/alps_shared/
> appinfo"
> [gele2:15844] ras:alps:orte_ras_alps_get_appinfo_attempts: 10
> [gele2:15844] ras:alps:read_appinfo: got NID 16
> [gele2:15844] ras:alps:read_appinfo: added NID 16 to list
> [gele2:15844] ras:alps:read_appinfo: got NID 16
> [gele2:15844] ras:alps:read_appinfo: got NID 16
> [gele2:15844] ras:alps:read_appinfo: got NID 16
> [gele2:15844] ras:alps:read_appinfo: got NID 16
> [gele2:15844] ras:alps:read_appinfo: got NID 16
> [gele2:15844] ras:alps:read_appinfo: got NID 16
> [gele2:15844] ras:alps:read_appinfo: got NID 16
> [gele2:15844] ras:alps:read_appinfo: got NID 16
> [gele2:15

Re: [OMPI devel] alps ras patch for SLURM

2010-07-09 Thread Ralph Castain
Appreciate your explanation, but it doesn't align with your patch. Your patch 
doesn't do anything because it patches the slurm ras module, but the system is 
selecting the alps ras module - so your patch never runs.

What am I missing?

On Jul 9, 2010, at 8:08 AM, Jerome Soumagne wrote:

> Ok I may have not explained very clearly. In our case we only use SLURM for 
> the resource manager.
> The difference here is that the SLURM version that we use has support for 
> ALPS. Therefore when we run our job using the mpirun command, since we have 
> the alps environment loaded, it's the ALPS RAS which is selected, and the 
> ALPS PLM as well. I think I could even not compile the OpenMPI slurm support.
> 
> Here is what we do for example: here is my batch script (with the patched 
> version) 
> #!/bin/bash
> #SBATCH --job-name=HelloOMPI 
> #SBATCH --nodes=2
> #SBATCH --time=00:30:00
> 
> set -ex
> cd /users/soumagne/gele/hello
> mpirun --mca ras_base_verbose 10 --mca plm_base_verbose 10 -np 2 --bynode 
> `pwd`/hello
> 
> And here is the output that I get:
> soumagne@gele1:~/gele/hello> more slurm-165.out 
> + cd /users/soumagne/gele/hello
> ++ pwd
> + mpirun --mca ras_base_verbose 10 --mca plm_base_verbose 10 --bynode -np 2 
> /use
> rs/soumagne/gele/hello/hello
> [gele2:15844] mca: base: components_open: Looking for plm components
> [gele2:15844] mca: base: components_open: opening plm components
> [gele2:15844] mca: base: components_open: found loaded component alps
> [gele2:15844] mca: base: components_open: component alps has no register 
> functio
> n
> [gele2:15844] mca: base: components_open: component alps open function 
> successfu
> l
> [gele2:15844] mca: base: components_open: found loaded component slurm
> [gele2:15844] mca: base: components_open: component slurm has no register 
> functi
> on
> [gele2:15844] mca: base: components_open: component slurm open function 
> successf
> ul
> [gele2:15844] mca:base:select: Auto-selecting plm components
> [gele2:15844] mca:base:select:(  plm) Querying component [alps]
> [gele2:15844] mca:base:select:(  plm) Query of component [alps] set priority 
> to 
> 75
> [gele2:15844] mca:base:select:(  plm) Querying component [slurm]
> [gele2:15844] mca:base:select:(  plm) Query of component [slurm] set priority 
> to
>  75
> [gele2:15844] mca:base:select:(  plm) Selected component [alps]
> [gele2:15844] mca: base: close: component slurm closed
> [gele2:15844] mca: base: close: unloading component slurm
> [gele2:15844] mca: base: components_open: Looking for ras components
> [gele2:15844] mca: base: components_open: opening ras components
> [gele2:15844] mca: base: components_open: found loaded component cm
> [gele2:15844] mca: base: components_open: component cm has no register 
> function
> [gele2:15844] mca: base: components_open: component cm open function 
> successful
> [gele2:15844] mca: base: components_open: found loaded component alps
> [gele2:15844] mca: base: components_open: component alps has no register 
> functio
> n
> [gele2:15844] mca: base: components_open: component alps open function 
> successfu
> l
> [gele2:15844] mca: base: components_open: found loaded component slurm
> [gele2:15844] mca: base: components_open: component slurm has no register 
> functi
> on
> [gele2:15844] mca: base: components_open: component slurm open function 
> successf
> ul
> [gele2:15844] mca:base:select: Auto-selecting ras components
> [gele2:15844] mca:base:select:(  ras) Querying component [cm]
> [gele2:15844] mca:base:select:(  ras) Skipping component [cm]. Query failed 
> to r
> eturn a module
> [gele2:15844] mca:base:select:(  ras) Querying component [alps]
> [gele2:15844] ras:alps: available for selection
> [gele2:15844] mca:base:select:(  ras) Query of component [alps] set priority 
> to 
> 75
> [gele2:15844] mca:base:select:(  ras) Querying component [slurm]
> [gele2:15844] mca:base:select:(  ras) Query of component [slurm] set priority 
> to
>  75
> [gele2:15844] mca:base:select:(  ras) Selected component [alps]
> [gele2:15844] mca: base: close: unloading component cm
> [gele2:15844] mca: base: close: unloading component slurm
> [gele2:15844] ras:alps:allocate: Using ALPS configuration file: 
> "/etc/sysconfig/
> alps"
> [gele2:15844] ras:alps:allocate: Located ALPS scheduler file: 
> "/ufs/alps_shared/
> appinfo"
> [gele2:15844] ras:alps:orte_ras_alps_get_appinfo_attempts: 10
> [gele2:15844] ras:alps:read_appinfo: got NID 16
> [gele2:15844] ras:alps:read_appinfo: added NID 16 to list
> [gele2:15844] ras:alps:read_appinfo: got NID 16
> [gele2:15844] ras:alps:read_appinfo: got NID 16
> [gele2:15844] ras:alps:read_appinfo: got NID 16
> [gele2:15844] ras:alps:read_appinfo: got NID 16
> [gele2:15844] ras:alps:read_appinfo: got NID 16
> [gele2:15844] ras:alps:read_appinfo: got NID 16
> [gele2:15844] ras:alps:read_appinfo: got NID 16
> [gele2:15844] ras:alps:read_appinfo: got NID 16
> [gele2:15844] ras:alps:read_appinfo: got NID 16
> [gele2:15844] ras:alps:read_appinf

Re: [OMPI devel] alps ras patch for SLURM

2010-07-09 Thread Jerome Soumagne
It's not invented, it's a SLURM standard name. Sorry for not having said 
that, my first e-mail was really too short.

http://manpages.ubuntu.com/manpages/lucid/man1/sbatch.1.html
http://slurm-llnl.sourcearchive.com/documentation/2.1.1/basil__interface_8c-source.html
...

google could have been your friend in this case... ;) (but I agree, 
that's really a strange name)


Jerome

On 07/09/2010 04:27 PM, Ralph Castain wrote:
To clarify: what I'm trying to understand is what the heck a 
"BASIL_RESERVATION_ID" is - it isn't a standard slurm thing, nor can I 
find it defined in alps, so it appears to just be a local name you 
invented. True?


If so, I would rather see some standard name instead of something 
local to one organization.


On Jul 9, 2010, at 8:08 AM, Jerome Soumagne wrote:

Ok I may have not explained very clearly. In our case we only use 
SLURM for the resource manager.
The difference here is that the SLURM version that we use has support 
for ALPS. Therefore when we run our job using the mpirun command, 
since we have the alps environment loaded, it's the ALPS RAS which is 
selected, and the ALPS PLM as well. I think I could even not compile 
the OpenMPI slurm support.


Here is what we do for example: here is my batch script (with the 
patched version)

#!/bin/bash
#SBATCH --job-name=HelloOMPI
#SBATCH --nodes=2
#SBATCH --time=00:30:00

set -ex
cd /users/soumagne/gele/hello
mpirun --mca ras_base_verbose 10 --mca plm_base_verbose 10 -np 2 
--bynode `pwd`/hello


And here is the output that I get:
soumagne@gele1:~/gele/hello> more slurm-165.out
+ cd /users/soumagne/gele/hello
++ pwd
+ mpirun --mca ras_base_verbose 10 --mca plm_base_verbose 10 --bynode 
-np 2 /use

rs/soumagne/gele/hello/hello
[gele2:15844] mca: base: components_open: Looking for plm components
[gele2:15844] mca: base: components_open: opening plm components
[gele2:15844] mca: base: components_open: found loaded component alps
[gele2:15844] mca: base: components_open: component alps has no 
register functio

n
[gele2:15844] mca: base: components_open: component alps open 
function successfu

l
[gele2:15844] mca: base: components_open: found loaded component slurm
[gele2:15844] mca: base: components_open: component slurm has no 
register functi

on
[gele2:15844] mca: base: components_open: component slurm open 
function successf

ul
[gele2:15844] mca:base:select: Auto-selecting plm components
[gele2:15844] mca:base:select:(  plm) Querying component [alps]
[gele2:15844] mca:base:select:(  plm) Query of component [alps] set 
priority to

75
[gele2:15844] mca:base:select:(  plm) Querying component [slurm]
[gele2:15844] mca:base:select:(  plm) Query of component [slurm] set 
priority to

 75
[gele2:15844] mca:base:select:(  plm) Selected component [alps]
[gele2:15844] mca: base: close: component slurm closed
[gele2:15844] mca: base: close: unloading component slurm
[gele2:15844] mca: base: components_open: Looking for ras components
[gele2:15844] mca: base: components_open: opening ras components
[gele2:15844] mca: base: components_open: found loaded component cm
[gele2:15844] mca: base: components_open: component cm has no 
register function
[gele2:15844] mca: base: components_open: component cm open function 
successful

[gele2:15844] mca: base: components_open: found loaded component alps
[gele2:15844] mca: base: components_open: component alps has no 
register functio

n
[gele2:15844] mca: base: components_open: component alps open 
function successfu

l
[gele2:15844] mca: base: components_open: found loaded component slurm
[gele2:15844] mca: base: components_open: component slurm has no 
register functi

on
[gele2:15844] mca: base: components_open: component slurm open 
function successf

ul
[gele2:15844] mca:base:select: Auto-selecting ras components
[gele2:15844] mca:base:select:(  ras) Querying component [cm]
[gele2:15844] mca:base:select:(  ras) Skipping component [cm]. Query 
failed to r

eturn a module
[gele2:15844] mca:base:select:(  ras) Querying component [alps]
[gele2:15844] ras:alps: available for selection
[gele2:15844] mca:base:select:(  ras) Query of component [alps] set 
priority to

75
[gele2:15844] mca:base:select:(  ras) Querying component [slurm]
[gele2:15844] mca:base:select:(  ras) Query of component [slurm] set 
priority to

 75
[gele2:15844] mca:base:select:(  ras) Selected component [alps]
[gele2:15844] mca: base: close: unloading component cm
[gele2:15844] mca: base: close: unloading component slurm
[gele2:15844] ras:alps:allocate: Using ALPS configuration file: 
"/etc/sysconfig/

alps"
[gele2:15844] ras:alps:allocate: Located ALPS scheduler file: 
"/ufs/alps_shared/

appinfo"
[gele2:15844] ras:alps:orte_ras_alps_get_appinfo_attempts: 10
[gele2:15844] ras:alps:read_appinfo: got NID 16
[gele2:15844] ras:alps:read_appinfo: added NID 16 to list
[gele2:15844] ras:alps:read_appinfo: got NID 16
[gele2:15844] ras:alps:read_appinfo: got NID 16
[gele2:15844] ras:alps:read_appinfo: got NID 16
[gele2:15844] ras:alps:

Re: [OMPI devel] alps ras patch for SLURM

2010-07-09 Thread Matney Sr, Kenneth D.
Hi Jerome,

I am in part responsible for the current incarnation of the ALPS  support in 
OMPI.  We use the
modules environment to set OMPI_ALPS_RESID to the ALPS reservation ID, the 
pertinent
parts of which are:

  set   ridpath ${basedir}/share/openmpi
  set   ridname ras-alps-command.sh
  set   rid ${ridpath}/${ridname}

# Set local cluster parameters for XT5.
  set   resId   [exec /bin/bash ${rid}]
  setenvOMPI_ALPS_RESID $resId

Originally, the Cray XT systems automatically set an environmental variable, 
BATCH_PARTITION_ID
to the ALPS reservation ID for the job.  However, newer versions do not expose 
the ALPS reservation
ID to the user.  So, we need a way to get the ALPS reservation ID of the Torque 
job.  Unfortunately,
Cray has not made the internal structure of ALPS that does this available.  So, 
we are forced to use
apstat to get this information.  But, apstat is not as robust as we might like. 
 Ergo, the script is used to
loop on apstat until it does not fail.  In the end, we obtain the ALPS 
reservation ID for the current
Torque job and set it to OMPI_ALPS_RESID.  I chose this name so as to avoid 
namespace conflicts.

So, the ALPS RAS mca is being selected, because your patch tells the ALPS RAS 
mca that
BASIL_RESERVATION_ID is equivalent to OMPI_ALPS_RESID.  In turn, while you 
invoke OMPI with
mpirun, the OMPI version of mpirun will select the ALPS PLM mca.  This will 
launch your job with an
aprun (under the covers).  So, your job does show a successful run.  However, 
you may not be taking
the path through mpirun that you intended.

I do hope that I have cleared up some confusion.
--
Ken Matney, Sr.
Oak Ridge National Laboratory


On Jul 9, 2010, at 6:19 AM, Jerome Soumagne wrote:

Hi,

We've recently installed OpenMPI on one of our Cray XT5 machines, here at CSCS. 
This machine uses SLURM for launching jobs.
Doing an salloc defines this environment variable:
  BASIL_RESERVATION_ID
  The reservation ID on Cray systems running ALPS/BASIL only.

Since the alps ras module tries to find a variable called OMPI_ALPS_RESID which 
is set using a script, we thought that for SLURM systems it would be a good 
idea to directly integrate this BASIL_RESERVATION_ID variable in the code, 
rather than using a script. The small patch is attached.

Regards,

Jerome

--
Jérôme Soumagne
Scientific Computing Research Group
CSCS, Swiss National Supercomputing Centre
Galleria 2, Via Cantonale  | Tel: +41 (0)91 610 8258
CH-6928 Manno, Switzerland | Fax: +41 (0)91 610 8282








Re: [OMPI devel] alps ras patch for SLURM

2010-07-09 Thread Jerome Soumagne
another link which can be worth mentioning: 
https://computing.llnl.gov/linux/slurm/cray.html


it says at the top of the page *NOTE: As of January 2009, the SLURM 
interface to Cray systems is incomplete.
*but what we have now on our system is something which is reasonably 
stable and a good part of the SLURM interface to Cray is now complete.
What we have at CSCS is a list of patches which improve and complete the 
interface. As I said, these modifications will be submitted for the 
November release of SLURM. Again, there is nothing non-standard in it.


I hope that it helps,

Jerome

On 07/09/2010 05:02 PM, Jerome Soumagne wrote:
It's not invented, it's a SLURM standard name. Sorry for not having 
said that, my first e-mail was really too short.

http://manpages.ubuntu.com/manpages/lucid/man1/sbatch.1.html
http://slurm-llnl.sourcearchive.com/documentation/2.1.1/basil__interface_8c-source.html
...

google could have been your friend in this case... ;) (but I agree, 
that's really a strange name)


Jerome

On 07/09/2010 04:27 PM, Ralph Castain wrote:
To clarify: what I'm trying to understand is what the heck a 
"BASIL_RESERVATION_ID" is - it isn't a standard slurm thing, nor can 
I find it defined in alps, so it appears to just be a local name you 
invented. True?


If so, I would rather see some standard name instead of something 
local to one organization.


On Jul 9, 2010, at 8:08 AM, Jerome Soumagne wrote:

Ok I may have not explained very clearly. In our case we only use 
SLURM for the resource manager.
The difference here is that the SLURM version that we use has 
support for ALPS. Therefore when we run our job using the mpirun 
command, since we have the alps environment loaded, it's the ALPS 
RAS which is selected, and the ALPS PLM as well. I think I could 
even not compile the OpenMPI slurm support.


Here is what we do for example: here is my batch script (with the 
patched version)

#!/bin/bash
#SBATCH --job-name=HelloOMPI
#SBATCH --nodes=2
#SBATCH --time=00:30:00

set -ex
cd /users/soumagne/gele/hello
mpirun --mca ras_base_verbose 10 --mca plm_base_verbose 10 -np 2 
--bynode `pwd`/hello


And here is the output that I get:
soumagne@gele1:~/gele/hello> more slurm-165.out
+ cd /users/soumagne/gele/hello
++ pwd
+ mpirun --mca ras_base_verbose 10 --mca plm_base_verbose 10 
--bynode -np 2 /use

rs/soumagne/gele/hello/hello
[gele2:15844] mca: base: components_open: Looking for plm components
[gele2:15844] mca: base: components_open: opening plm components
[gele2:15844] mca: base: components_open: found loaded component alps
[gele2:15844] mca: base: components_open: component alps has no 
register functio

n
[gele2:15844] mca: base: components_open: component alps open 
function successfu

l
[gele2:15844] mca: base: components_open: found loaded component slurm
[gele2:15844] mca: base: components_open: component slurm has no 
register functi

on
[gele2:15844] mca: base: components_open: component slurm open 
function successf

ul
[gele2:15844] mca:base:select: Auto-selecting plm components
[gele2:15844] mca:base:select:(  plm) Querying component [alps]
[gele2:15844] mca:base:select:(  plm) Query of component [alps] set 
priority to

75
[gele2:15844] mca:base:select:(  plm) Querying component [slurm]
[gele2:15844] mca:base:select:(  plm) Query of component [slurm] set 
priority to

 75
[gele2:15844] mca:base:select:(  plm) Selected component [alps]
[gele2:15844] mca: base: close: component slurm closed
[gele2:15844] mca: base: close: unloading component slurm
[gele2:15844] mca: base: components_open: Looking for ras components
[gele2:15844] mca: base: components_open: opening ras components
[gele2:15844] mca: base: components_open: found loaded component cm
[gele2:15844] mca: base: components_open: component cm has no 
register function
[gele2:15844] mca: base: components_open: component cm open function 
successful

[gele2:15844] mca: base: components_open: found loaded component alps
[gele2:15844] mca: base: components_open: component alps has no 
register functio

n
[gele2:15844] mca: base: components_open: component alps open 
function successfu

l
[gele2:15844] mca: base: components_open: found loaded component slurm
[gele2:15844] mca: base: components_open: component slurm has no 
register functi

on
[gele2:15844] mca: base: components_open: component slurm open 
function successf

ul
[gele2:15844] mca:base:select: Auto-selecting ras components
[gele2:15844] mca:base:select:(  ras) Querying component [cm]
[gele2:15844] mca:base:select:(  ras) Skipping component [cm]. Query 
failed to r

eturn a module
[gele2:15844] mca:base:select:(  ras) Querying component [alps]
[gele2:15844] ras:alps: available for selection
[gele2:15844] mca:base:select:(  ras) Query of component [alps] set 
priority to

75
[gele2:15844] mca:base:select:(  ras) Querying component [slurm]
[gele2:15844] mca:base:select:(  ras) Query of component [slurm] set 
priority to

 75
[gele2:15844] mca:base:select:(  ras) Selected compo

Re: [OMPI devel] alps ras patch for SLURM

2010-07-09 Thread Ralph Castain
Actually, this patch doesn't have anything to do with slurm according to the 
documentation in the links. It has to do with Cray's batch allocator system, 
which slurm is just interfacing to. So what you are really saying is that you 
want the alps ras to run if we either detect the presence of alps acting as a 
resource manager, or we detect that the Cray batch allocator has assigned an id.

However that latter id was assigned is irrelevant to the patch.

True?

You Cray guys out there: is this going to cause a conflict with other Cray 
installations?


On Jul 9, 2010, at 9:44 AM, Jerome Soumagne wrote:

> another link which can be worth mentioning: 
> https://computing.llnl.gov/linux/slurm/cray.html
> 
> it says at the top of the page NOTE: As of January 2009, the SLURM interface 
> to Cray systems is incomplete.
> but what we have now on our system is something which is reasonably stable 
> and a good part of the SLURM interface to Cray is now complete.
> What we have at CSCS is a list of patches which improve and complete the 
> interface. As I said, these modifications will be submitted for the November 
> release of SLURM. Again, there is nothing non-standard in it.
> 
> I hope that it helps,
> 
> Jerome
> 
> On 07/09/2010 05:02 PM, Jerome Soumagne wrote:
>> 
>> It's not invented, it's a SLURM standard name. Sorry for not having said 
>> that, my first e-mail was really too short.
>> http://manpages.ubuntu.com/manpages/lucid/man1/sbatch.1.html
>> http://slurm-llnl.sourcearchive.com/documentation/2.1.1/basil__interface_8c-source.html
>> ...
>> 
>> google could have been your friend in this case... ;) (but I agree, that's 
>> really a strange name)
>> 
>> Jerome
>> 
>> On 07/09/2010 04:27 PM, Ralph Castain wrote:
>>> 
>>> To clarify: what I'm trying to understand is what the heck a 
>>> "BASIL_RESERVATION_ID" is - it isn't a standard slurm thing, nor can I find 
>>> it defined in alps, so it appears to just be a local name you invented. 
>>> True?
>>> 
>>> If so, I would rather see some standard name instead of something local to 
>>> one organization.
>>> 
>>> On Jul 9, 2010, at 8:08 AM, Jerome Soumagne wrote:
>>> 
 Ok I may have not explained very clearly. In our case we only use SLURM 
 for the resource manager.
 The difference here is that the SLURM version that we use has support for 
 ALPS. Therefore when we run our job using the mpirun command, since we 
 have the alps environment loaded, it's the ALPS RAS which is selected, and 
 the ALPS PLM as well. I think I could even not compile the OpenMPI slurm 
 support.
 
 Here is what we do for example: here is my batch script (with the patched 
 version) 
 #!/bin/bash
 #SBATCH --job-name=HelloOMPI 
 #SBATCH --nodes=2
 #SBATCH --time=00:30:00
 
 set -ex
 cd /users/soumagne/gele/hello
 mpirun --mca ras_base_verbose 10 --mca plm_base_verbose 10 -np 2 --bynode 
 `pwd`/hello
 
 And here is the output that I get:
 soumagne@gele1:~/gele/hello> more slurm-165.out 
 + cd /users/soumagne/gele/hello
 ++ pwd
 + mpirun --mca ras_base_verbose 10 --mca plm_base_verbose 10 --bynode -np 
 2 /use
 rs/soumagne/gele/hello/hello
 [gele2:15844] mca: base: components_open: Looking for plm components
 [gele2:15844] mca: base: components_open: opening plm components
 [gele2:15844] mca: base: components_open: found loaded component alps
 [gele2:15844] mca: base: components_open: component alps has no register 
 functio
 n
 [gele2:15844] mca: base: components_open: component alps open function 
 successfu
 l
 [gele2:15844] mca: base: components_open: found loaded component slurm
 [gele2:15844] mca: base: components_open: component slurm has no register 
 functi
 on
 [gele2:15844] mca: base: components_open: component slurm open function 
 successf
 ul
 [gele2:15844] mca:base:select: Auto-selecting plm components
 [gele2:15844] mca:base:select:(  plm) Querying component [alps]
 [gele2:15844] mca:base:select:(  plm) Query of component [alps] set 
 priority to 
 75
 [gele2:15844] mca:base:select:(  plm) Querying component [slurm]
 [gele2:15844] mca:base:select:(  plm) Query of component [slurm] set 
 priority to
  75
 [gele2:15844] mca:base:select:(  plm) Selected component [alps]
 [gele2:15844] mca: base: close: component slurm closed
 [gele2:15844] mca: base: close: unloading component slurm
 [gele2:15844] mca: base: components_open: Looking for ras components
 [gele2:15844] mca: base: components_open: opening ras components
 [gele2:15844] mca: base: components_open: found loaded component cm
 [gele2:15844] mca: base: components_open: component cm has no register 
 function
 [gele2:15844] mca: base: components_open: component cm open function 
 successful
 [gele2:15844] mca: base: components_open: found loaded component alps
 

Re: [OMPI devel] alps ras patch for SLURM

2010-07-09 Thread Matney Sr, Kenneth D.
Ralph,

His patch only modifies the ALPS RAS mca.  And, it causes the environmental
variable BASIL_RESERVATION_ID to be a synonym for OMPI_ALPS_RESID.
It makes it convenient for the version of SLURM that they are proposing.  But,
it does not invoke any side-effects.
--
Ken Matney, Sr.
Oak Ridge National Laboratory


On Jul 9, 2010, at 12:15 PM, Ralph Castain wrote:

Actually, this patch doesn't have anything to do with slurm according to the 
documentation in the links. It has to do with Cray's batch allocator system, 
which slurm is just interfacing to. So what you are really saying is that you 
want the alps ras to run if we either detect the presence of alps acting as a 
resource manager, or we detect that the Cray batch allocator has assigned an id.

However that latter id was assigned is irrelevant to the patch.

True?

You Cray guys out there: is this going to cause a conflict with other Cray 
installations?


On Jul 9, 2010, at 9:44 AM, Jerome Soumagne wrote:

another link which can be worth mentioning: 
https://computing.llnl.gov/linux/slurm/cray.html

it says at the top of the page NOTE: As of January 2009, the SLURM interface to 
Cray systems is incomplete.
but what we have now on our system is something which is reasonably stable and 
a good part of the SLURM interface to Cray is now complete.
What we have at CSCS is a list of patches which improve and complete the 
interface. As I said, these modifications will be submitted for the November 
release of SLURM. Again, there is nothing non-standard in it.

I hope that it helps,

Jerome

On 07/09/2010 05:02 PM, Jerome Soumagne wrote:
It's not invented, it's a SLURM standard name. Sorry for not having said that, 
my first e-mail was really too short.
http://manpages.ubuntu.com/manpages/lucid/man1/sbatch.1.html
http://slurm-llnl.sourcearchive.com/documentation/2.1.1/basil__interface_8c-source.html
...

google could have been your friend in this case... ;) (but I agree, that's 
really a strange name)

Jerome

On 07/09/2010 04:27 PM, Ralph Castain wrote:
To clarify: what I'm trying to understand is what the heck a 
"BASIL_RESERVATION_ID" is - it isn't a standard slurm thing, nor can I find it 
defined in alps, so it appears to just be a local name you invented. True?

If so, I would rather see some standard name instead of something local to one 
organization.

On Jul 9, 2010, at 8:08 AM, Jerome Soumagne wrote:

Ok I may have not explained very clearly. In our case we only use SLURM for the 
resource manager.
The difference here is that the SLURM version that we use has support for ALPS. 
Therefore when we run our job using the mpirun command, since we have the alps 
environment loaded, it's the ALPS RAS which is selected, and the ALPS PLM as 
well. I think I could even not compile the OpenMPI slurm support.

Here is what we do for example: here is my batch script (with the patched 
version)
#!/bin/bash
#SBATCH --job-name=HelloOMPI
#SBATCH --nodes=2
#SBATCH --time=00:30:00

set -ex
cd /users/soumagne/gele/hello
mpirun --mca ras_base_verbose 10 --mca plm_base_verbose 10 -np 2 --bynode 
`pwd`/hello

And here is the output that I get:
soumagne@gele1:~/gele/hello> more slurm-165.out
+ cd /users/soumagne/gele/hello
++ pwd
+ mpirun --mca ras_base_verbose 10 --mca plm_base_verbose 10 --bynode -np 2 /use
rs/soumagne/gele/hello/hello
[gele2:15844] mca: base: components_open: Looking for plm components
[gele2:15844] mca: base: components_open: opening plm components
[gele2:15844] mca: base: components_open: found loaded component alps
[gele2:15844] mca: base: components_open: component alps has no register functio
n
[gele2:15844] mca: base: components_open: component alps open function successfu
l
[gele2:15844] mca: base: components_open: found loaded component slurm
[gele2:15844] mca: base: components_open: component slurm has no register functi
on
[gele2:15844] mca: base: components_open: component slurm open function successf
ul
[gele2:15844] mca:base:select: Auto-selecting plm components
[gele2:15844] mca:base:select:(  plm) Querying component [alps]
[gele2:15844] mca:base:select:(  plm) Query of component [alps] set priority to
75
[gele2:15844] mca:base:select:(  plm) Querying component [slurm]
[gele2:15844] mca:base:select:(  plm) Query of component [slurm] set priority to
 75
[gele2:15844] mca:base:select:(  plm) Selected component [alps]
[gele2:15844] mca: base: close: component slurm closed
[gele2:15844] mca: base: close: unloading component slurm
[gele2:15844] mca: base: components_open: Looking for ras components
[gele2:15844] mca: base: components_open: opening ras components
[gele2:15844] mca: base: components_open: found loaded component cm
[gele2:15844] mca: base: components_open: component cm has no register function
[gele2:15844] mca: base: components_open: component cm open function successful
[gele2:15844] mca: base: components_open: found loaded component alps
[gele2:15844] mca: base: components_open: component alps has no register functio
n

Re: [OMPI devel] alps ras patch for SLURM

2010-07-09 Thread Jerome Soumagne

Hi Ken,

That's interesting, setting the OMPI_ALPS_RESID in the modules so that 
it executes the ras-alps-command.sh is a good idea. In this case another 
way would be to add an extra line in this script with the 
BASIL_RESERVATION_ID as you did for the BATCH_PARTITION_ID.

I have another possible patch then:

Index: ras-alps-command.sh
===
--- ras-alps-command.sh(revision 23365)
+++ ras-alps-command.sh(working copy)
@@ -22,6 +22,13 @@
 exit 0
   fi

+  # If the SLURM BASIL_RESERVATION_ID is set, use it.
+  if [ "${BASIL_RESERVATION_ID}" != "" ]
+  then
+  ${ECHO} ${BASIL_RESERVATION_ID}
+  exit 0
+  fi
+
 # Extract the batch job ID directly from the environment, if available.
   jid=${BATCH_JOBID:--1}
   if [ $jid -eq -1 ]


Thanks for your help in the clarification.

Jerome

On 07/09/2010 05:41 PM, Matney Sr, Kenneth D. wrote:

Hi Jerome,

I am in part responsible for the current incarnation of the ALPS  support in 
OMPI.  We use the
modules environment to set OMPI_ALPS_RESID to the ALPS reservation ID, the 
pertinent
parts of which are:

   set   ridpath ${basedir}/share/openmpi
   set   ridname ras-alps-command.sh
   set   rid ${ridpath}/${ridname}

# Set local cluster parameters for XT5.
   set   resId   [exec /bin/bash ${rid}]
   setenvOMPI_ALPS_RESID $resId

Originally, the Cray XT systems automatically set an environmental variable, 
BATCH_PARTITION_ID
to the ALPS reservation ID for the job.  However, newer versions do not expose 
the ALPS reservation
ID to the user.  So, we need a way to get the ALPS reservation ID of the Torque 
job.  Unfortunately,
Cray has not made the internal structure of ALPS that does this available.  So, 
we are forced to use
apstat to get this information.  But, apstat is not as robust as we might like. 
 Ergo, the script is used to
loop on apstat until it does not fail.  In the end, we obtain the ALPS 
reservation ID for the current
Torque job and set it to OMPI_ALPS_RESID.  I chose this name so as to avoid 
namespace conflicts.

So, the ALPS RAS mca is being selected, because your patch tells the ALPS RAS 
mca that
BASIL_RESERVATION_ID is equivalent to OMPI_ALPS_RESID.  In turn, while you 
invoke OMPI with
mpirun, the OMPI version of mpirun will select the ALPS PLM mca.  This will 
launch your job with an
aprun (under the covers).  So, your job does show a successful run.  However, 
you may not be taking
the path through mpirun that you intended.

I do hope that I have cleared up some confusion.
--
Ken Matney, Sr.
Oak Ridge National Laboratory


On Jul 9, 2010, at 6:19 AM, Jerome Soumagne wrote:

Hi,

We've recently installed OpenMPI on one of our Cray XT5 machines, here at CSCS. 
This machine uses SLURM for launching jobs.
Doing an salloc defines this environment variable:
   BASIL_RESERVATION_ID
   The reservation ID on Cray systems running ALPS/BASIL only.

Since the alps ras module tries to find a variable called OMPI_ALPS_RESID which 
is set using a script, we thought that for SLURM systems it would be a good 
idea to directly integrate this BASIL_RESERVATION_ID variable in the code, 
rather than using a script. The small patch is attached.

Regards,

Jerome

--
Jérôme Soumagne
Scientific Computing Research Group
CSCS, Swiss National Supercomputing Centre
Galleria 2, Via Cantonale  | Tel: +41 (0)91 610 8258
CH-6928 Manno, Switzerland | Fax: +41 (0)91 610 8282






___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
   




Re: [OMPI devel] alps ras patch for SLURM

2010-07-09 Thread Jerome Soumagne
I would prefer the first patch though so that we get rid of scripts and 
of another env variable but well, I let you choose.


Jerome

On 07/09/2010 06:27 PM, Jerome Soumagne wrote:

Hi Ken,

That's interesting, setting the OMPI_ALPS_RESID in the modules so that 
it executes the ras-alps-command.sh is a good idea. In this case 
another way would be to add an extra line in this script with the 
BASIL_RESERVATION_ID as you did for the BATCH_PARTITION_ID.

I have another possible patch then:

Index: ras-alps-command.sh
===
--- ras-alps-command.sh(revision 23365)
+++ ras-alps-command.sh(working copy)
@@ -22,6 +22,13 @@
 exit 0
   fi

+  # If the SLURM BASIL_RESERVATION_ID is set, use it.
+  if [ "${BASIL_RESERVATION_ID}" != "" ]
+  then
+  ${ECHO} ${BASIL_RESERVATION_ID}
+  exit 0
+  fi
+
 # Extract the batch job ID directly from the environment, if available.
   jid=${BATCH_JOBID:--1}
   if [ $jid -eq -1 ]


Thanks for your help in the clarification.

Jerome

On 07/09/2010 05:41 PM, Matney Sr, Kenneth D. wrote:

Hi Jerome,

I am in part responsible for the current incarnation of the ALPS  support in 
OMPI.  We use the
modules environment to set OMPI_ALPS_RESID to the ALPS reservation ID, the 
pertinent
parts of which are:

   set   ridpath ${basedir}/share/openmpi
   set   ridname ras-alps-command.sh
   set   rid ${ridpath}/${ridname}

# Set local cluster parameters for XT5.
   set   resId   [exec /bin/bash ${rid}]
   setenvOMPI_ALPS_RESID $resId

Originally, the Cray XT systems automatically set an environmental variable, 
BATCH_PARTITION_ID
to the ALPS reservation ID for the job.  However, newer versions do not expose 
the ALPS reservation
ID to the user.  So, we need a way to get the ALPS reservation ID of the Torque 
job.  Unfortunately,
Cray has not made the internal structure of ALPS that does this available.  So, 
we are forced to use
apstat to get this information.  But, apstat is not as robust as we might like. 
 Ergo, the script is used to
loop on apstat until it does not fail.  In the end, we obtain the ALPS 
reservation ID for the current
Torque job and set it to OMPI_ALPS_RESID.  I chose this name so as to avoid 
namespace conflicts.

So, the ALPS RAS mca is being selected, because your patch tells the ALPS RAS 
mca that
BASIL_RESERVATION_ID is equivalent to OMPI_ALPS_RESID.  In turn, while you 
invoke OMPI with
mpirun, the OMPI version of mpirun will select the ALPS PLM mca.  This will 
launch your job with an
aprun (under the covers).  So, your job does show a successful run.  However, 
you may not be taking
the path through mpirun that you intended.

I do hope that I have cleared up some confusion.
--
Ken Matney, Sr.
Oak Ridge National Laboratory


On Jul 9, 2010, at 6:19 AM, Jerome Soumagne wrote:

Hi,

We've recently installed OpenMPI on one of our Cray XT5 machines, here at CSCS. 
This machine uses SLURM for launching jobs.
Doing an salloc defines this environment variable:
   BASIL_RESERVATION_ID
   The reservation ID on Cray systems running ALPS/BASIL only.

Since the alps ras module tries to find a variable called OMPI_ALPS_RESID which 
is set using a script, we thought that for SLURM systems it would be a good 
idea to directly integrate this BASIL_RESERVATION_ID variable in the code, 
rather than using a script. The small patch is attached.

Regards,

Jerome

--
Jérôme Soumagne
Scientific Computing Research Group
CSCS, Swiss National Supercomputing Centre
Galleria 2, Via Cantonale  | Tel: +41 (0)91 610 8258
CH-6928 Manno, Switzerland | Fax: +41 (0)91 610 8282






___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
   



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] some questions regarding the portals modules

2010-07-09 Thread Matney Sr, Kenneth D.
Hello Jerome,

The first one is simple.  portals is not thead-safe on the Cray XT.  As, I 
recall,
only the master thread can post an event. although any thread can receive
the event.  Although, i might have it backwards.  It has been a couple of years
since I played with this.

The second one depends on how you use your Cray XT.  In our case, the machine
is used as process-per-core; i.e., not as a collection of SMPs.  For performance
reasons, you definitely do not want MPI threads.  Also, since it is run 
process-per-core,
there is nothing to be gained with progress threads.  Portals events will 
generate a kernel
level interrupt.  Whether you can run the XT as a cluster of SMPs is another 
question
entirely.  We really have not tried this in the context of OMPI.  But, in 
conjunction with
portals, this might open a "can of worms".  For example, any thread can be run 
on any
core.  But the portals ID for a thread will be the NID/PID pair for that core.  
If two threads
get scheduled to the same core, it would not be pretty.

I could see lots of reasons why spawn might fail.  First, it is run on a 
compute node.
There is no way for a compute node to run a process on another compute node.
Also, there will be no rank/size initialization forthcoming from ALPS.  So, 
even if
it got past this, it would be running on the same node as its parent.
-- Ken Matney, Sr.
   Oak Ridge National Laboratory


On Jul 9, 2010, at 7:53 AM, Jerome Soumagne wrote:

Hi,

As I said in the previous e-mail, we've recently installed OpenMPI on a Cray 
XT5 machine, and we therefore use the portals and the alps libraries. Thanks 
for providing the configuration script from Jaguar, this was very helpful, it 
had just to be slightly adapted in order to use the latest CNL version 
installed on this machine.

I have some questions though regarding the use of the portals btl and mtl 
components. I noticed that when I compiled OpenMPI with mpi-thread support 
enabled and ran a job, the portals components did not want to initialize due to 
these funny lines:

./mtl_portals_component.c
182 /* we don't run with no stinkin' threads */
183 if (enable_progress_threads || enable_mpi_threads) return NULL;

I'd like to know why are mpi threads disabled since threads are supported on 
the XT5, does the btl/mtl require to have thread-safety implemented or 
something like that or is it because of the portals library itself ?

I would also like to use the MPI_Comm_accept/connect functions, it seems that 
it's not possible to do that using the portals mtl even if the spawn seems to 
be supported, did I do something wrong or is it really not supported?
In this case, is it possible to extend this module to support these functions? 
We could help in doing that.

I'd like also to know, are there any plans for creating a module in order to 
use the DMAPP interface for the Gemini interconnect?

Thanks.

Jerome


--
Jérôme Soumagne
Scientific Computing Research Group
CSCS, Swiss National Supercomputing Centre
Galleria 2, Via Cantonale  | Tel: +41 (0)91 610 8258
CH-6928 Manno, Switzerland | Fax: +41 (0)91 610 8282








Re: [OMPI devel] some questions regarding the portals modules

2010-07-09 Thread Jerome Soumagne

Hi Ken,

I thank you a lot for your reply, I will think about it and do some more 
tests. I was only thinking about using MPI threads, but yes as you say 
if two threads are scheduled on the same core, that wouldn't be pretty 
at all. I can probably do some more tests of that functionality, but I 
don't expect to have great results.


I'm not sure to correctly understand what you say about the spawn. I 
found a presentation on the web from Richard Graham saying that the 
spawn functionality was implemented as well as it says in this 
presentation that you get a full MPI 2 support on the Cray XT. When I 
said that I had problems with the MPI_Comm_accept/connect functions, I 
meant that I actually get errors when I try to do a "simple" 
MPI_Open_port, do you know where I can find in the code whether this 
function is implemented or not? If it is implemented, knowing where it 
is defined would help me to find the origin of my problem and possibly 
extend the support of this functionality (if it is feasible). I would 
like to be able to link two different jobs together using these 
functions, ie. creating a communicator between the jobs.


Thanks,

Jerome

On 07/09/2010 07:16 PM, Matney Sr, Kenneth D. wrote:

Hello Jerome,

The first one is simple.  portals is not thead-safe on the Cray XT.  As, I 
recall,
only the master thread can post an event. although any thread can receive
the event.  Although, i might have it backwards.  It has been a couple of years
since I played with this.

The second one depends on how you use your Cray XT.  In our case, the machine
is used as process-per-core; i.e., not as a collection of SMPs.  For performance
reasons, you definitely do not want MPI threads.  Also, since it is run 
process-per-core,
there is nothing to be gained with progress threads.  Portals events will 
generate a kernel
level interrupt.  Whether you can run the XT as a cluster of SMPs is another 
question
entirely.  We really have not tried this in the context of OMPI.  But, in 
conjunction with
portals, this might open a "can of worms".  For example, any thread can be run 
on any
core.  But the portals ID for a thread will be the NID/PID pair for that core.  
If two threads
get scheduled to the same core, it would not be pretty.

I could see lots of reasons why spawn might fail.  First, it is run on a 
compute node.
There is no way for a compute node to run a process on another compute node.
Also, there will be no rank/size initialization forthcoming from ALPS.  So, 
even if
it got past this, it would be running on the same node as its parent.
-- Ken Matney, Sr.
Oak Ridge National Laboratory


On Jul 9, 2010, at 7:53 AM, Jerome Soumagne wrote:

Hi,

As I said in the previous e-mail, we've recently installed OpenMPI on a Cray 
XT5 machine, and we therefore use the portals and the alps libraries. Thanks 
for providing the configuration script from Jaguar, this was very helpful, it 
had just to be slightly adapted in order to use the latest CNL version 
installed on this machine.

I have some questions though regarding the use of the portals btl and mtl 
components. I noticed that when I compiled OpenMPI with mpi-thread support 
enabled and ran a job, the portals components did not want to initialize due to 
these funny lines:

./mtl_portals_component.c
182 /* we don't run with no stinkin' threads */
183 if (enable_progress_threads || enable_mpi_threads) return NULL;

I'd like to know why are mpi threads disabled since threads are supported on 
the XT5, does the btl/mtl require to have thread-safety implemented or 
something like that or is it because of the portals library itself ?

I would also like to use the MPI_Comm_accept/connect functions, it seems that 
it's not possible to do that using the portals mtl even if the spawn seems to 
be supported, did I do something wrong or is it really not supported?
In this case, is it possible to extend this module to support these functions? 
We could help in doing that.

I'd like also to know, are there any plans for creating a module in order to 
use the DMAPP interface for the Gemini interconnect?

Thanks.

Jerome


--
Jérôme Soumagne
Scientific Computing Research Group
CSCS, Swiss National Supercomputing Centre
Galleria 2, Via Cantonale  | Tel: +41 (0)91 610 8258
CH-6928 Manno, Switzerland | Fax: +41 (0)91 610 8282






___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
   




Re: [OMPI devel] some questions regarding the portals modules

2010-07-09 Thread Ralph Castain

On Jul 9, 2010, at 3:23 PM, Jerome Soumagne wrote:

> Hi Ken,
> 
> I thank you a lot for your reply, I will think about it and do some more 
> tests. I was only thinking about using MPI threads, but yes as you say if two 
> threads are scheduled on the same core, that wouldn't be pretty at all. I can 
> probably do some more tests of that functionality, but I don't expect to have 
> great results.
> 
> I'm not sure to correctly understand what you say about the spawn. I found a 
> presentation on the web from Richard Graham saying that the spawn 
> functionality was implemented as well as it says in this presentation that 
> you get a full MPI 2 support on the Cray XT. When I said that I had problems 
> with the MPI_Comm_accept/connect functions, I meant that I actually get 
> errors when I try to do a "simple" MPI_Open_port, do you know where I can 
> find in the code whether this function is implemented or not? If it is 
> implemented, knowing where it is defined would help me to find the origin of 
> my problem and possibly extend the support of this functionality (if it is 
> feasible). I would like to be able to link two different jobs together using 
> these functions, ie. creating a communicator between the jobs.

It is implemented in ompi/mca/dpm/orte. I believe it isn't supported for the 
reasons Ken described.

> 
> Thanks,
> 
> Jerome
> 
> On 07/09/2010 07:16 PM, Matney Sr, Kenneth D. wrote:
>> Hello Jerome,
>> 
>> The first one is simple.  portals is not thead-safe on the Cray XT.  As, I 
>> recall,
>> only the master thread can post an event. although any thread can receive
>> the event.  Although, i might have it backwards.  It has been a couple of 
>> years
>> since I played with this.
>> 
>> The second one depends on how you use your Cray XT.  In our case, the machine
>> is used as process-per-core; i.e., not as a collection of SMPs.  For 
>> performance
>> reasons, you definitely do not want MPI threads.  Also, since it is run 
>> process-per-core,
>> there is nothing to be gained with progress threads.  Portals events will 
>> generate a kernel
>> level interrupt.  Whether you can run the XT as a cluster of SMPs is another 
>> question
>> entirely.  We really have not tried this in the context of OMPI.  But, in 
>> conjunction with
>> portals, this might open a "can of worms".  For example, any thread can be 
>> run on any
>> core.  But the portals ID for a thread will be the NID/PID pair for that 
>> core.  If two threads
>> get scheduled to the same core, it would not be pretty.
>> 
>> I could see lots of reasons why spawn might fail.  First, it is run on a 
>> compute node.
>> There is no way for a compute node to run a process on another compute node.
>> Also, there will be no rank/size initialization forthcoming from ALPS.  So, 
>> even if
>> it got past this, it would be running on the same node as its parent.
>> -- Ken Matney, Sr.
>>Oak Ridge National Laboratory
>> 
>> 
>> On Jul 9, 2010, at 7:53 AM, Jerome Soumagne wrote:
>> 
>> Hi,
>> 
>> As I said in the previous e-mail, we've recently installed OpenMPI on a Cray 
>> XT5 machine, and we therefore use the portals and the alps libraries. Thanks 
>> for providing the configuration script from Jaguar, this was very helpful, 
>> it had just to be slightly adapted in order to use the latest CNL version 
>> installed on this machine.
>> 
>> I have some questions though regarding the use of the portals btl and mtl 
>> components. I noticed that when I compiled OpenMPI with mpi-thread support 
>> enabled and ran a job, the portals components did not want to initialize due 
>> to these funny lines:
>> 
>> ./mtl_portals_component.c
>> 182 /* we don't run with no stinkin' threads */
>> 183 if (enable_progress_threads || enable_mpi_threads) return NULL;
>> 
>> I'd like to know why are mpi threads disabled since threads are supported on 
>> the XT5, does the btl/mtl require to have thread-safety implemented or 
>> something like that or is it because of the portals library itself ?
>> 
>> I would also like to use the MPI_Comm_accept/connect functions, it seems 
>> that it's not possible to do that using the portals mtl even if the spawn 
>> seems to be supported, did I do something wrong or is it really not 
>> supported?
>> In this case, is it possible to extend this module to support these 
>> functions? We could help in doing that.
>> 
>> I'd like also to know, are there any plans for creating a module in order to 
>> use the DMAPP interface for the Gemini interconnect?
>> 
>> Thanks.
>> 
>> Jerome
>> 
>> 
>> --
>> Jérôme Soumagne
>> Scientific Computing Research Group
>> CSCS, Swiss National Supercomputing Centre
>> Galleria 2, Via Cantonale  | Tel: +41 (0)91 610 8258
>> CH-6928 Manno, Switzerland | Fax: +41 (0)91 610 8282
>> 
>> 
>> 
>> 
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>   
> 
> __