[OMPI devel] btl_openib_receive_queues mca param not always taken into account

2014-07-11 Thread Nadia Derbey

Hi,

I noticed that specifying the receive_queues through an mca param (-mca 
btl_openib_receive_queues ) doesn't always override the 
mca-btl-openib-device-params.ini setting.


If for whatever reason we want to bypass the 
mca-btl-openib-device-params.ini file setting for the receive_queues, we 
should be able to specify a value through an mca param.
But if the string provided in the mca param is the same as the default 
one (default_qps in btl_openib_register_mca_params()), this does not 
work: we still get the receive_queues from the .ini file.


This is due to the way the 
mca_btl_openib_component.receive_queues_source (where did we get the 
receive_queues value from) is computed:


1) in btl_openib_register_mca_params() we register 
btl_openib_receive_queues, providing default_qps as a default

   value.
2) mca_btl_openib_component.receive_queues_source is set to 
BTL_OPENIB_RQ_SOURCE_MCA only if the registered string

   is different from default_qps
   (if both strings are equal, the source is set to 
BTL_OPENIB_RQ_SOURCE_DEFAULT).
3) then, in init_one_device(), 
mca_btl_openib_component.receive_queues_source is checked:
 . if its value is BTL_OPENIB_RQ_SOURCE_MCA, we bypass any other 
setting (this is the behaviour I expected)
 . otherwise, we go on, getting the .ini file settings (this is the 
behaviour I got)


I wanted to know if this behaviour is intentional and the reason for it.
If ever it is not, the attached trivial patch fixes it.

Regards,

--
Nadia Derbey

# HG changeset patch
# Parent 4cb09323aca44faec7d027586ffa94e7d9681989
btl/openib: when specifying the receive_queues as an mca param to bypass the XRC settings, the XRC settings in the .ini file are taken into account nevertheless if we use the default QPs value

diff -r 4cb09323aca4 ompi/mca/btl/openib/btl_openib_component.c
--- a/ompi/mca/btl/openib/btl_openib_component.c	Fri Jul 11 05:05:19 2014 +
+++ b/ompi/mca/btl/openib/btl_openib_component.c	Fri Jul 11 11:46:56 2014 +0200
@@ -268,6 +268,17 @@ static int btl_openib_component_close(vo
 ompi_btl_openib_fd_finalize();
 ompi_btl_openib_ini_finalize();

+if (NULL != mca_btl_openib_component.receive_queues
+&& BTL_OPENIB_RQ_SOURCE_DEFAULT ==
+mca_btl_openib_component.receive_queues_source) {
+/*
+ * In that case, the string has not been duplicated during variable
+ * registration. So it won't be freed by the mca_base_var system.
+ * Free it here.
+ */
+free(mca_btl_openib_component.receive_queues);
+}
+
 if (NULL != mca_btl_openib_component.default_recv_qps) {
 free(mca_btl_openib_component.default_recv_qps);
 }
diff -r 4cb09323aca4 ompi/mca/btl/openib/btl_openib_mca.c
--- a/ompi/mca/btl/openib/btl_openib_mca.c	Fri Jul 11 05:05:19 2014 +
+++ b/ompi/mca/btl/openib/btl_openib_mca.c	Fri Jul 11 11:46:56 2014 +0200
@@ -661,12 +661,14 @@ int btl_openib_register_mca_params(void)
 mca_btl_openib_component.default_recv_qps = default_qps;
 CHECK(reg_string("receive_queues", NULL,
  "Colon-delimited, comma-delimited list of receive queues: P,4096,8,6,4:P,32768,8,6,4",
- default_qps, _btl_openib_component.receive_queues,
+ NULL, _btl_openib_component.receive_queues,
  0));
-mca_btl_openib_component.receive_queues_source =
-(0 == strcmp(default_qps,
- mca_btl_openib_component.receive_queues)) ?
-BTL_OPENIB_RQ_SOURCE_DEFAULT : BTL_OPENIB_RQ_SOURCE_MCA;
+if (NULL == mca_btl_openib_component.receive_queues) {
+mca_btl_openib_component.receive_queues = strdup(default_qps);
+mca_btl_openib_component.receive_queues_source = BTL_OPENIB_RQ_SOURCE_DEFAULT;
+} else {
+mca_btl_openib_component.receive_queues_source = BTL_OPENIB_RQ_SOURCE_MCA;
+}

 CHECK(reg_string("if_include", NULL,
  "Comma-delimited list of devices/ports to be used (e.g. \"mthca0,mthca1:2\"; empty value means to use all ports found).  Mutually exclusive with btl_openib_if_exclude.",


Re: [OMPI devel] bug in opal_generic_simple_pack_function()

2013-11-25 Thread Nadia Derbey

George,

Yes, the revisions are from HG.
Actually, I saw the key word "convertor", in the summary and I thought 
that was the one!
You're right there are about 10 changesets aboout DDt fixes since the 
release I was testing with!


Thanks,
Nadia

On 25/11/2013 14:31, George Bosilca wrote:

Nadia,

I guess the revisions mentioned are from HG? If I'm not mistaken the 
change you mentioned corresponds to r29285. I'm not sure if they are 
related, as r29285 is about positioning a convertor, and this is only 
used in the case of multi-fragments messages. As this is not the case 
for your example, I don't think they are related.


I guess we should look at all the patches in the opal/datatype and 
ompi/datatype over the last 13 months (the starting point of the 1.6.3).


  George.


On Nov 25, 2013, at 14:10 , Nadia Derbey <nadia.der...@bull.net 
<mailto:nadia.der...@bull.net>> wrote:



George,

Thx for the detailed answer!
I did my tests on a v1.6.2 (changeset: 141b2759).
After you told me it worked for you with earlier releases, I looked 
at the changesets applied since that time. I guess 28fd94d282a3is the 
one that fixes my issue?


Regards,
Nadia

On 25/11/2013 13:36, George Bosilca wrote:

Nadia,

Which version of Open MPI are you using? I tried with the nightly 
r29751, the current 1.6 and the current 1.7 and I __always__ got the 
expected output.


There is a simple way to show what the datatype engine is doing. You 
can set the MCA parameters
mpi_ddt_unpack_debug and mpi_ddt_pack_debug to get more info. If you 
only want to see how the datatype looks after the MPI_Commit step 
you can call directly ompi_datatype_dump(ddt). This will show the 
internals of the datatype, converted in predefined types.


As an example I took the application you provided and build the 
following picture of what is send and what is received (original 
buffer, send datatype, packed buffer, recv datatype, resulting buffer).




Now using the ompi_datatype_dump, I see the recv and the send 
datatypes as:


-cC---P-DB-[---][---] OPAL_UINT1 count 8 disp 0x0 (0) extent 1 
(size 8)
-cC---P-DB-[---][---] OPAL_UINT1 count 8 disp 0x10 (16) extent 1 
(size 8)
-cC---P-DB-[---][---]   OPAL_INT4 count 4 disp 0x30 (48) 
extent 4 (size 16)
--G---[---][---]  OPAL_END_LOOP pref 3 
elements first elem displacement 0 size of data 32


-cC---P-DB-[---][---] OPAL_UINT1 count 24 disp 0x10 (16) extent 
1 (size 24)
-cC---P-DB-[---][---] OPAL_UINT1 count 8 disp 0x30 (48) extent 1 
(size 8)
-G---[---][---] OPAL_END_LOOP prev 2 elements first 
elem displacement 16 size of data 32


This match perfectly to the datatype drawn by hand.

  George.



On Nov 25, 2013, at 11:40 , Nadia Derbey <nadia.der...@bull.net 
<mailto:nadia.der...@bull.net>> wrote:



Hi,

I'm currently working on a bug occuring at the client site with 
openmpi when calling MPI_Sendreceive() on datatypes built by the 
application.
I think I've found where the bug comes from (it is located in 
opal_generic_simple_pack_function() - file 
opal/datatype/opal_datatype_pack.c). But this code is so 
complicated that I'm more than unsure of my fix. What I can say is 
that it fixes things for me, but I need some advices from the 
datatypes specialists.


---

You will find in attachment the reproducer provided by the client, 
as well as the resulting output.

datatypes.c : reproducer
to run the binary: salloc --exclusive -p B510 -N 1 -n 1 mpirun 
./datatypes

trc_ko: traces got without the patch applied
trc_ok: traces got with the patch applied.

---

The proposed patch is the following: (Note that the very first 
change in this patch was enough in my case, but I thought all the 
"source_base" settings should follow this model.)


-
opal_generic_simple_pack_function: add the datatype lb when 
progressing in the input buffer


diff -r cb23c2f07e1f opal/datatype/opal_datatype_pack.c
--- a/opal/datatype/opal_datatype_pack.c Sun Nov 24 17:06:51 2013 +
+++ b/opal/datatype/opal_datatype_pack.c Mon Nov 25 10:48:00 2013 +0100
@@ -301,7 +301,7 @@ opal_generic_simple_pack_function( opal_
 PACK_PREDEFINED_DATATYPE( pConvertor, pElem, 
count_desc,

source_base, destination, iov_len_local );
 if( 0 == count_desc ) {  /* completed */
-source_base = pConvertor->pBaseBuf + pStack->disp;
+source_base = pConvertor->pBaseBuf + 
pStack->disp + pData->lb;

 pos_desc++;  /* advance to the next data */
 UPDATE_INTERNAL_COUNTERS( description, 
pos_desc, pElem, count_desc );

 continue;
@@ -333,7 +333,7 @@ opal_generic_simple_pack_function( opal_
 pStack->disp += 
description[pStack->index].loop.extent;

 }
 }
-source_base = 

Re: [OMPI devel] bug in opal_generic_simple_pack_function()

2013-11-25 Thread Nadia Derbey

George,

Thx for the detailed answer!
I did my tests on a v1.6.2 (changeset: 141b2759).
After you told me it worked for you with earlier releases, I looked at 
the changesets applied since that time. I guess 28fd94d282a3is the one 
that fixes my issue?


Regards,
Nadia

On 25/11/2013 13:36, George Bosilca wrote:

Nadia,

Which version of Open MPI are you using? I tried with the nightly 
r29751, the current 1.6 and the current 1.7 and I __always__ got the 
expected output.


There is a simple way to show what the datatype engine is doing. You 
can set the MCA parameters
mpi_ddt_unpack_debug and mpi_ddt_pack_debug to get more info. If you 
only want to see how the datatype looks after the MPI_Commit step you 
can call directly ompi_datatype_dump(ddt). This will show the 
internals of the datatype, converted in predefined types.


As an example I took the application you provided and build the 
following picture of what is send and what is received (original 
buffer, send datatype, packed buffer, recv datatype, resulting buffer).



Now using the ompi_datatype_dump, I see the recv and the send 
datatypes as:


-cC---P-DB-[---][---] OPAL_UINT1 count 8 disp 0x0 (0) extent 1 
(size 8)
-cC---P-DB-[---][---] OPAL_UINT1 count 8 disp 0x10 (16) extent 1 
(size 8)
-cC---P-DB-[---][---]   OPAL_INT4 count 4 disp 0x30 (48) 
extent 4 (size 16)
--G---[---][---]  OPAL_END_LOOP pref 3 
elements first elem displacement 0 size of data 32


-cC---P-DB-[---][---] OPAL_UINT1 count 24 disp 0x10 (16) extent 1 
(size 24)
-cC---P-DB-[---][---] OPAL_UINT1 count 8 disp 0x30 (48) extent 1 
(size 8)
-G---[---][---] OPAL_END_LOOP prev 2 elements first 
elem displacement 16 size of data 32


This match perfectly to the datatype drawn by hand.

  George.



On Nov 25, 2013, at 11:40 , Nadia Derbey <nadia.der...@bull.net 
<mailto:nadia.der...@bull.net>> wrote:



Hi,

I'm currently working on a bug occuring at the client site with 
openmpi when calling MPI_Sendreceive() on datatypes built by the 
application.
I think I've found where the bug comes from (it is located in 
opal_generic_simple_pack_function() - file 
opal/datatype/opal_datatype_pack.c). But this code is so complicated 
that I'm more than unsure of my fix. What I can say is that it fixes 
things for me, but I need some advices from the datatypes specialists.


---

You will find in attachment the reproducer provided by the client, as 
well as the resulting output.

datatypes.c : reproducer
to run the binary: salloc --exclusive -p B510 -N 1 -n 1 mpirun 
./datatypes

trc_ko: traces got without the patch applied
trc_ok: traces got with the patch applied.

---

The proposed patch is the following: (Note that the very first change 
in this patch was enough in my case, but I thought all the 
"source_base" settings should follow this model.)


-
opal_generic_simple_pack_function: add the datatype lb when 
progressing in the input buffer


diff -r cb23c2f07e1f opal/datatype/opal_datatype_pack.c
--- a/opal/datatype/opal_datatype_pack.cSun Nov 24 17:06:51 
2013 +
+++ b/opal/datatype/opal_datatype_pack.cMon Nov 25 10:48:00 
2013 +0100

@@ -301,7 +301,7 @@ opal_generic_simple_pack_function( opal_
 PACK_PREDEFINED_DATATYPE( pConvertor, pElem, count_desc,
   source_base, destination, 
iov_len_local );

 if( 0 == count_desc ) {  /* completed */
-source_base = pConvertor->pBaseBuf + pStack->disp;
+source_base = pConvertor->pBaseBuf + 
pStack->disp + pData->lb;

 pos_desc++;  /* advance to the next data */
 UPDATE_INTERNAL_COUNTERS( description, pos_desc, 
pElem, count_desc );

 continue;
@@ -333,7 +333,7 @@ opal_generic_simple_pack_function( opal_
 pStack->disp += 
description[pStack->index].loop.extent;

 }
 }
-source_base = pConvertor->pBaseBuf + pStack->disp;
+source_base = pConvertor->pBaseBuf + pStack->disp + 
pData->lb;
 UPDATE_INTERNAL_COUNTERS( description, pos_desc, 
pElem, count_desc );
 DO_DEBUG( opal_output( 0, "pack new_loop count %d 
stack_pos %d pos_desc %d disp %ld space %lu\n",
(int)pStack->count, pConvertor->stack_pos, pos_desc, 
(long)pStack->disp, (unsigned long)iov_len_local ); );

@@ -354,7 +354,7 @@ opal_generic_simple_pack_function( opal_
 pStack->disp + local_disp);
 pos_desc++;
 update_loop_description:  /* update the current state */
-source_base = pConvertor->pBaseBuf + pStack->disp;
+source_base = pConvertor->pBaseBuf + pStack->disp + 
pDa

[OMPI devel] bug in opal_generic_simple_pack_function()

2013-11-25 Thread Nadia Derbey

Hi,

I'm currently working on a bug occuring at the client site with openmpi 
when calling MPI_Sendreceive() on datatypes built by the application.
I think I've found where the bug comes from (it is located in 
opal_generic_simple_pack_function() - file 
opal/datatype/opal_datatype_pack.c). But this code is so complicated 
that I'm more than unsure of my fix. What I can say is that it fixes 
things for me, but I need some advices from the datatypes specialists.


---

You will find in attachment the reproducer provided by the client, as 
well as the resulting output.

datatypes.c : reproducer
to run the binary: salloc --exclusive -p B510 -N 1 -n 1 mpirun ./datatypes
trc_ko: traces got without the patch applied
trc_ok: traces got with the patch applied.

---

The proposed patch is the following: (Note that the very first change in 
this patch was enough in my case, but I thought all the "source_base" 
settings should follow this model.)


-
opal_generic_simple_pack_function: add the datatype lb when progressing 
in the input buffer


diff -r cb23c2f07e1f opal/datatype/opal_datatype_pack.c
--- a/opal/datatype/opal_datatype_pack.cSun Nov 24 17:06:51 2013 
+
+++ b/opal/datatype/opal_datatype_pack.cMon Nov 25 10:48:00 2013 
+0100

@@ -301,7 +301,7 @@ opal_generic_simple_pack_function( opal_
 PACK_PREDEFINED_DATATYPE( pConvertor, pElem, count_desc,
   source_base, destination, 
iov_len_local );

 if( 0 == count_desc ) {  /* completed */
-source_base = pConvertor->pBaseBuf + pStack->disp;
+source_base = pConvertor->pBaseBuf + pStack->disp + 
pData->lb;

 pos_desc++;  /* advance to the next data */
 UPDATE_INTERNAL_COUNTERS( description, pos_desc, 
pElem, count_desc );

 continue;
@@ -333,7 +333,7 @@ opal_generic_simple_pack_function( opal_
 pStack->disp += 
description[pStack->index].loop.extent;

 }
 }
-source_base = pConvertor->pBaseBuf + pStack->disp;
+source_base = pConvertor->pBaseBuf + pStack->disp + 
pData->lb;
 UPDATE_INTERNAL_COUNTERS( description, pos_desc, 
pElem, count_desc );
 DO_DEBUG( opal_output( 0, "pack new_loop count %d 
stack_pos %d pos_desc %d disp %ld space %lu\n",
(int)pStack->count, 
pConvertor->stack_pos, pos_desc, (long)pStack->disp, (unsigned 
long)iov_len_local ); );

@@ -354,7 +354,7 @@ opal_generic_simple_pack_function( opal_
 pStack->disp + local_disp);
 pos_desc++;
 update_loop_description:  /* update the current state */
-source_base = pConvertor->pBaseBuf + pStack->disp;
+source_base = pConvertor->pBaseBuf + pStack->disp + 
pData->lb;
 UPDATE_INTERNAL_COUNTERS( description, pos_desc, 
pElem, count_desc );
 DDT_DUMP_STACK( pConvertor->pStack, 
pConvertor->stack_pos, pElem, "advance loop" );

 continue;
@@ -374,7 +374,7 @@ opal_generic_simple_pack_function( opal_
 }
 /* I complete an element, next step I should go to the next one */
 PUSH_STACK( pStack, pConvertor->stack_pos, pos_desc, 
OPAL_DATATYPE_INT8, count_desc,

-source_base - pStack->disp - pConvertor->pBaseBuf );
+source_base - pStack->disp - pConvertor->pBaseBuf - 
pData->lb );
 DO_DEBUG( opal_output( 0, "pack save stack stack_pos %d pos_desc 
%d count_desc %d disp %ld\n",
pConvertor->stack_pos, pStack->index, 
(int)pStack->count, (long)pStack->disp ); );

 return 0;

---

Regards,
Nadia

--
Nadia Derbey
Bull, Architect of an Open World
http://www.bull.com

#include 
#include 

/**
 * expected output for defective OpenMPI versions:
results_2[6] = 8
ref_results_2[6] = 12
results_2[7] = 9
ref_results_2[7] = 13

*/

static int
do_test(MPI_Datatype * recvs, MPI_Datatype * sends, int *raw_inputs);




int main(void) {

int rank, size;
int ierror = 0;
MPI_Datatype sends[2], recvs[2];

MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, );
MPI_Comm_size(MPI_COMM_WORLD, );


{
int count = 2, blocklen = 2, stride = 4;

MPI_Type_vector(count, blocklen, stride, MPI_INT, [0]);
MPI_Type_commit([0]);

MPI_Type_vector(count, blocklen, stride, MPI_INT, [1]);
MPI_Type_commit([1]);
}

{
int count = 1;
int blocklength = 4;
int array_of_displacements[] = {4};

MPI_Type_create_indexed_block(count, blocklength, array_of_displacements, 

Re: [OMPI devel] btl/openib: get_ib_dev_distance doesn't see processes as bound if the job has been launched by srun

2012-02-17 Thread nadia . derbey
devel-boun...@open-mpi.org wrote on 02/17/2012 08:36:54 AM:

> De : Brice Goglin 
> A : de...@open-mpi.org
> Date : 02/17/2012 08:37 AM
> Objet : Re: [OMPI devel] btl/openib: get_ib_dev_distance doesn't see
> processes as bound if the job has been launched by srun
> Envoyé par : devel-boun...@open-mpi.org
> 
> Le 16/02/2012 14:16, nadia.der...@bull.net a écrit : 
> Hi Jeff, 
> 
> Sorry for the delay, but my victim with 2 ib devices had been stolen ;-) 

> 
> So, I ported the patch on the v1.5 branch and finally could test it. 
> 
> Actually, there is no opal_hwloc_base_get_topology() in v1.5 so I had to 
set 
> the hwloc flags in ompi_mpi_init() and orte_odls_base_open() (i.e. the 
places
> where opal_hwloc_topology is initialized). 
> 
> With the new flag set, hwloc_get_nbobjs_by_type(opal_hwloc_topology,
> HWLOC_OBJ_CORE) 
> is now seeing the actual number of cores on the node (instead of 1 when 
our 
> cpuset is a singleton). 
> 
> Since opal_paffinity_base_get_processor_info() calls 
> module_get_processor_info() 
> (in hwloc/paffinity_hwloc_module.c), which in turn calls 
> hwloc_get_nbobjs_by_type(), 
> we are now getting the right number of cores in get_ib_dev_distance(). 
> 
> So we are looping over the exact number of cores, looking for a 
> potential binding. 
> 
> So as a conclusion, there's no need for any other patch: the fix 
youcommitted
> was the only one needed to fix the issue. 
> 
> I didn't follow this entire thread in details, but I am feeling that
> something is wrong here. The flag fixes your problem indeed, but I 
> think it may break binding too. It's basically making all 
> "unavailable resources" available. So the binding code may end up 
> trying to bind processes on cores that it can't actually use.

It's true that if we have a resource manager that can allocate for us
say a single socket within a node, the binding part OMPI might go out
of its actual boundaries.

> 
> If srun gives you the first cores of the machine, it works fine 
> because OMPI tries to use the first cores and those are available. 
> But did you ever try when srun gives the second socket only for 
> instance? Or whichever part of the machine that does not contain the
> first cores ?

But I have to look for the proper option in slurm: I don't know if slurm 
allows for such a fine grained allocation. I have to look for the option
that enables to allocate socket X (X!=0).

> I think OMPI will still try to bind on the first cores
> if the flag is set, but those are not available for binding.
> 
> Unless I am missing something, the proper fix would be to have two 
> instances of the topology. One with the entire machine (for people 
> that really want to consult all physical resources), and one for the
> really available part of machine (mostly used for binding).

Agreed! 

Regards,
Nadia
> 
> Brice
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] btl/openib: get_ib_dev_distance doesn't see processes as bound if the job has been launched by srun

2012-02-16 Thread nadia . derbey
Hi Jeff,

Sorry for the delay, but my victim with 2 ib devices had been stolen ;-)

So, I ported the patch on the v1.5 branch and finally could test it.

Actually, there is no opal_hwloc_base_get_topology() in v1.5 so I had to 
set
the hwloc flags in ompi_mpi_init() and orte_odls_base_open() (i.e. the 
places
where opal_hwloc_topology is initialized).

With the new flag set, hwloc_get_nbobjs_by_type(opal_hwloc_topology, 
HWLOC_OBJ_CORE)
is now seeing the actual number of cores on the node (instead of 1 when 
our
cpuset is a singleton).

Since opal_paffinity_base_get_processor_info() calls 
module_get_processor_info()
(in hwloc/paffinity_hwloc_module.c), which in turn calls 
hwloc_get_nbobjs_by_type(),
we are now getting the right number of cores in get_ib_dev_distance().

So we are looping over the exact number of cores, looking for a potential 
binding.

So as a conclusion, there's no need for any other patch: the fix you 
committed
was the only one needed to fix the issue.

Could you please move it to v1.5 (do I need to fill a CMR)?

Thanks!

 
-- 
Nadia Derbey
 

devel-boun...@open-mpi.org wrote on 02/09/2012 06:00:48 PM:

> De : Jeff Squyres <jsquy...@cisco.com>
> A : Open MPI Developers <de...@open-mpi.org>
> Date : 02/09/2012 06:01 PM
> Objet : Re: [OMPI devel] btl/openib: get_ib_dev_distance doesn't see
> processes as bound if the job has been launched by srun
> Envoyé par : devel-boun...@open-mpi.org
> 
> Nadia --
> 
> I committed the fix in the trunk to use HWLOC_WHOLE_SYSTEM and 
IO_DEVICES.
> 
> Do you want to revise your patch to use hwloc APIs with 
> opal_hwloc_topology (instead of paffinity)?  We could use that as a 
> basis for the other places you identified that are doing similar things.
> 
> 
> On Feb 9, 2012, at 8:34 AM, Ralph Castain wrote:
> 
> > Ah, okay - in that case, having the I/O device attached to the 
> "closest" object at each depth would be ideal from an OMPI perspective.
> > 
> > On Feb 9, 2012, at 6:30 AM, Brice Goglin wrote:
> > 
> >> The bios usually tells you which numa location is close to each 
> host-to-pci bridge. So the answer is yes.
> >> Brice
> >> 
> >> 
> >> Ralph Castain <r...@open-mpi.org> a écrit :
> >> I'm not sure I understand this comment. A PCI device is attached 
> to the node, not to any specific location within the node, isn't it?
> Can you really say that a PCI device is "attached" to a specific 
> NUMA location, for example?
> >> 
> >> 
> >> On Feb 9, 2012, at 6:15 AM, Jeff Squyres wrote:
> >> 
> >>> That doesn't seem too attractive from an OMPI perspective, 
> though.  We'd want to know where the PCI devices are actually rooted.
> >> 
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] btl/openib: get_ib_dev_distance doesn't see processes as bound if the job has been launched by srun

2012-02-09 Thread nadia . derbey
 devel-boun...@open-mpi.org wrote on 02/09/2012 01:32:31 PM:

> De : Ralph Castain 
> A : Open MPI Developers 
> Date : 02/09/2012 01:32 PM
> Objet : Re: [OMPI devel] btl/openib: get_ib_dev_distance doesn't see
> processes as bound if the job has been launched by srun
> Envoyé par : devel-boun...@open-mpi.org
> 
> Hi Nadia
> 
> I'm wondering what value there is in showing the full topology, or 
> using it in any of our components, if the process is restricted to a
> specific set of cpus? Does it really help to know that there are 
> other cpus out there that are unreachable?

Ralph,

The intention here is not to show cpus that are unreachable, but to fix an 
issue we have at least in get_ib_dev_distance() in the openib btl.

The problem is that if a process is restricted to a single CPU, the 
algorithm used in get_ib_dev_distance doesn't work at all:
I have 2 ib interfaces on my victim (say mlx4_0 and mlx4_1), and I want 
the openib btl to select the one that is the closest to my rank.

As I said in my first e-mail, here is what is done today:
   . opal_paffinity_base_get_processor_info() is called to get the number 
of logical processors (we get 1 due to the singleton cpuset)
   . we loop over that # of processors to check whether our process is 
bound to one of them. In our case the loop will be executed only once and 
we will never get the correct binding information.
   . if the process is bound actually get the distance to the device.
in our case, the distance won't be computed and mlx4_0 will be 
seen as "equivalent" to mlx4_1 in terms of distances. This is what I 
definitely want to avoid.

Regards,
Nadia

> 
> On Feb 9, 2012, at 5:15 AM, nadia.der...@bull.net wrote:
> 
> 
> 
> devel-boun...@open-mpi.org wrote on 02/09/2012 12:20:41 PM:
> 
> > De : Brice Goglin  
> > A : Open MPI Developers  
> > Date : 02/09/2012 12:20 PM 
> > Objet : Re: [OMPI devel] btl/openib: get_ib_dev_distance doesn't see
> > processes as bound if the job has been launched by srun 
> > Envoyé par : devel-boun...@open-mpi.org 
> > 
> > By default, hwloc only shows what's inside the current cpuset. There's
> > an option to show everything instead (topology flag). 
> 
> So may be using that flag inside 
> opal_paffinity_base_get_processor_info() would be a better fix than 
> the one I'm proposing in my patch. 
> 
> I found a bunch of other places where things are managed as in 
> get_ib_dev_distance(). 
> 
> Just doing a grep in the sources, I could find: 
>   . init_maffinity() in btl/sm/btl_sm.c 
>   . vader_init_maffinity() in btl/vader/btl_vader.c 
>   . get_ib_dev_distance() in btl/wv/btl_wv_component.c 
> 
> So I think the flag Brice is talking about should definitely be the fix. 

> 
> Regards, 
> Nadia 
> 
> > 
> > Brice
> > 
> > 
> > 
> > Le 09/02/2012 12:18, Jeff Squyres a écrit :
> > > Just so that I understand this better -- if a process is bound in 
> > a cpuset, will tools like hwloc's lstopo only show the Linux 
> > processors *in that cpuset*?  I.e., does it not have any visibility 
> > of the processors outside of its cpuset?
> > >
> > >
> > > On Jan 27, 2012, at 11:38 AM, nadia.derbey wrote:
> > >
> > >> Hi,
> > >>
> > >> If a job is launched using "srun --resv-ports --cpu_bind:..." and 
slurm
> > >> is configured with:
> > >>   TaskPlugin=task/affinity
> > >>   TaskPluginParam=Cpusets
> > >>
> > >> each rank of that job is in a cpuset that contains a single CPU.
> > >>
> > >> Now, if we use carto on top of this, the following happens in
> > >> get_ib_dev_distance() (in btl/openib/btl_openib_component.c):
> > >>   . opal_paffinity_base_get_processor_info() is called to get the
> > >> number of logical processors (we get 1 due to the singleton 
cpuset)
> > >>   . we loop over that # of processors to check whether our process 
is
> > >> bound to one of them. In our case the loop will be executed 
only
> > >> once and we will never get the correct binding information.
> > >>   . if the process is bound actually get the distance to the 
device.
> > >> in our case we won't execute that part of the code.
> > >>
> > >> The attached patch is a proposal to fix the issue.
> > >>
> > >> Regards,
> > >> Nadia
> > >> 
> 
___
> > >> devel mailing list
> > >> de...@open-mpi.org
> > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >
> > 
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] btl/openib: get_ib_dev_distance doesn't see processes as bound if the job has been launched by srun

2012-02-09 Thread nadia . derbey
 devel-boun...@open-mpi.org wrote on 02/09/2012 12:20:41 PM:

> De : Brice Goglin 
> A : Open MPI Developers 
> Date : 02/09/2012 12:20 PM
> Objet : Re: [OMPI devel] btl/openib: get_ib_dev_distance doesn't see
> processes as bound if the job has been launched by srun
> Envoyé par : devel-boun...@open-mpi.org
> 
> By default, hwloc only shows what's inside the current cpuset. There's
> an option to show everything instead (topology flag).

So may be using that flag inside opal_paffinity_base_get_processor_info() 
would be a better fix than the one I'm proposing in my patch.

I found a bunch of other places where things are managed as in 
get_ib_dev_distance().

Just doing a grep in the sources, I could find:
  . init_maffinity() in btl/sm/btl_sm.c
  . vader_init_maffinity() in btl/vader/btl_vader.c
  . get_ib_dev_distance() in btl/wv/btl_wv_component.c

So I think the flag Brice is talking about should definitely be the fix.

Regards,
Nadia

> 
> Brice
> 
> 
> 
> Le 09/02/2012 12:18, Jeff Squyres a écrit :
> > Just so that I understand this better -- if a process is bound in 
> a cpuset, will tools like hwloc's lstopo only show the Linux 
> processors *in that cpuset*?  I.e., does it not have any visibility 
> of the processors outside of its cpuset?
> >
> >
> > On Jan 27, 2012, at 11:38 AM, nadia.derbey wrote:
> >
> >> Hi,
> >>
> >> If a job is launched using "srun --resv-ports --cpu_bind:..." and 
slurm
> >> is configured with:
> >>   TaskPlugin=task/affinity
> >>   TaskPluginParam=Cpusets
> >>
> >> each rank of that job is in a cpuset that contains a single CPU.
> >>
> >> Now, if we use carto on top of this, the following happens in
> >> get_ib_dev_distance() (in btl/openib/btl_openib_component.c):
> >>   . opal_paffinity_base_get_processor_info() is called to get the
> >> number of logical processors (we get 1 due to the singleton 
cpuset)
> >>   . we loop over that # of processors to check whether our process is
> >> bound to one of them. In our case the loop will be executed only
> >> once and we will never get the correct binding information.
> >>   . if the process is bound actually get the distance to the device.
> >> in our case we won't execute that part of the code.
> >>
> >> The attached patch is a proposal to fix the issue.
> >>
> >> Regards,
> >> Nadia
> >> 
___
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] btl/openib: get_ib_dev_distance doesn't see processes as bound if the job has been launched by srun

2012-02-09 Thread nadia . derbey
 devel-boun...@open-mpi.org wrote on 02/09/2012 12:18:20 PM:

> De : Jeff Squyres 
> A : Open MPI Developers 
> Date : 02/09/2012 12:18 PM
> Objet : Re: [OMPI devel] btl/openib: get_ib_dev_distance doesn't see
> processes as bound if the job has been launched by srun
> Envoyé par : devel-boun...@open-mpi.org
> 
> Just so that I understand this better -- if a process is bound in a 
> cpuset, will tools like hwloc's lstopo only show the Linux 
> processors *in that cpuset*?  I.e., does it not have any visibility 
> of the processors outside of its cpuset?

Yes, looks like. At least this is what is returned by 
opal_paffinity_base_get_processor_info().

Regards,
Nadia

> 
> 
> On Jan 27, 2012, at 11:38 AM, nadia.derbey wrote:
> 
> > Hi,
> > 
> > If a job is launched using "srun --resv-ports --cpu_bind:..." and 
slurm
> > is configured with:
> >   TaskPlugin=task/affinity
> >   TaskPluginParam=Cpusets
> > 
> > each rank of that job is in a cpuset that contains a single CPU.
> > 
> > Now, if we use carto on top of this, the following happens in
> > get_ib_dev_distance() (in btl/openib/btl_openib_component.c):
> >   . opal_paffinity_base_get_processor_info() is called to get the
> > number of logical processors (we get 1 due to the singleton 
cpuset)
> >   . we loop over that # of processors to check whether our process is
> > bound to one of them. In our case the loop will be executed only
> > once and we will never get the correct binding information.
> >   . if the process is bound actually get the distance to the device.
> > in our case we won't execute that part of the code.
> > 
> > The attached patch is a proposal to fix the issue.
> > 
> > Regards,
> > Nadia
> > 
___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] btl/openib: get_ib_dev_distance doesn't see processes as bound if the job has been launched by srun

2012-02-06 Thread nadia . derbey
Resending, as i didn't get any answer...

Regards,
Nadia
 
-- 
Nadia Derbey

 


devel-boun...@open-mpi.org wrote on 01/27/2012 05:38:34 PM:

> De : "nadia.derbey" <nadia.der...@bull.net>
> A : Open MPI Developers <de...@open-mpi.org>
> Date : 01/27/2012 05:35 PM
> Objet : [OMPI devel] btl/openib: get_ib_dev_distance doesn't see 
> processes as bound if the job has been launched by srun
> Envoyé par : devel-boun...@open-mpi.org
> 
> Hi,
> 
> If a job is launched using "srun --resv-ports --cpu_bind:..." and slurm
> is configured with:
>TaskPlugin=task/affinity
>TaskPluginParam=Cpusets
> 
> each rank of that job is in a cpuset that contains a single CPU.
> 
> Now, if we use carto on top of this, the following happens in
> get_ib_dev_distance() (in btl/openib/btl_openib_component.c):
>. opal_paffinity_base_get_processor_info() is called to get the
>  number of logical processors (we get 1 due to the singleton cpuset)
>. we loop over that # of processors to check whether our process is
>  bound to one of them. In our case the loop will be executed only
>  once and we will never get the correct binding information.
>. if the process is bound actually get the distance to the device.
>  in our case we won't execute that part of the code.
> 
> The attached patch is a proposal to fix the issue.
> 
> Regards,
> Nadia
> [attachment "get_ib_dev_distance.patch" deleted by Nadia Derbey/FR/
> BULL] ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

get_ib_dev_distance.patch
Description: Binary data


Re: [OMPI devel] known limitation or bug in hwloc?

2011-08-30 Thread nadia . derbey
devel-boun...@open-mpi.org wrote on 08/29/2011 06:59:49 PM:

> De : Brice Goglin 
> A : Open MPI Developers 
> Date : 08/29/2011 07:00 PM
> Objet : Re: [OMPI devel] known limitation or bug in hwloc?
> Envoyé par : devel-boun...@open-mpi.org
> 
> I am playing with those aspects right now (it's planned for hwloc v1.4).
> hwloc (even the 1.2 currently in OMPI) can already support topology
> containing different machines,

I guess this is what corresponds to the HWLOC_OBJ_SYSTEM topology object?

> but there's no easy/automatic way to
> agregate multiple machine topologies into a single global one. The
> important thing to understand is that the cpuset/bitmap structure does
> not span to multiple machines, it remains local (because it's tightly
> coupled to binding processes/memory). So if a process running on A
> considers a topology containing nodes A and B, only the cpusets of
> objects corresponding to A are meaningful. Trying (on A) to bind on
> cpusets from B objects would actually bind on A (if the core numbers are
> similar). And the objects "above" the machine just have no cpusets at
> all (because there's no way to bind across multiple machines).
> 
> That said, my understanding is that this is not what this discussion is
> about. Doesn't OMPI use one topology for each node so far? Nadia might
> just be playing with large node (more than 64 cores?) which cause the
> bit loop to end too early.

Exactly: Bull guys are doing some tests on Westmere-EX nodes: 4 sockets of 
10 cores each, with potentially HT enabled.
The problem is that the BIOS has numbered the cores in the following way 
(each pair x,y corresponds to the ids of a physical core):

socket 0: 0,32 4,36  8,40 12,44 16,48 20,52 24,56 28,60 64,72 68,76
socket 0: 1,33 5,37  9,41 13,45 17,49 21,53 25,57 29,61 65,73 69,77
socket 2: 2,34 6,38 10,42 14,46 18,50 22,54 26,58 30,62 66,74 70,78
socket 3: 3,35 7,39 11,43 15,47 19,51 23,55 27,59 31,63 67,75 71,79

I hit the issue with a rankfile as soon as I reached the following line:

rank 8=my_host slot=p64

Regards,
Nadia

> 
> Brice
> 
> 
> 
> 
> Le 29/08/2011 18:47, Kenneth Lloyd a écrit :
> > This might get interesting.  In "portable hardware locality" (hwloc) 
as
> > originating at the native cpuset, and I see "locality" working at the
> > machine level (machines in my world can have up to 8 CPUs, for 
example).
> >
> > But from an ompi world view, the execution graph across myriad 
machines
> > might dictate a larger, yet still fine grained approach.  I haven't 
had a
> > chance to play with those aspects.  Has anyone else?
> >
> > Ken
> >
> >
> > -Original Message-
> > From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] 
On
> > Behalf Of Ralph Castain
> > Sent: Monday, August 29, 2011 8:21 AM
> > To: Open MPI Developers
> > Subject: Re: [OMPI devel] known limitation or bug in hwloc?
> >
> > Actually, I'll eat those words. I was looking at the wrong place.
> >
> > Yes, that is a bug in hwloc. It needs to loop over CPU_MAX for those 
cases
> > where the bit mask extends over multiple words.
> >
> >
> > On Aug 29, 2011, at 7:16 AM, Ralph Castain wrote:
> >
> >> Actually, if you look closely at the definition of those two values,
> > you'll see that it really doesn't matter which one we loop over. The
> > NUM_BITS value defines the actual total number of bits in the mask. 
The
> > CPU_MAX is the total number of cpus we can support, which was set to a 
value
> > such that the two are equal (i.e., it's a power of two that happens to 
be an
> > integer multiple of 64).
> >> I believe the original intent was to allow CPU_MAX to be independent 
of
> > address-alignment questions, so NUM_BITS could technically be greater 
than
> > CPU_MAX. Even if this happens, though, all that would do is cause the 
loop
> > to run across more bits than required.
> >> So it doesn't introduce a limitation at all. In hindsight, we could
> > simplify things by eliminating one of those values and just putting a
> > requirement on the number that it be a multiple of 64 so it aligns 
with a
> > memory address.
> >>
> >> On Aug 29, 2011, at 7:05 AM, Kenneth Lloyd wrote:
> >>
> >>> Nadia,
> >>>
> >>> Interesting. I haven't tried pushing this to levels above 8 on a
> > particular
> >>> machine. Do you think that the cpuset / paffinity / hwloc only 
applies at
> >>> the machine level, at which time you need to employ a graph with 
carto?
> >>>
> >>> Regards,
> >>>
> >>> Ken
> >>>
> >>> -Original Message-
> >>> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] 
On
> >>> Behalf Of nadia.derbey
> >>> Sent: Monday, August 29, 2011 5:45 AM
> >>> To: Open MPI Developers
> >>> Subject: [OMPI devel] known limitation or bug in hwloc?
> >>>
> >>> Hi list,
> >>>
> >>> I'm hitting a limitation with paffinity/hwloc with cpu numbers >= 
64.
> >>>
> >>> In opal/mca/paffinity/hwloc/paffinity_hwloc_module.c, module_set() 
is
> >>> the routine that sets 

Re: [OMPI devel] known limitation or bug in hwloc?

2011-08-30 Thread nadia . derbey
Thanks a lot Ralph!

Regards,

--
Nadia Derbey
Phone: +33 (0)4 76 29 77 62



devel-boun...@open-mpi.org wrote on 08/29/2011 06:12:13 PM:

> De : Ralph Castain <r...@open-mpi.org>
> A : Open MPI Developers <de...@open-mpi.org>
> Date : 08/29/2011 06:12 PM
> Objet : Re: [OMPI devel] known limitation or bug in hwloc?
> Envoyé par : devel-boun...@open-mpi.org
> 
> On Aug 29, 2011, at 10:08 AM, nadia.der...@bull.net wrote:
> 
> devel-boun...@open-mpi.org wrote on 08/29/2011 05:57:59 PM:
> 
> > De : Ralph Castain <r...@open-mpi.org> 
> > A : Open MPI Developers <de...@open-mpi.org> 
> > Date : 08/29/2011 05:58 PM 
> > Objet : Re: [OMPI devel] known limitation or bug in hwloc? 
> > Envoyé par : devel-boun...@open-mpi.org 
> > 
> > On Aug 29, 2011, at 8:35 AM, nadia.der...@bull.net wrote: 
> > 
> > 
> > devel-boun...@open-mpi.org wrote on 08/29/2011 04:20:30 PM:
> > 
> > > De : Ralph Castain <r...@open-mpi.org> 
> > > A : Open MPI Developers <de...@open-mpi.org> 
> > > Date : 08/29/2011 04:26 PM 
> > > Objet : Re: [OMPI devel] known limitation or bug in hwloc? 
> > > Envoyé par : devel-boun...@open-mpi.org 
> > > 
> > > Actually, I'll eat those words. I was looking at the wrong place.
> > > 
> > > Yes, that is a bug in hwloc. It needs to loop over CPU_MAX for those
> > > cases where the bit mask extends over multiple words. 
> > 
> > But I'm afraid the fix won't be trivial at all: hwloc in itself is 
> > coherent: it loops overs NUM_BITS, but it uses masks that are 
> > NUM_BITS wide (hwloc_bitmap_t set)... 
> > 
> > I guess I'm missing that - I just did a search and cannot find any 
> > reference to OPAL_PAFFINITY_BITMASK_T_NUM_BITS anywhere in 
> > paffinity/hwloc after the last change. 
> > 
> > Can you point me to where you believe a problem exists? Or feel free
> > to submit a patch to fix it :-)  We can push it upstream to the 
> > hwloc folks for their consideration. 
> 
> file: opal/mca/paffinity/hwloc/paffinity_hwloc_module.c 
> routine: module_set() 
> 
> You hae a reference to OPAL_PAFFINITY_BITMASK_T_NUM_BITS both in the
> trunk and in v1.5 
> 
> But may be this issue has been fixed already? 
> 
> I fixed it in the trunk (r25102) per this thread and filed a CMR to 
> move it to v1.5. You should be copied on the CMR ticket.
> 
> 
> Regards, 
> Nadia 
> 
> > 
> > 
> > Regards, 
> > Nadia
> > > 
> > > 
> > > On Aug 29, 2011, at 7:16 AM, Ralph Castain wrote:
> > > 
> > > > Actually, if you look closely at the definition of those two 
> > > values, you'll see that it really doesn't matter which one we loop 
> > > over. The NUM_BITS value defines the actual total number of bits in 
> > > the mask. The CPU_MAX is the total number of cpus we can support, 
> > > which was set to a value such that the two are equal (i.e., it's a 
> > > power of two that happens to be an integer multiple of 64).
> > > > 
> > > > I believe the original intent was to allow CPU_MAX to be 
> > > independent of address-alignment questions, so NUM_BITS could 
> > > technically be greater than CPU_MAX. Even if this happens, though, 
> > > all that would do is cause the loop to run across more bits 
thanrequired.
> > > > 
> > > > So it doesn't introduce a limitation at all. In hindsight, we 
> > > could simplify things by eliminating one of those values and just 
> > > putting a requirement on the number that it be a multiple of 64 so 
> > > it aligns with a memory address.
> > > > 
> > > > 
> > > > On Aug 29, 2011, at 7:05 AM, Kenneth Lloyd wrote:
> > > > 
> > > >> Nadia,
> > > >> 
> > > >> Interesting. I haven't tried pushing this to levels above 8 on 
> > a particular
> > > >> machine. Do you think that the cpuset / paffinity / hwloc 
> only applies at
> > > >> the machine level, at which time you need to employ a graph with 
carto?
> > > >> 
> > > >> Regards,
> > > >> 
> > > >> Ken
> > > >> 
> > > >> -Original Message-
> > > >> From: devel-boun...@open-mpi.org [
mailto:devel-boun...@open-mpi.org] On
> > > >> Behalf Of nadia.derbey
> > > >> Sent: Monday, August 29, 2011 5:45 AM
> > > >> To: Open MPI Developers
> > > >> Subject: [OMPI devel] known limitation or bug i

Re: [OMPI devel] known limitation or bug in hwloc?

2011-08-29 Thread nadia . derbey
devel-boun...@open-mpi.org wrote on 08/29/2011 05:57:59 PM:

> De : Ralph Castain 
> A : Open MPI Developers 
> Date : 08/29/2011 05:58 PM
> Objet : Re: [OMPI devel] known limitation or bug in hwloc?
> Envoyé par : devel-boun...@open-mpi.org
> 
> On Aug 29, 2011, at 8:35 AM, nadia.der...@bull.net wrote:
> 
> 
> devel-boun...@open-mpi.org wrote on 08/29/2011 04:20:30 PM:
> 
> > De : Ralph Castain  
> > A : Open MPI Developers  
> > Date : 08/29/2011 04:26 PM 
> > Objet : Re: [OMPI devel] known limitation or bug in hwloc? 
> > Envoyé par : devel-boun...@open-mpi.org 
> > 
> > Actually, I'll eat those words. I was looking at the wrong place.
> > 
> > Yes, that is a bug in hwloc. It needs to loop over CPU_MAX for those
> > cases where the bit mask extends over multiple words. 
> 
> But I'm afraid the fix won't be trivial at all: hwloc in itself is 
> coherent: it loops overs NUM_BITS, but it uses masks that are 
> NUM_BITS wide (hwloc_bitmap_t set)... 
> 
> I guess I'm missing that - I just did a search and cannot find any 
> reference to OPAL_PAFFINITY_BITMASK_T_NUM_BITS anywhere in 
> paffinity/hwloc after the last change.
> 
> Can you point me to where you believe a problem exists? Or feel free
> to submit a patch to fix it :-)  We can push it upstream to the 
> hwloc folks for their consideration.

file: opal/mca/paffinity/hwloc/paffinity_hwloc_module.c
routine: module_set()

You hae a reference to OPAL_PAFFINITY_BITMASK_T_NUM_BITS both in the trunk 
and in v1.5

But may be this issue has been fixed already?

Regards,
Nadia

> 
> 
> Regards, 
> Nadia
> > 
> > 
> > On Aug 29, 2011, at 7:16 AM, Ralph Castain wrote:
> > 
> > > Actually, if you look closely at the definition of those two 
> > values, you'll see that it really doesn't matter which one we loop 
> > over. The NUM_BITS value defines the actual total number of bits in 
> > the mask. The CPU_MAX is the total number of cpus we can support, 
> > which was set to a value such that the two are equal (i.e., it's a 
> > power of two that happens to be an integer multiple of 64).
> > > 
> > > I believe the original intent was to allow CPU_MAX to be 
> > independent of address-alignment questions, so NUM_BITS could 
> > technically be greater than CPU_MAX. Even if this happens, though, 
> > all that would do is cause the loop to run across more bits than 
required.
> > > 
> > > So it doesn't introduce a limitation at all. In hindsight, we 
> > could simplify things by eliminating one of those values and just 
> > putting a requirement on the number that it be a multiple of 64 so 
> > it aligns with a memory address.
> > > 
> > > 
> > > On Aug 29, 2011, at 7:05 AM, Kenneth Lloyd wrote:
> > > 
> > >> Nadia,
> > >> 
> > >> Interesting. I haven't tried pushing this to levels above 8 on 
> a particular
> > >> machine. Do you think that the cpuset / paffinity / hwloc only 
applies at
> > >> the machine level, at which time you need to employ a graph with 
carto?
> > >> 
> > >> Regards,
> > >> 
> > >> Ken
> > >> 
> > >> -Original Message-
> > >> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org
] On
> > >> Behalf Of nadia.derbey
> > >> Sent: Monday, August 29, 2011 5:45 AM
> > >> To: Open MPI Developers
> > >> Subject: [OMPI devel] known limitation or bug in hwloc?
> > >> 
> > >> Hi list,
> > >> 
> > >> I'm hitting a limitation with paffinity/hwloc with cpu numbers >= 
64.
> > >> 
> > >> In opal/mca/paffinity/hwloc/paffinity_hwloc_module.c, module_set() 
is
> > >> the routine that sets the calling process affinity to the mask 
given as
> > >> parameter. Note that "mask" is a opal_paffinity_base_cpu_set_t (so 
we
> > >> allow the cpus to be potentially numbered up to
> > >> OPAL_PAFFINITY_BITMASK_CPU_MAX - 1).
> > >> 
> > >> The problem with module_set() is that is loops over
> > >> OPAL_PAFFINITY_BITMASK_T_NUM_BITS bits to check if these bits are 
set in
> > >> the mask:
> > >> 
> > >> for (i = 0; ((unsigned int) i) < OPAL_PAFFINITY_BITMASK_T_NUM_BITS; 
++i)
> > >> {
> > >>   if (OPAL_PAFFINITY_CPU_ISSET(i, mask)) {
> > >>   hwloc_bitmap_set(set, i);
> > >>   }
> > >>   }
> > >> 
> > >> Given "mask"'s type, I think module_set() should instead loop over
> > >> OPAL_PAFFINITY_BITMASK_CPU_MAX bits.
> > >> 
> > >> Note that module_set() uses a type for its internal mask that is
> > >> coherent with OPAL_PAFFINITY_BITMASK_T_NUM_BITS (hwloc_bitmap_t).
> > >> 
> > >> So I'm wondering whether this is a known limitation I've never 
heard of
> > >> or an actual bug?
> > >> 
> > >> Regards,
> > >> Nadia
> > >> 
> > >> 
> > >> ___
> > >> devel mailing list
> > >> de...@open-mpi.org
> > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >> -
> > >> No virus found in this message.
> > >> Checked by AVG - www.avg.com
> > >> Version: 10.0.1392 / Virus Database: 1520/3864 - Release Date: 
08/28/11
> > >> 
> 

Re: [OMPI devel] known limitation or bug in hwloc?

2011-08-29 Thread nadia . derbey
devel-boun...@open-mpi.org wrote on 08/29/2011 04:20:30 PM:

> De : Ralph Castain 
> A : Open MPI Developers 
> Date : 08/29/2011 04:26 PM
> Objet : Re: [OMPI devel] known limitation or bug in hwloc?
> Envoyé par : devel-boun...@open-mpi.org
> 
> Actually, I'll eat those words. I was looking at the wrong place.
> 
> Yes, that is a bug in hwloc. It needs to loop over CPU_MAX for those
> cases where the bit mask extends over multiple words.

But I'm afraid the fix won't be trivial at all: hwloc in itself is 
coherent: it loops overs NUM_BITS, but it uses masks that are NUM_BITS 
wide (hwloc_bitmap_t set)...

Regards,
Nadia
> 
> 
> On Aug 29, 2011, at 7:16 AM, Ralph Castain wrote:
> 
> > Actually, if you look closely at the definition of those two 
> values, you'll see that it really doesn't matter which one we loop 
> over. The NUM_BITS value defines the actual total number of bits in 
> the mask. The CPU_MAX is the total number of cpus we can support, 
> which was set to a value such that the two are equal (i.e., it's a 
> power of two that happens to be an integer multiple of 64).
> > 
> > I believe the original intent was to allow CPU_MAX to be 
> independent of address-alignment questions, so NUM_BITS could 
> technically be greater than CPU_MAX. Even if this happens, though, 
> all that would do is cause the loop to run across more bits than 
required.
> > 
> > So it doesn't introduce a limitation at all. In hindsight, we 
> could simplify things by eliminating one of those values and just 
> putting a requirement on the number that it be a multiple of 64 so 
> it aligns with a memory address.
> > 
> > 
> > On Aug 29, 2011, at 7:05 AM, Kenneth Lloyd wrote:
> > 
> >> Nadia,
> >> 
> >> Interesting. I haven't tried pushing this to levels above 8 on a 
particular
> >> machine. Do you think that the cpuset / paffinity / hwloc only 
applies at
> >> the machine level, at which time you need to employ a graph with 
carto?
> >> 
> >> Regards,
> >> 
> >> Ken
> >> 
> >> -Original Message-
> >> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] 
On
> >> Behalf Of nadia.derbey
> >> Sent: Monday, August 29, 2011 5:45 AM
> >> To: Open MPI Developers
> >> Subject: [OMPI devel] known limitation or bug in hwloc?
> >> 
> >> Hi list,
> >> 
> >> I'm hitting a limitation with paffinity/hwloc with cpu numbers >= 64.
> >> 
> >> In opal/mca/paffinity/hwloc/paffinity_hwloc_module.c, module_set() is
> >> the routine that sets the calling process affinity to the mask given 
as
> >> parameter. Note that "mask" is a opal_paffinity_base_cpu_set_t (so we
> >> allow the cpus to be potentially numbered up to
> >> OPAL_PAFFINITY_BITMASK_CPU_MAX - 1).
> >> 
> >> The problem with module_set() is that is loops over
> >> OPAL_PAFFINITY_BITMASK_T_NUM_BITS bits to check if these bits are set 
in
> >> the mask:
> >> 
> >> for (i = 0; ((unsigned int) i) < OPAL_PAFFINITY_BITMASK_T_NUM_BITS; 
++i)
> >> {
> >>   if (OPAL_PAFFINITY_CPU_ISSET(i, mask)) {
> >>   hwloc_bitmap_set(set, i);
> >>   }
> >>   }
> >> 
> >> Given "mask"'s type, I think module_set() should instead loop over
> >> OPAL_PAFFINITY_BITMASK_CPU_MAX bits.
> >> 
> >> Note that module_set() uses a type for its internal mask that is
> >> coherent with OPAL_PAFFINITY_BITMASK_T_NUM_BITS (hwloc_bitmap_t).
> >> 
> >> So I'm wondering whether this is a known limitation I've never heard 
of
> >> or an actual bug?
> >> 
> >> Regards,
> >> Nadia
> >> 
> >> 
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> -
> >> No virus found in this message.
> >> Checked by AVG - www.avg.com
> >> Version: 10.0.1392 / Virus Database: 1520/3864 - Release Date: 
08/28/11
> >> 
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] Fix a hang in carto_base_select() if carto_module_init() fails

2011-07-08 Thread nadia . derbey
Yes, sure! Agreed.

Regards,
--
Nadia Derbey
Phone: +33 (0)4 76 29 77 62



devel-boun...@open-mpi.org wrote on 07/08/2011 02:10:22 AM:

> De : Jeff Squyres <jsquy...@cisco.com>
> A : Open MPI Developers <de...@open-mpi.org>
> Date : 07/08/2011 02:10 AM
> Objet : Re: [OMPI devel] Fix a hang in carto_base_select() if 
> carto_module_init() fails
> Envoyé par : devel-boun...@open-mpi.org
> 
> I'd go even slightly simpler than that:
> 
> Index: opal/mca/carto/base/carto_base_select.c
> ===
> --- opal/mca/carto/base/carto_base_select.c   (revision 24842)
> +++ opal/mca/carto/base/carto_base_select.c   (working copy)
> @@ -64,10 +64,7 @@
>  cleanup:
>  /* Initialize the winner */
>  if (NULL != opal_carto_base_module) {
> -if (OPAL_SUCCESS != (ret = 
> opal_carto_base_module->carto_module_init()) ) {
> -exit_status = ret;
> -goto cleanup;
> -}
> +exit_status = opal_carto_base_module->carto_module_init();
>  }
> 
>  return exit_status;
> 
> 
> 
> On Jun 28, 2011, at 3:02 AM, nadia.derbey wrote:
> 
> > Hi,
> > 
> > When using the carto/file module with a syntactically incorrect carto
> > file, we get stuck into opal_carto_base_select().
> > 
> > The attached trivial patch fixes the issue.
> > 
> > Regards,
> > Nadia
> > 
> > 
> > -- 
> > nadia.derbey <nadia.der...@bull.net>
> > 
___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] problem when binding to socket on a single socket node

2010-04-13 Thread Nadia Derbey
On Tue, 2010-04-13 at 01:27 -0600, Ralph Castain wrote:
> On Apr 13, 2010, at 1:02 AM, Nadia Derbey wrote:
> 
> > On Mon, 2010-04-12 at 10:07 -0600, Ralph Castain wrote:
> >> By definition, if you bind to all available cpus in the OS, you are
> >> bound to nothing (i.e., "unbound") as your process runs on any
> >> available cpu.
> >> 
> >> 
> >> PLPA doesn't care, and I personally don't care. I was just explaining
> >> why it generates an error in the odls.
> >> 
> >> 
> >> A user app would detect its binding by (a) getting the affiinity mask
> >> from the OS, and then (b) seeing if the bits are set to '1' for all
> >> available processors. If it is, then you are not bound - there is no
> >> mechanism available for checking "are the bits set only for the
> >> processors I asked to be bound to". The OS doesn't track what you
> >> asked for, it only tracks where you are bound - and a mask with all
> >> '1's is defined as "unbound".
> >> 
> >> 
> >> So the reason for my question was simple: a user asked us to "bind"
> >> their process. If their process checks to see if it is bound, it will
> >> return "no". The user would therefore be led to believe that OMPI had
> >> failed to execute their request, when in fact we did execute it - but
> >> the result was (as Nadia says) a "no-op".
> >> 
> >> 
> >> After talking with Jeff, I think he has the right answer. It is a
> >> method we have used elsewhere, so it isn't unexpected behavior.
> >> Basically, he proposed that we use an mca param to control this
> >> behavior:
> >> 
> >> 
> >> * default: generate an error message as the "bind" results in a no-op,
> >> and this is our current behavior
> >> 
> >> 
> >> * warn: generate a warning that the binding wound up being a "no-op",
> >> but continue working
> >> 
> >> 
> >> * quiet: just ignore it and keep going
> > 
> > Excellent, I completely agree (though I would have put the 2nd star as
> > the default behavior, but never mind, I don't want to restart the
> > discussion ;-) )
> 
> I actually went back/forth on that as well - I personally think it might be 
> better to just have warn and quiet, with warn being the default. The warning 
> could be generated with orte_show_help so the messages would be consolidated 
> across nodes. Given that the enhanced paffinity behavior is fairly new, and 
> that no-one has previously raised this issue, I don't think the prior 
> behavior is relevant.
> 
> Would that make sense? If so, we could extend that to the other binding 
> options for consistency.

Sure!

Patch proposal attached.

Regards,
Nadia
> 
> > 
> > Also this is a good opportunity to fix the other issue I talked about in
> > the first message in this thread: the tag
> > "odls-default:could-not-bind-to-socket" does not exist in
> > orte/mca/odls/default/help-odls-default.txt
> 
> I'll take that one - my fault for missing it. I'll cross-check the other 
> messages as well. Thanks for catching it!
> 
> As for your other change: let me think on it. I -think- I understand your 
> logic, but honestly haven't had time to really walk through it properly. Got 
> an ORCM deadline to meet, but hope to break free towards the end of this week.
> 
> 
> > 
> > Regards,
> > Nadia
> >> 
> >> 
> >> Fairly trivial to implement, and Bull could set the default mca param
> >> file to "quiet" to get what they want. I'm not sure if that's what the
> >> community wants or not - like I said, it makes no diff to me so long
> >> as the code logic is understandable.
> >> 
> >> 
> >> 
> >> On Apr 12, 2010, at 8:27 AM, Terry Dontje wrote:
> >> 
> >>> Ralph, I guess I am curious why is it that if there is only one
> >>> socket we cannot bind to it?  Does plpa actually error on this or is
> >>> this a condition we decided was an error at odls?
> >>> 
> >>> I am somewhat torn on whether this makes sense.  On the one hand it
> >>> is definitely useless as to the result if you allow it.  However if
> >>> you don't allow it and you have a script or running tests on
> >>> multiple systems it would be nice to have this run because you are
> >>> not really running into a resource starvation issue.
> >>> 
> >>> At a minimum I thin

Re: [OMPI devel] problem when binding to socket on a single socket node

2010-04-13 Thread Nadia Derbey
ia an 
> > > external bind. People actually use that as a means of suballocating 
> > > nodes, so the test needs to be there. Again, if the user said "bind to 
> > > socket", but none of that socket's cores are assigned for our use, that 
> > > is an error.
> > > 
> > > I haven't looked at your specific fix, but I agree with Terry's question. 
> > > It seems to me that whether or not we were externally bound is 
> > > irrelevant. Even if the overall result is what you want, I think a more 
> > > logically understandable test would help others reading the code.
> > > 
> > > But first we need to resolve the question: should this scenario return an 
> > > error or not?
> > > 
> > > 
> > > On Apr 12, 2010, at 1:43 AM, Nadia Derbey wrote:
> > > 
> > >   
> > > > On Fri, 2010-04-09 at 14:23 -0400, Terry Dontje wrote:
> > > > 
> > > > > Ralph Castain wrote: 
> > > > >   
> > > > > > Okay, just wanted to ensure everyone was working from the same base
> > > > > > code. 
> > > > > > 
> > > > > > 
> > > > > > Terry, Brad: you might want to look this proposed change over.
> > > > > > Something doesn't quite look right to me, but I haven't really
> > > > > > walked through the code to check it.
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > At first blush I don't really get the usage of orte_odls_globals.bound
> > > > > in you patch.  It would seem to me that the insertion of that
> > > > > conditional would prevent the check it surrounds being done when the
> > > > > process has not been bounded prior to startup which is a common case.
> > > > >   
> > > > Well, if you have a look at the algo in the ORTE_BIND_TO_SOCKET path
> > > > (odls_default_fork_local_proc() in odls_default_module.c):
> > > > 
> > > > 
> > > >(line 715)
> > > >  {
> > > >
> > > >
> > > > {
> > > >
> > > >continue
> > > >}
> > > >
> > > > }
> > > > 
> > > > ...
> > > > 
> > > > 
> > > > What I'm saying is that the only way to have nothing set in the affinity
> > > > mask (which would justify the last test) is to have never called the
> > > >  instruction. This means:
> > > >  . the test on orte_odls_globals.bound is true
> > > >  . call  for all the cores in the socket.
> > > > 
> > > > In the other path, what we are doing is checking if we have set one or
> > > > more bits in a mask after having actually set them: don't you think it's
> > > > useless?
> > > > 
> > > > That's why I'm suggesting to call the last check only if
> > > > orte_odls_globals.bound is true.
> > > > 
> > > > Regards,
> > > > Nadia
> > > > 
> > > > > --td
> > > > > 
> > > > > 
> > > > >   
> > > > > > On Apr 9, 2010, at 9:33 AM, Terry Dontje wrote:
> > > > > > 
> > > > > > 
> > > > > > > Nadia Derbey wrote: 
> > > > > > >   
> > > > > > > > On Fri, 2010-04-09 at 08:41 -0600, Ralph Castain wrote:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > Just to check: is this with the latest trunk? Brad and Terry 
> > > > > > > > > have been making changes to this section of code, including 
> > > > > > > > > modifying the PROCESS_IS_BOUND test...
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > >   
> > > > > > > > Well, it was on the v1.5. But I just checked: looks like
> > > > > > > >  1. the call to OPAL_PAFFINITY_PROCESS_IS_BOUND is still there 
> > > > > > > > in
> > > > > > > > odls_default_fork_local_proc()
> > > > > > > >  2. OPAL_PAFFINITY_PROCESS_IS_BOUND() is defined the same way
> > > > > > > >

Re: [OMPI devel] problem when binding to socket on a single socket node

2010-04-12 Thread Nadia Derbey
On Mon, 2010-04-12 at 07:50 -0600, Ralph Castain wrote:
> Guess I'll jump in here as I finally had a few minutes to look at the code 
> and think about your original note. In fact, I believe your original 
> statement is the source of contention.
> 
> If someone tells us -bind-to-socket, but there is only one socket, then we 
> really cannot bind them to anything. Any check by their code would reveal 
> that they had not, in fact, been bound - raising questions as to whether or 
> not OMPI is performing the request. Our operating standard has been to error 
> out if the user specifies something we cannot do to avoid that kind of 
> confusion. This is what generated the code in the system today.
> 
> Now I can see an argument that -bind-to-socket with one socket maybe 
> shouldn't generate an error, but that decision then has to get reflected in 
> other code areas as well.

Actually, that was my original point: -bind-to-socket on a single socket
should IMHO lead to a noop, but not an error.

> 
> As for the test you cite -  it actually performs a valuable function and was 
> added to catch specific scenarios. In particular, if you follow the code flow 
> up just a little, you will see that it is possible to complete the loop 
> without ever actually setting a bit in the mask. This happens when none of 
> the cpus in that socket have been assigned to us via an external bind.

This is exactly what I said, but may be I didn't express right:
  . the test on orte_odls_globals.bound is true   
---> means we have an external bind
  . call  for all the cores in the socket.
---> none of the cpus on the socket belongs to our binding

Regards,

>  People actually use that as a means of suballocating nodes, so the test 
> needs to be there. Again, if the user said "bind to socket", but none of that 
> socket's cores are assigned for our use, that is an error.
> 
> I haven't looked at your specific fix, but I agree with Terry's question. It 
> seems to me that whether or not we were externally bound is irrelevant. Even 
> if the overall result is what you want, I think a more logically 
> understandable test would help others reading the code.
> 
> But first we need to resolve the question: should this scenario return an 
> error or not?
> 
> 
> On Apr 12, 2010, at 1:43 AM, Nadia Derbey wrote:
> 
> > On Fri, 2010-04-09 at 14:23 -0400, Terry Dontje wrote:
> >> Ralph Castain wrote: 
> >>> Okay, just wanted to ensure everyone was working from the same base
> >>> code. 
> >>> 
> >>> 
> >>> Terry, Brad: you might want to look this proposed change over.
> >>> Something doesn't quite look right to me, but I haven't really
> >>> walked through the code to check it.
> >>> 
> >>> 
> >> At first blush I don't really get the usage of orte_odls_globals.bound
> >> in you patch.  It would seem to me that the insertion of that
> >> conditional would prevent the check it surrounds being done when the
> >> process has not been bounded prior to startup which is a common case.
> > 
> > Well, if you have a look at the algo in the ORTE_BIND_TO_SOCKET path
> > (odls_default_fork_local_proc() in odls_default_module.c):
> > 
> > 
> >(line 715)
> >  {
> >
> >
> > {
> >
> >continue
> >}
> >
> > }
> > 
> > ...
> > 
> > 
> > What I'm saying is that the only way to have nothing set in the affinity
> > mask (which would justify the last test) is to have never called the
> >  instruction. This means:
> >  . the test on orte_odls_globals.bound is true
> >  . call  for all the cores in the socket.
> > 
> > In the other path, what we are doing is checking if we have set one or
> > more bits in a mask after having actually set them: don't you think it's
> > useless?
> > 
> > That's why I'm suggesting to call the last check only if
> > orte_odls_globals.bound is true.
> > 
> > Regards,
> > Nadia
> >> 
> >> --td
> >> 
> >> 
> >>> 
> >>> On Apr 9, 2010, at 9:33 AM, Terry Dontje wrote:
> >>> 
> >>>> Nadia Derbey wrote: 
> >>>>> On Fri, 2010-04-09 at 08:41 -0600, Ralph Castain wrote:
> >>>>> 
> >>>>>> Just to check: is this with the latest trunk? Brad and Terry have been 
> >>>>>> making changes to this section of code, including modifying the 
> >>>>>> PROCESS_IS_BOUND test...
> >>>>>> 
> >>>

Re: [OMPI devel] problem when binding to socket on a single socket node

2010-04-12 Thread Nadia Derbey
On Fri, 2010-04-09 at 14:23 -0400, Terry Dontje wrote:
> Ralph Castain wrote: 
> > Okay, just wanted to ensure everyone was working from the same base
> > code. 
> > 
> > 
> > Terry, Brad: you might want to look this proposed change over.
> > Something doesn't quite look right to me, but I haven't really
> > walked through the code to check it.
> > 
> > 
> At first blush I don't really get the usage of orte_odls_globals.bound
> in you patch.  It would seem to me that the insertion of that
> conditional would prevent the check it surrounds being done when the
> process has not been bounded prior to startup which is a common case.

Well, if you have a look at the algo in the ORTE_BIND_TO_SOCKET path
(odls_default_fork_local_proc() in odls_default_module.c):


   (line 715)
 {


 {

continue
}

}

...


What I'm saying is that the only way to have nothing set in the affinity
mask (which would justify the last test) is to have never called the
 instruction. This means:
  . the test on orte_odls_globals.bound is true
  . call  for all the cores in the socket.

In the other path, what we are doing is checking if we have set one or
more bits in a mask after having actually set them: don't you think it's
useless?

That's why I'm suggesting to call the last check only if
orte_odls_globals.bound is true.

Regards,
Nadia
> 
> --td
> 
> 
> > 
> > On Apr 9, 2010, at 9:33 AM, Terry Dontje wrote:
> > 
> > > Nadia Derbey wrote: 
> > > > On Fri, 2010-04-09 at 08:41 -0600, Ralph Castain wrote:
> > > >   
> > > > > Just to check: is this with the latest trunk? Brad and Terry have 
> > > > > been making changes to this section of code, including modifying the 
> > > > > PROCESS_IS_BOUND test...
> > > > > 
> > > > > 
> > > > > 
> > > > 
> > > > Well, it was on the v1.5. But I just checked: looks like
> > > >   1. the call to OPAL_PAFFINITY_PROCESS_IS_BOUND is still there in
> > > >  odls_default_fork_local_proc()
> > > >   2. OPAL_PAFFINITY_PROCESS_IS_BOUND() is defined the same way
> > > > 
> > > > But, I'll give it a try with the latest trunk.
> > > > 
> > > > Regards,
> > > > Nadia
> > > > 
> > > >   
> > > The changes, I've done do not touch
> > > OPAL_PAFFINITY_PROCESS_IS_BOUND at all.  Also, I am only touching
> > > code related to the "bind-to-core" option so I really doubt if my
> > > changes are causing issues here.
> > > 
> > > --td
> > > > > On Apr 9, 2010, at 3:39 AM, Nadia Derbey wrote:
> > > > > 
> > > > > 
> > > > > > Hi,
> > > > > > 
> > > > > > I am facing a problem with a test that runs fine on some nodes, and
> > > > > > fails on others.
> > > > > > 
> > > > > > I have a heterogenous cluster, with 3 types of nodes:
> > > > > > 1) Single socket , 4 cores
> > > > > > 2) 2 sockets, 4cores per socket
> > > > > > 3) 2 sockets, 6 cores/socket
> > > > > > 
> > > > > > I am using:
> > > > > > . salloc to allocate the nodes,
> > > > > > . mpirun binding/mapping options "-bind-to-socket -bysocket"
> > > > > > 
> > > > > > # salloc -N 1 mpirun -n 4 -bind-to-socket -bysocket sleep 900
> > > > > > 
> > > > > > This command fails if the allocated node is of type #1 (single 
> > > > > > socket/4
> > > > > > cpus).
> > > > > > BTW, in that case orte_show_help is referencing a tag
> > > > > > ("could-not-bind-to-socket") that does not exist in
> > > > > > help-odls-default.txt.
> > > > > > 
> > > > > > While it succeeds when run on nodes of type #2 or 3.
> > > > > > I think a "bind to socket" should not return an error on a single 
> > > > > > socket
> > > > > > machine, but rather be a noop.
> > > > > > 
> > > > > > The problem comes from the test
> > > > > > OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, );
> > > > > > called in odls_default_fork_local_proc() after the binding to the
> > > > > > processors socket has been done:
> > > > > > 
> > > > > >   

Re: [OMPI devel] problem when binding to socket on a single socket node

2010-04-09 Thread Nadia Derbey
On Fri, 2010-04-09 at 08:41 -0600, Ralph Castain wrote:
> Just to check: is this with the latest trunk? Brad and Terry have been making 
> changes to this section of code, including modifying the PROCESS_IS_BOUND 
> test...
> 
> 

Well, it was on the v1.5. But I just checked: looks like
  1. the call to OPAL_PAFFINITY_PROCESS_IS_BOUND is still there in
 odls_default_fork_local_proc()
  2. OPAL_PAFFINITY_PROCESS_IS_BOUND() is defined the same way

But, I'll give it a try with the latest trunk.

Regards,
Nadia

> On Apr 9, 2010, at 3:39 AM, Nadia Derbey wrote:
> 
> > Hi,
> > 
> > I am facing a problem with a test that runs fine on some nodes, and
> > fails on others.
> > 
> > I have a heterogenous cluster, with 3 types of nodes:
> > 1) Single socket , 4 cores
> > 2) 2 sockets, 4cores per socket
> > 3) 2 sockets, 6 cores/socket
> > 
> > I am using:
> > . salloc to allocate the nodes,
> > . mpirun binding/mapping options "-bind-to-socket -bysocket"
> > 
> > # salloc -N 1 mpirun -n 4 -bind-to-socket -bysocket sleep 900
> > 
> > This command fails if the allocated node is of type #1 (single socket/4
> > cpus).
> > BTW, in that case orte_show_help is referencing a tag
> > ("could-not-bind-to-socket") that does not exist in
> > help-odls-default.txt.
> > 
> > While it succeeds when run on nodes of type #2 or 3.
> > I think a "bind to socket" should not return an error on a single socket
> > machine, but rather be a noop.
> > 
> > The problem comes from the test
> > OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, );
> > called in odls_default_fork_local_proc() after the binding to the
> > processors socket has been done:
> > 
> >
> >OPAL_PAFFINITY_CPU_ZERO(mask);
> >for (n=0; n < orte_default_num_cores_per_socket; n++) {
> >
> >OPAL_PAFFINITY_CPU_SET(phys_cpu, mask);
> >}
> >/* if we did not bind it anywhere, then that is an error */
> >OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, );
> >if (!bound) {
> >orte_show_help("help-odls-default.txt",
> >   "odls-default:could-not-bind-to-socket", true);
> >ORTE_ODLS_ERROR_OUT(ORTE_ERR_FATAL);
> >}
> > 
> > OPAL_PAFFINITY_PROCESS_IS_BOUND() will return true if there bits set in
> > the mask *AND* the number of bits set is lesser than the number of cpus
> > on the machine. Thus on a single socket, 4 cores machine the test will
> > fail. While on other the kinds of machines it will succeed.
> > 
> > Again, I think the problem could be solved by changing the alogrithm,
> > and assuming that ORTE_BIND_TO_SOCKET, on a single socket machine =
> > noop.
> > 
> > Another solution could be to call the test
> > OPAL_PAFFINITY_PROCESS_IS_BOUND() at the end of the loop only if we are
> > bound (orte_odls_globals.bound). Actually that is the only case where I
> > see a justification to this test (see attached patch).
> > 
> > And may be both solutions could be mixed.
> > 
> > Regards,
> > Nadia
> > 
> > 
> > -- 
> > Nadia Derbey <nadia.der...@bull.net>
> > <001_fix_process_binding_test.patch>___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
-- 
Nadia Derbey <nadia.der...@bull.net>



[OMPI devel] problem when binding to socket on a single socket node

2010-04-09 Thread Nadia Derbey
Hi,

I am facing a problem with a test that runs fine on some nodes, and
fails on others.

I have a heterogenous cluster, with 3 types of nodes:
1) Single socket , 4 cores
2) 2 sockets, 4cores per socket
3) 2 sockets, 6 cores/socket

I am using:
 . salloc to allocate the nodes,
 . mpirun binding/mapping options "-bind-to-socket -bysocket"

# salloc -N 1 mpirun -n 4 -bind-to-socket -bysocket sleep 900

This command fails if the allocated node is of type #1 (single socket/4
cpus).
BTW, in that case orte_show_help is referencing a tag
("could-not-bind-to-socket") that does not exist in
help-odls-default.txt.

While it succeeds when run on nodes of type #2 or 3.
I think a "bind to socket" should not return an error on a single socket
machine, but rather be a noop.

The problem comes from the test
OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, );
called in odls_default_fork_local_proc() after the binding to the
processors socket has been done:


OPAL_PAFFINITY_CPU_ZERO(mask);
for (n=0; n < orte_default_num_cores_per_socket; n++) {

OPAL_PAFFINITY_CPU_SET(phys_cpu, mask);
}
/* if we did not bind it anywhere, then that is an error */
OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, );
if (!bound) {
orte_show_help("help-odls-default.txt",
   "odls-default:could-not-bind-to-socket", true);
ORTE_ODLS_ERROR_OUT(ORTE_ERR_FATAL);
}

OPAL_PAFFINITY_PROCESS_IS_BOUND() will return true if there bits set in
the mask *AND* the number of bits set is lesser than the number of cpus
on the machine. Thus on a single socket, 4 cores machine the test will
fail. While on other the kinds of machines it will succeed.

Again, I think the problem could be solved by changing the alogrithm,
and assuming that ORTE_BIND_TO_SOCKET, on a single socket machine =
noop.

Another solution could be to call the test
OPAL_PAFFINITY_PROCESS_IS_BOUND() at the end of the loop only if we are
bound (orte_odls_globals.bound). Actually that is the only case where I
see a justification to this test (see attached patch).

And may be both solutions could be mixed.

Regards,
Nadia


-- 
Nadia Derbey <nadia.der...@bull.net>
Do not test actual process binding in obvious cases

diff -r 0b851b2e7934 orte/mca/odls/default/odls_default_module.c
--- a/orte/mca/odls/default/odls_default_module.c	Thu Mar 18 16:10:25 2010 +0100
+++ b/orte/mca/odls/default/odls_default_module.c	Fri Apr 09 11:38:28 2010 +0200
@@ -747,12 +747,16 @@ static int odls_default_fork_local_proc(
  target_socket, phys_core, phys_cpu));
 OPAL_PAFFINITY_CPU_SET(phys_cpu, mask);
 }
-/* if we did not bind it anywhere, then that is an error */
-OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, );
-if (!bound) {
-orte_show_help("help-odls-default.txt",
-   "odls-default:could-not-bind-to-socket", true);
-ORTE_ODLS_ERROR_OUT(ORTE_ERR_FATAL);
+/* if we actually did not bind it anywhere and it was
+ * originally bound then that is an error
+ */
+if (orte_odls_globals.bound) {
+OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, );
+if (!bound) {
+orte_show_help("help-odls-default.txt",
+   "odls-default:could-not-bind-to-socket", true);
+ORTE_ODLS_ERROR_OUT(ORTE_ERR_FATAL);
+}
 }
 if (orte_report_bindings) {
 opal_output(0, "%s odls:default:fork binding child %s to socket %d cpus %04lx",


Re: [OMPI devel] RFC 1/1: improvements to the "notifier" framework and ORTE WDC

2010-03-30 Thread Nadia Derbey
On Mon, 2010-03-29 at 09:37 -0600, Ralph Castain wrote:
> Hi Abhishek
> 
> 
> I'm confused by the WDC wiki page, specifically the part about the new
> ORTE_NOTIFIER_DEFINE_EVENT macro. Are you saying that I (as the
> developer) have to provide this macro with a unique notifier id?

Hi Ralph,

Actually ORTE_NOTIFIER_DEFINE_EVENT(, ) expands to a static
inline routine notifier_log_event_(). So I would say there is a one
to one relationship between an event id and a log_event routine. So
there is no need to do a lookup inside an array or a list.
So yes the event identifier needs to be unique, but only inside a single
source file: you can perpectly call ORTE_NOTIFIER_DEFINE_EVENT(0,
) in a .c file and ORTE_NOTIFIER_DEFINE_EVENT(0, ) in
another one.

Now, we could centralize the event ids in a .h file in the notifier
framework, but the purpose here would only be to have something
"cleaner".


>  So that would mean that ORTE/OMPI would have to maintain a global
> notifier id counter to ensure it is unique?

>From what I said before, we don't need this.

Regards,
Nadia
> 
> 
> If so, that seems really cumbersome. Could you please clarify?
> 
> 
> Thanks
> Ralph
> 
> On Mar 29, 2010, at 8:57 AM, Abhishek Kulkarni wrote:
> 
> > 
> > ==
> > [RFC 1/2]
> > ==
> > 
> > WHAT: Merge improvements to the "notifier" framework from the OPAL
> > SOS
> > and the ORTE WDC mercurial branches into the SVN trunk.
> > 
> > WHY: Some improvements and interface changes were put into the ORTE
> >notifier framework during the development of the OPAL SOS[1] and
> >ORTE WDC[2] branches.
> > 
> > WHERE: Mostly restricted to ORTE notifier files and files using the
> >  notifier interface in OMPI.
> > 
> > TIMEOUT: The weekend of April 2-3.
> > 
> > REFERENCE MERCURIAL REPOS:
> > * SOS development: http://bitbucket.org/jsquyres/opal-sos-fixed/
> > * WDC development: http://bitbucket.org/derbeyn/orte-wdc-fixed/
> > 
> > ==
> > 
> > BACKGROUND:
> > 
> > The notifier interface and its components underwent a host of
> > improvements and changes during the development of the SOS[1] and
> > the
> > WDC[2] branches.  The ORTE WDC (Warning Data Capture) branch enables
> > accounting of events through the use of notifier interface, whereas
> > OPAL SOS uses the notifier interface by setting up callbacks to
> > relay
> > out logged events.
> > 
> > Some of the improvements include:
> > 
> > - added more severity levels.
> > - "ftb" notifier improvements.
> > - "command" notifier improvements.
> > - added "file" notifier component
> > - changes in the notifier modules selection
> > - activate only a subset of the callbacks
> > (i.e. any combination of log, help, log_peer)
> > - define different output media for any given callback (e.g.
> > log_peer
> > can be redirected to the syslog and smtp, while the show_help can be
> > sent to the hnp).
> > - ORTE_NOTIFIER_LOG_EVENT() (that accounts and warns about unusual
> > events)
> > 
> > Much more information is available on these two wiki pages:
> > 
> > [1] http://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages
> > [2] http://svn.open-mpi.org/trac/ompi/wiki/ORTEWDC
> > 
> > NOTE: This is first of a two-part RFC to bring the SOS and WDC
> > branches
> > to the trunk. This only brings in the "notifier" changes from the
> > SOS
> > branch, while the rest of the branch will be brought over after the
> > timeout of the second RFC.
> > 
> > ==
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
-- 
Nadia Derbey <nadia.der...@bull.net>



Re: [OMPI devel] typo in opal/event/evutil.h ?

2010-02-26 Thread Nadia Derbey
On Fri, 2010-02-26 at 06:41 -0700, Ralph Castain wrote:
> Hi Nadia
> 
> I thought I saw a correction go by recently that fixed this in the
> trunk? What revision are you at, and on which branch?

Hi Ralph,

1) hg branch:
default
But I'm getting the warning in the v1.5 branch too.

2)
changeset:   17631:177d287dee3c
user:jsquyres
date:Thu Feb 25 21:04:09 2010 +
summary: This has bugged me for a long, long time: rename
btl_openib_iwarp.*
 ->

3) configure options:
--with-platform=../contrib/platform/optimized --enable-picky

4) Last update on this file is for me:

changeset:   17413:32687831ca9e
user:brbarret
date:Thu Feb 04 05:38:30 2010 +
summary: Update libevent to 1.4.13

But maybe something got messed here in our repo, will check.

Regards,
Nadia

> 
> On Fri, Feb 26, 2010 at 3:48 AM, Nadia Derbey <nadia.der...@bull.net>
> wrote:
> Hi,
> 
> I'm getting this warning during the make if configured with
> --enable-picky:
> ../../../../opal/event/evutil.h:62:7: warning:
> "_EVENT_SIZEOF_LONG_LONG"
> is not defined
> Looks like changeset #32687831ca9e has introduced a typo?
> I'm wondering whether _EVENT_SIZEOF_LONG_LONG shouldn't be
> changed to
> SIZEOF_LONG_LONG?
> 
> 
> --- a/opal/event/evutil.h   Thu Feb 25 21:04:09 2010 +
> +++ b/opal/event/evutil.h   Fri Feb 26 10:29:31 2010 +0100
> @@ -59,7 +59,7 @@ extern "C" {
>  #elif defined(WIN32)
>  #define ev_uint64_t unsigned __int64
>  #define ev_int64_t signed __int64
> -#elif _EVENT_SIZEOF_LONG_LONG == 8
> +#elif SIZEOF_LONG_LONG == 8
>  #define ev_uint64_t unsigned long long
>  #define ev_int64_t long long
>      #elif SIZEOF_LONG == 8
> 
> 
> Regards,
> Nadia
> --
> Nadia Derbey <nadia.der...@bull.net>
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
-- 
Nadia Derbey <nadia.der...@bull.net>



[OMPI devel] typo in opal/event/evutil.h ?

2010-02-26 Thread Nadia Derbey
Hi,

I'm getting this warning during the make if configured with
--enable-picky:
../../../../opal/event/evutil.h:62:7: warning: "_EVENT_SIZEOF_LONG_LONG"
is not defined
Looks like changeset #32687831ca9e has introduced a typo?
I'm wondering whether _EVENT_SIZEOF_LONG_LONG shouldn't be changed to
SIZEOF_LONG_LONG?


--- a/opal/event/evutil.h   Thu Feb 25 21:04:09 2010 +
+++ b/opal/event/evutil.h   Fri Feb 26 10:29:31 2010 +0100
@@ -59,7 +59,7 @@ extern "C" {
 #elif defined(WIN32)
 #define ev_uint64_t unsigned __int64
 #define ev_int64_t signed __int64
-#elif _EVENT_SIZEOF_LONG_LONG == 8
+#elif SIZEOF_LONG_LONG == 8
 #define ev_uint64_t unsigned long long
 #define ev_int64_t long long
 #elif SIZEOF_LONG == 8


Regards,
Nadia
-- 
Nadia Derbey <nadia.der...@bull.net>



Re: [OMPI devel] PATCH: remove trailing colon at the end of thegenerated LD_LIBRARY_PATH

2010-02-18 Thread Nadia Derbey
On Wed, 2010-02-17 at 17:14 -0500, Jeff Squyres wrote:
> Looks good to me!
> 
> Please commit and file CMRs for v1.4 and v1.5 (assuming this patch applies 
> cleanly to both branches).

Not sure I have the rights to do these things?

Regards,
Nadia
> 
> 
> On Feb 16, 2010, at 6:46 AM, Nadia Derbey wrote:
> 
> > Hi,
> > 
> > The mpivars.sh genereted in openmpi.spec might in some cases lead to a
> > LD_LIBRARY_PATH that contains a trailing ":". This happens if the
> > LD_LIBRARY_PATH is originally unset.
> > This means that current directory is included in the search path for the
> > loader, which might not be the desired result.
> > 
> > The following patch proposal fixes this potential issue by adding the
> > ":" only if LD_LIBRARY_PATH is already set.
> > 
> > Regards,
> > Nadia
> > 
> > 
> > diff -r 6609b6ba7637 contrib/dist/linux/openmpi.spec
> > --- a/contrib/dist/linux/openmpi.spec   Mon Feb 15 22:14:59 2010 +
> > +++ b/contrib/dist/linux/openmpi.spec   Tue Feb 16 12:44:41 2010 +0100
> > @@ -505,7 +505,7 @@ fi
> > 
> >  # LD_LIBRARY_PATH
> >  if test -z "\`echo \$LD_LIBRARY_PATH | grep %{_libdir}\`"; then
> > -LD_LIBRARY_PATH=%{_libdir}:\${LD_LIBRARY_PATH}
> > +LD_LIBRARY_PATH=%{_libdir}\${LD_LIBRARY_PATH:+:}\${LD_LIBRARY_PATH}
> >  export LD_LIBRARY_PATH
> >  fi
> > 
> > 
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> 
> 
-- 
Nadia Derbey <nadia.der...@bull.net>



[OMPI devel] PATCH: remove trailing colon at the end of the generated LD_LIBRARY_PATH

2010-02-16 Thread Nadia Derbey
Hi,

The mpivars.sh genereted in openmpi.spec might in some cases lead to a
LD_LIBRARY_PATH that contains a trailing ":". This happens if the
LD_LIBRARY_PATH is originally unset.
This means that current directory is included in the search path for the
loader, which might not be the desired result.

The following patch proposal fixes this potential issue by adding the
":" only if LD_LIBRARY_PATH is already set.

Regards,
Nadia


diff -r 6609b6ba7637 contrib/dist/linux/openmpi.spec
--- a/contrib/dist/linux/openmpi.spec   Mon Feb 15 22:14:59 2010 +
+++ b/contrib/dist/linux/openmpi.spec   Tue Feb 16 12:44:41 2010 +0100
@@ -505,7 +505,7 @@ fi

 # LD_LIBRARY_PATH
 if test -z "\`echo \$LD_LIBRARY_PATH | grep %{_libdir}\`"; then
-LD_LIBRARY_PATH=%{_libdir}:\${LD_LIBRARY_PATH}
+LD_LIBRARY_PATH=%{_libdir}\${LD_LIBRARY_PATH:+:}\${LD_LIBRARY_PATH}
 export LD_LIBRARY_PATH
 fi




Re: [OMPI devel] HOSTNAME environment variable

2010-01-22 Thread Nadia Derbey
On Fri, 2010-01-22 at 08:22 -0700, Ralph Castain wrote:
> A quick and easy way to answer my question of slurm vs ompi:
> 
> Just do "srun script-that-echos-hostname-and-gethostname". If you get the 
> right hostnames, then OMPI is to blame, not slurm.
> 

No, I'm not...
Will check the configuration.

Thanks a lot,
Nadia

> On Jan 22, 2010, at 8:07 AM, Ralph Castain wrote:
> 
> > Hi Nadia
> > 
> > That sounds like a bug in your SLURM config file - SLURM certainly doesn't 
> > propagate "hostname" by default as that would definitely mess things up for 
> > more than OMPI.
> > 
> > Are you sure that SLURM is propagating the environment (something I have 
> > never seen before)? Or is OMPI mistakenly picking it up and propagating it?
> > 
> > On Jan 22, 2010, at 7:25 AM, Nadia Derbey wrote:
> > 
> >> Hi,
> >> 
> >> I'm wondering whether the HOSTNAME environment variable shouldn't be
> >> handled as a "special case" when the orted daemons launch the remote
> >> jobs. This particularly applies to batch schedulers where the caller's
> >> environment is copied to the remote job: we are inheriting a $HOSTNAME
> >> which is the name of the host mpirun was called from:
> >> 
> >> I tried to run the following small test (see getenv.c in attachment - it
> >> substantially gets the hostname once through $HOSTNAME, and once through
> >> gethostname(2)):
> >> 
> >> 
> >> [derbeyn@pichu0 ~]$ hostname
> >> pichu0
> >> [derbeyn@pichu0 ~]$ salloc -N 2 -p pichu mpirun ./getenv
> >> salloc: Granted job allocation 358789
> >> Processor 0 of 2 on $HOSTNAME pichu0: Hello World
> >> Processor 0 of 2 on host pichu93: Hello World
> >> Processor 1 of 2 on $HOSTNAME pichu0: Hello World
> >> Processor 1 of 2 on host pichu94: Hello World
> >> salloc: Relinquishing job allocation 358789
> >> 
> >> 
> >> Shouldn't we be getting the same value when using getenv("HOSTNAME") and 
> >> gethsotname()?
> >> Applying the following small patch, we actually do.
> >> 
> >> Regards,
> >> Nadia
> >> 
> >> --
> >> 
> >> Do not propagate the HOSTNAME environment variable on remote hosts
> >> 
> >> diff -r 4ab256be2a17 orte/orted/orted_main.c
> >> --- a/orte/orted/orted_main.c   Wed Jan 20 16:45:07 2010 +0100
> >> +++ b/orte/orted/orted_main.c   Fri Jan 22 14:54:02 2010 +0100
> >> @@ -299,12 +299,17 @@ int orte_daemon(int argc, char *argv[])
> >> */
> >>orte_launch_environ = opal_argv_copy(environ);
> >> 
> >> +/*
> >> + * Set HOSTNAME to the actual hostname in order to avoid propagating
> >> + * the caller's HOSTNAME.
> >> + */
> >> +gethostname(hostname, 100);
> >> +opal_setenv("HOSTNAME", hostname, true, _launch_environ);
> >> 
> >>/* if orte_daemon_debug is set, let someone know we are alive right
> >> * away just in case we have a problem along the way
> >> */
> >>if (orted_globals.debug) {
> >> -gethostname(hostname, 100);
> >>fprintf(stderr, "Daemon was launched on %s - beginning to 
> >> initialize\n", hostname);
> >>}
> >> 
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
-- 
Nadia Derbey <nadia.der...@bull.net>



Re: [OMPI devel] HOSTNAME environment variable

2010-01-22 Thread Nadia Derbey
On Fri, 2010-01-22 at 08:12 -0700, Ralph Castain wrote:
> For SLURM, there is a config file where you can specify what gets propagated. 
> It is clearly an error to include hostname as it messes many things up, not 
> just OMPI. Frankly, I've never seen someone do that on SLURM.
> 
I'm going to check that.

Thanks,
Nadia

> I believe in this case OMPI is likely incorrectly picking up the environment 
> and propagating it. We know this is incorrectly happening on Torque, and it 
> appears to also be happening on SLURM. This is a bug that I will be fixing on 
> Torque - and as soon as Nadia confirms, on SLURM as well.
> 
> I know that on Torque it was an innocent mistake where a line got added to 
> the launch code that shouldn't have...
> 
> On Jan 22, 2010, at 8:07 AM, N.M. Maclaren wrote:
> 
> > On Jan 22 2010, Nadia Derbey wrote:
> >> 
> >> I'm wondering whether the HOSTNAME environment variable shouldn't be
> >> handled as a "special case" when the orted daemons launch the remote
> >> jobs. This particularly applies to batch schedulers where the caller's
> >> environment is copied to the remote job: we are inheriting a $HOSTNAME
> >> which is the name of the host mpirun was called from:
> > 
> > This is slightly orthogonal, but relevant.
> > 
> > This is an ancient mess with propagating environment variables, and predates
> > MPI by many years.  The most traditional form was the demented connexion
> > protocols that propagated TERM - truly wonderful when logging in from SunOS
> > to HP-UX!  Whether it is worth kludging up one variable and leaving the rest
> > is unclear.
> > 
> > Even if systems are fairly homogeneous, it is common for the head node to
> > have a different set of standard values from the others.  TMPDIR is one
> > very common one, but any of the dozen of so path variables is likely to
> > vary, at least sometimes, as are many of the others.
> > 
> > I used to have to write the most DISGUSTING hacks to stop unwanted export
> > when I managed our supercomputer.  Yet there are other systems that will
> > work only if you DO export environment variables.  And there are systems
> > where the secondary nodes aren't real systems, and using the parent hostname
> > would be better, though I haven't managed any.
> > 
> > Realistically, there should really be some kind of hook to control which
> > are transferred and which are not.  I haven't found one - if there is, it's
> > a better way to tackle this.
> > 
> > Regards,
> > Nick Maclaren.
> > 
> > 
> > _______
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
-- 
Nadia Derbey <nadia.der...@bull.net>



[OMPI devel] HOSTNAME environment variable

2010-01-22 Thread Nadia Derbey
Hi,

I'm wondering whether the HOSTNAME environment variable shouldn't be
handled as a "special case" when the orted daemons launch the remote
jobs. This particularly applies to batch schedulers where the caller's
environment is copied to the remote job: we are inheriting a $HOSTNAME
which is the name of the host mpirun was called from:

I tried to run the following small test (see getenv.c in attachment - it
substantially gets the hostname once through $HOSTNAME, and once through
gethostname(2)):


[derbeyn@pichu0 ~]$ hostname
pichu0
[derbeyn@pichu0 ~]$ salloc -N 2 -p pichu mpirun ./getenv
salloc: Granted job allocation 358789
Processor 0 of 2 on $HOSTNAME pichu0: Hello World
Processor 0 of 2 on host pichu93: Hello World
Processor 1 of 2 on $HOSTNAME pichu0: Hello World
Processor 1 of 2 on host pichu94: Hello World
salloc: Relinquishing job allocation 358789


Shouldn't we be getting the same value when using getenv("HOSTNAME") and 
gethsotname()?
Applying the following small patch, we actually do.

Regards,
Nadia

--

Do not propagate the HOSTNAME environment variable on remote hosts

diff -r 4ab256be2a17 orte/orted/orted_main.c
--- a/orte/orted/orted_main.c   Wed Jan 20 16:45:07 2010 +0100
+++ b/orte/orted/orted_main.c   Fri Jan 22 14:54:02 2010 +0100
@@ -299,12 +299,17 @@ int orte_daemon(int argc, char *argv[])
  */
 orte_launch_environ = opal_argv_copy(environ);

+/*
+ * Set HOSTNAME to the actual hostname in order to avoid propagating
+ * the caller's HOSTNAME.
+ */
+gethostname(hostname, 100);
+opal_setenv("HOSTNAME", hostname, true, _launch_environ);

 /* if orte_daemon_debug is set, let someone know we are alive right
  * away just in case we have a problem along the way
  */
 if (orted_globals.debug) {
-gethostname(hostname, 100);
 fprintf(stderr, "Daemon was launched on %s - beginning to 
initialize\n", hostname);
 }

#include 
#include 
#include 
#include 


int main(int argc, char **argv)
{
char *env_hostname;
char hostname[255];
int myrank, size;

MPI_Init(, );
MPI_Comm_rank(MPI_COMM_WORLD, );
MPI_Comm_size(MPI_COMM_WORLD, );

env_hostname = getenv("HOSTNAME");
if (NULL != env_hostname) {
printf("Processor %d of %d on $HOSTNAME %s: Hello World\n",
   myrank, size, env_hostname);
} else {
printf("Processor %d of %d on $HOSTNAME NULL: Hello World\n",
   myrank, size);
}
if (0 == gethostname(hostname, 255)) {
printf("Processor %d of %d on host %s: Hello World\n",
   myrank, size, hostname);
}

MPI_Finalize();

exit(0);
}


Re: [OMPI devel] VT config.h.in

2010-01-19 Thread Nadia Derbey
Sylvain,

Je reponds en "private":

Le fichier ./ompi/contrib/vt/vt/config.h.in devrait etre dans
ton .hgignore ...

Bye


On Tue, 2010-01-19 at 10:59 +0100, Sylvain Jeaugey wrote:
> Hi list,
> 
> The file ompi/contrib/vt/vt/config.h.in seems to have been added to the 
> repository, but it is also created by autogen.sh.
> 
> Is it normal ?
> 
> The result is that when I commit after autogen, I have my patches polluted 
> with diffs in this file.
> 
> Sylvain
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
-- 
Nadia Derbey <nadia.der...@bull.net>



Re: [OMPI devel] mca_btl_openib_post_srr() posts to an uncreated SRQwhen ibv_resize_cq() has failed

2009-11-26 Thread Nadia Derbey
On Mon, 2009-10-26 at 15:06 -0700, Paul H. Hargrove wrote:
> Retrying w/ fewer CQ entires as Jeff describes is a good idea to help 
> ensure that EINVAL actually does signify that the count exceeds the max 
> instead of just assuming this is so).  If it actually was signifying 
> some other error case, then one would probably not want to continue.

Sorry for the delay, but I had many other things to do...

You'll find a patch proposal in attachment, ready for review.

The only part I'm not sure about is the following hunk:

@@ -496,7 +540,13 @@ int mca_btl_openib_add_procs(
 peers[i] = endpoint;
 }

-return mca_btl_openib_size_queues(openib_btl, nprocs);
+rc = mca_btl_openib_size_queues(openib_btl, nprocs);
+if (OMPI_SUCCESS != rc) {
+mca_btl_openib_del_procs(btl, nprocs, ompi_procs, peers);
+opal_bitmap_clear_all_bits(reachable);
+}
+
+return rc;

Don't know if there's a "less violent" way of undoing things.

Anyway, things work well with the path applied.

You'll also find in attachment:
1. the output without the patch applied
2. the output with the patch applied
3. the output with the patch applied + an emulation of an EINVAL that is
still returned.

Comments would be welcome.

Regards,
Nadia


> 
> -Paul
> 
> Jeff Squyres wrote:
> > Thanks for the analysis!
> >
> > We've argued about btl_r2_add_btls() before -- IIRC, the consensus is 
> > that we want it to be able to continue even if a BTL fails.  So I 
> > *think* that your #1 answer is better.
> >
> > However, we might want to try a little harder if EINVAL is returned -- 
> > perhaps try decreasing number of CQ entries and try again until either 
> > we have too few CQ entries to be useful (e.g., 0 or some higher number 
> > that is still "too small"), or fail the BTL alltogether...?
> >
> > On Oct 23, 2009, at 10:10 AM, Nadia Derbey wrote:
> >
> >> Hi,
> >>
> >> Yesterdays I had to analyze a SIGSEV occuring after the following
> >> message had been output:
> >> [ adjust_cq] cannot resize completion queue, error: 22
> >>
> >>
> >> What I found is the following:
> >>
> >> When ibv_resize_cq() fails to resize a CQ (in my case it returned
> >> EINVAL), adjust_cq() returns an error and create_srq() is not called by
> >> mca_btl_openib_size_queues().
> >>
> >> Note: One of our infiniband specialists told me that EINVAL was returned
> >> in that case because we were asking for more CQ entries than the max
> >> available.
> >>
> >> mca_bml_r2_add_btls() goes on executing.
> >>
> >> Then qp_create_all() is called (connect/btl_openib_connect_oob.c).
> >> ibv_create_qp() succeeds even though init_attr.srq is a NULL pointer
> >> (remember that create_srq() has not been previously called).
> >>
> >> Since all the QPs have been successfully created, qp_create_all() then
> >> calls:
> >> mca_btl_openib_endpoint_post_recvs()
> >>   --> mca_btl_openib_post_srr()
> >>   --> ibv_post_srq_recv() on a NULL SRQ
> >> ==> SIGSEGV
> >>
> >>
> >> If I'm not wrong in the analysis above, we have the choice between 2
> >> solutions to fix this problem:
> >>
> >> 1. if EINVAL is returned by ibv_resize_cq() in adjust_cq(), treat this
> >> as the ENOSYS case: do not return an error, since the CQ has
> >> successfully been created may be with less entries than needed, but it
> >> is there.
> >>
> >> Doing this we assume that EINVAL will always be the symptom of a "too
> >> many entries asked for" error from the IB stack. I don't have the
> >> answer...
> >> + I don't know if this won't imply a degraded mode in terms of
> >> performances.
> >>
> >> 2. Fix mca_bml_r2_add_btls() to cleanly exit if an error occurs during
> >> btl_add_procs().
> >>
> >> FYI I tested solution #1 and it worked...
> >>
> >> Any suggestion or comment would be welcome.
> >>
> >> Regards,
> >> Nadia
> >>
> >> -- 
> >> Nadia Derbey <nadia.der...@bull.net>
> >>

> 
-- 
Nadia Derbey <nadia.der...@bull.net>
btl/openib: correctly manage ibv_resize_cqueue returning EINVAL

When ibv_resize_cqueue() returns EINVAL, retry several times lowering the
CQ size (b.c. EINVAL may be due to a CQ size too high).

If ever EINVAL is still returned, mca_btl_openib_size_queues() will do too.
In that case, since this is a true error, clean everything by calling
mca_btl_openib_del_procs().


diff -r cf107f

Re: [OMPI devel] RFC - "system-wide-only" MCA parameters

2009-09-04 Thread Nadia Derbey
On Fri, 2009-09-04 at 07:50 -0600, Ralph Castain wrote:
> Let me point out the obvious since this has plagued us at LANL with  
> regard to this concept. If a user wants to do something different, all  
> they have to do is download and build their own copy of OMPI.
> 
> Amazingly enough, that is exactly what they do. When we build our  
> production versions, we actually "no-build" modules we don't want them  
> using (e.g., certain BTL's that use privileged network interfaces) so  
> even MCA params won't let them do something undesirable.
> 
> No good - they just try until they realize it won't work, then  
> download and build their own version...and merrily hose the system.
> 
> My point here: this concept can help, but it should in no way be  
> viewed as a solution to the problem you are trying to solve. It is at  
> best a minor obstacle as we made it very simple for a user to  
> circumvent such measures.
> 
> Which is why I never made the effort to actually implement what was in  
> that ticket. It was decided that it really wouldn't help us here, and  
> would only result in further encouraging user-owned builds.

Ralph,

Let's forget those people who intentionally do bad things: it's true
that they will always find a way to bypass whatever has been done...

We are not talking about security here, but there are client sites where
people do not want to care about some mca params values and where those
system-wide params should not be *unintentionally* set to different
values.

Regards,
Nadia


> 
> :-(
> 
> 
> On Sep 4, 2009, at 12:42 AM, Jeff Squyres wrote:
> 
> > On Sep 4, 2009, at 8:26 AM, Nadia Derbey wrote:
> >
> >> > Can the file name ( openmpi-priv-mca-params.conf ) also be  
> >> configurable ?
> >>
> >> No, it isn't, presently, but this can be changed if needed.
> >>
> >
> >
> > If it's configurable, it must be configurable at configure time --  
> > not run time -- otherwise, a user could just give a different  
> > filename at runtime and get around all the "privileged" values.
> >
> > -- 
> > Jeff Squyres
> > jsquy...@cisco.com
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
-- 
Nadia Derbey <nadia.der...@bull.net>



Re: [OMPI devel] RFC - "system-wide-only" MCA parameters

2009-09-04 Thread Nadia Derbey
On Fri, 2009-09-04 at 13:55 +0200, Sylvain Jeaugey wrote:
> On Fri, 4 Sep 2009, Jeff Squyres wrote:
> 
> > I haven't looked at the code deeply, so forgive me if I'm parsing this 
> > wrong: 
> > is the code actually reading the file into one list and then moving the 
> > values to another list?  If so, that seems a little hackish.  Can't it just 
> > read directly to the target list?
> On the basic approach, I would have another suggestion, reducing parsing 
> and maybe a bit less hackish : do not introduce another file but only a 
> keyword indicating that further overriding is disabled ("fixed", 
> "restricted", "read-only" ?).
> 
> You would therefore write in your configuration file something like:
> notifier_threshold_severity=notice fixed
> or more generally :
> key=value flags
> 
> Maybe we don't have a way to differenciate flags at the end with the 
> current parser, so maybe a leading "!" or "%" or any other strong 
> character would be simpler to implement while still ensuring 
> retro-compatibility.
> 


Sylvain,

The current parser eats up all characters after the "=": everything that
is between the "=" and the eol is considered as the paramter's value.
So the second approach you're proposing seems better to me.

But if we do this, the extesension I proposed in my RFC will be harder
to implement:


This new functionality can be extended in the future in the following
way: allow the administrator to specify boundaries within which an MCA
parameter is allowed to be changed by a higher priority setting. This
means that the administrator should declare min and max values (or even
a set of discrete values) for any such parameter. Then, any higher
priority setting will be done only if the new value belongs to the
declared set.


But actually, may be that extension is not desirable at all. In that
case, I agree that your prposal is a very good compromise:
. single parser (though it should be enhanced)
. single configuration file

Regards,
Nadia

-- 
Nadia Derbey <nadia.der...@bull.net>



Re: [OMPI devel] RFC - "system-wide-only" MCA parameters

2009-09-04 Thread Nadia Derbey
On Fri, 2009-09-04 at 13:34 +0200, Sylvain Jeaugey wrote:
> On Fri, 4 Sep 2009, Jeff Squyres wrote:
> 
> > --
> > *** Checking versions
> > checking for SVN version... done
> > checking Open MPI version... 1.4a1hgf11244ed72b5
> > up to changeset c4b117c5439b
> > checking Open MPI release date... Unreleased developer copy
> > checking Open MPI Subversion repository version... hgf11244ed72b5
> > up to changeset c4b117c5439b
> > checking for SVN version... done
> > ...etc.
> > --
> >
> > Do you see this, or do you get a single-line version number?
> I get the same. The reason is simple :
> 
> $ hg tip
> changeset:   9:f11244ed72b5
> tag: tip
> user:Nadia Derbey <nadia.der...@bull.net>
> date:Thu Sep 03 14:21:47 2009 +0200
> summary: up to changeset c4b117c5439b
> 
> $ hg -v tip | grep changeset | cut -d: -f3 # done by configure
> f11244ed72b5
> up to changeset c4b117c5439b
> 
> So yes, if anyone includes the word "changeset" in the commit message, 
> you'll have the same bug :-)
> 
> So,
> hg -R "$srcdir" tip | head -1 | grep "^changeset:" | cut -d: -f3
> would certainly be safer.
> 

Thx Sylvain!
just pushed a fake patch as a workaround.
Actually, I didn't have the problem on my side, because hg is not known
in my build environment. Never noticed these lines:

-

*** Checking versions
checking for SVN version... ../configure: line 4285: hg: command not
found
done
checking Open MPI version... 1.4a1hg
checking Open MPI release date... Unreleased developer copy
checking Open MPI Subversion repository version... hg
checking for SVN version... ../configure: line 4397: hg: command not
found
done

-

Regards,
Nadia

-- 
Nadia Derbey <nadia.der...@bull.net>



Re: [OMPI devel] RFC - "system-wide-only" MCA parameters

2009-09-04 Thread Nadia Derbey
On Fri, 2009-09-04 at 10:05 +0300, Jeff Squyres wrote:
> On Sep 3, 2009, at 12:23 PM, Nadia Derbey wrote:
> 
> > What: Define a way for the system administrator to prevent users from
> >   overwriting the default system-wide MCA parameters settings.
> >
> 
> In general, I think this is great stuff.  I have a few nit picks.

Thanks for having a look at it!

> 
> (BTW: you might want to run contrib/hg/build-hgignore.pl in your svn 
> +hg tree to generate a proper .hgignore file...?)
> 
> I'm currently unable to build the hg tree -- it gets a version with a  
> newline in it which causes badness in the Makefile.  For example:
> 
> --
> *** Checking versions
> checking for SVN version... done
> checking Open MPI version... 1.4a1hgf11244ed72b5
> up to changeset c4b117c5439b
> checking Open MPI release date... Unreleased developer copy
> checking Open MPI Subversion repository version... hgf11244ed72b5
> up to changeset c4b117c5439b
> checking for SVN version... done
> ...etc.
> --

??? May be something went wrong with my mqueues for the last changeset
(I recognize the qnew message in the version...).

Until I fix that, the solution would be to revert back to changeset
db3595643dd2 (the very last changeset is not important at all).

> 
> Do you see this, or do you get a single-line version number?
> 
> > When openmpi-priv-mca-params.conf is parsed, any parameter listed in
> > that file is moved from the mca_base_param_file_values list to a new
> > parallel list (mca_base_priv_param_file_values). The parameter remains
> > "hidden" in that new list.
> >
> 
> I haven't looked at the code deeply, so forgive me if I'm parsing this  
> wrong: is the code actually reading the file into one list and then  
> moving the values to another list?  If so, that seems a little  
> hackish.  Can't it just read directly to the target list?

We could read directly to target list if we had 2 files giving settings
(as described in ticket 75):
one for the "classical" system-wide parameters
one for for the "privileged", system-wide parameters.

But the solution I'm proposing is different: we have one file for the
settings, and one file that lists the parameters names that should not
be changed once they have been set:

1) openmpi-mca-params.conf is the one we all know.
It contains the system-wide mca parameters settings.

ex:
notifier_threshold_severity = notice

2) openmpi-priv-mca-params.conf. This file contains the list of mca
parameters that cannot be changed once they have been set in
openmpi-mca-params.conf (only the parameter name is in that file, not
its setting).

ex:
notifier_threshold_severity


openmpi-mca-params.conf is parsed first. The list
mca_base_param_file_values is populated with all the parameters set in
that file.
Then openmpi-priv-mca-params.conf is parsed. This file, as described
earlier, only contains the list of those parameters that if set in
openmpi-mca-params.conf, shouldn't be changed. So, any parameter listed
in openmpi-priv-mca-params.conf is moved from mca_base_param_file_values
to a newly defined list (mca_base_priv_param_file_values).
Then the remaining files are parsed as usual.
As explained below, the lookup order is changed to ensure that
mca_base_priv_param_file_values is the very first list to be scanned: if
a parameter is found there, we are done.


> 
> > The lookup order has been changed: the new
> > mca_base_priv_param_file_values list is the very 1st one to be  
> > scanned:
> > this is how we ensure that the "privileged" parameters are never
> > overwritten once defined on a system-wide basis.
> >
> > Other external changes:
> > 1. The man page for mpirun(1) has been changed.
> > 2. The ompi_info(1) output has been changed to reflect the status of
> > these "privileged" parameters. The status field can now be one of
> > "read-only", "writable" or "privileged".
> >
> 
> This seems a little funky, too.  Isn't a privileged param actually  
> read only (in an abstract sense)?  Meaning: you can't change priv  
> param values, so they're "read only".  "Priv" feels more like a  
> "source" attribute, doesn't it...?  I.e., it's a read-only param, and  
> the source of the attribute is the special priv file.

It's true that I first thought of that read-only status, but I haven't
kept the idea: from an open mpi pov, read-only is an "immutable" status:
it is hard coded, and a variable that is read-only cannot easily become
rightable. So an mpi user who does an ompi_info and sees that a
parameter is read-only knows that he will never have a chance to change
it.
While these "privileged" params have a different status: they are
read-o

Re: [OMPI devel] RFC - "system-wide-only" MCA parameters

2009-09-04 Thread Nadia Derbey
On Thu, 2009-09-03 at 19:29 -0400, Graham, Richard L. wrote:
> What happens if $sysconfdir/openmpi-priv-mca-params.conf is missing ?

If it is missing, everything works as today: any parameter declared in
$sysconfdir/openmpi-mca-params.conf is considered as system-wide and can
be overwritten as usual.

> 
> Can the file name ( openmpi-priv-mca-params.conf ) also be configurable ?

No, it isn't, presently, but this can be changed if needed.

Regards,
Nadia

> 
> Rich
> 
> 
> On 9/3/09 5:23 AM, "Nadia Derbey" <nadia.der...@bull.net> wrote:
> 
> 
> 
> What: Define a way for the system administrator to prevent users from
>   overwriting the default system-wide MCA parameters settings.
> 
> Why: Letting the user completely free of changing any MCA parameter
>  that has been defined on a system-wide basis may sometimes be
>  undesirable.
> 
> Where: code can be found at
> http://derb...@bitbucket.org/derbeyn/opal-ro-mca-params/
> 
> ---
> 
> Description:
> 
> 
> Letting the user completely free of changing any parameter that has been
> defined on a system-wide basis may sometimes be undesirable. For
> example, on a node with multiple TCP interfaces, the system
> administrator may for some reason want to restrict MPI applications to a
> fixed subset of these TCP interfaces.
> 
> Today, there is no way for the system administrator to prevent users
> from overwriting the default system-wide settings: even if he has
> excluded eth0 from the interfaces usable by an MPI application, he
> cannot prevent any user from explicitly using that excluded interface.
> 
> The purpose of this feature is to provide a new "system-wide-only"
> property for the MCA parameters, specifying that their system-wide
> value, if it has been defined, can never be overridden. This implies
> changing the existing parameters setting rules.
> 
> 
> Implementation:
> 
> It should be noted that this feature is already described in
> https://svn.open-mpi.org/trac/ompi/ticket/75 as follows:
> "another MCA parameter file that has "privileged" MCA parameters. This
> file is hard-coded in the code base (based on $prefix, likely
> "$sysconfdir/openmpi-priv-mca-params.conf", or somesuch). Any parameters
> set in this file will be treated specially and cannot be overridden."
> 
> But we chose another way to implement this feature: the file
> $sysconfdir/openmpi-mca-params.conf is kept with all the parameters
> settings (even those that are considered as system-wide-only).
> And the new file $sysconfdir/openmpi-priv-mca-params.conf is introduced
> to contain only the list of the "privileged" parameters (not their value).
> Doing it that way makes it easier to change the status of the MCA
> parameters: moving them from privileged to non privileged is only a
> matter of removing the new file, which preserves compatibility.
> While with the solution proposed in the ticket, we have to merge both files.
> 
> The configuration files are parsed in the following order:
> 1. $sysconfdir/openmpi-mca-params.conf.
> 2. $sysconfdir/openmpi-priv-mca-params.conf
> 3. $HOME/.openmpi/mca-param.conf
> 
> When openmpi-priv-mca-params.conf is parsed, any parameter listed in
> that file is moved from the mca_base_param_file_values list to a new
> parallel list (mca_base_priv_param_file_values). The parameter remains
> "hidden" in that new list.
> The lookup order has been changed: the new
> mca_base_priv_param_file_values list is the very 1st one to be scanned:
> this is how we ensure that the "privileged" parameters are never
> overwritten once defined on a system-wide basis.
> 
> Other external changes:
> 1. The man page for mpirun(1) has been changed.
> 2. The ompi_info(1) output has been changed to reflect the status of
> these "privileged" parameters. The status field can now be one of
> "read-only", "writable" or "privileged".
> 3. A new option has been added to ompi_info(1): --privileged
> it outputs only the list of parameters that have been defined as
> system-wide-only.
> 
> 
> TODO list:
> . Make this feature configurable.
> . Output a warning message as described in the ticket.
> 
> 
> Possible extension to this functionality:
> 
> This new functionality can be extended in the future in the following
> way: allow the administrator to specify boundaries within which an MCA
> parameter is allowed to be changed by a higher priority setting. This
> means that the administrator should declare min and max values (or even
> a set of discrete values) for any such parameter. Then, any

Re: [OMPI devel] problem in the ORTE notifier framework

2009-05-28 Thread Nadia Derbey
On Tue, 2009-05-26 at 17:24 -0600, Ralph Castain wrote:
> First, to answer Nadia's question: you will find that the init
> function for the module is already called when it is selected - see
> the code in orte/mca/base/notifier_base_select.c, lines 72-76 (in the
> trunk.

Strange? Our repository is a clone of the trunk?
> 
It's true that if I "hg update" to v1.3 I see that the fix is there.

Regards,
Nadia

> It would be a good idea to tie into the sos work to avoid conflicts
> when it all gets merged back together, assuming that isn't a big
> problem for you.
> 
> As for Jeff's suggestion: dealing with the performance hit problem is
> why I suggested ORTE_NOTIFIER_VERBOSE, modeled after the
> OPAL_OUTPUT_VERBOSE model. The idea was to compile it in -only- when
> the system is built for it - maybe using a --with-notifier-verbose
> configuration option. Frankly, some organizations would happily pay a
> small performance penalty for the benefits.
> 
> I would personally recommend that the notifier framework keep the
> stats so things can be compact and self-contained. We still get
> atomicity by allowing each framework/component/whatever specify the
> threshold. Creating yet another system to do nothing more than track
> error/warning frequencies to decide whether or not to notify seems
> wasteful.
> 
> Perhaps worth a phone call to decide path forward?
> 
> 
> On Tue, May 26, 2009 at 1:06 PM, Jeff Squyres <jsquy...@cisco.com>
> wrote:
> Nadia --
> 
> Sorry I didn't get to jump in on the other thread earlier.
> 
> We have made considerable changes to the notifier framework in
> a branch to better support "SOS" functionality:
> 
> 
>  https://www.open-mpi.org/hg/auth/hgwebdir.cgi/jsquyres/opal-sos
> 
> Cisco and Indiana U. have been working on this branch for a
> while.  A description of the SOS stuff is here:
> 
>https://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages
> 
> As for setting up an external web server with hg, don't bother
> -- just get an account at bitbucket.org.  They're free and
> allow you to host hg repositories there.  I've used bitbucket
> to collaborate on code before it hits OMPI's SVN trunk with
> both internal and external OMPI developers.
> 
> We can certainly move the opal-sos repo to bitbucket (or
> branch again off opal-sos to bitbucket -- whatever makes more
> sense) to facilitate collaborating with you.
> 
> Back on topic...
> 
> I'd actually suggest a combination of what has been discussed
> in the other thread.  The notifier can be the mechanism that
> actually sends the output message, but it doesn't have to be
> the mechanism that tracks the stats and decides when to output
> a message.  That can be separate logic, and therefore be more
> fine-grained (and potentially even specific to the MPI layer).
> 
> The Big Question will how to do this with zero performance
> impact when it is not being used. This has always been the
> difficult issue when trying to implement any kind of
> monitoring inside the core OMPI performance-sensitive paths.
>  Even adding individual branches has met with resistance (in
> performance-critical code paths)...
> 
> 
> 
> 
> 
> On May 26, 2009, at 10:59 AM, Nadia Derbey wrote:
> 
> 
> 
> Hi,
> 
> While having a look at the notifier framework under
> orte, I noticed that
> the way it is written, the init routine for the
> selected module cannot
> be called.
> 
> Attached is a small patch that fixes this issue.
> 
> Regards,
> Nadia
> 
> 
> 
> 
> 
> -- 
> Jeff Squyres
> Cisco Systems
> 
> _______
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
-- 
Nadia Derbey <nadia.der...@bull.net>



Re: [OMPI devel] problem in the ORTE notifier framework

2009-05-28 Thread Nadia Derbey
On Wed, 2009-05-27 at 14:25 -0400, Jeff Squyres wrote:
> Excellent points; Ralph and I chatted about this on the phone today --  
> we concur with George.
> 
> Bull -- would peruse work for you?  I think you mentioned before that  
> it didn't seem attractive to you.

Well, it didn't because from what I understood, the MPI program need to
be changed (register a callback routine for the event, activate the
event, etc), and this is something we wanted to avoid.

Now, if we are allowed to 
1. define new "internal" PERUSE events, 
2. internally set the associated callback routines
why not using peruse? This combined with the orte notifier framework,
could do the job I think.

Regards,
Nadia

>   I think George's point is that we  
> already have lots of hooks in place in the PML -- and they're called  
> peruse.  So if we could use those hooks, then a) they're run-time  
> selectable already, and b) there's no additional cost in performance  
> critical/not-critical code paths (for the case where these stats are  
> not being collected) because PERUSE has been in the code base for a  
> long time.
> 
> I think the idea is that your callbacks could be invoked by the peruse  
> hooks and then they can do whatever they want -- increment counters,  
> conditionally invoke the ORTE notifier system, etc.
> 
> 
> 
> On May 27, 2009, at 11:34 AM, George Bosilca wrote:
> 
> > What is a generic threshold? And what is a counter? We have a policy
> > against such coding standards, and to be honest I would like to stick
> > to it. The reason is that the PML is a very complex piece of code, and
> > I would like to keep it as easy to understand as possible. If people
> > start adding #if/#endif all over the code, we diverging from this  
> > goal.
> >
> > The only way to make this work is to call the notifier or some other
> > framework in this "slow path" and let this other framework do it's own
> > logic to determine what and when to print. Of course the cost of this
> > is a function call plus an atomic operation (which is already not
> > cheap). It's starting to get expensive, even for a "slow path", which
> > in this particular context is just one insertion in an atomic FIFO.
> >
> > If instead of counting in number of times we try to send the fragment,
> > and switch to a time base approach, this can be solved with the PERUSE
> > calls. There is a callback when the request is created, and another
> > callback when the first fragment is pushed successfully into the
> > network. Computing the time between these two, allow a tool to figure
> > out how much time the request was waiting in some internal queues, and
> > therefore how much delay this added to the execution time.
> >
> >george.
> >
> > On May 27, 2009, at 06:59 , Ralph Castain wrote:
> >
> > > ORTE_NOTIFIER_VERBOSE(api, counter, threshold,...)
> > >
> > > #if WANT_NOTIFIER_VERBOSE
> > > opal_atomic_increment(counter);
> > > if (counter > threshold) {
> > > orte_notifier.api(.)
> > > }
> > > #endif
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> 
> 
-- 
Nadia Derbey <nadia.der...@bull.net>



[OMPI devel] problem in the ORTE notifier framework

2009-05-26 Thread Nadia Derbey
Hi,

While having a look at the notifier framework under orte, I noticed that
the way it is written, the init routine for the selected module cannot
be called.

Attached is a small patch that fixes this issue.

Regards,
Nadia
ORTE notifier module init routine is never called: orte_notifier.init checking
should be done after orte_notifier has been set.


diff -r 876c02c65058 orte/mca/notifier/base/notifier_base_select.c
--- a/orte/mca/notifier/base/notifier_base_select.c	Mon May 25 14:17:38 2009 +0200
+++ b/orte/mca/notifier/base/notifier_base_select.c	Tue May 26 17:00:28 2009 +0200
@@ -69,17 +69,16 @@ int orte_notifier_base_select(void)
 goto cleanup;
 }

+/* Save the winner */
+orte_notifier = *best_module;
+
 if (NULL != orte_notifier.init) {
 /* if an init function is provided, use it */
 if (ORTE_SUCCESS != (ret = orte_notifier.init()) ) {
 exit_status = ret;
-goto cleanup;
 }
 }

-/* Save the winner */
-orte_notifier = *best_module;
-
  cleanup:
 return exit_status;
 }


Re: [OMPI devel] RFC: Diagnostoc framework for MPI

2009-05-26 Thread Nadia Derbey
On Tue, 2009-05-26 at 05:35 -0600, Ralph Castain wrote:
> Hi Nadia
> 
> We actually have a framework in the system for this purpose, though it
> might require some minor modifications to do precisely what you
> describe. It is the ORTE "notifier" framework - you will find it at
> orte/mca/notifier. There are several components, each of which
> supports a different notification mechanism (e.g., message into the
> sys log, smtp, and even "twitter").

Ralph,

Thanks a lot for your detailed answer. I'll have a look at the notifier
framework to see if it could serve our purpose. Actually, form what you
describe, looks like it does.

Regards,
Nadia
> 
> The system works by adding orte_notifier calls to the OMPI code
> wherever we deem it advisable to alert someone. For example, if we
> think a sys admin might want to be alerted when the number of IB send
> retries exceeds some limit, we add a call to orte_notifier to the IB
> code with:
> 
> if (#retries > threshold) {
> orte_notifier.xxx();
> }
> 
> I believe we could easily extend this to support your proposed
> functionality. A couple of possibilities that immediately spring to
> mind would be:
> 
> 1. you could create a new component (or we could modify the existing
> ones) that tracks how many times it is called for a given error, and
> only actually issues a notification for that specific error when the
> count exceeds a threshold. The negative to this approach is that the
> threshold would be uniform across all errors.
> 
> 2. we could extend the current notifier APIs to add a threshold count
> upon which the notification is to be sent, perhaps creating a new
> macro ORTE_NOTIFIER_VERBOSE that takes the threshold as one of its
> arguments. We could then let each OMPI framework have a new
> "threshold" MCA param, thus allowing the sys admins to "tune" the
> frequency of error reporting by framework. Of course, we could let
> them get as detailed here as you want - they could even have
> "threshold" params for each component, function, or whatever. This
> would be combined with #1 above to alert only when the count exceeded
> the threshold for that specific error message.
> 
> I'm sure you and others will come up with additional (probably better)
> ways of implementing this extension. My point here was simply to
> ensure you knew that the basic mechanism already exists, and to
> stimulate some thought as to how to use it for your proposed purpose.
> 
> I would be happy to help you do so as this is something we (LANL) have
> put at a high priority - our sys admins on the large clusters really
> need the help.
> 
> HTH
> Ralph
> 
> 
> On Mon, May 25, 2009 at 11:33 PM, Nadia Derbey <nadia.der...@bull.net>
> wrote:
> What: Warn the administrator when unusual events are occurring
> too
> frequently.
> 
> Why: Such unusual events might be the symptom of some problem
> that can
> easily be fixed (by a better tuning, for example)
> 
> Where: Adds a new ompi framework
> 
> ---
> 
> Description:
> 
> The objective of the Open MPI library is to make applications
> run to
> completion, given that no fatal error is encountered.
> In some situations, unusual events may occur. Since these
> events are not
> considered to be fatal enough, the library arbitrarily chooses
> to bypass
> them using a software mechanism, instead of actually stopping
> the
> application. But even though this choice helps in completing
> the
> application, it may frequently result in significant
> performance
> degradation. This is not an issue if such “unusual events”
> don't occur
> too frequently. But if they actually do, that might be
> representative of
> a real problem that could sometimes be easily avoided.
> 
> For example, when mca_pml_ob1_send_request_start() starts a
> send request
> and faces a resource shortage, it silently calls
> add_request_to_send_pending() to queue that send request into
> the list
> of pending send requests in order to process it later on. If
> an adapting
> mechanism is not provided at runtime to increase the receive
> queue
> length, at least a message can be sent to the administrator to
> let him
> do the tuning by hand before the next run.
> 
> We had a lo

[OMPI devel] RFC: Diagnostoc framework for MPI

2009-05-26 Thread Nadia Derbey
What: Warn the administrator when unusual events are occurring too
frequently.

Why: Such unusual events might be the symptom of some problem that can
easily be fixed (by a better tuning, for example)

Where: Adds a new ompi framework

---

Description:

The objective of the Open MPI library is to make applications run to
completion, given that no fatal error is encountered.
In some situations, unusual events may occur. Since these events are not
considered to be fatal enough, the library arbitrarily chooses to bypass
them using a software mechanism, instead of actually stopping the
application. But even though this choice helps in completing the
application, it may frequently result in significant performance
degradation. This is not an issue if such “unusual events” don't occur
too frequently. But if they actually do, that might be representative of
a real problem that could sometimes be easily avoided.

For example, when mca_pml_ob1_send_request_start() starts a send request
and faces a resource shortage, it silently calls
add_request_to_send_pending() to queue that send request into the list
of pending send requests in order to process it later on. If an adapting
mechanism is not provided at runtime to increase the receive queue
length, at least a message can be sent to the administrator to let him
do the tuning by hand before the next run.

We had a look at other tracing utilities (like PMPI, PERUSE, VT), but
found them either too high level or too intrusive at the application
level.

The “diagnostic framework” we'd like to propose would help capturing
such “unusual events” and tracing them, while having a very low impact
on the performances. This is obtained by defining tracing routines that
can be called from the ompi code. The collected events are aggregated
per MPI process and only traced if a threshold has been reached. Another
threshold (time threshold) can be used to condition subsequent traces
generation for an already traced event.

This is obtained by defining 2 mca parameters and a rule:
. the count threshold C
. the time delay T
The rule is: an event will only be traced if it happened N times, and it
won't be traced more than once every T seconds.

Thus, events happening at a very low rate will never generate a trace
except one at MPI_Finalize summarizing:
[time] At finalize : 23 times : pre-allocated buffers all full, calling
malloc

Those happening "a little too much" will sometimes generate a trace
saying something like:
[time] 1000 warnings : could not send in openib now, delaying
[time+12345 sec] 1000 warnings : could not send in openib now, delaying

And events occurring at a high frequency will only generate a message
every T seconds saying:
[time] 1000 warnings : adding buffers in the SRQ
[time+T]   1,234,567 warnings (in T seconds) : adding buffers in the SRQ
[time+2*T] 2,345,678 warnings (in T seconds) : adding buffers in the SRQ

The count threshold and time delay are defined per event.
They can also be defined as MCA parameters. In that case, the mca
parameter value overrides the per event values.

The following information are traced too:
  . job family
  . the local job id
  . the job vpid

Another aspect of performance savings is that a mechanism ala
show_help() can be used in order to let the HNP actually do the job.

We started the implementation of this feature, so patches are available if 
needed. We are currently trying to setup hgweb on an external server.

Since I'm an Open MPI newbie, I'm submitting this RFC to have your
opinion about its usefulness, or even to know if there's an already
existing mechanism to do this job.

Regards,
Nadia

-- 
Nadia Derbey <nadia.der...@bull.net>