Re: [OMPI users] "-bind-to numa" of openmpi-1.7.4rc1 dosen't work for our magny cours based 32 core node

2013-12-25 Thread Ralph Castain
Added this info to the ticket, and added you to it as well.

Thanks again
Ralph


On Dec 25, 2013, at 3:42 PM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> Hi Ralph,
> 
> Thank you for your reply. After that, I found more reasonable fix,
> I guess. I moved OBJ_CONSTRUCT for opal_tree_item_t out of debug
> part in opal_tree_construct as shown below:
> 
> static void opal_tree_construct(opal_tree_t *tree)
> {
>OBJ_CONSTRUCT( &(tree->opal_tree_sentinel), opal_tree_item_t ); /*
> tmishima */
> #if OPAL_ENABLE_DEBUG
>/* These refcounts should never be used in assertions because they
>   should never be removed from this list, added to another list,
>   etc.  So set them to sentinel values. */
> 
>tree->opal_tree_sentinel.opal_tree_item_refcount  = 1;
>tree->opal_tree_sentinel.opal_tree_item_belong_to = tree;
> #endif
>tree->opal_tree_sentinel.opal_tree_container = tree;
>tree->opal_tree_sentinel.opal_tree_parent = &tree->opal_tree_sentinel;
>tree->opal_tree_sentinel.opal_tree_num_ancestors = -1;
> 
>tree->opal_tree_sentinel.opal_tree_next_sibling =
>&tree->opal_tree_sentinel;
>tree->opal_tree_sentinel.opal_tree_prev_sibling =
>&tree->opal_tree_sentinel;
> 
>tree->opal_tree_sentinel.opal_tree_first_child = &tree->
> opal_tree_sentinel;
>tree->opal_tree_sentinel.opal_tree_last_child = &tree->
> opal_tree_sentinel;
> 
>tree->opal_tree_num_items = 0;
>tree->comp = NULL;
>tree->serialize = NULL;
>tree->deserialize = NULL;
>tree->get_key = NULL;
> }
> 
> In addtion, I checked how lama worked for the hierarchy inversion.
> Then, it did not work on node04 which has the inversion and worked on
> node09 which has normal one. Please foward this information to lama
> developers.
> 
> Regerds,
> Tetsuya Mishima
> 
> qsub: job 8380.manage.cluster completed
> [mishima@manage openmpi-1.7.4rc2r30069]$ qsub -I -l nodes=4:ppn=8
> qsub: waiting for job 8381.manage.cluster to start
> qsub: job 8381.manage.cluster ready
> 
> [mishima@node09 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> [mishima@node09 demos]$ mpirun -np 2 -report-bindings -mca rmaps lama -mca
> rmaps_lama_bind 1N -mca rmaps_lama_map Ncsbnh
> myprog
> [node09.cluster:20144] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> [node09.cluster:20144] MCW rank 1 bound to socket 1[core 4[hwt 0]], socket
> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> Hello world from process 1 of 2
> Hello world from process 0 of 2
> [mishima@node09 demos]$
> 
> 
> qsub: job 8383.manage.cluster completed
> [mishima@manage openmpi-1.7.4rc2r30069]$ qsub -I -l nodes=1:ppn=32
> qsub: waiting for job 8384.manage.cluster to start
> qsub: job 8384.manage.cluster ready
> 
> [mishima@node04 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> [mishima@node04 demos]$ mpirun -np 2 -report-bindings -mca rmaps lama -mca
> rmaps_lama_bind 1N -mca rmaps_lama_map Ncsbnh
> myprog
> --
> RMaps LAMA detected that there are not enough resources to map the
> remainder of the job. Check the command line options, and the number of
> nodes allocated to this job.
> Application Context : 0
> # of Processes Successfully Mapped: 0
> # of Processes Requested  : 2
> Mapping  : Ncsbnh
> Binding  : 1N
> MPPR : [Not Provided]
> Ordering : s
> --
> [node04.cluster:20298] [[21003,0],0] ORTE_ERROR_LOG: Error in file
> rmaps_lama_module.c at line 309
> 
> [node04.cluster:20298] [[21003,0],0] ORTE_ERROR_LOG: Error in file
> base/rmaps_base_map_job.c at line 217
> 
>> Deeply appreciate all you help! Your fix looks reasonable to me and is
> the kind of difference we frequently see between compilers and
> environments, which is why initializing variables is so
>> important. This one apparently slipped by the lama developers.
>> 
>> I'll apply to trunk and cmr it across to 1.7.4.
>> 
>> Thanks again
>> Ralph
>> 
>> On Dec 25, 2013, at 3:39 AM, tmish...@jcity.maeda.co.jp wrote:
>> 
>>> 
>>> 
>>> Hi Ralph,
>>> 
>>> I did valgrind and found uninitialised value errors. All of them
>>> occured in opal_tree_add_child as shown at the bottom. As a quick
>>> fix, I puted one line in "opal_tree.c", although it's not elegant:
>>> 
>>> void opal_tree_init(opal_tree_t *tree, opal_tree_comp_fn_t comp,
>>>   opal_tree_item_serialize_fn_t serialize,
>>>   opal_tree_item_deserialize_fn_t deserialize,
>>>   opal_tree_get_key_fn_t get_key)
>>> {
>>>   tree->comp = comp;
>>>   tree->serialize = serialize;
>>>   tree->deserialize = deserialize;
>>>   tree->get_key = get_key;
>>>   opal_tree_get_root(tree)->opal_tree_num_children = 0 ; /* added by
>>> tmishima */
>>> }
>>> 
>>> Then, these errors all disappeared and openmpi with lama worked fine.
>>> A

Re: [OMPI users] "-bind-to numa" of openmpi-1.7.4rc1 dosen't work for our magny cours based 32 core node

2013-12-25 Thread tmishima


Hi Ralph,

Thank you for your reply. After that, I found more reasonable fix,
I guess. I moved OBJ_CONSTRUCT for opal_tree_item_t out of debug
part in opal_tree_construct as shown below:

static void opal_tree_construct(opal_tree_t *tree)
{
OBJ_CONSTRUCT( &(tree->opal_tree_sentinel), opal_tree_item_t ); /*
tmishima */
#if OPAL_ENABLE_DEBUG
/* These refcounts should never be used in assertions because they
   should never be removed from this list, added to another list,
   etc.  So set them to sentinel values. */

tree->opal_tree_sentinel.opal_tree_item_refcount  = 1;
tree->opal_tree_sentinel.opal_tree_item_belong_to = tree;
#endif
tree->opal_tree_sentinel.opal_tree_container = tree;
tree->opal_tree_sentinel.opal_tree_parent = &tree->opal_tree_sentinel;
tree->opal_tree_sentinel.opal_tree_num_ancestors = -1;

tree->opal_tree_sentinel.opal_tree_next_sibling =
&tree->opal_tree_sentinel;
tree->opal_tree_sentinel.opal_tree_prev_sibling =
&tree->opal_tree_sentinel;

tree->opal_tree_sentinel.opal_tree_first_child = &tree->
opal_tree_sentinel;
tree->opal_tree_sentinel.opal_tree_last_child = &tree->
opal_tree_sentinel;

tree->opal_tree_num_items = 0;
tree->comp = NULL;
tree->serialize = NULL;
tree->deserialize = NULL;
tree->get_key = NULL;
}

In addtion, I checked how lama worked for the hierarchy inversion.
Then, it did not work on node04 which has the inversion and worked on
node09 which has normal one. Please foward this information to lama
developers.

Regerds,
Tetsuya Mishima

qsub: job 8380.manage.cluster completed
[mishima@manage openmpi-1.7.4rc2r30069]$ qsub -I -l nodes=4:ppn=8
qsub: waiting for job 8381.manage.cluster to start
qsub: job 8381.manage.cluster ready

[mishima@node09 ~]$ cd ~/Desktop/openmpi-1.7/demos/
[mishima@node09 demos]$ mpirun -np 2 -report-bindings -mca rmaps lama -mca
rmaps_lama_bind 1N -mca rmaps_lama_map Ncsbnh
 myprog
[node09.cluster:20144] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
[node09.cluster:20144] MCW rank 1 bound to socket 1[core 4[hwt 0]], socket
1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
Hello world from process 1 of 2
Hello world from process 0 of 2
[mishima@node09 demos]$


qsub: job 8383.manage.cluster completed
[mishima@manage openmpi-1.7.4rc2r30069]$ qsub -I -l nodes=1:ppn=32
qsub: waiting for job 8384.manage.cluster to start
qsub: job 8384.manage.cluster ready

[mishima@node04 ~]$ cd ~/Desktop/openmpi-1.7/demos/
[mishima@node04 demos]$ mpirun -np 2 -report-bindings -mca rmaps lama -mca
rmaps_lama_bind 1N -mca rmaps_lama_map Ncsbnh
 myprog
--
RMaps LAMA detected that there are not enough resources to map the
remainder of the job. Check the command line options, and the number of
nodes allocated to this job.
 Application Context : 0
 # of Processes Successfully Mapped: 0
 # of Processes Requested  : 2
 Mapping  : Ncsbnh
 Binding  : 1N
 MPPR : [Not Provided]
 Ordering : s
--
[node04.cluster:20298] [[21003,0],0] ORTE_ERROR_LOG: Error in file
rmaps_lama_module.c at line 309

[node04.cluster:20298] [[21003,0],0] ORTE_ERROR_LOG: Error in file
base/rmaps_base_map_job.c at line 217

> Deeply appreciate all you help! Your fix looks reasonable to me and is
the kind of difference we frequently see between compilers and
environments, which is why initializing variables is so
> important. This one apparently slipped by the lama developers.
>
> I'll apply to trunk and cmr it across to 1.7.4.
>
> Thanks again
> Ralph
>
> On Dec 25, 2013, at 3:39 AM, tmish...@jcity.maeda.co.jp wrote:
>
> >
> >
> > Hi Ralph,
> >
> > I did valgrind and found uninitialised value errors. All of them
> > occured in opal_tree_add_child as shown at the bottom. As a quick
> > fix, I puted one line in "opal_tree.c", although it's not elegant:
> >
> > void opal_tree_init(opal_tree_t *tree, opal_tree_comp_fn_t comp,
> >opal_tree_item_serialize_fn_t serialize,
> >opal_tree_item_deserialize_fn_t deserialize,
> >opal_tree_get_key_fn_t get_key)
> > {
> >tree->comp = comp;
> >tree->serialize = serialize;
> >tree->deserialize = deserialize;
> >tree->get_key = get_key;
> >opal_tree_get_root(tree)->opal_tree_num_children = 0 ; /* added by
> > tmishima */
> > }
> >
> > Then, these errors all disappeared and openmpi with lama worked fine.
> > As I told you before, I built openmpi with PGI 13.10. As far as I
> > checked, no error was detected by valgrind with openmpi built by
> > GNU compiler. Therefore, it might depend on compiler...
> > Anyway, I would like to ask you (or openmpi team) to continue
> > further investigation.
> >
> > Regards,
> > Tetsuya Mishima

Re: [OMPI users] "-bind-to numa" of openmpi-1.7.4rc1 dosen't work for our magny cours based 32 core node

2013-12-25 Thread Ralph Castain
Deeply appreciate all you help! Your fix looks reasonable to me and is the kind 
of difference we frequently see between compilers and environments, which is 
why initializing variables is so important. This one apparently slipped by the 
lama developers.

I'll apply to trunk and cmr it across to 1.7.4.

Thanks again
Ralph

On Dec 25, 2013, at 3:39 AM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> Hi Ralph,
> 
> I did valgrind and found uninitialised value errors. All of them
> occured in opal_tree_add_child as shown at the bottom. As a quick
> fix, I puted one line in "opal_tree.c", although it's not elegant:
> 
> void opal_tree_init(opal_tree_t *tree, opal_tree_comp_fn_t comp,
>opal_tree_item_serialize_fn_t serialize,
>opal_tree_item_deserialize_fn_t deserialize,
>opal_tree_get_key_fn_t get_key)
> {
>tree->comp = comp;
>tree->serialize = serialize;
>tree->deserialize = deserialize;
>tree->get_key = get_key;
>opal_tree_get_root(tree)->opal_tree_num_children = 0 ; /* added by
> tmishima */
> }
> 
> Then, these errors all disappeared and openmpi with lama worked fine.
> As I told you before, I built openmpi with PGI 13.10. As far as I
> checked, no error was detected by valgrind with openmpi built by
> GNU compiler. Therefore, it might depend on compiler...
> Anyway, I would like to ask you (or openmpi team) to continue
> further investigation.
> 
> Regards,
> Tetsuya Mishima
> 
> valgrind -v --error-limit=no --leak-check=yes --show-reachable=no mpirun
> -np 1 -mca rmaps lama -report-bindings -mca rmaps_base_verbose 100
> --display-map ~/Desktop/openmpi-1.7/demos/myprog 2>&1 | tee valgrind.log
> 
> 
> ==27313== Conditional jump or move depends on uninitialised value(s)
> ==27313==at 0x4EC52A4: opal_tree_add_child (opal_tree.c:191)
> ==27313==by 0x81E3314: rmaps_lama_convert_hwloc_subtree
> (rmaps_lama_max_tree.c:320)
> ==27313==by 0x81E321D: rmaps_lama_convert_hwloc_tree_to_opal_tree
> (rmaps_lama_max_tree.c:267)
> ==27313==by 0x81E2EE8: rmaps_lama_build_max_tree
> (rmaps_lama_max_tree.c:154)
> ==27313==by 0x81E0E58: orte_rmaps_lama_map_core
> (rmaps_lama_module.c:664)
> ==27313==by 0x81E02D7: orte_rmaps_lama_map (rmaps_lama_module.c:303)
> ==27313==by 0x4C6468B: orte_rmaps_base_map_job
> (rmaps_base_map_job.c:204)
> ==27313==by 0x4F094CC: event_process_active_single_queue (event.c:1366)
> ==27313==by 0x4F090D8: event_process_active (event.c:1434)
> ==27313==by 0x4F050FF: opal_libevent2021_event_base_loop (event.c:1645)
> ==27313==by 0x4079A6: orterun (orterun.c:1049)
> ==27313==by 0x40694A: main (main.c:13)
> .
> ==27313== Conditional jump or move depends on uninitialised value(s)
> ==27313==at 0x4EC52A4: opal_tree_add_child (opal_tree.c:191)
> ==27313==by 0x4EC5D0E: deserialize_add_tree_item (opal_tree.c:496)
> ==27313==by 0x4EC5578: opal_tree_deserialize (opal_tree.c:524)
> ==27313==by 0x4EC5609: opal_tree_dup (opal_tree.c:544)
> ==27313==by 0x81E2FF6: rmaps_lama_build_max_tree
> (rmaps_lama_max_tree.c:202)
> ==27313==by 0x81E0E58: orte_rmaps_lama_map_core
> (rmaps_lama_module.c:664)
> ==27313==by 0x81E02D7: orte_rmaps_lama_map (rmaps_lama_module.c:303)
> ==27313==by 0x4C6468B: orte_rmaps_base_map_job
> (rmaps_base_map_job.c:204)
> ==27313==by 0x4F094CC: event_process_active_single_queue (event.c:1366)
> ==27313==by 0x4F090D8: event_process_active (event.c:1434)
> ==27313==by 0x4F050FF: opal_libevent2021_event_base_loop (event.c:1645)
> ==27313==by 0x4079A6: orterun (orterun.c:1049)
> 
> ==27313== Conditional jump or move depends on uninitialised value(s)
> ==27313==at 0x4EC52A4: opal_tree_add_child (opal_tree.c:191)
> ==27313==by 0x4EC5D0E: deserialize_add_tree_item (opal_tree.c:496)
> ==27313==by 0x4EC5578: opal_tree_deserialize (opal_tree.c:524)
> ==27313==by 0x4EC5609: opal_tree_dup (opal_tree.c:544)
> ==27313==by 0x81E2FF6: ???
> ==27313==by 0x81E0E58: ???
> ==27313==by 0x81E02D7: ???
> ==27313==by 0x4C6468B: orte_rmaps_base_map_job
> (rmaps_base_map_job.c:204)
> ==27313==by 0x4F094CC: event_process_active_single_queue (event.c:1366)
> ==27313==by 0x4F090D8: event_process_active (event.c:1434)
> ==27313==by 0x4F050FF: opal_libevent2021_event_base_loop (event.c:1645)
> ==27313==by 0x4079A6: orterun (orterun.c:1049)
> .
> ==27313== Conditional jump or move depends on uninitialised value(s)
> ==27313==at 0x4EC52A4: opal_tree_add_child (opal_tree.c:191)
> ==27313==by 0x81E3314: ???
> ==27313==by 0x81E321D: ???
> ==27313==by 0x81E2EE8: ???
> ==27313==by 0x81E0E58: ???
> ==27313==by 0x81E02D7: ???
> ==27313==by 0x4C6468B: orte_rmaps_base_map_job
> (rmaps_base_map_job.c:204)
> ==27313==by 0x4F094CC: event_process_active_single_queue (event.c:1366)
> ==27313==by 0x4F090D8: event_process_active (event.c:1434)
> ==27313==by 0x4F050FF: op

Re: [OMPI users] [EXTERNAL] Re: Configuration for rendezvous and eager protocols: two-sided comm

2013-12-25 Thread Siddhartha Jana
Hi,

Few questions,

1.  In the following:
$ ompi_info  --param btl tcp | grep eager
 MCA btl: parameter "btl_tcp_rndv_eager_limit" (current
value: <65536>, data source: default value)
  Size (in bytes) of "phase 1" fragment sent for
all large messages (must be >= 0 and <= eager_limit)
 MCA btl: parameter "btl_tcp_eager_limit" (current value:
<65536>, data source: default value)
  Messages smaller than this size (in bytes) will
not use the RDMA pipeline protocol.  Instead, they will be split into
fragments of max_send_size and sent using send/receive semantics (must be
>=0, and is automatically adjusted up to at least
(eager_limit+btl_rdma_pipeline_send_length); only relevant when the PUT
flag is set)

1.1. What is the meaning of "phase 1" fragment?
1.2. Is my understanding correct that the btl_*_eager_limit is applicable
only in case of one-sided communication?


As always, thanks for the help,
Season's greetings

-- Sid





On 16 December 2013 14:36, Jeff Squyres (jsquyres) wrote:

> Everything that Brian said, plus: note that the MCA param that Christoph
> mentioned is specifically for the "sm" (shared memory) transport.  Each
> transport has their own set of MCA params (e.g., mca_btl_tcp_eager_limit,
> and friends).
>
>
> On Dec 16, 2013, at 3:19 PM, "Barrett, Brian W" 
> wrote:
>
> > Siddhartha -
> >
> > Christoph mentioned how to change the cross-over for shared memory, but
> it's really per-transport (so you'd have to change it for your off-node
> transport as well).  That's all in the FAQ you mentioned, so hopefully you
> can take it from there.  Note that, in general, moving the eager limits has
> some unintended side effects.  For example, it can cause more / less
> copies.  It can also greatly increase memory usage.
> >
> > Good luck,
> >
> > Brian
> >
> > On 12/16/13 1:49 AM, "Siddhartha Jana" 
> wrote:
> >
> >> Thanks Christoph.
> >> I should have looked into the FAQ section on MCA params setting @ :
> >> http://www.open-mpi.org/faq/?category=tuning#available-mca-params
> >>
> >> Thanks again,
> >> -- Siddhartha
> >>
> >>
> >> On 16 December 2013 02:41, Christoph Niethammer 
> wrote:
> >>> Hi Siddhartha,
> >>>
> >>> MPI_Send/Recv in Open MPI implements both protocols and chooses based
> on the message size which one to use.
> >>> You can use the mca parameter "btl_sm_eager_limit" to modify the
> behaviour.
> >>>
> >>> Here the corresponding info obtained from the ompi_info tool:
> >>>
> >>> "btl_sm_eager_limit" (current value: <4096>, data source: default
> value)
> >>> Maximum size (in bytes) of "short" messages (must be >= 1)
> >>>
> >>> Regards
> >>> Christoph Niethammer
> >>>
> >>> --
> >>>
> >>> Christoph Niethammer
> >>> High Performance Computing Center Stuttgart (HLRS)
> >>> Nobelstrasse 19
> >>> 70569 Stuttgart
> >>>
> >>> Tel: ++49(0)711-685-87203
> >>> email: nietham...@hlrs.de
> >>> http://www.hlrs.de/people/niethammer
> >>>
> >>>
> >>>
> >>> - Ursprüngliche Mail -
> >>> Von: "Siddhartha Jana" 
> >>> An: "OpenMPI users mailing list" 
> >>> Gesendet: Samstag, 14. Dezember 2013 13:44:12
> >>> Betreff: [OMPI users] Configuration for rendezvous and eager
> protocols: two-sided comm
> >>>
> >>>
> >>>
> >>> Hi
> >>>
> >>>
> >>> In OpenMPI, are MPI_Send, MPI_Recv (and friends) implemented using
> rendezvous protocol or eager protocol?
> >>>
> >>>
> >>> If both, is there a way to choose one or the other during runtime or
> while building the library?
> >>>
> >>>
> >>> If there is a threshold of the message size that dictates the protocol
> to be used, is there a way I can alter that threshold value?
> >>>
> >>>
> >>> If different protocols were used for different versions of the library
> in the past, could someone please direct me to the exact version numbers of
> the implementations that used one or the other protocol?
> >>>
> >>>
> >>> Thanks a lot,
> >>> Siddhartha
> >>> ___
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>> ___
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >
> >
> > --
> >   Brian W. Barrett
> >   Scalable System Software Group
> >   Sandia National Laboratories
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] "-bind-to numa" of openmpi-1.7.4rc1 dosen't work for our magny cours based 32 core node

2013-12-25 Thread tmishima


Hi Ralph,

I did valgrind and found uninitialised value errors. All of them
occured in opal_tree_add_child as shown at the bottom. As a quick
fix, I puted one line in "opal_tree.c", although it's not elegant:

void opal_tree_init(opal_tree_t *tree, opal_tree_comp_fn_t comp,
opal_tree_item_serialize_fn_t serialize,
opal_tree_item_deserialize_fn_t deserialize,
opal_tree_get_key_fn_t get_key)
{
tree->comp = comp;
tree->serialize = serialize;
tree->deserialize = deserialize;
tree->get_key = get_key;
opal_tree_get_root(tree)->opal_tree_num_children = 0 ; /* added by
tmishima */
}

Then, these errors all disappeared and openmpi with lama worked fine.
As I told you before, I built openmpi with PGI 13.10. As far as I
checked, no error was detected by valgrind with openmpi built by
GNU compiler. Therefore, it might depend on compiler...
Anyway, I would like to ask you (or openmpi team) to continue
further investigation.

Regards,
Tetsuya Mishima

valgrind -v --error-limit=no --leak-check=yes --show-reachable=no mpirun
-np 1 -mca rmaps lama -report-bindings -mca rmaps_base_verbose 100
--display-map ~/Desktop/openmpi-1.7/demos/myprog 2>&1 | tee valgrind.log


==27313== Conditional jump or move depends on uninitialised value(s)
==27313==at 0x4EC52A4: opal_tree_add_child (opal_tree.c:191)
==27313==by 0x81E3314: rmaps_lama_convert_hwloc_subtree
(rmaps_lama_max_tree.c:320)
==27313==by 0x81E321D: rmaps_lama_convert_hwloc_tree_to_opal_tree
(rmaps_lama_max_tree.c:267)
==27313==by 0x81E2EE8: rmaps_lama_build_max_tree
(rmaps_lama_max_tree.c:154)
==27313==by 0x81E0E58: orte_rmaps_lama_map_core
(rmaps_lama_module.c:664)
==27313==by 0x81E02D7: orte_rmaps_lama_map (rmaps_lama_module.c:303)
==27313==by 0x4C6468B: orte_rmaps_base_map_job
(rmaps_base_map_job.c:204)
==27313==by 0x4F094CC: event_process_active_single_queue (event.c:1366)
==27313==by 0x4F090D8: event_process_active (event.c:1434)
==27313==by 0x4F050FF: opal_libevent2021_event_base_loop (event.c:1645)
==27313==by 0x4079A6: orterun (orterun.c:1049)
==27313==by 0x40694A: main (main.c:13)
.
==27313== Conditional jump or move depends on uninitialised value(s)
==27313==at 0x4EC52A4: opal_tree_add_child (opal_tree.c:191)
==27313==by 0x4EC5D0E: deserialize_add_tree_item (opal_tree.c:496)
==27313==by 0x4EC5578: opal_tree_deserialize (opal_tree.c:524)
==27313==by 0x4EC5609: opal_tree_dup (opal_tree.c:544)
==27313==by 0x81E2FF6: rmaps_lama_build_max_tree
(rmaps_lama_max_tree.c:202)
==27313==by 0x81E0E58: orte_rmaps_lama_map_core
(rmaps_lama_module.c:664)
==27313==by 0x81E02D7: orte_rmaps_lama_map (rmaps_lama_module.c:303)
==27313==by 0x4C6468B: orte_rmaps_base_map_job
(rmaps_base_map_job.c:204)
==27313==by 0x4F094CC: event_process_active_single_queue (event.c:1366)
==27313==by 0x4F090D8: event_process_active (event.c:1434)
==27313==by 0x4F050FF: opal_libevent2021_event_base_loop (event.c:1645)
==27313==by 0x4079A6: orterun (orterun.c:1049)

==27313== Conditional jump or move depends on uninitialised value(s)
==27313==at 0x4EC52A4: opal_tree_add_child (opal_tree.c:191)
==27313==by 0x4EC5D0E: deserialize_add_tree_item (opal_tree.c:496)
==27313==by 0x4EC5578: opal_tree_deserialize (opal_tree.c:524)
==27313==by 0x4EC5609: opal_tree_dup (opal_tree.c:544)
==27313==by 0x81E2FF6: ???
==27313==by 0x81E0E58: ???
==27313==by 0x81E02D7: ???
==27313==by 0x4C6468B: orte_rmaps_base_map_job
(rmaps_base_map_job.c:204)
==27313==by 0x4F094CC: event_process_active_single_queue (event.c:1366)
==27313==by 0x4F090D8: event_process_active (event.c:1434)
==27313==by 0x4F050FF: opal_libevent2021_event_base_loop (event.c:1645)
==27313==by 0x4079A6: orterun (orterun.c:1049)
.
==27313== Conditional jump or move depends on uninitialised value(s)
==27313==at 0x4EC52A4: opal_tree_add_child (opal_tree.c:191)
==27313==by 0x81E3314: ???
==27313==by 0x81E321D: ???
==27313==by 0x81E2EE8: ???
==27313==by 0x81E0E58: ???
==27313==by 0x81E02D7: ???
==27313==by 0x4C6468B: orte_rmaps_base_map_job
(rmaps_base_map_job.c:204)
==27313==by 0x4F094CC: event_process_active_single_queue (event.c:1366)
==27313==by 0x4F090D8: event_process_active (event.c:1434)
==27313==by 0x4F050FF: opal_libevent2021_event_base_loop (event.c:1645)
==27313==by 0x4079A6: orterun (orterun.c:1049)
==27313==by 0x40694A: main (main.c:13)



> Hi Ralph,
>
> Here is the output when I put "-mca rmaps_base_verbose 10 --display-map"
> and where it stopped(by gdb), which shows it stopped in a function of
lama.
>
> I usually use PGI 13.10, so I tried to change it to gnu compiler.
> Then, it works. Therefore, this problem depends on compiler.
>
> That's all what I could find today.
>
> Regards,
> Tetsuya Mishima
>
> [mishima@manage ~]$ gdb
> GNU gdb (GDB) CentOS (7.0.1-42.el5.