[OMPI users] 1.7.5rc1, error "COLL-ML ml_discover_hierarchy exited with error."

2014-03-03 Thread Filippo Spiga
Dear Open MPI developers,

I hit an expected error running OSU osu_alltoall benchmark using Open MPI 
1.7.5rc1. Here the error:

$ mpirun -np 4 --map-by ppr:1:socket -bind-to core osu_alltoall 
In bcol_comm_query hmca_bcol_basesmuma_allocate_sm_ctl_memory failed 
In bcol_comm_query hmca_bcol_basesmuma_allocate_sm_ctl_memory failed 
[tesla50][[6927,1],1][../../../../../ompi/mca/coll/ml/coll_ml_module.c:2996:mca_coll_ml_comm_query]
 COLL-ML ml_discover_hierarchy exited with error.

[tesla50:42200] In base_bcol_masesmuma_setup_library_buffers and mpool was not 
successfully setup!
[tesla50][[6927,1],0][../../../../../ompi/mca/coll/ml/coll_ml_module.c:2996:mca_coll_ml_comm_query]
 COLL-ML ml_discover_hierarchy exited with error.

[tesla50:42201] In base_bcol_masesmuma_setup_library_buffers and mpool was not 
successfully setup!
# OSU MPI All-to-All Personalized Exchange Latency Test v4.2
# Size   Avg Latency(us)
--
mpirun noticed that process rank 3 with PID 4508 on node tesla51 exited on 
signal 11 (Segmentation fault).
--
2 total processes killed (some possibly by mpirun during cleanup)

Any idea where this come from?

I compiled Open MPI using Intel 12.1, latest Mellanox stack and CUDA 6.0RC.  
Attached outputs grabbed from configure, make and run. The configure was

export MXM_DIR=/opt/mellanox/mxm
export KNEM_DIR=$(find /opt -maxdepth 1 -type d -name "knem*" -print0)
export FCA_DIR=/opt/mellanox/fca
export HCOLL_DIR=/opt/mellanox/hcoll

../configure CC=icc CXX=icpc F77=ifort FC=ifort FFLAGS="-xSSE4.2 -axAVX -ip -O3 
-fno-fnalias" FCFLAGS="-xSSE4.2 -axAVX -ip -O3 -fno-fnalias" --prefix=<...>  
--enable-mpirun-prefix-by-default --with-fca=$FCA_DIR --with-mxm=$MXM_DIR 
--with-knem=$KNEM_DIR  --with-cuda=$CUDA_INSTALL_PATH 
--enable-mpi-thread-multiple --with-hwloc=internal --with-verbs 2>&1 | tee 
config.out


Thanks in advance,
Regards

Filippo

--
Mr. Filippo SPIGA, M.Sc.
http://www.linkedin.com/in/filippospiga ~ skype: filippo.spiga

«Nobody will drive us out of Cantor's paradise.» ~ David Hilbert

*
Disclaimer: "Please note this message and any attachments are CONFIDENTIAL and 
may be privileged or otherwise protected from disclosure. The contents are not 
to be disclosed to anyone other than the addressee. Unauthorized recipients are 
requested to preserve this confidentiality and to advise the sender immediately 
of any error in transmission."



openmpi-1.7.5rc1_wrong.tar.gz
Description: GNU Zip compressed data


Re: [OMPI users] 1.7.5rc1, error "COLL-ML ml_discover_hierarchy exited with error."

2014-03-03 Thread Rolf vandeVaart
Can you try running with --mca coll ^ml and see if things work? 

Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Filippo Spiga
>Sent: Monday, March 03, 2014 7:14 PM
>To: Open MPI Users
>Subject: [OMPI users] 1.7.5rc1, error "COLL-ML ml_discover_hierarchy exited
>with error."
>
>Dear Open MPI developers,
>
>I hit an expected error running OSU osu_alltoall benchmark using Open MPI
>1.7.5rc1. Here the error:
>
>$ mpirun -np 4 --map-by ppr:1:socket -bind-to core osu_alltoall In
>bcol_comm_query hmca_bcol_basesmuma_allocate_sm_ctl_memory failed
>In bcol_comm_query hmca_bcol_basesmuma_allocate_sm_ctl_memory
>failed
>[tesla50][[6927,1],1][../../../../../ompi/mca/coll/ml/coll_ml_module.c:2996:mc
>a_coll_ml_comm_query] COLL-ML ml_discover_hierarchy exited with error.
>
>[tesla50:42200] In base_bcol_masesmuma_setup_library_buffers and mpool
>was not successfully setup!
>[tesla50][[6927,1],0][../../../../../ompi/mca/coll/ml/coll_ml_module.c:2996:mc
>a_coll_ml_comm_query] COLL-ML ml_discover_hierarchy exited with error.
>
>[tesla50:42201] In base_bcol_masesmuma_setup_library_buffers and mpool
>was not successfully setup!
># OSU MPI All-to-All Personalized Exchange Latency Test v4.2
># Size   Avg Latency(us)
>--
>mpirun noticed that process rank 3 with PID 4508 on node tesla51 exited on
>signal 11 (Segmentation fault).
>--
>2 total processes killed (some possibly by mpirun during cleanup)
>
>Any idea where this come from?
>
>I compiled Open MPI using Intel 12.1, latest Mellanox stack and CUDA 6.0RC.
>Attached outputs grabbed from configure, make and run. The configure was
>
>export MXM_DIR=/opt/mellanox/mxm
>export KNEM_DIR=$(find /opt -maxdepth 1 -type d -name "knem*" -print0)
>export FCA_DIR=/opt/mellanox/fca export HCOLL_DIR=/opt/mellanox/hcoll
>
>../configure CC=icc CXX=icpc F77=ifort FC=ifort FFLAGS="-xSSE4.2 -axAVX -ip -
>O3 -fno-fnalias" FCFLAGS="-xSSE4.2 -axAVX -ip -O3 -fno-fnalias" --prefix=<...>
>--enable-mpirun-prefix-by-default --with-fca=$FCA_DIR --with-
>mxm=$MXM_DIR --with-knem=$KNEM_DIR  --with-
>cuda=$CUDA_INSTALL_PATH --enable-mpi-thread-multiple --with-
>hwloc=internal --with-verbs 2>&1 | tee config.out
>
>
>Thanks in advance,
>Regards
>
>Filippo
>
>--
>Mr. Filippo SPIGA, M.Sc.
>http://www.linkedin.com/in/filippospiga ~ skype: filippo.spiga
>
> ~ David Hilbert
>
>*
>Disclaimer: "Please note this message and any attachments are
>CONFIDENTIAL and may be privileged or otherwise protected from disclosure.
>The contents are not to be disclosed to anyone other than the addressee.
>Unauthorized recipients are requested to preserve this confidentiality and to
>advise the sender immediately of any error in transmission."

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


Re: [OMPI users] 1.7.5rc1, error "COLL-ML ml_discover_hierarchy exited with error."

2014-03-03 Thread Filippo Spiga
Dear Rolf,

your suggestion works!

$ mpirun -np 4 --map-by ppr:1:socket -bind-to core  --mca coll ^ml osu_alltoall
# OSU MPI All-to-All Personalized Exchange Latency Test v4.2
# Size   Avg Latency(us)
1   8.02
2   2.96
4   2.91
8   2.91
16  2.96
32  3.07
64  3.25
128 3.74
256 3.85
512 4.11
10244.79
20485.91
4096   15.84
8192   24.88
16384  35.35
32768  56.20
65536  66.88
131072114.89
262144209.36
524288396.12
1048576   765.65


Can you clarify exactly where the problem come from?

Regards,
Filippo


On Mar 4, 2014, at 12:17 AM, Rolf vandeVaart  wrote:
> Can you try running with --mca coll ^ml and see if things work? 
> 
> Rolf
> 
>> -Original Message-
>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Filippo Spiga
>> Sent: Monday, March 03, 2014 7:14 PM
>> To: Open MPI Users
>> Subject: [OMPI users] 1.7.5rc1, error "COLL-ML ml_discover_hierarchy exited
>> with error."
>> 
>> Dear Open MPI developers,
>> 
>> I hit an expected error running OSU osu_alltoall benchmark using Open MPI
>> 1.7.5rc1. Here the error:
>> 
>> $ mpirun -np 4 --map-by ppr:1:socket -bind-to core osu_alltoall In
>> bcol_comm_query hmca_bcol_basesmuma_allocate_sm_ctl_memory failed
>> In bcol_comm_query hmca_bcol_basesmuma_allocate_sm_ctl_memory
>> failed
>> [tesla50][[6927,1],1][../../../../../ompi/mca/coll/ml/coll_ml_module.c:2996:mc
>> a_coll_ml_comm_query] COLL-ML ml_discover_hierarchy exited with error.
>> 
>> [tesla50:42200] In base_bcol_masesmuma_setup_library_buffers and mpool
>> was not successfully setup!
>> [tesla50][[6927,1],0][../../../../../ompi/mca/coll/ml/coll_ml_module.c:2996:mc
>> a_coll_ml_comm_query] COLL-ML ml_discover_hierarchy exited with error.
>> 
>> [tesla50:42201] In base_bcol_masesmuma_setup_library_buffers and mpool
>> was not successfully setup!
>> # OSU MPI All-to-All Personalized Exchange Latency Test v4.2
>> # Size   Avg Latency(us)
>> --
>> mpirun noticed that process rank 3 with PID 4508 on node tesla51 exited on
>> signal 11 (Segmentation fault).
>> --
>> 2 total processes killed (some possibly by mpirun during cleanup)
>> 
>> Any idea where this come from?
>> 
>> I compiled Open MPI using Intel 12.1, latest Mellanox stack and CUDA 6.0RC.
>> Attached outputs grabbed from configure, make and run. The configure was
>> 
>> export MXM_DIR=/opt/mellanox/mxm
>> export KNEM_DIR=$(find /opt -maxdepth 1 -type d -name "knem*" -print0)
>> export FCA_DIR=/opt/mellanox/fca export HCOLL_DIR=/opt/mellanox/hcoll
>> 
>> ../configure CC=icc CXX=icpc F77=ifort FC=ifort FFLAGS="-xSSE4.2 -axAVX -ip -
>> O3 -fno-fnalias" FCFLAGS="-xSSE4.2 -axAVX -ip -O3 -fno-fnalias" 
>> --prefix=<...>
>> --enable-mpirun-prefix-by-default --with-fca=$FCA_DIR --with-
>> mxm=$MXM_DIR --with-knem=$KNEM_DIR  --with-
>> cuda=$CUDA_INSTALL_PATH --enable-mpi-thread-multiple --with-
>> hwloc=internal --with-verbs 2>&1 | tee config.out
>> 
>> 
>> Thanks in advance,
>> Regards
>> 
>> Filippo
>> 
>> --
>> Mr. Filippo SPIGA, M.Sc.
>> http://www.linkedin.com/in/filippospiga ~ skype: filippo.spiga
>> 
>>  ~ David Hilbert
>> 
>> *
>> Disclaimer: "Please note this message and any attachments are
>> CONFIDENTIAL and may be privileged or otherwise protected from disclosure.
>> The contents are not to be disclosed to anyone other than the addressee.
>> Unauthorized recipients are requested to preserve this confidentiality and to
>> advise the sender immediately of any error in transmission."
> 
> ---
> This email message is for the sole use of the intended recipient(s) and may 
> contain
> confidential information.  Any unauthorized review, use, disclosure or 
> distribution
> is prohibited.  If you are not the intended recipient, please contact the 
> sender by
> reply email and destroy all copies of the original message.
> ---
> __

Re: [OMPI users] 1.7.5rc1, error "COLL-ML ml_discover_hierarchy exited with error."

2014-03-03 Thread Rolf vandeVaart
There is something going wrong with the ml collective component.  So, if you 
disable it, things work.
I just reconfigured without any CUDA-aware support, and I see the same failure 
so it has nothing to do with CUDA.

Looks like Jeff Squyres just made a bug for it.

https://svn.open-mpi.org/trac/ompi/ticket/4331



>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Filippo Spiga
>Sent: Monday, March 03, 2014 7:32 PM
>To: Open MPI Users
>Subject: Re: [OMPI users] 1.7.5rc1, error "COLL-ML ml_discover_hierarchy
>exited with error."
>
>Dear Rolf,
>
>your suggestion works!
>
>$ mpirun -np 4 --map-by ppr:1:socket -bind-to core  --mca coll ^ml osu_alltoall
># OSU MPI All-to-All Personalized Exchange Latency Test v4.2
># Size   Avg Latency(us)
>1   8.02
>2   2.96
>4   2.91
>8   2.91
>16  2.96
>32  3.07
>64  3.25
>128 3.74
>256 3.85
>512 4.11
>10244.79
>20485.91
>4096   15.84
>8192   24.88
>16384  35.35
>32768  56.20
>65536  66.88
>131072114.89
>262144209.36
>524288396.12
>1048576   765.65
>
>
>Can you clarify exactly where the problem come from?
>
>Regards,
>Filippo
>
>
>On Mar 4, 2014, at 12:17 AM, Rolf vandeVaart 
>wrote:
>> Can you try running with --mca coll ^ml and see if things work?
>>
>> Rolf
>>
>>> -Original Message-
>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Filippo
>>> Spiga
>>> Sent: Monday, March 03, 2014 7:14 PM
>>> To: Open MPI Users
>>> Subject: [OMPI users] 1.7.5rc1, error "COLL-ML ml_discover_hierarchy
>>> exited with error."
>>>
>>> Dear Open MPI developers,
>>>
>>> I hit an expected error running OSU osu_alltoall benchmark using Open
>>> MPI 1.7.5rc1. Here the error:
>>>
>>> $ mpirun -np 4 --map-by ppr:1:socket -bind-to core osu_alltoall In
>>> bcol_comm_query hmca_bcol_basesmuma_allocate_sm_ctl_memory
>failed In
>>> bcol_comm_query hmca_bcol_basesmuma_allocate_sm_ctl_memory
>>> failed
>>> [tesla50][[6927,1],1][../../../../../ompi/mca/coll/ml/coll_ml_module.
>>> c:2996:mc a_coll_ml_comm_query] COLL-ML ml_discover_hierarchy exited
>>> with error.
>>>
>>> [tesla50:42200] In base_bcol_masesmuma_setup_library_buffers and
>>> mpool was not successfully setup!
>>> [tesla50][[6927,1],0][../../../../../ompi/mca/coll/ml/coll_ml_module.
>>> c:2996:mc a_coll_ml_comm_query] COLL-ML ml_discover_hierarchy exited
>>> with error.
>>>
>>> [tesla50:42201] In base_bcol_masesmuma_setup_library_buffers and
>>> mpool was not successfully setup!
>>> # OSU MPI All-to-All Personalized Exchange Latency Test v4.2
>>> # Size   Avg Latency(us)
>>> -
>>> - mpirun noticed that process rank 3 with PID 4508 on node
>>> tesla51 exited on signal 11 (Segmentation fault).
>>> -
>>> -
>>> 2 total processes killed (some possibly by mpirun during cleanup)
>>>
>>> Any idea where this come from?
>>>
>>> I compiled Open MPI using Intel 12.1, latest Mellanox stack and CUDA
>6.0RC.
>>> Attached outputs grabbed from configure, make and run. The configure
>>> was
>>>
>>> export MXM_DIR=/opt/mellanox/mxm
>>> export KNEM_DIR=$(find /opt -maxdepth 1 -type d -name "knem*"
>>> -print0) export FCA_DIR=/opt/mellanox/fca export
>>> HCOLL_DIR=/opt/mellanox/hcoll
>>>
>>> ../configure CC=icc CXX=icpc F77=ifort FC=ifort FFLAGS="-xSSE4.2
>>> -axAVX -ip -
>>> O3 -fno-fnalias" FCFLAGS="-xSSE4.2 -axAVX -ip -O3 -fno-fnalias"
>>> --prefix=<...> --enable-mpirun-prefix-by-default --with-fca=$FCA_DIR
>>> --with- mxm=$MXM_DIR --with-knem=$KNEM_DIR  --with-
>>> cuda=$CUDA_INSTALL_PATH --enable-mpi-thread-multiple --with-
>>> hwloc=internal --with-verbs 2>&1 | tee config.out
>>>
>>>
>>> Thanks in advance,
>>> Regards
>>>
>>> Filippo
>>>
>>> --
>>> Mr. Filippo SPIGA, M.Sc.
>