Looks to me like the warning message saids it all - the problem is in
openib.

The reason we took this action was to force the problems to the surface
across the code base so that people would address them. We've tried before
to just ask people to set the right flags to enable async progress and fix
things, but nobody ever does it. Hence this action.

So please investigate the openib BTL and see what needs to be done. I'll
poke Nathan in a couple of hours as well.

Thanks
Ralph



On Wed, Jun 25, 2014 at 6:28 AM, Mike Dubman <mi...@dev.mellanox.co.il>
wrote:

> tried with vader - same crash
>
> *14:14:22* [vegas12:32068] 7 more processes have sent help message 
> help-mca-var.txt / deprecated-mca-env*14:14:22* [vegas12:32068] Set MCA 
> parameter "orte_base_help_aggregate" to 0 to see all help / error 
> messages*14:14:22* + 
> LD_LIBRARY_PATH=/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/oshm_install2/lib*14:14:22*
>  + OMPI_MCA_scoll_fca_enable=1*14:14:22* + OMPI_MCA_scoll_fca_np=0*14:14:22* 
> + OMPI_MCA_pml=ob1*14:14:22* + OMPI_MCA_btl=vader,self,openib*14:14:22* + 
> OMPI_MCA_spml=yoda*14:14:22* + 
> OMPI_MCA_memheap_mr_interleave_factor=8*14:14:22* + 
> OMPI_MCA_memheap=ptmalloc*14:14:22* + 
> OMPI_MCA_btl_openib_if_include=mlx4_0:1*14:14:22* + 
> OMPI_MCA_rmaps_base_dist_hca=mlx4_0*14:14:22* + 
> OMPI_MCA_memheap_base_hca_name=mlx4_0*14:14:22* + 
> OMPI_MCA_rmaps_base_mapping_policy=dist:mlx4_0*14:14:22* + 
> MXM_RDMA_PORTS=mlx4_0:1*14:14:22* + SHMEM_SYMMETRIC_HEAP_SIZE=1024M*14:14:22* 
> + timeout -s SIGSEGV 3m 
> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/oshm_install2/bin/oshrun
>  -np 8 
> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/examples/hello_shmem*14:14:22*
>  [vegas12][[4652,1],1][btl_openib_component.c:909:device_destruct] Failed to 
> cancel OpenIB progress thread*14:14:22* 
> [vegas12][[4652,1],0][btl_openib_component.c:909:device_destruct] Failed to 
> cancel OpenIB progress thread*14:14:22* 
> --------------------------------------------------------------------------*14:14:22*
>  WARNING: The openib BTL was directed to use "eager RDMA" for short*14:14:22* 
> messages, but the openib BTL was compiled with progress threads*14:14:22* 
> support.  Short eager RDMA is not yet supported with progress 
> threads;*14:14:22* its use has been disabled in this job.*14:14:22* 
> *14:14:22* This is a warning only; you job will attempt to 
> continue.*14:14:22* 
> --------------------------------------------------------------------------*14:14:22*
>  [vegas12][[4652,1],5][btl_openib_component.c:909:device_destruct] Failed to 
> cancel OpenIB progress thread*14:14:22* [vegas12:32108] *** Process received 
> signal ****14:14:22* [vegas12:32108] Signal: Segmentation fault 
> (11)*14:14:22* [vegas12:32108] Signal code: Address not mapped (1)*14:14:22* 
> [vegas12:32108] Failing at address: (nil)*14:14:22* [vegas12:32108] [ 0] 
> /lib64/libpthread.so.0[0x3937c0f500]*14:14:22* [vegas12:32108] [ 1] 
> /usr/lib64/libibverbs.so.1(ibv_destroy_comp_channel+0x16)[0x3b7760bf46]*14:14:22*
>  [vegas12:32108] [ 2] 
> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/oshm_install2/lib/openmpi/mca_btl_openib.so(+0xdf02)[0x7ffff3fc1f02]*14:14:22*
>  [vegas12:32108] [ 3] 
> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/oshm_install2/lib/openmpi/mca_btl_openib.so(+0xf161)[0x7ffff3fc3161]*14:14:22*
>  [vegas12:32108] [ 4] 
> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/oshm_install2/lib/openmpi/mca_btl_openib.so(+0x12ab1)[0x7ffff3fc6ab1]*14:14:22*
>  [vegas12:32108] [ 5] 
> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/oshm_install2/lib/libmpi.so.0(mca_btl_base_select+0x117)[0x7ffff7a29807]*14:14:22*
>  [vegas12:32108] [ 6] 
> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/oshm_install2/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x12)[0x7ffff41ed7e2]*14:14:22*
>  [vegas12:32108] [ 7] 
> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/oshm_install2/lib/libmpi.so.0(mca_bml_base_init+0x99)[0x7ffff7a29009]*14:14:22*
>  [vegas12:32108] [ 8] 
> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/oshm_install2/lib/openmpi/mca_pml_ob1.so(+0x58b5)[0x7ffff35848b5]*14:14:22*
>  [vegas12:32108] [ 9] 
> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/oshm_install2/lib/libmpi.so.0(mca_pml_base_select+0x1e0)[0x7ffff7a3c590]*14:14:22*
>  [vegas12:32108] [10] 
> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/oshm_install2/lib/libmpi.so.0(ompi_mpi_init+0x455)[0x7ffff7a06bf5]*14:14:22*
>  [vegas12:32108] [11] 
> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/oshm_install2/lib/liboshmem.so.0(oshmem_shmem_init+0xfd)[0x7ffff7ca66dd]*14:14:22*
>  [vegas12:32108] [12] 
> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/oshm_install2/lib/liboshmem.so.0(shmem_init+0x28)[0x7ffff7ca9328]*14:14:22*
>  [vegas12:32108] [13] 
> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/examples/hello_shmem[0x40077d]*14:14:22*
>  [vegas12:32108] [14] 
> /lib64/libc.so.6(__libc_start_main+0xfd)[0x393741ecdd]*14:14:22* 
> [vegas12:32108] [15] 
> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/examples/hello_shmem[0x4006a9]*14:14:22*
>  [vegas12:32108] *** End of error message ****14:14:22* [vegas12:32112] *** 
> Process received signal ****14:14:22* [vegas12:32112] Signal: Segmentation 
> fault (11)*14:14:*
>
>
>
> On Wed, Jun 25, 2014 at 9:11 AM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
>> Mike,
>>
>> could you try again with
>>
>> OMPI_MCA_btl=vader,self,openib
>>
>> it seems the sm module causes a hang
>> (which later causes the timeout sending a SIGSEGV)
>>
>> Cheers,
>>
>> Gilles
>>
>> On 2014/06/25 14:22, Mike Dubman wrote:
>> > Hi,
>> > The following commit broke trunk in jenkins:
>> >
>> >>>> Per the OMPI developer conference, remove the last vestiges of
>> > OMPI_USE_PROGRESS_THREADS
>> >
>> > *22:15:09* +
>> LD_LIBRARY_PATH=/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/oshm_install2/lib*22:15:09*
>> > + OMPI_MCA_scoll_fca_enable=1*22:15:09* +
>> > OMPI_MCA_scoll_fca_np=0*22:15:09* + OMPI_MCA_pml=ob1*22:15:09* +
>> > OMPI_MCA_btl=sm,self,openib*22:15:09* + OMPI_MCA_spml=yoda*22:15:09* +
>> > OMPI_MCA_memheap_mr_interleave_factor=8*22:15:09* +
>> > OMPI_MCA_memheap=ptmalloc*22:15:09* +
>> > OMPI_MCA_btl_openib_if_include=mlx4_0:1*22:15:09* +
>> > OMPI_MCA_rmaps_base_dist_hca=mlx4_0*22:15:09* +
>> > OMPI_MCA_memheap_base_hca_name=mlx4_0*22:15:09* +
>> > OMPI_MCA_rmaps_base_mapping_policy=dist:mlx4_0*22:15:09* +
>> > MXM_RDMA_PORTS=mlx4_0:1*22:15:09* +
>> > SHMEM_SYMMETRIC_HEAP_SIZE=1024M*22:15:09* + timeout -s SIGSEGV 3m
>> >
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/oshm_install2/bin/oshrun
>> > -np 8
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/examples/hello_shmem*22:15:09*
>> > [vegas12:08101] *** Process received signal ****22:15:09*
>> > [vegas12:08101] Signal: Segmentation fault (11)*22:15:09*
>> > [vegas12:08101] Signal code: Address not mapped (1)*22:15:09*
>> > [vegas12:08101] Failing at address: (nil)*22:15:09* [vegas12:08101] [
>> >
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/06/15055.php
>>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/06/15061.php
>

Reply via email to