As mentioned on the ticket I cannot reproduce this on master (same version of vader) with netcdf master, hdf 1.8.15, and gcc 4.8.2. The test runs to completion with both ompio and romio with both xpmem and no single copy support. This could be a romio bug as the version in 1.8.4 lags behind master.
-Nathan On Fri, Mar 13, 2015 at 09:43:14PM +0000, Jeff Squyres (jsquyres) wrote: > https://github.com/open-mpi/ompi/issues/473 filed. > > > > On Mar 13, 2015, at 4:28 PM, Orion Poplawski <or...@cora.nwra.com> wrote: > > > > That does appear to make it work. So I guess the issue is in the vader btl > > somewhere. FWIW I don't see any warning compiling the vader btl code. > > > > On 03/13/2015 01:08 PM, George Bosilca wrote: > >> Do you have the same behavior when you disable the vader BTL ? (--mca btl > >> ^vader). > >> > >> George. > >> > >> > >> On Fri, Mar 13, 2015 at 2:20 PM, Orion Poplawski <or...@cora.nwra.com > >> <mailto:or...@cora.nwra.com>> wrote: > >> > >> We currently have openmpi-1.8.4-99-20150228 built in Fedora Rawhide. > >> I'm now > >> seeing crashes/hangs when running the netcdf test suite on i686. > >> Crashes > >> include: > >> > >> > >> [mock1:23702] *** An error occurred in MPI_Allreduce > >> [mock1:23702] *** reported by process [3653173249 <tel:%5B3653173249>,1] > >> [mock1:23702] *** on communicator MPI COMMUNICATOR 7 DUP FROM 6 > >> [mock1:23702] *** MPI_ERR_IN_STATUS: error code in status > >> [mock1:23702] *** MPI_ERRORS_ARE_FATAL (processes in this communicator > >> will > >> now abort, > >> [mock1:23702] *** and potentially your MPI job) > >> > >> and a similar one in MPI_Bcast. > >> > >> Hangs (100%cpu) seem to be in opal_condition_wait() -> opal_progress() > >> calling > >> both mca_pml_ob1_progress and mca_btl_vader_component_progress. > >> > >> #0 mca_btl_vader_check_fboxes () at btl_vader_fbox.h:192 > >> #1 mca_btl_vader_component_progress () at btl_vader_component.c:694 > >> #2 0xf3971b69 in opal_progress () at runtime/opal_progress.c:187 > >> #3 0xf40b4695 in opal_condition_wait (c=<optimized out>, m=<optimized > >> out>) > >> at ../opal/threads/condition.h:78 > >> #4 ompi_request_default_wait_all (count=6, requests=<optimized out>, > >> statuses=0x0) > >> at request/req_wait.c:281 > >> #5 0xf28bb5e7 in ompi_coll_tuned_alltoall_intra_basic_linear > >> (sbuf=sbuf@entry=0xf7a2d328, > >> scount=scount@entry=1, sdtype=sdtype@entry=0xf4148240 > >> <ompi_mpi_int>, > >> rbuf=rbuf@entry=0xf7af1920, rcount=rcount@entry=1, > >> rdtype=rdtype@entry=0xf4148240 <ompi_mpi_int>, > >> comm=comm@entry=0xf7b051d8, > >> module=module@entry=0xf7a2b4d0) at coll_tuned_alltoall.c:700 > >> #6 0xf28b4d08 in ompi_coll_tuned_alltoall_intra_dec_fixed > >> (sbuf=0xf7a2d328, > >> scount=1, > >> sdtype=0xf4148240 <ompi_mpi_int>, rbuf=0xf7af1920, rcount=1, > >> rdtype=0xf4148240 <ompi_mpi_int>, comm=0xf7b051d8, > >> module=0xf7a2b4d0) > >> at coll_tuned_decision_fixed.c:130 > >> #7 0xf40c7899 in PMPI_Alltoall (sendbuf=sendbuf@entry=0xf7a2d328, > >> sendcount=sendcount@entry=1, sendtype=sendtype@entry=0xf4148240 > >> <ompi_mpi_int>, > >> recvbuf=recvbuf@entry=0xf7af1920, recvcount=recvcount@entry=1, > >> recvtype=recvtype@entry=0xf4148240 <ompi_mpi_int>, comm=0xf7b051d8) > >> at > >> palltoall.c:111 > >> #8 0xe9780da0 in ADIOI_Calc_others_req (fd=fd@entry=0xf7b12640, > >> count_my_req_procs=1, > >> count_my_req_per_proc=0xf7a2d328, my_req=0xf7b00750, nprocs=4, > >> myrank=0, > >> > >> count_others_req_procs_ptr=count_others_req_procs_ptr@entry=0xffbea6e8, > >> others_req_ptr=others_req_ptr@entry=0xffbea6cc) at > >> adio/common/ad_aggregate.c:453 > >> #9 0xe9796a14 in ADIOI_GEN_WriteStridedColl (fd=0xf7b12640, > >> buf=0xf7aa0148, > >> count=2440, > >> datatype=0xf4148840 <ompi_mpi_byte>, file_ptr_type=100, offset=0, > >> status=0xffbea8b8, > >> error_code=0xffbea790) at adio/common/ad_write_coll.c:192 > >> #10 0xe97779e0 in MPIOI_File_write_all (fh=fh@entry=0xf7b12640, > >> offset=offset@entry=0, > >> file_ptr_type=file_ptr_type@entry=100, buf=buf@entry=0xf7aa0148, > >> count=count@entry=2440, > >> datatype=datatype@entry=0xf4148840 <ompi_mpi_byte>, > >> myname=myname@entry=0xe97a9a1c <myname.9354> > >> "MPI_FILE_WRITE_AT_ALL", > >> status=status@entry=0xffbea8b8) at mpi-io/write_all.c:116 > >> #11 0xe9778176 in mca_io_romio_dist_MPI_File_write_at_all > >> (fh=0xf7b12640, > >> offset=offset@entry=0, buf=buf@entry=0xf7aa0148, > >> count=count@entry=2440, > >> datatype=datatype@entry=0xf4148840 <ompi_mpi_byte>, > >> status=status@entry=0xffbea8b8) > >> at mpi-io/write_atall.c:55 > >> #12 0xe9770bcc in mca_io_romio_file_write_at_all (fh=0xf7aa27c8, > >> offset=0, > >> buf=0xf7aa0148, > >> count=2440, datatype=0xf4148840 <ompi_mpi_byte>, status=0xffbea8b8) > >> at src/io_romio_file_write.c:61 > >> #13 0xf40ff3ce in PMPI_File_write_at_all (fh=0xf7aa27c8, offset=0, > >> buf=buf@entry=0xf7aa0148, > >> count=count@entry=2440, e=0xf4148840 <ompi_mpi_byte>, > >> status=status@entry=0xffbea8b8) > >> at pfile_write_at_all.c:75 > >> #14 0xf437a43c in H5FD_mpio_write (_file=_file@entry=0xf7b074a8, > >> type=type@entry=H5FD_MEM_DRAW, dxpl_id=167772177, addr=31780, > >> size=size@entry=2440, > >> buf=buf@entry=0xf7aa0148) at ../../src/H5FDmpio.c:1840 > >> #15 0xf4375cd5 in H5FD_write (file=0xf7b074a8, dxpl=0xf7a47d20, > >> type=H5FD_MEM_DRAW, > >> addr=31780, size=size@entry=2440, buf=buf@entry=0xf7aa0148) at > >> ../../src/H5FDint.c:245 > >> #16 0xf4360932 in H5F__accum_write (fio_info=fio_info@entry=0xffbea9d4, > >> type=type@entry=H5FD_MEM_DRAW, addr=31780, size=size@entry=2440, > >> buf=buf@entry=0xf7aa0148) > >> at ../../src/H5Faccum.c:824 > >> #17 0xf436430c in H5F_block_write (f=0xf7a31860, > >> type=type@entry=H5FD_MEM_DRAW, addr=31780, > >> size=size@entry=2440, dxpl_id=167772177, buf=0xf7aa0148) at > >> ../../src/H5Fio.c:170 > >> #18 0xf43413ee in H5D__mpio_select_write (io_info=0xffbeab60, > >> type_info=0xffbeab1c, > >> mpi_buf_count=2440, file_space=0x0, mem_space=0x0) at > >> ../../src/H5Dmpio.c:296 > >> #19 0xf4341f33 in H5D__final_collective_io (mpi_buf_type=0xffbeaa7c, > >> mpi_file_type=0xffbeaa78, > >> mpi_buf_count=<optimized out>, type_info=0xffbeab1c, > >> io_info=0xffbeab60) > >> at ../../src/H5Dmpio.c:1444 > >> #20 H5D__inter_collective_io (mem_space=0xf7a38120, > >> file_space=0xf7a55590, > >> type_info=0xffbeab1c, io_info=0xffbeab60) at > >> ../../src/H5Dmpio.c:1400 > >> #21 H5D__contig_collective_write (io_info=0xffbeab60, > >> type_info=0xffbeab1c, > >> nelmts=610, > >> file_space=0xf7a55590, mem_space=0xf7a38120, fm=0xffbeace0) at > >> ../../src/H5Dmpio.c:528 > >> #22 0xf433ae8d in H5D__write (buf=0xf7aa0148, dxpl_id=167772177, > >> file_space=0xf7a55590, > >> mem_space=0xf7a38120, mem_type_id=-140159600, dataset=0xf7a3eb40) at > >> ../../src/H5Dio.c:787 > >> #23 H5D__pre_write (dset=dset@entry=0xf7a3eb40, direct_write=<optimized > >> out>, > >> mem_type_id=mem_type_id@entry=50331747, > >> mem_space=mem_space@entry=0xf7a38120, > >> file_space=0xf7a55590, dxpl_id=dxpl_id@entry=167772177, > >> buf=buf@entry=0xf7aa0148) > >> at ../../src/H5Dio.c:351 > >> #24 0xf433b74c in H5Dwrite (dset_id=83886085, mem_type_id=50331747, > >> mem_space_id=mem_space_id@entry=67108867, > >> file_space_id=file_space_id@entry=67108866, > >> dxpl_id=dxpl_id@entry=167772177, buf=buf@entry=0xf7aa0148) at > >> ../../src/H5Dio.c:270 > >> #25 0xf466b603 in nc4_put_vara (nc=0xf7a05c58, ncid=ncid@entry=65536, > >> varid=varid@entry=3, > >> startp=startp@entry=0xffbf6a08, countp=countp@entry=0xffbf6a10, > >> mem_nc_type=mem_nc_type@entry=5, is_long=is_long@entry=0, > >> data=data@entry=0xf7a07c40) > >> at ../../libsrc4/nc4hdf.c:788 > >> #26 0xf4673c55 in nc4_put_vara_tc (mem_type_is_long=0, op=0xf7a07c40, > >> countp=0xffbf6a10, > >> startp=0xffbf6a08, mem_type=5, varid=3, ncid=65536) at > >> ../../libsrc4/nc4var.c:1429 > >> #27 NC4_put_vara (ncid=65536, varid=3, startp=0xffbf6a08, > >> countp=0xffbf6a10, > >> op=0xf7a07c40, > >> memtype=5) at ../../libsrc4/nc4var.c:1565 > >> #28 0xf460a377 in NC_put_vara (ncid=ncid@entry=65536, > >> varid=varid@entry=3, > >> start=start@entry=0xffbf6a08, edges=edges@entry=0xffbf6a10, > >> value=value@entry=0xf7a07c40, > >> memtype=memtype@entry=5) at ../../libdispatch/dvarput.c:79 > >> #29 0xf460b541 in nc_put_vara_float (ncid=65536, varid=3, > >> startp=0xffbf6a08, > >> countp=0xffbf6a10, op=0xf7a07c40) at ../../libdispatch/dvarput.c:655 > >> #30 0xf77d06ed in test_pio_2d (cache_size=67108864, facc_type=8192, > >> access_flag=1, > >> comm=0xf414d800 <ompi_mpi_comm_world>, info=0xf4154240 > >> <ompi_mpi_info_null>, mpi_size=4, > >> mpi_rank=0, chunk_size=0xffbf76f4) at > >> ../../nc_test4/tst_nc4perf.c:96 > >> #31 0xf77cfdb1 in main (argc=1, argv=0xffbf7804) at > >> ../../nc_test4/tst_nc4perf.c:299 > >> > >> > >> Any suggests as to where to look next would be greatly appreciated. > >> > >> -- > >> Orion Poplawski > >> Technical Manager 303-415-9701 x222 > >> <tel:303-415-9701%20x222> > >> NWRA, Boulder/CoRA Office FAX: 303-415-9702 > >> <tel:303-415-9702> > >> 3380 Mitchell Lane or...@nwra.com > >> <mailto:or...@nwra.com> > >> Boulder, CO 80301 http://www.nwra.com > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org <mailto:de...@open-mpi.org> > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> Link to this post: > >> http://www.open-mpi.org/community/lists/devel/2015/03/17131.php > >> > >> > >> > >> > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> Link to this post: > >> http://www.open-mpi.org/community/lists/devel/2015/03/17132.php > >> > > > > > > -- > > Orion Poplawski > > Technical Manager 303-415-9701 x222 > > NWRA, Boulder/CoRA Office FAX: 303-415-9702 > > 3380 Mitchell Lane or...@nwra.com > > Boulder, CO 80301 http://www.nwra.com > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2015/03/17133.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/03/17134.php
pgpLwk1sTSy6Q.pgp
Description: PGP signature