On Thu, May 14, 2009 at 10:47 AM, Terry Dontje <terry.don...@sun.com> wrote:
> Ralph Castain wrote: > >> Hi folks >> >> I encourage people to please look at your MTT outputs. As we are preparing >> to roll the 1.3.3 release, I am seeing a lot of problems on the branch: >> >> 1. timeouts, coming in two forms: (a) MPI_Abort hanging, and (b) >> collectives hanging (this is mostly on Solaris) >> >> Can you clarify or send me a link that makes you believe b is mostly > solaris. Looking at last night's Sun's MTT 1.3 nightly runs I see 47 > timeouts on Linux and 24 timeouts on Solaris. That doesn't constitute > mostly Solaris to me. Also how are you determining these timeouts are > Collective based? I have a theory they are but I don't have a clear smoking > gun as of yet. I looked at this MTT report, which showed it hanging in a whole bunch of collective tests: http://www.open-mpi.org/mtt/index.php?limit=&wrap=&trial=&enable_drilldowns=&yaxis_scale=&xaxis_scale=&hide_subtitle=&split_graphs=&remote_go=&do_cookies=&phase=test_run&text_start_timestamp=2009-05-13+15%3A15%3A25+-+2009-05-14+15%3A15%3A25&text_platform_hardware= ^x86_64%24&show_platform_hardware=show&text_os_name=^Linux%24&show_os_name=show&text_mpi_name=^ompi-nightly-v1.3%24&show_mpi_name=show&text_mpi_version=^1.3.3a1r21173%24&show_mpi_version=show&text_suite_name=all&show_suite_name=show&text_test_name=all&show_test_name=hide&text_np=all&show_np=show&text_full_command=&show_full_command=show&text_http_username=^sun%24&show_http_username=show&text_local_username=all&show_local_username=hide&text_platform_name=^burl-ct-v20z-10%24&show_platform_name=show&click=Detail&phase=test_run&test_result=_rt&text_os_version=&show_os_version=&text_platform_type=&show_platform_type=&text_hostname=&show_hostname=&text_compiler_name=&show_compiler_name=&text_compiler_version=&show_compiler_version=&text_vpath_mode=&show_vpath_mode=&text_endian=&show_endian=&text_bitness=&show_bitness=&text_configure_arguments=&text_exit_value=&show_exit_value=&text_exit_signal=&show_exit_signal=&text_duration=&show_duration=&text_client_serial=&show_client_serial=&text_result_message=&text_result_stdout=&text_result_stderr=&text_environment=&text_description=&text_launcher=&show_launcher=&text_resource_mgr=&show_resource_mgr=&text_network=&show_network=&text_parameters=&show_parameters=&lastgo=summary When I look at the hangs on other systems, they are in non-collective tests. I'm not sure what that really means, though - it was just an observation based on this one set of tests. > > > I've been looking at some collective hangs and segv's. These seem to > happen across different platform and OS (Linux and Solaris). I've been > finding it really hard to reproduce. I ran MPI_Allreduce_loc_c on a three > clusters for 2 days without a hang or segv. I am really concerned whether > we'll even be able to get this to fail with debugging on. > I have not been able to get a core or time with a hung run in order to get > more information. > >> 2. segfaults - mostly on sif, but occasionally elsewhere >> >> 3. daemon failed to report back - this was only on sif >> >> We will need to correct many of these for the release - unless it proves >> to be due to trivial errors, I don't see how we will be ready to roll >> release candidates next week. >> >> So let's please start taking a look at these?! >> >> I've actually been looking at ours though I have not been extremely > vocal. I was hoping to get more info on our timeouts before requesting > help. No problem - I wasn't pointing a finger at anyone in particular. Just wanted to highlight that the branch is not in great shape since we had talked on the telecon about trying to do a release next week. > Ralph >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >