[OMPI devel] OMPI Tuesday teleconf: US just changed time
FYI: the OMPI teleconf tomorrow is at the same time *in the US*: 11am US Eastern time. If you're joining the call from outside the US, remember that the US changed time this past weekend. Please adjust your call-in time accordingly. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] C/R and orte_oob
On Fri, Mar 07, 2014 at 06:54:18AM -0800, Ralph Castain wrote: > > If you like, I can define the required code in the trunk and let you > > fill in the event functionality. > > That would be great. > >>> > >>> Thanks for your changes. When using --with-ft there are a few compiler > >>> errors which I tried to fix with following patch: > >>> > >>> https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=71521789ef9d248a7eef53030d2ec5de900faa4c > >> > >> That looks okay, with the only caveat being that you wouldn't ordinarily > >> pass the state_caddy_t into a function. It's just there to pass along the > >> job etc in case the callback function needs to reference something. In > >> this case, I can't think of anything the FT event function would need to > >> know - you just want it to quiet all messaging. > > > > I need to pass the type of state to the ft_event() functions: > > > > enum opal_crs_state_type_t { > >OPAL_CRS_NONE= 0, > >OPAL_CRS_CHECKPOINT = 1, > >OPAL_CRS_RESTART_PRE = 2, > >OPAL_CRS_RESTART = 3, /* RESTART_POST */ > > > > so an int is all I need. So I probably need to encode it into *cbdata. Do I > > just use an int directly in *cbdata or should it be part of a struct? > > Why don't you define a job state for each of those, and then you can walk the > state machine thru them if needed? That way the state caddy will already > provide you with the state and you can just pass it to the functions. Like this? https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=79d6c8262bf809bb2f9ecc853d4a7a42a88654da Adrian
Re: [OMPI devel] C/R and orte_oob
On Mar 10, 2014, at 1:29 PM, Adrian Reber wrote: > On Fri, Mar 07, 2014 at 06:54:18AM -0800, Ralph Castain wrote: >>> If you like, I can define the required code in the trunk and let you >>> fill in the event functionality. >> >> That would be great. > > Thanks for your changes. When using --with-ft there are a few compiler > errors which I tried to fix with following patch: > > https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=71521789ef9d248a7eef53030d2ec5de900faa4c That looks okay, with the only caveat being that you wouldn't ordinarily pass the state_caddy_t into a function. It's just there to pass along the job etc in case the callback function needs to reference something. In this case, I can't think of anything the FT event function would need to know - you just want it to quiet all messaging. >>> >>> I need to pass the type of state to the ft_event() functions: >>> >>> enum opal_crs_state_type_t { >>> OPAL_CRS_NONE= 0, >>> OPAL_CRS_CHECKPOINT = 1, >>> OPAL_CRS_RESTART_PRE = 2, >>> OPAL_CRS_RESTART = 3, /* RESTART_POST */ >>> >>> so an int is all I need. So I probably need to encode it into *cbdata. Do I >>> just use an int directly in *cbdata or should it be part of a struct? >> >> Why don't you define a job state for each of those, and then you can walk >> the state machine thru them if needed? That way the state caddy will already >> provide you with the state and you can just pass it to the functions. > > Like this? > > https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=79d6c8262bf809bb2f9ecc853d4a7a42a88654da Yep! You can now use those states to sequence things as desired > > Adrian > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/03/14315.php
[OMPI devel] 1.7.5 status
Hi folks Here's a quick status for our discussion tomorrow on 1.7.5: MPI tests * idx_null continues to fail * other failures may be fixed due to CMR that came after my tests started OSHMEM tests * quite a few failures * performance tests uniformly fail - finally had to just abort them and set them to be skipped from now on Link to OSHMEM failures: http://mtt.open-mpi.org/index.php?do_redir=2156 Right now, I'm leaning on releasing 1.7.5 with OSHMEM *disabled* by default, with a plan to enable OSHMEM by default once the code stabilizes and becomes less error prone. Talk to you tomorrow Ralph
[OMPI devel] onesided/test_acc2 failures
Nathan, The onesided/test_acc2 test is failing in our Cisco MTT runs on the trunk and v1.7.5 branches: 8< test_acc2 == Mon Mar 10 15:31:47 2014 Time per int accumulate 0.769040 microsecs P0, Test No. 0, PASSED: accumulate performance Mon Mar 10 15:31:47 2014 test_acc2 == Mon Mar 10 15:31:47 2014 P7, Test No. 0, PASSED: multi-offset accumulate Mon Mar 10 15:31:47 2014 P0, Test No. 1 CHECK: accumulate self without permission, nfail=1, Mon Mar 10 15:31:47 2014 P0, Test No. 2 CHECK: accumulate self, nfail=1, Mon Mar 10 15:31:49 2014 P1, Test No. 0 CHECK: accumulate non-self, nfail=1, Mon Mar 10 15:31:51 2014 --- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. --- -- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[30384,1],1] Exit code:16 -- 8< I've bisected from the v1.7.4 to r30977 and the test begins failing around the time of r30894 (the big one-sided CMR from trunk): https://svn.open-mpi.org/trac/ompi/changeset/30894 The build on the v1.7 branch was in the (inclusive) range r30893:r30896, so there's a slim chance that this is one of r30893, r30895, or r30896 instead, but given the test failure that seems unlikely. Valgrind runs are not clean on this test, but they were insufficient to point me to a suspicious part of the code (there are some finalization bugs that should probably be fixed, but I don't think they are causing this issue). Sample VG messages and my (possibly flawed) interpretations are at the end of this mail. Can you take a look? This is a regression from v1.7.4 as we're headed into v1.7.5. Thanks, -Dave No idea here, but probably unrelated to the one-sided test failures: ==22608== Conditional jump or move depends on uninitialised value(s) ==22608==at 0xA72E6B9: ml_init_k_nomial_trees (coll_ml_module.c:651) ==22608==by 0xA7336C8: mca_coll_ml_tree_hierarchy_discovery (coll_ml_module.c:2162) ==22608==by 0xA733D46: mca_coll_ml_fulltree_ptp_only_hierarchy_discovery (coll_ml_module.c:2333) ==22608==by 0xA7314C9: ml_discover_hierarchy (coll_ml_module.c:1565) ==22608==by 0xA735DCD: mca_coll_ml_comm_query (coll_ml_module.c:2992) ==22608==by 0x4CD23AE: query_2_0_0 (coll_base_comm_select.c:395) ==22608==by 0x4CD2372: query (coll_base_comm_select.c:378) ==22608==by 0x4CD2285: check_one_component (coll_base_comm_select.c:340) ==22608==by 0x4CD20D7: check_components (coll_base_comm_select.c:304) ==22608==by 0x4CCAE11: mca_coll_base_comm_select (coll_base_comm_select.c:131) ==22608==by 0x4C60B49: ompi_mpi_init (ompi_mpi_init.c:888) ==22608==by 0x4C93AE2: PMPI_Init (pinit.c:84) ==22608== Uninitialised value was created by a heap allocation ==22608==at 0x4A07844: malloc (vg_replace_malloc.c:291) ==22608==by 0x4A079B8: realloc (vg_replace_malloc.c:687) ==22608==by 0xA72F91B: get_new_subgroup_data (coll_ml_module.c:1044) ==22608==by 0xA73292E: mca_coll_ml_tree_hierarchy_discovery (coll_ml_module.c:1939) ==22608==by 0xA733D46: mca_coll_ml_fulltree_ptp_only_hierarchy_discovery (coll_ml_module.c:2333) ==22608==by 0xA7314C9: ml_discover_hierarchy (coll_ml_module.c:1565) ==22608==by 0xA735DCD: mca_coll_ml_comm_query (coll_ml_module.c:2992) ==22608==by 0x4CD23AE: query_2_0_0 (coll_base_comm_select.c:395) ==22608==by 0x4CD2372: query (coll_base_comm_select.c:378) ==22608==by 0x4CD2285: check_one_component (coll_base_comm_select.c:340) ==22608==by 0x4CD20D7: check_components (coll_base_comm_select.c:304) ==22608==by 0x4CCAE11: mca_coll_base_comm_select (coll_base_comm_select.c:131) Possibly related: ==22608== Syscall param writev(vector[...]) points to uninitialised byte(s) ==22608==at 0x354BAE0C57: writev (in /lib64/libc-2.12.so) ==22608==by 0x95F7CD2: mca_btl_tcp_frag_send (btl_tcp_frag.c:107) ==22608==by 0x95F61CE: mca_btl_tcp_endpoint_send (btl_tcp_endpoint.c:261) ==22608==by 0x95F1C08: mca_btl_tcp_send (btl_tcp.c:387) ==22608==by 0x9CD4098: mca_bml_base_send (bml.h:276) ==22608==by 0x9CD636F: mca_pml_ob1_send_request_start_prepare (pml_ob1_sendreq.c:650) ==22608==by 0x9CC9B39: mca_pml_ob1_send_request_start_btl (pml_ob1_sendreq.h:388) ==22608==by 0x9CC9DF1: mca_pml_ob1_send_request_start (pml_ob1_sendreq.h:461) ==22608==by 0x9CCA4FA: mca_pml_ob1_isend (pml_ob1_isend.c:85) ==22608==by 0x4C699BD: comm_allreduce_pml (allreduce.c:178) ==22608==by 0xA73172C: ml_discover_hierarchy (coll_ml_module.c:1616) ==22608==by 0xA735DCD: