[OMPI devel] OMPI Tuesday teleconf: US just changed time

2014-03-10 Thread Jeff Squyres (jsquyres)
FYI: the OMPI teleconf tomorrow is at the same time *in the US*: 11am US 
Eastern time.

If you're joining the call from outside the US, remember that the US changed 
time this past weekend.  Please adjust your call-in time accordingly.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] C/R and orte_oob

2014-03-10 Thread Adrian Reber
On Fri, Mar 07, 2014 at 06:54:18AM -0800, Ralph Castain wrote:
> > If you like, I can define the required code in the trunk and let you 
> > fill in the event functionality.
>  
>  That would be great.
> >>> 
> >>> Thanks for your changes. When using --with-ft there are a few compiler
> >>> errors which I tried to fix with following patch:
> >>> 
> >>> https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=71521789ef9d248a7eef53030d2ec5de900faa4c
> >> 
> >> That looks okay, with the only caveat being that you wouldn't ordinarily 
> >> pass the state_caddy_t into a function. It's just there to pass along the 
> >> job etc in case the callback function needs to reference something. In 
> >> this case, I can't think of anything the FT event function would need to 
> >> know - you just want it to quiet all messaging.
> > 
> > I need to pass the type of state to the ft_event() functions:
> > 
> > enum opal_crs_state_type_t {
> >OPAL_CRS_NONE= 0,
> >OPAL_CRS_CHECKPOINT  = 1,
> >OPAL_CRS_RESTART_PRE = 2,
> >OPAL_CRS_RESTART = 3, /* RESTART_POST */
> > 
> > so an int is all I need. So I probably need to encode it into *cbdata. Do I
> > just use an int directly in *cbdata or should it be part of a struct?
> 
> Why don't you define a job state for each of those, and then you can walk the 
> state machine thru them if needed? That way the state caddy will already 
> provide you with the state and you can just pass it to the functions.

Like this?

https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=79d6c8262bf809bb2f9ecc853d4a7a42a88654da

Adrian


Re: [OMPI devel] C/R and orte_oob

2014-03-10 Thread Ralph Castain

On Mar 10, 2014, at 1:29 PM, Adrian Reber  wrote:

> On Fri, Mar 07, 2014 at 06:54:18AM -0800, Ralph Castain wrote:
>>> If you like, I can define the required code in the trunk and let you 
>>> fill in the event functionality.
>> 
>> That would be great.
> 
> Thanks for your changes. When using --with-ft there are a few compiler
> errors which I tried to fix with following patch:
> 
> https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=71521789ef9d248a7eef53030d2ec5de900faa4c
 
 That looks okay, with the only caveat being that you wouldn't ordinarily 
 pass the state_caddy_t into a function. It's just there to pass along the 
 job etc in case the callback function needs to reference something. In 
 this case, I can't think of anything the FT event function would need to 
 know - you just want it to quiet all messaging.
>>> 
>>> I need to pass the type of state to the ft_event() functions:
>>> 
>>> enum opal_crs_state_type_t {
>>>   OPAL_CRS_NONE= 0,
>>>   OPAL_CRS_CHECKPOINT  = 1,
>>>   OPAL_CRS_RESTART_PRE = 2,
>>>   OPAL_CRS_RESTART = 3, /* RESTART_POST */
>>> 
>>> so an int is all I need. So I probably need to encode it into *cbdata. Do I
>>> just use an int directly in *cbdata or should it be part of a struct?
>> 
>> Why don't you define a job state for each of those, and then you can walk 
>> the state machine thru them if needed? That way the state caddy will already 
>> provide you with the state and you can just pass it to the functions.
> 
> Like this?
> 
> https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=79d6c8262bf809bb2f9ecc853d4a7a42a88654da

Yep! You can now use those states to sequence things as desired


> 
>   Adrian
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/03/14315.php



[OMPI devel] 1.7.5 status

2014-03-10 Thread Ralph Castain
Hi folks

Here's a quick status for our discussion tomorrow on 1.7.5:

MPI tests

* idx_null continues to fail

* other failures may be fixed due to CMR that came after my tests started

OSHMEM tests

* quite a few failures

* performance tests uniformly fail - finally had to just abort them and set 
them to be skipped from now on

Link to OSHMEM failures:

http://mtt.open-mpi.org/index.php?do_redir=2156

Right now, I'm leaning on releasing 1.7.5 with OSHMEM *disabled* by default, 
with a plan to enable OSHMEM by default once the code stabilizes and becomes 
less error prone.

Talk to you tomorrow
Ralph



[OMPI devel] onesided/test_acc2 failures

2014-03-10 Thread Dave Goodell (dgoodell)
Nathan,

The onesided/test_acc2 test is failing in our Cisco MTT runs on the trunk and 
v1.7.5 branches:

8<
 test_acc2 == Mon Mar 10 15:31:47 2014

Time per int accumulate 0.769040 microsecs
P0, Test No. 0, PASSED: accumulate performance Mon Mar 10 15:31:47 2014

 test_acc2 == Mon Mar 10 15:31:47 2014

P7, Test No. 0, PASSED: multi-offset accumulate Mon Mar 10 15:31:47 2014

P0, Test No. 1 CHECK: accumulate self without permission, nfail=1, Mon Mar 10 
15:31:47 2014

P0, Test No. 2 CHECK: accumulate self, nfail=1, Mon Mar 10 15:31:49 2014

P1, Test No. 0 CHECK: accumulate non-self, nfail=1, Mon Mar 10 15:31:51 2014

---
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
---
--
mpirun detected that one or more processes exited with non-zero status, thus 
causing
the job to be terminated. The first process to do so was:

  Process name: [[30384,1],1]
  Exit code:16
--
8<

I've bisected from the v1.7.4 to r30977 and the test begins failing around the 
time of r30894 (the big one-sided CMR from trunk): 
https://svn.open-mpi.org/trac/ompi/changeset/30894

The build on the v1.7 branch was in the (inclusive) range r30893:r30896, so 
there's a slim chance that this is one of r30893, r30895, or r30896 instead, 
but given the test failure that seems unlikely.

Valgrind runs are not clean on this test, but they were insufficient to point 
me to a suspicious part of the code (there are some finalization bugs that 
should probably be fixed, but I don't think they are causing this issue).  
Sample VG messages and my (possibly flawed) interpretations are at the end of 
this mail.

Can you take a look?  This is a regression from v1.7.4 as we're headed into 
v1.7.5.

Thanks,
-Dave


No idea here, but probably unrelated to the one-sided test failures:
==22608== Conditional jump or move depends on uninitialised value(s)
==22608==at 0xA72E6B9: ml_init_k_nomial_trees (coll_ml_module.c:651)
==22608==by 0xA7336C8: mca_coll_ml_tree_hierarchy_discovery 
(coll_ml_module.c:2162)
==22608==by 0xA733D46: mca_coll_ml_fulltree_ptp_only_hierarchy_discovery 
(coll_ml_module.c:2333)
==22608==by 0xA7314C9: ml_discover_hierarchy (coll_ml_module.c:1565)
==22608==by 0xA735DCD: mca_coll_ml_comm_query (coll_ml_module.c:2992)
==22608==by 0x4CD23AE: query_2_0_0 (coll_base_comm_select.c:395)
==22608==by 0x4CD2372: query (coll_base_comm_select.c:378)
==22608==by 0x4CD2285: check_one_component (coll_base_comm_select.c:340)
==22608==by 0x4CD20D7: check_components (coll_base_comm_select.c:304)
==22608==by 0x4CCAE11: mca_coll_base_comm_select 
(coll_base_comm_select.c:131)
==22608==by 0x4C60B49: ompi_mpi_init (ompi_mpi_init.c:888)
==22608==by 0x4C93AE2: PMPI_Init (pinit.c:84)
==22608==  Uninitialised value was created by a heap allocation
==22608==at 0x4A07844: malloc (vg_replace_malloc.c:291)
==22608==by 0x4A079B8: realloc (vg_replace_malloc.c:687)
==22608==by 0xA72F91B: get_new_subgroup_data (coll_ml_module.c:1044)
==22608==by 0xA73292E: mca_coll_ml_tree_hierarchy_discovery 
(coll_ml_module.c:1939)
==22608==by 0xA733D46: mca_coll_ml_fulltree_ptp_only_hierarchy_discovery 
(coll_ml_module.c:2333)
==22608==by 0xA7314C9: ml_discover_hierarchy (coll_ml_module.c:1565)
==22608==by 0xA735DCD: mca_coll_ml_comm_query (coll_ml_module.c:2992)
==22608==by 0x4CD23AE: query_2_0_0 (coll_base_comm_select.c:395)
==22608==by 0x4CD2372: query (coll_base_comm_select.c:378)
==22608==by 0x4CD2285: check_one_component (coll_base_comm_select.c:340)
==22608==by 0x4CD20D7: check_components (coll_base_comm_select.c:304)
==22608==by 0x4CCAE11: mca_coll_base_comm_select 
(coll_base_comm_select.c:131)

Possibly related:
==22608== Syscall param writev(vector[...]) points to uninitialised byte(s)
==22608==at 0x354BAE0C57: writev (in /lib64/libc-2.12.so)
==22608==by 0x95F7CD2: mca_btl_tcp_frag_send (btl_tcp_frag.c:107)
==22608==by 0x95F61CE: mca_btl_tcp_endpoint_send (btl_tcp_endpoint.c:261)
==22608==by 0x95F1C08: mca_btl_tcp_send (btl_tcp.c:387)
==22608==by 0x9CD4098: mca_bml_base_send (bml.h:276)
==22608==by 0x9CD636F: mca_pml_ob1_send_request_start_prepare 
(pml_ob1_sendreq.c:650)
==22608==by 0x9CC9B39: mca_pml_ob1_send_request_start_btl 
(pml_ob1_sendreq.h:388)
==22608==by 0x9CC9DF1: mca_pml_ob1_send_request_start 
(pml_ob1_sendreq.h:461)
==22608==by 0x9CCA4FA: mca_pml_ob1_isend (pml_ob1_isend.c:85)
==22608==by 0x4C699BD: comm_allreduce_pml (allreduce.c:178)
==22608==by 0xA73172C: ml_discover_hierarchy (coll_ml_module.c:1616)
==22608==by 0xA735DCD: