[OMPI devel] MTT GAE/GDS call this morning

2009-04-23 Thread Jeff Squyres
Ok, I opened a Google account with the name "openmpi" this morning,  
and using Ethan's cell phone (heh; turns out that I had already used  
mine), we verified it for use with GAE.  Woo hoo!


"open-mpi-mtt" is the name of the Google Apps application.

I'll send the relevant authentication information out-of-band.

--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] MTT GAE/GDS call this morning

2009-04-23 Thread Jeff Squyres

Blah!  Wrong "devel" list -- sorry folks!

On Apr 23, 2009, at 10:33 AM, Jeff Squyres (jsquyres) wrote:


Ok, I opened a Google account with the name "openmpi" this morning,
and using Ethan's cell phone (heh; turns out that I had already used
mine), we verified it for use with GAE.  Woo hoo!

"open-mpi-mtt" is the name of the Google Apps application.

I'll send the relevant authentication information out-of-band.

--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



[OMPI devel] Patch to resolve valgrind warnings for 1.3.2

2009-04-23 Thread Number Cruncher
Although --enable-mem-debug resolves the issue, I get warnings about 
uninitialized bytes in writev from the opal_if_t structs in opal_ifinit:


==25777== Syscall param writev(vector[...]) points to uninitialised byte(s)
==25777==at 0x34DE2C9F0C: writev (in /lib64/libc-2.6.so)
==25777==by 0xAC233B: mca_oob_tcp_msg_send_handler (oob_tcp_msg.c:265)
==25777==by 0xABAC92: mca_oob_tcp_peer_send (oob_tcp_peer.c:197)
==25777==by 0xAC0A80: mca_oob_tcp_send_nb (oob_tcp_send.c:167)
==25777==by 0xAD025E: orte_rml_oob_send (rml_oob_send.c:137)
==25777==by 0xAD0CE3: orte_rml_oob_send_buffer (rml_oob_send.c:269)
==25777==by 0xAA50A6: allgather (grpcomm_bad_module.c:370)
==25777==by 0xAA592D: modex (grpcomm_bad_module.c:498)
==25777==by 0x92EE48: ompi_mpi_init (ompi_mpi_init.c:626)
==25777==by 0x95351C: PMPI_Init (pinit.c:80)

Since this isn't a performance critical part of Open MPI, why not follow 
the reasoning already noted in a comment at opal/util/if.c:208 and 
zero-out the struct even outside OMPI_ENABLE_MEM_DEBUG.


The attached patch makes this one-line change and clears up all valgrind 
warnings (when --with-valgrind enabled).


Regards,
Simon

diff -r -U 5 openmpi-1.3.2/opal/util/if.c openmpi-1.3.2.edited/opal/util/if.c
--- openmpi-1.3.2/opal/util/if.c	2009-04-16 20:02:42.0 +0100
+++ openmpi-1.3.2.edited/opal/util/if.c	2009-04-23 16:18:09.0 +0100
@@ -258,11 +258,12 @@
 struct ifreq* ifr = (struct ifreq*) ptr;
 opal_if_t intf;
 opal_if_t *intf_ptr;
 int length;

-OMPI_DEBUG_ZERO(intf);
+/* Again, make valgrind and purify happy - this isn't performance critical. */
+memset(&intf, 0, sizeof(intf));
 OBJ_CONSTRUCT(&intf, opal_list_item_t);

 /* compute offset for entries */
 #ifdef HAVE_STRUCT_SOCKADDR_SA_LEN
 length = sizeof(struct sockaddr);


Re: [OMPI devel] Patch to resolve valgrind warnings for 1.3.2

2009-04-23 Thread Jeff Squyres

Done; thanks!

https://svn.open-mpi.org/trac/ompi/changeset/21060

(I added in a few more memset's and calloc's elsewhere in if.c, just  
to be complete)





On Apr 23, 2009, at 11:36 AM, Number Cruncher wrote:


Although --enable-mem-debug resolves the issue, I get warnings about
uninitialized bytes in writev from the opal_if_t structs in  
opal_ifinit:


==25777== Syscall param writev(vector[...]) points to uninitialised  
byte(s)

==25777==at 0x34DE2C9F0C: writev (in /lib64/libc-2.6.so)
==25777==by 0xAC233B: mca_oob_tcp_msg_send_handler  
(oob_tcp_msg.c:265)

==25777==by 0xABAC92: mca_oob_tcp_peer_send (oob_tcp_peer.c:197)
==25777==by 0xAC0A80: mca_oob_tcp_send_nb (oob_tcp_send.c:167)
==25777==by 0xAD025E: orte_rml_oob_send (rml_oob_send.c:137)
==25777==by 0xAD0CE3: orte_rml_oob_send_buffer (rml_oob_send.c: 
269)

==25777==by 0xAA50A6: allgather (grpcomm_bad_module.c:370)
==25777==by 0xAA592D: modex (grpcomm_bad_module.c:498)
==25777==by 0x92EE48: ompi_mpi_init (ompi_mpi_init.c:626)
==25777==by 0x95351C: PMPI_Init (pinit.c:80)

Since this isn't a performance critical part of Open MPI, why not  
follow

the reasoning already noted in a comment at opal/util/if.c:208 and
zero-out the struct even outside OMPI_ENABLE_MEM_DEBUG.

The attached patch makes this one-line change and clears up all  
valgrind

warnings (when --with-valgrind enabled).

Regards,
Simon

diff -r -U 5 openmpi-1.3.2/opal/util/if.c openmpi-1.3.2.edited/opal/ 
util/if.c

--- openmpi-1.3.2/opal/util/if.c2009-04-16 20:02:42.0 +0100
+++ openmpi-1.3.2.edited/opal/util/if.c	2009-04-23  
16:18:09.0 +0100

@@ -258,11 +258,12 @@
struct ifreq* ifr = (struct ifreq*) ptr;
opal_if_t intf;
opal_if_t *intf_ptr;
int length;

-OMPI_DEBUG_ZERO(intf);
+/* Again, make valgrind and purify happy - this isn't  
performance critical. */

+memset(&intf, 0, sizeof(intf));
OBJ_CONSTRUCT(&intf, opal_list_item_t);

/* compute offset for entries */
#ifdef HAVE_STRUCT_SOCKADDR_SA_LEN
length = sizeof(struct sockaddr);




--
Jeff Squyres
Cisco Systems



[OMPI devel] size of items in mca_osc_pt2pt_component.p2p_c_buffers

2009-04-23 Thread doriankrause

Hi,

I'm currently looking at this bug: 
http://www.open-mpi.org/community/lists/users/2008/12/7611.php

I'm using the 1.3.2 tarball.

Valgrind tells me that there is an invalid write (of size 1) in 
osc_pt2pt_data_move.c at line 229 which is the

statement

   memcpy((unsigned char*) buffer->payload + written_data,
  packed_ddt, packed_ddt_len);

in the function ompi_osc_pt2pt_sendreq_send.

I have

(gdb) p packed_ddt_len
$2 = 44852

and

(gdb) p written_data
$3 = 36

but I can't figure out what the actual size of buffer->payload is. I have

(gdb) p *buffer
$6 = {mpireq = {super = {super = {super = {
 obj_magic_id = 16046253926196952813, obj_class = 0x4f5240,
 obj_reference_count = 1,
 cls_init_file_name = 0x2efe0b "class/opal_free_list.c",
 cls_init_lineno = 114}, opal_list_next = 0x0, opal_list_prev = 
0x0,

   item_free = 1, opal_list_item_refcount = 0,
   opal_list_item_belong_to = 0x0}}, request = 0x5a35, status = {
 MPI_SOURCE = 23094, MPI_TAG = 23095, MPI_ERROR = 23096, _count = 
23097,
 _cancelled = 23098}, cbfunc = 0x4e6cc5 
,

   cbdata = 0x8681080}, payload = 0x86bc0d8, len = 23102}

Is len the size of payload?

In osc_pt2pt_component.c I found the statement

   /* adjust size to be multiple of ompi_ptr_t to avoid alignment issues*/
   aligned_size = sizeof(ompi_osc_pt2pt_buffer_t) +
   (sizeof(ompi_osc_pt2pt_buffer_t) % sizeof(ompi_ptr_t)) +
   mca_osc_pt2pt_component.p2p_c_eager_size;
   OBJ_CONSTRUCT(&mca_osc_pt2pt_component.p2p_c_buffers, opal_free_list_t);
   opal_free_list_init(&mca_osc_pt2pt_component.p2p_c_buffers,
   aligned_size,
   OBJ_CLASS(ompi_osc_pt2pt_buffer_t),
   1, -1, 1);

but this doesn't help me to understand ...


Can you help with this? Where can I find the allocation routine for the 
buffer?

Or do you know why there could be an invalid write?

Thanks + Best regards,
Dorian




Re: [OMPI devel] RFC: Final cleanup of included headers

2009-04-23 Thread Rainer Keller
Hi,
what were we talking about again ;-)?
Aah, right: 1.3.2 is out the door, the dust has settled.

Tuesday, next week, I'd like to apply the patch (produced by  
contrib/check_unnecessary_headers.sh, see the first email).
This will require just two patches:
 - one (independent patch) for a few minor additions of mostly include 
string.h and other includes of header files
 - then the "big" one, that gets rid of inclusion of header files in 832 files 
(again mostly one-liners).

For this, I expect to not require exclusive access to the trunk...

Would that be reasonable?

CU,
Rainer

PS: Regarding MTT, it _has_ been tested on several clusters (getting coverage 
of more MCAs, compilers and so on), but neither OS's, nor architectures, so...




On Friday 20 March 2009 08:45:27 am Jeff Squyres wrote:
> This sounds reasonable -- small changes are always appreciated.  If we
> can hold off on the big patch until 1.3 is about to morph into 1.4,
> that would be good.
>
> Can you remind us about this issue when 1.3.2 gets out the door?
> (sorry, not trying to be a jerk; I just know that my short term memory
> is so non-existent that I'll have forgotten about it by then :-
> ( ...what were we talking about again?)
>
> On Mar 19, 2009, at 5:12 PM, Rainer Keller wrote:
> > Hi Ralph,
> >
> > On Wednesday 18 March 2009 09:00:36 am Ralph Castain wrote:
> > > Could we hold off on this until after 1.3.2 is out the door and
> >
> > has a
> >
> > > couple of days to stabilize? All these header file changes are
> >
> > making
> >
> > > it more difficult to cleanly apply patches to the 1.3 branch.
> >
> > Hmm, sure, we can hold off the big patch.
> > With the current plan, 1.3.2 should be out on 4/3.
> >
> > Some intermediate (small!) steps however I'd still like to be able
> > to apply?
> >
> > > When we get past the next couple of weeks, the 1.3 branch should
> >
> > clear
> >
> > > out the backlog of CMRs, and we should have the usual immediate
> >
> > "oops"
> >
> > > fixes in to 1.3.2. Then this won't be such a problem.
> >
> > However, it would be nice, if You could test the patch on Your
> > systems, prior
> > to me moving it into trunk. I want to limit the "down-time" of trunk
> > (There
> > may be a few places, where additional headers are required  -- as
> > unnecessary
> > headers were removed in lower-level headers).
> >
> > Thanks,
> > Rainer
> > --
> > 
> > Rainer Keller, PhD  Tel: +1 (865) 241-6293
> > Oak Ridge National Lab  Fax: +1 (865) 241-4811
> > PO Box 2008 MS 6164   Email: kel...@ornl.gov
> > Oak Ridge, TN 37831-2008AIM/Skype: rusraink
> >
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 

Rainer Keller, PhD  Tel: +1 (865) 241-6293
Oak Ridge National Lab  Fax: +1 (865) 241-4811
PO Box 2008 MS 6164   Email: kel...@ornl.gov
Oak Ridge, TN 37831-2008AIM/Skype: rusraink