[OMPI users] Spawn_multiple with tight integration to SGE grid engine

2012-01-27 Thread Tom Bryan
I am in the process of setting up a grid engine (SGE) cluster for running
Open MPI applications.  I'll detail the set up below, but my current problem
is that this call to Span_multiple never seems to return.

// Spawn all of the children processes.
_intercomm = MPI::COMM_WORLD.Spawn_multiple( _nProc,
const_cast(_command),
const_cast(_arg),
_maxProc, _info, 0, errCode );

I'm new to both SGE and MPI, which is making this problem difficult for me
to troubleshoot.

I can schedule simple (non-MPI) jobs on the SGE grid with qsub.

I can use qsub to schedule multiple copies of a simple Hello World type of
application using mpirun spawn the processes in a script like this:
#!/bin/sh
#
#$ -S /bin/sh
#$ -V
#$ -pe orte 4
#$ -cwd
#$ -j yes
export LD_LIBRARY_PATH=/${VXR_STATIC}/openmpi-1.5.4/lib
mpirun -np 4 ./mpihello $*

That seems to work.  The processes report the hostname where they were run,
and they appear to be scheduled on different machines in my SGE grid.

The problem is with a program, mpitest, that tries to use Spawn_multiple to
launch multiple child processes.  The script that I submit to the SGE grid
looks like this:
#!/bin/sh
#
#$ -S /bin/sh
#$ -V
#$ -pe orte 1-
#$ -cwd
#$ -j yes
export LD_LIBRARY_PATH=/${VXR_STATIC}/openmpi-1.5.4/lib
./mpitest $*

The mpitest program is the one that calls Spawn_multiple.  In this case, it
just tries to run multiple copies of itself.  If I restrict my SGE
configuration so that the orte parallel environment has to run all jobs on a
single host, then mpitest runs to completion, spawning 4 "child" processes
that are scheduled via SGE to run on the same host as the root process.  The
processes Send and Recv some messages, and the program exits.

If I permit SGE to schedule jobs on multiple hosts, then the child processes
appear to be scheduled and launched.  (That is, I can see them as children
of the sge_execd and sge_shepherd processes on various machines.)  But the
original call to Spawn_multiple doesn't appear to return in the root
mpitest.  I assume that there's some problem setting up the communications
channel among the different processes, but it's possible that my mpitest
code is just buggy.  I already tried disabling the firewall on all of the
machines.  I'm not sure how else to get useful debug information at this
stage of the troubleshooting.

It would be great if someone could look at the attached code and just let me
know whether what I'm doing is horribly incorrect.  If it should work, then
I can focus on systems and SGE configuration issues.  If the code is broken
and really shouldn't work, then I'd like to fix that first, of course.

Thanks,
---Tom




mpitest.tgz
Description: Binary data


Re: [OMPI users] MPI_Comm_split and intercommunicator - Problem

2012-01-27 Thread Rodrigo Silva Oliveira
Hy Jeff, thanks for replying.

Does it mean that you don't have it working properly yet? I read the thread
at the devel list where you addressed the problem and a possible solution,
but I was not able to find a conclusion about the problem.

I'm in trouble without this function. Probably I'll need to redesign all my
implementation to achieve what I need.


On Fri, Jan 27, 2012 at 2:35 PM, Jeff Squyres  wrote:

> Unfortunately, I think that this is a known problem with INTERCOMM_MERGE
> and COMM_SPAWN parents and children:
>
>https://svn.open-mpi.org/trac/ompi/ticket/2904
>
>
> On Jan 26, 2012, at 12:11 PM, Rodrigo Oliveira wrote:
>
> > Hi there, I tried to understand the behavior Thatyene said and I think
> is a bug in open mpi implementation.
> >
> > I do not know what exactly is happening because I am not an expert in
> ompi code, but I could see that when one process define its color as
> MPI_UNDEFINED, one of the processes on the inter-communicator blocks in the
> call to the function bellow:
> >
> > /* Step 3: set up the communicator   */
> > /* - */
> > /* Create the communicator finally */
> > rc = ompi_comm_set ( &newcomp,   /* new comm */
> >  comm,   /* old comm */
> >  my_size,/* local_size */
> >  lranks, /* local_ranks */
> >  my_rsize,   /* remote_size */
> >  rranks, /* remote_ranks */
> >  NULL,   /* attrs */
> >  comm->error_handler,/* error handler */
> >  (pass_on_topo)?
> >  (mca_base_component_t *)comm->c_topo_component:
> >  NULL,   /* topo component */
> >  NULL,   /* local group */
> >  NULL/* remote group */
> > );
> >
> > This function is called inside ompi_comm_split, in the file
> ompi/communicator/comm.c
> >
> > Is there a solution for this problem in some revision? I insist in this
> problem because I need to use this function for a similar purpose.
> >
> > Any idea?
> >
> >
> > On Wed, Jan 25, 2012 at 4:50 PM, Thatyene Louise Alves de Souza Ramos <
> thaty...@gmail.com> wrote:
> > It seems the split is blocking when must return MPI_COMM_NULL, in the
> case I have one process with a color that does not exist in the other group
> or with the color = MPI_UNDEFINED.
> >
> > On Wed, Jan 25, 2012 at 4:28 PM, Rodrigo Oliveira <
> rsilva.olive...@gmail.com> wrote:
> > Hi Thatyene,
> >
> > I took a look in your code and it seems to be logically correct. Maybe
> there is some problem when you call the split function having one client
> process with color = MPI_UNDEFINED. I understood you are trying to isolate
> one of the client process to do something applicable only to it, am I
> wrong? According to open mpi documentation, this function can be used to do
> that, but it is not working. Anyone have any idea about what can be?
> >
> > Best regards
> >
> > Rodrigo Oliveira
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] pure static "mpirun" launcher

2012-01-27 Thread Jeff Squyres
Ah ha, I think I got it.  There was actually a bug about disabling the memory 
manager in trunk/v1.5.x/v1.4.x.  I fixed it on the trunk and scheduled it for 
v1.6 (since we're trying very hard to get v1.5.5 out the door) and v1.4.5.

On the OMPI trunk on RHEL 5 with gcc 4.4.6, I can do this:

./configure --without-memory-manager LDFLAGS=--static --disable-shared 
--enable-static

And get a fully static set of OMPI executables.  For example:

-
[10:41] svbu-mpi:~ % cd $prefix/bin
[10:41] svbu-mpi:/home/jsquyres/bogus/bin % ldd *
mpic++:
not a dynamic executable
mpicc:
not a dynamic executable
mpiCC:
not a dynamic executable
mpicxx:
not a dynamic executable
mpiexec:
not a dynamic executable
mpif77:
not a dynamic executable
mpif90:
not a dynamic executable
mpirun:
not a dynamic executable
ompi-clean:
not a dynamic executable
ompi_info:
not a dynamic executable
ompi-ps:
not a dynamic executable
ompi-server:
not a dynamic executable
ompi-top:
not a dynamic executable
opal_wrapper:
not a dynamic executable
ortec++:
not a dynamic executable
ortecc:
not a dynamic executable
orteCC:
not a dynamic executable
orte-clean:
not a dynamic executable
orted:
not a dynamic executable
orte-info:
not a dynamic executable
orte-ps:
not a dynamic executable
orterun:
not a dynamic executable
orte-top:
not a dynamic executable
-

So I think the answer here is: it depends on a few factors:

1. Need that bug fix that I just committed.
2. Libtool is stripping out -static (and/or --static?).  So you have to find 
some other flags to make your compiler/linker do static.
3. Your OS has to support static builds.  For example, RHEL6 doesn't install 
libc.a by default (it's apparently on the optional DVD, which I don't have).  
My RHEL 5.5 install does have it, though.


On Jan 27, 2012, at 11:16 AM, Jeff Squyres wrote:

> I've tried a bunch of variations on this, but I'm actually getting stymied by 
> my underlying OS not supporting static linking properly.  :-\
> 
> I do see that Libtool is stripping out the "-static" standalone flag that you 
> passed into LDFLAGS.  Yuck.  What's -Wl,-E?  Can you try "-Wl,-static" 
> instead?
> 
> 
> On Jan 25, 2012, at 1:24 AM, Ilias Miroslav wrote:
> 
>> Hello again,
>> 
>> I need own static "mpirun" for porting (together with the static executable) 
>> onto various (unknown) grid servers. In grid computing one can not expect 
>> OpenMPI-ILP64 installtion on each computing element. 
>> 
>> Jeff: I tried LDFLAGS in configure
>> 
>> ilias@194.160.135.47:~/bin/ompi-ilp64_full_static/openmpi-1.4.4/../configure 
>> --prefix=/home/ilias/bin/ompi-ilp64_full_static -without-memory-manager 
>> --without-libnuma --enable-static --disable-shared CXX=g++ CC=gcc 
>> F77=gfortran FC=gfortran FFLAGS="-m64 -fdefault-integer-8 -static" 
>> FCFLAGS="-m64 -fdefault-integer-8 -static" CFLAGS="-m64 -static" 
>> CXXFLAGS="-m64 -static"  LDFLAGS="-static  -Wl,-E" 
>> 
>> but still got dynamic, not static "mpirun":
>> ilias@194.160.135.47:~/bin/ompi-ilp64_full_static/bin/.ldd ./mpirun
>>  linux-vdso.so.1 =>  (0x7fff6090c000)
>>  libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x7fd7277cf000)
>>  libnsl.so.1 => /lib/x86_64-linux-gnu/libnsl.so.1 (0x7fd7275b7000)
>>  libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x7fd7273b3000)
>>  libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7fd727131000)
>>  libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 
>> (0x7fd726f15000)
>>  libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7fd726b9)
>>  /lib64/ld-linux-x86-64.so.2 (0x7fd7279ef000)
>> 
>> Any help please ? config.log is here:
>> 
>> https://docs.google.com/open?id=0B8qBHKNhZAipNTNkMzUxZDEtNjJmZi00YzY3LWI4MmYtY2RkZDVkMjhiOTM1
>> 
>> Best, Miro
>> --
>> Message: 10
>> Date: Tue, 24 Jan 2012 11:55:21 -0500
>> From: Jeff Squyres 
>> Subject: Re: [OMPI users] pure static "mpirun" launcher
>> To: Open MPI Users 
>> Message-ID: 
>> Content-Type: text/plain; charset=windows-1252
>> 
>> Ilias: Have you simply tried building Open MPI with flags that force static 
>> linking?  E.g., something like this:
>> 
>> ./configure --enable-static --disable-shared LDFLAGS=-Wl,-static
>> 
>> I.e., put in LDFLAGS whatever flags your compiler/linker needs to force 
>> static linking.  These LDFLAGS will be applied to all of Open MPI's 
>> executables, including mpirun.
>> 
>> 
>> On Jan 24, 2012, at 10:28 AM, Ralph Castain wrote:
>> 
>>> Good point! I'm traveling this week with limited resources, but will try to 
>>> address when able.
>>> 
>>> Sent from my iPad
>>> 
>>> On Jan 24, 2012, at 7:07 AM, Reuti  wrote:
>>> 
 Am 24.01.2012 um 15:49 schrieb Ralph Castain:
 
> I'm a little confused. Building procs static makes sense as libraries may 
> not

Re: [OMPI users] MPI_AllGather null terminator character

2012-01-27 Thread Jeff Squyres
Ah, I have an idea what might be happening here: I believe that valgrind is 
actually pretty smart.

If you have a buffer of size 128, and gethostname() only fills in, say, the 
first 32 bytes (including the \0), the other 128-32=96 bytes are uninitialized. 
 You can MPI_Allgather these, in which case those 96 uninitialized bytes will 
be copied over to the hostname_recv_buf buffer.

For each rank, valgrind can actually track the local memcpy from local_hostname 
to hostnam_recv_buf[rank * MAX_LEN_SIZE], and it knows that those 96 bytes are 
still uninitialized.

So when you go to strcmp them later, valgrind says "ah ha! those are 
uninitialized!"

Meaning: I think that in some cases, valgrind is actually tracking the memcpy 
of uninitialized bytes and then alerting you later when you access those 
secondary uninitialized bytes.

If I'm right, you can memset the local_hostname buffer (or use calloc), and 
then valgrind warnings will go away.



On Jan 27, 2012, at 8:21 AM, Gabriele Fatigati wrote:

> Hi Jeff,
> 
> yes, very stupid bug in a code, but also with the correction the problem with 
> Valgrind in strcmp remains:
> 
> ==21779== Conditional jump or move depends on uninitialised value(s)
> ==21779==at 0x4A0898C: strcmp (mc_replace_strmem.c:711)
> ==21779==by 0x400BA8: main (all_gather.c:28)
> ==21779==
> ==21779== Conditional jump or move depends on uninitialised value(s)
> ==21779==at 0x4A0899A: strcmp (mc_replace_strmem.c:711)
> ==21779==by 0x400BA8: main (all_gather.c:28)
> ==21779==
> ==21779== Conditional jump or move depends on uninitialised value(s)
> ==21779==at 0x4A089BA: strcmp (mc_replace_strmem.c:711)
> ==21779==by 0x400BA8: main (all_gather.c:28)
> 
> 
> Do you have the same warning with Valgrind?  Localhost name is something like 
> "node343" "node344" and so on.
> 
> 
> 2012/1/27 Jeff Squyres 
> I see one problem:
> 
>gethostname(local_hostname, sizeof(local_hostname));
> 
> That should be:
> 
>gethostname(local_hostname, max_name_len);
> 
> because sizeof(local_hostname) will be sizeof(void*).
> 
> But if that's what you were intending, just to simulate a small hostname 
> buffer, then be aware that gethostname() will not put a \0 after the string, 
> because it'll copy in sizeof(local_hostname) characters and then stop.
> 
> Specifically, the man page on OS X says:
> 
> The gethostname() function returns the standard host name for the current
> processor, as previously set by sethostname().  The namelen argument
> specifies the size of the name array.  The returned name is null-termi-
> nated, unless insufficient space is provided.
> 
> Hence, MPI is transmitting the entire 255 characters in your source array 
> (regardless of content -- MPI is not looking for \0's; you gave it the 
> explicit length of the buffer), but if they weren't filled with \0's, then 
> the receiver's printf will have problems handling it.
> 
> 
> 
> On Jan 27, 2012, at 4:03 AM, Gabriele Fatigati wrote:
> 
> > Sorry,
> >
> > this is the right code.
> >
> > 2012/1/27 Gabriele Fatigati 
> > Hi Jeff,
> >
> > The problem is when I use strcmp on ALLGather buffer and Valgrind that 
> > raise a warning.
> >
> > Please check if the attached code is right, where size(local_hostname) is 
> > very small.
> >
> > Valgrind is used as:
> >
> > mpirun valgrind --leak-check=full --tool=memcheck ./all_gather
> >
> > and openmpi/1.4.4 compiled with "-O0 -g"
> >
> > Thanks!
> >
> > 2012/1/26 Jeff Squyres 
> > I'm not sure what you're asking.
> >
> > The entire contents of hostname[] will be sent -- from position 0 to 
> > position (MAX_STRING_LEN-1).  If there's a \0 in there, it will be sent.  
> > If the \0 occurs after that, then it won't.
> >
> > Be aware that get_hostname(buf, size) will not put a \0 in the buffer if 
> > the hostname is exactly "size" bytes.  So you might want to double check 
> > that your get_hostname() is returning a \0-terminated string.
> >
> > Does that make sense?
> >
> > Here's a sample I wrote to verify this:
> >
> > #include 
> > #include 
> > #include 
> > #include 
> >
> > #define MAX_LEN 64
> >
> > static void where_null(char *ptr, int len, int rank)
> > {
> >int i;
> >
> >for (i = 0; i < len; ++i) {
> >if ('\0' == ptr[i]) {
> >printf("Rank %d: Null found at position %d (string: %s)\n",
> >   rank, i, ptr);
> >return;
> >}
> >}
> >
> >printf("Rank %d: Null not found! (string: ", rank);
> >for (i = 0; i < len; ++i) putc(ptr[i], stdout);
> >putc('\n', stdout);
> > }
> >
> > int main()
> > {
> >int i;
> >char hostname[MAX_LEN];
> >char *hostname_recv_buf;
> >int rank, size;
> >
> >MPI_Init(NULL, NULL);
> >MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> >MPI_Comm_size(MPI_COMM_WORLD, &size);
> >
> >gethostname(hostname, MAX_LEN - 1);
> >where_null(hostname, MAX_LEN, rank);
> >
> >hostname_recv_buf = calloc(size * (MAX_LEN), (sizeof(char)))

Re: [OMPI users] MPI_Comm_split and intercommunicator - Problem

2012-01-27 Thread Jeff Squyres
Unfortunately, I think that this is a known problem with INTERCOMM_MERGE and 
COMM_SPAWN parents and children:

https://svn.open-mpi.org/trac/ompi/ticket/2904


On Jan 26, 2012, at 12:11 PM, Rodrigo Oliveira wrote:

> Hi there, I tried to understand the behavior Thatyene said and I think is a 
> bug in open mpi implementation.
> 
> I do not know what exactly is happening because I am not an expert in ompi 
> code, but I could see that when one process define its color as 
> MPI_UNDEFINED, one of the processes on the inter-communicator blocks in the 
> call to the function bellow:
> 
> /* Step 3: set up the communicator   */
> /* - */
> /* Create the communicator finally */
> rc = ompi_comm_set ( &newcomp,   /* new comm */
>  comm,   /* old comm */
>  my_size,/* local_size */
>  lranks, /* local_ranks */
>  my_rsize,   /* remote_size */
>  rranks, /* remote_ranks */
>  NULL,   /* attrs */
>  comm->error_handler,/* error handler */
>  (pass_on_topo)?
>  (mca_base_component_t *)comm->c_topo_component:
>  NULL,   /* topo component */
>  NULL,   /* local group */
>  NULL/* remote group */
> );
> 
> This function is called inside ompi_comm_split, in the file 
> ompi/communicator/comm.c
> 
> Is there a solution for this problem in some revision? I insist in this 
> problem because I need to use this function for a similar purpose.
> 
> Any idea?
> 
> 
> On Wed, Jan 25, 2012 at 4:50 PM, Thatyene Louise Alves de Souza Ramos 
>  wrote:
> It seems the split is blocking when must return MPI_COMM_NULL, in the case I 
> have one process with a color that does not exist in the other group or with 
> the color = MPI_UNDEFINED.
> 
> On Wed, Jan 25, 2012 at 4:28 PM, Rodrigo Oliveira  
> wrote:
> Hi Thatyene,
> 
> I took a look in your code and it seems to be logically correct. Maybe there 
> is some problem when you call the split function having one client process 
> with color = MPI_UNDEFINED. I understood you are trying to isolate one of the 
> client process to do something applicable only to it, am I wrong? According 
> to open mpi documentation, this function can be used to do that, but it is 
> not working. Anyone have any idea about what can be?
> 
> Best regards
> 
> Rodrigo Oliveira
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] pure static "mpirun" launcher

2012-01-27 Thread Jeff Squyres
I've tried a bunch of variations on this, but I'm actually getting stymied by 
my underlying OS not supporting static linking properly.  :-\

I do see that Libtool is stripping out the "-static" standalone flag that you 
passed into LDFLAGS.  Yuck.  What's -Wl,-E?  Can you try "-Wl,-static" instead?


On Jan 25, 2012, at 1:24 AM, Ilias Miroslav wrote:

> Hello again,
> 
> I need own static "mpirun" for porting (together with the static executable) 
> onto various (unknown) grid servers. In grid computing one can not expect 
> OpenMPI-ILP64 installtion on each computing element. 
> 
> Jeff: I tried LDFLAGS in configure
> 
> ilias@194.160.135.47:~/bin/ompi-ilp64_full_static/openmpi-1.4.4/../configure 
> --prefix=/home/ilias/bin/ompi-ilp64_full_static -without-memory-manager 
> --without-libnuma --enable-static --disable-shared CXX=g++ CC=gcc 
> F77=gfortran FC=gfortran FFLAGS="-m64 -fdefault-integer-8 -static" 
> FCFLAGS="-m64 -fdefault-integer-8 -static" CFLAGS="-m64 -static" 
> CXXFLAGS="-m64 -static"  LDFLAGS="-static  -Wl,-E" 
> 
> but still got dynamic, not static "mpirun":
> ilias@194.160.135.47:~/bin/ompi-ilp64_full_static/bin/.ldd ./mpirun
>   linux-vdso.so.1 =>  (0x7fff6090c000)
>   libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x7fd7277cf000)
>   libnsl.so.1 => /lib/x86_64-linux-gnu/libnsl.so.1 (0x7fd7275b7000)
>   libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x7fd7273b3000)
>   libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7fd727131000)
>   libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 
> (0x7fd726f15000)
>   libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7fd726b9)
>   /lib64/ld-linux-x86-64.so.2 (0x7fd7279ef000)
> 
> Any help please ? config.log is here:
> 
> https://docs.google.com/open?id=0B8qBHKNhZAipNTNkMzUxZDEtNjJmZi00YzY3LWI4MmYtY2RkZDVkMjhiOTM1
> 
> Best, Miro
> --
> Message: 10
> Date: Tue, 24 Jan 2012 11:55:21 -0500
> From: Jeff Squyres 
> Subject: Re: [OMPI users] pure static "mpirun" launcher
> To: Open MPI Users 
> Message-ID: 
> Content-Type: text/plain; charset=windows-1252
> 
> Ilias: Have you simply tried building Open MPI with flags that force static 
> linking?  E.g., something like this:
> 
>  ./configure --enable-static --disable-shared LDFLAGS=-Wl,-static
> 
> I.e., put in LDFLAGS whatever flags your compiler/linker needs to force 
> static linking.  These LDFLAGS will be applied to all of Open MPI's 
> executables, including mpirun.
> 
> 
> On Jan 24, 2012, at 10:28 AM, Ralph Castain wrote:
> 
>> Good point! I'm traveling this week with limited resources, but will try to 
>> address when able.
>> 
>> Sent from my iPad
>> 
>> On Jan 24, 2012, at 7:07 AM, Reuti  wrote:
>> 
>>> Am 24.01.2012 um 15:49 schrieb Ralph Castain:
>>> 
 I'm a little confused. Building procs static makes sense as libraries may 
 not be available on compute nodes. However, mpirun is only executed in one 
 place, usually the head node where it was built. So there is less reason 
 to build it purely static.
 
 Are you trying to move mpirun somewhere? Or is it the daemons that mpirun 
 launches that are the real problem?
>>> 
>>> This depends: if you have a queuing system, the master node of a parallel 
>>> job may be one of the slave nodes already where the jobscript runs. 
>>> Nevertheless I have the nodes uniform, but I saw places where it wasn't the 
>>> case.
>>> 
>>> An option would be to have a special queue, which will execute the 
>>> jobscript always on the headnode (i.e. without generating any load) and use 
>>> only non-local granted slots for mpirun. For this it might be necssary to 
>>> have a high number of slots on the headnode for this queue, and request 
>>> always one slot on this machine in addition to the necessary ones on the 
>>> computing node.
>>> 
>>> -- Reuti
>>> 
>>> 
 Sent from my iPad
 
 On Jan 24, 2012, at 5:54 AM, Ilias Miroslav  wrote:
 
> Dear experts,
> 
> following http://www.open-mpi.org/faq/?category=building#static-build I 
> successfully build static OpenMPI library.
> Using such prepared library I succeeded in building parallel static 
> executable - dirac.x (ldd dirac.x-not a dynamic executable).
> 
> The problem remains, however,  with the mpirun (orterun) launcher.
> While on the local machine, where I compiled both static OpenMPI & static 
> dirac.x  I am able to launch parallel job
> /mpirun -np 2 dirac.x ,
> I can not lauch it elsewhere, because "mpirun" is dynamically linked, 
> thus machine dependent:
> 
> ldd mpirun:
>linux-vdso.so.1 =>  (0x7fff13792000)
>libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x7f40f8cab000)
>libnsl.so.1 => /lib/x86_64-linux-gnu/libnsl.so.1 (0x7f40f8a93000)
>libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x7f40f888f000)
>libm.so.6 => /lib/x86_64-linux-g

Re: [OMPI users] OpenMPI: How many connections?

2012-01-27 Thread Prentice Bisbal
I would like to nominate the quote below for the best explanation of how
a piece of software works  that I've ever read.

Kudos, Jeff.

On 01/26/2012 04:38 PM, Jeff Squyres wrote:
> You send a message, a miracle occurs, and the message is received on the 
> other side. 

--
Prentice


[OMPI users] MPI_Barrier, again

2012-01-27 Thread Evgeniy Shapiro
Hi

I have a strange problem with MPI_Barrier occurring when writing to a
file. The output subroutine (the code is in FORTRAN) is called from
the main program and there is an MPI_Barrier just before the call.

In the subroutine

1. Process 0 checks whether the first file exists and, if not, -
creates the file 1, writes the file header and closes the file

2. there is a loop over the data sets with an embedded barrier
  do i=0, iDatasets
   call MPI_Barrier
   if I do not own data - cycle and go to the next dataset (and barrier)
   check if the file exists, if not - sleep and check again until it
does (needed to make sure the buffer has been flushed)
   write my portion of the file
  end do
 in theory the above should result in a sequential write of datasets
to the file.

3. Process 0 checks whether the second file exists and, if not, -
creates the file 2, writes the file header and closes the file

2. there is a loop over the data sets with an embedded barrier
  do i=0, iDatasets
   call MPI_Barrier
   if I do not own data - cycle and go to the next dataset (and barrier)
   check if the file exists, if not - sleep and check again until it
does (needed to make sure the buffer has been flushed)
   write my portion of the file including a link to the 1st file
  end do

The sub is called several times (different files/datasets) with a
barrier between calls, erratically the program hangs in one of the
calls. The likelihood of the program hanging increases with the
increase of the number of processes.  DDT shows that when this happens
some of the processes including 0 are waiting at barrier inside the
first loop, some - at the second barrier and one whereas one  process
is in the sleep/check file status cycle in the second loop. So somehow
 a part of  processes go through the 1st barrier before process 0.
This is a debug version, so no loop unrolling etc.

Is there anything I can do to make sure that the first barrier is
observed by all processes? Any advice greatly appreciated.

Evgeniy


OpenMPI: 1.4.3
(I cannot use parallel mpi io in this situation for various reasons)


Re: [OMPI users] MPI_AllGather null terminator character

2012-01-27 Thread Gabriele Fatigati
Dear Ralph, thanks for the suggest, but doesn't solve the problem :(

The warning still exists.

2012/1/27 Ralph Castain 

> I suspect that valgrind doesn't recognize that MPI_Allgather will ensure
> that hostname_recv_buf is filled prior to calling strcmp. If you want to
> eliminate the warning, you should memset hostname_recv_buf to 0 so it has a
> guaranteed value.
>
> On Jan 27, 2012, at 6:21 AM, Gabriele Fatigati wrote:
>
> Hi Jeff,
>
> yes, very stupid bug in a code, but also with the correction the problem
> with Valgrind in strcmp remains:
>
> ==21779== Conditional jump or move depends on uninitialised value(s)
> ==21779==at 0x4A0898C: strcmp (mc_replace_strmem.c:711)
> ==21779==by 0x400BA8: main (all_gather.c:28)
> ==21779==
> ==21779== Conditional jump or move depends on uninitialised value(s)
> ==21779==at 0x4A0899A: strcmp (mc_replace_strmem.c:711)
> ==21779==by 0x400BA8: main (all_gather.c:28)
> ==21779==
> ==21779== Conditional jump or move depends on uninitialised value(s)
> ==21779==at 0x4A089BA: strcmp (mc_replace_strmem.c:711)
> ==21779==by 0x400BA8: main (all_gather.c:28)
>
>
> Do you have the same warning with Valgrind?  Localhost name is something
> like "node343" "node344" and so on.
>
>
> 2012/1/27 Jeff Squyres 
>
>> I see one problem:
>>
>>gethostname(local_hostname, sizeof(local_hostname));
>>
>> That should be:
>>
>>gethostname(local_hostname, max_name_len);
>>
>> because sizeof(local_hostname) will be sizeof(void*).
>>
>> But if that's what you were intending, just to simulate a small hostname
>> buffer, then be aware that gethostname() will not put a \0 after the
>> string, because it'll copy in sizeof(local_hostname) characters and then
>> stop.
>>
>> Specifically, the man page on OS X says:
>>
>> The gethostname() function returns the standard host name for the
>> current
>> processor, as previously set by sethostname().  The namelen argument
>> specifies the size of the name array.  The returned name is
>> null-termi-
>> nated, unless insufficient space is provided.
>>
>> Hence, MPI is transmitting the entire 255 characters in your source array
>> (regardless of content -- MPI is not looking for \0's; you gave it the
>> explicit length of the buffer), but if they weren't filled with \0's, then
>> the receiver's printf will have problems handling it.
>>
>>
>>
>> On Jan 27, 2012, at 4:03 AM, Gabriele Fatigati wrote:
>>
>> > Sorry,
>> >
>> > this is the right code.
>> >
>> > 2012/1/27 Gabriele Fatigati 
>> > Hi Jeff,
>> >
>> > The problem is when I use strcmp on ALLGather buffer and Valgrind that
>> raise a warning.
>> >
>> > Please check if the attached code is right, where size(local_hostname)
>> is very small.
>> >
>> > Valgrind is used as:
>> >
>> > mpirun valgrind --leak-check=full --tool=memcheck ./all_gather
>> >
>> > and openmpi/1.4.4 compiled with "-O0 -g"
>> >
>> > Thanks!
>> >
>> > 2012/1/26 Jeff Squyres 
>> > I'm not sure what you're asking.
>> >
>> > The entire contents of hostname[] will be sent -- from position 0 to
>> position (MAX_STRING_LEN-1).  If there's a \0 in there, it will be sent.
>>  If the \0 occurs after that, then it won't.
>> >
>> > Be aware that get_hostname(buf, size) will not put a \0 in the buffer
>> if the hostname is exactly "size" bytes.  So you might want to double check
>> that your get_hostname() is returning a \0-terminated string.
>> >
>> > Does that make sense?
>> >
>> > Here's a sample I wrote to verify this:
>> >
>> > #include 
>> > #include 
>> > #include 
>> > #include 
>> >
>> > #define MAX_LEN 64
>> >
>> > static void where_null(char *ptr, int len, int rank)
>> > {
>> >int i;
>> >
>> >for (i = 0; i < len; ++i) {
>> >if ('\0' == ptr[i]) {
>> >printf("Rank %d: Null found at position %d (string: %s)\n",
>> >   rank, i, ptr);
>> >return;
>> >}
>> >}
>> >
>> >printf("Rank %d: Null not found! (string: ", rank);
>> >for (i = 0; i < len; ++i) putc(ptr[i], stdout);
>> >putc('\n', stdout);
>> > }
>> >
>> > int main()
>> > {
>> >int i;
>> >char hostname[MAX_LEN];
>> >char *hostname_recv_buf;
>> >int rank, size;
>> >
>> >MPI_Init(NULL, NULL);
>> >MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>> >MPI_Comm_size(MPI_COMM_WORLD, &size);
>> >
>> >gethostname(hostname, MAX_LEN - 1);
>> >where_null(hostname, MAX_LEN, rank);
>> >
>> >hostname_recv_buf = calloc(size * (MAX_LEN), (sizeof(char)));
>> >MPI_Allgather(hostname, MAX_LEN, MPI_CHAR,
>> >  hostname_recv_buf, MAX_LEN, MPI_CHAR, MPI_COMM_WORLD);
>> >for (i = 0; i < size; ++i) {
>> >where_null(hostname_recv_buf + i * MAX_LEN, MAX_LEN, rank);
>> >}
>> >
>> >MPI_Finalize();
>> >return 0;
>> > }
>> >
>> >
>> >
>> > On Jan 13, 2012, at 2:32 AM, Gabriele Fatigati wrote:
>> >
>> > > Dear OpenMPI,
>> > >
>> > > using MPI_Allgather with MPI_CHAR type, I have a doubt about
>> null-terminated character. Im

Re: [OMPI users] MPI_AllGather null terminator character

2012-01-27 Thread Ralph Castain
I suspect that valgrind doesn't recognize that MPI_Allgather will ensure that 
hostname_recv_buf is filled prior to calling strcmp. If you want to eliminate 
the warning, you should memset hostname_recv_buf to 0 so it has a guaranteed 
value.

On Jan 27, 2012, at 6:21 AM, Gabriele Fatigati wrote:

> Hi Jeff,
> 
> yes, very stupid bug in a code, but also with the correction the problem with 
> Valgrind in strcmp remains:
> 
> ==21779== Conditional jump or move depends on uninitialised value(s)
> ==21779==at 0x4A0898C: strcmp (mc_replace_strmem.c:711)
> ==21779==by 0x400BA8: main (all_gather.c:28)
> ==21779==
> ==21779== Conditional jump or move depends on uninitialised value(s)
> ==21779==at 0x4A0899A: strcmp (mc_replace_strmem.c:711)
> ==21779==by 0x400BA8: main (all_gather.c:28)
> ==21779==
> ==21779== Conditional jump or move depends on uninitialised value(s)
> ==21779==at 0x4A089BA: strcmp (mc_replace_strmem.c:711)
> ==21779==by 0x400BA8: main (all_gather.c:28)
> 
> 
> Do you have the same warning with Valgrind?  Localhost name is something like 
> "node343" "node344" and so on.
> 
> 
> 2012/1/27 Jeff Squyres 
> I see one problem:
> 
>gethostname(local_hostname, sizeof(local_hostname));
> 
> That should be:
> 
>gethostname(local_hostname, max_name_len);
> 
> because sizeof(local_hostname) will be sizeof(void*).
> 
> But if that's what you were intending, just to simulate a small hostname 
> buffer, then be aware that gethostname() will not put a \0 after the string, 
> because it'll copy in sizeof(local_hostname) characters and then stop.
> 
> Specifically, the man page on OS X says:
> 
> The gethostname() function returns the standard host name for the current
> processor, as previously set by sethostname().  The namelen argument
> specifies the size of the name array.  The returned name is null-termi-
> nated, unless insufficient space is provided.
> 
> Hence, MPI is transmitting the entire 255 characters in your source array 
> (regardless of content -- MPI is not looking for \0's; you gave it the 
> explicit length of the buffer), but if they weren't filled with \0's, then 
> the receiver's printf will have problems handling it.
> 
> 
> 
> On Jan 27, 2012, at 4:03 AM, Gabriele Fatigati wrote:
> 
> > Sorry,
> >
> > this is the right code.
> >
> > 2012/1/27 Gabriele Fatigati 
> > Hi Jeff,
> >
> > The problem is when I use strcmp on ALLGather buffer and Valgrind that 
> > raise a warning.
> >
> > Please check if the attached code is right, where size(local_hostname) is 
> > very small.
> >
> > Valgrind is used as:
> >
> > mpirun valgrind --leak-check=full --tool=memcheck ./all_gather
> >
> > and openmpi/1.4.4 compiled with "-O0 -g"
> >
> > Thanks!
> >
> > 2012/1/26 Jeff Squyres 
> > I'm not sure what you're asking.
> >
> > The entire contents of hostname[] will be sent -- from position 0 to 
> > position (MAX_STRING_LEN-1).  If there's a \0 in there, it will be sent.  
> > If the \0 occurs after that, then it won't.
> >
> > Be aware that get_hostname(buf, size) will not put a \0 in the buffer if 
> > the hostname is exactly "size" bytes.  So you might want to double check 
> > that your get_hostname() is returning a \0-terminated string.
> >
> > Does that make sense?
> >
> > Here's a sample I wrote to verify this:
> >
> > #include 
> > #include 
> > #include 
> > #include 
> >
> > #define MAX_LEN 64
> >
> > static void where_null(char *ptr, int len, int rank)
> > {
> >int i;
> >
> >for (i = 0; i < len; ++i) {
> >if ('\0' == ptr[i]) {
> >printf("Rank %d: Null found at position %d (string: %s)\n",
> >   rank, i, ptr);
> >return;
> >}
> >}
> >
> >printf("Rank %d: Null not found! (string: ", rank);
> >for (i = 0; i < len; ++i) putc(ptr[i], stdout);
> >putc('\n', stdout);
> > }
> >
> > int main()
> > {
> >int i;
> >char hostname[MAX_LEN];
> >char *hostname_recv_buf;
> >int rank, size;
> >
> >MPI_Init(NULL, NULL);
> >MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> >MPI_Comm_size(MPI_COMM_WORLD, &size);
> >
> >gethostname(hostname, MAX_LEN - 1);
> >where_null(hostname, MAX_LEN, rank);
> >
> >hostname_recv_buf = calloc(size * (MAX_LEN), (sizeof(char)));
> >MPI_Allgather(hostname, MAX_LEN, MPI_CHAR,
> >  hostname_recv_buf, MAX_LEN, MPI_CHAR, MPI_COMM_WORLD);
> >for (i = 0; i < size; ++i) {
> >where_null(hostname_recv_buf + i * MAX_LEN, MAX_LEN, rank);
> >}
> >
> >MPI_Finalize();
> >return 0;
> > }
> >
> >
> >
> > On Jan 13, 2012, at 2:32 AM, Gabriele Fatigati wrote:
> >
> > > Dear OpenMPI,
> > >
> > > using MPI_Allgather with MPI_CHAR type, I have a doubt about 
> > > null-terminated character. Imaging I want to spawn node names where my 
> > > program is running on:
> > >
> > >
> > > 
> > >
> > > char hostname[MAX_LEN];
> > >
> > > char* 
> > > hostname_recv_buf=(char*)c

Re: [OMPI users] MPI_AllGather null terminator character

2012-01-27 Thread Gabriele Fatigati
Hi Jeff,

yes, very stupid bug in a code, but also with the correction the problem
with Valgrind in strcmp remains:

==21779== Conditional jump or move depends on uninitialised value(s)
==21779==at 0x4A0898C: strcmp (mc_replace_strmem.c:711)
==21779==by 0x400BA8: main (all_gather.c:28)
==21779==
==21779== Conditional jump or move depends on uninitialised value(s)
==21779==at 0x4A0899A: strcmp (mc_replace_strmem.c:711)
==21779==by 0x400BA8: main (all_gather.c:28)
==21779==
==21779== Conditional jump or move depends on uninitialised value(s)
==21779==at 0x4A089BA: strcmp (mc_replace_strmem.c:711)
==21779==by 0x400BA8: main (all_gather.c:28)


Do you have the same warning with Valgrind?  Localhost name is something
like "node343" "node344" and so on.


2012/1/27 Jeff Squyres 

> I see one problem:
>
>gethostname(local_hostname, sizeof(local_hostname));
>
> That should be:
>
>gethostname(local_hostname, max_name_len);
>
> because sizeof(local_hostname) will be sizeof(void*).
>
> But if that's what you were intending, just to simulate a small hostname
> buffer, then be aware that gethostname() will not put a \0 after the
> string, because it'll copy in sizeof(local_hostname) characters and then
> stop.
>
> Specifically, the man page on OS X says:
>
> The gethostname() function returns the standard host name for the
> current
> processor, as previously set by sethostname().  The namelen argument
> specifies the size of the name array.  The returned name is null-termi-
> nated, unless insufficient space is provided.
>
> Hence, MPI is transmitting the entire 255 characters in your source array
> (regardless of content -- MPI is not looking for \0's; you gave it the
> explicit length of the buffer), but if they weren't filled with \0's, then
> the receiver's printf will have problems handling it.
>
>
>
> On Jan 27, 2012, at 4:03 AM, Gabriele Fatigati wrote:
>
> > Sorry,
> >
> > this is the right code.
> >
> > 2012/1/27 Gabriele Fatigati 
> > Hi Jeff,
> >
> > The problem is when I use strcmp on ALLGather buffer and Valgrind that
> raise a warning.
> >
> > Please check if the attached code is right, where size(local_hostname)
> is very small.
> >
> > Valgrind is used as:
> >
> > mpirun valgrind --leak-check=full --tool=memcheck ./all_gather
> >
> > and openmpi/1.4.4 compiled with "-O0 -g"
> >
> > Thanks!
> >
> > 2012/1/26 Jeff Squyres 
> > I'm not sure what you're asking.
> >
> > The entire contents of hostname[] will be sent -- from position 0 to
> position (MAX_STRING_LEN-1).  If there's a \0 in there, it will be sent.
>  If the \0 occurs after that, then it won't.
> >
> > Be aware that get_hostname(buf, size) will not put a \0 in the buffer if
> the hostname is exactly "size" bytes.  So you might want to double check
> that your get_hostname() is returning a \0-terminated string.
> >
> > Does that make sense?
> >
> > Here's a sample I wrote to verify this:
> >
> > #include 
> > #include 
> > #include 
> > #include 
> >
> > #define MAX_LEN 64
> >
> > static void where_null(char *ptr, int len, int rank)
> > {
> >int i;
> >
> >for (i = 0; i < len; ++i) {
> >if ('\0' == ptr[i]) {
> >printf("Rank %d: Null found at position %d (string: %s)\n",
> >   rank, i, ptr);
> >return;
> >}
> >}
> >
> >printf("Rank %d: Null not found! (string: ", rank);
> >for (i = 0; i < len; ++i) putc(ptr[i], stdout);
> >putc('\n', stdout);
> > }
> >
> > int main()
> > {
> >int i;
> >char hostname[MAX_LEN];
> >char *hostname_recv_buf;
> >int rank, size;
> >
> >MPI_Init(NULL, NULL);
> >MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> >MPI_Comm_size(MPI_COMM_WORLD, &size);
> >
> >gethostname(hostname, MAX_LEN - 1);
> >where_null(hostname, MAX_LEN, rank);
> >
> >hostname_recv_buf = calloc(size * (MAX_LEN), (sizeof(char)));
> >MPI_Allgather(hostname, MAX_LEN, MPI_CHAR,
> >  hostname_recv_buf, MAX_LEN, MPI_CHAR, MPI_COMM_WORLD);
> >for (i = 0; i < size; ++i) {
> >where_null(hostname_recv_buf + i * MAX_LEN, MAX_LEN, rank);
> >}
> >
> >MPI_Finalize();
> >return 0;
> > }
> >
> >
> >
> > On Jan 13, 2012, at 2:32 AM, Gabriele Fatigati wrote:
> >
> > > Dear OpenMPI,
> > >
> > > using MPI_Allgather with MPI_CHAR type, I have a doubt about
> null-terminated character. Imaging I want to spawn node names where my
> program is running on:
> > >
> > >
> > > 
> > >
> > > char hostname[MAX_LEN];
> > >
> > > char*
> hostname_recv_buf=(char*)calloc(num_procs*(MAX_STRING_LEN),(sizeof(char)));
> > >
> > > MPI_Allgather(hostname, MAX_STRING_LEN, MPI_CHAR, hostname_recv_buf,
> MAX_STRING_LEN, MPI_CHAR, MPI_COMM_WORLD);
> > >
> > > 
> > >
> > >
> > > Now, is the null-terminated character of each local string included?
> Or I have to send and receive in MPI_Allgather MAX_STRING_LEN+1 elements?

Re: [OMPI users] MPI_Allgather problem

2012-01-27 Thread Jeff Squyres
On Jan 27, 2012, at 5:12 AM, Brett Tully wrote:

> Looking at the change log for 1.5.1 I see:
> - Use memmove (instead of memcpy) when necessary (e.g., source and 
> destination overlap).

Checking the logs, it looks like that fix was in 1.4.3, too.

Do you know if your application has sends/receives where the source and 
destination overlap?

Just curious -- have you run your application thought a memory checking 
debugger, like valgrind?  Sometimes application memory corruption can show up 
in very strange (and non-deterministic) ways.

> It seems as though this might be a likely candidate for a change that might 
> fix my problems if I am indeed using 1.5.3 following the installation of 
> OpenFOAM?
> 
> On Fri, Jan 27, 2012 at 10:02 AM, Brett Tully wrote:
> Interesting. In the same set of updates, I installed OpenFOAM from their 
> Ubuntu deb package and it claims to ship with openmpi. I just downloaded 
> their Third-party source tar and unzipped it to see what version of openmpi 
> they are using, and it is 1.5.3. However, when I do man openmpi, or 
> ompi_info, I get the same version as before (1.4.3). How do I determine for 
> sure what is being included when I compile something using mpicc?

You need to be sure that the mpicc (etc.) you are using to compile your app 
exactly matches the mpirun.  mpicc --showme:version will show you the version 
that it is using.  In general, you should be able to "which mpicc;which 
mpirun;which ompi_info" to see where your executables are coming from.  This 
will likely give you a good clue to ensure that a) everything is matching (you 
want to ensure that your LD_LIBRARY_PATH also matches, or that your desired 
libmpi.so is in a system default library path), and b) what version they are. 

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] MPI_AllGather null terminator character

2012-01-27 Thread Jeff Squyres
I see one problem:

gethostname(local_hostname, sizeof(local_hostname));

That should be:

gethostname(local_hostname, max_name_len);

because sizeof(local_hostname) will be sizeof(void*).

But if that's what you were intending, just to simulate a small hostname 
buffer, then be aware that gethostname() will not put a \0 after the string, 
because it'll copy in sizeof(local_hostname) characters and then stop.

Specifically, the man page on OS X says:

 The gethostname() function returns the standard host name for the current
 processor, as previously set by sethostname().  The namelen argument
 specifies the size of the name array.  The returned name is null-termi-
 nated, unless insufficient space is provided.

Hence, MPI is transmitting the entire 255 characters in your source array 
(regardless of content -- MPI is not looking for \0's; you gave it the explicit 
length of the buffer), but if they weren't filled with \0's, then the 
receiver's printf will have problems handling it.



On Jan 27, 2012, at 4:03 AM, Gabriele Fatigati wrote:

> Sorry, 
> 
> this is the right code.
> 
> 2012/1/27 Gabriele Fatigati 
> Hi Jeff,
> 
> The problem is when I use strcmp on ALLGather buffer and Valgrind that raise 
> a warning.
> 
> Please check if the attached code is right, where size(local_hostname) is 
> very small. 
> 
> Valgrind is used as:
> 
> mpirun valgrind --leak-check=full --tool=memcheck ./all_gather
> 
> and openmpi/1.4.4 compiled with "-O0 -g"
> 
> Thanks!
> 
> 2012/1/26 Jeff Squyres 
> I'm not sure what you're asking.
> 
> The entire contents of hostname[] will be sent -- from position 0 to position 
> (MAX_STRING_LEN-1).  If there's a \0 in there, it will be sent.  If the \0 
> occurs after that, then it won't.
> 
> Be aware that get_hostname(buf, size) will not put a \0 in the buffer if the 
> hostname is exactly "size" bytes.  So you might want to double check that 
> your get_hostname() is returning a \0-terminated string.
> 
> Does that make sense?
> 
> Here's a sample I wrote to verify this:
> 
> #include 
> #include 
> #include 
> #include 
> 
> #define MAX_LEN 64
> 
> static void where_null(char *ptr, int len, int rank)
> {
>int i;
> 
>for (i = 0; i < len; ++i) {
>if ('\0' == ptr[i]) {
>printf("Rank %d: Null found at position %d (string: %s)\n",
>   rank, i, ptr);
>return;
>}
>}
> 
>printf("Rank %d: Null not found! (string: ", rank);
>for (i = 0; i < len; ++i) putc(ptr[i], stdout);
>putc('\n', stdout);
> }
> 
> int main()
> {
>int i;
>char hostname[MAX_LEN];
>char *hostname_recv_buf;
>int rank, size;
> 
>MPI_Init(NULL, NULL);
>MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>MPI_Comm_size(MPI_COMM_WORLD, &size);
> 
>gethostname(hostname, MAX_LEN - 1);
>where_null(hostname, MAX_LEN, rank);
> 
>hostname_recv_buf = calloc(size * (MAX_LEN), (sizeof(char)));
>MPI_Allgather(hostname, MAX_LEN, MPI_CHAR,
>  hostname_recv_buf, MAX_LEN, MPI_CHAR, MPI_COMM_WORLD);
>for (i = 0; i < size; ++i) {
>where_null(hostname_recv_buf + i * MAX_LEN, MAX_LEN, rank);
>}
> 
>MPI_Finalize();
>return 0;
> }
> 
> 
> 
> On Jan 13, 2012, at 2:32 AM, Gabriele Fatigati wrote:
> 
> > Dear OpenMPI,
> >
> > using MPI_Allgather with MPI_CHAR type, I have a doubt about 
> > null-terminated character. Imaging I want to spawn node names where my 
> > program is running on:
> >
> >
> > 
> >
> > char hostname[MAX_LEN];
> >
> > char* 
> > hostname_recv_buf=(char*)calloc(num_procs*(MAX_STRING_LEN),(sizeof(char)));
> >
> > MPI_Allgather(hostname, MAX_STRING_LEN, MPI_CHAR, hostname_recv_buf, 
> > MAX_STRING_LEN, MPI_CHAR, MPI_COMM_WORLD);
> >
> > 
> >
> >
> > Now, is the null-terminated character of each local string included? Or I 
> > have to send and receive in MPI_Allgather MAX_STRING_LEN+1 elements?
> >
> > Using Valgrind, in a subsequent simple strcmp:
> >
> > for( i= 0; i< num_procs; i++){
> > if(strcmp(&hostname_recv_buf[MAX_STRING_LEN*i], 
> > local_hostname)==0){
> >... doing something
> > }
> > }
> >
> > raise a warning:
> >
> > Conditional jump or move depends on uninitialised value(s)
> > ==19931==at 0x4A06E5C: strcmp (mc_replace_strmem.c:412)
> >
> > The same warning is not present if I use MAX_STRING_LEN+1 in MPI_Allgather.
> >
> >
> > Thanks in forward.
> >
> > --
> > Ing. Gabriele Fatigati
> >
> > HPC specialist
> >
> > SuperComputing Applications and Innovation Department
> >
> > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
> >
> > www.cineca.itTel:   +39 051 6171722
> >
> > g.fatigati [AT] cineca.it
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> --
> Jeff Squ

Re: [OMPI users] MPI_Allgather problem

2012-01-27 Thread TERRY DONTJE
 ompi_info should tell you the current version of Open MPI your path is 
pointing to.
Are you sure your path is pointing to the area that the OpenFOAM package 
delivered Open MPI into?


--td
On 1/27/2012 5:02 AM, Brett Tully wrote:
Interesting. In the same set of updates, I installed OpenFOAM from 
their Ubuntu deb package and it claims to ship with openmpi. I just 
downloaded their Third-party source tar and unzipped it to see what 
version of openmpi they are using, and it is 1.5.3. However, when I do 
man openmpi, or ompi_info, I get the same version as before (1.4.3). 
How do I determine for sure what is being included when I compile 
something using mpicc?


Thanks,
Brett.


On Thu, Jan 26, 2012 at 10:05 PM, Jeff Squyres > wrote:


What version did you upgrade to?  (we don't control the Ubuntu
packaging)

I see a bullet in the soon-to-be-released 1.4.5 release notes:

- Fix obscure cases where MPI_ALLGATHER could crash.  Thanks to Andrew
 Senin for reporting the problem.

But that would be surprising if this is what fixed your issue,
especially since it's not released yet.  :-)



On Jan 26, 2012, at 5:24 AM, Brett Tully wrote:

> As of two days ago, this problem has disappeared and the tests
that I had written and run each night are now passing. Having
looked through the update log of my machine (Ubuntu 11.10) it
appears as though I got a new version of mpi-default-dev
(0.6ubuntu1). I would like to understand this problem in more
detail -- is it possible to see what changed in this update?
> Thanks,
> Brett.
>
>
>
> On Fri, Dec 9, 2011 at 6:43 PM, teng ma mailto:t...@eecs.utk.edu>> wrote:
> I guess your output is from different ranks.   YOu can add rank
infor inside print to tell like follows:
>
> (void) printf("rank %d: gathered[%d].node = %d\n", rank, i,
gathered[i].node);
>
> From my side, I did not see anything wrong from your code in
Open MPI 1.4.3. after I add rank, the output is
> rank 5: gathered[0].node = 0
> rank 5: gathered[1].node = 1
> rank 5: gathered[2].node = 2
> rank 5: gathered[3].node = 3
> rank 5: gathered[4].node = 4
> rank 5: gathered[5].node = 5
> rank 3: gathered[0].node = 0
> rank 3: gathered[1].node = 1
> rank 3: gathered[2].node = 2
> rank 3: gathered[3].node = 3
> rank 3: gathered[4].node = 4
> rank 3: gathered[5].node = 5
> rank 1: gathered[0].node = 0
> rank 1: gathered[1].node = 1
> rank 1: gathered[2].node = 2
> rank 1: gathered[3].node = 3
> rank 1: gathered[4].node = 4
> rank 1: gathered[5].node = 5
> rank 0: gathered[0].node = 0
> rank 0: gathered[1].node = 1
> rank 0: gathered[2].node = 2
> rank 0: gathered[3].node = 3
> rank 0: gathered[4].node = 4
> rank 0: gathered[5].node = 5
> rank 4: gathered[0].node = 0
> rank 4: gathered[1].node = 1
> rank 4: gathered[2].node = 2
> rank 4: gathered[3].node = 3
> rank 4: gathered[4].node = 4
> rank 4: gathered[5].node = 5
> rank 2: gathered[0].node = 0
> rank 2: gathered[1].node = 1
> rank 2: gathered[2].node = 2
> rank 2: gathered[3].node = 3
> rank 2: gathered[4].node = 4
> rank 2: gathered[5].node = 5
>
> Is that what you expected?
>
> On Fri, Dec 9, 2011 at 12:03 PM, Brett Tully
mailto:brett.tu...@oxyntix.com>> wrote:
> Dear all,
>
> I have not used OpenMPI much before, but am maintaining a large
legacy application. We noticed a bug to do with a call to
MPI_Allgather as summarised in this post to Stackoverflow:

http://stackoverflow.com/questions/8445398/mpi-allgather-produces-inconsistent-results
>
> In the process of looking further into the problem, I noticed
that the following function results in strange behaviour.
>
> void test_all_gather() {
>
> struct _TEST_ALL_GATHER {
> int node;
> };
>
> int ierr, size, rank;
> ierr = MPI_Comm_size(MPI_COMM_WORLD, &size);
> ierr = MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>
> struct _TEST_ALL_GATHER local;
> struct _TEST_ALL_GATHER *gathered;
>
> gathered = (struct _TEST_ALL_GATHER*) malloc(size *
sizeof(*gathered));
>
> local.node = rank;
>
> MPI_Allgather(&local, sizeof(struct _TEST_ALL_GATHER), MPI_BYTE,
> gathered, sizeof(struct _TEST_ALL_GATHER), MPI_BYTE,
MPI_COMM_WORLD);
>
> int i;
> for (i = 0; i < numnodes; ++i) {
> (void) printf("gathered[%d].node = %d\n", i,
gathered[i].node);
> }
>
> FREE(gathered);
> }
>
> At one point, this function printed the following:
> gathered[0].node = 2
> gathered[1].node = 3
> gathered[2].node = 2
> gathered[3].node = 3
> gathered[4].node = 4
> gathered[5].node = 5
>
> 

Re: [OMPI users] MPI_Allgather problem

2012-01-27 Thread Brett Tully
Looking at the change log for 1.5.1 I see:
- Use memmove (instead of memcpy) when necessary (e.g., source
and destination overlap).

It seems as though this might be a likely candidate for a change that might
fix my problems if I am indeed using 1.5.3 following the installation of
OpenFOAM?

On Fri, Jan 27, 2012 at 10:02 AM, Brett Tully wrote:

> Interesting. In the same set of updates, I installed OpenFOAM from their
> Ubuntu deb package and it claims to ship with openmpi. I just downloaded
> their Third-party source tar and unzipped it to see what version of openmpi
> they are using, and it is 1.5.3. However, when I do man openmpi, or
> ompi_info, I get the same version as before (1.4.3). How do I determine for
> sure what is being included when I compile something using mpicc?
>
> Thanks,
> Brett.
>
>
>
> On Thu, Jan 26, 2012 at 10:05 PM, Jeff Squyres  wrote:
>
>> What version did you upgrade to?  (we don't control the Ubuntu packaging)
>>
>> I see a bullet in the soon-to-be-released 1.4.5 release notes:
>>
>> - Fix obscure cases where MPI_ALLGATHER could crash.  Thanks to Andrew
>>  Senin for reporting the problem.
>>
>> But that would be surprising if this is what fixed your issue, especially
>> since it's not released yet.  :-)
>>
>>
>>
>> On Jan 26, 2012, at 5:24 AM, Brett Tully wrote:
>>
>> > As of two days ago, this problem has disappeared and the tests that I
>> had written and run each night are now passing. Having looked through the
>> update log of my machine (Ubuntu 11.10) it appears as though I got a new
>> version of mpi-default-dev (0.6ubuntu1). I would like to understand this
>> problem in more detail -- is it possible to see what changed in this update?
>> > Thanks,
>> > Brett.
>> >
>> >
>> >
>> > On Fri, Dec 9, 2011 at 6:43 PM, teng ma  wrote:
>> > I guess your output is from different ranks.   YOu can add rank infor
>> inside print to tell like follows:
>> >
>> > (void) printf("rank %d: gathered[%d].node = %d\n", rank, i,
>> gathered[i].node);
>> >
>> > From my side, I did not see anything wrong from your code in Open MPI
>> 1.4.3. after I add rank, the output is
>> > rank 5: gathered[0].node = 0
>> > rank 5: gathered[1].node = 1
>> > rank 5: gathered[2].node = 2
>> > rank 5: gathered[3].node = 3
>> > rank 5: gathered[4].node = 4
>> > rank 5: gathered[5].node = 5
>> > rank 3: gathered[0].node = 0
>> > rank 3: gathered[1].node = 1
>> > rank 3: gathered[2].node = 2
>> > rank 3: gathered[3].node = 3
>> > rank 3: gathered[4].node = 4
>> > rank 3: gathered[5].node = 5
>> > rank 1: gathered[0].node = 0
>> > rank 1: gathered[1].node = 1
>> > rank 1: gathered[2].node = 2
>> > rank 1: gathered[3].node = 3
>> > rank 1: gathered[4].node = 4
>> > rank 1: gathered[5].node = 5
>> > rank 0: gathered[0].node = 0
>> > rank 0: gathered[1].node = 1
>> > rank 0: gathered[2].node = 2
>> > rank 0: gathered[3].node = 3
>> > rank 0: gathered[4].node = 4
>> > rank 0: gathered[5].node = 5
>> > rank 4: gathered[0].node = 0
>> > rank 4: gathered[1].node = 1
>> > rank 4: gathered[2].node = 2
>> > rank 4: gathered[3].node = 3
>> > rank 4: gathered[4].node = 4
>> > rank 4: gathered[5].node = 5
>> > rank 2: gathered[0].node = 0
>> > rank 2: gathered[1].node = 1
>> > rank 2: gathered[2].node = 2
>> > rank 2: gathered[3].node = 3
>> > rank 2: gathered[4].node = 4
>> > rank 2: gathered[5].node = 5
>> >
>> > Is that what you expected?
>> >
>> > On Fri, Dec 9, 2011 at 12:03 PM, Brett Tully 
>> wrote:
>> > Dear all,
>> >
>> > I have not used OpenMPI much before, but am maintaining a large legacy
>> application. We noticed a bug to do with a call to MPI_Allgather as
>> summarised in this post to Stackoverflow:
>> http://stackoverflow.com/questions/8445398/mpi-allgather-produces-inconsistent-results
>> >
>> > In the process of looking further into the problem, I noticed that the
>> following function results in strange behaviour.
>> >
>> > void test_all_gather() {
>> >
>> > struct _TEST_ALL_GATHER {
>> > int node;
>> > };
>> >
>> > int ierr, size, rank;
>> > ierr = MPI_Comm_size(MPI_COMM_WORLD, &size);
>> > ierr = MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>> >
>> > struct _TEST_ALL_GATHER local;
>> > struct _TEST_ALL_GATHER *gathered;
>> >
>> > gathered = (struct _TEST_ALL_GATHER*) malloc(size *
>> sizeof(*gathered));
>> >
>> > local.node = rank;
>> >
>> > MPI_Allgather(&local, sizeof(struct _TEST_ALL_GATHER), MPI_BYTE,
>> > gathered, sizeof(struct _TEST_ALL_GATHER), MPI_BYTE,
>> MPI_COMM_WORLD);
>> >
>> > int i;
>> > for (i = 0; i < numnodes; ++i) {
>> > (void) printf("gathered[%d].node = %d\n", i, gathered[i].node);
>> > }
>> >
>> > FREE(gathered);
>> > }
>> >
>> > At one point, this function printed the following:
>> > gathered[0].node = 2
>> > gathered[1].node = 3
>> > gathered[2].node = 2
>> > gathered[3].node = 3
>> > gathered[4].node = 4
>> > gathered[5].node = 5
>> >
>> > Can anyone suggest a place to start looking into why this might be
>> h

Re: [OMPI users] MPI_Allgather problem

2012-01-27 Thread Brett Tully
Interesting. In the same set of updates, I installed OpenFOAM from their
Ubuntu deb package and it claims to ship with openmpi. I just downloaded
their Third-party source tar and unzipped it to see what version of openmpi
they are using, and it is 1.5.3. However, when I do man openmpi, or
ompi_info, I get the same version as before (1.4.3). How do I determine for
sure what is being included when I compile something using mpicc?

Thanks,
Brett.


On Thu, Jan 26, 2012 at 10:05 PM, Jeff Squyres  wrote:

> What version did you upgrade to?  (we don't control the Ubuntu packaging)
>
> I see a bullet in the soon-to-be-released 1.4.5 release notes:
>
> - Fix obscure cases where MPI_ALLGATHER could crash.  Thanks to Andrew
>  Senin for reporting the problem.
>
> But that would be surprising if this is what fixed your issue, especially
> since it's not released yet.  :-)
>
>
>
> On Jan 26, 2012, at 5:24 AM, Brett Tully wrote:
>
> > As of two days ago, this problem has disappeared and the tests that I
> had written and run each night are now passing. Having looked through the
> update log of my machine (Ubuntu 11.10) it appears as though I got a new
> version of mpi-default-dev (0.6ubuntu1). I would like to understand this
> problem in more detail -- is it possible to see what changed in this update?
> > Thanks,
> > Brett.
> >
> >
> >
> > On Fri, Dec 9, 2011 at 6:43 PM, teng ma  wrote:
> > I guess your output is from different ranks.   YOu can add rank infor
> inside print to tell like follows:
> >
> > (void) printf("rank %d: gathered[%d].node = %d\n", rank, i,
> gathered[i].node);
> >
> > From my side, I did not see anything wrong from your code in Open MPI
> 1.4.3. after I add rank, the output is
> > rank 5: gathered[0].node = 0
> > rank 5: gathered[1].node = 1
> > rank 5: gathered[2].node = 2
> > rank 5: gathered[3].node = 3
> > rank 5: gathered[4].node = 4
> > rank 5: gathered[5].node = 5
> > rank 3: gathered[0].node = 0
> > rank 3: gathered[1].node = 1
> > rank 3: gathered[2].node = 2
> > rank 3: gathered[3].node = 3
> > rank 3: gathered[4].node = 4
> > rank 3: gathered[5].node = 5
> > rank 1: gathered[0].node = 0
> > rank 1: gathered[1].node = 1
> > rank 1: gathered[2].node = 2
> > rank 1: gathered[3].node = 3
> > rank 1: gathered[4].node = 4
> > rank 1: gathered[5].node = 5
> > rank 0: gathered[0].node = 0
> > rank 0: gathered[1].node = 1
> > rank 0: gathered[2].node = 2
> > rank 0: gathered[3].node = 3
> > rank 0: gathered[4].node = 4
> > rank 0: gathered[5].node = 5
> > rank 4: gathered[0].node = 0
> > rank 4: gathered[1].node = 1
> > rank 4: gathered[2].node = 2
> > rank 4: gathered[3].node = 3
> > rank 4: gathered[4].node = 4
> > rank 4: gathered[5].node = 5
> > rank 2: gathered[0].node = 0
> > rank 2: gathered[1].node = 1
> > rank 2: gathered[2].node = 2
> > rank 2: gathered[3].node = 3
> > rank 2: gathered[4].node = 4
> > rank 2: gathered[5].node = 5
> >
> > Is that what you expected?
> >
> > On Fri, Dec 9, 2011 at 12:03 PM, Brett Tully 
> wrote:
> > Dear all,
> >
> > I have not used OpenMPI much before, but am maintaining a large legacy
> application. We noticed a bug to do with a call to MPI_Allgather as
> summarised in this post to Stackoverflow:
> http://stackoverflow.com/questions/8445398/mpi-allgather-produces-inconsistent-results
> >
> > In the process of looking further into the problem, I noticed that the
> following function results in strange behaviour.
> >
> > void test_all_gather() {
> >
> > struct _TEST_ALL_GATHER {
> > int node;
> > };
> >
> > int ierr, size, rank;
> > ierr = MPI_Comm_size(MPI_COMM_WORLD, &size);
> > ierr = MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> >
> > struct _TEST_ALL_GATHER local;
> > struct _TEST_ALL_GATHER *gathered;
> >
> > gathered = (struct _TEST_ALL_GATHER*) malloc(size *
> sizeof(*gathered));
> >
> > local.node = rank;
> >
> > MPI_Allgather(&local, sizeof(struct _TEST_ALL_GATHER), MPI_BYTE,
> > gathered, sizeof(struct _TEST_ALL_GATHER), MPI_BYTE,
> MPI_COMM_WORLD);
> >
> > int i;
> > for (i = 0; i < numnodes; ++i) {
> > (void) printf("gathered[%d].node = %d\n", i, gathered[i].node);
> > }
> >
> > FREE(gathered);
> > }
> >
> > At one point, this function printed the following:
> > gathered[0].node = 2
> > gathered[1].node = 3
> > gathered[2].node = 2
> > gathered[3].node = 3
> > gathered[4].node = 4
> > gathered[5].node = 5
> >
> > Can anyone suggest a place to start looking into why this might be
> happening? There is a section of the code that calls MPI_Comm_split, but I
> am not sure if that is related...
> >
> > Running on Ubuntu 11.10 and a summary of ompi_info:
> > Package: Open MPI buildd@allspice Distribution
> > Open MPI: 1.4.3
> > Open MPI SVN revision: r23834
> > Open MPI release date: Oct 05, 2010
> > Open RTE: 1.4.3
> > Open RTE SVN revision: r23834
> > Open RTE release date: Oct 05, 2010
> > OPAL: 1.4.3
> > OPAL SVN revision: r23834
> > OPAL release date: Oct 05, 2010
> > 

Re: [OMPI users] MPI_AllGather null terminator character

2012-01-27 Thread Gabriele Fatigati
Sorry,

this is the right code.

2012/1/27 Gabriele Fatigati 

> Hi Jeff,
>
> The problem is when I use strcmp on ALLGather buffer and Valgrind that
> raise a warning.
>
> Please check if the attached code is right, where size(local_hostname) is
> very small.
>
> Valgrind is used as:
>
> mpirun valgrind --leak-check=full --tool=memcheck ./all_gather
>
> and openmpi/1.4.4 compiled with "-O0 -g"
>
> Thanks!
>
> 2012/1/26 Jeff Squyres 
>
>> I'm not sure what you're asking.
>>
>> The entire contents of hostname[] will be sent -- from position 0 to
>> position (MAX_STRING_LEN-1).  If there's a \0 in there, it will be sent.
>>  If the \0 occurs after that, then it won't.
>>
>> Be aware that get_hostname(buf, size) will not put a \0 in the buffer if
>> the hostname is exactly "size" bytes.  So you might want to double check
>> that your get_hostname() is returning a \0-terminated string.
>>
>> Does that make sense?
>>
>> Here's a sample I wrote to verify this:
>>
>> #include 
>> #include 
>> #include 
>> #include 
>>
>> #define MAX_LEN 64
>>
>> static void where_null(char *ptr, int len, int rank)
>> {
>>int i;
>>
>>for (i = 0; i < len; ++i) {
>>if ('\0' == ptr[i]) {
>>printf("Rank %d: Null found at position %d (string: %s)\n",
>>   rank, i, ptr);
>>return;
>>}
>>}
>>
>>printf("Rank %d: Null not found! (string: ", rank);
>>for (i = 0; i < len; ++i) putc(ptr[i], stdout);
>>putc('\n', stdout);
>> }
>>
>> int main()
>> {
>>int i;
>>char hostname[MAX_LEN];
>>char *hostname_recv_buf;
>>int rank, size;
>>
>>MPI_Init(NULL, NULL);
>>MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>MPI_Comm_size(MPI_COMM_WORLD, &size);
>>
>>gethostname(hostname, MAX_LEN - 1);
>>where_null(hostname, MAX_LEN, rank);
>>
>>hostname_recv_buf = calloc(size * (MAX_LEN), (sizeof(char)));
>>MPI_Allgather(hostname, MAX_LEN, MPI_CHAR,
>>  hostname_recv_buf, MAX_LEN, MPI_CHAR, MPI_COMM_WORLD);
>>for (i = 0; i < size; ++i) {
>>where_null(hostname_recv_buf + i * MAX_LEN, MAX_LEN, rank);
>>}
>>
>>MPI_Finalize();
>>return 0;
>> }
>>
>>
>>
>> On Jan 13, 2012, at 2:32 AM, Gabriele Fatigati wrote:
>>
>> > Dear OpenMPI,
>> >
>> > using MPI_Allgather with MPI_CHAR type, I have a doubt about
>> null-terminated character. Imaging I want to spawn node names where my
>> program is running on:
>> >
>> >
>> > 
>> >
>> > char hostname[MAX_LEN];
>> >
>> > char*
>> hostname_recv_buf=(char*)calloc(num_procs*(MAX_STRING_LEN),(sizeof(char)));
>> >
>> > MPI_Allgather(hostname, MAX_STRING_LEN, MPI_CHAR, hostname_recv_buf,
>> MAX_STRING_LEN, MPI_CHAR, MPI_COMM_WORLD);
>> >
>> > 
>> >
>> >
>> > Now, is the null-terminated character of each local string included? Or
>> I have to send and receive in MPI_Allgather MAX_STRING_LEN+1 elements?
>> >
>> > Using Valgrind, in a subsequent simple strcmp:
>> >
>> > for( i= 0; i< num_procs; i++){
>> > if(strcmp(&hostname_recv_buf[MAX_STRING_LEN*i],
>> local_hostname)==0){
>> >... doing something
>> > }
>> > }
>> >
>> > raise a warning:
>> >
>> > Conditional jump or move depends on uninitialised value(s)
>> > ==19931==at 0x4A06E5C: strcmp (mc_replace_strmem.c:412)
>> >
>> > The same warning is not present if I use MAX_STRING_LEN+1 in
>> MPI_Allgather.
>> >
>> >
>> > Thanks in forward.
>> >
>> > --
>> > Ing. Gabriele Fatigati
>> >
>> > HPC specialist
>> >
>> > SuperComputing Applications and Innovation Department
>> >
>> > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
>> >
>> > www.cineca.itTel:   +39 051 6171722
>> >
>> > g.fatigati [AT] cineca.it
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>
> --
> Ing. Gabriele Fatigati
>
> HPC specialist
>
> SuperComputing Applications and Innovation Department
>
> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
>
> www.cineca.itTel:   +39 051 6171722
>
> g.fatigati [AT] cineca.it
>



-- 
Ing. Gabriele Fatigati

HPC specialist

SuperComputing Applications and Innovation Department

Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy

www.cineca.itTel:   +39 051 6171722

g.fatigati [AT] cineca.it
#include 
#include 
#include 
#include 
#include 

int main(int argc, char *argv[]){

MPI_Init(&argc,&argv);
const int max_name_len = 255;
char* local_hostname;
char* hostname_recv_buf;
   

Re: [OMPI users] MPI_AllGather null terminator character

2012-01-27 Thread Gabriele Fatigati
Hi Jeff,

The problem is when I use strcmp on ALLGather buffer and Valgrind that
raise a warning.

Please check if the attached code is right, where size(local_hostname) is
very small.

Valgrind is used as:

mpirun valgrind --leak-check=full --tool=memcheck ./all_gather

and openmpi/1.4.4 compiled with "-O0 -g"

Thanks!

2012/1/26 Jeff Squyres 

> I'm not sure what you're asking.
>
> The entire contents of hostname[] will be sent -- from position 0 to
> position (MAX_STRING_LEN-1).  If there's a \0 in there, it will be sent.
>  If the \0 occurs after that, then it won't.
>
> Be aware that get_hostname(buf, size) will not put a \0 in the buffer if
> the hostname is exactly "size" bytes.  So you might want to double check
> that your get_hostname() is returning a \0-terminated string.
>
> Does that make sense?
>
> Here's a sample I wrote to verify this:
>
> #include 
> #include 
> #include 
> #include 
>
> #define MAX_LEN 64
>
> static void where_null(char *ptr, int len, int rank)
> {
>int i;
>
>for (i = 0; i < len; ++i) {
>if ('\0' == ptr[i]) {
>printf("Rank %d: Null found at position %d (string: %s)\n",
>   rank, i, ptr);
>return;
>}
>}
>
>printf("Rank %d: Null not found! (string: ", rank);
>for (i = 0; i < len; ++i) putc(ptr[i], stdout);
>putc('\n', stdout);
> }
>
> int main()
> {
>int i;
>char hostname[MAX_LEN];
>char *hostname_recv_buf;
>int rank, size;
>
>MPI_Init(NULL, NULL);
>MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>MPI_Comm_size(MPI_COMM_WORLD, &size);
>
>gethostname(hostname, MAX_LEN - 1);
>where_null(hostname, MAX_LEN, rank);
>
>hostname_recv_buf = calloc(size * (MAX_LEN), (sizeof(char)));
>MPI_Allgather(hostname, MAX_LEN, MPI_CHAR,
>  hostname_recv_buf, MAX_LEN, MPI_CHAR, MPI_COMM_WORLD);
>for (i = 0; i < size; ++i) {
>where_null(hostname_recv_buf + i * MAX_LEN, MAX_LEN, rank);
>}
>
>MPI_Finalize();
>return 0;
> }
>
>
>
> On Jan 13, 2012, at 2:32 AM, Gabriele Fatigati wrote:
>
> > Dear OpenMPI,
> >
> > using MPI_Allgather with MPI_CHAR type, I have a doubt about
> null-terminated character. Imaging I want to spawn node names where my
> program is running on:
> >
> >
> > 
> >
> > char hostname[MAX_LEN];
> >
> > char*
> hostname_recv_buf=(char*)calloc(num_procs*(MAX_STRING_LEN),(sizeof(char)));
> >
> > MPI_Allgather(hostname, MAX_STRING_LEN, MPI_CHAR, hostname_recv_buf,
> MAX_STRING_LEN, MPI_CHAR, MPI_COMM_WORLD);
> >
> > 
> >
> >
> > Now, is the null-terminated character of each local string included? Or
> I have to send and receive in MPI_Allgather MAX_STRING_LEN+1 elements?
> >
> > Using Valgrind, in a subsequent simple strcmp:
> >
> > for( i= 0; i< num_procs; i++){
> > if(strcmp(&hostname_recv_buf[MAX_STRING_LEN*i],
> local_hostname)==0){
> >... doing something
> > }
> > }
> >
> > raise a warning:
> >
> > Conditional jump or move depends on uninitialised value(s)
> > ==19931==at 0x4A06E5C: strcmp (mc_replace_strmem.c:412)
> >
> > The same warning is not present if I use MAX_STRING_LEN+1 in
> MPI_Allgather.
> >
> >
> > Thanks in forward.
> >
> > --
> > Ing. Gabriele Fatigati
> >
> > HPC specialist
> >
> > SuperComputing Applications and Innovation Department
> >
> > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
> >
> > www.cineca.itTel:   +39 051 6171722
> >
> > g.fatigati [AT] cineca.it
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Ing. Gabriele Fatigati

HPC specialist

SuperComputing Applications and Innovation Department

Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy

www.cineca.itTel:   +39 051 6171722

g.fatigati [AT] cineca.it
#include 
#include 
#include 
#include 
#include 

int main(int argc, char *argv[]){

MPI_Init(&argc,&argv);
const int max_name_len = 255;
char* local_hostname;
char* hostname_recv_buf;
char** hostname_list_final;
int i, num_procs, rank;
MPI_Comm_size(MPI_COMM_WORLD, &num_procs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);


local_hostname = (char*) malloc(max_name_len*sizeof(char));
hostname_recv_buf=(char*)malloc(num_procs*(max_name_len+1)*(sizeof(char)));

hostname_list_final=(char**)malloc(num_procs*(sizeof(char*)));

for (i=0; i< num_procs; i++)
hostname_list_final[i] = (char*)malloc((max_name_len+