Re: [OMPI users] mpi problems/many cpus per node

2012-12-15 Thread Ralph Castain
It must be making contact or ORTE wouldn't be attempting to launch your 
application's procs. Looks more like it never received the launch command. 
Looking at the code, I suspect you're getting caught in a race condition that 
causes the message to get "stuck".

Just to see if that's the case, you might try running this with the 1.7 release 
candidate, or even the developer's nightly build. Both use a different timing 
mechanism intended to resolve such situations.


On Dec 14, 2012, at 2:49 PM, Daniel Davidson  wrote:

> Thank you for the help so far.  Here is the information that the debugging 
> gives me.  Looks like the daemon on on the non-local node never makes 
> contact.  If I step NP back two though, it does.
> 
> Dan
> 
> [root@compute-2-1 etc]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
> compute-2-0,compute-2-1 -v  -np 34 --leave-session-attached -mca 
> odls_base_verbose 5 hostname
> [compute-2-1.local:44855] mca:base:select:( odls) Querying component [default]
> [compute-2-1.local:44855] mca:base:select:( odls) Query of component 
> [default] set priority to 1
> [compute-2-1.local:44855] mca:base:select:( odls) Selected component [default]
> [compute-2-0.local:29282] mca:base:select:( odls) Querying component [default]
> [compute-2-0.local:29282] mca:base:select:( odls) Query of component 
> [default] set priority to 1
> [compute-2-0.local:29282] mca:base:select:( odls) Selected component [default]
> [compute-2-1.local:44855] [[49524,0],0] odls:update:daemon:info updating 
> nidmap
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
> [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list unpacking 
> data to launch job [49524,1]
> [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list adding new 
> jobdat for job [49524,1]
> [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list unpacking 1 
> app_contexts
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],0] on daemon 1
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],1] on daemon 0
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
> proc [[49524,1],1] for me!
> [compute-2-1.local:44855] adding proc [[49524,1],1] (1) to my local list
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],2] on daemon 1
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],3] on daemon 0
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
> proc [[49524,1],3] for me!
> [compute-2-1.local:44855] adding proc [[49524,1],3] (3) to my local list
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],4] on daemon 1
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],5] on daemon 0
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
> proc [[49524,1],5] for me!
> [compute-2-1.local:44855] adding proc [[49524,1],5] (5) to my local list
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],6] on daemon 1
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],7] on daemon 0
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
> proc [[49524,1],7] for me!
> [compute-2-1.local:44855] adding proc [[49524,1],7] (7) to my local list
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],8] on daemon 1
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],9] on daemon 0
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
> proc [[49524,1],9] for me!
> [compute-2-1.local:44855] adding proc [[49524,1],9] (9) to my local list
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],10] on daemon 1
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],11] on daemon 0
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
> proc [[49524,1],11] for me!
> [compute-2-1.local:44855] adding proc [[49524,1],11] (11) to my local list
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],12] on daemon 1
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],13] on daemon 0
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
> proc [[49524,1],13] for me!
> [compute-2-1.local:44855] adding proc [[49524,1],13] (13) to my local list
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],14] on daemon 1
> [compute-2-1.local:44855] [[49524,0],0] odls:constru

Re: [OMPI users] questions to some open problems

2012-12-15 Thread Siegmar Gross
Hi Ralph

> > some weeks ago (mainly in the beginning of October) I reported
> > several problems and I would be grateful if you can tell me if
> > and probably when somebody will try to solve them.
> > 
> > 1) I don't get the expected results, when I try to send or scatter
> >   the columns of a matrix in Java. The received column values have
> >   nothing to do with the original values, if I use a homogeneous
> >   environment and the program breaks with "An error occurred in
> >   MPI_Comm_dup" and "MPI_ERR_INTERN: internal error", if I use
> >   a heterogeneous environment. I would like to use the Java API.
> > 
> > 2) I don't get the expected result, when I try to scatter an object
> >   in Java.
> >   https://svn.open-mpi.org/trac/ompi/ticket/3351
> 
> Nothing has happened on these yet

Do you have an idea when somebody will have time to fix these problems?


> > 3) I still get only a message that all nodes are already filled up
> >   when I use a "rankfile" and nothing else happens. I would like
> >   to use a rankfile. You filed a bug fix for it.
> > 
> 
> I believe rankfile was fixed, at least on the trunk - not sure if it
> was moved to 1.7. I assume that's the release you are talking about?

I'm using the trunk for my tests. It didn't work for me because I used
the rankfile without a hostfile or a hostlist (it is not enough to
specify the hosts in the rankfile). Everything works fine when I provide
a "correct" hostfile or hostlist and the binding isn't too compilicated
(see my last example below).

My rankfile:

rank 0=sunpc0 slot=0:0
rank 1=sunpc1 slot=0:0
rank 2=sunpc0 slot=1:0
rank 3=sunpc1 slot=1:0


My hostfile:

sunpc0 slots=4
sunpc1 slots=4


It will not work without a hostfile or hostlist.

sunpc0 mpi-probleme 128 mpiexec -report-bindings -rf rankfile_1.openmpi \
  -np 4 hostname

The rankfile that was used claimed that a host was either not
allocated or oversubscribed its slots.  Please review your rank-slot
assignments and your host allocation to ensure a proper match.  Also,
some systems may require using full hostnames, such as
"host1.example.com" (instead of just plain "host1").

  Host: sunpc1

sunpc0 mpi-probleme 129 


I get the expected output, if I add "-hostfile host_sunpc" or
"-host sunpc0,sunpc1" on the command line.

sunpc0 mpi-probleme 129 mpiexec -report-bindings -rf rankfile_1.openmpi \
  -np 4 -hostfile host_sunpc hostname
[sunpc0:06954] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/.][./.]
[sunpc0:06954] MCW rank 2 bound to socket 1[core 2[hwt 0]]: [./.][B/.]
sunpc0
sunpc0
[sunpc1:12583] MCW rank 1 bound to socket 0[core 0[hwt 0]]: [B/.][./.]
[sunpc1:12583] MCW rank 3 bound to socket 1[core 2[hwt 0]]: [./.][B/.]
sunpc1
sunpc1
sunpc0 mpi-probleme 130 


Furthermore it is necessary that both the rankfile and the hostfile
contain qualified or unqualified hostnames in the same way. Otherwise
it will not work as you can see in the following output where my
hostfile contains a qualified hostname and my rankfile only the hostname
without domain name.

sunpc0 mpi-probleme 131 mpiexec -report-bindings -rf rankfile_1.openmpi \
  -np 4 -hostfile host_sunpc_full hostname

The rankfile that was used claimed that a host was either not
allocated or oversubscribed its slots.  Please review your rank-slot
assignments and your host allocation to ensure a proper match.  Also,
some systems may require using full hostnames, such as
"host1.example.com" (instead of just plain "host1").

  Host: sunpc1

sunpc0 mpi-probleme 132 


Unfortunately my complicated rankfile still doesn't work, although
you told me some weeks ago that it is correct.

rank 0=sunpc0 slot=0:0-1,1:0-1
rank 1=sunpc1 slot=0:0-1
rank 2=sunpc1 slot=1:0
rank 3=sunpc1 slot=1:1

sunpc1 mpi-probleme 103 mpiexec -report-bindings -rf rankfile -np 4 \
  -hostfile host_sunpc hostname
sunpc1
sunpc1
sunpc1
[sunpc1:12741] MCW rank 2 bound to socket 1[core 2[hwt 0]]: [./.][B/.]
[sunpc1:12741] MCW rank 3 bound to socket 1[core 3[hwt 0]]: [./.][./B]
[sunpc1:12741] MCW rank 1 bound to socket 0[core 0[hwt 0]],
   socket 0[core 1[hwt 0]]: [B/B][./.]
[sunpc0:07075] MCW rank 0 bound to socket 0[core 0[hwt 0]],
   socket 0[core 1[hwt 0]]: [B/B][./.]
sunpc0
sunpc1 mpi-probleme 104 

The bindings for ranks 1 to 3 are correct, but rank 0 didn't get the
cores from the second socket.



> > 4) I would like to have "-cpus-per-proc", "-npersocket", etc for
> >   every set of machines/applications and not globally for all
> >   machines/applications if I specify several colon-separated sets
> >   of machines or applications on the command line. You told me that
> >   it could be done.
> > 
> > 5) By the way, it seems that the option "-cpus-per-proc" isn't any
> >   lo

Re: [OMPI users] Problems with shared libraries while launching jobs

2012-12-15 Thread Jeff Squyres
Note that exporting the LD_LIBRARY_PATH on the mpirun command line does not 
necessarily apply to launching the remote orteds (it applies to launching the 
remote MPI processes, which are children of the orteds).

Since you're using ssh, you might want to check the shell startup scripts on 
the target nodes (e.g., .bashrc).  It's not sufficient to not overwrite the 
LD_LIBRARY_PATH -- ensure that it is getting set to the right library location 
of the intel support libraries.

You might also want to check your .bashrc that you're not setting 
LD_LIBRARY_PATH (or path or ...) after it exits for non-interactive shells.  
This is a common optimization trick in shell startup files -- exit early when 
it detects that this is a non-interactive shell, and therefore don't do a bunch 
of stuff that assumedly is only needed when you login interactively (e.g., 
create shell aliases and the like).

Random question: is there a reason you're not using torque support?  When you 
use torque support, torque will automatically copy your current environment -- 
including LD_LIBRARY_PATH -- to the target node before launching orted.  Hence, 
it can actually be easier for LD_LIBRARY_PATH issues like this.



On Dec 14, 2012, at 3:17 PM, Blosch, Edwin L wrote:

> I am having a weird problem launching cases with OpenMPI 1.4.3.  It is most 
> likely a problem with a particular node of our cluster, as the jobs will run 
> fine on some submissions, but not other submissions.  It seems to depend on 
> the node list.  I just am having trouble diagnosing which node, and what is 
> the nature of the problem it has.
>  
> One or perhaps more of the orted are indicating they cannot find an Intel 
> Math library.  The error is:
> /release/cfd/openmpi-intel/bin/orted: error while loading shared libraries: 
> libimf.so: cannot open shared object file: No such file or directory
>  
> I’ve checked the environment just before launching mpirun, and 
> LD_LIBRARY_PATH includes the necessary component to point to where the Intel 
> shared libraries are located.  Furthermore, my mpirun command line says to 
> export the LD_LIBRARY_PATH variable:
> Executing ['/release/cfd/openmpi-intel/bin/mpirun', '--machinefile 
> /var/spool/PBS/aux/20761.maruhpc4-mgt', '-np 160', '-x LD_LIBRARY_PATH', '-x 
> MPI_ENVIRONMENT=1', '/tmp/fv420761.maruhpc4-mgt/falconv4_openmpi_jsgl', '-v', 
> '-cycles', '1', '-ri', 'restart.1', '-ro', 
> '/tmp/fv420761.maruhpc4-mgt/restart.1']
>  
> My shell-initialization script (.bashrc) does not overwrite LD_LIBRARY_PATH.  
> OpenMPI is built explicitly --without-torque and should be using ssh to 
> launch the orted.
>  
> What options can I add to get more debugging of problems launching orted?
>  
> Thanks,
>  
> Ed
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Possible memory error

2012-12-15 Thread Jeff Squyres
On Dec 14, 2012, at 4:31 PM, Handerson, Steven wrote:

> I’m trying to track down an instance of openMPI writing to a freed block of 
> memory.
> This occurs with the most recent release (1.6.3) as well as 1.6, on a 64 bit 
> intel architecture, fedora 14.
> It occurs with a very simple reduction (allreduce minimum), over a single int 
> value.

Can you send a reproducer program?  The simpler, the better.

> I’m wondering if the openMPI developers use power tools such as valgrind / 
> dmalloc / etc
> on the releases to try to catch these things via exhaustive testing –
> but I understand memory problems in C are of the nature that anyone making a 
> mistake can propogate,
> so I haven’t ruled out problems in our own code.
> Also, I’m wondering if anyone has suggestions on how to track this down 
> further.

Yes, we do use such tools.

Can you cite the specific file/line where the problem is occurring?  The all 
reduce algorithms are fairly self-contained; it should be (relatively) 
straightforward to examine that code and see if there's a problem with the 
memory allocation there.

> I’m using allinea DDT and their builtin dmalloc, which catches the error, 
> which appears in
> the second memcpy in  opal_convertor_pack(), but I don’t have more details 
> than that at the moment.
> All I know so far is that one of those values has been freed.
> Obviously, I haven’t seen anything in earlier parts of the code which might 
> have triggered memory corruption,
> although both openMPI and intel IPP do things with uninitialized values 
> before this (according to Valgrind).

There's a number of issues that can lead to false positives for using 
uninitialized values.  Here's two of the most common cases:

1. When using TCP, one of our data headers has a padding hole in it, but we 
write the whole struct down a TCP socket file descriptor anyway.  Hence, it 
will generate a "read from uninit" warning.

2. When using OpenFabrics-based networks, tool like valgrind don't see the 
OS-bypass initialization of the memory (Which frequently comes directly from 
the hardware), and it generates a lot of false "read from uninit" positives.

One thing you can try is to compile Open MPI --with-valgrind.  This adds a 
little performance penalty, but we take extra steps to eliminate most false 
positives.  It could help separate the wheat from the chaff, in your case.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] question to scattering an object in openmpi-1.9a1r27380

2012-12-15 Thread Jeff Squyres
Greetings Siegmar; sorry for the horrid delay in replying.  :-(

Ralph opened a ticket about this a while ago 
(https://svn.open-mpi.org/trac/ompi/ticket/3351).  I answered it this morning 
-- see the ticket for the details.

Short version: I don't think that your program is correct.


On Oct 11, 2012, at 7:40 AM, Siegmar Gross wrote:

> Hi,
> 
> I have built openmpi-1.9a1r27380 with Java support and try some small
> programs. When I try to scatter an object, I get a ClassCastException.
> I use the following object.
> 
> public class MyData implements java.io.Serializable
> {
>  static final long serialVersionUID = -5243516570672186644L;
> 
>  private int   age;
>  private String name;
>  private double salary;
> 
>  public MyData ()
>  {
>age= 0;
>name   = "";
>salary = 0.0;
>  }
> 
>  public void setAge (int newAge)
>  {
>age = newAge;
>  }
> ...
> }
> 
> 
> I use the following main program.
> 
> import mpi.*;
> 
> public class ObjectScatterMain
> {
>  public static void main (String args[]) throws MPIException
>  {
>int   mytid;   /* my task id   
> */
>MyData dataItem, objBuffer;
>String processor_name; /* name of local machine*/
> 
>MPI.Init (args);
>processor_name = MPI.Get_processor_name ();
>mytid = MPI.COMM_WORLD.Rank ();
>dataItem  = new MyData ();
>objBuffer = new MyData ();
>if (mytid == 0)
>{
>  /* initialize data item  */
>  dataItem.setAge (35);
>  dataItem.setName ("Smith");
>  dataItem.setSalary (2545.75);
>}
>MPI.COMM_WORLD.Scatter (dataItem, 0, 1, MPI.OBJECT,
>   objBuffer, 0, 1, MPI.OBJECT, 0);
>/* Each process prints its received data item. The outputs
> * can intermingle on the screen so that you must use
> * "-output-filename" in Open MPI.
> */
>System.out.printf ("\nProcess %d running on %s.\n" +
>  "  Age:  %d\n" +
>  "  Name: %s\n" +
>  "  Salary: %10.2f\n",
>  mytid, processor_name,
>  objBuffer.getAge (),
>  objBuffer.getName (),
>  objBuffer.getSalary ());
>MPI.Finalize();
>  }
> }
> 
> 
> I get the following error, when I compile and run the program.
> 
> tyr java 218 mpijavac ObjectScatterMain.java
> tyr java 219 mpiexec java ObjectScatterMain
> Exception in thread "main" java.lang.ClassCastException:
>  MyData cannot be cast to [Ljava.lang.Object;
>at mpi.Intracomm.copyBuffer(Intracomm.java:119)
>at mpi.Intracomm.Scatter(Intracomm.java:389)
>at ObjectScatterMain.main(ObjectScatterMain.java:45)
> --
> mpiexec has exited due to process rank 0 with PID 25898 on
> ...
> 
> 
> Has anybody an idea why I get a ClassCastException or how I must define
> an object, which I can use in a scatter operation? Thank you very much
> for any help in advance.
> 
> 
> Kind regards
> 
> Siegmar
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] questions to some open problems

2012-12-15 Thread Jeff Squyres
On Dec 15, 2012, at 4:41 AM, Siegmar Gross wrote:

>>> 2) I don't get the expected result, when I try to scatter an object
>>>  in Java.
>>>  https://svn.open-mpi.org/trac/ompi/ticket/3351
> 
> Do you have an idea when somebody will have time to fix these problems?

Sorry for the horrid delay.  :-( 

I just replied to the ticket -- see the ticket for details.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] [Open MPI] #3351: JAVA scatter error

2012-12-15 Thread Siegmar Gross
Hello,

> #3351: JAVA scatter error
> -+-
> Reporter:  rhc   |   Owner:  jsquyres
> Type:  defect|  Status:  closed
> Priority:  critical  |   Milestone:  Open MPI 1.7.1
>  Version:  trunk |  Resolution:  invalid
> Keywords:|
> -+-
> Changes (by jsquyres):
> 
>  * status:  new => closed
>  * resolution:   => invalid
> 
> 
> Comment:
> 
>  I do not believe that the sample code provided is a valid MPI program, for
>  two reasons (disclaimer: I do ''not'' know Java -- I'm just reading the
>  code and making some assumptions about Java):
> 
>   1. The datatypes passed to Scatter are not valid MPI datatypes
>  (MPI.OBJECT).  You need to construct a datatype that is specific to the
>  !MyData class, just like you would in C/C++.  I think that this is the
>  first error that you are seeing (i.e., that OMPI is trying to treat
>  MPI.OBJECT as an MPI Datatype object, and failing (and therefore throwing
>  an !ClassCastException exception).

Perhaps you are right and my small example program ist not a valid MPI
program. The problem is that I couldn't find any good documentation or
example programs how to write a program which uses a structured data
type. Therefore I sticked to the mpiJava specification which states
for derived datatypes in chapter 3.12 that the effect for MPI_Type_struct
can be achieved by using MPI.OBJECT as the buffer type and relying on
Java object serialization. "dataItem" is a serializable Java object and
I used MPI.OBJECT as buffer type. How can I create a valid MPI datatype
MPI.OBJECT so that I get a working example program?

MPI.COMM_WORLD.Scatter (dataItem, 0, 1, MPI.OBJECT,
objBuffer, 0, 1, MPI.OBJECT, 0);


>   1. It looks like you're trying to Scatter a single object to N peers.
>  That's invalid MPI -- you need to scatter (N*M) objects to N peers, where
>  M is a positive integer value (e.g., 1 or 2).  Are you trying to
>  broadcast?

It is the very first version of the program where I scatter one object
to the process itself (at this point it is not the normal application
area for scatter, but should nevertheless work). I didn't continue due
to the error. I get the same error when I broadcast my data item.

tyr java 116 mpiexec -np 1 java -cp $DIRPREFIX_LOCAL/mpi_classfiles \
  ObjectScatterMain
Exception in thread "main" java.lang.ClassCastException: MyData cannot
  be cast to [Ljava.lang.Object;
at mpi.Intracomm.copyBuffer(Intracomm.java:119)
at mpi.Intracomm.Scatter(Intracomm.java:389)
at ObjectScatterMain.main(ObjectScatterMain.java:45)


"Broadcast" works if I have only a root process and it fails when I have
one more process.

tyr java 117 mpiexec -np 1 java -cp $DIRPREFIX_LOCAL/mpi_classfiles \
  ObjectBroadcastMain

Process 0 running on tyr.informatik.hs-fulda.de.
  Age:  35
  Name: Smith
  Salary:2545.75


tyr java 118 mpiexec -np 2 java -cp $DIRPREFIX_LOCAL/mpi_classfiles \
  ObjectBroadcastMain
Exception in thread "main" java.lang.ClassCastException: MyData cannot
  be cast to [Ljava.lang.Object;
at mpi.Comm.Object_Serialize(Comm.java:207)
at mpi.Comm.Send(Comm.java:292)
at mpi.Intracomm.Bcast(Intracomm.java:202)
at ObjectBroadcastMain.main(ObjectBroadcastMain.java:44)


>  Short version -- I don't think this bug is valid.  I'm closing the ticket.

If I misunderstood the mpiJava specification and I must create a special
MPI object from my Java object: How do I create it? Thank you very much
for any help in advance.

Kind regards

Siegmar


ObjectBroadcastMain.java
Description: ObjectBroadcastMain.java


MyData.java
Description: MyData.java


Re: [OMPI users] questions to some open problems

2012-12-15 Thread Ralph Castain
Hmmm...you shouldn't need to specify a hostfile in addition to the rankfile, so 
something has gotten messed up in the allocator. I'll take a look at it.

As for cpus-per-proc, I'm hoping to tackle it over the holiday while I take a 
break from my regular job. Will let you know when fixed.

Thanks for your patience!


On Dec 15, 2012, at 1:41 AM, Siegmar Gross 
 wrote:

> Hi Ralph
> 
>>> some weeks ago (mainly in the beginning of October) I reported
>>> several problems and I would be grateful if you can tell me if
>>> and probably when somebody will try to solve them.
>>> 
>>> 1) I don't get the expected results, when I try to send or scatter
>>>  the columns of a matrix in Java. The received column values have
>>>  nothing to do with the original values, if I use a homogeneous
>>>  environment and the program breaks with "An error occurred in
>>>  MPI_Comm_dup" and "MPI_ERR_INTERN: internal error", if I use
>>>  a heterogeneous environment. I would like to use the Java API.
>>> 
>>> 2) I don't get the expected result, when I try to scatter an object
>>>  in Java.
>>>  https://svn.open-mpi.org/trac/ompi/ticket/3351
>> 
>> Nothing has happened on these yet
> 
> Do you have an idea when somebody will have time to fix these problems?
> 
> 
>>> 3) I still get only a message that all nodes are already filled up
>>>  when I use a "rankfile" and nothing else happens. I would like
>>>  to use a rankfile. You filed a bug fix for it.
>>> 
>> 
>> I believe rankfile was fixed, at least on the trunk - not sure if it
>> was moved to 1.7. I assume that's the release you are talking about?
> 
> I'm using the trunk for my tests. It didn't work for me because I used
> the rankfile without a hostfile or a hostlist (it is not enough to
> specify the hosts in the rankfile). Everything works fine when I provide
> a "correct" hostfile or hostlist and the binding isn't too compilicated
> (see my last example below).
> 
> My rankfile:
> 
> rank 0=sunpc0 slot=0:0
> rank 1=sunpc1 slot=0:0
> rank 2=sunpc0 slot=1:0
> rank 3=sunpc1 slot=1:0
> 
> 
> My hostfile:
> 
> sunpc0 slots=4
> sunpc1 slots=4
> 
> 
> It will not work without a hostfile or hostlist.
> 
> sunpc0 mpi-probleme 128 mpiexec -report-bindings -rf rankfile_1.openmpi \
>  -np 4 hostname
> 
> The rankfile that was used claimed that a host was either not
> allocated or oversubscribed its slots.  Please review your rank-slot
> assignments and your host allocation to ensure a proper match.  Also,
> some systems may require using full hostnames, such as
> "host1.example.com" (instead of just plain "host1").
> 
>  Host: sunpc1
> 
> sunpc0 mpi-probleme 129 
> 
> 
> I get the expected output, if I add "-hostfile host_sunpc" or
> "-host sunpc0,sunpc1" on the command line.
> 
> sunpc0 mpi-probleme 129 mpiexec -report-bindings -rf rankfile_1.openmpi \
>  -np 4 -hostfile host_sunpc hostname
> [sunpc0:06954] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/.][./.]
> [sunpc0:06954] MCW rank 2 bound to socket 1[core 2[hwt 0]]: [./.][B/.]
> sunpc0
> sunpc0
> [sunpc1:12583] MCW rank 1 bound to socket 0[core 0[hwt 0]]: [B/.][./.]
> [sunpc1:12583] MCW rank 3 bound to socket 1[core 2[hwt 0]]: [./.][B/.]
> sunpc1
> sunpc1
> sunpc0 mpi-probleme 130 
> 
> 
> Furthermore it is necessary that both the rankfile and the hostfile
> contain qualified or unqualified hostnames in the same way. Otherwise
> it will not work as you can see in the following output where my
> hostfile contains a qualified hostname and my rankfile only the hostname
> without domain name.
> 
> sunpc0 mpi-probleme 131 mpiexec -report-bindings -rf rankfile_1.openmpi \
>  -np 4 -hostfile host_sunpc_full hostname
> 
> The rankfile that was used claimed that a host was either not
> allocated or oversubscribed its slots.  Please review your rank-slot
> assignments and your host allocation to ensure a proper match.  Also,
> some systems may require using full hostnames, such as
> "host1.example.com" (instead of just plain "host1").
> 
>  Host: sunpc1
> 
> sunpc0 mpi-probleme 132 
> 
> 
> Unfortunately my complicated rankfile still doesn't work, although
> you told me some weeks ago that it is correct.
> 
> rank 0=sunpc0 slot=0:0-1,1:0-1
> rank 1=sunpc1 slot=0:0-1
> rank 2=sunpc1 slot=1:0
> rank 3=sunpc1 slot=1:1
> 
> sunpc1 mpi-probleme 103 mpiexec -report-bindings -rf rankfile -np 4 \
>  -hostfile host_sunpc hostname
> sunpc1
> sunpc1
> sunpc1
> [sunpc1:12741] MCW rank 2 bound to socket 1[core 2[hwt 0]]: [./.][B/.]
> [sunpc1:12741] MCW rank 3 bound to socket 1[core 3[hwt 0]]: [./.][./B]
> [sunpc1:12741] MCW rank 1 bound to socket 0[core 0[hwt 0]],
>   socket 0[core 1[hwt 0]]: [B/B][./.]
> [sunpc0:07075] MCW rank 0 bound to socket 0[core 0[hwt 0]],
>  

[OMPI users] segfault with one-sided communication and derived datatypes

2012-12-15 Thread Stephan Mohr

Dear community

I get a segfault in the small Fortran program that is attached. I use 
one-sided communication and derived datatypes.


I tried it with different version of Open MPI. Versions 1.4.2 and 1.4.5 
work, but with 1.6.1 and 1.6.3 it crashes.


Can anybody confirm this?

Many thanks
Stephan
program test
  implicit none
  include 'mpif.h'

  ! Variables
  integer :: iproc, nproc, ierr

  call mpi_init(ierr)
  call mpi_comm_size(mpi_comm_world, nproc, ierr)
  call mpi_comm_rank(mpi_comm_world, iproc, ierr)

  write(*,'(2(a,i0))') 'I am task ',iproc,' out of ',nproc

  call test_mpi_get(iproc, nproc)

  write(*,'(a,i0,a)') 'task ',iproc,' is at the end of the program'

  call mpi_finalize(ierr)


end program test



subroutine test_mpi_get(iproc, nproc)
  implicit none
  include 'mpif.h'

  ! Calling arguments
  integer,intent(in) :: iproc, nproc

  ! Local ariables
  integer,parameter :: n=1
  real(kind=8),dimension(n) :: sendbuf, recvbuf
  integer :: window, size_of_double, ierr, mpi_type, nsize, nelements

  ! Initialize sendbuf on process 0.
  if (iproc==0) then
  sendbuf=51.d0
  end if

  ! Size of a double precision number in bytes.
  call mpi_type_size(mpi_double_precision, size_of_double, ierr)

  ! Create the memory window at sendbuf.
  call mpi_win_create(sendbuf(1), int(n*size_of_double,kind=mpi_address_kind), size_of_double, &
   mpi_info_null, mpi_comm_world, window, ierr)

  ! Synchronize.
  call mpi_win_fence(0, window, ierr)

  ! Create a new derived datatype (should be identical to a double
  ! precision number).
  call mpi_type_create_hvector(1, 1, int(0,kind=mpi_address_kind), &
   mpi_double_precision, mpi_type, ierr)
  call mpi_type_commit(mpi_type, ierr)

  ! Size of the datatype in bytes.
  call mpi_type_size(mpi_type, nsize, ierr)

  ! Number of double precision elements that are communicated.
  nelements=nsize/size_of_double
  write(*,*)'nelements,nsize,size_of_double',nelements,nsize,size_of_double

  ! Communicate the data from process 0 to all other process, i.e. transfer them from 
  ! the memory window (at sendbuf) to the receive buffer (recvbuf).
  if (iproc/=0) then
  call mpi_get(recvbuf(1), nelements, &
   mpi_double_precision, 0, int(0,kind=mpi_address_kind), &
   1, mpi_type, window, ierr)
  end if

  ! Free the derived datatype. According to the MPI standard this should
  ! not affect the communication.
  call mpi_type_free(mpi_type, ierr)

  ! Synchronize. Here the code crashes with a segfault.
  call mpi_win_fence(0, window, ierr)

  ! Free the window.
  call mpi_win_free(window, ierr)

  ! Write the results.
  if (iproc/=0) then
  write(*,'(a,i0,a,es9.2)') 'process ',iproc,' received the value ',recvbuf(1)
  end if

end subroutine test_mpi_get