Re: [OMPI users] mpi problems/many cpus per node
It must be making contact or ORTE wouldn't be attempting to launch your application's procs. Looks more like it never received the launch command. Looking at the code, I suspect you're getting caught in a race condition that causes the message to get "stuck". Just to see if that's the case, you might try running this with the 1.7 release candidate, or even the developer's nightly build. Both use a different timing mechanism intended to resolve such situations. On Dec 14, 2012, at 2:49 PM, Daniel Davidson wrote: > Thank you for the help so far. Here is the information that the debugging > gives me. Looks like the daemon on on the non-local node never makes > contact. If I step NP back two though, it does. > > Dan > > [root@compute-2-1 etc]# /home/apps/openmpi-1.6.3/bin/mpirun -host > compute-2-0,compute-2-1 -v -np 34 --leave-session-attached -mca > odls_base_verbose 5 hostname > [compute-2-1.local:44855] mca:base:select:( odls) Querying component [default] > [compute-2-1.local:44855] mca:base:select:( odls) Query of component > [default] set priority to 1 > [compute-2-1.local:44855] mca:base:select:( odls) Selected component [default] > [compute-2-0.local:29282] mca:base:select:( odls) Querying component [default] > [compute-2-0.local:29282] mca:base:select:( odls) Query of component > [default] set priority to 1 > [compute-2-0.local:29282] mca:base:select:( odls) Selected component [default] > [compute-2-1.local:44855] [[49524,0],0] odls:update:daemon:info updating > nidmap > [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list > [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list unpacking > data to launch job [49524,1] > [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list adding new > jobdat for job [49524,1] > [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list unpacking 1 > app_contexts > [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - > checking proc [[49524,1],0] on daemon 1 > [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - > checking proc [[49524,1],1] on daemon 0 > [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found > proc [[49524,1],1] for me! > [compute-2-1.local:44855] adding proc [[49524,1],1] (1) to my local list > [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - > checking proc [[49524,1],2] on daemon 1 > [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - > checking proc [[49524,1],3] on daemon 0 > [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found > proc [[49524,1],3] for me! > [compute-2-1.local:44855] adding proc [[49524,1],3] (3) to my local list > [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - > checking proc [[49524,1],4] on daemon 1 > [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - > checking proc [[49524,1],5] on daemon 0 > [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found > proc [[49524,1],5] for me! > [compute-2-1.local:44855] adding proc [[49524,1],5] (5) to my local list > [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - > checking proc [[49524,1],6] on daemon 1 > [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - > checking proc [[49524,1],7] on daemon 0 > [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found > proc [[49524,1],7] for me! > [compute-2-1.local:44855] adding proc [[49524,1],7] (7) to my local list > [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - > checking proc [[49524,1],8] on daemon 1 > [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - > checking proc [[49524,1],9] on daemon 0 > [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found > proc [[49524,1],9] for me! > [compute-2-1.local:44855] adding proc [[49524,1],9] (9) to my local list > [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - > checking proc [[49524,1],10] on daemon 1 > [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - > checking proc [[49524,1],11] on daemon 0 > [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found > proc [[49524,1],11] for me! > [compute-2-1.local:44855] adding proc [[49524,1],11] (11) to my local list > [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - > checking proc [[49524,1],12] on daemon 1 > [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - > checking proc [[49524,1],13] on daemon 0 > [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found > proc [[49524,1],13] for me! > [compute-2-1.local:44855] adding proc [[49524,1],13] (13) to my local list > [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - > checking proc [[49524,1],14] on daemon 1 > [compute-2-1.local:44855] [[49524,0],0] odls:constru
Re: [OMPI users] questions to some open problems
Hi Ralph > > some weeks ago (mainly in the beginning of October) I reported > > several problems and I would be grateful if you can tell me if > > and probably when somebody will try to solve them. > > > > 1) I don't get the expected results, when I try to send or scatter > > the columns of a matrix in Java. The received column values have > > nothing to do with the original values, if I use a homogeneous > > environment and the program breaks with "An error occurred in > > MPI_Comm_dup" and "MPI_ERR_INTERN: internal error", if I use > > a heterogeneous environment. I would like to use the Java API. > > > > 2) I don't get the expected result, when I try to scatter an object > > in Java. > > https://svn.open-mpi.org/trac/ompi/ticket/3351 > > Nothing has happened on these yet Do you have an idea when somebody will have time to fix these problems? > > 3) I still get only a message that all nodes are already filled up > > when I use a "rankfile" and nothing else happens. I would like > > to use a rankfile. You filed a bug fix for it. > > > > I believe rankfile was fixed, at least on the trunk - not sure if it > was moved to 1.7. I assume that's the release you are talking about? I'm using the trunk for my tests. It didn't work for me because I used the rankfile without a hostfile or a hostlist (it is not enough to specify the hosts in the rankfile). Everything works fine when I provide a "correct" hostfile or hostlist and the binding isn't too compilicated (see my last example below). My rankfile: rank 0=sunpc0 slot=0:0 rank 1=sunpc1 slot=0:0 rank 2=sunpc0 slot=1:0 rank 3=sunpc1 slot=1:0 My hostfile: sunpc0 slots=4 sunpc1 slots=4 It will not work without a hostfile or hostlist. sunpc0 mpi-probleme 128 mpiexec -report-bindings -rf rankfile_1.openmpi \ -np 4 hostname The rankfile that was used claimed that a host was either not allocated or oversubscribed its slots. Please review your rank-slot assignments and your host allocation to ensure a proper match. Also, some systems may require using full hostnames, such as "host1.example.com" (instead of just plain "host1"). Host: sunpc1 sunpc0 mpi-probleme 129 I get the expected output, if I add "-hostfile host_sunpc" or "-host sunpc0,sunpc1" on the command line. sunpc0 mpi-probleme 129 mpiexec -report-bindings -rf rankfile_1.openmpi \ -np 4 -hostfile host_sunpc hostname [sunpc0:06954] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/.][./.] [sunpc0:06954] MCW rank 2 bound to socket 1[core 2[hwt 0]]: [./.][B/.] sunpc0 sunpc0 [sunpc1:12583] MCW rank 1 bound to socket 0[core 0[hwt 0]]: [B/.][./.] [sunpc1:12583] MCW rank 3 bound to socket 1[core 2[hwt 0]]: [./.][B/.] sunpc1 sunpc1 sunpc0 mpi-probleme 130 Furthermore it is necessary that both the rankfile and the hostfile contain qualified or unqualified hostnames in the same way. Otherwise it will not work as you can see in the following output where my hostfile contains a qualified hostname and my rankfile only the hostname without domain name. sunpc0 mpi-probleme 131 mpiexec -report-bindings -rf rankfile_1.openmpi \ -np 4 -hostfile host_sunpc_full hostname The rankfile that was used claimed that a host was either not allocated or oversubscribed its slots. Please review your rank-slot assignments and your host allocation to ensure a proper match. Also, some systems may require using full hostnames, such as "host1.example.com" (instead of just plain "host1"). Host: sunpc1 sunpc0 mpi-probleme 132 Unfortunately my complicated rankfile still doesn't work, although you told me some weeks ago that it is correct. rank 0=sunpc0 slot=0:0-1,1:0-1 rank 1=sunpc1 slot=0:0-1 rank 2=sunpc1 slot=1:0 rank 3=sunpc1 slot=1:1 sunpc1 mpi-probleme 103 mpiexec -report-bindings -rf rankfile -np 4 \ -hostfile host_sunpc hostname sunpc1 sunpc1 sunpc1 [sunpc1:12741] MCW rank 2 bound to socket 1[core 2[hwt 0]]: [./.][B/.] [sunpc1:12741] MCW rank 3 bound to socket 1[core 3[hwt 0]]: [./.][./B] [sunpc1:12741] MCW rank 1 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]]: [B/B][./.] [sunpc0:07075] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]]: [B/B][./.] sunpc0 sunpc1 mpi-probleme 104 The bindings for ranks 1 to 3 are correct, but rank 0 didn't get the cores from the second socket. > > 4) I would like to have "-cpus-per-proc", "-npersocket", etc for > > every set of machines/applications and not globally for all > > machines/applications if I specify several colon-separated sets > > of machines or applications on the command line. You told me that > > it could be done. > > > > 5) By the way, it seems that the option "-cpus-per-proc" isn't any > > lo
Re: [OMPI users] Problems with shared libraries while launching jobs
Note that exporting the LD_LIBRARY_PATH on the mpirun command line does not necessarily apply to launching the remote orteds (it applies to launching the remote MPI processes, which are children of the orteds). Since you're using ssh, you might want to check the shell startup scripts on the target nodes (e.g., .bashrc). It's not sufficient to not overwrite the LD_LIBRARY_PATH -- ensure that it is getting set to the right library location of the intel support libraries. You might also want to check your .bashrc that you're not setting LD_LIBRARY_PATH (or path or ...) after it exits for non-interactive shells. This is a common optimization trick in shell startup files -- exit early when it detects that this is a non-interactive shell, and therefore don't do a bunch of stuff that assumedly is only needed when you login interactively (e.g., create shell aliases and the like). Random question: is there a reason you're not using torque support? When you use torque support, torque will automatically copy your current environment -- including LD_LIBRARY_PATH -- to the target node before launching orted. Hence, it can actually be easier for LD_LIBRARY_PATH issues like this. On Dec 14, 2012, at 3:17 PM, Blosch, Edwin L wrote: > I am having a weird problem launching cases with OpenMPI 1.4.3. It is most > likely a problem with a particular node of our cluster, as the jobs will run > fine on some submissions, but not other submissions. It seems to depend on > the node list. I just am having trouble diagnosing which node, and what is > the nature of the problem it has. > > One or perhaps more of the orted are indicating they cannot find an Intel > Math library. The error is: > /release/cfd/openmpi-intel/bin/orted: error while loading shared libraries: > libimf.so: cannot open shared object file: No such file or directory > > I’ve checked the environment just before launching mpirun, and > LD_LIBRARY_PATH includes the necessary component to point to where the Intel > shared libraries are located. Furthermore, my mpirun command line says to > export the LD_LIBRARY_PATH variable: > Executing ['/release/cfd/openmpi-intel/bin/mpirun', '--machinefile > /var/spool/PBS/aux/20761.maruhpc4-mgt', '-np 160', '-x LD_LIBRARY_PATH', '-x > MPI_ENVIRONMENT=1', '/tmp/fv420761.maruhpc4-mgt/falconv4_openmpi_jsgl', '-v', > '-cycles', '1', '-ri', 'restart.1', '-ro', > '/tmp/fv420761.maruhpc4-mgt/restart.1'] > > My shell-initialization script (.bashrc) does not overwrite LD_LIBRARY_PATH. > OpenMPI is built explicitly --without-torque and should be using ssh to > launch the orted. > > What options can I add to get more debugging of problems launching orted? > > Thanks, > > Ed > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Possible memory error
On Dec 14, 2012, at 4:31 PM, Handerson, Steven wrote: > I’m trying to track down an instance of openMPI writing to a freed block of > memory. > This occurs with the most recent release (1.6.3) as well as 1.6, on a 64 bit > intel architecture, fedora 14. > It occurs with a very simple reduction (allreduce minimum), over a single int > value. Can you send a reproducer program? The simpler, the better. > I’m wondering if the openMPI developers use power tools such as valgrind / > dmalloc / etc > on the releases to try to catch these things via exhaustive testing – > but I understand memory problems in C are of the nature that anyone making a > mistake can propogate, > so I haven’t ruled out problems in our own code. > Also, I’m wondering if anyone has suggestions on how to track this down > further. Yes, we do use such tools. Can you cite the specific file/line where the problem is occurring? The all reduce algorithms are fairly self-contained; it should be (relatively) straightforward to examine that code and see if there's a problem with the memory allocation there. > I’m using allinea DDT and their builtin dmalloc, which catches the error, > which appears in > the second memcpy in opal_convertor_pack(), but I don’t have more details > than that at the moment. > All I know so far is that one of those values has been freed. > Obviously, I haven’t seen anything in earlier parts of the code which might > have triggered memory corruption, > although both openMPI and intel IPP do things with uninitialized values > before this (according to Valgrind). There's a number of issues that can lead to false positives for using uninitialized values. Here's two of the most common cases: 1. When using TCP, one of our data headers has a padding hole in it, but we write the whole struct down a TCP socket file descriptor anyway. Hence, it will generate a "read from uninit" warning. 2. When using OpenFabrics-based networks, tool like valgrind don't see the OS-bypass initialization of the memory (Which frequently comes directly from the hardware), and it generates a lot of false "read from uninit" positives. One thing you can try is to compile Open MPI --with-valgrind. This adds a little performance penalty, but we take extra steps to eliminate most false positives. It could help separate the wheat from the chaff, in your case. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] question to scattering an object in openmpi-1.9a1r27380
Greetings Siegmar; sorry for the horrid delay in replying. :-( Ralph opened a ticket about this a while ago (https://svn.open-mpi.org/trac/ompi/ticket/3351). I answered it this morning -- see the ticket for the details. Short version: I don't think that your program is correct. On Oct 11, 2012, at 7:40 AM, Siegmar Gross wrote: > Hi, > > I have built openmpi-1.9a1r27380 with Java support and try some small > programs. When I try to scatter an object, I get a ClassCastException. > I use the following object. > > public class MyData implements java.io.Serializable > { > static final long serialVersionUID = -5243516570672186644L; > > private int age; > private String name; > private double salary; > > public MyData () > { >age= 0; >name = ""; >salary = 0.0; > } > > public void setAge (int newAge) > { >age = newAge; > } > ... > } > > > I use the following main program. > > import mpi.*; > > public class ObjectScatterMain > { > public static void main (String args[]) throws MPIException > { >int mytid; /* my task id > */ >MyData dataItem, objBuffer; >String processor_name; /* name of local machine*/ > >MPI.Init (args); >processor_name = MPI.Get_processor_name (); >mytid = MPI.COMM_WORLD.Rank (); >dataItem = new MyData (); >objBuffer = new MyData (); >if (mytid == 0) >{ > /* initialize data item */ > dataItem.setAge (35); > dataItem.setName ("Smith"); > dataItem.setSalary (2545.75); >} >MPI.COMM_WORLD.Scatter (dataItem, 0, 1, MPI.OBJECT, > objBuffer, 0, 1, MPI.OBJECT, 0); >/* Each process prints its received data item. The outputs > * can intermingle on the screen so that you must use > * "-output-filename" in Open MPI. > */ >System.out.printf ("\nProcess %d running on %s.\n" + > " Age: %d\n" + > " Name: %s\n" + > " Salary: %10.2f\n", > mytid, processor_name, > objBuffer.getAge (), > objBuffer.getName (), > objBuffer.getSalary ()); >MPI.Finalize(); > } > } > > > I get the following error, when I compile and run the program. > > tyr java 218 mpijavac ObjectScatterMain.java > tyr java 219 mpiexec java ObjectScatterMain > Exception in thread "main" java.lang.ClassCastException: > MyData cannot be cast to [Ljava.lang.Object; >at mpi.Intracomm.copyBuffer(Intracomm.java:119) >at mpi.Intracomm.Scatter(Intracomm.java:389) >at ObjectScatterMain.main(ObjectScatterMain.java:45) > -- > mpiexec has exited due to process rank 0 with PID 25898 on > ... > > > Has anybody an idea why I get a ClassCastException or how I must define > an object, which I can use in a scatter operation? Thank you very much > for any help in advance. > > > Kind regards > > Siegmar > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] questions to some open problems
On Dec 15, 2012, at 4:41 AM, Siegmar Gross wrote: >>> 2) I don't get the expected result, when I try to scatter an object >>> in Java. >>> https://svn.open-mpi.org/trac/ompi/ticket/3351 > > Do you have an idea when somebody will have time to fix these problems? Sorry for the horrid delay. :-( I just replied to the ticket -- see the ticket for details. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] [Open MPI] #3351: JAVA scatter error
Hello, > #3351: JAVA scatter error > -+- > Reporter: rhc | Owner: jsquyres > Type: defect| Status: closed > Priority: critical | Milestone: Open MPI 1.7.1 > Version: trunk | Resolution: invalid > Keywords:| > -+- > Changes (by jsquyres): > > * status: new => closed > * resolution: => invalid > > > Comment: > > I do not believe that the sample code provided is a valid MPI program, for > two reasons (disclaimer: I do ''not'' know Java -- I'm just reading the > code and making some assumptions about Java): > > 1. The datatypes passed to Scatter are not valid MPI datatypes > (MPI.OBJECT). You need to construct a datatype that is specific to the > !MyData class, just like you would in C/C++. I think that this is the > first error that you are seeing (i.e., that OMPI is trying to treat > MPI.OBJECT as an MPI Datatype object, and failing (and therefore throwing > an !ClassCastException exception). Perhaps you are right and my small example program ist not a valid MPI program. The problem is that I couldn't find any good documentation or example programs how to write a program which uses a structured data type. Therefore I sticked to the mpiJava specification which states for derived datatypes in chapter 3.12 that the effect for MPI_Type_struct can be achieved by using MPI.OBJECT as the buffer type and relying on Java object serialization. "dataItem" is a serializable Java object and I used MPI.OBJECT as buffer type. How can I create a valid MPI datatype MPI.OBJECT so that I get a working example program? MPI.COMM_WORLD.Scatter (dataItem, 0, 1, MPI.OBJECT, objBuffer, 0, 1, MPI.OBJECT, 0); > 1. It looks like you're trying to Scatter a single object to N peers. > That's invalid MPI -- you need to scatter (N*M) objects to N peers, where > M is a positive integer value (e.g., 1 or 2). Are you trying to > broadcast? It is the very first version of the program where I scatter one object to the process itself (at this point it is not the normal application area for scatter, but should nevertheless work). I didn't continue due to the error. I get the same error when I broadcast my data item. tyr java 116 mpiexec -np 1 java -cp $DIRPREFIX_LOCAL/mpi_classfiles \ ObjectScatterMain Exception in thread "main" java.lang.ClassCastException: MyData cannot be cast to [Ljava.lang.Object; at mpi.Intracomm.copyBuffer(Intracomm.java:119) at mpi.Intracomm.Scatter(Intracomm.java:389) at ObjectScatterMain.main(ObjectScatterMain.java:45) "Broadcast" works if I have only a root process and it fails when I have one more process. tyr java 117 mpiexec -np 1 java -cp $DIRPREFIX_LOCAL/mpi_classfiles \ ObjectBroadcastMain Process 0 running on tyr.informatik.hs-fulda.de. Age: 35 Name: Smith Salary:2545.75 tyr java 118 mpiexec -np 2 java -cp $DIRPREFIX_LOCAL/mpi_classfiles \ ObjectBroadcastMain Exception in thread "main" java.lang.ClassCastException: MyData cannot be cast to [Ljava.lang.Object; at mpi.Comm.Object_Serialize(Comm.java:207) at mpi.Comm.Send(Comm.java:292) at mpi.Intracomm.Bcast(Intracomm.java:202) at ObjectBroadcastMain.main(ObjectBroadcastMain.java:44) > Short version -- I don't think this bug is valid. I'm closing the ticket. If I misunderstood the mpiJava specification and I must create a special MPI object from my Java object: How do I create it? Thank you very much for any help in advance. Kind regards Siegmar ObjectBroadcastMain.java Description: ObjectBroadcastMain.java MyData.java Description: MyData.java
Re: [OMPI users] questions to some open problems
Hmmm...you shouldn't need to specify a hostfile in addition to the rankfile, so something has gotten messed up in the allocator. I'll take a look at it. As for cpus-per-proc, I'm hoping to tackle it over the holiday while I take a break from my regular job. Will let you know when fixed. Thanks for your patience! On Dec 15, 2012, at 1:41 AM, Siegmar Gross wrote: > Hi Ralph > >>> some weeks ago (mainly in the beginning of October) I reported >>> several problems and I would be grateful if you can tell me if >>> and probably when somebody will try to solve them. >>> >>> 1) I don't get the expected results, when I try to send or scatter >>> the columns of a matrix in Java. The received column values have >>> nothing to do with the original values, if I use a homogeneous >>> environment and the program breaks with "An error occurred in >>> MPI_Comm_dup" and "MPI_ERR_INTERN: internal error", if I use >>> a heterogeneous environment. I would like to use the Java API. >>> >>> 2) I don't get the expected result, when I try to scatter an object >>> in Java. >>> https://svn.open-mpi.org/trac/ompi/ticket/3351 >> >> Nothing has happened on these yet > > Do you have an idea when somebody will have time to fix these problems? > > >>> 3) I still get only a message that all nodes are already filled up >>> when I use a "rankfile" and nothing else happens. I would like >>> to use a rankfile. You filed a bug fix for it. >>> >> >> I believe rankfile was fixed, at least on the trunk - not sure if it >> was moved to 1.7. I assume that's the release you are talking about? > > I'm using the trunk for my tests. It didn't work for me because I used > the rankfile without a hostfile or a hostlist (it is not enough to > specify the hosts in the rankfile). Everything works fine when I provide > a "correct" hostfile or hostlist and the binding isn't too compilicated > (see my last example below). > > My rankfile: > > rank 0=sunpc0 slot=0:0 > rank 1=sunpc1 slot=0:0 > rank 2=sunpc0 slot=1:0 > rank 3=sunpc1 slot=1:0 > > > My hostfile: > > sunpc0 slots=4 > sunpc1 slots=4 > > > It will not work without a hostfile or hostlist. > > sunpc0 mpi-probleme 128 mpiexec -report-bindings -rf rankfile_1.openmpi \ > -np 4 hostname > > The rankfile that was used claimed that a host was either not > allocated or oversubscribed its slots. Please review your rank-slot > assignments and your host allocation to ensure a proper match. Also, > some systems may require using full hostnames, such as > "host1.example.com" (instead of just plain "host1"). > > Host: sunpc1 > > sunpc0 mpi-probleme 129 > > > I get the expected output, if I add "-hostfile host_sunpc" or > "-host sunpc0,sunpc1" on the command line. > > sunpc0 mpi-probleme 129 mpiexec -report-bindings -rf rankfile_1.openmpi \ > -np 4 -hostfile host_sunpc hostname > [sunpc0:06954] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/.][./.] > [sunpc0:06954] MCW rank 2 bound to socket 1[core 2[hwt 0]]: [./.][B/.] > sunpc0 > sunpc0 > [sunpc1:12583] MCW rank 1 bound to socket 0[core 0[hwt 0]]: [B/.][./.] > [sunpc1:12583] MCW rank 3 bound to socket 1[core 2[hwt 0]]: [./.][B/.] > sunpc1 > sunpc1 > sunpc0 mpi-probleme 130 > > > Furthermore it is necessary that both the rankfile and the hostfile > contain qualified or unqualified hostnames in the same way. Otherwise > it will not work as you can see in the following output where my > hostfile contains a qualified hostname and my rankfile only the hostname > without domain name. > > sunpc0 mpi-probleme 131 mpiexec -report-bindings -rf rankfile_1.openmpi \ > -np 4 -hostfile host_sunpc_full hostname > > The rankfile that was used claimed that a host was either not > allocated or oversubscribed its slots. Please review your rank-slot > assignments and your host allocation to ensure a proper match. Also, > some systems may require using full hostnames, such as > "host1.example.com" (instead of just plain "host1"). > > Host: sunpc1 > > sunpc0 mpi-probleme 132 > > > Unfortunately my complicated rankfile still doesn't work, although > you told me some weeks ago that it is correct. > > rank 0=sunpc0 slot=0:0-1,1:0-1 > rank 1=sunpc1 slot=0:0-1 > rank 2=sunpc1 slot=1:0 > rank 3=sunpc1 slot=1:1 > > sunpc1 mpi-probleme 103 mpiexec -report-bindings -rf rankfile -np 4 \ > -hostfile host_sunpc hostname > sunpc1 > sunpc1 > sunpc1 > [sunpc1:12741] MCW rank 2 bound to socket 1[core 2[hwt 0]]: [./.][B/.] > [sunpc1:12741] MCW rank 3 bound to socket 1[core 3[hwt 0]]: [./.][./B] > [sunpc1:12741] MCW rank 1 bound to socket 0[core 0[hwt 0]], > socket 0[core 1[hwt 0]]: [B/B][./.] > [sunpc0:07075] MCW rank 0 bound to socket 0[core 0[hwt 0]], >
[OMPI users] segfault with one-sided communication and derived datatypes
Dear community I get a segfault in the small Fortran program that is attached. I use one-sided communication and derived datatypes. I tried it with different version of Open MPI. Versions 1.4.2 and 1.4.5 work, but with 1.6.1 and 1.6.3 it crashes. Can anybody confirm this? Many thanks Stephan program test implicit none include 'mpif.h' ! Variables integer :: iproc, nproc, ierr call mpi_init(ierr) call mpi_comm_size(mpi_comm_world, nproc, ierr) call mpi_comm_rank(mpi_comm_world, iproc, ierr) write(*,'(2(a,i0))') 'I am task ',iproc,' out of ',nproc call test_mpi_get(iproc, nproc) write(*,'(a,i0,a)') 'task ',iproc,' is at the end of the program' call mpi_finalize(ierr) end program test subroutine test_mpi_get(iproc, nproc) implicit none include 'mpif.h' ! Calling arguments integer,intent(in) :: iproc, nproc ! Local ariables integer,parameter :: n=1 real(kind=8),dimension(n) :: sendbuf, recvbuf integer :: window, size_of_double, ierr, mpi_type, nsize, nelements ! Initialize sendbuf on process 0. if (iproc==0) then sendbuf=51.d0 end if ! Size of a double precision number in bytes. call mpi_type_size(mpi_double_precision, size_of_double, ierr) ! Create the memory window at sendbuf. call mpi_win_create(sendbuf(1), int(n*size_of_double,kind=mpi_address_kind), size_of_double, & mpi_info_null, mpi_comm_world, window, ierr) ! Synchronize. call mpi_win_fence(0, window, ierr) ! Create a new derived datatype (should be identical to a double ! precision number). call mpi_type_create_hvector(1, 1, int(0,kind=mpi_address_kind), & mpi_double_precision, mpi_type, ierr) call mpi_type_commit(mpi_type, ierr) ! Size of the datatype in bytes. call mpi_type_size(mpi_type, nsize, ierr) ! Number of double precision elements that are communicated. nelements=nsize/size_of_double write(*,*)'nelements,nsize,size_of_double',nelements,nsize,size_of_double ! Communicate the data from process 0 to all other process, i.e. transfer them from ! the memory window (at sendbuf) to the receive buffer (recvbuf). if (iproc/=0) then call mpi_get(recvbuf(1), nelements, & mpi_double_precision, 0, int(0,kind=mpi_address_kind), & 1, mpi_type, window, ierr) end if ! Free the derived datatype. According to the MPI standard this should ! not affect the communication. call mpi_type_free(mpi_type, ierr) ! Synchronize. Here the code crashes with a segfault. call mpi_win_fence(0, window, ierr) ! Free the window. call mpi_win_free(window, ierr) ! Write the results. if (iproc/=0) then write(*,'(a,i0,a,es9.2)') 'process ',iproc,' received the value ',recvbuf(1) end if end subroutine test_mpi_get