Re: [OMPI users] users Digest, Vol 3510, Issue 2

2016-05-24 Thread Megdich Islem
Yes, Empire does the fluid structure coupling. It couples OpenFoam (fluid 
analysis) and Abaqus (structural analysis).
Does all the software need to have the same MPI architecture in order to 
communicate ?
Regards,Islem 

Le Mardi 24 mai 2016 1h02, Gilles Gouaillardet  a écrit :
 

  what do you mean by coupling ? does Empire and OpenFoam communicate via MPI ? 
wouldn't it be much easier if you rebuild OpenFoam with mpich or intelmpi ? 
  Cheers, 
  Gilles
  
 On 5/24/2016 8:44 AM, Megdich Islem wrote:
  
  "Open MPI does not work when MPICH or intel MPI are installed" 
  Thank you for your suggestion. But I need to run OpenFoam and Empire at the 
same time. In fact, Empire couples OpenFoam with another software. 
  Is there any solution for this case ? 
  
  Regards, Islem 
 
  Le Lundi 23 mai 2016 17h00, "users-requ...@open-mpi.org" 
 a écrit :
  
 
 Send users mailing list submissions to
     us...@open-mpi.org
 
 To subscribe or unsubscribe via the World Wide Web, visit
     https://www.open-mpi.org/mailman/listinfo.cgi/users
 or, via email, send a message with subject or body 'help' to
     users-requ...@open-mpi.org
 
 You can reach the person managing the list at
     users-ow...@open-mpi.org
 
 When replying, please edit your Subject line so it is more specific
 than "Re: Contents of users digest..."
 
 
 Today's Topics:
 
   1. Re: Open MPI does not work when MPICH or intel MPI are
       installed (Andy Riebs)
   2. segmentation fault for slot-list and openmpi-1.10.3rc2
       (Siegmar Gross)
   3. Re: problem about mpirun on two nodes (Jeff Squyres (jsquyres))
   4. Re: Open MPI does not work when MPICH or intel MPI are
       installed (Gilles Gouaillardet)
   5. Re: segmentation fault for slot-list and    openmpi-1.10.3rc2
       (Ralph Castain)
   6. mpirun java (Claudio Stamile)
 
 
--
 
 [Message discarded by content filter]
 --
 
 Message: 2
 Date: Mon, 23 May 2016 15:26:52 +0200
 From: Siegmar Gross 
 To: Open MPI Users 
 Subject: [OMPI users] segmentation fault for slot-list and
     openmpi-1.10.3rc2
 Message-ID:
     <241613b1-ada6-292f-eeb9-722fc8fa2...@informatik.hs-fulda.de>
 Content-Type: text/plain; charset=utf-8; format=flowed
 
 Hi,
 
 I installed openmpi-1.10.3rc2 on my "SUSE Linux Enterprise Server
 12 (x86_64)" with Sun C 5.13  and gcc-6.1.0. Unfortunately I get
 a segmentation fault for "--slot-list" for one of my small programs.
 
 
 loki spawn 119 ompi_info | grep -e "OPAL repo revision:" -e "C compiler 
absolute:"
       OPAL repo revision: v1.10.2-201-gd23dda8
       C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc
 
 
 loki spawn 120 mpiexec -np 1 --host loki,loki,loki,loki,loki spawn_master
 
 Parent process 0 running on loki
   I create 4 slave processes
 
 Parent process 0: tasks in MPI_COMM_WORLD:                    1
                   tasks in COMM_CHILD_PROCESSES local group:  1
                   tasks in COMM_CHILD_PROCESSES remote group: 4
 
 Slave process 0 of 4 running on loki
 Slave process 1 of 4 running on loki
 Slave process 2 of 4 running on loki
 spawn_slave 2: argv[0]: spawn_slave
 Slave process 3 of 4 running on loki
 spawn_slave 0: argv[0]: spawn_slave
 spawn_slave 1: argv[0]: spawn_slave
 spawn_slave 3: argv[0]: spawn_slave
 
 
 
 
 loki spawn 121 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master
 
 Parent process 0 running on loki
   I create 4 slave processes
 
 [loki:17326] *** Process received signal ***
 [loki:17326] Signal: Segmentation fault (11)
 [loki:17326] Signal code: Address not mapped (1)
 [loki:17326] Failing at address: 0x8
 [loki:17326] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f4e469b3870]
 [loki:17326] [ 1] *** An error occurred in MPI_Init
 *** on a NULL communicator
 *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
 ***    and potentially your MPI job)
 [loki:17324] Local abort before MPI_INIT completed successfully; not able to 
 aggregate error messages, and not able to guarantee that all other processes 
 were killed!
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_proc_self+0x35)[0x7f4e46c165b0]
 [loki:17326] [ 2] 
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_comm_init+0x68b)[0x7f4e46bf5b08]
 [loki:17326] [ 3] *** An error occurred in MPI_Init
 *** on a NULL communicator
 *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
 ***    and potentially your MPI job)
 [loki:17325] Local abort before MPI_INIT completed successfully; not able to 
 aggregate error messages, and not able to guarantee that all other processes 
 were killed!
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_mpi_init+0xa90)[0x7f4e46c1be8a]
 [loki:17326] [ 4] 
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(MPI_Init+0x180)[0x7f4e46c5828e]
 [loki:17326] [ 5] spawn_slave[0x40097e]
 [loki:17326] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f4e4661db05]
 [loki:1732

Re: [OMPI users] wtime implementation in 1.10

2016-05-24 Thread Dave Love
Ralph Castain  writes:

> Nobody ever filed a PR to update the branch with the patch - looks
> like you never responded to confirm that George’s proposed patch was
> acceptable.

I've never seen anything asking me about it, but I'm not an OMPI
developer in a position to review backports or even put things in a bug
tracker.

1.10 isn't used here, and I just subvert gettimeofday whenever I'm
running something that might use it for timing short intervals.

> I’ll create the PR and copy you for review
>
>
>> On May 23, 2016, at 9:17 AM, Dave Love  wrote:
>> 
>> I thought the 1.10 branch had been fixed to use clock_gettime for
>> MPI_Wtime where it's available, a la
>> https://www.open-mpi.org/community/lists/users/2016/04/28899.php -- and
>> have been telling people so!  However, I realize it hasn't, and it looks
>> as if 1.10 is still being maintained.
>> 
>> Is there a good reason for that, or could it be fixed?


Re: [OMPI users] users Digest, Vol 3510, Issue 2

2016-05-24 Thread Dave Love
Megdich Islem  writes:

> Yes, Empire does the fluid structure coupling. It couples OpenFoam (fluid 
> analysis) and Abaqus (structural analysis).
> Does all the software need to have the same MPI architecture in order to 
> communicate ?

I doubt it's doing that, and presumably you have no control over abaqus,
which is a major source of pain here.

You could wrap one (set of) program(s) in a script to set the
appropriate environment before invoking the real program.  That might be
a bit painful if you need many of the OF components, but it should be
straightforward to put scripts somewhere on PATH ahead of the real
versions.

On the other hand, it never ceases to amaze how difficult proprietary
engineering applications make life on HPC systems; I could believe
there's a catch.  Also you (or systems people) normally want programs to
use the system MPI, assuming that's been set up appropriately.


Re: [OMPI users] segmentation fault for slot-list and openmpi-1.10.3rc2

2016-05-24 Thread Siegmar Gross

Hi Ralph,

thank you very much for your answer and your example program.

On 05/23/16 17:45, Ralph Castain wrote:

I cannot replicate the problem - both scenarios work fine for me. I’m not
convinced your test code is correct, however, as you call Comm_free the
inter-communicator but didn’t call Comm_disconnect. Checkout the attached
for a correct code and see if it works for you.


I thought that I only need MPI_Comm_Disconnect, if I would have established a
connection with MPI_Comm_connect before. The man page for MPI_Comm_free states

"This  operation marks the communicator object for deallocation. The
handle is set to MPI_COMM_NULL. Any pending operations that use this
communicator will complete normally; the object is actually deallocated only
if there are no other active references to it.".

The man page for MPI_Comm_disconnect states

"MPI_Comm_disconnect waits for all pending communication on comm to complete
internally, deallocates the communicator object, and sets the handle to
MPI_COMM_NULL. It is  a  collective operation.".

I don't see a difference for my spawned processes, because both functions will
"wait" until all pending operations have finished, before the object will be
destroyed. Nevertheless, perhaps my small example program worked all the years
by chance.

However, I don't understand, why my program works with
"mpiexec -np 1 --host loki,loki,loki,loki,loki spawn_master" and breaks with
"mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master". You are right,
my slot-list is equivalent to "-bind-to none". I could also have used
"mpiexec -np 1 --host loki --oversubscribe spawn_master" which works as well.

The program breaks with "There are not enough slots available in the system
to satisfy ...", if I only use "--host loki" or different host names,
without mentioning five host names, using "slot-list", or "oversubscribe",
Unfortunately "--host :" isn't available for
openmpi-1.10.3rc2 to specify the number of available slots.

Your program behaves the same way as mine, so that MPI_Comm_disconnect
will not solve my problem. I had to modify your program in a negligible way
to get it compiled.

loki spawn 153 ompi_info | grep -e "OPAL repo revision:" -e "C compiler 
absolute:"
  OPAL repo revision: v1.10.2-201-gd23dda8
 C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc
loki spawn 154 mpicc simple_spawn.c
loki spawn 155 mpiexec -np 1 a.out
[pid 24008] starting up!
0 completed MPI_Init
Parent [pid 24008] about to spawn!
[pid 24010] starting up!
[pid 24011] starting up!
[pid 24012] starting up!
Parent done with spawn
Parent sending message to child
0 completed MPI_Init
Hello from the child 0 of 3 on host loki pid 24010
1 completed MPI_Init
Hello from the child 1 of 3 on host loki pid 24011
2 completed MPI_Init
Hello from the child 2 of 3 on host loki pid 24012
Child 0 received msg: 38
Child 0 disconnected
Child 1 disconnected
Child 2 disconnected
Parent disconnected
24012: exiting
24010: exiting
24008: exiting
24011: exiting


Is something wrong with my command line? I didn't use slot-list before, so
that I'm not sure, if I use it in the intended way.

loki spawn 156 mpiexec -np 1 --host loki --slot-list 0-5 a.out
[pid 24102] starting up!
0 completed MPI_Init
Parent [pid 24102] about to spawn!
[pid 24104] starting up!
[pid 24105] starting up!
[loki:24105] *** Process received signal ***
[loki:24105] Signal: Segmentation fault (11)
[loki:24105] Signal code: Address not mapped (1)
[loki:24105] Failing at address: 0x8
[loki:24105] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f39aa76f870]
[loki:24105] [ 1] 
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_proc_self+0x35)[0x7f39aa9d25b0]
[loki:24105] [ 2] 
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_comm_init+0x68b)[0x7f39aa9b1b08]

[loki:24105] [ 3] *** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[loki:24104] Local abort before MPI_INIT completed successfully; not able to 
aggregate error messages, and not able to guarantee that all other processes 
were killed!

/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_mpi_init+0xa90)[0x7f39aa9d7e8a]
[loki:24105] [ 4] 
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(MPI_Init+0x1a0)[0x7f39aaa142ae]

[loki:24105] [ 5] a.out[0x400d0c]
[loki:24105] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f39aa3d9b05]
[loki:24105] [ 7] a.out[0x400bf9]
[loki:24105] *** End of error message ***
---
Child job 2 terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
---
--
mpiexec detected that one or more processes exited with non-zero status, thus 
causing

the job to be terminated. The first process to do so was:

  Process 

Re: [OMPI users] segmentation fault for slot-list and openmpi-1.10.3rc2

2016-05-24 Thread Ralph Castain

> On May 24, 2016, at 4:19 AM, Siegmar Gross 
>  wrote:
> 
> Hi Ralph,
> 
> thank you very much for your answer and your example program.
> 
> On 05/23/16 17:45, Ralph Castain wrote:
>> I cannot replicate the problem - both scenarios work fine for me. I’m not
>> convinced your test code is correct, however, as you call Comm_free the
>> inter-communicator but didn’t call Comm_disconnect. Checkout the attached
>> for a correct code and see if it works for you.
> 
> I thought that I only need MPI_Comm_Disconnect, if I would have established a
> connection with MPI_Comm_connect before. The man page for MPI_Comm_free states
> 
> "This  operation marks the communicator object for deallocation. The
> handle is set to MPI_COMM_NULL. Any pending operations that use this
> communicator will complete normally; the object is actually deallocated only
> if there are no other active references to it.".
> 
> The man page for MPI_Comm_disconnect states
> 
> "MPI_Comm_disconnect waits for all pending communication on comm to complete
> internally, deallocates the communicator object, and sets the handle to
> MPI_COMM_NULL. It is  a  collective operation.".
> 
> I don't see a difference for my spawned processes, because both functions will
> "wait" until all pending operations have finished, before the object will be
> destroyed. Nevertheless, perhaps my small example program worked all the years
> by chance.
> 
> However, I don't understand, why my program works with
> "mpiexec -np 1 --host loki,loki,loki,loki,loki spawn_master" and breaks with
> "mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master". You are 
> right,
> my slot-list is equivalent to "-bind-to none". I could also have used
> "mpiexec -np 1 --host loki --oversubscribe spawn_master" which works as well.

Well, you are only giving us one slot when you specify "-host loki”, and then 
you are trying to launch multiple processes into it. The “slot-list” option 
only tells us what cpus to bind each process to - it doesn’t allocate process 
slots. So you have to tell us how many processes are allowed to run on this 
node.

> 
> The program breaks with "There are not enough slots available in the system
> to satisfy ...", if I only use "--host loki" or different host names,
> without mentioning five host names, using "slot-list", or "oversubscribe",
> Unfortunately "--host :" isn't available for
> openmpi-1.10.3rc2 to specify the number of available slots.

Correct - we did not backport the new syntax

> 
> Your program behaves the same way as mine, so that MPI_Comm_disconnect
> will not solve my problem. I had to modify your program in a negligible way
> to get it compiled.
> 
> loki spawn 153 ompi_info | grep -e "OPAL repo revision:" -e "C compiler 
> absolute:"
>  OPAL repo revision: v1.10.2-201-gd23dda8
> C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc
> loki spawn 154 mpicc simple_spawn.c
> loki spawn 155 mpiexec -np 1 a.out
> [pid 24008] starting up!
> 0 completed MPI_Init
> Parent [pid 24008] about to spawn!
> [pid 24010] starting up!
> [pid 24011] starting up!
> [pid 24012] starting up!
> Parent done with spawn
> Parent sending message to child
> 0 completed MPI_Init
> Hello from the child 0 of 3 on host loki pid 24010
> 1 completed MPI_Init
> Hello from the child 1 of 3 on host loki pid 24011
> 2 completed MPI_Init
> Hello from the child 2 of 3 on host loki pid 24012
> Child 0 received msg: 38
> Child 0 disconnected
> Child 1 disconnected
> Child 2 disconnected
> Parent disconnected
> 24012: exiting
> 24010: exiting
> 24008: exiting
> 24011: exiting
> 
> 
> Is something wrong with my command line? I didn't use slot-list before, so
> that I'm not sure, if I use it in the intended way.

I don’t know what “a.out” is, but it looks like there is some memory corruption 
there.

> 
> loki spawn 156 mpiexec -np 1 --host loki --slot-list 0-5 a.out
> [pid 24102] starting up!
> 0 completed MPI_Init
> Parent [pid 24102] about to spawn!
> [pid 24104] starting up!
> [pid 24105] starting up!
> [loki:24105] *** Process received signal ***
> [loki:24105] Signal: Segmentation fault (11)
> [loki:24105] Signal code: Address not mapped (1)
> [loki:24105] Failing at address: 0x8
> [loki:24105] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f39aa76f870]
> [loki:24105] [ 1] 
> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_proc_self+0x35)[0x7f39aa9d25b0]
> [loki:24105] [ 2] 
> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_comm_init+0x68b)[0x7f39aa9b1b08]
> [loki:24105] [ 3] *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***and potentially your MPI job)
> [loki:24104] Local abort before MPI_INIT completed successfully; not able to 
> aggregate error messages, and not able to guarantee that all other processes 
> were killed!
> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_mpi_init+0xa90)[0x7f39aa9d7e8a]
> [loki:24105] [ 4] 
> /usr/local/ope

Re: [OMPI users] users Digest, Vol 3510, Issue 2

2016-05-24 Thread Jeff Squyres (jsquyres)
Doesn't Abaqus do its own environment setup?  I.e., I'm *guessing* that you 
should be able to set your environment startup files (e.g., $HOME/.bashrc) to 
point your PATH / LD_LIBRARY_PATH to point to whichever MPI implementation you 
want, and Abaqus will do whatever it needs to a) be independent of your 
environment, and b) be able to function with whatever underlying MPI it wants 
to use.

This is a supposition on my part -- I am not an Abaqus user.

Note, too that popular MPI implementations support a command line option such 
that if you invoke the absolute path name of mpirun/mpiexec, it'll setup the 
PATH / LD_LIBRARY_PATH on the remote servers to echo that of the local server.  
E.g.:

  /path/to/open/mpi/install/bin/mpirun -np 4 --host a,b,c,d my_program


> On May 24, 2016, at 6:25 AM, Dave Love  wrote:
> 
> Megdich Islem  writes:
> 
>> Yes, Empire does the fluid structure coupling. It couples OpenFoam (fluid 
>> analysis) and Abaqus (structural analysis).
>> Does all the software need to have the same MPI architecture in order to 
>> communicate ?
> 
> I doubt it's doing that, and presumably you have no control over abaqus,
> which is a major source of pain here.
> 
> You could wrap one (set of) program(s) in a script to set the
> appropriate environment before invoking the real program.  That might be
> a bit painful if you need many of the OF components, but it should be
> straightforward to put scripts somewhere on PATH ahead of the real
> versions.
> 
> On the other hand, it never ceases to amaze how difficult proprietary
> engineering applications make life on HPC systems; I could believe
> there's a catch.  Also you (or systems people) normally want programs to
> use the system MPI, assuming that's been set up appropriately.
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/05/29299.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] segmentation fault for slot-list and openmpi-1.10.3rc2

2016-05-24 Thread Jeff Squyres (jsquyres)
On May 24, 2016, at 7:19 AM, Siegmar Gross 
 wrote:
> 
> I don't see a difference for my spawned processes, because both functions will
> "wait" until all pending operations have finished, before the object will be
> destroyed. Nevertheless, perhaps my small example program worked all the years
> by chance.

FWIW: COMM_DISCONNECT is (effectively) used to guarantee that a communicator is 
destroyed.  It can actually be used with any communicator (other than 
COMM_WORLD and COMM_SELF), but is typically used with communicators created by 
CONNECT, ACCEPT, JOIN, SPAWN, or SPAWN_MULTIPLE.

When used with a communicator that was created by the dynamic operations (e.g., 
SPAWN), it effectively "disconnects" the two groups in the communicator that it 
just freed (assuming that that communicator was the only one spanning the two 
groups).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] segmentation fault for slot-list and openmpi-1.10.3rc2

2016-05-24 Thread Siegmar Gross

Hi Ralph,

I copy the relevant lines to this place, so that it is easier to see what
happens. "a.out" is your program, which I compiled with mpicc.

>> loki spawn 153 ompi_info | grep -e "OPAL repo revision:" -e "C compiler
>> absolute:"
>>  OPAL repo revision: v1.10.2-201-gd23dda8
>> C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc
>> loki spawn 154 mpicc simple_spawn.c

>> loki spawn 155 mpiexec -np 1 a.out
>> [pid 24008] starting up!
>> 0 completed MPI_Init
...

"mpiexec -np 1 a.out" works.



> I don’t know what “a.out” is, but it looks like there is some memory
> corruption there.

"a.out" is still your program. I get the same error on different
machines, so that it is not very likely, that the (hardware) memory
is corrupted.


>> loki spawn 156 mpiexec -np 1 --host loki --slot-list 0-5 a.out
>> [pid 24102] starting up!
>> 0 completed MPI_Init
>> Parent [pid 24102] about to spawn!
>> [pid 24104] starting up!
>> [pid 24105] starting up!
>> [loki:24105] *** Process received signal ***
>> [loki:24105] Signal: Segmentation fault (11)
>> [loki:24105] Signal code: Address not mapped (1)
...

"mpiexec -np 1 --host loki --slot-list 0-5 a.out" breaks with a segmentation
faUlt. Can I do something, so that you can find out, what happens?


Kind regards

Siegmar



On 05/24/16 15:07, Ralph Castain wrote:



On May 24, 2016, at 4:19 AM, Siegmar Gross
mailto:siegmar.gr...@informatik.hs-fulda.de>> wrote:

Hi Ralph,

thank you very much for your answer and your example program.

On 05/23/16 17:45, Ralph Castain wrote:

I cannot replicate the problem - both scenarios work fine for me. I’m not
convinced your test code is correct, however, as you call Comm_free the
inter-communicator but didn’t call Comm_disconnect. Checkout the attached
for a correct code and see if it works for you.


I thought that I only need MPI_Comm_Disconnect, if I would have established a
connection with MPI_Comm_connect before. The man page for MPI_Comm_free states

"This  operation marks the communicator object for deallocation. The
handle is set to MPI_COMM_NULL. Any pending operations that use this
communicator will complete normally; the object is actually deallocated only
if there are no other active references to it.".

The man page for MPI_Comm_disconnect states

"MPI_Comm_disconnect waits for all pending communication on comm to complete
internally, deallocates the communicator object, and sets the handle to
MPI_COMM_NULL. It is  a  collective operation.".

I don't see a difference for my spawned processes, because both functions will
"wait" until all pending operations have finished, before the object will be
destroyed. Nevertheless, perhaps my small example program worked all the years
by chance.

However, I don't understand, why my program works with
"mpiexec -np 1 --host loki,loki,loki,loki,loki spawn_master" and breaks with
"mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master". You are right,
my slot-list is equivalent to "-bind-to none". I could also have used
"mpiexec -np 1 --host loki --oversubscribe spawn_master" which works as well.


Well, you are only giving us one slot when you specify "-host loki”, and then
you are trying to launch multiple processes into it. The “slot-list” option only
tells us what cpus to bind each process to - it doesn’t allocate process slots.
So you have to tell us how many processes are allowed to run on this node.



The program breaks with "There are not enough slots available in the system
to satisfy ...", if I only use "--host loki" or different host names,
without mentioning five host names, using "slot-list", or "oversubscribe",
Unfortunately "--host :" isn't available for
openmpi-1.10.3rc2 to specify the number of available slots.


Correct - we did not backport the new syntax



Your program behaves the same way as mine, so that MPI_Comm_disconnect
will not solve my problem. I had to modify your program in a negligible way
to get it compiled.

loki spawn 153 ompi_info | grep -e "OPAL repo revision:" -e "C compiler 
absolute:"
 OPAL repo revision: v1.10.2-201-gd23dda8
C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc
loki spawn 154 mpicc simple_spawn.c
loki spawn 155 mpiexec -np 1 a.out
[pid 24008] starting up!
0 completed MPI_Init
Parent [pid 24008] about to spawn!
[pid 24010] starting up!
[pid 24011] starting up!
[pid 24012] starting up!
Parent done with spawn
Parent sending message to child
0 completed MPI_Init
Hello from the child 0 of 3 on host loki pid 24010
1 completed MPI_Init
Hello from the child 1 of 3 on host loki pid 24011
2 completed MPI_Init
Hello from the child 2 of 3 on host loki pid 24012
Child 0 received msg: 38
Child 0 disconnected
Child 1 disconnected
Child 2 disconnected
Parent disconnected
24012: exiting
24010: exiting
24008: exiting
24011: exiting


Is something wrong with my command line? I didn't use slot-list before, so
that I'm not sure, if I use it in the intended way.


I don’t know what “a.out” is, but it looks like there is some memo

Re: [OMPI users] users Digest, Vol 3510, Issue 2

2016-05-24 Thread Ralph Castain
Just to clarify, as this is a frequent misconception: the statement that the 
absolute path will setup your remote environment is only true when using the 
rsh/ssh launcher. It is not true when running under a resource manager (e.g., 
SLURM, LSF, PBSPro, etc.). In those cases, it is up to the RM configuration as 
to whether or not the local environment is forwarded, and we have no control or 
influence over the remote path.


> On May 24, 2016, at 6:14 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
> Doesn't Abaqus do its own environment setup?  I.e., I'm *guessing* that you 
> should be able to set your environment startup files (e.g., $HOME/.bashrc) to 
> point your PATH / LD_LIBRARY_PATH to point to whichever MPI implementation 
> you want, and Abaqus will do whatever it needs to a) be independent of your 
> environment, and b) be able to function with whatever underlying MPI it wants 
> to use.
> 
> This is a supposition on my part -- I am not an Abaqus user.
> 
> Note, too that popular MPI implementations support a command line option such 
> that if you invoke the absolute path name of mpirun/mpiexec, it'll setup the 
> PATH / LD_LIBRARY_PATH on the remote servers to echo that of the local 
> server.  E.g.:
> 
>  /path/to/open/mpi/install/bin/mpirun -np 4 --host a,b,c,d my_program
> 
> 
>> On May 24, 2016, at 6:25 AM, Dave Love  wrote:
>> 
>> Megdich Islem  writes:
>> 
>>> Yes, Empire does the fluid structure coupling. It couples OpenFoam (fluid 
>>> analysis) and Abaqus (structural analysis).
>>> Does all the software need to have the same MPI architecture in order to 
>>> communicate ?
>> 
>> I doubt it's doing that, and presumably you have no control over abaqus,
>> which is a major source of pain here.
>> 
>> You could wrap one (set of) program(s) in a script to set the
>> appropriate environment before invoking the real program.  That might be
>> a bit painful if you need many of the OF components, but it should be
>> straightforward to put scripts somewhere on PATH ahead of the real
>> versions.
>> 
>> On the other hand, it never ceases to amaze how difficult proprietary
>> engineering applications make life on HPC systems; I could believe
>> there's a catch.  Also you (or systems people) normally want programs to
>> use the system MPI, assuming that's been set up appropriately.
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2016/05/29299.php
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/05/29302.php



Re: [OMPI users] users Digest, Vol 3510, Issue 2

2016-05-24 Thread Scott Shaw
Most commercial applications, i.e., Ansys Fluent, Abaqus, NASTRAN, and 
PAM-CRASH is IBM Platform MPI bundled with each application and is the default 
MPI when running parallel simulations.  Depending on which Abaqus release 
you're using your choices are IBM Platform MPI or Intel MPI.   I don't recall 
OpenMPI is a supported MPI implementation for Abaqus. 

Scott Shaw
SGI - Principal Apps Engineer



Re: [OMPI users] segmentation fault for slot-list and openmpi-1.10.3rc2

2016-05-24 Thread Ralph Castain

> On May 24, 2016, at 6:21 AM, Siegmar Gross 
>  wrote:
> 
> Hi Ralph,
> 
> I copy the relevant lines to this place, so that it is easier to see what
> happens. "a.out" is your program, which I compiled with mpicc.
> 
> >> loki spawn 153 ompi_info | grep -e "OPAL repo revision:" -e "C compiler
> >> absolute:"
> >>  OPAL repo revision: v1.10.2-201-gd23dda8
> >> C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc
> >> loki spawn 154 mpicc simple_spawn.c
> 
> >> loki spawn 155 mpiexec -np 1 a.out
> >> [pid 24008] starting up!
> >> 0 completed MPI_Init
> ...
> 
> "mpiexec -np 1 a.out" works.
> 
> 
> 
> > I don’t know what “a.out” is, but it looks like there is some memory
> > corruption there.
> 
> "a.out" is still your program. I get the same error on different
> machines, so that it is not very likely, that the (hardware) memory
> is corrupted.
> 
> 
> >> loki spawn 156 mpiexec -np 1 --host loki --slot-list 0-5 a.out
> >> [pid 24102] starting up!
> >> 0 completed MPI_Init
> >> Parent [pid 24102] about to spawn!
> >> [pid 24104] starting up!
> >> [pid 24105] starting up!
> >> [loki:24105] *** Process received signal ***
> >> [loki:24105] Signal: Segmentation fault (11)
> >> [loki:24105] Signal code: Address not mapped (1)
> ...
> 
> "mpiexec -np 1 --host loki --slot-list 0-5 a.out" breaks with a segmentation
> faUlt. Can I do something, so that you can find out, what happens?

I honestly have no idea - perhaps Gilles can help, as I have no access to that 
kind of environment. We aren’t seeing such problems elsewhere, so it is likely 
something local.

> 
> 
> Kind regards
> 
> Siegmar
> 
> 
> 
> On 05/24/16 15:07, Ralph Castain wrote:
>> 
>>> On May 24, 2016, at 4:19 AM, Siegmar Gross
>>> >> > wrote:
>>> 
>>> Hi Ralph,
>>> 
>>> thank you very much for your answer and your example program.
>>> 
>>> On 05/23/16 17:45, Ralph Castain wrote:
 I cannot replicate the problem - both scenarios work fine for me. I’m not
 convinced your test code is correct, however, as you call Comm_free the
 inter-communicator but didn’t call Comm_disconnect. Checkout the attached
 for a correct code and see if it works for you.
>>> 
>>> I thought that I only need MPI_Comm_Disconnect, if I would have established 
>>> a
>>> connection with MPI_Comm_connect before. The man page for MPI_Comm_free 
>>> states
>>> 
>>> "This  operation marks the communicator object for deallocation. The
>>> handle is set to MPI_COMM_NULL. Any pending operations that use this
>>> communicator will complete normally; the object is actually deallocated only
>>> if there are no other active references to it.".
>>> 
>>> The man page for MPI_Comm_disconnect states
>>> 
>>> "MPI_Comm_disconnect waits for all pending communication on comm to complete
>>> internally, deallocates the communicator object, and sets the handle to
>>> MPI_COMM_NULL. It is  a  collective operation.".
>>> 
>>> I don't see a difference for my spawned processes, because both functions 
>>> will
>>> "wait" until all pending operations have finished, before the object will be
>>> destroyed. Nevertheless, perhaps my small example program worked all the 
>>> years
>>> by chance.
>>> 
>>> However, I don't understand, why my program works with
>>> "mpiexec -np 1 --host loki,loki,loki,loki,loki spawn_master" and breaks with
>>> "mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master". You are 
>>> right,
>>> my slot-list is equivalent to "-bind-to none". I could also have used
>>> "mpiexec -np 1 --host loki --oversubscribe spawn_master" which works as 
>>> well.
>> 
>> Well, you are only giving us one slot when you specify "-host loki”, and then
>> you are trying to launch multiple processes into it. The “slot-list” option 
>> only
>> tells us what cpus to bind each process to - it doesn’t allocate process 
>> slots.
>> So you have to tell us how many processes are allowed to run on this node.
>> 
>>> 
>>> The program breaks with "There are not enough slots available in the system
>>> to satisfy ...", if I only use "--host loki" or different host names,
>>> without mentioning five host names, using "slot-list", or "oversubscribe",
>>> Unfortunately "--host :" isn't available for
>>> openmpi-1.10.3rc2 to specify the number of available slots.
>> 
>> Correct - we did not backport the new syntax
>> 
>>> 
>>> Your program behaves the same way as mine, so that MPI_Comm_disconnect
>>> will not solve my problem. I had to modify your program in a negligible way
>>> to get it compiled.
>>> 
>>> loki spawn 153 ompi_info | grep -e "OPAL repo revision:" -e "C compiler 
>>> absolute:"
>>> OPAL repo revision: v1.10.2-201-gd23dda8
>>>C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc
>>> loki spawn 154 mpicc simple_spawn.c
>>> loki spawn 155 mpiexec -np 1 a.out
>>> [pid 24008] starting up!
>>> 0 completed MPI_Init
>>> Parent [pid 24008] about to spawn!
>>> [pid 24010] starting up!
>>> [pid 24011] starting up!
>>> [pid 24012] start

Re: [OMPI users] segmentation fault for slot-list and openmpi-1.10.3rc2

2016-05-24 Thread Siegmar Gross

Hi Ralph and Gilles,

the program breaks only, if I combine "--host" and "--slot-list". Perhaps this
information is helpful. I use a different machine now, so that you can see that
the problem is not restricted to "loki".


pc03 spawn 115 ompi_info | grep -e "OPAL repo revision:" -e "C compiler 
absolute:"
  OPAL repo revision: v1.10.2-201-gd23dda8
 C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc


pc03 spawn 116 uname -a
Linux pc03 3.12.55-52.42-default #1 SMP Thu Mar 3 10:35:46 UTC 2016 (4354e1d) 
x86_64 x86_64 x86_64 GNU/Linux



pc03 spawn 117 cat host_pc03.openmpi
pc03.informatik.hs-fulda.de slots=12 max_slots=12


pc03 spawn 118 mpicc simple_spawn.c


pc03 spawn 119 mpiexec -np 1 --report-bindings a.out
[pc03:03711] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: 
[BB/../../../../..][../../../../../..]

[pid 3713] starting up!
0 completed MPI_Init
Parent [pid 3713] about to spawn!
[pc03:03711] MCW rank 0 bound to socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 
0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 
10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: [../../../../../..][BB/BB/BB/BB/BB/BB]
[pc03:03711] MCW rank 1 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 
0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 
0-1]], socket 0[core 5[hwt 0-1]]: [BB/BB/BB/BB/BB/BB][../../../../../..]
[pc03:03711] MCW rank 2 bound to socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 
0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 
10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: [../../../../../..][BB/BB/BB/BB/BB/BB]

[pid 3715] starting up!
[pid 3716] starting up!
[pid 3717] starting up!
Parent done with spawn
Parent sending message to child
0 completed MPI_Init
Hello from the child 0 of 3 on host pc03 pid 3715
1 completed MPI_Init
Hello from the child 1 of 3 on host pc03 pid 3716
2 completed MPI_Init
Hello from the child 2 of 3 on host pc03 pid 3717
Child 0 received msg: 38
Child 0 disconnected
Child 2 disconnected
Parent disconnected
Child 1 disconnected
3713: exiting
3715: exiting
3716: exiting
3717: exiting


pc03 spawn 120 mpiexec -np 1 --hostfile host_pc03.openmpi --slot-list 
0:0-1,1:0-1 --report-bindings a.out
[pc03:03729] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 
0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: 
[BB/BB/../../../..][BB/BB/../../../..]

[pid 3731] starting up!
0 completed MPI_Init
Parent [pid 3731] about to spawn!
[pc03:03729] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 
0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: 
[BB/BB/../../../..][BB/BB/../../../..]
[pc03:03729] MCW rank 1 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 
0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: 
[BB/BB/../../../..][BB/BB/../../../..]
[pc03:03729] MCW rank 2 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 
0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: 
[BB/BB/../../../..][BB/BB/../../../..]

[pid 3733] starting up!
[pid 3734] starting up!
[pid 3735] starting up!
Parent done with spawn
Parent sending message to child
2 completed MPI_Init
Hello from the child 2 of 3 on host pc03 pid 3735
1 completed MPI_Init
Hello from the child 1 of 3 on host pc03 pid 3734
0 completed MPI_Init
Hello from the child 0 of 3 on host pc03 pid 3733
Child 0 received msg: 38
Child 0 disconnected
Child 2 disconnected
Child 1 disconnected
Parent disconnected
3731: exiting
3734: exiting
3733: exiting
3735: exiting


pc03 spawn 121 mpiexec -np 1 --host pc03 --slot-list 0:0-1,1:0-1 
--report-bindings a.out
[pc03:03744] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 
0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: 
[BB/BB/../../../..][BB/BB/../../../..]

[pid 3746] starting up!
0 completed MPI_Init
Parent [pid 3746] about to spawn!
[pc03:03744] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 
0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: 
[BB/BB/../../../..][BB/BB/../../../..]
[pc03:03744] MCW rank 2 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 
0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: 
[BB/BB/../../../..][BB/BB/../../../..]

[pid 3748] starting up!
[pid 3749] starting up!
[pc03:03749] *** Process received signal ***
[pc03:03749] Signal: Segmentation fault (11)
[pc03:03749] Signal code: Address not mapped (1)
[pc03:03749] Failing at address: 0x8
[pc03:03749] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7fe6f0d1f870]
[pc03:03749] [ 1] 
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_proc_self+0x35)[0x7fe6f0f825b0]
[pc03:03749] [ 2] 
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_comm_init+0x68b)[0x7fe6f0f61b08]
[pc03:03749] [ 3] 
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_mpi_init+0xa90)[0x7fe6f0f87e8a]
[pc03:03749] [ 4] 
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(MPI_Init+0x1a0)[0x7fe6f0fc42ae]

[pc03:03749]

Re: [OMPI users] segmentation fault for slot-list and openmpi-1.10.3rc2

2016-05-24 Thread Ralph Castain
Works perfectly for me, so I believe this must be an environment issue - I am 
using gcc 6.0.0 on CentOS7 with x86:

$ mpirun -n 1 -host bend001 --slot-list 0:0-1,1:0-1 --report-bindings 
./simple_spawn
[bend001:17599] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 
1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: 
[BB/BB/../../../..][BB/BB/../../../..]
[pid 17601] starting up!
0 completed MPI_Init
Parent [pid 17601] about to spawn!
[pid 17603] starting up!
[bend001:17599] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 
1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: 
[BB/BB/../../../..][BB/BB/../../../..]
[bend001:17599] MCW rank 1 bound to socket 0[core 0[hwt 0-1]], socket 0[core 
1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: 
[BB/BB/../../../..][BB/BB/../../../..]
[bend001:17599] MCW rank 2 bound to socket 0[core 0[hwt 0-1]], socket 0[core 
1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: 
[BB/BB/../../../..][BB/BB/../../../..]
[pid 17604] starting up!
[pid 17605] starting up!
Parent done with spawn
Parent sending message to child
0 completed MPI_Init
Hello from the child 0 of 3 on host bend001 pid 17603
Child 0 received msg: 38
1 completed MPI_Init
Hello from the child 1 of 3 on host bend001 pid 17604
2 completed MPI_Init
Hello from the child 2 of 3 on host bend001 pid 17605
Child 0 disconnected
Child 2 disconnected
Parent disconnected
Child 1 disconnected
17603: exiting
17605: exiting
17601: exiting
17604: exiting
$

> On May 24, 2016, at 7:18 AM, Siegmar Gross 
>  wrote:
> 
> Hi Ralph and Gilles,
> 
> the program breaks only, if I combine "--host" and "--slot-list". Perhaps this
> information is helpful. I use a different machine now, so that you can see 
> that
> the problem is not restricted to "loki".
> 
> 
> pc03 spawn 115 ompi_info | grep -e "OPAL repo revision:" -e "C compiler 
> absolute:"
>  OPAL repo revision: v1.10.2-201-gd23dda8
> C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc
> 
> 
> pc03 spawn 116 uname -a
> Linux pc03 3.12.55-52.42-default #1 SMP Thu Mar 3 10:35:46 UTC 2016 (4354e1d) 
> x86_64 x86_64 x86_64 GNU/Linux
> 
> 
> pc03 spawn 117 cat host_pc03.openmpi
> pc03.informatik.hs-fulda.de slots=12 max_slots=12
> 
> 
> pc03 spawn 118 mpicc simple_spawn.c
> 
> 
> pc03 spawn 119 mpiexec -np 1 --report-bindings a.out
> [pc03:03711] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: 
> [BB/../../../../..][../../../../../..]
> [pid 3713] starting up!
> 0 completed MPI_Init
> Parent [pid 3713] about to spawn!
> [pc03:03711] MCW rank 0 bound to socket 1[core 6[hwt 0-1]], socket 1[core 
> 7[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 
> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: 
> [../../../../../..][BB/BB/BB/BB/BB/BB]
> [pc03:03711] MCW rank 1 bound to socket 0[core 0[hwt 0-1]], socket 0[core 
> 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 
> 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]]: 
> [BB/BB/BB/BB/BB/BB][../../../../../..]
> [pc03:03711] MCW rank 2 bound to socket 1[core 6[hwt 0-1]], socket 1[core 
> 7[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 
> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: 
> [../../../../../..][BB/BB/BB/BB/BB/BB]
> [pid 3715] starting up!
> [pid 3716] starting up!
> [pid 3717] starting up!
> Parent done with spawn
> Parent sending message to child
> 0 completed MPI_Init
> Hello from the child 0 of 3 on host pc03 pid 3715
> 1 completed MPI_Init
> Hello from the child 1 of 3 on host pc03 pid 3716
> 2 completed MPI_Init
> Hello from the child 2 of 3 on host pc03 pid 3717
> Child 0 received msg: 38
> Child 0 disconnected
> Child 2 disconnected
> Parent disconnected
> Child 1 disconnected
> 3713: exiting
> 3715: exiting
> 3716: exiting
> 3717: exiting
> 
> 
> pc03 spawn 120 mpiexec -np 1 --hostfile host_pc03.openmpi --slot-list 
> 0:0-1,1:0-1 --report-bindings a.out
> [pc03:03729] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 
> 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: 
> [BB/BB/../../../..][BB/BB/../../../..]
> [pid 3731] starting up!
> 0 completed MPI_Init
> Parent [pid 3731] about to spawn!
> [pc03:03729] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 
> 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: 
> [BB/BB/../../../..][BB/BB/../../../..]
> [pc03:03729] MCW rank 1 bound to socket 0[core 0[hwt 0-1]], socket 0[core 
> 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: 
> [BB/BB/../../../..][BB/BB/../../../..]
> [pc03:03729] MCW rank 2 bound to socket 0[core 0[hwt 0-1]], socket 0[core 
> 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: 
> [BB/BB/../../../..][BB/BB/../../../..]
> [pid 3733] starting up!
> [pid 3734] starting up!
> [pid 3735] starting up!
> Parent done with spawn
> Parent sending message to child
> 2 completed MPI_Init
> Hello from the child 

Re: [OMPI users] problem with exceptions in Java interface

2016-05-24 Thread Howard Pritchard
Hi Siegmar,

Sorry for the delay, I seem to have missed this one.

It looks like there's an error in the way the native methods are processing
java exceptions.  The code correctly builds up an exception message for
cases where MPI 'c' returns non-success but, not if the problem occured
in one of the JNI utilities.

Issue filed:
https://github.com/open-mpi/ompi/issues/1698


Thanks for reporting this.


Howard


2016-05-20 9:25 GMT-06:00 Siegmar Gross <
siegmar.gr...@informatik.hs-fulda.de>:

> Hi,
>
> I tried MPI.ERRORS_RETURN in a small Java program with Open MPI
> 1.10.2 and master. I get the expected behaviour, if I use a
> wrong value for the root process in "bcast". Unfortunately I
> get an MPI or Java error message if I try to broadcast more data
> than available. Is this intended or is it a problem in the Java
> interface of Open MPI? I would be grateful if somebody can answer
> my question.
>
> loki java 194 mpijavac Exception_1_Main.java
> loki java 195 mpijavac Exception_2_Main.java
>
> loki java 196 mpiexec -np 1 java Exception_1_Main
> Set error handler for MPI.COMM_WORLD to MPI.ERRORS_RETURN.
> Call "bcast" with wrong "root" process.
> Caught an exception.
> MPI_ERR_ROOT: invalid root
>
>
> loki java 197 mpiexec -np 1 java Exception_2_Main
> Set error handler for MPI.COMM_WORLD to MPI.ERRORS_RETURN.
> Call "bcast" with index out-of bounds.
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException
> at mpi.Comm.bcast(Native Method)
> at mpi.Comm.bcast(Comm.java:1231)
> at Exception_2_Main.main(Exception_2_Main.java:44)
> ---
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> ---
> --
> mpiexec detected that one or more processes exited with non-zero status,
> thus causing
> the job to be terminated. The first process to do so was:
>
>   Process name: [[38300,1],0]
>   Exit code:1
> --
> loki java 198
>
>
> Kind regards and thank you very much for any help in advance
>
> Siegmar
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/05/29256.php
>


Re: [OMPI users] OpenMPI 1.6.5 on CentOS 7.1, silence ib-locked-pages?

2016-05-24 Thread Ryan Novosielski
> On May 18, 2016, at 6:59 PM, Jeff Squyres (jsquyres)  
> wrote:
> 
> On May 18, 2016, at 6:16 PM, Ryan Novosielski  wrote:
>> 
>> I’m pretty sure this is no longer relevant (having read Roland’s messages 
>> about it from a couple of years ago now). Can you please confirm that for 
>> me, and then let me know if there is any way that I can silence this old 
>> copy of OpenMPI that I need to use with some software that depends on it for 
>> some reason? It is causing my users to report it as an issue pretty 
>> regularly.
> 
> The message cites that only 32MB is able to be registered out of a total of 
> 128MB.  That seems low to me.

That’s 32768 MiB, 32GB (out of 128GB).

> Did you look at the FAQ item and see if there are system limits that you 
> should increase?

I did, but again, I’ve seen other messages about this that indicate that it’s 
not required, such as this one:

https://www.open-mpi.org/community/lists/users/2014/08/25090.php

What can happen in these circumstances where you set something that’s not 
required is that you end up finding out down the road that the default has 
changed appropriately but you’re still hard-coding the wrong settings. I’d 
prefer to avoid laying those sorts of traps for myself whenever possible. :)

I’m not as experienced with OpenMPI as I could be; is this not the same issue? 
I am using the CentOS-supplied Mellanox drivers.

[novosirj@perceval2 profile.d]$ modinfo mlx4_en
filename:   
/lib/modules/3.10.0-229.20.1.el7.x86_64/kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_en.ko
version:2.2-1 (Feb 2014)
license:Dual BSD/GPL
description:Mellanox ConnectX HCA Ethernet driver
author: Liran Liss, Yevgeny Petrilin
rhelversion:7.1
srcversion: DC68737527B57AD77CD3AD6
depends:mlx4_core,ptp,vxlan
intree: Y
vermagic:   3.10.0-229.20.1.el7.x86_64 SMP mod_unload modversions
signer: CentOS Linux kernel signing key
sig_key:38:C3:70:0F:5B:84:90:11:D3:72:15:7D:E5:CD:06:17:C8:15:DE:03
sig_hashalgo:   sha256
parm:   udp_rss:Enable RSS for incoming UDP traffic or disabled (0) 
(uint)
parm:   pfctx:Priority based Flow Control policy on TX[7:0]. Per 
priority bit mask (uint)
parm:   pfcrx:Priority based Flow Control policy on RX[7:0]. Per 
priority bit mask (uint)
parm:   inline_thold:Threshold for using inline data (range: 17-104, 
default: 104) (uint)

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
   `'

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [OMPI users] fortran problem when mixing "use mpi" and "use mpi_f08" with gfortran 5

2016-05-24 Thread Jeff Squyres (jsquyres)
On May 21, 2016, at 12:17 PM, Andrea Negri  wrote:
> 
> Hi, in the last few days I ported my entire fortran mpi code to "use
> mpif_08". You really did a great job with this interface. However,
> since HDF5 still uses integers to handle communicators, I have a
> module where I still use "use mpi", and with gfortran 5.3.0 and
> openmpi-1.10.2 I got some errors.

FWIW, you can actually mix integer handles with mpi_f08 handles.  The mpi_f08 
handles are derived datatypes that contain exactly one member: an integer.

For example, if you call a subroutine with an integer MPI handle as a dummy 
parameter, and that subroutine has a Type(MPI_Comm)::comm variable, you can 
assign:

comm%mpi_val = integer_handle

And then use that Type(MPI_Comm) as an mpi_f08 handle.

The opposite is true, too -- you can extract the %mpi_val from an mpi_f08 
handle and use it as an integer handle with "use mpi" or mpif.h interfaces.

Meaning: the %mpi_val value is exactly equivalent to the integer handles.

> I have been able to produce an extremely minimalistic example that
> reproduces the same errors. If you try to compile with mpifort -c this
> file

I think you had a typo -- it should be "use test1_mod", right?

I expanded your code a little to give it a program, and run it:

-
  !==   
  
module test1_mod
  ! I use ONLY here just to show you that errors happen even with  ONLY 
  
  use mpi, only: MPI_BARRIER,  MPI_COMM_WORLD
  implicit none
  private
  public ::  test1
contains
  subroutine test1(a)
implicit none
real, intent(inout) :: a
integer :: ierr
a=0
print *, "Here I am"
call MPI_INIT(ierr)
call mpi_barrier(MPI_COMM_WORLD, ierr)
call MPI_FINALIZE(ierr)
print *, "Done with finalize"
  endsubroutine test1
endmodule test1_mod



module prova2
  use mpi_f08
  implicit none
  public :: prova3
contains
  subroutine prova3
use test1_mod
implicit none
real :: a

call test1(a)
  endsubroutine prova3
endmodule prova2
!== 
  

program doit
  use prova2
  implicit none

  call prova3()
end program doit
-

I then compiled it with the Intel compiler and ran it:

-
$ mpifort mix-usempi-usempif08-2.f90 -I. && mpirun -np 2 ./a.out
 Here I am
 Here I am
 Done with finalize
 Done with finalize
-

Now that does not mean that there isn't a bug in OMPI -- I'm just saying that 
it works with the Intel compiler.

I tried the following:

- Open MPI dev master with icc: works
- Open MPI dev master with gcc 5.2.0: works
- Open MPI v1.10.x head with icc: works
- Open MPI v1.10.x head with gcc 5.2.0: same errors you got
- Open MPI v2.x head with gcc 5.2.0: works

We have clearly changed something since v1.10.x, but I don't know offhand 
exactly what would have caused this difference.

Is there a chance you can use the Intel compiler?  Or the Open MPI v2.0.0 rc?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/