Re: [OMPI users] use accept/connect to merge a new intra-comm

2015-02-01 Thread Ralph Castain
Which OMPI version?

> On Jan 25, 2015, at 5:41 AM, haozi  wrote:
> 
> Hi guys.
> 
> I am interested in an example from OpenMPI, as  attachment: 
> singleton_client_server.c.
> So, I wrote another example. And some error happened.
> My example includes two servers and one client. 
> First, server1 runs. Second, client runs. server1 and client merge an new 
> intra-comm. Next, server2 runs. server2 open a port, server1 and client want 
> to connect server2. At the moment, some error happen. client and server2 
> block, and server1 has errors as following:
> 
> [datanode-2:06824] [[53818,0],0]:route_callback tried routing message from 
> [[53818,1],0] to [[53822,1],0]:16, can't find route
> [0] func:/usr/lib/libopen-pal.so.0(opal_backtrace_print+0x30) [0xb769fbc0]
> [1] func:/usr/lib/openmpi/lib/openmpi/mca_rml_oob.so(+0x1bfd) [0xb748fbfd]
> [2] func:/usr/lib/openmpi/lib/openmpi/mca_oob_tcp.so(+0x6cfa) [0xb7484cfa]
> [3] func:/usr/lib/openmpi/lib/openmpi/mca_oob_tcp.so(+0x81c2) [0xb74861c2]
> [4] func:/usr/lib/libopen-pal.so.0(+0x1aca4) [0xb7688ca4]
> [5] func:/usr/lib/libopen-pal.so.0(opal_event_loop+0x25) [0xb7688ea5]
> [6] func:/usr/lib/libopen-pal.so.0(opal_event_dispatch+0x1b) [0xb7688ecb]
> [7] func:mpiexec() [0x804ac0a]
> [8] func:mpiexec() [0x8049f4f]
> [9] func:/lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0xb74c14d3]
> [10] func:mpiexec() [0x8049ea1] 
> 
> My code is attached.
> 
> Regards.
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/01/26233.php



Re: [OMPI users] orted seg fault when using MPI_Comm_spawn on more than one host

2015-02-01 Thread Ralph Castain
Well, I can reproduce it - but I won’t have time to address it until I return 
later this week.

Whether or not procs get spawned onto a remote host depends on the number of 
local slots. You asked for 8 processes, so if there are more than 8 slots on 
the node, then it will launch them all on the local node. If you want to spread 
them across nodes, you need to use —map-by node

Also, FWIW: this job will “hang” as the spawned procs (“hostname”) never call 
MPI_Init. You can only use MPI_Comm_spawn to launch MPI processes as the 
spawning parent will blissfully wait forever for the child processes to call 
MPI_Connect.


> On Jan 26, 2015, at 11:29 AM, Evan  wrote:
> 
> Hi,
> 
> I am using OpenMPI 1.8.4 on a Ubuntu 14.04 machine and 5 Ubuntu 12.04 
> machines.  I am using ssh to launch MPI jobs and I'm able to run simple 
> programs like 'mpirun -np 8 --host localhost,pachy1 hostname' and get the 
> expected output (pachy1 being an entry in my /etc/hosts file).
> 
> I started using MPI_Comm_spawn in my app with the intent of NOT calling 
> mpirun to launch the program that calls MPI_Comm_spawn (my attempt at using 
> the Singleton MPI_INIT pattern described in 10.5.2 of MPI 3.0 standard).  The 
> app needs to launch an MPI job of a given size from a given hostfile, where 
> the job needs to report some info back to the app, so it seemed 
> MPI_Comm_spawn was my best bet.  The app is only rarely going to be used this 
> way, thus mpirun not being used to launch the app that is the parent in the 
> MPI_Comm_spawn operation.  This pattern works fine if the only entries in the 
> hostfile are 'localhost'.  However if I add a host that isn't local I get a 
> segmentation fault from the orted process.
> 
> In any case, I distilled my example down as small as I could.  I've attached 
> the C code of the master and the hostfile I'm using. Here's the output:
> 
> evan@lasarti:~/devel/toy_progs/mpi_spawn$ ./master 
> ~/mpi/test_distributed.hostfile
> [lasarti:32020] [[21014,1],0] FORKING HNP: orted --hnp --set-sid --report-uri 
> 14 --singleton-died-pipe 15 -mca state_novm_select 1 -mca ess_base_jobid 
> 1377173504
> [lasarti:32022] *** Process received signal ***
> [lasarti:32022] Signal: Segmentation fault (11)
> [lasarti:32022] Signal code: Address not mapped (1)
> [lasarti:32022] Failing at address: (nil)
> [lasarti:32022] [ 0] 
> /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)[0x7f07af039340]
> [lasarti:32022] [ 1] 
> /opt/openmpi-1.8.4/lib/libopen-pal.so.6(opal_hwloc191_hwloc_get_obj_by_depth+0x32)[0x7f07aea227c2]
> [lasarti:32022] [ 2] 
> /opt/openmpi-1.8.4/lib/libopen-pal.so.6(opal_hwloc_base_get_nbobjs_by_type+0x90)[0x7f07ae9f5430]
> [lasarti:32022] [ 3] 
> /opt/openmpi-1.8.4/lib/openmpi/mca_rmaps_round_robin.so(orte_rmaps_rr_byobj+0x134)[0x7f07ab2fb154]
> [lasarti:32022] [ 4] 
> /opt/openmpi-1.8.4/lib/openmpi/mca_rmaps_round_robin.so(+0x12c6)[0x7f07ab2fa2c6]
> [lasarti:32022] [ 5] 
> /opt/openmpi-1.8.4/lib/libopen-rte.so.7(orte_rmaps_base_map_job+0x21a)[0x7f07af299f7a]
> [lasarti:32022] [ 6] 
> /opt/openmpi-1.8.4/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0x6e4)[0x7f07ae9e7034]
> [lasarti:32022] [ 7] 
> /opt/openmpi-1.8.4/lib/libopen-rte.so.7(orte_daemon+0xdff)[0x7f07af27a86f]
> [lasarti:32022] [ 8] orted(main+0x47)[0x400877]
> [lasarti:32022] [ 9] 
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f07aec84ec5]
> [lasarti:32022] [10] orted[0x4008cb]
> [lasarti:32022] *** End of error message ***
> 
> If I launch 'master.c' using mpirun, I don't get a segmentation fault, but it 
> doesn't seem to be launching the process on anything more than localhost, no 
> matter what hostfile I give it.
> 
> For what it's worth, I fully expected to debug some path issues regarding the 
> binary I wanted to launch with MPI_Comm_spawn when I used this distributed, 
> but this error at first glance doesn't appear to have anything to do with 
> that.  I'm sure this is something silly I'm doing wrong, but I don't really 
> know how to debug this further given this error.
> 
> Evan
> 
> P.S. Only including zipped config.log since the "ompi_info -v ompi full 
> --parsable" command I got from http://www.open-mpi.org/community/help/ 
> doesn't seem to work anymore.
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/01/26235.php



Re: [OMPI users] slurm openmpi 1.8.3 core bindings

2015-02-01 Thread Ralph Castain
Yeah, I don’t think that the slurm bindings will work for you. Problem is that 
the slurm directive gets applied to the launch of our daemon, not the 
application procs. So what you’ve done is bind our daemon to 3 cpus. This has 
nothing to do with the OMPI-Slurm integration - you told slurm to bind any  
process it launches to 3 cpus, and the only “processes” slurm launches are our 
daemons.

The only way to get what you want is to have slurm make the allocation without 
specifying cpus-per-task, and then have mpirun do the pe=N.


> On Jan 30, 2015, at 8:20 AM, Michael Di Domenico  
> wrote:
> 
> I'm trying to get slurm and openmpi to cooperate when running multi
> thread jobs.  i'm sure i'm doing something wrong, but i can't figure
> out what
> 
> my node configuration is
> 
> 2 nodes
> 2 sockets
> 6 cores per socket
> 
> i want to run
> 
> sbatch -N2 -n 8 --ntasks-per-node=4 --cpus-per-task=3 -w node1,node2
> program.sbatch
> 
> inside the program.sbatch i'm calling openmpi
> 
> mpirun -n $SLURM_NTASKS --report-bindings program
> 
> when the binds report comes out i get
> 
> node1 rank 0 socket 0 core 0
> node1 rank 1 socket 1 core 6
> node1 rank 2 socket 0 core 1
> node1 rank 3 socket 1 core 7
> node2 rank 4 socket 0 core 0
> node2 rank 5 socket 1 core 6
> node2 rank 6 socket 0 core 1
> node2 rank 7 socket 1 core 7
> 
> which is semi-fine, but when the job runs the resulting threads from
> the program are locked (according to top) to those eight cores rather
> then spreading themselves over the 24 cores available
> 
> i tried a few incantations of the map-by, bind-to, etc, but openmpi
> basically complained about everything i tried for one reason or
> another
> 
> my understand is that slurm should be passing the requested config to
> openmpi (or openmpi is pulling from the environment somehow) and it
> should magically work
> 
> if i skip slurm and run
> 
> mpirun -n 8 --map-by node:pe=3 -bind-to core -host node1,node2
> --report-bindings program
> 
> node1 rank 0 socket 0 core 0
> node2 rank 1 socket 0 core 0
> node1 rank 2 socket 0 core 3
> node2 rank 3 socket 0 core 3
> node1 rank 4 socket 1 core 6
> node2 rank 5 socket 1 core 6
> node1 rank 6 socket 1 core 9
> node2 rank 7 socket 1 core 9
> 
> i do get the behavior i want (though i would prefer a -npernode switch
> in there, but openmpi complains).  the bindings look better and the
> threads are not locked to the particular cores
> 
> therefore i'm pretty sure this is a problem between openmpi and slurm
> and not necessarily with either individually
> 
> i did compile openmpi with the slurm support switch and we're using
> the cgroups taskplugin within slurm
> 
> i guess ancillary to this, is there a way to turn off core
> binding/placement routines and control the placement manually?
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/01/26245.php



Re: [OMPI users] using multiple IB connections between hosts

2015-02-01 Thread Gilles Gouaillardet
Dave,

the QDR Infiniband uses the openib btl (by default :
btl_openib_exclusivity=1024)
i assume the RoCE 10Gbps card is using the tcp btl (by default :
btl_tcp_exclusivity=100)

that means that by default, when both openib and tcp btl could be used,
the tcp btl is discarded.

could you give a try by settings the same exclusivity value on both btl ?
e.g.
OMPI_MCA_btl_tcp_exclusivity=1024 mpirun ...

assuming this is enough the get traffic on both interfaces, you migh
want *not* to use the eth0 interface
(e.g. OMPI_MCA_btl_tcp_if_exlude=eth0 ...)

you might also have to tweak the bandwidth parameters (i assume QDR
interface should get 4 times more
traffic than the 10Gbe interface)
by default :
btl_openib_bandwidth=4
btl_tcp_bandwidth=100
/* value is in Mbps, so the openib value should be 40960 (!), and in
your case, tcp bandwidth should be 10240 */
you might also want to try btl_*_bandwidth=0 (auto-detect value at run time)

i hope this helps,

Cheers,

Gilles
On 2015/01/29 9:45, Dave Turner wrote:
>  I ran some aggregate bandwidth tests between 2 hosts connected by
> both QDR InfiniBand and RoCE enabled 10 Gbps Mellanox cards.  The tests
> measured the aggregate performance for 16 cores on one host communicating
> with 16 on the second host.  I saw the same performance as with the QDR
> InfiniBand alone, so it appears that the addition of the 10 Gbps RoCE cards
> is
> not helping.
>
>  Should OpenMPI be using both in this case by default, or is there
> something
> I need to configure to allow for this?  I suppose this is the same question
> as
> how to make use of 2 identical IB connections on each node, or is the system
> simply ignoring the 10 Gbps cards because they are the slower option.
>
>  Any clarification on this would be helpful.  The only posts I've found
> are very
> old and discuss mostly channel bonding of 1 Gbps cards.
>
>  Dave Turner
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/01/26243.php



Re: [OMPI users] independent startup of orted and orterun

2015-02-01 Thread Ralph Castain
I have pushed the changes to the OMPI master. It took a little bit more than I 
had hoped due to the changes to the ORTE infrastructure, but hopefully this 
will meet your needs. It consists of two new tools:

(a) orte-dvm - starts the virtual machine by launching a daemon on every node 
of the allocation, as constrained by -host and/or -hostfile. Check the options 
for outputting the URI as you’ll need that info for the other tool. The DVM 
remains “up” until you issue the orte-submit -terminate command, or hit the 
orte-dvm process with a sigterm.

(b) orte-submit - takes the place of mpirun. Basically just packages your app 
and arguments and sends it to orte-dvm for execution. Requires the URI of 
orte-dvm. The tool exits once the job has completed execution, though you can 
run multiple jobs in parallel by backgrounding orte-submit or issuing commands 
from separate shells.

I’ve added man pages for both tools, though they may not be complete. Also, I 
don’t have all the mapping/ranking/binding options supported just yet as I 
first wanted to see if this meets your basic needs before worrying about the 
detail.

Let me know what you think
Ralph


> On Jan 21, 2015, at 4:07 PM, Mark Santcroos  
> wrote:
> 
> Hi Ralph,
> 
> All makes sense! Thanks a lot!
> 
> Looking forward to your modifications.
> Please don't hesitate to through things with rough-edges to me!
> 
> Cheers,
> 
> Mark
> 
>> On 21 Jan 2015, at 23:21 , Ralph Castain  wrote:
>> 
>> Let me address your questions up here so you don’t have to scan thru the 
>> entire note.
>> 
>> PMIx rationale: PMI has been around for a long time, primarily used inside 
>> the MPI library implementations to perform wireup. It provided a link from 
>> the MPI library to the local resource manager. However, as we move towards 
>> exascale, two things became apparent:
>> 
>> 1. the current PMI implementations don’t scale adequately to get there. The 
>> API created too many communications and assumed everything was a blocking 
>> operation, thus preventing asynchronous progress
>> 
>> 2. there were increasing requests for application-level interactions to the 
>> resource manager. People want ways to spawn jobs (and not just from within 
>> MPI), request pre-location of data, control power, etc. Rather than having 
>> every RM write its own interface (and thus make everyone’s code 
>> non-portable), we at Intel decided to extend the existing PMI definitions to 
>> support those functions. Thus, an application developer can directly access 
>> PMIx functions to perform all those operations.
>> 
>> PMIx v1.0 is about to be released - it’ll be backward compatible with PMI-1 
>> and PMI-2, plus add non-blocking operations and significantly reduce the 
>> number of communications. PMIx 2.0 is slated for this summer and will 
>> include the advanced controls capabilities.
>> 
>> ORCM is being developed because we needed a BSD-licensed, fully featured 
>> resource manager. This will allow us to integrate the RM even more tightly 
>> to the file system, networking, and other subsystems, thus achieving higher 
>> launch performance and providing desired features such as QoS management. 
>> PMIx is a part of that plan, but as you say, they each play their separate 
>> roles in the overall stack.
>> 
>> 
>> Persistent ORTE: there is a learning curve on ORTE, I fear. We do have some 
>> videos on the web site that can help get you started, and I’ve given a 
>> number of “classes" at Intel now for that purpose. I still have it on my 
>> “to-do” list that I summarize those classes and post them on the web site.
>> 
>> For now, let me summarize how things work. At startup, mpirun reads the 
>> allocation (usually from the environment, but it depends on the host RM) and 
>> launches a daemon on each allocated node. Each daemon reads its local 
>> hardware environment and “phones home” to let mpirun know it is alive. Once 
>> all daemons have reported, mpirun maps the processes to the nodes and sends 
>> that map to all the daemons in a scalable broadcast pattern.
>> 
>> Upon receipt of the launch message, each daemon parses it to identify which 
>> procs it needs to locally spawn. Once spawned, each proc connects back to 
>> its local daemon via a Unix domain socket for wireup support. As procs 
>> complete, the daemon maintains bookkeeping and reports back to mpirun once 
>> all procs are done. When all procs are reported complete (or one reports as 
>> abnormally terminated), mpirun sends a “die” message to every daemon so it 
>> will cleanly terminate.
>> 
>> What I will do is simply tell mpirun to not do that last step, but instead 
>> to wait to receive a “terminate” cmd before ending the daemons. This will 
>> allow you to reuse the existing DVM, making each independent job start a 
>> great deal faster. You’ll need to either manually terminate the DVM, or the 
>> RM will do so when the allocation expires.
>> 
>> HTH
>> Ralph
>> 
>> 
>>> On Jan 21, 2015, at 12:52 PM, Mar

Re: [OMPI users] vector type

2015-02-01 Thread Nick Papior Andersen
Because the compiler does not know that you want to send the entire
sub-matrix, passing non-contiguous arrays to a function is, at best,
dangerous, do not do that unless you know the function can handle that.
Do AA(1,1,2) and then it works. (in principle you then pass the starting
memory location and MPI assumes it to be contiguous).
Or do A(:,:,2:3), which ensures that the matrix is passed by reference and
hence the memory will also be contiguous.




2015-01-31 22:16 GMT+00:00 Diego Avesani :

> Dear all,
> here how I create a 2D vector type to send 3D array element:
>
> (in the attachment)
>
> The vectors are:
> real*4  AA(4,5,3), BB(4,5,3)
> In my idea both AA and BB have three elements (last columns) and each
> elements has (4x5) features.
>
> 1) What do you think?
>
> 2) why I can not send AA(1,1,2:3) as
>  call MPI_SEND(AA(1,1,2:3), 2, rowtype, 1, 300, MPI_COMM_WORLD, ierr)
>
> Thanks a lot
>
> Diego
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/01/26246.php
>



-- 
Kind regards Nick


Re: [OMPI users] Segfault in mpi-java

2015-02-01 Thread Oscar Vega-Gisbert

Hi,

I created an issue with a simplified example:

https://github.com/open-mpi/ompi/issues/369

Regards,
Oscar


El 25/01/15 a las 19:36, Oscar Vega-Gisbert escribió:

Hi,

I also reproduce this behaviour. But I think this crash is not related 
with garbage collector. Java is much better than you think.


May be MPI corrupts the Java runtime heap.

Regards,
Oscar

El 22/01/15 a las 08:07, Gilles Gouaillardet escribió:

Alexander,

i was able to reproduce this behaviour.

basically, bad things happen when the garbage collector is invoked ...
i was even able to reproduce some crashes (but that happen at random
stages) very early in the code
by manually inserting calls to the garbage collector (e.g. System.gc();)

Cheers,

Gilles

On 2015/01/19 9:03, Alexander Daryin wrote:

Hi

I am using Java MPI bindings and periodically get fatal erros. This 
is illustrated by the following model Java program.


import mpi.MPI;
import mpi.MPIException;
import mpi.Prequest;
import mpi.Request;
import mpi.Status;

import java.nio.ByteBuffer;
import java.util.Random;

public class TestJavaMPI {

private static final int NREQ = 16;
private static final int BUFFSIZE = 0x2000;
private static final int NSTEP = 10;

public static void main(String... args) throws MPIException {
MPI.Init(args);
Random random = new Random();
Prequest[] receiveRequests = new Prequest[NREQ];
Request[] sendRequests = new Request[NREQ];
ByteBuffer[] receiveBuffers = new ByteBuffer[NREQ];
ByteBuffer[] sendBuffers = new ByteBuffer[NREQ];
for(int i = 0; i < NREQ; i++) {
receiveBuffers[i] = MPI.newByteBuffer(BUFFSIZE);
sendBuffers[i] = MPI.newByteBuffer(BUFFSIZE);
receiveRequests[i] = 
MPI.COMM_WORLD.recvInit(receiveBuffers[i], BUFFSIZE, MPI.BYTE, 
MPI.ANY_SOURCE, MPI.ANY_TAG);

receiveRequests[i].start();
sendRequests[i] = MPI.COMM_WORLD.iSend(sendBuffers[i], 
0, MPI.BYTE, MPI.PROC_NULL, 0);

}
for(int step = 0; step < NSTEP; step++) {
if( step % 128 == 0 ) System.out.println(step);
int index;
do {
Status status = Request.testAnyStatus(receiveRequests);
if( status != null )
receiveRequests[status.getIndex()].start();
index = Request.testAny(sendRequests);
} while( index == MPI.UNDEFINED );
sendRequests[index].free();
sendRequests[index] = 
MPI.COMM_WORLD.iSend(sendBuffers[index], BUFFSIZE, MPI.BYTE,

random.nextInt(MPI.COMM_WORLD.getSize()), 0);
}
MPI.Finalize();
}
}

On Linux, this produces a segfault after about a million steps. On 
OS X, instead of segfault it prints the following error message


java(64053,0x127e4d000) malloc: *** error for object 0x7f80eb828808: 
incorrect checksum for freed object - object was probably modified 
after being freed.

*** set a breakpoint in malloc_error_break to debug
[mbp:64053] *** Process received signal ***
[mbp:64053] Signal: Abort trap: 6 (6)
[mbp:64053] Signal code:  (0)
[mbp:64053] [ 0] 0   libsystem_platform.dylib 0x7fff86b5ff1a 
_sigtramp + 26

[mbp:64053] [ 1] 0   ??? 0x 0x0 + 0
[mbp:64053] [ 2] 0   libsystem_c.dylib 0x7fff80c7bb73 abort + 129
[mbp:64053] [ 3] 0   libsystem_malloc.dylib 0x7fff8c26ce06 
szone_error + 625
[mbp:64053] [ 4] 0   libsystem_malloc.dylib 0x7fff8c2645c8 
small_free_list_remove_ptr + 154
[mbp:64053] [ 5] 0   libsystem_malloc.dylib 0x7fff8c2632bf 
szone_free_definite_size + 1856
[mbp:64053] [ 6] 0   libjvm.dylib 0x00010e257d89 _ZN2os4freeEPvt 
+ 63
[mbp:64053] [ 7] 0   libjvm.dylib 0x00010dea2b0a 
_ZN9ChunkPool12free_all_butEm + 136
[mbp:64053] [ 8] 0   libjvm.dylib 0x00010e30ab33 
_ZN12PeriodicTask14real_time_tickEi + 77
[mbp:64053] [ 9] 0   libjvm.dylib 0x00010e3372a3 
_ZN13WatcherThread3runEv + 267
[mbp:64053] [10] 0   libjvm.dylib 0x00010e25d87e 
_ZL10java_startP6Thread + 246
[mbp:64053] [11] 0   libsystem_pthread.dylib 0x7fff8f1402fc 
_pthread_body + 131
[mbp:64053] [12] 0   libsystem_pthread.dylib 0x7fff8f140279 
_pthread_body + 0
[mbp:64053] [13] 0   libsystem_pthread.dylib 0x7fff8f13e4b1 
thread_start + 13

[mbp:64053] *** End of error message ***

OpenMPI version is 1.8.4. Java version is 1.8.0_25-b17.

Best regards,
Alexander Daryin
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/01/26215.php

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/01/26230.php