from:"Sylvain Jeaugey"

Re: [OMPI devel] Cuda build break

2017-10-04 Thread Sylvain Jeaugey

See my last comment on #4257 : https://github.com/open-mpi/ompi/pull/4257#issuecomment-332900393 We should completely disable CUDA in hwloc. It is breaking the build, but more importantly, it creates an extra dependency on the CUDA runtime that Open MPI doesn't have, even when compiled with --

Re: [OMPI devel] CUDA kernels in OpenMPI

2017-01-27 Thread Sylvain Jeaugey

Hi Chris, First, you will need to have some configure stuff to detect nvcc and use it inside your Makefile. UTK may have some examples to show here. For the C/C++ API, you need to add 'extern "C"' statements around the interfaces you want to export in C so that you can use them inside Open MP

Re: [OMPI devel] Process affinity detection

2016-04-26 Thread Sylvain Jeaugey

ote: On Apr 26, 2016, at 3:35 PM, Sylvain Jeaugey wrote: Indeed, I implied that affinity was set before MPI_Init (usually even before the process is launched). And yes, that would require a modex ... but I thought there was one already and maybe we could pack the affinity information inside t

Re: [OMPI devel] Process affinity detection

2016-04-26 Thread Sylvain Jeaugey

we could do it - but at the cost of forcing a modex. You can only detect your own affinity, so to get the relative placement, you have to do an exchange if we can’t pass it to you. Perhaps we could offer it as an option? On Apr 26, 2016, at 2:27 PM, Sylvain Jeaugey wrote: Within the BTL code

[OMPI devel] Process affinity detection

2016-04-26 Thread Sylvain Jeaugey

Within the BTL code (and surely elsewhere), we can use those convenient OPAL_PROC_ON_LOCAL_{NODE,SOCKET, ...} macros to figure out where another endpoint is located compared to us. The problem is that it only works when ORTE defines it. The NODE works almost always since ORTE is always doing i

Re: [OMPI devel] Crash in orte_iof_hnp_read_local_handler

2016-02-26 Thread Sylvain Jeaugey

. does it write to mpirun stdin ? On 02/26/2016 11:46 AM, Ralph Castain wrote: So the child processes are not calling orte_init or anything like that? I can check it - any chance you can give me a line number via a debug build? On Feb 26, 2016, at 11:42 AM, Sylvain Jeaugey wrote: I got

[OMPI devel] Crash in orte_iof_hnp_read_local_handler

2016-02-26 Thread Sylvain Jeaugey

I got this strange crash on master this night running nv/mpix_test : Signal: Segmentation fault (11) Signal code: Address not mapped (1) Failing at address: 0x50 [ 0] /lib64/libpthread.so.0(+0xf710)[0x7f9f19a80710] [ 1] /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-4/installs/eGXW/inst

Re: [OMPI devel] [OMPI users] configuring open mpi 10.1.2 with cuda on NVIDIA TK1

2016-01-22 Thread Sylvain Jeaugey

. Thanks, Sylvain On 01/22/2016 10:07 AM, Sylvain Jeaugey wrote: It looks like the errors are produced by the hwloc configure ; this one somehow can't find CUDA (I have to check if that's a problem btw). Anyway, later in the configure, the VT configure finds cuda correctly, so it seems s

Re: [OMPI devel] FOSS for scientists devroom at FOSDEM 2013

2012-11-20 Thread Sylvain Jeaugey

Hi Jeff, Do you mean "attend" or "do a talk" ? Sylvain Le 20/11/2012 16:16, Jeff Squyres a écrit : Cool! Thanks for the invite. Do we have any European friends who would be able to attend this conference? On Nov 20, 2012, at 10:02 AM, Sylwester Arabas wrote: Dear Open MPI Team, A day-lo

Re: [OMPI devel] poor btl sm latency

2012-02-13 Thread sylvain . jeaugey

Hi Matthias, You might want to play with process binding to see if your problem is related to bad memory affinity. Try to launch pingpong on two CPUs of the same socket, then on different sockets (i.e. bind each process to a core, and try different configurations). Sylvain De :Matthias

Re: [OMPI devel] Bull Vendor ID disappeared from IB ini file

2011-09-07 Thread sylvain . jeaugey

Please note that configure requirements on components HAVE > CHANGED. For example. a configure.params file is no longer required > in each component directory. See Jeff's emails for an explanation. > > > > ________ > From: devel-boun...@op

Re: [OMPI devel] Bull Vendor ID disappeared from IB ini file

2011-09-07 Thread sylvain . jeaugey

e.params file is no longer required in each component directory. See Jeff's emails for an explanation. >> >> >> >> >> From: devel-boun...@open-mpi.org [devel-boun...@open-mpi.org] On Behalf Of Sylvain Jeaugey [sylvain.jeau...@bull.net]

[OMPI devel] Bull Vendor ID disappeared from IB ini file

2011-09-07 Thread Sylvain Jeaugey

Hi All, I just realized that Bull Vendor IDs for Infiniband cards disappeared from the trunk. Actually, they were removed shortly after we included them in last September. The original commit was : r23715 | derbeyn | 2010-09-03 16:13:19 +0200 (Fri, 03 Sep 2010) | 1 line Added Bull vendor id f

Re: [OMPI devel] "Open MPI"-based MPI library used by K computer

2011-06-29 Thread sylvain . jeaugey

Kawashima-san, Congratulations for your machine, this is a stunning achievement ! > Kawashima wrote : > Also, we modified tuned COLL to implement interconnect-and-topology- > specific bcast/allgather/alltoall/allreduce algorithm. These algorithm > implementations also bypass PML/BML/BTL to elimi

Re: [OMPI devel] BTL preferred_protocol , large message

2011-03-10 Thread Sylvain Jeaugey

On Wed, 9 Mar 2011, George Bosilca wrote: One gets multiple non-overlapping BTL (in terms of peers), each with its own set of parameters and eventually accepted protocols. Mainly there will be one BTL per memory hierarchy. Pretty cool :-) I'll cleanup the code and send you a patch. We'd be

Re: [OMPI devel] BTL preferred_protocol , large message

2011-03-09 Thread Sylvain Jeaugey

Hi George, This certainly looks like our motivations are close. However, I don't see in the presentation how you implement it (maybe I misread it), especially how you manage to not modify the BTL interface. Do you have any code / SVN commit references for us to better understand what it's ab

Re: [OMPI devel] [RFC] Hierarchical Topology

2010-11-16 Thread Sylvain Jeaugey

in locality. Sylvain On Mon, Nov 15, 2010 at 9:00 AM, Sylvain Jeaugey wrote: I already mentionned it answering Terry's e-mail, but to be sure I'm clear : don't confuse node full topology with MPI job topology. It _is_ different. And every process does not get the whole top

Re: [OMPI devel] [RFC] Hierarchical Topology

2010-11-15 Thread Sylvain Jeaugey

code may not have direct relationship to hitopo the use of hwloc and standardization of what you call level 4-7 might help avoid some user confusions. --td On 11/15/2010 06:56 AM, Sylvain Jeaugey wrote: As a followup of Stuttgart's developper's meeting, here is an RFC for our

Re: [OMPI devel] [RFC] Hierarchical Topology

2010-11-15 Thread Sylvain Jeaugey

to inter- node. Sylvain On 11/15/2010 06:56 AM, Sylvain Jeaugey wrote: As a followup of Stuttgart's developper's meeting, here is an RFC for our topology detection framework. WHAT: Add a framework for hardware topology detection to be used by any other part of Open MPI to help optim

[OMPI devel] [RFC] Hierarchical Topology

2010-11-15 Thread Sylvain Jeaugey

As a followup of Stuttgart's developper's meeting, here is an RFC for our topology detection framework. WHAT: Add a framework for hardware topology detection to be used by any other part of Open MPI to help optimization. WHY: Collective operations or shared memory algorithms among others may

Re: [OMPI devel] Use of OPAL_PREFIX to relocate a lib

2010-11-04 Thread Sylvain Jeaugey

, 2010, at 6:01 AM, Sylvain Jeaugey wrote: On Tue, 26 Oct 2010, Jeff Squyres wrote: I don't think this is the right way to fix it. Sorry! :-( I don't think it is the right way to do it either :-) I say this because it worked somewhat by luck before, and now it's broken. If

Re: [OMPI devel] Use of OPAL_PREFIX to relocate a lib

2010-10-26 Thread Sylvain Jeaugey

On Tue, 26 Oct 2010, Jeff Squyres wrote: I don't think this is the right way to fix it. Sorry! :-( I don't think it is the right way to do it either :-) I say this because it worked somewhat by luck before, and now it's broken. If we put in another "it'll work because of a side effect of a

Re: [OMPI devel] Use of OPAL_PREFIX to relocate a lib

2010-10-26 Thread Sylvain Jeaugey

components (one to get the priorities, and then another to execute) and additional API functions in the various modules. On Oct 7, 2010, at 6:25 AM, Sylvain Jeaugey wrote: Hi list, Remember this old bug ? I think I finally found out what was going wrong. The opal "installdirs"

Re: [OMPI devel] New Romio for OpenMPI available in bitbucket

2010-10-07 Thread Sylvain Jeaugey

On Wed, 29 Sep 2010, Ashley Pittman wrote: On 17 Sep 2010, at 11:36, Pascal Deveze wrote: Hi all, In charge of ticket 1888 (see at https://svn.open-mpi.org/trac/ompi/ticket/1888) , I have put the resulting code in bitbucket at: http://bitbucket.org/devezep/new-romio-for-openmpi/ The work in

Re: [OMPI devel] Use of OPAL_PREFIX to relocate a lib

2010-10-07 Thread Sylvain Jeaugey

opened first regardless of its position in the static components array ; 3. Any other idea ? Sylvain On Fri, 19 Jun 2009, Sylvain Jeaugey wrote: On Thu, 18 Jun 2009, Jeff Squyres wrote: On Jun 18, 2009, at 11:25 AM, Sylvain Jeaugey wrote: My problem seems related to library generation throu

Re: [OMPI devel] Possible memory leak

2010-09-01 Thread Sylvain Jeaugey

Hi ananda, I didn't try to run your program, but this seems logical to me. The problem with calling MPI_Bcast repeatedly is that you may have an infinite desynchronization between the sender and the receiver(s). MPI_Bcast is an unidirectional operation. It does not necessary block until the r

Re: [OMPI devel] delivering SIGUSR2 to an ompi process

2010-08-26 Thread Sylvain Jeaugey

Steve, This is indeed strange. The mechanism you describe works for me. Here is my simple test : -- mpi-sig.c -- #include "mpi.h" #include #include void warn(int sig) { printf("Got signal %d\n", sig); } int main (int argc, char ** argv) {

Re: [OMPI devel] Committing to release branches

2010-07-26 Thread Sylvain Jeaugey

Thanks Jeff for this very useful explanation. I guess locking is not needed as long as the system is well understood by everyone (which was not the case for us, sorry). Sylvain On Thu, 22 Jul 2010, Ralph Castain wrote: On Jul 22, 2010, at 8:01 AM, Jeff Squyres wrote: On Jul 22, 2010, at 9

Re: [OMPI devel] v1.5: sigsegv in case of extremely low settings in theSRQs

2010-06-23 Thread Sylvain Jeaugey

On Wed, 23 Jun 2010, Jeff Squyres wrote: BTW, are you guys waiting for us to commit that, or do we ever give you guys SVN commit access? Nadia is off today. She should commit it tomorrow. Sylvain

Re: [OMPI devel] v1.5: sigsegv in case of extremely low settings in theSRQs

2010-06-23 Thread Sylvain Jeaugey

Hi Jeff, Why do we want to set this value so low ? Well, just to see if it crashes :-) More seriously, we're working on lowering the memory usage of the openib BTL, which is achieved at most by having only 1 send queue element (at very large scale, send queues prevail). This "extreme" conf

Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing

2010-06-11 Thread Sylvain Jeaugey

On Fri, 11 Jun 2010, Jeff Squyres wrote: On Jun 11, 2010, at 5:43 AM, Paul H. Hargrove wrote: Interesting. Do you think this behavior of the linux kernel would change if the file was unlink()ed after attach ? After a little talk with kernel guys, it seems that unlinking wouldn't change anythi

Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing

2010-06-10 Thread Sylvain Jeaugey

On Thu, 10 Jun 2010, Jeff Squyres wrote: Sam -- if the shmat stuff fails because the limits are too low, it'll (silently) fall back to the mmap module, right? From my experience, it completely disabled the sm component. Having a nice fallback would be indeed a very Good thing. Sylvain

Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing

2010-06-10 Thread Sylvain Jeaugey

On Thu, 10 Jun 2010, Paul H. Hargrove wrote: One should not ignore the option of POSIX shared memory: shm_open() and shm_unlink(). When present this mechanism usually does not suffer from the small (eg 32MB) limits of SysV, and uses a "filename" (in an abstract namespace) which can portably b

Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing

2010-06-10 Thread Sylvain Jeaugey

On Wed, 9 Jun 2010, Jeff Squyres wrote: On Jun 9, 2010, at 3:26 PM, Samuel K. Gutierrez wrote: System V shared memory cleanup is a concern only if a process dies in between shmat and shmctl IPC_RMID. Shared memory segment cleanup should happen automagically in most cases, including abnormal p

Re: [OMPI devel] System V Shared Memory for Open MPI: Request for Community Input and Testing

2010-06-09 Thread Sylvain Jeaugey

As stated at the conf call, I did some performance testing on a 32 cores node. So, here is graph showing 500 timings of an allreduce operation (repeated 15,000 times for good timing) with sysv, mmap on /dev/shm and mmap on /tmp. What is shows : - sysv has the better performance ; - having

Re: [OMPI devel] BTL add procs errors

2010-06-02 Thread Sylvain Jeaugey

On Wed, 2 Jun 2010, Jeff Squyres wrote: Don't you mean return NULL? This function is supposed to return a (struct ibv_cq *). Oops. My bad. Yes, it should return NULL. And it seems that if I make ibv_create_cq always return NULL, the scenario described by George works smoothly : returned OMPI

Re: [OMPI devel] BTL add procs errors

2010-06-02 Thread Sylvain Jeaugey

On Tue, 1 Jun 2010, Jeff Squyres wrote: On May 31, 2010, at 5:10 AM, Sylvain Jeaugey wrote: In my case, the error happens in : mca_btl_openib_add_procs() mca_btl_openib_size_queues() adjust_cq() ibv_create_cq_compat() ibv_create_cq() Can you nail this down

Re: [OMPI devel] BTL add procs errors

2010-06-02 Thread Sylvain Jeaugey

Couldn't explain it better. Thanks Jeff for the summary ! On Tue, 1 Jun 2010, Jeff Squyres wrote: On May 31, 2010, at 10:27 AM, Ralph Castain wrote: Just curious - your proposed fix sounds exactly like what was done in the OPAL SOS work. Are you therefore proposing to use SOS to provide a mo

Re: [OMPI devel] BTL add procs errors

2010-05-31 Thread Sylvain Jeaugey

L init / query sequence is it returning an error for you, Sylvain? Is it just a matter of tidying something up properly before returning the error? On May 28, 2010, at 2:21 PM, George Bosilca wrote: On May 28, 2010, at 10:03 , Sylvain Jeaugey wrote: On Fri, 28 May 2010, Jeff Squyres wr

Re: [OMPI devel] BTL add procs errors

2010-05-28 Thread Sylvain Jeaugey

On Fri, 28 May 2010, Jeff Squyres wrote: On May 28, 2010, at 9:32 AM, Jeff Squyres wrote: Understood, and I agreed that the bug should be fixed. Patches would be welcome. :-) I sent a patch on the bml layer in my first e-mail. We will apply it on our tree, but as always we're trying to send

Re: [OMPI devel] BTL add procs errors

2010-05-28 Thread Sylvain Jeaugey

On Fri, 28 May 2010, Jeff Squyres wrote: Herein lies the quandary: we don't/can't know the user or sysadmin intent. They may not care if the IB is borked -- they might just want the job to fall over to TCP and continue. But they may care a lot if IB is borked -- they might want the job to ab

Re: [OMPI devel] BTL add procs errors

2010-05-28 Thread Sylvain Jeaugey

On Thu, 27 May 2010, Jeff Squyres wrote: On May 27, 2010, at 10:32 AM, Sylvain Jeaugey wrote: That's pretty much my first proposition : abort when an error arises, because if we don't, we'll crash soon afterwards. That's my original concern and this should really be fixe

Re: [OMPI devel] BTL add procs errors

2010-05-27 Thread Sylvain Jeaugey

rocs does return an error, the job should abort. Brian -- Brian W. Barrett Scalable System Software Group Sandia National Laboratories From: devel-boun...@open-mpi.org [devel-boun...@open-mpi.org] On Behalf Of Sylvain Jeaugey [sylvain.jeau...@bull.net]

Re: [OMPI devel] BTL add procs errors

2010-05-27 Thread Sylvain Jeaugey

rmda endpoint arrays will not be built. george. On May 25, 2010, at 05:10 , Sylvain Jeaugey wrote: Hi, I'm currently trying to have Open MPI exit more gracefully when a BTL returns an error during the "add procs" phase. The current bml/r2 code silently ignores btl->add_procs

[OMPI devel] BTL add procs errors

2010-05-25 Thread Sylvain Jeaugey

Hi, I'm currently trying to have Open MPI exit more gracefully when a BTL returns an error during the "add procs" phase. The current bml/r2 code silently ignores btl->add_procs() error codes with the following comment : ompi/mca/bml/r2/bml_r2.c:208 /* This BTL has troubles adding

Re: [OMPI devel] Infiniband memory usage with XRC

2010-05-19 Thread Sylvain Jeaugey

On Mon, 17 May 2010, Pavel Shamis (Pasha) wrote: Sylvain Jeaugey wrote: The XRC protocol seems to create shared receive queues, which is a good thing. However, comparing memory used by an "X" queue versus and "S" queue, we can see a large difference. Digging a bit into

Re: [OMPI devel] Very poor performance with btl sm on twin nehalem servers with Mellanox ConnectX installed

2010-05-18 Thread Sylvain Jeaugey

s wrote: How's this? http://www.open-mpi.org/faq/?category=sm#poor-sm-btl-performance What's the advantage of /dev/shm? (I don't know anything about /dev/shm) On May 17, 2010, at 4:08 AM, Sylvain Jeaugey wrote: I agree with Paul on the fact that a FAQ update would be grea

Re: [OMPI devel] Infiniband memory usage with XRC

2010-05-17 Thread Sylvain Jeaugey

Thanks Pasha for these details. On Mon, 17 May 2010, Pavel Shamis (Pasha) wrote: blocking is the receive queues, because they are created during MPI_Init, so in a way, they are the "basic fare" of MPI. BTW SRQ resources are also allocated on demand. We start with very small SRQ and it is incre

[OMPI devel] Infiniband memory usage with XRC

2010-05-17 Thread Sylvain Jeaugey

Hi list, We did some testing on memory taken by Infiniband queues in Open MPI using the XRC protocol, which is supposed to reduce the needed memory for Infiniband connections. When using XRC queues, Open MPI is indeed creating only one XRC queue per node (instead of per-host). The problem is

Re: [OMPI devel] Very poor performance with btl sm on twin nehalem servers with Mellanox ConnectX installed

2010-05-17 Thread Sylvain Jeaugey

I agree with Paul on the fact that a FAQ update would be great on this subject. /dev/shm seems a good place to put the temporary files (when available, of course). Putting files in /dev/shm also showed better performance on our systems, even with /tmp on a local disk. Sylvain On Sun, 16 May

Re: [OMPI devel] Thread safety levels

2010-05-10 Thread Sylvain Jeaugey

On Mon, 10 May 2010, N.M. Maclaren wrote: As explained by Sylvain, current Open MPI implementation always returns MPI_THREAD_SINGLE as provided thread level if neither --enable-mpi-threads nor --enable-progress-threads was specified at configure (v1.4). That is definitely the correct action.

[OMPI devel] RDMA with ob1 and openib

2010-04-27 Thread Sylvain Jeaugey

Hi list, I'm currently working on IB bandwidth improvements and maybe some of you may help me understanding some things. I'm trying to align every IB RDMA operation to 64 bytes, because having it unaligned can hurt your performance from lightly to very badly, depending on your architecture.

Re: [OMPI devel] RFC 1/1: improvements to the "notifier" framework and ORTE WDC

2010-03-30 Thread Sylvain Jeaugey

On Mon, 29 Mar 2010, Abhishek Kulkarni wrote: #define ORTE_NOTIFIER_DEFINE_EVENT(eventstr, associated_text) { static int event = -1; if (OPAL_UNLIKELY(event == -1) { event = opal_sos_create_new_event(eventstr, associated_text); } .. } This

Re: [OMPI devel] RFC 1/1: improvements to the "notifier" framework and ORTE WDC

2010-03-29 Thread Sylvain Jeaugey

Hi Ralph, For now, I think that yes, this is a unique identifier. However, in my opinion, this could be improved in the future replacing it by a unique string. Something like : #define ORTE_NOTIFIER_DEFINE_EVENT(eventstr, associated_text) { static int event = -1; if (OPAL_UNL

Re: [OMPI devel] RFC: s/ENABLE_MPI_THREADS/ENABLE_THREAD_SAFETY/g

2010-02-09 Thread Sylvain Jeaugey

While we're at it, why not call the option giving MPI_THREAD_MULTIPLE support --enable-thread-multiple ? About ORTE and OPAL, if you have --enable-thread-multiple=yes, it may force the usage of --enable-thread-safety to configure OPAL and/or ORTE. I know there are other projects using ORTE an

[OMPI devel] VT config.h.in

2010-01-19 Thread Sylvain Jeaugey

Hi list, The file ompi/contrib/vt/vt/config.h.in seems to have been added to the repository, but it is also created by autogen.sh. Is it normal ? The result is that when I commit after autogen, I have my patches polluted with diffs in this file. Sylvain

Re: [OMPI devel] MALLOC_MMAP_MAX (and MALLOC_MMAP_THRESHOLD)

2010-01-18 Thread Sylvain Jeaugey

On Jan 17, 2010, at 11:31 AM, Ashley Pittman wrote: Tuning the libc malloc implementation using the options they provide to do is is valid and provides real benefit to a lot of applications. For the record we used to disable mmap based allocations by default on Quadrics systems and I can't thi

Re: [OMPI devel] MALLOC_MMAP_MAX (and MALLOC_MMAP_THRESHOLD)

2010-01-08 Thread Sylvain Jeaugey

On Thu, 7 Jan 2010, Eugene Loh wrote: Could someone tell me how these settings are used in OMPI or give any guidance on how they should or should not be used? This is a very good question :-) As this whole e-mail, though it's hard (in my opinion) to give it a Good (TM) answer. This means that

[OMPI devel] Thread safety levels

2010-01-05 Thread Sylvain Jeaugey

Hi list, I'm currently playing with thread levels in Open MPI and I'm quite surprised by the current code. First, the C interface : at ompi/mpi/c/init_thread.c:56 we have : #if OPAL_ENABLE_MPI_THREADS *provided = MPI_THREAD_MULTIPLE; #else *provided = MPI_THREAD_SINGLE; #endif prior to

Re: [OMPI devel] Crash when using MPI_REAL8

2009-12-08 Thread Sylvain Jeaugey

Thanks Rainer for the patch. I confirm it solves my testcase as well as the real application that triggered the bug. Sylvain On Mon, 7 Dec 2009, Rainer Keller wrote: Hello Sylvain, On Friday 04 December 2009 02:27:22 pm Sylvain Jeaugey wrote: There is definetly something wrong in types

Re: [OMPI devel] Crash when using MPI_REAL8

2009-12-04 Thread Sylvain Jeaugey

l_datatype.h Fri Dec 04 19:59:26 2009 +0100 @@ -56,7 +56,7 @@ * * XXX TODO Adapt to whatever the OMPI-layer needs */ -#define OPAL_DATATYPE_MAX_SUPPORTED 46 +#define OPAL_DATATYPE_MAX_SUPPORTED 56 /* flags for the datatypes. */ On Fri, 4 Dec 2009, Sylvain Jeaugey wrote: For t

Re: [OMPI devel] Crash when using MPI_REAL8

2009-12-04 Thread Sylvain Jeaugey

For the record, and to try to explain why all MTT tests may have missed this "bug", configuring without --enable-debug makes the bug disappear. Still trying to figure out why. Sylvain On Thu, 3 Dec 2009, Sylvain Jeaugey wrote: Hi list, I hope this time I won't be the onl

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-12-03 Thread Sylvain Jeaugey

conds [rhc@odin mpi]$ Sorry I don't have more time to continue pursuing this. I have no idea what is going on with your system(s), but it clearly is something peculiar to what you are doing or the system(s) you are running on. Ralph On Dec 2, 2009, at 1:56 AM, Sylvain Jeaugey wrote: Ok,

[OMPI devel] Crash when using MPI_REAL8

2009-12-03 Thread Sylvain Jeaugey

Hi list, I hope this time I won't be the only one to suffer this bug :) It is very simple indeed, just perform an allreduce with MPI_REAL8 (fortran) and you should get a crash in ompi/op/op.h:411. Tested with trunk and v1.5, working fine on v1.3. From what I understand, in the trunk, MPI_REA

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-12-02 Thread Sylvain Jeaugey

t it). But since this is a race condition, your mileage may vary on a different cluster. With the patch however, I'm in every time. I'll continue to try different configurations (e.g. without slurm ...) to see if I can reproduce it on much common configurations. Sylvain On Mon, 30 Nov 2

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-30 Thread Sylvain Jeaugey

en FC11 and the compiler. On Nov 30, 2009, at 8:48 AM, Sylvain Jeaugey wrote: Hi Ralph, I'm also puzzled :-) Here is what I did today : * download the latest nightly build (openmpi-1.7a1r22241) * untar it * patch it with my "ORTE_RELAY_DELAY" patch * build it directly on t

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-30 Thread Sylvain Jeaugey

ain wrote: On Nov 27, 2009, at 8:23 AM, Sylvain Jeaugey wrote: Hi Ralph, I tried with the trunk and it makes no difference for me. Strange Looking at potential differences, I found out something strange. The bug may have something to do with the "routed" framework. I can repro

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-27 Thread Sylvain Jeaugey

hreads?? That is the only way I can recreate this behavior. I plan to modify the relay/message processing method anyway to clean it up. But there doesn't appear to be anything wrong with the current code. Ralph On Nov 20, 2009, at 6:55 AM, Sylvain Jeaugey wrote: Hi Ralph, Thanks for your e

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-20 Thread Sylvain Jeaugey

l-crcp2,crcp enable_io_romio=no On Nov 19, 2009, at 8:08 AM, Ralph Castain wrote: On Nov 19, 2009, at 7:52 AM, Sylvain Jeaugey wrote: Thank you Ralph for this precious help. I setup a quick-and-dirty patch basically postponing process_msg (hence daemon_collective) until the launch is done. I

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-19 Thread Sylvain Jeaugey

2. send the relay - the daemon collective can now proceed without a "wait" in it 3. now launch the local procs It would be a fairly simple reorganization of the code in the orte/mca/odls area. I can do it this weekend if you like, or you can do it - either way is fine, but if you

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-19 Thread Sylvain Jeaugey

n Nov 17, 2009, at 9:01 AM, Sylvain Jeaugey wrote: I don't think so, and I'm not doing it explicitely at least. How do I know ? Sylvain On Tue, 17 Nov 2009, Ralph Castain wrote: We routinely launch across thousands of nodes without a problem...I have never seen it stick in this fash

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-17 Thread Sylvain Jeaugey

ded by any chance? If so, that definitely won't work. On Nov 17, 2009, at 9:27 AM, Sylvain Jeaugey wrote: Hi all, We are currently experiencing problems at launch on the 1.5 branch on relatively large number of nodes (at least 80). Some processes are not spawned and orted processes are de

[OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-17 Thread Sylvain Jeaugey

Hi all, We are currently experiencing problems at launch on the 1.5 branch on relatively large number of nodes (at least 80). Some processes are not spawned and orted processes are deadlocked. When MPI processes are calling MPI_Init before send_relay is complete, the send_relay function and

Re: [OMPI devel] [OMPI users] cartofile

2009-10-13 Thread Sylvain Jeaugey

We worked a bit on it and yes, there is some work to do : * The syntax used to describe the various components is far from being consistent from one usage to another ("SOCKET", "NODE", ...). We manage to make things reading the various not up to date example files - but mainly the code. * Th

Re: [OMPI devel] Deadlock with comm_create since cid allocator change

2009-09-21 Thread Sylvain Jeaugey

You were faster to fix the bug than I was to send my bug report :-) So I confirm : this fixes the problem. Thanks ! Sylvain On Mon, 21 Sep 2009, Edgar Gabriel wrote: what version of OpenMPI did you use? Patch #21970 should have fixed this issue on the trunk... Thanks Edgar Sylvain Jeaugey

[OMPI devel] Deadlock with comm_create since cid allocator change

2009-09-21 Thread Sylvain Jeaugey

Hi list, We are currently experiencing deadlocks when using communicators other than MPI_COMM_WORLD. So we made a very simple reproducer (Comm_create then MPI_Barrier on the communicator - see end of e-mail). We can reproduce the deadlock only with openib and with at least 8 cores (no succes

Re: [OMPI devel] Deadlock on openib when using hindexed types

2009-09-04 Thread Sylvain Jeaugey

penib, but if I'm not mistaken (again !) tcp still hangs. Sylvain On Fri, 4 Sep 2009, Sylvain Jeaugey wrote: Hi Rolf, I was indeed running a more than 4 weeks old trunk, but after pulling the latest version (and checking the patch was in the code), it seems to make no difference. Howev

Re: [OMPI devel] RFC - "system-wide-only" MCA parameters

2009-09-04 Thread Sylvain Jeaugey

Understood. So, let's say that we're only implementing a hurdle to discourage users from doing things wrong. I guess the efficiency of this will reside in the message displayed to the user ("You are about to break the entire machine and you will be fined if you try to circumvent this in any way

Re: [OMPI devel] RFC - "system-wide-only" MCA parameters

2009-09-04 Thread Sylvain Jeaugey

Looks like users at LANL are not very nice ;) Indeed, this is no hard security. Only a way to prevent users from doing mistakes. We often give users special tuning for their application and when they see their application is going faster, they start messing with every parameter hoping that it

Re: [OMPI devel] Deadlock on openib when using hindexed types

2009-09-04 Thread Sylvain Jeaugey

/changeset/21833 If you are running the latest bits and still seeing the problem, then I guess it is something else. Rolf On 09/04/09 04:40, Sylvain Jeaugey wrote: Hi all, We're currently working with romio and we hit a problem when exchanging data with hindexed types with the openi

Re: [OMPI devel] RFC - "system-wide-only" MCA parameters

2009-09-04 Thread Sylvain Jeaugey

On Fri, 4 Sep 2009, Jeff Squyres wrote: I haven't looked at the code deeply, so forgive me if I'm parsing this wrong: is the code actually reading the file into one list and then moving the values to another list? If so, that seems a little hackish. Can't it just read directly to the target

Re: [OMPI devel] RFC - "system-wide-only" MCA parameters

2009-09-04 Thread Sylvain Jeaugey

On Fri, 4 Sep 2009, Jeff Squyres wrote: -- *** Checking versions checking for SVN version... done checking Open MPI version... 1.4a1hgf11244ed72b5 up to changeset c4b117c5439b checking Open MPI release date... Unreleased developer copy checking Open MPI Subversion repository version... hgf11

[OMPI devel] Deadlock on openib when using hindexed types

2009-09-04 Thread Sylvain Jeaugey

Hi all, We're currently working with romio and we hit a problem when exchanging data with hindexed types with the openib btl. The attached reproducer (adapted from romio) is working fine on tcp, blocks on openib when using 1 port but works if we use 2 ports (!). I tested it against the trunk

Re: [OMPI devel] RFC: convert send to ssend

2009-08-24 Thread Sylvain Jeaugey

For the record, I see an big interest in this. Sometimes, you have to answer calls for tender featuring applications that must work with no code change, even if the code is completely not MPI-compliant. That's sad, but true (no pun intended :-)) Sylvain On Mon, 24 Aug 2009, George Bosilca w

Re: [OMPI devel] Improvement of openmpi.spec

2009-08-06 Thread Sylvain Jeaugey

e RPM build command passing --pkgname or somesuch to OMPI's configure to override the built-in name? Hum, I guess you're right, this is indeed not something to change. Sorry about that. Sylvain On Jul 31, 2009, at 11:51 AM, Sylvain Jeaugey wrote: Hi all, We had to apply a litt

Re: [OMPI devel] [OT] Who's going to Helsinki?

2009-08-04 Thread Sylvain Jeaugey

Hi Jeff, I bet you're refering to Euro PVM MPI 09 ? If this is what you're refering to, I should attend as usual. And of course, I'm very interested in joining a devel meeting :) Sylvain On Tue, 4 Aug 2009, Jeff Squyres wrote: Who's going to Helsinki? Does anyone want to meet up for some

Re: [OMPI devel] [PATCH] Better error reporting when failing to load a component

2009-08-03 Thread Sylvain Jeaugey

On Mon, 3 Aug 2009, Jeff Squyres wrote: On Aug 3, 2009, at 8:23 AM, Arthur Huillet wrote: I have recently started working on OpenMPI, and part of my job consists in adding a new module to OpenMPI. Cool. What are you adding? A collective component to support some Bull specific hardware. Sy

[OMPI devel] Improvement of openmpi.spec

2009-07-31 Thread Sylvain Jeaugey

n a couple of places - Add an %{opt_prefix} option to be able to install in a specific path (e.g. in /opt//mpi/-/ instead of /opt/-) The patch is done with "hg extract" but should apply on the SVN trunk. Sylvain# HG changeset patch # User Sylvain Jeaugey # Date 124904

Re: [OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support

2009-07-22 Thread Sylvain Jeaugey

Hi Jeff, I'm interested in joining the effort, since we will likely have the same problem with SLURM's cpuset support. On Wed, 22 Jul 2009, Jeff Squyres wrote: But as to why it's getting EINVAL, that could be wonky. We might want to take this to the PLPA list and have you run some small, no

Re: [OMPI devel] Use of OPAL_PREFIX to relocate a lib

2009-06-19 Thread Sylvain Jeaugey

On Thu, 18 Jun 2009, Jeff Squyres wrote: On Jun 18, 2009, at 11:25 AM, Sylvain Jeaugey wrote: My problem seems related to library generation through RPM, not with 1.3.2, nor the patch. I'm not sure I understand -- is there something we need to fix in our SRPM? I need to dig a bit

Re: [OMPI devel] Use of OPAL_PREFIX to relocate a lib

2009-06-18 Thread Sylvain Jeaugey

Ok, never mind. My problem seems related to library generation through RPM, not with 1.3.2, nor the patch. Sylvain On Thu, 18 Jun 2009, Sylvain Jeaugey wrote: Hi all, Until Open MPI 1.3 (maybe 1.3.1), I used to find it convenient to be able to move a library from its "normal&q

[OMPI devel] Use of OPAL_PREFIX to relocate a lib

2009-06-18 Thread Sylvain Jeaugey

Hi all, Until Open MPI 1.3 (maybe 1.3.1), I used to find it convenient to be able to move a library from its "normal" place (either /usr or /opt) to somewhere else (i.e. my NFS home account) to be able to try things only on my account. So, I used to set OPAL_PREFIX to the root of the Open MP

Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-12 Thread Sylvain Jeaugey

ou seem to have a real reproducer). Sylvain On Wed, 10 Jun 2009, Sylvain Jeaugey wrote: Hum, very glad that padb works with Open MPI, I couldn't live without it. In my opinion, the best debug tool for parallel applications, and more importantly, the only one that scales. About the is

Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Sylvain Jeaugey

Hum, very glad that padb works with Open MPI, I couldn't live without it. In my opinion, the best debug tool for parallel applications, and more importantly, the only one that scales. About the issue, I couldn't reproduce it on my platform (tried 2 nodes with 2 to 8 processes each, nodes are t

Re: [OMPI devel] [RFC] Low pressure OPAL progress

2009-06-10 Thread Sylvain Jeaugey

putting the process to sleep. You could let someone know so a human can decide what, if anything, to do about it, or provide a hook so that people can explore/utilize different response strategies...or both! HTH Ralph On Tue, Jun 9, 2009 at 6:52 AM, Sylvain Jeaugey wrote: I understand your

Re: [OMPI devel] [RFC] Low pressure OPAL progress

2009-06-09 Thread Sylvain Jeaugey

o about it, or provide a hook so that people can explore/utilize different response strategies...or both! HTH Ralph On Tue, Jun 9, 2009 at 6:52 AM, Sylvain Jeaugey wrote: I understand your point of view, and mostly share it. I think the biggest point in my example is that sleep

Re: [OMPI devel] [RFC] Low pressure OPAL progress

2009-06-09 Thread Sylvain Jeaugey

MPI processes wait for us to reach a communication point. We -want- those processes spinning away so that, when the comm starts, it can proceed as quickly as possible. Just some thoughts... Ralph On Jun 9, 2009, at 5:28 AM, Terry Dontje wrote: Sylvain Jeaugey wrote: Hi Ralph, I'm enti

Re: [OMPI devel] Multi-rail on openib

2009-06-09 Thread Sylvain Jeaugey

On Mon, 8 Jun 2009, NiftyOMPI Tom Mitchell wrote: ??? dual rail does double the number of switch ports. If you want to address switch failure each rail must connect to a different switch. If you do not want to have isolated fabrics you must have some additional ports on all switches to connect

Re: [OMPI devel] [RFC] Low pressure OPAL progress

2009-06-09 Thread Sylvain Jeaugey

system to behave similarly to today isn't enough - we still wind up adding logic into a very critical timing loop for no reason. A simple configure option of --enable-mpi-progress-monitoring would be sufficient to protect the code. HTH Ralph On Jun 8, 2009, at 9:50 AM, Sylvain

[OMPI devel] [RFC] Low pressure OPAL progress

2009-06-08 Thread Sylvain Jeaugey

What : when nothing has been received for a very long time - e.g. 5 minutes, stop busy polling in opal_progress and switch to a usleep-based one. Why : when we have long waits, and especially when an application is deadlock'ed, detecting it is not easy and a lot of power is wasted until the e

1 2 >

1 - 100 of 107 matches

Mail list logo