from:"Rolf vandeVaart"

[OMPI devel] RFC: Add support to send/receive CUDA device memory directly

2011-04-13 Thread Rolf vandeVaart

WHAT: Add support to send data directly from CUDA device memory via MPI calls. TIMEOUT: April 25, 2011 DETAILS: When programming in a mixed MPI and CUDA environment, one cannot currently send data directly from CUDA device memory. The programmer first has to move the data into host memory, and

Re: [OMPI devel] RFC: Add support to send/receive CUDA device memory directly

2011-04-13 Thread Rolf vandeVaart

the same values ? Do you need GPUDirect for "to improve performance, the internal host buffers have to also be registered with the CUDA environment" ? Regards, Brice Le 13/04/2011 18:47, Rolf vandeVaart a écrit : WHAT: Add support to send data directly from CUDA device memory via

[OMPI devel] RFC: Second Try: Add support to send/receive CUDA device memory directly

2011-04-19 Thread Rolf vandeVaart

WHAT: Second try to add support to send data directly from CUDA device memory via MPI calls. TIMEOUT: 4/26/2011 DETAILS: Based on all the feedback (thanks to everyone who looked at it), I have whittled down what I hope to accomplish with this RFC. There were suggestions to better modularize

Re: [OMPI devel] RFC: Second Try: Add support to send/receive CUDA device memory directly

2011-04-19 Thread Rolf vandeVaart

Forgot the link... https://bitbucket.org/rolfv/ompi-trunk-cuda-rfc2 From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart Sent: Tuesday, April 19, 2011 4:45 PM To: Open MPI Developers Subject: [OMPI devel] RFC: Second Try: Add support to send

Re: [OMPI devel] RFC: Second Try: Add support to send/receive CUDA device memory directly

2011-04-22 Thread Rolf vandeVaart

: Thursday, April 21, 2011 6:19 PM To: Open MPI Developers Subject: Re: [OMPI devel] RFC: Second Try: Add support to send/receive CUDA device memory directly George -- what say you? On Apr 19, 2011, at 4:54 PM, Rolf vandeVaart wrote: > Forgot the link... > > https://bitbucket.org/r

[OMPI devel] Multiple Memory Pools

2011-05-16 Thread Rolf vandeVaart

I see in the sm BTL that there is the concept of memory affinity and the potential to support multiple memory pools. I am curious if anyone is making use of that feature? I am looking in the function sm_btl_first_time_init() in the btl_sm.c file. Thanks, Rolf ---

[OMPI devel] RFC: CUDA register sm and openib host memory

2011-07-28 Thread Rolf vandeVaart

WHAT: Add CUDA registration of host memory in sm and openib BTLs. TIMEOUT: 8/4/2011 DETAILS: In order to improve performance of sending GPU device memory, we need to register the host memory with the CUDA framework. These changes allow that to happen. These changes are somewhat different from w

Re: [OMPI devel] RFC: CUDA register sm and openib host memory

2011-08-02 Thread Rolf vandeVaart

lopers Subject: Re: [OMPI devel] RFC: CUDA register sm and openib host memory Rolf - Can you send a cumulative SVN diff against the SVN HEAD? Sent from my phone. No type good. On Jul 28, 2011, at 5:52 PM, "Rolf vandeVaart" mailto:rvandeva...@nvidia.com>> wrote: WHAT: Add CUDA regis

Re: [OMPI devel] RFC: CUDA register sm and openib host memory

2011-08-02 Thread Rolf vandeVaart

of >sending GPU device memory? I fail to see how registering the backend shared >memory file with CUDA is supposed to do anything at all, as this memory is >internal to Open MPI and not supposed to be visible at any other level. > > Thanks, >george. > >On Jul 28, 20

Re: [OMPI devel] ibm/dynamic/loop_spawn

2011-08-15 Thread Rolf vandeVaart

I think this is a good idea. I have spent a fair amount of time in the past analyzing timeouts from this set of tests. I had to figure out if it was an actual timeout or if the test was just running very slowly. In fact, I see that sometime in the past I throttled back the number of iterations

Re: [OMPI devel] Bull Vendor ID disappeared from IB ini file

2011-09-07 Thread Rolf vandeVaart

Actually, I think you are off by which commit undid the change. It was this one. And the message does suggest it might have caused problems. https://svn.open-mpi.org/trac/ompi/changeset/23764 Timestamp: 09/17/10 19:04:06 (12 months ago) Author: rhc Message: WARNING: Work on the te

[OMPI devel] PRE-RFC: Adding RDMA support for GPU buffers within a node

2011-09-16 Thread Rolf vandeVaart

This is a pre-RFC of some changes I am hoping to bring into the trunk. (I call this a pre-RFC as I have no timeout and I am not done with the code yet.) With some prior commits, I have added the ability to send GPU buffers directly. This support consists of forcing the use of only the send/receive

[OMPI devel] Remote key sizes

2011-11-08 Thread Rolf vandeVaart

> george. > >PS: Regarding the hand-copy instead of the memcpy, we tried to avoid using >memcpy in performance critical codes, especially when we know the size of >the data and the alignment. This relieves the compiler of adding ugly >intrinsics, >allowing it to nicely pipeline to load/stores. An

Re: [OMPI devel] RFC: new btl descriptor flags

2011-11-29 Thread Rolf vandeVaart

This may seem trivial, but should we name them: #define MCA_BTL_DES_FLAGS_PUT 0x0010 #define MCA_BTL_DES_FLAGS_GET 0x0020 Although I see there is some inconsistency in how these flags are named, two of the three original ones have "BTL_DES_FLAGS" in them. Rolf rvandeva...@nvidia.com 781-275-5

[OMPI devel] New smcuda BTL that optimizes intra-node GPU to GPU memory transfers

2011-12-09 Thread Rolf vandeVaart

WHAT: Add new sm BTL, and supporting mpools, that can also support CUDA RDMA. WHY: With CUDA 4.1, there is some GPU IPC support available that we can take advantage of to move data efficiently between GPUs within a node. WHERE: new--> ompi/mca/btl/smcuda, ompi/mca/mpool/cuda, ompi/mca/mpool/rcud

Re: [OMPI devel] RDMA with non-contiguous payload

2012-01-04 Thread Rolf vandeVaart

Your observations are correct. If the payload is non-contiguous, then RDMA is not used. The data has to be copied first into an intermediate buffer and then sent. This has not changed in later version of Open MPI. Rolf >-Original Message- >From: devel-boun...@open-mpi.org [mailto:de

Re: [OMPI devel] GPUDirect v1 issues

2012-01-17 Thread Rolf vandeVaart

I am not aware of any issues. Can you send me a test program and I can try it out? Which version of CUDA are you using? Rolf >-Original Message- >From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] >On Behalf Of Sebastian Rinke >Sent: Tuesday, January 17, 2012 8:50 AM >

Re: [OMPI devel] GPUDirect v1 issues

2012-01-17 Thread Rolf vandeVaart

These are wickedly interesting. >> >> Ken >> -Original Message- >> From: devel-boun...@open-mpi.org [mailto:devel-bounces@open- >mpi.org] >> On Behalf Of Rolf vandeVaart >> Sent: Tuesday, January 17, 2012 7:54 AM >> To: Open MPI Developers >

Re: [OMPI devel] GPUDirect v1 issues

2012-01-17 Thread Rolf vandeVaart

GPUDirect support * Install the CUDA developer driver Does using CUDA >= 4.0 make one of the above steps redundant? I.e., RHEL or different kernel or MLNX OFED stack with GPUDirect support is not needed any more? Sebastian. Rolf vandeVaart wrote: I ran your test case against Open MPI 1.4.2

Re: [OMPI devel] GPUDirect v1 issues

2012-01-20 Thread Rolf vandeVaart

el with GPUDirect support * Use the MLNX OFED stack with GPUDirect support * Install the CUDA developer driver Does using CUDA >= 4.0 make one of the above steps redundant? I.e., RHEL or different kernel or MLNX OFED stack with GPUDirect support is not needed any more? Sebastian. Rolf

Re: [OMPI devel] MVAPICH2 vs Open-MPI

2012-02-14 Thread Rolf vandeVaart

There are several things going on here that make their library perform better. With respect to inter-node performance, both MVAPICH2 and Open MPI copy the GPU memory into host memory first. However, they are using special host buffers that and a code path that allows them to copy the data async

Re: [OMPI devel] RFC: Allocate free list payload if free list isn't specified

2012-02-21 Thread Rolf vandeVaart

I think I am OK with this. Alternatively, you could have done something like is done in the TCP BTL where the payload and header are added together for the frag size? To state more clearly, I was trying to say you could do something similar to what is done at line 1015 in btl_tcp_component.c a

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26039

2012-02-24 Thread Rolf vandeVaart

Hi Jeff: It is set in opal/config/opal_configure_options.m4 >-Original Message- >From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] >On Behalf Of Jeffrey Squyres >Sent: Friday, February 24, 2012 6:07 AM >To: de...@open-mpi.org >Subject: Re: [OMPI devel] [OMPI svn-full]

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106

2012-03-09 Thread Rolf vandeVaart

[Comment at bottom] >-Original Message- >From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] >On Behalf Of Nathan Hjelm >Sent: Friday, March 09, 2012 2:23 PM >To: Open MPI Developers >Subject: Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106 > > > >On Fri, 9 Mar 2012, J

[OMPI devel] memory bind warning with -bind-to-core and -bind-to-socket

2012-03-14 Thread Rolf vandeVaart

I am running a simple test and using the -bind-to-core or -bind-to-socket options. I think the CPU binding is working fine, but I see these warnings about not being able to bind to memory. Is this expected? This is trunk code (266128) [dt]$ mpirun --report-bindings -np 2 -bind-to-core conne

Re: [OMPI devel] mca_btl_tcp_alloc

2012-04-04 Thread Rolf vandeVaart

Here is my explanation. The call to MCA_BTL_TCP_FRAG_ALLOC_EAGER or MCA_BTL_TCP_FRAG_ALLOC_MAX allocate a chunk of memory that has space for both the fragment as well as any payload. So, when we do the frag+1, we are setting the pointer in the frag to point where the payload of the message liv

[OMPI devel] Modified files after autogen

2012-05-23 Thread Rolf vandeVaart

After doing a fresh checkout of the trunk, and then running autogen, I see this: M opal/mca/event/libevent2019/libevent/Makefile.in M opal/mca/event/libevent2019/libevent/depcomp M opal/mca/event/libevent2019/libevent/include/Makefile.in M opal/mca/event/libevent2019/libeve

Re: [OMPI devel] RFC: hide btl segment keys within btl

2012-06-18 Thread Rolf vandeVaart

Hi Nathan: I downloaded and tried it out. There were a few issues that I had to work through, but finally got things working. Can you apply this patch to your changes prior to checking things in? I also would suggest configuring with --enable-picky as there are something like 10 warnings genera

[OMPI devel] RFC: add asynchronous copies for large GPU buffers

2012-06-27 Thread Rolf vandeVaart

WHAT: Add support for doing asynchronous copies of GPU memory with larger messages. WHY: Improve performance for sending/receiving of larger GPU messages over IB WHERE: ob1, openib, and convertor code. All is protected by compiler directives so no effect on non-CUDA builds. REFEREN

Re: [OMPI devel] RFC: add asynchronous copies for large GPU buffers

2012-06-27 Thread Rolf vandeVaart

PU >buffers > >Can you make your repository public or add me to the access list? > >-Nathan > >On Wed, Jun 27, 2012 at 03:12:34PM -0700, Rolf vandeVaart wrote: >> WHAT: Add support for doing asynchronous copies of GPU memory with >larger messages. >> WHY: I

[OMPI devel] FW: add asynchronous copies for large GPU buffers

2012-07-10 Thread Rolf vandeVaart

Adding a timeout to this RFC. TIMEOUT: July 17, 2012 rvandeva...@nvidia.com 781-275-5358 -Original Message- From: Rolf vandeVaart Sent: Wednesday, June 27, 2012 6:13 PM To: de...@open-mpi.org Subject: RFC: add asynchronous copies for large GPU buffers WHAT: Add support for doing

Re: [OMPI devel] The hostfile option

2012-07-30 Thread Rolf vandeVaart

>-Original Message- >From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] >On Behalf Of Ralph Castain >Sent: Monday, July 30, 2012 9:29 AM >To: Open MPI Developers >Subject: Re: [OMPI devel] The hostfile option > > >On Jul 30, 2012, at 2:37 AM, George Bosilca wrote: > >> I t

Re: [OMPI devel] OpenMPI CUDA 5 readiness?

2012-09-04 Thread Rolf vandeVaart

37 PM >To: Rolf vandeVaart >Cc: de...@open-mpi.org >Subject: Re: OpenMPI CUDA 5 readiness? > >CUDA 5 basically changes char* to void* in some functions. Attached is a small >patch which changes prototypes, depending on used CUDA version. Tested >with CUDA 5 preview and 4.2. > >

[OMPI devel] RFC: Support for asynchronous copies of GPU buffers over IB

2012-12-17 Thread Rolf vandeVaart

[I sent this out in June, but did not commit it. So resending. Timeout of Jan 5, 2012. Note that this does not use the GPU Direct RDMA] WHAT: Add support for doing asynchronous copies of GPU memory with larger messages. WHY: Improve performance for sending/receiving of larger GPU messages over

Re: [OMPI devel] CUDA support doesn't work starting from 1.9a1r27862

2013-01-24 Thread Rolf vandeVaart

Thanks for this report. I will look into this. Can you tell me what your mpirun command looked like and do you know what transport you are running over? Specifically, is this on a single node or multiple nodes? Rolf From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On Behalf

[OMPI devel] mpirun -host does not work from r27879 and forward on trunk

2013-01-31 Thread Rolf vandeVaart

I have stumbled into a problem with the -host argument. This problem appears to be introduced with changeset r27879 on 1/19/2013 by rhc. With r27877, things work: [rolf@node]$ which mpirun /home/rolf/ompi-trunk-r27877/64/bin/mpirun [rolf@node]$ mpirun -np 2 -host c0-0,c0-3 hostname c0-3 c0-0

Re: [OMPI devel] mpirun -host does not work from r27879 and forward on trunk

2013-01-31 Thread Rolf vandeVaart

, Ralph Castain wrote: > >> Ummm...that was fixed a long time ago. You might try a later version. >> >> Or are you saying the head of the trunk doesn't work too? >> >> On Jan 31, 2013, at 7:31 AM, Rolf vandeVaart >wrote: >> >>> I have stum

Re: [OMPI devel] mpirun -host does not work from r27879 and forward on trunk

2013-01-31 Thread Rolf vandeVaart

31, 2013 11:51 AM >To: Open MPI Developers >Subject: Re: [OMPI devel] mpirun -host does not work from r27879 and >forward on trunk > >Yes - no hostfile and no RM allocation, just -host. > >What is your setup? > >On Jan 31, 2013, at 8:44 AM, Rolf vandeVaart >wrote: >

[OMPI devel] Build warnings in trunk

2013-05-14 Thread Rolf vandeVaart

I have noticed several warnings while building the trunk. Feel free to fix anything that you are familiar with. CC sys_limits.lo ../../../opal/util/sys_limits.c: In function 'opal_util_init_sys_limits': ../../../opal/util/sys_limits.c:107:20: warning: 'lim' may be used uninitialized in t

[OMPI devel] RGET issue when send is less than receive

2013-06-21 Thread Rolf vandeVaart

I ran into a hang in a test in which the sender sends less data than the receiver is expecting. For example, the following shows the receiver expecting twice what the sender is sending. Rank 0: MPI_Send(buf, BUFSIZE, MPI_INT, 1, 99, MPI_COMM_WORLD) Rank 1: MPI_Recv(buf, BUFSIZE*2, MPI_INT, 0,

Re: [OMPI devel] Annual OMPI membership review: SVN accounts

2013-07-09 Thread Rolf vandeVaart

No changes here. >NVIDIA >== >rolfv:Rolf Vandevaart > --- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29055 - in trunk/ompi/mca: btl btl/smcuda common/cuda pml/ob1

2013-08-22 Thread Rolf vandeVaart

George. > >On Aug 21, 2013, at 23:00 , svn-commit-mai...@open-mpi.org wrote: > >> Author: rolfv (Rolf Vandevaart) >> Date: 2013-08-21 17:00:09 EDT (Wed, 21 Aug 2013) New Revision: 29055 >> URL: https://svn.open-mpi.org/trac/ompi/changeset/29055 >> >> Log: >> Fi

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29055 - in trunk/ompi/mca: btl btl/smcuda common/cuda pml/ob1

2013-08-23 Thread Rolf vandeVaart

rg] On Behalf Of George >Bosilca >Sent: Friday, August 23, 2013 7:36 AM >To: Open MPI Developers >Subject: Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29055 - in >trunk/ompi/mca: btl btl/smcuda common/cuda pml/ob1 > >Rolf, > >On Aug 22, 2013, at 19:24 , Rolf vandeVaart

[OMPI devel] Quick observation - component ignored for 7 years

2013-08-27 Thread Rolf vandeVaart

The ompi/mca/rcache/rb component has been .ompi_ignored for almost 7 years. Should we delete it? --- This email message is for the sole use of the intended recipient(s) and may contain confidential information.

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29055 - in trunk/ompi/mca: btl btl/smcuda common/cuda pml/ob1

2013-08-30 Thread Rolf vandeVaart

t;interested in implementing in the future (an intern or some PhD student). > >On Aug 23, 2013, at 21:53 , Rolf vandeVaart wrote: > >> Yes, I agree that the CUDA support is more intrusive and ends up in >different areas. The problem is that the changes could not be simply isol

[OMPI devel] openib BTL problems with ORTE async changes

2013-09-03 Thread Rolf vandeVaart

As mentioned in the weekly conference call, I am seeing some strange errors when using the openib BTL. I have narrowed down the changeset that broke things to the ORTE async code. https://svn.open-mpi.org/trac/ompi/changeset/29058 (and https://svn.open-mpi.org/trac/ompi/changeset/29061 which

Re: [OMPI devel] openib BTL problems with ORTE async changes

2013-09-03 Thread Rolf vandeVaart

something up in the OOB connect code itself. I'll take a look and see if something leaps out at me - it seems to be working fine on IU's odin cluster, which is the only IB-based system I can access On Sep 3, 2013, at 11:34 AM, Rolf vandeVaart mailto:rvandeva...@nvidia.com>> wr

Re: [OMPI devel] openib BTL problems with ORTE async changes

2013-09-03 Thread Rolf vandeVaart

M, Ralph Castain mailto:r...@open-mpi.org>> wrote: Dang - I just finished running it on odin without a problem. Are you seeing this with a debug or optimized build? On Sep 3, 2013, at 12:16 PM, Rolf vandeVaart mailto:rvandeva...@nvidia.com>> wrote: Yes, it fails on the current t

Re: [OMPI devel] openib BTL problems with ORTE async changes

2013-09-03 Thread Rolf vandeVaart

. I've tried up to np=16 without getting a single hiccup. Try a fresh checkout - let's make sure you don't have some old cruft laying around. On Sep 3, 2013, at 12:26 PM, Rolf vandeVaart mailto:rvandeva...@nvidia.com>> wrote: I am running a debug build. Here is my configur

Re: [OMPI devel] openib BTL problems with ORTE async changes

2013-09-03 Thread Rolf vandeVaart

Correction: That line below should be: gmake run FILE=p2p_c From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart Sent: Tuesday, September 03, 2013 4:50 PM To: Open MPI Developers Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes I just retried and I

Re: [OMPI devel] openib BTL problems with ORTE async changes

2013-09-03 Thread Rolf vandeVaart

: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart Sent: Tuesday, September 03, 2013 4:52 PM To: Open MPI Developers Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes Correction: That line below should be: gmake run FILE=p2p_c From: devel [mailto:devel-boun

[OMPI devel] RFC: Remove alignment code from rcache

2013-09-10 Thread Rolf vandeVaart

WHAT: Remove alignment code from ompi/mca/rcache/vma module WHY: Because it is redundant and causing problems for memory pools that want different alignment WHERE: ompi/mca/rcache/vma/rcache_vma.c, ompi/mca/mpool/grdma/mpool_grdma_module.c (Detailed changes attached) WHEN: Tuesday, September 17,

Re: [OMPI devel] Nearly unlimited growth of pml free list

2013-09-11 Thread Rolf vandeVaart

Hi Max: You say that that the function keeps "allocating memory in the pml free list." How do you know that is happening? Do you know which free list it is happening on? There are something like 8 free lists associated with the pml ob1 so it would be interesting to know which one you observe

Re: [OMPI devel] Nearly unlimited growth of pml free list

2013-09-13 Thread Rolf vandeVaart

a private email, I had Max add some instrumentation so we could see which list was growing. We now know it is the mca_pml_base_send_requests list. >-Original Message- >From: Max Staufer [mailto:max.stau...@gmx.net] >Sent: Friday, September 13, 2013 7:06 AM >To: Rolf va

Re: [OMPI devel] RFC: Remove alignment code from rcache

2013-09-18 Thread Rolf vandeVaart

I will wait another week on this since I know a lot of folks were traveling. Any input welcome. From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart Sent: Tuesday, September 10, 2013 2:46 PM To: de...@open-mpi.org Subject: [OMPI devel] RFC: Remove alignment code from

[OMPI devel] RFC: Add GPU Direct RDMA support to openib btl

2013-10-08 Thread Rolf vandeVaart

WHAT: Add GPU Direct RDMA support to openib btl WHY: Better latency for small GPU message transfers WHERE: Several files, see ticket for list WHEN: Friday, October 18, 2013 COB More detail: This RFC looks to make use of GPU Direct RDMA support that is coming in the future in Mellanox libraries.

Re: [OMPI devel] Warnings in v1.7.4: rcache

2013-10-23 Thread Rolf vandeVaart

Yes, that is from one of my CMRs. I always configure with -enable-picky but that did not pick up this warning. I will fix this in the trunk in the morning (watching the Red Sox right now :)) and then file CMR to bring over. Rolf From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralp

[OMPI devel] oshmem and CFLAGS removal

2013-10-31 Thread Rolf vandeVaart

I noticed that there were some CFLAGS that were no longer set when enabling with --enable-picky for gcc. Specifically, -Wundef and -pedantic were no longer set. This is not a problem for Open MPI 1.7. I believe this is happening because of some code in the config/oshmem_configure_options.m4 f

Re: [OMPI devel] oshmem and CFLAGS removal

2013-10-31 Thread Rolf vandeVaart

>-Original Message- >From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Jeff Squyres >(jsquyres) >Sent: Thursday, October 31, 2013 4:12 PM >To: Open MPI Developers >Subject: Re: [OMPI devel] oshmem and CFLAGS removal > >On Oct 31, 2013, at 3:46 PM,

Re: [OMPI devel] MPIRUN error message after ./configure and sudo make all install...

2013-11-07 Thread Rolf vandeVaart

Hello Solibakke: Let me try and reproduce with your configure options. Rolf From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Solibakke Per Bjarte Sent: Thursday, November 07, 2013 8:40 AM To: 'de...@open-mpi.org' Subject: [OMPI devel] MPIRUN error message after ./configure and sudo m

Re: [OMPI devel] MPIRUN error message after ./configure and sudo make all install...

2013-11-07 Thread Rolf vandeVaart

use --enable-mca-dso...though I don't know if that is the source of the problem. On Nov 7, 2013, at 6:00 AM, Rolf vandeVaart mailto:rvandeva...@nvidia.com>> wrote: Hello Solibakke: Let me try and reproduce with your configure options. Rolf From: devel [mailto:devel-boun...@open-

Re: [OMPI devel] CUDA support not working?

2013-11-25 Thread Rolf vandeVaart

Let me know of any other issues you are seeing. Ralph fixed the issue with ob1 and we will move that into Open MPI 1.7.4. Not sure why I never saw that issue. Will investigate some more. >-Original Message- >From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Jörg >Bornschein

[OMPI devel] NUMA bug in openib BTL device selection

2014-01-10 Thread Rolf vandeVaart

I believe I found a bug in openib BTL and just want to see if folks agree with this. When we are running on a NUMA node and we are bound to a CPU, we only ant to use the IB device that is closest to us. However, I observed that we always used both devices regardless. I believe there is a bug

Re: [OMPI devel] 1.7.4 status update

2014-01-22 Thread Rolf vandeVaart

Hi Ralph: In my opinion, we still try to get to a stable 1.7.4. I think we can just keep the bar high (as you said in the meeting) about what types of fixes need to get into 1.7.4. I have been telling folks 1.7.4 would be ready "really soon" so the idea of folding in 1.7.5 CMRs and delaying it

[OMPI devel] Intermittent mpirun crash?

2014-01-30 Thread Rolf vandeVaart

I am seeing this happening to me very intermittently. Looks like mpirun is getting a SEGV. Is anyone else seeing this? This is 1.7.4 built yesterday. (Note that I added some stuff to what is being printed out so the message is slightly different than 1.7.4 output) mpirun - -np 6 -host drosse

Re: [OMPI devel] Intermittent mpirun crash?

2014-01-30 Thread Rolf vandeVaart

, January 30, 2014 11:51 AM To: Open MPI Developers Subject: Re: [OMPI devel] Intermittent mpirun crash? Huh - not much info there, I'm afraid. I gather you didn't build this with --enable-debug? On Jan 30, 2014, at 8:26 AM, Rolf vandeVaart wrote: > I am seeing this happeni

Re: [OMPI devel] Intermittent mpirun crash?

2014-01-30 Thread Rolf vandeVaart

>fixes the problem, I did not investigate any further. > >Do you see a similar behavior? > > George. > >On Jan 30, 2014, at 17:26 , Rolf vandeVaart wrote: > >> I am seeing this happening to me very intermittently. Looks like mpirun is >getting a SEGV. Is anyone el

Re: [OMPI devel] Intermittent mpirun crash?

2014-01-30 Thread Rolf vandeVaart

gfaulted as >well), but obviously wouldn't have anything to do with mpirun > >On Jan 30, 2014, at 9:29 AM, Rolf vandeVaart >wrote: > >> I just retested with --mca mpi_leave_pinned 0 and that made no difference. >I still see the mpirun crash. >> >>> -

Re: [OMPI devel] 1.7.5 fails on simple test

2014-02-10 Thread Rolf vandeVaart

I have seen this same issue although my core dump is a little bit different. I am running with tcp,self. The first entry in the list of BTLs is garbage, but then there is tcp and self in the list. Strange. This is my core dump. Line 208 in bml_r2.c is where I get the SEGV. Program termina

Re: [OMPI devel] 1.7.5 fails on simple test

2014-02-10 Thread Rolf vandeVaart

I have tracked this down. There is a missing commit that affects ompi_mpi_init.c causing it to initialize bml twice. Ralph, can you apply r30310 to 1.7? Thanks, Rolf From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart Sent: Monday, February 10, 2014 12:29 PM To: Open

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r30860 - in trunk/ompi/mca: btl/usnic rte

2014-02-27 Thread Rolf vandeVaart

It could. I added that argument 4 years ago to support by my failover work with the BFO. It was a way for a BTL to pass some type of string back to the PML telling the PML who it was for verbose output to understand what was happening. >-Original Message- >From: devel [mailto:devel-b

[OMPI devel] RFC: Add two new verbose outputs to BML layer

2014-03-03 Thread Rolf vandeVaart

WHAT: Add two new verbose outputs to BML layer WHY: There are times that I really want to know which BTLs are being used. These verbose outputs can help with that. WHERE: ompi/mca/bml/r2/bml_r2.c TIMEOUT: COB Friday, 7 March 2014 MORE DETAIL: I have run into some cases where I have added to a

Re: [OMPI devel] -mca coll "ml" cause segv or hangs with different command lines.

2014-03-04 Thread Rolf vandeVaart

I am still seeing the same issue where I get some type of segv unless I disable the coll ml component. This may be an issue at my end, but just thought I would double check that we are sure this is fixed. Thanks, Rolf >-Original Message- >From: devel [mailto:devel-boun...@open-mpi.org]

Re: [OMPI devel] 1-question developer poll

2014-04-16 Thread Rolf vandeVaart

SVN >-Original Message- >From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Nathan >Hjelm >Sent: Wednesday, April 16, 2014 10:35 AM >To: Open MPI Developers >Subject: Re: [OMPI devel] 1-question developer poll > >* PGP Signed by an unknown key > >Git > >On Wed, Apr 16, 2014 at 10

[OMPI devel] Possible bug with derived datatypes and openib BTL in trunk

2014-04-16 Thread Rolf vandeVaart

I have seen errors when running the intel test suite using the openib BTL when transferring derived datatypes. I do not see the error with sm or tcp BTLs. The errors begin after this checkin. https://svn.open-mpi.org/trac/ompi/changeset/31370 Timestamp: 04/11/14 16:06:56 (5 days ago) Author: b

Re: [OMPI devel] Possible bug with derived datatypes and openib BTL in trunk

2014-04-17 Thread Rolf vandeVaart

g set to 1. I would be >interested in the output you get on your machine. > >George. > > >On Apr 16, 2014, at 14:34 , Rolf vandeVaart wrote: > >> I have seen errors when running the intel test suite using the openib BTL >when transferring derived datatypes. I do not s

Re: [OMPI devel] regression with derived datatypes

2014-05-07 Thread Rolf vandeVaart

This seems similar to what I reported on a different thread. http://www.open-mpi.org/community/lists/devel/2014/05/14688.php I need to try and reproduce again. Elena, what kind of cluster were your running on? Rolf From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Elena Elkina Sent

Re: [OMPI devel] regression with derived datatypes

2014-05-07 Thread Rolf vandeVaart

OK. So, I investigated a little more. I only see the issue when I am running with multiple ports enabled such that I have two openib BTLs instantiated. In addition, large message RDMA has to be enabled. If those conditions are not met, then I do not see the problem. For example: FAILS: Ø

Re: [OMPI devel] regression with derived datatypes

2014-05-07 Thread Rolf vandeVaart

_send_size 23” should always transfer wrong data, even when only one single BTL is in play. George. On May 7, 2014, at 13:11 , Rolf vandeVaart mailto:rvandeva...@nvidia.com>> wrote: OK. So, I investigated a little more. I only see the issue when I am running with multiple ports ena

[OMPI devel] Minutes of Open MPI ConCall Meeting - Tuesday, May 13, 2014

2014-05-13 Thread Rolf vandeVaart

Open MPI 1.6: - Release was waiting on https://svn.open-mpi.org/trac/ompi/ticket/3079 but during meeting we decided it was not necessary. Therefore, Jeff will go ahead and roll Open MPI 1.6.6 RC1. Open MPI 1.8: - Several tickets have been applied. Some discussion about other

[OMPI devel] RFC: Add some basic CUDA-aware support to reductions

2014-05-14 Thread Rolf vandeVaart

WHAT: Add some basic support so that reduction functions can support GPU buffers. All this patch does is move the GPU data into a host buffer before the reduction call and move it back to GPU after the reduction call. Changes have no effect if CUDA-aware support is not compiled in. WHY: Users

Re: [OMPI devel] RFC : what is the best way to fix the memory leak in mca/pml/bfo

2014-05-16 Thread Rolf vandeVaart

The bfo PML is mostly a duplicate of the ob1 PML but with extra code to handle failover when running with a cluster with multiple IB NICs. A few observations. 1. Almost no one uses the bfo PML. I have kept it around just in case someone thinks about failover again. 2. The code where you are s

[OMPI devel] RFC: [UPDATE] Add some basic CUDA-aware support to reductions

2014-05-21 Thread Rolf vandeVaart

NOTE: This is an update to the RFC after review and help from George WHAT: Add some basic support so that reduction functions can support GPU buffers. Create new coll module that is only compiled in when CUDA-aware support is compiled in. This patch moves the GPU data into a host buffer befor

[OMPI devel] Still problems with del_procs in trunkj

2014-05-23 Thread Rolf vandeVaart

I am still seeing problems with del_procs with openib. Do we believe everything should be working? This is with the latest trunk (updated 1 hour ago). [rvandevaart@drossetti-ivy0 examples]$ mpirun --mca btl_openib_if_include mlx5_0:1 -np 2 -host drossetti-ivy0,drossetti-ivy1 connectivity_cCon

[OMPI devel] Intermittent hangs when exiting with error

2014-05-29 Thread Rolf vandeVaart

Ralph: I am seeing cases where mpirun seems to hang when one of the applications exits with non-zero. For example, the intel test MPI_Cart_get_c will exit that way if there are not enough processes to run the test. In most cases, mpirun seems to return fine with the error code, but sometimes i

Re: [OMPI devel] regression with derived datatypes

2014-05-30 Thread Rolf vandeVaart

s >we force the exclusive usage of the send protocol, with an unconventional >fragment size. >>>> >>>> In other words using the following flags “—mca btl tcp,self —mca >btl_tcp_flags 3 —mca btl_tcp_rndv_eager_limit 23 —mca btl_tcp_eager_limit >23 —mca btl_tcp_max_send_s

[OMPI devel] iallgather failures with coll ml

2014-06-06 Thread Rolf vandeVaart

On the trunk, I am seeing failures of the ibm tests iallgather and iallgather_in_place. Is this a known issue? $ mpirun --mca btl self,sm,tcp --mca coll ml,basic,libnbc --host drossetti-ivy0,drossetti-ivy0,drossetti-ivy1,drossetti-ivy1 -np 4 iallgather [**ERROR**]: MPI_COMM_WORLD rank 0, file i

[OMPI devel] Strange intercomm_create, spawn, spawn_multiple hang on trunk

2014-06-06 Thread Rolf vandeVaart

I am seeing an interesting failure on trunk. intercomm_create, spawn, and spawn_multiple from the IBM tests hang if I explicitly list the hostnames to run on. For example: Good: $ mpirun -np 2 --mca btl self,sm,tcp spawn_multiple Parent: 0 of 2, drossetti-ivy0.nvidia.com (0 in init) Parent: 1

Re: [OMPI devel] Strange intercomm_create, spawn, spawn_multiple hang on trunk

2014-06-06 Thread Rolf vandeVaart

n isend to process 3 dpm_base_disconnect_init: error -12 in isend to process 3 dpm_base_disconnect_init: error -12 in isend to process 1 dpm_base_disconnect_init: error -12 in isend to process 3 [rhc@bend001 mpi]$ On Jun 6, 2014, at 11:26 AM, Rolf vandeVaart mailto:rvandeva...@nvidia.co

[OMPI devel] Open MPI Core Developer - Minutes June 10, 2014

2014-06-10 Thread Rolf vandeVaart

Minutes of June 10, 2014 Open MPI Core Developer Meeting 1. Review 1.6 - Nothing new 2. Review 1.8 - Most things are doing fine. Still several tickets awaiting review. If influx of bugs slows, then we will get 1.8.2 release ready. Rolf was concerned about intermittent hangs, but

Re: [OMPI devel] iallgather failures with coll ml

2014-06-11 Thread Rolf vandeVaart

Hearing no response, I assume this is not a known issue so I submitted https://svn.open-mpi.org/trac/ompi/ticket/4709 Nathan, is this something that you can look at? Rolf From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart Sent: Friday, June 06, 2014 1:55 PM To: de

[OMPI devel] Hangs on the trunk

2014-07-14 Thread Rolf vandeVaart

I have noticed that I am seeing some tests hang on the trunk. For example: $ mpirun --mca btl_tcp_if_include eth0 --host drossetti-ivy0,drossetti-ivy1 -np 2 --mca pml ob1 --mca btl sm,tcp,self --mca coll_mdisable_allgather 1 --mca btl_openib_warn_default_gid_prefix 0 send It is not unusual for

Re: [OMPI devel] Hangs on the trunk

2014-07-14 Thread Rolf vandeVaart

the conversions in ob1. >> >> -Nathan >> >> On Mon, Jul 14, 2014 at 01:38:38PM -0700, Rolf vandeVaart wrote: >> >I have noticed that I am seeing some tests hang on the trunk. For >> >example: >> > >> > >> > >> >$

[OMPI devel] New crash on trunk (r32246)

2014-07-15 Thread Rolf vandeVaart

With the latest trunk (r32246) I am getting crashes while the program is shutting down. I assume this is related to some of the changes George just made. George, can you take a look when you get a chance? Looks like everyone is getting the segv during shutdown (mpirun, orted, and application)

[OMPI devel] Onesided failures

2014-07-16 Thread Rolf vandeVaart

On both 1.8 and trunk (as Ralph mentioned in meeting) we are seeing three tests fail. http://mtt.open-mpi.org/index.php?do_redir=2205 Ibm/onesided/win_allocate_shared Ibm/onesided/win_allocated_shared_mpifh Ibm/onesided/win_allocated_shared_usempi Is there a ticket that covers these failures? T

Re: [OMPI devel] Onesided failures

2014-07-16 Thread Rolf vandeVaart

ssing something obvious, I will update the test tomorrow and add a comm split to ensure MPI_Win_allocate_shared is called from single node communicator and skip the test if this impossible Cheers, Gilles Rolf vandeVaart mailto:rvandeva...@nvidia.com>> wrote: On both 1.8 and trunk (as Ralph m

Re: [OMPI devel] PML-bfo deadlocks for message size > eager limit after connection loss

2014-07-24 Thread Rolf vandeVaart

My guess is that no one is testing the bfo PML. However, I would have expected it to still work with Open MPI 1.6.5. From your description, it works for smaller messages but fails with larger ones? So, if you just send smaller messages and pull the cable, things work correctly? One idea is t

[OMPI devel] RFC: Bump minimum sm pool size to 128K from 64K

2014-07-24 Thread Rolf vandeVaart

WHAT: Bump up the minimum sm pool size to 128K from 64K. WHY: When running OSU benchmark on 2 nodes and utilizing a larger btl_smcuda_max_send_size, we can run into the case where the free list cannot grow. This is not a common case, but it is something that folks sometimes experiment with.

Re: [OMPI devel] RFC: Bump minimum sm pool size to 128K from 64K

2014-07-26 Thread Rolf vandeVaart

Yes (my mistake) Sent from my iPhone On Jul 26, 2014, at 3:19 PM, "George Bosilca" mailto:bosi...@icl.utk.edu>> wrote: We are talking MB not KB isn't it? George. On Thu, Jul 24, 2014 at 2:57 PM, Rolf vandeVaart mailto:rvandeva...@nvidia.com>> wrote: WHAT:

Re: [OMPI devel] A Couple of Questions

2009-04-13 Thread Rolf Vandevaart

On 04/13/09 09:40, George Bosilca wrote: On Apr 12, 2009, at 21:58 , Timothy Hayes wrote: I was wondering if someone might be able to shed some light on a couple of questions I have. When you receive a fragment/base_descriptor in a BTL module, is the raw data allowed to be fragmented when y

1 2 >

1 - 100 of 180 matches

Mail list logo