WHAT: Add support to send data directly from CUDA device memory via MPI calls.
TIMEOUT: April 25, 2011
DETAILS: When programming in a mixed MPI and CUDA environment, one cannot
currently send data directly from CUDA device memory. The programmer first has
to move the data into host memory, and
the same values ?
Do you need GPUDirect for "to improve performance, the internal host buffers
have to also be registered with the CUDA environment" ?
Regards,
Brice
Le 13/04/2011 18:47, Rolf vandeVaart a écrit :
WHAT: Add support to send data directly from CUDA device memory via
WHAT: Second try to add support to send data directly from CUDA device memory
via MPI calls.
TIMEOUT: 4/26/2011
DETAILS: Based on all the feedback (thanks to everyone who looked at it), I
have whittled down what I hope to accomplish with this RFC. There were
suggestions to better modularize
Forgot the link...
https://bitbucket.org/rolfv/ompi-trunk-cuda-rfc2
From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On Behalf
Of Rolf vandeVaart
Sent: Tuesday, April 19, 2011 4:45 PM
To: Open MPI Developers
Subject: [OMPI devel] RFC: Second Try: Add support to send
: Thursday, April 21, 2011 6:19 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] RFC: Second Try: Add support to send/receive CUDA
device memory directly
George -- what say you?
On Apr 19, 2011, at 4:54 PM, Rolf vandeVaart wrote:
> Forgot the link...
>
> https://bitbucket.org/r
I see in the sm BTL that there is the concept of memory affinity and the
potential to support multiple memory pools. I am curious if anyone is making
use of that feature? I am looking in the function sm_btl_first_time_init() in
the btl_sm.c file.
Thanks,
Rolf
---
WHAT: Add CUDA registration of host memory in sm and openib BTLs.
TIMEOUT: 8/4/2011
DETAILS: In order to improve performance of sending GPU device memory,
we need to register the host memory with the CUDA framework. These
changes allow that to happen. These changes are somewhat different
from w
lopers
Subject: Re: [OMPI devel] RFC: CUDA register sm and openib host memory
Rolf -
Can you send a cumulative SVN diff against the SVN HEAD?
Sent from my phone. No type good.
On Jul 28, 2011, at 5:52 PM, "Rolf vandeVaart"
mailto:rvandeva...@nvidia.com>> wrote:
WHAT: Add CUDA regis
of
>sending GPU device memory? I fail to see how registering the backend shared
>memory file with CUDA is supposed to do anything at all, as this memory is
>internal to Open MPI and not supposed to be visible at any other level.
>
> Thanks,
>george.
>
>On Jul 28, 20
I think this is a good idea.
I have spent a fair amount of time in the past analyzing timeouts from this set
of tests. I had to figure out if it was an actual timeout or if the test was
just running very slowly.
In fact, I see that sometime in the past I throttled back the number of
iterations
Actually, I think you are off by which commit undid the change. It was this
one. And the message does suggest it might have caused problems.
https://svn.open-mpi.org/trac/ompi/changeset/23764
Timestamp:
09/17/10 19:04:06 (12 months ago)
Author:
rhc
Message:
WARNING: Work on the te
This is a pre-RFC of some changes I am hoping to bring into the trunk.
(I call this a pre-RFC as I have no timeout and I am not done with the code
yet.)
With some prior commits, I have added the ability to send GPU buffers directly.
This support consists of forcing the use of only the send/receive
> george.
>
>PS: Regarding the hand-copy instead of the memcpy, we tried to avoid using
>memcpy in performance critical codes, especially when we know the size of
>the data and the alignment. This relieves the compiler of adding ugly
>intrinsics,
>allowing it to nicely pipeline to load/stores. An
This may seem trivial, but should we name them:
#define MCA_BTL_DES_FLAGS_PUT 0x0010
#define MCA_BTL_DES_FLAGS_GET 0x0020
Although I see there is some inconsistency in how these flags are named, two of
the three original ones have "BTL_DES_FLAGS" in them.
Rolf
rvandeva...@nvidia.com
781-275-5
WHAT: Add new sm BTL, and supporting mpools, that can also support CUDA RDMA.
WHY: With CUDA 4.1, there is some GPU IPC support available that we can take
advantage of to move data efficiently between GPUs within a node.
WHERE: new--> ompi/mca/btl/smcuda, ompi/mca/mpool/cuda, ompi/mca/mpool/rcud
Your observations are correct. If the payload is non-contiguous, then RDMA is
not used. The data has to be copied first into an intermediate buffer and then
sent.
This has not changed in later version of Open MPI.
Rolf
>-Original Message-
>From: devel-boun...@open-mpi.org [mailto:de
I am not aware of any issues. Can you send me a test program and I can try it
out?
Which version of CUDA are you using?
Rolf
>-Original Message-
>From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org]
>On Behalf Of Sebastian Rinke
>Sent: Tuesday, January 17, 2012 8:50 AM
>
These are wickedly interesting.
>>
>> Ken
>> -Original Message-
>> From: devel-boun...@open-mpi.org [mailto:devel-bounces@open-
>mpi.org]
>> On Behalf Of Rolf vandeVaart
>> Sent: Tuesday, January 17, 2012 7:54 AM
>> To: Open MPI Developers
>
GPUDirect support
* Install the CUDA developer driver
Does using CUDA >= 4.0 make one of the above steps redundant?
I.e., RHEL or different kernel or MLNX OFED stack with GPUDirect support is
not needed any more?
Sebastian.
Rolf vandeVaart wrote:
I ran your test case against Open MPI 1.4.2
el with GPUDirect support
* Use the MLNX OFED stack with GPUDirect support
* Install the CUDA developer driver
Does using CUDA >= 4.0 make one of the above steps redundant?
I.e., RHEL or different kernel or MLNX OFED stack with GPUDirect support is
not needed any more?
Sebastian.
Rolf
There are several things going on here that make their library perform better.
With respect to inter-node performance, both MVAPICH2 and Open MPI copy the GPU
memory into host memory first. However, they are using special host buffers
that and a code path that allows them to copy the data async
I think I am OK with this.
Alternatively, you could have done something like is done in the TCP BTL where
the payload and header are added together for the frag size?
To state more clearly, I was trying to say you could do something similar to
what is done at line 1015 in btl_tcp_component.c a
Hi Jeff:
It is set in opal/config/opal_configure_options.m4
>-Original Message-
>From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org]
>On Behalf Of Jeffrey Squyres
>Sent: Friday, February 24, 2012 6:07 AM
>To: de...@open-mpi.org
>Subject: Re: [OMPI devel] [OMPI svn-full]
[Comment at bottom]
>-Original Message-
>From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org]
>On Behalf Of Nathan Hjelm
>Sent: Friday, March 09, 2012 2:23 PM
>To: Open MPI Developers
>Subject: Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
>
>
>
>On Fri, 9 Mar 2012, J
I am running a simple test and using the -bind-to-core or -bind-to-socket
options. I think the CPU binding is working fine, but I see these warnings
about not being able to bind to memory. Is this expected? This is trunk code
(266128)
[dt]$ mpirun --report-bindings -np 2 -bind-to-core conne
Here is my explanation. The call to MCA_BTL_TCP_FRAG_ALLOC_EAGER or
MCA_BTL_TCP_FRAG_ALLOC_MAX allocate a chunk of memory that has space for both
the fragment as well as any payload. So, when we do the frag+1, we are setting
the pointer in the frag to point where the payload of the message liv
After doing a fresh checkout of the trunk, and then running autogen, I see this:
M opal/mca/event/libevent2019/libevent/Makefile.in
M opal/mca/event/libevent2019/libevent/depcomp
M opal/mca/event/libevent2019/libevent/include/Makefile.in
M opal/mca/event/libevent2019/libeve
Hi Nathan:
I downloaded and tried it out. There were a few issues that I had to work
through, but finally got things working.
Can you apply this patch to your changes prior to checking things in?
I also would suggest configuring with --enable-picky as there are something
like 10 warnings genera
WHAT: Add support for doing asynchronous copies of GPU memory with larger
messages.
WHY: Improve performance for sending/receiving of larger GPU messages over IB
WHERE: ob1, openib, and convertor code. All is protected by compiler directives
so no effect on non-CUDA builds.
REFEREN
PU
>buffers
>
>Can you make your repository public or add me to the access list?
>
>-Nathan
>
>On Wed, Jun 27, 2012 at 03:12:34PM -0700, Rolf vandeVaart wrote:
>> WHAT: Add support for doing asynchronous copies of GPU memory with
>larger messages.
>> WHY: I
Adding a timeout to this RFC.
TIMEOUT: July 17, 2012
rvandeva...@nvidia.com
781-275-5358
-Original Message-
From: Rolf vandeVaart
Sent: Wednesday, June 27, 2012 6:13 PM
To: de...@open-mpi.org
Subject: RFC: add asynchronous copies for large GPU buffers
WHAT: Add support for doing
>-Original Message-
>From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org]
>On Behalf Of Ralph Castain
>Sent: Monday, July 30, 2012 9:29 AM
>To: Open MPI Developers
>Subject: Re: [OMPI devel] The hostfile option
>
>
>On Jul 30, 2012, at 2:37 AM, George Bosilca wrote:
>
>> I t
37 PM
>To: Rolf vandeVaart
>Cc: de...@open-mpi.org
>Subject: Re: OpenMPI CUDA 5 readiness?
>
>CUDA 5 basically changes char* to void* in some functions. Attached is a small
>patch which changes prototypes, depending on used CUDA version. Tested
>with CUDA 5 preview and 4.2.
>
>
[I sent this out in June, but did not commit it. So resending. Timeout of Jan
5, 2012. Note that this does not use the GPU Direct RDMA]
WHAT: Add support for doing asynchronous copies of GPU memory with larger
messages.
WHY: Improve performance for sending/receiving of larger GPU messages over
Thanks for this report. I will look into this. Can you tell me what your
mpirun command looked like and do you know what transport you are running over?
Specifically, is this on a single node or multiple nodes?
Rolf
From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On Behalf
I have stumbled into a problem with the -host argument. This problem appears
to be introduced with changeset r27879 on 1/19/2013 by rhc.
With r27877, things work:
[rolf@node]$ which mpirun
/home/rolf/ompi-trunk-r27877/64/bin/mpirun
[rolf@node]$ mpirun -np 2 -host c0-0,c0-3 hostname
c0-3
c0-0
, Ralph Castain wrote:
>
>> Ummm...that was fixed a long time ago. You might try a later version.
>>
>> Or are you saying the head of the trunk doesn't work too?
>>
>> On Jan 31, 2013, at 7:31 AM, Rolf vandeVaart
>wrote:
>>
>>> I have stum
31, 2013 11:51 AM
>To: Open MPI Developers
>Subject: Re: [OMPI devel] mpirun -host does not work from r27879 and
>forward on trunk
>
>Yes - no hostfile and no RM allocation, just -host.
>
>What is your setup?
>
>On Jan 31, 2013, at 8:44 AM, Rolf vandeVaart
>wrote:
>
I have noticed several warnings while building the trunk. Feel free to fix
anything that you are familiar with.
CC sys_limits.lo
../../../opal/util/sys_limits.c: In function 'opal_util_init_sys_limits':
../../../opal/util/sys_limits.c:107:20: warning: 'lim' may be used
uninitialized in t
I ran into a hang in a test in which the sender sends less data than the
receiver is expecting. For example, the following shows the receiver expecting
twice what the sender is sending.
Rank 0: MPI_Send(buf, BUFSIZE, MPI_INT, 1, 99, MPI_COMM_WORLD)
Rank 1: MPI_Recv(buf, BUFSIZE*2, MPI_INT, 0,
No changes here.
>NVIDIA
>==
>rolfv:Rolf Vandevaart
>
---
This email message is for the sole use of the intended recipient(s) and may
contain
confidential information. Any unauthorized
George.
>
>On Aug 21, 2013, at 23:00 , svn-commit-mai...@open-mpi.org wrote:
>
>> Author: rolfv (Rolf Vandevaart)
>> Date: 2013-08-21 17:00:09 EDT (Wed, 21 Aug 2013) New Revision: 29055
>> URL: https://svn.open-mpi.org/trac/ompi/changeset/29055
>>
>> Log:
>> Fi
rg] On Behalf Of George
>Bosilca
>Sent: Friday, August 23, 2013 7:36 AM
>To: Open MPI Developers
>Subject: Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29055 - in
>trunk/ompi/mca: btl btl/smcuda common/cuda pml/ob1
>
>Rolf,
>
>On Aug 22, 2013, at 19:24 , Rolf vandeVaart
The ompi/mca/rcache/rb component has been .ompi_ignored for almost 7 years.
Should we delete it?
---
This email message is for the sole use of the intended recipient(s) and may
contain
confidential information.
t;interested in implementing in the future (an intern or some PhD student).
>
>On Aug 23, 2013, at 21:53 , Rolf vandeVaart wrote:
>
>> Yes, I agree that the CUDA support is more intrusive and ends up in
>different areas. The problem is that the changes could not be simply isol
As mentioned in the weekly conference call, I am seeing some strange errors
when using the openib BTL. I have narrowed down the changeset that broke
things to the ORTE async code.
https://svn.open-mpi.org/trac/ompi/changeset/29058 (and
https://svn.open-mpi.org/trac/ompi/changeset/29061 which
something up in the OOB connect code
itself. I'll take a look and see if something leaps out at me - it seems to be
working fine on IU's odin cluster, which is the only IB-based system I can
access
On Sep 3, 2013, at 11:34 AM, Rolf vandeVaart
mailto:rvandeva...@nvidia.com>> wr
M, Ralph Castain
mailto:r...@open-mpi.org>> wrote:
Dang - I just finished running it on odin without a problem. Are you seeing
this with a debug or optimized build?
On Sep 3, 2013, at 12:16 PM, Rolf vandeVaart
mailto:rvandeva...@nvidia.com>> wrote:
Yes, it fails on the current t
. I've tried up to np=16 without getting a single
hiccup.
Try a fresh checkout - let's make sure you don't have some old cruft laying
around.
On Sep 3, 2013, at 12:26 PM, Rolf vandeVaart
mailto:rvandeva...@nvidia.com>> wrote:
I am running a debug build. Here is my configur
Correction: That line below should be:
gmake run FILE=p2p_c
From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart
Sent: Tuesday, September 03, 2013 4:50 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes
I just retried and I
: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart
Sent: Tuesday, September 03, 2013 4:52 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes
Correction: That line below should be:
gmake run FILE=p2p_c
From: devel [mailto:devel-boun
WHAT: Remove alignment code from ompi/mca/rcache/vma module
WHY: Because it is redundant and causing problems for memory pools that want
different alignment
WHERE: ompi/mca/rcache/vma/rcache_vma.c,
ompi/mca/mpool/grdma/mpool_grdma_module.c (Detailed changes attached)
WHEN: Tuesday, September 17,
Hi Max:
You say that that the function keeps "allocating memory in the pml free list."
How do you know that is happening?
Do you know which free list it is happening on? There are something like 8
free lists associated with the pml ob1 so it would be interesting to know which
one you observe
a private email, I had Max add some instrumentation so
we could see which list was growing. We now know it is the
mca_pml_base_send_requests list.
>-Original Message-
>From: Max Staufer [mailto:max.stau...@gmx.net]
>Sent: Friday, September 13, 2013 7:06 AM
>To: Rolf va
I will wait another week on this since I know a lot of folks were traveling.
Any input welcome.
From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart
Sent: Tuesday, September 10, 2013 2:46 PM
To: de...@open-mpi.org
Subject: [OMPI devel] RFC: Remove alignment code from
WHAT: Add GPU Direct RDMA support to openib btl
WHY: Better latency for small GPU message transfers
WHERE: Several files, see ticket for list
WHEN: Friday, October 18, 2013 COB
More detail:
This RFC looks to make use of GPU Direct RDMA support that is coming in the
future in Mellanox libraries.
Yes, that is from one of my CMRs. I always configure with -enable-picky but
that did not pick up this warning.
I will fix this in the trunk in the morning (watching the Red Sox right now :))
and then file CMR to bring over.
Rolf
From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralp
I noticed that there were some CFLAGS that were no longer set when enabling
with --enable-picky for gcc. Specifically, -Wundef and -pedantic were no
longer set.
This is not a problem for Open MPI 1.7.
I believe this is happening because of some code in the
config/oshmem_configure_options.m4 f
>-Original Message-
>From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Jeff Squyres
>(jsquyres)
>Sent: Thursday, October 31, 2013 4:12 PM
>To: Open MPI Developers
>Subject: Re: [OMPI devel] oshmem and CFLAGS removal
>
>On Oct 31, 2013, at 3:46 PM,
Hello Solibakke:
Let me try and reproduce with your configure options.
Rolf
From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Solibakke Per
Bjarte
Sent: Thursday, November 07, 2013 8:40 AM
To: 'de...@open-mpi.org'
Subject: [OMPI devel] MPIRUN error message after ./configure and sudo m
use --enable-mca-dso...though I don't
know if that is the source of the problem.
On Nov 7, 2013, at 6:00 AM, Rolf vandeVaart
mailto:rvandeva...@nvidia.com>> wrote:
Hello Solibakke:
Let me try and reproduce with your configure options.
Rolf
From: devel [mailto:devel-boun...@open-
Let me know of any other issues you are seeing. Ralph fixed the issue with ob1
and we will move that into Open MPI 1.7.4.
Not sure why I never saw that issue. Will investigate some more.
>-Original Message-
>From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Jörg
>Bornschein
I believe I found a bug in openib BTL and just want to see if folks agree with
this. When we are running on a NUMA node and we are bound to a CPU, we only
ant to use the IB device that is closest to us. However, I observed that we
always used both devices regardless. I believe there is a bug
Hi Ralph:
In my opinion, we still try to get to a stable 1.7.4. I think we can just keep
the bar high (as you said in the meeting) about what types of fixes need to get
into 1.7.4. I have been telling folks 1.7.4 would be ready "really soon" so
the idea of folding in 1.7.5 CMRs and delaying it
I am seeing this happening to me very intermittently. Looks like mpirun is
getting a SEGV. Is anyone else seeing this?
This is 1.7.4 built yesterday. (Note that I added some stuff to what is being
printed out so the message is slightly different than 1.7.4 output)
mpirun - -np 6 -host
drosse
, January 30, 2014 11:51 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] Intermittent mpirun crash?
Huh - not much info there, I'm afraid. I gather you didn't build this with
--enable-debug?
On Jan 30, 2014, at 8:26 AM, Rolf vandeVaart wrote:
> I am seeing this happeni
>fixes the problem, I did not investigate any further.
>
>Do you see a similar behavior?
>
> George.
>
>On Jan 30, 2014, at 17:26 , Rolf vandeVaart wrote:
>
>> I am seeing this happening to me very intermittently. Looks like mpirun is
>getting a SEGV. Is anyone el
gfaulted as
>well), but obviously wouldn't have anything to do with mpirun
>
>On Jan 30, 2014, at 9:29 AM, Rolf vandeVaart
>wrote:
>
>> I just retested with --mca mpi_leave_pinned 0 and that made no difference.
>I still see the mpirun crash.
>>
>>> -
I have seen this same issue although my core dump is a little bit different. I
am running with tcp,self. The first entry in the list of BTLs is garbage, but
then there is tcp and self in the list. Strange. This is my core dump. Line
208 in bml_r2.c is where I get the SEGV.
Program termina
I have tracked this down. There is a missing commit that affects
ompi_mpi_init.c causing it to initialize bml twice.
Ralph, can you apply r30310 to 1.7?
Thanks,
Rolf
From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart
Sent: Monday, February 10, 2014 12:29 PM
To: Open
It could. I added that argument 4 years ago to support by my failover work
with the BFO. It was a way for a BTL to pass some type of string back to the
PML telling the PML who it was for verbose output to understand what was
happening.
>-Original Message-
>From: devel [mailto:devel-b
WHAT: Add two new verbose outputs to BML layer
WHY: There are times that I really want to know which BTLs are being used.
These verbose outputs can help with that.
WHERE: ompi/mca/bml/r2/bml_r2.c
TIMEOUT: COB Friday, 7 March 2014
MORE DETAIL: I have run into some cases where I have added to a
I am still seeing the same issue where I get some type of segv unless I disable
the coll ml component. This may be an issue at my end, but just thought I
would double check that we are sure this is fixed.
Thanks,
Rolf
>-Original Message-
>From: devel [mailto:devel-boun...@open-mpi.org]
SVN
>-Original Message-
>From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Nathan
>Hjelm
>Sent: Wednesday, April 16, 2014 10:35 AM
>To: Open MPI Developers
>Subject: Re: [OMPI devel] 1-question developer poll
>
>* PGP Signed by an unknown key
>
>Git
>
>On Wed, Apr 16, 2014 at 10
I have seen errors when running the intel test suite using the openib BTL when
transferring derived datatypes. I do not see the error with sm or tcp BTLs.
The errors begin after this checkin.
https://svn.open-mpi.org/trac/ompi/changeset/31370
Timestamp: 04/11/14 16:06:56 (5 days ago)
Author: b
g set to 1. I would be
>interested in the output you get on your machine.
>
>George.
>
>
>On Apr 16, 2014, at 14:34 , Rolf vandeVaart wrote:
>
>> I have seen errors when running the intel test suite using the openib BTL
>when transferring derived datatypes. I do not s
This seems similar to what I reported on a different thread.
http://www.open-mpi.org/community/lists/devel/2014/05/14688.php
I need to try and reproduce again. Elena, what kind of cluster were your
running on?
Rolf
From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Elena Elkina
Sent
OK. So, I investigated a little more. I only see the issue when I am running
with multiple ports enabled such that I have two openib BTLs instantiated. In
addition, large message RDMA has to be enabled. If those conditions are not
met, then I do not see the problem. For example:
FAILS:
Ø
_send_size 23” should always transfer wrong data, even when only one
single BTL is in play.
George.
On May 7, 2014, at 13:11 , Rolf vandeVaart
mailto:rvandeva...@nvidia.com>> wrote:
OK. So, I investigated a little more. I only see the issue when I am running
with multiple ports ena
Open MPI 1.6:
- Release was waiting on
https://svn.open-mpi.org/trac/ompi/ticket/3079 but during meeting we decided it
was not necessary. Therefore, Jeff will go ahead and roll Open MPI 1.6.6 RC1.
Open MPI 1.8:
- Several tickets have been applied. Some discussion about other
WHAT: Add some basic support so that reduction functions can support GPU
buffers.
All this patch does is move the GPU data into a host buffer before the
reduction call and move it back to GPU after the reduction call.
Changes have no effect if CUDA-aware support is not compiled in.
WHY: Users
The bfo PML is mostly a duplicate of the ob1 PML but with extra code to handle
failover when running with a cluster with multiple IB NICs. A few
observations.
1. Almost no one uses the bfo PML. I have kept it around just in case someone
thinks about failover again.
2. The code where you are s
NOTE: This is an update to the RFC after review and help from George
WHAT: Add some basic support so that reduction functions can support GPU
buffers. Create new coll module that is only compiled in when CUDA-aware
support is compiled in. This patch moves the GPU data into a host buffer
befor
I am still seeing problems with del_procs with openib. Do we believe
everything should be working? This is with the latest trunk (updated 1 hour
ago).
[rvandevaart@drossetti-ivy0 examples]$ mpirun --mca btl_openib_if_include
mlx5_0:1 -np 2 -host drossetti-ivy0,drossetti-ivy1 connectivity_cCon
Ralph:
I am seeing cases where mpirun seems to hang when one of the applications exits
with non-zero. For example, the intel test MPI_Cart_get_c will exit that way
if there are not enough processes to run the test. In most cases, mpirun seems
to return fine with the error code, but sometimes i
s
>we force the exclusive usage of the send protocol, with an unconventional
>fragment size.
>>>>
>>>> In other words using the following flags “—mca btl tcp,self —mca
>btl_tcp_flags 3 —mca btl_tcp_rndv_eager_limit 23 —mca btl_tcp_eager_limit
>23 —mca btl_tcp_max_send_s
On the trunk, I am seeing failures of the ibm tests iallgather and
iallgather_in_place. Is this a known issue?
$ mpirun --mca btl self,sm,tcp --mca coll ml,basic,libnbc --host
drossetti-ivy0,drossetti-ivy0,drossetti-ivy1,drossetti-ivy1 -np 4 iallgather
[**ERROR**]: MPI_COMM_WORLD rank 0, file i
I am seeing an interesting failure on trunk. intercomm_create, spawn, and
spawn_multiple from the IBM tests hang if I explicitly list the hostnames to
run on. For example:
Good:
$ mpirun -np 2 --mca btl self,sm,tcp spawn_multiple
Parent: 0 of 2, drossetti-ivy0.nvidia.com (0 in init)
Parent: 1
n isend to process 3
dpm_base_disconnect_init: error -12 in isend to process 3
dpm_base_disconnect_init: error -12 in isend to process 1
dpm_base_disconnect_init: error -12 in isend to process 3
[rhc@bend001 mpi]$
On Jun 6, 2014, at 11:26 AM, Rolf vandeVaart
mailto:rvandeva...@nvidia.co
Minutes of June 10, 2014 Open MPI Core Developer Meeting
1. Review 1.6 - Nothing new
2. Review 1.8 - Most things are doing fine. Still several tickets
awaiting review. If influx of bugs slows, then we will get 1.8.2 release
ready. Rolf was concerned about intermittent hangs, but
Hearing no response, I assume this is not a known issue so I submitted
https://svn.open-mpi.org/trac/ompi/ticket/4709
Nathan, is this something that you can look at?
Rolf
From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart
Sent: Friday, June 06, 2014 1:55 PM
To: de
I have noticed that I am seeing some tests hang on the trunk. For example:
$ mpirun --mca btl_tcp_if_include eth0 --host drossetti-ivy0,drossetti-ivy1 -np
2 --mca pml ob1 --mca btl sm,tcp,self --mca coll_mdisable_allgather 1 --mca
btl_openib_warn_default_gid_prefix 0 send
It is not unusual for
the conversions in ob1.
>>
>> -Nathan
>>
>> On Mon, Jul 14, 2014 at 01:38:38PM -0700, Rolf vandeVaart wrote:
>> >I have noticed that I am seeing some tests hang on the trunk. For
>> >example:
>> >
>> >
>> >
>> >$
With the latest trunk (r32246) I am getting crashes while the program is
shutting down. I assume this is related to some of the changes George just
made. George, can you take a look when you get a chance?
Looks like everyone is getting the segv during shutdown (mpirun, orted, and
application)
On both 1.8 and trunk (as Ralph mentioned in meeting) we are seeing three tests
fail.
http://mtt.open-mpi.org/index.php?do_redir=2205
Ibm/onesided/win_allocate_shared
Ibm/onesided/win_allocated_shared_mpifh
Ibm/onesided/win_allocated_shared_usempi
Is there a ticket that covers these failures?
T
ssing something obvious, I will update the test tomorrow and add
a comm split to ensure MPI_Win_allocate_shared is called from single node
communicator and skip the test if this impossible
Cheers,
Gilles
Rolf vandeVaart mailto:rvandeva...@nvidia.com>> wrote:
On both 1.8 and trunk (as Ralph m
My guess is that no one is testing the bfo PML. However, I would have expected
it to still work with Open MPI 1.6.5. From your description, it works for
smaller messages but fails with larger ones? So, if you just send smaller
messages and pull the cable, things work correctly?
One idea is t
WHAT: Bump up the minimum sm pool size to 128K from 64K.
WHY: When running OSU benchmark on 2 nodes and utilizing a larger
btl_smcuda_max_send_size, we can run into the case where the free list cannot
grow. This is not a common case, but it is something that folks sometimes
experiment with.
Yes (my mistake)
Sent from my iPhone
On Jul 26, 2014, at 3:19 PM, "George Bosilca"
mailto:bosi...@icl.utk.edu>> wrote:
We are talking MB not KB isn't it?
George.
On Thu, Jul 24, 2014 at 2:57 PM, Rolf vandeVaart
mailto:rvandeva...@nvidia.com>> wrote:
WHAT:
On 04/13/09 09:40, George Bosilca wrote:
On Apr 12, 2009, at 21:58 , Timothy Hayes wrote:
I was wondering if someone might be able to shed some light on a
couple of questions I have.
When you receive a fragment/base_descriptor in a BTL module, is the
raw data allowed to be fragmented when y
1 - 100 of 180 matches
Mail list logo