Hi Ralph:
Just an FYI that the following change broke the use of --host on master last
night.
[rvandevaart@drossetti-ivy4 ompi-master-rolfv]$ git bisect bad
169c44258d5c98870872b77166390d4f9a81105e is the first bad commit
commit 169c44258d5c98870872b77166390d4f9a81105e
Author: Ralph Castain
Li
Hi there,
Rolf vandeVaart (rvandeva...@nvidia.com) invites you to participate in
the Doodle poll "Open MPI Weely Meetings."
Should we have Open MPI weekly meetings during SC15 and Thanksgiving
week? Let me know if you want to attend one or both of them.
Participate now
https://doodl
The bfo was my creation many years ago. Can we keep it around for a little
longer? If we blow it away, then we should probably clean up all the code I
also have in the openib BTL for supporting failover. There is also some
configure code that would have to go as well.
Rolf
>-Original Me
There was a problem reported on the User's list about Open MPI always picking
one Mellanox card when they were two in the machine.
http://www.open-mpi.org/community/lists/users/2015/08/27507.php
We dug a little deeper and I think this has to do with how hwloc is figuring
out where one of the
I just tested this against the PGI 15.7 compiler and I see the same thing. It
appears that we get this error on some of the files called out in
ompi/mpi/fortran/use-mpi-f08/mpi-f-interfaces-bind.h as not having an
"easy-peasy" solution. All the other files compile just fine. I checked the
list
There have been two reports on the user list about memory leaks. I have
reproduced this leak with LAMMPS. Note that this has nothing to do with
CUDA-aware features. The steps that Stefan has provided make it easy to
reproduce.
Here are some more specific steps to reproduce derived from Stefa
A few observations.
1. The smcuda btl is only built when --with-cuda is part of the configure line
so folks who do not do this will not even have this btl and will never run into
this issue.
2. The priority of the smcuda btl has been higher since Open MPI 1.7.5 (March
2014). The idea is that if
I am seeing it also on my cluster too.
[ivy4:27085] mca_base_component_repository_open: unable to open mca_btl_usnic:
/ivylogin/home/rvandevaart/ompi-repos/ompi-master-uvm/64-dbg/lib/libmca_common_libfabric.so.0:
undefined symbol: psmx_eq_open (ignored)
[ivy4:27085] mca_base_component_repository
Hi Gilles:
Is your failure similar to this ticket?
https://github.com/open-mpi/ompi/issues/393
Rolf
From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Gilles Gouaillardet
Sent: Monday, April 20, 2015 9:12 AM
To: Open MPI Developers
Subject: [OMPI devel] c_accumulate
Folks,
i (sometimes
I ended up looking at this and it was a bug in this set of tests. Needed to
check for MPI_COMM_NULL in a few places.
This has been fixed.
From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart
Sent: Thursday, April 02, 2015 10:10 AM
To: de...@open-mpi.org
Subject: [OMPI
I just recently bumped running some tests from np=4 to np=6. I am now seeing
failures on the following tests in the ibm/collective directory.
ineighbor_allgather, ineighbor_allgatherv, ineighbor_alltoall,
ineighbor_alltoallv, ineighbor_alltoallw
neighbor_allfather, neighbor_allgatherv, neighbo
Greetings:
I am now seeing the following message for all my calls to mpirun on ompi
master. This started with last night's MTT run. Is this intentional?
[rvandevaart@ivy0 ~]$ mpirun -np 1 hostname
--
WARNING: a request wa
This message is mostly for Nathan, but figured I would go with the wider
distribution. I have noticed some different behaviour that I assume started
with this change.
https://github.com/open-mpi/ompi/commit/4bf7a207e90997e75ba1c60d9d191d9d96402d04
I am noticing that the openib BTL will also b
I think this has already been fixed by Ralph this morning. I had observed the
same issue but is now gone.
From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Brice Goglin
Sent: Wednesday, December 17, 2014 3:53 PM
To: de...@open-mpi.org
Subject: Re: [OMPI devel] Solaris/x86-64 SEGV with
may be related to change set 32659.
If you back this change out, do the tests pass?
Howard
From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart
Sent: Monday, September 15, 2014 8:55 AM
To: de...@open-mpi.org<mailto:de...@open-mpi.org>
Subject: [OMPI devel] coll ml
I wonder if anyone else is seeing this failure. Not sure when this started but
it is only on the trunk. Here is a link to my failures as well as an example
below that. There are a variety of nonblocking collectives failing like this.
http://mtt.open-mpi.org/index.php?do_redir=2208
[rvandevaar
I noticed MTT failures from last night and then reproduced this morning on 1.8
branch. Looks like maybe a double free. I assume it is related to fixes for
aborting programs. Maybe related to
https://svn.open-mpi.org/trac/ompi/changeset/32508 but not sure.
[rvandevaart@drossetti-ivy0 environme
WHAT: Change default behavior in openib to not call ibv_fork_init() even if
available.
WHY: There are some strange interactions with ummunotify that cause errors. In
addition, see the additional points below.
WHEN: After next weekly meeting, August 5, 2014
DETAILS: This change will just be a co
mit it and
drop a note to #4815
( I am afk until tomorrow)
Cheers,
Gilles
Rolf vandeVaart mailto:rvandeva...@nvidia.com>> wrote:
Just an FYI that my trunk version (r32355) does not work at all anymore if I do
not include "--mca coll ^ml".Here is a stack trace from the ibm/pt
Just an FYI that my trunk version (r32355) does not work at all anymore if I do
not include "--mca coll ^ml".Here is a stack trace from the ibm/pt2pt/send
test running on a single node.
(gdb) where
#0 0x7f6c0d1321d0 in ?? ()
#1
#2 0x7f6c183abd52 in orte_util_compare_name_fie
Yes (my mistake)
Sent from my iPhone
On Jul 26, 2014, at 3:19 PM, "George Bosilca"
mailto:bosi...@icl.utk.edu>> wrote:
We are talking MB not KB isn't it?
George.
On Thu, Jul 24, 2014 at 2:57 PM, Rolf vandeVaart
mailto:rvandeva...@nvidia.com>> wrote:
WHAT:
WHAT: Bump up the minimum sm pool size to 128K from 64K.
WHY: When running OSU benchmark on 2 nodes and utilizing a larger
btl_smcuda_max_send_size, we can run into the case where the free list cannot
grow. This is not a common case, but it is something that folks sometimes
experiment with.
My guess is that no one is testing the bfo PML. However, I would have expected
it to still work with Open MPI 1.6.5. From your description, it works for
smaller messages but fails with larger ones? So, if you just send smaller
messages and pull the cable, things work correctly?
One idea is t
ssing something obvious, I will update the test tomorrow and add
a comm split to ensure MPI_Win_allocate_shared is called from single node
communicator and skip the test if this impossible
Cheers,
Gilles
Rolf vandeVaart mailto:rvandeva...@nvidia.com>> wrote:
On both 1.8 and trunk (as Ralph m
On both 1.8 and trunk (as Ralph mentioned in meeting) we are seeing three tests
fail.
http://mtt.open-mpi.org/index.php?do_redir=2205
Ibm/onesided/win_allocate_shared
Ibm/onesided/win_allocated_shared_mpifh
Ibm/onesided/win_allocated_shared_usempi
Is there a ticket that covers these failures?
T
With the latest trunk (r32246) I am getting crashes while the program is
shutting down. I assume this is related to some of the changes George just
made. George, can you take a look when you get a chance?
Looks like everyone is getting the segv during shutdown (mpirun, orted, and
application)
the conversions in ob1.
>>
>> -Nathan
>>
>> On Mon, Jul 14, 2014 at 01:38:38PM -0700, Rolf vandeVaart wrote:
>> >I have noticed that I am seeing some tests hang on the trunk. For
>> >example:
>> >
>> >
>> >
>> >$
I have noticed that I am seeing some tests hang on the trunk. For example:
$ mpirun --mca btl_tcp_if_include eth0 --host drossetti-ivy0,drossetti-ivy1 -np
2 --mca pml ob1 --mca btl sm,tcp,self --mca coll_mdisable_allgather 1 --mca
btl_openib_warn_default_gid_prefix 0 send
It is not unusual for
Hearing no response, I assume this is not a known issue so I submitted
https://svn.open-mpi.org/trac/ompi/ticket/4709
Nathan, is this something that you can look at?
Rolf
From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart
Sent: Friday, June 06, 2014 1:55 PM
To: de
Minutes of June 10, 2014 Open MPI Core Developer Meeting
1. Review 1.6 - Nothing new
2. Review 1.8 - Most things are doing fine. Still several tickets
awaiting review. If influx of bugs slows, then we will get 1.8.2 release
ready. Rolf was concerned about intermittent hangs, but
n isend to process 3
dpm_base_disconnect_init: error -12 in isend to process 3
dpm_base_disconnect_init: error -12 in isend to process 1
dpm_base_disconnect_init: error -12 in isend to process 3
[rhc@bend001 mpi]$
On Jun 6, 2014, at 11:26 AM, Rolf vandeVaart
mailto:rvandeva...@nvidia.co
I am seeing an interesting failure on trunk. intercomm_create, spawn, and
spawn_multiple from the IBM tests hang if I explicitly list the hostnames to
run on. For example:
Good:
$ mpirun -np 2 --mca btl self,sm,tcp spawn_multiple
Parent: 0 of 2, drossetti-ivy0.nvidia.com (0 in init)
Parent: 1
On the trunk, I am seeing failures of the ibm tests iallgather and
iallgather_in_place. Is this a known issue?
$ mpirun --mca btl self,sm,tcp --mca coll ml,basic,libnbc --host
drossetti-ivy0,drossetti-ivy0,drossetti-ivy1,drossetti-ivy1 -np 4 iallgather
[**ERROR**]: MPI_COMM_WORLD rank 0, file i
s
>we force the exclusive usage of the send protocol, with an unconventional
>fragment size.
>>>>
>>>> In other words using the following flags “—mca btl tcp,self —mca
>btl_tcp_flags 3 —mca btl_tcp_rndv_eager_limit 23 —mca btl_tcp_eager_limit
>23 —mca btl_tcp_max_send_s
Ralph:
I am seeing cases where mpirun seems to hang when one of the applications exits
with non-zero. For example, the intel test MPI_Cart_get_c will exit that way
if there are not enough processes to run the test. In most cases, mpirun seems
to return fine with the error code, but sometimes i
I am still seeing problems with del_procs with openib. Do we believe
everything should be working? This is with the latest trunk (updated 1 hour
ago).
[rvandevaart@drossetti-ivy0 examples]$ mpirun --mca btl_openib_if_include
mlx5_0:1 -np 2 -host drossetti-ivy0,drossetti-ivy1 connectivity_cCon
NOTE: This is an update to the RFC after review and help from George
WHAT: Add some basic support so that reduction functions can support GPU
buffers. Create new coll module that is only compiled in when CUDA-aware
support is compiled in. This patch moves the GPU data into a host buffer
befor
The bfo PML is mostly a duplicate of the ob1 PML but with extra code to handle
failover when running with a cluster with multiple IB NICs. A few
observations.
1. Almost no one uses the bfo PML. I have kept it around just in case someone
thinks about failover again.
2. The code where you are s
WHAT: Add some basic support so that reduction functions can support GPU
buffers.
All this patch does is move the GPU data into a host buffer before the
reduction call and move it back to GPU after the reduction call.
Changes have no effect if CUDA-aware support is not compiled in.
WHY: Users
Open MPI 1.6:
- Release was waiting on
https://svn.open-mpi.org/trac/ompi/ticket/3079 but during meeting we decided it
was not necessary. Therefore, Jeff will go ahead and roll Open MPI 1.6.6 RC1.
Open MPI 1.8:
- Several tickets have been applied. Some discussion about other
_send_size 23” should always transfer wrong data, even when only one
single BTL is in play.
George.
On May 7, 2014, at 13:11 , Rolf vandeVaart
mailto:rvandeva...@nvidia.com>> wrote:
OK. So, I investigated a little more. I only see the issue when I am running
with multiple ports ena
OK. So, I investigated a little more. I only see the issue when I am running
with multiple ports enabled such that I have two openib BTLs instantiated. In
addition, large message RDMA has to be enabled. If those conditions are not
met, then I do not see the problem. For example:
FAILS:
Ø
This seems similar to what I reported on a different thread.
http://www.open-mpi.org/community/lists/devel/2014/05/14688.php
I need to try and reproduce again. Elena, what kind of cluster were your
running on?
Rolf
From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Elena Elkina
Sent
g set to 1. I would be
>interested in the output you get on your machine.
>
>George.
>
>
>On Apr 16, 2014, at 14:34 , Rolf vandeVaart wrote:
>
>> I have seen errors when running the intel test suite using the openib BTL
>when transferring derived datatypes. I do not s
I have seen errors when running the intel test suite using the openib BTL when
transferring derived datatypes. I do not see the error with sm or tcp BTLs.
The errors begin after this checkin.
https://svn.open-mpi.org/trac/ompi/changeset/31370
Timestamp: 04/11/14 16:06:56 (5 days ago)
Author: b
SVN
>-Original Message-
>From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Nathan
>Hjelm
>Sent: Wednesday, April 16, 2014 10:35 AM
>To: Open MPI Developers
>Subject: Re: [OMPI devel] 1-question developer poll
>
>* PGP Signed by an unknown key
>
>Git
>
>On Wed, Apr 16, 2014 at 10
I am still seeing the same issue where I get some type of segv unless I disable
the coll ml component. This may be an issue at my end, but just thought I
would double check that we are sure this is fixed.
Thanks,
Rolf
>-Original Message-
>From: devel [mailto:devel-boun...@open-mpi.org]
WHAT: Add two new verbose outputs to BML layer
WHY: There are times that I really want to know which BTLs are being used.
These verbose outputs can help with that.
WHERE: ompi/mca/bml/r2/bml_r2.c
TIMEOUT: COB Friday, 7 March 2014
MORE DETAIL: I have run into some cases where I have added to a
It could. I added that argument 4 years ago to support by my failover work
with the BFO. It was a way for a BTL to pass some type of string back to the
PML telling the PML who it was for verbose output to understand what was
happening.
>-Original Message-
>From: devel [mailto:devel-b
I have tracked this down. There is a missing commit that affects
ompi_mpi_init.c causing it to initialize bml twice.
Ralph, can you apply r30310 to 1.7?
Thanks,
Rolf
From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart
Sent: Monday, February 10, 2014 12:29 PM
To: Open
I have seen this same issue although my core dump is a little bit different. I
am running with tcp,self. The first entry in the list of BTLs is garbage, but
then there is tcp and self in the list. Strange. This is my core dump. Line
208 in bml_r2.c is where I get the SEGV.
Program termina
gfaulted as
>well), but obviously wouldn't have anything to do with mpirun
>
>On Jan 30, 2014, at 9:29 AM, Rolf vandeVaart
>wrote:
>
>> I just retested with --mca mpi_leave_pinned 0 and that made no difference.
>I still see the mpirun crash.
>>
>>> -
>fixes the problem, I did not investigate any further.
>
>Do you see a similar behavior?
>
> George.
>
>On Jan 30, 2014, at 17:26 , Rolf vandeVaart wrote:
>
>> I am seeing this happening to me very intermittently. Looks like mpirun is
>getting a SEGV. Is anyone el
, January 30, 2014 11:51 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] Intermittent mpirun crash?
Huh - not much info there, I'm afraid. I gather you didn't build this with
--enable-debug?
On Jan 30, 2014, at 8:26 AM, Rolf vandeVaart wrote:
> I am seeing this happeni
I am seeing this happening to me very intermittently. Looks like mpirun is
getting a SEGV. Is anyone else seeing this?
This is 1.7.4 built yesterday. (Note that I added some stuff to what is being
printed out so the message is slightly different than 1.7.4 output)
mpirun - -np 6 -host
drosse
Hi Ralph:
In my opinion, we still try to get to a stable 1.7.4. I think we can just keep
the bar high (as you said in the meeting) about what types of fixes need to get
into 1.7.4. I have been telling folks 1.7.4 would be ready "really soon" so
the idea of folding in 1.7.5 CMRs and delaying it
I believe I found a bug in openib BTL and just want to see if folks agree with
this. When we are running on a NUMA node and we are bound to a CPU, we only
ant to use the IB device that is closest to us. However, I observed that we
always used both devices regardless. I believe there is a bug
Let me know of any other issues you are seeing. Ralph fixed the issue with ob1
and we will move that into Open MPI 1.7.4.
Not sure why I never saw that issue. Will investigate some more.
>-Original Message-
>From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Jörg
>Bornschein
use --enable-mca-dso...though I don't
know if that is the source of the problem.
On Nov 7, 2013, at 6:00 AM, Rolf vandeVaart
mailto:rvandeva...@nvidia.com>> wrote:
Hello Solibakke:
Let me try and reproduce with your configure options.
Rolf
From: devel [mailto:devel-boun...@open-
Hello Solibakke:
Let me try and reproduce with your configure options.
Rolf
From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Solibakke Per
Bjarte
Sent: Thursday, November 07, 2013 8:40 AM
To: 'de...@open-mpi.org'
Subject: [OMPI devel] MPIRUN error message after ./configure and sudo m
>-Original Message-
>From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Jeff Squyres
>(jsquyres)
>Sent: Thursday, October 31, 2013 4:12 PM
>To: Open MPI Developers
>Subject: Re: [OMPI devel] oshmem and CFLAGS removal
>
>On Oct 31, 2013, at 3:46 PM,
I noticed that there were some CFLAGS that were no longer set when enabling
with --enable-picky for gcc. Specifically, -Wundef and -pedantic were no
longer set.
This is not a problem for Open MPI 1.7.
I believe this is happening because of some code in the
config/oshmem_configure_options.m4 f
Yes, that is from one of my CMRs. I always configure with -enable-picky but
that did not pick up this warning.
I will fix this in the trunk in the morning (watching the Red Sox right now :))
and then file CMR to bring over.
Rolf
From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralp
WHAT: Add GPU Direct RDMA support to openib btl
WHY: Better latency for small GPU message transfers
WHERE: Several files, see ticket for list
WHEN: Friday, October 18, 2013 COB
More detail:
This RFC looks to make use of GPU Direct RDMA support that is coming in the
future in Mellanox libraries.
I will wait another week on this since I know a lot of folks were traveling.
Any input welcome.
From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart
Sent: Tuesday, September 10, 2013 2:46 PM
To: de...@open-mpi.org
Subject: [OMPI devel] RFC: Remove alignment code from
a private email, I had Max add some instrumentation so
we could see which list was growing. We now know it is the
mca_pml_base_send_requests list.
>-Original Message-
>From: Max Staufer [mailto:max.stau...@gmx.net]
>Sent: Friday, September 13, 2013 7:06 AM
>To: Rolf va
Hi Max:
You say that that the function keeps "allocating memory in the pml free list."
How do you know that is happening?
Do you know which free list it is happening on? There are something like 8
free lists associated with the pml ob1 so it would be interesting to know which
one you observe
WHAT: Remove alignment code from ompi/mca/rcache/vma module
WHY: Because it is redundant and causing problems for memory pools that want
different alignment
WHERE: ompi/mca/rcache/vma/rcache_vma.c,
ompi/mca/mpool/grdma/mpool_grdma_module.c (Detailed changes attached)
WHEN: Tuesday, September 17,
: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart
Sent: Tuesday, September 03, 2013 4:52 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes
Correction: That line below should be:
gmake run FILE=p2p_c
From: devel [mailto:devel-boun
Correction: That line below should be:
gmake run FILE=p2p_c
From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart
Sent: Tuesday, September 03, 2013 4:50 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes
I just retried and I
. I've tried up to np=16 without getting a single
hiccup.
Try a fresh checkout - let's make sure you don't have some old cruft laying
around.
On Sep 3, 2013, at 12:26 PM, Rolf vandeVaart
mailto:rvandeva...@nvidia.com>> wrote:
I am running a debug build. Here is my configur
M, Ralph Castain
mailto:r...@open-mpi.org>> wrote:
Dang - I just finished running it on odin without a problem. Are you seeing
this with a debug or optimized build?
On Sep 3, 2013, at 12:16 PM, Rolf vandeVaart
mailto:rvandeva...@nvidia.com>> wrote:
Yes, it fails on the current t
something up in the OOB connect code
itself. I'll take a look and see if something leaps out at me - it seems to be
working fine on IU's odin cluster, which is the only IB-based system I can
access
On Sep 3, 2013, at 11:34 AM, Rolf vandeVaart
mailto:rvandeva...@nvidia.com>> wr
As mentioned in the weekly conference call, I am seeing some strange errors
when using the openib BTL. I have narrowed down the changeset that broke
things to the ORTE async code.
https://svn.open-mpi.org/trac/ompi/changeset/29058 (and
https://svn.open-mpi.org/trac/ompi/changeset/29061 which
t;interested in implementing in the future (an intern or some PhD student).
>
>On Aug 23, 2013, at 21:53 , Rolf vandeVaart wrote:
>
>> Yes, I agree that the CUDA support is more intrusive and ends up in
>different areas. The problem is that the changes could not be simply isol
The ompi/mca/rcache/rb component has been .ompi_ignored for almost 7 years.
Should we delete it?
---
This email message is for the sole use of the intended recipient(s) and may
contain
confidential information.
rg] On Behalf Of George
>Bosilca
>Sent: Friday, August 23, 2013 7:36 AM
>To: Open MPI Developers
>Subject: Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29055 - in
>trunk/ompi/mca: btl btl/smcuda common/cuda pml/ob1
>
>Rolf,
>
>On Aug 22, 2013, at 19:24 , Rolf vandeVaart
George.
>
>On Aug 21, 2013, at 23:00 , svn-commit-mai...@open-mpi.org wrote:
>
>> Author: rolfv (Rolf Vandevaart)
>> Date: 2013-08-21 17:00:09 EDT (Wed, 21 Aug 2013) New Revision: 29055
>> URL: https://svn.open-mpi.org/trac/ompi/changeset/29055
>>
>> Log:
>> Fi
No changes here.
>NVIDIA
>==
>rolfv:Rolf Vandevaart
>
---
This email message is for the sole use of the intended recipient(s) and may
contain
confidential information. Any unauthorized
I ran into a hang in a test in which the sender sends less data than the
receiver is expecting. For example, the following shows the receiver expecting
twice what the sender is sending.
Rank 0: MPI_Send(buf, BUFSIZE, MPI_INT, 1, 99, MPI_COMM_WORLD)
Rank 1: MPI_Recv(buf, BUFSIZE*2, MPI_INT, 0,
I have noticed several warnings while building the trunk. Feel free to fix
anything that you are familiar with.
CC sys_limits.lo
../../../opal/util/sys_limits.c: In function 'opal_util_init_sys_limits':
../../../opal/util/sys_limits.c:107:20: warning: 'lim' may be used
uninitialized in t
31, 2013 11:51 AM
>To: Open MPI Developers
>Subject: Re: [OMPI devel] mpirun -host does not work from r27879 and
>forward on trunk
>
>Yes - no hostfile and no RM allocation, just -host.
>
>What is your setup?
>
>On Jan 31, 2013, at 8:44 AM, Rolf vandeVaart
>wrote:
>
, Ralph Castain wrote:
>
>> Ummm...that was fixed a long time ago. You might try a later version.
>>
>> Or are you saying the head of the trunk doesn't work too?
>>
>> On Jan 31, 2013, at 7:31 AM, Rolf vandeVaart
>wrote:
>>
>>> I have stum
I have stumbled into a problem with the -host argument. This problem appears
to be introduced with changeset r27879 on 1/19/2013 by rhc.
With r27877, things work:
[rolf@node]$ which mpirun
/home/rolf/ompi-trunk-r27877/64/bin/mpirun
[rolf@node]$ mpirun -np 2 -host c0-0,c0-3 hostname
c0-3
c0-0
Thanks for this report. I will look into this. Can you tell me what your
mpirun command looked like and do you know what transport you are running over?
Specifically, is this on a single node or multiple nodes?
Rolf
From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On Behalf
[I sent this out in June, but did not commit it. So resending. Timeout of Jan
5, 2012. Note that this does not use the GPU Direct RDMA]
WHAT: Add support for doing asynchronous copies of GPU memory with larger
messages.
WHY: Improve performance for sending/receiving of larger GPU messages over
37 PM
>To: Rolf vandeVaart
>Cc: de...@open-mpi.org
>Subject: Re: OpenMPI CUDA 5 readiness?
>
>CUDA 5 basically changes char* to void* in some functions. Attached is a small
>patch which changes prototypes, depending on used CUDA version. Tested
>with CUDA 5 preview and 4.2.
>
>
>-Original Message-
>From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org]
>On Behalf Of Ralph Castain
>Sent: Monday, July 30, 2012 9:29 AM
>To: Open MPI Developers
>Subject: Re: [OMPI devel] The hostfile option
>
>
>On Jul 30, 2012, at 2:37 AM, George Bosilca wrote:
>
>> I t
Adding a timeout to this RFC.
TIMEOUT: July 17, 2012
rvandeva...@nvidia.com
781-275-5358
-Original Message-
From: Rolf vandeVaart
Sent: Wednesday, June 27, 2012 6:13 PM
To: de...@open-mpi.org
Subject: RFC: add asynchronous copies for large GPU buffers
WHAT: Add support for doing
PU
>buffers
>
>Can you make your repository public or add me to the access list?
>
>-Nathan
>
>On Wed, Jun 27, 2012 at 03:12:34PM -0700, Rolf vandeVaart wrote:
>> WHAT: Add support for doing asynchronous copies of GPU memory with
>larger messages.
>> WHY: I
WHAT: Add support for doing asynchronous copies of GPU memory with larger
messages.
WHY: Improve performance for sending/receiving of larger GPU messages over IB
WHERE: ob1, openib, and convertor code. All is protected by compiler directives
so no effect on non-CUDA builds.
REFEREN
Hi Nathan:
I downloaded and tried it out. There were a few issues that I had to work
through, but finally got things working.
Can you apply this patch to your changes prior to checking things in?
I also would suggest configuring with --enable-picky as there are something
like 10 warnings genera
After doing a fresh checkout of the trunk, and then running autogen, I see this:
M opal/mca/event/libevent2019/libevent/Makefile.in
M opal/mca/event/libevent2019/libevent/depcomp
M opal/mca/event/libevent2019/libevent/include/Makefile.in
M opal/mca/event/libevent2019/libeve
Here is my explanation. The call to MCA_BTL_TCP_FRAG_ALLOC_EAGER or
MCA_BTL_TCP_FRAG_ALLOC_MAX allocate a chunk of memory that has space for both
the fragment as well as any payload. So, when we do the frag+1, we are setting
the pointer in the frag to point where the payload of the message liv
I am running a simple test and using the -bind-to-core or -bind-to-socket
options. I think the CPU binding is working fine, but I see these warnings
about not being able to bind to memory. Is this expected? This is trunk code
(266128)
[dt]$ mpirun --report-bindings -np 2 -bind-to-core conne
[Comment at bottom]
>-Original Message-
>From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org]
>On Behalf Of Nathan Hjelm
>Sent: Friday, March 09, 2012 2:23 PM
>To: Open MPI Developers
>Subject: Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
>
>
>
>On Fri, 9 Mar 2012, J
Hi Jeff:
It is set in opal/config/opal_configure_options.m4
>-Original Message-
>From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org]
>On Behalf Of Jeffrey Squyres
>Sent: Friday, February 24, 2012 6:07 AM
>To: de...@open-mpi.org
>Subject: Re: [OMPI devel] [OMPI svn-full]
I think I am OK with this.
Alternatively, you could have done something like is done in the TCP BTL where
the payload and header are added together for the frag size?
To state more clearly, I was trying to say you could do something similar to
what is done at line 1015 in btl_tcp_component.c a
There are several things going on here that make their library perform better.
With respect to inter-node performance, both MVAPICH2 and Open MPI copy the GPU
memory into host memory first. However, they are using special host buffers
that and a code path that allows them to copy the data async
el with GPUDirect support
* Use the MLNX OFED stack with GPUDirect support
* Install the CUDA developer driver
Does using CUDA >= 4.0 make one of the above steps redundant?
I.e., RHEL or different kernel or MLNX OFED stack with GPUDirect support is
not needed any more?
Sebastian.
Rolf
1 - 100 of 180 matches
Mail list logo