Re: [OMPI devel] Parent terminates when child crashes/terminates (without finalizing)

2010-12-18 Thread Ken Lloyd
Nick Maclaren, Yes, this is a hard problem. It is not endemic to OpenMPI, however. This hints at the distributed memory/process/thread issues either through the various OSs or alternately external to them in many solution spaces. Jeff Squyers statement that "flexible dynamic processing is not

Re: [OMPI devel] Minor OMPI SVN configuration change

2011-02-17 Thread Ken Lloyd
Makes sense to me. On Thu, 2011-02-17 at 08:49 -0700, Barrett, Brian W wrote: > Why did "we" make this change? It was originally this way, and we changed it > to the no-auth way for a reason. > > Brian > > > - Original Message - > From: Jeff Squyres [mailto:jsquy...@cisco.com] > Sent:

Re: [OMPI devel] affinity MPI extension not included in OMPI 1.5.2

2011-03-09 Thread Ken Lloyd
Please, do. On Wed, 2011-03-09 at 15:58 -0500, Jeff Squyres wrote: > Crud. It's specifically listed in the NEWS, but somehow it didn't get > included in the tarball. I'll investigate. > > Should we do a 1.5.3 in the immediate future with the affinity extension? > -- Kenneth A. Lloyd Directo

Re: [OMPI devel] 1.5.3rc1 has been posted

2011-03-11 Thread Ken Lloyd
Thanks! I'll put it through its paces ASAP. On Fri, 2011-03-11 at 12:34 -0500, Jeff Squyres wrote: > The only difference is the addition of the "affinity" MPI extension that was > missing in 1.5.2. > > It seems to be ok for me. If no one else finds any problems, I'll release it > Sunday or so

Re: [OMPI devel] RFC: Add support to send/receive CUDA device memory directly

2011-04-13 Thread Ken Lloyd
Rolf, I haven't had a chance to review the code yet, but how do these changes relate to CUDA 4.0 - especially the UVA and GPUDirect 2.0 implementation? Ken On Wed, 2011-04-13 at 09:47 -0700, Rolf vandeVaart wrote: > WHAT: Add support to send data directly from CUDA device memory via > MPI calls.

Re: [OMPI devel] RFC: Add support to send/receive CUDA device memory directly

2011-04-13 Thread Ken Lloyd
George, Yes. GPUDirect eliminated an additional (host) memory buffering step between the HCA and the GPU that took CPU cycles. I was never very comfortable with the kernel patch necessary, nor the patched OFED required to make it all work. Having said that, it did provide a ~14% improvement in th

Re: [OMPI devel] Exit status

2011-04-14 Thread Ken Lloyd
Point well made, Nick. In other words, irrespective of OS or language, are we citing the need for "application correcting code" from OpenMPI, (relocate a/o retry) similar to ECC in memory? Ken On Thu, 2011-04-14 at 14:31 +0100, N.M. Maclaren wrote: > On Apr 14 2011, Ralph Castain wrote: > >> >

Re: [OMPI devel] RFC: Add support to send/receive CUDA device memory directly

2011-04-14 Thread Ken Lloyd
I'd suggest supporting CUDA device queries in carto and hwloc. Ken On Thu, 2011-04-14 at 11:25 -0400, Jeff Squyres wrote: > On Apr 13, 2011, at 12:47 PM, Rolf vandeVaart wrote: > > > By default, the code is disable and has to be configured into the library. > > --with-cuda(=DIR) Build

Re: [OMPI devel] RFC: Second Try: Add support to send/receive CUDA device memory directly

2011-04-19 Thread Ken Lloyd
Thanks Rolf. We'll try it out. Ken On Tue, 2011-04-19 at 13:45 -0700, Rolf vandeVaart wrote: > WHAT: Second try to add support to send data directly from CUDA device > memory via MPI calls. > > > > TIMEOUT: 4/26/2011 > > > > DETAILS: Based on all the feedback (thanks to everyone who look

[OMPI devel] Adaptive or fault-tolerant MPI

2011-04-22 Thread Ken Lloyd
Before I jump in, is anyone already actively working in this area? Ken

Re: [OMPI devel] Adaptive or fault-tolerant MPI

2011-04-25 Thread Ken Lloyd
Thanks. I've read your (Joshua Hersey's) Ph.D. thesis on fault tolerance using checkpointing with much interest. It would be of further interest to get the range of possible user requirements for defining the behaviors in response to various faults. Ken Lloyd On Fri, 2011-04-22 at 1

Re: [OMPI devel] Multiple Memory Pools

2011-05-16 Thread Ken Lloyd
Rolf, We have identified a need in certain high performance applications to specify memory sections -> L3 -> L2 - >L1 -> specific core -> specific CPU -> specific machine. These tend toward hybridized CUDA apps where other sections of the CPU are involved in non-CUDA (non-GPU) functions. In our di

[OMPI devel] Resilience 2011

2011-06-24 Thread Ken Lloyd
Josh and Wesley, Will you be presenting Resilient ORTE at Resilience 2011 in Bordeaux? http://xcr.cenit.latech.edu/resilience2011/ = Kenneth A. Lloyd CEO - Director of Systems Science Watt Systems Technologies Inc. www.wattsys.com kenneth.ll...@wattsys.com This e-mail is co

Re: [OMPI devel] Resilience 2011

2011-06-27 Thread Ken Lloyd
that has occurred, we'll probably be > close to what we would call a "resilient" system. > > > Until then, we are improving, but still far from "resilient". > > > > > > On Jun 24, 2011, at 10:24 AM, Ken Lloyd wrote: >