I am curious which tests are being used when running tests on larger
clusters. And by larger clusters, I mean anything with np > 128.
(Although I realize that is not very large, but it is bigger than most
of the clusters I assume tests are being run on)
I ask this because I planned on using some
Sounds great to me.
Aurelien
Le 11 sept. 07 à 13:03, Jeff Squyres a écrit :
If you genericize the concept, I think it's compatible with FT:
1. during MPI_INIT, one of the MPI processes can request a "notify"
exit pattern for the job: a process must notify the RTE before it
actually exits (i.e.
On Sep 8, 2007, at 2:33 PM, Aurelien Bouteiller wrote:
I agree (b) is not a good idea. However I am not very pleased by (a)
either. It totally prevent any process Fault Tolerant mechanism if we
go that way. If we plan to add some failure detection mechanism to
RTE and failure management (to avoi
First off, I've managed to reproduce this with nbcbench using only 16
procs (two per node), and setting btl_ofud_sd_num to 12 -- eases
debugging with fewer procs to look at.
ompi_coll_tuned_alltoall_intra_basic_linear is the alltoall routine that
is being called. What I'm seeing from totalvie
Hi Aurelien,
Thank you for the pointers. I was able to plug in a component to an
existing framework.
Thanks again,
Sajjad
Aurelien Bouteiller
Sent by: devel-boun...@open-mpi.org
09/08/07 01:34 PM
Please respond to
Open MPI Developers
To
Open MPI Developers
cc
Subject
Re: [OMPI devel]
David fixed a problem this morning that Coverity wasn't quite running
right because the directory where OMPI lived was changing every
night. So a few of the old runs were pruned.
--
Jeff Squyres
Cisco Systems
Gleb Natapov wrote:
On Tue, Sep 11, 2007 at 10:00:07AM -0500, Edgar Gabriel wrote:
Gleb,
in the scenario which you describe in the comment to the patch, what
should happen is, that the communicator with the cid which started
already the allreduce will basically 'hang' until the other processe
On Tue, Sep 11, 2007 at 11:30:53AM -0400, George Bosilca wrote:
>
> On Sep 11, 2007, at 11:05 AM, Gleb Natapov wrote:
>
>> On Tue, Sep 11, 2007 at 10:54:25AM -0400, George Bosilca wrote:
>>> We don't want to prevent two thread from entering the code is same time.
>>> The algorithm you cited support
On Sep 11, 2007, at 11:05 AM, Gleb Natapov wrote:
On Tue, Sep 11, 2007 at 10:54:25AM -0400, George Bosilca wrote:
We don't want to prevent two thread from entering the code is same
time.
The algorithm you cited support this case. There is only one
moment that is
Are you sure it support thi
On Tue, Sep 11, 2007 at 10:00:07AM -0500, Edgar Gabriel wrote:
> Gleb,
>
> in the scenario which you describe in the comment to the patch, what
> should happen is, that the communicator with the cid which started
> already the allreduce will basically 'hang' until the other processes
> 'allow'
On Tue, Sep 11, 2007 at 10:54:25AM -0400, George Bosilca wrote:
> We don't want to prevent two thread from entering the code is same time.
> The algorithm you cited support this case. There is only one moment that is
Are you sure it support this case? There is a global var mask_in_use
that preven
Gleb,
in the scenario which you describe in the comment to the patch, what
should happen is, that the communicator with the cid which started
already the allreduce will basically 'hang' until the other processes
'allow' the lower cids to continue. It should basically be blocked in
the allredu
We don't want to prevent two thread from entering the code is same
time. The algorithm you cited support this case. There is only one
moment that is critical. The local selection of the next available
cid. And this is what we try to protect there. If after the first
run, the collective call
On Tue, Sep 11, 2007 at 10:14:30AM -0400, George Bosilca wrote:
> Gleb,
>
> This patch is not correct. The code preventing the registration of the same
> communicator twice is later in the code (same file in the function
> ompi_comm_register_cid line 326). Once the function ompi_comm_register_cid
Gleb,
This patch is not correct. The code preventing the registration of
the same communicator twice is later in the code (same file in the
function ompi_comm_register_cid line 326). Once the function
ompi_comm_register_cid is called, we know that each communicator only
handle one "commun
15 matches
Mail list logo