On Mon, Dec 17, 2007 at 08:08:02PM -0500, Richard Graham wrote:
> Needless to say (for the nth time :-) ) that changing this bit of code
> makes me
> nervous.
I've noticed it already :)
>However, it occurred to me that there is a much better way to
> test
> this code than setting u
Gleb,
Needless to say (for the nth time :-) ) that changing this bit of code
makes me
nervous. However, it occurred to me that there is a much better way to
test
this code than setting up an environment that generates some out of order
events with out us being able to specify the order.
Sin
On Thu, Dec 13, 2007 at 08:04:21PM -0500, Richard Graham wrote:
> Yes, should be a bit more clear. Need an independent way to verify that
> data is matched
> in the correct order sending this information as payload is one way to do
> this. So,
> sending unique data in every message, and makin
On Fri, Dec 14, 2007 at 06:53:55AM -0500, Richard Graham wrote:
> If you have positive confirmation that such things have happened, this will
> go a long way.
I instrumented the code to log all kind of info about fragment reordering while
I chased a bug in openib that caused matching logic to malfu
If you have positive confirmation that such things have happened, this will
go a long way. I will not trust the code until this has also been done with
multiple independent network paths. I very rarely express such strong
opinions, even if I don't agree with what is being done, but this is the
co
On Thu, Dec 13, 2007 at 06:16:49PM -0500, Richard Graham wrote:
> The situation that needs to be triggered, just as George has mentions, is
> where we have a lot of unexpected messages, to make sure that when one that
> we can match against comes in, all the unexpected messages that can be
> matche
Yes, should be a bit more clear. Need an independent way to verify that
data is matched
in the correct order sending this information as payload is one way to do
this. So,
sending unique data in every message, and making sure that it arrives in
the user buffers
in the expected order is a way
The situation that needs to be triggered, just as George has mentions, is
where we have a lot of unexpected messages, to make sure that when one that
we can match against comes in, all the unexpected messages that can be
matched with pre-posted receives are matched. Since we attempt to match
only
Rich was referring to the fact that the reordering of fragments other
than the matching ones is irrelevant to the Gleb's change. In order to
trigger the changes we need to force a lot of small unexpected
messages over multiple networks. The testing environment should have
multiple similar n
On Wed, Dec 12, 2007 at 03:10:10PM -0600, Brian W. Barrett wrote:
> On Wed, 12 Dec 2007, Gleb Natapov wrote:
>
> > On Wed, Dec 12, 2007 at 03:46:10PM -0500, Richard Graham wrote:
> >> This is better than nothing, but really not very helpful for looking at the
> >> specific issues that can arise wi
tapov [mailto:gl...@voltaire.com]
Sent: Wednesday, December 12, 2007 03:54 PM Eastern Standard Time
To: Open MPI Developers
Subject: Re: [OMPI devel] matching code rewrite in OB1
On Wed, Dec 12, 2007 at 03:52:17PM -0500, Jeff Squyres wrote:
> On Dec 12, 2007, at 3:20 PM, Gleb Natap
Was Rich referring to ensuring that the test codes checked that their
payloads were correct (and not re-assembled in the wrong order)?
On Dec 12, 2007, at 4:10 PM, Brian W. Barrett wrote:
On Wed, 12 Dec 2007, Gleb Natapov wrote:
On Wed, Dec 12, 2007 at 03:46:10PM -0500, Richard Graham wrot
On Wed, 12 Dec 2007, Gleb Natapov wrote:
On Wed, Dec 12, 2007 at 03:46:10PM -0500, Richard Graham wrote:
This is better than nothing, but really not very helpful for looking at the
specific issues that can arise with this, unless these systems have several
parallel networks, with tests that wil
Re: [OMPI devel] matching code rewrite in OB1
On Wed, Dec 12, 2007 at 03:52:17PM -0500, Jeff Squyres wrote:
> On Dec 12, 2007, at 3:20 PM, Gleb Natapov wrote:
>
> >> How about making a tarball with this patch in it that can be thrown
> >> at
> >> everyone'
On Wed, Dec 12, 2007 at 03:52:17PM -0500, Jeff Squyres wrote:
> On Dec 12, 2007, at 3:20 PM, Gleb Natapov wrote:
>
> >> How about making a tarball with this patch in it that can be thrown
> >> at
> >> everyone's MTT? (we can put the tarball on www.open-mpi.org
> >> somewhere)
> > I don't have
On Dec 12, 2007, at 3:20 PM, Gleb Natapov wrote:
How about making a tarball with this patch in it that can be thrown
at
everyone's MTT? (we can put the tarball on www.open-mpi.org
somewhere)
I don't have access to www.open-mpi.org, but I can send you the patch.
I can send you a tarball too,
On Wed, Dec 12, 2007 at 03:46:10PM -0500, Richard Graham wrote:
> This is better than nothing, but really not very helpful for looking at the
> specific issues that can arise with this, unless these systems have several
> parallel networks, with tests that will generate a lot of parallel network
>
This is better than nothing, but really not very helpful for looking at the
specific issues that can arise with this, unless these systems have several
parallel networks, with tests that will generate a lot of parallel network
traffic, and be able to self check for out-of-order received - i.e. this
On Wed, Dec 12, 2007 at 11:57:11AM -0500, Jeff Squyres wrote:
> Gleb --
>
> How about making a tarball with this patch in it that can be thrown at
> everyone's MTT? (we can put the tarball on www.open-mpi.org somewhere)
I don't have access to www.open-mpi.org, but I can send you the patch.
I can
Gleb --
How about making a tarball with this patch in it that can be thrown at
everyone's MTT? (we can put the tarball on www.open-mpi.org somewhere)
On Dec 11, 2007, at 4:14 PM, Richard Graham wrote:
I will re-iterate my concern. The code that is there now is mostly
nine
years old (with
I will re-iterate my concern. The code that is there now is mostly nine
years old (with some mods made when it was brought over to Open MPI). It
took about 2 months of testing on systems with 5-13 way network parallelism
to track down all KNOWN race conditions. This code is at the center of MPI
On Tue, Dec 11, 2007 at 08:03:52AM -0800, Andrew Friedley wrote:
> Try UD, frags are reordered at a very high rate so should be a good test.
mpi-ping works fine with UD BTL and the patch.
>
> Andrew
>
> Richard Graham wrote:
> > Gleb,
> > I would suggest that before this is checked in this be
On Tue, Dec 11, 2007 at 08:36:42AM -0800, Andrew Friedley wrote:
> Possibly, though I have results from a benchmark I've written indicating
> the reordering happens at the sender. I believe I found it was due to
> the QP striping trick I use to get more bandwidth -- if you back down to
> one QP
Possibly, though I have results from a benchmark I've written indicating
the reordering happens at the sender. I believe I found it was due to
the QP striping trick I use to get more bandwidth -- if you back down to
one QP (there's a define in the code you can change), the reordering
rate drop
On Tue, Dec 11, 2007 at 08:03:52AM -0800, Andrew Friedley wrote:
> Try UD, frags are reordered at a very high rate so should be a good test.
Good Idea I'll try this. BTW I thing the reason for such a high rate of
reordering in UD is that it polls for MCA_BTL_UD_NUM_WC completions
(500) and process
On Tue, Dec 11, 2007 at 10:00:08AM -0600, Brian W. Barrett wrote:
> On Tue, 11 Dec 2007, Gleb Natapov wrote:
>
> > I did a rewrite of matching code in OB1. I made it much simpler and 2
> > times smaller (which is good, less code - less bugs). I also got rid
> > of huge macros - very helpful if y
On Tue, Dec 11, 2007 at 11:00:51AM -0500, Richard Graham wrote:
> Gleb,
> I would suggest that before this is checked in this be tested on a system
> that has N-way network parallelism, where N is as large as you can find.
> This is a key bit of code for MPI correctness, and out-of-order operatio
Try UD, frags are reordered at a very high rate so should be a good test.
Andrew
Richard Graham wrote:
Gleb,
I would suggest that before this is checked in this be tested on a system
that has N-way network parallelism, where N is as large as you can find.
This is a key bit of code for MPI cor
Gleb,
I would suggest that before this is checked in this be tested on a system
that has N-way network parallelism, where N is as large as you can find.
This is a key bit of code for MPI correctness, and out-of-order operations
will break it, so you want to maximize the chance for such operations
On Tue, 11 Dec 2007, Gleb Natapov wrote:
I did a rewrite of matching code in OB1. I made it much simpler and 2
times smaller (which is good, less code - less bugs). I also got rid
of huge macros - very helpful if you need to debug something. There
is no performance degradation, actually I even
30 matches
Mail list logo