I an not 100% sure I correctly understand what you are asking, so
correct me (or just ignore me) if I have totally missed the point here.
You have said you have a lightweight socket implementation for Linux and
managed to add a simple polling function into the module. If am
guessing that you
Hi everyone,
I'm currently writing my own BTL component that utilises a lightweight Linux
socket module. It wouldn't have nearly as much functionality as a TCP/IP
socket but it does the job and I managed to add a simple polling function
into the module, it sleeps for whatever amount of time is en
I would agree with Brian - in fact, it was my understanding from the
beginning of the project that we were Andrew's first vision: an MPI
implementation with whatever run time support that is required, and no
more.
I would only expand on the statement about "...do not detract from the
prim
On Wed, 11 Mar 2009, Andrew Lumsdaine wrote:
Hi all -- There is a meta question that I think is underlying some of the
discussion about what to do with BTLs etc. Namely, is Open MPI an MPI
implementation with a portable run time system -- or is it a distributed OS
with an MPI interface? It s
On Mar 11, 2009, at 2:18 PM, Eugene Loh wrote:
> Can this error happen on any test?
Presumably yes if two or more processes are on the same node.
Yes, because these failures were occurring during MPI_INIT (i.e.,
'zactly what Eugene said...).
--
Jeff Squyres
Cisco Systems
Ethan Mallove wrote:
Can this error happen on any test?
Presumably yes if two or more processes are on the same node.
What do these tests have in common?
They all try to start. :^) The problem is in MPI_Init.
It almost looks like the problem is more likely to occur if MPI_UB or
MPI_L
On Wed, Mar/11/2009 11:38:19AM, Jeff Squyres wrote:
> As Terry stated, I think this bugger is quite rare. I'm having a helluva
> time trying to reproduce it manually (over 5k runs this morning and still
> no segv). Ugh.
5k of which test(s)? Can this error happen on any test? I am wondering
if
Hi all -- There is a meta question that I think is underlying some of
the discussion about what to do with BTLs etc. Namely, is Open MPI an
MPI implementation with a portable run time system -- or is it a
distributed OS with an MPI interface? It seems like some of the
changes being asked
On Wed, 11 Mar 2009, Richard Graham wrote:
Brian,
Going back over the e-mail trail it seems like you have raised two
concerns:
- BTL performance after the change, which I would take to be
- btl latency
- btl bandwidth
- Code maintainability
- repeated code changes that impact a large number
If it is that hard to replicate outside of MTT, then by all means
let's just release it - users will probably never see it.
On Mar 11, 2009, at 10:07 AM, Terry Dontje wrote:
Ralph Castain wrote:
You know, this isn't the first time we have encountered errors that
-only- appear when running
Ralph Castain wrote:
Could be nobody is saying anything...but I would be surprised if -
nobody- barked at a segfault during startup.
Well, if it segfaulted during startup, someone's first reaction would
probably be, "Oh really?" They would try again, have success, attribute
to cosmic rays,
Ralph Castain wrote:
You know, this isn't the first time we have encountered errors that
-only- appear when running under MTT. As per my other note, we are not
seeing these failures here, even though almost all our users run under
batch/scripts.
This has been the case with at least some of th
You know, this isn't the first time we have encountered errors that -
only- appear when running under MTT. As per my other note, we are not
seeing these failures here, even though almost all our users run under
batch/scripts.
This has been the case with at least some of these other MTT-only
FWIW, we have people running dozens of jobs every day with 1.3.0 built
with Intel 10.0.23 and PGI 7.2-5 compilers, using -mca btl
sm,openib,self...and have not received a single report of this failure.
This is all on Linux machines (various kernels), under both slurm and
torque environments
As Terry stated, I think this bugger is quite rare. I'm having a
helluva time trying to reproduce it manually (over 5k runs this
morning and still no segv). Ugh.
Looking through the sm startup code, I can't see exactly what the
problem would be. :-(
On Mar 11, 2009, at 11:34 AM, Ralph
I'll run some tests with 1.3.1 on one of our systems and see if it
shows up there. If it is truly rare and was in 1.3.0, then I
personally don't have a problem with it. Got bigger problems with
hanging collectives, frankly - and we don't know how the sm changes
will affect this problem, if
Jeff Squyres wrote:
So -- Brad/George -- this technically isn't a regression against
v1.3.0 (do we know if this can happen in 1.2? I don't recall seeing
it there, but if it's so elusive... I haven't been MTT testing the
1.2 series in a long time). But it is a nonzero problem.
I have not se
So -- Brad/George -- this technically isn't a regression against
v1.3.0 (do we know if this can happen in 1.2? I don't recall seeing
it there, but if it's so elusive... I haven't been MTT testing the
1.2 series in a long time). But it is a nonzero problem.
Should we release 1.3.1 without
Could be true; it unfortunately doesn't help us for 1.3.1, though. :-(
Maybe I'll add a big memset of 0 across the sm segment at the
beginning of time and see if this problem goes away.
On Mar 11, 2009, at 7:30 AM, Ralph Castain wrote:
I actually wasn't implying that Eugene's changes -ca
I actually wasn't implying that Eugene's changes -caused- the problem,
but rather that I thought they might have -fixed- the problem.
:-)
On Mar 11, 2009, at 4:34 AM, Terry Dontje wrote:
I forgot to mention that since I ran into this issue so long ago I
really doubt that Eugene's SM change
I forgot to mention that since I ran into this issue so long ago I
really doubt that Eugene's SM changes has caused this issue.
--td
Terry Dontje wrote:
Hey!!! I ran into this problem many months ago but its been so
elusive that I've haven't nailed it down. First time we saw this was
last O
Hey!!! I ran into this problem many months ago but its been so elusive
that I've haven't nailed it down. First time we saw this was last
October. I did some MTT gleaning and could not find anyone but Solaris
having this issue under MTT. What's interesting is I gleaned Sun's MTT
results and
Brian,
Going back over the e-mail trail it seems like you have raised two
concerns:
- BTL performance after the change, which I would take to be
- btl latency
- btl bandwidth
- Code maintainability
- repeated code changes that impact a large number of files
- A demonstration t
Ralph Castain wrote:
Hey Jeff
I seem to recall seeing the identical problem reported on the user
list not long ago...or may have been the devel list. Anyway, it was
during btl_sm_add_procs, and the code was segv'ing.
I don't have the archives handy here, but perhaps you might search
the
24 matches
Mail list logo