Re: [OMPI devel] Manipulating OPAL event system

2009-03-11 Thread Paul H. Hargrove
I an not 100% sure I correctly understand what you are asking, so correct me (or just ignore me) if I have totally missed the point here. You have said you have a lightweight socket implementation for Linux and managed to add a simple polling function into the module. If am guessing that you

[OMPI devel] Manipulating OPAL event system

2009-03-11 Thread Timothy Hayes
Hi everyone, I'm currently writing my own BTL component that utilises a lightweight Linux socket module. It wouldn't have nearly as much functionality as a TCP/IP socket but it does the job and I managed to add a simple polling function into the module, it sleeps for whatever amount of time is en

Re: [OMPI devel] Meta Question -- Open MPI: Is it a dessert topping or is it a floor wax?

2009-03-11 Thread Ralph Castain
I would agree with Brian - in fact, it was my understanding from the beginning of the project that we were Andrew's first vision: an MPI implementation with whatever run time support that is required, and no more. I would only expand on the statement about "...do not detract from the prim

Re: [OMPI devel] Meta Question -- Open MPI: Is it a dessert topping or is it a floor wax?

2009-03-11 Thread Brian W. Barrett
On Wed, 11 Mar 2009, Andrew Lumsdaine wrote: Hi all -- There is a meta question that I think is underlying some of the discussion about what to do with BTLs etc. Namely, is Open MPI an MPI implementation with a portable run time system -- or is it a distributed OS with an MPI interface? It s

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Jeff Squyres
On Mar 11, 2009, at 2:18 PM, Eugene Loh wrote: > Can this error happen on any test? Presumably yes if two or more processes are on the same node. Yes, because these failures were occurring during MPI_INIT (i.e., 'zactly what Eugene said...). -- Jeff Squyres Cisco Systems

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Eugene Loh
Ethan Mallove wrote: Can this error happen on any test? Presumably yes if two or more processes are on the same node. What do these tests have in common? They all try to start. :^) The problem is in MPI_Init. It almost looks like the problem is more likely to occur if MPI_UB or MPI_L

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Ethan Mallove
On Wed, Mar/11/2009 11:38:19AM, Jeff Squyres wrote: > As Terry stated, I think this bugger is quite rare. I'm having a helluva > time trying to reproduce it manually (over 5k runs this morning and still > no segv). Ugh. 5k of which test(s)? Can this error happen on any test? I am wondering if

[OMPI devel] Meta Question -- Open MPI: Is it a dessert topping or is it a floor wax?

2009-03-11 Thread Andrew Lumsdaine
Hi all -- There is a meta question that I think is underlying some of the discussion about what to do with BTLs etc. Namely, is Open MPI an MPI implementation with a portable run time system -- or is it a distributed OS with an MPI interface? It seems like some of the changes being asked

Re: [OMPI devel] RFC: move BTLs out of ompi into separate layer

2009-03-11 Thread Brian W. Barrett
On Wed, 11 Mar 2009, Richard Graham wrote: Brian, Going back over the e-mail trail it seems like you have raised two concerns: - BTL performance after the change, which I would take to be - btl latency - btl bandwidth - Code maintainability - repeated code changes that impact a large number

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Ralph Castain
If it is that hard to replicate outside of MTT, then by all means let's just release it - users will probably never see it. On Mar 11, 2009, at 10:07 AM, Terry Dontje wrote: Ralph Castain wrote: You know, this isn't the first time we have encountered errors that -only- appear when running

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Eugene Loh
Ralph Castain wrote: Could be nobody is saying anything...but I would be surprised if - nobody- barked at a segfault during startup. Well, if it segfaulted during startup, someone's first reaction would probably be, "Oh really?" They would try again, have success, attribute to cosmic rays,

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Terry Dontje
Ralph Castain wrote: You know, this isn't the first time we have encountered errors that -only- appear when running under MTT. As per my other note, we are not seeing these failures here, even though almost all our users run under batch/scripts. This has been the case with at least some of th

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Ralph Castain
You know, this isn't the first time we have encountered errors that - only- appear when running under MTT. As per my other note, we are not seeing these failures here, even though almost all our users run under batch/scripts. This has been the case with at least some of these other MTT-only

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Ralph Castain
FWIW, we have people running dozens of jobs every day with 1.3.0 built with Intel 10.0.23 and PGI 7.2-5 compilers, using -mca btl sm,openib,self...and have not received a single report of this failure. This is all on Linux machines (various kernels), under both slurm and torque environments

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Jeff Squyres
As Terry stated, I think this bugger is quite rare. I'm having a helluva time trying to reproduce it manually (over 5k runs this morning and still no segv). Ugh. Looking through the sm startup code, I can't see exactly what the problem would be. :-( On Mar 11, 2009, at 11:34 AM, Ralph

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Ralph Castain
I'll run some tests with 1.3.1 on one of our systems and see if it shows up there. If it is truly rare and was in 1.3.0, then I personally don't have a problem with it. Got bigger problems with hanging collectives, frankly - and we don't know how the sm changes will affect this problem, if

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Terry Dontje
Jeff Squyres wrote: So -- Brad/George -- this technically isn't a regression against v1.3.0 (do we know if this can happen in 1.2? I don't recall seeing it there, but if it's so elusive... I haven't been MTT testing the 1.2 series in a long time). But it is a nonzero problem. I have not se

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Jeff Squyres
So -- Brad/George -- this technically isn't a regression against v1.3.0 (do we know if this can happen in 1.2? I don't recall seeing it there, but if it's so elusive... I haven't been MTT testing the 1.2 series in a long time). But it is a nonzero problem. Should we release 1.3.1 without

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Jeff Squyres
Could be true; it unfortunately doesn't help us for 1.3.1, though. :-( Maybe I'll add a big memset of 0 across the sm segment at the beginning of time and see if this problem goes away. On Mar 11, 2009, at 7:30 AM, Ralph Castain wrote: I actually wasn't implying that Eugene's changes -ca

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Ralph Castain
I actually wasn't implying that Eugene's changes -caused- the problem, but rather that I thought they might have -fixed- the problem. :-) On Mar 11, 2009, at 4:34 AM, Terry Dontje wrote: I forgot to mention that since I ran into this issue so long ago I really doubt that Eugene's SM change

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Terry Dontje
I forgot to mention that since I ran into this issue so long ago I really doubt that Eugene's SM changes has caused this issue. --td Terry Dontje wrote: Hey!!! I ran into this problem many months ago but its been so elusive that I've haven't nailed it down. First time we saw this was last O

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Terry Dontje
Hey!!! I ran into this problem many months ago but its been so elusive that I've haven't nailed it down. First time we saw this was last October. I did some MTT gleaning and could not find anyone but Solaris having this issue under MTT. What's interesting is I gleaned Sun's MTT results and

Re: [OMPI devel] RFC: move BTLs out of ompi into separate layer

2009-03-11 Thread Richard Graham
Brian, Going back over the e-mail trail it seems like you have raised two concerns: - BTL performance after the change, which I would take to be - btl latency - btl bandwidth - Code maintainability - repeated code changes that impact a large number of files - A demonstration t

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

2009-03-11 Thread Eugene Loh
Ralph Castain wrote: Hey Jeff I seem to recall seeing the identical problem reported on the user list not long ago...or may have been the devel list. Anyway, it was during btl_sm_add_procs, and the code was segv'ing. I don't have the archives handy here, but perhaps you might search the