Re: [OMPI devel] Changes: opal_output and opal_show_help

2008-05-12 Thread Josh Hursey


On May 10, 2008, at 9:00 AM, Jeff Squyres wrote:


Er, no.  I thought the group had agreed to the main idea last Tuesday
(framework for filtering output).  We were racing against the time-to-
branch clock and didn't take the time for an RFC after we agreed on
the design.  Do we need to?


I don't think so. But I'd just kinda like a more formal description of  
what this fix is and it's implications on how the developers are  
expected to use it going forward since this is altering the coding  
standards.





The side effect of eliminating duplicate error messages is new / was
not discussed last Tuesday -- I can put out an RFC for that if you'd
like, but the benefit is so obvious that I didn't think it would be
controversial.


Don't get me wrong, I'm not arguing the benefit just that I'd like to  
know what is expected of me as a developer after this change.


Not something to hold up the merge, just something I'd like to see.

Cheers,
Josh





On May 9, 2008, at 8:48 PM, Josh Hursey wrote:


Is there a RFC telling us when we might expect this?

On May 9, 2008, at 5:52 PM, Jeff Squyres wrote:


So when this stuff hits the trunk,


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




[OMPI devel] heterogeneous OpenFabrics adapters

2008-05-12 Thread Jeff Squyres
I think that this issue has come up before, but I filed a ticket about  
it because at least one developer (Jon) has a system with both IB and  
iWARP adapters:


https://svn.open-mpi.org/trac/ompi/ticket/1282

My question: do we care about the heterogeneous adapter scenarios?   
For v1.3?  For v1.4?  For ...some version in the future?


I think the first issue I identified in the ticket is grunt work to  
fix (annoying and tedious, but not difficult), but the second one will  
be a little dicey -- it has scalability issues (e.g., sending around  
all info in the modex, etc.).


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] Changes: opal_output and opal_show_help

2008-05-12 Thread Jeff Squyres
Sorry it took so long for a reply; Ralph and I were working on this  
code much of the day in an attempt to have it all complete / tidied up  
for the teleconf tomorrow.



On May 12, 2008, at 10:04 AM, Josh Hursey wrote:


Er, no.  I thought the group had agreed to the main idea last Tuesday
(framework for filtering output).  We were racing against the time- 
to-

branch clock and didn't take the time for an RFC after we agreed on
the design.  Do we need to?


I don't think so. But I'd just kinda like a more formal description of
what this fix is and it's implications on how the developers are
expected to use it going forward since this is altering the coding
standards.


Fair enough, will do.

Since this one was kinda weird, do you want an after-the-fact RFC, or  
a page on the wiki?  I'm partial to the latter; it'll be more durable.



The side effect of eliminating duplicate error messages is new / was
not discussed last Tuesday -- I can put out an RFC for that if you'd
like, but the benefit is so obvious that I didn't think it would be
controversial.


Don't get me wrong, I'm not arguing the benefit just that I'd like to
know what is expected of me as a developer after this change.


That's perfectly reasonable.  In short: s/opal_show_help/ 
orte_show_help/ in the ORTE and OMPI layers, and you're done (which we  
already did throughout the code base).  Use orte_show_help in the ORTE  
and OMPI layers in the future.  I think this information should go on  
the wiki.


Finally, per a conversation that I had with Terry earlier today, I  
added a new MCA parameter that will turn off the show_help message  
aggregation.  It defaults to aggregation enabled, but you can disable  
it with:


... --mca orte_base_help_aggregation 0 ...

This will show *all* show_help messages, regardless of duplication.   
Terry was worried that aggregating the same (filename, tuple) messages  
may actually mask different errors because we allow %s expansion in  
the message.


Re-examining George's mail in this thread, I think he may have had  
similar concerns, but I didn't grok that at the time.


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] Changes: opal_output and opal_show_help

2008-05-12 Thread Ralph Castain



On 5/12/08 3:49 PM, "Jeff Squyres"  wrote:

> Sorry it took so long for a reply; Ralph and I were working on this
> code much of the day in an attempt to have it all complete / tidied up
> for the teleconf tomorrow.
> 
> 
> On May 12, 2008, at 10:04 AM, Josh Hursey wrote:
> 
>>> Er, no.  I thought the group had agreed to the main idea last Tuesday
>>> (framework for filtering output).  We were racing against the time-
>>> to-
>>> branch clock and didn't take the time for an RFC after we agreed on
>>> the design.  Do we need to?
>> 
>> I don't think so. But I'd just kinda like a more formal description of
>> what this fix is and it's implications on how the developers are
>> expected to use it going forward since this is altering the coding
>> standards.
> 
> Fair enough, will do.
> 
> Since this one was kinda weird, do you want an after-the-fact RFC, or
> a page on the wiki?  I'm partial to the latter; it'll be more durable.
> 
>>> The side effect of eliminating duplicate error messages is new / was
>>> not discussed last Tuesday -- I can put out an RFC for that if you'd
>>> like, but the benefit is so obvious that I didn't think it would be
>>> controversial.
>> 
>> Don't get me wrong, I'm not arguing the benefit just that I'd like to
>> know what is expected of me as a developer after this change.
> 
> That's perfectly reasonable.  In short: s/opal_show_help/
> orte_show_help/ in the ORTE and OMPI layers, and you're done (which we
> already did throughout the code base).  Use orte_show_help in the ORTE
> and OMPI layers in the future.  I think this information should go on
> the wiki.

Just to complete that, you also should:

s/opal_output/orte_output
s/OPAL_OUTPUT/ORTE_OUTPUT
s/OPAL_OUTPUT_VERBOSE/ORTE_OUTPUT_VERBOSE

throughout ORTE and OMPI layers in the future.

This has also been done in the current code base.

> 
> Finally, per a conversation that I had with Terry earlier today, I
> added a new MCA parameter that will turn off the show_help message
> aggregation.  It defaults to aggregation enabled, but you can disable
> it with:
> 
>  ... --mca orte_base_help_aggregation 0 ...
> 
> This will show *all* show_help messages, regardless of duplication.
> Terry was worried that aggregating the same (filename, tuple) messages
> may actually mask different errors because we allow %s expansion in
> the message.
> 
> Re-examining George's mail in this thread, I think he may have had
> similar concerns, but I didn't grok that at the time.




Re: [OMPI devel] Changes: opal_output and opal_show_help

2008-05-12 Thread Josh Hursey
I think a wiki page describing this should be fine. Just wanted to  
make sure I use the new functionality properly.


Cheers,
Josh

On May 12, 2008, at 5:59 PM, Ralph Castain wrote:





On 5/12/08 3:49 PM, "Jeff Squyres"  wrote:


Sorry it took so long for a reply; Ralph and I were working on this
code much of the day in an attempt to have it all complete /  
tidied up

for the teleconf tomorrow.


On May 12, 2008, at 10:04 AM, Josh Hursey wrote:

Er, no.  I thought the group had agreed to the main idea last  
Tuesday

(framework for filtering output).  We were racing against the time-
to-
branch clock and didn't take the time for an RFC after we agreed on
the design.  Do we need to?


I don't think so. But I'd just kinda like a more formal  
description of

what this fix is and it's implications on how the developers are
expected to use it going forward since this is altering the coding
standards.


Fair enough, will do.

Since this one was kinda weird, do you want an after-the-fact RFC, or
a page on the wiki?  I'm partial to the latter; it'll be more  
durable.


The side effect of eliminating duplicate error messages is new /  
was
not discussed last Tuesday -- I can put out an RFC for that if  
you'd

like, but the benefit is so obvious that I didn't think it would be
controversial.


Don't get me wrong, I'm not arguing the benefit just that I'd  
like to

know what is expected of me as a developer after this change.


That's perfectly reasonable.  In short: s/opal_show_help/
orte_show_help/ in the ORTE and OMPI layers, and you're done  
(which we
already did throughout the code base).  Use orte_show_help in the  
ORTE

and OMPI layers in the future.  I think this information should go on
the wiki.


Just to complete that, you also should:

s/opal_output/orte_output
s/OPAL_OUTPUT/ORTE_OUTPUT
s/OPAL_OUTPUT_VERBOSE/ORTE_OUTPUT_VERBOSE

throughout ORTE and OMPI layers in the future.

This has also been done in the current code base.



Finally, per a conversation that I had with Terry earlier today, I
added a new MCA parameter that will turn off the show_help message
aggregation.  It defaults to aggregation enabled, but you can disable
it with:

 ... --mca orte_base_help_aggregation 0 ...

This will show *all* show_help messages, regardless of duplication.
Terry was worried that aggregating the same (filename, tuple)  
messages

may actually mask different errors because we allow %s expansion in
the message.

Re-examining George's mail in this thread, I think he may have had
similar concerns, but I didn't grok that at the time.



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] heterogeneous OpenFabrics adapters

2008-05-12 Thread Jeff Squyres
After looking at the code a bit, I realized that I completely forgot  
that the INI file was invented to solve at least the heterogeneous- 
adapters-in-a-host problem.


So I amended the ticket to reflect that that problem is already  
solved.  The other part is not, though -- consider two MPI procs on  
different hosts, each with an iWARP NIC, but one NIC supports SRQs and  
one does not.



On May 12, 2008, at 5:36 PM, Jeff Squyres wrote:

I think that this issue has come up before, but I filed a ticket  
about it because at least one developer (Jon) has a system with both  
IB and iWARP adapters:


   https://svn.open-mpi.org/trac/ompi/ticket/1282

My question: do we care about the heterogeneous adapter scenarios?   
For v1.3?  For v1.4?  For ...some version in the future?


I think the first issue I identified in the ticket is grunt work to  
fix (annoying and tedious, but not difficult), but the second one  
will be a little dicey -- it has scalability issues (e.g., sending  
around all info in the modex, etc.).


--
Jeff Squyres
Cisco Systems




--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] heterogeneous OpenFabrics adapters

2008-05-12 Thread Jeff Squyres

Short version:
--

I propose that we should disallow multiple different  
mca_btl_openib_receive_queues values (or receive_queues values from  
the INI file) to be used in a single MPI job for the v1.3 series.


More details:
-

The reason I'm looking into this heterogeneity stuff is to help  
Chelsio support their iWARP NIC in OMPI.  Their NIC needs a specific  
value for mca_btl_openib_receive_queues (specifically: it does not  
support SRQ and it has the wireup race condition that we discussed  
before).


The major problem is that all the BSRQ information is currently stored  
in on the openib component -- it is *not* maintained on a per-HCA (or  
per port) basis.  We *could* move all the BSRQ info to live on the  
hca_t struct (or even the openib module struct), but it has at least 3  
big consequences:


1. It would touch a lot of code.  But touching all this code is  
relatively low risk; it will be easy to check for correctness because  
the changes will either compile or not.


2. There are functions (some of which are static inline) that read the  
BSRQ data.  These functions would have to take an additional (hca_t*)  
(or (btl_openib_module_t*)) parameter.


3. Getting to the BSRQ info will take at least 1 or 2 more  
dereferences (e.g., module->hca->bsrq_info or module->bsrq_info...).


I'm not too concerned about #1 (it's grunt work), but I am a bit  
concerned about #2 and #3 since at least some of these places are in  
the critical performance path.


Given these concerns, I propose the following v1.3:

- Add a "receive_queues" field to the INI file so that the Chelsio  
adapter can run out of the box (i.e., "mpirun -np 4 a.out" with hosts  
containing Chelsio NICs will get a value for btl_openib_receive_queues  
that will work).


- NetEffect NICs will also require overriding  
btl_openib_receive_queues, but will likely have a different value than  
Chelsio NICs (they don't have the wireup race condition that Chelsio  
does).


- Because the BSRQ info is on the component (i.e., global), we should  
detect when multiple different receive_queues values are specified and  
gracefully abort.


I think it'll be quite uncommon to have a need for two different  
receive_queues values, and that this proposal will be fine for v1.3


Comments?



On May 12, 2008, at 6:44 PM, Jeff Squyres wrote:


After looking at the code a bit, I realized that I completely forgot
that the INI file was invented to solve at least the heterogeneous-
adapters-in-a-host problem.

So I amended the ticket to reflect that that problem is already
solved.  The other part is not, though -- consider two MPI procs on
different hosts, each with an iWARP NIC, but one NIC supports SRQs and
one does not.


On May 12, 2008, at 5:36 PM, Jeff Squyres wrote:


I think that this issue has come up before, but I filed a ticket
about it because at least one developer (Jon) has a system with both
IB and iWARP adapters:

  https://svn.open-mpi.org/trac/ompi/ticket/1282

My question: do we care about the heterogeneous adapter scenarios?
For v1.3?  For v1.4?  For ...some version in the future?

I think the first issue I identified in the ticket is grunt work to
fix (annoying and tedious, but not difficult), but the second one
will be a little dicey -- it has scalability issues (e.g., sending
around all info in the modex, etc.).

--
Jeff Squyres
Cisco Systems




--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] [RFC] mca_base_select()

2008-05-12 Thread Ralph Castain
I -think- I may have found the problem here, but don't have a real test case
- try r18429 and see if it works.


On 5/11/08 4:32 PM, "Josh Hursey"  wrote:

>  From the stacktrace, this doesn't look like a problem with
> base_select, but with 'orte_util_encode_pidmap'. You may want to
> start looking there.
> 
> -- Josh
> 
> On May 11, 2008, at 1:30 PM, Lenny Verkhovsky wrote:
> 
>> Hi,
>> I tried r 18423 with rank_file component and got seqfault
>> ( I increase priority of the component if rmaps_rank_file_path exist)
>> 
>> 
>> /home/USERS/lenny/OMPI_ORTE_SMD/bin/mpirun -np 4 -hostfile
>> hostfile_ompi -mca rmaps_rank_file_path rankfile -mca
>> paffinity_base_verbose 5 ./mpi_p_SMD -t bw -output 1 -order 1
>> [witch1:25456] mca:base:select: Querying component [linux]
>> [witch1:25456] mca:base:select: Query of component [linux] set
>> priority to 10
>> [witch1:25456] mca:base:select: Selected component [linux]
>> [witch1:25456] *** Process received signal ***
>> [witch1:25456] Signal: Segmentation fault (11)
>> [witch1:25456] Signal code: Invalid permissions (2)
>> [witch1:25456] Failing at address: 0x2b2875530030
>> [witch1:25456] [ 0] /lib64/libpthread.so.0 [0x2b28759dfc10]
>> [witch1:25456] [ 1] /home/USERS/lenny/OMPI_ORTE_SMD/lib/libopen-
>> pal.so.0 [0x2b28753e2bb6]
>> [witch1:25456] [ 2] /home/USERS/lenny/OMPI_ORTE_SMD/lib/libopen-
>> pal.so.0 [0x2b28753e23b6]
>> [witch1:25456] [ 3] /home/USERS/lenny/OMPI_ORTE_SMD/lib/libopen-
>> pal.so.0 [0x2b28753e22fd]
>> [witch1:25456] [ 4] /home/USERS/lenny/OMPI_ORTE_SMD/lib/libopen-
>> rte.so.0(orte_util_encode_pidmap+0x2f4) [0x2b287527f412]
>> [witch1:25456] [ 5] /home/USERS/lenny/OMPI_ORTE_SMD/lib/libopen-
>> rte.so.0(orte_odls_base_default_get_add_procs_data+0x989)
>> [0x2b28752934f5]
>> [witch1:25456] [ 6] /home/USERS/lenny/OMPI_ORTE_SMD/lib/libopen-
>> rte.so.0(orte_plm_base_launch_apps+0x1a3) [0x2b287529e60b]
>> [witch1:25456] [ 7] /home/USERS/lenny/OMPI_ORTE_SMD/lib/openmpi/
>> mca_plm_rsh.so [0x2b287612f788]
>> [witch1:25456] [ 8] /home/USERS/lenny/OMPI_ORTE_SMD/bin/mpirun
>> [0x4032bf]
>> [witch1:25456] [ 9] /home/USERS/lenny/OMPI_ORTE_SMD/bin/mpirun
>> [0x402b53]
>> [witch1:25456] [10] /lib64/libc.so.6(__libc_start_main+0xf4)
>> [0x2b2875b06154]
>> [witch1:25456] [11] /home/USERS/lenny/OMPI_ORTE_SMD/bin/mpirun
>> [0x402aa9]
>> [witch1:25456] *** End of error message ***
>> Segmentation fault
>> 
>> 
>> 
>> 
>> On Tue, May 6, 2008 at 9:09 PM, Josh Hursey 
>> wrote:
>> This has been committed in r18381
>> 
>> Please let me know if you have any problems with this commit.
>> 
>> Cheers,
>> Josh
>> 
>> On May 5, 2008, at 10:41 AM, Josh Hursey wrote:
>> 
>>> Awesome.
>>> 
>>> The branch is updated to the latest trunk head. I encourage folks to
>>> check out this repository and make sure that it builds on their
>>> system. A normal build of the branch should be enough to find out if
>>> there are any cut-n-paste problems (though I tried to be careful,
>>> mistakes do happen).
>>> 
>>> I haven't heard any problems so this is looking like it will come in
>>> tomorrow after the teleconf. I'll ask again there to see if there
>> are
>>> any voices of concern.
>>> 
>>> Cheers,
>>> Josh
>>> 
>>> On May 5, 2008, at 9:58 AM, Jeff Squyres wrote:
>>> 
 This all sounds good to me!
 
 On Apr 29, 2008, at 6:35 PM, Josh Hursey wrote:
 
> What:  Add mca_base_select() and adjust frameworks & components to
> use
> it.
> Why:   Consolidation of code for general goodness.
> Where: https://svn.open-mpi.org/svn/ompi/tmp-public/jjh-mca-play
> When:  Code ready now. Documentation ready soon.
> Timeout: May 6, 2008 (After teleconf) [1 week]
> 
> Discussion:
> ---
> For a number of years a few developers have been talking about
> creating a MCA base component selection function. For various
> reasons
> this was never implemented. Recently I decided to give it a try.
> 
> A base select function will allow Open MPI to provide completely
> consistent selection behavior for many of its frameworks (18 of 31
> to
> be exact at the moment). The primary goal of this work is to
> improving
> code maintainability through code reuse. Other benefits also
>> result
> such as a slightly smaller memory footprint.
> 
> The mca_base_select() function represented the most commonly used
> logic for component selection: Select the one component with the
> highest priority and close all of the not selected components.
>> This
> function can be found at the path below in the branch:
> opal/mca/base/mca_base_components_select.c
> 
> To support this I had to formalize a query() function in the
> mca_base_component_t of the form:
> int mca_base_query_component_fn(mca_base_module_t **module, int
> *priority);
> 
> This function is specified after the open and close component
> functions in this structure as to allow compatibility with
> fram