Re: [OMPI devel] [OMPI svn] svn:open-mpi r21517

2009-06-24 Thread George Bosilca
Ralph, I didn't do it so you can appreciate, and there is no need to revisit the logic behind anything. I missed the exception for the orted command, so this whole mess is my fault. As a side note, let me just remind you that the trunk is meant to be more or less stable, so there is absol

Re: [OMPI devel] [OMPI svn] svn:open-mpi r21517

2009-06-24 Thread Ralph Castain
Thanks George - you may ignore the note I just sent! :-) I am happy to revisit the logic behind the prior work, without the time pressure of dealing with it right away. I do appreciate this! Ralph On Jun 24, 2009, at 5:48 PM, bosi...@osl.iu.edu wrote: Author: bosilca Date: 2009-06-24 19:4

Re: [OMPI devel] [OMPI svn] svn:open-mpi r21513

2009-06-24 Thread Ralph Castain
Just to be specific, here is how we handle the orte_launch_agent in rsh that makes it work: /* now get the orted cmd - as specified by user - into our tmp array. * The function returns the location where the actual orted command is * located - usually in the final spot, but s

Re: [OMPI devel] [OMPI svn] svn:open-mpi r21513

2009-06-24 Thread Ralph Castain
I believe you are using a bad example here George. If you look closely at the code, you will see that we treat the orte_launch_agent separately from everything else - it gets passed through the following code: int orte_plm_base_setup_orted_cmd(int *argc, char ***argv) { int i, loc;

Re: [OMPI devel] [OMPI svn] svn:open-mpi r21513

2009-06-24 Thread George Bosilca
Just for the sake of it. A funy command line to try: [bosilca@dancer ~]$ mpirun --mca routed_base_verbose 0 --leave-session- attached -np 1 --mca orte_launch_agent "orted --mca routed_base_verbose 1" uptime [node03:22355] [[14661,0],1] routed_linear: init routes for daemon job [14661,0]

Re: [OMPI devel] [OMPI svn] svn:open-mpi r21513

2009-06-24 Thread George Bosilca
On Jun 24, 2009, at 17:41 , Jeff Squyres wrote: - [14:38] svbu-mpi:~/svn/ompi/orte % mpirun --mca plm_base_verbose 100 --leave-session-attached -np 1 --mca orte_launch_agent "$bogus/bin/ orted -s" uptime ...lots of output... srun --nodes=1 --ntasks=1 --kill-on-bad-exit --nodelist=svbu-

Re: [OMPI devel] [OMPI svn] svn:open-mpi r21513

2009-06-24 Thread Jeff Squyres
Hmm. Doesn't seem to work for me... First, without the quotes -- a single argument ($bogus is the tree where my OMPI is installed): - [14:36] svbu-mpi:~/svn/ompi/orte % mpirun --mca plm_base_verbose 100 -- leave-session-attached -np 1 --mca orte_launch_agent $bogus/bin/orted uptime .

Re: [OMPI devel] [OMPI svn] svn:open-mpi r21513

2009-06-24 Thread Ralph Castain
If you read the original comment, we had concluded that there were no multi-word options that were being passed back to the orteds. All multi-word options known to us at that time, and still today, -only- apply to the HNP. Hence, dropping them has zero impact. To update you on the history:

Re: [OMPI devel] [OMPI svn] svn:open-mpi r21513

2009-06-24 Thread George Bosilca
I can't guarantee this for all PLM but I can confirm that rsh and slurm (1.3.12) works well with this. We try with and without Open MPI, and the outcome is the same. [bosilca@dancer c]$ srun -n 4 echo "1 2 3 4 5 it works" 1 2 3 4 5 it works 1 2 3 4 5 it works 1 2 3 4 5 it works 1 2 3 4 5 it wo

Re: [OMPI devel] [OMPI svn] svn:open-mpi r21513

2009-06-24 Thread Ralph Castain
The problem is that they don't get there properly. We have been through this debate multiple times for several years - every so often, someone decides to try this again. The problem is that the mca param that reaches the other end has quotes around it in some environments, and doesn't in ot

Re: [OMPI devel] [OMPI svn] svn:open-mpi r21513

2009-06-24 Thread George Bosilca
Then I guess you will be happy to learn that instead of eating your multi word arguments we now pass them to your srun as expected. george. On Jun 24, 2009, at 16:18 , Jeff Squyres wrote: As a non-rsh'er (I run all my jobs in SLURM), this is very important to me. Please revert. On Ju

Re: [OMPI devel] [OMPI svn] svn:open-mpi r21513

2009-06-24 Thread Jeff Squyres
As a non-rsh'er (I run all my jobs in SLURM), this is very important to me. Please revert. On Jun 24, 2009, at 4:15 PM, Ralph Castain wrote: Yo George This commit is going to break non-rsh launchers. While it is true that the rsh launcher may handle multi-word options by putting them

Re: [OMPI devel] [OMPI svn] svn:open-mpi r21513

2009-06-24 Thread Ralph Castain
Yo George This commit is going to break non-rsh launchers. While it is true that the rsh launcher may handle multi-word options by putting them in quotes, we specifically avoided it here because it breaks SLURM, Torque, and others. This is why we specifically put the inclusion of multi-word optio

Re: [OMPI devel] trac ticket 1944 and pending sends

2009-06-24 Thread Eugene Loh
George Bosilca wrote: Here is a simple fix for both problems. Enforce a reasonable limit on the number of fragments in the BTL free list (1K should be more than enough), and make sure the fifo has a size equal to p * number_of_allowed_fragments_in_the_free_list, where p is the number of l

Re: [OMPI devel] trac ticket 1944 and pending sends

2009-06-24 Thread Eugene Loh
George Bosilca wrote: In other words, as long as a queue is peer based (peer not peers), the management of the pending send list was doing what it was supposed to, and there was no possibility of deadlock. I disagree. It is true that I can fill up a remote FIFO with sends. In such a case

Re: [OMPI devel] trac ticket 1944 and pending sends

2009-06-24 Thread Brian W. Barrett
On Wed, 24 Jun 2009, Eugene Loh wrote: Brian Barrett wrote: Or go to what I proposed and USE A LINKED LIST! (as I said before, not an original idea, but one I think has merit) Then you don't have to size the fifo, because there isn't a fifo. Limit the number of send fragments any one p

Re: [OMPI devel] trac ticket 1944 and pending sends

2009-06-24 Thread Ralph Castain
I'm not sure the two questions in your second item are separable, Eugene. I fear that the only real solution will be to rearch the sm BTL, which was originally a flawed design. I think you did a great job of building on it, but we are now finding that the foundation is just too shaky, so no matter

Re: [OMPI devel] trac ticket 1944 and pending sends

2009-06-24 Thread Eugene Loh
Brian Barrett wrote: Or go to what I proposed and USE A LINKED LIST! (as I said before, not an original idea, but one I think has merit) Then you don't have to size the fifo, because there isn't a fifo. Limit the number of send fragments any one proc can allocate and the only place memor

Re: [OMPI devel] sm BTL flow management

2009-06-24 Thread Bryan Lally
Ralph Castain wrote: Be happy to put it through the wringer... :-) My wringer is available, too. - Bryan -- Bryan Lally, la...@lanl.gov 505.667.9954 CCS-2 Los Alamos National Laboratory Los Alamos, New Mexico

Re: [OMPI devel] trac ticket 1944 and pending sends

2009-06-24 Thread Ralph Castain
I'm afraid that this solution doesn't pass the acid test - our reproducers still lock up if we set the #frags to 1K and fifo size to p*that. In other words, adding: -mca btl_sm_free_list_max 1024 -mca btl_sm_fifo_size p*1024 where p=ppn still causes our reproducers to hang. Sorrysigh. *From

Re: [OMPI devel] trac ticket 1944 and pending sends

2009-06-24 Thread Brian Barrett
Or go to what I proposed and USE A LINKED LIST! (as I said before, not an original idea, but one I think has merit) Then you don't have to size the fifo, because there isn't a fifo. Limit the number of send fragments any one proc can allocate and the only place memory can grow without bo

Re: [OMPI devel] Does open MPI support nodes behind NAT or Firewall

2009-06-24 Thread Ralph Castain
Check the devel mailing list over the last few weeks - I believe I and others provided some fairly detailed explanation of what would need to be done when an identical question was asked. It is definitely a development project, not just a configuration issue. On Jun 24, 2009, at 5:43 AM, J

Re: [OMPI devel] Does open MPI support nodes behind NAT or Firewall

2009-06-24 Thread Jeff Squyres
On Jun 10, 2009, at 9:23 AM, Anjin Pradhan wrote: I wanted to know whether OPENMPI supported nodes that are behind a NAT or a firewall. If it doesn't do this by default can anyone let me know how i should go about making openMPI support NAT and firewall. Sorry for the delay on replying.

Re: [OMPI devel] trac ticket 1944 and pending sends

2009-06-24 Thread George Bosilca
In other words, as long as a queue is peer based (peer not peers), the management of the pending send list was doing what it was supposed to, and there was no possibility of deadlock. With the new code, as a third party can fill up a remote queue, getting a fragment back [as you stated] bec

Re: [OMPI devel] OMPI performance competitiveness

2009-06-24 Thread neeraj
Hi Eugene, We have licenses of HPMPI, IntelMPI and SpecMPI with us. We usually do comparison tests periodically like recently we did collective performance test on our cluster. I would like to help openmpi community to benchmark their collectives and other calls if given the opp