Ralph,
I didn't do it so you can appreciate, and there is no need to revisit
the logic behind anything. I missed the exception for the orted
command, so this whole mess is my fault.
As a side note, let me just remind you that the trunk is meant to be
more or less stable, so there is absol
Thanks George - you may ignore the note I just sent! :-) I am happy to
revisit the logic behind the prior work, without the time pressure of
dealing with it right away.
I do appreciate this!
Ralph
On Jun 24, 2009, at 5:48 PM, bosi...@osl.iu.edu wrote:
Author: bosilca
Date: 2009-06-24 19:4
Just to be specific, here is how we handle the orte_launch_agent in
rsh that makes it work:
/* now get the orted cmd - as specified by user - into our tmp
array.
* The function returns the location where the actual orted
command is
* located - usually in the final spot, but s
I believe you are using a bad example here George. If you look closely
at the code, you will see that we treat the orte_launch_agent
separately from everything else - it gets passed through the following
code:
int orte_plm_base_setup_orted_cmd(int *argc, char ***argv)
{
int i, loc;
Just for the sake of it. A funy command line to try:
[bosilca@dancer ~]$ mpirun --mca routed_base_verbose 0 --leave-session-
attached -np 1 --mca orte_launch_agent "orted --mca
routed_base_verbose 1" uptime
[node03:22355] [[14661,0],1] routed_linear: init routes for daemon job
[14661,0]
On Jun 24, 2009, at 17:41 , Jeff Squyres wrote:
-
[14:38] svbu-mpi:~/svn/ompi/orte % mpirun --mca plm_base_verbose 100
--leave-session-attached -np 1 --mca orte_launch_agent "$bogus/bin/
orted -s" uptime
...lots of output...
srun --nodes=1 --ntasks=1 --kill-on-bad-exit --nodelist=svbu-
Hmm. Doesn't seem to work for me...
First, without the quotes -- a single argument ($bogus is the tree
where my OMPI is installed):
-
[14:36] svbu-mpi:~/svn/ompi/orte % mpirun --mca plm_base_verbose 100 --
leave-session-attached -np 1 --mca orte_launch_agent $bogus/bin/orted
uptime
.
If you read the original comment, we had concluded that there were no
multi-word options that were being passed back to the orteds. All
multi-word options known to us at that time, and still today, -only-
apply to the HNP. Hence, dropping them has zero impact.
To update you on the history:
I can't guarantee this for all PLM but I can confirm that rsh and
slurm (1.3.12) works well with this.
We try with and without Open MPI, and the outcome is the same.
[bosilca@dancer c]$ srun -n 4 echo "1 2 3 4 5 it works"
1 2 3 4 5 it works
1 2 3 4 5 it works
1 2 3 4 5 it works
1 2 3 4 5 it wo
The problem is that they don't get there properly. We have been
through this debate multiple times for several years - every so often,
someone decides to try this again.
The problem is that the mca param that reaches the other end has
quotes around it in some environments, and doesn't in ot
Then I guess you will be happy to learn that instead of eating your
multi word arguments we now pass them to your srun as expected.
george.
On Jun 24, 2009, at 16:18 , Jeff Squyres wrote:
As a non-rsh'er (I run all my jobs in SLURM), this is very important
to me.
Please revert.
On Ju
As a non-rsh'er (I run all my jobs in SLURM), this is very important
to me.
Please revert.
On Jun 24, 2009, at 4:15 PM, Ralph Castain wrote:
Yo George
This commit is going to break non-rsh launchers. While it is true
that the rsh launcher may handle multi-word options by putting them
Yo George
This commit is going to break non-rsh launchers. While it is true that the
rsh launcher may handle multi-word options by putting them in quotes, we
specifically avoided it here because it breaks SLURM, Torque, and others.
This is why we specifically put the inclusion of multi-word optio
George Bosilca wrote:
Here is a simple fix for both problems. Enforce a reasonable limit on
the number of fragments in the BTL free list (1K should be more than
enough), and make sure the fifo has a size equal to p *
number_of_allowed_fragments_in_the_free_list, where p is the number
of l
George Bosilca wrote:
In other words, as long as a queue is peer based (peer not peers),
the management of the pending send list was doing what it was
supposed to, and there was no possibility of deadlock.
I disagree. It is true that I can fill up a remote FIFO with sends. In
such a case
On Wed, 24 Jun 2009, Eugene Loh wrote:
Brian Barrett wrote:
Or go to what I proposed and USE A LINKED LIST! (as I said before, not an
original idea, but one I think has merit) Then you don't have to size the
fifo, because there isn't a fifo. Limit the number of send fragments any
one p
I'm not sure the two questions in your second item are separable, Eugene. I
fear that the only real solution will be to rearch the sm BTL, which was
originally a flawed design. I think you did a great job of building on it,
but we are now finding that the foundation is just too shaky, so no matter
Brian Barrett wrote:
Or go to what I proposed and USE A LINKED LIST! (as I said before,
not an original idea, but one I think has merit) Then you don't have
to size the fifo, because there isn't a fifo. Limit the number of
send fragments any one proc can allocate and the only place memor
Ralph Castain wrote:
Be happy to put it through the wringer... :-)
My wringer is available, too.
- Bryan
--
Bryan Lally, la...@lanl.gov
505.667.9954
CCS-2
Los Alamos National Laboratory
Los Alamos, New Mexico
I'm afraid that this solution doesn't pass the acid test - our reproducers
still lock up if we set the #frags to 1K and fifo size to p*that. In other
words, adding:
-mca btl_sm_free_list_max 1024 -mca btl_sm_fifo_size p*1024
where p=ppn still causes our reproducers to hang.
Sorrysigh.
*From
Or go to what I proposed and USE A LINKED LIST! (as I said before,
not an original idea, but one I think has merit) Then you don't have
to size the fifo, because there isn't a fifo. Limit the number of
send fragments any one proc can allocate and the only place memory can
grow without bo
Check the devel mailing list over the last few weeks - I believe I and
others provided some fairly detailed explanation of what would need to
be done when an identical question was asked. It is definitely a
development project, not just a configuration issue.
On Jun 24, 2009, at 5:43 AM, J
On Jun 10, 2009, at 9:23 AM, Anjin Pradhan wrote:
I wanted to know whether OPENMPI supported nodes that are behind a
NAT or a firewall.
If it doesn't do this by default can anyone let me know how i should
go about making openMPI support NAT and firewall.
Sorry for the delay on replying.
In other words, as long as a queue is peer based (peer not peers), the
management of the pending send list was doing what it was supposed to,
and there was no possibility of deadlock. With the new code, as a
third party can fill up a remote queue, getting a fragment back [as
you stated] bec
Hi Eugene,
We have licenses of HPMPI, IntelMPI and SpecMPI with us. We
usually do comparison tests periodically like recently we did collective
performance test on our cluster.
I would like to help openmpi community to benchmark their
collectives and other calls if given the opp
25 matches
Mail list logo