Jeff,
Some limited testing shows that that srun does seem to work where the
quote-y one did not. I'm working with our admins now to make sure it let's
the prolog work as expected as well.
I'll keep you informed,
Matt
On Thu, Sep 4, 2014 at 1:26 PM, Jeff Squyres (jsquyres)
wrote:
> Try this (t
Try this (typed in editor, not tested!):
#! /usr/bin/perl -w
use strict;
use warnings;
use FindBin;
# Specify the path to the prolog.
my $prolog = '--task-prolog=/gpfsm//.task.prolog';
# Build the path to the SLURM srun command.
my $srun_slurm = "${FindBin::Bin}/srun.slurm";
# Add the
Jeff,
Here is the script (with a bit of munging for safety's sake):
#! /usr/bin/perl -w
use strict;
use warnings;
use FindBin;
# Specify the path to the prolog.
my $prolog = '--task-prolog=/gpfsm//.task.prolog';
# Build the path to the SLURM srun command.
my $srun_slurm = "${FindBin::
Still begs the bigger question, though, as others have used script wrappers
before - and I'm not sure we (OMPI) want to be in the business of dictating the
scripting language they can use. :-)
Jeff and I will argue that one out
On Sep 4, 2014, at 7:38 AM, Jeff Squyres (jsquyres) wrote:
> Ah,
Ah, if it's perl, it might be easy. It might just be the difference between
system("...string...") and system(@argv).
Sent from my phone. No type good.
On Sep 4, 2014, at 8:35 AM, "Matt Thompson"
mailto:fort...@gmail.com>> wrote:
Jeff,
I actually misspoke earlier. It turns out our srun is a *
Jeff,
I actually misspoke earlier. It turns out our srun is a *Perl* script
around the SLURM srun. I'll speak with our admins to see if they can
massage the script to not interpret the arguments. If possible, I'll ask
them if I can share the script with you (privately or on the list) and
maybe you
On Sep 3, 2014, at 9:27 AM, Matt Thompson wrote:
> Just saw this, sorry. Our srun is indeed a shell script. It seems to be a
> wrapper around the regular srun that runs a --task-prolog. What it
> does...that's beyond my ken, but I could ask. My guess is that it probably
> does something that h
Thanks Matt - that does indeed resolve the "how" question :-)
We'll talk internally about how best to resolve the issue. We could, of course,
add a flag to indicate "we are using a shellscript version of srun" so we know
to quote things, but it would mean another thing that the user would have t
On Tue, Sep 2, 2014 at 8:38 PM, Jeff Squyres (jsquyres)
wrote:
> Matt: Random thought -- is your "srun" a shell script, perchance? (it
> shouldn't be, but perhaps there's some kind of local override...?)
>
> Ralph's point on the call today is that it doesn't matter *how* this
> problem is happen
Jeff,
I tried your script and I saw:
(1027) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpirun
-np 8 ./script.sh
(1028) $
Now, the very first time I ran it, I think I might have noticed a blip of
orted on the nodes, but it disappeared fast. When I re-run the same
command, it ju
Ah, I see the "sh: tcp://10.1.25.142,172.31.1.254,10.12.25.142:41686: No such
file or directory" message now -- I was looking for something like that when I
replied before and missed it.
I really wish I understood why the heck that is happening; it doesn't seem to
make sense.
Matt: Random th
I can answer that for you right now. The launch of the orted's is what is
failing, and they are "silently" failing at this time. The reason is simple:
1. we are failing due to truncation of the HNP uri at the first semicolon. This
causes the orted to emit an ORTE_ERROR_LOG message and then abort
Matt --
We were discussing this issue on our weekly OMPI engineering call today.
Can you check one thing for me? With the un-edited 1.8.2 tarball installation,
I see that you're getting no output for commands that you run -- but also no
errors.
Can you verify and see if your commands are actu
On that machine, it would be SLES 11 SP1. I think it's soon transitioning
to SLES 11 SP3.
I also use Open MPI on an RHEL 6.5 box (possibly soon to be RHEL 7).
On Mon, Sep 1, 2014 at 8:41 PM, Ralph Castain wrote:
> Thanks - I expect we'll have to release 1.8.3 soon to fix this in case
> others
Thanks - I expect we'll have to release 1.8.3 soon to fix this in case others
have similar issues. Out of curiosity, what OS are you using?
On Sep 1, 2014, at 9:00 AM, Matt Thompson wrote:
> Ralph,
>
> Okay that seems to have done it here (well, minus the usual
> shmem_mmap_enable_nfs_warnin
Ralph,
Okay that seems to have done it here (well, minus the
usual shmem_mmap_enable_nfs_warning that our system always generates):
(1033) $ setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0
(1034) $
/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug-patch/bin/mpirun
-np 8 ./helloWorld.1
HmmmI may see the problem. Would you be so kind as to apply the attached patch to your 1.8.2 code, rebuild, and try again?Much appreciate the help. Everyone's system is slightly different, and I think you've uncovered one of those differences.Ralph
uri.diff
Description: Binary data
On Aug 31,
Ralph,
Sorry it took me a bit of time. Here you go:
(1002) $
/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun
--leave-session-attached --debug-daemons --mca oob_base_verbose 10 -mca
plm_base_verbose 5 -np 8 ./helloWorld.182-debug.x
[borg01w063:03815] mca:base:select:( plm
Rats - I also need "-mca plm_base_verbose 5" on there so I can see the cmd line
being executed. Can you add it?
On Aug 29, 2014, at 11:16 AM, Matt Thompson wrote:
> Ralph,
>
> Here you go:
>
> (1080) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun
> --leave-ses
Ralph,
Here you go:
(1080) $
/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun
--leave-session-attached --debug-daemons --mca oob_base_verbose 10 -np 8
./helloWorld.182-debug.x
[borg01x142:29232] mca: base: components_register: registering oob
components
[borg01x142:29232]
Okay, something quite weird is happening here. I can't replicate using the
1.8.2 release tarball on a slurm machine, so my guess is that something else is
going on here.
Could you please rebuild the 1.8.2 code with --enable-debug on the configure
line (assuming you haven't already done so), and
Ralph,
For 1.8.2rc4 I get:
(1003) $
/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun
--leave-session-attached --debug-daemons -np 8 ./helloWorld.182.x
srun.slurm: cluster configuration lacks support for cpu binding
srun.slurm: cluster configuration lacks support for cpu bindi
I'm unaware of any changes to the Slurm integration between rc4 and final
release. It sounds like this might be something else going on - try adding
"--leave-session-attached --debug-daemons" to your 1.8.2 command line and let's
see if any errors get reported.
On Aug 28, 2014, at 12:20 PM, Mat
Open MPI List,
I recently encountered an odd bug with Open MPI 1.8.1 and GCC 4.9.1 on our
cluster (reported on this list), and decided to try it with 1.8.2. However,
we seem to be having an issue with Open MPI 1.8.2 and SLURM. Even weirder,
Open MPI 1.8.2rc4 doesn't show the bug. And the bug is: I
24 matches
Mail list logo