Re: [OMPI users] running OpenMPI jobs (either 1.10.1 or 1.8.7) on SoGE more problems

2016-03-17 Thread Dave Love
Ralph Castain  writes:

> That’s an SGE error message - looks like your tmp file system on one
> of the remote nodes is full.

Yes; surely that just needs to be fixed, and I'd expect the host not to
accept jobs in that state.  It's not just going to break ompi.

> We don’t control where SGE puts its
> files, but it might be that your backend nodes are having issues with
> us doing a tree-based launch (i.e., where each backend daemon launches
> more daemons along the tree).

I doubt that's relevant.  You just need space for the SGE tmpdir, which
is where the ompi session directory will go, for instance.  Also, too
many things don't recognize TMPDIR and will fail if they can't write to
/tmp specifically, even if there's reason to avoid /tmp for tmpdir.


Re: [OMPI users] running OpenMPI jobs (either 1.10.1 or 1.8.7) on SoGE more problems

2016-03-16 Thread Ralph Castain
That’s an SGE error message - looks like your tmp file system on one of the 
remote nodes is full. We don’t control where SGE puts its files, but it might 
be that your backend nodes are having issues with us doing a tree-based launch 
(i.e., where each backend daemon launches more daemons along the tree).

You could try turning the tree-based launch “off” and see if that helps: "-mca 
plm_rsh_no_tree_spawn 1"


> On Mar 16, 2016, at 3:50 PM, Lane, William  wrote:
> 
> I'm getting an error message early on:
> [csclprd3-0-11:17355] [[36373,0],17] plm:rsh: using 
> "/opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V -verbose" for launching
> unable to write to file /tmp/285019.1.verylong.q/qrsh_error: No space left on 
> device[csclprd3-6-10:18352] [[36373,0],21] plm:rsh: using 
> "/opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V -verbose" for launching
> 
> According to the OpenMPI FAQ:
> 
> 'You may want to alter other parameters, but the important one is 
> "control_slaves", specifying that the environment has "tight integration". 
> Note also the lack of a start or stop procedure. The tight integration means 
> that mpirun automatically picks up the slot count to use as a default in 
> place of the '-np' argument, picks up a host file, spawns remote processes 
> via 'qrsh' so that SGE can control and monitor them, and creates and destroys 
> a per-job temporary directory ($TMPDIR), in which Open MPI's directory will 
> be created (by default).'
> 
> When I look at my OpenMPI environment there is no $TMPDIR environment 
> variable.
> 
> How does OpenMPI determine where it's going to put the "per-job temporary 
> directory ($TMPDIR)"? Does it use an SoGE defined environment variable? Is 
> the host file used by OpenMPI spawned in this $TMPDIR temporary directory?
> 
> Bill L.
> IMPORTANT WARNING: This message is intended for the use of the person or 
> entity to which it is addressed and may contain information that is 
> privileged and confidential, the disclosure of which is governed by 
> applicable law. If the reader of this message is not the intended recipient, 
> or the employee or agent responsible for delivering it to the intended 
> recipient, you are hereby notified that any dissemination, distribution or 
> copying of this information is strictly prohibited. Thank you for your 
> cooperation. ___
> users mailing list
> us...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> 
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/03/28719.php 
> 


[OMPI users] running OpenMPI jobs (either 1.10.1 or 1.8.7) on SoGE more problems

2016-03-16 Thread Lane, William
I'm getting an error message early on:
[csclprd3-0-11:17355] [[36373,0],17] plm:rsh: using "/opt/sge/bin/lx-amd64/qrsh 
-inherit -nostdin -V -verbose" for launching
unable to write to file /tmp/285019.1.verylong.q/qrsh_error: No space left on 
device[csclprd3-6-10:18352] [[36373,0],21] plm:rsh: using 
"/opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V -verbose" for launching

According to the OpenMPI FAQ:

'You may want to alter other parameters, but the important one is 
"control_slaves", specifying that the environment has "tight integration". Note 
also the lack of a start or stop procedure. The tight integration means that 
mpirun automatically picks up the slot count to use as a default in place of 
the '-np' argument, picks up a host file, spawns remote processes via 'qrsh' so 
that SGE can control and monitor them, and creates and destroys a per-job 
temporary directory ($TMPDIR), in which Open MPI's directory will be created 
(by default).'

When I look at my OpenMPI environment there is no $TMPDIR environment variable.

How does OpenMPI determine where it's going to put the "per-job temporary 
directory ($TMPDIR)"? Does it use an SoGE defined environment variable? Is the 
host file used by OpenMPI spawned in this $TMPDIR temporary directory?

Bill L.
IMPORTANT WARNING: This message is intended for the use of the person or entity 
to which it is addressed and may contain information that is privileged and 
confidential, the disclosure of which is governed by applicable law. If the 
reader of this message is not the intended recipient, or the employee or agent 
responsible for delivering it to the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this information is 
strictly prohibited. Thank you for your cooperation.