Re: [OMPI users] Specifying -wdir

2008-02-21 Thread Kevin Durda
Hi Roberto,

I think that you can do what you want if you use an appfile with something
like this:

-host node1,node2,node3 -np 6 -wdir /WorkingDir/ appname
-host node4 -np 2 -wdir /DifferentWorkingDir/ appname

Then run your program using "mpirun --app appfilename".

Kevin


On Wed, Feb 20, 2008 at 5:30 PM, R C Pasianot  wrote:

>
>  Hello list,
>
>  Is there a way to specify different working directories for
>  different hosts ?. I mean for a single application launched
>  from one of them. Seems I can't do that in the hostfile ... :/.
>
>  Thanks in advance,
>
>  Roberto
>
>  PS: I was unable to find the answer in the archives, sorry if
>  it's too trivial.
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Openmpi with SGE

2008-02-21 Thread Pak Lui
I am not quite sure. It seems that your AR (advance reservation) 
snapshot3 build is a bit new, and it may be a problem coming from it. I 
am not quite familiar with this new SGE feature. I'd ping the gridengine 
list to check on that error message coming from execd.


Neeraj Chourasia wrote:

Hello everyone,

I am facing problem while calling mpirun in a loop when using with 
SGE. My sge version is SGE6.1AR_snapshot3. The script i am submitting 
via sge is


x
let i=0

while [ $i -lt 100 ]
do
echo 
""

echo "Iteration :$i"
/usr/local/openmpi-1.2.4/bin/mpirun -np $NP -hostfile 
$TMP/machines send

let "i+=1"
echo 
""

done


Now above script runs well for 15-20 iteration and then fails with 
following message


-Error 
Message---
error: executing task of job 3869 failed: execution daemon on host 
"n101" didn't accept task

[n199:11989] ERROR: A daemon on node n101 failed to start as expected.
[n199:11989] ERROR: There may be more information available from
[n199:11989] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[n199:11989] ERROR: If the problem persists, please restart the
[n199:11989] ERROR: Grid Engine PE job
[n199:11989] ERROR: The daemon exited unexpectedly with status 1.
---

When i do ssh to n101, there is no orted and qrsh_starter running. While 
checking its spool file, i came across following message
---Execd spool Error 
Message-
|execd|n101|E|no free queue for job 3869 of user neeraj@n199 (localhost 
= n101)

---

What could be the reason for it.
While checking the mailing list, i come across following link
http://www.open-mpi.org/community/lists/users/2007/03/2771.php
but, i dont think its the same problem. Any help is appreciated.

Regards
Neeraj




Singapore Tour 






___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--

- Pak Lui
pak@sun.com


Re: [OMPI users] Specifying -wdir

2008-02-21 Thread R.C.Pasianot

 Thanks a lot Kevin,

 It seemed to me that something like your suggestion would
 launch two unrelated "appname", ..., luckily I was wrong ;-).
 Indeed it does what I want.

 Thanks again,

 Roberto

On Thu, 21 Feb 2008, Kevin Durda wrote:

> Hi Roberto,
>
> I think that you can do what you want if you use an appfile with something
> like this:
>
> -host node1,node2,node3 -np 6 -wdir /WorkingDir/ appname
> -host node4 -np 2 -wdir /DifferentWorkingDir/ appname
>
> Then run your program using "mpirun --app appfilename".
>
> Kevin




[OMPI users] ofa-default-subnet-gid

2008-02-21 Thread Bill Wichser


In trying to get openmpi up and running on a new cluster, I came across 
this error about having both of my IB switches set to the same 
subnet-gid.  Snooping around on my hosts which run the opensm daemon, I 
indeed found this to be the case in the /var/log/osm-ib[0-1].log files, 
giving up finding it with ibstat which showed these values to be 
different, at least the second part of the GID.


Before I try and pursue how to actually change this value for the opensm 
daemon, I do have a question.


Since both of my hosts are connected to each switch, how am I to 
instruct openmpi to use port0?  I'm trying to use port0 as the MPI 
network and port1 as the storage network.  Is there something that I 
need to add someplace forcing connections only to some default-subnet-gid?


Thanks,
Bill


Re: [OMPI users] ofa-default-subnet-gid

2008-02-21 Thread George Bosilca

Here are the MCA parameters that you can use:

 MCA btl: parameter "btl_openib_if_include" (current value:  
)

  Comma-delimited list of HCAs/ports to be used (e.g.
  "mthca0,mthca1:2"; empty value means to use all  
ports found).

  Mutually exclusive with btl_openib_if_exclude.
 MCA btl: parameter "btl_openib_if_exclude" (current value:  
)
  Comma-delimited list of HCAs/ports to be excluded  
(empty value
  means to not exclude any ports).  Mutually  
exclusive with

  btl_openib_if_include.

  george.

On Feb 21, 2008, at 2:45 PM, Bill Wichser wrote:



In trying to get openmpi up and running on a new cluster, I came  
across

this error about having both of my IB switches set to the same
subnet-gid.  Snooping around on my hosts which run the opensm  
daemon, I
indeed found this to be the case in the /var/log/osm-ib[0-1].log  
files,

giving up finding it with ibstat which showed these values to be
different, at least the second part of the GID.

Before I try and pursue how to actually change this value for the  
opensm

daemon, I do have a question.

Since both of my hosts are connected to each switch, how am I to
instruct openmpi to use port0?  I'm trying to use port0 as the MPI
network and port1 as the storage network.  Is there something that I
need to add someplace forcing connections only to some default- 
subnet-gid?


Thanks,
Bill
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI users] ofa-default-subnet-gid

2008-02-21 Thread Bill Wichser

Thanks George!

I've added:
--mca btl_openib_if_include mthca0
--mca btl_openib_warn_default_gid_prefix 0

and hopefully it'll do the right thing without any warnings.

Bill

George Bosilca wrote:

Here are the MCA parameters that you can use:

 MCA btl: parameter "btl_openib_if_include" (current value: )
  Comma-delimited list of HCAs/ports to be used (e.g.
  "mthca0,mthca1:2"; empty value means to use all ports 
found).

  Mutually exclusive with btl_openib_if_exclude.
 MCA btl: parameter "btl_openib_if_exclude" (current value: )
  Comma-delimited list of HCAs/ports to be excluded 
(empty value

  means to not exclude any ports).  Mutually exclusive with
  btl_openib_if_include.

  george.

On Feb 21, 2008, at 2:45 PM, Bill Wichser wrote:



In trying to get openmpi up and running on a new cluster, I came across
this error about having both of my IB switches set to the same
subnet-gid.  Snooping around on my hosts which run the opensm daemon, I
indeed found this to be the case in the /var/log/osm-ib[0-1].log files,
giving up finding it with ibstat which showed these values to be
different, at least the second part of the GID.

Before I try and pursue how to actually change this value for the opensm
daemon, I do have a question.

Since both of my hosts are connected to each switch, how am I to
instruct openmpi to use port0?  I'm trying to use port0 as the MPI
network and port1 as the storage network.  Is there something that I
need to add someplace forcing connections only to some 
default-subnet-gid?


Thanks,
Bill
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] ofa-default-subnet-gid

2008-02-21 Thread Jeff Squyres

On Feb 21, 2008, at 12:36 PM, George Bosilca wrote:


Here are the MCA parameters that you can use:

MCA btl: parameter "btl_openib_if_include" (current value:  
)

 Comma-delimited list of HCAs/ports to be used (e.g.
 "mthca0,mthca1:2"; empty value means to use all  
ports found).

 Mutually exclusive with btl_openib_if_exclude.
MCA btl: parameter "btl_openib_if_exclude" (current value:  
)
 Comma-delimited list of HCAs/ports to be excluded  
(empty value
 means to not exclude any ports).  Mutually  
exclusive with

 btl_openib_if_include.


These parameters are [upcoming] v1.3 only -- they do not exist in the  
v1.2 series.


(more below)


On Feb 21, 2008, at 2:45 PM, Bill Wichser wrote:


In trying to get openmpi up and running on a new cluster, I came  
across

this error about having both of my IB switches set to the same
subnet-gid.  Snooping around on my hosts which run the opensm  
daemon, I
indeed found this to be the case in the /var/log/osm-ib[0-1].log  
files,

giving up finding it with ibstat which showed these values to be
different, at least the second part of the GID.

Before I try and pursue how to actually change this value for the  
opensm

daemon, I do have a question.

Since both of my hosts are connected to each switch, how am I to
instruct openmpi to use port0?  I'm trying to use port0 as the MPI
network and port1 as the storage network.  Is there something that I
need to add someplace forcing connections only to some default- 
subnet-gid?


The v1.3 series will have the parameters that George mentioned above;  
those give you fine-grained control about which HCAs and ports you are  
using.


In the v1.2 series, you cannot explicitly control which hcas/ports you  
are using.  Instead, you can only limit the *number* of active ports  
that Open MPI will use:


 MCA btl: parameter "btl_openib_max_btls" (current value:  
"-1")

  Maximum number of HCA ports to use (-1 = use all
  available, otherwise must be >= 1)

Open MPI starts with the first port on the first interface and goes  
upward until it finds max_btls of active ports.  This is admittedly  
imperfect, but it was only somewhat recently that someone asked for  
explicit control over which hcas/ports to use.  Sorry...  :-\


--
Jeff Squyres
Cisco Systems



Re: [OMPI users] mpi.h macro naming

2008-02-21 Thread Jeff Squyres

On Feb 20, 2008, at 9:45 AM, Ben Allan wrote:

Our assumption was that if some other package defined these values,  
they
would either likely be coming from the same standard autoconf tests  
or

use the same #define conventions as the autoconf tests.  As such, the
values that they are #defined to would be the same (and compilers  
don't
whine about multiple #defines of the same macro to the same value  
-- they

only whine if the values are different).


The particular offending packages in question are indeed using
autoconf/autoheader, however ompi's defines
#define HAVE_LONG_LONG 1
while the others only
#define HAVE_LONG_LONG

more ac version madness?


Gaa!  Yes, this could definitely be the case.  :-(


There's two places that would need to be changed:

- the relevant parts of OMPI's configure script to *also* define an
OMPI_* equivalent of the macro (which will sometimes mean extracting
non-public information from the Autoconf tests -- usually a risky
proposition because Autoconf can change their internals at any time).
The only safe way I can think of would be to AC_TRY_RUN and write the
#define'd value out to a temp file.  This, of course, won't work for
cross-compiling environments, though.

- modify mpi.h.in to use the new OMPI_* macros.

Keep in mind that mpi.h only has a small subset of the #defines from
OMPI's configure script.  opal_config.h (and internal OMPI file  
that is
not installed) has *all* the #defines; that's what's used to  
compile the
OMPI code base.  mpi.h replicates a small number of these defines  
that

are used by OMPI's public interface.


I will think about this guidance and see what kind of patches and
alternative patches I can suggest.
I did not detect autoheader being used in the process of building
mpi.h; is that correct? it would make some simpler workarounds easier.



Correct.  We have mpi.h.in in the SVN repository -- it is *not*  
automatically generated.  We just put the #undef HAVE_LONG_LONG (etc.)  
lines in there (which is the same format that autoheader generates),  
and config.status will morph these into #define... as relevant.


While I agree that having AC actually define them to a value is a Good  
Thing (better than just defining it to be empty), I do see the pickle  
that it has put us in.  :-\  I don't see an obvious solution.


--
Jeff Squyres
Cisco Systems



Re: [OMPI users] openmpi/openib problems

2008-02-21 Thread Jeff Squyres

On Feb 20, 2008, at 9:53 AM, jessie puls wrote:


Secifically Jobs are not being handed to other nodes ever.  Running

mpirun -mca btl openib,self -np 20 /bin/hostname

will return the same hostname 20 times, even if I specify -bynode as  
an argument.



This is normal, and not an InfiniBand issue.  You need to tell Open  
MPI where to run your jobs (i.e., which hosts).  If you don't, Open  
MPI assumes you want to run on the localhost.


A simple way to specify where to run is to use a hostfile:

http://www.open-mpi.org/faq/?category=running#run-prereqs
http://www.open-mpi.org/faq/?category=running#simple-launch
http://www.open-mpi.org/faq/?category=running#simple-spmd-run

--
Jeff Squyres
Cisco Systems