Re: [OMPI users] large jobs hang on startup (deadlock?)

2007-02-11 Thread Jeff Squyres

On Feb 6, 2007, at 6:05 PM, Heywood, Todd wrote:

I know this is an OpenMPI list, but does anyone know how common or  
uncommon LDAP-based clusters are? I would have thought this issue  
would have arisen elsewhere, but Googling MPI+LDAP (and similar)  
doesn't turn up much.


FWIW, when I was back at Indiana University, we had a similar issue  
with a 128 node cluster -- starting parallel jobs would overwhelm the  
central slapd's and logins would start failing.


IIRC, the admins tried a variety of things that didn't end up working  
or were too complicated to maintain in the long term.  So they ended  
up replicating the /etc/shadow and /etc/passwd from LDAP every X  
hours (24, I think?) so that all authentications on the cluster were  
local.  Then they simply disallowed changing user information the  
cluster (password, shell, etc.) and said "if you want to change  
information, change it elsewhere and it will sync to the cluster  
within X hours".


Not an optimal solution, but it was the one they opted for because  
all things being equal, I think it was the simplest.


This is all from quite a while ago, so I might not have the details  
exactly correct.


I don't know much about LDAP, but if proxying / caching LDAP servers  
exist, it might help considerably (e.g., put a caching proxy on the  
cluster head node that can respond quickly to hundreds of  
simultaneous LDAP requests from across the cluster instead of having  
the cluster nodes all talk to a central LDAP server).  I don't know  
if that even makes sense (caching LDAP queries), but it was just a  
thought...


--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems



Re: [OMPI users] Error using MPI_WAITALL

2007-02-11 Thread Jeff Squyres
This list is for supporting Open MPI, not MPICH (MPICH is an entirely  
different software package).  You should probably redirect your  
question to their support lists.


FWIW, based on the error message, it sounds like you may have an  
incorrect MPI application with a race condition -- sometimes it  
works, sometimes it doesn't.  E.g., in some cases, you're passing an  
invalid MPI request to MPI_WAITALL.  You should double check your  
code to ensure that every entry in the request array is valid (e.g.,  
check what happens after you call WAITALL the first time; are all  
requests re-generated properly?  Is your count accurate?  etc.)


Good luck.



On Feb 10, 2007, at 1:07 PM, Vadivelan Ranjith wrote:


Hi
I am using mpich2-1.0.3 to compiling our code. Our code is calling  
MPI_WAITALL. We ran the case in intel-Dual core without any problem  
and solution was fine. I tried to ran the code in intel quad-core.  
Compilation using mpif90 is fine. I started running the executable  
file, i got the following error.
-- 
-

Fatal error in MPI_Waitall: Invalid MPI_Request, error stack:
MPI_Waitall(241): MPI_Waitall(count=250, req_array=0x23e52e0,  
status_array=0x7fbfffe3a0) failed

MPI_Waitall(109): Invalid MPI_Request
-- 
-


So i removed all the lines where MPI-WAITALL is using. Again i  
compiled to code using mpif90(mpich) and ran it. Now its running  
without any problem. Can you please explain me what is happen here.


Thanks
Velan

Here’s a new way to find what you're looking for - Yahoo! Answers
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems




[OMPI users] openMPI 1.1.4 - connect() failed with errno=111

2007-02-11 Thread matteo . guglielmi
Since I've installed openmpi I cannot submit any job that uses cpus from
different machines.

### hostfile ###
lcbcpc02.epfl.ch slots=4 max-slots=4
lcbcpc04.epfl.ch slots=4 max-slots=4


### error message ###
[matteo@lcbcpc02 TEST]$ mpirun --hostfile ~matteo/hostfile -np 8
/home/matteo/Software/NWChem/5.0/bin/nwchem ./nwchem.nw
[0,1,5][../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
[0,1,6][../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=111
6: lcbcpc04.epfl.ch len=16
[0,1,4][../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=111
4: lcbcpc04.epfl.ch len=16
[0,1,7][../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=111
7: lcbcpc04.epfl.ch len=16
connect() failed with errno=111
5: lcbcpc04.epfl.ch len=16
#

I did disable the firewall on both machines but I still get that error message.

Thanks,
MG.