Re: [OMPI users] large jobs hang on startup (deadlock?)
On Feb 6, 2007, at 6:05 PM, Heywood, Todd wrote: I know this is an OpenMPI list, but does anyone know how common or uncommon LDAP-based clusters are? I would have thought this issue would have arisen elsewhere, but Googling MPI+LDAP (and similar) doesn't turn up much. FWIW, when I was back at Indiana University, we had a similar issue with a 128 node cluster -- starting parallel jobs would overwhelm the central slapd's and logins would start failing. IIRC, the admins tried a variety of things that didn't end up working or were too complicated to maintain in the long term. So they ended up replicating the /etc/shadow and /etc/passwd from LDAP every X hours (24, I think?) so that all authentications on the cluster were local. Then they simply disallowed changing user information the cluster (password, shell, etc.) and said "if you want to change information, change it elsewhere and it will sync to the cluster within X hours". Not an optimal solution, but it was the one they opted for because all things being equal, I think it was the simplest. This is all from quite a while ago, so I might not have the details exactly correct. I don't know much about LDAP, but if proxying / caching LDAP servers exist, it might help considerably (e.g., put a caching proxy on the cluster head node that can respond quickly to hundreds of simultaneous LDAP requests from across the cluster instead of having the cluster nodes all talk to a central LDAP server). I don't know if that even makes sense (caching LDAP queries), but it was just a thought... -- Jeff Squyres Server Virtualization Business Unit Cisco Systems
Re: [OMPI users] Error using MPI_WAITALL
This list is for supporting Open MPI, not MPICH (MPICH is an entirely different software package). You should probably redirect your question to their support lists. FWIW, based on the error message, it sounds like you may have an incorrect MPI application with a race condition -- sometimes it works, sometimes it doesn't. E.g., in some cases, you're passing an invalid MPI request to MPI_WAITALL. You should double check your code to ensure that every entry in the request array is valid (e.g., check what happens after you call WAITALL the first time; are all requests re-generated properly? Is your count accurate? etc.) Good luck. On Feb 10, 2007, at 1:07 PM, Vadivelan Ranjith wrote: Hi I am using mpich2-1.0.3 to compiling our code. Our code is calling MPI_WAITALL. We ran the case in intel-Dual core without any problem and solution was fine. I tried to ran the code in intel quad-core. Compilation using mpif90 is fine. I started running the executable file, i got the following error. -- - Fatal error in MPI_Waitall: Invalid MPI_Request, error stack: MPI_Waitall(241): MPI_Waitall(count=250, req_array=0x23e52e0, status_array=0x7fbfffe3a0) failed MPI_Waitall(109): Invalid MPI_Request -- - So i removed all the lines where MPI-WAITALL is using. Again i compiled to code using mpif90(mpich) and ran it. Now its running without any problem. Can you please explain me what is happen here. Thanks Velan Here’s a new way to find what you're looking for - Yahoo! Answers ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Server Virtualization Business Unit Cisco Systems
[OMPI users] openMPI 1.1.4 - connect() failed with errno=111
Since I've installed openmpi I cannot submit any job that uses cpus from different machines. ### hostfile ### lcbcpc02.epfl.ch slots=4 max-slots=4 lcbcpc04.epfl.ch slots=4 max-slots=4 ### error message ### [matteo@lcbcpc02 TEST]$ mpirun --hostfile ~matteo/hostfile -np 8 /home/matteo/Software/NWChem/5.0/bin/nwchem ./nwchem.nw [0,1,5][../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] [0,1,6][../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=111 6: lcbcpc04.epfl.ch len=16 [0,1,4][../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=111 4: lcbcpc04.epfl.ch len=16 [0,1,7][../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=111 7: lcbcpc04.epfl.ch len=16 connect() failed with errno=111 5: lcbcpc04.epfl.ch len=16 # I did disable the firewall on both machines but I still get that error message. Thanks, MG.