Hi Gus, Thanks for the suggestions!
I know that QCSCRATCH and QCLOCALSCR are not the problem. When I set QCSCRATCH="." and unset QCLOCALSCR it writes all the scratch files to the current directory, which is the behavior I want. The environment variables are correctly passed in the mpirun command line. Since my jobs have a fair bit of I/O, I make sure to change to the locally mounted /tmp folder before running the calculations. I do have permissions to write in there. When I run jobs without OpenMPI they are stable on Blue Waters compute nodes, which suggests the issues are not due to the above. I compiled Q-Chem from the source code, so I built OpenMPI 1.8.3 first and added $OMPI/bin to my PATH (and $OMPI/lib to LD_LIBRARY_PATH). I configured the Q-Chem build so it properly uses "mpicc", etc. The environment variables for OpenMPI are correctly set at runtime. At this point, I think the main problem is a limitation on the networking in the compute nodes, and I believe Blue Waters support is currently working on this. I'll make sure to send an update if anything happens. - Lee-Ping On Oct 2, 2014, at 12:09 PM, Gus Correa <g...@ldeo.columbia.edu> wrote: > > Hi Lee-Ping > > Computational Chemistry is Greek to me. > > However, on pp. 12 of the Q-Chem manual 3.2 > > (PDF online > http://www.q-chem.com/qchem-website/doc_for_web/qchem_manual_3.2.pdf) > > there are explanations of the meaning of QCSCRATCH and > QLOCALSRC, etc, which as Ralph pointed out, seem to be a sticking point, > and showed up in the warning messages, which I enclose below. > > QLOCALSRC specifies a local disk for IO. > I wonder if the node(s) is (are) diskless, and this might cause the problem. > Another possibility is that mpiexec may not be passing these > environment variables. > (Do you pass them in the mpiexec/mpirun command line?) > > > QCSCRATCH defines a directory for temporary files. > If this is a network shared directory, could it be that some nodes > are not mounting it correctly? > Likewise, if your home directory or your job run directory are not > mounted that could be a problem. > Or maybe you don't have write permission (sometimes this > happens in /tmp, specially if it is a ramdir/tmpdir, which may also have a > small size). > > Your BlueWaters system administrator may be able to shed some light on these > things. > > Also the Q-Chem manual says it is a pre-compiled executable, > which as far as I know would require a matching version of OpenMPI. > (Ralph, please correct me if I am wrong.). > > However, you seem to have the source code, at least you sent a > snippet of it. [With all those sockets being opened besides MPI ...] > > Did you recompile with OpenMPI? > Did you add the $OMPI/bin to PATH and $OMPI/lib to LD_LIBRARY_PATH > and are these environment variables propagated to the job execution nodes > (specially those that are failing)? > > > Anyway, just a bunch of guesses ... > Gus Correa > > ********************************************* > QCSCRATCH Defines the directory in which > Q-Chem > will store temporary files. > Q-Chem > will usually remove these files on successful completion of t > he job, but they > can be saved, if so wished. Therefore, > QCSCRATCH > should not reside in > a directory that will be automatically removed at the end of a > job, if the > files are to be kept for further calculations. > Note that many of these files can be very large, and it should be > ensured that > the volume that contains this directory has sufficient disk sp > ace available. > The > QCSCRATCH > directory should be periodically checked for scratch > files remaining from abnormally terminated jobs. > QCSCRATCH > defaults > to the working directory if not explicitly set. Please see se > ction 2.6 for > details on saving temporary files and consult your systems ad > ministrator. > > > QCLOCALSCR On certain platforms, such as Linux clusters, it > is sometimes preferable to > write the temporary files to a disk local to the node. > QCLOCALSCR > spec- > ifies this directory. The temporary files will be copied to > QCSCRATCH > at > the end of the job, unless the job is terminated abnormally. I > n such cases > Q-Chem > will attempt to remove the files in > QCLOCALSCR > , but may not > be able to due to access restrictions. Please specify this va > riable only if > required > ********************************************* > > On 10/02/2014 02:08 PM, Lee-Ping Wang wrote: >> Hi Ralph, >> >> I’ve been troubleshooting this issue and communicating with Blue Waters >> support. It turns out that Q-Chem and OpenMPI are both trying to open >> sockets, and I get different error messages depending on which one fails. >> >> As an aside, I don’t know why Q-Chem needs sockets of its own to >> communicate between ranks; shouldn’t OpenMPI be taking care of all >> that? (I’m unfamiliar with this part of the Q-Chem code base, maybe >> it’s trying to duplicate some functionality?) >> >> The Blue Waters support has indicated that there’s a problem with their >> realm-specific IP addressing (RSIP) for the compute nodes, which they’re >> working on fixing. I also tried running the same Q-Chem / OpenMPI job >> on a management node which I think has the same hardware (but not the >> RSIP), and the problem went away. So I think I’ll shelve this problem >> for now, until Blue Waters support gets back to me with the fix. :) >> >> Thanks, >> >> -Lee-Ping >> >> *From:*users [mailto:users-boun...@open-mpi.org] *On Behalf Of *Lee-Ping >> Wang >> *Sent:* Tuesday, September 30, 2014 1:15 PM >> *To:* Open MPI Users >> *Subject:* Re: [OMPI users] General question about running single-node jobs. >> >> Hi Ralph, >> >> Thanks. I'll add some print statements to the code and try to figure >> out precisely where the failure is happening. >> >> - Lee-Ping >> >> On Sep 30, 2014, at 12:06 PM, Ralph Castain <r...@open-mpi.org >> <mailto:r...@open-mpi.org>> wrote: >> >> >> >> On Sep 30, 2014, at 11:19 AM, Lee-Ping Wang <leep...@stanford.edu >> <mailto:leep...@stanford.edu>> wrote: >> >> >> >> Hi Ralph, >> >> If so, then I should be able to (1) locate where the port >> number is defined in the code, and (2) randomize the port number >> every time it's called to work around the issue. What do you think? >> >> That might work, depending on the code. I'm not sure what it is >> trying to connect to, and if that code knows how to handle arbitrary >> connections >> >> The main reason why Q-Chem is using MPI is for executing parallel tasks >> on a single node. Thus, I think it's just the MPI ranks attempting to >> connect with each other on the same machine. This could be off the mark >> because I'm still a novice with respect to MPI concepts - but I am sure >> it is just one machine. >> >> Your statement doesn't match what you sent us - you showed that it was >> your connection code that was failing, not ours. You wouldn't have >> gotten that far if our connections failed as you would have failed in >> MPI_Init. You are clearly much further than that as you already passed >> an MPI_Barrier before reaching the code in question. >> >> >> >> You might check about those warnings - could be that QCLOCALSCR and >> QCREF need to be set for the code to work. >> >> Thanks; I don't think these environment variables are the issue but I >> will check again. The calculation runs without any problems on four >> different clusters (where I don't set these environment variables >> either), it's only broken on the Blue Waters compute node. Also, the >> calculation runs without any problems the first time it's executed on >> the BW compute node - it's only subsequent executions that give the >> error messages. >> >> Thanks, >> >> - Lee-Ping >> >> On Sep 30, 2014, at 11:05 AM, Ralph Castain <r...@open-mpi.org >> <mailto:r...@open-mpi.org>> wrote: >> >> >> >> On Sep 30, 2014, at 10:49 AM, Lee-Ping Wang <leep...@stanford.edu >> <mailto:leep...@stanford.edu>> wrote: >> >> >> >> Hi Ralph, >> >> Thank you. I think your diagnosis is probably correct. Are these >> sockets the same as TCP/UDP ports (though different numbers) that are >> used in web servers, email etc? >> >> Yes >> >> >> >> If so, then I should be able to (1) locate where the port number is >> defined in the code, and (2) randomize the port number every time it's >> called to work around the issue. What do you think? >> >> That might work, depending on the code. I'm not sure what it is trying >> to connect to, and if that code knows how to handle arbitrary connections >> >> You might check about those warnings - could be that QCLOCALSCR and >> QCREF need to be set for the code to work. >> >> >> >> - Lee-Ping >> >> On Sep 29, 2014, at 8:45 PM, Ralph Castain <r...@open-mpi.org >> <mailto:r...@open-mpi.org>> wrote: >> >> >> >> I don't know anything about your application, or what the functions in >> your code are doing. I imagine it's possible that you are trying to open >> statically defined ports, which means that running the job again too >> soon could leave the OS thinking the socket is already busy. It takes >> awhile for the OS to release a socket resource. >> >> On Sep 29, 2014, at 5:49 PM, Lee-Ping Wang <leep...@stanford.edu >> <mailto:leep...@stanford.edu>> wrote: >> >> >> >> Here's another data point that might be useful: The error message is >> much more rare if I run my application on 4 cores instead of 8. >> >> Thanks, >> >> - Lee-Ping >> >> On Sep 29, 2014, at 5:38 PM, Lee-Ping Wang <leep...@stanford.edu >> <mailto:leep...@stanford.edu>> wrote: >> >> >> >> Sorry for my last email - I think I spoke too quick. I realized after >> reading some more documentation that OpenMPI always uses TCP sockets for >> out-of-band communication, so it doesn't make sense for me to set >> OMPI_MCA_oob=^tcp. That said, I am still running into a strange problem >> in my application when running on a specific machine (Blue Waters >> compute node); I don't see this problem on any other nodes. >> >> When I run the same job (~5 seconds) in rapid succession, I see the >> following error message on the second execution: >> >> /tmp/leeping/opt/qchem-4.2/bin/parallel.csh, , qcopt_reactants.in, 8, >> 0, ./qchem24825/ >> >> MPIRUN in parallel.csh is >> /tmp/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun >> >> P4_RSHCOMMAND in parallel.csh is ssh >> >> QCOUTFILE is stdout >> >> Q-Chem machineFile is /tmp/leeping/opt/qchem-4.2/bin/mpi/machines >> >> [nid15081:24859] Warning: could not find environment variable "QCLOCALSCR" >> >> [nid15081:24859] Warning: could not find environment variable "QCREF" >> >> initial socket setup ...start >> >> ------------------------------------------------------- >> >> Primary job terminated normally, but 1 process returned >> >> a non-zero exit code.. Per user-direction, the job has been aborted. >> >> ------------------------------------------------------- >> >> -------------------------------------------------------------------------- >> >> mpirun detected that one or more processes exited with non-zero status, >> thus causing >> >> the job to be terminated. The first process to do so was: >> >> Process name: [[46773,1],0] >> >> Exit code: 255 >> >> -------------------------------------------------------------------------- >> >> And here's the source code where the program is exiting (before "initial >> socket setup ...done") >> >> intGPICommSoc::init(MPI_Comm comm0) { >> >> /* setup basic MPI information */ >> >> init_comm(comm0); >> >> MPI_Barrier(comm); >> >> /*-- start inisock and set serveradd[] array --*/ >> >> if(me == 0) { >> >> fprintf(stdout,"initial socket setup ...start\n"); >> >> fflush(stdout); >> >> } >> >> // create the initial socket >> >> inisock = new_server_socket(NULL,0); >> >> // fill and gather the serveraddr array >> >> intszsock = sizeof(SOCKADDR); >> >> memset(&serveraddr[0],0, szsock*nproc); >> >> intiniport=get_sockport(inisock); >> >> set_sockaddr_byhname(NULL, iniport, &serveraddr[me]); >> >> //printsockaddr( serveraddr[me] ); >> >> SOCKADDR addrsend = serveraddr[me]; >> >> MPI_Allgather(&addrsend,szsock,MPI_BYTE, >> >> &serveraddr[0], szsock,MPI_BYTE, comm); >> >> if(me == 0) { >> >> fprintf(stdout,"initial socket setup ...done \n" >> >> ); >> >> fflush(stdout);} >> >> I didn't write this part of the program and I'm really a novice to MPI - >> but it seems like the initial execution of the program isn't freeing up >> some system resource as it should. Is there something that needs to be >> corrected in the code? >> >> Thanks, >> >> - Lee-Ping >> >> On Sep 29, 2014, at 5:12 PM, Lee-Ping Wang <leep...@stanford.edu >> <mailto:leep...@stanford.edu>> wrote: >> >> >> >> Hi there, >> >> My application uses MPI to run parallel jobs on a single node, so I have >> no need of any support for communication between nodes. However, when I >> use mpirun to launch my application I see strange errors such as: >> >> -------------------------------------------------------------------------- >> >> No network interfaces were found for out-of-band communications. We require >> >> at least one available network for out-of-band messaging. >> >> -------------------------------------------------------------------------- >> >> [nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP >> socket for out-of-band communications in file oob_tcp_listener.c at line 113 >> >> [nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP >> socket for out-of-band communications in file oob_tcp_component.c at >> line 584 >> >> -------------------------------------------------------------------------- >> >> It looks like orte_init failed for some reason; your parallel process is >> >> likely to abort. There are many reasons that a parallel process can >> >> fail during orte_init; some of which are due to configuration or >> >> environment problems. This failure appears to be an internal failure; >> >> here's some additional information (which may only be relevant to an >> >> Open MPI developer): >> >> orte_oob_base_select failed >> >> --> Returned value (null) (-43) instead of ORTE_SUCCESS >> >> -------------------------------------------------------------------------- >> >> /home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(+0xfeaa9)[0x2b77e9de5aa9] >> >> /home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(ompi_btl_openib_connect_base_select_for_local_port+0xd0)[0x2b77e9de13a0] >> >> It seems like in each case, OpenMPI is trying to use some feature >> related to networking and crashing as a result. My workaround is to >> deduce the components that are crashing and disable them in my >> environment variables like this: >> >> export OMPI_MCA_btl=self,sm >> >> export OMPI_MCA_oob=^tcp >> >> Is there a better way to do this - i.e. explicitly prohibit OpenMPI from >> using any network-related feature and run only on the local node? >> >> Thanks, >> >> - Lee-Ping >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/09/25410.php >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/09/25411.php >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/09/25412.php >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/09/25413.php >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/09/25419.php >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/09/25420.php >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/09/25421.php >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/09/25422.php >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/10/25428.php >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/10/25429.php