Hi, Am 18.06.2014 um 20:45 schrieb Connell, Jesse:
> We've been having a seemingly-random problem with MPI jobs on our install > of Open Grid Scheduler 2011.11. For some varying length of time from when > the execd processes start up, MPI jobs running across multiple hosts will > run fine. Then, at some point, they will start failing at the mpirun > step, and will keep failing until execd is restarted on the affected > hosts. They then work again, before eventually failing, and so on. If I > increase the SGE debug level before calling mpirun in my job script, I see > things like this: > > 842 11556 main ../clients/qsh/qsh.c 1840 executing task of qsh? Are you using an X11 session? -- Reuti > job 6805430 failed: failed sending task to execd@<hostname>: got send error > > ...but nothing more interesting that I can see. (I also get the same sort > of "send error" message from mpirun itself if I use its --mca > ras_gridengine_debug --mca ras_gridengine_verbose flags, but nothing > else.) Jobs that run on multiple cores on a single host are fine, but > ones that try to start up workers on additional hosts fail. Since > restarting execd makes it work again, I assumed the problem was on that > end, and tried dumping verbose log output for execd (using dl 10) to a > file. But, despite many thousands of lines, I can't spot anything that > looks different when the jobs start failing from when they are working, as > far as execd is concerned. Ordinary grid jobs (no parallel environment) > continue to run fine no matter what. > > So for now, I'm stumped! Any other ideas of what to look for, or thoughts > of what the unpredictable off-and-on behavior could possibly mean? Thanks > in advance, > > Jesse > > P.S. This is on CentOS 6, with its openmpi 1.5.4 package. > > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
