On Mon, Dec 10, 2012 at 9:27 AM, Forster, Robert <[email protected]> wrote: > Hello all: > > I'm running a small Rocks cluster (Rocks 5.4, 7 nodes, 56 cores). I > need to run many iterations of a program that takes 13 hrs to finish on > 53 cores. I can successfully run the program via the command line, > however when I tried an sge script it failed. I then tested mpi-ring_c > and hello_c, and they also both failed. I really need to queue this > program up so I'm not just running once per day. > > When submitted with qsub -pe mpi 56 mpi-ring.qsub > mpi-ring.qsub > > #!/bin/bash > # > #$ -cwd > #$ -j y > #$ -S /bin/bash > # > > > /share/apps/mpi/gcc460/openmpi-1.4.3/bin/mpirun /share/apps/test/ring_c > > > [mono-addon] -bash-3.2$ cat mpi-ring.qsub.o869 > error: executing task of job 869 failed: execution daemon on host > "compute-0-0.local" didn't accept task > error: executing task of job 869 failed: execution daemon on host > "compute-0-10.local" didn't accept task > ...
It seems that you have a problem with SGE installation on two nodes. First you should try to fix this. Go on the node and check if there's any problem with them (disks? Restart sge with /etc/init.d/sgeexecXXXX restart). And then you can try to make a sge script with this extra line: #$ -l hostname=compute-0-0.local which simply execute a hostname on the node to debug. > > When submitted with qsub -pe orte 56 mpi-ring.qsub > > [compute-0-2:13313] *** Process received signal *** > [compute-0-2:13313] Signal: Segmentation fault (11) > [compute-0-2:13313] Signal code: Address not mapped (1) > [compute-0-2:13313] Failing at address: 0x206 > [compute-0-2:13313] [ 0] /lib64/libpthread.so.0 [0x3a0c40eb10] > [compute-0-2:13313] [ 1] > /share/apps/mpi/gcc460/openmpi-1.4.3/lib/openmpi/mca_btl_sm.so > [0x2ac3f3ba6188] > [compute-0-2:13313] [ 2] > /share/apps/mpi/gcc460/openmpi-1.4.3/lib/openmpi/mca_bml_r2.so > [0x2ac3f2f467f2] > [compute-0-2:13313] [ 3] > /share/apps/mpi/gcc460/openmpi-1.4.3/lib/openmpi/mca_pml_ob1.so > [0x2ac3f2b302ee] > [compute-0-2:13313] [ 4] > /share/apps/mpi/gcc460/openmpi-1.4.3/lib/libmpi.so.0 [0x2ac3f019b6e9] > [compute-0-2:13313] [ 5] > /share/apps/mpi/gcc460/openmpi-1.4.3/lib/libmpi.so.0(MPI_Init+0x16b) > [0x2ac3f01ba38b] > [compute-0-2:13313] [ 6] /share/apps/test/ring_c(main+0x29) [0x4009dd] > [compute-0-2:13313] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) > [0x3a0bc1d994] > [compute-0-2:13313] [ 8] /share/apps/test/ring_c [0x4008f9] > [compute-0-2:13313] *** End of error message *** > ------------------------------------------------------------------------ > -- > mpirun noticed that process rank 14 with PID 13313 on node > compute-0-2.local exited on signal 11 (Segmentation fault). > > When I change the allocation rule to $pe_slots and only run 8 processes, > it works. However this doesn't help. > > Since this will be the major work for this computer over the next month > or two, I'm thinking of starting over and installing Rocks6.1, > especially if infiniband is built in. Unless there is a simple fix. Is > there something I need to do to set up sge to run across multiple nodes? Sge is working out of the box on a rocks cluster. Upgrading is always a good idea. Luca _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
