Hi Ashwani Also, check if there are rogue processes from old jobs in your compute nodes taking up lots of file descriptors. A reboot should fix this easily.
My two cents. Gus Correa On Oct 15, 2011, at 10:34 AM, Ralph Castain wrote: > Okay, let's try spreading them out more, just to avoid putting more on a node > than you actually need. Add -bynode to your cmd line. This will spread the > procs across all the nodes. > > Our default mode is "byslot", which means we fill each node before adding > procs to the next one. "bynode" puts one proc on each node, wrapping around > until all procs have been assigned. You may lose a little performance as > shared memory can't be used as much, but at least it has a better chance of > running. > > > On Oct 14, 2011, at 1:29 PM, Ashwani Kumar Mishra wrote: > >> Hi Ralph, >> No idea how much this program consumes the numbers of file descriptors :( >> >> Best Regards, >> Ashwani >> >> On Sat, Oct 15, 2011 at 12:08 AM, Ralph Castain <r...@open-mpi.org> wrote: >> Should be plenty for us - does your program consume a lot? >> >> >> On Oct 14, 2011, at 12:25 PM, Ashwani Kumar Mishra wrote: >> >>> Hi Ralph, >>> fs.file-max = 100000 >>> is this ok or less? >>> >>> Best Regards, >>> Ashwani >>> >>> >>> On Fri, Oct 14, 2011 at 11:45 PM, Ralph Castain <r...@open-mpi.org> wrote: >>> Can't offer much about the qsub job. On the first one, what is your limit >>> on the number of file descriptors? Could be your sys admin has it too low. >>> >>> >>> On Oct 14, 2011, at 12:07 PM, Ashwani Kumar Mishra wrote: >>> >>>> Hello, >>>> When i try to run the following command i receive the following error when >>>> i try to submit this job on the cluster having 40 nodes with each node >>>> having 8 processor & 8 GB RAM: >>>> >>>> Both the command work well, as long as i use only upto 88 processors in >>>> the cluster, but the moment i allocate more than 88 processors it gives me >>>> the below 2 errors: >>>> >>>> I tried to set the ulimit to unlimited & setting mca parameter >>>> opal_set_max_sys_limits to 1 but still the problem wont go. >>>> >>>> >>>> $ mpirun=/opt/psc/ompi/bin/mpirun abyss-pe np=100 name=cattle k=50 n=10 >>>> in=s_1_1_sequence.txt >>>> >>>> /opt/mpi/openmpi/1.3.3/intel/ >>>> bin/mpirun -np 100 ABYSS-P -k50 -q3 --coverage-hist=coverage.hist -s >>>> cattle-bubbles.fa -o cattle-1.fa s_1_1_sequence.txt >>>> [coe:19807] [[62863,0],0] ORTE_ERROR_LOG: The system limit on number of >>>> pipes a process can open was reached in file base/iof_base_setup.c at line >>>> 107 >>>> [coe.:19807] [[62863,0],0] ORTE_ERROR_LOG: The system limit on number of >>>> pipes a process can open was reached in file odls_default_module.c at line >>>> 203 >>>> [coe.:19807] [[62863,0],0] ORTE_ERROR_LOG: The system limit on number of >>>> network connections a process can open was reached in file oob_tcp.c at >>>> line 447 >>>> -------------------------------------------------------------------------- >>>> Error: system limit exceeded on number of network connections that can be >>>> open >>>> >>>> This can be resolved by setting the mca parameter opal_set_max_sys_limits >>>> to 1, >>>> increasing your limit descriptor setting (using limit or ulimit commands), >>>> or asking the system administrator to increase the system limit. >>>> -------------------------------------------------------------------------- >>>> make: *** [cattle-1.fa] Error 1 >>>> >>>> >>>> >>>> >>>> When i submit the same job through qsub, i receive the following error: >>>> $ qsub -cwd -pe orte 100 -o qsub.out -e qsub.err -b y -N abyss `which >>>> mpirun` /home/genome/abyss/bin/ABYSS-P -k 50 s_1_1_sequence.txt -o av >>>> >>>> >>>> [compute-0-19.local][[28273,1] >>>> ,125][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect] >>>> connect() to 173.16.255.231 failed: Connection refused (111) >>>> [compute-0-19.local][[28273,1],127][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect] >>>> connect() to 173.16.255.231 failed: Connection refused (111) >>>> [compute-0-23.local][[28273,1],135][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect] >>>> connect() to 173.16.255.228 failed: Connection refused (111) >>>> [compute-0-23.local][[28273,1],133][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect] >>>> connect() to 173.16.255.228 failed: Connection refused (111) >>>> [compute-0-4.local][[28273,1],113][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect] >>>> connect() to 173.16.255.231 failed: Connection refused (111) >>>> >>>> >>>> >>>> Best Regards, >>>> Ashwani >>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users