Re: [OMPI users] Error when using more than 88 processors for a specific executable -Abyss

Gustavo Correa Sat, 15 Oct 2011 11:03:16 -0400

Hi Ashwani

Also, check if there are rogue processes from old jobs in your 
compute nodes taking up lots of  file descriptors.
A reboot should fix this easily.


My two cents.
Gus Correa

On Oct 15, 2011, at 10:34 AM, Ralph Castain wrote:

> Okay, let's try spreading them out more, just to avoid putting more on a node 
> than you actually need. Add -bynode to your cmd line. This will spread the 
> procs across all the nodes.
> 
> Our default mode is "byslot", which means we fill each node before adding 
> procs to the next one. "bynode" puts one proc on each node, wrapping around 
> until all procs have been assigned. You may lose a little performance as 
> shared memory can't be used as much, but at least it has a better chance of 
> running.
> 
> 
> On Oct 14, 2011, at 1:29 PM, Ashwani Kumar Mishra wrote:
> 
>> Hi Ralph,
>> No idea how much this program consumes the numbers of file descriptors :(
>> 
>> Best Regards,
>> Ashwani
>> 
>> On Sat, Oct 15, 2011 at 12:08 AM, Ralph Castain <r...@open-mpi.org> wrote:
>> Should be plenty for us - does your program consume a lot?
>> 
>> 
>> On Oct 14, 2011, at 12:25 PM, Ashwani Kumar Mishra wrote:
>> 
>>> Hi Ralph,
>>> fs.file-max = 100000
>>> is this ok or less?
>>> 
>>> Best Regards,
>>> Ashwani
>>> 
>>> 
>>> On Fri, Oct 14, 2011 at 11:45 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>> Can't offer much about the qsub job. On the first one, what is your limit 
>>> on the number of file descriptors? Could be your sys admin has it too low.
>>> 
>>> 
>>> On Oct 14, 2011, at 12:07 PM, Ashwani Kumar Mishra wrote:
>>> 
>>>> Hello,
>>>> When i try to run the following command i receive the following error when 
>>>> i try to submit this job on the cluster having 40 nodes with each node 
>>>> having 8 processor & 8 GB RAM:
>>>> 
>>>> Both the command work well, as long as i use only upto 88 processors in 
>>>> the cluster, but the moment i allocate more than 88 processors it gives me 
>>>> the below 2 errors:
>>>> 
>>>> I tried to set the ulimit to unlimited & setting mca parameter 
>>>> opal_set_max_sys_limits to 1 but still the problem wont go.
>>>> 
>>>> 
>>>> $ mpirun=/opt/psc/ompi/bin/mpirun abyss-pe np=100 name=cattle k=50 n=10  
>>>> in=s_1_1_sequence.txt
>>>> 
>>>> /opt/mpi/openmpi/1.3.3/intel/
>>>> bin/mpirun -np 100 ABYSS-P -k50 -q3  --coverage-hist=coverage.hist -s 
>>>> cattle-bubbles.fa  -o cattle-1.fa s_1_1_sequence.txt
>>>> [coe:19807] [[62863,0],0] ORTE_ERROR_LOG: The system limit on number of 
>>>> pipes a process can open was reached in file base/iof_base_setup.c at line 
>>>> 107
>>>> [coe.:19807] [[62863,0],0] ORTE_ERROR_LOG: The system limit on number of 
>>>> pipes a process can open was reached in file odls_default_module.c at line 
>>>> 203
>>>> [coe.:19807] [[62863,0],0] ORTE_ERROR_LOG: The system limit on number of 
>>>> network connections a process can open was reached in file oob_tcp.c at 
>>>> line 447
>>>> --------------------------------------------------------------------------
>>>> Error: system limit exceeded on number of network connections that can be 
>>>> open
>>>> 
>>>> This can be resolved by setting the mca parameter opal_set_max_sys_limits 
>>>> to 1,
>>>> increasing your limit descriptor setting (using limit or ulimit commands),
>>>> or asking the system administrator to increase the system limit.
>>>> --------------------------------------------------------------------------
>>>> make: *** [cattle-1.fa] Error 1
>>>> 
>>>> 
>>>> 
>>>> 
>>>> When i submit the same job through qsub, i receive the following error:
>>>> $ qsub  -cwd -pe  orte 100 -o qsub.out -e qsub.err -b y -N  abyss `which 
>>>> mpirun` /home/genome/abyss/bin/ABYSS-P -k 50 s_1_1_sequence.txt -o av
>>>> 
>>>> 
>>>> [compute-0-19.local][[28273,1]
>>>> ,125][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect] 
>>>> connect() to 173.16.255.231 failed: Connection refused (111)
>>>> [compute-0-19.local][[28273,1],127][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
>>>>  connect() to 173.16.255.231 failed: Connection refused (111)
>>>> [compute-0-23.local][[28273,1],135][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
>>>>  connect() to 173.16.255.228 failed: Connection refused (111)
>>>> [compute-0-23.local][[28273,1],133][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
>>>>  connect() to 173.16.255.228 failed: Connection refused (111)
>>>> [compute-0-4.local][[28273,1],113][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
>>>>  connect() to 173.16.255.231 failed: Connection refused (111)
>>>> 
>>>> 
>>>> 
>>>> Best Regards,
>>>> Ashwani
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Error when using more than 88 processors for a specific executable -Abyss

Reply via email to