On Wed, May 28, 2008 at 10:16:06AM -0500, Stuart Martin wrote:

> What client got this error message?  Are you using globusrun-ws?  Or some 
> other client?

globusrun-ws, from VDT.


> If the issue is the remote host closing the connection because the client 
> is unable to respond in time, maybe there is an ssh timeout value that can 
> be increased?  Note, this is unrelated to GRAM as you say just below.  The 
> issue here is between one compute node and a set of others in a cluster.
>
> Is rsh an option here for you?  Maybe that would offer greater scalability?
>
>>
>> I suspect that there is limit on simultaneous connections in sshd.
>
> Interesting.  Looks like your starting to pin down the problem and 
> understand the limits.

Actually changing sshd configuration parameter MaxStartups to larger
value fixed that issue. Tried to use rsh and it also works, but we
will not use it unless really necessary. 

Regards,
Yuriy

>>> the
>>> situation.
>>>
>>> At what scale do problems occur with this?  By that I mean, how many PBS
>>> processes/nodes are trying to access that file at (nearly) the same time
>>> when errors begin to occur?
>>>
>>>>
>>>>
>>>> Also <count> tag seems to have no effect on number of jobs executed, 
>>>> other
>>>> then if it is equal to one, all jobs execute on single node.
>>>
>>> Here is the 4.0 doc on extension handling:
>>>     
>>> http://www.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Job_Desc_Extensions.html#r-wsgram-extensions-constructs
>>>
>>> This is not be well documented, but for PBS, when the
>>> resourceAllocationGroup is used, then count is ignored.  As you can see
>>> here in the if-then-else in choosing what is going to set the PBS nodes
>>> directive.
>>>
>>> From PBS.pm >>>>>>>>
>>>    if (defined $description->nodes())
>>>    {
>>>        #Generated by ExtensionsHandler.pm from resourceAllocationGroup
>>> elements
>>>        print JOB '#PBS -l nodes=', $description->nodes(), "\n";
>>>    }
>>>    elsif($description->host_count() != 0)
>>>    {
>>>        print JOB '#PBS -l nodes=', $description->host_count(), "\n";
>>>    }
>>>    elsif($cluster && $cpu_per_node != 0)
>>>    {
>>>        print JOB '#PBS -l nodes=',
>>>        myceil($description->count() / $cpu_per_node), "\n";
>>>    }
>>> <<<<<<<<
>>>
>>>>
>>>>
>>>> Example job description:
>>>>
>>>> <job>
>>>>   <factoryEndpoint
>>>>           xmlns:gram="http://www.globus.org/namespaces/2004/10/gram/job";
>>>>           xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/03/addressing";>
>>>>       <wsa:Address>
>>>>
>>>> https://ng2test.auckland.ac.nz:8443/wsrf/services/ManagedJobFactoryService
>>>>       </wsa:Address>
>>>>       <wsa:ReferenceProperties>
>>>>           <gram:ResourceID>PBS</gram:ResourceID>
>>>>       </wsa:ReferenceProperties>
>>>>   </factoryEndpoint>
>>>> <executable>/bin/hostname</executable>
>>>> <count>200</count>
>>>> <queue>[EMAIL PROTECTED]</queue>
>>>> <jobType>multiple</jobType>
>>>>   <extensions>
>>>>       <resourceAllocationGroup>
>>>>               <hostCount>10</hostCount>
>>>>               <cpusPerHost>8</cpusPerHost>
>>>>               <processCount>162</processCount>
>>>>       </resourceAllocationGroup>
>>>>   </extensions>
>>>> </job>
>>>>
>>>> For MPI jobs the limit seems to be 20 * number of cores, for larger
>>>> number of processes I see erros like this:
>>>>
>>>> --------------------------------------------------------------------------
>>>> *** An error occurred in MPI_Init
>>>> *** before MPI was initialized
>>>> *** MPI_ERRORS_ARE_FATAL (goodbye)
>>>> [compute-1.local:23438] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>>>> base/pls_base_orted_cmds.c at line 275
>>>> [compute-1.local:23438] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>>>> pls_rsh_module.c at line 1164
>>>> [compute-1.local:23438] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>>>> errmgr_hnp.c at line 90
>>>> mpiexec noticed that job rank 8 with PID 22257 on node compute-10 exited
>>>> on signal 15 (Terminated).
>>>> [compute-1.local:23438] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>>>> base/pls_base_orted_cmds.c at line 188
>>>> [compute-1.local:23438] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>>>> pls_rsh_module.c at line 1196
>>>> --------------------------------------------------------------------------
>>>>
>>>> Again, this does not happen all the time.
>>>>
>>>>
>>>> Example job description:
>>>>
>>>> <job>
>>>>   <factoryEndpoint
>>>>           xmlns:gram="http://www.globus.org/namespaces/2004/10/gram/job";
>>>>           xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/03/addressing";>
>>>>       <wsa:Address>
>>>>
>>>> https://ng2test.auckland.ac.nz:9443/wsrf/services/ManagedJobFactoryService
>>>>       </wsa:Address>
>>>>       <wsa:ReferenceProperties>
>>>>           <gram:ResourceID>PBS</gram:ResourceID>
>>>>       </wsa:ReferenceProperties>
>>>>   </factoryEndpoint>
>>>>
>>>> <executable>test</executable>
>>>> <directory>/home/grid-bestgrid/MPI/</directory>
>>>> <queue>[EMAIL PROTECTED]</queue>
>>>> <jobType>mpi</jobType>
>>>>
>>>>   <extensions>
>>>>       <resourceAllocationGroup>
>>>>               <hostCount>5</hostCount>
>>>>               <cpusPerHost>8</cpusPerHost>
>>>>               <processCount>900</processCount>
>>>>       </resourceAllocationGroup>
>>>>   </extensions>
>>>> </job>
>>>>
>>>>
>>>>
>>>> Can anyone explain what is going on here?
>>>>
>>>>
>>>> Regards,
>>>> Yuriy
>>>>
>>>
>>
>

Reply via email to