Hi Yves -

How frequently do you see these warnings?  Does it cause any
servers/clients to hang?

If not common/destructive this could be that there was a simple error
case on the infiniband fabric and that the operation timed out in pvfs
and that can be readily ignored as it would be retransmitted
eventually.

If you see this a lot it may be one of a few issues that we've fixed
in recent releases, which version of orangefs/pvfs are you using?
~Kyle

Kyle Schochenmaier


On Thu, Oct 18, 2012 at 4:31 PM, Becky Ligon <[email protected]> wrote:
> Yves:
>
> The timeouts that you listed below are in the configuration file.
>
> ClientJobBMITimeoutSecs 300 - The client's job scheduler limits each "job"
> sent across the network to this timeout.  If the job exceeds this limit, the
> job is cancelled.  Depending on the request, the job may be retried.  Keep
> in mind that one PVFS request can be made up of many jobs.
>
> ClientJobFlowTimeoutSecs - This value limits the time spent on a particular
> job called a flow.  A flow is used to transfer data across the network to a
> server or to transfer data from a server to the client.    Again, if the
> flow exceeds this timeout, then the flow is cancelled.
>
> The server counterparts for these settings are rarely used, since the server
> doesn't normally initiate reads or writes.
>
> I think your real problem has something to do with IB, but I am not an
> expert in that area.  I have cc'd Kyle Schochenmaier to see if he can help.
>
> Becky
>
>
>
> On Thu, Oct 18, 2012 at 4:07 PM, Yves Revaz <[email protected]> wrote:
>>
>>
>> Dear list,
>>
>> I sometimes have the following error occuring in my pvfs server log.
>>
>> [E 10/18/2012 20:59:50] Warning: encourage_recv_incoming: mop_id 150c320
>> in RTS_DONE message not found.
>> [E 10/18/2012 21:00:50] job_time_mgr_expire: job time out: cancelling flow
>> operation, job_id: 33307291.
>> [E 10/18/2012 21:00:50] fp_multiqueue_cancel: flow proto cancel called on
>> 0xf18c80
>> [E 10/18/2012 21:00:50] fp_multiqueue_cancel: I/O error occurred
>> [E 10/18/2012 21:00:50] handle_io_error: flow proto error cleanup started
>> on 0xf18c80: Operation cancelled (possibly due to timeout)
>> [E 10/18/2012 21:00:50] handle_io_error: flow proto 0xf18c80 canceled 1
>> operations, will clean up.
>> [E 10/18/2012 21:00:50] bmi_recv_callback_fn: I/O error occurred
>> [E 10/18/2012 21:00:50] handle_io_error: flow proto 0xf18c80 error cleanup
>> finished: Operation cancelled (possibly due to time
>>
>>
>> Looking at the mailing list, I've found that increasing these default
>> value (300)
>>
>>         ServerJobBMITimeoutSecs 30
>>         ServerJobFlowTimeoutSecs 30
>>         ClientJobBMITimeoutSecs 300
>>         ClientJobFlowTimeoutSecs 300
>>
>> to 600.
>>
>> What is at the origin of these  timeout ?
>>
>> Thanks,
>>
>>
>> yves
>>
>>
>>
>>
>>
>> --
>>                                                  (o o)
>> --------------------------------------------oOO--(_)--OOo-------
>>   Dr. Yves Revaz
>>   Laboratory of Astrophysics EPFL
>>
>>   Observatoire de Sauverny     Tel : ++ 41 22 379 24 28
>>   51. Ch. des Maillettes       Fax : ++ 41 22 379 22 05
>>   1290 Sauverny             e-mail : [email protected]
>>   SWITZERLAND                  Web : http://www.lunix.ch/revaz/
>> ----------------------------------------------------------------
>>
>> _______________________________________________
>> Pvfs2-users mailing list
>> [email protected]
>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>
>
>
>
> --
> Becky Ligon
> OrangeFS Support and Development
> Omnibond Systems
> Anderson, South Carolina
>
>
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to