I'm working on a set of patches for the IB support.  There are several
issues I'm working through on the patches before I commit them.  I'll send
you a copy when I have them ready for release so you can test them.


-Randy


On 2/7/13 8:54 AM, "Yves Revaz" <[email protected]> wrote:

>On 10/18/2012 11:41 PM, Kyle Schochenmaier wrote:
>> Hi Yves -
>>
>> How frequently do you see these warnings?  Does it cause any
>> servers/clients to hang?
>
>Hi Kyle and the list,
>
>In a previous mail, I was mentioning the following errors:
>
>[E 02/07/2013 14:39:24] Warning: encourage_recv_incoming: mop_id d0e680
>in RTS_DONE message not found.
>[E 02/07/2013 14:39:54] job_time_mgr_expire: job time out: cancelling
>flow operation, job_id: 17549115350.
>[E 02/07/2013 14:39:54] fp_multiqueue_cancel: flow proto cancel called
>on 0x1bce5e0
>[E 02/07/2013 14:39:54] fp_multiqueue_cancel: I/O error occurred
>[E 02/07/2013 14:39:54] handle_io_error: flow proto error cleanup
>started on 0x1bce5e0: Operation cancelled (possibly due to timeout)
>[E 02/07/2013 14:39:54] handle_io_error: flow proto 0x1bce5e0 canceled 1
>operations, will clean up.
>[E 02/07/2013 14:39:54] bmi_recv_callback_fn: I/O error occurred
>[E 02/07/2013 14:39:54] handle_io_error: flow proto 0x1bce5e0 error
>cleanup finished: Operation cancelled (possibly due to timeout)
>
>In fact, I'm trying to move 10Tb of data in our pvfs, using and rsync.
>When a lot of data are transfered, those errors occurs very frequently,
>about every 5 minutes, which
>is very annoying.
>
>I've checked our IB network which is perfectly sane.
>I'm currently using orangefs-2.8.6/. Should I move to 2.8.7 ?
>Looking at the changelog of the 2.8.7 realease, I don't thinks IB
>related problems
>have been fixed.
>
>Thanks,
>
>yves
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>> If not common/destructive this could be that there was a simple error
>> case on the infiniband fabric and that the operation timed out in pvfs
>> and that can be readily ignored as it would be retransmitted
>> eventually.
>>
>> If you see this a lot it may be one of a few issues that we've fixed
>> in recent releases, which version of orangefs/pvfs are you using?
>> ~Kyle
>>
>> Kyle Schochenmaier
>>
>>
>> On Thu, Oct 18, 2012 at 4:31 PM, Becky Ligon<[email protected]>  wrote:
>>> Yves:
>>>
>>> The timeouts that you listed below are in the configuration file.
>>>
>>> ClientJobBMITimeoutSecs 300 - The client's job scheduler limits each
>>>"job"
>>> sent across the network to this timeout.  If the job exceeds this
>>>limit, the
>>> job is cancelled.  Depending on the request, the job may be retried.
>>>Keep
>>> in mind that one PVFS request can be made up of many jobs.
>>>
>>> ClientJobFlowTimeoutSecs - This value limits the time spent on a
>>>particular
>>> job called a flow.  A flow is used to transfer data across the network
>>>to a
>>> server or to transfer data from a server to the client.    Again, if
>>>the
>>> flow exceeds this timeout, then the flow is cancelled.
>>>
>>> The server counterparts for these settings are rarely used, since the
>>>server
>>> doesn't normally initiate reads or writes.
>>>
>>> I think your real problem has something to do with IB, but I am not an
>>> expert in that area.  I have cc'd Kyle Schochenmaier to see if he can
>>>help.
>>>
>>> Becky
>>>
>>>
>>>
>>> On Thu, Oct 18, 2012 at 4:07 PM, Yves Revaz<[email protected]>  wrote:
>>>>
>>>> Dear list,
>>>>
>>>> I sometimes have the following error occuring in my pvfs server log.
>>>>
>>>> [E 10/18/2012 20:59:50] Warning: encourage_recv_incoming: mop_id
>>>>150c320
>>>> in RTS_DONE message not found.
>>>> [E 10/18/2012 21:00:50] job_time_mgr_expire: job time out: cancelling
>>>>flow
>>>> operation, job_id: 33307291.
>>>> [E 10/18/2012 21:00:50] fp_multiqueue_cancel: flow proto cancel
>>>>called on
>>>> 0xf18c80
>>>> [E 10/18/2012 21:00:50] fp_multiqueue_cancel: I/O error occurred
>>>> [E 10/18/2012 21:00:50] handle_io_error: flow proto error cleanup
>>>>started
>>>> on 0xf18c80: Operation cancelled (possibly due to timeout)
>>>> [E 10/18/2012 21:00:50] handle_io_error: flow proto 0xf18c80 canceled
>>>>1
>>>> operations, will clean up.
>>>> [E 10/18/2012 21:00:50] bmi_recv_callback_fn: I/O error occurred
>>>> [E 10/18/2012 21:00:50] handle_io_error: flow proto 0xf18c80 error
>>>>cleanup
>>>> finished: Operation cancelled (possibly due to time
>>>>
>>>>
>>>> Looking at the mailing list, I've found that increasing these default
>>>> value (300)
>>>>
>>>>          ServerJobBMITimeoutSecs 30
>>>>          ServerJobFlowTimeoutSecs 30
>>>>          ClientJobBMITimeoutSecs 300
>>>>          ClientJobFlowTimeoutSecs 300
>>>>
>>>> to 600.
>>>>
>>>> What is at the origin of these  timeout ?
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> yves
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>                                                   (o o)
>>>> --------------------------------------------oOO--(_)--OOo-------
>>>>    Dr. Yves Revaz
>>>>    Laboratory of Astrophysics EPFL
>>>>
>>>>    Observatoire de Sauverny     Tel : ++ 41 22 379 24 28
>>>>    51. Ch. des Maillettes       Fax : ++ 41 22 379 22 05
>>>>    1290 Sauverny             e-mail : [email protected]
>>>>    SWITZERLAND                  Web : http://www.lunix.ch/revaz/
>>>> ----------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> Pvfs2-users mailing list
>>>> [email protected]
>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>>
>>>
>>>
>>> --
>>> Becky Ligon
>>> OrangeFS Support and Development
>>> Omnibond Systems
>>> Anderson, South Carolina
>>>
>>>
>
>
>-- 
>
>----------------------------------------------------------------
>   Dr. Yves Revaz
>   Laboratory of Astrophysics
>   Ecole Polytechnique Fédérale de Lausanne (EPFL)
>   Observatoire de Sauverny     Tel : ++ 41 22 379 24 28
>   51. Ch. des Maillettes       Fax : ++ 41 22 379 22 05
>   1290 Sauverny             e-mail : [email protected]
>   SWITZERLAND                  Web : http://www.lunix.ch/revaz/
>----------------------------------------------------------------
>
>_______________________________________________
>Pvfs2-users mailing list
>[email protected]
>http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users



_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to