Hi Christine,
I have tested your patch but it can not solve my problem. By adding printf,
I found that whenever during retransmition occured in my test case or not,
the retrans_message_queue is always empty. It seems that the
retrans_message_queue is for recovery state used only?
On Aug 5, 2014 3:50 PM, "Christine Caulfield" <[email protected]> wrote:

> On 01/08/14 10:50, Christine Caulfield wrote:
>
>> On 01/08/14 10:42, Jan Friesse wrote:
>>
>>> Jason,
>>>
>>>
>>>  Hi All,
>>>>
>>>> I have encountered a problem that when there is no other activty on
>>>> ring but
>>>> only retransmition, and token is in hold mode, the retransmition will
>>>> become
>>>> slow. More over, if the retransmition is always fail but token
>>>>
>>>
>>> Yes
>>>
>>>  rotation works well,
>>>> then it takes quite a lone time(fail_to_recv_const * token_hold = 2500
>>>> * 180ms = 450sec) for the retransmiting node to meet the "FAILED TO
>>>> RECEIVE" condition to
>>>> re-construct a new ring. This can be reporduced by the following steps:
>>>>
>>>>      1) Create a two-node cluster in udpu transport mode.
>>>>      2) Wait until there is no other activty on ring.
>>>>      3) One, or both nodes delete each other in nodelist in
>>>> corosync.conf
>>>>      4) corosync-cfgtool -R, this can cause a message retransmition,
>>>> but I am
>>>>      not sure why.
>>>>      5) Since tokenrotation still works well, but the retransmition
>>>> can not be
>>>>      satisfied due to node deletion, so, only "FAILED TO RECEIVE"
>>>> condition can form new
>>>>      ring. But we need to wait 450 seconds for it to happen. During
>>>> this wait,
>>>>      we saw the following logs:
>>>>
>>>>
>>> This is really weird case.
>>>
>>>       Jul 30 11:21:06 notice  [TOTEM ] Retransmit List: e
>>>>      Jul 30 11:21:06 notice  [TOTEM ] Retransmit List: e
>>>>      Jul 30 11:21:06 notice  [TOTEM ] Retransmit List: e
>>>>      Jul 30 11:21:06 notice  [TOTEM ] Retransmit List: e
>>>>      Jul 30 11:21:06 notice  [TOTEM ] Retransmit List: e
>>>>      ...
>>>>
>>>>
>>>> This problem can be solved by adding token_hold_cancel_send() in both
>>>> retransmition request and response conditions in orf_token_rtr() to
>>>> speed up
>>>> retransmition. I created a patch below, any comments?
>>>>
>>>>
>>> Ok. Patch looks fine, but during review I had other idea. What about
>>> prohibit starting of hold mode where there are messages to retransmit?
>>> Such solution may be cleaner, isn't it?
>>>
>>> Anyway. This is change in very critical part of the code, so Chrissie,
>>> can you please take a look to patch and express your opinion?
>>>
>>
>>
>> I've been looking it over yesterday. It's a problem I have definitely
>> seen myself on some VM systems so it's certainly not an isolated case. I
>> think Honza is right that there might be a better way of fixing it so
>> I'll have a look.
>>
>> Chrissie
>>
>
>
> Annoyingly my common reproducer seems not to be working and I can't get
> yours to make it happen either. If you can still reproduce it could you try
> this patch for me please?
>
> Chrissie
>
>
> _______________________________________________
> discuss mailing list
> [email protected]
> http://lists.corosync.org/mailman/listinfo/discuss
>
>
_______________________________________________
discuss mailing list
[email protected]
http://lists.corosync.org/mailman/listinfo/discuss

Reply via email to