On Fri, Aug 1, 2014 at 2:50 AM, Christine Caulfield <[email protected]>
wrote:

> On 01/08/14 10:42, Jan Friesse wrote:
>
>> Jason,
>>
>>
>>  Hi All,
>>>
>>> I have encountered a problem that when there is no other activty on
>>> ring but
>>> only retransmition, and token is in hold mode, the retransmition will
>>> become
>>> slow. More over, if the retransmition is always fail but token
>>>
>>
>> Yes
>>
>>  rotation works well,
>>> then it takes quite a lone time(fail_to_recv_const * token_hold = 2500
>>> * 180ms = 450sec) for the retransmiting node to meet the "FAILED TO
>>> RECEIVE" condition to
>>> re-construct a new ring. This can be reporduced by the following steps:
>>>
>>>      1) Create a two-node cluster in udpu transport mode.
>>>      2) Wait until there is no other activty on ring.
>>>      3) One, or both nodes delete each other in nodelist in corosync.conf
>>>      4) corosync-cfgtool -R, this can cause a message retransmition,
>>> but I am
>>>      not sure why.
>>>      5) Since tokenrotation still works well, but the retransmition
>>> can not be
>>>      satisfied due to node deletion, so, only "FAILED TO RECEIVE"
>>> condition can form new
>>>      ring. But we need to wait 450 seconds for it to happen. During
>>> this wait,
>>>      we saw the following logs:
>>>
>>>
>> This is really weird case.
>>
>>       Jul 30 11:21:06 notice  [TOTEM ] Retransmit List: e
>>>      Jul 30 11:21:06 notice  [TOTEM ] Retransmit List: e
>>>      Jul 30 11:21:06 notice  [TOTEM ] Retransmit List: e
>>>      Jul 30 11:21:06 notice  [TOTEM ] Retransmit List: e
>>>      Jul 30 11:21:06 notice  [TOTEM ] Retransmit List: e
>>>      ...
>>>
>>>
>>> This problem can be solved by adding token_hold_cancel_send() in both
>>> retransmition request and response conditions in orf_token_rtr() to
>>> speed up
>>> retransmition. I created a patch below, any comments?
>>>
>>>
>> Ok. Patch looks fine, but during review I had other idea. What about
>> prohibit starting of hold mode where there are messages to retransmit?
>> Such solution may be cleaner, isn't it?
>>
>>
This seems better to me - pragmatically speaking it prevents the scenario
where a hold cancel message is lost via UDP and prevents extra complication
with only one more conditional.


> Anyway. This is change in very critical part of the code, so Chrissie,
>> can you please take a look to patch and express your opinion?
>>
>
>
> I've been looking it over yesterday. It's a problem I have definitely seen
> myself on some VM systems so it's certainly not an isolated case. I think
> Honza is right that there might be a better way of fixing it so I'll have a
> look.
>
> The patch will work, but I think Honza's approach is better.

Regards,
-steve


> Chrissie
> should
>
>  Regards,
>>    Honza
>>
>>
>>>      Signed-off-by: Jason HU <[email protected]>
>>>
>>> ------------------------------- exec/totemsrp.c
>>> -------------------------------
>>> index dcda8d1..c227c44 100644
>>> @@ -2672,6 +2672,7 @@ static int orf_token_rtr (
>>>
>>>       strcpy (retransmit_msg, "Retransmit List: ");
>>>       if (orf_token->rtr_list_entries) {
>>> +        token_hold_cancel_send(instance);
>>>           log_printf (instance->totemsrp_log_level_debug,
>>>               "Retransmit List %d", orf_token->rtr_list_entries);
>>>           for (i = 0; i < orf_token->rtr_list_entries; i++) {
>>> @@ -2726,6 +2727,10 @@ static int orf_token_rtr (
>>>       range = orf_token->seq - instance->my_aru;
>>>       assert (range < QUEUE_RTR_ITEMS_SIZE_MAX);
>>>
>>> +    if (range >= 1) {
>>> +        token_hold_cancel_send(instance);
>>> +    }
>>> +
>>>       for (i = 1; (orf_token->rtr_list_entries <
>>> RETRANSMIT_ENTRIES_MAX) &&
>>>           (i <= range); i++) {
>>>
>>>
>>>
>>>
>>>
>>
> _______________________________________________
> discuss mailing list
> [email protected]
> http://lists.corosync.org/mailman/listinfo/discuss
>
_______________________________________________
discuss mailing list
[email protected]
http://lists.corosync.org/mailman/listinfo/discuss

Reply via email to