Hi Christine,
Does the correction for this problem have progress?
We think that it has a problem very much that a system-manager is confused in
this log.
* The outbreak frequency of this problem is low, but the impact to a
system-manager is big when it occurs.
The log may evade the confusion of the system-manager if the log that canceled
LEAVE message appears right now even if a correction is difficult.
>> static int net_deliver_fn (
>> int fd,
>> int revents,
>> void *data)
>> {
>> struct totemudp_instance *instance = (struct totemudp_instance *)data;
>> struct msghdr msg_recv;
>> struct iovec *iovec;
>> (snip)
>> /*
>> * Drop all non-mcast messages (more specifically join
>> * messages should be dropped)
>> */
>> message_type = (char *)iovec->iov_base;
>> if (instance->flushing == 1 && *message_type ==
> MESSAGE_TYPE_MEMB_JOIN) {
>> iovec->iov_len = FRAME_SIZE_MAX;
------> I think that some kind of log should appear here.
>> return (0);
>> }
>> (snip)
Best Regards,
Hideo Yamauchi.
----- Original Message -----
> From: "[email protected]" <[email protected]>
> To: Christine Caulfield <[email protected]>; "[email protected]"
> <[email protected]>
> Cc:
> Date: 2015/3/11, Wed 06:46
> Subject: Re: [corosync] It is sometimes judged to be node trouble.
>
> Hi Christine,
>
> Thank you for comments!
>
>> Yes, I think I can see what's happening here. JOIN messages get
>> discarded during flushing because that can cause entry into GATHER state
>> at an inappropriate time. For a normal JOIN message that's fine because
>> the joining node will re-send the message. But this also causes LEAVE
>> messages to be discarded too (as they are a special case of JOIN). This
>> causes the error you are seeing.
>>
>> The fix is non-trivial, sadly, but I'm looking into it
>
>
> Possibly I think that the correction for this problem is big.
> However, we wish the secession of this node is not judged with trouble. We
> hope
> fix it....
>
> Best Regards,
> Hideo Yamauchi.
>
>
>
>
> ----- Original Message -----
>> From: Christine Caulfield <[email protected]>
>> To: [email protected]
>> Cc:
>> Date: 2015/3/10, Tue 21:55
>> Subject: Re: [corosync] It is sometimes judged to be node trouble.
>>
>> On 09/03/15 01:09, [email protected] wrote:
>>> Hi All,
>>>
>>> We constitute a cluster in corosync.
>>> We shutdown one node afterwards.
>>>
>>> Then the node that we shutdown is sometimes judged with fail by a
> cluster.
>>>
>>> ---------------------------------------
>>> Oct 21 11:03:30 XXX corosync[21677]: [TOTEM ] A processor failed,
> forming
>> new configuration.
>>> ---------------------------------------
>>>
>>> This phenomenon seems to occur with very low probability.
>>>
>>> We think that it is a problem that there is the log that the node that
> we
>> shutdown is taken as trouble.
>>>
>>> The problem is because the leave message(memb_leave_message_send)
> which is
>> sent when a user stops corosync may be thrown away.
>>>
>>> static int net_deliver_fn (
>>> int fd,
>>> int revents,
>>> void *data)
>>> {
>>> struct totemudp_instance *instance = (struct totemudp_instance *)data;
>>> struct msghdr msg_recv;
>>> struct iovec *iovec;
>>> (snip)
>>> /*
>>> * Drop all non-mcast messages (more specifically join
>>> * messages should be dropped)
>>> */
>>> message_type = (char *)iovec->iov_base;
>>> if (instance->flushing == 1 && *message_type ==
>> MESSAGE_TYPE_MEMB_JOIN) {
>>> iovec->iov_len = FRAME_SIZE_MAX;
>>> return (0);
>>> }
>>> (snip)
>>>
>>> A secession leave is handled definitely and wishes a node stops.
>>> Is the correction of the handling of problem of corosync possible?
>>> * We think that it is a problem that there is the log that the node
> that
>> we shutdown is taken as trouble.
>>
>> Yes, I think I can see what's happening here. JOIN messages get
>> discarded during flushing because that can cause entry into GATHER state
>> at an inappropriate time. For a normal JOIN message that's fine because
>> the joining node will re-send the message. But this also causes LEAVE
>> messages to be discarded too (as they are a special case of JOIN). This
>> causes the error you are seeing.
>>
>> The fix is non-trivial, sadly, but I'm looking into it
>>
>> Thanks for the report!
>>
>> Chrissie
>>
>>
>>
>>> We hope that this problem is revised in the next version if possible.
>>>
>>>
>>> Best Regards,
>>> Hideo Yamauchi.
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list
>>> [email protected]
>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>
>>
>> _______________________________________________
>> discuss mailing list
>> [email protected]
>> http://lists.corosync.org/mailman/listinfo/discuss
>>
>
> _______________________________________________
> discuss mailing list
> [email protected]
> http://lists.corosync.org/mailman/listinfo/discuss
>
_______________________________________________
discuss mailing list
[email protected]
http://lists.corosync.org/mailman/listinfo/discuss