Re: [corosync] It is sometimes judged to be node trouble.

renayama19661014 Wed, 15 Apr 2015 23:14:06 -0700

Hi Christine,

Does the correction for this problem have progress?


We think that it has a problem very much that a system-manager is confused in 
this log.
 * The outbreak frequency of this problem is low, but the impact to a 
system-manager is big when it occurs.

The log may evade the confusion of the system-manager if the log that canceled 
LEAVE message appears right now even if a correction is difficult.


>>  static int net_deliver_fn (
>>  int fd,
>>  int revents,
>>  void *data)
>>  {
>>  struct totemudp_instance *instance = (struct totemudp_instance *)data;
>>  struct msghdr msg_recv;
>>  struct iovec *iovec;
>>  (snip)
>>  /*
>>   * Drop all non-mcast messages (more specifically join
>>   * messages should be dropped)
>>   */
>>  message_type = (char *)iovec->iov_base;
>>  if (instance->flushing == 1 && *message_type == 
> MESSAGE_TYPE_MEMB_JOIN) {
>>  iovec->iov_len = FRAME_SIZE_MAX;

 ------> I think that some kind of log should appear here.

>>  return (0);
>>  }
>>  (snip)

Best Regards,
Hideo Yamauchi.



----- Original Message -----
> From: "[email protected]" <[email protected]>
> To: Christine Caulfield <[email protected]>; "[email protected]" 
> <[email protected]>
> Cc: 
> Date: 2015/3/11, Wed 06:46
> Subject: Re: [corosync] It is sometimes judged to be node trouble.
> 
> Hi Christine,
> 
> Thank you for comments!
> 
>>  Yes, I think I can see what's happening here. JOIN messages get
>>  discarded during flushing because that can cause entry into GATHER state
>>  at an inappropriate time. For a normal JOIN message that's fine because
>>  the joining node will re-send the message. But this also causes LEAVE
>>  messages to be discarded too (as they are a special case of JOIN). This
>>  causes the error you are seeing.
>>  
>>  The fix is non-trivial, sadly, but I'm looking into it
> 
> 
> Possibly I think that the correction for this problem is big.
> However, we wish the secession of this node is not judged with trouble. We 
> hope 
> fix it....
> 
> Best Regards,
> Hideo Yamauchi.
> 
> 
> 
> 
> ----- Original Message -----
>>  From: Christine Caulfield <[email protected]>
>>  To: [email protected]
>>  Cc: 
>>  Date: 2015/3/10, Tue 21:55
>>  Subject: Re: [corosync] It is sometimes judged to be node trouble.
>> 
>>  On 09/03/15 01:09, [email protected] wrote:
>>>   Hi All,
>>> 
>>>   We constitute a cluster in corosync.
>>>   We shutdown one node afterwards.
>>> 
>>>   Then the node that we shutdown is sometimes judged with fail by a 
> cluster.
>>> 
>>>   ---------------------------------------
>>>   Oct 21 11:03:30 XXX corosync[21677]: [TOTEM ] A processor failed, 
> forming 
>>  new configuration.
>>>   ---------------------------------------
>>> 
>>>   This phenomenon seems to occur with very low probability.
>>> 
>>>   We think that it is a problem that there is the log that the node that 
> we 
>>  shutdown is taken as trouble.
>>> 
>>>   The problem is because the leave message(memb_leave_message_send) 
> which is 
>>  sent when a user stops corosync may be thrown away.
>>> 
>>>   static int net_deliver_fn (
>>>   int fd,
>>>   int revents,
>>>   void *data)
>>>   {
>>>   struct totemudp_instance *instance = (struct totemudp_instance *)data;
>>>   struct msghdr msg_recv;
>>>   struct iovec *iovec;
>>>   (snip)
>>>   /*
>>>    * Drop all non-mcast messages (more specifically join
>>>    * messages should be dropped)
>>>    */
>>>   message_type = (char *)iovec->iov_base;
>>>   if (instance->flushing == 1 && *message_type == 
>>  MESSAGE_TYPE_MEMB_JOIN) {
>>>   iovec->iov_len = FRAME_SIZE_MAX;
>>>   return (0);
>>>   }
>>>   (snip)
>>> 
>>>   A secession leave is handled definitely and wishes a node stops.
>>>   Is the correction of the handling of problem of corosync possible?
>>>    * We think that it is a problem that there is the log that the node 
> that 
>>  we shutdown is taken as trouble.
>> 
>>  Yes, I think I can see what's happening here. JOIN messages get
>>  discarded during flushing because that can cause entry into GATHER state
>>  at an inappropriate time. For a normal JOIN message that's fine because
>>  the joining node will re-send the message. But this also causes LEAVE
>>  messages to be discarded too (as they are a special case of JOIN). This
>>  causes the error you are seeing.
>> 
>>  The fix is non-trivial, sadly, but I'm looking into it
>> 
>>  Thanks for the report!
>> 
>>  Chrissie
>> 
>> 
>> 
>>>   We hope that this problem is revised in the next version if possible.
>>> 
>>> 
>>>   Best Regards,
>>>   Hideo Yamauchi.
>>> 
>>> 
>>>   _______________________________________________
>>>   discuss mailing list
>>>   [email protected]
>>>   http://lists.corosync.org/mailman/listinfo/discuss
>>> 
>> 
>>  _______________________________________________
>>  discuss mailing list
>>  [email protected]
>>  http://lists.corosync.org/mailman/listinfo/discuss
>> 
> 
> _______________________________________________
> discuss mailing list
> [email protected]
> http://lists.corosync.org/mailman/listinfo/discuss
> 

_______________________________________________
discuss mailing list
[email protected]
http://lists.corosync.org/mailman/listinfo/discuss

Re: [corosync] It is sometimes judged to be node trouble.

Reply via email to