Re: [corosync] [Problem] Corosync cannot reconstitute a cluster.

renayama19661014 Wed, 12 Jun 2013 01:09:41 -0700

Hi Honza,

Thank you for comments.


> can you please tell me exact reproducer for physical hw? (because brctl
> delif is I believe not valid in hw at all).

It is the next environment that I reported a problem in the second in physical  
environment.

-------------------------
Enclosure               : BladeSystem c7000 Enclosure
node1, node2, node3 : HP ProLiant BL460c G6(CPU:Xeon E5540,Mem:16G) --- Blade
                                 NIC:Flex-10 Embedded Ethernet x 1(2Port)
                                 NIC:NC325m Quad Port 1Gb NIC for c-Class 
BladeSystem(4Port)
SW                        : GbE2c Ethernet Blade Switch x 6
-------------------------

In addition, I carried out the cutting of the interface via a switch.
 * In the second report, I did not execute the brctl command.

Is more detailed HW information necessary?
If there is necessary information, I send it.

Best Regards,
Hideo Yamauchi.


--- On Wed, 2013/6/12, Jan Friesse <[email protected]> wrote:

> Hideo,
> can you please tell me exact reproducer for physical hw? (because brctl
> delif is I believe not valid in hw at all).
> 
> Thanks,
>   Honza
> 
> [email protected] napsal(a):
> > Hi Fabio,
> > 
> > Thank you for comment.
> > 
> >> I'll let Honza look at it, I don't have enough physical hardware to
> >> reproduce.
> > 
> > All right.
> > 
> > Many Thanks!
> > Hideo Yamauchi.
> > 
> > 
> > --- On Tue, 2013/6/11, Fabio M. Di Nitto <[email protected]> wrote:
> > 
> >> Hi Yamauchi-san,
> >>
> >> I'll let Honza look at it, I don't have enough physical hardware to
> >> reproduce.
> >>
> >> Fabio
> >>
> >> On 06/11/2013 01:15 AM, [email protected] wrote:
> >>> Hi Fabio,
> >>>
> >>> Thank you for comments.
> >>>
> >>> We confirmed this problem in the physical environment.
> >>> The communication of corosync lets eth1,eth2 go through.
> >>>
> >>> -------------------------------------------------------
> >>> [root@bl460g6a ~]# ip addr show
> >>> (snip)
> >>> 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP 
> >>> qlen 1000
> >>>      link/ether f4:ce:46:b3:fe:3c brd ff:ff:ff:ff:ff:ff
> >>>      inet 192.168.101.9/24 brd 192.168.101.255 scope global eth1
> >>>      inet6 fe80::f6ce:46ff:feb3:fe3c/64 scope link 
> >>>         valid_lft forever preferred_lft forever
> >>> 4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP 
> >>> qlen 1000
> >>>      link/ether 18:a9:05:78:6c:f0 brd ff:ff:ff:ff:ff:ff
> >>>      inet 192.168.102.9/24 brd 192.168.102.255 scope global eth2
> >>>      inet6 fe80::1aa9:5ff:fe78:6cf0/64 scope link 
> >>>         valid_lft forever preferred_lft forever
> >>> (snip)
> >>> 8: virbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state 
> >>> UNKNOWN 
> >>>      link/ether 52:54:00:7f:f3:0a brd ff:ff:ff:ff:ff:ff
> >>>      inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
> >>> 9: virbr0-nic: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 
> >>> 500
> >>>      link/ether 52:54:00:7f:f3:0a brd ff:ff:ff:ff:ff:ff
> >>> -----------------------------------------------
> >>>
> >>> I think that it is not a virtual environmental problem.
> >>>
> >>> I attach the log that I confirmed just to make sure in three 
> >>> Blade.(RHEL6.4)
> >>> * I performed the interception of the communication with a network switch.
> >>>
> >>> The phenomenon is similar, and, as for one node, a loop does an 
> >>> OPERATIONAL state, and two other nodes do not change in an OPERATIONAL 
> >>> state.
> >>>
> >>> After all is the problem same as the bug that you taught?
> >>>> Check this thread as reference:
> >>>> http://lists.linuxfoundation.org/pipermail/openais/2013-April/016792.html
> >>>
> >>>
> >>> Best Regards,
> >>> Hideo Yamauchi.
> >>>
> >>>
> >>>
> >>> --- On Fri, 2013/5/31, Fabio M. Di Nitto <[email protected]> wrote:
> >>>
> >>>> On 5/31/2013 7:12 AM, [email protected] wrote:
> >>>>> Hi All,
> >>>>>
> >>>>> We discovered the problem of the network of the corosync communication.
> >>>>>
> >>>>> We composed a cluster of three nodes on KVM in corosync.
> >>>>>
> >>>>> Step 1) Start corosync service in all nodes. 
> >>>>>
> >>>>> Step 2) Confirm that a cluster is comprised of all nodes definitely and 
> >>>>> became the OPERATIONAL state.
> >>>>>
> >>>>> Step 3) Cut off the network of node1(rh64-coro1) and node2(rh64-coro2) 
> >>>>> from a host of KVM.
> >>>>>
> >>>>>          [root@kvm-host ~]# brctl delif virbr3 vnet5;brctl delif virbr2 
> >>>>>vnet1
> >>>>>
> >>>>> Step 4) Because a problem occurred, we stop all nodes.
> >>>>>
> >>>>>
> >>>>> The problem occurs at the time of step 3.
> >>>>>
> >>>>> One node(rh64-coro1) continues moving a state after becoming the 
> >>>>> OPERATIONAL state.
> >>>>>
> >>>>> Two nodes(rh64-coro2 and rh64-coro3) continue changing in a state.
> >>>>> It seems to never change in an OPERATIONAL state while the first node 
> >>>>> operates.
> >>>>>
> >>>>> This means that two nodes(rh64-coro2 and rh64-coro3) cannot complete 
> >>>>> cluster constitution.
> >>>>> When this network trouble happens, by the setting that corosync 
> >>>>> combined with Pacemaker, corosync cannot notify Pacemaker of the 
> >>>>> constitution change of the cluster.
> >>>>>
> >>>>>
> >>>>> Question 1) Are there any parameters to solve this problem in 
> >>>>> corosync.conf?
> >>>>>    * We bundle up an interface(Bonding) and think that it can be 
> >>>>>settled by appointing "rrp_mode:none", but do not want to appoint 
> >>>>>"rrp_mode:none".
> >>>>>
> >>>>> Question 2) Is this a bug? Or is it specifications of the communication 
> >>>>> of corosync?
> >>>>
> >>>> We already checked this specific test, and it appears to be a bug in
> >>>> the kernel bridge code when handling multicast traffic (groups are not
> >>>> joined correctly and traffic is not forwarded).
> >>>>
> >>>> Check this thread as reference:
> >>>> http://lists.linuxfoundation.org/pipermail/openais/2013-April/016792.html
> >>>>
> >>>> Thanks
> >>>> Fabio
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> discuss mailing list
> >>>> [email protected]
> >>>> http://lists.corosync.org/mailman/listinfo/discuss
> >>>>
> >>
> >>
> > 
> > _______________________________________________
> > discuss mailing list
> > [email protected]
> > http://lists.corosync.org/mailman/listinfo/discuss
> 
> 

_______________________________________________
discuss mailing list
[email protected]
http://lists.corosync.org/mailman/listinfo/discuss

Re: [corosync] [Problem] Corosync cannot reconstitute a cluster.

Reply via email to