Ali, could you share output of "ps | grep ovsdb" and "netstat -lpn | grep 6641" on the new slave node after you do "crm resource move"?
On Fri, May 11, 2018 at 2:25 PM, aginwala <aginw...@asu.edu> wrote: > Thanks Han for more suggestions: > > > I did test failover by gracefully stopping pacemaker+corosync on master > node along with crm move and it works as expected too as crm move is > triggering promote of new master and hence the new master gets elected > along with slave getting demoted as expected to listen on sync-from node. > Hence, whatever code change I posted earlier is well and good. > > # crm stat > Stack: corosync > Current DC: test-pace1-2365293 (version 1.1.14-70404b0) - partition with > quorum > 2 nodes and 2 resources configured > > Online: [ test-pace1-2365293 test-pace2-2365308 ] > > Full list of resources: > > Master/Slave Set: ovndb_servers-master [ovndb_servers] > Masters: [ test-pace2-2365308 ] > Slaves: [ test-pace1-2365293 ] > > #crm --debug resource move ovndb_servers test-pace1-2365293 > DEBUG: pacemaker version: [err: ][out: CRM Version: 1.1.14 (70404b0)] > DEBUG: found pacemaker version: 1.1.14 > DEBUG: invoke: crm_resource --quiet --move -r 'ovndb_servers' > --node='test-pace1-2365293' > # crm stat > > Stack: corosync > Current DC: test-pace1-2365293 (version 1.1.14-70404b0) - partition with > quorum > 2 nodes and 2 resources configured > > Online: [ test-pace1-2365293 test-pace2-2365308 ] > > Full list of resources: > > Master/Slave Set: ovndb_servers-master [ovndb_servers] > Masters: [ test-pace1-2365293 ] > Slaves: [ test-pace2-2365308 ] > > Failed Actions: > * ovndb_servers_monitor_10000 on test-pace2-2365308 'master' (8): call=46, > status=complete, exitreason='none', > last-rc-change='Fri May 11 14:08:35 2018', queued=0ms, exec=83ms > > Note: Failed Actions warning only comes for crm move command and not > using reboot/kill/service pacemaker/corosync stop/start > > I cleaned up the warning using below commad: > #crm_resource -P > Waiting for 1 replies from the CRMd. OK > > Also wanted to call out above findings noticed that ocf_attribute_target > is not getting called as per pacemaker logs as code says it will not work > for older pacemaker versions and not sure what versions exactly as I am on > version 1.1.14 > # pacemaker logs > notice: operation_finished: ovndb_servers_monitor_10000:7561:stderr [ > /usr/lib/ocf/resource.d/ovn/ovndb-servers: line 31: ocf_attribute_target: > command not found ] > > > # Also need nb db logs are showing socket util errors which I think need a > code change too to skip stamping it as functionality is still working as > expected (may be in a separate commit since its ovsdb change) > 018-05-11T21:14:25.958Z|00560|socket_util|ERR|6641:10.149.4.252: bind: > Cannot assign requested address > 2018-05-11T21:14:25.958Z|00561|socket_util|ERR|6641:10.149.4.252: bind: > Cannot assign requested address > 2018-05-11T21:14:27.859Z|00562|socket_util|ERR|6641:10.149.4.252: bind: > Cannot assign requested address > > > > Let me know for any suggestions further. > > > Regards, > Aliasgar > > > On Thu, May 10, 2018 at 3:49 PM, Han Zhou <zhou...@gmail.com> wrote: > >> Good progress! >> >> I think at least one more change is needed to ensure when demote happens, >> the TCP port is shut down. Otherwise, the LB will be confused again and >> can't figure out which one is active. This is the graceful failover >> scenario which can be tested by crm resource move instead of reboot/killing >> process. >> >> This may be done by the same approach you did for promote, i.e. stop >> ovsdb and then call ovsdb_server_start() so the parameters are reset >> correctly before starting. Alternatively we can add a command in >> ovsdb-server, in addition to the commands that switches to/from >> active/backup modes, to open/close the TCP ports, to avoid restarting >> during failover, but I am not sure if this is valuable. It depends on >> whether restarting ovsdb-server during failover is sufficient enough. Could >> you add the restart logic for demote and try more? Thanks! >> >> Thanks, >> Han >> >> On Thu, May 10, 2018 at 1:54 PM, aginwala <aginw...@asu.edu> wrote: >> >>> Hi : >>> >>> Just to further update, I am able to re-open tcp port for failover >>> scenario when new master is getting promoted with additional code changes >>> as below which do require stop of ovs service on the new selected master to >>> reset the tcp settings: >>> >>> >>> diff --git a/ovn/utilities/ovndb-servers.ocf >>> b/ovn/utilities/ovndb-servers.ocf >>> index 164b6bc..8cb4c25 100755 >>> --- a/ovn/utilities/ovndb-servers.ocf >>> +++ b/ovn/utilities/ovndb-servers.ocf >>> @@ -295,8 +295,8 @@ ovsdb_server_start() { >>> >>> set ${OVN_CTL} >>> >>> - set $@ --db-nb-addr=${MASTER_IP} --db-nb-port=${NB_MASTER_PORT} >>> - set $@ --db-sb-addr=${MASTER_IP} --db-sb-port=${SB_MASTER_PORT} >>> + set $@ --db-nb-port=${NB_MASTER_PORT} >>> + set $@ --db-sb-port=${SB_MASTER_PORT} >>> >>> if [ "x${NB_MASTER_PROTO}" = xtcp ]; then >>> set $@ --db-nb-create-insecure-remote=yes >>> @@ -307,6 +307,8 @@ ovsdb_server_start() { >>> fi >>> >>> if [ "x${present_master}" = x ]; then >>> + set $@ --db-nb-create-insecure-remote=yes >>> + set $@ --db-sb-create-insecure-remote=yes >>> # No master detected, or the previous master is not among the >>> # set starting. >>> # >>> @@ -316,6 +318,8 @@ ovsdb_server_start() { >>> set $@ --db-nb-sync-from-addr=${INVALID_IP_ADDRESS} >>> --db-sb-sync-from-addr=${INVALID_IP_ADDRESS} >>> >>> elif [ ${present_master} != ${host_name} ]; then >>> + set $@ --db-nb-create-insecure-remote=no >>> + set $@ --db-sb-create-insecure-remote=no >>> # An existing master is active, connect to it >>> set $@ --db-nb-sync-from-addr=${MASTER_IP} >>> --db-sb-sync-from-addr=${MASTER_IP} >>> set $@ --db-nb-sync-from-port=${NB_MASTER_PORT} >>> @@ -416,6 +420,8 @@ ovsdb_server_promote() { >>> ;; >>> esac >>> >>> + ${OVN_CTL} stop_ovsdb >>> + ovsdb_server_start >>> ${OVN_CTL} promote_ovnnb >>> ${OVN_CTL} promote_ovnsb >>> >>> >>> >>> Below are the scenarios tested: >>> MasterSlaveScenarioResult >>> >>> - >>> >>> >>> - >>> >>> reboot/failure New master gets promoted with tcp ports enabled to start >>> taking LB traffic. >>> >>> - >>> >>> >>> - >>> >>> reboot/failure >>> No change and current master continues taking traffic with slave >>> continue to sync from master. >>> >>> - >>> >>> >>> - >>> >>> reboot/failure >>> New master gets promoted with tcp ports enabled to start taking LB >>> traffic. >>> >>> Also sync on slaves from master works as expected: >>> # On master >>> ovn-nbctl --db=tcp:10.169.129.33:6641 ls-add 556 >>> # on slave port is shutdown as expected >>> ovn-nbctl --db=tcp:10.169.129.34:6641 show >>> ovn-nbctl: tcp:10.169.129.34:6641: database connection failed >>> (Connection refused) >>> # on slave local unix socket, above lswitch 556 gets replicated too as >>> --sync-from=tcp:10.149.4.252:6641 >>> ovn-nbctl show >>> switch 2bd07b67-fd6b-401d-9612-da75e8f9ffc8 (556) >>> >>> # Same testing for sb db too >>> # Slave port 6642 is shutdown too >>> ovn-sbctl --db=tcp:10.169.129.34:6642 show hangs and >>> # Using master ip works >>> ovn-sbctl --db=tcp:10.169.129.33:6642 show >>> Chassis "21f12bd6-e9e8-4ee2-afeb-28b331df6715" >>> hostname: "test-pace2-2365308.lvs02.dev.ebayc3.com" >>> Encap geneve >>> ip: "10.169.129.34" >>> options: {csum="true"} >>> >>> >>> >>> # Accessing via LB vip works fine too as only one member is active: >>> for i in `seq 1 500`; do ovn-sbctl --db=tcp:10.149.4.252:6642 show; done >>> switch 2bd07b67-fd6b-401d-9612-da75e8f9ffc8 (556) >>> switch 2bd07b67-fd6b-401d-9612-da75e8f9ffc8 (556) >>> switch 2bd07b67-fd6b-401d-9612-da75e8f9ffc8 (556) >>> switch 2bd07b67-fd6b-401d-9612-da75e8f9ffc8 (556) >>> switch 2bd07b67-fd6b-401d-9612-da75e8f9ffc8 (556) >>> >>> >>> Everything works fine as expected. Let me know for any corner case >>> missed. I will submit a formal patch using LISTEN_ON_MASTER_IP_ONLY for >>> using LB with tcp to avoid breaking existing functionality accordingly. >>> >>> >>> >>> Regards, >>> Aliasgar >>> >>> >>> >>> On Thu, May 10, 2018 at 9:55 AM, aginwala <aginw...@asu.edu> wrote: >>> >>>> Thanks folks for suggestions: >>>> >>>> For LB vip configurations, I did the testing further and yes it does >>>> tries to hit the slave db as per the logs below and fails as slave do not >>>> have write permission of which LB is not aware of: >>>> for i in `seq 1 500`; do ovn-nbctl --db=tcp:10.149.4.252:6641 ls-add >>>> $i590;done >>>> ovn-nbctl: transaction error: {"details":"insert operation not allowed >>>> when database server is in read only mode","error":"not allowed"} >>>> ovn-nbctl: transaction error: {"details":"insert operation not allowed >>>> when database server is in read only mode","error":"not allowed"} >>>> ovn-nbctl: transaction error: {"details":"insert operation not allowed >>>> when database server is in read only mode","error":"not allowed"} >>>> >>>> Hence, with little more code changes(in the same patch without the flag >>>> variable suggestion), I am able to shutdown the tcp port on the slave and >>>> it works fine as below: >>>> #Master Node >>>> # ovn-nbctl --db=tcp:10.169.129.33:6641 ls-add test444 >>>> #Slave Node >>>> # ovn-nbctl --db=tcp:10.169.129.34:6641 ls-add test444 >>>> ovn-nbctl: tcp:10.169.129.34:6641: database connection failed >>>> (Connection refused) >>>> >>>> Code to shutdown tcp port on slave db along with only master listening >>>> on tcp ports: >>>> diff --git a/ovn/utilities/ovndb-servers.ocf >>>> b/ovn/utilities/ovndb-servers.ocf >>>> index 164b6bc..b265df6 100755 >>>> --- a/ovn/utilities/ovndb-servers.ocf >>>> +++ b/ovn/utilities/ovndb-servers.ocf >>>> @@ -295,8 +295,8 @@ ovsdb_server_start() { >>>> >>>> set ${OVN_CTL} >>>> >>>> - set $@ --db-nb-addr=${MASTER_IP} --db-nb-port=${NB_MASTER_PORT} >>>> - set $@ --db-sb-addr=${MASTER_IP} --db-sb-port=${SB_MASTER_PORT} >>>> + set $@ --db-nb-port=${NB_MASTER_PORT} >>>> + set $@ --db-sb-port=${SB_MASTER_PORT} >>>> >>>> if [ "x${NB_MASTER_PROTO}" = xtcp ]; then >>>> set $@ --db-nb-create-insecure-remote=yes >>>> @@ -307,6 +307,8 @@ ovsdb_server_start() { >>>> fi >>>> >>>> if [ "x${present_master}" = x ]; then >>>> + set $@ --db-nb-create-insecure-remote=yes >>>> + set $@ --db-sb-create-insecure-remote=yes >>>> # No master detected, or the previous master is not among the >>>> # set starting. >>>> # >>>> @@ -316,6 +318,8 @@ ovsdb_server_start() { >>>> set $@ --db-nb-sync-from-addr=${INVALID_IP_ADDRESS} >>>> --db-sb-sync-from-addr=${INVALID_IP_ADDR >>>> >>>> elif [ ${present_master} != ${host_name} ]; then >>>> + set $@ --db-nb-create-insecure-remote=no >>>> + set $@ --db-sb-create-insecure-remote=no >>>> >>>> >>>> But I noticed that if the slave becomes active post failover after >>>> active node reboot/failure, pacemaker shows it online but I am not able to >>>> access the dbs. >>>> >>>> # crm status >>>> Online: [ test-pace2-2365308 ] >>>> OFFLINE: [ test-pace1-2365293 ] >>>> >>>> Full list of resources: >>>> >>>> Master/Slave Set: ovndb_servers-master [ovndb_servers] >>>> Masters: [ test-pace2-2365308 ] >>>> Stopped: [ test-pace1-2365293 ] >>>> >>>> >>>> # ovn-nbctl --db=tcp:10.169.129.33:6641 ls-add test444 >>>> ovn-nbctl: tcp:10.169.129.33:6641: database connection failed >>>> (Connection refused) >>>> # ovn-nbctl --db=tcp:10.169.129.34:6641 ls-add test444 >>>> ovn-nbctl: tcp:10.169.129.34:6641: database connection failed >>>> (Connection refused) >>>> >>>> Hence, if failover happens, slave is already running with >>>> --sync-from=lbVIP:6641/6642 for nb and sb db respectively. Thus, re-opening >>>> of tcp ports for nb and sb db on the slave that is getting promoted to >>>> master is not happening automatically. >>>> >>>> Let me know if there is a valid way/approach too which I am missing to >>>> handle it during slave promote logic? Will do further code changes >>>> accordingly. >>>> >>>> Note: Current code changes for use with LB will needs to be handled for >>>> ssl too. Will have to handle that separately but want to get the tcp >>>> working first and we can add ssl support later. >>>> >>>> >>>> Regards, >>>> Aliasgar >>>> >>>> On Wed, May 9, 2018 at 12:19 PM, Numan Siddique <nusid...@redhat.com> >>>> wrote: >>>> >>>>> >>>>> >>>>> On Thu, May 10, 2018 at 12:44 AM, Han Zhou <zhou...@gmail.com> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Wed, May 9, 2018 at 11:51 AM, Numan Siddique <nusid...@redhat.com> >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, May 10, 2018 at 12:15 AM, Han Zhou <zhou...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Thanks Ali for the quick patch. Please see my comments inline. >>>>>>>> >>>>>>>> On Wed, May 9, 2018 at 9:30 AM, aginwala <aginw...@asu.edu> wrote: >>>>>>>> > >>>>>>>> > Thanks Han and Numan for the clarity to help sort it out. >>>>>>>> > >>>>>>>> > For making vip work with using LB in my two node setup, I had >>>>>>>> changed below code to skip setting master IP when creating pcs >>>>>>>> resource >>>>>>>> for ovndbs and listen on 0.0.0.0 instead. Hence, the discussion seems >>>>>>>> inline with the code change which is small for sure as below: >>>>>>>> > >>>>>>>> > >>>>>>>> > diff --git a/ovn/utilities/ovndb-servers.ocf >>>>>>>> b/ovn/utilities/ovndb-servers.ocf >>>>>>>> > index 164b6bc..d4c9ad7 100755 >>>>>>>> > --- a/ovn/utilities/ovndb-servers.ocf >>>>>>>> > +++ b/ovn/utilities/ovndb-servers.ocf >>>>>>>> > @@ -295,8 +295,8 @@ ovsdb_server_start() { >>>>>>>> > >>>>>>>> > set ${OVN_CTL} >>>>>>>> > >>>>>>>> > - set $@ --db-nb-addr=${MASTER_IP} >>>>>>>> --db-nb-port=${NB_MASTER_PORT} >>>>>>>> > - set $@ --db-sb-addr=${MASTER_IP} >>>>>>>> --db-sb-port=${SB_MASTER_PORT} >>>>>>>> > + set $@ --db-nb-port=${NB_MASTER_PORT} >>>>>>>> > + set $@ --db-sb-port=${SB_MASTER_PORT} >>>>>>>> > >>>>>>>> > if [ "x${NB_MASTER_PROTO}" = xtcp ]; then >>>>>>>> > set $@ --db-nb-create-insecure-remote=yes >>>>>>>> > >>>>>>>> >>>>>>>> This change solves the IP binding problem. It will just listen on >>>>>>>> 0.0.0.0. >>>>>>>> >>>>>>> >>>>>>> One problem with this approach I see is that it would listen on all >>>>>>> the IPs. May be it's not a good idea and may have some security issues. >>>>>>> >>>>>>> Can we instead check the value of MASTER_IP param something like >>>>>>> below ? >>>>>>> >>>>>>> if [ "$MASTER_IP" == "0.0.0.0" ]; then >>>>>>> set $@ --db-nb-addr=${MASTER_IP} --db-nb-port=${NB_MASTER_PORT} >>>>>>> set $@ --db-sb-addr=${MASTER_IP} --db-sb-port=${SB_MASTER_PORT} >>>>>>> else >>>>>>> set $@ --db-nb-port=${NB_MASTER_PORT} >>>>>>> set $@ --db-sb-port=${SB_MASTER_PORT} >>>>>>> fi >>>>>>> >>>>>>> And when you create OVN pacemaker resource in your deployment, you >>>>>>> can pass master_ip=0.0.0.0 >>>>>>> >>>>>>> Will this work ? >>>>>>> >>>>>>> >>>>>> Maybe some misunderstanding here. We still need to use master_ip = LB >>>>>> VIP, so that the standby nodes can "sync-from" the active node. So we >>>>>> cannot pass 0.0.0.0 explicitly. >>>>>> >>>>> >>>>> I misunderstood earlier. I thought you wouldn't need master ip at all. >>>>> Thanks for the clarification. >>>>> >>>>>> >>>>>> I didn't understand your code above either. Why would we specify the >>>>>> master_ip if we know it is 0.0.0.0? Or do you mean the other way around >>>>>> but >>>>>> just a typo in the code? >>>>>> >>>>>> For security of listening on any IP, I am not quit sure. It may be a >>>>>> problem if the nodes sits on multiple networks and some of them are >>>>>> considered insecure, and you want to listen on the security one only. If >>>>>> this is the concern, we can add a parameter e.g. >>>>>> LISTEN_ON_MASTER_IP_ONLY, >>>>>> and set it to true by default. What do you think? >>>>>> >>>>> >>>>> I would prefer adding the parameter as you have suggested so that the >>>>> existing behavior remain intact. >>>>> >>>>> Thanks >>>>> Numan >>>>> >>>>> >>>>>> Thanks, >>>>>> Han >>>>>> >>>>>> >>>>> >>>> >>> >> >
_______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss