[Yahoo-eng-team] [Bug 1974057] [NEW] [neutron-dynamic-routing] Plugin RPC queue should be consumed by RPC workers

2022-05-18 Thread Renat Nurgaliyev
Public bug reported:

Currently, the RPC queue of the BGP service plugin is consumed directdly
in the plugin thread. This may lead to unprocessed data in TCP queues
due to infrequent polling, AMQP connection drops due to missed
heartbeats, and other unwanted behavior. Instead, the RPC queue should
be consumed by RPC workers, the same way as it is already done in other
service plugins, like L3 plugin, metering plugin, etc.

** Affects: neutron
 Importance: Undecided
 Assignee: Renat Nurgaliyev (rnurgaliyev)
 Status: In Progress

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1974057

Title:
  [neutron-dynamic-routing] Plugin RPC queue should be consumed by RPC
  workers

Status in neutron:
  In Progress

Bug description:
  Currently, the RPC queue of the BGP service plugin is consumed
  directdly in the plugin thread. This may lead to unprocessed data in
  TCP queues due to infrequent polling, AMQP connection drops due to
  missed heartbeats, and other unwanted behavior. Instead, the RPC queue
  should be consumed by RPC workers, the same way as it is already done
  in other service plugins, like L3 plugin, metering plugin, etc.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1974057/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1973347] [NEW] OVN revision_number infinite update loop

2022-05-13 Thread Renat Nurgaliyev
Public bug reported:

After the change described in
https://mail.openvswitch.org/pipermail/ovs-dev/2022-May/393966.html was
merged and released in stable OVN 22.03, there is a possibility to
create an endless loop of revision_number update in external_ids of
ports and router_ports. We have confirmed the bug in Ussuri and Yoga.
When the problem happens, the Neutron log would look like this:

2022-05-13 09:30:56.318 25 ... Successfully bumped revision number for resource 
8af189bd-c5bf-48a9-b072-3fb6c69ae592 (type: router_ports) to 4815
2022-05-13 09:30:56.366 25 ... Running txn n=1 command(idx=0): 
CheckRevisionNumberCommand(...)
2022-05-13 09:30:56.367 25 ... Running txn n=1 command(idx=1): 
SetLSwitchPortCommand(...)
2022-05-13 09:30:56.367 25 ... Running txn n=1 command(idx=2): 
PgDelPortCommand(...)
2022-05-13 09:30:56.467 25 ... Successfully bumped revision number for resource 
8af189bd-c5bf-48a9-b072-3fb6c69ae592 (type: ports) to 4815
2022-05-13 09:30:56.880 25 ... Running txn n=1 command(idx=0): 
CheckRevisionNumberCommand(...)
2022-05-13 09:30:56.881 25 ... Running txn n=1 command(idx=1): 
UpdateLRouterPortCommand(...)
2022-05-13 09:30:56.881 25 ... Running txn n=1 command(idx=2): 
SetLRouterPortInLSwitchPortCommand(...)
2022-05-13 09:30:56.984 25 ... Successfully bumped revision number for resource 
8af189bd-c5bf-48a9-b072-3fb6c69ae592 (type: router_ports) to 4816
2022-05-13 09:30:57.057 25 ... Running txn n=1 command(idx=0): 
CheckRevisionNumberCommand(...)
2022-05-13 09:30:57.057 25 ... Running txn n=1 command(idx=1): 
SetLSwitchPortCommand(...)
2022-05-13 09:30:57.058 25 ... Running txn n=1 command(idx=2): 
PgDelPortCommand(...)
2022-05-13 09:30:57.159 25 ... Successfully bumped revision number for resource 
8af189bd-c5bf-48a9-b072-3fb6c69ae592 (type: ports) to 4816
2022-05-13 09:30:57.523 25 ... Running txn n=1 command(idx=0): 
CheckRevisionNumberCommand(...)
2022-05-13 09:30:57.523 25 ... Running txn n=1 command(idx=1): 
UpdateLRouterPortCommand(...)
2022-05-13 09:30:57.524 25 ... Running txn n=1 command(idx=2): 
SetLRouterPortInLSwitchPortCommand(...)
2022-05-13 09:30:57.627 25 ... Successfully bumped revision number for resource 
8af189bd-c5bf-48a9-b072-3fb6c69ae592 (type: router_ports) to 4817
2022-05-13 09:30:57.674 25 ... Running txn n=1 command(idx=0): 
CheckRevisionNumberCommand(...)
2022-05-13 09:30:57.674 25 ... Running txn n=1 command(idx=1): 
SetLSwitchPortCommand(...)
2022-05-13 09:30:57.675 25 ... Running txn n=1 command(idx=2): 
PgDelPortCommand(...)
2022-05-13 09:30:57.765 25 ... Successfully bumped revision number for resource 
8af189bd-c5bf-48a9-b072-3fb6c69ae592 (type: ports) to 4817

(full version here: https://pastebin.com/raw/NLP1b6Qm).

In our lab environment we have confirmed that the problem is gone after
mentioned change is rolled back.

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1973347

Title:
  OVN revision_number infinite update loop

Status in neutron:
  New

Bug description:
  After the change described in
  https://mail.openvswitch.org/pipermail/ovs-dev/2022-May/393966.html
  was merged and released in stable OVN 22.03, there is a possibility to
  create an endless loop of revision_number update in external_ids of
  ports and router_ports. We have confirmed the bug in Ussuri and Yoga.
  When the problem happens, the Neutron log would look like this:

  2022-05-13 09:30:56.318 25 ... Successfully bumped revision number for 
resource 8af189bd-c5bf-48a9-b072-3fb6c69ae592 (type: router_ports) to 4815
  2022-05-13 09:30:56.366 25 ... Running txn n=1 command(idx=0): 
CheckRevisionNumberCommand(...)
  2022-05-13 09:30:56.367 25 ... Running txn n=1 command(idx=1): 
SetLSwitchPortCommand(...)
  2022-05-13 09:30:56.367 25 ... Running txn n=1 command(idx=2): 
PgDelPortCommand(...)
  2022-05-13 09:30:56.467 25 ... Successfully bumped revision number for 
resource 8af189bd-c5bf-48a9-b072-3fb6c69ae592 (type: ports) to 4815
  2022-05-13 09:30:56.880 25 ... Running txn n=1 command(idx=0): 
CheckRevisionNumberCommand(...)
  2022-05-13 09:30:56.881 25 ... Running txn n=1 command(idx=1): 
UpdateLRouterPortCommand(...)
  2022-05-13 09:30:56.881 25 ... Running txn n=1 command(idx=2): 
SetLRouterPortInLSwitchPortCommand(...)
  2022-05-13 09:30:56.984 25 ... Successfully bumped revision number for 
resource 8af189bd-c5bf-48a9-b072-3fb6c69ae592 (type: router_ports) to 4816
  2022-05-13 09:30:57.057 25 ... Running txn n=1 command(idx=0): 
CheckRevisionNumberCommand(...)
  2022-05-13 09:30:57.057 25 ... Running txn n=1 command(idx=1): 
SetLSwitchPortCommand(...)
  2022-05-13 09:30:57.058 25 ... Running txn n=1 command(idx=2): 
PgDelPortCommand(...)
  2022-05-13 09:30:57.159 25 ... Successfully bumped revision number for 
resource 8af189bd-c5bf-48a9-b072-3fb6c69ae592 (type: ports) to 4816
  2022-05-13 

[Yahoo-eng-team] [Bug 1920065] [NEW] Automatic rescheduling of BGP speakers on DrAgents

2021-03-18 Thread Renat Nurgaliyev
Public bug reported:

In case when dynamic routing agent becomes unreachable, neutron takes
these actions:

1. Remove all BGP speakers from unreachable agents
2. Schedule all unassigned BGP speakers on available DrAgents

This behavior can be undesirable, in the following cases:

1. Speakers are removed from DrAgent, even if there is no other
alive agent running. Sometimes, I'd prefer them to stay configured
exactly where they are, and come back after DrAgent is back online,
after the server is restarted or so. This sometimes leads to situations,
especially when there is only one active DrAgent, that speakers are
not configured on any DrAgent at all.

2. Sometimes it is desirable to let operator control which components
are running where. For example, not every node running DrAgent has
reachability to all iBGP peers, and network designer places route
reflectors, DrAgents, BGP speakers, in their appropriate places, keeping
in mind high availability and other concerns. In these setups, it could
be better to let the speaker fail on DrAgent which is down. Moving speaker
to another DrAgent also means that the source IP address for the BGP
session will also change, which sometimes can be not so good to reconfigure
on the other side of BGP peering, and not predictable at all.

These situations may happen after following change was introduced:
https://review.opendev.org/c/openstack/neutron-dynamic-routing/+/478455

My proposal is to add a configuration flag to control this behavior:
https://review.opendev.org/c/openstack/neutron-dynamic-routing/+/780675

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1920065

Title:
  Automatic rescheduling of BGP speakers on DrAgents

Status in neutron:
  New

Bug description:
  In case when dynamic routing agent becomes unreachable, neutron takes
  these actions:

  1. Remove all BGP speakers from unreachable agents
  2. Schedule all unassigned BGP speakers on available DrAgents

  This behavior can be undesirable, in the following cases:

  1. Speakers are removed from DrAgent, even if there is no other
  alive agent running. Sometimes, I'd prefer them to stay configured
  exactly where they are, and come back after DrAgent is back online,
  after the server is restarted or so. This sometimes leads to situations,
  especially when there is only one active DrAgent, that speakers are
  not configured on any DrAgent at all.

  2. Sometimes it is desirable to let operator control which components
  are running where. For example, not every node running DrAgent has
  reachability to all iBGP peers, and network designer places route
  reflectors, DrAgents, BGP speakers, in their appropriate places, keeping
  in mind high availability and other concerns. In these setups, it could
  be better to let the speaker fail on DrAgent which is down. Moving speaker
  to another DrAgent also means that the source IP address for the BGP
  session will also change, which sometimes can be not so good to reconfigure
  on the other side of BGP peering, and not predictable at all.

  These situations may happen after following change was introduced:
  https://review.opendev.org/c/openstack/neutron-dynamic-routing/+/478455

  My proposal is to add a configuration flag to control this behavior:
  https://review.opendev.org/c/openstack/neutron-dynamic-routing/+/780675

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1920065/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp