[ovs-dev] [ovn] ha-chassis-group false positive failover

2023-03-30 Thread Vladislav Odintsov
Hi all,

I’m facing a false positive failover in the event of openflow connection got 
dropped after inactivity probe timeout due to high load of ovn-controller 
handling huge amount of ovsdb updates.
I wonder wether its a bug or I have to tune any specifics. See ovn-controller 
logs:

2023-03-30T07:57:24.469Z|58440|inc_proc_eng|INFO|node: logical_flow_output, 
recompute (failed handler for input port_groups) took 35774ms
2023-03-30T07:57:28.521Z|58441|lflow_cache|INFO|Detected cache inactivity (last 
active 40006 ms ago): trimming cache
2023-03-30T07:57:28.528Z|58442|timeval|WARN|Unreasonably long 40084ms poll 
interval (32947ms user, 6968ms system)
2023-03-30T07:57:28.528Z|58443|timeval|WARN|faults: 43468 minor, 0 major
…..
2023-03-30T07:57:48.784Z|58496|reconnect|ERR|ssl::6642: no response to 
inactivity probe after 60.2 seconds, disconnecting
2023-03-30T07:57:48.784Z|58497|reconnect|INFO|ssl::6642: connection dropped
2023-03-30T07:57:48.905Z|58498|main|INFO|OVNSB commit failed, force recompute 
next time.
2023-03-30T07:57:49.786Z|58499|reconnect|INFO|ssl::6642: connecting...
2023-03-30T07:57:49.924Z|58500|reconnect|INFO|ssl::6642: connected
2023-03-30T07:57:29.781Z|58501|poll_loop|INFO|wakeup due to 0-ms timeout at 
lib/stream-ssl.c:838 (74% CPU usage)
2023-03-30T07:57:29.836Z|58502|poll_loop|INFO|wakeup due to 0-ms timeout at 
lib/stream-ssl.c:838 (74% CPU usage)
….
2023-03-30T07:57:05.976Z|58545|poll_loop|INFO|wakeup due to 0-ms timeout at 
lib/stream-ssl.c:838 (101% CPU usage)
2023-03-30T07:57:07.002Z|58546|memory|INFO|peak resident set size grew 54% in 
last 682257.2 seconds, from 4435696 kB to 6824820 kB
2023-03-30T07:57:07.002Z|58547|memory|INFO|idl-cells-OVN_Southbound:11491359 
idl-cells-Open_vSwitch:14416 lflow-cache-entries-cache-expr:530298 
lflow-cache-entries-cache-matches:31770 lflow-cache-size-KB:769553 
local_datapath_usage-KB:630
 ofctrl_desired_flow_usage-KB:577052 ofctrl_installed_flow_usage-KB:409538 
ofctrl_sb_flow_ref_usage-KB:273707
…
2023-03-30T07:57:31.667Z|58552|rconn|WARN|unix:/run/openvswitch/br-int.mgmt: 
connection dropped (Broken pipe)
2023-03-30T07:57:31.739Z|58553|binding|INFO|Releasing lport cr-xxx from this 
chassis (sb_readonly=0)
2023-03-30T07:57:31.739Z|58554|if_status|WARN|Dropped 1 log messages in last 
1254725 seconds (most recently, 1254725 seconds ago) due to excessive rate
2023-03-30T07:57:31.739Z|58555|if_status|WARN|Trying to release unknown 
interface cr-xxx
2023-03-30T07:57:31.743Z|58556|binding|INFO|Releasing lport cr-yyy from this 
chassis (sb_readonly=0)
2023-03-30T07:57:31.743Z|58557|binding|INFO|Releasing lport cr-zzz from this 
chassis (sb_readonly=0)
…

After some time these ports are claimed back:

2023-03-30T07:57:49.320Z|59137|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: 
connecting...
2023-03-30T07:57:49.320Z|59138|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: 
connected
2023-03-30T07:57:49.523Z|59139|poll_loop|INFO|wakeup due to 0-ms timeout at 
controller/ovn-controller.c:4296 (99% CPU usage)
2023-03-30T07:57:49.574Z|59140|poll_loop|INFO|wakeup due to 0-ms timeout at 
controller/ovn-controller.c:4286 (99% CPU usage)
2023-03-30T07:58:01.451Z|59141|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 
(:34736<->:6642) at lib/stream-ssl.c:836 (99% CPU usage)
2023-03-30T07:58:02.449Z|59142|binding|INFO|Claiming lport cr-xxx for this 
chassis.
2023-03-30T07:58:02.449Z|59143|binding|INFO|cr-xxx: Claiming xx:xx:xx:xx:xx 
x.x.x.x/xx
2023-03-30T07:58:02.453Z|59144|binding|INFO|Claiming lport cr-yyy for this 
chassis.
2023-03-30T07:58:02.453Z|59145|binding|INFO|cr-yyy: Claiming yy:yy:yy:yy:yy:yy 
y.y.y.y/yy
…

Previously I’ve met with similar problem but the openflow probe interval was 30 
seconds and I set it to 60 and then to 120 seconds.
But it seems that OVS was configured on default inactivity probe timeout for 
openflow - 60 seconds. So, the error above occured. In OVS logs it looks like 
next:

2023-03-30T07:57:28.632Z|61003|rconn|ERR|br-int<->unix#8: no response to 
inactivity probe after 60 seconds, disconnecting

This node is a dedicated l3gateway which has is-interconn=true setting 
(interconnection case) and has quite big openflow table around 2.7M flows.

So, I’ve got some questions for OVN and OVS:
1. I couldn’t find in ovn-controller code where the logic of handling openflow 
connection drop. It is a bug when after openflow connection drop the claimed 
cr-ports are being released? I guess, yes. Can you please point me to the 
corrent place of code to start investigating?
2. I see almost the same openflow amount on each l3gateway with interconnection 
role. Even l3gateways which have no allocated cr-port-bindings have 2M+ 
openflow flows. Is it normal operation or it’s a place for possible 
optimisation? I mean it seems that excess flows are installed.
3. How to handle long OVS OF updates, which block inactivity probe for OF 
connection and then got dropped? From ovn-controller it’s possible to configure 
inactivity pro

Re: [ovs-dev] [ovn] ha-chassis-group false positive failover

2023-05-16 Thread Vladislav Odintsov
Hi Numan, Dumitru, Ilya,

I guess I’ve managed to deal with some of these problems and have some 
conclusions.
Please see inline.

> On 14 Apr 2023, at 16:26, Vladislav Odintsov  wrote:
> 
> And as a follow-up, I see sometimes next error message in ovn-controller log, 
> though there is a configured much higher limit in seconds for inactivity 
> probe value for the chassis:
> 

ovn-controller establishes 3 openflow connections to local ovs-vswitchd for 
next purposes:
1. openflow rules management (openflow probe interval is configured through 
ovn-openflow-probe-interval external_ids setting [1]; default — disabled [2]);
2. openflow features collection (hardcoded probe interval 5s [3]);
3. pinctrl thread (hardcoded probe interval 5s [4]).

> ovs|04426|rconn|ERR|unix:/run/openvswitch/br-int.mgmt: no response to 
> inactivity probe after 8 seconds, disconnecting
> 
> ovs|26560|rconn|ERR|unix:/run/openvswitch/br-int.mgmt: no response to 
> inactivity probe after 10 seconds, disconnecting
> 

These messages most likely related to the connection for features collection or 
pinctrl thread. So ovn-openflow-probe-interval has no matter here.

> 
> server with first log message:
> # ovs-vsctl get open . external_ids:ovn-openflow-probe-interval
> "30"
> 
> server with second log message:
> # ovs-vsctl get open . external_ids:ovn-openflow-probe-interval
> "60"
> 
> 
> 
>> On 30 Mar 2023, at 23:19, Vladislav Odintsov  wrote:
>> 
>> Hi all,
>> 
>> I’m facing a false positive failover in the event of openflow connection got 
>> dropped after inactivity probe timeout due to high load of ovn-controller 
>> handling huge amount of ovsdb updates.
>> I wonder wether its a bug or I have to tune any specifics. See 
>> ovn-controller logs:
>> 
>> 2023-03-30T07:57:24.469Z|58440|inc_proc_eng|INFO|node: logical_flow_output, 
>> recompute (failed handler for input port_groups) took 35774ms
>> 2023-03-30T07:57:28.521Z|58441|lflow_cache|INFO|Detected cache inactivity 
>> (last active 40006 ms ago): trimming cache
>> 2023-03-30T07:57:28.528Z|58442|timeval|WARN|Unreasonably long 40084ms poll 
>> interval (32947ms user, 6968ms system)
>> 2023-03-30T07:57:28.528Z|58443|timeval|WARN|faults: 43468 minor, 0 major
>> …..
>> 2023-03-30T07:57:48.784Z|58496|reconnect|ERR|ssl::6642: no response to 
>> inactivity probe after 60.2 seconds, disconnecting
>> 2023-03-30T07:57:48.784Z|58497|reconnect|INFO|ssl::6642: connection 
>> dropped
>> 2023-03-30T07:57:48.905Z|58498|main|INFO|OVNSB commit failed, force 
>> recompute next time.
>> 2023-03-30T07:57:49.786Z|58499|reconnect|INFO|ssl::6642: connecting...
>> 2023-03-30T07:57:49.924Z|58500|reconnect|INFO|ssl::6642: connected
>> 2023-03-30T07:57:29.781Z|58501|poll_loop|INFO|wakeup due to 0-ms timeout at 
>> lib/stream-ssl.c:838 (74% CPU usage)
>> 2023-03-30T07:57:29.836Z|58502|poll_loop|INFO|wakeup due to 0-ms timeout at 
>> lib/stream-ssl.c:838 (74% CPU usage)
>> ….
>> 2023-03-30T07:57:05.976Z|58545|poll_loop|INFO|wakeup due to 0-ms timeout at 
>> lib/stream-ssl.c:838 (101% CPU usage)
>> 2023-03-30T07:57:07.002Z|58546|memory|INFO|peak resident set size grew 54% 
>> in last 682257.2 seconds, from 4435696 kB to 6824820 kB
>> 2023-03-30T07:57:07.002Z|58547|memory|INFO|idl-cells-OVN_Southbound:11491359 
>> idl-cells-Open_vSwitch:14416 lflow-cache-entries-cache-expr:530298 
>> lflow-cache-entries-cache-matches:31770 lflow-cache-size-KB:769553 
>> local_datapath_usage-KB:630
>>  ofctrl_desired_flow_usage-KB:577052 ofctrl_installed_flow_usage-KB:409538 
>> ofctrl_sb_flow_ref_usage-KB:273707
>> …
>> 2023-03-30T07:57:31.667Z|58552|rconn|WARN|unix:/run/openvswitch/br-int.mgmt: 
>> connection dropped (Broken pipe)
>> 2023-03-30T07:57:31.739Z|58553|binding|INFO|Releasing lport cr-xxx from this 
>> chassis (sb_readonly=0)
>> 2023-03-30T07:57:31.739Z|58554|if_status|WARN|Dropped 1 log messages in last 
>> 1254725 seconds (most recently, 1254725 seconds ago) due to excessive rate
>> 2023-03-30T07:57:31.739Z|58555|if_status|WARN|Trying to release unknown 
>> interface cr-xxx
>> 2023-03-30T07:57:31.743Z|58556|binding|INFO|Releasing lport cr-yyy from this 
>> chassis (sb_readonly=0)
>> 2023-03-30T07:57:31.743Z|58557|binding|INFO|Releasing lport cr-zzz from this 
>> chassis (sb_readonly=0)
>> …
>> 
>> After some time these ports are claimed back:
>> 
>> 2023-03-30T07:57:49.320Z|59137|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: 
>> connecting...
>> 2023-03-30T07:57:49.320Z|59138|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: 
>> connected
>> 2023-03-30T07:57:49.523Z|59139|poll_loop|INFO|wakeup due to 0-ms timeout at 
>> controller/ovn-controller.c:4296 (99% CPU usage)
>> 2023-03-30T07:57:49.574Z|59140|poll_loop|INFO|wakeup due to 0-ms timeout at 
>> controller/ovn-controller.c:4286 (99% CPU usage)
>> 2023-03-30T07:58:01.451Z|59141|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>> 22 (:34736<->:6642) at lib/stream-ssl.c:836 (99% CPU usage)
>> 2023-03-30T07:58:02.449Z|59142|binding|INFO|Claiming lport

Re: [ovs-dev] [ovn] ha-chassis-group false positive failover

2023-05-16 Thread Numan Siddique
Hi Vladislav,

Sorry for the late reply.

PSB for few comments.



On Tue, May 16, 2023 at 3:42 PM Vladislav Odintsov  wrote:
>
> Hi Numan, Dumitru, Ilya,
>
> I guess I’ve managed to deal with some of these problems and have some 
> conclusions.
> Please see inline.
>
> > On 14 Apr 2023, at 16:26, Vladislav Odintsov  wrote:
> >
> > And as a follow-up, I see sometimes next error message in ovn-controller 
> > log, though there is a configured much higher limit in seconds for 
> > inactivity probe value for the chassis:
> >
>
> ovn-controller establishes 3 openflow connections to local ovs-vswitchd for 
> next purposes:
> 1. openflow rules management (openflow probe interval is configured through 
> ovn-openflow-probe-interval external_ids setting [1]; default — disabled [2]);
> 2. openflow features collection (hardcoded probe interval 5s [3]);
> 3. pinctrl thread (hardcoded probe interval 5s [4]).
>
> > ovs|04426|rconn|ERR|unix:/run/openvswitch/br-int.mgmt: no response to 
> > inactivity probe after 8 seconds, disconnecting
> >
> > ovs|26560|rconn|ERR|unix:/run/openvswitch/br-int.mgmt: no response to 
> > inactivity probe after 10 seconds, disconnecting
> >
>
> These messages most likely related to the connection for features collection 
> or pinctrl thread. So ovn-openflow-probe-interval has no matter here.
>
> >
> > server with first log message:
> > # ovs-vsctl get open . external_ids:ovn-openflow-probe-interval
> > "30"
> >
> > server with second log message:
> > # ovs-vsctl get open . external_ids:ovn-openflow-probe-interval
> > "60"
> >
> >
> >
> >> On 30 Mar 2023, at 23:19, Vladislav Odintsov  wrote:
> >>
> >> Hi all,
> >>
> >> I’m facing a false positive failover in the event of openflow connection 
> >> got dropped after inactivity probe timeout due to high load of 
> >> ovn-controller handling huge amount of ovsdb updates.
> >> I wonder wether its a bug or I have to tune any specifics. See 
> >> ovn-controller logs:
> >>
> >> 2023-03-30T07:57:24.469Z|58440|inc_proc_eng|INFO|node: 
> >> logical_flow_output, recompute (failed handler for input port_groups) took 
> >> 35774ms
> >> 2023-03-30T07:57:28.521Z|58441|lflow_cache|INFO|Detected cache inactivity 
> >> (last active 40006 ms ago): trimming cache
> >> 2023-03-30T07:57:28.528Z|58442|timeval|WARN|Unreasonably long 40084ms poll 
> >> interval (32947ms user, 6968ms system)
> >> 2023-03-30T07:57:28.528Z|58443|timeval|WARN|faults: 43468 minor, 0 major
> >> …..
> >> 2023-03-30T07:57:48.784Z|58496|reconnect|ERR|ssl::6642: no response to 
> >> inactivity probe after 60.2 seconds, disconnecting
> >> 2023-03-30T07:57:48.784Z|58497|reconnect|INFO|ssl::6642: connection 
> >> dropped
> >> 2023-03-30T07:57:48.905Z|58498|main|INFO|OVNSB commit failed, force 
> >> recompute next time.
> >> 2023-03-30T07:57:49.786Z|58499|reconnect|INFO|ssl::6642: connecting...
> >> 2023-03-30T07:57:49.924Z|58500|reconnect|INFO|ssl::6642: connected
> >> 2023-03-30T07:57:29.781Z|58501|poll_loop|INFO|wakeup due to 0-ms timeout 
> >> at lib/stream-ssl.c:838 (74% CPU usage)
> >> 2023-03-30T07:57:29.836Z|58502|poll_loop|INFO|wakeup due to 0-ms timeout 
> >> at lib/stream-ssl.c:838 (74% CPU usage)
> >> ….
> >> 2023-03-30T07:57:05.976Z|58545|poll_loop|INFO|wakeup due to 0-ms timeout 
> >> at lib/stream-ssl.c:838 (101% CPU usage)
> >> 2023-03-30T07:57:07.002Z|58546|memory|INFO|peak resident set size grew 54% 
> >> in last 682257.2 seconds, from 4435696 kB to 6824820 kB
> >> 2023-03-30T07:57:07.002Z|58547|memory|INFO|idl-cells-OVN_Southbound:11491359
> >>  idl-cells-Open_vSwitch:14416 lflow-cache-entries-cache-expr:530298 
> >> lflow-cache-entries-cache-matches:31770 lflow-cache-size-KB:769553 
> >> local_datapath_usage-KB:630
> >>  ofctrl_desired_flow_usage-KB:577052 ofctrl_installed_flow_usage-KB:409538 
> >> ofctrl_sb_flow_ref_usage-KB:273707
> >> …
> >> 2023-03-30T07:57:31.667Z|58552|rconn|WARN|unix:/run/openvswitch/br-int.mgmt:
> >>  connection dropped (Broken pipe)
> >> 2023-03-30T07:57:31.739Z|58553|binding|INFO|Releasing lport cr-xxx from 
> >> this chassis (sb_readonly=0)
> >> 2023-03-30T07:57:31.739Z|58554|if_status|WARN|Dropped 1 log messages in 
> >> last 1254725 seconds (most recently, 1254725 seconds ago) due to excessive 
> >> rate
> >> 2023-03-30T07:57:31.739Z|58555|if_status|WARN|Trying to release unknown 
> >> interface cr-xxx
> >> 2023-03-30T07:57:31.743Z|58556|binding|INFO|Releasing lport cr-yyy from 
> >> this chassis (sb_readonly=0)
> >> 2023-03-30T07:57:31.743Z|58557|binding|INFO|Releasing lport cr-zzz from 
> >> this chassis (sb_readonly=0)
> >> …
> >>
> >> After some time these ports are claimed back:
> >>
> >> 2023-03-30T07:57:49.320Z|59137|rconn|INFO|unix:/run/openvswitch/br-int.mgmt:
> >>  connecting...
> >> 2023-03-30T07:57:49.320Z|59138|rconn|INFO|unix:/run/openvswitch/br-int.mgmt:
> >>  connected
> >> 2023-03-30T07:57:49.523Z|59139|poll_loop|INFO|wakeup due to 0-ms timeout 
> >> at controller/ovn-controller.c:4296 (99% CPU usage)
> >> 2023-03-

Re: [ovs-dev] [ovn] ha-chassis-group false positive failover

2023-05-17 Thread Vladislav Odintsov


> On 16 May 2023, at 23:54, Numan Siddique  wrote:
> 
> Hi Vladislav,
> 
> Sorry for the late reply.
> 
> PSB for few comments.


Thanks for your reply!
My answers are inline.

> 
> 
> 
> On Tue, May 16, 2023 at 3:42 PM Vladislav Odintsov  > wrote:
>> 
>> Hi Numan, Dumitru, Ilya,
>> 
>> I guess I’ve managed to deal with some of these problems and have some 
>> conclusions.
>> Please see inline.
>> 
>>> On 14 Apr 2023, at 16:26, Vladislav Odintsov  wrote:
>>> 
>>> And as a follow-up, I see sometimes next error message in ovn-controller 
>>> log, though there is a configured much higher limit in seconds for 
>>> inactivity probe value for the chassis:
>>> 
>> 
>> ovn-controller establishes 3 openflow connections to local ovs-vswitchd for 
>> next purposes:
>> 1. openflow rules management (openflow probe interval is configured through 
>> ovn-openflow-probe-interval external_ids setting [1]; default — disabled 
>> [2]);
>> 2. openflow features collection (hardcoded probe interval 5s [3]);
>> 3. pinctrl thread (hardcoded probe interval 5s [4]).
>> 
>>> ovs|04426|rconn|ERR|unix:/run/openvswitch/br-int.mgmt: no response to 
>>> inactivity probe after 8 seconds, disconnecting
>>> 
>>> ovs|26560|rconn|ERR|unix:/run/openvswitch/br-int.mgmt: no response to 
>>> inactivity probe after 10 seconds, disconnecting
>>> 
>> 
>> These messages most likely related to the connection for features collection 
>> or pinctrl thread. So ovn-openflow-probe-interval has no matter here.
>> 
>>> 
>>> server with first log message:
>>> # ovs-vsctl get open . external_ids:ovn-openflow-probe-interval
>>> "30"
>>> 
>>> server with second log message:
>>> # ovs-vsctl get open . external_ids:ovn-openflow-probe-interval
>>> "60"
>>> 
>>> 
>>> 
 On 30 Mar 2023, at 23:19, Vladislav Odintsov  wrote:
 
 Hi all,
 
 I’m facing a false positive failover in the event of openflow connection 
 got dropped after inactivity probe timeout due to high load of 
 ovn-controller handling huge amount of ovsdb updates.
 I wonder wether its a bug or I have to tune any specifics. See 
 ovn-controller logs:
 
 2023-03-30T07:57:24.469Z|58440|inc_proc_eng|INFO|node: 
 logical_flow_output, recompute (failed handler for input port_groups) took 
 35774ms
 2023-03-30T07:57:28.521Z|58441|lflow_cache|INFO|Detected cache inactivity 
 (last active 40006 ms ago): trimming cache
 2023-03-30T07:57:28.528Z|58442|timeval|WARN|Unreasonably long 40084ms poll 
 interval (32947ms user, 6968ms system)
 2023-03-30T07:57:28.528Z|58443|timeval|WARN|faults: 43468 minor, 0 major
 …..
 2023-03-30T07:57:48.784Z|58496|reconnect|ERR|ssl::6642: no response to 
 inactivity probe after 60.2 seconds, disconnecting
 2023-03-30T07:57:48.784Z|58497|reconnect|INFO|ssl::6642: connection 
 dropped
 2023-03-30T07:57:48.905Z|58498|main|INFO|OVNSB commit failed, force 
 recompute next time.
 2023-03-30T07:57:49.786Z|58499|reconnect|INFO|ssl::6642: connecting...
 2023-03-30T07:57:49.924Z|58500|reconnect|INFO|ssl::6642: connected
 2023-03-30T07:57:29.781Z|58501|poll_loop|INFO|wakeup due to 0-ms timeout 
 at lib/stream-ssl.c:838 (74% CPU usage)
 2023-03-30T07:57:29.836Z|58502|poll_loop|INFO|wakeup due to 0-ms timeout 
 at lib/stream-ssl.c:838 (74% CPU usage)
 ….
 2023-03-30T07:57:05.976Z|58545|poll_loop|INFO|wakeup due to 0-ms timeout 
 at lib/stream-ssl.c:838 (101% CPU usage)
 2023-03-30T07:57:07.002Z|58546|memory|INFO|peak resident set size grew 54% 
 in last 682257.2 seconds, from 4435696 kB to 6824820 kB
 2023-03-30T07:57:07.002Z|58547|memory|INFO|idl-cells-OVN_Southbound:11491359
  idl-cells-Open_vSwitch:14416 lflow-cache-entries-cache-expr:530298 
 lflow-cache-entries-cache-matches:31770 lflow-cache-size-KB:769553 
 local_datapath_usage-KB:630
 ofctrl_desired_flow_usage-KB:577052 ofctrl_installed_flow_usage-KB:409538 
 ofctrl_sb_flow_ref_usage-KB:273707
 …
 2023-03-30T07:57:31.667Z|58552|rconn|WARN|unix:/run/openvswitch/br-int.mgmt:
  connection dropped (Broken pipe)
 2023-03-30T07:57:31.739Z|58553|binding|INFO|Releasing lport cr-xxx from 
 this chassis (sb_readonly=0)
 2023-03-30T07:57:31.739Z|58554|if_status|WARN|Dropped 1 log messages in 
 last 1254725 seconds (most recently, 1254725 seconds ago) due to excessive 
 rate
 2023-03-30T07:57:31.739Z|58555|if_status|WARN|Trying to release unknown 
 interface cr-xxx
 2023-03-30T07:57:31.743Z|58556|binding|INFO|Releasing lport cr-yyy from 
 this chassis (sb_readonly=0)
 2023-03-30T07:57:31.743Z|58557|binding|INFO|Releasing lport cr-zzz from 
 this chassis (sb_readonly=0)
 …
 
 After some time these ports are claimed back:
 
 2023-03-30T07:57:49.320Z|59137|rconn|INFO|unix:/run/openvswitch/br-int.mgmt:
  connecting...
 2023-03-30T07:57:49.320Z|59138|rconn|INFO|unix:/

Re: [ovs-dev] [ovn] ha-chassis-group false positive failover

2023-05-17 Thread Numan Siddique
On Wed, May 17, 2023 at 4:44 AM Vladislav Odintsov  wrote:
>
>
>
> > On 16 May 2023, at 23:54, Numan Siddique  wrote:
> >
> > Hi Vladislav,
> >
> > Sorry for the late reply.
> >
> > PSB for few comments.
>
>
> Thanks for your reply!
> My answers are inline.
>
> >
> >
> >
> > On Tue, May 16, 2023 at 3:42 PM Vladislav Odintsov  > > wrote:
> >>
> >> Hi Numan, Dumitru, Ilya,
> >>
> >> I guess I’ve managed to deal with some of these problems and have some 
> >> conclusions.
> >> Please see inline.
> >>
> >>> On 14 Apr 2023, at 16:26, Vladislav Odintsov  wrote:
> >>>
> >>> And as a follow-up, I see sometimes next error message in ovn-controller 
> >>> log, though there is a configured much higher limit in seconds for 
> >>> inactivity probe value for the chassis:
> >>>
> >>
> >> ovn-controller establishes 3 openflow connections to local ovs-vswitchd 
> >> for next purposes:
> >> 1. openflow rules management (openflow probe interval is configured 
> >> through ovn-openflow-probe-interval external_ids setting [1]; default — 
> >> disabled [2]);
> >> 2. openflow features collection (hardcoded probe interval 5s [3]);
> >> 3. pinctrl thread (hardcoded probe interval 5s [4]).
> >>
> >>> ovs|04426|rconn|ERR|unix:/run/openvswitch/br-int.mgmt: no response to 
> >>> inactivity probe after 8 seconds, disconnecting
> >>>
> >>> ovs|26560|rconn|ERR|unix:/run/openvswitch/br-int.mgmt: no response to 
> >>> inactivity probe after 10 seconds, disconnecting
> >>>
> >>
> >> These messages most likely related to the connection for features 
> >> collection or pinctrl thread. So ovn-openflow-probe-interval has no matter 
> >> here.
> >>
> >>>
> >>> server with first log message:
> >>> # ovs-vsctl get open . external_ids:ovn-openflow-probe-interval
> >>> "30"
> >>>
> >>> server with second log message:
> >>> # ovs-vsctl get open . external_ids:ovn-openflow-probe-interval
> >>> "60"
> >>>
> >>>
> >>>
>  On 30 Mar 2023, at 23:19, Vladislav Odintsov  wrote:
> 
>  Hi all,
> 
>  I’m facing a false positive failover in the event of openflow connection 
>  got dropped after inactivity probe timeout due to high load of 
>  ovn-controller handling huge amount of ovsdb updates.
>  I wonder wether its a bug or I have to tune any specifics. See 
>  ovn-controller logs:
> 
>  2023-03-30T07:57:24.469Z|58440|inc_proc_eng|INFO|node: 
>  logical_flow_output, recompute (failed handler for input port_groups) 
>  took 35774ms
>  2023-03-30T07:57:28.521Z|58441|lflow_cache|INFO|Detected cache 
>  inactivity (last active 40006 ms ago): trimming cache
>  2023-03-30T07:57:28.528Z|58442|timeval|WARN|Unreasonably long 40084ms 
>  poll interval (32947ms user, 6968ms system)
>  2023-03-30T07:57:28.528Z|58443|timeval|WARN|faults: 43468 minor, 0 major
>  …..
>  2023-03-30T07:57:48.784Z|58496|reconnect|ERR|ssl::6642: no response 
>  to inactivity probe after 60.2 seconds, disconnecting
>  2023-03-30T07:57:48.784Z|58497|reconnect|INFO|ssl::6642: connection 
>  dropped
>  2023-03-30T07:57:48.905Z|58498|main|INFO|OVNSB commit failed, force 
>  recompute next time.
>  2023-03-30T07:57:49.786Z|58499|reconnect|INFO|ssl::6642: 
>  connecting...
>  2023-03-30T07:57:49.924Z|58500|reconnect|INFO|ssl::6642: connected
>  2023-03-30T07:57:29.781Z|58501|poll_loop|INFO|wakeup due to 0-ms timeout 
>  at lib/stream-ssl.c:838 (74% CPU usage)
>  2023-03-30T07:57:29.836Z|58502|poll_loop|INFO|wakeup due to 0-ms timeout 
>  at lib/stream-ssl.c:838 (74% CPU usage)
>  ….
>  2023-03-30T07:57:05.976Z|58545|poll_loop|INFO|wakeup due to 0-ms timeout 
>  at lib/stream-ssl.c:838 (101% CPU usage)
>  2023-03-30T07:57:07.002Z|58546|memory|INFO|peak resident set size grew 
>  54% in last 682257.2 seconds, from 4435696 kB to 6824820 kB
>  2023-03-30T07:57:07.002Z|58547|memory|INFO|idl-cells-OVN_Southbound:11491359
>   idl-cells-Open_vSwitch:14416 lflow-cache-entries-cache-expr:530298 
>  lflow-cache-entries-cache-matches:31770 lflow-cache-size-KB:769553 
>  local_datapath_usage-KB:630
>  ofctrl_desired_flow_usage-KB:577052 
>  ofctrl_installed_flow_usage-KB:409538 ofctrl_sb_flow_ref_usage-KB:273707
>  …
>  2023-03-30T07:57:31.667Z|58552|rconn|WARN|unix:/run/openvswitch/br-int.mgmt:
>   connection dropped (Broken pipe)
>  2023-03-30T07:57:31.739Z|58553|binding|INFO|Releasing lport cr-xxx from 
>  this chassis (sb_readonly=0)
>  2023-03-30T07:57:31.739Z|58554|if_status|WARN|Dropped 1 log messages in 
>  last 1254725 seconds (most recently, 1254725 seconds ago) due to 
>  excessive rate
>  2023-03-30T07:57:31.739Z|58555|if_status|WARN|Trying to release unknown 
>  interface cr-xxx
>  2023-03-30T07:57:31.743Z|58556|binding|INFO|Releasing lport cr-yyy from 
>  this chassis (sb_readonly=0)
>  2023-03-30T07:57:31.743Z|58557|binding|INFO|Releasing

Re: [ovs-dev] [ovn] ha-chassis-group false positive failover

2023-06-05 Thread Vladislav Odintsov
Hi Dumitru, Ilya, Han,

do you guys have any comments here?
I’m gonna start working on this change in near future...

Thank you.

> On 17 May 2023, at 18:34, Numan Siddique  wrote:
> 
> On Wed, May 17, 2023 at 4:44 AM Vladislav Odintsov  > wrote:
>> 
>> 
>> 
>>> On 16 May 2023, at 23:54, Numan Siddique  wrote:
>>> 
>>> Hi Vladislav,
>>> 
>>> Sorry for the late reply.
>>> 
>>> PSB for few comments.
>> 
>> 
>> Thanks for your reply!
>> My answers are inline.
>> 
>>> 
>>> 
>>> 
>>> On Tue, May 16, 2023 at 3:42 PM Vladislav Odintsov >> > wrote:
 
 Hi Numan, Dumitru, Ilya,
 
 I guess I’ve managed to deal with some of these problems and have some 
 conclusions.
 Please see inline.
 
> On 14 Apr 2023, at 16:26, Vladislav Odintsov  wrote:
> 
> And as a follow-up, I see sometimes next error message in ovn-controller 
> log, though there is a configured much higher limit in seconds for 
> inactivity probe value for the chassis:
> 
 
 ovn-controller establishes 3 openflow connections to local ovs-vswitchd 
 for next purposes:
 1. openflow rules management (openflow probe interval is configured 
 through ovn-openflow-probe-interval external_ids setting [1]; default — 
 disabled [2]);
 2. openflow features collection (hardcoded probe interval 5s [3]);
 3. pinctrl thread (hardcoded probe interval 5s [4]).
 
> ovs|04426|rconn|ERR|unix:/run/openvswitch/br-int.mgmt: no response to 
> inactivity probe after 8 seconds, disconnecting
> 
> ovs|26560|rconn|ERR|unix:/run/openvswitch/br-int.mgmt: no response to 
> inactivity probe after 10 seconds, disconnecting
> 
 
 These messages most likely related to the connection for features 
 collection or pinctrl thread. So ovn-openflow-probe-interval has no matter 
 here.
 
> 
> server with first log message:
> # ovs-vsctl get open . external_ids:ovn-openflow-probe-interval
> "30"
> 
> server with second log message:
> # ovs-vsctl get open . external_ids:ovn-openflow-probe-interval
> "60"
> 
> 
> 
>> On 30 Mar 2023, at 23:19, Vladislav Odintsov  wrote:
>> 
>> Hi all,
>> 
>> I’m facing a false positive failover in the event of openflow connection 
>> got dropped after inactivity probe timeout due to high load of 
>> ovn-controller handling huge amount of ovsdb updates.
>> I wonder wether its a bug or I have to tune any specifics. See 
>> ovn-controller logs:
>> 
>> 2023-03-30T07:57:24.469Z|58440|inc_proc_eng|INFO|node: 
>> logical_flow_output, recompute (failed handler for input port_groups) 
>> took 35774ms
>> 2023-03-30T07:57:28.521Z|58441|lflow_cache|INFO|Detected cache 
>> inactivity (last active 40006 ms ago): trimming cache
>> 2023-03-30T07:57:28.528Z|58442|timeval|WARN|Unreasonably long 40084ms 
>> poll interval (32947ms user, 6968ms system)
>> 2023-03-30T07:57:28.528Z|58443|timeval|WARN|faults: 43468 minor, 0 major
>> …..
>> 2023-03-30T07:57:48.784Z|58496|reconnect|ERR|ssl::6642: no response 
>> to inactivity probe after 60.2 seconds, disconnecting
>> 2023-03-30T07:57:48.784Z|58497|reconnect|INFO|ssl::6642: connection 
>> dropped
>> 2023-03-30T07:57:48.905Z|58498|main|INFO|OVNSB commit failed, force 
>> recompute next time.
>> 2023-03-30T07:57:49.786Z|58499|reconnect|INFO|ssl::6642: 
>> connecting...
>> 2023-03-30T07:57:49.924Z|58500|reconnect|INFO|ssl::6642: connected
>> 2023-03-30T07:57:29.781Z|58501|poll_loop|INFO|wakeup due to 0-ms timeout 
>> at lib/stream-ssl.c:838 (74% CPU usage)
>> 2023-03-30T07:57:29.836Z|58502|poll_loop|INFO|wakeup due to 0-ms timeout 
>> at lib/stream-ssl.c:838 (74% CPU usage)
>> ….
>> 2023-03-30T07:57:05.976Z|58545|poll_loop|INFO|wakeup due to 0-ms timeout 
>> at lib/stream-ssl.c:838 (101% CPU usage)
>> 2023-03-30T07:57:07.002Z|58546|memory|INFO|peak resident set size grew 
>> 54% in last 682257.2 seconds, from 4435696 kB to 6824820 kB
>> 2023-03-30T07:57:07.002Z|58547|memory|INFO|idl-cells-OVN_Southbound:11491359
>>  idl-cells-Open_vSwitch:14416 lflow-cache-entries-cache-expr:530298 
>> lflow-cache-entries-cache-matches:31770 lflow-cache-size-KB:769553 
>> local_datapath_usage-KB:630
>> ofctrl_desired_flow_usage-KB:577052 
>> ofctrl_installed_flow_usage-KB:409538 ofctrl_sb_flow_ref_usage-KB:273707
>> …
>> 2023-03-30T07:57:31.667Z|58552|rconn|WARN|unix:/run/openvswitch/br-int.mgmt:
>>  connection dropped (Broken pipe)
>> 2023-03-30T07:57:31.739Z|58553|binding|INFO|Releasing lport cr-xxx from 
>> this chassis (sb_readonly=0)
>> 2023-03-30T07:57:31.739Z|58554|if_status|WARN|Dropped 1 log messages in 
>> last 1254725 seconds (most recently, 1254725 seconds ago) due to 
>> excessive rate
>> 2023-03-30T07:57:31.

Re: [ovs-dev] [ovn] ha-chassis-group false positive failover

2023-06-05 Thread Han Zhou
On Mon, Jun 5, 2023 at 8:58 AM Vladislav Odintsov  wrote:
>
> Hi Dumitru, Ilya, Han,
>
> do you guys have any comments here?
> I’m gonna start working on this change in near future...
>
> Thank you.
>
> On 17 May 2023, at 18:34, Numan Siddique  wrote:
>
> On Wed, May 17, 2023 at 4:44 AM Vladislav Odintsov 
wrote:
>
>
>
>
> On 16 May 2023, at 23:54, Numan Siddique  wrote:
>
> Hi Vladislav,
>
> Sorry for the late reply.
>
> PSB for few comments.
>
>
>
> Thanks for your reply!
> My answers are inline.
>
>
>
>
> On Tue, May 16, 2023 at 3:42 PM Vladislav Odintsov mailto:odiv...@gmail.com>> wrote:
>
>
> Hi Numan, Dumitru, Ilya,
>
> I guess I’ve managed to deal with some of these problems and have some
conclusions.
> Please see inline.
>
> On 14 Apr 2023, at 16:26, Vladislav Odintsov  wrote:
>
> And as a follow-up, I see sometimes next error message in ovn-controller
log, though there is a configured much higher limit in seconds for
inactivity probe value for the chassis:
>
>
> ovn-controller establishes 3 openflow connections to local ovs-vswitchd
for next purposes:
> 1. openflow rules management (openflow probe interval is configured
through ovn-openflow-probe-interval external_ids setting [1]; default —
disabled [2]);
> 2. openflow features collection (hardcoded probe interval 5s [3]);
> 3. pinctrl thread (hardcoded probe interval 5s [4]).
>
> ovs|04426|rconn|ERR|unix:/run/openvswitch/br-int.mgmt: no response to
inactivity probe after 8 seconds, disconnecting
>
> ovs|26560|rconn|ERR|unix:/run/openvswitch/br-int.mgmt: no response to
inactivity probe after 10 seconds, disconnecting
>
>
> These messages most likely related to the connection for features
collection or pinctrl thread. So ovn-openflow-probe-interval has no matter
here.
>
>
> server with first log message:
> # ovs-vsctl get open . external_ids:ovn-openflow-probe-interval
> "30"
>
> server with second log message:
> # ovs-vsctl get open . external_ids:ovn-openflow-probe-interval
> "60"
>
>
>
> On 30 Mar 2023, at 23:19, Vladislav Odintsov  wrote:
>
> Hi all,
>
> I’m facing a false positive failover in the event of openflow connection
got dropped after inactivity probe timeout due to high load of
ovn-controller handling huge amount of ovsdb updates.
> I wonder wether its a bug or I have to tune any specifics. See
ovn-controller logs:
>
> 2023-03-30T07:57:24.469Z|58440|inc_proc_eng|INFO|node:
logical_flow_output, recompute (failed handler for input port_groups) took
35774ms
> 2023-03-30T07:57:28.521Z|58441|lflow_cache|INFO|Detected cache inactivity
(last active 40006 ms ago): trimming cache
> 2023-03-30T07:57:28.528Z|58442|timeval|WARN|Unreasonably long 40084ms
poll interval (32947ms user, 6968ms system)
> 2023-03-30T07:57:28.528Z|58443|timeval|WARN|faults: 43468 minor, 0 major
> …..
> 2023-03-30T07:57:48.784Z|58496|reconnect|ERR|ssl::6642: no response
to inactivity probe after 60.2 seconds, disconnecting
> 2023-03-30T07:57:48.784Z|58497|reconnect|INFO|ssl::6642: connection
dropped
> 2023-03-30T07:57:48.905Z|58498|main|INFO|OVNSB commit failed, force
recompute next time.
> 2023-03-30T07:57:49.786Z|58499|reconnect|INFO|ssl::6642: connecting...
> 2023-03-30T07:57:49.924Z|58500|reconnect|INFO|ssl::6642: connected
> 2023-03-30T07:57:29.781Z|58501|poll_loop|INFO|wakeup due to 0-ms timeout
at lib/stream-ssl.c:838 (74% CPU usage)
> 2023-03-30T07:57:29.836Z|58502|poll_loop|INFO|wakeup due to 0-ms timeout
at lib/stream-ssl.c:838 (74% CPU usage)
> ….
> 2023-03-30T07:57:05.976Z|58545|poll_loop|INFO|wakeup due to 0-ms timeout
at lib/stream-ssl.c:838 (101% CPU usage)
> 2023-03-30T07:57:07.002Z|58546|memory|INFO|peak resident set size grew
54% in last 682257.2 seconds, from 4435696 kB to 6824820 kB
>
2023-03-30T07:57:07.002Z|58547|memory|INFO|idl-cells-OVN_Southbound:11491359
idl-cells-Open_vSwitch:14416 lflow-cache-entries-cache-expr:530298
lflow-cache-entries-cache-matches:31770 lflow-cache-size-KB:769553
local_datapath_usage-KB:630
> ofctrl_desired_flow_usage-KB:577052 ofctrl_installed_flow_usage-KB:409538
ofctrl_sb_flow_ref_usage-KB:273707
> …
>
2023-03-30T07:57:31.667Z|58552|rconn|WARN|unix:/run/openvswitch/br-int.mgmt:
connection dropped (Broken pipe)
> 2023-03-30T07:57:31.739Z|58553|binding|INFO|Releasing lport cr-xxx from
this chassis (sb_readonly=0)
> 2023-03-30T07:57:31.739Z|58554|if_status|WARN|Dropped 1 log messages in
last 1254725 seconds (most recently, 1254725 seconds ago) due to excessive
rate
> 2023-03-30T07:57:31.739Z|58555|if_status|WARN|Trying to release unknown
interface cr-xxx
> 2023-03-30T07:57:31.743Z|58556|binding|INFO|Releasing lport cr-yyy from
this chassis (sb_readonly=0)
> 2023-03-30T07:57:31.743Z|58557|binding|INFO|Releasing lport cr-zzz from
this chassis (sb_readonly=0)
> …
>
> After some time these ports are claimed back:
>
>
2023-03-30T07:57:49.320Z|59137|rconn|INFO|unix:/run/openvswitch/br-int.mgmt:
connecting...
>
2023-03-30T07:57:49.320Z|59138|rconn|INFO|unix:/run/openvswitch/br-int.mgmt:
connected
> 2023-03-30T07:57

Re: [ovs-dev] [ovn] ha-chassis-group false positive failover

2023-06-06 Thread Vladislav Odintsov
Hi Han,

thanks for your inputs!
Please see comments inline.

> On 5 Jun 2023, at 21:01, Han Zhou  wrote:
> 
> On Mon, Jun 5, 2023 at 8:58 AM Vladislav Odintsov  > wrote:
>> 
>> Hi Dumitru, Ilya, Han,
>> 
>> do you guys have any comments here?
>> I’m gonna start working on this change in near future...
>> 
>> Thank you.
>> 
>> On 17 May 2023, at 18:34, Numan Siddique > > wrote:
>> 
>> On Wed, May 17, 2023 at 4:44 AM Vladislav Odintsov > >
> wrote:
>> 
>> 
>> 
>> 
>> On 16 May 2023, at 23:54, Numan Siddique > > wrote:
>> 
>> Hi Vladislav,
>> 
>> Sorry for the late reply.
>> 
>> PSB for few comments.
>> 
>> 
>> 
>> Thanks for your reply!
>> My answers are inline.
>> 
>> 
>> 
>> 
>> On Tue, May 16, 2023 at 3:42 PM Vladislav Odintsov > 
> > wrote:
>> 
>> 
>> Hi Numan, Dumitru, Ilya,
>> 
>> I guess I’ve managed to deal with some of these problems and have some
> conclusions.
>> Please see inline.
>> 
>> On 14 Apr 2023, at 16:26, Vladislav Odintsov > > wrote:
>> 
>> And as a follow-up, I see sometimes next error message in ovn-controller
> log, though there is a configured much higher limit in seconds for
> inactivity probe value for the chassis:
>> 
>> 
>> ovn-controller establishes 3 openflow connections to local ovs-vswitchd
> for next purposes:
>> 1. openflow rules management (openflow probe interval is configured
> through ovn-openflow-probe-interval external_ids setting [1]; default —
> disabled [2]);
>> 2. openflow features collection (hardcoded probe interval 5s [3]);
>> 3. pinctrl thread (hardcoded probe interval 5s [4]).
>> 
>> ovs|04426|rconn|ERR|unix:/run/openvswitch/br-int.mgmt: no response to
> inactivity probe after 8 seconds, disconnecting
>> 
>> ovs|26560|rconn|ERR|unix:/run/openvswitch/br-int.mgmt: no response to
> inactivity probe after 10 seconds, disconnecting
>> 
>> 
>> These messages most likely related to the connection for features
> collection or pinctrl thread. So ovn-openflow-probe-interval has no matter
> here.
>> 
>> 
>> server with first log message:
>> # ovs-vsctl get open . external_ids:ovn-openflow-probe-interval
>> "30"
>> 
>> server with second log message:
>> # ovs-vsctl get open . external_ids:ovn-openflow-probe-interval
>> "60"
>> 
>> 
>> 
>> On 30 Mar 2023, at 23:19, Vladislav Odintsov > > wrote:
>> 
>> Hi all,
>> 
>> I’m facing a false positive failover in the event of openflow connection
> got dropped after inactivity probe timeout due to high load of
> ovn-controller handling huge amount of ovsdb updates.
>> I wonder wether its a bug or I have to tune any specifics. See
> ovn-controller logs:
>> 
>> 2023-03-30T07:57:24.469Z|58440|inc_proc_eng|INFO|node:
> logical_flow_output, recompute (failed handler for input port_groups) took
> 35774ms
>> 2023-03-30T07:57:28.521Z|58441|lflow_cache|INFO|Detected cache inactivity
> (last active 40006 ms ago): trimming cache
>> 2023-03-30T07:57:28.528Z|58442|timeval|WARN|Unreasonably long 40084ms
> poll interval (32947ms user, 6968ms system)
>> 2023-03-30T07:57:28.528Z|58443|timeval|WARN|faults: 43468 minor, 0 major
>> …..
>> 2023-03-30T07:57:48.784Z|58496|reconnect|ERR|ssl::6642: no response
> to inactivity probe after 60.2 seconds, disconnecting
>> 2023-03-30T07:57:48.784Z|58497|reconnect|INFO|ssl::6642: connection
> dropped
>> 2023-03-30T07:57:48.905Z|58498|main|INFO|OVNSB commit failed, force
> recompute next time.
>> 2023-03-30T07:57:49.786Z|58499|reconnect|INFO|ssl::6642: connecting...
>> 2023-03-30T07:57:49.924Z|58500|reconnect|INFO|ssl::6642: connected
>> 2023-03-30T07:57:29.781Z|58501|poll_loop|INFO|wakeup due to 0-ms timeout
> at lib/stream-ssl.c:838 (74% CPU usage)
>> 2023-03-30T07:57:29.836Z|58502|poll_loop|INFO|wakeup due to 0-ms timeout
> at lib/stream-ssl.c:838 (74% CPU usage)
>> ….
>> 2023-03-30T07:57:05.976Z|58545|poll_loop|INFO|wakeup due to 0-ms timeout
> at lib/stream-ssl.c:838 (101% CPU usage)
>> 2023-03-30T07:57:07.002Z|58546|memory|INFO|peak resident set size grew
> 54% in last 682257.2 seconds, from 4435696 kB to 6824820 kB
>> 
> 2023-03-30T07:57:07.002Z|58547|memory|INFO|idl-cells-OVN_Southbound:11491359
> idl-cells-Open_vSwitch:14416 lflow-cache-entries-cache-expr:530298
> lflow-cache-entries-cache-matches:31770 lflow-cache-size-KB:769553
> local_datapath_usage-KB:630
>> ofctrl_desired_flow_usage-KB:577052 ofctrl_installed_flow_usage-KB:409538
> ofctrl_sb_flow_ref_usage-KB:273707
>> …
>> 
> 2023-03-30T07:57:31.667Z|58552|rconn|WARN|unix:/run/openvswitch/br-int.mgmt:
> connection dropped (Broken pipe)
>> 2023-03-30T07:57:31.739Z|58553|binding|INFO|Releasing lport cr-xxx from
> this chassis (sb_readonly=0)
>> 2023-03-30T07:57:31.739Z|58554|if_status|WARN|Dropped 1 log messages in
> last 1254725 seconds (most recently, 1254725 seconds ago) due to excessive
> rate
>> 2023-03-30T07:57:31.739Z|58555|if_

Re: [ovs-dev] [ovn] ha-chassis-group false positive failover

2023-06-08 Thread Vladislav Odintsov
Hi,

I’ve sent the first part (OF probing disable) - 
https://patchwork.ozlabs.org/project/ovn/patch/20230608141555.1339905-1-odiv...@gmail.com/

Regarding configuration on the OVS side, I see it’s not working like I wrote 
before, I’ve deceived myself :)
I’ve written some code in ovn-controller, which creates controller for 
integration bridge with inactivity probe, it successfully creates record, but 
it hits OVS controller paths restrictions:

2023-06-07T21:08:01.552Z|00338|bridge|ERR|bridge br-int: Not adding Unix domain 
socket controller "punix:/var/run/ovn/br-int.mgmt" due to possibility of 
overwriting local files. Instead, specify path in permitted format 
"punix:/var/run/openvswitch/br-int.*" or connect to 
"unix:/var/run/openvswitch/br-int.mgmt" (which is always available without 
special configuration).

And in my manual tests I created controller in a directory /var/run, which in 
centos made a workaround how to add socket on the same path, but allowed to 
bypass ovs checks (/var/run/openvswitch/… vs /run/openvswitch/…), because of 
symlink /var/run/ -> /run.

So, if you guys have any suggestion how to disable OF probes on a standard OVS 
controller socket (br-int.mgmt), please give an advice.
Alternatively I’ve been thinking about creation another socket with another 
path for ovs-vswitchd<->ovn-controller OF communication.
But I’m concerned about paths consistency across different distributions…

Thoughts?

> On 6 Jun 2023, at 13:28, Vladislav Odintsov  wrote:
> 
> Hi Han,
> 
> thanks for your inputs!
> Please see comments inline.
> 
>> On 5 Jun 2023, at 21:01, Han Zhou  wrote:
>> 
>> On Mon, Jun 5, 2023 at 8:58 AM Vladislav Odintsov > > wrote:
>>> 
>>> Hi Dumitru, Ilya, Han,
>>> 
>>> do you guys have any comments here?
>>> I’m gonna start working on this change in near future...
>>> 
>>> Thank you.
>>> 
>>> On 17 May 2023, at 18:34, Numan Siddique >> > wrote:
>>> 
>>> On Wed, May 17, 2023 at 4:44 AM Vladislav Odintsov >> >
>> wrote:
>>> 
>>> 
>>> 
>>> 
>>> On 16 May 2023, at 23:54, Numan Siddique >> > wrote:
>>> 
>>> Hi Vladislav,
>>> 
>>> Sorry for the late reply.
>>> 
>>> PSB for few comments.
>>> 
>>> 
>>> 
>>> Thanks for your reply!
>>> My answers are inline.
>>> 
>>> 
>>> 
>>> 
>>> On Tue, May 16, 2023 at 3:42 PM Vladislav Odintsov >> 
>> > wrote:
>>> 
>>> 
>>> Hi Numan, Dumitru, Ilya,
>>> 
>>> I guess I’ve managed to deal with some of these problems and have some
>> conclusions.
>>> Please see inline.
>>> 
>>> On 14 Apr 2023, at 16:26, Vladislav Odintsov >> > wrote:
>>> 
>>> And as a follow-up, I see sometimes next error message in ovn-controller
>> log, though there is a configured much higher limit in seconds for
>> inactivity probe value for the chassis:
>>> 
>>> 
>>> ovn-controller establishes 3 openflow connections to local ovs-vswitchd
>> for next purposes:
>>> 1. openflow rules management (openflow probe interval is configured
>> through ovn-openflow-probe-interval external_ids setting [1]; default —
>> disabled [2]);
>>> 2. openflow features collection (hardcoded probe interval 5s [3]);
>>> 3. pinctrl thread (hardcoded probe interval 5s [4]).
>>> 
>>> ovs|04426|rconn|ERR|unix:/run/openvswitch/br-int.mgmt: no response to
>> inactivity probe after 8 seconds, disconnecting
>>> 
>>> ovs|26560|rconn|ERR|unix:/run/openvswitch/br-int.mgmt: no response to
>> inactivity probe after 10 seconds, disconnecting
>>> 
>>> 
>>> These messages most likely related to the connection for features
>> collection or pinctrl thread. So ovn-openflow-probe-interval has no matter
>> here.
>>> 
>>> 
>>> server with first log message:
>>> # ovs-vsctl get open . external_ids:ovn-openflow-probe-interval
>>> "30"
>>> 
>>> server with second log message:
>>> # ovs-vsctl get open . external_ids:ovn-openflow-probe-interval
>>> "60"
>>> 
>>> 
>>> 
>>> On 30 Mar 2023, at 23:19, Vladislav Odintsov >> > wrote:
>>> 
>>> Hi all,
>>> 
>>> I’m facing a false positive failover in the event of openflow connection
>> got dropped after inactivity probe timeout due to high load of
>> ovn-controller handling huge amount of ovsdb updates.
>>> I wonder wether its a bug or I have to tune any specifics. See
>> ovn-controller logs:
>>> 
>>> 2023-03-30T07:57:24.469Z|58440|inc_proc_eng|INFO|node:
>> logical_flow_output, recompute (failed handler for input port_groups) took
>> 35774ms
>>> 2023-03-30T07:57:28.521Z|58441|lflow_cache|INFO|Detected cache inactivity
>> (last active 40006 ms ago): trimming cache
>>> 2023-03-30T07:57:28.528Z|58442|timeval|WARN|Unreasonably long 40084ms
>> poll interval (32947ms user, 6968ms system)
>>> 2023-03-30T07:57:28.528Z|58443|timeval|WARN|faults: 43468 minor, 0 major
>>> …..
>>> 2023-03-30T07:57:48.784Z|58496|reconnect|ERR|ssl::6642: no response
>> to inactivity probe after 60.2 seconds, disconnec

Re: [ovs-dev] [ovn] ha-chassis-group false positive failover

2023-09-22 Thread Han Zhou
Hi Vladislav,

(Sorry I didn't notice your update until now when reading the discussion of
your OVN patch)

I think a simple solution is to update the hardcoded 60s probe in OVS to 0,
for the pre-created punix mgmt controller.
We could make it configurable, but it may not make much sense since it is a
unix socket anyway.
We have been using this 0 value downstream for years (cc Winson).
@Ilya Maximets  do you think it is reasonable?

Thanks,
Han

On Thu, Jun 8, 2023 at 7:38 AM Vladislav Odintsov  wrote:

> Hi,
>
> I’ve sent the first part (OF probing disable) -
> https://patchwork.ozlabs.org/project/ovn/patch/20230608141555.1339905-1-odiv...@gmail.com/
>
> Regarding configuration on the OVS side, I see it’s not working like I
> wrote before, I’ve deceived myself :)
> I’ve written some code in ovn-controller, which creates controller for
> integration bridge with inactivity probe, it successfully creates record,
> but it hits OVS controller paths restrictions:
>
> 2023-06-07T21:08:01.552Z|00338|bridge|ERR|bridge br-int: Not adding Unix
> domain socket controller "punix:/var/run/ovn/br-int.mgmt" due to
> possibility of overwriting local files. Instead, specify path in permitted
> format "punix:/var/run/openvswitch/br-int.*" or connect to
> "unix:/var/run/openvswitch/br-int.mgmt" (which is always available without
> special configuration).
>
> And in my manual tests I created controller in a directory /var/run, which
> in centos made a workaround how to add socket on the same path, but allowed
> to bypass ovs checks (/var/run/openvswitch/… vs /run/openvswitch/…),
> because of symlink /var/run/ -> /run.
>
> So, if you guys have any suggestion how to disable OF probes on a standard
> OVS controller socket (br-int.mgmt), please give an advice.
> Alternatively I’ve been thinking about creation another socket with
> another path for ovs-vswitchd<->ovn-controller OF communication.
> But I’m concerned about paths consistency across different distributions…
>
> Thoughts?
>
> On 6 Jun 2023, at 13:28, Vladislav Odintsov  wrote:
>
> Hi Han,
>
> thanks for your inputs!
> Please see comments inline.
>
> On 5 Jun 2023, at 21:01, Han Zhou  wrote:
>
> On Mon, Jun 5, 2023 at 8:58 AM Vladislav Odintsov  > wrote:
>
>
> Hi Dumitru, Ilya, Han,
>
> do you guys have any comments here?
> I’m gonna start working on this change in near future...
>
> Thank you.
>
> On 17 May 2023, at 18:34, Numan Siddique  num...@ovn.org>> wrote:
>
> On Wed, May 17, 2023 at 4:44 AM Vladislav Odintsov  >
>
> wrote:
>
>
>
>
>
> On 16 May 2023, at 23:54, Numan Siddique  num...@ovn.org>> wrote:
>
> Hi Vladislav,
>
> Sorry for the late reply.
>
> PSB for few comments.
>
>
>
> Thanks for your reply!
> My answers are inline.
>
>
>
>
> On Tue, May 16, 2023 at 3:42 PM Vladislav Odintsov  
>
> > wrote:
>
>
>
> Hi Numan, Dumitru, Ilya,
>
> I guess I’ve managed to deal with some of these problems and have some
>
> conclusions.
>
> Please see inline.
>
> On 14 Apr 2023, at 16:26, Vladislav Odintsov  odiv...@gmail.com>> wrote:
>
> And as a follow-up, I see sometimes next error message in ovn-controller
>
> log, though there is a configured much higher limit in seconds for
> inactivity probe value for the chassis:
>
>
>
> ovn-controller establishes 3 openflow connections to local ovs-vswitchd
>
> for next purposes:
>
> 1. openflow rules management (openflow probe interval is configured
>
> through ovn-openflow-probe-interval external_ids setting [1]; default —
> disabled [2]);
>
> 2. openflow features collection (hardcoded probe interval 5s [3]);
> 3. pinctrl thread (hardcoded probe interval 5s [4]).
>
> ovs|04426|rconn|ERR|unix:/run/openvswitch/br-int.mgmt: no response to
>
> inactivity probe after 8 seconds, disconnecting
>
>
> ovs|26560|rconn|ERR|unix:/run/openvswitch/br-int.mgmt: no response to
>
> inactivity probe after 10 seconds, disconnecting
>
>
>
> These messages most likely related to the connection for features
>
> collection or pinctrl thread. So ovn-openflow-probe-interval has no matter
> here.
>
>
>
> server with first log message:
> # ovs-vsctl get open . external_ids:ovn-openflow-probe-interval
> "30"
>
> server with second log message:
> # ovs-vsctl get open . external_ids:ovn-openflow-probe-interval
> "60"
>
>
>
> On 30 Mar 2023, at 23:19, Vladislav Odintsov  odiv...@gmail.com>> wrote:
>
> Hi all,
>
> I’m facing a false positive failover in the event of openflow connection
>
> got dropped after inactivity probe timeout due to high load of
> ovn-controller handling huge amount of ovsdb updates.
>
> I wonder wether its a bug or I have to tune any specifics. See
>
> ovn-controller logs:
>
>
> 2023-03-30T07:57:24.469Z|58440|inc_proc_eng|INFO|node:
>
> logical_flow_output, recompute (failed handler for input port_groups) took
> 35774ms
>
> 2023-03-30T07:57:28.521Z|58441|lflow_cache|INFO|Detected cache inactivity
>
> (last active 40006 ms ago): trimming

Re: [ovs-dev] [ovn] ha-chassis-group false positive failover

2023-04-14 Thread Vladislav Odintsov
And as a follow-up, I see sometimes next error message in ovn-controller log, 
though there is a configured much higher limit in seconds for inactivity probe 
value for the chassis:

ovs|04426|rconn|ERR|unix:/run/openvswitch/br-int.mgmt: no response to 
inactivity probe after 8 seconds, disconnecting

ovs|26560|rconn|ERR|unix:/run/openvswitch/br-int.mgmt: no response to 
inactivity probe after 10 seconds, disconnecting


server with first log message:
# ovs-vsctl get open . external_ids:ovn-openflow-probe-interval
"30"

server with second log message:
# ovs-vsctl get open . external_ids:ovn-openflow-probe-interval
"60"



> On 30 Mar 2023, at 23:19, Vladislav Odintsov  wrote:
> 
> Hi all,
> 
> I’m facing a false positive failover in the event of openflow connection got 
> dropped after inactivity probe timeout due to high load of ovn-controller 
> handling huge amount of ovsdb updates.
> I wonder wether its a bug or I have to tune any specifics. See ovn-controller 
> logs:
> 
> 2023-03-30T07:57:24.469Z|58440|inc_proc_eng|INFO|node: logical_flow_output, 
> recompute (failed handler for input port_groups) took 35774ms
> 2023-03-30T07:57:28.521Z|58441|lflow_cache|INFO|Detected cache inactivity 
> (last active 40006 ms ago): trimming cache
> 2023-03-30T07:57:28.528Z|58442|timeval|WARN|Unreasonably long 40084ms poll 
> interval (32947ms user, 6968ms system)
> 2023-03-30T07:57:28.528Z|58443|timeval|WARN|faults: 43468 minor, 0 major
> …..
> 2023-03-30T07:57:48.784Z|58496|reconnect|ERR|ssl::6642: no response to 
> inactivity probe after 60.2 seconds, disconnecting
> 2023-03-30T07:57:48.784Z|58497|reconnect|INFO|ssl::6642: connection 
> dropped
> 2023-03-30T07:57:48.905Z|58498|main|INFO|OVNSB commit failed, force recompute 
> next time.
> 2023-03-30T07:57:49.786Z|58499|reconnect|INFO|ssl::6642: connecting...
> 2023-03-30T07:57:49.924Z|58500|reconnect|INFO|ssl::6642: connected
> 2023-03-30T07:57:29.781Z|58501|poll_loop|INFO|wakeup due to 0-ms timeout at 
> lib/stream-ssl.c:838 (74% CPU usage)
> 2023-03-30T07:57:29.836Z|58502|poll_loop|INFO|wakeup due to 0-ms timeout at 
> lib/stream-ssl.c:838 (74% CPU usage)
> ….
> 2023-03-30T07:57:05.976Z|58545|poll_loop|INFO|wakeup due to 0-ms timeout at 
> lib/stream-ssl.c:838 (101% CPU usage)
> 2023-03-30T07:57:07.002Z|58546|memory|INFO|peak resident set size grew 54% in 
> last 682257.2 seconds, from 4435696 kB to 6824820 kB
> 2023-03-30T07:57:07.002Z|58547|memory|INFO|idl-cells-OVN_Southbound:11491359 
> idl-cells-Open_vSwitch:14416 lflow-cache-entries-cache-expr:530298 
> lflow-cache-entries-cache-matches:31770 lflow-cache-size-KB:769553 
> local_datapath_usage-KB:630
>  ofctrl_desired_flow_usage-KB:577052 ofctrl_installed_flow_usage-KB:409538 
> ofctrl_sb_flow_ref_usage-KB:273707
> …
> 2023-03-30T07:57:31.667Z|58552|rconn|WARN|unix:/run/openvswitch/br-int.mgmt: 
> connection dropped (Broken pipe)
> 2023-03-30T07:57:31.739Z|58553|binding|INFO|Releasing lport cr-xxx from this 
> chassis (sb_readonly=0)
> 2023-03-30T07:57:31.739Z|58554|if_status|WARN|Dropped 1 log messages in last 
> 1254725 seconds (most recently, 1254725 seconds ago) due to excessive rate
> 2023-03-30T07:57:31.739Z|58555|if_status|WARN|Trying to release unknown 
> interface cr-xxx
> 2023-03-30T07:57:31.743Z|58556|binding|INFO|Releasing lport cr-yyy from this 
> chassis (sb_readonly=0)
> 2023-03-30T07:57:31.743Z|58557|binding|INFO|Releasing lport cr-zzz from this 
> chassis (sb_readonly=0)
> …
> 
> After some time these ports are claimed back:
> 
> 2023-03-30T07:57:49.320Z|59137|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: 
> connecting...
> 2023-03-30T07:57:49.320Z|59138|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: 
> connected
> 2023-03-30T07:57:49.523Z|59139|poll_loop|INFO|wakeup due to 0-ms timeout at 
> controller/ovn-controller.c:4296 (99% CPU usage)
> 2023-03-30T07:57:49.574Z|59140|poll_loop|INFO|wakeup due to 0-ms timeout at 
> controller/ovn-controller.c:4286 (99% CPU usage)
> 2023-03-30T07:58:01.451Z|59141|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 
> (:34736<->:6642) at lib/stream-ssl.c:836 (99% CPU usage)
> 2023-03-30T07:58:02.449Z|59142|binding|INFO|Claiming lport cr-xxx for this 
> chassis.
> 2023-03-30T07:58:02.449Z|59143|binding|INFO|cr-xxx: Claiming xx:xx:xx:xx:xx 
> x.x.x.x/xx
> 2023-03-30T07:58:02.453Z|59144|binding|INFO|Claiming lport cr-yyy for this 
> chassis.
> 2023-03-30T07:58:02.453Z|59145|binding|INFO|cr-yyy: Claiming 
> yy:yy:yy:yy:yy:yy y.y.y.y/yy
> …
> 
> Previously I’ve met with similar problem but the openflow probe interval was 
> 30 seconds and I set it to 60 and then to 120 seconds.
> But it seems that OVS was configured on default inactivity probe timeout for 
> openflow - 60 seconds. So, the error above occured. In OVS logs it looks like 
> next:
> 
> 2023-03-30T07:57:28.632Z|61003|rconn|ERR|br-int<->unix#8: no response to 
> inactivity probe after 60 seconds, disconnecting
> 
> This node is a dedicated l3gateway which has is-interconn=true sett