Re: [ovs-dev] [PATCH v3] dpif-netdev: Forwarding optimization for flows with a simple match.

2022-01-07 Thread Ilya Maximets
On 12/22/21 06:38, Sriharsha Basavapatna wrote:
> On Tue, Nov 2, 2021 at 1:48 AM Ilya Maximets  wrote:
>>
>> On 10/30/21 09:04, Sriharsha Basavapatna wrote:
>>> On Mon, Aug 9, 2021 at 6:28 PM Ilya Maximets  wrote:

 There are cases where users might want simple forwarding or drop rules
 for all packets received from a specific port, e.g ::

   "in_port=1,actions=2"
   "in_port=2,actions=IN_PORT"
   "in_port=3,vlan_tci=0x1234/0x1fff,actions=drop"
   "in_port=4,actions=push_vlan:0x8100,set_field:4196->vlan_vid,output:3"

 There are also cases where complex OpenFlow rules can be simplified
 down to datapath flows with very simple match criteria.

 In theory, for very simple forwarding, OVS doesn't need to parse
 packets at all in order to follow these rules.  "Simple match" lookup
 optimization is intended to speed up packet forwarding in these cases.

 Design:

 Due to various implementation constraints userspace datapath has
 following flow fields always in exact match (i.e. it's required to
 match at least these fields of a packet even if the OF rule doesn't
 need that):

   - recirc_id
   - in_port
   - packet_type
   - dl_type
   - vlan_tci (CFI + VID) - in most cases
   - nw_frag - for ip packets

 Not all of these fields are related to packet itself.  We already
 know the current 'recirc_id' and the 'in_port' before starting the
 packet processing.  It also seems safe to assume that we're working
 with Ethernet packets.  So, for the simple OF rule we need to match
 only on 'dl_type', 'vlan_tci' and 'nw_frag'.

 'in_port', 'dl_type', 'nw_frag' and 13 bits of 'vlan_tci' can be
 combined in a single 64bit integer (mark) that can be used as a
 hash in hash map.  We are using only VID and CFI form the 'vlan_tci',
 flows that need to match on PCP will not qualify for the optimization.
 Workaround for matching on non-existence of vlan updated to match on
 CFI and VID only in order to qualify for the optimization.  CFI is
 always set by OVS if vlan is present in a packet, so there is no need
 to match on PCP in this case.  'nw_frag' takes 2 bits of PCP inside
 the simple match mark.

 New per-PMD flow table 'simple_match_table' introduced to store
 simple match flows only.  'dp_netdev_flow_add' adds flow to the
 usual 'flow_table' and to the 'simple_match_table' if the flow
 meets following constraints:

   - 'recirc_id' in flow match is 0.
   - 'packet_type' in flow match is Ethernet.
   - Flow wildcards contains only minimal set of non-wildcarded fields
 (listed above).

 If the number of flows for current 'in_port' in a regular 'flow_table'
 equals number of flows for current 'in_port' in a 'simple_match_table',
 we may use simple match optimization, because all the flows we have
 are simple match flows.  This means that we only need to parse
 'dl_type', 'vlan_tci' and 'nw_frag' to perform packet matching.
 Now we make the unique flow mark from the 'in_port', 'dl_type',
 'nw_frag' and 'vlan_tci' and looking for it in the 'simple_match_table'.
 On successful lookup we don't need to run full 'miniflow_extract()'.

 Unsuccessful lookup technically means that we have no suitable flow
 in the datapath and upcall will be required.  So, in this case EMC and
 SMC lookups are disabled.  We may optimize this path in the future by
 bypassing the dpcls lookup too.

 Performance improvement of this solution on a 'simple match' flows
 should be comparable with partial HW offloading, because it parses same
 packet fields and uses similar flow lookup scheme.
 However, unlike partial HW offloading, it works for all port types
 including virtual ones.

 Performance results when compared to EMC:

 Test setup:

  virtio-user   OVSvirtio-user
   Testpmd1  >  pmd1  >  Testpmd2
   (txonly)   x<--  pmd2  < (mac swap)

 Single stream of 64byte packets.  Actions:
   in_port=vhost0,actions=vhost1
   in_port=vhost1,actions=vhost0

 Stats collected from pmd1 and pmd2, so there are 2 scenarios:
 Virt-to-Virt   : Testpmd1 --> pmd1 --> Testpmd2.
 Virt-to-NoCopy : Testpmd2 --> pmd2 --->x   Testpmd1.
 Here the packet sent from pmd2 to Testpmd1 is always dropped, because
 the virtqueue is full since Testpmd1 is in txonly mode and doesn't
 receive any packets.  This should be closer to the performance of a
 VM-to-Phy scenario.

 Test performed on machine with Intel Xeon CPU E5-2690 v4 @ 2.60GHz.
 Table below represents improvement in throughput when compared to EMC.

  ++++
  ||Def

Re: [ovs-dev] [PATCH v3] dpif-netdev: Forwarding optimization for flows with a simple match.

2021-12-21 Thread Sriharsha Basavapatna via dev
On Tue, Nov 2, 2021 at 1:48 AM Ilya Maximets  wrote:
>
> On 10/30/21 09:04, Sriharsha Basavapatna wrote:
> > On Mon, Aug 9, 2021 at 6:28 PM Ilya Maximets  wrote:
> >>
> >> There are cases where users might want simple forwarding or drop rules
> >> for all packets received from a specific port, e.g ::
> >>
> >>   "in_port=1,actions=2"
> >>   "in_port=2,actions=IN_PORT"
> >>   "in_port=3,vlan_tci=0x1234/0x1fff,actions=drop"
> >>   "in_port=4,actions=push_vlan:0x8100,set_field:4196->vlan_vid,output:3"
> >>
> >> There are also cases where complex OpenFlow rules can be simplified
> >> down to datapath flows with very simple match criteria.
> >>
> >> In theory, for very simple forwarding, OVS doesn't need to parse
> >> packets at all in order to follow these rules.  "Simple match" lookup
> >> optimization is intended to speed up packet forwarding in these cases.
> >>
> >> Design:
> >>
> >> Due to various implementation constraints userspace datapath has
> >> following flow fields always in exact match (i.e. it's required to
> >> match at least these fields of a packet even if the OF rule doesn't
> >> need that):
> >>
> >>   - recirc_id
> >>   - in_port
> >>   - packet_type
> >>   - dl_type
> >>   - vlan_tci (CFI + VID) - in most cases
> >>   - nw_frag - for ip packets
> >>
> >> Not all of these fields are related to packet itself.  We already
> >> know the current 'recirc_id' and the 'in_port' before starting the
> >> packet processing.  It also seems safe to assume that we're working
> >> with Ethernet packets.  So, for the simple OF rule we need to match
> >> only on 'dl_type', 'vlan_tci' and 'nw_frag'.
> >>
> >> 'in_port', 'dl_type', 'nw_frag' and 13 bits of 'vlan_tci' can be
> >> combined in a single 64bit integer (mark) that can be used as a
> >> hash in hash map.  We are using only VID and CFI form the 'vlan_tci',
> >> flows that need to match on PCP will not qualify for the optimization.
> >> Workaround for matching on non-existence of vlan updated to match on
> >> CFI and VID only in order to qualify for the optimization.  CFI is
> >> always set by OVS if vlan is present in a packet, so there is no need
> >> to match on PCP in this case.  'nw_frag' takes 2 bits of PCP inside
> >> the simple match mark.
> >>
> >> New per-PMD flow table 'simple_match_table' introduced to store
> >> simple match flows only.  'dp_netdev_flow_add' adds flow to the
> >> usual 'flow_table' and to the 'simple_match_table' if the flow
> >> meets following constraints:
> >>
> >>   - 'recirc_id' in flow match is 0.
> >>   - 'packet_type' in flow match is Ethernet.
> >>   - Flow wildcards contains only minimal set of non-wildcarded fields
> >> (listed above).
> >>
> >> If the number of flows for current 'in_port' in a regular 'flow_table'
> >> equals number of flows for current 'in_port' in a 'simple_match_table',
> >> we may use simple match optimization, because all the flows we have
> >> are simple match flows.  This means that we only need to parse
> >> 'dl_type', 'vlan_tci' and 'nw_frag' to perform packet matching.
> >> Now we make the unique flow mark from the 'in_port', 'dl_type',
> >> 'nw_frag' and 'vlan_tci' and looking for it in the 'simple_match_table'.
> >> On successful lookup we don't need to run full 'miniflow_extract()'.
> >>
> >> Unsuccessful lookup technically means that we have no suitable flow
> >> in the datapath and upcall will be required.  So, in this case EMC and
> >> SMC lookups are disabled.  We may optimize this path in the future by
> >> bypassing the dpcls lookup too.
> >>
> >> Performance improvement of this solution on a 'simple match' flows
> >> should be comparable with partial HW offloading, because it parses same
> >> packet fields and uses similar flow lookup scheme.
> >> However, unlike partial HW offloading, it works for all port types
> >> including virtual ones.
> >>
> >> Performance results when compared to EMC:
> >>
> >> Test setup:
> >>
> >>  virtio-user   OVSvirtio-user
> >>   Testpmd1  >  pmd1  >  Testpmd2
> >>   (txonly)   x<--  pmd2  < (mac swap)
> >>
> >> Single stream of 64byte packets.  Actions:
> >>   in_port=vhost0,actions=vhost1
> >>   in_port=vhost1,actions=vhost0
> >>
> >> Stats collected from pmd1 and pmd2, so there are 2 scenarios:
> >> Virt-to-Virt   : Testpmd1 --> pmd1 --> Testpmd2.
> >> Virt-to-NoCopy : Testpmd2 --> pmd2 --->x   Testpmd1.
> >> Here the packet sent from pmd2 to Testpmd1 is always dropped, because
> >> the virtqueue is full since Testpmd1 is in txonly mode and doesn't
> >> receive any packets.  This should be closer to the performance of a
> >> VM-to-Phy scenario.
> >>
> >> Test performed on machine with Intel Xeon CPU E5-2690 v4 @ 2.60GHz.
> >> Table below represents improvement in throughput when compared to EMC.
> >>
> >>  ++++
> >>  ||Default (-g -O2)| "-Ofast -march=native" |
> >>  | 

Re: [ovs-dev] [PATCH v3] dpif-netdev: Forwarding optimization for flows with a simple match.

2021-11-01 Thread Ilya Maximets
On 10/30/21 09:04, Sriharsha Basavapatna wrote:
> On Mon, Aug 9, 2021 at 6:28 PM Ilya Maximets  wrote:
>>
>> There are cases where users might want simple forwarding or drop rules
>> for all packets received from a specific port, e.g ::
>>
>>   "in_port=1,actions=2"
>>   "in_port=2,actions=IN_PORT"
>>   "in_port=3,vlan_tci=0x1234/0x1fff,actions=drop"
>>   "in_port=4,actions=push_vlan:0x8100,set_field:4196->vlan_vid,output:3"
>>
>> There are also cases where complex OpenFlow rules can be simplified
>> down to datapath flows with very simple match criteria.
>>
>> In theory, for very simple forwarding, OVS doesn't need to parse
>> packets at all in order to follow these rules.  "Simple match" lookup
>> optimization is intended to speed up packet forwarding in these cases.
>>
>> Design:
>>
>> Due to various implementation constraints userspace datapath has
>> following flow fields always in exact match (i.e. it's required to
>> match at least these fields of a packet even if the OF rule doesn't
>> need that):
>>
>>   - recirc_id
>>   - in_port
>>   - packet_type
>>   - dl_type
>>   - vlan_tci (CFI + VID) - in most cases
>>   - nw_frag - for ip packets
>>
>> Not all of these fields are related to packet itself.  We already
>> know the current 'recirc_id' and the 'in_port' before starting the
>> packet processing.  It also seems safe to assume that we're working
>> with Ethernet packets.  So, for the simple OF rule we need to match
>> only on 'dl_type', 'vlan_tci' and 'nw_frag'.
>>
>> 'in_port', 'dl_type', 'nw_frag' and 13 bits of 'vlan_tci' can be
>> combined in a single 64bit integer (mark) that can be used as a
>> hash in hash map.  We are using only VID and CFI form the 'vlan_tci',
>> flows that need to match on PCP will not qualify for the optimization.
>> Workaround for matching on non-existence of vlan updated to match on
>> CFI and VID only in order to qualify for the optimization.  CFI is
>> always set by OVS if vlan is present in a packet, so there is no need
>> to match on PCP in this case.  'nw_frag' takes 2 bits of PCP inside
>> the simple match mark.
>>
>> New per-PMD flow table 'simple_match_table' introduced to store
>> simple match flows only.  'dp_netdev_flow_add' adds flow to the
>> usual 'flow_table' and to the 'simple_match_table' if the flow
>> meets following constraints:
>>
>>   - 'recirc_id' in flow match is 0.
>>   - 'packet_type' in flow match is Ethernet.
>>   - Flow wildcards contains only minimal set of non-wildcarded fields
>> (listed above).
>>
>> If the number of flows for current 'in_port' in a regular 'flow_table'
>> equals number of flows for current 'in_port' in a 'simple_match_table',
>> we may use simple match optimization, because all the flows we have
>> are simple match flows.  This means that we only need to parse
>> 'dl_type', 'vlan_tci' and 'nw_frag' to perform packet matching.
>> Now we make the unique flow mark from the 'in_port', 'dl_type',
>> 'nw_frag' and 'vlan_tci' and looking for it in the 'simple_match_table'.
>> On successful lookup we don't need to run full 'miniflow_extract()'.
>>
>> Unsuccessful lookup technically means that we have no suitable flow
>> in the datapath and upcall will be required.  So, in this case EMC and
>> SMC lookups are disabled.  We may optimize this path in the future by
>> bypassing the dpcls lookup too.
>>
>> Performance improvement of this solution on a 'simple match' flows
>> should be comparable with partial HW offloading, because it parses same
>> packet fields and uses similar flow lookup scheme.
>> However, unlike partial HW offloading, it works for all port types
>> including virtual ones.
>>
>> Performance results when compared to EMC:
>>
>> Test setup:
>>
>>  virtio-user   OVSvirtio-user
>>   Testpmd1  >  pmd1  >  Testpmd2
>>   (txonly)   x<--  pmd2  < (mac swap)
>>
>> Single stream of 64byte packets.  Actions:
>>   in_port=vhost0,actions=vhost1
>>   in_port=vhost1,actions=vhost0
>>
>> Stats collected from pmd1 and pmd2, so there are 2 scenarios:
>> Virt-to-Virt   : Testpmd1 --> pmd1 --> Testpmd2.
>> Virt-to-NoCopy : Testpmd2 --> pmd2 --->x   Testpmd1.
>> Here the packet sent from pmd2 to Testpmd1 is always dropped, because
>> the virtqueue is full since Testpmd1 is in txonly mode and doesn't
>> receive any packets.  This should be closer to the performance of a
>> VM-to-Phy scenario.
>>
>> Test performed on machine with Intel Xeon CPU E5-2690 v4 @ 2.60GHz.
>> Table below represents improvement in throughput when compared to EMC.
>>
>>  ++++
>>  ||Default (-g -O2)| "-Ofast -march=native" |
>>  |   Scenario ++---++---+
>>  || GCC|   Clang   | GCC|   Clang   |
>>  +++---++---+
>>  | Virt-to-Virt   |+18.9%  |   +25.5%  |   

Re: [ovs-dev] [PATCH v3] dpif-netdev: Forwarding optimization for flows with a simple match.

2021-10-30 Thread Sriharsha Basavapatna via dev
On Mon, Aug 9, 2021 at 6:28 PM Ilya Maximets  wrote:
>
> There are cases where users might want simple forwarding or drop rules
> for all packets received from a specific port, e.g ::
>
>   "in_port=1,actions=2"
>   "in_port=2,actions=IN_PORT"
>   "in_port=3,vlan_tci=0x1234/0x1fff,actions=drop"
>   "in_port=4,actions=push_vlan:0x8100,set_field:4196->vlan_vid,output:3"
>
> There are also cases where complex OpenFlow rules can be simplified
> down to datapath flows with very simple match criteria.
>
> In theory, for very simple forwarding, OVS doesn't need to parse
> packets at all in order to follow these rules.  "Simple match" lookup
> optimization is intended to speed up packet forwarding in these cases.
>
> Design:
>
> Due to various implementation constraints userspace datapath has
> following flow fields always in exact match (i.e. it's required to
> match at least these fields of a packet even if the OF rule doesn't
> need that):
>
>   - recirc_id
>   - in_port
>   - packet_type
>   - dl_type
>   - vlan_tci (CFI + VID) - in most cases
>   - nw_frag - for ip packets
>
> Not all of these fields are related to packet itself.  We already
> know the current 'recirc_id' and the 'in_port' before starting the
> packet processing.  It also seems safe to assume that we're working
> with Ethernet packets.  So, for the simple OF rule we need to match
> only on 'dl_type', 'vlan_tci' and 'nw_frag'.
>
> 'in_port', 'dl_type', 'nw_frag' and 13 bits of 'vlan_tci' can be
> combined in a single 64bit integer (mark) that can be used as a
> hash in hash map.  We are using only VID and CFI form the 'vlan_tci',
> flows that need to match on PCP will not qualify for the optimization.
> Workaround for matching on non-existence of vlan updated to match on
> CFI and VID only in order to qualify for the optimization.  CFI is
> always set by OVS if vlan is present in a packet, so there is no need
> to match on PCP in this case.  'nw_frag' takes 2 bits of PCP inside
> the simple match mark.
>
> New per-PMD flow table 'simple_match_table' introduced to store
> simple match flows only.  'dp_netdev_flow_add' adds flow to the
> usual 'flow_table' and to the 'simple_match_table' if the flow
> meets following constraints:
>
>   - 'recirc_id' in flow match is 0.
>   - 'packet_type' in flow match is Ethernet.
>   - Flow wildcards contains only minimal set of non-wildcarded fields
> (listed above).
>
> If the number of flows for current 'in_port' in a regular 'flow_table'
> equals number of flows for current 'in_port' in a 'simple_match_table',
> we may use simple match optimization, because all the flows we have
> are simple match flows.  This means that we only need to parse
> 'dl_type', 'vlan_tci' and 'nw_frag' to perform packet matching.
> Now we make the unique flow mark from the 'in_port', 'dl_type',
> 'nw_frag' and 'vlan_tci' and looking for it in the 'simple_match_table'.
> On successful lookup we don't need to run full 'miniflow_extract()'.
>
> Unsuccessful lookup technically means that we have no suitable flow
> in the datapath and upcall will be required.  So, in this case EMC and
> SMC lookups are disabled.  We may optimize this path in the future by
> bypassing the dpcls lookup too.
>
> Performance improvement of this solution on a 'simple match' flows
> should be comparable with partial HW offloading, because it parses same
> packet fields and uses similar flow lookup scheme.
> However, unlike partial HW offloading, it works for all port types
> including virtual ones.
>
> Performance results when compared to EMC:
>
> Test setup:
>
>  virtio-user   OVSvirtio-user
>   Testpmd1  >  pmd1  >  Testpmd2
>   (txonly)   x<--  pmd2  < (mac swap)
>
> Single stream of 64byte packets.  Actions:
>   in_port=vhost0,actions=vhost1
>   in_port=vhost1,actions=vhost0
>
> Stats collected from pmd1 and pmd2, so there are 2 scenarios:
> Virt-to-Virt   : Testpmd1 --> pmd1 --> Testpmd2.
> Virt-to-NoCopy : Testpmd2 --> pmd2 --->x   Testpmd1.
> Here the packet sent from pmd2 to Testpmd1 is always dropped, because
> the virtqueue is full since Testpmd1 is in txonly mode and doesn't
> receive any packets.  This should be closer to the performance of a
> VM-to-Phy scenario.
>
> Test performed on machine with Intel Xeon CPU E5-2690 v4 @ 2.60GHz.
> Table below represents improvement in throughput when compared to EMC.
>
>  ++++
>  ||Default (-g -O2)| "-Ofast -march=native" |
>  |   Scenario ++---++---+
>  || GCC|   Clang   | GCC|   Clang   |
>  +++---++---+
>  | Virt-to-Virt   |+18.9%  |   +25.5%  |+10.8%  |   +16.7%  |
>  | Virt-to-NoCopy |+24.3%  |   +33.7%  |+14.9%  |   +22.0%  |
>  +++---++-

[ovs-dev] [PATCH v3] dpif-netdev: Forwarding optimization for flows with a simple match.

2021-08-09 Thread Ilya Maximets
There are cases where users might want simple forwarding or drop rules
for all packets received from a specific port, e.g ::

  "in_port=1,actions=2"
  "in_port=2,actions=IN_PORT"
  "in_port=3,vlan_tci=0x1234/0x1fff,actions=drop"
  "in_port=4,actions=push_vlan:0x8100,set_field:4196->vlan_vid,output:3"

There are also cases where complex OpenFlow rules can be simplified
down to datapath flows with very simple match criteria.

In theory, for very simple forwarding, OVS doesn't need to parse
packets at all in order to follow these rules.  "Simple match" lookup
optimization is intended to speed up packet forwarding in these cases.

Design:

Due to various implementation constraints userspace datapath has
following flow fields always in exact match (i.e. it's required to
match at least these fields of a packet even if the OF rule doesn't
need that):

  - recirc_id
  - in_port
  - packet_type
  - dl_type
  - vlan_tci (CFI + VID) - in most cases
  - nw_frag - for ip packets

Not all of these fields are related to packet itself.  We already
know the current 'recirc_id' and the 'in_port' before starting the
packet processing.  It also seems safe to assume that we're working
with Ethernet packets.  So, for the simple OF rule we need to match
only on 'dl_type', 'vlan_tci' and 'nw_frag'.

'in_port', 'dl_type', 'nw_frag' and 13 bits of 'vlan_tci' can be
combined in a single 64bit integer (mark) that can be used as a
hash in hash map.  We are using only VID and CFI form the 'vlan_tci',
flows that need to match on PCP will not qualify for the optimization.
Workaround for matching on non-existence of vlan updated to match on
CFI and VID only in order to qualify for the optimization.  CFI is
always set by OVS if vlan is present in a packet, so there is no need
to match on PCP in this case.  'nw_frag' takes 2 bits of PCP inside
the simple match mark.

New per-PMD flow table 'simple_match_table' introduced to store
simple match flows only.  'dp_netdev_flow_add' adds flow to the
usual 'flow_table' and to the 'simple_match_table' if the flow
meets following constraints:

  - 'recirc_id' in flow match is 0.
  - 'packet_type' in flow match is Ethernet.
  - Flow wildcards contains only minimal set of non-wildcarded fields
(listed above).

If the number of flows for current 'in_port' in a regular 'flow_table'
equals number of flows for current 'in_port' in a 'simple_match_table',
we may use simple match optimization, because all the flows we have
are simple match flows.  This means that we only need to parse
'dl_type', 'vlan_tci' and 'nw_frag' to perform packet matching.
Now we make the unique flow mark from the 'in_port', 'dl_type',
'nw_frag' and 'vlan_tci' and looking for it in the 'simple_match_table'.
On successful lookup we don't need to run full 'miniflow_extract()'.

Unsuccessful lookup technically means that we have no suitable flow
in the datapath and upcall will be required.  So, in this case EMC and
SMC lookups are disabled.  We may optimize this path in the future by
bypassing the dpcls lookup too.

Performance improvement of this solution on a 'simple match' flows
should be comparable with partial HW offloading, because it parses same
packet fields and uses similar flow lookup scheme.
However, unlike partial HW offloading, it works for all port types
including virtual ones.

Performance results when compared to EMC:

Test setup:

 virtio-user   OVSvirtio-user
  Testpmd1  >  pmd1  >  Testpmd2
  (txonly)   x<--  pmd2  < (mac swap)

Single stream of 64byte packets.  Actions:
  in_port=vhost0,actions=vhost1
  in_port=vhost1,actions=vhost0

Stats collected from pmd1 and pmd2, so there are 2 scenarios:
Virt-to-Virt   : Testpmd1 --> pmd1 --> Testpmd2.
Virt-to-NoCopy : Testpmd2 --> pmd2 --->x   Testpmd1.
Here the packet sent from pmd2 to Testpmd1 is always dropped, because
the virtqueue is full since Testpmd1 is in txonly mode and doesn't
receive any packets.  This should be closer to the performance of a
VM-to-Phy scenario.

Test performed on machine with Intel Xeon CPU E5-2690 v4 @ 2.60GHz.
Table below represents improvement in throughput when compared to EMC.

 ++++
 ||Default (-g -O2)| "-Ofast -march=native" |
 |   Scenario ++---++---+
 || GCC|   Clang   | GCC|   Clang   |
 +++---++---+
 | Virt-to-Virt   |+18.9%  |   +25.5%  |+10.8%  |   +16.7%  |
 | Virt-to-NoCopy |+24.3%  |   +33.7%  |+14.9%  |   +22.0%  |
 +++---++---+

For Phy-to-Phy case performance improvement should be even higher, but
it's not the main use-case for this functionality.  Performance
difference for the non-simple flows is within a margin of error.

Signed-off-by: Ilya Maximets 
---