On Mon, Feb 8, 2021 at 4:02 PM Dumitru Ceara <dce...@redhat.com> wrote: > > On 2/8/21 11:09 AM, Numan Siddique wrote: > > On Sat, Feb 6, 2021 at 4:00 AM Dumitru Ceara <dce...@redhat.com> wrote: > >> > >> The main trait of load balancer hairpin traffic is that it never leaves > >> the local hypervisor. Essentially this means that only hairpin > >> openflows installed for logical switches that have at least one logical > >> switch port bound locally can ever be hit. > >> > >> Until now, if a load balancer was applied on multiple logical switches > >> that are connected through a distributed router, ovn-controller would > >> install flows to detect hairpin replies for all logical switches. In > >> practice this leads to a very high number of openflows out of which > >> most will never be used. > >> > >> Instead we now use an additional action, learn(), on flows that match on > >> packets that create the hairpin session. The learn() action will then > >> generate the necessary flows to handle hairpin replies, but only for > >> the local datapaths which actually generate hairpin traffic. > >> > >> For example, simulating how ovn-k8s uses load balancer for services, > >> in a "switch per node" scenario, the script below would generate > >> 10K (n_nodes * n_vips * n_backends) openflows on every node in table=69 > >> (hairpin reply). With this patch the maximum number of openflows that > >> can be created for hairpin replies is 200 (n_vips * n_backends). > >> > >> In general, for deployments that leverage switch-per-node topologies, > >> the number of openflows is reduced by a factor of N, where N is the > >> number of nodes. > >> > >> $ cat lbs.sh > >> NODES=50 > >> VIPS=20 > >> BACKENDS=10 > >> ovn-nbctl lr-add rtr > >> for ((i = 1; i <= $NODES; i++)); do > >> ovn-nbctl \ > >> -- ls-add ls$i \ > >> -- lsp-add ls$i vm$i \ > >> -- lsp-add ls$i ls$i-rtr \ > >> -- lsp-set-type ls$i-rtr router \ > >> -- lsp-set-options ls$i-rtr router-port=rtr-ls$i \ > >> -- lrp-add rtr rtr-ls$i 00:00:00:00:01:00 42.42.42.$i/24 > >> done > >> > >> for ((i = 1; i <= $VIPS; i++)); do > >> lb=lb$i > >> vip=10.10.10.$i:1 > >> bip=20.20.20.1:2 > >> for ((j = 2; j <= $BACKENDS; j++)); do > >> bip="$bip,20.20.20.$j:2" > >> done > >> ovn-nbctl lb-add $lb $vip $backends > >> done > >> > >> for ((i = 1; i <= $NODES; i++)); do > >> for ((j = 1; j <= $VIPS; j++)); do > >> ovn-nbctl ls-lb-add ls$i lb$j > >> done > >> done > >> > >> ovs-vsctl add-port br-int vm1 \ > >> -- set interface vm1 type=internal \ > >> -- set interface vm1 external-ids:iface-id=vm1 > >> > >> Suggested-by: Ilya Maximets <i.maxim...@ovn.org> > >> Signed-off-by: Dumitru Ceara <dce...@redhat.com> > > > > Hi Dumitru, > > Hi Numan, > > > > > Thanks for this patch. The patch LGTM. I tested it out with huge NB > > DB with lots > > of load balancer and backends. I see a significant reduction in OF flows. > > Thanks for the review! > > > > > Please see the few comments below. > > > > Thanks > > Numan > > > >> --- > >> controller/lflow.c | 204 ++++++++++++++++--- > >> tests/ofproto-macros.at | 5 +- > >> tests/ovn.at | 516 > >> +++++++++++++++++++++++++----------------------- > >> 3 files changed, 455 insertions(+), 270 deletions(-) > >> > >> diff --git a/controller/lflow.c b/controller/lflow.c > >> index 946c1e0..2b7d356 100644 > >> --- a/controller/lflow.c > >> +++ b/controller/lflow.c > >> @@ -1171,6 +1171,178 @@ add_neighbor_flows(struct ovsdb_idl_index > >> *sbrec_port_binding_by_name, > >> } > >> } > >> > >> +/* Builds the "learn()" action to be triggered by packets initiating a > >> + * hairpin session. > >> + * > >> + * This will generate flows in table OFTABLE_CHK_LB_HAIRPIN_REPLY of the > >> form: > >> + * - match: > >> + * metadata=<orig-pkt-metadata>,ip/ipv6,ip.src=<backend>,ip.dst=<vip> > >> + * nw_proto='lb_proto',tp_src_port=<backend-port> > >> + * - action: > >> + * set MLF_LOOKUP_LB_HAIRPIN_BIT=1 > >> + */ > >> +static void > >> +add_lb_vip_hairpin_reply_action(struct in6_addr *vip6, ovs_be32 vip, > >> + uint8_t lb_proto, bool has_l4_port, > >> + uint64_t cookie, struct ofpbuf *ofpacts) > >> +{ > >> + struct match match = MATCH_CATCHALL_INITIALIZER; > >> + struct ofpact_learn *ol = ofpact_put_LEARN(ofpacts); > >> + struct ofpact_learn_spec *ol_spec; > >> + unsigned int imm_bytes; > >> + uint8_t *src_imm; > >> + > >> + /* Once learned, hairpin reply flows are permanent until the > >> VIP/backend > >> + * is removed. > >> + */ > >> + ol->flags = NX_LEARN_F_DELETE_LEARNED; > >> + ol->idle_timeout = OFP_FLOW_PERMANENT; > >> + ol->hard_timeout = OFP_FLOW_PERMANENT; > >> + ol->priority = OFP_DEFAULT_PRIORITY; > >> + ol->table_id = OFTABLE_CHK_LB_HAIRPIN_REPLY; > >> + ol->cookie = htonll(cookie); > >> + > >> + /* Match on metadata of the packet that created the hairpin session. > >> */ > >> + ol_spec = ofpbuf_put_zeros(ofpacts, sizeof *ol_spec); > >> + > >> + ol_spec->dst.field = mf_from_id(MFF_METADATA); > >> + ol_spec->dst.ofs = 0; > >> + ol_spec->dst.n_bits = ol_spec->dst.field->n_bits; > >> + ol_spec->n_bits = ol_spec->dst.n_bits; > >> + ol_spec->dst_type = NX_LEARN_DST_MATCH; > >> + ol_spec->src_type = NX_LEARN_SRC_FIELD; > >> + ol_spec->src.field = mf_from_id(MFF_METADATA); > >> + > >> + /* Match on the same ETH type as the packet that created the hairpin > >> + * session. > >> + */ > >> + ol_spec = ofpbuf_put_zeros(ofpacts, sizeof *ol_spec); > >> + ol_spec->dst.field = mf_from_id(MFF_ETH_TYPE); > >> + ol_spec->dst.ofs = 0; > >> + ol_spec->dst.n_bits = ol_spec->dst.field->n_bits; > >> + ol_spec->n_bits = ol_spec->dst.n_bits; > >> + ol_spec->dst_type = NX_LEARN_DST_MATCH; > >> + ol_spec->src_type = NX_LEARN_SRC_IMMEDIATE; > >> + union mf_value imm_eth_type = { > >> + .be16 = !vip6 ? htons(ETH_TYPE_IP) : htons(ETH_TYPE_IPV6) > >> + }; > >> + mf_write_subfield_value(&ol_spec->dst, &imm_eth_type, &match); > >> + > >> + /* Push value last, as this may reallocate 'ol_spec'. */ > >> + imm_bytes = DIV_ROUND_UP(ol_spec->dst.n_bits, 8); > >> + src_imm = ofpbuf_put_zeros(ofpacts, OFPACT_ALIGN(imm_bytes)); > >> + memcpy(src_imm, &imm_eth_type, imm_bytes); > >> + > >> + /* Hairpin replies have ip.src == <backend-ip>. */ > >> + ol_spec = ofpbuf_put_zeros(ofpacts, sizeof *ol_spec); > >> + if (!vip6) { > >> + ol_spec->dst.field = mf_from_id(MFF_IPV4_SRC); > >> + ol_spec->src.field = mf_from_id(MFF_IPV4_SRC); > >> + } else { > >> + ol_spec->dst.field = mf_from_id(MFF_IPV6_SRC); > >> + ol_spec->src.field = mf_from_id(MFF_IPV6_SRC); > >> + } > >> + ol_spec->dst.ofs = 0; > >> + ol_spec->dst.n_bits = ol_spec->dst.field->n_bits; > >> + ol_spec->n_bits = ol_spec->dst.n_bits; > >> + ol_spec->dst_type = NX_LEARN_DST_MATCH; > >> + ol_spec->src_type = NX_LEARN_SRC_FIELD; > >> + > >> + /* Hairpin replies have ip.dst == <vip>. */ > >> + union mf_value imm_ip; > >> + ol_spec = ofpbuf_put_zeros(ofpacts, sizeof *ol_spec); > >> + if (!vip6) { > >> + ol_spec->dst.field = mf_from_id(MFF_IPV4_DST); > >> + imm_ip = (union mf_value) { > >> + .be32 = vip > >> + }; > >> + } else { > >> + ol_spec->dst.field = mf_from_id(MFF_IPV6_DST); > >> + imm_ip = (union mf_value) { > >> + .ipv6 = *vip6 > >> + }; > >> + } > >> + ol_spec->dst.ofs = 0; > >> + ol_spec->dst.n_bits = ol_spec->dst.field->n_bits; > >> + ol_spec->n_bits = ol_spec->dst.n_bits; > >> + ol_spec->dst_type = NX_LEARN_DST_MATCH; > >> + ol_spec->src_type = NX_LEARN_SRC_IMMEDIATE; > >> + mf_write_subfield_value(&ol_spec->dst, &imm_ip, &match); > >> + > >> + /* Push value last, as this may reallocate 'ol_spec' */ > >> + imm_bytes = DIV_ROUND_UP(ol_spec->dst.n_bits, 8); > >> + src_imm = ofpbuf_put_zeros(ofpacts, OFPACT_ALIGN(imm_bytes)); > >> + memcpy(src_imm, &imm_ip, imm_bytes); > >> + > >> + /* Hairpin replies have the same nw_proto as packets that created the > >> + * session. > >> + */ > >> + union mf_value imm_proto = { > >> + .u8 = lb_proto, > >> + }; > >> + ol_spec = ofpbuf_put_zeros(ofpacts, sizeof *ol_spec); > >> + ol_spec->dst.field = mf_from_id(MFF_IP_PROTO); > >> + ol_spec->src.field = mf_from_id(MFF_IP_PROTO); > >> + ol_spec->dst.ofs = 0; > >> + ol_spec->dst.n_bits = ol_spec->dst.field->n_bits; > >> + ol_spec->n_bits = ol_spec->dst.n_bits; > >> + ol_spec->dst_type = NX_LEARN_DST_MATCH; > >> + ol_spec->src_type = NX_LEARN_SRC_IMMEDIATE; > >> + mf_write_subfield_value(&ol_spec->dst, &imm_proto, &match); > >> + > >> + /* Push value last, as this may reallocate 'ol_spec' */ > >> + imm_bytes = DIV_ROUND_UP(ol_spec->dst.n_bits, 8); > >> + src_imm = ofpbuf_put_zeros(ofpacts, OFPACT_ALIGN(imm_bytes)); > >> + memcpy(src_imm, &imm_proto, imm_bytes); > >> + > >> + /* Hairpin replies have source port == <backend-port>. */ > >> + if (has_l4_port) { > >> + ol_spec = ofpbuf_put_zeros(ofpacts, sizeof *ol_spec); > >> + switch (lb_proto) { > >> + case IPPROTO_TCP: > >> + ol_spec->dst.field = mf_from_id(MFF_TCP_SRC); > >> + ol_spec->src.field = mf_from_id(MFF_TCP_DST); > >> + break; > >> + case IPPROTO_UDP: > >> + ol_spec->dst.field = mf_from_id(MFF_UDP_SRC); > >> + ol_spec->src.field = mf_from_id(MFF_UDP_DST); > >> + break; > >> + case IPPROTO_SCTP: > >> + ol_spec->dst.field = mf_from_id(MFF_SCTP_SRC); > >> + ol_spec->src.field = mf_from_id(MFF_SCTP_DST); > >> + break; > >> + default: > >> + OVS_NOT_REACHED(); > >> + break; > >> + } > >> + ol_spec->dst.ofs = 0; > >> + ol_spec->dst.n_bits = ol_spec->dst.field->n_bits; > >> + ol_spec->n_bits = ol_spec->dst.n_bits; > >> + ol_spec->dst_type = NX_LEARN_DST_MATCH; > >> + ol_spec->src_type = NX_LEARN_SRC_FIELD; > >> + } > >> + > >> + /* Set MLF_LOOKUP_LB_HAIRPIN_BIT for hairpin replies. */ > >> + ol_spec = ofpbuf_put_zeros(ofpacts, sizeof *ol_spec); > >> + ol_spec->dst.field = mf_from_id(MFF_LOG_FLAGS); > >> + ol_spec->dst.ofs = MLF_LOOKUP_LB_HAIRPIN_BIT; > >> + ol_spec->dst.n_bits = 1; > >> + ol_spec->n_bits = ol_spec->dst.n_bits; > >> + ol_spec->dst_type = NX_LEARN_DST_LOAD; > >> + ol_spec->src_type = NX_LEARN_SRC_IMMEDIATE; > >> + union mf_value imm_reg_value = { > >> + .u8 = 1 > >> + }; > >> + mf_write_subfield_value(&ol_spec->dst, &imm_reg_value, &match); > >> + > >> + /* Push value last, as this may reallocate 'ol_spec' */ > >> + imm_bytes = DIV_ROUND_UP(ol_spec->dst.n_bits, 8); > >> + src_imm = ofpbuf_put_zeros(ofpacts, OFPACT_ALIGN(imm_bytes)); > >> + memcpy(src_imm, &imm_reg_value, imm_bytes); > >> + > >> + ofpact_finish_LEARN(ofpacts, &ol); > > > > The above code makes perfect sense. But I think we can reduce the amount of > > code by making use of the 'learn_parse()' function present in > > lib/learn.c of OVS. > > > > This would also make the code much more simpler. Otherwise the reader > > has to understand all the intricacies of adding 'learn' ofpact. > > > > I tried something like below to test it out and It worked fine to me > > > > ***************** > > struct ds learn = DS_EMPTY_INITIALIZER; > > ds_put_format(&learn, "table=%d,delete_learned,OXM_OF_METADATA[]," > > > > "eth_type=0x800,nw_proto=6,NXM_OF_IP_SRC[],ip_dst="IP_FMT"," > > > > "NXM_OF_TCP_SRC[]=NXM_OF_TCP_DST[],load:0x1->NXM_NX_REG10[7]", > > OFTABLE_CHK_LB_HAIRPIN_REPLY, IP_ARGS(vip)); > > char *error = learn_parse(ds_cstr(&learn), NULL, NULL, &ofpbuf_learn); > > if (!error) { > > ofpbuf_put(ofpacts, ofpbuf_learn.data, ofpbuf_learn.size); > > } > > ... > > .. > > > > ****************** > > > > What do you think ? > > To be honest, I'm not so sure about this. It would create a dependency > on how OVS formats openflows internally. > > I think we can argue the same thing about intricacies of adding other > OVS actions, e.g.: > > https://github.com/ovn-org/ovn/blob/44ea2ec88136f83e7eab9790473025b6c95bdcc0/controller/lflow.c#L1267 > > https://github.com/ovn-org/ovn/blob/44ea2ec88136f83e7eab9790473025b6c95bdcc0/lib/actions.c#L422 > > We could change those too and use OVS parsing internal utilities to > parse strings but we don't and, in my opinion, the proper way to go is > to configure the actions programatically by using the data structures in > include/openvswitch/ofp-actions.h > > If it helps I can try to refactor a bit the code that creates the > learn() action for hairpin although I'm not sure at the moment how to do > that.
Thanks for the reply. Fair enough. I applied this patch to master. Numan > > > > > Thanks > > Numan > > Regards, > Dumitru > > _______________________________________________ > dev mailing list > d...@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev