On 11/23/20 6:30 PM, Numan Siddique wrote:
> On Fri, Nov 20, 2020 at 9:08 PM Dumitru Ceara <dce...@redhat.com> wrote:
>>
>> When a port binding of type "l3gateway" is claimed its remote peer
>> port_binding is also stored in local_datapath.peer_ports[].remote.
>>
>> If the remote peer port_binding is deleted first (i.e., before the local
>> "l3gateway" one) then we need to remove the complete
>> local_datapath.peer_ports[] entry in order to avoid ending up using
>> dangling pointers to already freed port bindings.
>>
>> Also, properly reset local_datapath->has_local_l3gateway in
>> remove_pb_from_local_datapath().
>>
>> Ilya reported this issue found by AddressSanitizer during his testing:
>>
>> ==1816017==ERROR: AddressSanitizer: heap-use-after-free on address 
>> 0x6140000cb170 at pc 0x0000005ab574 bp 0x7fff68925a30 sp 0x7fff68925a28
>> READ of size 8 at 0x6140000cb170 thread T0
>>     #0 0x5ab573 in put_replace_chassis_mac_flows 
>> git/ovn/controller/physical.c:550:9
>>     #1 0x5a65eb in consider_port_binding 
>> git/ovn/controller/physical.c:1168:13
>>     #2 0x5a8764 in physical_run git/ovn/controller/physical.c:1607:9
>>     #3 0x5a0064 in flow_output_physical_flow_changes_handler 
>> git/ovn/controller/ovn-controller.c:2127:9
>>     #4 0x5db423 in engine_compute git/ovn/lib/inc-proc-eng.c:306:18
>>     #5 0x5dae1f in engine_run_node git/ovn/lib/inc-proc-eng.c:352:14
>>     #6 0x5dac74 in engine_run git/ovn/lib/inc-proc-eng.c:377:9
>>     #7 0x59ad64 in main git/ovn/controller/ovn-controller.c
>>     #8 0x7f39fa6491a2 in __libc_start_main (/lib64/libc.so.6+0x271a2)
>>     #9 0x480b2d in _start (git/ovn/controller/ovn-controller+0x480b2d)
>>
>> 0x6140000cb170 is located 304 bytes inside of 408-byte region 
>> [0x6140000cb040,0x6140000cb1d8)
>> freed by thread T0 here:
>>     #0 0x520d07 in free (git/ovn/controller/ovn-controller+0x520d07)
>>     #1 0x712de7 in ovsdb_idl_db_track_clear git/ovs/lib/ovsdb-idl.c:1984:21
>>     #2 0x59b5cd in main git/ovn/controller/ovn-controller.c:2762:9
>>     #3 0x7f39fa6491a2 in __libc_start_main (/lib64/libc.so.6+0x271a2)
>>
>> Reported-by: Ilya Maximets <i.maxim...@ovn.org>
>> Fixes: 354bdba51abf ("ovn-controller: I-P for SB port binding and OVS 
>> interface in runtime_data.")
>> Signed-off-by: Dumitru Ceara <dce...@redhat.com>
> 
> Hi Dumitru,
> 
> Please see below for few comments.
> 
>> ---
>>  controller/binding.c | 47 ++++++++++++++++++++++++++++++++++++++++++-----
>>  1 file changed, 42 insertions(+), 5 deletions(-)
>>
>> diff --git a/controller/binding.c b/controller/binding.c
>> index eb92679..cb60c5d 100644
>> --- a/controller/binding.c
>> +++ b/controller/binding.c
>> @@ -1523,21 +1523,41 @@ binding_cleanup(struct ovsdb_idl_txn *ovnsb_idl_txn,
>>  }
>>
>>  static const struct sbrec_port_binding *
>> -get_peer_lport(const struct sbrec_port_binding *pb,
>> -               struct binding_ctx_in *b_ctx_in)
>> +get_peer_lport__(const struct sbrec_port_binding *pb,
>> +                 struct binding_ctx_in *b_ctx_in)
>>  {
>>      const char *peer_name = smap_get(&pb->options, "peer");
>> -    if (strcmp(pb->type, "patch") || !peer_name) {
>> +
>> +    if (!peer_name) {
>>          return NULL;
>>      }
>>
>>      const struct sbrec_port_binding *peer;
>>      peer = lport_lookup_by_name(b_ctx_in->sbrec_port_binding_by_name,
>>                                  peer_name);
>> -
>>      return (peer && peer->datapath) ? peer : NULL;
>>  }
>>
>> +static const struct sbrec_port_binding *
>> +get_l3gw_peer_lport(const struct sbrec_port_binding *pb,
>> +                    struct binding_ctx_in *b_ctx_in)
>> +{
>> +    if (strcmp(pb->type, "l3gateway")) {
>> +        return NULL;
>> +    }
>> +    return get_peer_lport__(pb, b_ctx_in);
>> +}
>> +
>> +static const struct sbrec_port_binding *
>> +get_peer_lport(const struct sbrec_port_binding *pb,
>> +               struct binding_ctx_in *b_ctx_in)
>> +{
>> +    if (strcmp(pb->type, "patch")) {
>> +        return NULL;
>> +    }
>> +    return get_peer_lport__(pb, b_ctx_in);
>> +}
>> +
>>  /* This function adds the local datapath of the 'peer' of
>>   * lport 'pb' to the local datapaths if it is not yet added.
>>   */
>> @@ -1654,7 +1674,9 @@ remove_pb_from_local_datapath(const struct 
>> sbrec_port_binding *pb,
>>                                           pb->logical_port)) {
>>              ld->localnet_port = NULL;
>>          }
>> -    } else if (!strcmp(pb->type, "l3gateway")) {
>> +    }
> 
> This makes sense. Thanks for spotting this.
> 
>> +
>> +    if (!strcmp(pb->type, "l3gateway")) {
>>          const char *chassis_id = smap_get(&pb->options,
>>                                            "l3gateway-chassis");
>>          if (chassis_id && !strcmp(chassis_id, chassis_rec->name)) {
>> @@ -1956,12 +1978,27 @@ handle_deleted_lport(const struct sbrec_port_binding 
>> *pb,
>>                       struct binding_ctx_in *b_ctx_in,
>>                       struct binding_ctx_out *b_ctx_out)
>>  {
>> +    /* If the binding is local, remove it. */
>>      struct local_datapath *ld =
>>          get_local_datapath(b_ctx_out->local_datapaths,
>>                             pb->datapath->tunnel_key);
>>      if (ld) {
>>          remove_pb_from_local_datapath(pb, b_ctx_in->chassis_rec,
>>                                        b_ctx_out, ld);
>> +        return;
>> +    }
>> +
>> +    /* If the binding is not local, if 'pb' is a L3 gateway port, we should
>> +     * remove its peer, if that one is local.
>> +     */
> 
> Hi Dumitru,
> 

Hi Numan,

> Is this issue seen wiith this Ilya's patch series applied ?
> https://patchwork.ozlabs.org/project/ovn/list/?series=214426
> 

Yes, that's the easiest way to see the issue but the bug is also present
on master branch.

> I have not looked into that series yet. But I fail to understand how
> can we encounter this issue with the present master ?
> 
> Because
>   - If a logical router is a gateway router, then all the logical
> router ports of that router and the peer logical switch ports
>     are claimed by the same chassis where the options:chassis=<ch> is set.
>   - So when a logical router port of l3gateway is deleted or its peer
> (of type l3gatewayport) is deleted, then
>     if that port_binding is not in the local datapath, then its peer
> cannot be in the local datapath too.
> 
> Do you see a scenarion where this could happen with the present master ?
> 

As far as I understand there's no guarantee that SB updates that happen
in the same ovn-northd transaction are sent to ovn-controller in the
same jsonrpc update message and I think that's what's causing this.

I managed to replicate the issue on the master branch by simulating a
bit of delay in ovn-controller and making the failing test run faster by
skipping the packet checks:

https://github.com/dceara/ovn/commit/801327f0dd593f2593de19a34f2e6d0101f7aa1c

With AddressSanitizer enabled:

$ ./boot.sh && ./configure --with-ovs-source=~/git-repos/ovs CFLAGS="-g
-O2 -fsanitize=address -fno-omit-frame-pointer -fno-common" && make

And running test 159 in a loop until it fails, I occasionally hit the
use after free:

$ while true; do make check TESTSUITEFLAGS="-d 159" || break; done

$ head -10 tests/testsuite.dir/159/asan*
=================================================================
==1964550==ERROR: AddressSanitizer: heap-use-after-free on address
0x614000033b70 at pc 0x000000480fd6 bp 0x7fff7920c640 sp 0x7fff7920c630
READ of size 8 at 0x614000033b70 thread T0
    #0 0x480fd5 in put_replace_chassis_mac_flows controller/physical.c:550
    #1 0x480fd5 in consider_port_binding controller/physical.c:1168
    #2 0x482f58 in physical_run controller/physical.c:1605
    #3 0x474432 in flow_output_physical_flow_changes_handler
controller/ovn-controller.c:2170
    #4 0x4bf417 in engine_compute lib/inc-proc-eng.c:306
    #5 0x4bf417 in engine_run_node lib/inc-proc-eng.c:352
    #6 0x4bf417 in engine_run lib/inc-proc-eng.c:377

Thanks,
Dumitru

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to