On 8/22/22 17:21, Mike Pattrick wrote:
> On Thu, Aug 11, 2022 at 10:55 AM Ilya Maximets <i.maxim...@ovn.org> wrote:
>>
>> If the PACKET_OUT from controller ends up with sending packet to
>> a bond interface, the main thread will take locks in the following
>> order:
>>   handle_openflow
>>   --> take ofproto_mutex
>>   handle_packet_out
>>   packet_xlate
>>   output_normal
>>   bond_update_post_recirc_rules
>>   --> take rwlock in bond.c
>>
>> If at the same time revalidator thread is processing other packet
>> with the output to the same bond:
>>   xlate_actions
>>   output_normal
>>   bond_update_post_recirc_rules
>>   --> take rwlock in bond.c
>>   update_recirc_rules
>>   ofproto_dpif_add_internal_flow
>>   ofproto_flow_mod
>>   --> take ofproto_mutex
>>
>> So, it is possible for these 2 threads to lock each other by
>> taking one lock and waiting for another thread to release the
>> second lock.
>>
>> It might also be possible for the main thread to lock itself
>> up by trying to acquire ofproto_mutex for the second time, if
>> it will actually proceed with update_recirc_rules() after
>> taking the bond rwlock.
>>
>> The problem appears to be that bond_update_post_recirc_rules()
>> is called during the flow translation even if side effects are
>> prohibited, which is the case for openflow PACKET_OUT handling.
>>
>> Skipping actual flow updates during the flow translation if
>> side effects are disabled to avoid the deadlock.
>>
>> Since flow are not installed now when actions translated for
>> very first packet, installing initial flows in bond_reconfigure().
>> This will cover the case of allocating a new recirc_id.
>>
>> Also checking if we need to update flows in bond_run() to cover
>> link state changes.
>>
>> Reported-at: https://github.com/openvswitch/ovs-issues/issues/259
>> Fixes: adcf00ba35a0 ("ofproto/bond: Implement bond megaflow using 
>> recirculation")
>> Signed-off-by: Ilya Maximets <i.maxim...@ovn.org>
>> ---
>>
>> I'm not sure how to create a unit test for the lock race with
>> revalidator, but it might be possible to trigger the self-lockup
>> issue on the main thread.  I'll try to work on the test while
>> waiting for review.
>>
> 
> Hello Ilya,
> 
> I've tested this and confirmed that it solves the problem for me.
> 
> Acked-by: Mike Pattrick <m...@redhat.com>
> 
> I created a reproducer that doesn't require dpdk or real interfaces:
> 
> ip netns add sidea
> ip netns add sideb
> ovs-vsctl add-br br-sidea
> ovs-vsctl add-br br-sideb
> ip link add portab0 type veth peer name portbb0
> ip link add portab1 type veth peer name portbb1
> ip link set dev portab0 up
> ip link set dev portab1 up
> ip link set dev portbb0 up
> ip link set dev portbb1 up
> ovs-vsctl add-bond br-sidea bonda portab0 portab1 -- set port bonda
> lacp=active -- set port bonda bond_mode=balance-tcp
> ovs-vsctl add-bond br-sideb bondb portbb0 portbb1 -- set port bondb
> lacp=active -- set port bondb bond_mode=balance-tcp
> ip link add porta2 type veth peer name ovs-porta2
> ip link add portb2 type veth peer name ovs-portb2
> ip link set dev ovs-porta2 up
> ip link set dev ovs-portb2 up
> ip link set porta2 netns sidea
> ip link set portb2 netns sideb
> ip netns exec sidea ip link set dev porta2 up
> ip netns exec sideb ip link set dev portb2 up
> ip netns exec sidea ip addr add 172.31.1.1/24 dev porta2
> ip netns exec sideb ip addr add 172.31.1.2/24 dev portb2
> ovs-vsctl add-port br-sidea ovs-porta2
> ovs-vsctl add-port br-sideb ovs-portb2
> ovs-ofctl add-flow br-sidea "actions=normal"
> ovs-ofctl add-flow br-sideb "actions=normal"
> while [[ 1 ]]; do
>     ovs-ofctl packet-out br-sidea
> "packet=ffffffffffff000000000000080045000054fbc340004001e4a3ac1f0101ac1f01020800e9af9e540001ce8e036300000000da35050000000000101112131415161718191a1b1c1d1e1f202122231415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f3031323334353637,
> actions=resubmit(,0)" 2>/dev/null >/dev/null
>     ovs-ofctl packet-out br-sideb
> "packet=ffffffffffff000000000000080045000054de00000040014267ac1f0102ac1f01010000efecd19a00019c90036300000000d5b00a0000000000101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f3031323334353637,
> actions=resubmit(,0)"
> done &
> while [[ 1 ]]; do
>     ovs-vsctl --may-exist br-sidea
>     ovs-vsctl --if-exists del-port bonda
>     ovs-vsctl add-bond br-sidea bonda portab0 portab1 -- set port
> bonda lacp=active -- set port bonda bond_mode=balance-tcp
>     ovs-vsctl add-port br-sidea ovs-porta2
>     ovs-ofctl add-flow br-sidea "actions=normal"
> done &
> ip netns exec sidea ping -f 172.31.1.2
> 
> 
> On my machine it triggers a deadlock fairly consistently in under 3
> seconds. Do you think it's worth turning something like this into a
> test? Or would this be too slow and resource intensive?

Thanks, Mike.  It looks hard to reproduce this way in CI, and
also a bit unreliable.  I manged to create a test that is
triggering double locking.  I sent v2 with that test included.
Please, take a look.

Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to