On 8/22/22 17:21, Mike Pattrick wrote: > On Thu, Aug 11, 2022 at 10:55 AM Ilya Maximets <i.maxim...@ovn.org> wrote: >> >> If the PACKET_OUT from controller ends up with sending packet to >> a bond interface, the main thread will take locks in the following >> order: >> handle_openflow >> --> take ofproto_mutex >> handle_packet_out >> packet_xlate >> output_normal >> bond_update_post_recirc_rules >> --> take rwlock in bond.c >> >> If at the same time revalidator thread is processing other packet >> with the output to the same bond: >> xlate_actions >> output_normal >> bond_update_post_recirc_rules >> --> take rwlock in bond.c >> update_recirc_rules >> ofproto_dpif_add_internal_flow >> ofproto_flow_mod >> --> take ofproto_mutex >> >> So, it is possible for these 2 threads to lock each other by >> taking one lock and waiting for another thread to release the >> second lock. >> >> It might also be possible for the main thread to lock itself >> up by trying to acquire ofproto_mutex for the second time, if >> it will actually proceed with update_recirc_rules() after >> taking the bond rwlock. >> >> The problem appears to be that bond_update_post_recirc_rules() >> is called during the flow translation even if side effects are >> prohibited, which is the case for openflow PACKET_OUT handling. >> >> Skipping actual flow updates during the flow translation if >> side effects are disabled to avoid the deadlock. >> >> Since flow are not installed now when actions translated for >> very first packet, installing initial flows in bond_reconfigure(). >> This will cover the case of allocating a new recirc_id. >> >> Also checking if we need to update flows in bond_run() to cover >> link state changes. >> >> Reported-at: https://github.com/openvswitch/ovs-issues/issues/259 >> Fixes: adcf00ba35a0 ("ofproto/bond: Implement bond megaflow using >> recirculation") >> Signed-off-by: Ilya Maximets <i.maxim...@ovn.org> >> --- >> >> I'm not sure how to create a unit test for the lock race with >> revalidator, but it might be possible to trigger the self-lockup >> issue on the main thread. I'll try to work on the test while >> waiting for review. >> > > Hello Ilya, > > I've tested this and confirmed that it solves the problem for me. > > Acked-by: Mike Pattrick <m...@redhat.com> > > I created a reproducer that doesn't require dpdk or real interfaces: > > ip netns add sidea > ip netns add sideb > ovs-vsctl add-br br-sidea > ovs-vsctl add-br br-sideb > ip link add portab0 type veth peer name portbb0 > ip link add portab1 type veth peer name portbb1 > ip link set dev portab0 up > ip link set dev portab1 up > ip link set dev portbb0 up > ip link set dev portbb1 up > ovs-vsctl add-bond br-sidea bonda portab0 portab1 -- set port bonda > lacp=active -- set port bonda bond_mode=balance-tcp > ovs-vsctl add-bond br-sideb bondb portbb0 portbb1 -- set port bondb > lacp=active -- set port bondb bond_mode=balance-tcp > ip link add porta2 type veth peer name ovs-porta2 > ip link add portb2 type veth peer name ovs-portb2 > ip link set dev ovs-porta2 up > ip link set dev ovs-portb2 up > ip link set porta2 netns sidea > ip link set portb2 netns sideb > ip netns exec sidea ip link set dev porta2 up > ip netns exec sideb ip link set dev portb2 up > ip netns exec sidea ip addr add 172.31.1.1/24 dev porta2 > ip netns exec sideb ip addr add 172.31.1.2/24 dev portb2 > ovs-vsctl add-port br-sidea ovs-porta2 > ovs-vsctl add-port br-sideb ovs-portb2 > ovs-ofctl add-flow br-sidea "actions=normal" > ovs-ofctl add-flow br-sideb "actions=normal" > while [[ 1 ]]; do > ovs-ofctl packet-out br-sidea > "packet=ffffffffffff000000000000080045000054fbc340004001e4a3ac1f0101ac1f01020800e9af9e540001ce8e036300000000da35050000000000101112131415161718191a1b1c1d1e1f202122231415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f3031323334353637, > actions=resubmit(,0)" 2>/dev/null >/dev/null > ovs-ofctl packet-out br-sideb > "packet=ffffffffffff000000000000080045000054de00000040014267ac1f0102ac1f01010000efecd19a00019c90036300000000d5b00a0000000000101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f3031323334353637, > actions=resubmit(,0)" > done & > while [[ 1 ]]; do > ovs-vsctl --may-exist br-sidea > ovs-vsctl --if-exists del-port bonda > ovs-vsctl add-bond br-sidea bonda portab0 portab1 -- set port > bonda lacp=active -- set port bonda bond_mode=balance-tcp > ovs-vsctl add-port br-sidea ovs-porta2 > ovs-ofctl add-flow br-sidea "actions=normal" > done & > ip netns exec sidea ping -f 172.31.1.2 > > > On my machine it triggers a deadlock fairly consistently in under 3 > seconds. Do you think it's worth turning something like this into a > test? Or would this be too slow and resource intensive?
Thanks, Mike. It looks hard to reproduce this way in CI, and also a bit unreliable. I manged to create a test that is triggering double locking. I sent v2 with that test included. Please, take a look. Best regards, Ilya Maximets. _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev