> During IPMP stress testing, we hit the following interesting deadlock:

FWIW, I just hit a variant of this deadlock in which the thread inside the
perimeter was via promiscoff rather than disabmulti:

  stack pointer for thread ffffff00077fdc60: ffffff00077fd9f0
  [ ffffff00077fd9f0 _resume_from_idle+0xf1() ]
    ffffff00077fda20 swtch+0x200()
    ffffff00077fda50 cv_wait+0x75(ffffff02012239b8, ffffff02012239a0)
    ffffff00077fda80 mac_callback_remove_wait+0x43(ffffff02012239b0)
    ffffff00077fdac0 mac_promisc_remove+0xa5(ffffff021eeb5600)
    ffffff00077fdb00 dls_promisc+0xba(ffffff020106c898, 2)
    ffffff00077fdb60 proto_promiscoff_req+0xda(ffffff020106c898, 
ffffff01da3fe2a0
    ffffff00077fdb80 dld_proto+0x51(ffffff020106c898, ffffff01da3fe2a0)
    ffffff00077fdbb0 dld_wput_nondata_task+0x8b(ffffff020106c898)
    ffffff00077fdc40 taskq_d_thread+0xbc(ffffff01d1beeaa0)
    ffffff00077fdc50 thread_start+8()

The same fix you're working on should handle it, though.

 > T1: Inside the mac perimeter for "under1", waiting for it to quiesce:
 > 
 > stack pointer for thread ffffff000770dc60: ffffff000770d960
 > [ ffffff000770d960 _resume_from_idle+0xf1() ]
 >   ffffff000770d990 swtch+0x200()
 >   ffffff000770d9c0 cv_wait+0x75(ffffff146a245d98, ffffff146a245ce8)
 >   ffffff000770da00 mac_flow_wait+0x5c(ffffff146a245488, 0)
 >   ffffff000770da60 mac_bcast_delete+0x24b(ffffff01d51a7c08, 
 > ffffff02697df97c, 0)
 >   ffffff000770daa0 mac_multicast_remove+0x50(ffffff01d51a7c08, 
 > ffffff02697df97c)
 >   ffffff000770db00 dls_multicst_remove+0xa0(ffffff025168cd78, 
 > ffffff02697df97c)
 >   ffffff000770db60 proto_disabmulti_req+0xa0(ffffff025168cd78, 
 > ffffff0208231840)
 >   ffffff000770db80 dld_proto+0x5f(ffffff025168cd78, ffffff0208231840)
 >   ffffff000770dbb0 dld_wput_nondata_task+0x8b(ffffff025168cd78)
 >   ffffff000770dc40 taskq_d_thread+0xbc(ffffff01d0310320)
 >   ffffff000770dc50 thread_start+8()
 > 
 > T2: An interrupt from the datapath for "under1", that led to entering the
 >     IPSQ in ip_ndp_failure() because it discovered an address conflict.
 >     On the backend of exiting the IPSQ, it dequeued a request for it to
 >     leave the group, which then retrieved the latest kstats for "under1"
 >     (to add to the group baseline, which then called back into the mac
 >     perimeter).
 > 
 > stack pointer for thread ffffff0007a0bc60: ffffff0007a0ac00
 > [ ffffff0007a0ac00 resume_from_intr+0xb4() ]
 >   ffffff0007a0ac30 swtch+0xb1()
 >   ffffff0007a0ac60 cv_wait+0x75()
 >   ffffff0007a0ac90 i_mac_perim_enter+0x65()
 >   ffffff0007a0acc0 mac_perim_enter_by_mh+0x1f()
 >   ffffff0007a0ad00 mac_perim_enter_by_macname+0x32()
 >   ffffff0007a0ad50 dls_devnet_stat_update+0x37()
 >   ffffff0007a0adb0 ipmp_phyint_get_kstats+0x8c()
 >   ffffff0007a0ae80 ipmp_phyint_leave_grp+0x158()
 >   ffffff0007a0af10 ip_sioctl_groupname+0x1e4()
 >   ffffff0007a0af90 ip_process_ioctl+0x217()
 >   ffffff0007a0afd0 ipsq_exit+0xb8()
 >   ffffff0007a0b040 qwriter_ip+0x82()
 >   ffffff0007a0b080 ip_ndp_failure+0x7b()
 >   ffffff0007a0b120 ndp_input_solicit+0x483()
 >   ffffff0007a0b170 ndp_input+0xff()
 >   ffffff0007a0b230 icmp_inbound_v6+0x486()
 >   ffffff0007a0b3e0 ip_rput_data_v6+0x1961()
 >   ffffff0007a0b4c0 ip_rput_v6+0x71a()
 >   ffffff0007a0b540 putnext+0x2f9()
 >   ffffff0007a0b590 dld_str_rx_fastpath+0xaa()
 >   ffffff0007a0b680 i_dls_link_rx+0x2ea()
 >   ffffff0007a0b6c0 mac_rx_deliver+0x5d()
 >   ffffff0007a0b750 mac_rx_soft_ring_process+0x192()
 >   ffffff0007a0b840 mac_rx_srs_proto_fanout+0x4de()
 >   ffffff0007a0b8d0 mac_rx_srs_drain+0x2c0()
 >   ffffff0007a0b960 mac_rx_srs_process+0x4c9()
 >   ffffff0007a0b9f0 mac_bcast_send+0x157()
 >   ffffff0007a0ba40 mac_rx_classify+0x17f()
 >   ffffff0007a0baa0 mac_rx_flow+0x54()
 >   ffffff0007a0baf0 mac_rx+0x11b()
 >   ffffff0007a0bb30 mac_rx_ring+0x4c()
 >   ffffff0007a0bba0 e1000g_intr+0x230()
 >   ffffff0007a0bc00 av_dispatch_autovect+0x8f()
 >   ffffff0007a0bc40 dispatch_hardint+0x33()
 >   ffffff0007fae470 switch_sp_and_call+0x13()
 > 
 > I don't see any locking rules in IP that were violated (e.g., no locks are
 > being held in IP across the call down to GLDv3).  This seems to just be an
 > unexpected (and very serious) problem in the new mac perimeter design.
 > 
 > Thoughts?
 > 
 > -- 
 > meem
 > _______________________________________________
 > clearview-dev mailing list
 > clearview-dev at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/clearview-dev

-- 
meem

Reply via email to