Thiru,

During IPMP stress testing, we hit the following interesting deadlock:

T1: Inside the mac perimeter for "under1", waiting for it to quiesce:

stack pointer for thread ffffff000770dc60: ffffff000770d960
[ ffffff000770d960 _resume_from_idle+0xf1() ]
  ffffff000770d990 swtch+0x200()
  ffffff000770d9c0 cv_wait+0x75(ffffff146a245d98, ffffff146a245ce8)
  ffffff000770da00 mac_flow_wait+0x5c(ffffff146a245488, 0)
  ffffff000770da60 mac_bcast_delete+0x24b(ffffff01d51a7c08, ffffff02697df97c, 0)
  ffffff000770daa0 mac_multicast_remove+0x50(ffffff01d51a7c08, ffffff02697df97c)
  ffffff000770db00 dls_multicst_remove+0xa0(ffffff025168cd78, ffffff02697df97c)
  ffffff000770db60 proto_disabmulti_req+0xa0(ffffff025168cd78, ffffff0208231840)
  ffffff000770db80 dld_proto+0x5f(ffffff025168cd78, ffffff0208231840)
  ffffff000770dbb0 dld_wput_nondata_task+0x8b(ffffff025168cd78)
  ffffff000770dc40 taskq_d_thread+0xbc(ffffff01d0310320)
  ffffff000770dc50 thread_start+8()

T2: An interrupt from the datapath for "under1", that led to entering the
    IPSQ in ip_ndp_failure() because it discovered an address conflict.
    On the backend of exiting the IPSQ, it dequeued a request for it to
    leave the group, which then retrieved the latest kstats for "under1"
    (to add to the group baseline, which then called back into the mac
    perimeter).

stack pointer for thread ffffff0007a0bc60: ffffff0007a0ac00
[ ffffff0007a0ac00 resume_from_intr+0xb4() ]
  ffffff0007a0ac30 swtch+0xb1()
  ffffff0007a0ac60 cv_wait+0x75()
  ffffff0007a0ac90 i_mac_perim_enter+0x65()
  ffffff0007a0acc0 mac_perim_enter_by_mh+0x1f()
  ffffff0007a0ad00 mac_perim_enter_by_macname+0x32()
  ffffff0007a0ad50 dls_devnet_stat_update+0x37()
  ffffff0007a0adb0 ipmp_phyint_get_kstats+0x8c()
  ffffff0007a0ae80 ipmp_phyint_leave_grp+0x158()
  ffffff0007a0af10 ip_sioctl_groupname+0x1e4()
  ffffff0007a0af90 ip_process_ioctl+0x217()
  ffffff0007a0afd0 ipsq_exit+0xb8()
  ffffff0007a0b040 qwriter_ip+0x82()
  ffffff0007a0b080 ip_ndp_failure+0x7b()
  ffffff0007a0b120 ndp_input_solicit+0x483()
  ffffff0007a0b170 ndp_input+0xff()
  ffffff0007a0b230 icmp_inbound_v6+0x486()
  ffffff0007a0b3e0 ip_rput_data_v6+0x1961()
  ffffff0007a0b4c0 ip_rput_v6+0x71a()
  ffffff0007a0b540 putnext+0x2f9()
  ffffff0007a0b590 dld_str_rx_fastpath+0xaa()
  ffffff0007a0b680 i_dls_link_rx+0x2ea()
  ffffff0007a0b6c0 mac_rx_deliver+0x5d()
  ffffff0007a0b750 mac_rx_soft_ring_process+0x192()
  ffffff0007a0b840 mac_rx_srs_proto_fanout+0x4de()
  ffffff0007a0b8d0 mac_rx_srs_drain+0x2c0()
  ffffff0007a0b960 mac_rx_srs_process+0x4c9()
  ffffff0007a0b9f0 mac_bcast_send+0x157()
  ffffff0007a0ba40 mac_rx_classify+0x17f()
  ffffff0007a0baa0 mac_rx_flow+0x54()
  ffffff0007a0baf0 mac_rx+0x11b()
  ffffff0007a0bb30 mac_rx_ring+0x4c()
  ffffff0007a0bba0 e1000g_intr+0x230()
  ffffff0007a0bc00 av_dispatch_autovect+0x8f()
  ffffff0007a0bc40 dispatch_hardint+0x33()
  ffffff0007fae470 switch_sp_and_call+0x13()

I don't see any locking rules in IP that were violated (e.g., no locks are
being held in IP across the call down to GLDv3).  This seems to just be an
unexpected (and very serious) problem in the new mac perimeter design.

Thoughts?

-- 
meem

Reply via email to