Meem, I think dls_devnet_stat_update doen't need the mac perimeter itself. Instead it needs to increment the dd_ref of the dls_devnet_t to make sure it can't disappear. That should synchronize dls_devnet_stat_create/update/destroy functions. A related thing is that the functions will have to drop i_dls_devnet_lock before calling kstat functions so that the order is kstat internal lock (or perimeter) -> i_dls_devnet_lock. I will try this and let you know, so that you can try the stress test on it (unless you have already coded up a fix).
Thanks Thirumalai Peter Memishian wrote: > Thiru, > > During IPMP stress testing, we hit the following interesting deadlock: > > T1: Inside the mac perimeter for "under1", waiting for it to quiesce: > > stack pointer for thread ffffff000770dc60: ffffff000770d960 > [ ffffff000770d960 _resume_from_idle+0xf1() ] > ffffff000770d990 swtch+0x200() > ffffff000770d9c0 cv_wait+0x75(ffffff146a245d98, ffffff146a245ce8) > ffffff000770da00 mac_flow_wait+0x5c(ffffff146a245488, 0) > ffffff000770da60 mac_bcast_delete+0x24b(ffffff01d51a7c08, ffffff02697df97c, > 0) > ffffff000770daa0 mac_multicast_remove+0x50(ffffff01d51a7c08, > ffffff02697df97c) > ffffff000770db00 dls_multicst_remove+0xa0(ffffff025168cd78, > ffffff02697df97c) > ffffff000770db60 proto_disabmulti_req+0xa0(ffffff025168cd78, > ffffff0208231840) > ffffff000770db80 dld_proto+0x5f(ffffff025168cd78, ffffff0208231840) > ffffff000770dbb0 dld_wput_nondata_task+0x8b(ffffff025168cd78) > ffffff000770dc40 taskq_d_thread+0xbc(ffffff01d0310320) > ffffff000770dc50 thread_start+8() > > T2: An interrupt from the datapath for "under1", that led to entering the > IPSQ in ip_ndp_failure() because it discovered an address conflict. > On the backend of exiting the IPSQ, it dequeued a request for it to > leave the group, which then retrieved the latest kstats for "under1" > (to add to the group baseline, which then called back into the mac > perimeter). > > stack pointer for thread ffffff0007a0bc60: ffffff0007a0ac00 > [ ffffff0007a0ac00 resume_from_intr+0xb4() ] > ffffff0007a0ac30 swtch+0xb1() > ffffff0007a0ac60 cv_wait+0x75() > ffffff0007a0ac90 i_mac_perim_enter+0x65() > ffffff0007a0acc0 mac_perim_enter_by_mh+0x1f() > ffffff0007a0ad00 mac_perim_enter_by_macname+0x32() > ffffff0007a0ad50 dls_devnet_stat_update+0x37() > ffffff0007a0adb0 ipmp_phyint_get_kstats+0x8c() > ffffff0007a0ae80 ipmp_phyint_leave_grp+0x158() > ffffff0007a0af10 ip_sioctl_groupname+0x1e4() > ffffff0007a0af90 ip_process_ioctl+0x217() > ffffff0007a0afd0 ipsq_exit+0xb8() > ffffff0007a0b040 qwriter_ip+0x82() > ffffff0007a0b080 ip_ndp_failure+0x7b() > ffffff0007a0b120 ndp_input_solicit+0x483() > ffffff0007a0b170 ndp_input+0xff() > ffffff0007a0b230 icmp_inbound_v6+0x486() > ffffff0007a0b3e0 ip_rput_data_v6+0x1961() > ffffff0007a0b4c0 ip_rput_v6+0x71a() > ffffff0007a0b540 putnext+0x2f9() > ffffff0007a0b590 dld_str_rx_fastpath+0xaa() > ffffff0007a0b680 i_dls_link_rx+0x2ea() > ffffff0007a0b6c0 mac_rx_deliver+0x5d() > ffffff0007a0b750 mac_rx_soft_ring_process+0x192() > ffffff0007a0b840 mac_rx_srs_proto_fanout+0x4de() > ffffff0007a0b8d0 mac_rx_srs_drain+0x2c0() > ffffff0007a0b960 mac_rx_srs_process+0x4c9() > ffffff0007a0b9f0 mac_bcast_send+0x157() > ffffff0007a0ba40 mac_rx_classify+0x17f() > ffffff0007a0baa0 mac_rx_flow+0x54() > ffffff0007a0baf0 mac_rx+0x11b() > ffffff0007a0bb30 mac_rx_ring+0x4c() > ffffff0007a0bba0 e1000g_intr+0x230() > ffffff0007a0bc00 av_dispatch_autovect+0x8f() > ffffff0007a0bc40 dispatch_hardint+0x33() > ffffff0007fae470 switch_sp_and_call+0x13() > > I don't see any locking rules in IP that were violated (e.g., no locks are > being held in IP across the call down to GLDv3). This seems to just be an > unexpected (and very serious) problem in the new mac perimeter design. > > Thoughts? > >
