Meem,

I think dls_devnet_stat_update doen't need the mac perimeter itself. 
Instead it needs to increment the dd_ref of the dls_devnet_t to make 
sure it can't disappear. That should synchronize 
dls_devnet_stat_create/update/destroy functions. A related thing is that 
the functions will have to drop i_dls_devnet_lock before calling kstat 
functions so that the order is  kstat internal lock (or perimeter) -> 
i_dls_devnet_lock. I will try this and let you know, so that you can try 
the stress test on it (unless you have already coded up a fix).

Thanks
Thirumalai


Peter Memishian wrote:
> Thiru,
>
> During IPMP stress testing, we hit the following interesting deadlock:
>
> T1: Inside the mac perimeter for "under1", waiting for it to quiesce:
>
> stack pointer for thread ffffff000770dc60: ffffff000770d960
> [ ffffff000770d960 _resume_from_idle+0xf1() ]
>   ffffff000770d990 swtch+0x200()
>   ffffff000770d9c0 cv_wait+0x75(ffffff146a245d98, ffffff146a245ce8)
>   ffffff000770da00 mac_flow_wait+0x5c(ffffff146a245488, 0)
>   ffffff000770da60 mac_bcast_delete+0x24b(ffffff01d51a7c08, ffffff02697df97c, 
> 0)
>   ffffff000770daa0 mac_multicast_remove+0x50(ffffff01d51a7c08, 
> ffffff02697df97c)
>   ffffff000770db00 dls_multicst_remove+0xa0(ffffff025168cd78, 
> ffffff02697df97c)
>   ffffff000770db60 proto_disabmulti_req+0xa0(ffffff025168cd78, 
> ffffff0208231840)
>   ffffff000770db80 dld_proto+0x5f(ffffff025168cd78, ffffff0208231840)
>   ffffff000770dbb0 dld_wput_nondata_task+0x8b(ffffff025168cd78)
>   ffffff000770dc40 taskq_d_thread+0xbc(ffffff01d0310320)
>   ffffff000770dc50 thread_start+8()
>
> T2: An interrupt from the datapath for "under1", that led to entering the
>     IPSQ in ip_ndp_failure() because it discovered an address conflict.
>     On the backend of exiting the IPSQ, it dequeued a request for it to
>     leave the group, which then retrieved the latest kstats for "under1"
>     (to add to the group baseline, which then called back into the mac
>     perimeter).
>
> stack pointer for thread ffffff0007a0bc60: ffffff0007a0ac00
> [ ffffff0007a0ac00 resume_from_intr+0xb4() ]
>   ffffff0007a0ac30 swtch+0xb1()
>   ffffff0007a0ac60 cv_wait+0x75()
>   ffffff0007a0ac90 i_mac_perim_enter+0x65()
>   ffffff0007a0acc0 mac_perim_enter_by_mh+0x1f()
>   ffffff0007a0ad00 mac_perim_enter_by_macname+0x32()
>   ffffff0007a0ad50 dls_devnet_stat_update+0x37()
>   ffffff0007a0adb0 ipmp_phyint_get_kstats+0x8c()
>   ffffff0007a0ae80 ipmp_phyint_leave_grp+0x158()
>   ffffff0007a0af10 ip_sioctl_groupname+0x1e4()
>   ffffff0007a0af90 ip_process_ioctl+0x217()
>   ffffff0007a0afd0 ipsq_exit+0xb8()
>   ffffff0007a0b040 qwriter_ip+0x82()
>   ffffff0007a0b080 ip_ndp_failure+0x7b()
>   ffffff0007a0b120 ndp_input_solicit+0x483()
>   ffffff0007a0b170 ndp_input+0xff()
>   ffffff0007a0b230 icmp_inbound_v6+0x486()
>   ffffff0007a0b3e0 ip_rput_data_v6+0x1961()
>   ffffff0007a0b4c0 ip_rput_v6+0x71a()
>   ffffff0007a0b540 putnext+0x2f9()
>   ffffff0007a0b590 dld_str_rx_fastpath+0xaa()
>   ffffff0007a0b680 i_dls_link_rx+0x2ea()
>   ffffff0007a0b6c0 mac_rx_deliver+0x5d()
>   ffffff0007a0b750 mac_rx_soft_ring_process+0x192()
>   ffffff0007a0b840 mac_rx_srs_proto_fanout+0x4de()
>   ffffff0007a0b8d0 mac_rx_srs_drain+0x2c0()
>   ffffff0007a0b960 mac_rx_srs_process+0x4c9()
>   ffffff0007a0b9f0 mac_bcast_send+0x157()
>   ffffff0007a0ba40 mac_rx_classify+0x17f()
>   ffffff0007a0baa0 mac_rx_flow+0x54()
>   ffffff0007a0baf0 mac_rx+0x11b()
>   ffffff0007a0bb30 mac_rx_ring+0x4c()
>   ffffff0007a0bba0 e1000g_intr+0x230()
>   ffffff0007a0bc00 av_dispatch_autovect+0x8f()
>   ffffff0007a0bc40 dispatch_hardint+0x33()
>   ffffff0007fae470 switch_sp_and_call+0x13()
>
> I don't see any locking rules in IP that were violated (e.g., no locks are
> being held in IP across the call down to GLDv3).  This seems to just be an
> unexpected (and very serious) problem in the new mac perimeter design.
>
> Thoughts?
>
>   


Reply via email to