Re: [networking-discuss] Re: what is " Deadlock: cycle in blocking chain "

Oliver Yang Thu, 29 Mar 2007 00:09:18 -0800

Tom Chen wrote:

Guys, thanks for your help! Finally I find that this is because one thread 
holding mutex is preempted by a high priority timer (soft) interrupt which also 
seeks the same mutex. thus dead lock happens.


I did some research and noted that it crashes only when my send( ) function is 
called and at the same time 1 second timer function is also invoked. In my 
send( ) code, after a packet is prepared and saved in DMA accessible area, it 
will write a device register to inform the hardware to read the packet and send 
out. 1 second timer also needs to read hardware registers to get current link 
status. I use a hardware mutex to control the access to hardware. So, if send( 
) got the mutex but preempted by 1 second timer, however, the timer must get 
this mutex before it can read/write hardware, timer must wait until mutex 
becomes available, since send( ) does not run, timer interrupt will wait 
forever and system crash.

If timer( ) does not call mutex_enter( ) and just read/write hardware registers 
directly, then, the deadlock does not happen any more. But this solution is not 
good. Any other ideas? Do you guys use a mutex lock to control access to PCI 
Network Card? Do you have similar issue?


Tom

Oliver, below is my analysis using mdb following your article, what can you see 
from it? Sorry, I do not have much experience on this. How do you get the ACT 
tool?

panic[cpu0]/thread=[b]ffffff0003eddc80[/b]:Deadlock: cycle in blocking chainffffff0003eddaa0 genunix:turnstile_block+9f3 ()

ffffff0003eddb20 unix:mutex_vector_enter+38d ()
ffffff0003eddb50 qla:qla_link_state_machine+22 ()

ffffff0003eddb70 qla:qla_timer+78 ()ffffff0003eddbd0 genunix:callout_execute+b1 ()

ffffff0003eddc60 genunix:taskq_thread+1dc ()

ffffff0003eddc70 unix:thread_start+8 ()

It seems the panic thread try to acquire a DRIVER mutex and blocked. Youshouldn't do that, I think you couldn't blocked in callout_executecontext, because it will cause all of threads which depend on callouttable processing are also blocked. For example, the thread in yourdriver might use cv_timedwait, which also could be blocked due tocallout table was locked. And at that time, clock interrupt thread alsocould be blocked.

If you can't avoid using driver mutex and rwlock in qla_timer relatedroutine, maybe you can trigger another soft interrupts to handle it.And ddi_intr_add_softint(9F) should work for you.

I'm not an expert of driver development, maybe other guys can give youbetter solution.

[b]ffffff0003eddc80[/b]::findstack -v

stack pointer for thread ffffff0003eddc80: ffffff0003edd540
  ffffff0003edd580 0x292()
  ffffff0003edd5a0 7()

ffffff0003edd670 xc_common+0x3fb(fffffffffb81c240, 0, fffffffffb861981, ffffff0003edd680, 0, ff,ffffff0003edd700)

  ffffff0003edd700 xc_mbox_lock+0x10()

ffffff0003edd750 xc_wait_sync+0x2b(fffffffec586f180, 0, fffffffffb81bf5a, ffffff0003edd750, ff,fffffffffb81c240)

  ffffff0003edd7f0 x86pte_inval+0x1e3(fffffffec586f180, c1, 8000000002045561, 0)
  ffffff0003edd880 hat_pte_unmap+0x21b(1b70e7008975471b, 1, fffffffffbbe8a53, 
ffffff0003edd890, 1388)
  ffffff0003edd890 1()
  ffffff0003edd910 dosoftint_prolog+0xa2(246, 91409, ffffff0003edd920, 0)
  ffffff0003edda00 panic+0x9c()
  ffffff0003eddaa0 turnstile_block+0x9f3(0, 0, [b]fffffffee1c178f8,[/b] 
fffffffffbc04550, 0, 0)
  ffffff0003eddb20 mutex_vector_enter+0x38d(fffffffee1c178f8)
  ffffff0003eddb50 qla_link_state_machine+0x22()
  ffffff0003eddb70 qla_timer+0x78()
  ffffff0003eddbd0 callout_execute+0xb1(fffffffec65e8000)
  ffffff0003eddc60 taskq_thread+0x1dc(fffffffec57f7698)
  ffffff0003eddc70 thread_start+8()

[b]fffffffee1c178f8[/b]::rwlock

            ADDR      OWNER/COUNT FLAGS          WAITERS
fffffffee1c178f8 READERS=23058428717  B001 ffffff0003eddc80 (W)
                                      |
                  HAS_WAITERS --------+

My blog's example is about system hang with rwlock. From your stackback trace, you can see your driver thread was acquiring a mutexinstead of rwlock, so you should use ::mutex instead of ::rwlock. Ifyou check address fffffffee1c178f8 with ::mutex dcmd, you might find theonwer of the mutex. Then you can ::findstack against that owner.


Anyway, you can find that deadlock chain by the similar steps.

ACT should be available on sunsolve.sun.com, and it can find thedeadlock threads automatically, specially, it should work for your case.But I'm not sure whether it's a free download. I also heard from otherguys, there is another automation tools named SCAT for download.

--
Cheers,

----------------------------------------------------------------------
Oliver Yang | [EMAIL PROTECTED] | x82229 | Work from office

_______________________________________________
networking-discuss mailing list
[email protected]

Re: [networking-discuss] Re: what is " Deadlock: cycle in blocking chain "

Reply via email to