[ https://issues.apache.org/jira/browse/HDFS-9137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14946085#comment-14946085 ]
Yi Liu commented on HDFS-9137: ------------------------------ I think it's OK to do the fix using this approach and update the {{BPOS#toString()}} in a follow-on. The new patch looks good to me, +1 pending Jeninks, thanks Uma, Vinay, Colin. How do you think [~vinayrpet], [~cmccabe]? > DeadLock between DataNode#refreshVolumes and > BPOfferService#registrationSucceeded > ---------------------------------------------------------------------------------- > > Key: HDFS-9137 > URL: https://issues.apache.org/jira/browse/HDFS-9137 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 3.0.0, 2.7.1 > Reporter: Uma Maheswara Rao G > Assignee: Uma Maheswara Rao G > Attachments: HDFS-9137.00.patch, > HDFS-9137.01-WithPreservingRootExceptions.patch > > > I can see this code flows between DataNode#refreshVolumes and > BPOfferService#registrationSucceeded could cause deadLock. > In practice situation may be rare as user calling refreshVolumes at the time > DN registration with NN. But seems like issue can happen. > Reason for deadLock: > 1) refreshVolumes will be called with DN lock and after at the end it will > also trigger Block report. In the Block report call, > BPServiceActor#triggerBlockReport calls toString on bpos. Here it takes > readLock on bpos. > DN lock then boos lock > 2) BPOfferSetrvice#registrationSucceeded call is taking writeLock on bpos and > calling dn.bpRegistrationSucceeded which is again synchronized call on DN. > bpos lock and then DN lock. > So, this can clearly create dead lock. > I think simple fix could be to move triggerBlockReport call outside out DN > lock and I feel that call may not be really needed inside DN lock. > Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)